Improving Link Time on Windows with clang-cl and lld
source link: http://blog.llvm.org/2018/01/improving-link-time-on-windows-with.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Improving Link Time on Windows with clang-cl and lld
One of our goals in bringing clang and lld to Windows has always been to improve developer experience, and what is it that developers want the most? Faster build times! Recently, our focus has been on improving link time because it's the step that's the hardest to parallelize so we can't fall back on the time honored tradition of throwing more cores at it.
Of the various steps involved in linking, generating the debug info (which, on Windows, is a PDB file) is by far the slowest since it involves merging O(# of linker inputs) sequences of type records, most of which are duplicate anyway. For example, if two cpp files both include <string>, then both of those object files will have hundreds of duplicate type records that need to be de-duplicated during the link step. This means you have to compute O(M x N) hash values, even though only a small fraction of those ultimately contribute to the final PDB.
Several strategies have been invented to deal with this over the years and try to make linking faster. Many years ago, Microsoft introduced the notion of a Type Server (enabled via /Zi compiler option in MSVC), which moves some of the work into the compiler (to take advantage of parallelism). More recently we have been given the /DEBUG:FASTLINK linker option which attempts to solve the problem by not merging types at all in the linker. However, each of these strategies has its own set of disadvantages, and neither can be considered perfect for all use cases.
In this blog post, we'll first go over some technical background about CodeView so that we can understand the problem, followed by a summary of existing attempts to speed up type merging. Then, we'll describe a novel extension to the PE/COFF file format which speeds up linking by offloading part of the work required to de-duplicate types to the compiler and using a new algorithm which uniquely identifies type records even across input files, and discuss the various tradeoffs of each approach. Finally, we'll present some benchmarks and discuss how you can try this out in clang-cl and lld today.
Background
Existing Solutions
Type Servers (/Zi)
- Type servers add significant context switching and global lock contention to the compilation phase, reducing parallelism and degrading overall system performance while a build is in process. While some performance is reclaimed from the linker, some is sacrificed due to the use of a global system lock. It’s still a net win, but as it is not free, it leaves open the possibility that we may be able to achieve better parallelism using a different approach.
- The type server process itself (mspdbsrv.exe) introduces a single point of failure. When it crashes (we see C1033 several times per day on Chrome, for example, which seems to indicate an mspdbsrv.exe crash) it could trigger a full rebuild if the type server PDB file is left in a corrupt state.
- mspdbsrv is incompatible with distributed builds, which is a show-stopper for large applications that can take several hours to build on normal workstations. Type servers operate only via local IPC. While multi-processing works well for small applications, many large products have build farms that distribute compilations among tens or hundreds of physical machines. Type servers are incompatible with this scenario.
Fastlink PDBs
- The pdbcopy utility is almost unusable with fastlink PDBs for performance reasons.
- Since type merging doesn’t happen, indexing of type information also doesn’t happen (since the expensive part of building an index -- the hashing -- comes for free when you were hashing the record anyway). This leads to degradation in the debugger user experience, since waits which previously happened only at build time now happen at debug-time.
- Fastlink PDBs are not portable. The PDB references the object files by path, so if you copy the PDB and object files to a different machine (or even different path on the same machine) for archival purposes, they can no longer be debugged. This is a deal-breaker for using it on production builds
- Symbols can’t be enumerated in a Fastlink PDB. This is most obvious if you attempt to use DIA SDK on a Fastlink PDB, where it will simply refuse to do anything at all. This means that the only externally supported way of querying debug info for users is impossible against a Fastlink PDB. Beyond that, however, it also means that even Microsoft’s own tools which need to enumerate symbols cannot use any standard API for doing so. For example, WinDbg doesn’t fully support Fastlink PDBs, and many workflows are broken by the use of them, even using supported Microsoft tools.
- It has several serious stability issues which make it unusable on large projects [ref]. This is probably related to point 4 above, namely the fact that every tool that wants to be able to work with a Fastlink PDB needs to use different code than the SDK that has been tested and battle-hardened through years of development.
- When compiling with clang-cl and linking with /debug:fastlink the compiler has to be instructed to emit additional debug information, making .obj files about 29% larger.
Clang's Solution - The COFF .debug$H section
- remapAllTypeIndices is called unconditionally for every type in every object file.
- A hash of the type is computed unconditionally for every type
- At least one full record comparison is done for every type. In practice it turns out to be much more, because hash buckets are computed modulo table size, so there will actually be 1 full record comparison for every probe.
- remapAllTypeIndices is only called when the record is actually new. Which, as we discussed earlier, is a small fraction of the time over many linker inputs.
- A hash of the type is never computed by the linker.It is simply there in the object file (the exception to this is mixed linker inputs, discussed earlier, but those are a small fraction of input files).
- Full record comparisons never happen. Since we are using a strong hash function with negligible chance of false collisions, and since the hash of a record provides equality semantics across streams, the hash is as good as the record itself.
- An array of contiguous hash values.
- An array of contiguous hash buckets.
Mixed Input Files and Compiler/Linker Compatibility
The On-Disk Format
Limitations and Pitfalls
Benchmarks
Further Improvements
- Use a smaller or faster hash. We use a 20-byte SHA1 hash. This is not a multiple of cache line size, and in any case the probability of collision is astronomically small even in the largest PDBs, considering that the theoretical limit of a PDB is just under 2^32 possible unique types (due to the 4-byte size of a type index). SHA1 is also notoriously slow. It might be interesting to try, for example, a Blake2 set to output an 8 byte hash. This should give sufficiently low probability of a collision while improving cache performance. The on-disk format is designed with this flexibility in mind, as different hash algorithms can be specified in the header.
- Hashes for compilands with missing .debug$H sections can be computed in parallel before linking. Currently when we encounter an object file without a .debug$H section, we must synthesize one in the linker. Our prototype algorithm does this serially for each input.
- Symbol records from .debug$S sections can be merged in parallel. Currently in lld, we first merge type records into the TPI stream, then we iterate symbol records and remap types in each symbol record to correspond to the new type indices. If we merge types from all modules up front, the symbol records (with the exception of global symbols) can be merged in parallel since they get written to independent streams).
Try it out!
- To enable the emission of a .debug$H section by the compiler, you will need to pass the undocumented -mllvm -emit-codeview-ghash-section flag to clang-cl (this flag should go away in the future, once this is considered stable and good enough to be turned on by default).
- To tell lld to use this information, you will need to pass the /DEBUG:GHASH to lld.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK