Why isn't ld.lld faster?

This article makes an in-depth analysis of ld.lld's performance. The topic has been in my mind for a while. Recently Rui Ueyama released mold 1.0 and people wonder why with multi-threading its ELF port is faster than LLD's ELF port (ld.lld). So I finally completed the article.

TODO At this point I should move large portion of "how to implement an ELF linker" to a separate article.

First of all, I am very glad that Rui Ueyama started mold. Our world has a plethora of compilers, but not many people learn or write linkers. As its design documentation says, there are many drastically different designs which haven't been explored. In my view, mold is innovative in that it introduced parallel symbol table initialization, symbol resolution, and relocation scan which to my knowledge hadn't been implemented before, and showed us amazing results. The innovation gives existing and future linkers incentive to optimize further.

Generic ELF linker passes

Before jumping into the analysis, let me put up forward the generic passes to be referenced by subsequent chapters.

Here are the passes a typical linker has.

Parse command line options
Find and scan input files (.o, .so, .a), interleaved with symbol resolution
Global transforms (section based garbage collection, identical code folding, etc)
Create synthetic sections
Map input sections and synthetic (linker-generated) sections into output sections
Scan relocations
Layout and assign addresses (thunks, sections with varying sizes, symbol assignments)
Write headers
Copy section contents to output and resolve relocations

ld.lld

Previously people advocated "ld.lld focuses on speed". This impression was mainly driven by the comparison with GNU ld and gold. People say that ld.lld is at least 2x faster than gold, sometimes 4x or more.

As I have observed the codebase and code review activity, the focus is probably more on parity on important features, ideal semantics, code readability, portability and generality, but less on speed. In many cases, less code, better performance. In some cases, simple code is preferred over little performance improvement.

ld.lld has parallelism in a few places like writing sections and computing the content for a few synthetic sections, some critical paths do not have parallelism. If you experiment with --threads=N, you may find that the performance speed-up is not large.

In a large executable with no debug information, I measured the following with ld.lld --threads=1 --time-trace and jq -r '.traceEvents[] | select(.name|startswith("Total")) | "\(.dur/1000000) \(.name)"'. The indented lines indicate what --threads=8 can parallelize:

3.23408 Total Write output file
  --threads=8: 1.946309 Total Write output file
2.715859 Total Parse input files
1.222062 Total Scan relocations
0.964637 Total markLive
0.467088 Total Load input files
0.274216 Total Process symbol versions
0.273555 Total Split sections
  --threads=8: 0.066911 Total Split sections
0.239738 Total Add local symbols
0.217316 Total Merge/finalize input sections
0.210563 Total Finalize synthetic sections
0.177842 Total Assign sections
0.164059 Total Add symbols to symtabs
0.163841 Total Finalize address dependent content

Next, I will describe performance significant options and rudimentary performance analysis.

Performance significant options

Different programs have different link characteristics. For the same program, we can get vastly different results with different compiler and linker options. The percentage numbers in this article are rough, just indicating the few applications that I tested. YMMV.

The most significant compiler options for link time are:

-O: the input size matters
-g: leads to large portion of time spent on writing .debug_* sections, diluting the significance of the performance of other passess
-gz (compile action): zlib-compressed debug sections require uncompression on the linker side
-gz (link action): passes --compress-debug-sections=zlib to the linker
-ffunction-sections: often used with -fdata-sections. -fno-unique-section-names may help.

The most significant linker options for link time are:

--compress-debug-sections=zlib: Compress output .debug_* sections with zlib
(LLD specific) -O0 (-Wl,-O0 in compiler drivers) disables SHF_MERGE duplicate elimination. This can tremendously decrease link time for a program with large .debug_str, but you will get a huge .debug_str as the penalty.
-z nocombreloc: Don't sort dynamic relocations in .rela.dyn. This is a large bottleneck when you have millions of relocations in a mostly statically linked executable. Well, this is the cost of PIE.
--icf={safe,all}: The section-based deduplication requires extensive computation on contents and relocations.
--gc-sections: Discard unreferenced input sections
--build-id: Compute a message digest from the contents of SHF_ALLOC sections

In a large executable with no debug information, I measured the following proportion:

32% Parse input files
15% Write output
14% Scan relocations
12% --gc-sections
5% Load input files
2% Merge/finalize input sections

In a large executable with debug information, I measured the following:

64% Merge/finalize input sections
18% Write output file
5% Parse input files
5% Split sections

You can map these metrics to the generic linker passes.

Now let me make an in-depth analysis of various passes in LLD.

Find and load input files

For each input file, ld.lld calls open and mmap, then checks its file type by scanning the initial bytes (magic): relocatable object file, shared object, archive, lazy object file, LLVM bitcode file, or a linker script. The process is sequential and synchronous. In the absence of debug information, this pass may take 5% time.

lld-link does asynchronous file load which can speed up a bit. Parallelizing this pass can improve the performance a little, but it may cost the readability a bit.

Parse input and perform symbol resolution

Relocatable object files

A relocatable object file contributes symbols, sections, and relocations to the link process. ld.lld reads sections and makes in-memory representations of sections, then reads the symbol table. In a symbol table, all symbols with STB_LOCAL binding precede the weak and global symbols. The local part is file-local while the non-local part requires resolution with other files.

The non-local part consists of STB_GLOBAL, STB_WEAK, and (GCC specific) STB_GNU_UNIQUE symbols. An entry may be defined (Defined) or undefined (Undefined). ld.lld inserts each entry to the global symbol table and resolves the entry.

template <class ELFT> void ObjFile<ELFT>::initializeSymbols() {
  ArrayRef<Elf_Sym> eSyms = this->getELFSyms<ELFT>();
  // firstGlobal == sh_info of SHT_SYMTAB, the number of local symbols.
  for (size_t i = 0; i != firstGlobal; ++i) {
    const Elf_Sym &eSym = eSyms[i];
    InputSectionBase *sec = this->sections[secIdx];
    if (eSym.st_shndx == SHN_UNDEF || sec == &InputSection::discarded)
      new (this->symbols[i]) Undefined(...);
    else
      new (this->symbols[i]) Defined(...);
  }

for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i)
    if (!this->symbols[i])
      this->symbols[i] = symtab->insert(eSyms[i].getName());

for (size_t i = firstGlobal, end = eSyms.size(); i != end; ++i) {
    const Elf_Sym &eSym = eSyms[i];
    if (eSym.st_shndx == SHN_UNDEF) {
      undefines.push_back(i);
      continue;
    }
    Resolve this->symbols[i] with eSyms[i];
  }
  // Introduced by https://reviews.llvm.org/D95985 to stabilize symbol resolution behaviors.
  for (unsigned i : undefineds) {
    const Elf_Sym &eSym = eSyms[i];
    this->symbols[i]->resolve(Undefined(...));
  }
}

Shared objects

A shared object contributes just symbols to the link process. For a shared object, ld.lld reads its dynamic symbol table (SHT_DYNSYM) which consists of undefined symbols (represented identically to an undefined in a relocatable object file) and defined symbols (represented as a Shared).

template <class ELFT> void SharedFile::parse() {
  ArrayRef<Elf_Sym> syms = this->getGlobalELFSyms<ELFT>();
  for (size_t i = 0, e = syms.size(); i != e; ++i) {
    const Elf_Sym &sym = syms[i];
    StringRef name = sym.getName(this->stringTable);
    uint16_t idx = versyms[i] & ~VERSYM_HIDDEN; // SHT_GNU_versym index
    if (sym.isUndefined()) {
      if (idx != VER_NDX_LOCAL && idx != VER_NDX_GLOBAL)
        name = saver.save((name + "@" + verneedName).toStringRef();
      Symbol *s = symtab->insert(name);
      s->resolve(Undefined(this, name, ...));
      continue;
    }
    if (!(versyms[i] & VERSYM_HIDDEN)) {
      // Handle the unversioned definition.
      Symbol *s = symtab->insert(name);
      s->resolve(SharedSymbol(*this, name, ...));
    }
    if (idx == VER_NDX_GLOBAL)
      continue;
    // Handle the versioned definition.
    if (idx >= verdefs.size() || idx == VER_NDX_LOCAL)
      error(...);
    else {
      name = saver.save((name + "@" + verdefName).toStringRef());
      Symbol *s = symtab->insert(name);
      s->resolve(SharedSymbol(*this, name, ...));
    }
  }
}

Symbol resolution

TODO: move some part to "how to implement an ELF linker"

The resolution between two .o defined/undefined symbols is associative. In the absence of weak symbols and duplicate definition errors, the interaction is also commutative.

The resolution between a .o and a .so is commutative and associative.

The resolution between two .so is associative. If two .so don't define the same symbol, the resolution is also commutative.

For an archive symbol, its interaction with any other file kind is neither commutative nor associative.

When linking a large program with little debug information, ld.lld --time-pass shows some symbol resolution passes which take varying time from 0.x~3%. Anyhow, an empty global symbol table iteration takes 0.x% time. In the presence of --dynamic-list and --version-script, the upper bound can go higher because wildcard pattern matching can be slow. Some passes are related to ideal symbol resolution semantics which are not needed by 99.9% software in the wild: demote shared symbols/archives, parse symbol versions, redirect symbols (for edge-case --wrap and unversioned symbol elimination).

I believe mold handles fewer cases and will take similar performance hits if the tricky cases are handled.

What can be parallelized

"Parse input and perform symbol resolution" is typically the largest bottleneck in a link and can be parallelized in three places.

initialization of sections (embarrassingly parallel)
initialization of local symbols (embarrassingly parallel)
initialization of non-local symbols
symbol resolution

Parallel initialization of local symbols

Initialization of local symbols is embarrassingly parallel. As of 13.0.0, ld.lld called make<Defined> or make<Undefined> for each symbol entry where make is backed up by the non-thread-safe BumpPtrAllocator. I recently changed the code to batch allocate SymbolUnion. Is this part ready to be parallelized? Not yet.

To support linker scripts, a defined symbol may be relative to either an output section or an input section. The symbol representation therefore has a pointer struct member section, which means we need to have the section representation first. To diagnose relocations referencing a local symbol defined in a discarded section (due to COMDAT rules), a local symbol may be represented as Undefined. This needs the section pointer and COMDAT group resolution to decide whether a local symbol should be Defined or Undefined.

For relocatable object files, ld.lld initialize sections before symbols. However, for archives and lazy object files, ld.lld does not parse their sections. In a large link, most files are usually archives and lazy object files, so parallelizing just relocatable object files would be of little help.

One idea is to eagerly parse archive and lazy object files, and forget that an archive may have an index. This will make the linker more prepared for parallelism with a hit to the single-threading performance.

Tips: ld.lld --print-archive-stats=- writes the utilitization rate of archives to - (stdout).

If we do parallel initialization, the section member may need to be fixed later. The symbol kind may change from Defined to Undefined as well after COMDAT group resolution. Note that the symbol kind may not need to be explicit through C++ class inheritance, though ld.lld currently does this and is different to change at this point due to the numerous references.

Parallel initialization of non-local symbols

If both a.o and b.o have a non-local symbol foo, they need to refer to the same entry in the global symbol table. Conceptually an object file needs a data structure like vector<Symbol *> symbols;. In the database world, this is a hash join or sort merge join problem. Every linker implementation uses a hash join.

ld.lld represents the global symbol table with llvm::DenseMap<llvm::CachedHashStringRef, int> symMap; std::vector<Symbol *> symVector;. For each symbol table entry, ld.lld reserves a symbol in the global symbol table if it does not exist yet. If we had a concurrent hash table, the insertion could obviously be parallelized. In April 2021, there was a llvm-dev thread "Concurrent Hashmap?" but we haven't taken any action.

mold uses a concurrent hash table from Intel oneTBB.

Like non-local symbols, if we eagerly parse archive and lazy object files, and forget that an archive may have an index, we can parallelize this part.

Parallel symbol resolution

mold uses parallel symbol resolution. I previously mentioned that the resolution of archive symbols is not associative, therefore parallelism is difficult. For a.a b.so, in GNU ld, gold, and ld.lld:

if the a.a member is extracted, it will become a regular object definition and override b.so instead.
if the a.a member is not extracted, the b.so definition wins.

mold seems to be willing to break the compatibility. I know 99.99% software don't leverage this point, but personally I think the 0.01% case is important and indicates a symbol resolution ideal I want to keep up with.

mold assigns lower priorities to archive symbols than shared symbols. The b.so definition wins, regardless of the position of a.a.

According to mold --perf, the symbol initialization and resolution performance is like the following:

1 thread: 1.3x slower than ld.lld (atomics have costs)
2 threads: 1.75x faster than 1 thread
4 threads: 3x faster than 1 thread
8 threads: 5.3x faster than 1 thread
16 threads: 5.4x faster than 1 thread

Global transforms

`--gc-sections`

See Linker garbage collection. For a -ffunction-sections -fdata-sections build, GC can take little time (say, 2%) if there is debug information or a lot (10%) when there is little debug information. .debug_* sections are large and monolithic (fewer than .text.* and .data.*) and take significant write time, so they dilute the proportion of GC time.

The code is straightforward to parallelize, unfortunately llvm-project does not provide a parallel producer–consumer framework like oneTBB [algorithms.parallel_for_each.feeder]. Once the library is available, it is not difficult to parallelize this part.

`--icf`

If enabled, the pass may take 3~15% time. It is more expensive than the current sequential --gc-sections.

That said, ld.lld's algorithm is pretty good. It computes a lightweight message digest for each section, then collects hashes into buckets. For each bucket, it does pairwise comparison which is easy to reason about and expensive computation can be lazy.

Another idea is to use message digests to fully describe a section (mold). The message digest can be more expensive to compute as the pairwise comparison approach may skip many condition branches.

My test shows that mold's algorithm is sometimes slower, for example when linking Clang.

Scan relocations

An input section may have a dependent SHT_REL/SHT_RELA. ld.lld scans relocation records to decide what actions are taken:

link-time constants: the location should be updated in-place
GOT: .got and .rela.dyn need an update
PLT: .plt, .got.plt, and .rela.plt need an update
copy relocations, canonical PLT entries
TLS dynamic relocations
report a diagnostic

For single-threading, the speed of mold's relocation scan is 1.7~2.3x of ld.lld's. I think fundamentally it is because ld.lld does more work. It has more dispatches and abstraction costs (e.g. virtual function), and provides more diagnostics and handles many tricky cases that mold doesn't do:

support all of REL/RELA/mips64el for one architecture (even if the architecture typically doesn't use REL)
use a generic relocation expression (RelExpr) design which costs two virtual functions processing one relocation
handle non-preemptible GNU indirect function
retain otherwise unused .got and .got.plt for certain relocation types
Hexagon/Mips/PowerPC hacks
Unfortunate Fortran COMMON symbol semantics (only needed by PowerPC64 AFAIK)

Every check has a small cost. Many a little makes a mickle.

I have a patch refactoring the way ld.lld does relocations but it may increase lines of code a lot: https://reviews.llvm.org/D113228. I do not know whether it will be a right trade-off.

Another thing worth pointing out is that this process passes some relocations (InputSectionBase::relocations) which are needed by subsequent passes like range extension thunks. Many input relocations are currently pass-through, so we pay push_back time and memory. If we speculately think most relocations will be pass-through we can omit InputSectionBase::relocations, but subsequent passes will take input which is more difficult to deal with.

Except for the constraints mentioned above, relocation scan is conceptually straightforward to parallelize. Local symbols do not need communication while non-local symbols just need atomic states.

My https://reviews.llvm.org/D114783 is an important refactoring in ld.lld's relocation scan. To enable parallelism, there is still much to do. We would lightly have to exclude some architectures needing special treatment Mips/PowerPC. With other constraints it may or may not be a good trade-off.

One way to reduce dispatch and virtual function costs is to change template <class ELFT, class RelTy> void scanRelocs(...) to template <class Arch, class RelTy> void scanRelocs(...). For one architecture, dead code elimination will help. This approach will however lead to some code bloat.

In a Clang build, I measured the following for mold:

1 thread: 2.3x faster than ld.lld
2 threads: 1.88x faster than 1 thread
4 threads: 3.28x faster than 1 thread
16 threads: 11.13x faster than 1 thread

On the other hand, with little debug information, relocation scan takes 15% total time. IMO parallelism does not pay off if we don't optimize other critical paths to stand out the 15% time. I raised this point during the 2021 LLVM Dev Meeting round table "LLD for Mach-O".

Map input sections and synthetic sections into output sections

To support linker scripts, ld.lld's model was designed for output section descriptions and input section descriptions. A typical scenario where a user complains about link performance does not involve a linker script. Nevertheless, the mapping process takes a slight performance hit as we reuse the framework in the absence of a SECTIONS command. For a large executable with little debug information "Assign sections" took 2% time.

If we want to optimize this case, we will need to introduce a separate code path which will lead to implementation complexity.

`SHF_MERGE` duplicate elimination

A section with the SHF_MERGE|STRINGS flags has data elements consisting of NUL-terminated strings. In a program with debug information, .debug_str (with the SHF_MERGE|STRINGS flags) may be the hottest pass. A section with just the SHF_MERGE flag has data elements of a uniform size. In either case, duplicate elements can be eliminated as an optional optimization. ld.lld performs the optimization with (the default) -O1 and above.

ld.lld splits the section content into pieces, collects pieces from all sections, then eliminates duplicates with a hash table. An obvious idea is to use a concurrent hash table. As said previously, llvm-project does not provide one. ld.lld duplicates work and does the following instead:

size_t concurrency = PowerOf2Floor(numThreads, 32);
for each thread {
  for (MergeInputSection *sec : sections)
    for (auto &piece : sec->pieces)
      if (piece.live && (piece.shardId & (concurrency-1)) == threadId)
        shards[piece.shardId].add(piece.data);
}

// Compute an in-section offset for each shard.
size_t off = 0;
for (size_t i = 0; i < numShards; ++i) {
  shards[i].finalizeInOrder();
  if (shards[i].getSize() > 0)
    off = alignTo(off, alignment);
  shardOffsets[i] = off;
  off += shards[i].getSize();
}
size = off;

parallelForEach(sections, [&](MergeInputSection *sec) {
  for (auto &piece : sec->pieces)
    if (piece.live)
      piece.outputOff += shardOffsets[getShardId(piece.hash)];
});

It would be nice if we could explore alternative algorithms.

Open output file

For -o out, ld.lld unlinks out asynchronously, creates out in read-write mode, and resizes out to the output size. The output size is computed in the writer and has to wait after output section addresses are assigned.

I recently noticed that the way LLVM's Support library resizes the file is problematic for Linux tmpfs. The API calls posix_fallocate which has a nice property of reporting ENOSPC for some filesystems in some operating systems but may be bad for tmpfs. See https://reviews.llvm.org/D115957. In my system, posix_fallocate on an 1.4GiB output costs 0.3s, which is significant.

An alternative design is to overwrite the output. This would lead to old/new file inconsistency problems. If the old file has hard links, this would corrupt the other links. Also, you have to be carefuly to zero all bytes, otherwise you may get non-reproducible output.

Write sections

Some output sections contain one single synthetic section. I plan to analyze some expensive sections more carefully. For my experiments this week I use the additional time tracing (https://reviews.llvm.org/D115984) a lot.

I found that .rela.dyn sorting (-z combreloc) took 10+% time for a position-independent executable with millions of relocations. I pushed a trivial parallel commit and have a pending patch to optimize it further: https://reviews.llvm.org/D115993. This is a great example demonstrating that generality and feature creep bring us constraints and abstraction costs:

outputSec is in the dynamic relocation representation because of Mips.
We want to support all of REL/RELA/mips64el. We value code brevity, so manual loop unswitching would not be OK.
We support LLD's partition feature, so obtaining the symbol index has an extra condition which adds a little cost.
We support Android packed relocations which have very different encoding schemes, so refactoring encodeDynamicReloc needs to touch it.

`--compress-debug-sections=zlib`

In one invocation linking a Clang executable (-DCMAKE_BUILD_TYPE=RelWithDebInfo), ld.lld spent 36.7s compressing .debug_* sections while the total link time was 41.5s.

There are several problems.

No format except zlib is standard. This is a big ecosystem issue. ld.lld is part of the ecosystem and can drive it, but with significant buy-in from debug information consumers. As a generic ELF feature, a new format needs a generic-abi discussion.
llvm-project offloads to the system zlib. It does not leverage parallelism.
Technically ld.lld can compress different .debug_* output sections but the degree of parallelism would be slow and there would be implementation complexity.

`--build-id`

According to https://fedoraproject.org/w/index.php?title=RolandMcGrath/BuildID&oldid=16098, the feature was added as approximation of true uniqueness across all binaries that might be used by overlapping sets of people".

In ld.lld, --build-id is implemented as a tree hash, which is pretty fast. The performance is limited by the library code checked into llvm-project. Some library code is not optimal (see https://reviews.llvm.org/D69295 for SHA-1).

Note: ld.lld's --build-id feature, as I know, has no validation tool.

Executable size

The excutable size costs a little of performance. This matters very little but I still want to mention it.

The lld codebase has many make<XXX> calls where XXX is a singleton. In ELF, when XXX indicates a synthetic section, a make<XXX> return value is typically assigned to a global variable XXX *x. The return value resides as a singleton in a 4096-byte slab. Every type pulls in lots of templace code for BumpPtrAllocator which makes the x86-64 executable more than 1KiB larger.

If we change the global variable to std::unique_ptr<XXX>, this is non-trivial destructor at exit time, which is bad.
If we manually destroy global variables, we would do work which should be handled by RAII.
Global states are difficult to remove from ld.lld.

If your lld executable is a position-independent executable or is linked against libLLVM.so or libLLD*.so, you pay some relocation processing costs in ld.so. For an lld executable, the -pie one spends extra 8ms to resolve the 34000+ more relocations (mostly are R_X86_64_RELATIVE).

lld is large because LTO pulls in a substantial portion of LLVM libraries, including middle-end optimizations and code generation. I did an experiment (https://github.com/MaskRay/picolld) that if one removed LTO and non-ELF ports, the ELF lld executable was just ~3MiB.

`-Map`

The feature is for debugging and binary analysis. The link map output is tremendous so therefore inherently slow. ld.lld parallelizes the computation of some symbol properties.

ld64.lld adopts an approach that makes the link map computation asynchronous, interleaved with other parallelizable passes.

New hope?

With architectural constraints, I am actively finding performance opportunities for ld.lld. For example, with landed and pending patches I have made this week, ld.lld --threads={2,4,8} linking tmpfs output seems to be ~11% faster (both clang RelWithDebInfo and a large program with no debug info) compared with 13.0.0.

Epilogue

With multi-threading, why is mold faster than ld.lld? Ordered by my rough estimate of importance:

SHF_MERGE duplicate elimination
parallel symbol table initialization
eager loading of archive members and lazy object files
parallel relocation scan
faster synthetic sections
parallel --gc-sections
parallel symbol resolution

mold brings more parallelizable passes to the table. This is nice.

For ld.lld, the lack of llvm-project data structures and parallel facility reduce some design space we can make. This can be fixed.

Features and generality require abstraction. Abstraction has costs. Costs can be tangible (performance) and intangible (reduced design space). For symbol table initialization, symbol resolution, relocation scan, I believe ld.lld has found a good balancing among different trade-offs. These passes have been ruthlessly challenged by mold. However, with various constraints they may be difficult to change. Perhaps the archive semantics and COMDAT designers had not anticipated that after some decades there are linker developers struggling with finding a good balance and compromise:) Perhaps 99% people will not feel sympathy for the sequential choice ld.lld makes in these passes.

Share Comments

Why isn't ld.lld faster?

Why isn't ld.lld faster?

Generic ELF linker passes

ld.lld

Performance significant options

Find and load input files

Parse input and perform symbol resolution

Relocatable object files

Shared objects

Archives

Symbol resolution

What can be parallelized

Parallel initialization of local symbols

Parallel initialization of non-local symbols

Parallel symbol resolution

Global transforms

--gc-sections

--icf

Scan relocations

Map input sections and synthetic sections into output sections

SHF_MERGE duplicate elimination

Open output file

Write sections

--compress-debug-sections=zlib

--build-id

Executable size

-Map

New hope?

Epilogue

Recommend

About Joyk

`--gc-sections`

`--icf`

`SHF_MERGE` duplicate elimination

`--compress-debug-sections=zlib`

`--build-id`

`-Map`