Analysis of the overhead of a minimal Zig program

Jan 1

・6 min read

If you wanted to make a minimal x86-64 Linux program that did nothing, how would you write it? You'd probably whip out an assembler and type something like this:

mov    eax, 60 ; sys_exit
xor    edi, edi
syscall

Enter fullscreen mode

Exit fullscreen mode

Letting LLD link it for us nets us a binary that's 600 bytes large. Aggressively stripping out all the unnecessary trash that the linker puts into it makes it 297 bytes — but we're not interested in linker overhead right now, so let's use 600 as a baseline.

If we write a minimal Zig program that does the same thing, will it be just as small? Probably not. Let's go through every assembly instruction of the Zig binary and see what's up!

First, let's write that program:

pub fn main() void {}

Enter fullscreen mode

Exit fullscreen mode

Building it with -O ReleaseSmall --strip -fsingle-threaded results in a 5.4KiB binary. The very first thing we realize is that all the debug symbols aren't stripped, because the Zig strip flag isn't completely functional yet and is waiting for the stage 2 compiler. No matter, we just do it manually (with strip -s), shrinking it to 1.7KiB.

What does all that code do? When we objdump it and take a look, we find 208 lines of assembly consuming 715 bytes. In addition, it uses 128 bytes for read-only data and 12624 bytes of .bss zero-initialized static data, only taking up space in a running program and not in the binary itself.

Let's go through each line of assembly to see what's going on. First, we have this:

xor    rbp,rbp

Enter fullscreen mode

Exit fullscreen mode

I.e. rbp is cleared. If we take a look in std/start.zig we can see that this is from inline assembly that zig runs immediately on _start(). Why? Presumably because the x86-64 ABI mandates it:

The content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero

I'll allow it. ABI compliance is a very good reason for "wasting" 3 bytes of code, and should arguably be added to our original assembly program. Now, let's check the next line:

mov    QWORD PTR [rip+0x1e1e],rsp

Enter fullscreen mode

Exit fullscreen mode

What's this for? Turns out Zig always saves the initial value of rsp, since it starts out pointing to the auxiliary vector, which you need to parse the program arguments. We're not looking at that though, so this is at first glance a completely unnecessary waste of 7 bytes.

Next up:

2011e2: call   0x2011e7
2011e7: push   rbp
2011e8: [...]
2011f4: and    rsp,0xfffffffffffffff0

Enter fullscreen mode

Exit fullscreen mode

So, we're instantly calling a function located directly on the next byte. Looking around the code, we find that this is the only place it's called from. Why? From reading start.zig we find the answer:

If LLVM inlines stack variables into _start, they will overwrite the command line argument data.

So, the reason it's not inlined is because it's called with never_inline, because otherwise LLVM can put things that messes up rsp before the inline assembly that stashed rsp away. Makes sense, except it'd be nicer if there was a non-hacky way of solving it. In any case we don't need rsp so ideally we shouldn't have to pay for this anyway.

What's up with the and rsp,0xfffffffffffffff0? That's because the function manually aligns the stack to the next 16-byte boundary. I'm not sure why the stdlib does this. The SystemV ABI (§2.3.1) guarantees an initial alignment of 16 already, both for x86-64 and i386, so it should be superfluous. From looking around a little, musl does the same alignment, as does glibc, but not dietlibc.

Next up, the code is parsing the auxiliary vector. Not only is this needed for argv, but it also contains the program header which the program uses for PIE relocations (if applicable, which it isn't for us). It also contains the stack size, which if not set to the default of 8MiB Zig asks the kernel to resize (it's not done automatically). This seems superfluous; if we compiled the program ourselves and used our own linker we should be able to hardcode the stack size resize at compile-time if necessary, not store it in some roundabout program header. Since Zig is working on automatically calculating the maximum stack size required as well, this information could be directly available to the compiler in the future and used here.

Lastly, the data is also needed to initialize the static TLS memory. This is for static threadlocal variables that should have an unique copy for each thread, like errno. "But we are using -fsingle-threaded," you may ask, "Why shouldn't the compiler turn all the thread-local variables to normal static ones and strip out the TLS section?". The reason is that you could export a threadlocal symbol to another program that's actually threaded, so we can't just remove them willy-nilly.

Moreover, since the TLS initialization calls mmap if the size is large enough, it can fail, which calls abort(). abort() in turn calls raise(SIG.ABRT), and raise in turn masks out all the signals with sigprocmask. It's this call that uses the 128 bytes of readonly data we saw previously. It's fairly large as it needs to contains the entire set of possible signals.

The TLS initialization is also the explanation for much of the wasted .bss data as well; it uses an 8448 byte static buffer when the TLS data is small enough to fit it.

Tangentially we can see that avoiding TLS when it's not needed is an open issue: #2432, so it's something that's in the pipeline to be handled.

In any case, since we don't use TLS, PIE, argv, nor env variables, all of this is just a waste of space. Let's try commenting all of that out; in start.zig we remove everything that depends on argc, then everything that depends on those lines and so on. After that's done we're more or less back at our initial ideal program size, just with the minor cruft I mentioned at the start:

xor    rbp,rbp
mov    QWORD PTR [rip+0x1016],rsp
call   0x201167
push   rbp ; @ 0x201167
mov    rbp,rsp
and    rsp,0xfffffffffffffff0
push   0x3c
pop    rax
xor    edi,edi
syscall

Enter fullscreen mode

Exit fullscreen mode

Now, what was the point of all this? I think there are several benefits to minimizing overhead for simple programs:

Having minimal overhead for tiny programs is actually relevant for system performance. Many scripts, for example, work by chaining together common Unix programs, so you're potentially having the same startup code running tens of thousands of times in a short duration. This can get fairly significant! Right now Linux ameliorates the performance hit from this by either writing built-in copies of the most common tools directly into the shell (like Bash does), or having a single fat binary that you stuff a ton of programs into (like BusyBox) so you don't have to store the same initialization code across hundreds of programs.
The very first thing anybody interested in Zig will attempt to do is compile a "Hello World!" program and look at it. Having it being an order of magnitude smaller than the equivalent C program would be really impressive, and first impressions count for a lot. I've watched friends try Go and immediately uninstall the compiler when they see that the resulting no-op demo program is larger than 2 MiB.
Overhead breeds complacency — if your program is already several megabytes in size, what's a few extra bytes wasted? Such thinking leads to atrocities like writing desktop text editors bundled on top of an entire web browser, and I think it would be nice to have a language that pushes people to be a bit more mindful of the amount of resources they're using.

Analysis of the overhead of a minimal Zig program

Analysis of the overhead of a minimal Zig program

Recommend

GitHub - fastify/fastify: Fast and low overhead web framework, for Node.js

GitHub - facebook/nailgun: Nailgun is a client, protocol, and server for running...

GitHub - dyu/ffi-overhead: comparing the c ffi overhead on various programming l...

The Linux Kernel Is Now VLA-Free: A Win for Security and Less Overhead

Measuring Percona Server for MySQL On-Disk Decryption Overhead

Slim: OS kernel support for a low-overhead container overlay network

Reduce IT overhead with HP Device as a Service (DaaS) and Chrome Enterprise

Overhead analysis for Vulkan Portability

A minimal RocksDB example with Zig

The Curious Case of a Memory Leak in a Zig program

About Joyk