4

Faster zlib/DEFLATE decompression on the Apple M1 (and x86) | dougallj

 2 years ago
source link: https://dougallj.wordpress.com/2022/08/20/faster-zlib-deflate-decompression-on-the-apple-m1-and-x86/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Faster zlib/DEFLATE decompression on the Apple M1 (and x86)

Posted on August 20, 2022

DEFLATE is a relatively slow compression algorithm from 1991, which (along with its wrapper format, zlib) is incredibly widely used, for example in the PNG, Zip and Gzip file formats and the HTTP, SSL, and SSH protocols. As such, I had assumed the most widely used implementation, zlib-madler, was extremely optimised. I’d also imagined Apple’s system zlib might outperform open-source alternatives on their own hardware. I was wrong. Fortunately there are optimised forks of zlib such as zlib-ng, Chromium’s zlib and zlib-cloudflare, which significantly outperform zlib-madler and Apple’s system zlib.

I tried optimising zlib’s decompression for the M1, using zlib-cloudflare as a starting point, and was able to show 1.51x speed compared to zlib-cloudflare, and 2.1x speed compared to Apple’s system zlib:

Graph of decompression performance, showing zlib-dougallj outperforming 3 alternatives by about 50% on each of the files in the Silesia Corpus. zlib-cloudflare is consistently second best, closely followed by zlib-ng, with zlib-apple further behind, outperformed by about 2x. The files have been compressed with zlib level 6, and throughput is measured in compressed mb/s on M1 using Apple clang 13.1.6. zlib-dougallj performs between 282 and 380mb/s, cloudflare performs between 186 and 276mb/s. xml: apple 122, ng 181, cf 194, dj 290. x-ray: apple 181, ng 190, cf 209, dj 330. webster: apple 132, ng 195, cf 205, dj 303. sao: apple 234, ng 260, cf 276, dj 380. samba: apple 151, ng 211, cf 221, dj 321. reymont: apple 131, ng 210, cf 213, dj 338. osdb: apple 173, ng 206, cf 224, dj 352. ooffice: apple 142, ng 170, cf 186, dj 291. nci: apple 119, ng 171, cf 191, dj 282. mr: apple 164, ng 187, cf 208, dj 329. mozilla: apple 153, ng 187, cf 199, dj 307. dickens: apple 148, ng 219, cf 238, dj 344.
Decompression speed comparison (higher is better)

My version is also faster on x86, at ~1.3x the speed of zlib-cloudflare using clang (~1.7x the speed of Apple’s system zlib), and ~1.46x using gcc (~2.3x the speed of Ubuntu 20.04’s system zlib).

If you have a choice, please use a modern alternative like zstd instead of zlib. If you need zlib, consider optimised zlib forks, like zlib-ng or zlib-cloudflare (or libdeflate if you don’t need zlib API compatibility or streaming). If you maintain a zlib fork or library, please consider incorporating some of my changes. (If you want to use my code, you are free to do so under the zlib license, but it is under-tested, and the bugs tend to be vulnerabilities.)

This speedup is the product of a number of changes, which are discussed in the remainder of the post. The changes interact with each other in very complex ways, so individual speedup measurements shouldn’t be taken seriously – making the same changes in a different order would completely change many numbers.

Bigger root tables

The decoder uses a table to look up bit-sequences and map them to their decoded values. The longest Huffman code allowed by the DEFLATE specification is 15-bits, which would require a 128kb table, so to fit tables comfortably in cache, zlib splits this into a root table and a number of (possibly nested) subtables.

Every time the decoder reads a symbol that isn’t in the root table, we’ll have a branch mispredict, followed by extra latency for the subtable lookup, making root-table-misses very expensive. One of the simplest speedups was increasing root table size for a ~7% decompression speed improvement.

(The table size is still limited to the maximum code length used, meaning this typically has zero performance impact on tiny files, where table-building time dominates, since they’re unlikely to use longer codes.)

Adler32 with UDOT

The zlib wrapper adds an adler32 checksum to the DEFLATE data, which is verified on decoding.

There’s already a NEON SIMD adler32 implementation in zlib-cloudflare, but, not unlike my CRC32 post, the existing implementation doesn’t take full advantage of the unusually wide M1. I added a new implementation which is 2x wider, takes advantage of the amazing UDOT instruction (part of the optional dotprod extension in ARMv8.2, mandatory in v8.4), and uses 64-bit scalar values to allow more iterations of the inner loop before reducing. I think this computes checksums at ~1.5x the speed of the old version (~60GB/s), but the decompression speedup is only about 1.01x, since it was already very fast.

I was very happy to see UDOT-based adler32 added to libdeflate, with impressive performance numbers on non-Apple CPUs too (e.g. ~1.7x speed on Cortex-A55).

Reading Bits

Most of the speedup just comes from reading and applying Fabian Giesen’s posts on Huffman decoding:

The goal was to use roughly the following variant-4-style loop to decode and unconditionally refill with eight-cycle latency on Apple M1:

; do {
;  bitbuf |= read64LE(bitptr) << bitcount 
ldr   x11, [x0]
lsl   x11, x11, x9              ; 1c
orr   x10, x11, x10             ; 1c

;  bitptr += 7
add   x0, x0, #0x7             

;  bitptr -= ((bitcount >> 3) & 7)
ubfx  x11, x9, #3, #3
sub   x0, x0, x11

;  bitcount |= 56
orr   x9, x9, #0x38

;  value = table[bitbuf & 0x1FF]
and   x11, x10, #0x1ff          ; 1c
ldr   w11, [x4, x11, lsl #2]    ; 4c

;  bitcount -= value
sub   x9, x9, x11               ; 1c

;  bitbuf >>= (value & 63)
lsr   x10, x10, x11

;  *output++ = value >> 8
lsr   x11, x11, #8
strb  w11, [x1], #1

; } while (--n);
subs  x8, x8, #0x1
b.ne  x8, start

When we don’t need to refill, we can just do a decode with six-cycle latency:

;  value = table[bitbuf & 0x1FF]
and   x11, x10, #0x1ff          ; 1c
ldr   w11, [x4, x11, lsl #2]    ; 4c

;  bitcount -= value
sub   x9, x9, x11  

;  bitbuf >>= (value & 63)
lsr   x10, x10, x11             ; 1c

;  *output++ = value >> 8
lsr   x11, x11, #8
strb  w11, [x1], #1

Most of the commits are trying to encourage the compiler to generate this code without breaking anything. Simply forcing the compiler to use a 32-bit, shifted load was a ~9% speedup, and moving the shift amount to the least-significant bits was another ~3.5%. (Ignoring the high bits of bitcount was accidentally merged into this commit, but had another measurable speedup. Git is counterintuitive. Unconditional refill was combined with the upcoming fastpath optimisation here.)

As covered in Fabian’s linked blog posts, we’re mostly concerned with latency. The M1 can run 8 instructions per cycle, meaning that we have time to run 64 instructions during that 8-cycle latency chain – a limit we’re unlikely to hit. So the focus is on reducing bit-reading latency and reducing branch mispredicts.

Reading Extra Bits

When reading either the length or distance of an LZ77 pair, the decoder would first read a Huffman code, and then optionally read a number of extra bits:

int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c
int extra_bits = entry >> 8;
int extra = bits & ((1 << extra_bits) - 1);
bits >>= extra_bits; // 1c

This second shift adds a cycle of latency. I changed this to combine the two shifts, then look back at the old value to extract the extra bits, saving a cycle of latency:

int old_bits = bits;
int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c, includes extra
int non_extra_bits = (entry >> 8);
old_bits &= ~(-1 << (entry & 63)); // clear the high bits
int extra = old_bits >> non_extra_bits;

This required significant changes to the table format, making it one of the more complicated and error prone changes. Speedup ~5.5%.

Fast-Path for Literals

Since we can have 6-cycle latency (rather than 8-cycle latency) if we decode without refilling, I added a fast-path to the top of the loop that can decode two literals if they’re in the root table, before proceeding as usual. This is cheap because we already have to branch on whether or not a value is a literal.

This can take the first 20-bits of our 56-bit buffer. Unfortunately, the worst-case for a length+distance pair with extra bits is 48-bits, which we may no longer have, so this requires a conditional refill if we don’t have enough bits while reading a distance code. However, this worst-case is exceedingly rare – the whole point of compression is to use short codes in the common cases – and so the “needs refill” branch predicts essentially perfectly, and this has a very low cost. Speedup ~4%.

Overlapping Mispredict Latency

There’s a fancy “chunkcopy” optimisation, which is responsible for some of zlib-cloudflare’s performance over zlib-madler (by copying LZ77 references 16-bytes at a time), but this can have expensive branch mispredicts. There’s a benefit to performing the refill and Huffman table lookup for the next iteration before the chunkcopy, allowing the load latency to overlap with any mispredict latency. Speedup ~6%.

Table-building tweaks

Table-building isn’t terribly expensive nor terribly frequent, but does involve counting the number of symbols of each length (1-15), which can be problematic due to the latency of reading-back recently written memory. This might be a bit worse on the M1 than on x86, with no zero-latency forwarding (but no extreme penalties). I instead used NEON intrinsics to keep the sum in two 128-bit registers with lower latency for a ~0.5% improvement in decompression time. I also use clang’s bitreverse intrinsic, and split the table-filling loop to have simpler code for filling the (now larger) root table, and only run the more complex code for the (now smaller and less common) subtables, for another ~0.5%. These changes give a total of a ~1% speedup.

(I was trying anything I could think of to try to get to 1.5x, and this was what finally did it… and then I realised my geometric mean calculation was wrong, and I was already past 1.5x without this change. But it is nice to have.)

Final notes

There’s been a lot of excellent (and inspirational) prior work on zlib performance: Adenilson Cavalcanti has been optimizing zlib decompression on ARM in Chromium, Vlad Krasnov significantly improved compression performance, Janakarajan Natarajan and Volker Simonis ported the Chromium optimisations to cloudflare-zlib, and Nigel Tao wrote The Fastest, Safest PNG Decoder in the World (a great read if you missed it).

Eric Bigger’s libdeflate also deserves more mention in this post, and should have been included in the graphs – I was looking mainly at zlib-compatible projects, but it is significantly faster than zlib-cloudflare. zlib-dougallj currently outperforms libdeflate version 1.13 on M1 at ~1.17x speed, but I didn’t test other platforms.

My changes seem to be a speedup regardless of file, which I hadn’t imagined was possible. There’s still a fair bit of room for further work here – I didn’t even look at the x86 disassembly, but hopefully this can save some time and energy. The code is on Github, under the zlib license, but has not been thoroughly tested. I’m @dougallj on Twitter.

Graph of performance on Canterbury corpus, showing slightly smaller, but still nearly 1.5x speedups.
Graph of performance on Calgary corpus, showing slightly larger, but still around 1.5x speedups.

This entry was posted in Uncategorized by dougallj. Bookmark the permalink.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK