Faster zlib/DEFLATE decompression on the Apple M1 (and x86) | dougallj
source link: https://dougallj.wordpress.com/2022/08/20/faster-zlib-deflate-decompression-on-the-apple-m1-and-x86/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Faster zlib/DEFLATE decompression on the Apple M1 (and x86)
DEFLATE is a relatively slow compression algorithm from 1991, which (along with its wrapper format, zlib) is incredibly widely used, for example in the PNG, Zip and Gzip file formats and the HTTP, SSL, and SSH protocols. As such, I had assumed the most widely used implementation, zlib-madler, was extremely optimised. I’d also imagined Apple’s system zlib might outperform open-source alternatives on their own hardware. I was wrong. Fortunately there are optimised forks of zlib such as zlib-ng, Chromium’s zlib and zlib-cloudflare, which significantly outperform zlib-madler and Apple’s system zlib.
I tried optimising zlib’s decompression for the M1, using zlib-cloudflare as a starting point, and was able to show 1.51x speed compared to zlib-cloudflare, and 2.1x speed compared to Apple’s system zlib:
Decompression speed comparison (higher is better)My version is also faster on x86, at ~1.3x the speed of zlib-cloudflare using clang (~1.7x the speed of Apple’s system zlib), and ~1.46x using gcc (~2.3x the speed of Ubuntu 20.04’s system zlib).
If you have a choice, please use a modern alternative like zstd instead of zlib. If you need zlib, consider optimised zlib forks, like zlib-ng or zlib-cloudflare (or libdeflate if you don’t need zlib API compatibility or streaming). If you maintain a zlib fork or library, please consider incorporating some of my changes. (If you want to use my code, you are free to do so under the zlib license, but it is under-tested, and the bugs tend to be vulnerabilities.)
This speedup is the product of a number of changes, which are discussed in the remainder of the post. The changes interact with each other in very complex ways, so individual speedup measurements shouldn’t be taken seriously – making the same changes in a different order would completely change many numbers.
Bigger root tables
The decoder uses a table to look up bit-sequences and map them to their decoded values. The longest Huffman code allowed by the DEFLATE specification is 15-bits, which would require a 128kb table, so to fit tables comfortably in cache, zlib splits this into a root table and a number of (possibly nested) subtables.
Every time the decoder reads a symbol that isn’t in the root table, we’ll have a branch mispredict, followed by extra latency for the subtable lookup, making root-table-misses very expensive. One of the simplest speedups was increasing root table size for a ~7% decompression speed improvement.
(The table size is still limited to the maximum code length used, meaning this typically has zero performance impact on tiny files, where table-building time dominates, since they’re unlikely to use longer codes.)
Adler32 with UDOT
The zlib wrapper adds an adler32 checksum to the DEFLATE data, which is verified on decoding.
There’s already a NEON SIMD adler32 implementation in zlib-cloudflare, but, not unlike my CRC32 post, the existing implementation doesn’t take full advantage of the unusually wide M1. I added a new implementation which is 2x wider, takes advantage of the amazing UDOT instruction (part of the optional dotprod extension in ARMv8.2, mandatory in v8.4), and uses 64-bit scalar values to allow more iterations of the inner loop before reducing. I think this computes checksums at ~1.5x the speed of the old version (~60GB/s), but the decompression speedup is only about 1.01x, since it was already very fast.
I was very happy to see UDOT-based adler32 added to libdeflate, with impressive performance numbers on non-Apple CPUs too (e.g. ~1.7x speed on Cortex-A55).
Reading Bits
Most of the speedup just comes from reading and applying Fabian Giesen’s posts on Huffman decoding:
The goal was to use roughly the following variant-4-style loop to decode and unconditionally refill with eight-cycle latency on Apple M1:
; do {
; bitbuf |= read64LE(bitptr) << bitcount
ldr x11, [x0]
lsl x11, x11, x9 ; 1c
orr x10, x11, x10 ; 1c
; bitptr += 7
add x0, x0, #0x7
; bitptr -= ((bitcount >> 3) & 7)
ubfx x11, x9, #3, #3
sub x0, x0, x11
; bitcount |= 56
orr x9, x9, #0x38
; value = table[bitbuf & 0x1FF]
and x11, x10, #0x1ff ; 1c
ldr w11, [x4, x11, lsl #2] ; 4c
; bitcount -= value
sub x9, x9, x11 ; 1c
; bitbuf >>= (value & 63)
lsr x10, x10, x11
; *output++ = value >> 8
lsr x11, x11, #8
strb w11, [x1], #1
; } while (--n);
subs x8, x8, #0x1
b.ne x8, start
When we don’t need to refill, we can just do a decode with six-cycle latency:
; value = table[bitbuf & 0x1FF]
and x11, x10, #0x1ff ; 1c
ldr w11, [x4, x11, lsl #2] ; 4c
; bitcount -= value
sub x9, x9, x11
; bitbuf >>= (value & 63)
lsr x10, x10, x11 ; 1c
; *output++ = value >> 8
lsr x11, x11, #8
strb w11, [x1], #1
Most of the commits are trying to encourage the compiler to generate this code without breaking anything. Simply forcing the compiler to use a 32-bit, shifted load was a ~9% speedup, and moving the shift amount to the least-significant bits was another ~3.5%. (Ignoring the high bits of bitcount was accidentally merged into this commit, but had another measurable speedup. Git is counterintuitive. Unconditional refill was combined with the upcoming fastpath optimisation here.)
As covered in Fabian’s linked blog posts, we’re mostly concerned with latency. The M1 can run 8 instructions per cycle, meaning that we have time to run 64 instructions during that 8-cycle latency chain – a limit we’re unlikely to hit. So the focus is on reducing bit-reading latency and reducing branch mispredicts.
Reading Extra Bits
When reading either the length or distance of an LZ77 pair, the decoder would first read a Huffman code, and then optionally read a number of extra bits:
int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c
int extra_bits = entry >> 8;
int extra = bits & ((1 << extra_bits) - 1);
bits >>= extra_bits; // 1c
This second shift adds a cycle of latency. I changed this to combine the two shifts, then look back at the old value to extract the extra bits, saving a cycle of latency:
int old_bits = bits;
int entry = table[bits & 0x1FF]; // 5c
bits >>= (entry & 63); // 1c, includes extra
int non_extra_bits = (entry >> 8);
old_bits &= ~(-1 << (entry & 63)); // clear the high bits
int extra = old_bits >> non_extra_bits;
This required significant changes to the table format, making it one of the more complicated and error prone changes. Speedup ~5.5%.
Fast-Path for Literals
Since we can have 6-cycle latency (rather than 8-cycle latency) if we decode without refilling, I added a fast-path to the top of the loop that can decode two literals if they’re in the root table, before proceeding as usual. This is cheap because we already have to branch on whether or not a value is a literal.
This can take the first 20-bits of our 56-bit buffer. Unfortunately, the worst-case for a length+distance pair with extra bits is 48-bits, which we may no longer have, so this requires a conditional refill if we don’t have enough bits while reading a distance code. However, this worst-case is exceedingly rare – the whole point of compression is to use short codes in the common cases – and so the “needs refill” branch predicts essentially perfectly, and this has a very low cost. Speedup ~4%.
Overlapping Mispredict Latency
There’s a fancy “chunkcopy” optimisation, which is responsible for some of zlib-cloudflare’s performance over zlib-madler (by copying LZ77 references 16-bytes at a time), but this can have expensive branch mispredicts. There’s a benefit to performing the refill and Huffman table lookup for the next iteration before the chunkcopy, allowing the load latency to overlap with any mispredict latency. Speedup ~6%.
Table-building tweaks
Table-building isn’t terribly expensive nor terribly frequent, but does involve counting the number of symbols of each length (1-15), which can be problematic due to the latency of reading-back recently written memory. This might be a bit worse on the M1 than on x86, with no zero-latency forwarding (but no extreme penalties). I instead used NEON intrinsics to keep the sum in two 128-bit registers with lower latency for a ~0.5% improvement in decompression time. I also use clang’s bitreverse intrinsic, and split the table-filling loop to have simpler code for filling the (now larger) root table, and only run the more complex code for the (now smaller and less common) subtables, for another ~0.5%. These changes give a total of a ~1% speedup.
(I was trying anything I could think of to try to get to 1.5x, and this was what finally did it… and then I realised my geometric mean calculation was wrong, and I was already past 1.5x without this change. But it is nice to have.)
Final notes
There’s been a lot of excellent (and inspirational) prior work on zlib performance: Adenilson Cavalcanti has been optimizing zlib decompression on ARM in Chromium, Vlad Krasnov significantly improved compression performance, Janakarajan Natarajan and Volker Simonis ported the Chromium optimisations to cloudflare-zlib, and Nigel Tao wrote The Fastest, Safest PNG Decoder in the World (a great read if you missed it).
Eric Bigger’s libdeflate also deserves more mention in this post, and should have been included in the graphs – I was looking mainly at zlib-compatible projects, but it is significantly faster than zlib-cloudflare. zlib-dougallj currently outperforms libdeflate version 1.13 on M1 at ~1.17x speed, but I didn’t test other platforms.
My changes seem to be a speedup regardless of file, which I hadn’t imagined was possible. There’s still a fair bit of room for further work here – I didn’t even look at the x86 disassembly, but hopefully this can save some time and energy. The code is on Github, under the zlib license, but has not been thoroughly tested. I’m @dougallj on Twitter.
This entry was posted in Uncategorized by dougallj. Bookmark the permalink.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK