4

Fabian Giesen: "By request, my usual "the leas…" - Gamedev Mastod...

 1 year ago
source link: https://mastodon.gamedev.place/@rygorous/110572829749524388
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Fabian Giesen: "By request, my usual "the leas…"

By request, my usual "the least interesting part about AVX-512 is the 512 bits vector width" infodump in thread form.

So here goes, a laundry list of things introduced with AVX-512 that I think are way more important to typical use cases than the 512-bit vectors are:

* Unsigned integer compares, at long last (used to be either 3 or 2 instructions, depending on what you were comparing against and how much of the prep work you could hoist out)
* Unsigned int <-> float conversions. Better late than never.
* Down converts (narrowing), not just up - these are super-awkward in AVX2 since the packs etc. can't cross 128b boundaries
* VPTERNLOGD (your swiss army railgun for bitwise logic, can often fuse 2 or even 3 ops into one)

* VPTEST (logical compare instead of arithmetic for vectors - another one that replaces what used to be either 2- or 3-instruction sequences)
* VRANGEPS (meant for range reduction, can do multiple things including max-abs of multiple operands - another one that folds what is typically 2-4 instructions into one)
* Predicates/masks and "free" predication
* VRNDSCALE (round to fixed point with specified precision, yet another one that replaces 3 ops - mul, round, mul)

* VPRO[LR] (bitwise rotate on vectors, super-valuable for hashing; yet another one that replaces what used to be 3 ops: shift, shift, or)
* VPERM[IT]2* (permute across two source registers instead of one - PPC/SPUs had this, ARM has this, it's yet another one that replaces usually 3 instructions with one: two permutes and an OR or similar. And 2 of these go to the often-contended full shuffle network)
* Broadcast load-operands (4-16x reduction in L1D$ space for all-lane constants for free)

* Compress/expand (these replaces what is typically some awkward move mask, big table, permute combination and is super-useful to have, and saves like ~1k-4k [sometimes more] of tables you otherwise keep warm in L1D for no good reason)
* Disp8*N encoding (purely a code size thing; this alone does a very good job at offseting the extra cost of EVEX)
* Variable shifts on not just DWords/QWords, but also words. (THANK YOU))

* (VBMI2) PSH[LR]D(V?) - double-wide shifts. yet another one that replaces what would usually be 3 instructions.

Notice a theme here? There's just a ton of efficiency stuff in there that patches holes that have been in the ISA forever and replaces 3-insn sequences with a single one, and it's often 3-insn seqs with 2 uops on a contended port down to 1.

More niche stuff:
* VNNI - VPDPBUSB is yet another "replaces 3 insns with one" example, in that case PMADDUBSW, PMADDWD, PADDD
* 52-bit IFMA, mostly interesting for crypto/big int folks
* VPMULTISHIFTQB in VBMI for bit wrangling
* The bfloat16 additions
* Fixed versions of RCP and RSQRT, namely RCP14 and RSQRT14, with an exact spec so they don't differ between Intel and AMD
* Vector leading zero count and pop count
* Math-library focused insns like VFIXUPIMM and VFPCLASS.

And last not least, AVX-512 doubles with FP/SIMD architectural register count from 16 to 32.

This, combined with the 512b vectors that are the "-512" part, quadruples the amount of architectural FP/SIMD state, which is one of the main reasons we _don't_ get any AVX-512 in the "small" cores.

Now the 32 arch regs can pay dividends for all SIMD code, the 512 bits, only for code that can actually efficiently go that wide, which is not _that_ frequent.

In short, the 512-bit-ness that's part of the name is what probably provides the least utility to most workloads (outside of things like HPC), it's what has hindered proliferation of it the most, and it's the main reason we still keep getting new x86 designs without it.

I wish that around Skylake Intel had defined an "AVX-256" subset that was in consumer SKUs, which is "all of AVX-512 except for the actual 512b-wide vectors".

Because if they had, we'd now be 8 years into "AVX-256" on clients and would probably have a very decent percentage of machines with it.

Then we would have all that goodness I just listed and never mind the 512b vectors that are honestly pretty niche

@rygorous Or the double-pumped stuff that was done by AMD for AVX-512 (and one or both did for AVX2 as well IIRC???)

Get the instruction set out there, but not necessarily with full performance benefits.

@scottmichaud No that doesn't help.

The problem is the amount of state, not the width of the implementation.

@scottmichaud I like to dig out die shots of the original AMD Zen for a scale reference.

Wikichip has die shots of the original Zen 1: https://en.wikichip.org/wiki/File:amd_zen_core.png

Here's my annotated version with the FP/SIMD register file marked, and capacities noted.

ab8e9ba95529e62a.png

@scottmichaud AVX-512 _architectural_ state, with absolutely no extra registers for renaming, just the bare minimum you need for a single hardware thread (can't do 2 HW threads either), is 32 regs * 512 bits = 2KiB.

Why are RFs this big compared to more compact structures like the L1D cache? Mainly because they need a ton of ports. I think Zen1's FP RF needs around 8-9 read plus 4 write ports per cycle. A banked L1D is either single- or dual-ported usually.

@scottmichaud Zen 1 isn't super-dense either, that's still a part targeting quite high frequencies.

The main reason you don't see AVX-512 in smaller cores is basically that. They literally don't have the area budget for a register file so huge you can see it from space

@scottmichaud Note the "mold" growing to either side of the FP/SIMD RF is the actual SIMD unit logic. Programmers tend to overestimate how much area is spent on data path (i.e. the stuff that does the actual computation) vs. logistics. In this case, coarsely eyeballing it, looks like more than 1/3rd of the entire FP/SIMD unit is just that register file.

@scottmichaud also pretty much anything you see in that image that has this obvious regular structure is memory of one kind or another - usually caches.

@rygorous @scottmichaud Yup, 1/3 RF, 1/3 ALU, 1/3 "everything else" is pretty normal (just for the FP unit itself)

@TomF @rygorous @scottmichaud And probably the left mold is the ALU, and the right mold is FP Load/Store

@moonchild @scottmichaud That's a difference in implementation (and a common choice, e.g. how the aforementioned Zen 1 handles AVX ops) but it doesn't make the register file any smaller.

@moonchild @scottmichaud The issue is that you can't just look at pure space, that's still not enough.

The architectural state is a lower bound for what _needs_ to fit but you still need a fair amount of extra space to handle in-flight instructions for roughly the length of your pipeline between register allocate (usually around rename time) and register free (usually a few cycles past retire) plus turn-around time on the RF free list.

@moonchild @scottmichaud At the extreme, if you have architectural state plus one FP register, then that prevents you from having more than a single FP operation at any state of completion in the pipeline at once, which is clearly not actually viable.

@moonchild @scottmichaud so if you have say a pipeline that can handle 2 128-bit vector operations per cycle, and time from allocate to reg free is 15 cycles (that's low-balling it), you need at the bare minimum another ~30 or so spare rename registers to not always be bumping into limits, more if you want at least some basic scheduling flexibility without immediately running out of FP regs

@moonchild @scottmichaud but realistically, you can add at least another 20 or so cycles you want to be able to cover in there, so that a single L1D miss, L2 hit doesn't immediately ruin your day. So that's at least another ~40 rename registers you would need with a 2-vector-unit setup.

@moonchild @scottmichaud Gracemont has a pretty deep scheduler and decides to go for ~207 regs for their 128b impl. Adding another 24 512b registers worth would add another 96 128b regs of architectural state, so if you left everything else the same that's still 1.5x the size of the RF you'd need.

@rygorous @scottmichaud I thought that, given move-elimination, architectural registers can alias each other (or, better yet, the zero register), allowing to ameliorate this problem somewhat? Would make for some interesting performance characteristics, but seems workable.
@rygorous @scottmichaud Obviously suboptimal, but perhaps not that bad, if you don't need all the registers.

@moonchild @scottmichaud Yeah some smaller cores do that (e.g. the AMD Jaguar cores do this to not assign RF slots to the x87 regs when they're not used) and it's definitely workable (even the big cores now have reduced renaming capability when 512b vectors are used), but it's still a problem

@moonchild @scottmichaud The other issue is that the logistics for cracking insns into 2 uops for low/high halves (and allocating low/high halves of regs separately) are reasonable but having to do quarters turns into quite a mess.

So AVX 256b regs that are cracked into 128b halves (often low/high separately renamed, e.g. Jaguar/Zen 1 did this) is doable, but you wouldn't want to do quad, so 512b nudges you towards small vecs being 256b.

@rygorous I guess we're kind of getting that now with AMD enabling AVX512 through what I understand is doubling up on 256? Would be nice to have one standard SIMD set of at least 256 wide instructions with all the ops on everything across the board though

@rygorous do we know why intel didn’t give us something more like SVE?

@regehr @rygorous We looked at variable lengths. Many people have tried them - they're the "obvious" thing to look at. But the practicalities are that they suck. Every time something is optional, you get nasty edge cases and people get hurt.

@TomF @rygorous would love to read the blog post version of this, if anyone has time to write it

@TomF @rygorous also, if you don't know my friend @adreid (one of the SVE developers, but now employed by Intel), I bet you two would have things to discuss

@regehr @rygorous It's not much longer than that really. Things like the Cray vector length registers, or the SPARC sliding-window stuff - every time you have something designed for "scalability" or (gods help you) "forward compatibility" it never works as intended, it brings immediate implementation overhead, and everybody ends up hating it.

So after a while we scrapped all those plans and went fixed-size.

@flacs That one's not actually a vector instruction, it sets the flags. This is usable for rejection tests but not for actual vector tests where you need a mask of which lanes pass/fail as a result.

@flacs If you want the actual results (say to do a BLEND with) you're still stuck with an AND/CMP-against-0 combo until AVX-512

@rygorous predicate masks and "free" predication seems pretty useful especially when you have control flow that's non-uniform over your SIMD lanes.

@rygorous As a proud father, I endorse this list. We chose the width before AVX existed, so both 256b and 512b were on the table, and it was a very close decision! The deciding factor was (a) wider was better and (b) 512b is 16x float32, and there was a nice 4x4 symmetry to 16 lanes that we thought would be more important than it turned out to be (because in practice nobody cared about porting SSE code as-is).

More info on the origins here:
https://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%20full%20fat.pdf


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK