5

Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy

 1 year ago
source link: https://news.ycombinator.com/item?id=34810610
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy

s.gif
Isn't it on their consumer lines only that Intel removed AVX-512?

Saphire Rapids is indicated as having support for it.

I know that even though you pointed to an EPYC CPU, all Zen4 support it, but Intel probably released it more for the professional users than their non-pro ones.

s.gif
Yes it's only the Alder Lake (ie cheaper, consumer oriented CPUs) in which it has been removed. Server chips still have it AFAIK.

Even on Alder Lake, the official explanation is that it has both P(erformance) and E(fficiency) cores, with the E cores being significantly more power efficient and the P cores being significantly faster. The P cores have AVX512, the E cores don't. Since most kernels have no idea that this is a thing and treat all CPUs in a system as equal, they will happily schedule code with AVX512 instructions on an E core. This obviously crashes, since the CPU doesn't know how to handle those instructions. Some motherboard manufacturers allowed you to work around this by simply turning off all the E-cores, so only the P-cores (with AVX512 support) remained. Intel was not a fan of this and eventually disabled AVX512 in hardware.

As ivegotnoaccount mentioned, the Sapphire Rapids range of CPUs will have AVX512. Those are not intended for the typical consumer or mobile platform though, but for servers and big workstations where power consumption is much less of a concern. You would probably not want such a chip in your laptop or phone.

s.gif
It would have been possible to devise a system call to 'unlock' AVX512 for an application that wants to use it, which would pin it to only be scheduled on P cores.
s.gif
You end up with the issue of what happens if a commonly used library (or even something like glibc) wants to use AVX512 for some common operation, you could end up with most / all processes pinned to the P cores.
s.gif
If you explicitly have to request AVX512 that might discourage glibc from using it.
AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup with an 8 wide SIMD unit is a little difficult to explain. Some of this speedup is presumably coming from fewer branch instructions in addition to the vector width. It's extremely impressive. Also, it has taken Intel a surprisingly long time!
s.gif
L1 cache on Intel machines reads/writes in 512-bit chunks. So you get a 2x faster L1 cache when working with AVX512 on Intel IIRC.

Or perhaps more accurately: L1 cache that can process twice the data in the same amount of time.

s.gif
That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)
s.gif
People already do optimizations like this all the time, when they're working on low-level code that can benefit from it. Sorting is actually a good example, all of the major sort implementations typically use quicksort when the array size is large enough and then at some level of the recursion the arrays get small enough that insertion sort (or even sorting network) is faster. So sorting a large array will use at least two different sorting methods depending on what level of recursion is happening.

You can get information about the cache line sizes and cache hierarchy at runtime from sysfs/sysconf, but I don't think many people actually do this. Instead they just size things so that on the common architectures they expect to run on things will be sized appropriately, since these cache sizes don't change frequently. If you really want to optimize things, when you compile with -march=native (or some specific target architecture) GCC/Clang will implicitly add a bunch of command line flags to the compiler invocation that set up different preprocessor defines that expose information about the cache sizes/hierarchy for the target architecture.

s.gif
> cache line

Wrong side of the L1 cache. Cache-lines is how L1 cache talks to L2/L3 cache.

I'm talking about the load/store units in the CPU core, or the Core <--> L1 cache communications. This side is less commonly discussed online, but I am pretty sure its important in this AVX512 discussion. (To be fair, I probably should have said "Load/Store" unit instead of L1 cache in my previous post, which would have been more clear)

-------------

Modern CPU cores only have a limited number of load/store units. Its superscalar of course, like 4 load/stores per clock tick or something, but still limited. By "batching" your loads/stores into 512-bits instead of 256-bits or 64-bits at a time, your CPU doesn't have to do as much work to talk with L1 cache.

s.gif
Almost any array-math implementation that's aware of cache sizes is going to outperform the ones that don't. By a heavy margin.
s.gif
AVX-512 has masks and a lot of new instructions. It's not just wider.
s.gif
It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.
s.gif
My understanding is that AVX-512 also has a lot more functions, so composing something less naturally parallel (eg. simdJSON) is easier in it
s.gif
avx512 also gives you 2x more register space which can be very useful.
On an Ice Lake GCE instance, Highway's vqsort was 40% faster when sorting uint64s. vqsort also doesn't require avx512, and supports a wider array of types (including 128-bit integers and 64-bit k/v pairs), so it's more useful IMO. It's a much heavier weight dependency though.

Code / scripts here: https://github.com/funrollloops/parallel-sort-bench

I had to use a cloud instance for testing since I don't have an avx512-capable personal machine.

s.gif
Thanks for sharing the benchmark :D Is there anything we could do to make Highway/vqsort (feel like) a lighter dependency?
s.gif
I made that comment because the Intel library is header only. While header only libraries can be convenient, for non-trivial projects I prefer a well engineered CMake-based build for better compile times.
s.gif
Got it. Highway is mostly header-only and we can move towards fully if anyone is interested.

We also have a CMake build, not sure about well-engineered but patches welcome from anyone with more CMake expertise :)

s.gif
I think as long as FetchContent works it's super easy to try out your library and see if its something that should be pulled in as a real dependency.

I have heard that FetchContent can make something unreliable, as in dependent on having a connection, but I think it opens up the door for a lot of people to be willing to try it :) (That, and you can turn off mandatory updates, making it work offline too).

I think the basic rule for making FetchContent work is to just put the CMakeLists.txt in the root directory and then make sure that you can use the project as a CMake sub-project.

s.gif
OK, I think we meet those criteria. JPEG XL can use Highway via add_subdirectory, assuming that is what you mean by sub-project.
s.gif
Interesting! The benchmark appears to be using only random data though. Any measurements for partially sorted or reverse sorted data?
s.gif
sagarm has posted one result in another thread. I'll also look into adding their code to our benchmark :)

It's great to see more vector code, but caveat for anyone using this: the pivot sampling is quite basic, just median of 16 evenly spaced samples. This is will perform poorly on skewed distributions including all-equal and very few unique values. Yes, in the worst case it can resort to std::sort but that's a >10x speed hit and until recently also potentially O(N^2)!.

We have drawn larger samples (nine vectors, not one), and subsequently extended the vqsort algorithm beyond what is described in our paper, e.g. special handling for 1..3 unique keys, see https://github.com/google/highway/blob/master/hwy/contrib/so....

Now we only need a consumer CPU from Intel with AVX-512 enabled.
s.gif
Not listed on that page is the Microsoft Surface Laptop Go, which has the same i5-1135G7 as the X1 Carbon listed.

It appears that MS is clearing out their remaining stock with discounts, and they are really nice little machines with outstanding build quality, very good keyboards, and a 3:2 touchscreen.

It was never a popular machine, I think it had very unfortunate naming which leads people to confuse it with other MS products. You have to think of it as something like a super-premium Chromebook to understand what it is for. But regardless, you can dump Windows and install Linux just fine.

s.gif
Dedicated hardware for common algorithms is actually not very far fetched. In addition to GPUs, we already have examples of HSMs [1] and TPUs [2] that optimize for specific cryptographic and machine learning operations.

[1]: https://en.wikipedia.org/wiki/Hardware_security_module

[2]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit

s.gif
Pretty sure the internet exists by virtue of algorithm specialized hardware.
s.gif
Intel has dedicated gzip hardware in QAT among a few other hardware blocks.
s.gif
This is incredible, I feel like I am manifesting things by posting HN comments. What'll they think of next, one billion euros in cash hidden in my wardrobe!?
s.gif
well, really that's basically the only place left to go at the moment. I don't think we're likely to have 10GHz any time soon or 1,024 cores either. Specialized circuits are probably all that's left before we start hitting asymptotes.
What is the speedup compared against? Is it compared against non-avx code? Or is it compared against avx2 (256-bit)?

From my experience, vectorizing with avx2 can speed up 3x-10x against non-avx operations depending on the data. The operations involved are things like finding common prefix or searching in string.

Thanks Intel! Now maybe you can release some processors for those of us at home, that actually have AVX-512!
I read that as 10^17 times faster sorts and I thought to myself, now that’s news!
s.gif
"Intel has decided that CPUs no longer should sort, and instead will return original arrays in constant time. This has shown to have a 10^17 performance increase in some benchmarks."
s.gif
"According to the many-worlds interpretation of quantum mechanics, there may exist a universe where every array is already sorted, resulting in a 10^17 performance increase in some timelines."
Is sorting one of those things that’s so common that it deserves its own CPU hardware/instructions to aid in performance?

Or does AVX-512 provide a lot of what that would theoretically be?

s.gif
AVX-512 is not a dedicated sorting instruction set, but rather instructions dedicated to doing the same computations in parallel over 512 bit wide registers. So you can do the same operation in the same time for 8 doubles, 16 integers, or 64 bytes at once.

Coming up with good usage of those instructions can be tricky. It's not just the typical arithmetic things, but also instructions that shuffle around values in those registers based on values elsewhere and combining all that cleverly then can yield speed-ups for algorithms that deal with lots of data serially.

A while ago while trying to understand all that (for the older instruction sets) I've read this CodeProject article: https://www.codeproject.com/Articles/874396/Crunching-Number... – AVX-512 is basically similar, just wider. Although I've heard it has a few more useful instructions as well that have no counterpart in the older instruction sets.

s.gif
Heck yeah. Sorting is a pretty common operation in tons of algorithms, which is why you find some form of a sort function in pretty much every language's standard runtime. Sure, this won't help much for sorting strings, but numerical sorts are still address a significant chunk of problems.
s.gif
"Ordinateur", which is French for "computer", literally means "sorting machine". So wherever the sorting instructions would go, the computer would follow.
s.gif
Just to pick a random example: creating an SQL database with an index for faster search requires sorting
s.gif
Except for the special case of loading data into an empty table that already has an index, that requires repeatedly inserting items in a sorted ‘list’ (more likely a btree), keeping it sorted.

That’s quite a different thing.

s.gif
>btree

That's literally a collection of arrays that needs to be sorted. And it's still a traditional sorting.

s.gif
I really should have clarified that I understand that sorting is important. Hence why wondering if it deserves its own instructions. ;)
s.gif
That’s a kind of interesting idea. Better implementations of sorting algorithms will go down to a small-ish base case and then do a linear sort. Maybe with AVX512 sized registers there’s room for a “sort the register” instruction to act as the base case, haha.
s.gif
Lots of things are sorted: search results, recommendations, your news feed
This is probably the vectorized quicksort. I remember the paper detailing the algorithm here on HN.

Since then, I know that if I really need to sort numbers very fast one day, I would have to learn the vectorized way to quicksort.

I would probably write it directly in assembly though, with a C API (coze tinycc, cproc, scc...)

Thanks Intel for publishing something that's useful on AMD consumer CPU's but not on Intel ones.
s.gif
For those not in the know here, Intel's actually had some fairly ok AVX-512 implementations on consumer chips in the past, even if they do cause the whole chip to downclock significantly.

But for the new Alder Lake cpu, which has P/E Performance/Efficiency cores, the efficiency cores don't have AVX-512, so code would have to find a way to switch modes it runs in as it is shuffled between cores. So generally, AVX-512 is regarded as not-actually-available on Alder Lake.

s.gif
> so code would have to find a way to switch modes it runs in as it is shuffled between core

It's even worse than that. Initially you could disable E-cores in BIOS to get the system to report AVX-512 being available, but Intel released a microcode update to remove this workaround[0]. Intel also stated that they started fusing off the AVX-512 in silicon on later production Alder Lake chips[1]. Also compare the Ark entries for the Rocket Lake[2], Alder Lake[3], and Raptor Lake[4] flagships. Only the 11900k lists AVX-512 as an available Instruction Set Extension. So it's reasonable to say that AVX-512 on consumer Intel lines is dead for now, whereas AMD has just introduced it in the Ryzen 7000 series.

[0] https://www.tomshardware.com/news/intel-reportedly-kills-avx...

[1] https://www.intel.com/content/www/us/en/support/articles/000...

[2] https://ark.intel.com/content/www/us/en/ark/products/212325/...

[3] https://ark.intel.com/content/www/us/en/ark/products/134599/...

[4] https://ark.intel.com/content/www/us/en/ark/products/230496/...

s.gif
Does anyone know why they would do this? If AVX-512 works fine on P-cores, and if certain people disable E-cores because they want to use AVX-512, why would they stop those who want to from being able to use it? Why would they go to such extreme lengths to disable something?
s.gif
"The glibc problem"

You can schedule among heterogeneous cores, that's not really a problem. You simply have another bit for "task used AVX512" and let the task run without AVX512 so it faults the first time it tries to use it. The same stuff is done (or used to be done) for AVX, because if you know a task doesn't use AVX, you don't need to preserve all those registers.

The issue is that eventually someone will find that memcpy* is 4.79 % faster on average with AVX-512 and will put that into glibc and approximately five minutes later all processes end up hitting AVX-512 instructions and zero processes can be scheduled on the E cores, making them completely pointless.

* It doesn't have to be memcpy or glibc, it's sufficient if some reasonably commonly used library ends up adopting AVX-512 when available.

s.gif
> and zero processes can be scheduled on the E cores, making them completely pointless.

So because AVX-512 is fast, but E cores are slow, we should keep everything slow and prevent adoption of fast AVX-512 to prevent those E cores becoming pointless?

s.gif
Well, Intel is in the business of selling e cores.
s.gif
Nobody really knows for sure.

The immediate problem is that CPUID is not deterministic for naive software, if you don't set affinity-masks you don't know whether you will be scheduled onto p-cores or e-cores, and so the result you get will vary.

More generally, software doesn't know what configurations of threads to launch... you want to launch as many AVX-512 threads as you have logical cores, but not more, because they won't run on e-cores.

Software could potentially run a cpuid instruction affine to each logical core though, and collate the results... all you need know is "16 logical cores with AVX-512 and 4 without".

And software that isn't AVX-512 aware doesn't need to worry about it at all, since it doesn't know AVX-512 instructions. I guess the long tail of support is the stuff written for Skylake-SP in the meantime, but how much adoption really is there? It's that narrow gap between "regular stuff that never adopted AVX-512 because it wasn't on consumer platform" and "stuff that isn't HPC enough to be really custom" but also "stuff that won't receive an update". How much software can that really describe, especially with the reaction against Skylake-SP's clockdowns in mixed AVX-512+non-AVX workloads?

And also, that software can just launch AVX-512 threads and if they end up on the e-cores you trap the instruction and affine them to the p-cores. Linux already has support for this because Linux doesn't save AVX registers if there have never been AVX instructions used, so, it just would become another type of interrupt for that first AVX-512 instruction. Linus has commented that this is perfectly feasible and he's puzzled why they're not doing it too.

Nobody knows what the fuck is going on and there has been no plan expressed to anyone outside the company as to what the exact problem is and whether they're looking at anything to fix it going forward. It's a complete mystery, nobody even knows if it's something critical or everything is just too on-fire to care about that right now.

(and if it wasn't on fire before, it probably is now, nobody you want to retain is hanging around after a 20% pay cut off the top and truly insulting retention bonuses... ranging as high as $200 for a senior principal (no, that is not missing a "K"). Oh and we paid $4b in dividends, and you need to move to Ohio if you want to keep your job, yes the ohio with the cancer cloud. Intel is fucked.)

s.gif
Perhaps market segmentation, perhaps they heard of a vulnerability in their implementation that they couldn't patch (hence the microcode update). Intel loves market segmentation (server specific avx extensions, bfloat, ecc, overclocking), and I wouldn't be shocked to see them sell avx512 support as a "dlc" microcode update down the road.
s.gif
I wonder what Intel's plan is for the future here. Will a future efficiency core support AVX-512? Or will intel just abandon it on consumer in favor of a variable-length simd instruction set?
s.gif
Crestmont is the next e-core after gracemont and appears to still not have AVX-512.

It would be highly desirable for e-cores to implement microcoded AVX-512 support to break the heterogeneous-ISA problem, if nothing else. You don't need to use the same implementation but if you can support the same ISA via microcode then software doesn't need to worry about heterogeneous ISA. Maybe crestmont does the microcode thing, possibly, but in the die shots there aren't many visible changes in the vector unit vs gracemont design.

The next p-core will continue to have AVX-512 but it will continue to be fused off.

This obviously is completely insane, like, even if you validated Raptor Cove already and you can't just take AVX-512 out, you're just going to keep including it in all your future designs too? It's not an insignificant amount of area, even on consumer it's probably at least 10% extra area just to even support a 256b vector/microcode, and it just looks crazy to introduce and then abandon it at the exact moment your competitor adopts and supports it.

Only thing I can think of is that maybe they have some other instructions which utilize microcode intrinsics implemented on the AVX-512 engines... like how Turing implements its Rapid Packed Math (dual-rate FP16) support using the tensor engines. Something else in the design that locks them into AVX-512 even if it is not externally exposed?

But again, they don't even support it even if you turn off the e-cores entirely... why the fuck would you do that? It's like the most confusing resolution to this problem and satisfies nobody, probably not even Intel. I guess they flatly do not want to touch it at all for some reason.

Note that this also includes mobile going forward since mobile will have big.LITTLE designs too... it's a serious amount of work and years of rollout they're pissing down the drain here.

s.gif
It's going to be somewhat hilarious if Intel makes a AMD Bulldozer alike architecture for E-cores where there are AVX-512 units shared among 2 or 4 e-cores.
s.gif
To go back to a past tech-screed: CMT as implemented by Bulldozer is just SMT with inefficiently-allocated resources. If the frontend (cache, fetch, decode, scoreboarding) and the FPU and retirement are all shared, what exactly did Bulldozer have that was unique to each 'core'? It was an integer ALU dedicated to each thread, that's it. And that's functionally identical to SMT but with a dedicated ALU for each thread. And if one of the threads has enough ILP to occupy two units and the other one isn't being used... why not let the thread use them both and get more work done?

https://news.ycombinator.com/item?id=34494484

So sure, let's do Bulldozer, a bunch of weaker but space-efficient threads (you know, e-cores) but put four threads on a single module sharing an AVX-512 unit, but let's also make it SMT so they can steal unoccupied execution units from other threads in their module. We could call it... Xeon Phi. ;)

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

And sure maybe Bulldozer was "ahead of its time" but I think that probably undersells just how weak they are especially when threads in a module start contending for shared resources. Both Bulldozer and Xeon Phi get incredibly weak when multiple threads are on a module, the higher threadcount is offset by a reduction in IPC too. And while that is still probably a net perf-per-area gain, your application really has to like threads for it to be worth it.

I'm gonna say it: if you think Bulldozer was "ahead of its time" then so was Larrabee and Xeon Phi. Bulldozer was a first stab at this Xeon Phi idea for AMD. And in both cases I'm not 100% sure it was worth it. The market doesn't seem to have thought so.

Now again: the devil is kinda in the details. It all depends what is shared. If you can make it so the performance hit is really small except for the shared FPU, that's one thing - there's nothing inherently wrong with this idea, that's what the Sun Niagara series does too. But Sun Niagara is also noted for comparatively weak FP performance (it's an integer-monster for database work). It all depends on just how much duplication per-thread and how much shared resource and how much area benefit it gets you.

But like, take a Niagara core, and let's say we have a couple threads with an opportunity for a bunch of ILP and a bunch of unused execution units sitting there. Why should you not launch onto them? That's the CMT vs SMT question to me. And it's fine if the answer is "scheduling complexity" but you need to think about that question before just blindly pinning resources to specific threads.

And again none of this is to dump on e-cores specifically. Intel's P-cores are too big, they are like triple the transistor count of AMD for a smidge more performance. I think the long-term future lies in the e-cores for Intel, they will replace Coves with Monts eventually. Sierra Forest is the most interesting Intel product in a long time imo. AMD is in less need of e-cores, a Zen3 core only has about 2x the transistor count of a Gracemont core and to me that's fine, it's an SMT core with higher per-thread performance too, that's a reasonable sidegrade. AMD's strategy of pursing "compact" cores makes sense to me, they don't need a whole separate e-core, their P-cores are already area-efficient. They just are going to squeeze the last 10-20% out of it for area-optimized applications and call it a day.

(AMD has done a really good job avoiding cruft - supposedly Zen3 was a from-scratch redesign (Zen2 was actually supposedly a tweak, according to AMD's engineering lead), etc. And they've built this modularity of design that lets them embrace semicustom and advanced packaging and make innovative products and not just architectures. It really feels like Intel has been coasting on the Sandy Bridge design for a long time now, not even the kinds of Zen2->3 shifts, just incremental tweaks. Their iGPU stuff is evidently just as tightly tied to the CPU side as the CPU stuff is to their nodes, everything at Intel is obviously just one giant ball of mud at this point and it's incremental changes and legacy cruft all the way down. I am very down on Intel lately because even completely ignoring the current set of products and their merits, AMD is executing well and Intel is simply not. They've had 6 years since the Ryzen launch to turn things around and they still can't do the job right. AMD is obviously the better company right now in the Warren Buffett "own stocks that you'd want to buy the product" sense.)

https://www.youtube.com/watch?v=3vyNzgOP5yw

But I'm just not sure the Xeon Phi/Bulldozer/Niagara concept has really worked all that well in practice.

Anyway it's also possible that instead of sharing one unit among four cores, they put a unit in each core but it executes over multiple cycles, like AMD did with Zen1/Zen+ and 256b vectors. Or you have two 256b units that fuse to become a 512b unit. That seems to have been the design trend recently, that's how AMD does their AVX-512 on Zen4.

But those kinds of changes are what I mean when I say "if there were changes in the vector units it would probably show up on the die shots". Crestmont die shots seem to show a pretty unchanged AVX unit from Gracemont - it seems unlikely they changed it too significantly.

https://www.semianalysis.com/p/meteor-lake-die-shot-and-arch...

https://twitter.com/Locuza_/status/1524441315441786881

s.gif
Perhaps there wasn't enough lead time to get it into Crestmont after the decision to go with P+E for Alder Lake.
s.gif
AFAIK Consumer Zen4 supports 12 of the 15 AVX-512 extensions, do we know for certain this doesn't target one of the ones AMD is missing?
s.gif
The newest extension it needs is -VBMI2, which is supported by Zen 4. -DQ and -BW are quite old and very common amongst all implementations by this point.
s.gif
zen4 supports basically everything except the xeon phi SMT4 intrinsics (4VMMW or whatever). As did alder lake before its removal.

The support story for AVX extensions is not as complex as people make it out to be anyway. Server is a monotonic sequence, consumer is a monotonic sequence, both of them are converging apart from 1 or 2 that are unique to one or the other. Xeon Phi has the SMT4 intrinsics that are completely its own thing due to the SMT4 there, but you'll know if you're targeting xeon phi.

https://i.imgur.com/idAjB1X.png

So as you can see, consumer supports everything except BFloat16 for neural net training. Consumer doesn't do that so it's not a problem. And it doesn't support the Xeon Phi stuff because Xeon Phi is its own crazy thing.

No uarch family in that chart has ever abandoned an extension once it was adopted. So unless you are taking a consumer application and running it in the server, it's literally not even a problem. And server gets bfloat. That's it, that's literally the only two things you have to know.

but letting AMD fanboys draw le funni venn diagram is obviously way catchier than a properly organized chart representing the actual family trees involved... SSE would look bizarre if you represented it that way too, like all AMD's weird one-off SSE4 extension sets released in the middle of more fully-featured implementations... but people working in good faith would never actually be confused by that because they understand it's a different product family and year of release is not the only factor here.

Really the thing that has been a problem is that server has been stalled out forever... first 10nm problems and now sapphire rapid has more than a dozen known steppings. They can't get the newer architectures out, so consumer has been moving ahead without them... up until alder lake nuked the whole thing. If server had been able to get newer uarchs out, there would be a lot more green bars in server too.

supposedly the fab teams are actually ready to go now, and the problem is the design teams aren't used to operating in an environment where they can't go down the hall and have the fab teams fix their shit. Intel put the foot down and aren't letting them do that anymore, since the fab teams need to sell the resulting process/cell libraries to external foundry customers, and the design teams need to be able to make their shit work on external foundries. You can't do this hyper-tuned shit where the process is tweaked to make your bullshit cell designs work. But some of the teams are not mature enough to work in a portable environment where design rules actually have to be obeyed because Intel historically never had to.

When you hear the infinite steppings of Sapphire Rapids and the network chip team's continued inability to put out a 2.5gbe chipset that works (I think we are on public release number 6 now?), it's pretty obvious who the worst culprits were. Meteor Lake may also be having packaging/integration problems (although this is supposition by me based on what products are delayed - coincidentally it is a lot of chiplet/tile stuff and intel obviously lacks experience in advanced packaging) but the products that have infinite steppings obviously can't get their own shit together even on their own tiles let alone talking to other people's tiles.

But Intel supposedly are not kidding that Intel 4 is ready to go and they've just got nothing to run on it yet. Hence looking for outside partners. Supposedly they've got at least one definite order signed for Intel 3 in 2024, and I think there will be a lot of people happy to diversify and derisk away from the TSMC monoculture that has emerged... if TSMC stumbles, right now there is no alternative.

https://www.tomshardware.com/news/intel-ifs-lands-3nm-to-mak...

Samsung has all the same conflict-of-interest problems as Intel and also a track record of really mediocre fab execution. Supposedly they are ahead on GAAFET but like... we'll see, it's Samsung, who knows. They've stumbled just as much as Intel, just not on 7nm tier - I remember the iphone "is it TSMC or Samsung" games too. Samsung has put out a lot of garbage nodes and a lot of poorly-yielding nodes of their own.

s.gif
edit since I can't edit: "And server gets bfloat" meaning "if you were to bring a ML training server application over to consumer it might not work".

Basically what I'm saying is, the only 2 situations that would be a problem is going consumer->server (which I don't see happening often) or going server ML training -> consumer if it doesn't have a non-BFloat16 fallback. And everyone does ML training on GPUs anyway.

Otherwise everything supports everything. Going backwards within a family might be a problem, but, that's always a problem, it's not a support matrix problem where there's a mixture of capability, it's just backwards compatibility to older hardware with less features.

The real problem, as I said, is that "Cooper Lake" there is Ice Lake-SP which was stalled for years, and by the time it was adopted Milan was already in the market and Cooper Lake was dead on arrival. So nobody actually has Cooper Lake, if you have AVX-512 server it's 99.9% chance it's either Skylake-SP or Cascade Lake-SP.

Which is 100% drop-in compatible with any consumer platform that anyone has (since conveniently nobody has Cannon Lake either). The literal only problem is taking consumer applications and running them on server stuff, and there's a well-defined server compatibility set there too.

Going forward, Sapphire Rapids is Golden Cove cores, so it should have the same support bars as Alder Lake there, ie basically everything, including server bfloat as well.

https://www.phoronix.com/image-viewer.php?id=intel-sapphirer...

(and of course the other problem being Intel has no idea what the fuck they're doing with big.LITTLE on the consumer platform... the support matrix for everything consumer-family going forward is apparently "nothing" because they've dropped AVX-512 entirely.)

Let me drill this down to the generations you actually need to care about: (that poor PNG...)

https://i.imgur.com/2HLrIjr.png

Like literally the AVX-512 support matrix is a complete fucking non-issue, it's an absolute tempest in a teapot by people who have never touched or looked seriously at AVX-512. The AVX-512 rollout is a dumpster fire in many many ways but an overly-complex support matrix is not one of them.

s.gif
Numpy is something you could expect to find running on a workstation and Intel's workstation CPU line has had AVX-512 continuously since 2017.
s.gif
Alder Lake and Raptor Lake workstation (W680 Chipset) doesn't have AVX512 enabled.
s.gif
Fair. The "entry workstation" thing from Intel is baffling. I was thinking Xeon W, but then of course there was the Xeon W-12xx that lacked AVX-512.

In short, I was wrong. It would have been more correct to say that Intel has offered a workstation part with AVX-512 continuously since Skylake.

s.gif
Sad, big win for the Ryzen or future Zen 4 threadripper.
s.gif
There was a longstanding issue where AVX-512 would trigger frequency throttling on a number of Intel CPUs, resulting in a net performance penalty for mixed workloads.
s.gif
No, there was a longstanding issue of people worrying that it would result in a net performance penalty for mixed workloads. Meanwhile people actually using AVX-512 dealt with that by making sure mixed workloads were batched appropriately and it is usually a big net win even if you don't worry about it.
s.gif
“Longstanding” here means “really only in a single CPU generation (SKX), and even then only if you were dumb and used like 50 AVX-512 instructions in isolation for no reason at all.”
s.gif
Hasn't been a problem because software corrected for that, or hasn't been a problem because Intel resolved the underlying throttling issues with AVX-512?
s.gif
It was only useful in a couple of corner cases, but not in general library functions.

Intel could never fix its thermal issues, AMD did.

s.gif
Assuming Intel didn't add code that's:
  if (AMD_CPU) { go_slow() }
.... again
s.gif
I still don't get why people think Intel is obligated to optimize AMD performance. From what I recall, it wasn't a case of slowing down AMD devices, they just didn't apply code optimizations to machines not using Intel.
s.gif
The code literally checked for CPU = AMD, and ignore the CPU feature bits what show which accelerations are available.

So sure Intel shouldn't tune for AMD, but if the AVX2 is listed as available a compiler should use it. This was proven when via some shared library trickery you could lie about the CPU name, and suddenly the AMD CPU was faster.

I stumbled across this on a "Why is matlab much slower than I'd expect" thread. Lying about the CPU greatly improved performance and still showed the correct answers.

s.gif
If Matlab performance is important for competitive reasons, AMD should hire a few SW engineers to build tuned linear-algebra libraries like Intel does. It’s not rocket science. They could be competitive or faster with a low-millions-per-year investment (and only slightly behind with a much smaller investment in grad students and open source).

I worked on these sorts of libraries (not at Intel) for a decade, it’s very, very common to dispatch on CPU family rather than feature flags, because performance details are tied to the microarchitecture, not availability of the feature. Even though Skylake and Haswell both support AVX2 and FMA, there are huge differences in implementation that effect what the best implementation will be, so you can’t just say “oh, FMA is available, use version 37.” Instead you do “if we’re on Skylake, use the Skylake version, if we’re on Haswell, use the Haswell version, otherwise fall back on the generic runs-anywhere version.” Nothing underhanded about it.

s.gif
With all the errata compilers need to correct for, I wouldn't blame the compiler for not optimising for foreign chips. Matlab chose to use a library that only works well for Intel (a simple benchmark would've shown that), I don't think Intel's compiler team should be forced to write code for AMD chips. I very much doubt AMD's driver team will optimise their OpenCL tools for Nvidia hardware either.

Blame Matlab and friends for slowing down their software on your computer.

s.gif
> it wasn't a case of slowing down AMD devices, they just didn't apply code optimizations to machines not using Intel

What's the difference?

The thing that would be okay is "not having optimizations designed for AMD devices".

But when you already have the optimizations, and you refuse to use them because AMD, that is not okay.

s.gif
> What's the difference

Let's be naïve: "not doing something" is indeed, on a literal level, different from "doing something to slow it down". It's a bit like the nuances of a lie by omission vs a proper lie. That being said... both are still considered lies in a more abstract sense, and so is this deliberate slowdown. I assume the author just took it all a little more literal than most people (probably?) deem necessary.

s.gif
I would say "not doing something" is an incorrect description of what it did, though. They had to write extra code to make the behavior on Intel and AMD differ.
s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK