DWCAS in C++
source link: https://timur.audio/dwcas-in-c
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
DWCAS in C++ – timur.audio
A blog about C++, music software, tech community, and life.
test cmpxchg16b
in x86 assembly) apart from some early AMD and Intel Core 2 chips (we’re talking 2008 and earlier). In fact, 64-bit Windows 8.1 and newer requires DWCAS to run.test cmpxchg8b
, giving you 64-bit atomic lock-free variables. And on ARM, DWCAS functionality is provided via double-width LL/SC instructions, which seem to be available on all modern ARM chips (32 and 64 bit).struct
that contains two pointers, or perhaps, a pointer and an index, such astemplate <typename T>
struct stack_head
{
std::uintptr_t aba_counter;
stack_node<T>* node;
};
static_assert
such as this should succeed:static_assert(std::atomic<stack_head<T>>::is_always_lock_free);
head_t next;
head_t orig = head.load();
do
{
node->next = orig.node;
next.aba_counter = orig.aba_counter + 1;
next.node = node;
}
while (!head.compare_exchange_weak(orig, next));
the compare_exchange_weak
should compile to test cmpxchg16b
on x86_64 (and the equivalent instructions on other architectures). In other words, I would expect a guarantee that the compiler will not insert any mutexes into the std::atomic
and the code will be lock-free and race-free with the best performance possible.
static_assert
succeeds and the test cmpxchg8b
instruction gets generated on all major C++ compilers (godbolt). On some compilers, this only happens if you align the struct to 8 bytes using alignas
, but that is easily done. However, on x86_64, a much more important platform these days, things look less rosy.static_assert
passes and the correct instructions are generated out of the box when targeting 64 bit is the current Apple Clang. This works when compiling for Apple Silicon as well as when compiling for Intel, all the way down to deployment target macOS 10.7 (this is the oldest that my version of Clang supports). I suppose this is because Apple knows exactly what CPUs their OS can run on, and have a modified backend in their Clang fork that uses DWCAS whenever possible. So if Apple is the only platform you’re interested in, then we’re good. Which of course isn’t helpful at all if your goal is a portable, generic C++ library.static_assert
fails, and the compiler will instead generate a library call to __atomic_compare_exchange
, which may insert mutexes (see documentation here)./proc/cpuinfo
, it will be listed as cx16
). So presumably they could do at least a runtime check here, and give us a runtime guarantee that those library calls will use DWCAS instead of mutexes if the CPU supports it. But that’s not what’s happening: if you call the node.is_lock_free()
member function on a std::atomic<stack_node<T>>
object (which, unlike is_always_lock_free
, is a per-instance runtime check), it still returns false
on my brand new Linux machine with an 11th Generation (Tiger Lake) Intel Core i9 CPU.-mcx16
, which assumes DWCAS availability. On Clang, this will do the right thing, if and only if you overalign the struct to 16 bytes (godbolt). If you then run the binary on an old CPU without DWCAS, it will presumably just crash. Which is fine if we don’t care about those.-mcx16
. If you don’t (or if you can’t, because you’re shipping a library), you will get ODR violations and undefined behaviour because now there are two diverging definitions of std::atomic
floating around in your binary. And I haven’t found a way yet to programmatically check whether the -mcx16
flag was set, so you can’t even warn your users about this. Which leaves us with the conclusion that you cannot ship a C++ library relying on DWCAS availability on Clang.-mcx16
flag does not help: regardless of what you do, GCC simply won’t emit DWCAS instructions on x86_64. Instead, it will always issue calls into libatomic, and either report that is_lock_free() == false
, or (on my machine) not link at all because the default libatomic contains no implementation of std::atomic
for double-width underlying types.-mcx16
has resulted in several bug reports (here, here, and here), but apparently GCC folks have decided that this is by design and won’t be fixed. Which means: no lock-free DWCAS on Linux.is_lock_free() == false
, and emits mutexes instead of DWCAS instructions.std::atomic
would constitute an ABI break, so they won’t do that. It’s another sad example of choosing ABI stability over performance, which continues to baffle me, given that C++’s tag line is to give the programmer the tools for best performance and “leave no room for a lower-level language”.InterlockedCompareExchange128
function from winnt.h
(docs here). So if you want portable code, you basically have to implement your own std::atomic
and use that under the hood. This is the case for both x86_64 and ARM.std::atomic
that portably uses DWCAS instead of silently adding locks into your code (and I don’t think anyone would want that?), your only option is to implement your own std::atomic
for all relevant target platforms, or to use a third-party library that does it for you. Which is a really sad state of affairs: I am currently not aware of a free and open source implementation that works on all major compilers and operating systems. If somebody knows such a library, please let me know!Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK