Add atomic memcpy RFC. by m-ou-se · Pull Request #3301 · rust-lang/rfcs · GitHub
source link: https://github.com/rust-lang/rfcs/pull/3301
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label
Contributor
bjorn3 commented 5 days ago
cc @ojeda |
This could mention the |
5225225 commented 5 days ago •
With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as This would also require removing the safe That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn. |
Contributor
Lokathor commented 5 days ago
The "uninit" part of not without the fabled and legendary Freeze Intrinsic anyway. |
T-Dark0 commented 5 days ago
On the other hand, |
Member
programmerjake commented 5 days ago
note that LLVM already implements this operation: |
comex commented 5 days ago
The trouble with that intrinsic is that |
- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about |
||
duplicating non-`Copy` data. |
||
There might be valid use cases with non-`Copy` data. |
||
Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies). |
crossbeam-deque is one of the crates that needs support for non-Copy
data.
pub const fn uninit() -> Self; |
||
pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>; |
||
pub fn store(&self, value: T, ordering: Ordering); |
Member
Author
m-ou-se 5 days ago
That's good to know, thanks.
For this specific API: instead of store
, store_from
can be used to store something from a &MaybeUninit<T>
. I figured the majority of the cases would use a T
, so that's what I picked for the store
method, to keep it as simple as possible. (There doesn't seem to be much use to consuming a MaybeUninit<T>
since it doesn't have a drop implementation anyway.)
- In order for this to be efficient, we need an additional intrinsic hooking into |
||
special support in LLVM. (Which LLVM needs to have anyway for C++.) |
How do you plan to implement this until LLVM implements this?
I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.
(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)
Member
Author
m-ou-se 5 days ago
I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:
- We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
- Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like
ptr::{read_volatile, write_volatile}
combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler. - We use the existing
llvm.memcpy.element.unordered.atomic
, after figuring out the consequences of theunordered
property. - Until LLVM supports appears, we implement it in the library using a loop of
AtomicUsize::load()
/store()
s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.
I'm not fully sure yet which of these are feasible.
Outdated Show resolved
I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's (It seems |
Outdated Show resolved
# Unresolved questions |
||
- Should we require `T: Copy`? |
Given this is a quite advanced tool, I would lean towards not requiring it.
Some documentation in AtomicPerByte
could ameliorate the issue.
Yes, IMO this should not require Copy, since there's no way to shake off this requirement. I think it's reasonable to use this for a T
which happens to hold something else in an UnsafeCell in it, which would make it non-Copy.
but it's easy to accidentally cause undefined behavior by using `load` |
||
to make an extra copy of data that shouldn't be copied. |
||
- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`? |
Given these options and considering what the C++ paper chose, AtomicPerByte
sounds OK and has the advantage of having Atomic
as a prefix.
ojeda commented 5 days ago
Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense) In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed. Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence). That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists. (FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf) |
Member
thomcc commented 4 days ago
CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past). I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently. |
Contributor
bjorn3 commented 4 days ago
We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes. |
Member
thomcc commented 4 days ago
I'm not sure we'd want unordered, as mentioned above... |
Member
thomcc commented 4 days ago
To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have
In other words, for unordered, it would be legal for 2 to be stored before 1, or for |
Member
programmerjake commented 4 days ago
something that could work but not be technically correct is: those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do. |
Member
thomcc commented 4 days ago
Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt). Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that. |
I like |
The somewhat popular `seqlock` crate and similar implementations found in the ecosystem |
||
all use a regular non-atomic write (preceded by an atomic fence) for writing, |
||
and `ptr::read_volatile` (followed by an atomic fence) for reading. |
||
This "works" "fine", but is technically undefined behavior. |
It may be worth noting that, besides from the volatile-volatile data race (which can be somewhat justified by always detecting and discarding the value of a racing read), the volatile-fence pattern doesn't actually provide synchronisations under the C++ memory model.
Fences only "interact" with atomic accesses. Volatiles are not atomic accesses. So unless SeqCst is involved, release fence-volatile write + volatile read-acquire fence does nothing synchronisation wise. This can cause the reader to miss a sequence update and use the volatile read value instead of discarding it, even though the writer did a racing write in the mean time.
Miri can already reproduce this: crossbeam-rs/crossbeam#859. Though most (all?) LLVM compilation schemes emit the same code for volatile and relaxed: https://godbolt.org/z/MsGnq1fGh, so it probably doesn't blow up irl, until some optimisations come along.
Byte-wise atomic is actually required to resolve this, without using potentially more expensive RMW and SCs
Recommend
-
6
Member LukasKalbertodt
-
13
New issue mir-opt-level 4 is the new 3 #1737
-
8
Copy link Collaborator rust-log-analyzer c...
-
11
New issue Add saturating abs/neg #87 Merged ...
-
6
Copy link Member Mark-Simulacrum commented
-
35
New issue Optimize memcpy, memmove and memset #405 Conversation ...
-
5
Copy link Contributor Jarcho commented...
-
6
Conversation Copy link Contributor
-
3
Copy link Member ...
-
3
Conversation Contributor This s...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK