11

Add atomic memcpy RFC. by m-ou-se · Pull Request #3301 · rust-lang/rfcs · GitHub

 2 years ago
source link: https://github.com/rust-lang/rfcs/pull/3301
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Member

@m-ou-se m-ou-se commented 5 days ago

edited
JakobDegen, Nilstrieb, SabrinaJewson, ibraheemdev, GrantM11235, bryanhitc, ChayimFriedman2, taiki-e, vrmiguel, Kobzol, and 8 more reacted with thumbs up emoji All reactions

m-ou-se

added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label

5 days ago

Contributor

bjorn3 commented 5 days ago

cc @ojeda

Contributor

ibraheemdev commented 5 days ago

edited

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

taiki-e reacted with heart emoji All reactions

5225225 commented 5 days ago

edited

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

Contributor

Lokathor commented 5 days ago

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

T-Dark0 commented 5 days ago

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

Member

programmerjake commented 5 days ago

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

comex commented 5 days ago

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

- Require `T: Copy` for `AtomicPerByte<T>`, such that we don't need to worry about

duplicating non-`Copy` data.

There might be valid use cases with non-`Copy` data.

Also, not all "memcpy'able" data is always marked as `Copy` (e.g. to prevent implicit copies).

crossbeam-deque is one of the crates that needs support for non-Copy data.

All reactions

pub const fn uninit() -> Self;

pub fn load(&self, ordering: Ordering) -> MaybeUninit<T>;

pub fn store(&self, value: T, ordering: Ordering);

Member

Author

@m-ou-se m-ou-se 5 days ago

That's good to know, thanks.

For this specific API: instead of store, store_from can be used to store something from a &MaybeUninit<T>. I figured the majority of the cases would use a T, so that's what I picked for the store method, to keep it as simple as possible. (There doesn't seem to be much use to consuming a MaybeUninit<T> since it doesn't have a drop implementation anyway.)

All reactions

- In order for this to be efficient, we need an additional intrinsic hooking into

special support in LLVM. (Which LLVM needs to have anyway for C++.)

How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

All reactions

Member

Author

@m-ou-se m-ou-se 5 days ago

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

  • We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
  • Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.
  • We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.
  • Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

All reactions

text/3301-atomic-memcpy.md

Outdated Show resolved

Member

Author

m-ou-se commented 5 days ago

edited

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

text/3301-atomic-memcpy.md

Outdated Show resolved

# Unresolved questions

- Should we require `T: Copy`?

Given this is a quite advanced tool, I would lean towards not requiring it.

Some documentation in AtomicPerByte could ameliorate the issue.

All reactions

Yes, IMO this should not require Copy, since there's no way to shake off this requirement. I think it's reasonable to use this for a T which happens to hold something else in an UnsafeCell in it, which would make it non-Copy.

All reactions

but it's easy to accidentally cause undefined behavior by using `load`

to make an extra copy of data that shouldn't be copied.

- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?

Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

All reactions

ojeda commented 5 days ago

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

Member

thomcc commented 4 days ago

edited

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

Member

thomcc commented 4 days ago

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

Contributor

bjorn3 commented 4 days ago

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Member

thomcc commented 4 days ago

I'm not sure we'd want unordered, as mentioned above...

Member

thomcc commented 4 days ago

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

Member

programmerjake commented 4 days ago

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

Member

thomcc commented 4 days ago

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

The somewhat popular `seqlock` crate and similar implementations found in the ecosystem

all use a regular non-atomic write (preceded by an atomic fence) for writing,

and `ptr::read_volatile` (followed by an atomic fence) for reading.

This "works" "fine", but is technically undefined behavior.

It may be worth noting that, besides from the volatile-volatile data race (which can be somewhat justified by always detecting and discarding the value of a racing read), the volatile-fence pattern doesn't actually provide synchronisations under the C++ memory model.

Fences only "interact" with atomic accesses. Volatiles are not atomic accesses. So unless SeqCst is involved, release fence-volatile write + volatile read-acquire fence does nothing synchronisation wise. This can cause the reader to miss a sequence update and use the volatile read value instead of discarding it, even though the writer did a racing write in the mean time.

Miri can already reproduce this: crossbeam-rs/crossbeam#859. Though most (all?) LLVM compilation schemes emit the same code for volatile and relaxed: https://godbolt.org/z/MsGnq1fGh, so it probably doesn't blow up irl, until some optimisations come along.

Byte-wise atomic is actually required to resolve this, without using potentially more expensive RMW and SCs

All reactions

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK