New issue

Optimize memcpy, memmove and memset #405

Conversation

Copy link

Contributor

nbdd0121 commented on Feb 9

Partially addresses #339. (memcmp is not implemented in this PR)

This PR modifies the current simple implementation of memcpy, memmove and memset with a more sophisticated one.
It will first align dest to machine-word boundary using bytewise copy.
If dest and src are co-aligned (i.e. dest and src address are congruent modular machine-word size), it will proceed to use machine-word-wide copy to copy as many bytes as possible. Remaining bytes are copied using bytewise copy.
If dest and src are not co-aligned, misaligned copying is performed by reading two adjacent machine words and assemble them together with logical shifts before writing to dest.

The existing implementation can be pretty fast on platforms with SIMD and vectorization, but falls short for those without. The code in this PR is carefully written so it is fast on those platforms while is still vectorizable.

Here are some performance numbers:

Rust BuiltinNew ImplementationOld ImplementationNew, with no-vectorize-loopsOld, with no-vectorize-loops

All above numbers are tested on my Coffee Lake laptop, with no-asm feature on.

The number shows that if vectorization is enabled or supported on the platform, then memcpy and memset performance of the old and the new implementation are similar (while much faster on memmove, it seems the compiler had a hard time vectorizing backward copy). If there is no vectorization or the platform does not supported misaligned access, the new implementation will have very significant performance gain (>4x with the number above, and on in-order scaler cores you'd expect 3~8x improvement depending on whether access is aligned and the width of usize).

I inspected the code generated assembly for RISC-V and it seems pretty close in quality to my hand optimsied assembly.

nbdd0121

changed the title Add test cases for memcpy, memmove and memset for different alignment

Optimize memcpy, memmove and memset

on Feb 9

Copy link

Contributor

Author

nbdd0121 commented on Feb 9

Oh, one thing to note is that the misaligned code currently makes some accesses that might be UB. For example if we are trying to copy from 0x10007-0x100017, then the code will access the entire machine-word containing the byte 0x10007 (thus also 0x10000-0x10006) and the machine-word containing 0x10010-0x10016 (thus 0x100017) which are out of the bound supplied to memcpy/memmove.

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

Copy link

Contributor

Author

nbdd0121 commented on Feb 9

Oh big-endian ISAs , will fix sooon

Copy link

Contributor

bjorn3 commented on Feb 9

I don't think that is a UB ever exploitable by compiler though, and I am not aware of any architecture where such accesses could cause trouble.

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

Copy link

Contributor

Author

nbdd0121 commented on Feb 9

On the M1 I believe one of the exploit protection mechanisms may cause such out-of-bounds accesses to crash the program under certain circumstances. I forgot the name of the mechanism though, so I can't check if I remember it correctly.

Hmm, interesting. Platforms with these kind of protection features I know of usually track at allocation and deallocation, and these memory regions will be machine-word aligned. The oob access here is strictly just accessing the machine-word that contains an accessible byte.

Copy link

Contributor

Author

nbdd0121 commented on Feb 9

Anyway I could force at least 7 bytes to be copied bytewise before and after the unaligned copy, and doing so would be UB-free. But that would hurt other platforms where this specific kind of OOB access is permitted.

Copy link

Member

alexcrichton left a comment

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

src/mem/impls.rs

Outdated Show resolved

src/mem/impls.rs

Show resolved

src/mem/impls.rs

Outdated

while i < n {

*dest.add(i) = *src.add(i);

i += 1;

unsafe fn copy_forward_aligned_words(dest: *mut u8, src: *const u8, n: usize) {

Since this function is taking aligned words, could this perhaps take usize pointers? Additionally should this take u64 pointers perhaps? Or maybe even u128?

Copy link

Contributor

Author

nbdd0121 on Feb 10

It could, but I uses u8 pointer to keep the signature identical to other functions.

I thought about using u64 or u128 here but it turns out to be bad for a few reason:

Many platforms, especially those which don't have vectorization, can do at most usize load per instruction.
It requires more alignment than usize, so more bytes have to be copied with bytewise copy.
It makes the misalignment scenario bad. Shifting and reassemble 2usize actually generates worse code than 1usize.

However, one optimisation that I haven't include in this PR is to unroll the loop for the aligned scenario. If n is a multiple of 8*usize for example, then we can have a 8-unrolled loop followed by a loop for the remainder. On clang this could be done easily with #pragma unroll but I don't think it is currently possible in Rust so we might have to be it manually.

I feel like taking usize here would make the intent clearer where it's copying based on each word. Perhaps there could be a typedef in the module for the copy size? Some platforms may have better performance on some types than others, so I could see it being conditional as well.

For unrolling I would expect LLVM to do this automatically if supported.

src/mem/impls.rs

Outdated Show resolved

src/mem/impls.rs

if likely(n >= WORD_COPY_THRESHOLD) {

// Align dest

// Because of n >= 2 * WORD_SIZE, dst_misalignment < n

let dest_misalignment = (dest as usize).wrapping_neg() & WORD_MASK;

Would it be possible to use the align_offset method here to avoid bit-tricks?

Copy link

Contributor

Author

nbdd0121 on Feb 10

From align_offset's doc:

If it is not possible to align the pointer, the implementation returns usize::MAX. It is permissible for the implementation to always return usize::MAX. Only your algorithm's performance can depend on getting a usable offset here, not its correctness.

So that rules out the use of align_offset. Also note that this is the very hot path and I would prefer simple bit trcisk rather to rely on LLVM to optimise out a very complex function.

Could you try benchmarking to see if there is overhead?

Copy link

Contributor

Author

nbdd0121 on Feb 17

It's not related to performance, it could simply not be used for correctness.

It is permissible for the implementation to always return usize::MAX.

src/mem/impls.rs

if likely(src_misalignment == 0) {

copy_forward_aligned_words(dest, src, n_words);

} else {

copy_forward_misaligned_words(dest, src, n_words);

Out of curiosity, have you tested simply copying using ptr::read_unaligned and ptr::write_unaligned? That way alignment wouldn't be an issue, but I don't know the performance impact that would have on various platforms.

Copy link

Contributor

Author

nbdd0121 on Feb 10

Well, first of all ptr::read_unaligned and ptr::write_aligned calls ptr::copy_nonoverlapping which translates to memcpy, so I would like to avoid the possibility of an recursion if LLVM doesn't optimise them away.

Secondly, ptr::read_unaligned has really poor performance. On platforms without misaligned memory access support it will translate to size_of<usize>() number of byte loads.

This branch is necessary because we don't want to bear the burden of all the shifts and checks necessary for misaligned loads if dest and src are perfectly co-aligned.

Have you tried this out and seen infinite recursion? Have you tried it out and seen if it's slower?

Copy link

Contributor

Author

nbdd0121 on Feb 17

There'll be an infinite recursion if compiled in debug mode.

On architectures that does not support misaligned loads (so most ISAs other than armv8 and x86/64) the performance is much slower because it generates 8 byte load and 16 bit ops rather than 1 word load and 3 bit ops.

src/mem/impls.rs

Outdated Show resolved

src/mem/impls.rs

Outdated

while dest < dest_end {

*dest = *src;

dest = dest.add(1);

src = src.add(1);

In general LLVM has a bunch of heuristics for auto-vectorizing loops like these, and the auto-vectorized code is probably faster as well.

LLVM doesn't auto-vectorize on all platforms, though. What I'm worried about though is that this is hurting performance on platforms that auto-vectorize. Would it be possible to survey some platforms to see if the simple function implementations are enough with LLVM auto-vectorizing them?

Copy link

Contributor

Author

nbdd0121 on Feb 10

You can check the benchmarks on x86-64. The code submitted in this PR is actually my second attempt, my first attempt ended up in poor performance because it does not vectorize very well.

The current design that computes dest_end inside these helper functions are a result of learning from the failed attempt. On platforms that vectorizes, the code is simple enough to be detected by LLVM so it effectively treat it as the while i < n type of loop so there won't be a regression, while on those that does not vectorize LLVM wouldn't bookkeepthe i variable (yes, LLVM couldn't optimise away i) so that's like 20% performance improvement.

I don't really understand the purpose of benchmarking on x86_64? This file isn't even used on that platform and it seems like we wouldn't want to use it anyway because LLVM otherwise auto-vectorizes very well and probably outperforms what's in this file.

I'm interested in getting data on this for platforms other than x86_64 for those reasons.

Copy link

Contributor

Author

nbdd0121 on Feb 17

That's why I turn off auto vectorization to mimic platforms without auto-vectorization. I did observe very significant improvement on RISC-V but I couldn't just do cargo bench on a no_std platform. Even on x86-64 the code still provides significant performance improvement for backward copy though.

Copy link

Contributor

Author

nbdd0121 commented on Feb 10

Has this been tested on architectures that we primarily would like to optimize for? I think it would be good to have a candidate set of target architectures in mind to compare with LLVM's auto-vectorization and alternative implementations such as unaligned loads/stores.

I have tested it on RISC-V. The RISC-V platform that I tested it on is no_std so there isn't a bencher but it speeds up my entire program by 25%.

It could hurt platforms that have both vectorization and hardware misaligned load support, like x86-64 and ARM, but I am not sure what is the best to do there, e.g. how to detect if vectorization is on and how to make use of unaligned load support.

BTW I just realized the benches do not test misaligned scenario. I'll add a benchmark soon.

Copy link

Contributor

Author

nbdd0121 commented on Feb 10

Misaligned benchmarks:

Rust BuiltinNew ImplementationOld ImplementationNew, with no-vectorize-loopsOld, with no-vectorize-loops

I expected the misaligned code path hurts x86-64. It is slower with vectorization on but faster with vectorization off. It is always faster in memmove, presumably because LLVM couldn't vectorize it.

Copy link

Member

alexcrichton commented on Feb 17

Indeed yeah I'm specifically worried where this makes existing platforms worse because LLVM otherwise does something well today. This whole PR should arguably be replaced with an improved LLVM backend for relevant targets, but that's naturally a very large undertaking so it seems fine to have this in the meantime while we wait for LLVM to improve.

I would prefer to not have a too-overly-complicated implementation since it seems likely to only get used on some platforms. Having a complicated implementation is also a risk because I don't think anything here is tested on CI since CI mostly does x86_64.

Basically I think that for this there should at least be peformance numbers for ARM/AArch64 and investigation into what LLVM does today in terms of vectorization and performance of unaligned loads/stores. Additionally it would be great to ensure that everything here is tested on CI regardless of the platform if possible.

Copy link

Contributor

Author

nbdd0121 commented on Feb 17

I would argue this isn't overly complicated. The concept is very simple and is what C libraries do already.

Benchmarking on ARM/AArch64 might not be very helpful because there are both vectorization and misaligned load/store support, but I would still expect memmove performance improvement (I could test it myself because I don't have a ARM dev environment setup). This PR is mainly intended to be a much better default for those without (a quick test on godbolt suggests this includes armv7, powerpc, mips, riscv and more).

As for the CI, it should be noted a few platforms are already being tested (including big-endian ones like PowerPC).

Copy link

adamgemmell commented 27 days ago •

edited

Hi, I've carried out some benchmarks on an aarch64-unknown-linux-gnu platform (Graviton 2, Neoverse-N1) to alleviate some concerns. I rebased the branch to something more recent before carrying these out.

(all numbers ns/iter. +/- ranges were never more than 1% of the result)

Test Name Result memcpy_builtin_104857 45562 memcpy_builtin_1048576_misalign 47097 memcpy_builtin_1048576_offset 47052 memcpy_builtin_4096 108 memcpy_builtin_4096_misalign 111 memcpy_builtin_4096_offset 111 memmove_builtin_1048576 42840 memmove_builtin_1048576_misalign 62671 memmove_builtin_4096 109 memmove_builtin_4096_misalign 161 memset_builtin_1048576 26245 memset_builtin_1048576_offset 26242 memset_builtin_4096 105 memset_builtin_4096_offset 106 Test Name New New (no-vec) Old Old (no-vec) memcpy_rust_1048576 47317 108806 46978 709112 memcpy_rust_1048576_misalign 78310 148440 56356 708612 memcpy_rust_1048576_offset 50758 109122 57212 708791 memcpy_rust_4096 113 421 109 2766 memcpy_rust_4096_misalign 287 573 136 2768 memcpy_rust_4096_offset 121 426 136 2768 memmove_rust_1048576 105688 109360 840238 630534 memmove_rust_1048576_misalign 151630 151128 840360 630725 memmove_rust_4096 392 421 3288 2469 memmove_rust_4096_misalign 573 568 3288 2469 memset_rust_1048576 26252 104967 26246 419879 memset_rust_1048576_offset 26257 104960 26247 419857 memset_rust_4096 109 420 106 1658 memset_rust_4096_offset 114 422 106 1654

The short story is the results look similar to the ones observed above on x86_64.

memcpy and memset are largely the same, with nice improvements for non-vectorised code. Performance reaches that of the builtins in at least some cases.
memmove shows large improvments. Non-vectorised performance seems comparable at least for these tests.
Misaligned accesses have a performance regression, though it's smaller than the improvement in memmove.

Copy link

Contributor

Author

nbdd0121 commented 27 days ago

I suppose I should conditionally select different versions of code for misaligned case. For architectures (e.g. x86/ARM) that have hardware misaligned access support, it might be better to just do misaligned load instead of trying to do bitshifts.

Copy link

Contributor

Author

nbdd0121 commented 27 days ago

@adamgemmell I have added a different code path for misaligned case for x86/x64/aarch64. Can you do another bench run? Thanks!

Same machine, same branch base, same compiler (which I forgot to mention was 1.56.0-nightly (2faabf579 2021-07-27)). Master runs looked close to what I was getting previously.

Test With misaligned case With MC (no-vec) memcpy_rust_1048576 46949 109190 memcpy_rust_1048576_misalign 55425 104213 memcpy_rust_1048576_offset 50890 109193 memcpy_rust_4096 114 421 memcpy_rust_4096_misalign 144 398 memcpy_rust_4096_offset 123 427 memmove_rust_1048576 104361 109690 memmove_rust_1048576_misalign 106661 105638 memmove_rust_4096 393 421 memmove_rust_4096_misalign 411 415 memset_rust_1048576 26254 52526 memset_rust_1048576_offset 26262 52510 memset_rust_4096 110 220 memset_rust_4096_offset 114 222

Looks like a good change, really brings the misaligned numbers in line with the other tests.

There's a weird doubling of performance for the memset no-vec case on this machine. Not complaining but I kinda want to look more into it tomorrow. Couldn't replicate something similar on other machines I tried.

Copy link

Contributor

Amanieu commented 17 days ago

You can get even better performance with unaligned memory accesses by overlapping the accesses:

This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].
Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.
For larger memcpy you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

Copy link

Contributor

Author

nbdd0121 commented 17 days ago

You can get even better performance with unaligned memory accesses by overlapping the accesses:

This is easy with memset: a 7-byte memset can be done with 2 4-byte stores at [start] and [end - 4].

Small memset/memcpy/memmove isn't particularly of concern. Sure, there might be some improvements possible, but the improvement is not as significant as improvements of large memset/memcpys. Plus, if size is known, the compiler will inline them anyway.

Small memcpy can be done by reading the entire data into registers with overlapping loads and then writing back everything with overlapping stores. This also allows you to skip the direction check for small memcpy since the direction becomes irrelevant.

you can perform the first copy as a large unaligned copy and the rest using aligned copies. The trick here to avoid overlap issues is to load the data for the first aligned copy before performing the store for the initial unaligned copy. At the end of the main aligned copy loop, do an unaligned copy for the last block of data.

I don't understand how overlapping is an issue for memcpy. Do you mean memmove? For the trick to work I believe the gap between src and dst must be small.

I recommend reading through these optimized implementations of memcpy and memset for AArch64 which use all of these tricks:

The tricks applied can slow down some processors while speed up others. For example, use of branches in small memset and memcpy will slow down those with simple branch predictors because there are many data-dependent branches. On the other hand loops are much easier to predict.

For platforms such as RISC-V without support for unaligned accesses, I'm surprised that copy_forward_misaligned_words is faster than a simple byte-by-byte copy. But if you've run the benchmarks then I guess it's fine.

The number of instructions reduce from 8 byte-size load + 8 byte-size store per word copy to 1 word-size load + 1 word-size store + 3 arithmetic ops per word copy. LLVM actually wouldn't even unroll the loop, so loop overhead would hit bytewise copy even further.

While some tricks you described are interesting, my main focus is on RISC-V which wouldn't benefit from those optimisations and might not even test those code paths. I am afraid it's outside my capacity to optimise for AArch64.

Copy link

Contributor

Author

nbdd0121 commented 17 days ago

Failed at testcrate/tests/cmp.rs:34:9. I am not sure I touched anything related to that between recent 2 pushes. Might be a change made in recent nightly?

Copy link

Contributor

Author

nbdd0121 commented 17 days ago

Tried to run the workflow with current HEAD of master on my fork. Confirmed that it failed https://github.com/nbdd0121/compiler-builtins/actions/runs/1184124369.

Copy link

Contributor

Amanieu commented 16 days ago

I had a look at the powerpc failure and it seems to be an LLVM bug (I'm still working on a minimal reproduction). In the meantime I think it's fine to merge this PR.

Amanieu

merged commit 461b5f8 into

rust-lang:master 16 days ago

6 of 25 checks passed

Copy link

Contributor

Amanieu commented 16 days ago

Published in compiler-builtins 0.1.50.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants

Optimize memcpy, memmove and memset by nbdd0121 · Pull Request #405 · rust-lang/...

Optimize memcpy, memmove and memset #405

Conversation

nbdd0121 commented on Feb 9

nbdd0121 commented on Feb 9

nbdd0121 commented on Feb 9

bjorn3 commented on Feb 9

nbdd0121 commented on Feb 9

nbdd0121 commented on Feb 9

alexcrichton left a comment

nbdd0121 commented on Feb 10

nbdd0121 commented on Feb 10

alexcrichton commented on Feb 17

nbdd0121 commented on Feb 17

adamgemmell commented 27 days ago •

nbdd0121 commented 27 days ago

nbdd0121 commented 27 days ago

Amanieu commented 17 days ago

nbdd0121 commented 17 days ago

nbdd0121 commented 17 days ago

nbdd0121 commented 17 days ago

Amanieu commented 16 days ago

Amanieu commented 16 days ago

Recommend

About Joyk