Github Simplify logical operations CFG by AngelicosPhosphoros · Pull Request #83...
source link: https://github.com/rust-lang/rust/pull/83663
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
This is basically same commit as e38e954 which was reverted later in 676953f
In both cases, this changes weren't benchmarked.
e38e954 leads to missed optimization described in this issue
676953f leads to missed optimization described in this issue
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @estebank (or someone else) soon.
If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.
Please see the contribution instructions for more information.
I would like to benchmark this changes.
This code like to fix the issue with failed SIMD (#83623)
Code:
pub struct Blueprint { pub fuel_tank_size: u32, pub payload: u32, pub wheel_diameter: u32, pub wheel_width: u32, pub storage: u32, } impl PartialEq for Blueprint{ fn eq(&self, other: &Self)->bool{ (self.fuel_tank_size == other.fuel_tank_size) && (self.payload == other.payload) && (self.wheel_diameter == other.wheel_diameter) && (self.wheel_width == other.wheel_width) && (self.storage == other.storage) } }
LLVM IR:
; ModuleID = 'eq_test.3a1fbbbh-cgu.0' source_filename = "eq_test.3a1fbbbh-cgu.0" target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-windows-msvc" %Blueprint = type { [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32] } ; <eq_test::Blueprint as core::cmp::PartialEq>::eq ; Function Attrs: norecurse nounwind readonly uwtable willreturn define zeroext i1 @"_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E"(%Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %self, %Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %other) unnamed_addr #0 { start: %0 = bitcast %Blueprint* %self to <4 x i32>* %1 = load <4 x i32>, <4 x i32>* %0, align 4 %2 = bitcast %Blueprint* %other to <4 x i32>* %3 = load <4 x i32>, <4 x i32>* %2, align 4 %4 = icmp eq <4 x i32> %1, %3 %5 = getelementptr inbounds %Blueprint, %Blueprint* %self, i64 0, i32 9 %_19 = load i32, i32* %5, align 4 %6 = getelementptr inbounds %Blueprint, %Blueprint* %other, i64 0, i32 9 %_20 = load i32, i32* %6, align 4 %_18 = icmp eq i32 %_19, %_20 %7 = call i1 @llvm.vector.reduce.and.v4i1(<4 x i1> %4) %8 = and i1 %7, %_18 ret i1 %8 } ; Function Attrs: nofree nosync nounwind readnone willreturn declare i1 @llvm.vector.reduce.and.v4i1(<4 x i1>) #1 attributes #0 = { norecurse nounwind readonly uwtable willreturn "target-cpu"="znver1" } attributes #1 = { nofree nosync nounwind readnone willreturn } !llvm.module.flags = !{!0} !0 = !{i32 7, !"PIC Level", i32 2}
And ASM:
_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E: vmovdqu (%rcx), %xmm0 movl 16(%rcx), %eax vpcmpeqd (%rdx), %xmm0, %xmm0 cmpl 16(%rdx), %eax vmovmskps %xmm0, %eax sete %cl cmpb $15, %al sete %al andb %cl, %al retq
Awaiting bors try build completion.
@rustbot label: +S-waiting-on-perf
Before this lands an addition of a codegen test will be necessary so that we don't regress this case again.
Try build successful - checks-actions
Build commit: 6b7dccd (6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5
)
Finished benchmarking try commit (6b7dccd): comparison url.
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup-
to bors.
Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.
@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf
Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.
I don't feel comfortable reviewing this. @nagisa would you mind taking this?
r? @nagisa
Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.
Should we merge this and should I continue to work on this then?
@nagisa
Hacky code from eq2
case (#83623) and #[derive(PartialEq)]
cannot be vectorized when comparing arrays of structure on current compiler.
godbolt link
Code from this PR handles vectorizing of slices of such struct fine:
codeI think it is worthwhile to pursue this based on the assembly & IR that gets produced once this is applied. I would be surprised if the compiler exercised the kinds of code paths we're seeing here to produce good motivation for or against merging this. Microbenchmarks would be a significantly better tool in this situation, I suspect.
I'm somewhat surprised that #62993 was sufficient motivation to revert this in the first place, but from what I can tell it might have happened because there was no information suggesting that the code derived by PartialEq
becomes significantly worse without it ("reduces number of basic blocks" seems "meh" enough of an improvement if it regresses an actual code example)
I wrote a benchmark for more cases.
I run it on toolchains from HEAD~1 and HEAD for my working branch, and used target-cpu=znver1
.
So, results:
Legend:
u32s struct:
5 us32 fields struct
u32s struct/Self - comparison with same cloned vec
u32s struct/Random field - comparison with elements differ only in one random field
u32s struct/First field - comparison with elements differ only in first field
u32s struct/Last field - comparison with elements differ only in last field
u32s struct/Every Second - comparison with vec which every even element is equal and every odd is inequal
String struct:
Struct with 4 u32 fields and one string field. String fields generated as random char from A to Z.
cmp/u32s struct/Self - with same cloned Vec
cmp/String struct/u32 field - with random u32 field
cmp/String struct/String field - with String field.
HEAD~1 Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self time: [6.7752 us 6.7880 us 6.8033 us]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
cmp/u32s struct/Random field
time: [15.176 us 15.204 us 15.234 us]
Found 10 outliers among 100 measurements (10.00%)
9 (9.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/First field
time: [2.6016 us 2.6084 us 2.6149 us]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/Last field
time: [6.7269 us 6.7357 us 6.7459 us]
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
cmp/u32s struct/Every Second
time: [12.925 us 12.940 us 12.957 us]
Found 18 outliers among 100 measurements (18.00%)
18 (18.00%) high severe
Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)
Gnuplot not found, using plotters backend
cmp/String struct/Self time: [65.674 us 65.750 us 65.823 us]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
cmp/String struct/u32 field
time: [17.712 us 17.733 us 17.756 us]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
cmp/String struct/String field
time: [10.087 us 10.102 us 10.116 us]
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) low mild
5 (5.00%) high mild
With PR commit
Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self time: [7.1867 us 7.2374 us 7.2920 us]
change: [+6.1698% +7.0979% +7.9865%] (p = 0.00 < 0.05)
Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
cmp/u32s struct/Random field
time: [6.7970 us 6.8126 us 6.8319 us]
change: [-54.664% -53.917% -53.142%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/First field
time: [6.7780 us 6.7868 us 6.7966 us]
change: [+158.65% +159.32% +160.07%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe
cmp/u32s struct/Last field
time: [6.7810 us 6.7919 us 6.8034 us]
change: [+0.3407% +0.6019% +0.9169%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe
cmp/u32s struct/Every Second
time: [6.7663 us 6.7758 us 6.7861 us]
change: [-48.465% -48.193% -47.936%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild
Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)
Gnuplot not found, using plotters backend
cmp/String struct/Self time: [59.658 us 59.743 us 59.833 us]
change: [-9.1402% -8.9131% -8.6605%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low severe
3 (3.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
cmp/String struct/u32 field
time: [3.9043 us 3.9140 us 3.9253 us]
change: [-77.864% -77.797% -77.726%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
cmp/String struct/String field
time: [6.9662 us 6.9746 us 6.9833 us]
change: [-31.134% -30.958% -30.788%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe
So, it looks like that code generated by trunk version is faster when it's execution finishes on first branch and branch predictor correctly predicts this. When branch predictor fails to do this, it is slower 7 times.
SIMD version shows much less variance and it is significantly slower in First field case and faster in worst Random field.
First field case slowed down by 4.1784 microseconds and Random field sped up by 8.3914
I think, it is a clear win.
For anyone who want to look benchmarks code or run it on own machine, I add zipped project.
bench_changes.zip
@nagisa I would add codegen tests now and I wonder if I need to squash my commits myself or it would be done automatically.
force-pushed the
AngelicosPhosphoros:simplify_binary_and_to_get_better_asm
@@ -0,0 +1,45 @@
// This test checks that jumps generated by logical operators can be optimized away
AngelicosPhosphoros 15 days ago
Author
Contributor
I don't know how to write codegen tests so I made this by example.
I can miss something.
This comment has been hidden.
@@ -0,0 +1,45 @@
// This test checks that jumps generated by logical operators can be optimized away
// compile-flags: -O3
// This test checks that jumps generated by logical operators can be optimized away
// compile-flags: -O3
// only-64bit
nagisa 15 days ago
Contributor
This test might fail for some of the 64-bit targets, and may need to be limited to just x86_64
, but lets see if this works out, first.
Can you help me with building LLVM please?
Missed headers>grep 'No such file or directory' build/x86_64-unknown-linux-gnu/llvm/buil
d/CMakeFiles/CMakeError.log
CheckIncludeFile.c:1:10: fatal error: malloc/malloc.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: valgrind/valgrind.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: mach/mach.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: histedit.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: CrashReporterClient.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc_np.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc/malloc.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: os/signpost.h: No such file or directory
I don't normally have to do anything special, but best practice (for me) in this situation would be:
- Completely delete your
build
directory - Copy the current
config.toml.example
toconfig.toml
and apply the following:
[llvm] ... optimize = false # required for `llvm-cov show --debug` ... [build] ... profiler = true
I also generally enable [rust]
debug = true
but that's probably not necessary.
Then:
$ ./x.py test src/test/run-make-fulldeps/coverage --bless
should work.
force-pushed the
AngelicosPhosphoros:simplify_binary_and_to_get_better_asm
I finally managed to make it compile.
I looks like that I got expected results:
It looks like the only problems are basic block numbers (bb) changed. Should be OK to re-bless if that's the case.
$ ./x.py test src/test/run-make-fulldeps/coverage --bless
As long as the only changed files are spanview files, you're probably good to go (adding the changed expected results to your PR).
If it changes any of the coverage-reports/* expected results, we can take a closer look, but I don't see that in the above failure.
JYI to make it work I changed this:
[llvm]
...
optimize = false
...
[build]
...
profiler = true
...
[target.x86_64-unknown-linux-gnu]
cc = "clang-11"
cxx = "clang++-11"
So, I failed to build LLVM because it used distribution provided C/C++ compiler (gcc) at first.
Your words about optimize = false
are true too because using optimize = true
deletes coverage reports.
Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.
Are there any thoughts to move generation of such changes to some build agents? I think, we cut some possible contributors from work because they just haven't such good machines (I have 64 gb RAM on my PC but I don't know anyone other who has).
It woulds be nice if we can call bors
to generate commit with such report.
Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.
Right. Linking will take a ton of memory because debug info is just that huge. There's definitely some improvements to be had around the coverage tests – having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)
@bors r+
Commit 87264fa has been approved by nagisa
force-pushed the
AngelicosPhosphoros:simplify_binary_and_to_get_better_asm
@nagisa
Just force-pushed with little change:
Removed #[no_mangle]
here because it's not valid attribute on impl.
// && chains should not prevent SIMD optimizations for primitives
#[no_mangle]
impl PartialEq for Blueprint{
having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)
I agree. I only just made the connection in my head. I'm going to push a PR later this week that removes spanview files and removes the debug output.
The spanview files are no longer as valuable as they used to be (in tests) so that doesn't hurt much.
It's frustrating that I can't call llvm-cov show --debug
to show the counters without setting optimize = false
for the whole LLVM build. There must be a way to be more granular than that but I'm not aware of it.
But that's not your problem. I think it's worth removing.
cc: @wesleywiser @tmandry
@bors r+
It's frustrating that I can't call llvm-cov show --debug to show the counters without setting optimize = false
In my experience RelWithDebInfo
builds still retain --debug
output at least in llc
and opt
, so that may be an option to pursue. Doesn't solve the issue with linkers though as debug info is a major contributor to the linkers' memory use.
Commit 4464cc2 has been approved by nagisa
I assume #83663 will land before #83755, in which case, I'll need to rebase my changes.
But if for some reason #83755 lands first, you can just wipe out the spanview directory:
src/test/run-make-fulldeps/coverage-spanview/
You shouldn't need to re-bless since (AFAICT) only spanview files were changed in this PR. But if you ever do need to rebless, you no longer need to build your own LLVM (once my PR lands). The LLVM debug options are no longer required.
Thanks!
FYI usually debug flags in LLVM are controlled by assertions = true
. It shouldn't be necessary to use a debug build of LLVM.
Test successful - checks-actions
Approved by: nagisa
Pushing d1065e6 to master...
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK