Github Simplify logical operations CFG by AngelicosPhosphoros · Pull Request #83... - JOYK Joy of Geek, Geek News, Link all geek

This is basically same commit as e38e954 which was reverted later in 676953f
In both cases, this changes weren't benchmarked.
e38e954 leads to missed optimization described in this issue
676953f leads to missed optimization described in this issue

Copy link

Collaborator

rust-highfive commented 16 days ago

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @estebank (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Copy link

Contributor

Author

AngelicosPhosphoros commented 16 days ago

I would like to benchmark this changes.

This code like to fix the issue with failed SIMD (#83623)

Code:

pub struct Blueprint {
    pub fuel_tank_size: u32,
    pub payload: u32,
    pub wheel_diameter: u32,
    pub wheel_width: u32,
    pub storage: u32,
}
impl PartialEq for Blueprint{
   fn eq(&self, other: &Self)->bool{
      (self.fuel_tank_size == other.fuel_tank_size)
           && (self.payload == other.payload)
           && (self.wheel_diameter == other.wheel_diameter)
           && (self.wheel_width == other.wheel_width)
           && (self.storage == other.storage)
   }
}

LLVM IR:

; ModuleID = 'eq_test.3a1fbbbh-cgu.0'
source_filename = "eq_test.3a1fbbbh-cgu.0"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc"

%Blueprint = type { [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32] }

; <eq_test::Blueprint as core::cmp::PartialEq>::eq
; Function Attrs: norecurse nounwind readonly uwtable willreturn
define zeroext i1 @"_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E"(%Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %self, %Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %other) unnamed_addr #0 {
start:
  %0 = bitcast %Blueprint* %self to <4 x i32>*
  %1 = load <4 x i32>, <4 x i32>* %0, align 4
  %2 = bitcast %Blueprint* %other to <4 x i32>*
  %3 = load <4 x i32>, <4 x i32>* %2, align 4
  %4 = icmp eq <4 x i32> %1, %3
  %5 = getelementptr inbounds %Blueprint, %Blueprint* %self, i64 0, i32 9
  %_19 = load i32, i32* %5, align 4
  %6 = getelementptr inbounds %Blueprint, %Blueprint* %other, i64 0, i32 9
  %_20 = load i32, i32* %6, align 4
  %_18 = icmp eq i32 %_19, %_20
  %7 = call i1 @llvm.vector.reduce.and.v4i1(<4 x i1> %4)
  %8 = and i1 %7, %_18
  ret i1 %8
}

; Function Attrs: nofree nosync nounwind readnone willreturn
declare i1 @llvm.vector.reduce.and.v4i1(<4 x i1>) #1

attributes #0 = { norecurse nounwind readonly uwtable willreturn "target-cpu"="znver1" }
attributes #1 = { nofree nosync nounwind readnone willreturn }

!llvm.module.flags = !{!0}

!0 = !{i32 7, !"PIC Level", i32 2}

And ASM:

_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E:
	vmovdqu	(%rcx), %xmm0
	movl	16(%rcx), %eax
	vpcmpeqd	(%rdx), %xmm0, %xmm0
	cmpl	16(%rdx), %eax
	vmovmskps	%xmm0, %eax
	sete	%cl
	cmpb	$15, %al
	sete	%al
	andb	%cl, %al
	retq

Copy link

Contributor

bjorn3 commented 16 days ago

Copy link

Collaborator

rust-timer commented 16 days ago

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

Copy link

Contributor

bors commented 16 days ago

Trying commit c343862 with merge 6b7dccd...

Copy link

Contributor

nagisa commented 16 days ago

Before this lands an addition of a codegen test will be necessary so that we don't regress this case again.

Copy link

Contributor

bors commented 16 days ago

Try build successful - checks-actions
Build commit: 6b7dccd (6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5)

Copy link

Collaborator

rust-timer commented 16 days ago

Copy link

Collaborator

rust-timer commented 16 days ago

Finished benchmarking try commit (6b7dccd): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

Copy link

Contributor

mati865 commented 16 days ago

Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.

Copy link

Contributor

estebank commented 15 days ago

I don't feel comfortable reviewing this. @nagisa would you mind taking this?

Copy link

Contributor

nagisa commented 15 days ago

r? @nagisa

Copy link

Contributor

Author

AngelicosPhosphoros commented 15 days ago

Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.

Should we merge this and should I continue to work on this then?
@nagisa

Copy link

Contributor

Author

AngelicosPhosphoros commented 15 days ago

Hacky code from eq2 case (#83623) and #[derive(PartialEq)] cannot be vectorized when comparing arrays of structure on current compiler.
godbolt link

Code from this PR handles vectorizing of slices of such struct fine:

code

Copy link

Contributor

nagisa commented 15 days ago

I think it is worthwhile to pursue this based on the assembly & IR that gets produced once this is applied. I would be surprised if the compiler exercised the kinds of code paths we're seeing here to produce good motivation for or against merging this. Microbenchmarks would be a significantly better tool in this situation, I suspect.

I'm somewhat surprised that #62993 was sufficient motivation to revert this in the first place, but from what I can tell it might have happened because there was no information suggesting that the code derived by PartialEq becomes significantly worse without it ("reduces number of basic blocks" seems "meh" enough of an improvement if it regresses an actual code example)

Copy link

Contributor

Author

AngelicosPhosphoros commented 15 days ago

I wrote a benchmark for more cases.
I run it on toolchains from HEAD~1 and HEAD for my working branch, and used target-cpu=znver1.

So, results:

Legend:

u32s struct:
   5 us32 fields struct
u32s struct/Self - comparison with same cloned vec
u32s struct/Random field - comparison with elements differ only in one random field
u32s struct/First field - comparison with elements differ only in first field
u32s struct/Last field - comparison with elements differ only in last field
u32s struct/Every Second - comparison with vec which every even element is equal and every odd is inequal

String struct:
  Struct with 4 u32 fields and one string field. String fields generated as random char from A to Z.
cmp/u32s struct/Self - with same cloned Vec
cmp/String struct/u32 field - with random u32 field
cmp/String struct/String field - with String field.

HEAD~1

     Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)

Gnuplot not found, using plotters backend
cmp/u32s struct/Self    time:   [6.7752 us 6.7880 us 6.8033 us]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
cmp/u32s struct/Random field
                        time:   [15.176 us 15.204 us 15.234 us]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/First field
                        time:   [2.6016 us 2.6084 us 2.6149 us]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/Last field
                        time:   [6.7269 us 6.7357 us 6.7459 us]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
cmp/u32s struct/Every Second
                        time:   [12.925 us 12.940 us 12.957 us]
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high severe

     Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)

Gnuplot not found, using plotters backend
cmp/String struct/Self  time:   [65.674 us 65.750 us 65.823 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cmp/String struct/u32 field
                        time:   [17.712 us 17.733 us 17.756 us]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
cmp/String struct/String field
                        time:   [10.087 us 10.102 us 10.116 us]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild

With PR commit

     Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self    time:   [7.1867 us 7.2374 us 7.2920 us]
                        change: [+6.1698% +7.0979% +7.9865%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
cmp/u32s struct/Random field
                        time:   [6.7970 us 6.8126 us 6.8319 us]
                        change: [-54.664% -53.917% -53.142%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/First field
                        time:   [6.7780 us 6.7868 us 6.7966 us]
                        change: [+158.65% +159.32% +160.07%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
cmp/u32s struct/Last field
                        time:   [6.7810 us 6.7919 us 6.8034 us]
                        change: [+0.3407% +0.6019% +0.9169%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/Every Second
                        time:   [6.7663 us 6.7758 us 6.7861 us]
                        change: [-48.465% -48.193% -47.936%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

     Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)

Gnuplot not found, using plotters backend
cmp/String struct/Self  time:   [59.658 us 59.743 us 59.833 us]
                        change: [-9.1402% -8.9131% -8.6605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
cmp/String struct/u32 field
                        time:   [3.9043 us 3.9140 us 3.9253 us]
                        change: [-77.864% -77.797% -77.726%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
cmp/String struct/String field
                        time:   [6.9662 us 6.9746 us 6.9833 us]
                        change: [-31.134% -30.958% -30.788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

So, it looks like that code generated by trunk version is faster when it's execution finishes on first branch and branch predictor correctly predicts this. When branch predictor fails to do this, it is slower 7 times.
SIMD version shows much less variance and it is significantly slower in First field case and faster in worst Random field.
First field case slowed down by 4.1784 microseconds and Random field sped up by 8.3914

I think, it is a clear win.

For anyone who want to look benchmarks code or run it on own machine, I add zipped project.
bench_changes.zip

@nagisa I would add codegen tests now and I wonder if I need to squash my commits myself or it would be done automatically.

Copy link

Contributor

nagisa commented 15 days ago

bors merges commits as is so you do need to cleanup the history in your PR manually.

…

AngelicosPhosphoros

force-pushed the

AngelicosPhosphoros:simplify_binary_and_to_get_better_asm

branch from c343862 to e9350d3

15 days ago

src/test/codegen/issue-83623-SIMD-PartialEq.rs

@@ -0,0 +1,45 @@

// This test checks that jumps generated by logical operators can be optimized away

AngelicosPhosphoros 15 days ago

Author

Contributor

I don't know how to write codegen tests so I made this by example.
I can miss something.

This comment has been hidden.

src/test/codegen/issue-83623-SIMD-PartialEq.rs

Outdated

@@ -0,0 +1,45 @@

// This test checks that jumps generated by logical operators can be optimized away

// compile-flags: -O3

nagisa 15 days ago

Contributor

-Copt-level=3 or perhaps just -O should work.

src/test/codegen/issue-83623-SIMD-PartialEq.rs

// This test checks that jumps generated by logical operators can be optimized away

// compile-flags: -O3

// only-64bit

nagisa 15 days ago

Contributor

This test might fail for some of the 64-bit targets, and may need to be limited to just x86_64, but lets see if this works out, first.

Copy link

Contributor

Author

AngelicosPhosphoros commented 14 days ago

Can you help me with building LLVM please?

Missed headers

>grep 'No such file or directory'  build/x86_64-unknown-linux-gnu/llvm/buil
d/CMakeFiles/CMakeError.log
CheckIncludeFile.c:1:10: fatal error: malloc/malloc.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: valgrind/valgrind.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: mach/mach.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: histedit.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: CrashReporterClient.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc_np.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc/malloc.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: os/signpost.h: No such file or directory

Copy link

Contributor

richkadel commented 14 days ago

I don't normally have to do anything special, but best practice (for me) in this situation would be:

Completely delete your build directory
Copy the current config.toml.example to config.toml and apply the following:

[llvm]
...
optimize = false # required for `llvm-cov show --debug`
...
[build]
...
profiler = true

I also generally enable [rust] debug = true but that's probably not necessary.

Then:

$ ./x.py test src/test/run-make-fulldeps/coverage --bless

should work.

AngelicosPhosphoros

force-pushed the

AngelicosPhosphoros:simplify_binary_and_to_get_better_asm

branch from c1f2928 to 87264fa

13 days ago

Copy link

Contributor

Author

AngelicosPhosphoros commented 13 days ago •

edited

@richkadel @nagisa

I finally managed to make it compile.
I looks like that I got expected results:

It looks like the only problems are basic block numbers (bb) changed. Should be OK to re-bless if that's the case.

$ ./x.py test src/test/run-make-fulldeps/coverage --bless

As long as the only changed files are spanview files, you're probably good to go (adding the changed expected results to your PR).

If it changes any of the coverage-reports/* expected results, we can take a closer look, but I don't see that in the above failure.

JYI to make it work I changed this:

[llvm]
...
optimize = false
...
[build]
...
profiler = true
...
[target.x86_64-unknown-linux-gnu]
cc = "clang-11"
cxx = "clang++-11"

So, I failed to build LLVM because it used distribution provided C/C++ compiler (gcc) at first.
Your words about optimize = false are true too because using optimize = true deletes coverage reports.

Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.

Are there any thoughts to move generation of such changes to some build agents? I think, we cut some possible contributors from work because they just haven't such good machines (I have 64 gb RAM on my PC but I don't know anyone other who has).
It woulds be nice if we can call bors to generate commit with such report.

Copy link

Contributor

nagisa commented 13 days ago

Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.

Right. Linking will take a ton of memory because debug info is just that huge. There's definitely some improvements to be had around the coverage tests – having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)

Copy link

Contributor

nagisa commented 13 days ago

@bors r+

Copy link

Contributor

bors commented 13 days ago

Commit 87264fa has been approved by nagisa

AngelicosPhosphoros

force-pushed the

AngelicosPhosphoros:simplify_binary_and_to_get_better_asm

branch from 87264fa to 4464cc2

13 days ago

Copy link

Contributor

Author

AngelicosPhosphoros commented 13 days ago

@nagisa
Just force-pushed with little change:
Removed #[no_mangle] here because it's not valid attribute on impl.

// && chains should not prevent SIMD optimizations for primitives
#[no_mangle]
impl PartialEq for Blueprint{

Copy link

Contributor

richkadel commented 13 days ago

having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)

I agree. I only just made the connection in my head. I'm going to push a PR later this week that removes spanview files and removes the debug output.

The spanview files are no longer as valuable as they used to be (in tests) so that doesn't hurt much.

It's frustrating that I can't call llvm-cov show --debug to show the counters without setting optimize = false for the whole LLVM build. There must be a way to be more granular than that but I'm not aware of it.

But that's not your problem. I think it's worth removing.

cc: @wesleywiser @tmandry

Copy link

Contributor

nagisa commented 13 days ago

@bors r+

It's frustrating that I can't call llvm-cov show --debug to show the counters without setting optimize = false

In my experience RelWithDebInfo builds still retain --debug output at least in llc and opt, so that may be an option to pursue. Doesn't solve the issue with linkers though as debug info is a major contributor to the linkers' memory use.

Copy link

Contributor

bors commented 13 days ago

Commit 4464cc2 has been approved by nagisa

Copy link

Contributor

richkadel commented 13 days ago

I assume #83663 will land before #83755, in which case, I'll need to rebase my changes.

But if for some reason #83755 lands first, you can just wipe out the spanview directory:

src/test/run-make-fulldeps/coverage-spanview/

You shouldn't need to re-bless since (AFAICT) only spanview files were changed in this PR. But if you ever do need to rebless, you no longer need to build your own LLVM (once my PR lands). The LLVM debug options are no longer required.

Thanks!

Copy link

Contributor

nikic commented 13 days ago •

edited

FYI usually debug flags in LLVM are controlled by assertions = true. It shouldn't be necessary to use a debug build of LLVM.

Copy link

Contributor

bors commented 13 days ago

Testing commit 4464cc2 with merge d1065e6...

Copy link

Contributor

bors commented 13 days ago

Test successful - checks-actions
Approved by: nagisa
Pushing d1065e6 to master...

Github Simplify logical operations CFG by AngelicosPhosphoros · Pull Request #83...

rust-highfive commented 16 days ago

AngelicosPhosphoros commented 16 days ago

bjorn3 commented 16 days ago

rust-timer commented 16 days ago

bors commented 16 days ago

nagisa commented 16 days ago

bors commented 16 days ago

rust-timer commented 16 days ago

rust-timer commented 16 days ago

mati865 commented 16 days ago

estebank commented 15 days ago

nagisa commented 15 days ago

AngelicosPhosphoros commented 15 days ago

AngelicosPhosphoros commented 15 days ago

nagisa commented 15 days ago

AngelicosPhosphoros commented 15 days ago

nagisa commented 15 days ago

This comment has been hidden.

AngelicosPhosphoros commented 14 days ago

richkadel commented 14 days ago

AngelicosPhosphoros commented 13 days ago •

nagisa commented 13 days ago

nagisa commented 13 days ago

bors commented 13 days ago

AngelicosPhosphoros commented 13 days ago

richkadel commented 13 days ago

nagisa commented 13 days ago

bors commented 13 days ago

richkadel commented 13 days ago

nikic commented 13 days ago •

bors commented 13 days ago

bors commented 13 days ago

Recommend

About Joyk