Always const-evaluate the GCD in `slice::align_to_offsets` by Sp00ph · Pull Requ...
source link: https://github.com/rust-lang/rust/pull/111296
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Always const-evaluate the GCD in slice::align_to_offsets
#111296
Conversation
Member
Use an inline const
-block to force the compiler to calculate the GCD at compile time, even in debug mode. This shouldn't affect the behavior of the program at all, but it drastically cuts down on the number of instructions emitted with optimizations disabled.
With the current implementation, a single slice::align_to
instantiation (specifically <[u8]>::align_to::<u128>()
) generates 676 instructions (on x86-64). Forcing the GCD computation to be const cuts it down to 327 instructions, so just over 50% less. This is obviously not representative of actual runtime gains, but I still see it as a significant win as long as it doesn't degrade compile times.
Not having to worry about LLVM const-evaluating the GCD function also allows it to use the textbook recursive euclidean algorithm instead of a much more complicated iterative implementation with multiple unsafe
-blocks.
Collaborator
(rustbot has picked a reviewer for you, use r? to override) |
added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
labels
Collaborator
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
This comment has been minimized.
added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label
r=me with perf results OK cc @scottmcm, I think you've worked on the GCD code and may be interested. |
Contributor
Try build successful - checks-actions |
This comment has been minimized.
let gcd: usize = const { gcd(mem::size_of::<T>(), mem::size_of::<U>()) }; |
||
let ts: usize = mem::size_of::<U>() / gcd; |
||
let us: usize = mem::size_of::<T>() / gcd; |
Maybe push even a bit more into the const block?
let gcd: usize = const { gcd(mem::size_of::<T>(), mem::size_of::<U>()) }; | |
let ts: usize = mem::size_of::<U>() / gcd; | |
let us: usize = mem::size_of::<T>() / gcd; | |
let (ts, us) = const { | |
let gcd: usize = gcd(mem::size_of::<T>(), mem::size_of::<U>()); | |
(mem::size_of::<U>() / gcd, mem::size_of::<T>() / gcd) | |
}; |
Member
Author
Doing that results in more instructions being emitted in debug mode for some reason.
You can also compute ts * us
in const
too.
Member
Author
I'm not convinced we should move more operations into the const
-block if it increases the instruction count though.
Member
Doing that results in more instructions being emitted in debug mode for some reason.
Caveat: small changes in instruction count in a debug build is a subpar proxy for estimating performance of such a build. LLVM can still select instruction sequences that execute in fewer cycles, but require more instructions in a debug mode, so long as it is able to match on IR sequences that trigger those instruction sequences. A prime example of that is e.g. dividing by a constant that produces a strength reduction that involves a multiplication and a shift. These two operations are still way faster to execute compared to a full division. Something similar may be happening here.
More importantly, though, this is all LLVM version dependent. It might do better in versions V and V+1, worse in versions V+2 and V+3 and even better in V+4... In this case I'd say, as long as the difference in machine instruction count isn’t like 10-20% or more, then we should strive to ensure that rustc does as much work as possible before handing it off to LLVM’s consideration.
EDIT: The actually correct thing would be to actually put the generated code through llvm-mca
and have it compute the actual execution cost in cycles.
EDIT (part 2): Another thing that might be happening is that moving code into the const
block enables LLVM to use FastIsel
or somesuch mechanism which might have been not used before for whatever reason, which is just much faster in terms of generating code (even if the results aren't as good execution-time wise.)
Member
Author
If both T
and U
are ZSTs then doing the divisions in the const-block causes a compile time divide-by-zero error. align_to
has an early return if at least one of the types is a ZST, so the divisor will never be zero at runtime, but it looks like the const-block still gets evaluated even though it's technically dead code.
Got it, sounds like we have no other choice right now, then. We could look into moving the check for ts == 0 || us == 0
into the const block too and bail in some other reasonable way, but it probably something that warrants a separate PR anyhow.
Member
Nope, wasn't me. Maybe @nagisa ? Great change, though! A really nice demonstration of a |
Member
Yeah, I “own” anything |
} |
||
a << k |
||
const fn gcd(a: usize, b: usize) -> usize { |
||
if b == 0 { a } else { gcd(b, a % b) } |
Member
Doesn’t this need to check for a == 0
too? There's no guarantee that either of the types won’t be ZST (in isolation.) Or, add a comment.
Member
Author
If a == 0
then it returns gcd(b, 0 % b)
which will then return b
.
Collaborator
Finished benchmarking commit (2a2eb58): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)Results CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 653.81s -> 653.024s (-0.12%) |
removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label
Member
@bors r=nagisa,Mark-Simulacrum |
added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
labels
Contributor
Test successful - checks-actions |
Collaborator
Finished benchmarking commit (90c02c1): comparison URL. Overall result: improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results CyclesResults Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 658.017s -> 656.527s (-0.23%) |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Successfully merging this pull request may close these issues.
None yet
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK