Github Import small cold functions by tmiasko · Pull Request #82980 · rust-lang/...
source link: https://github.com/rust-lang/rust/pull/82980
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Conversation
The Rust code is often written under an assumption that for generic
methods inline attribute is mostly unnecessary, since for optimized
builds using ThinLTO, a method will be code generated in at least one
CGU and available for import.
For example, deref implementations for Box, Vec, MutexGuard, and
MutexGuard are not currently marked as inline, neither is identity
implementation of From trait.
In PGO builds, when functions are determined to be cold, the default
multiplier of zero will stop the import, no matter how trivial the
implementation.
Increase slightly the default multiplier from 0 to 0.1.
r? @ghost
Awaiting bors try build completion.
@rustbot label: +S-waiting-on-perf
Try build successful - checks-actions
Build commit: df0df90 (df0df906bd16e3d83f3f52a984101dc638c848e3
)
Finished benchmarking try commit (df0df90): comparison url.
Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup-
to bors.
Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.
@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf
The perf result look quite promising, with small improvements in instruction counts, wall-time and max-rss. The change also reduced size of librustc_driver.so from 102.5 MiB to 97.1 MiB.
The main premise behind the change is that with PGO, we will have cold functions with negative inline cost, and we should continue inlining them. I don't know if this is something we would like to land in this form, but could be interesting from Rust PGO perspective, so
changed the title Use small non-zero import-cold-multiplier
Import small cold functions
I'll take a look at this next week. Thanks for the PR, @tmiasko!
Hm, the numbers look like an improvement overall. However, I'm wondering, in the PGO case, wouldn't the compiler have accurate information about the coldness of a function? In which case it should actually be beneficial not to import cold functions?
On the other hand, as you say, a multiplier of 0.1
would be a strong discouragement for importing a given function while still allowing a decision to be made based on other factors such as size.
I think we can just merge this. It's just a small tweak and does not seem to a negative impact on LLVM times.
@bors r+
Commit 1aee808 has been approved by michaelwoerister
Test successful - checks-actions
Approved by: michaelwoerister
Pushing e423058 to master...
Inlining trivial functions often reduces caller code size, and inlining heuristics for cold functions try to avoid inlining otherwise, so yes it should be beneficial even with completely accurate profile information. For example, an error handling code path using ?
taken from rustc MIR interpreter, before:
→ callq *0x20446f5(%rip) # 3e1a678 <<rustc_middle::mir::interpret::error::InterpErrorInfo as core::convert::From<rustc_middle::mir::interpret::error::InterpError>>::from@@Base+0x1ba93a8>
mov %rax,%rdi
→ callq <T as core::convert::From<T>>::from
mov %rax,%rdi
1ce: → callq <T as core::convert::From<T>>::from
mov %rax,%rdi
→ callq <T as core::convert::From<T>>::from
mov $0x1,%sil
↑ jmpq 115
and after:
→ callq *0x1f22ee0(%rip) # 3a0ae68 <<rustc_middle::mir::interpret::error::InterpErrorInfo as core::convert::From<rustc_middle::mir::interpret::error::InterpError>>::from@@Base+0x1a80f38>
1c8: mov $0x1,%sil
↑ jmpq 116
@tmiasko @michaelwoerister Can you talk a bit more about the performance regressions seen here? Some of the regressions are somewhat severe, so it would be nice to talk through them a bit more. Why is the performance regression worth it?
The main change here is that we allow LLVM to import functions across module boundaries during ThinLTO even if a given function has been marked as cold
. This allows LLVM to inline trivial functions (like the example in the comment above), leading to an overall reduction in code size. LLVM still has the ability to not do inlining for such functions based on its usual heuristics (in which case the effort put into import the function was wasted). The change should lead to an overall quality improvement for all machine code being produced via ThinLTO (i.e. currently all --release
done via cargo).
Because this PR re-jiggers LLVM inlining decisions it can be expected to have a mixed influence on compile times. Since that effect seems to be positive overall (with the exception of two small synthetic test cases) I think this change is a net win.
No reviews
None yet
Successfully merging this pull request may close these issues.
None yet
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK