proposal: profile-guided optimization · Issue #55022 · golang/go · GitHub
source link: https://github.com/golang/go/issues/55022
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
proposal: profile-guided optimization #55022
cherrymui opened this issue 2 days ago · 24 comments
Comments
We propose adding support for profile-guided optimization (PGO) to the Go gc toolchain. PGO will enable the toolchain to perform application- and workload-specific optimizations based on run-time information. Unlike many compiler optimizations, PGO requires user involvement to collect profiles and feed them back into the build process. Hence, we propose a design that centers user experience and ease of deployment and fits naturally into the broader Go build ecosystem. Detailed design can be found here. In summary, we propose
Input welcome. Beyond input on the general approach, we're particularly looking for input on whether PGO should be default enabled in 1.20, and flag and file naming conventions. If accepted, we plan to implement a preview of PGO in Go 1.20. See also previous discussions at #28262. Filing a new issue to make it clearer what we are proposing. |
Change https://go.dev/cl/430355 mentions this issue: |
Contributor
rhysh commented 2 days ago
This looks very interesting! Starting with the pprof format sounds good, but I have some concerns about its verbosity in two ways. First, the files can be really big and processing them can be slow. When I work with profiling data manually, I often start with pprof-format files that include hundreds or thousands of samples, usually covering less than one thread-minute. For PGO, I'd be inclined to use "the most and best data available". For some applications I support, that means multi-megabyte pprof-format files which take multiple seconds to load. That's a big file to add to an application's source control repository (for build environments that work that way), and a big build speed hit if it applies to every package. The approach of "Some of these downsides can be solved transparently by converting the profile to a simpler, indexed format at build time and storing this processed profile in the build cache" sounds promising, but I wonder if that format should be obtainable through some explicit Second, how can application owners understand what the compiler is learning from the profile? When I encounter a bug that might be affected by or otherwise involve PGO, I'd like to be able to report it (including steps to reproduce) without accidentally disclosing proprietary information about the rest of my applications' performance and without accidentally changing the compiler's PGO-related decisions. When I have the choice between using a 100kB pprof file or a 10MB pprof file, I'd like to know if using the more detailed file actually changes the results. When reviewing a change to the PGO input file (maybe to adjust its weight towards more latency-critical parts of the app), I'd like to know what it's communicating to the compiler. These could also be addressed by a "simpler, indexed format" that is semi-stable and human-readable / human-redactable. |
Member
Author
cherrymui commented 2 days ago
@rhysh thanks for the feedback! Yes, if we later switch to a different format, we will provide a tool to convert profiles between the formats. And yes, if we choose a different format we will consider how to make it more human-readable. (We considered this possibility but it was unclear to me how to do that. There is also a range of possibilities. The human readable form could be very low level, basically a textual representation of the raw profile, to very high level, basically a list of "compiler optimization directives", or something in between.) Related, I think the compiler will have an option to output what optimizations are done based on the profile, either through the |
Contributor
rhysh commented 2 days ago
Visibility into the compiler's decisions via If so, it sounds like we have the "simpler [...] format" described in the proposal, in an ad-hoc way. Maybe ad-hoc is all we need, and would keep the messier and less stable parts of the ecosystem out of the main This sounds like the "basically a list of 'compiler optimization directives'" option. What design challenges did you encounter with that? Thanks @cherrymui ! |
Like @rhysh, I feel strongly that the
|
Member
Author
cherrymui commented 2 days ago
Interesting idea about compiler optimization directive list for bug report. Thanks. I thought for bug report due to PGO we would expect one to share the profile in order to reproduce the buggy build. It is understandable that this can be difficult in some cases. Maybe we can consider supporting an optimization list, even just for debugging purposes. (FWIW, we even had a prototype for optimization list (thanks @dr2chase) which I used for early experiments.) One challenge is how to generate such an optimization directive list from raw profiles. For a streamlined user experience, one would take profiles from production binary, and Another benefit for profiles is that we can do multiple optimizations based on, say, a CPU profile. If we add a new type of PGO-based optimization, one would automatically get the optimization without taking a new profile. For optimization lists, one would need to explicitly add it to the file. I'm not really clear what the optimization list would include. Should it be binary (boolean) options, or weights? Binary options sound simpler, but maybe weights are better for optimizations, so the compiler can take into account the information from source code and its own analysis? There is also the question about the file becoming stale. Of course profiles can also become stale. It might be possible to design an optimization directive list format that is resilient to staleness. Maybe one option is to have the compiler
|
rajbarik commented 2 days ago
Hi Josh, Do you have any concrete examples of how an optimization file is generated from a profile? Any existing compilers do this today? Please let us know. Typically compiler optimizations such as inlining, specialization, code layout, register allocation and instruction scheduling can use profile information to improve application performance by optimizing hot-paths. Best, Raj |
Non-main packages As a package writer, I cannot supply any improvements. It would require package users to implement pgo benchmarks that hits my package. Writing these benchmarks to build cpu profiles are far from trivial. As a result 99% of all compiles with my packages would be without (useful) pgo. You mention this as future work, but it seems like a big miss. Platforms This will only cover platform independent code. Much optimized code has separate code paths for different platforms. As proposed I don't see any reasonable way to handle multiple platforms automatically, for example when cross-compiling. You could make this a user problem, and say they should use Future Optimization Scope I may be lacking imagination, but I don't see the scope of this going much beyond inline decisions. Maybe with branch-information, you could re-order unordered To justify such a rather complex setup it is important to keep the end-goal in sight. Counter Proposal: Simplify Proposal I appreciate the intent to make this automatic and low-effort. But writing the benchmarks and/or capture useful profiles is a big task and for a big application with many subsystems it will be a massive. While it proposes to just run pprof on a production server, apply the profile -> faster application, it isn't really that simple, since profiles would need to continuously be captured as code changes. I very much agree with Josh, et al, that this should be simpler and in the hands of the developer. To add to that, I know that Go doesn't like pragmas, but it seems to me with a few, well designed pragmas we could get the same result, that doesn't have all the problems. As far as I can see having a program propose pragmas (or code reordering) from a profile would be just as simple, and allows you to review the proposed changes before they are committed. If Go still isn't ready for performance related pragmas, then an external file describing the same would also be an (IMO suboptimal) solution. I dread a bit how you would specify inlining at callers and auto-vectorization of a |
@klauspost Given inlining is the most significant optimisation mechanism, I believe even at the current status, pgo is a much valuable improvement. In addition to inlining, pgo can be applied to interface dispatch, which helps devirtualise method invocations, profiling on branch-taken frequency helps produce better code layout at the very least, and guides the optimiser toward better decisions in other operations such as loop unrolling, cloning operations through phi nodes, in extreme cases branch pruning is also a possibility. While pragmas or external decorative files work on call graphs, optimisation decisions are most optimal when having information in the call trees, this would be a suboptimal proposal. Additionally, it is not internal details, so I would disagree with this idea. |
@merykitty Thanks! I see your point with interface dispatch, where knowing common implementation could output implementation specific branches. The other cases I don't really see as something that couldn't be done with pragsmas. Granted, generic implementations could make this hard to control, since they would probably apply to all instances. I don't see how you can make branch pruning without the compiler being able to prove that branches are impossible, in which case you don't really need pgo. Branch reordering and loop unrolling (without vectorization) has very limited impact on modern CPUs. Either way, I stand by my point, that as proposed seems like a complex system that is hard to use correctly, and will provide limited benefits. It is not a "set up and forget", which is my main problem. I am hoping for a simpler system that is more controllable, that doesn't require 'main-function' implementation with continuous updates required, which provides 90% of the benefits. |
One worry I have with using profiles directly is that it can easily be an unstable mechanism. If I profile a program's current build and use that profile to rebuild it with PGO, grabbing another profile would likely result in fairly different results. For example, if a function call used to take ~2% of CPU time due to being in a very hot path and PGO now inlines it, then it could become far less relevant in the second profile. I imagine we want developers to only use PGO with profiles obtained from runs without PGO. Can we restrict that to prevent confusion? |
@mvdan The ability to collect profiles from a PGO-optimized binary is actually a core requirement in our design because it simplifies profile collection significantly (you can collect directly from production deployments without need for some kind of unoptimized "canary" deployment used only for profile collection). It certainly has the challenge you mention, which we call "iterative stability" and discuss here and here. Sections 4.2.1 and 5.2 of the AutoFDO paper also discuss this. This is something we will need to pay close attention to and test well to make sure results are stable, and may be an area where additional metadata could help (e.g., collecting a CPU "last branch record" (LBR) profile would tell us which basic blocks are frequently executed, even if they have become much cheaper, so, used carefully, that could further mitigate this issue). |
@josharian Having a list of optimization directives (or pragmas in the source) is something we've thought about, and definitely has pros and cons. I certainly think we want PGO to be applicable from a profile (which I think you agree with?), but whether translation to optimization decisions happens directly in the compiler, or as a pre-processing step. Some thoughts on list of advantages below. Note, this list actually talks about two axes: profile vs optimization list and binary vs plain text format. I think we could decide along either of these axes (e.g., we could use a plain-text format that describes the pprof profile).
I'm not sure I fully agree with this. While I agree that the compiler will be simpler if it only takes a list of directives rather than having to process a profile, something still needs to convert the profile to a list of directives. That new tool will contain the complexity, and share a lot of similarities with the compiler, perhaps even directly sharing code, depending on the specifics of the directives. e.g., if inlining directives are binary decisions ("inline this function"), then the conversion tool should probably contain a near copy of the compiler's inlining heuristics so that it makes very similar decisions. That is perhaps less important if the directives are more abstract (providing an inlining importance "weight"?), but if we go too abstract then we are probably just describing a profile anyways. As Cherry mentioned, one option here is that the compiler could accept either a profile or optimization list. When given a profile, it generates the optimization list to be used for future compilations.
I agree this would be easier to test/debug because it adds an intermediate format which you could test both sides of.
100% agreed that a plain test format and an optimization list are more transparent about what is happening in the build.
While I agree that an optimization list makes manual optimization tuning easier/possible, that has not been a goal of this proposal, and it is something we have avoided adding pragmas for (also likely a reason we wouldn't have PGO work by adding pragmas to source code). I think it is a bigger discussion if we want to support custom tuning of optimizations.
This strikes me as out of place in your list. We absolutely want to support (and encourage!) merging profiles from multiple instances in order to get a more representative sample. This seems to be a point in favor of pprof, as it is easy to merge pprof profiles ( On the other hand, optimization lists seem difficult to merge. If one list says "inline foo" and the other says "do not inline foo", how do you merged? It seems in this case you'd still want to merge the profiles prior to generating the optimization list.
Earlier in our design we were planning to create a new PGO format specifically to have the flexibility of adding information beyond CPU profiles. We switched to proposing pprof directly because the format, while not perfect, is actually quite flexible. e.g., dynamic type info could be encoded as a Label on Samples of calls describing the type being called (though for calls the type is already obvious from the call destination, so not the best example). Stack sizes could be sampled as Samples with Location == gp.startpc, and value == stack size. The format is not perfect and we may find ourselves limited in the future (discussion here), but I think there is a lot of runway.
Agreed that a plain-text format makes fix-ups easier. We've discussed tooling for renames for pprof files, but plain text is easier.
I'm not sure what you mean by this, could you expand? |
Contributor
ianlancetaylor commented yesterday
@klauspost My opinion is that history tells us that using pragmas to guide optimizations doesn't sustain over time. What happens is that someone does a bunch of analysis with a specific compiler version and a specific set of benchmarks, and writes a bunch of pragmas that makes those benchmarks faster. So far so good. But two years later the compiler has changed, the libraries have changed, and the program has changed. The existing optimization pragmas have not changed, and no longer do any good, and occasionally actually do harm. Profiling can have the same problem, of course, if nobody updates the profiles. But updating the profiles is a much simpler task than analyzing performance in detail, and therefore tends to happen more frequently. In the best case, updating the profiling is simply automated. So over the long term, I believe that profile guided optimization is a better approach for most projects. It's unlikely to beat careful analysis and hand tuning at the point in time when that happens. But profile guided optimization is much cheaper in what is for most projects the greatest expense: developer time. Just my opinion, of course, but based on experience. |
@klauspost, just a small point
Branch pruning here means splitting everything below the if to separate hot paths and cold paths in the whole function. |
This is something we've been thinking about recently and are open to suggestions. I think we may want automatic build configuration selection eventually, but the details are tricky. Do we use suffixes, like .go files ( What we've proposed I think is the MVP option which remains forward compatible with future auto-selection we might do. (FWIW, I think that a single profile for all platforms is likely good enough for more users. A platform-specific one will of course be at least a bit better, but my intuition says that the vast majority of most profiles will be common across platforms. We could verify this with data by comparing profiles).) |
To add to what others have said, we've actually been keeping an eye toward many possible PGO-based optimizations while designing this. Here's a non-exhaustive list:
|
@klauspost thanks for the feedback!
I totally agree that PGO for non-main packages are important (we want to do it for the standard library as well), and we spent quite some time considering it. But many details are still unclear to us. At the moment we're proposing what we know how to do, and leave the door open for future improvements including PGO for non-main packages. I think this is better than waiting.
This is not what we expected. Instead, we expect the user to take profiles from their main program. If in that program your package is in the hot path, PGO will apply to your package.
As @prattmic mentioned, we are proposing an MVP option which remains forward compatible with future improvements. As mentioned in the design doc
By "portable", for example, if the workloads are similar across platforms, the hot functions on one platform is likely also hot on another platform. See also "Input welcome" from the original post. We're particularly welcome input for naming default profiles.
I'm not really sure what your comment meant. But here is a (non-exhaustive) list of optimizations we plan to do in the future https://go.googlesource.com/proposal/+/master/design/55022-pgo.md#optimizations |
I will first admit that I do not have direct experience with profile-guided optimization in any other language, and so my line of questioning here may be naive. If this seems like a nonsense question, I'm happy to accept that as an answer to avoid anyone feeling the need to do my homework for me! The proposal and some of the commentary above discusses the fact that a particular set of profile information is specific to a particular version of the compiler (because it may generate code with different execution characteristics) and to particular input source code (because changing the program will at least change the behavior of the part of the program that changed, and may also have knock-on effects elsewhere from optimizations that are able to consider global information). I quite enjoy the fact that Of course I understand that profile-guided optimization cannot possibly be implemented with an automatically-maintained cache, because gathering profile information is always an explicit step and similarly it's up to the person running However, I do wonder if it seems feasible for the toolchain to automatically announce when a particular profile information file has been invalidated, so that I could be prompted to regenerate it. I assume "invalidated" is not a boolean value in this case but perhaps more like a code coverage report: how much of the program is still the same now as it was when this profile information was generated? What specific parts of the program are not covered by this profile because they have changed since the profile was generated? Was the profile generated from a program compiled with the same toolchain version as I'm currently using? (Hopefully this could also somehow take into account the degree to which the system can "make do" with mismatching profiles, giving higher precedence to changes that are outside the set that the compiler and trace format can account for automatically, as described in Stability to source code and compiler changes.) With what I've understood from reviewing the proposal so far it seems like it would be too expensive to track this sort of "profile coverage" on a per-line or per-statement basis, since it would seem to require retaining some sort of checksum for every single line/statement. But I wonder if it would be feasible to track at a coarser grain, such as per-function, per-file, or per-package, just to give the user something meaningful to understand the results against, so that they can then use their intuition to estimate whether the indicated objects have changed enough to warrant the effort of capturing fresh profile data. (For this question I'm imagining a situation where a team collects profile information semi-manually once and uses it for some time before semi-manually generating it again. I expect that the capability I'm describing would be far less useful in situations where a team is able to constantly gather profile information for a currently-running and feed it into the next build, as described in Iterative Stability. The software I spend most of my time working on is "shrinkwrap" software which we package and ship to users to run on their own computers, and so our ability to capture traces from real user runs is limited. "This feature is primarily intended for servers you can profile constantly" would be a perfectly reasonable argument to dismiss my questioning above, of course.) |
Contributor
josharian commented 21 hours ago
Yes, that was rather vague. :) It's similar to @apparentlymart's point (which is a good one). In many systems, performance properties are correctness properties. Losing key optimizations can cause performance regressions, which can cause existing provisioning to be insufficient, which can take down a system. Even merely slowing down a system can cause cascading failures. (I speak from experience, unfortunately.) The more opaque the toolchain and its inputs are, the harder it is to (a) write safety checks that detect performance problems before they make it to production and (b) diagnose performance issues after they make it to production. This is in some ways even worse with less severe regressions, which don't result in an obvious, immediate problem. The obvious approach here using pprof files is to have excellent compiler diagnostic output. This puts us in the same boat as (for example) current inlining tests: exec a build from inside a test and check for magic output strings. It's kind of a miserable existence, but it works. @apparentlymart's notion of a PGO fitness/staleness score would also help, albeit in a less fine-grained way. |
Contributor
josharian commented 21 hours ago
Or to put it a different way: How do I code review a commit that replaces one pprof profile with a new one? |
Member
prattmic commented 8 hours ago
Thanks for the clarification, that makes sense. Having an optimization list doesn't solve the problem of determining if a new profile has "good" results, but it is at least easier to look at a diff and notice potentially worrying changes. (Though if the list is thousands of lines long and changes a lot from profile-to-profile, then I imagine that could also be difficult to review). |
Member
Author
cherrymui commented 4 hours ago
Thanks. I think we all agree that we want the compiler to emit an optimization list for optimizations it does based on a given profile. Interesting idea about profile invalidation. I think that is a good point. And we can make the compiler emit some information when some profile information doesn't apply, either in the same optimization list file our some other file. I think it shouldn't be too hard or expensive. |
rajbarik commented 1 hour ago •
I recommend that we use the same technique as of AutoFDO] to deal with stale profiles, i.e., to use <pkg_name, function_name, line_offset> instead of <pkg_name, function_name, line_number>. With line_offset information, we can identify if a call site has moved up or down relative to the start of the function. In this case, we do not inline this call site, however other call sites that have not moved up or down will continue to use the profile information and get optimized. This design also allows functions that are unchanged will continue to be optimized via PGO. It should be easy to produce a post-PGO report containing a list of optimized functions and their corresponding PGO-enabled optimizations. Couple of notes more:
|
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
No one assigned
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
No branches or pull requests
Recommend
-
42
HHVM JIT: A profile-guided, region-based compiler for PHP and Hack Ottoni, PLDI'18 HHVM is a virtual machine for PHP and Hack (a PHP extension) which is used to power Facebook’s website among others. Today’s paper choice describes the second gen...
-
30
-
3
Original (ru): https://habr.com/ru/post/596755/. If you combine a structural code search of gogrep and CPU prof...
-
5
Profile-guided optimization 2020-10-16 Profile-guided optimization (PGO), also known as profile-directed feedback (PDF), and feedback-directed optimization (FDO) is a compiler optimization technique in co...
-
6
Victoria Carroll October 27, 2022 5 minute read...
-
7
Golang previews profile-guided optimization Profile-guided optimization, a preview feature in Go 1.20, enables the compiler to perform application-specif...
-
1
Go 1.20 previews profile-guided optimization Profile-guided optimization enables the Go compiler to perform application-specific and workload-specific op...
-
3
The Go Blog Profile-guided optimization preview Michael Pratt 8 February 2023 When you build a Go binary, the...
-
5
Profile Guided Optimization (PGO) in GoWith Go 1.20, the Go compiler started to support Profile Guided Optimization (will be referred to as PGO) mechanism to optimize builds. In this article,...
-
5
The Go Blog Profile-guided optimization in Go 1.21 Michael Pratt 5 September 2023 Earlier in 2023, Go 1.20
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK