Dave Cheney | The acme of foolishness
source link: https://dave.cheney.net/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A few bytes here, a few there, pretty soon you’re talking real memory
Today’s post comes from a recent Go pop quiz. Consider this benchmark fragment.1
func BenchmarkSortStrings(b *testing.B) {
s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"}
b.ReportAllocs()
for i := 0; i < b.N; i++ {
sort.Strings(s)
}
}
A convenience wrapper around sort.Sort(sort.StringSlice(s))
, sort.Strings
sorts the input in place, so it isn’t expected to allocate (or at least that’s what 43% of the tweeps who responded thought). However it turns out that, at least in recent versions of Go, each iteration of the benchmark causes one heap allocation. Why does this happen?
Interfaces, as all Go programmers should know, are implemented as a two word structure. Each interface value contains a field which holds the type of the interface’s contents, and a pointer to the interface’s contents.2
In pseudo Go code, an interface might look something like this:
type interface struct {
// the ordinal number for the type of the value
// assigned to the interface
type uintptr
// (usually) a pointer to the value assigned to
// the interface
data uintptr
}
interface.data
can hold one machine word–8 bytes in most cases–but a []string
is 24 bytes; one word for the pointer to the slice’s underlying array; one word for the length; and one for the remaining capacity of the underlying array, so how does Go fit 24 bytes into 8? With the oldest trick in the book, indirection. s
, a []string
is 24 bytes, but *[]string
–a pointer to a slice of strings–is only 8.
Escaping to the heap
To make the example a little more explicit, here’s the benchmark rewritten without the sort.Strings
helper function:
func BenchmarkSortStrings(b *testing.B) {
s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"}
b.ReportAllocs()
for i := 0; i < b.N; i++ {
var ss sort.StringSlice = s
var si sort.Interface = ss // allocation
sort.Sort(si)
}
}
To make the interface magic work, the compiler rewrites the assignment as var si sort.Interface = &ss
–the address of ss
is assigned to the interface value.3 We now have a situation where the interface value holds a pointer to ss
, but where does it point? Where in memory does ss
live?
It appears that ss
is moved to the heap, causing the allocation that the benchmark reports.
Total: 296.01MB 296.01MB (flat, cum) 99.66% 8 . . func BenchmarkSortStrings(b *testing.B) { 9 . . s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"} 10 . . b.ReportAllocs() 11 . . for i := 0; i < b.N; i++ { 12 . . var ss sort.StringSlice = s 13 296.01MB 296.01MB var si sort.Interface = ss // allocation 14 . . sort.Sort(si) 15 . . } 16 . . }
The allocation occurs because the compiler currently cannot convince itself that ss
outlives si
. The general attitude amongst Go compiler hackers seems to be that this could be improved, but that’s a discussion for another time. As it stands, ss
is allocated on the heap. Thus the question becomes, how many bytes are allocated per iteration? Why don’t we ask the testing
package.
% go test -bench=. sort_test.go goos: darwin goarch: amd64 cpu: Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz BenchmarkSortStrings-4 12591951 91.36 ns/op 24 B/op 1 allocs/op PASS ok command-line-arguments 1.260s
Using Go 1.16beta1, on amd64, 24 bytes are allocated per operation.4 However, the previous Go version, on the same platform, consumes 32 bytes per operation
% go1.15 test -bench=. sort_test.go goos: darwin goarch: amd64 BenchmarkSortStrings-4 11453016 96.4 ns/op 32 B/op 1 allocs/op PASS ok command-line-arguments 1.225s
This brings us, circuitously, to the subject of this post, a nifty quality of life improvement coming in Go 1.16. But before I can talk about it, I need to discuss size classes.
Size classes
To explain what a size class is, consider how a theoretical Go runtime could allocate 24 bytes on its heap. A simple way to do this would be keep track of all the memory allocated so far using a pointer to the last allocated byte on the heap. To allocate 24 bytes, the heap pointer is incremented by 24, and the previous value returned to the caller. As long as the code that asked for 24 bytes never writes beyond that mark this mechanism has no overhead. Sadly, in real life, memory allocators don’t just allocate memory, sometimes they have to free it.
Eventually the Go runtime will have to free those 24 bytes, but from the runtime’s point of view, the only thing it knows is the starting address it gave to the caller. It does not know how many bytes after that address were allocated. To permit deallocation, our hypothetical Go runtime allocator would have to record, for each allocation on the heap, its length. Where are the allocation for those lengths allocated? On the heap of course.
In our scenario, when the runtime wants to allocate memory, it could request a little more than it was asked for and use it to store the amount requested. For our slice example, when we asked for 24 bytes, we’d actually consume 24 bytes plus some overhead to store the number 24. How large is this overhead? It turns out the minimum amount that is practical is one word.5
To record a 24 byte allocation the overhead would 8 bytes. 25% is not great, but not terrible, and as the size of the allocation goes up, the overhead would become insignificant. However, what would happen if we wanted to store just one byte
on the heap? The overhead would be eight times the amount of data we asked for! Is there are more efficient way to allocate small amounts on the heap?
Rather than storing the length alongside each allocation, what if all the thing of the same size were stored together? If all the 24 byte things are stored together, then the runtime would automatically know how large they are. All the runtime needs is a single bit to indicate if a 24 byte area is in use or not. In Go these areas are known as size classes, because all the things of the same size are stored together (think school class–all the students are the same grade–not a C++ class). When the runtime needs to allocate a small amount, it does so using the smallest size class that can accomodate the allocation.
Unlimited size classes for all
Now we know how size classes work, the obvious question is, where are they stored? Not surprisingly, the memory for a size class comes from the heap. To minimise overhead, the runtime allocates a larger amount from the heap (usually a multiple of the system page size) then dedicates that space for allocations of a single size. But, there’s a problem.
The pattern of allocating a large area to store things of the same size works well6 if there’s a fixed, preferably small, number of allocation sizes, but in a general purpose language programs can ask the runtime for an allocation of any size.7
For example, imagine asking the runtime for 9 bytes. 9 bytes is an uncommon size, so it’s likely that a new size class for things of size 9 would be required. As 9 byte things are uncommon, it’s likely the remainder of the allocation, 4kb or more, would be wasted. Because of this the set of possible size classes is fixed. If a size class of the exact amount is not available, the allocation is rounded up to the next size class. In our example 9 bytes might be allocated in the 12 byte size class. The overhead of 3 unused bytes is better than an entire size class allocation which goes mostly unused.
All together now
This is the final piece of the puzzle. Go 1.15 did not have a 24 byte size class, so the heap allocation of ss
was allocated in the 32 byte size class. Thanks to the work of Martin Möhrmann Go 1.16 has a 24 byte size class, which is a perfect fit for slice values assigned to interfaces.
The story of the one line fix
Picture yourself, an engineer working at the hottest distributed microservices de jour, assigned to fix a bug. You jump into an unfamiliar codebase and quickly locate the line where the problem occurred. The fix is simple, just return early or substitute a default value in the case that one cannot be determined from your input. Boom, problem solved. Code compiles, fix goes out to production, and the card’s in the done pile in time for Tuesday drinks at the pub.
Now consider this alternative scenario. You find the problematic line and, before you make the fix, you decide to add a test so you will know that your fix worked, and hopefully someone will not accidentally revert it in the future.
You’ve figured out that some weird edge case causes that line to be called without being able to determine the right value for y
. You can see it clearly, y
ends up being zero so the program crashes. To write a test, you need to get the conditions that caused y
to be zero to occur on command. There are only two parameters passed into this function, this shouldn’t be hard. Oh, but this is a method, not a function, so you need to construct an instance of the object in the right state to trigger the bug. Hmm, this language uses constructors to make sure people can’t just monkey up the bits of the object they need for a test, you’ll have to find, mock, stub or build instances of all the objects dependencies (and their dependencies, and so on) then run the object through the precise series of operations to put the it in the state that causes y
to be empty. Then you can write the test.
At this point Tuesday night drinks are fading before you eyes. Without being able to write a test for your fix, how will you convince a reviewer that it actually fixed what it fixed? All you can prove was your “simple” change didn’t break any behaviour that was covered by existing tests.
Fast forward to Friday and your one line change has finally been merged, along with a few new test cases to make sure it doesn’t happen again. Next week you’re going to sit down and try to figure out a bunch of weird crashes you caused while trying to stub out calls to the database. Eventually you gave up and copied some code from another test that uses a copy of production data. It’s much slower than it needs to be, but it worked, and was the only way to put a reasonable estimate on a job that had already consumed your entire week.
The moral of the story; was the test the reason it took nearly a week to make a one line fix?
How to dump the GOSSAFUNC graph for a method
The Go compiler’s SSA backend contains a facility to produce HTML debugging output of the compilation phases. This post covers how to print the SSA output for function and methods.
Let’s start with a sample program which contains a function, a value method, and a pointer method:
package main
import (
"fmt"
)
type Numbers struct {
vals []int
}
func (n *Numbers) Add(v int) {
n.vals = append(n.vals, v)
}
func (n Numbers) Average() float64 {
sum := 0.0
for _, num := range n.vals {
sum += float64(num)
}
return sum / float64(len(n.vals))
}
func main() {
var numbers Numbers
numbers.Add(200)
numbers.Add(43)
numbers.Add(-6)
fmt.Println(numbers.Average())
}
Control of the SSA debugging output is via the GOSSAFUNC
environment variable. This variable takes the name of the function to dump. This is not the functions fully qualified name. For func main
above the name of the function is main
not main.main
.
% env GOSSAFUNC=main go build runtime dumped SSA to ../../go/src/runtime/ssa.html t dumped SSA to ./ssa.html
In this example GOSSAFUNC=main
matched both main.main
and a function called runtime.main
.1 This is a little unfortunate, but in practice probably not a big deal as, if you’re performance tuning your code, it won’t be in a giant spaghetti blob in func main
.
What is more likely is your code will be in a method, so you’ve probably landed on this post looking for the correct incantation to dump the SSA output for a method.
To print the SSA debug for the pointer method func (n *Numbers) Add
, the equivalent function name is(*Numbers).Add
:2
% env "GOSSAFUNC=(*Numbers).Add" go build t dumped SSA to ./ssa.html
To print the SSA debug for a value method func (n Numbers) Average
, the equivalent function name is (*Numbers).Average
even though this is a value method:
% env "GOSSAFUNC=(*Numbers).Average" go build t dumped SSA to ./ssa.html
Diamond interface composition in Go 1.14
Per the overlapping interfaces proposal, Go 1.14 now permits embedding of interfaces with overlapping method sets. This is a brief post explain what this change means:
Let’s start with the definition of the three key interfaces from the io
package; io.Reader
, io.Writer
, and io.Closer
:
package io
type Reader interface {
Read([]byte) (int, error)
}
type Writer interface {
Write([]byte) (int, error)
}
type Closer interface {
Close() error
}
Just as embedding a type inside a struct allows the embedded type’s fields and methods to be accessed as if it were declared on the embedding type1, the process is true for interfaces. Thus there is no difference between explicitly declaring
type ReadCloser interface {
Read([]byte) (int, error)
Close() error
}
and using embedding to compose the interface
type ReadCloser interface {
Reader
Closer
}
You can even mix and match
type WriteCloser interface {
Write([]byte) (int, error)
Closer
}
However, prior to Go 1.14, if you continued to compose interface declarations in this manner you would likely find that something like this,
type ReadWriteCloser interface {
ReadCloser
WriterCloser
}
would fail to compile
% go build interfaces.go command-line-arguments ./interfaces.go:27:2: duplicate method Close
Fortunately, with Go 1.14 this is no longer a limitation, thus solving problems that typically occur with diamond-shaped embedding graphs.
However, there is a catch that I ran into attempting to demonstrate this feature to the local user group–this feature is only enabled when the Go compiler uses the 1.14 (or later) spec.
As near as I can make out the rules for which version of the Go spec is used during compilation appear to be:
- If your source code is stored inside
GOPATH
(or you have disabled modules withGO111MODULE=off
) then the version of the Go spec used to compile with matches the version of the compiler you are using. Said another way, if you have Go 1.13 installed, your Go version is 1.13. If you have Go 1.14 installed, your version is 1.14. No surprises here. - If your source code is stored outside
GOPATH
(or you have forced modules on withGO111MODULE=on
) then thego
tool will take the Go version from thego.mod
file. - If there is no Go version listed in
go.mod
then the version of the spec will be the version of Go installed. This is identical to point 1. - If you are in module mode, either by being outside
GOPATH
or withGO111MODULE=on
, but there is nogo.mod
file in the current, or any parent, directory then the version of the Go spec used to compile your code defaults to Go 1.13.
The last point caught me out.
Fatih’s question
A few days ago Fatih posted this question on twitter.
I’m going to attempt to give my answer, however to do that I need to apply some simplifications as my previous attempts to answer it involved a lot of phrases like a pointer to a pointer, and other unhelpful waffling. Hopefully my simplified answer can be useful in building a mental framework to answer Fatih’s original question.
Restating the question
Fatih’s original tweet showed four different variations of json.Unmarshal
. I’m going to focus on the last two, which I’ll rewrite a little:
package main
import (
"encoding/json"
"fmt"
)
type Result struct {
Foo string `json:"foo"`
}
func main() {
content := []byte(`{"foo": "bar"}`)
var result1, result2 *Result
err := json.Unmarshal(content, &result1)
fmt.Println(result1, err) // &{bar} <nil>
err = json.Unmarshal(content, result2)
fmt.Println(result2, err) // <nil> json: Unmarshal(nil *main.Result)
}
Restated in words, result1
and result2
are the same type; *Result
. Decoding into result1
works as expected, whereas decoding into result2
causes the json
package to complain that the value passed to Unmarshal
is nil
. However, both values were declared without an initialiser so both would have taken on the type’s zero value, nil
.
Eagle eyed readers will have spotted that the reason for the difference is the firstinvocation is passed &result1
, while the second is passed result2
, but this explanation is unsatisfactory because the documentation for json.Unmarshal
states:
Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v. If v is nil or not a pointer, Unmarshal returns an InvalidUnmarshalError.
Which is confusing because result1
and result2
are pointers. Furthermore, without initialisation, both are nil
. Now, the documentation is correct (as you’d expect from a package that has been hammered on for a decade), but explaining why takes a little more investigation.
Functions receive a copy of their arguments
Every assignment in Go is a copy, this includes function arguments and return values.
package main
import (
"fmt"
)
func increment(v int) {
v++
}
func main() {
v := 1
increment(v)
fmt.Println(v) // 1
}
In this example, increment
is operating on a copy of main
‘s v
. This is because the v
declared in main
and increment
‘s v
parameter have different addresses in memory. Thus changes to increment
‘s v
cannot affect the contents of main
‘s v
.
package main
import (
"fmt"
)
func increment(v *int) {
*v++
}
func main() {
v := 1
increment(&v)
fmt.Println(v) // 2
}
If we wanted to write increment
in a way that it could affect the contents of its caller we would need to pass a reference, a pointer, to main.v
.1 This example demonstrates why json.Unmarshal
needs a pointer to the value to decode JSON into.
Pointers to pointers
Returning to the original question, both result1
and result2
are declared as *Result
, that is, pointers to a Result
value. We established that you have to pass the address of caller’s value to json.Unmarshal
otherwise it won’t be able to alter the contents of the caller’s value. Why then must we pass the address of result1
, a **Result
, a pointer to a pointer to a Result
, for the operation to succeed.
To explain this another detour is required. Consider this code:
package main
import (
"encoding/json"
"fmt"
)
type Result struct {
Foo *string `json:"foo"`
}
func main() {
content := []byte(`{"foo": "bar"}`)
var result1 *Result
err := json.Unmarshal(content, &result1)
fmt.Printf("%#v %v", result1, err) // &main.Result{Foo:(*string)(0xc0000102f0)} <nil>
}
In this example Result
contains a pointer typed field, Foo *string
. During JSON decoding Unmarshal
allocated a new string
value, stored the value bar
in it, then placed the address of the string in Result.Foo
. This behaviour is quite handy as it frees the caller from having to initialise Result.Foo
and makes it easier to detect when a field was not initialised because the JSON did not contain a value. Beyond the convenience this offers for simple examples it would be prohibitively difficult for the caller to properly initialise all the reference type fields in a structure before decoding unknown JSON without first inspecting the incoming JSON which itself may be problematic if the input is coming from an io.Reader
without the ability to rewind the input.
To unmarshal JSON into a pointer, Unmarshal first handles the case of the JSON being the JSON literal null. In that case, Unmarshal sets the pointer to nil. Otherwise, Unmarshal unmarshals the JSON into the value pointed at by the pointer. If the pointer is nil, Unmarshal allocates a new value for it to point to.
json.Unmarshal
‘s handling of pointer fields is clearly documented, and works as you would expect, allocating a new value whenever there is a need to decode into a pointer shaped field. It is this behaviour that gives us a hint to what is happening in the original example.
We’ve seen that when json.Unmarshal
encounters a field which points to nil
it will allocate a new value of the correct type and assign its address the field before proceeding. Not only is does behaviour is applied recursively–for example in the case of a complex structure which contains pointers to other structures–but it also applies to the value passed to Unmarshal
.
package main
import (
"encoding/json"
"fmt"
)
func main() {
content := []byte(`1`)
var result *int
err := json.Unmarshal(content, &result)
fmt.Println(*result, err) // 1 <nil>
}
In this example result
is not a struct, but a simple *int
which, lacking an initialiser, defaults to nil
. After JSON decoding, result
now points to an int
with the value 1
.
Putting the pieces together
Now I think I’m ready to take a shot at answering Fatih’s question.
json.Unmarshal
requires the address of the variable you want to decode into, otherwise it would decode into a temporary copy which would be discard on return. Normally this is done by declaring a value, then passing its address, or explicitly initialising the the value
var result1 Result
err := json.Unmarshal(content, &result1) // this is fine
var result2 = new(Result)
err = json.Unmarshal(content, result2) // and this
var result3 = &Result{}
err = json.Unmarshal(content, result3) // this is also fine
In all three cases the address that the *Result
points too is not nil
, it points to initialised memory that json.Unmarshal
decodes into.
Now consider what happens when json.Unmarshal
encounters this
var result4 *Result
err = json.Unmarshal(content, result4) // err json: Unmarshal(nil *main.Result)
result2
, result3
, and the expression &result1
point to a Result
. However result4
, even though it has the same type as the previous three, does not point to initialised memory, it points to nil
. Thus, according to the examples we saw previously, before json.Unmarshal
can decode into it, the memory result4
points too must be initialised.
However, because each function receives a copy of its arguments, the caller’s result4
variable and the copy inside json.Unmarshal
are unique. json.Unmarshal
can allocate a new Result
value and decode into it, but it cannot alter result4
to point to this new value because it was not provided with a reference to result4
, only a copy of its contents.
Ensmallening Go binaries by prohibiting comparisons
Conventional wisdom dictates that the larger the number of types declared in a Go program, the larger the resulting binary. Intuitively this makes sense, after all, what’s the point in defining a bunch of types if you’re not going to write code that operates on them. However, part of the job of a linker is to detect functions which are not referenced by a program–say they are part of a library of which only a subset of functionality is used–and remove them from the final output. Yet, the adage mo’ types, mo’ binary holds true for the majority of Go programs.
In this post I’ll dig into what equality, in the context of a Go program, means and why changes like this have a measurable impact on the size of a Go program.
Defining equality between two values
The Go spec defines the concepts of assignability and equality. Assignabiity is the act of assigning a value to an identifier. Not everything which is declared can be assigned, for example constants and functions. Equality is the act of comparing two identifies by asking are their contents the same?
Being a strongly typed language, the notion of sameness is fundamentally rooted in the identifier’s type. Two things can only be the same if they are of the same type. Beyond that, the type of the values defines how they are compared.
For example, integers are compared arithmetically. For pointer types, equality is determining if the addresses they point too are the same. Reference types like maps and channels, like pointers, are considered to be the same if they have the same address.
These are all examples of bitwise equality, that is, if the bit patterns of the memory that value occupies are the same, those values are equal. This is known as memcmp, short for memory comparison, as equality is defined by comparing the contents of two areas of memory.
Hold on to this idea, I’ll come back to in a second.
Struct equality
Beyond scalar types like integers, floats, and pointers is the realm of compound types; structs. All structs are laid out in memory in program order, thus this declaration:
type S struct {
a, b, c, d int64
}
will consume 32 bytes of memory; 8 bytes for a
, then 8 bytes for b
, and so on. The spec says that struct values are comparable if all their fields are comparable. Thus two structs are equal iff each of their fields are equal.
a := S{1, 2, 3, 4}
b := S{1, 2, 3, 4}
fmt.Println(a == b) // prints true
Under the hood the compiler uses memcmp to compare the 32 bytes of a
and b
.
Padding and alignment
However the simplistic bitwise comparison strategy will fail in situations like this:
type S struct {
a byte
b uint64
c int16
d uint32
}
func main()
a := S{1, 2, 3, 4}
b := S{1, 2, 3, 4}
fmt.Println(a == b) // prints true
}
The code compiles, the comparison is still true, but under the hood the compiler cannot rely on comparing the bit patterns of a
and b
because the structure contains padding.
Go requires each field in a struct to be naturally aligned. 2 byte values must start on an even address, four byte values on an address divisible by 4, and so on1. The compiler inserts padding to ensure the fields are aligned to according to their type and the underlying platform. In effect, after padding, this is what the compiler sees2:
type S struct {
a byte
_ [7]byte // padding
b uint64
c int16
_ [2]int16 // padding
d uint32
}
Padding exists to ensure the correct field alignments, and while it does take up space in memory, the contents of those padding bytes are unknown. You might assume that, being Go, the padding bytes are always zero, but it turns out that’s not the case–the contents of padding bytes are simply not defined. Because they’re not defined to always be a certain value, doing a bitwise comparison may return false because the nine bytes of padding spread throughout the 24 bytes of S
are may not be the same.
The Go compiler solves this problem by generating what is known as an equality function. In this case S
‘s equality function knows how to compare two values of type S
by comparing only the fields in the function while skipping over the padding.
Type algorithms
Phew, that was a lot of setup to illustrate why, for each type defined in a Go program, the compiler may generate several supporting functions, known inside the compiler as the type’s algorithms. In addition to the equality function the compiler will generate a hash function if the type is used as a map key. Like the equality function, the hash function must consider factors like padding when computing its result to ensure it remains stable.
It turns out that it can be hard, and sometimes non obvious, to intuit when the compiler will generate these functions–it’s more than you’d expect–and it can be hard for the linker to eliminate the ones that are not needed as reflection often causes the linker to be more conservative when trimming types.
Reducing binary size by prohibiting comparisons
Now we’re at a point to explain Brad’s change. By adding an incomparable field 3 to the type, the resulting struct is by extension incomparable, thus forcing the compiler to elide the generation of eq and hash algorithms, short circuiting the linkers elimination of those types and, in practice, reducing the size of the final binary. As an example of this technique, this program:
package main
import "fmt"
func main() {
type t struct {
// _ [0][]byte uncomment to prevent comparison
a byte
b uint16
c int32
d uint64
}
var a t
fmt.Println(a)
}
when compiled with Go 1.14.2 (darwin/amd64), decreased from 2174088 to 2174056, a saving of 32 bytes. In isolation this 32 byte saving may seem like small beer, but consider that equality and hash functions can be generated for every type in the transitive closure of your program and all its dependencies, and the size of these functions varies depending on the size of the type and its complexity, prohibiting them can have a sizeable impact on the final binary over and above the old saw of -ldflags="-s -w"
.
The bottom line, if you don’t wish to make your types comparable, a hack like this enforces it at the source level while contributing to a small reduction in the size of your binary.
Addendum: thanks to Brad’s prodding, Go 1.15 already has a bunch of improvements by Cherry Zhang and Keith Randall that fix the most egregious of the failures to eliminate unnecessary equality and hash functions (although I suspect it was also to avoid the proliferation of this class of CLs).
Mid-stack inlining in Go
In the previous post I discussed how leaf inlining allows the Go compiler to reduce the overhead of function calls and extend optimisation opportunities across function boundaries. In this post I’ll discuss the limits of inlining and leaf vs mid-stack inlining.
The limits of inlining
Inlining a function into its caller removes the call’s overhead and increases the opportunity for the compiler to apply additional optimisations so the question should be asked, if some inlining is good, would more be better, why not inline as much as possible?
Inlining trades possibly larger program sizes for potentially faster execution time. The main reason to limit inlining is creating many inlined copies of a function can increase compile time and result in larger binaries for marginal gain. Even taking into account the opportunities for further optimisation, aggressive inlining tends to increase the size of, and the time too compile, the resulting binary.
Inlining works best for small functions that do relatively little work compared to the overhead of calling them. As the size of a function grows, the time saved avoiding the call’s overhead diminishes relative to the work done inside the function. Larger functions tend to be more complex, thus the benefits of optimising their inlined forms vs in situ are reduced.
Inlining budget
During compilation each function’s inlineabilty is calculated using what is known as the inlining budget1. The cost calculation can be tricky to internalise but is broadly one unit per node in the AST for simple things like unary and binary operations but can be higher for complex operations like make
. Consider this example:
package main
func small() string {
s := "hello, " + "world!"
return s
}
func large() string {
s := "a"
s += "b"
s += "c"
s += "d"
s += "e"
s += "f"
s += "g"
s += "h"
s += "i"
s += "j"
s += "k"
s += "l"
s += "m"
s += "n"
s += "o"
s += "p"
s += "q"
s += "r"
s += "s"
s += "t"
s += "u"
s += "v"
s += "w"
s += "x"
s += "y"
s += "z"
return s
}
func main() {
small()
large()
}
Compiling this function with -gcflags=-m=2
allows us to see the cost the compiler assigns to each function.
% go build -gcflags=-m=2 inl.go # command-line-arguments ./inl.go:3:6: can inline small with cost 7 as: func() string { s := "hello, world!"; return s } ./inl.go:8:6: cannot inline large: function too complex: cost 82 exceeds budget 80 ./inl.go:38:6: can inline main with cost 68 as: func() { small(); large() } ./inl.go:39:7: inlining call to small func() string { s := "hello, world!"; return s }
The compiler determined that func small()
can be inlined due to its cost of 7. func large()
was determined to be too expensive. func main()
has been marked as eligible and assigned a cost of 68; 7 from the body of small
, 57 from the function call to small
and the remainder in its own overhead.
The inlining budget can be controlled to some degree with the -gcflag=-l
flag. Currently the values that apply are:
-gcflags=-l=0
is the default level of inlining.-gcflags=-l
(or-gcflags=-l=1
) disables inlining.-gcflags=-l=2
and-gcflags=-l=3
are currently unused and have no effect over-gcflags=-l=0
-gcflags=-l=4
reduces the cost for inlining non-leaf functions and calls through interfaces.2
Hairy optimisations
Some functions with a relatively low inlining cost may be ineligible because of their complexity. This is known as the function’s hairiness as the semantics of some operations are hard to reason about once inlined, for example recover
, break
. Others, like select
and go
, involve co-ordination with the runtime so the extra effort of inlining doesn’t pay for itself.
The list of hairy statements also includes things like for
and range
which don’t have an inherently large cost, but simply haven’t been optimised yet.
Mid stack inlining
Historically the Go compiler only performed leaf inlining–only functions which did not call other functions were eligible. In the context of the hairiness discussion previously, a function call would disqualify the function from being inlined.
Enter mid stack inlining which, as its name implies, allows functions in the middle of a call stack to be inlined without requiring everything below them to be eligible. Mid stack inlining was introduced by David Lazar in Go 1.9 and improved in subsequent releases. This presentation goes into some of the difficulties with retaining the behaviour of stack traces and runtime.Callers
in code paths that had been heavily inlined.
We see an example of mid-stack inlining in the previous example. After inlining, func main()
contains the body of func small()
and a call to func large()
, thus it is considered a non-leaf function. Historically this would have prevented it from being further inlined even though its combined cost was less than the inlining budget.
The primary use case for mid stack inlining is to reduce the overhead of a path through the call stack. Consider this example:
package main
import (
"fmt"
"strconv"
)
type Rectangle struct {}
//go:noinline
func (r *Rectangle) Height() int {
h, _ := strconv.ParseInt("7", 10, 0)
return int(h)
}
func (r *Rectangle) Width() int {
return 6
}
func (r *Rectangle) Area() int { return r.Height() * r.Width() }
func main() {
var r Rectangle
fmt.Println(r.Area())
}
In this example r.Area()
is a simple function which calls two others. r.Width()
can be inlined while r.Height()
, simulated here with the //go:noinline
annotation, cannot. 3
% go build -gcflags='-m=2' square.go # command-line-arguments ./square.go:12:6: cannot inline (*Rectangle).Height: marked go:noinline ./square.go:17:6: can inline (*Rectangle).Width with cost 2 as: method(*Rectangle) func() int { return 6 } ./square.go:21:6: can inline (*Rectangle).Area with cost 67 as: method(*Rectangle) func() int { return r.Height() * r.Width() } ./square.go:21:61: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 } ./square.go:23:6: cannot inline main: function too complex: cost 150 exceeds budget 80 ./square.go:25:20: inlining call to (*Rectangle).Area method(*Rectangle) func() int { return r.Height() * r.Width() } ./square.go:25:20: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 }
As the multiplication performed by r.Area()
is cheap compared to the overhead of calling it, inlining r.Area()
‘s single expression is a net win even if its downstream caller to r.Height()
remains ineligible.
Fast path inlining
The most startling example of the power of mid-stack inlining comes from 2019 when Carlo Alberto Ferraris improved the performance of sync.Mutex.Lock()
by allowing the fast path of the lock–the uncontended case–to be inlined into its caller. Prior to this change sync.Mutex.Lock()
was a large function containing many hairy conditions which made it ineligible to be inlined. Even in the case where the lock was available, the caller had to pay the overhead of calling sync.Mutex.Lock()
.
Carlo’s change split sync.Mutex.Lock()
into two functions (a process he dubbed outlining). The outer sync.Mutex.Lock()
method now calls sync/atomic.CompareAndSwapInt32()
and returns to the caller immediately if the CAS succeeds. If not, the function falls through to sync.Mutex.lockSlow()
which handles the slow path required to register interest on the lock and park the goroutine.4
% go build -gcflags='-m=2 -l=0' sync 2>&1 | grep '(*Mutex).Lock' ../go/src/sync/mutex.go:72:6: can inline (*Mutex).Lock with cost 69 as: method(*Mutex) func() { if "sync/atomic".CompareAndSwapInt32(&m.state, 0, mutexLocked) { if race.Enabled { }; return }; m.lockSlow() }
By splitting the function into an easily inlineable outer function, falling through to a complex inner function to handle the slow path Carlo’s combined mid stack inlining and the compiler’s support for intrinsic operations to reduce the cost of an uncontended lock by 14%. Then he repeated the trick for an additional 9% saving in sync.RWMutex.Unlock()
.
Inlining optimisations in Go
This is a post about how the Go compiler implements inlining and how this optimisation affects your Go code.
n.b. This article focuses on gc, the de facto Go compiler from golang.org. The concepts discussed apply broadly to other Go compilers like gccgo and tinygo but may differ in implementation and efficacy.
What is inlining?
Inlining is the act of combining smaller functions into their respective callers. In the early days of computing this optimisation was typically performed by hand. Nowadays inlining is one of a class of fundamental optimisations performed automatically during the compilation process.
Why is inlining important?
Inlining is important for two reasons. The first is it removes the overhead of the function call itself. The second is it permits the compiler to more effectively apply other optimisation strategies.
Function call overhead
Calling a function1 in any language carries a cost. There are the overheads of marshalling parameters into registers or onto the stack (depending on the ABI) and reversing the process on return. Invoking a function call involves jumping the program counter from one point in the instruction stream to another which can cause a pipeline stall. Once inside the function there is usually some preamble required to prepare a new stack frame for the function to execute and a similar epilogue needed to retire the frame before returning to the caller.
In Go a function call carries additional costs to support dynamic stack growth. On entry the amount of stack space available to the goroutine is compared to the amount required for the function. If insufficient stack space is available, the preamble jumps into the runtime logic that grows the stack by copying it to a new, larger, location. Once this is done the runtime jumps back to the start of the original function, the stack check is performed again, which now passes, and the call continues. In this way goroutines can start with a small stack allocation which grows only when needed.2
This check is cheap–only a few instructions–and because goroutine stacks grows geometrically the check rarely fails. Thus, the branch prediction unit inside a modern processor can hide the cost of the stack check by assuming it will always be successful. In the case where the processor mis-predicts the stack check and has to discard the work done while it was executing speculatively, the cost of the pipeline stall is relatively small compared to the cost of the work needed for the runtime to grow a goroutines stack.
While the overhead of the generic and Go specific components of each function call are well optimised by modern processors using speculative execution techniques, those overheads cannot be entirely eliminated, thus each function call carries with it a performance cost over and above the time it takes to perform useful work. As a function call’s overhead is fixed, smaller functions pay a larger cost relative to larger ones because they tend to do less useful work per invocation.
The solution to eliminating these overheads must therefore be to eliminate the function call itself, which the Go compiler does, under certain conditions, by replacing the call to a function with the contents of the function. This is known as inlining because it brings the body of the function in line with its caller.
Improved optimisation opportunities
Dr. Cliff Click describes inlining as the optimisation performed by modern compilers as it forms the basis for optimisations like constant propagation and dead code elimination. In effect, inlining allows the compiler to see further, allowing it to observe, in the context that a particular function is being called, logic that can be further simplified or eliminated entirely. As inlining can be applied recursively optimisation decisions can be made not only in the context of each individual function, but also applied to the chain of functions in a call path.
Inlining in action
The effects of inlining can be demonstrated with this small example
package main
import "testing"
//go:noinline
func max(a, b int) int {
if a > b {
return a
}
return b
}
var Result int
func BenchmarkMax(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
r = max(-1, i)
}
Result = r
}
Running this benchmark gives the following result:3
% go test -bench=. BenchmarkMax-4 530687617 2.24 ns/op
The cost of max(-1, i)
is around 2.24 nanoseconds on my 2015 MacBook Air. Now let’s remove the //go:noinline
pragma and see the result:
% go test -bench=. BenchmarkMax-4 1000000000 0.514 ns/op
From 2.24 ns to 0.51 ns, or according to benchstat
, a 78% improvement.
% benchstat {old,new}.txt name old time/op new time/op delta Max-4 2.21ns ± 1% 0.49ns ± 6% -77.96% (p=0.000 n=18+19)
Where did these improvements come from?
First, the removal of the function call and associated preamble4 was a major contributor. Pulling the contents of max
into its caller reduced the number of instructions executed by the processor and eliminated several branches.
Now the contents of max
are visible to the compiler as it optimises BenchmarkMax
it can make some additional improvements. Consider that once max
is inlined, this is what the body of BenchmarkMax
looks like to the compiler:
func BenchmarkMax(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
if -1 > i {
r = -1
} else {
r = i
}
}
Result = r
}
Running the benchmark again we see our manually inlined version performs as well as the version inlined by the compiler
% benchstat {old,new}.txt name old time/op new time/op delta Max-4 2.21ns ± 1% 0.48ns ± 3% -78.14% (p=0.000 n=18+18)
Now the compiler has access to the result of inlining max
into BenchmarkMax
it can apply optimisation passes which were not possible before. For example, the compiler has noted that i
is initialised to 0
and only incremented so any comparison with i
can assume i
will never be negative. Thus, the condition -1 > i
will never be true.5
Having proved that -1 > i
will never be true, the compiler can simplify the code to
func BenchmarkMax(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
if false {
r = -1
} else {
r = i
}
}
Result = r
}
and because the branch is now a constant, the compiler can eliminate the unreachable path leaving it with
func BenchmarkMax(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
r = i
}
Result = r
}
Thus, through inlining and the optimisations it unlocks, the compiler has reduced the expression r = max(-1, i)
to simply r = i
.
The limits of inlining
In this article I’ve discussed, so called, leaf inlining; the act of inlining a function at the bottom of a call stack into its direct caller. Inlining is a recursive process, once a function has been inlined into its caller, the compiler may inline the resulting code into its caller, as so on. For example, this code
func BenchmarkMaxMaxMax(b *testing.B) {
var r int
for i := 0; i < b.N; i++ {
r = max(max(-1, i), max(0, i))
}
Result = r
}
runs as fast as the previous example as the compiler is able to repeatedly apply the optimisations outlined above to reduce the code to the same r = i
expression.
In the next article I’ll discuss an alternative inlining strategy when the Go compiler wishes to inline a function in the middle of a call stack. Finally I’ll discuss the limits that the compiler is prepared to go to to inline code, and which Go constructs are currently beyond its capability.
go test -v streaming output
The testing
package is one of my favourite packages in the Go standard library, not just because of its low noise approach to unit testing, but, over the lifetime of Go, it has received a steady stream of quality of life improvements driven by real world usage.
The most recent example of this is, in Go 1.14, go test -v
will stream t.Log
output as it happens, rather than hoarding it til the end of the test run. Here’s an example;
package main
import (
"fmt"
"testing"
"time"
)
func TestLogStreaming(t *testing.T) {
for i := 0; i < 5; i++ {
time.Sleep(300 * time.Millisecond)
fmt.Println("fmt.Println:", i)
t.Log("t.Log:", i)
}
}
Note: Calling fmt.Println
inside a test is generally considered a no no as it bypasses the testing
package’s output buffering irrespective of the -v
flag. However, for this example, it‘s necessary to demonstrate the streaming t.Log
change.
% go1.13 test -v tlog_test.go
=== RUN TestLogStreaming
fmt.Println: 0
fmt.Println: 1
fmt.Println: 2
fmt.Println: 3
fmt.Println: 4
--- PASS: TestLogStreaming (1.52s)
tlog_test.go:13: t.Log: 0
tlog_test.go:13: t.Log: 1
tlog_test.go:13: t.Log: 2
tlog_test.go:13: t.Log: 3
tlog_test.go:13: t.Log: 4
PASS
ok command-line-arguments 1.971s
Under Go 1.13 and earlier the fmt.Println
lines output immediately. t.Log
lines are buffered and are printed after the test completes.
% go1.14 test -v tlog_test.go
=== RUN TestLogStreaming
fmt.Println: 0
TestLogStreaming: tlog_test.go:13: t.Log: 0
fmt.Println: 1
TestLogStreaming: tlog_test.go:13: t.Log: 1
fmt.Println: 2
TestLogStreaming: tlog_test.go:13: t.Log: 2
fmt.Println: 3
TestLogStreaming: tlog_test.go:13: t.Log: 3
fmt.Println: 4
TestLogStreaming: tlog_test.go:13: t.Log: 4
--- PASS: TestLogStreaming (1.51s)
PASS
ok command-line-arguments 1.809s
Under Go 1.14 the fmt.Println
and t.Log
lines are interleaved, rather than waiting for the test to complete, demonstrating that test output is streamed when go test -v
is used.
This is a great quality of life improvement for integration style tests that often retry for long periods when the test is failing. Streaming t.Log
output will help Gophers debug those test failures without having to wait until the entire test times out to receive their output.
Are large slices more expensive than smaller ones?
Programmers have a tendency to be superstitious. Particularly, when a programmer hears that copies are expensive, they start to see them everywhere, especially when they learn that, in Go, every assignment is a copy.
Consider this code; x
is three orders of magnitude larger than y
, is the assignment of x
to a
more expensive than the assignment of y
to b
?
func f() { x, y := make([]byte, 9000), make([]byte, 9) a := x b := y // ... }
The answer is; no. x
and y
have the same type, []byte
, that is, a slice of bytes. As both variables have the same type, their assignment involves copying the same amount of data. Both assignments have the same cost.
All slices are the same size; three machine words (three uintptrs
). The first word in the slice is a pointer to the slice’s backing array, the storage for the slice, the second word is the slice’s length, and the third is the capacity. Assigning one slice variable to another copies just three machine words.
Further reading: Go slices: usage and internals (blog.golang.org)
Post navigation
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK