通过内存分配来学习 go 中的机制

在前一篇博客中，我介绍了逃逸分析的基础场景。但是还有一些其他场景，我并没有做介绍。为了介绍其他场景，我专门写了了一个程序用于 debug，这个程序中分配内存的方式比较让人吃惊。

程序

为了更多的学习 io 包，我尝试了一个快速的项目。找到字节流中的字符串 elvis ，并且替换为首字母大写的字符串 Elvis 。

代码中列出了两个用于解决这个这个问题的函数。这个博客主要集中于函数 algOne ，因为这个函数用到了 io 包。

下面的数据中，一个是输入，一个是希望通过函数 algOne 作用之后的输出。

Listing 1

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis
复制代码

下面是函数 algOne

Listing 2

80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

我想知道这个函数的表现以及函数给堆上的压力。为了了解这些，我们需要运行下 benchmark。

Benchmarking

下面是用来运行函数 algOne 来处流数据的 benchmark 函数

Listing 3

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }
复制代码

有了这个函数，我们就可以运行 go test 了，并且可以使用选项 -bench ， -benchtime 和 -benchmem 选项。

Listing 4

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op
复制代码

在运行 benchmark 之后，我们可以看到函数 algOne 函数的每次操作都分配了两次内存，并且分配的内存大小为 117 字节。这个表现非常好了，但是我们需要知道是哪些代码造成了这些内存的分配。为了知道这些，我们需要产生运行 benchmark 的 profiling data。

Profiling

为了产生 profile data，我们需要运行 benchmark，不过这次需要使用选项 -memprofile 选项。

Listing 5

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op
复制代码

在程序运行完之后，就会产生两个新的文件。

Listing 6

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go
复制代码

源码所在的文件夹为 memcpu ，函数 algOne 就存在于文件 stream.go 中，函数 BenchmarkAlgorithmOne 存在于 stream_test.go 。两个产生的文件分别是 mem.out 和 memcpu.test 。文件 mem.out 包含了 profiles data。文件 memcpu.test 是一个二进制文件，当我们需要看 profile data 的时候需要使用到这个文件。

有了 profile data 和二进制文件，我们就可以运行 pprof 工具来学习 profile data。

Listing 7

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _
复制代码

当需要 profiling memory 并且寻找容易解决的问题的时候，我们需要使用选项 -alloc_space 而不是默认的选项 -inuse_space 。这个选项会展示每次分配内存的情况，而不管你 take the profile 的时候，分配的内存是否还在使用。

通过 pprof 的作用，我们可以使用 list 命令来检查函数 algOne 的情况。 list 命令接受一个正则表达式，用于匹配表达式匹配的函数。

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _
复制代码

基于这个 profile，我们可以知道 input 以及切片 buf 的底层数组被分配到了堆。由于 input 是指针，所以这个 profile 是说明， input 所指向的 bytes.Buffer 是分配的到堆的。所以我们先聚焦于变量 input 的变量的分配，并且理解是如何分配的。

由于函数 bytes.NewBuffer 创建的变量，和函数 algOne 共享，所以导致变量分配到堆。并且 flat 列(pprof 输出的第一列)出现的值告诉我们这个值是分配到堆的，因为函数 algOne 共享变量的原因导致的变量分配逃逸到堆。

flat 列表示的是函数的堆的分配，可以看看 list 命令展示函数 Benchmark 是如何调用函数 algOne 的。

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _
复制代码

由于只有第二列 cum 才有值，所以函数 Benchmark 函数并不直接的创建任何变量到堆的。在循环内部，每次对函数调用的时候都会分配变量到堆。你可以看到两次对 list 命令调用的时候，分配的值到堆是匹配的(译者注：$$318.53 + 16.50 = 335.03$$)。

到此呢，我们仍然不知道为什么 bytes.Buffer 会创建变量到堆。这个时候可以使用 go build 命令的 -gcflags "-m -m" 选项了。 profiler 会告诉我们值逃逸到的堆，而 go build 命令会告诉我们为什么。

编译器报告

我们可以让编译器告诉我们代码里面变量逃逸到堆的原因。

Listing 10

$ go build -gcflags "-m -m"
复制代码

这个命令会产生非常多的输出。我们需要找到的就是包含 stream.go:83 的行，因为 stream.go 是文件的名称，并且第 83 行含有代码来构建 bytes.buffer 的值。在搜索之后，找到了如下 6 行。

Listing 11

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

第一行是非常有意思的

Listing 12

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }
复制代码

这句话告诉了我们 bytes.Buffer 逃逸到堆的原因并不是对函数 bytes.Buffer 调用造成的。因为 bytes.Buffer 压根没有被调用，函数的操作被内联到了调用的地方。

第 83 行的的如下代码

Listing 13

83     input := bytes.NewBuffer(data)
复制代码

由于编译器选择把 bytes.NewBuffer 内联到代码里面，所以上面的代码在实际调用的时候是如下的

Listing 14

input := &bytes.Buffer{buf: data}
复制代码

这就意味着函数 algOne 是直接创建 bytes.Buffer 的。那么到底是什么导致 input 被分配到堆中的呢？答案就在剩下的五行报告中。

Listing 15

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
复制代码

上面的这些内容告诉我们是第 93 行造成的值逃逸的。因为 input 变量被赋值给了一个接口。

接口

我并没有印象在代码中对接口有过赋值的操作。但是如果看了第 93 行代码，问题就变得清晰了。

Listing 16

93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
复制代码

由于调用了 io.ReadFull 函数，所以造成了对接口的赋值。如果你看了 io.ReadFull 的定义，你可以看到函数 io.ReadFull 接受的第一个参数是一个接口。

Listing 17

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}
复制代码

这个说明了，把 bytes.Buffer 的地址传递给函数，然后函数把这个地址作为一个接口存储，这就造成了变量逃逸到了堆。现在我们看到了使用接口的代价：变量分配到堆和变量的间接使用(如果分配到栈，变量的访问速度会更快)。如果使用接口并没有使得代码变得更好，那就最好别使用接口。我跟随这下面这些指导来使用接口

当有下面几种情况的时候，我会使用接口

用户需要自己实现接口的细节
API 有许多实现方法，需要各自维护其细节
API 的部分操作随着时间会改变，需要解耦

不需要使用接口的情况如下

为了使用接口而使用接口
用于完成一个算法
当用户可以自己定义接口的时候

现在我们需要问自己，这个算法真的需要使用 io.ReadFull 函数吗？答案是否定的，因为 bytes.Buffer 类型有一系列方法可以使用，并且使用这些方法可以有效的避免变量被分配到堆。

现在我们可以移去 io 包，并使用 input 变量已有的方法 Read 。

下面的代码移去了 io 包，为了保持新的代码行和原来的代码行不变，使用了变量 _ 来避免导入 io 包。这样就可以保持 io 包还在引入的行列中。

Listing 18

12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
复制代码

当我们再次运行 benchmark 的时候，就可以看到变量 bytes.Buffer 不再分配到堆中了。

Listing 19

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op
复制代码

也可以从上面的输出看到，代码性能提升了约 29%。代码花费的时间由 2570 ns/op 到 1814 ns/op。既然这个问题解决了，我们现在就可以聚焦于切片 buf 背后的数组分配到了堆的问题。如果我们使用新的代码，来运行得到 profile 的结果，我们也许就可以解决这个问题了。

Listing 20

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
复制代码

现在唯一分配到堆的一行就是第 89 行了，这部分的分配就是切片底层的数组。

栈帧

我们需要知道为什么 buf 底层的数组分配到了堆。再次运行 go build 指令，并且使用参数 -gcflags "-m -m" ，在输出的结果中搜索 stream.go:89 。

Listing 21

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89
复制代码

报告中说的是分配的数组对于栈来说太大了。这个信息是非常的有迷惑性的。因为并不是底层数组太大了，而是编译器在编译的时候不知道底层数组的大小。

只有在编译器在编译期间知道值的大小的时候，值才会被分配到栈。这是因为每个函数的栈帧的大小都是在编译期间计算的。如果编译器不知道一个值的大小，那么编译器会把值分配到堆上。

为了展示这个，我们暂时硬编码切片的大小为 5 到代码中去

Listing 22

89     buf := make([]byte, 5)
复制代码

这个时候再运行 benchmark，所有的分配到堆的操作都没有了。

Listing 23

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

如果再次查看编译器的报告，你会发现没有变量的逃逸行为

Listing 24

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape
复制代码

显然，并不能硬编码切片的大小到代码中，所以代码中知道存在着一次的变量分配到堆的操作。

分配和性能

有了三次的修改，我们可以查看、对比每次修改后的性能

Listing 25

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
复制代码

在第一优化的时候，性能提升大约 29%。第二次优化之后，性能提升约 33%。通过这些数据，我们可以看到变量分配到堆是影响程序性能的。

结论

go 有许多让人吃惊的工具，来让我们理解编译器在涉及到逃逸分析是所做的决定的原由。基于这些信息，我们可以修改代码以保持可以存在于栈中的变量避免存在于堆中。你并不需要完成一个在堆上不分配内存的程序，但是你需要使得这些操作尽可能的避免。

永远不要基于程序的性能写代码，因为你不想猜测程序的性能。我们应该首先基于正确性来写代码。这意味着需要聚焦于整体性，可读性和简单性。在你有了一个程序的时候，确认下程序是否运行的足够块。如果不够快，那么可以使用 go 提供的工具来找到修复程序运行慢的问题。

欢迎关注我们的微信公众号，每天学习Go知识

程序