Golang性能调优--工具篇

软件开发过程中，项目上线并不是终点。上线后，还要对程序的取样分析运行情况，并重构现有的功能，让程序执行更高效更稳写。 golang的工具包内自带pprof功能，使找出程序中占内存和CPU较多的部分功能方便了不少。加上uber的火焰图，可视化显示，让我们在分析程序时更简单明了。

准备工作

安装graphviz
- brew install graphviz
将 $GOPATH/bin 加⼊ $PATH
- Mac OS: 在 .bash_profile 中修改路径
安装 go-torch（不使用可以不装）
- go get github.com/uber/go-torch
- 下载并复制 flamegraph.pl （https://github.com/brendangregg/FlameGraph）⾄ $GOPATH/bin 路径下
- 将 $GOPATH/bin 加⼊ $PATH

pprof

采样方式：
- runtime/pprof：工具型应用，采集程序（非 Server）的运行数据进行分析
- net/http/pprof：服务型应用，采集 HTTP Server 的运行时数据进行分析
支持模式：
- Report generation：报告生成
- Interactive terminal use：交互式终端使用
- Web interface：Web 界面
分析类型
- CPU Profiling：
  
  CPU 分析是最常见的性能分析类型。
  
  按照一定的频率（默认10ms一次）采集所监听的应用程序 CPU（含寄存器）的使用情况，可确定应用程序在主动消耗 CPU 周期时花费时间的位置。
- Memory Profiling：
  
  内存性能分析记录堆内存分配时的堆栈信息，忽略栈内存分配信息。
  
  内存性能分析启用时，默认每1000次采样1次，这个比例是可以调整的。因为内存性能分析是基于采样的，因此基于内存分析数据来判断程序所有的内存使用情况是很困难的。
- Block Profiling：
  阻塞性能分析是 Go 特有的。
  
  阻塞性能分析用来记录一个协程等待一个共享资源花费的时间。在判断程序的并发瓶颈时会很有用。阻塞的场景包括：
  - 在没有缓冲区的信道上发送或接收数据。
  - 从空的信道上接收数据，或发送数据到满的信道上。
  - 尝试获得一个已经被其他协程锁住的排它锁。
  一般情况下，当所有的 CPU 和内存瓶颈解决后，才会考虑这一类分析。
  
  默认不开启，需要调用 runtime.SetBlockProfileRate 进行设置。
- Mutex Profiling：
  
  锁性能分析与阻塞分析类似，但专注于因为锁竞争导致的等待或延时。
  
  默认不开启，需要调用 runtime.SetMutexProfileFraction 进行设置。
- Goroutine Profiling：
  
  Goroutine 分析，可以对当前应用程序正在运行的 Goroutine 进行堆栈跟踪和分析。这项功能在实际排查中会经常用到，因为很多问题出现时的表象就是 Goroutine 暴增，而这时候我们要做的事情之一就是查看应用程序中的 Goroutine 正在做什么事情，因为什么阻塞了，然后再进行下一步。

采样

runtime/pprof：工具型应用

通过嵌入代码，手动调用 runtime/pprof 的 API
灵活性⾼，适⽤于特定代码段的分析
API 相关⽂档 https://studygolang.com/static/pkgdoc/pkg/runtime_pprof.htm
go tool pprof [binary] [binary.prof]

开启方式

1
2
3
4
5
6
7


import "runtime/pprof"
//cpu
f, err := os.Create("cpu.prof") //创建一个文件句柄
pprof.StartCPUProfile(f) //开启
pprof.StopCPUProfile() //关闭
//内存
pprof.WriteHeapProfile(f1)

net/http/pprof：服务型应用

简单，适合于持续性运⾏的应⽤
在应⽤程序中导⼊ import _ “net/http/pprof”，并启动 http server 即可
http://:/debug/pprof/
go tool pprof http://:/debug/pprof/profile?seconds=10 （默认值为30秒）
go-torch -seconds 10 http://:/debug/pprof/profile

开启方式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


//http包使用方式
//如果使用了默认的http.DefaultServeMux（通常是代码直接使用 http.ListenAndServe(“0.0.0.0:8000”, nil)），只需要import _ "net/http/pprof"即可
//如果使用的是自定义的 Mux，则需要手动注册一些路由规则：
r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)

//gin框架使用方式
//1、github.com/gin-contrib/pprof
//2、github.com/DeanThompson/ginpprof
import "github.com/gin-contrib/pprof"
pprof.Register(r)

文件地址

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# http://<host>:<port>/debug/pprof/
allocs：查看过去所有内存分配的样本，访问路径为 $HOST/debug/pprof/allocs。
block：查看导致阻塞同步的堆栈跟踪，访问路径为 $HOST/debug/pprof/block。
cmdline： 当前程序的命令行的完整调用路径。
goroutine：查看当前所有运行的 goroutines 堆栈跟踪，访问路径为 $HOST/debug/pprof/goroutine。
heap：查看活动对象的内存分配情况，访问路径为 $HOST/debug/pprof/heap。
mutex：查看导致互斥锁的竞争持有者的堆栈跟踪，访问路径为 $HOST/debug/pprof/mutex。
profile： 默认进行 30s 的 CPU Profiling，得到一个分析用的 profile 文件，访问路径为 $HOST/debug/pprof/profile。
threadcreate：查看创建新 OS 线程的堆栈跟踪，访问路径为 $HOST/debug/pprof/threadcreate。

1、web界面点击链接默认带debug=1，若没有debug参数，那么将会直接下载对应的 profile 文件。
2、在部署环境中，为了网络安全，通常不会直接对外网暴露 pprof 的相关端口，因此会通过 curl、wget 等方式进行 profile 文件的间接拉取。
3、debug 的访问方式是具有时效性的，在实际场景中，我们常常需要及时将当前状态下的 profile 文件给存储下来，便于二次分析。

交互式终端（go tool pprof）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


#go tool pprof
go tool pprof + prof文件进入交互式终端，有以下几个命令
top  默认查看程序中占用cpu前10位的函数，后面可跟数字，默认以flat排列，可使用-cum以cum排列
list +函数名命令查看具体的函数分析
traces 查看调用链路
pdf  可以生成可视化的pdf文件。
svg  生成svg
web  直接生成关系调用图（方块越大表示消耗越大）：
     注意：如果没有安装graphviz，会出现Could not execute dot；may need to install graphviz

#top详解
(pprof) top
Showing nodes accounting for 71.27s, 75.62% of 94.25s total
Dropped 211 nodes (cum <= 0.47s)
Showing top 10 nodes out of 97
      flat  flat%   sum%        cum   cum%
    23.32s 24.74% 24.74%     23.32s 24.74%  runtime.pthread_cond_wait
    10.09s 10.71% 35.45%     10.09s 10.71%  runtime.pthread_kill
    10.08s 10.69% 46.14%     10.08s 10.69%  runtime.kevent
     7.19s  7.63% 53.77%     19.04s 20.20%  main.GetFibonacciServe
     5.10s  5.41% 59.18%      5.10s  5.41%  runtime.procyield
     4.34s  4.60% 63.79%      8.12s  8.62%  runtime.scanobject
     4.17s  4.42% 68.21%      4.17s  4.42%  runtime.pthread_cond_signal
     2.72s  2.89% 71.10%      2.82s  2.99%  runtime.nanotime1
     2.32s  2.46% 73.56%      2.33s  2.47%  runtime.usleep
     1.94s  2.06% 75.62%      1.94s  2.06%  runtime.memclrNoHeapPointers
     
flat：当前函数占用CPU的耗时
flat%：当前函数占用CPU的耗时百分比
sum%：函数占用CPU的累积耗时百分比
cum：当前函数+调用当前函数的占用CPU总耗时
cum%：当前函数+调用当前函数的占用CPU总耗时百分比

【flat和cum的区别】假设函数b由三部分组成：调用函数c、自己直接处理一些事情、调用函数d，其中调用函数c耗时1秒，自己直接处理事情耗时3秒，调用函数d耗时2秒，那么函数b的flat耗时就是3秒，cum耗时就是6秒。
【flat%、cum%、sum%】flat%和cum%指的就是flat耗时和cum耗时占总耗时（也就是94.25秒）的百分比，而最后一个sum%指的就是每一行的flat%与上面所有行的flat%总和，代表从上到下的累计值，比如第二行的75.04%就等于第一行flat%的66.67%+本行flat%的8.37%，下面的以此类推

CPU Profiling

1
2
3


go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// 执行该命令后，需等待 30 秒（可调整 seconds 的值），pprof 会进行 CPU Profiling，结束后将默认进入 pprof 的命令行交互式模式，可以对分析的结果进行查看或导出。另外如果你所启动的 HTTP Server 是 TLS 的方式，那么在调用 go tool pprof 时，需要将调用路径改为：
go tool pprof https+insecure://localhost:6060/debug/pprof/profile?seconds=30。

Heap Profiling

1
2
3
4
5
6
7
8


go tool pprof http://localhost:6060/debug/pprof/heap
// 执行该命令后，能够很快的拉取到其结果，因为它不需要像 CPU Profiling 做采样等待

// 可以选择传 type 它默认显示的是 inuse_space，实际上可以针对多种内存概况进行分析，常用的类别如下：
inuse_space：分析应用程序的常驻内存占用情况。
alloc_objects：分析应用程序的内存临时分配情况。
//另外还有 inuse_objects 和 alloc_space 类别，分别对应查看每个函数所分别的对象数量和查看分配的内存空间大小，具体可根据情况选用。
go tool pprof -alloc_objects http://localhost:6060/debug/pprof/heap

Goroutine Profiling

1

// 在查看 goroutine 时，我们可以使用 traces 命令，这个命令会打印出对应的所有调用栈，以及指标信息，可以让我们很便捷的查看到整个调用链路有什么，分别在哪里使用了多少个 goroutine，并且能够通过分析查看到谁才是真正的调用方

Mutex Profiling

1
2
3
4


// 需要注意的是 runtime.SetMutexProfileFraction 语句，如果未来希望进行互斥锁的采集，那么需要通过调用该方法来设置采集频率，若不设置或没有设置大于 0 的数值，默认是不进行采集的。
func init() {
	runtime.SetMutexProfileFraction(1)
}

Block Profiling

1
2
3
4


// 与 Mutex 的 runtime.SetMutexProfileFraction 相似，Block 也需要调用 runtime.SetBlockProfileRate() 进行采集量的设置，否则默认关闭，若设置的值小于等于 0 也会认为是关闭。
func init() {
	runtime.SetBlockProfileRate(1)
}

Web 界面

方法一（推荐）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# 该命令将在所指定的端口号运行一个 PProf 的分析用的站点。
go tool pprof -http=:6001 profile

# 包含内容：
# Top
# Graph
# Flame Graph （火焰图）
# Peek 相较于 Top 视图，增加了所属的上下文信息的展示，也就是函数的输出调用者/被调用者。
# Source 主要是增加了面向源代码的追踪和分析，可以看到其开销主要消耗在哪里。
# Disassemble

方法二：

1
2


# 在go tool pprof交互式终端中使用web命令
(pprof) web

安全地使用net/http/pprof

benchmark

benchmark和普通的单元测试一样，都位于 _test.go 文件中。
函数名以 Benchmark 开头，参数是 b *testing.B。和普通的单元测试用例很像，单元测试函数名以 Test 开头，参数是 t *testing.T。
go test 命令默认不运行 benchmark 用例的，如果我们想运行 benchmark 用例，则需要加上 -bench 参数。例如：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


go test -bench .
// 支持正则表达式，下面这个代表运行以Fib结尾的benchmark用例：
go test -bench='Fib$' .
// 默认使用全部cpu核数，可通过-cpu参数改变，-cpu支持传入一个列表作为参数，例如：
go test -bench='Fib$' -cpu=2,4 .
// 默认运行时间为1s，可通过-benchtime指定时间或者次数：
go test -bench='Fib$' -benchtime=5s .
go test -bench='Fib$' -benchtime=50x .
// 默认执行一轮，可通过-count指定轮数：
go test -bench='Fib$' -benchtime=5s -count=3 .
// 可通过-benchmem可查看内存分配量和分配次数：
go test -bench=. -benchmem
//常用方法：
b.StopTimer() //暂停计时
b.ResetTimer() //重置定时器
b.StartTimer() //开始计时
//下面输出的意思是：通过1s时间内运行该方法33次,每次需要30403687ns，进行内存分配40次，每次分配45188395B内存
BenchmarkGenerate-8         33  30403687 ns/op  45188395 B/op  40 allocs/op

延伸阅读

Go 测试，go test 工具的具体指令 flag

go tool trace

文章目录

准备工作

pprof

采样

runtime/pprof：工具型应用

开启方式

net/http/pprof：服务型应用

开启方式

文件地址

交互式终端（go tool pprof）

CPU Profiling

Heap Profiling

Goroutine Profiling

Mutex Profiling

Block Profiling

Web 界面

安全地使用net/http/pprof

benchmark

延伸阅读