Oct 27, 2023 7 min read linux工具

perf使用简介

0、前言

perf是内核的一个性能工具集合，可以很方便、有效的对内核的性能指标进行分析、追踪，可以检测 CPU 性能计数器、跟踪点、kprobes 和 uprobes（动态跟踪）。

perf的功能其实非常非常强大，使用的方法以及可拓展性非常非常多，这里只是简单介绍一些常用的工具、命令集合，。

1、基础使用

在一个linux环境下，执行

perf record -a -- sleep 30

会获取到一个perf.data数据，这里记录了这30s内系统的事件，这里是通过采样来获取的事件sample。

在完成后，可以通过下边的命令进行解析，获取到相关的可读相关信息。

perf report

此外，还可以通过perf script对perf.data进行解析，将捕获的sample数据生成文本格式，方便进一步进行分析。（可以自己比编写脚本，或用来查找特定时间点的数据）

perf的相关其他相关事件如下

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

2、Probe

perf-probe(1) — Linux manual page

2-1 使用

比较常用的一个功能是使用probe进行动态插桩。由于静态的tracepoint并不能充分满足各种各样的性能问题，所以内核开发了一套叫做probe的动态插桩工具，这个也被集合在了perf工具中。

基础的使用方法如下，其中xxx是我们需要插桩的函数名。

perf probe -a xxx

但是这里需要注意的是，probe只能插桩非内联的函数，如果遇到内联函数，可能就需要将函数强制声明为非内联，即增加noinline标签。

此外，perf probe还支持各种特殊插桩

perf probe -a xxx:xx # 在xxx函数的xx行进行插桩
perf probe -a xxx+xx # 在xxx函数的xxbit偏移进行插桩
perf probe -a xxx%retrun # 在xxx函数返回处进行插桩

这里其实也支持在指定的源码文件中进行插桩，更详细的信息可以参考perf probe --help中PROBE SYNTAX关键字

或者参考perf-probe(1) — Linux manual page

2-2 原理

Perf probe的底层依赖是内核的Kprobe机制，kprobes 的实现原理基于内核中断和异常处理机制，通过在指定地址处插入断点指令（例如 x86 架构上的 int3 指令）来触发中断，然后在中断处理程序中执行用户注册的探针处理函数。

perf中probe的实现原理概述如下

解析用户输入
查找内核符号：perf probe 使用vmlinux来查找指定函数的内核地址。此外，如果用户指定了源代码行号，perf probe 还会查找该行号对应的内核地址。
注册 perf_event：perf probe 通过 perf_events 子系统将指定的内核地址和关联的事件名称注册到内核中。perf_events 子系统会创建一个新的 perf_event，其类型为 PERF_TYPE_TRACEPOINT，并将其与 kprobes 框架关联。
插入 kprobe：
1. perf_events 子系统调用 kprobes API（如 register_kprobe() 或 register_kretprobe()）在指定的内核地址处插入一个 kprobe 或 kretprobe。这会在目标地址处插入一个断点指令，以便在执行到该地址时触发中断。
  
  例如x86会在指定地址插入一个0xcc的操作码，处理器解析后通过中断向量表，关联为int3指令，在运行后便会触发kprobe事件。
```
#define BREAKPOINT_INSTRUCTION	0xcc

void arch_arm_kprobe(struct kprobe *p)
{
	text_poke(p->addr, ((unsigned char []){BREAKPOINT_INSTRUCTION}), 1);
}
```
  arm64中则是
```
#define BRK64_ESR_KPROBES	0x0004 -> 中断类型是kprobe
#define AARCH64_BREAK_MON	0xd4200000 -> 软件断点触发中断

#define BRK64_OPCODE_KPROBES	(AARCH64_BREAK_MON | (BRK64_ESR_KPROBES << 5))
```
2. 在 kprobe 的处理函数中，perf_events 子系统会调用 perf_tp_event() 函数，该函数负责将 perf_event 与 kprobe 关联。这意味着，当 kprobe 被触发时，perf_event 会收集和记录内核函数的执行信息。
收集和分析数据：用户可以使用 perf record 和 perf report 命令来收集和分析内核函数的执行信息。在这些命令中，perf 工具会使用 perf_events 子系统来订阅和读取与 kprobes 关联的 perf_event 数据。

3、Sched

perf-sched(1)

3-1 使用

有时我们需要分析内核调度信息，可以方便的时候perf sched命令获取内核调度信息。

首先记录调度类信息数据

perf sched record -- sleep 30

然后使用工具解析调度时延信息

perf sched latency -p

这样你就会获取到一个如下一个调度信息的汇总可读文本

  cpuPercentTest:21721  |   2996.171 ms |     1872 | avg:    0.017 ms | max:    1.586 ms | max at: 521094.021548 s
  cpuPercentTest:21727  |   2995.264 ms |     1857 | avg:    0.016 ms | max:    1.278 ms | max at: 521095.473709 s
  cpuPercentTest:21725  |   2998.391 ms |     1875 | avg:    0.016 ms | max:    0.895 ms | max at: 521090.525446 s
  cpuPercentTest:21723  |   2995.983 ms |     1860 | avg:    0.015 ms | max:    1.693 ms | max at: 521095.423880 s
	......

此时，可以通过perf script来获取上述信息的原始文本，通过max at: 521094.021548 s这类时间点，来检查具体的调度信息。

例如上述一次调度的实际情况如下

         swapper     0 [094] 521094.019960: sched:sched_migrate_task: comm=cpuPercentTest pid=21721 prio=120 orig_cpu=94 dest_cpu=72
  cpuPercentTest 21734 [024] 521094.019961: sched:sched_migrate_task: comm=cpuPercentTest pid=21729 prio=120 orig_cpu=24 dest_cpu=72
         swapper     0 [094] 521094.019962:       sched:sched_wakeup: comm=cpuPercentTest pid=21721 prio=120 target_cpu=072 # 开始唤醒21721进程
  cpuPercentTest 21734 [024] 521094.019964:       sched:sched_wakeup: comm=cpuPercentTest pid=21729 prio=120 target_cpu=072
         swapper     0 [072] 521094.019966:       sched:sched_switch: prev_comm=swapper/72 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cpuPercentTest next_pid=21729 next_prio=120
  cpuPercentTest 21729 [072] 521094.020129: sched:sched_stat_runtime: comm=cpuPercentTest pid=21729 runtime=164651 [ns] vruntime=734902380619 [ns]
   ......
  cpuPercentTest 21733 [046] 521094.021546: sched:sched_migrate_task: comm=cpuPercentTest pid=21721 prio=120 orig_cpu=72 dest_cpu=46 # 系统再次迁移任务到46核执行
  cpuPercentTest 21733 [046] 521094.021547: sched:sched_stat_runtime: comm=cpuPercentTest pid=21733 runtime=6907 [ns] vruntime=3761424387 [ns]
  cpuPercentTest 21733 [046] 521094.021548:       sched:sched_switch: prev_comm=cpuPercentTest prev_pid=21733 prev_prio=120 prev_state=S ==> next_comm=cpuPercentTest next_pid=21721 next_prio=120 # 完成进程调度切换

这次的调度情况可以看到，其实是在原来的72核上有两个进程21721与21733同时被唤醒，而21733先被执行，所以导致21721一直未被调度，最终导致这一次sched长。

3-2 原理

perf sched for Linux CPU scheduler analysis

其实通过上边的例子不难看出perf sched latency所解析的信息是sched:sched_switch-sched:sched_migrate_task的值

521094.021548s - 521094.019962s = 0.001586s = 1.586 ms

此外，perf sched record其实就是整合了几个特定的调度相关tracepoints（可能没有列举全，只写了通常能观察到的几种）

sched:sched_switch # 任务切换
sched:sched_wakeup # 任务唤醒
sched:sched_migrate_task # 任务迁移
sched:sched_process_exec # 进程执行事件
sched:sched_process_fork # 进程创建（fork）事件

该功能其实与如下命令是相同的

perf record -a -e 'sched:sched_switch,sched:sched_wakeup,sched:sched_migrate_task,sched:sched_process_exec,sched:sched_process_fork' -- sleep 30

4、火焰图

首先需要获取perf的原始数据，即执行

perf record -a -p xxx -- sleep 60

然后要对perf.data进行解析

perf script -i perf.data &> perf.unfold

之后就要用到flamegraph工具

git clone https://github.com/brendangregg/FlameGraph.git

之后需要调用flame项目里的脚本(注意相对路径)

./../FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded

最后生成svg图

./../FlameGraph/flamegraph.pl perf.folded > perf.svg

0、前言

1、基础使用

2、Probe

2-1 使用

2-2 原理

3、Sched

3-1 使用

3-2 原理

4、火焰图

You might also like...

内核链表

kdump调试

内存屏障

系统调用

内核同步机制1