Introduce lmbench.
Overview
lmbench是一套比较小的benchmark,需要依赖操作系统,可以在linux下跑。
能测试文档读写、内存操作、进程创建销毁开销、网络等性能,指标是latency和bandwidth。对于CPU开发来说,比较关心的是内存操作。
可从如下地址得到程序:
1 | http://www.bitmover.com/lmbench/ |
Run on host
直接运行下列命令跑
1 | $ cd src |
这步会做三个动作:
编译src内的源码,生成到bin目录
1
2
3gmake[1]: Entering director `/home/francis.zheng/work/lmbench/lmbench-3.0-a9/src'
gcc -O -DRUSAGE -DHAVE_unit=1 -DHAVE_int64_t=1 -DHAVE_pmap_clnt_h -DHAVE_socklen_t -DHAVE_DRAND48 -DHAVE_SCHED_SETAFFINITY=1 -c lib_tcp.c -o ../bin/x86_64-linux-gnu/lib_tcp.o
....运行
../scripts/config-run
,生成CONFIG1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71=====================================================================
L M B E N C H C ON F I G U R A T I O N
----------------------------------------
You need to configure some parameters to lmbench. Once you have configured
these parameters, you may do multiple runs by saying
"make rerun"
in the src subdirectory.
NOTICE: please do not have any other activity on the system if you can
help it. Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking. In fact, X is not so good when
benchmarking.
=====================================================================
...
MULTIPLE COPIES [default 1]:
=====================================================================
...
Job placement selection [default 1]:
=====================================================================
Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.
Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.
Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00007627 usecs.
=====================================================================
...
MB [default 540585]: 64
Checking to see if you have 64 MB; please wait for a mement...
64MB OK
64MB OK
64MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is bytes.
=====================================================================
...
SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all]: h
=====================================================================
...
FASTMEM [default no]:
=====================================================================
...
SLOWFS [default no]:
=====================================================================
...
DISKS [default none]:
=====================================================================
...
REMOTE [default none]:
=====================================================================
...
Processor mhz [default 2685MHz, 0.3724 nanosec clock]:
=====================================================================
...
FSDIR [default /usr/tmp]:
=====================================================================
...
Status output file [default /dev/tty]:
=====================================================================
...
Mail results [default yes]: n
OK, no results mailed.
=====================================================================
Configuration done, thanks.运行
../scripts/results
,得到结果1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16Using config in CONFIG.sfccpu003
Wed May 11 22:20:40 CST 2022
Latency measurements
Wed May 11 22:21:00 CST 2022
Local networking
Wed May 11 22:21:03 CST 2022
Bandwidth measurements
Wed May 11 22:21:18 CST 2022
Calculating effective TLB size
Wed May 11 22:21:18 CST 2022
Calculating memory load parallelsim
Wed May 11 22:21:18 CST 2022
McCalpin's STEAM benchmark
Wed May 11 22:21:20 CST 2022
Calculating memory load latency
Wed May 11 22:27:21 CST 2022
运行下面命令,得到结果
1 | $ make see |
1 |
|
Run on riscv
想在riscv架构上运行lmbench,一般需要交叉编译,即在host上编译,在riscv上运行。
使用以下命令:
1 | $ make CC="riscv64-unknown-linux-gnu-gcc" OS=riscv-linux |
需要注意的是,要使用linux版的riscv gcc,而不是elf-gcc那个。
在编译时,如果报redefine
, 就可以将src/bench.h
里的typedef in socketlen_t
那行注释掉。
编译完之后,将其拷贝到rootfs里面就行了。
由于rootfs中可能还没有make等工具,可以直接运行命令,而不通过make
1 | $ cd src |
如果系统中没加库文件,运行的时候会报错,not found
, 可以使用静态编译
1 | $ make CC="riscv64-unknown-linux-gnu-gcc -static" OS=riscv-linux |
不过如果使用静态编译,bin目录会由3MB
增大到200MB
。
benchmarks
lmbench的代码比较简单,主要是一些c程序,以及一些脚本程序组成。C程序负责具体的测试,脚本负责组织测试,以及结果收集处理。
这里列举了对于CPU有用的一些benchmark。例子可以参考scripts/lmbench
脚本。
http://lmbench.sourceforge.net/man/index.html
mhz
calulate processor clock rate
mhz calculates the processor clock rate and megahertz. It uses an unrolled, interlocked loop of adds or shifts. So far, superscalarness has been defeated on the tested processors (SuperSPARC, RIOS, Alpha).
Output format is either just the clock rate as a float (-c) or more verbose
39.80 Mhz, 25 nanosec clock
例子:
1 | $ ./mhz |
bw_mem
time memory bandwidth
bw_mem allocates twice the specified amount of memory, zeros it, and then times the copying of the first half to the second half. Results are reported in megabytes moved per second.
The size specification may end with ‘‘k’’ or ‘‘m’’ to mean kilobytes ( 1024) or megabytes ( 1024 * 1024).
Output format is CB”%0.2f %.2f\n”, megabytes, megabytes_per_second, i.e.,
8.00 25.33
There are nine different memory benchmarks in bw_mem. They each measure slightly different methods for reading, writing or copying data.
rd
measures the time to read data into the processor. It computes the sum of an array of integer values. It accesses every fourth word.
wr
measures the time to write data to memory. It assigns a constant value to each memory of an array of integer values. It accesses every fourth word.
rdwr
measures the time to read data into memory and then write data to the same memory location. For each element in an array it adds the current value to a running sum before assigning a new (constant) value to the element. It accesses every fourth word.
cp
measures the time to copy data from one location to another. It does an array copy: dest[i] = source[i]. It accesses every fourth word.
frd
measures the time to read data into the processor. It computes the sum of an array of integer values.
fwr
measures the time to write data to memory. It assigns a constant value to each memory of an array of integer values.
fcp
measures the time to copy data from one location to another. It does an array copy: dest[i] = source[i].
bzero
measures how fast the system can bzero memory.
bcopy
measures how fast the system can bcopy data.
属于:OS | HARDWARE
例子:
1 | $ ./bw_mem 1024k rd # 小写k代表乘以1024,小写m代表1024*1024 |
lat_mem_rd
memory read latency benchmark
lat_mem_rd measures memory read latency for varying memory sizes and strides. The results are reported in nanoseconds per load and have been verified accurate to within a few nanoseconds on an SGI Indy.
The entire memory hierarchy is measured, including onboard cache latency and size, external cache latency and size, main memory latency, and TLB miss latency.
Only data accesses are measured; the instruction cache is not measured.
The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point backward one stride. Traversing the array is done by
p = (char *)p;
in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 100 loads long).
The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.
Since this benchmark uses fixed-stride offsets in the pointer chain, it may be vulnerable to smart, stride-sensitive cache prefetching policies. Older machines were typically able to prefetch for sequential access patterns, and some were able to prefetch for strided forward access patterns, but only a few could prefetch for backward strided patterns. These capabilities are becoming more widespread in newer processors.
属于:HARDWARE | MEM
例子:
1 | $ lat_mem_rd 64 128 # 测量64MB,stride是128 |
Summary.out里的Memory latencies in nanoseconds是通过lat_mem_rd得到的。规则是:
1 | $ lat_mem_rd 64 128 # size必须大于8MB,stride 必须是128 |
1 | $ lat_mem_rd -t 64 16 # size必须大于8MB,stride 必须是16 |
lat_ops
basic CPU operation parallelism
lat_ops [ -W ] [ -N ]
lat_ops measures the latency of basic CPU operations, such as integer ADD.
例子:
1 | $ ./lat_ops |
par_ops
basic CPU operation parallelism
par_ops [ -W ] [ -N ]
par_ops measures the available parallelism for basic CPU operations, such as integer ADD. Results are reported as the average operation latency divided by the minimum average operation latency across all levels of parallelism.
例子:
1 | $ ./par_ops |
par_mem
memory parallelism benchmark
par_mem [ -L ] [ -M ] [ -W ] [ -N ]
par_mem measures the available parallelism in the memory hierarchy, up to len bytes. Modern processors can often service multiple memory requests in parallel, while older processors typically blocked on LOAD instructions and had no available parallelism (other than that provided by cache prefetching). par_mem measures the available parallelism at a variety of points, since the available parallelism is often a function of the data location in the memory hierarchy.
这里的,-M是要测的大小len,-L是line size,不理解是什么。
1 | $ ./par_mem -L 512 -M 64M |
lat_ctx
context switching benchmark
lat_ctx measures context switching time for any reasonable number of processes of any reasonable size. The processes are connected in a ring of Unix pipes. Each process reads a token from its pipe, possibly does some work, and then writes the token to the next process.
lat_ctx [ -P \
属于:OS| CTX
例子:
1 | $ ./lat_ctx -s 0 2 4 8 16 24 32 64 96 |
时间计算
时间是通过gettimeofday
来计算的。
定义函数:int gettimeofday (struct timeval tv, struct timezone tz);
函数说明:gettimeofday()会把目前的时间有tv 所指的结构返回,当地时区的信息则放到tz 所指的结构中。
timeval 结构定义为:
struct timeval{
long tv_sec; //秒
long tv_usec; //微秒
};
Scripts
针对CPU需要的一些benchmark直接拿出来跑,写个小的scripts,直接跑这个就行了。
1 | #!/bin/sh |
Note
lmbench的bandwidth是算两侧的带宽,而不是算memcpy的数据量,所以,如果如果算拷贝的数据量的话,要除以2。