Introduce lmbench.

Overview

lmbench是一套比较小的benchmark，需要依赖操作系统，可以在linux下跑。

能测试文档读写、内存操作、进程创建销毁开销、网络等性能，指标是latency和bandwidth。对于CPU开发来说，比较关心的是内存操作。

可从如下地址得到程序：

1	http://www.bitmover.com/lmbench/

Run on host

直接运行下列命令跑

1 2	$ cd src $ make results

这步会做三个动作：

编译src内的源码，生成到bin目录

1
2
3

gmake[1]: Entering director `/home/francis.zheng/work/lmbench/lmbench-3.0-a9/src'
gcc -O -DRUSAGE -DHAVE_unit=1 -DHAVE_int64_t=1 -DHAVE_pmap_clnt_h -DHAVE_socklen_t -DHAVE_DRAND48 -DHAVE_SCHED_SETAFFINITY=1 -c lib_tcp.c -o ../bin/x86_64-linux-gnu/lib_tcp.o
....

运行../scripts/config-run，生成CONFIG

=====================================================================

		L M B E N C H   C ON F I G U R A T I O N
		----------------------------------------

You need to configure some parameters to lmbench.  Once you have configured
these parameters, you may do multiple runs by saying

	"make rerun"

in the src subdirectory.

NOTICE: please do not have any other activity on the system if you can
help it.  Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking.  In fact, X is not so good when
benchmarking.

=====================================================================
...
MULTIPLE COPIES [default 1]:
=====================================================================
...
Job placement selection [default 1]:
=====================================================================
Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.

Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.

Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00007627 usecs.
=====================================================================
...
MB [default 540585]: 64
Checking to see if you have 64 MB; please wait for a mement...
64MB OK
64MB OK
64MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is bytes.
=====================================================================
...
SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all]: h
=====================================================================
...
FASTMEM [default no]:
=====================================================================
...
SLOWFS [default no]:
=====================================================================
...
DISKS [default none]:
=====================================================================
...
REMOTE [default none]:
=====================================================================
...
Processor mhz [default 2685MHz, 0.3724 nanosec clock]:
=====================================================================
...
FSDIR [default /usr/tmp]:
=====================================================================
...
Status output file [default /dev/tty]:
=====================================================================
...
Mail results [default yes]: n
OK, no results mailed.
=====================================================================
Configuration done, thanks.

运行../scripts/results，得到结果

Using config in CONFIG.sfccpu003
Wed May 11 22:20:40 CST 2022
Latency measurements
Wed May 11 22:21:00 CST 2022
Local networking
Wed May 11 22:21:03 CST 2022
Bandwidth measurements
Wed May 11 22:21:18 CST 2022
Calculating effective TLB size
Wed May 11 22:21:18 CST 2022
Calculating memory load parallelsim
Wed May 11 22:21:18 CST 2022
McCalpin's STEAM benchmark
Wed May 11 22:21:20 CST 2022
Calculating memory load latency
Wed May 11 22:27:21 CST 2022

运行下面命令，得到结果

1
2
3

$ make see
cd results && make summary >summary.out 2>summary.errs
cd results && make percent >percent.out 2>percent.errs


                 L M B E N C H  3 . 0   S U M M A R Y
                 ------------------------------------
		 (Alpha software, do not distribute)

Basic system parameters
------------------------------------------------------------------------------
Host                 OS Description              Mhz  tlb  cache  mem   scal
                                                     pages line   par   load
                                                           bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
sfccpu003 Linux 3.10.0-        x86_64-linux-gnu 2685                       1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host                 OS  Mhz null null      open slct sig  sig  fork exec sh
                             call  I/O stat clos TCP  inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
sfccpu003 Linux 3.10.0- 2685 

Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host                 OS  intgr intgr  intgr  intgr  intgr
                          bit   add    mul    div    mod
--------- ------------- ------ ------ ------ ------ ------
sfccpu003 Linux 3.10.0- 0.3600 0.1900 1.1200 9.9800   10.2

Basic uint64 operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host                 OS  int64 int64  int64  int64  int64
                          bit   add    mul    div    mod
--------- ------------- ------ ------ ------ ------ ------
sfccpu003 Linux 3.10.0- 0.3700        1.1000  15.9   15.1

Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
Host                 OS  float  float  float  float
                         add    mul    div    bogo
--------- ------------- ------ ------ ------ ------
sfccpu003 Linux 3.10.0- 1.4900 1.4700 4.2800 1.1200

Basic double operations - times in nanoseconds - smaller is better
------------------------------------------------------------------
Host                 OS  double double double double
                         add    mul    div    bogo
--------- ------------- ------  ------ ------ ------
sfccpu003 Linux 3.10.0- 1.4800  1.4900 5.4100 1.5300

...

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
sfccpu003 Linux 3.10.0-                              4711.4 5217.6 8331 6940.

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
------------------------------------------------------------------------------
Host                 OS   Mhz   L1 $   L2 $    Main mem    Rand mem    Guesses
--------- -------------   ---   ----   ----    --------    --------    -------
sfccpu003 Linux 3.10.0-  2685 1.5100 5.2090    28.4         93.1

Run on riscv

想在riscv架构上运行lmbench，一般需要交叉编译，即在host上编译，在riscv上运行。

使用以下命令：

1	$ make CC="riscv64-unknown-linux-gnu-gcc" OS=riscv-linux

需要注意的是，要使用linux版的riscv gcc，而不是elf-gcc那个。

在编译时，如果报redefine，就可以将src/bench.h里的typedef in socketlen_t那行注释掉。

编译完之后，将其拷贝到rootfs里面就行了。

由于rootfs中可能还没有make等工具，可以直接运行命令，而不通过make

1
2
3

$ cd src
$ env OS="riscv-linux" ../scripts/config-run
$ env OS="riscv-linux" ../scripts/results

如果系统中没加库文件，运行的时候会报错，not found，可以使用静态编译

1	$ make CC="riscv64-unknown-linux-gnu-gcc -static" OS=riscv-linux

不过如果使用静态编译，bin目录会由3MB增大到200MB。

benchmarks

lmbench的代码比较简单，主要是一些c程序，以及一些脚本程序组成。C程序负责具体的测试，脚本负责组织测试，以及结果收集处理。

这里列举了对于CPU有用的一些benchmark。例子可以参考scripts/lmbench脚本。

http://lmbench.sourceforge.net/man/index.html

mhz

calulate processor clock rate

mhz calculates the processor clock rate and megahertz. It uses an unrolled, interlocked loop of adds or shifts. So far, superscalarness has been defeated on the tested processors (SuperSPARC, RIOS, Alpha).

Output format is either just the clock rate as a float (-c) or more verbose

39.80 Mhz, 25 nanosec clock

例子：

1 2	$ ./mhz 2681 MHz, 0.3730 nanosec clock

bw_mem

time memory bandwidth

bw_mem allocates twice the specified amount of memory, zeros it, and then times the copying of the first half to the second half. Results are reported in megabytes moved per second.

The size specification may end with ‘‘k’’ or ‘‘m’’ to mean kilobytes ( 1024) or megabytes ( 1024 * 1024).

Output format is CB”%0.2f %.2f\n”, megabytes, megabytes_per_second, i.e.,

8.00 25.33

There are nine different memory benchmarks in bw_mem. They each measure slightly different methods for reading, writing or copying data.

rd

measures the time to read data into the processor. It computes the sum of an array of integer values. It accesses every fourth word.
wr

measures the time to write data to memory. It assigns a constant value to each memory of an array of integer values. It accesses every fourth word.
rdwr

measures the time to read data into memory and then write data to the same memory location. For each element in an array it adds the current value to a running sum before assigning a new (constant) value to the element. It accesses every fourth word.
cp

measures the time to copy data from one location to another. It does an array copy: dest[i] = source[i]. It accesses every fourth word.
frd

measures the time to read data into the processor. It computes the sum of an array of integer values.
fwr

measures the time to write data to memory. It assigns a constant value to each memory of an array of integer values.
fcp

measures the time to copy data from one location to another. It does an array copy: dest[i] = source[i].
bzero

measures how fast the system can bzero memory.
bcopy

measures how fast the system can bcopy data.

属于：OS | HARDWARE

例子：

1 2	$ ./bw_mem 1024k rd # 小写k代表乘以1024，小写m代表1024*1024 1.05 44780.39 # 前面是megabytes，后面是megabytes_per_second

lat_mem_rd

memory read latency benchmark

lat_mem_rd measures memory read latency for varying memory sizes and strides. The results are reported in nanoseconds per load and have been verified accurate to within a few nanoseconds on an SGI Indy.

The entire memory hierarchy is measured, including onboard cache latency and size, external cache latency and size, main memory latency, and TLB miss latency.

Only data accesses are measured; the instruction cache is not measured.

The benchmark runs as two nested loops. The outer loop is the stride size. The inner loop is the array size. For each array size, the benchmark creates a ring of pointers that point backward one stride. Traversing the array is done by

p = (char *)p;

in a for loop (the over head of the for loop is not significant; the loop is an unrolled loop 100 loads long).

The size of the array varies from 512 bytes to (typically) eight megabytes. For the small sizes, the cache will have an effect, and the loads will be much faster. This becomes much more apparent when the data is plotted.

Since this benchmark uses fixed-stride offsets in the pointer chain, it may be vulnerable to smart, stride-sensitive cache prefetching policies. Older machines were typically able to prefetch for sequential access patterns, and some were able to prefetch for strided forward access patterns, but only a few could prefetch for backward strided patterns. These capabilities are becoming more widespread in newer processors.

属于：HARDWARE | MEM

例子：

$ lat_mem_rd 64 128						# 测量64MB，stride是128
$ lat_mem_rd 64 16 32 64 128 512 1024		# 测量64MB，stride是16/32/64/128/512/1024

$ lat_mem_rd -t 64 16						# random load latency

Summary.out里的Memory latencies in nanoseconds是通过lat_mem_rd得到的。规则是：

$ lat_mem_rd 64 128	# size必须大于8MB，stride 必须是128
"stride=128
0.00049 1.493
0.00098 1.497        # L1 $
0.00195 1.493
....
0.03125 1.496
0.04688 5.016
....
0.12500 5.239		# L2 $
...
0.75000 5.950
1.00000 7.087
....
6.00000 8.620
8.00000 31.398
12.00000 30.405
....
64.00000 31.508		$ Main mem

$ lat_mem_rd -t 64 16	# size必须大于8MB，stride 必须是16
"stride=16
0.00049 1.392
0.00098 1.388
0.00195 1.493
....
0.03125 1.446
0.04688 4.868
....
0.12500 4.941
...
0.75000 5.322
1.00000 9.715
1.50000 19.136
2.00000 21.574
3.00000 21.907
4.00000 86.572
6.00000 89.870
8.00000 90.388
12.00000 90.182
....
64.00000 90.813		$ Rand Mem

lat_ops

basic CPU operation parallelism

lat_ops [ -W ] [ -N ]

lat_ops measures the latency of basic CPU operations, such as integer ADD.

例子：

$ ./lat_ops
integer bit: 0.37 nanoseconds
integer add: 0.19 nanoseconds
integer mul: 1.06 nanoseconds
integer div: 10.13 nanoseconds
integer mod: 9.66 nanoseconds
int64 bit: 0.37 nanoseconds
uint64 add: 0.17 nanoseconds
int64 mul: 1.12 nanoseconds
int64 div: 14.78 nanoseconds
int64 mod: 15.27 nanoseconds
float add: 1.41 nanoseconds
fload mul: 1.50 nanoseconds
fload div: 4.00 nanoseconds
double add: 1.50 nanoseconds
double mul: 1.39 nanoseconds
double div: 5.40 nanoseconds
float bogomflops: 1.04 nanoseconds
double bogomflops: 1.50 nanoseconds

par_ops

basic CPU operation parallelism

par_ops [ -W ] [ -N ]

par_ops measures the available parallelism for basic CPU operations, such as integer ADD. Results are reported as the average operation latency divided by the minimum average operation latency across all levels of parallelism.

例子：

$ ./par_ops
integer bit parallelism: 3.71 nanoseconds
integer add parallelism: 2.67 nanoseconds
integer mul parallelism: 3.33 nanoseconds
integer div parallelism: 4.15 nanoseconds
integer mod parallelism: 4.34 nanoseconds
int64 bit parallelism: 2.88 nanoseconds
int64 add parallelism: 2.00 nanoseconds
int64 mul parallelism: 3.10 nanoseconds
int64 div parallelism: 1.67 nanoseconds
int64 mod parallelism: 1.70 nanoseconds
float add parallelism: 8.01 nanoseconds
fload mul parallelism: 7.60 nanoseconds
fload div parallelism: 3.85 nanoseconds
double add parallelism: 7.96 nanoseconds
double mul parallelism: 7.60 nanoseconds
double div parallelism: 3.72 nanoseconds

par_mem

memory parallelism benchmark

par_mem [ -L ] [ -M ] [ -W ] [ -N ]

par_mem measures the available parallelism in the memory hierarchy, up to len bytes. Modern processors can often service multiple memory requests in parallel, while older processors typically blocked on LOAD instructions and had no available parallelism (other than that provided by cache prefetching). par_mem measures the available parallelism at a variety of points, since the available parallelism is often a function of the data location in the memory hierarchy.

这里的，-M是要测的大小len，-L是line size，不理解是什么。

$ ./par_mem -L 512 -M 64M
0.004096 7.99
0.008192 7.99
...
1.048576 10.37
...
4.194304 7.42
8.388608 18.62
16.777216 33.92
33.554432 9.48

lat_ctx

context switching benchmark

lat_ctx measures context switching time for any reasonable number of processes of any reasonable size. The processes are connected in a ring of Unix pipes. Each process reads a token from its pipe, possibly does some work, and then writes the token to the next process.

lat_ctx [ -P \ ] [ -W \ ] [ -N \ ] [ -s \<size_in_kbytes> ] #procs [ #procs …]

属于：OS| CTX

例子：

$ ./lat_ctx -s 0 2 4 8 16 24 32 64 96
"size=0k ovr=1.32
2 3.99
4 4.24
8 5.63
16 5.83
24 5.29
32 5.56
64 7.32
96 5.05

$ ./lat_ctx -s 32 2 4 8 16 24 32 64 96
"size=32k ovr=2.40
2 2.38
4 12.39
8 5.08
16 7.46
24 12.00
32 5.30
64 18.63
96 18.58

时间计算

时间是通过gettimeofday来计算的。

定义函数：int gettimeofday (struct timeval tv, struct timezone tz);

函数说明：gettimeofday()会把目前的时间有tv 所指的结构返回，当地时区的信息则放到tz 所指的结构中。

timeval 结构定义为：
struct timeval{
long tv_sec; //秒
long tv_usec; //微秒
};

Scripts

针对CPU需要的一些benchmark直接拿出来跑，写个小的scripts，直接跑这个就行了。

#!/bin/sh

echo "-----------------------------"
echo "lat_ops"
./lat_ops
echo "-----------------------------"
echo " "

SIZE="32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m"
BW_MEM_TYPE="rd wr rdwr cp frd fwr fcp bzero bcopy"
for t in $BW_MEM_TYPE;
do
	echo "-----------------------------"
	echo "bw_mem < $t >: size(megabytes) bandwidth(megabytes per second)"
	for s in $SIZE;
	do
		./bw_mem $s $t
	done
	echo "-----------------------------"
	echo " "
done

echo "-----------------------------"
echo "lat_mem_rd 128m stride=128"
./lat_mem_rd 128 128
echo "-----------------------------"
echo " "

echo "-----------------------------"
echo "lat_mem_rd 128m stride=16"
./lat_mem_rd -t 128 16
echo "-----------------------------"
echo " "

echo "-----------------------------"
echo "par_ops"
./par_ops
echo "-----------------------------"
echo " "

echo "-----------------------------"
echo "par_mem -L 512 -M 64M"
./par_mem -L 512 -M 64M
echo "-----------------------------"
echo " "

Note

lmbench的bandwidth是算两侧的带宽，而不是算memcpy的数据量，所以，如果如果算拷贝的数据量的话，要除以2。