Benchmarking¶
Objectives
Introduce the example problem
Preparing the system for benchmarking using
pyperfLearn how to runing benchmarks using
time,timeitandpyperf timeit
Instructor note
15 min teaching/type-along
The problem: word count¶

In this episode, we will use a local word-count exercise derived from the
word-count-hpda example project.
It finds the most frequent words in books. The copied exercise files live in
the wordcount/ directory next to this lesson, and the books are saved in
plain-text format in wordcount/data/.
For example to run this code for one book, pg99.txt
$ cd wordcount
$ python v0.py data/pg99.txt processed_data/pg99.dat
Preparation: Use pyperf to tune your system¶
Most personal laptops would be running in a power-saver / balanced power management mode. This would include that the system has a scaling governor which can change the CPU clock frequency on demand, among other things. This can cause jitter which means that benchmarks are not reproducible enough and are less reliable.
In order to improve reliability of your benchmarks consider running the following
Warning
It requires admin / root privileges.
# python -m pyperf system tune
When you are done with the lesson, you can run python -m pyperf system reset or
restart the computer to go back to your default CPU settings.
See also¶
Benchmark using time¶
In order to observe the cost of computation, we need to choose a sufficiently large input data file and time the computation. We can do that by concatenating all the books into a single input file approximately 45 MB in size.
Type-Along
Copy the following script.
import fileinput
from pathlib import Path
files = Path("data").glob("pg*.txt")
file_concat = Path("data", "concat.txt")
with (
fileinput.input(files) as file_in,
file_concat.open("w") as file_out
):
for line in file_in:
file_out.write(line)
Open an IPython console or JupyterLab, with wordcount as the
current working directory (you can also use %cd inside IPython
to change the directory).
%paste
%ls -lh data/concat.txt
import sys
sys.path.insert(0, ".")
import v0 as wordcount
%time wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)
$ cat data/pg*.txt > data/concat.txt
$ ls -lh data/concat.txt
$ time python v0.py data/concat.txt processed_data/concat.dat
Solution
In [1]: %paste
import fileinput
from pathlib import Path
files = Path("data").glob("pg*.txt")
file_concat = Path("data", "concat.txt")
with (
fileinput.input(files) as file_in,
file_concat.open("w") as file_out
):
for line in file_in:
file_out.write(line)
## -- End pasted text --
In [2]: %ls -lh data/concat.txt
-rw-rw-r-- 1 ashwinmo ashwinmo 45M sep 24 14:54 data/concat.txt
In [3]: import sys
...: sys.path.insert(0, ".")
In [4]: import v0 as wordcount
In [5]: %time wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)
CPU times: user 2.64 s, sys: 146 ms, total: 2.79 s
Wall time: 2.8 s
$ cat data/pg*.txt > data/concat.txt
$ ls -lh data/concat.txt
-rw-rw-r-- 1 ashwinmo ashwinmo 46M sep 24 14:58 data/concat.txt
$ time python v0.py data/concat.txt processed_data/concat.dat
real 0m2,826s
user 0m2,645s
sys 0m0,180s
Note
What are the implications of this small benchmark test?
It takes a few seconds to analyze a 45 MB file. Imagine that you are working in a library and you are tasked with running this on several terabytes of data.
10 TB = 10 000 000 MB
Current processing speed = 45 MB / 2.8 s ~ 16 MB/s
Estimated time = 10 000 000 / 16 = 625 000 s = 7.2 days
Then the same script would take days to complete!
Benchmark using timeit¶
If you run the %time magic / time command again, you will notice
that the results vary a bit. To get a reliable answer we should repeat
the benchmark several times using timeit. timeit is part of
the Python standard library and it can be imported in a Python script
or used via a command-line interface.
If you’re using IPython / Jupyter notebook, the best choice will be
to use the %timeit magic.
As an example, here we benchmark the Numpy array:
import numpy as np
a = np.arange(1000)
%timeit a ** 2
# 1.4 µs ± 25.1 ns per loop
We could do the same for the word_count function.
In [6]: %timeit wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)
# 2.81 s ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We could use python -m timeit which is the CLI interface of the standard library module timeit,
$ export PYTHONPATH=.
$ python -m timeit --setup 'import v0 as wordcount' 'wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)'
1 loop, best of 5: 2.75 sec per loop
or an even better alternative is using python -m pyperf timeit
$ export PYTHONPATH=.
$ python -m pyperf timeit --fast --setup 'import v0 as wordcount' 'wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)'
...........
Mean +- std dev: 2.72 sec +- 0.22 sec
Notice that the output reports the arithmetic mean and standard deviation of timings. This is a good choice, since it means that outliers and temporary spikes in results are not automatically removed, which could be as a result of:
garbage collection
JIT compilation
CPU or memory resource limitations
Keypoints
pyperfcan be used to tune the systemWe understood the use of
timeandtimeitto create benchmarkstimeis faster, since it is executed only oncetimeitis more reliable, since it collects statistics
Additional benchmarking examples¶
Benchmarking is a method of doing performance analysis for either the end-to-end execution of a whole program or a part of a program.
time¶
One of the easy way to benchmark is to use the time function:
import time
def some_function():
...
# start the timer
start_time = time.time()
# here are the code you would like to measure
result = some_function()
# stop the
end_time = time.time()
print("Runtime: {:.4f} seconds".format(end_time - start_time))
The IPython “magic” command
%time
can also be used to make a similar benchmark with less effort as follows:
%time some_function()
timeit¶
If you’re using a Jupyter notebook, the best choice will be to use
timeit module or the
IPython “magic” command
%timeit
to repeatedly time a small piece of code:
import numpy as np
a = np.arange(1000)
%timeit a ** 2
We will shortly see in an
One can also use the cell magic
%timeit
to benchmark a full cell containing a block of code.
Exercise 1¶
Exercise
Start with the following code:
import numpy as np
a = np.arange(1000)
def square_sum(array):
return (a ** 2).sum()
Run
%time square_sum(a)a couple of times. Do you get the same result?Run
%timeit square_sum(a)a couple of times. Do you get the same result?(optional) execute the following benchmark and compare it with output of question number 1.
from urllib.request import urlopen
%time urlopen("https://raw.githubusercontent.com/ENCCS/hpda-python/refs/heads/main/content/data/tas1840.nc")
Solution
Run
%time square_sum(a)a couple of times.
In [1]: import numpy as np
...:
...:
...: a = np.arange(1000)
...:
...: def square_sum(array):
...: return (a ** 2).sum()
...:
In [2]: %time square_sum(a)
CPU times: user 184 μs, sys: 5 μs, total: 189 μs
Wall time: 155 μs
Out[2]: np.int64(332833500)
In [3]: %time square_sum(a)
CPU times: user 74 μs, sys: 0 ns, total: 74 μs
Wall time: 77.7 μs
Out[3]: np.int64(332833500)
We get a rough estimate of how long it takes to execute a function for a given
input value. While useful, a few sample timings of the function square_sum(),
does not represent a reproducible benchmark.
Subsequent measurements can result in different runtimes, due to the state of the
computer such as:
what background processes are running,
hyperthreading,
memory and cache usage,
CPU’s temperature,
and many more factors, also collectively known as system jitter.
Run
%timeit square_sum(a)a couple of times.
In [4]: %timeit square_sum(a)
1.62 μs ± 55.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [5]: %timeit square_sum(a)
1.6 μs ± 46.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
By making several measurements, we manage to reduce jitter and the measurement is more reliable
Note
For long running calls, using %time instead of %timeit; it is
less precise but faster
(optional) Comparing benchmarks of
%time square_sum(a)and%time urlopen(...).
In [6]: from urllib.request import urlopen
In [7]: %time urlopen("https://raw.githubusercontent.com/ENCCS/hpda-python/refs/heads/main/content/data/tas1840.nc")
CPU times: user 4.66 ms, sys: 974 μs, total: 5.63 ms
Wall time: 21.4 ms
Out[7]: <http.client.HTTPResponse at 0x78ea989eed40>
In (1) we see that the CPU time and Wall time is comparable which indicates that the operation is CPU bound.
However in (3) we clearly see that CPU time is lower than wall-time, from which we can deduce that it is not a CPU-bound operation. In this particular case, the operation was I/O bound. Some common I/O bound operations are network related, or due to latency in filesystems or use of inefficient file storage formats.