Profiling¶
Objectives
Learn how to profile Python code using
cProfileLearn how to visualise cProfile results using
SnakeVizExamine the most most expensive function call via
line_profiler
Instructor note
15 min teaching/type-along
Using cProfile to investigate performance¶
While %timeit can provide good benchmarking information on single lines or single functions,
larger codebases have more complex function hierarchies which require more sofisticated tools to
traverse properly. Python comes with two built-in tools
to profile code, which implement the same interface: cProfile and profile.
These tools can help to identify performance bottlenecks in the code.
In this lesson, we will use cProfile due to its smaller overhead (profile, on the other hand,
is more extensible). The standard syntax to call it is:
$ python -m cProfile [-o <outputFile>] <python_module>
By default, cProfile writes the results to stdout, but the optional -o flag redirects
the output to file instead. A report can be generated using the pstats command.
Type-Along
Let’s profile the wordcount script and write the results to a file.
Warning
Use the shell variant. The profiling output from Jupyter, although it seems to work, is hard to decipher.
The %run magic supports profiling out-of-the-box using the -p flag. The script can be run as:
In [1]: %run -p -D wordcount.prof v0.py data/concat.txt processed_data/concat.dat
*** Profile stats marshalled to file 'wordcount.prof'.
We can call cProfile as:
$ python -m cProfile -o wordcount.prof v0.py data/concat.txt processed_data/concat.dat
We can then generate a report using the pstats command:
$ python -m pstats wordcount.prof
# Welcome to the profile statistics browser.
# wordcount.prof% sort tottime
# wordcount.prof% stats
# Wed Sep 25 11:52:27 2024 wordcount.prof
# 53473208 function calls in 8.410 seconds
# Ordered by: internal time
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1233410 4.151 0.000 7.204 0.000 v0.py:41(update_word_counts)
# 32068660 1.799 0.000 1.799 0.000 {method 'replace' of 'str' objects}
# 7747363 0.570 0.000 0.570 0.000 {method 'lower' of 'str' objects}
# 7747363 0.428 0.000 0.428 0.000 {method 'strip' of 'str' objects}
# 1530212 0.271 0.000 0.271 0.000 v0.py:23(<genexpr>)
# 1233411 0.256 0.000 0.256 0.000 {method 'split' of 'str' objects}
# 1 0.184 0.184 7.388 7.388 v0.py:59(calculate_word_counts)
# 382553 0.133 0.000 0.404 0.000 {method 'join' of 'str' objects}
# 1 0.126 0.126 0.580 0.580 v0.py:16(save_word_counts)
# ...
Discussion
Profiling introduces a non-negligible overhead on the code being executed. Thus, the absolute values for time being spent in each function should be taken with a grain of salt. The real objective lies in understanding the relative amount of time spent in each function call.
Using SnakeViz to visualise performance reports¶
SnakeViz is a browser-based visualiser of
performance reports generated by cProfile. It is already included among the
dependecies installed in this virtual/Conda environment.
Type-Along
SnakeViz has a IPython magic to profile and open a browser directly. To use it, we just need to load the relevant extension and run it:
In [4]: %load_ext snakeviz
In [5]: %snakeviz wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)
Warning
This will run only if the source IPython instance has access to a local web browser.
This also means that, e.g., if you are on Windows and following the tutorial in WSL,
this will not work.
We can run SnakeViz as:
$ snakeviz wordcount.prof --server
The output will contain a clickable link containing the visualisation.

Based on the output, we can clearly see that the update_word_counts() function
is where most of the runtime of the script is spent.
Using line_profiler to inspect the expensive function¶
Once the main performance-intensive function is identified, we can further examine it
to find bottlenecks. This can be done using the line_profiler tool, which returns a line-by-line
breakdown of where time is spent.
Type-Along
Let’s profile the wordcount script and write the results to a file.
The line_profiler package provides a magic to be used in IPython. First, the
magic needs to be loaded:
In [1]: %load_ext line_profiler
The script can be run with the %lprun magic, whose syntax is very close to the %run
introduced above. Notice that we have to explicitly mention which functions we want to step through
line by line:
In [5]: %lprun -f wordcount.update_word_counts wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)
To use line_profiler from the command line, the functions to be profiled need to be explicitly marked.
For this reason, we need to import the relevant decorator at the beginning of the v0.py script:
from line_profiler import profile
Note
The tool will also work without the import
We can now mark the update_word_counts function with the @profile decorator:
@profile
def update_word_counts(line, counts):
Profiling is performed with a script called kernprof. Its usage is the following:
kernprof -lvr v0.py data/concat.txt processed_data/concat.dat
The -l flag ensures that the function is profiled step-by-step, the -v flag
shows the result on stdout and the -r flag is used to have rich format on output.
Wrote profile results to wordcount.py.lprof
Timer unit: 1e-06 s
Total time: 12.2802 s
File: v0.py
Function: update_word_counts at line 40
Line # Hits Time Per Hit % Time Line Contents
==============================================================
40 @profile
41 def update_word_counts(line, counts):
42 """
43 Given a string, parse the string and update a dictionary of word
44 counts (mapping words to counts of their frequencies). DELIMITERS are
45 removed before the string is parsed. The function is case-insensitive
46 and words in the dictionary are in lower-case.
47 """
48 33302070 2574252.9 0.1 21.0 for purge in DELIMITERS:
49 32068660 4405499.9 0.1 35.9 line = line.replace(purge, " ")
50 1233410 392268.8 0.3 3.2 words = line.split()
51 8980773 819407.4 0.1 6.7 for word in words:
52 7747363 1457841.0 0.2 11.9 word = word.lower().strip()
53 7747363 1355462.5 0.2 11.0 if word in counts:
54 7364810 1211000.7 0.2 9.9 counts[word] += 1
55 else:
56 382553 64505.7 0.2 0.5 counts[word] = 1
Based on the output, we can conclude that most of the time is spent replacing delimiters.
Keypoints
The
cProfilemodule can provide information on how costly each function call is.Profile reports can be inspected using the
pstatstool in tabular form or with SnakeViz for a graphical visualisationThe
line_profilertool can be used to inspect line-by-line performance overhead.
Profiling concepts and exercises¶
Profilers are applications which attach to the execution of the program, which in our case is done by the CPython interpreter and analyze the time taken for different portions of the code. Profilers help to identify performance bottlenecks in the code by showing
wall-time (or start to end time that the user observes),
CPU and GPU time, and
memory usage patterns
in function/method/line of code level granularity.
Deterministic profilers vs. sampling profilers¶
Note
Deterministic profilers are also called tracing profilers.
Deterministic profilers record every function call and event in the program, logging the exact sequence and duration of events.
👍 Pros: : - Provides detailed information on the program’s execution.
Deterministic: Captures exact call sequences and timings.
👎 Cons: : - Higher overhead, slowing down the program.
Can generate larger amount of data.
Sampling profilers periodically samples the program’s state (where it is and how much memory is used), providing a statistical view of where time is spent.
👍 Pros: : - Lower overhead, as it doesn’t track every event.
Scales better with larger programs.
👎 Cons: : - Less precise, potentially missing infrequent or short calls.
Provides an approximation rather than exact timing.
Discussion
Analogy: Imagine we want to optimize the Stockholm Länstrafik (SL) metro system We wish to detect bottlenecks in the system to improve the service and for this we have asked few passengers to help us by tracking their journey.
Deterministic: : We follow every train and passenger, recording every stop and delay. When passengers enter and exit the train, we record the exact time and location.
Sampling: : Every 5 minutes the phone notifies the passenger to note down their current location. We then use this information to estimate the most crowded stations and trains.
In addition to the above distinctions, some profilers can also
Examples of some profilers
CPU profilers:
Memory profilers:
Both CPU and memory:
In the following sections, we will use cProfile and Line-profiler to profile a Python program. cProfile is a deterministic (tracing) profiler built-in to the Python standard library and gives timings in function-level granularity. Line profiler is also deterministic and it provides timings in line-of-code granularity for few selected functions.
cProfile¶
As a demo, let us consider the following code which simulates a random walk in one dimension
(we can save it as walk.py or download from here):
"""A 1-D random walk.
See also:
- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html
"""
import numpy as np
def step():
import random
return 1.0 if random.random() > 0.5 else -1.0
def walk(n: int, dx: float = 1.0):
"""The for-loop version.
Parameters
----------
n: int
Number of time steps
dx: float
Step size. Default step size is unity.
"""
xs = np.zeros(n)
for i in range(n - 1):
x_new = xs[i] + dx * step()
xs[i + 1] = x_new
return xs
def walk_vec(n: int, dx: float = 1.0):
"""The vectorized version of :func:`walk` using numpy functions."""
import random
steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
# steps = np.random.choice([1, -1], size=n)
dx_steps = dx * steps
# set initial condition to zero
dx_steps[0] = 0
# use cumulative sum to replicate time evolution of position x
xs = np.cumsum(dx_steps)
return xs
if __name__ == "__main__":
n = 1_000_000
_ = walk(n)
_ = walk_vec(n)
We can profile it with cProfile:
$ python -m cProfile -s time walk.py
The -s switch sorts the results by time. Other options include
e.g. function name, cumulative time, etc. However, this will print a lot of
output which is difficult to read.
$ python -m cProfile -o walk.prof walk.py
It’s also possible to write the profile
to a file with the -o flag and view it with profile pstats module
or profile visualisation tools like
Snakeviz
or tuna.
Note
Similar functionality is available in interactive IPython or Jupyter sessions with the magic command %%prun.
Line-profiler¶
The cProfile tool tells us which function takes most of the time but it does not give us a line-by-line breakdown of where time is being spent. For this information, we can use the line_profiler tool.
Demo: line profiling
For line-profiling source files from the command line, we can add a decorator @profile
to the functions of interests. If we do this for the step() and walk() function
in the example above, we can then run the script using the kernprof.py program which comes with
line_profiler, making sure to include the switches -l, --line-by-line and -v, --view:
$ kernprof -l -v walk.py
line_profiler also works in a Jupyter notebook. First one needs to load the extension:
%load_ext line_profiler
If the walk() and step() functions are defined in code cells, we can get the line-profiling
information by:
%lprun -f walk -f step walk(10000)
Based on the output, can you spot a mistake which is affecting performance?
Line-profiling output
Wrote profile results to walk.py.lprof
Timer unit: 1e-06 s
Total time: 0.113249 s
File: walk.py
Function: step at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def step():
6 99999 57528.0 0.6 50.8 import random
7 99999 55721.0 0.6 49.2 return 1. if random.random() > .5 else -1.
Total time: 0.598811 s
File: walk.py
Function: walk at line 9
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 @profile
10 def walk(n):
11 1 20.0 20.0 0.0 x = np.zeros(n)
12 1 1.0 1.0 0.0 dx = 1. / n
13 100000 44279.0 0.4 7.4 for i in range(n - 1):
14 99999 433303.0 4.3 72.4 x_new = x[i] + dx * step()
15 99999 53894.0 0.5 9.0 if x_new > 5e-3:
16 x[i + 1] = 0.
17 else:
18 99999 67313.0 0.7 11.2 x[i + 1] = x_new
19 1 1.0 1.0 0.0 return x
The mistake
The mistake is that the random module is loaded inside the step() function
which is called thousands of times! Moving the module import to the top level saves
considerable time.
Exercise 2¶
Exercise
Start by copying in the script walk.py into Jupyter or
download it and execute %run walk.py
Use
%timeitmagic command in Jupyter to benchmark the functionswalk(1_000_000)andwalk_vec(1_000_000). Which is faster?Use:
%load_ext line_profiler %lprun -f walk_vec walk_vec(1_000_000)
to apply line-profiler on
walk_vecfunction. What is the bottleneck?Modify the following lines to use to change how the
stepsarray is initialized. Execute it to re-initializewalk_vec. Redo the above benchmark and profiling. Does it improve the performance?def walk_vec(n: int, dx: float = 1.0): """The vectorized version of :func:`walk` using numpy functions.""" import random steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n])) # steps = np.random.choice([1, -1], size=n) dx_steps = dx * steps # set initial condition to zero dx_steps[0] = 0 # use cumulative sum to replicate time evolution of position x xs = np.cumsum(dx_steps) return xs
Solution
Benchmarking
walkandwalk_vecIn [1]: %run walk.py In [2]: %timeit walk(1_000_000) 246 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit walk_vec(1_000_000) 397 ms ± 6.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Shows
walk_vec, although use vectorised operations, is slower. Although it is not directly obvious why.Profiling
walk_vecIn [4]: %load_ext line_profiler In [5]: %lprun -f walk_vec walk_vec(1_000_000)
illustrates how creating the
stepsarray consumes more than 99% of the total time:Timer unit: 1e-09 s Total time: 1.14911 s File: /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py Function: walk_vec at line 36 Line # Hits Time Per Hit % Time Line Contents ============================================================== 36 def walk_vec(n: int, dx: float = 1.0): 37 """The vectorized version of :func:`walk` using numpy functions.""" 38 1 3374.0 3374.0 0.0 import random 39 1 1145604166.0 1e+09 99.7 steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n])) 40 41 # steps = np.random.choice([1, -1], size=n) 42 43 1 1085779.0 1e+06 0.1 dx_steps = dx * steps 44 45 # set initial condition to zero 46 1 3666.0 3666.0 0.0 dx_steps[0] = 0 47 # use cumulative sum to replicate time evolution of position x 48 1 2408374.0 2e+06 0.2 xs = np.cumsum(dx_steps) 49 50 1 507.0 507.0 0.0 return xs
This is because
random.sampleis part of the standard library and produces a list.After modifying
walk_vecfunction, we see it is much faster.In [7]: %ed walk.py Editing... done. Executing edited code... In [8]: %timeit walk(1_000_000) 259 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [9]: %timeit walk_vec(1_000_000) 6.49 ms ± 90.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [10]: %lprun -f walk_vec walk_vec(1_000_000)
Line-profiler output:
Timer unit: 1e-09 s Total time: 0.0078706 s File: /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py Function: walk_vec at line 36 Line # Hits Time Per Hit % Time Line Contents ============================================================== 36 def walk_vec(n: int, dx: float = 1.0): 37 """The vectorized version of :func:`walk` using numpy functions.""" 38 # import random 39 # steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n])) 40 41 1 4323251.0 4e+06 54.9 steps = np.random.choice([1, -1], size=n) 42 43 1 1142060.0 1e+06 14.5 dx_steps = dx * steps 44 45 # set initial condition to zero 46 1 2907.0 2907.0 0.0 dx_steps[0] = 0 47 # use cumulative sum to replicate time evolution of position x 48 1 2401730.0 2e+06 30.5 xs = np.cumsum(dx_steps) 49 50 1 649.0 649.0 0.0 return xs
Scalene profiler¶
Profiling¶
Scalene¶
Scalene is a sampling profiler. In addition to timings , it can also give insight into:
CPU time spent in Python (interpreted), native (compiled) and system function calls
Memory usage and copy
GPU utilization
Memory leak detection
Moreover, it adds minimal overhead due to profiling. The downside is the results are less reproducible, because it is a sampling profiler.
Scalene can be used as a CLI tool, or using IPython magic or in
Web interface as an interactive widget. Here are some examples profiling
walk.py with Scalene.
CLI tool¶
$ scalene --cli walk.py
IPython magic¶
This allows for profiling a specific function. For example to profile just walk, we do as follows:
In [1]: %load_ext scalene
In [2]: %run walk.py
In [3]: %scrun --cli walk(n)
Gives the following output:
SCRUN MAGIC
/home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py: % of time = 100.00% (1.933s) out of 1.933s.
╷ ╷ ╷ ╷ ╷
│Time │–––––– │–––––– │–––––– │
Line │Python │native │system │GPU │/home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
1 │ │ │ │ │"""A 1-D random walk.
2 │ │ │ │ │
3 │ │ │ │ │See also:
4 │ │ │ │ │- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html
5 │ │ │ │ │
6 │ │ │ │ │"""
7 │ │ │ │ │import numpy as np
8 │ │ │ │ │
9 │ │ │ │ │
10 │ 6% │ │ │ │def step():
11 │ │ │ │ │ import random
12 │ 7% │ 64% │ 13% │ │ return 1.0 if random.random() > 0.5 else -1.0
13 │ │ │ │ │
14 │ │ │ │ │
15 │ │ │ │ │def walk(n: int, dx: float = 1.0):
16 │ │ │ │ │ """The for-loop version.
17 │ │ │ │ │
18 │ │ │ │ │ Parameters
19 │ │ │ │ │ ----------
20 │ │ │ │ │ n: int
21 │ │ │ │ │ Number of time steps
22 │ │ │ │ │
23 │ │ │ │ │ dx: float
24 │ │ │ │ │ Step size. Default step size is unity.
25 │ │ │ │ │
26 │ │ │ │ │ """
27 │ │ │ │ │ xs = np.zeros(n)
28 │ │ │ │ │
29 │ │ │ │ │ for i in range(n - 1):
30 │ │ │ │ │ x_new = xs[i] + dx * step()
31 │ 7% │ │ │ │ xs[i + 1] = x_new
32 │ │ │ │ │
33 │ │ │ │ │ return xs
34 │ │ │ │ │
35 │ │ │ │ │
36 │ │ │ │ │def walk_vec(n: int, dx: float = 1.0):
37 │ │ │ │ │ """The vectorized version of :func:`walk` using numpy functions."""
38 │ │ │ │ │ import random
39 │ │ │ │ │ steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
40 │ │ │ │ │
41 │ │ │ │ │ # steps = np.random.choice([1, -1], size=n)
42 │ │ │ │ │
43 │ │ │ │ │ dx_steps = dx * steps
44 │ │ │ │ │
45 │ │ │ │ │ # set initial condition to zero
46 │ │ │ │ │ dx_steps[0] = 0
47 │ │ │ │ │ # use cumulative sum to replicate time evolution of position x
48 │ │ │ │ │ xs = np.cumsum(dx_steps)
49 │ │ │ │ │
50 │ │ │ │ │ return xs
51 │ │ │ │ │
52 │ │ │ │ │
53 │ │ │ │ │if __name__ == "__main__":
54 │ │ │ │ │ n = 1_000_000
55 │ │ │ │ │ _ = walk(n)
56 │ │ │ │ │ _ = walk_vec(n)
57 │ │ │ │ │
│ │ │ │ │
╶──────┼───────┼───────┼───────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
│ │ │ │ │function summary for /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
10 │ 14% │ 69% │ 9% │ │step
15 │ 7% │ │ │ │walk
╵ ╵ ╵ ╵
If you run the magic command in Jupyter you can use %scrun walk(n) instead and it should an output similar to the Web interface below.
Web interface¶
Running
$ scalene walk.py
opens up the following web app: