Profiling¶

Objectives

Learn how to profile Python code using cProfile
Learn how to visualise cProfile results using SnakeViz
Examine the most most expensive function call via line_profiler

Instructor note

15 min teaching/type-along

Using `cProfile` to investigate performance¶

While %timeit can provide good benchmarking information on single lines or single functions, larger codebases have more complex function hierarchies which require more sofisticated tools to traverse properly. Python comes with two built-in tools to profile code, which implement the same interface: cProfile and profile. These tools can help to identify performance bottlenecks in the code.

In this lesson, we will use cProfile due to its smaller overhead (profile, on the other hand, is more extensible). The standard syntax to call it is:

$ python -m cProfile [-o <outputFile>] <python_module>

By default, cProfile writes the results to stdout, but the optional -o flag redirects the output to file instead. A report can be generated using the pstats command.

Type-Along

Let’s profile the wordcount script and write the results to a file.

Warning

Use the shell variant. The profiling output from Jupyter, although it seems to work, is hard to decipher.

IPython / Jupyter

The %run magic supports profiling out-of-the-box using the -p flag. The script can be run as:

In [1]: %run -p -D wordcount.prof v0.py data/concat.txt processed_data/concat.dat
 
*** Profile stats marshalled to file 'wordcount.prof'.

Unix Shell

We can call cProfile as:

$ python -m cProfile -o wordcount.prof v0.py data/concat.txt processed_data/concat.dat

We can then generate a report using the pstats command:

$ python -m pstats wordcount.prof
# Welcome to the profile statistics browser.
# wordcount.prof% sort tottime
# wordcount.prof% stats
# Wed Sep 25 11:52:27 2024    wordcount.prof

#          53473208 function calls in 8.410 seconds

#    Ordered by: internal time

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#   1233410    4.151    0.000    7.204    0.000 v0.py:41(update_word_counts)
#  32068660    1.799    0.000    1.799    0.000 {method 'replace' of 'str' objects}
#   7747363    0.570    0.000    0.570    0.000 {method 'lower' of 'str' objects}
#   7747363    0.428    0.000    0.428    0.000 {method 'strip' of 'str' objects}
#   1530212    0.271    0.000    0.271    0.000 v0.py:23(<genexpr>)
#   1233411    0.256    0.000    0.256    0.000 {method 'split' of 'str' objects}
#         1    0.184    0.184    7.388    7.388 v0.py:59(calculate_word_counts)
#    382553    0.133    0.000    0.404    0.000 {method 'join' of 'str' objects}
#         1    0.126    0.126    0.580    0.580 v0.py:16(save_word_counts)
# ...

Discussion

Profiling introduces a non-negligible overhead on the code being executed. Thus, the absolute values for time being spent in each function should be taken with a grain of salt. The real objective lies in understanding the relative amount of time spent in each function call.

Using SnakeViz to visualise performance reports¶

SnakeViz is a browser-based visualiser of performance reports generated by cProfile. It is already included among the dependecies installed in this virtual/Conda environment.

Type-Along

IPython / Jupyter

SnakeViz has a IPython magic to profile and open a browser directly. To use it, we just need to load the relevant extension and run it:

In [4]: %load_ext snakeviz
In [5]: %snakeviz wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)

Warning

This will run only if the source IPython instance has access to a local web browser. This also means that, e.g., if you are on Windows and following the tutorial in WSL, this will not work.

Unix Shell

We can run SnakeViz as:

$ snakeviz wordcount.prof --server

The output will contain a clickable link containing the visualisation.

SnakeViz output

Based on the output, we can clearly see that the update_word_counts() function is where most of the runtime of the script is spent.

Using `line_profiler` to inspect the expensive function¶

Once the main performance-intensive function is identified, we can further examine it to find bottlenecks. This can be done using the line_profiler tool, which returns a line-by-line breakdown of where time is spent.

Type-Along

Let’s profile the wordcount script and write the results to a file.

IPython / Jupyter

The line_profiler package provides a magic to be used in IPython. First, the magic needs to be loaded:

In [1]: %load_ext line_profiler

The script can be run with the %lprun magic, whose syntax is very close to the %run introduced above. Notice that we have to explicitly mention which functions we want to step through line by line:

In [5]: %lprun -f wordcount.update_word_counts wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)

Unix Shell

To use line_profiler from the command line, the functions to be profiled need to be explicitly marked. For this reason, we need to import the relevant decorator at the beginning of the v0.py script:

from line_profiler import profile

Note

The tool will also work without the import

We can now mark the update_word_counts function with the @profile decorator:

@profile
def update_word_counts(line, counts):

Profiling is performed with a script called kernprof. Its usage is the following:

kernprof -lvr v0.py data/concat.txt processed_data/concat.dat

The -l flag ensures that the function is profiled step-by-step, the -v flag shows the result on stdout and the -r flag is used to have rich format on output.

Wrote profile results to wordcount.py.lprof
Timer unit: 1e-06 s

Total time: 12.2802 s
File: v0.py
Function: update_word_counts at line 40

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                         @profile
                                         def update_word_counts(line, counts):
                                             """
                                             Given a string, parse the string and update a dictionary of word
                                             counts (mapping words to counts of their frequencies). DELIMITERS are
                                             removed before the string is parsed. The function is case-insensitive
                                             and words in the dictionary are in lower-case.
                                             """
33302070    2574252.9      0.1     21.0      for purge in DELIMITERS:
32068660    4405499.9      0.1     35.9          line = line.replace(purge, " ")
 1233410     392268.8      0.3      3.2      words = line.split()
 8980773     819407.4      0.1      6.7      for word in words:
 7747363    1457841.0      0.2     11.9          word = word.lower().strip()
 7747363    1355462.5      0.2     11.0          if word in counts:
 7364810    1211000.7      0.2      9.9              counts[word] += 1
                                                 else:
  382553      64505.7      0.2      0.5              counts[word] = 1

Based on the output, we can conclude that most of the time is spent replacing delimiters.

Keypoints

The cProfile module can provide information on how costly each function call is.
Profile reports can be inspected using the pstats tool in tabular form or with SnakeViz for a graphical visualisation
The line_profiler tool can be used to inspect line-by-line performance overhead.

Profiling concepts and exercises¶

Profilers are applications which attach to the execution of the program, which in our case is done by the CPython interpreter and analyze the time taken for different portions of the code. Profilers help to identify performance bottlenecks in the code by showing

wall-time (or start to end time that the user observes),
CPU and GPU time, and
memory usage patterns

in function/method/line of code level granularity.

Deterministic profilers vs. sampling profilers¶

Note

Deterministic profilers are also called tracing profilers.

Deterministic profilers record every function call and event in the program, logging the exact sequence and duration of events.

👍 Pros: : - Provides detailed information on the program’s execution.

Deterministic: Captures exact call sequences and timings.

👎 Cons: : - Higher overhead, slowing down the program.

Can generate larger amount of data.

Sampling profilers periodically samples the program’s state (where it is and how much memory is used), providing a statistical view of where time is spent.

👍 Pros: : - Lower overhead, as it doesn’t track every event.

Scales better with larger programs.

👎 Cons: : - Less precise, potentially missing infrequent or short calls.

Provides an approximation rather than exact timing.

Discussion

Analogy: Imagine we want to optimize the Stockholm Länstrafik (SL) metro system We wish to detect bottlenecks in the system to improve the service and for this we have asked few passengers to help us by tracking their journey.

Deterministic: : We follow every train and passenger, recording every stop and delay. When passengers enter and exit the train, we record the exact time and location.
Sampling: : Every 5 minutes the phone notifies the passenger to note down their current location. We then use this information to estimate the most crowded stations and trains.

In addition to the above distinctions, some profilers can also

Examples of some profilers

CPU profilers:

Memory profilers:

Both CPU and memory:

Scalene (see optional course material on Scalene)

In the following sections, we will use cProfile and Line-profiler to profile a Python program. cProfile is a deterministic (tracing) profiler built-in to the Python standard library and gives timings in function-level granularity. Line profiler is also deterministic and it provides timings in line-of-code granularity for few selected functions.

cProfile¶

As a demo, let us consider the following code which simulates a random walk in one dimension (we can save it as walk.py or download from here):

"""A 1-D random walk.

See also:
- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html

"""
import numpy as np


def step():
    import random
    return 1.0 if random.random() > 0.5 else -1.0


def walk(n: int, dx: float = 1.0):
    """The for-loop version.

    Parameters
    ----------
    n: int
        Number of time steps

    dx: float
        Step size. Default step size is unity.

    """
    xs = np.zeros(n)

    for i in range(n - 1):
        x_new = xs[i] + dx * step()
        xs[i + 1] = x_new

    return xs


def walk_vec(n: int, dx: float = 1.0):
    """The vectorized version of :func:`walk` using numpy functions."""
    import random
    steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))

    # steps = np.random.choice([1, -1], size=n)

    dx_steps = dx * steps

    # set initial condition to zero
    dx_steps[0] = 0
    # use cumulative sum to replicate time evolution of position x
    xs = np.cumsum(dx_steps)

    return xs


if __name__ == "__main__":
    n = 1_000_000
    _ = walk(n)
    _ = walk_vec(n)

We can profile it with cProfile:

$  python -m cProfile -s time walk.py

The -s switch sorts the results by time. Other options include e.g. function name, cumulative time, etc. However, this will print a lot of output which is difficult to read.

$ python -m cProfile -o walk.prof walk.py

It’s also possible to write the profile to a file with the -o flag and view it with profile pstats module or profile visualisation tools like Snakeviz or tuna.

Note

Similar functionality is available in interactive IPython or Jupyter sessions with the magic command %%prun.

Line-profiler¶

The cProfile tool tells us which function takes most of the time but it does not give us a line-by-line breakdown of where time is being spent. For this information, we can use the line_profiler tool.

Demo: line profiling

For line-profiling source files from the command line, we can add a decorator @profile to the functions of interests. If we do this for the step() and walk() function in the example above, we can then run the script using the kernprof.py program which comes with line_profiler, making sure to include the switches -l, --line-by-line and -v, --view:

$ kernprof -l -v walk.py

line_profiler also works in a Jupyter notebook. First one needs to load the extension:

%load_ext line_profiler

If the walk() and step() functions are defined in code cells, we can get the line-profiling information by:

%lprun -f walk -f step walk(10000)

Based on the output, can you spot a mistake which is affecting performance?

Line-profiling output

Wrote profile results to walk.py.lprof
Timer unit: 1e-06 s

Total time: 0.113249 s
File: walk.py
Function: step at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   4                                           @profile
   5                                           def step():
   6     99999      57528.0      0.6     50.8      import random
   7     99999      55721.0      0.6     49.2      return 1. if random.random() > .5 else -1.

Total time: 0.598811 s
File: walk.py
Function: walk at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   9                                           @profile
   10                                           def walk(n):
   11         1         20.0     20.0      0.0      x = np.zeros(n)
   12         1          1.0      1.0      0.0      dx = 1. / n
   13    100000      44279.0      0.4      7.4      for i in range(n - 1):
   14     99999     433303.0      4.3     72.4          x_new = x[i] + dx * step()
   15     99999      53894.0      0.5      9.0          if x_new > 5e-3:
   16                                                       x[i + 1] = 0.
   17                                                   else:
   18     99999      67313.0      0.7     11.2              x[i + 1] = x_new
   19         1          1.0      1.0      0.0      return x

The mistake

The mistake is that the random module is loaded inside the step() function which is called thousands of times! Moving the module import to the top level saves considerable time.

Exercise 2¶

Exercise

Start by copying in the script walk.py into Jupyter or download it and execute %run walk.py

Use %timeit magic command in Jupyter to benchmark the functions walk(1_000_000) and walk_vec(1_000_000). Which is faster?
Use:
```
%load_ext line_profiler
%lprun -f walk_vec walk_vec(1_000_000)
```
to apply line-profiler on walk_vec function. What is the bottleneck?

Modify the following lines to use to change how the steps array is initialized. Execute it to re-initialize walk_vec. Redo the above benchmark and profiling. Does it improve the performance?

def walk_vec(n: int, dx: float = 1.0):
    """The vectorized version of :func:`walk` using numpy functions."""
    import random
    steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))

    # steps = np.random.choice([1, -1], size=n)

    dx_steps = dx * steps

    # set initial condition to zero
    dx_steps[0] = 0
    # use cumulative sum to replicate time evolution of position x
    xs = np.cumsum(dx_steps)

    return xs

Solution

Benchmarking walk and walk_vec

In [1]: %run walk.py

In [2]: %timeit walk(1_000_000)
246 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit walk_vec(1_000_000)
397 ms ± 6.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Shows walk_vec, although use vectorised operations, is slower. Although it is not directly obvious why.

Profiling walk_vec

In [4]: %load_ext line_profiler

In [5]: %lprun -f walk_vec walk_vec(1_000_000)

illustrates how creating the steps array consumes more than 99% of the total time:

Timer unit: 1e-09 s

Total time: 1.14911 s
File: /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
Function: walk_vec at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   36                                           def walk_vec(n: int, dx: float = 1.0):
   37                                               """The vectorized version of :func:`walk` using numpy functions."""
   38         1       3374.0   3374.0      0.0      import random
   39         1 1145604166.0    1e+09     99.7      steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
   40
   41                                               # steps = np.random.choice([1, -1], size=n)
   42
   43         1    1085779.0    1e+06      0.1      dx_steps = dx * steps
   44
   45                                               # set initial condition to zero
   46         1       3666.0   3666.0      0.0      dx_steps[0] = 0
   47                                               # use cumulative sum to replicate time evolution of position x
   48         1    2408374.0    2e+06      0.2      xs = np.cumsum(dx_steps)
   49
   50         1        507.0    507.0      0.0      return xs

This is because random.sample is part of the standard library and produces a list.

After modifying walk_vec function, we see it is much faster.

In [7]: %ed walk.py
Editing... done. Executing edited code...

In [8]: %timeit walk(1_000_000)
259 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit walk_vec(1_000_000)
6.49 ms ± 90.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: %lprun -f walk_vec walk_vec(1_000_000)

Line-profiler output:

Timer unit: 1e-09 s

Total time: 0.0078706 s
File: /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
Function: walk_vec at line 36

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   36                                           def walk_vec(n: int, dx: float = 1.0):
   37                                               """The vectorized version of :func:`walk` using numpy functions."""
   38                                               # import random
   39                                               # steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
   40
   41         1    4323251.0    4e+06     54.9      steps = np.random.choice([1, -1], size=n)
   42
   43         1    1142060.0    1e+06     14.5      dx_steps = dx * steps
   44
   45                                               # set initial condition to zero
   46         1       2907.0   2907.0      0.0      dx_steps[0] = 0
   47                                               # use cumulative sum to replicate time evolution of position x
   48         1    2401730.0    2e+06     30.5      xs = np.cumsum(dx_steps)
   49
   50         1        649.0    649.0      0.0      return xs

Scalene profiler¶

Profiling¶

Scalene¶

Scalene is a sampling profiler. In addition to timings , it can also give insight into:

CPU time spent in Python (interpreted), native (compiled) and system function calls
Memory usage and copy
GPU utilization
Memory leak detection

Moreover, it adds minimal overhead due to profiling. The downside is the results are less reproducible, because it is a sampling profiler.

Scalene can be used as a CLI tool, or using IPython magic or in Web interface as an interactive widget. Here are some examples profiling walk.py with Scalene.

CLI tool¶

$ scalene --cli walk.py

IPython magic¶

This allows for profiling a specific function. For example to profile just walk, we do as follows:

In [1]: %load_ext scalene

In [2]: %run walk.py

In [3]: %scrun --cli walk(n)

Gives the following output:

SCRUN MAGIC
                              /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py: % of time = 100.00% (1.933s) out of 1.933s.
       ╷       ╷       ╷       ╷       ╷
       │Time   │–––––– │–––––– │–––––– │
  Line │Python │native │system │GPU    │/home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
│       │       │       │       │"""A 1-D random walk.
│       │       │       │       │
│       │       │       │       │See also:
│       │       │       │       │- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html
│       │       │       │       │
│       │       │       │       │"""
│       │       │       │       │import numpy as np
│       │       │       │       │
│       │       │       │       │
│    6% │       │       │       │def step():
│       │       │       │       │    import random
│    7% │   64% │  13%  │       │    return 1.0 if random.random() > 0.5 else -1.0
│       │       │       │       │
│       │       │       │       │
│       │       │       │       │def walk(n: int, dx: float = 1.0):
│       │       │       │       │    """The for-loop version.
│       │       │       │       │
│       │       │       │       │    Parameters
│       │       │       │       │    ----------
│       │       │       │       │    n: int
│       │       │       │       │        Number of time steps
│       │       │       │       │
│       │       │       │       │    dx: float
│       │       │       │       │        Step size. Default step size is unity.
│       │       │       │       │
│       │       │       │       │    """
│       │       │       │       │    xs = np.zeros(n)
│       │       │       │       │
│       │       │       │       │    for i in range(n - 1):
│       │       │       │       │        x_new = xs[i] + dx * step()
│    7% │       │       │       │        xs[i + 1] = x_new
│       │       │       │       │
│       │       │       │       │    return xs
│       │       │       │       │
│       │       │       │       │
│       │       │       │       │def walk_vec(n: int, dx: float = 1.0):
│       │       │       │       │    """The vectorized version of :func:`walk` using numpy functions."""
│       │       │       │       │    import random
│       │       │       │       │    steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
│       │       │       │       │
│       │       │       │       │    # steps = np.random.choice([1, -1], size=n)
│       │       │       │       │
│       │       │       │       │    dx_steps = dx * steps
│       │       │       │       │
│       │       │       │       │    # set initial condition to zero
│       │       │       │       │    dx_steps[0] = 0
│       │       │       │       │    # use cumulative sum to replicate time evolution of position x
│       │       │       │       │    xs = np.cumsum(dx_steps)
│       │       │       │       │
│       │       │       │       │    return xs
│       │       │       │       │
│       │       │       │       │
│       │       │       │       │if __name__ == "__main__":
│       │       │       │       │    n = 1_000_000
│       │       │       │       │    _ = walk(n)
│       │       │       │       │    _ = walk_vec(n)
│       │       │       │       │
       │       │       │       │       │
╶──────┼───────┼───────┼───────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
       │       │       │       │       │function summary for /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
│   14% │   69% │   9%  │       │step
│    7% │       │       │       │walk
       ╵       ╵       ╵       ╵

If you run the magic command in Jupyter you can use %scrun walk(n) instead and it should an output similar to the Web interface below.

Web interface¶

Running

$ scalene walk.py

opens up the following web app:

Profiling¶

Using cProfile to investigate performance¶

Using SnakeViz to visualise performance reports¶

Using line_profiler to inspect the expensive function¶

Profiling concepts and exercises¶

Deterministic profilers vs. sampling profilers¶

cProfile¶

Line-profiler¶

Exercise 2¶

Scalene profiler¶

Profiling¶

Scalene¶

CLI tool¶

IPython magic¶

Web interface¶

Using `cProfile` to investigate performance¶

Using `line_profiler` to inspect the expensive function¶