Profiling

Objectives

  • Learn how to profile Python code using cProfile

  • Learn how to visualise cProfile results using SnakeViz

  • Examine the most most expensive function call via line_profiler

Instructor note

  • 15 min teaching/type-along

Using cProfile to investigate performance

While %timeit can provide good benchmarking information on single lines or single functions, larger codebases have more complex function hierarchies which require more sofisticated tools to traverse properly. Python comes with two built-in tools to profile code, which implement the same interface: cProfile and profile. These tools can help to identify performance bottlenecks in the code.

In this lesson, we will use cProfile due to its smaller overhead (profile, on the other hand, is more extensible). The standard syntax to call it is:

$ python -m cProfile [-o <outputFile>] <python_module>

By default, cProfile writes the results to stdout, but the optional -o flag redirects the output to file instead. A report can be generated using the pstats command.

Type-Along

Let’s profile the wordcount script and write the results to a file.

Warning

Use the shell variant. The profiling output from Jupyter, although it seems to work, is hard to decipher.

The %run magic supports profiling out-of-the-box using the -p flag. The script can be run as:

In [1]: %run -p -D wordcount.prof v0.py data/concat.txt processed_data/concat.dat
 
*** Profile stats marshalled to file 'wordcount.prof'.

We can call cProfile as:

$ python -m cProfile -o wordcount.prof v0.py data/concat.txt processed_data/concat.dat

We can then generate a report using the pstats command:

$ python -m pstats wordcount.prof
# Welcome to the profile statistics browser.
# wordcount.prof% sort tottime
# wordcount.prof% stats
# Wed Sep 25 11:52:27 2024    wordcount.prof

#          53473208 function calls in 8.410 seconds

#    Ordered by: internal time

#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#   1233410    4.151    0.000    7.204    0.000 v0.py:41(update_word_counts)
#  32068660    1.799    0.000    1.799    0.000 {method 'replace' of 'str' objects}
#   7747363    0.570    0.000    0.570    0.000 {method 'lower' of 'str' objects}
#   7747363    0.428    0.000    0.428    0.000 {method 'strip' of 'str' objects}
#   1530212    0.271    0.000    0.271    0.000 v0.py:23(<genexpr>)
#   1233411    0.256    0.000    0.256    0.000 {method 'split' of 'str' objects}
#         1    0.184    0.184    7.388    7.388 v0.py:59(calculate_word_counts)
#    382553    0.133    0.000    0.404    0.000 {method 'join' of 'str' objects}
#         1    0.126    0.126    0.580    0.580 v0.py:16(save_word_counts)
# ...

Discussion

Profiling introduces a non-negligible overhead on the code being executed. Thus, the absolute values for time being spent in each function should be taken with a grain of salt. The real objective lies in understanding the relative amount of time spent in each function call.

Using SnakeViz to visualise performance reports

SnakeViz is a browser-based visualiser of performance reports generated by cProfile. It is already included among the dependecies installed in this virtual/Conda environment.

Type-Along

SnakeViz has a IPython magic to profile and open a browser directly. To use it, we just need to load the relevant extension and run it:

In [4]: %load_ext snakeviz
In [5]: %snakeviz wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)

Warning

This will run only if the source IPython instance has access to a local web browser. This also means that, e.g., if you are on Windows and following the tutorial in WSL, this will not work.

We can run SnakeViz as:

$ snakeviz wordcount.prof --server

The output will contain a clickable link containing the visualisation.

SnakeViz output

Based on the output, we can clearly see that the update_word_counts() function is where most of the runtime of the script is spent.

Using line_profiler to inspect the expensive function

Once the main performance-intensive function is identified, we can further examine it to find bottlenecks. This can be done using the line_profiler tool, which returns a line-by-line breakdown of where time is spent.

Type-Along

Let’s profile the wordcount script and write the results to a file.

The line_profiler package provides a magic to be used in IPython. First, the magic needs to be loaded:

In [1]: %load_ext line_profiler

The script can be run with the %lprun magic, whose syntax is very close to the %run introduced above. Notice that we have to explicitly mention which functions we want to step through line by line:

In [5]: %lprun -f wordcount.update_word_counts wordcount.word_count("data/concat.txt", "processed_data/concat.dat", 1)

To use line_profiler from the command line, the functions to be profiled need to be explicitly marked. For this reason, we need to import the relevant decorator at the beginning of the v0.py script:

from line_profiler import profile

Note

The tool will also work without the import

We can now mark the update_word_counts function with the @profile decorator:

@profile
def update_word_counts(line, counts):

Profiling is performed with a script called kernprof. Its usage is the following:

kernprof -lvr v0.py data/concat.txt processed_data/concat.dat

The -l flag ensures that the function is profiled step-by-step, the -v flag shows the result on stdout and the -r flag is used to have rich format on output.

Wrote profile results to wordcount.py.lprof
Timer unit: 1e-06 s

Total time: 12.2802 s
File: v0.py
Function: update_word_counts at line 40

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    40                                           @profile
    41                                           def update_word_counts(line, counts):
    42                                               """
    43                                               Given a string, parse the string and update a dictionary of word
    44                                               counts (mapping words to counts of their frequencies). DELIMITERS are
    45                                               removed before the string is parsed. The function is case-insensitive
    46                                               and words in the dictionary are in lower-case.
    47                                               """
    48  33302070    2574252.9      0.1     21.0      for purge in DELIMITERS:
    49  32068660    4405499.9      0.1     35.9          line = line.replace(purge, " ")
    50   1233410     392268.8      0.3      3.2      words = line.split()
    51   8980773     819407.4      0.1      6.7      for word in words:
    52   7747363    1457841.0      0.2     11.9          word = word.lower().strip()
    53   7747363    1355462.5      0.2     11.0          if word in counts:
    54   7364810    1211000.7      0.2      9.9              counts[word] += 1
    55                                                   else:
    56    382553      64505.7      0.2      0.5              counts[word] = 1

Based on the output, we can conclude that most of the time is spent replacing delimiters.

Keypoints

  • The cProfile module can provide information on how costly each function call is.

  • Profile reports can be inspected using the pstats tool in tabular form or with SnakeViz for a graphical visualisation

  • The line_profiler tool can be used to inspect line-by-line performance overhead.

Profiling concepts and exercises

Profilers are applications which attach to the execution of the program, which in our case is done by the CPython interpreter and analyze the time taken for different portions of the code. Profilers help to identify performance bottlenecks in the code by showing

  • wall-time (or start to end time that the user observes),

  • CPU and GPU time, and

  • memory usage patterns

in function/method/line of code level granularity.

Deterministic profilers vs. sampling profilers

Note

Deterministic profilers are also called tracing profilers.

Deterministic profilers record every function call and event in the program, logging the exact sequence and duration of events.

👍 Pros: : - Provides detailed information on the program’s execution.

  • Deterministic: Captures exact call sequences and timings.

👎 Cons: : - Higher overhead, slowing down the program.

  • Can generate larger amount of data.

Sampling profilers periodically samples the program’s state (where it is and how much memory is used), providing a statistical view of where time is spent.

👍 Pros: : - Lower overhead, as it doesn’t track every event.

  • Scales better with larger programs.

👎 Cons: : - Less precise, potentially missing infrequent or short calls.

  • Provides an approximation rather than exact timing.

Discussion

Analogy: Imagine we want to optimize the Stockholm Länstrafik (SL) metro system We wish to detect bottlenecks in the system to improve the service and for this we have asked few passengers to help us by tracking their journey.

  • Deterministic: : We follow every train and passenger, recording every stop and delay. When passengers enter and exit the train, we record the exact time and location.

  • Sampling: : Every 5 minutes the phone notifies the passenger to note down their current location. We then use this information to estimate the most crowded stations and trains.

In addition to the above distinctions, some profilers can also

In the following sections, we will use cProfile and Line-profiler to profile a Python program. cProfile is a deterministic (tracing) profiler built-in to the Python standard library and gives timings in function-level granularity. Line profiler is also deterministic and it provides timings in line-of-code granularity for few selected functions.

cProfile

As a demo, let us consider the following code which simulates a random walk in one dimension (we can save it as walk.py or download from here):

"""A 1-D random walk.

See also:
- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html

"""
import numpy as np


def step():
    import random
    return 1.0 if random.random() > 0.5 else -1.0


def walk(n: int, dx: float = 1.0):
    """The for-loop version.

    Parameters
    ----------
    n: int
        Number of time steps

    dx: float
        Step size. Default step size is unity.

    """
    xs = np.zeros(n)

    for i in range(n - 1):
        x_new = xs[i] + dx * step()
        xs[i + 1] = x_new

    return xs


def walk_vec(n: int, dx: float = 1.0):
    """The vectorized version of :func:`walk` using numpy functions."""
    import random
    steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))

    # steps = np.random.choice([1, -1], size=n)

    dx_steps = dx * steps

    # set initial condition to zero
    dx_steps[0] = 0
    # use cumulative sum to replicate time evolution of position x
    xs = np.cumsum(dx_steps)

    return xs


if __name__ == "__main__":
    n = 1_000_000
    _ = walk(n)
    _ = walk_vec(n)

We can profile it with cProfile:

$  python -m cProfile -s time walk.py

The -s switch sorts the results by time. Other options include e.g. function name, cumulative time, etc. However, this will print a lot of output which is difficult to read.

$ python -m cProfile -o walk.prof walk.py

It’s also possible to write the profile to a file with the -o flag and view it with profile pstats module or profile visualisation tools like Snakeviz or tuna.

Note

Similar functionality is available in interactive IPython or Jupyter sessions with the magic command %%prun.

Line-profiler

The cProfile tool tells us which function takes most of the time but it does not give us a line-by-line breakdown of where time is being spent. For this information, we can use the line_profiler tool.

Demo: line profiling

For line-profiling source files from the command line, we can add a decorator @profile to the functions of interests. If we do this for the step() and walk() function in the example above, we can then run the script using the kernprof.py program which comes with line_profiler, making sure to include the switches -l, --line-by-line and -v, --view:

$ kernprof -l -v walk.py

line_profiler also works in a Jupyter notebook. First one needs to load the extension:

%load_ext line_profiler

If the walk() and step() functions are defined in code cells, we can get the line-profiling information by:

%lprun -f walk -f step walk(10000)
  • Based on the output, can you spot a mistake which is affecting performance?

Exercise 2

Exercise

Start by copying in the script walk.py into Jupyter or download it and execute %run walk.py

  1. Use %timeit magic command in Jupyter to benchmark the functions walk(1_000_000) and walk_vec(1_000_000). Which is faster?

  2. Use:

    %load_ext line_profiler
    %lprun -f walk_vec walk_vec(1_000_000)
    

    to apply line-profiler on walk_vec function. What is the bottleneck?

  3. Modify the following lines to use to change how the steps array is initialized. Execute it to re-initialize walk_vec. Redo the above benchmark and profiling. Does it improve the performance?

    def walk_vec(n: int, dx: float = 1.0):
        """The vectorized version of :func:`walk` using numpy functions."""
        import random
        steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
    
        # steps = np.random.choice([1, -1], size=n)
    
        dx_steps = dx * steps
    
        # set initial condition to zero
        dx_steps[0] = 0
        # use cumulative sum to replicate time evolution of position x
        xs = np.cumsum(dx_steps)
    
        return xs
    

Scalene profiler

Profiling

Scalene

Scalene is a sampling profiler. In addition to timings , it can also give insight into:

  • CPU time spent in Python (interpreted), native (compiled) and system function calls

  • Memory usage and copy

  • GPU utilization

  • Memory leak detection

Moreover, it adds minimal overhead due to profiling. The downside is the results are less reproducible, because it is a sampling profiler.

Scalene can be used as a CLI tool, or using IPython magic or in Web interface as an interactive widget. Here are some examples profiling walk.py with Scalene.

CLI tool

$ scalene --cli walk.py

IPython magic

This allows for profiling a specific function. For example to profile just walk, we do as follows:

In [1]: %load_ext scalene

In [2]: %run walk.py

In [3]: %scrun --cli walk(n)

Gives the following output:

SCRUN MAGIC
                              /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py: % of time = 100.00% (1.933s) out of 1.933s.
       ╷       ╷       ╷       ╷       ╷
       │Time   │–––––– │–––––– │–––––– │
  Line │Python │native │system │GPU    │/home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
     1 │       │       │       │       │"""A 1-D random walk.
     2 │       │       │       │       │
     3 │       │       │       │       │See also:
     4 │       │       │       │       │- https://lectures.scientific-python.org/intro/numpy/auto_examples/plot_randomwalk.html
     5 │       │       │       │       │
     6 │       │       │       │       │"""
     7 │       │       │       │       │import numpy as np
     8 │       │       │       │       │
     9 │       │       │       │       │
    10 │    6% │       │       │       │def step():
    11 │       │       │       │       │    import random
    12 │    7% │   64% │  13%  │       │    return 1.0 if random.random() > 0.5 else -1.0
    13 │       │       │       │       │
    14 │       │       │       │       │
    15 │       │       │       │       │def walk(n: int, dx: float = 1.0):
    16 │       │       │       │       │    """The for-loop version.
    17 │       │       │       │       │
    18 │       │       │       │       │    Parameters
    19 │       │       │       │       │    ----------
    20 │       │       │       │       │    n: int
    21 │       │       │       │       │        Number of time steps
    22 │       │       │       │       │
    23 │       │       │       │       │    dx: float
    24 │       │       │       │       │        Step size. Default step size is unity.
    25 │       │       │       │       │
    26 │       │       │       │       │    """
    27 │       │       │       │       │    xs = np.zeros(n)
    28 │       │       │       │       │
    29 │       │       │       │       │    for i in range(n - 1):
    30 │       │       │       │       │        x_new = xs[i] + dx * step()
    31 │    7% │       │       │       │        xs[i + 1] = x_new
    32 │       │       │       │       │
    33 │       │       │       │       │    return xs
    34 │       │       │       │       │
    35 │       │       │       │       │
    36 │       │       │       │       │def walk_vec(n: int, dx: float = 1.0):
    37 │       │       │       │       │    """The vectorized version of :func:`walk` using numpy functions."""
    38 │       │       │       │       │    import random
    39 │       │       │       │       │    steps = np.array(random.sample([1, -1], k=n, counts=[10 * n, 10 * n]))
    40 │       │       │       │       │
    41 │       │       │       │       │    # steps = np.random.choice([1, -1], size=n)
    42 │       │       │       │       │
    43 │       │       │       │       │    dx_steps = dx * steps
    44 │       │       │       │       │
    45 │       │       │       │       │    # set initial condition to zero
    46 │       │       │       │       │    dx_steps[0] = 0
    47 │       │       │       │       │    # use cumulative sum to replicate time evolution of position x
    48 │       │       │       │       │    xs = np.cumsum(dx_steps)
    49 │       │       │       │       │
    50 │       │       │       │       │    return xs
    51 │       │       │       │       │
    52 │       │       │       │       │
    53 │       │       │       │       │if __name__ == "__main__":
    54 │       │       │       │       │    n = 1_000_000
    55 │       │       │       │       │    _ = walk(n)
    56 │       │       │       │       │    _ = walk_vec(n)
    57 │       │       │       │       │
       │       │       │       │       │
╶──────┼───────┼───────┼───────┼───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
       │       │       │       │       │function summary for /home/ashwinmo/Sources/enccs/hpda-python/content/example/walk.py
    10 │   14% │   69% │   9%  │       │step
    15 │    7% │       │       │       │walk
       ╵       ╵       ╵       ╵

If you run the magic command in Jupyter you can use %scrun walk(n) instead and it should an output similar to the Web interface below.

Web interface

Running

$ scalene walk.py

opens up the following web app:

../../_images/scalene_web.png