Optimizing¶

Objectives

Optimize the most expensive function from the local word-count exercise.
Show how changes to algorithm influences the performance.
Introduce a few Python accelerators: cython, numba, pythran
Mention the library transonic

Instructor note

15 min teaching/demo
No type-along intended

Targeting the most expensive function¶

In the previous episode by profiling, we found out that update_word_counts consumes around half of the CPU wall time and is called repeatedly. Here is a snippet from profiling output.

...
         53473208 function calls in 8.410 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1233410    4.151    0.000    7.204    0.000 v0.py:41(update_word_counts)
  ...

Option 1: changing the algorithm¶

If we look at the output from the line profiler, we can see that the following two lines are the most time-consuming.

def update_word_counts(line, counts):
    """
    Given a string, parse the string and update a dictionary of word
    counts (mapping words to counts of their frequencies). DELIMITERS are
    removed before the string is parsed. The function is case-insensitive
    and words in the dictionary are in lower-case.
    """
    for purge in DELIMITERS:
        line = line.replace(purge, " ")
    words = line.split()
    for word in words:
        word = word.lower().strip()
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

Demo

Instead of a for loop and a str.replace we could use a single regular expression substitution. This change would look like this

import re
# WARNING: there is a bug in the regular expression below!
DELIMITERS = re.compile(r"""[\.,;:?$@^<>#%`!\*-=\(\)\[\]\{\}/\\\\"']""")


def update_word_counts(line, counts):
    """
    Given a string, parse the string and update a dictionary of word
    counts (mapping words to counts of their frequencies). DELIMITERS are
    removed before the string is parsed. The function is case-insensitive
    and words in the dictionary are in lower-case.
    """
    line = DELIMITERS.sub(" ", line)
    words = line.split()
    for word in words:
        word = word.lower().strip()
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

If we run our benchmark with the original code (v0.py) and the regex version (v0_1.py), we get

$ time python v0.py data/concat.txt processed_data/concat.dat

real    0m2,934s
user    0m2,733s
sys     0m0,191s
$ time python v0_1.py data/concat.txt processed_data/concat.dat

real    0m2,472s
user    0m2,320s
sys     0m0,147s

Summary

There is a marginal gain of ~0.5 s which amounts to a 16% performance boost.
Such changes are less maintainable, but sometime necessary.

Option 2: using an accelerator¶

Accelerators¶

The following are the few well-known accelerators for Python-Numpy applications.

Accelerator	Compiles	Implemented in	Level	Supports	Advantage

Cython	Ahead of time	C	Module	All of Python, Numpy, and C	Generic and can also interface C,C++
Pythran	Ahead of time	C++	Module	Most Python and Numpy features	Escapes GIL always, can optimize vectorized code without loops. Can parallelize using OpenMP.
Numba	Just in time	LLVM	Function	Most Python and Numpy features	Specializes in Numeric codes. Has GPU support, can parallelize
Jax	Just in time	C++	Function or Expression	Most Python and Numpy features	Drop-in alternative for Numpy. Designed for creating ML libraries
Cupy	Pre-compiled / JIT	Cython / C / C++	Function or Expression	Numpy and Scipy	Drop-in alternative for Numpy. Supports CUDA and ROCm GPUs

Refactoring¶

One complication with optimizing update_word_counts is that it is an impure function. In other words, it has some side-effects since it:

accesses a global variable DELIMITERS, and
mutates an external dictionary counts which is a local variable inside the function calculate_word_counts.

Thus the function update_word_counts on its own can be complicated for an accelerator to compile since the types of the external variables are unknown.

def update_word_counts(line, counts):
    """
    Given a string, parse the string and update a dictionary of word
    counts (mapping words to counts of their frequencies). DELIMITERS are
    removed before the string is parsed. The function is case-insensitive
    and words in the dictionary are in lower-case.
    """
    for purge in DELIMITERS:
        line = line.replace(purge, " ")
    words = line.split()
    for word in words:
        word = word.lower().strip()
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

DELIMITERS = ". , ; : ? $ @ ^ < > # % ` ! * - = ( ) [ ] { } / \" '".split()

def calculate_word_counts(lines):
    """
    Given a list of strings, parse each string and create a dictionary of
    word counts (mapping words to counts of their frequencies). DELIMITERS
    are removed before the string is parsed. The function is
    case-insensitive and words in the dictionary are in lower-case.
    """
    counts = {}
    for line in lines:
        update_word_counts(line, counts)
    return counts

Cython¶

In this example we shall demonstrate Cython via a package called Transonic . Transonic lets you switch between Cython, Numba, Pythran and to some extent Jax using very similar syntax

To use Transonic we add decorators to functions we need to optimize. There are two decorators

@transonic.boost to create ahead-of-time (AOT) compiled modules and it requires type annotations
@transonic.jit to create just-in-time (JIT) compiled modules where type is inferred on runtime

The advantage of using transonic is that you can quickly find out which accelerator works best while preserving the Python code for debugging and future development. It also abstracts away the syntax variations that Cython, Pythran etc. have.

The accelerator backend can be chosen in 3 ways:

Using an environment variable, export TRANSONIC_BACKEND=cython
As a parameter to the decorator, @boost(backend="cython")
As a parameter to the Transonic CLI, transonic -b cython /path/to/file.py

We shall use the @boost decorator and the environment variable TRANSONIC_BACKEND for simplicity

Demo

We make a few changes to the code:

Pull DELIMITERS inside update_word_counts function
Add @boost decorators
Add type annotations as required by transonic.

Cython has an ability to create inline functions and this is also supported in Transonic. Therefore it is OK that update_word_counts is impure.

from transonic import boost
from transonic.typing import List, Dict

@boost(inline=True)
def update_word_counts(line: str, counts: Dict[str, int]):
    """
    Given a string, parse the string and update a dictionary of word
    counts (mapping words to counts of their frequencies). DELIMITERS are
    removed before the string is parsed. The function is case-insensitive
    and words in the dictionary are in lower-case.
    """
    DELIMITERS = ". , ; : ? $ @ ^ < > # % ` ! * - = ( ) [ ] { } / \" '".split()

    for purge in DELIMITERS:
        line = line.replace(purge, " ")

    words = line.split()
    for word in words:
        word = word.lower().strip()
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1


@boost
def calculate_word_counts(lines: List[str]):
    """
    Given a list of strings, parse each string and create a dictionary of
    word counts (mapping words to counts of their frequencies). DELIMITERS
    are removed before the string is parsed. The function is
    case-insensitive and words in the dictionary are in lower-case.
    """
    counts = {}

    for line in lines:
        update_word_counts(line, counts)

    return counts

Then compile the file wordcount/v1_1.py.

$ export TRANSONIC_BACKEND=cython
$ transonic v1_1.py 
...
1 files created or updated needs to be cythonized
$ ls -1 __cython__/
build
v1_1_ee8b793c43119b782190c854a1eb2ba7.cpython-312-x86_64-linux-gnu.so
v1_1.pxd
v1_1.py

This would auto-generate a module containing only the functions to be optimized and also compiles it. While running the application, Transonic takes care of swapping the Python function with the compiled counterpart.

We are ready to benchmark this.

$ time python v1_1.py data/concat.txt processed_data/concat.dat

real    0m4,071s
user    0m4,373s
sys     0m0,288s

Summary

We see that the compiled function made the script slower! This could happen because of a few reasons

Python’s dictionary which uses hash-maps, is quite optimized and it is hard to beat it
Cython interacts with Python a lot. This can be analyzed by running cd __cython__; cythonize --annotate v1_1.py which generates the following HTML page.

Pythran can be used to escape interaction the GIL, but it has a similar performance. Source code: wordcount/v1_2.py and wordcount/v1_2_pythran.py

When do we use accelerators?¶

An example: Astrophysics N-body problem¶

To simulate an N-body problem using a naive algorithm involves $O(N^2)$ operations for each time-step.

Naive Python version¶

It uses a list of Numpy arrays!

import sys
import numpy as np
from itertools import combinations
from time import perf_counter
from datetime import timedelta


class Particle:
    """
    A Particle has mass, position, velocity and acceleration.
    """

    def __init__(self, mass, x, y, z, vx, vy, vz):
        self.mass = mass
        self.position = np.array([x, y, z])
        self.velocity = np.array([vx, vy, vz])
        self.acceleration = np.array([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])

    @property
    def ke(self):
        return 0.5 * self.mass * sum(v ** 2 for v in self.velocity)


class Cluster(list):
    """
    A Cluster is just a list of particles with methods to accelerate and
    advance them.
    """

    @property
    def ke(self):
        return sum(particle.ke for particle in self)

    @property
    def energy(self):
        return self.ke + self.pe

    def step(self, dt):
        self.__accelerate()
        self.__advance_positions(dt)
        self.__accelerate()
        self.__advance_velocities(dt)

    def __accelerate(self):
        for particle in self:
            particle.acceleration[1] = particle.acceleration[0]
            particle.acceleration[0] = 0.0
            self.pe = 0.0
        for p1, p2 in combinations(self, 2):
            vector = np.subtract(p1.position, p2.position)
            distance = np.sqrt(np.sum(vector ** 2))
            p1.acceleration[0] = (
                p1.acceleration[0] - (p2.mass / distance ** 3) * vector
            )
            p2.acceleration[0] = (
                p2.acceleration[0] + (p1.mass / distance ** 3) * vector
            )
            self.pe -= (p1.mass * p2.mass) / distance

    def __advance_positions(self, dt):
        for p in self:
            p.position = (
                p.position + p.velocity * dt + 0.5 * dt ** 2 * p.acceleration[0]
            )

    def __advance_velocities(self, dt):
        for p in self:
            p.velocity = (
                p.velocity + 0.5 * (p.acceleration[0] + p.acceleration[1]) * dt
            )


if __name__ == "__main__":
    t_start = perf_counter()
    tend, dt = 10.0, 0.001  # end time, timestep
    cluster = Cluster()
    with open(sys.argv[1]) as input_file:
        for line in input_file:
            # try/except is a blunt instrument to clean up input
            try:
                cluster.append(Particle(*[float(x) for x in line.split()[1:]]))
            except:
                pass
    old_energy = -0.25
    for step in range(1, int(tend / dt + 1)):
        cluster.step(dt)
        if not step % 100:
            print(
                f"t = {dt * step:.2f}, E = {cluster.energy:.10f}, "
                f"dE/E = {(cluster.energy - old_energy) / old_energy:.10f}"
            )
            old_energy = cluster.energy
    print(f"Final dE/E = {(cluster.energy + 0.25) / -0.25:.6e}")
    print(f"run in {timedelta(seconds=perf_counter()-t_start)}")

Numpy vectorized version¶

from math import sqrt
from time import perf_counter
from datetime import timedelta

import numpy as np
import pandas as pd


def load_input_data(path):
    df = pd.read_csv(
        path, names=["mass", "x", "y", "z", "vx", "vy", "vz"], delimiter=r"\s+"
    )
    # warning: copy() is for Pythran...
    masses = df["mass"].values.copy()
    positions = df.loc[:, ["x", "y", "z"]].values.copy()
    velocities = df.loc[:, ["vx", "vy", "vz"]].values.copy()
    return masses, positions, velocities


def advance_positions(positions, velocities, accelerations, time_step):
    positions += time_step * velocities + 0.5 * time_step ** 2 * accelerations


def advance_velocities(velocities, accelerations, accelerations1, time_step):
    velocities += 0.5 * time_step * (accelerations + accelerations1)


def compute_accelerations(accelerations, masses, positions):
    nb_particules = masses.size
    for index_p0 in range(nb_particules - 1):
        position0 = positions[index_p0]
        mass0 = masses[index_p0]
        for index_p1 in range(index_p0 + 1, nb_particules):
            mass1 = masses[index_p1]
            vector = position0 - positions[index_p1]
            distance = sqrt(sum(vector ** 2))
            coef = 1.0 / distance ** 3
            accelerations[index_p0] -= coef * mass1 * vector
            accelerations[index_p1] += coef * mass0 * vector


def loop(time_step, nb_steps, masses, positions, velocities):
    accelerations = np.zeros_like(positions)
    accelerations1 = np.zeros_like(positions)

    compute_accelerations(accelerations, masses, positions)

    time = 0.0
    energy0, _, _ = compute_energies(masses, positions, velocities)
    energy_previous = energy0

    for step in range(nb_steps):
        advance_positions(positions, velocities, accelerations, time_step)
        # swap acceleration arrays
        accelerations, accelerations1 = accelerations1, accelerations
        accelerations.fill(0)
        compute_accelerations(accelerations, masses, positions)
        advance_velocities(velocities, accelerations, accelerations1, time_step)
        time += time_step

        if not step % 100:
            energy, _, _ = compute_energies(masses, positions, velocities)
            print(
                f"t = {time_step * step:5.2f}, E = {energy:.10f}, "
                f"dE/E = {(energy - energy_previous) / energy_previous:+.10f}"
            )
            energy_previous = energy

    return energy, energy0


def compute_kinetic_energy(masses, velocities):
    return 0.5 * np.sum(masses * np.sum(velocities ** 2, 1))


def compute_potential_energy(masses, positions):
    nb_particules = masses.size
    pe = 0.0
    for index_p0 in range(nb_particules - 1):
        for index_p1 in range(index_p0 + 1, nb_particules):
            mass0 = masses[index_p0]
            mass1 = masses[index_p1]
            vector = positions[index_p0] - positions[index_p1]
            distance = sqrt(sum(vector ** 2))
            pe -= (mass0 * mass1) / distance
    return pe


def compute_energies(masses, positions, velocities):
    energy_kin = compute_kinetic_energy(masses, velocities)
    energy_pot = compute_potential_energy(masses, positions)
    return energy_kin + energy_pot, energy_kin, energy_pot


if __name__ == "__main__":

    import sys

    t_start = perf_counter()
    try:
        time_end = float(sys.argv[2])
    except IndexError:
        time_end = 10.

    time_step = 0.001
    nb_steps = int(time_end / time_step) + 1

    path_input = sys.argv[1]
    masses, positions, velocities = load_input_data(path_input)

    energy, energy0 = loop(time_step, nb_steps, masses, positions, velocities)
    print(f"Final dE/E = {(energy - energy0) / energy0:.6e}")
    print(
        f"{nb_steps} time steps run in {timedelta(seconds=perf_counter()-t_start)}"
    )

Demo

Data for 16-bodies: nbabel/input16

Naive Python version: nbabel/bench0.py
Numpy vectorized version: nbabel/bench_numpy_highlevel.py
Numpy vectorized version + JIT compilation using Transonic and Pythran: nbabel/bench_numpy_highlevel_jit.py

$ time python bench0.py input16
$ time bench_numpy_highlevel.py input16
$ export TRANSONIC_BACKEND=pythran
$ time bench_numpy_highlevel_jit.py input16  # Rerun after Pythran module is compiled

Solution

$ time python bench0.py input16
...
run in 0:00:12.637249

$ time bench_numpy_highlevel.py input16
...
10001 time steps run in 0:00:04.297485

$ export TRANSONIC_BACKEND=pythran
$ time bench_numpy_highlevel_jit.py input16  # Rerun after Pythran module is compiled
...
10001 time steps run in 0:00:00.042925

~300x speedup by using Pythran!

Keypoints

Algortihmic optimizations are often better
Accelerators work well with contiguous data structures
The word-count problem is a poor candidate, but when it involves contiguous data structures such as arrays of numbers these accelerators can give amazing performance boosts. See here:
- https://enccs.github.io/hpda-python/performance-boosting/
- https://github.com/paugier/nbabel/

Numerical and array optimization¶

Once we have identified the bottlenecks, we need to make the corresponding code go faster. The specific optimization can vary widely based on the computational load (how big or small the data is, and how frequently a function is executed) and particular problem at hand. Nevertheless, we present some common methods which can be handy to know.

Algorithm optimization¶

The first thing to look into is the underlying algorithm you chose: is it optimal? To answer this question, a good understanding of the maths behind the algorithm helps. For certain algorithms, many of the bottlenecks will be linear algebra computations. In these cases, using the right function to solve the right problem is key. For instance, an eigenvalue problem with a symmetric matrix is much easier to solve than with a general matrix. Moreover, most often, you can avoid inverting a matrix and use a less costly (and more numerically stable) operation. However, it can be as simple as moving computation or memory allocation outside a loop, and this happens very often as well.

Singular value decomposition¶

Singular Value Decomposition (SVD) is quite often used in climate model data analysis. The computational cost of this algorithm is roughly $n^3$ where $n$ is the size of the input matrix. However, in most cases, we are not using all the output of the SVD, but only the first few rows of its first returned argument. If we use the svd implementation from SciPy, we can ask for an incomplete version of the SVD. Note that implementations of linear algebra in SciPy are richer then those in NumPy and should be preferred. The following example demonstrates the performance benefit for a “slim” array (i.e. much larger along one axis):

import numpy as np
data = np.random.random((4000,100))

%timeit np.linalg.svd(data)
# 1.09 s ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from scipy import linalg

%timeit linalg.svd(data)
# 1.03 s ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit linalg.svd(data, full_matrices=False)
# 21.2 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.linalg.svd(data, full_matrices=False)
# 23.8 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The Fibonacci sequence¶

The Fibonacci sequence is defined by the recurrence relatioin:

\[\begin{split}F[0] &= 0 \text{ , } F[1] =1 \\ F[n] &= F[n-1] + F[n-2] \text{ for } n > 1\end{split}\]

The most straightforward version of the Fibonacci sequence is the one using recursion. However, it turns out that it performs very badly. Things can be improved by using the iterative version or the cached version.

def fib_rec(n):
    if n < 2:
        return n
    return fib_rec(n - 2) + fib_rec(n - 1)

def fib_iter(n):
    a, b = 0, 1
    for i in range(n):
        a, b = a + b, a
    return a

"""Using a least recently used (LRU) cache."""
from functools import lru_cache

@lru_cache
def fib_cached(n):
    if n < 2:
       return n

    return fib_cached(n - 2) + fib_cached(n - 1)

CPU usage optimization¶

Vectorization¶

Arithmetic is one place where NumPy performance outperforms python list and the reason is that it uses vectorization. A lot of the data analysis involves a simple operation being applied to each element of a large dataset. In such cases, vectorization is key for better performance. In practice, a vectorised operation means reframing the code in a manner that completely avoids a loop and instead uses e.g. slicing to apply the operation on the whole array (slice) at one go. For example, the following code for calculating the difference of neighbouring elements in an array:

Consider the following code:

%%timeit

import numpy as np
a = np.arange(1000)
a_dif = np.zeros(999, np.int64)
for i in range(1, len(a)):
    a_dif[i-1] = a[i] - a[i-1]

# 564 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can the for loop be vectorized? We need to use clever indexing to get rid of the loop:

%%timeit

import numpy as np
a = np.arange(1000)
a_dif = a[1:] - a[:-1]

# 2.12 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

The first brute force approach using a for loop is much slower than the second vectorised form!

So one should consider using vectorized operations whenever possible, not only for performance but also because the vectorized version can be more convenient.

What if we have a function that only take scalar values as input, but we want to apply it element-by-element on an array? We can vectorize the function! Let’s define a simple function f which takes scalars input:

import math
def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

If we pass an array we get an error

x = np.ones(10_000, dtype=np.int8)
f(x,x)

# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 2, in f
# TypeError: only size-1 arrays can be converted to Python scalars

We could loop over the array:

%%timeit
for i in x:
    f(i,i)

# 49.9 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, in order to pass a NumPy array it is better to vectorize the function using np.vectorize() which takes a nested sequence of objects or NumPy arrays as inputs and returns a single NumPy array or a tuple of NumPy arrays:

import math
import numpy as np

def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

f_numpy = np.vectorize(f)

# benchmark
x = np.ones(10_000, dtype=np.int8)
%timeit f_numpy(x,x)
# 4.84 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For high performance vectorization, another choice is to use Numba. Adding the decorator in a function, Numba will figure out the rest for you:

import math
import numba
import numpy as np

def f(x, y):
    return math.pow(x,3.0) + 4*math.sin(y)

f_numba = numba.vectorize(f)

# benchmark
x = np.ones(10_000, dtype=np.int8)
%timeit f_numba(x,x)

# 89.2 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

We will learn more about Numba in the next episode.

Memory usage optimization¶

Broadcasting¶

Basic operations of NumPy are elementwise, and the shape of the arrays should be compatible. However, in practice under certain conditions, it is possible to do operations on arrays of different shapes. NumPy expands the arrays such that the operation becomes viable.

Note

Broadcasting Rules

Dimensions match when they are equal, or when either is 1 or None.
In the latter case, the dimension of the output array is expanded to the larger of the two.
Broadcasted arrays are never physically constructed, which saves memory.

Broadcasting

import numpy as np
a = np.array([1, 2, 3])
b = 4
a + b

import numpy as np
a = np.array([[0, 0, 0],[10, 10, 10],[20, 20, 20],[30, 30, 30]])
b = np.array([1, 2, 3])
a + b

import numpy as np
a = np.array([0, 10, 20, 30])
b = np.array([1, 2, 3])
a + b # this does not work
a[:,None] +b
# or
a[:,np.newaxis] +b

Cache effects¶

Memory access is cheaper when it is grouped: accessing a big array in a continuous way is much faster than random access. This implies amongst other things that smaller strides are faster:

c = np.zeros((10_000, 10_000), order='C')

%timeit c.sum(axis=0)
# 1 loops, best of 3: 3.89 s per loop

%timeit c.sum(axis=1)
# 1 loops, best of 3: 188 ms per loop

c.strides
# (80000, 8)

This is the reason why Fortran ordering or C ordering may make a big difference on operations.

Temporary arrays¶

In complex expressions, NumPy stores intermediate values in temporary arrays
Memory consumption can be higher than expected

a = np.random.random((1024, 1024, 50))
b = np.random.random((1024, 1024, 50))

# two temporary arrays will be created
c = 2.0 * a - 4.5 * b

# four temporary arrays will be created, and from which two are due to unnecessary parenthesis
c = (2.0 * a - 4.5 * b) + (np.sin(a) + np.cos(b))

# solution
# apply the operation one by one for really large arrays
c = 2.0 * a
c = c - 4.5 * b
c = c + np.sin(a)
c = c + np.cos(b)

Broadcasting approaches can lead also to hidden temporary arrays
- Input data M x 3 array
- Output data M x M array
- There is a temporary M x M x 3 array

import numpy as np
M = 10_000
X = np.random.random((M, 3))
D = np.sqrt(((X[:, np.newaxis, :] - X) ** 2).sum(axis=-1))

Numexpr¶

Evaluation of complex expressions with one operation at a time can lead also into suboptimal performance
- Effectively, one carries out multiple for loops in the NumPy C-code
- Memory gets allocated for intermediate results
Numexpr package provides fast evaluation of array expressions

import numpy as np
import numexpr as ne

x = np.random.random((10_000_000, 1))
y = np.random.random((10_000_000, 1))
%timeit y = ((.25*x + .75)*x - 1.5)*x - 2
%timeit y = ne.evaluate("((.25*x + .75)*x - 1.5)*x - 2")

By default, Numexpr tries to use multiple threads
Number of threads can be queried and set with numexpr.set_num_threads(nthreads)
Supported operators and functions: +,-,*,/,**, sin, cos, tan, exp, log, sqrt
Speedups in comparison to NumPy are typically between 0.95 and 4
Works best on arrays that do not fit in CPU cache

Keypoints

Measure and benchmark before you start optimizing
Optimization can be to change algorithms, optimize memory usage or add vectorization, or to convert performance-critical functions to Numba or Cython