Compilers

Compilers#

CPython #

CPython is the main Python distribution.

(Not to be confused with Cython, which we’ll touch on later).

CPython uses an Ahead-Of-Time (AOT) compiler i.e., the code is compiled in advance.

(A compiler translates program source course into machine-readable instructions).

This is as an assortment of statically compiled C extensions.

CPython is a general purpose interpreter, allowing it to work on a variety of problems.

It is dynamically typed, so types can change as you go.

For example:

# assign x to an integer
x = 5
print(x)

# then assign x to a string
x = "Gary"
print(x)

5
Gary

Numba #

Numba uses a JIT (Just-In-Time) compiler on functions i.e., compiles the function at execution time.

This converts the function to fast machine code (LLVM).

Numba works with the default CPython.

It works by adding decorators around functions.

Numba is helpful when you want to speed up numerical operations in specific functions.

There are two main modes: object and nopython.

`object` mode (`@jit`)#

Works by adding the @jit decorator around the function.

This then compiles code that handles all values as Python objects and uses CPython to work on those objects.

@jit first tries to use nopython mode (covered next), and if it fails uses object mode.

The main improvement over CPython is for loops.

`nopython` mode (`@njit`)#

Works by adding the @jit(nopython=True) decorator (aliased as @njit) around the function.

This then compiles code that does not access CPython.

This has higher performance than the object mode.

The nopython mode requires specific types (mainly numbers), otherwise returns a TypeError.

For example:

import numpy as np
from numba import njit

First, lets profile an example numerical function without Numba:

nums = np.arange(1_000_000)

def slow_function(nums):
    trace = 0.0
    for num in nums:  # loop
        trace += np.cos(num)  # numpy
    return nums + trace  # broadcasting

%%timeit
slow_function(nums)

1.03 s ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, lets add the Numba njit decorator on the same function:

@njit
def fast_function(nums):
    trace = 0.0
    for num in nums:  # loop
        trace += np.cos(num)  # numpy
    return nums + trace  # broadcasting

The first call to the Numba function has an overhead to compile the function.

%%timeit -n 1 -r 1 # -n 1 means execute the statement once, -r 1 means for one repetition
fast_function(nums)

418 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Then, all subsequent calls use this compiled version, which are much faster.

%%timeit -n 1 -r 1
fast_function(nums)

19 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Question 1

For the function below (fast_add):

@njit
def fast_add(x, y):
return x + y

What will happen when it is called with:
fast_add(1, (2,))

Signatures#

The signature of the Numba function can limit it to specific input and output types, among other things.

This can save time for Numba to infer the types, and is also useful for when we use GPUs (covered in a later course).

These are added as arguments to the Numba decorator.

For example:

from numba import float32, int32

Here, the output type is wrapped around the input types.

@njit(float32(int32, int32))
def fast_add(x, y):
    return x + y

fast_add(2, 2)

4.0

`@vectorize`#

Numba also simplifies the creation of a NumPy ufunc using the @vectorize decorator.

They can be targeted to different hardware in the signature.

The default target is for a single CPU case (which has the least overhead).

This is suitable for smaller data sizes (<1 KB) and low compute intensities.

For example:

from numba import vectorize

Don’t worry about what this function does, just focus on the vectorisation bit.

You’ll notice that this it the same example as from the previous lesson on vectorisation, apart from that we’re now adding Numba’s @vectorize decorator.

import math

SQRT_2PI = np.float32((2.0 * math.pi) ** 0.5)

x = np.random.uniform(-3.0, 3.0, size=1_000_000)
mean = 0.0
sigma = 1.0


@vectorize  # I'm new
def my_function(x, mean, sigma):
    """Compute the value of a Gaussian probability density function at x with given mean and sigma."""
    return math.exp(-0.5 * ((x - mean) / sigma) ** 2.0) / (sigma * SQRT_2PI)

So, the first call to the function compiles it:

%%timeit -n 1 -r 1
my_function(x, 0.0, 1.0)

75.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Then, all subsequent calls use the fast compiled version:

%%timeit -n 1 -r 1
my_function(x, 0.0, 1.0)

8.13 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

`@guvectorize`#

During our last lesson on vectorisation we also touched on generalised ufuncs (gufuncs).

These extend vectorize to work on many input elements.

Numba has a nice implementation of these using guvectorize.

The signature also requires the types to be specified first in a list.

For example: [(int64[:], int64, int64[:])] means an n-element one-dimensional array of int64, a scalar of int64, and another n-element one-dimensional array of int64.

Then the signature includes the input(s) and output(s) dimensions in symbolic form.

For example: '(n),()->(n)' means input an n-element one-dimensional array ((n)) and a scalar (()), and output an n-element one-dimensional array ((n)).

from numba import guvectorize, int64

@guvectorize([(int64[:], int64, int64[:])], "(n),()->(n)")
def g(x, y, result):
    for index in range(x.shape[0]):
        result[index] = x[index] + y

First, lets try the gufunc with 1D array and an integer:

x = np.arange(5)
x

array([0, 1, 2, 3, 4])

g(x, 5)

array([5, 6, 7, 8, 9])

Okay. So, now how about a 2D array and an integer:

x = np.arange(6).reshape(2, 3)
x

array([[0, 1, 2],
       [3, 4, 5]])

g(x, 10)

array([[10, 11, 12],
       [13, 14, 15]])

And, what about a 2D array and a 1D array:

g(x, np.array([10, 20]))

array([[10, 11, 12],
       [23, 24, 25]])

`parallel = True`#

The next lesson covers parallelisation in detail. However, before that, let’s touch on a nice feature within Numba.

Numba can target different hardware in the signature.

Just now, we saw a Numba function for a single CPU, which is suitable for small data sizes.

The next target is for a multi-core CPU.

This has small additional overheads for threading.

This is suitable for medium data sizes (1 KB - 1 MB).

If code contains operations that are parallelisable (and supported) Numba can compile a version that will run in parallel on multiple threads.

This parallelisation is performed automatically and is enabled by simply adding the keyword agurment parallel=True to @njit.

For example, let’s first use the function in serial (i.e., with parallel=False which is also the default):

x = np.arange(1.0e7, dtype=np.int64)

@njit
def my_serial_function_for_cpu(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

%%timeit
my_serial_function_for_cpu(x)

167 ms ± 939 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Okay, so let’s now change that to run in parallel:

@njit(parallel=True)
def my_parallel_function_for_cpu(x):
    return np.cos(x) ** 2 + np.sin(x) ** 2

Note

The timing of this parallel function depends on how many CPUs your machine has and how free their resources are.

%%timeit
my_parallel_function_for_cpu(x)

72 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Exercises#

Exercise 1

What is the default Python distribution?

Cython
PyPy
CPython

Exercise 2

Which Numba compilation mode has higher performance?

object
nopython

Exercise 3

How do I compile a function in Numba using no-python mode?

Exercise 4

What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?

Exercise 5

Create your own Numba vectorised function that calculates the cube root of an array over all elements.

Solutions #

Key Points#

Important

Speed up numerical functions with the Numba @njit (nopython) compiler.

Further information#

More information and considerations#

Factor out the performance-critical part of the code for compilation in Numba.
Consider what data precision is required i.e., is 64-bit needed?
Debugging and troubleshooting
Here are some more Numba examples for NumPy and Pandas.
Numba can also target CUDA GPUs, which we’ll cover in the final lesson.

Other options#

Cython
- Compiles to statically typed C/C++.
- Can be used for any amount of code.
- Used with the default CPython.
- Cython is helpful when you need static typing e.g., when optimising libraries to release.
- Here are some examples for NumPy and Pandas.
PyPy
- Also uses the JIT compiler (though is written in Python).
- PyPy enables optimisations at run time, especially for numerical tasks with repitition and loops.
- Completely replaces CPython.
  - Caution, it may not be compatible with the libraries you use.
- Generally fast, though there are overheads for start-up and memory.
- PyPy is helpful when want to speed up numerical opterations in all of the code.

Resources#

Why is Python slow?, Anthony Shaw, PyCon 2020. CPython Internals.

Compilers

Contents

Compilers#

CPython#

Numba#

object mode (@jit)#

nopython mode (@njit)#

Signatures#

@vectorize#

@guvectorize#

parallel = True#

Exercises#

Solutions#

Key Points#

Further information#

More information and considerations#

Other options#

Resources#

CPython #

Numba #

`object` mode (`@jit`)#

`nopython` mode (`@njit`)#

`@vectorize`#

`@guvectorize`#

`parallel = True`#

Solutions #