Compilers#
CPython#
CPython is the main Python distribution.
(Not to be confused with Cython, which we’ll touch on later).
CPython uses an Ahead-Of-Time (AOT) compiler i.e., the code is compiled in advance.
(A compiler translates program source course into machine-readable instructions).
This is as an assortment of statically compiled C extensions.
CPython is a general purpose interpreter, allowing it to work on a variety of problems.
It is dynamically typed, so types can change as you go.
For example:
# assign x to an integer
x = 5
print(x)
# then assign x to a string
x = "Gary"
print(x)
5
Gary
Numba#
Numba uses a JIT (Just-In-Time) compiler on functions i.e., compiles the function at execution time.
This converts the function to fast machine code (LLVM).
Numba works with the default CPython.
It works by adding decorators around functions.
Numba is helpful when you want to speed up numerical operations in specific functions.
There are two main modes: object
and nopython
.
object
mode (@jit
)#
Works by adding the @jit
decorator around the function.
This then compiles code that handles all values as Python objects and uses CPython to work on those objects.
@jit
first tries to use nopython
mode (covered next), and if it fails uses object
mode.
The main improvement over CPython is for loops.
nopython
mode (@njit
)#
Works by adding the @jit(nopython=True)
decorator (aliased as @njit
) around the function.
This then compiles code that does not access CPython.
This has higher performance than the object
mode.
The nopython
mode requires specific types (mainly numbers), otherwise returns a TypeError
.
For example:
import numpy as np
from numba import njit
First, lets profile an example numerical function without Numba:
nums = np.arange(1_000_000)
def slow_function(nums):
trace = 0.0
for num in nums: # loop
trace += np.cos(num) # numpy
return nums + trace # broadcasting
%%timeit
slow_function(nums)
1.02 s ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, lets add the Numba njit
decorator on the same function:
@njit
def fast_function(nums):
trace = 0.0
for num in nums: # loop
trace += np.cos(num) # numpy
return nums + trace # broadcasting
The first call to the Numba function has an overhead to compile the function.
%%timeit -n 1 -r 1 # -n 1 means execute the statement once, -r 1 means for one repetition
fast_function(nums)
414 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use this compiled version, which are much faster.
%%timeit -n 1 -r 1
fast_function(nums)
11.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Question 1
For the function below (fast_add
):
@njit
def fast_add(x, y):
return x + y
What will happen when it is called with:
fast_add(1, (2,))
Signatures#
The signature of the Numba function can limit it to specific input and output types, among other things.
This can save time for Numba to infer the types, and is also useful for when we use GPUs (covered in a later course).
These are added as arguments to the Numba decorator.
For example:
from numba import float32, int32
Here, the output type is wrapped around the input types.
@njit(float32(int32, int32))
def fast_add(x, y):
return x + y
fast_add(2, 2)
4.0
@vectorize
#
Numba also simplifies the creation of a NumPy ufunc using the @vectorize
decorator.
They can be targeted to different hardware in the signature.
The default target is for a single CPU case (which has the least overhead).
This is suitable for smaller data sizes (<1 KB) and low compute intensities.
For example:
from numba import vectorize
Don’t worry about what this function does, just focus on the vectorisation bit.
You’ll notice that this it the same example as from the previous lesson on vectorisation, apart from that we’re now adding Numba’s @vectorize
decorator.
import math
SQRT_2PI = np.float32((2.0 * math.pi) ** 0.5)
x = np.random.uniform(-3.0, 3.0, size=1_000_000)
mean = 0.0
sigma = 1.0
@vectorize # I'm new
def my_function(x, mean, sigma):
"""Compute the value of a Gaussian probability density function at x with given mean and sigma."""
return math.exp(-0.5 * ((x - mean) / sigma) ** 2.0) / (sigma * SQRT_2PI)
So, the first call to the function compiles it:
%%timeit -n 1 -r 1
my_function(x, 0.0, 1.0)
75.3 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Then, all subsequent calls use the fast compiled version:
%%timeit -n 1 -r 1
my_function(x, 0.0, 1.0)
8.03 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
@guvectorize
#
During our last lesson on vectorisation we also touched on generalised ufuncs (gufuncs).
These extend vectorize
to work on many input elements.
Numba has a nice implementation of these using guvectorize
.
The signature also requires the types to be specified first in a list.
For example:
[(int64[:], int64, int64[:])]
means an n-element one-dimensional array ofint64
, a scalar ofint64
, and another n-element one-dimensional array ofint64
.
Then the signature includes the input(s) and output(s) dimensions in symbolic form.
For example:
'(n),()->(n)'
means input an n-element one-dimensional array ((n)
) and a scalar (()
), and output an n-element one-dimensional array ((n)
).
from numba import guvectorize, int64
@guvectorize([(int64[:], int64, int64[:])], "(n),()->(n)")
def g(x, y, result):
for index in range(x.shape[0]):
result[index] = x[index] + y
First, lets try the gufunc with 1D array and an integer:
x = np.arange(5)
x
array([0, 1, 2, 3, 4])
g(x, 5)
array([5, 6, 7, 8, 9])
Okay. So, now how about a 2D array and an integer:
x = np.arange(6).reshape(2, 3)
x
array([[0, 1, 2],
[3, 4, 5]])
g(x, 10)
array([[10, 11, 12],
[13, 14, 15]])
And, what about a 2D array and a 1D array:
g(x, np.array([10, 20]))
array([[10, 11, 12],
[23, 24, 25]])
parallel = True
#
The next lesson covers parallelisation in detail. However, before that, let’s touch on a nice feature within Numba.
Numba can target different hardware in the signature.
Just now, we saw a Numba function for a single CPU, which is suitable for small data sizes.
The next target is for a multi-core CPU.
This has small additional overheads for threading.
This is suitable for medium data sizes (1 KB - 1 MB).
If code contains operations that are parallelisable (and supported) Numba can compile a version that will run in parallel on multiple threads.
This parallelisation is performed automatically and is enabled by simply adding the keyword agurment parallel=True
to @njit
.
For example, let’s first use the function in serial (i.e., with parallel=False
which is also the default):
x = np.arange(1.0e7, dtype=np.int64)
@njit
def my_serial_function_for_cpu(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
%%timeit
my_serial_function_for_cpu(x)
177 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Okay, so let’s now change that to run in parallel:
@njit(parallel=True)
def my_parallel_function_for_cpu(x):
return np.cos(x) ** 2 + np.sin(x) ** 2
Note
The timing of this parallel function depends on how many CPUs your machine has and how free their resources are.
%%timeit
my_parallel_function_for_cpu(x)
78.8 ms ± 605 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Exercises#
Exercise 1
What is the default Python distribution?
Cython
PyPy
CPython
Exercise 2
Which Numba compilation mode has higher performance?
object
nopython
Exercise 3
How do I compile a function in Numba using no-python
mode?
Exercise 4
What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?
Exercise 5
Create your own Numba vectorised function that calculates the cube root of an array over all elements.
Solutions#
Key Points#
Important
Speed up numerical functions with the Numba
@njit
(nopython) compiler.
Further information#
More information and considerations#
Factor out the performance-critical part of the code for compilation in Numba.
Consider what data precision is required i.e., is 64-bit needed?
Numba can also target CUDA GPUs, which we’ll cover in the final lesson.
Other options#
-
Also uses the JIT compiler (though is written in Python).
PyPy enables optimisations at run time, especially for numerical tasks with repitition and loops.
Completely replaces CPython.
Caution, it may not be compatible with the libraries you use.
Generally fast, though there are overheads for start-up and memory.
PyPy is helpful when want to speed up numerical opterations in all of the code.
Resources#
Why is Python slow?, Anthony Shaw, PyCon 2020. CPython Internals.