Solutions
Contents
Solutions#
Profiling#
Exercises#
Exercise 1
What is profiling and why is it useful?
Solution 1
Profiling analyses your code in terms of speed or memory.
This helps identify where the bottlenecks are (why and where is it slow?) and how much potential there is for improvement.
Exercise 2
What profiling tool times the execution of a cell in a Jupyter Notebook?
Solution 2
%%timeit
Exercise 3
Below are two approaches for filling up an empty NumPy array.
Which approach is faster and why?
def fill_array_approach_1(n):
array = np.empty(1)
for index in range(n):
new_point = np.random.rand()
array = np.append(array, new_point)
return array
def fill_array_approach_2(n):
array = np.empty(n)
for index in range(len(array)):
new_point = np.random.rand()
array[index] = new_point
return array
Solution 3
Approach 2 is about 10x faster.
For example, over 1,000 data points:
%timeit fill_array_approach_1(1_000)
2.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fill_array_approach_2(1_000)
251 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is because approach 2 creates one empty array of the required size in memory once, before filling it up in a loop.
In contrast, approach 1 creates a new array each time by appending the new point.
Exercise 4
Below are two methods that find two numbers from an array of unique integers that add up to a target sum.
If the target can’t be made, then an empty list is returned.
Each element in the array can only be used once.
Which method is faster and why?
def two_sum_brute_force(array, target):
"""
A brute force approach to find two numbers from an array that add up to a target.
Steps
1. Loop through the array twice, adding up pairs of array elements.
2. Compare each of these sums to the target.
3. Return the pair that sums to the target, if one exists.
"""
for index_one in range(len(array)):
for index_two in range(index_one + 1, len(array)):
if (
array[index_one] + array[index_two] == target # check sum of pair
and index_one != index_two # can't use the same element twice
):
return [index_one, index_two] # return the target pair
return [] # return an empty list if the target pair isn't found
def two_sum_cache(array, target):
"""
Use caching to find two numbers from an array that add up to a target.
Steps
1. Create a dictionary of cached differences relative to the target sum.
2. Loop through the array once, adding each index and difference to the cache.
3. If the required difference of a new array element is already in the cache,
then you've found a matching pair, which you can return.
"""
cache_differences = {}
for index, element in enumerate(array):
difference = (
target - element
) # calculate the target difference for this element
if difference in cache_differences: # if we have the matching pair
return [index, cache_differences[difference]] # return the target pair
cache_differences[element] = index # if we don't have a match, add to the cache
return [] # return an empty list if the target pair isn't found
import numpy as np
array = np.random.choice(1_000, 500, replace=False)
target = 250
Solution 4
The cached version is faster.
This is because it only needs to loop through the array once.
So, for the brute force method, the time complexity is O(n^2) and the space complexity is O(1).
And for the cached method, the time complexity is O(n) and the space complexity is O(n). This is because the cache itself (i.e., the dictionary) takes up space up to the size of the array.
For example, using the array and target above:
%timeit two_sum_brute_force(array, target)
272 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit two_sum_cache(array, target)
31.8 µs ± 8.57 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Data Structures, Algorithms, and Libraries#
Questions#
Question 1
Which of the following uses less memory and how can you check?
np.float16
np.float32
np.float64
Solution 1
Single precision floats have 32 bits (i.e., np.single
or np.float32
).
This is equal to 4 bytes, as 1 byte is 8 bits.
You can check this using:
np.float32(1.0).nbytes
4
Half precision floats have (you guessed it) half of this (i.e., np.half
or np.float16
):
np.float16(1.0).nbytes
2
And double precision floats (i.e., np.double
or np.float64
):
np.float64(1.0).nbytes
8
In some situations, you may be able to significantly reduce your memory usage by using half or single precision arrays.
Exercises#
Exercise 1
What data structure would be suitable for finding or removing duplicate values?
a. List
b. Dictionary
c. Queue
d. Set
Test out your answer on the following array:
array = np.random.choice(100, 80)
Are there any other ways of doing it?
Solution 1
A suitable data structure for finding or removing duplicate values is:
d. Set
This is because sets are unordered collections of distinct hashable objects.
You can call it via set(array)
.
Another approach would be to use NumPy features. For example, when creating a random array, you can specific whether the random choice gets replaced or not via:
array = np.random.choice(100, 80, replace=False)
Exercise 2
In the exercise from the profiling lesson, we saw an example of two_sum
i.e., finding two numbers from an array of unique integers that add up to a target sum.
What would be a good approach for generalising this sum of two numbers to three, four, or n numbers?
Solution 2
Many algorithms are already implemented for you.
For example, the algorithm
library has an n_sum
method.
This generalises the function to any value of n.
For example, here is a “six_sum”:
from algorithms.arrays import n_sum
array = np.random.choice(100, 80, replace=False)
target = 23
n_sum(6, array, target)
[[0, 1, 3, 4, 6, 9],
[0, 1, 3, 4, 7, 8],
[0, 1, 3, 5, 6, 8],
[0, 1, 4, 5, 6, 7]]
Try it for yourself or any one of the other many examples.
Vectorisation#
Questions#
Question 1
If something doesn’t vary for a given loop, should it be inside or outside of that loop?
Solution 1
Loop-invariants should be outside of loops.
This is because loops are slow in Python (CPython, default interpreter).
Loops are slow because loops type−check and dispatch functions per cycle.
Try to avoid them where you can (e.g., by using NumPy ufuncs, aggregations, broadcasting, etc.).
Question 2
Can you run the unvectorised my_function
directly on the same inputs (i.e., all of x)?
Solution 2
No, a TypeError
is returned.
As the function has not yet been vectorised, it cannot operate across arrays with more than 1 element.
To check that it does work for 1 element, you could try:
my_function(x[0], mean, sigma)
Exercises#
Exercise 1
What is broadcasting?
Solution 1
Broadcasting allows for operations with different shaped arrays.
Exercise 2
What is vectorisation?
Solution 2
Vectorisation allows the code to operate on multiple array elements at once, rather than looping through them one at a time.
Exercise 3
How would you replace the compute_reciprocals
function below with a vectorised version?
def compute_reciprocals(array):
"""
Divides 1 by an array of values.
"""
output = np.empty(len(array))
for i in range(len(array)):
output[i] = 1.0 / array[i]
return output
big_array = np.random.randint(1, 100, size=1_000_000)
%timeit compute_reciprocals(big_array)
1.43 s ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3
You can either use the np.divide
ufunc as follows:
%timeit np.divide(1.0, big_array)
1.29 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Or you could just use /
, which works on every element by overloading the operator (similar to broadcasting):
%timeit (1.0 / big_array)
1.19 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Exercise 4
Create your own vectorised ufunc that calculates the cube root of an array over all elements.
Hint
Here is an unvectorised version (i.e., it has a for loop):
def my_cube_root(array):
output = np.empty(len(array))
for i in range(len(array)):
output[i] = array[i] ** (1 / 3)
return output
%timeit my_cube_root(big_array)
1.77 s ± 152 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 4
Option 1:
You could use the np.vectorize
decorator:
import math
from numpy import vectorize
@vectorize
def my_cube_root(array):
return math.pow(array, 1/3)
%timeit my_cube_root(big_array)
213 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Option 2:
You could overload the **
and /
symbols:
def my_cube_root(array):
return array ** (1 / 3)
%timeit my_cube_root(big_array)
36 ms ± 864 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Option 3:
You could use the optimised ufuncs from NumPy or SciPy (recommended):
%timeit np.cbrt(big_array)
13.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
from scipy.special import cbrt
%timeit cbrt(big_array)
18.9 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compilers#
Questions#
Question 1
For the function below (fast_add
):
@njit
def fast_add(x, y):
return x + y
What will happen when it is called with:
fast_add(1, (2,))
Solution 1
A TypingError
is returned.
This is because Numba is trying to compile a function that adds an integer to a tuple.
Take care with types.
Ensure that the function works first, before adding Numba to make it faster.
Exercises#
Exercise 1
What is the default Python distribution?
Cython
PyPy
CPython
Solution 1
CPython
Exercise 2
Which Numba compilation mode has higher performance?
object
nopython
Solution 2
nopython
Exercise 3
How do I compile a function in Numba using no-python
mode?
Solution 3
Wrap the function in the @njit
(or @jit(nopython=True)
) decorator.
Exercise 4
What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?
Solution 4
parallel=True
Exercise 5
Create your own Numba vectorised function that calculates the cube root of an array over all elements.
Hint
Have a look at the similar exercise from the Vectorisation lesson.
Solution 5
Using the Numba @vectorize
decorator and the math
library:
import math
from numba import vectorize # I'm the only change from the NumPy ufunc exercise
@vectorize
def my_cube_root(array):
return math.pow(array, 1/3)
%timeit my_cube_root(big_array)
27.7 ms ± 643 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba is a nice way to get performant numerical code.
Parallelisation#
Questions#
Question 1
What did our Dask Dashboard show?
Hint
Look at the rectangles in the Task Stream.
Solution 1
Our specific results show us that:
All workers were full with numerical tasks (i.e., no white/red blocks).
This utilises the CPU cores well.
Also, the CPU percentage for the workers increased to ~100 % during the computation.
Question 2
How can we check that the job used the CPU cores efficiently?
Hint
Use qacct -j JOBID
on the job.
Then look at the time spent computing things (cpu
) compared to the total time taken up (ru_wallclock * slots
).
slots 8
ru_wallclock 1102s
cpu 6623.054s
What does this comparison tell us?
Solution 2
After the job has finished, we can find information on the job using qacct -j JOBID
, where JOBID
is the unique ID for your job.
The result is a large table of information. The important variables here are:
slots 8
ru_wallclock 1102s
cpu 6623.054s
We can calculate the efficiency of the job by dividing how long the cores were computing things (cpu
) by the total amount of time they used up (ru_wallclock * slots
). This efficiency should be as close to 100 % as possible.
For example:
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * 6623 / (1102 * 8)
Efficiency = 75 %
The efficiency here of 75 % is pretty good, though why is this not higher? The Task stream was full with computation the whole time, right?
Well, the reason is because Dask’s scheduler and client are each pinned to their own process. This left only 6 workers for the computation. If we recalculate the efficiency without these 2 processes, then the difference is Dask’s overhead:
Efficiency = 100 * 6623 / (1102 * 6)
Efficiency = 100 %
Hence, the worker efficiency was approximately 100 %, while the overall efficiency with Dask’s overheads was 75 %. So, although Dask’s overheads were assigned a compute node each, they didn’t use up much of these resources (i.e., while they were managing the workers).
It would be better if Dask’s overheads were run elsewhere (e.g., the login nodes), so the compute nodes only handled this efficient use of resources.
This alternative does exist. It’s Dask-Jobqueue
and we’re exploring that next.
Question 3
How well did this job use the resources (use the output from qacct
below, which is from one of the workers)?
$ $ qacct -j 3526684
==============================================================
qname feps-cpu.q
hostname d13s0b1.arc4.leeds.ac.uk
group EAR
owner earlacoa
project feps-cpu
department defaultdepartment
jobname dask-worker
jobnumber 3526684
taskid undefined
account sge
priority 0
qsub_time Fri Feb 25 16:25:14 2022
start_time Fri Feb 25 16:25:35 2022
end_time Fri Feb 25 16:38:21 2022
granted_pe smp
slots 1
failed 100 : assumedly after job
exit_status 137 (Killed)
ru_wallclock 766s
ru_utime 0.040s
ru_stime 0.045s
ru_maxrss 4.879KB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 15248
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 16
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 208
ru_nivcsw 38
cpu 728.840s
mem 6.774TBs
io 43.651MB
iow 0.000s
maxvmem 18.891GB
arid undefined
ar_sub_time undefined
category -U admiralty,feps-cpu,feps-gpu -l disk=48G,env=centos7,h_rt=3600,h_vmem=48G,node_type=40core-192G,project=feps-cpu -pe smp 1
Solution 3
Instead of one big job with all the workers (as for Dask-MPI), Dask-Jobqueue submits individual workers.
Similar to before (i.e., using qacct -j JOBID
), we can check the efficiency of each of these individually.
The results for each worker are:
slots 1
ru_wallclock 766
cpu 728.840s
Hence, the efficiency of each worker is:
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * 729 / (766 * 1)
Efficiency = 95 %
This is a nice use of HPC resources.
Exercises#
Exercise 1
Why does parallelisation speed up code?
Solution 1
Parallelisation divides a large problem into many smaller ones and solves them simultaneously.
This divides up the time/space complexity.
Exercise 2
What are there multiple of to split the work over?
Solution 2
Multiple processes
Multiple threads
Exercise 3
If you need to share memory, would you use MPI or OpenMP?
Solution 3
OpenMP is often suitable for shared memory problems.
Exercise 4
Which Dask library can be tailored to a variety of resource managers (e.g., SGE, SLURM)?
Solution 4
Dask-Jobqueue
Exercise 5
Which is of the 3 examples below is most efficient and why?
Note, the chunks
keyword argument is the size of each chunk.
Example 1: Many, small chunks.
x = da.random.random(10_000_000, chunks=(1_000,))
y = x.sum().compute()
Example 2: Fewer, large chunks.
x = da.random.random(10_000_000, chunks=(100_000,))
y = x.sum().compute()
Example 3: Use NumPy.
x = np.random.random(10_000_000)
y = x.sum()
Hint
If you’re not in COLAB, then you could profile them with %%timeit
.
Otherwise, you could guess, thinking about overheads and chunks.
Note, if you’ve closed the Dask client, you may need to restart it:
from dask.distributed import Client
client = Client()
Solution 5
Dask introduces overhead to parallelise work (i.e., the scheduler and client).
Hence, Dask is inefficient (i.e., the workers are under-utilised) when doing lots of very small computations.
Consider using fewer, larger chunks to reduce this overhead.
Also, consider if parallelisation is really needed.
For this example:
Example 1: Many, small chunks.
%%timeit
x = da.random.random(10_000_000, chunks=(1_000,))
y = x.sum().compute()
12.4 s ± 2.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Example 2: Fewer, large chunks.
%%timeit
x = da.random.random(10_000_000, chunks=(100_000,))
y = x.sum().compute()
129 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, for this example fewer big chunks (each of size 100,000) improved things.
There is a nice feature that let’s Dask automatically decide on the chunk size.
This improves things further:
%%timeit
x = da.random.random(10_000_000, chunks='auto')
y = x.sum().compute()
71.6 ms ± 868 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, compared to NumPy, Dask only makes things slower for this problem (due to its overheads).
Example 3: Use NumPy.
%%timeit
x = np.random.random(10_000_000)
y = x.sum()
60 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This demonstrates that NumPy is already performant for some tasks, and that parallelisation is sometimes not needed.
Exercise 6
How well did the job below use the HPC resources?
$ qacct -j 3524073
==============================================================
qname feps-cpu.q
hostname d9s9b4.arc4.leeds.ac.uk
group EAR
owner earlacoa
project feps-cpu
department defaultdepartment
jobname example_bodo_mpi_sge.bash
jobnumber 3524073
taskid undefined
account sge
priority 0
qsub_time Thu Feb 24 12:48:24 2022
start_time Thu Feb 24 12:48:34 2022
end_time Thu Feb 24 12:48:55 2022
granted_pe smp
slots 8
failed 0
exit_status 0
ru_wallclock 21s
ru_utime 139.030s
ru_stime 7.635s
ru_maxrss 1.687MB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 764941
ru_majflt 2
ru_nswap 0
ru_inblock 0
ru_oublock 80
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 38884
ru_nivcsw 665
cpu 146.665s
mem 112.727GBs
io 166.518MB
iow 0.000s
maxvmem 12.973GB
arid undefined
ar_sub_time undefined
category -U admiralty,feps-cpu,feps-gpu -l env=centos7,h_rt=600,h_vmem=24G,node_type=40core-192G,project=feps-cpu -pe smp 8
Solution 6
qacct -j JOBID
tells us that:
slots 8
ru_wallclock 21s
cpu 146.665s
Hence, the efficiency is:
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * 147 / (21 * 8)
Efficiency = 88 %
That’s good.
Exercise 7
How well did this job used the HPC resources?
If it wasn’t ideal, what went wrong and what might fix it?
$ qacct -j 3524046
==============================================================
qname feps-cpu.q
hostname d9s9b4.arc4.leeds.ac.uk
group EAR
owner earlacoa
project feps-cpu
department defaultdepartment
jobname example_bodo_mpi_sge.bash
jobnumber 3524046
taskid undefined
account sge
priority 0
qsub_time Thu Feb 24 12:34:54 2022
start_time Thu Feb 24 12:35:08 2022
end_time Thu Feb 24 12:37:14 2022
granted_pe smp
slots 8
failed 0
exit_status 0
ru_wallclock 126s
ru_utime 125.250s
ru_stime 6.542s
ru_maxrss 1.689MB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 758663
ru_majflt 2
ru_nswap 0
ru_inblock 0
ru_oublock 80
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 35039
ru_nivcsw 30366
cpu 131.792s
mem 102.207GBs
io 166.212MB
iow 0.000s
maxvmem 13.432GB
arid undefined
ar_sub_time undefined
category -U admiralty,feps-cpu,feps-gpu -l env=centos7,h_rt=600,h_vmem=24G,node_type=40core-192G,project=feps-cpu -pe smp 8
Solution 7
qacct -j JOBID
tells us that:
slots 8
ru_wallclock 126s
cpu 131.792s
Hence, the efficiency is:
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * 131 / (126 * 8)
Efficiency = 13 %
That’s very low.
The likely cause of this low efficiency is that all of the work is being pinned to one process, instead of being distributed across all 8.
(There are other ways to check this e.g., using top
on the compute nodes, using ps
.)
One solution to try is to set an environment variable that stops processes being pinned (remember, we saw this earlier). To do this, add the following command to the submission bash script unset GOMP_CPU_AFFINITY
.
GPUs#
Exercises#
Exercise 1
In general, what kind of tasks are GPUs faster than CPUs for, and why?
Solution 1
Generally, GPUs are faster than CPUs for numerical operations, because they were optimised for this purpose using data parallelism where the same computation is done across many elements at once.
Exercise 2
Which Numba decorators can you use to offload a function to GPUs?
Solution 2
@vectorize
@cuda.jit
Exercise 3
How would you vectorize the the following function for GPUs?
def my_serial_function_for_gpu(x):
return math.cos(x) ** 2 + math.sin(x) ** 2
Solution 3
You could add the @vectorize
decorator with a signature to target CUDA.
In the example below, the input and output types are specified as float32
:
@vectorize(['float32(float32)'], target='cuda')
def my_serial_function_for_gpu(x):
return math.cos(x) ** 2 + math.sin(x) ** 2
Exercise 4
What are ways you can check if your Python environment has access to a GPU?
Solution 4
You could use Numba via:
from numba import cuda
print(cuda.gpus)
Or in an IPython cell, you could use:
!nvidia-smi
There are other ways too. This is an important step to check.
Exercise 5
If you wanted to do NumPy style work on GPUs, could you use:
cuPy
JAX
Solution 5
You could use either cuPy or JAX for NumPy work on GPUs.