Solutions#

Profiling#

Exercises#

Exercise 1

What is profiling and why is it useful?

Exercise 2

What profiling tool times the execution of a cell in a Jupyter Notebook?

Exercise 3

Below are two approaches for filling up an empty NumPy array.

Which approach is faster and why?

def fill_array_approach_1(n):
    array = np.empty(1)

    for index in range(n):
        new_point = np.random.rand()
        array = np.append(array, new_point)

    return array
def fill_array_approach_2(n):
    array = np.empty(n)

    for index in range(len(array)):
        new_point = np.random.rand()
        array[index] = new_point

    return array

Exercise 4

Below are two methods that find two numbers from an array of unique integers that add up to a target sum.

If the target can’t be made, then an empty list is returned.

Each element in the array can only be used once.

Which method is faster and why?

def two_sum_brute_force(array, target):
    """
    A brute force approach to find two numbers from an array that add up to a target.

    Steps
    1. Loop through the array twice, adding up pairs of array elements.
    2. Compare each of these sums to the target.
    3. Return the pair that sums to the target, if one exists.
    """
    for index_one in range(len(array)):
        for index_two in range(index_one + 1, len(array)):
            if (
                array[index_one] + array[index_two] == target  # check sum of pair
                and index_one != index_two  # can't use the same element twice
            ):
                return [index_one, index_two]  # return the target pair

    return []  # return an empty list if the target pair isn't found
def two_sum_cache(array, target):
    """
    Use caching to find two numbers from an array that add up to a target.

    Steps
    1. Create a dictionary of cached differences relative to the target sum.
    2. Loop through the array once, adding each index and difference to the cache.
    3. If the required difference of a new array element is already in the cache,
       then you've found a matching pair, which you can return.
    """
    cache_differences = {}
    for index, element in enumerate(array):
        difference = (
            target - element
        )  # calculate the target difference for this element
        if difference in cache_differences:  # if we have the matching pair
            return [index, cache_differences[difference]]  # return the target pair
        cache_differences[element] = index  # if we don't have a match, add to the cache

    return []  # return an empty list if the target pair isn't found
import numpy as np

array = np.random.choice(1_000, 500, replace=False)
target = 250

Data Structures, Algorithms, and Libraries#

Questions#

Question 1

Which of the following uses less memory and how can you check?

  • np.float16

  • np.float32

  • np.float64

Exercises#

Exercise 1

What data structure would be suitable for finding or removing duplicate values?

a. List
b. Dictionary
c. Queue
d. Set

Test out your answer on the following array:

array = np.random.choice(100, 80)

Are there any other ways of doing it?

Exercise 2

In the exercise from the profiling lesson, we saw an example of two_sum i.e., finding two numbers from an array of unique integers that add up to a target sum.

What would be a good approach for generalising this sum of two numbers to three, four, or n numbers?

Vectorisation#

Questions#

Question 1

If something doesn’t vary for a given loop, should it be inside or outside of that loop?

Question 2

Can you run the unvectorised my_function directly on the same inputs (i.e., all of x)?

Exercises#

Exercise 1

What is broadcasting?

Exercise 2

What is vectorisation?

Exercise 3

How would you replace the compute_reciprocals function below with a vectorised version?

def compute_reciprocals(array):
    """
    Divides 1 by an array of values.
    """
    output = np.empty(len(array))
    for i in range(len(array)):
        output[i] = 1.0 / array[i]

    return output
big_array = np.random.randint(1, 100, size=1_000_000)
%timeit compute_reciprocals(big_array)
1.43 s ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Exercise 4

Create your own vectorised ufunc that calculates the cube root of an array over all elements.

Compilers#

Questions#

Question 1

For the function below (fast_add):

@njit
def fast_add(x, y):
    return x + y

What will happen when it is called with:
fast_add(1, (2,))

Exercises#

Exercise 1

What is the default Python distribution?

  • Cython

  • PyPy

  • CPython

Exercise 2

Which Numba compilation mode has higher performance?

  • object

  • nopython

Exercise 3

How do I compile a function in Numba using no-python mode?

Exercise 4

What is the keyword argument that enables Numba compiled functions to run over multiple CPUs?

Exercise 5

Create your own Numba vectorised function that calculates the cube root of an array over all elements.

Parallelisation#

Questions#

Question 1

What did our Dask Dashboard show?

Question 2

How can we check that the job used the CPU cores efficiently?

Question 3

How well did this job use the resources (use the output from qacct below, which is from one of the workers)?

$ $ qacct -j 3526684
==============================================================
qname        feps-cpu.q          
hostname     d13s0b1.arc4.leeds.ac.uk
group        EAR                 
owner        earlacoa            
project      feps-cpu            
department   defaultdepartment   
jobname      dask-worker         
jobnumber    3526684             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Fri Feb 25 16:25:14 2022
start_time   Fri Feb 25 16:25:35 2022
end_time     Fri Feb 25 16:38:21 2022
granted_pe   smp                 
slots        1                   
failed       100 : assumedly after job
exit_status  137                  (Killed)
ru_wallclock 766s
ru_utime     0.040s
ru_stime     0.045s
ru_maxrss    4.879KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    15248               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   16                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     208                 
ru_nivcsw    38                  
cpu          728.840s
mem          6.774TBs
io           43.651MB
iow          0.000s
maxvmem      18.891GB
arid         undefined
ar_sub_time  undefined
category     -U admiralty,feps-cpu,feps-gpu -l disk=48G,env=centos7,h_rt=3600,h_vmem=48G,node_type=40core-192G,project=feps-cpu -pe smp 1

Exercises#

Exercise 1

Why does parallelisation speed up code?

Exercise 2

What are there multiple of to split the work over?

Exercise 3

If you need to share memory, would you use MPI or OpenMP?

Exercise 4

Which Dask library can be tailored to a variety of resource managers (e.g., SGE, SLURM)?

Exercise 5

Which is of the 3 examples below is most efficient and why?

Note, the chunks keyword argument is the size of each chunk.

Example 1: Many, small chunks.

x = da.random.random(10_000_000, chunks=(1_000,))
y = x.sum().compute()

Example 2: Fewer, large chunks.

x = da.random.random(10_000_000, chunks=(100_000,))
y = x.sum().compute()

Example 3: Use NumPy.

x = np.random.random(10_000_000)
y = x.sum()

Exercise 6

How well did the job below use the HPC resources?

$ qacct -j 3524073
==============================================================
qname        feps-cpu.q          
hostname     d9s9b4.arc4.leeds.ac.uk
group        EAR                 
owner        earlacoa            
project      feps-cpu            
department   defaultdepartment   
jobname      example_bodo_mpi_sge.bash
jobnumber    3524073             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Feb 24 12:48:24 2022
start_time   Thu Feb 24 12:48:34 2022
end_time     Thu Feb 24 12:48:55 2022
granted_pe   smp                 
slots        8                   
failed       0    
exit_status  0                   
ru_wallclock 21s
ru_utime     139.030s
ru_stime     7.635s
ru_maxrss    1.687MB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    764941              
ru_majflt    2                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   80                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     38884               
ru_nivcsw    665                 
cpu          146.665s
mem          112.727GBs
io           166.518MB
iow          0.000s
maxvmem      12.973GB
arid         undefined
ar_sub_time  undefined
category     -U admiralty,feps-cpu,feps-gpu -l env=centos7,h_rt=600,h_vmem=24G,node_type=40core-192G,project=feps-cpu -pe smp 8

Exercise 7

How well did this job used the HPC resources?

If it wasn’t ideal, what went wrong and what might fix it?

$ qacct -j 3524046
==============================================================
qname        feps-cpu.q          
hostname     d9s9b4.arc4.leeds.ac.uk
group        EAR                 
owner        earlacoa            
project      feps-cpu            
department   defaultdepartment   
jobname      example_bodo_mpi_sge.bash
jobnumber    3524046             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Thu Feb 24 12:34:54 2022
start_time   Thu Feb 24 12:35:08 2022
end_time     Thu Feb 24 12:37:14 2022
granted_pe   smp                 
slots        8                   
failed       0    
exit_status  0                   
ru_wallclock 126s
ru_utime     125.250s
ru_stime     6.542s
ru_maxrss    1.689MB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    758663              
ru_majflt    2                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   80                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     35039               
ru_nivcsw    30366               
cpu          131.792s
mem          102.207GBs
io           166.212MB
iow          0.000s
maxvmem      13.432GB
arid         undefined
ar_sub_time  undefined
category     -U admiralty,feps-cpu,feps-gpu -l env=centos7,h_rt=600,h_vmem=24G,node_type=40core-192G,project=feps-cpu -pe smp 8

GPUs#

Exercises#

Exercise 1

In general, what kind of tasks are GPUs faster than CPUs for, and why?

Exercise 2

Which Numba decorators can you use to offload a function to GPUs?

Exercise 3

How would you vectorize the the following function for GPUs?

def my_serial_function_for_gpu(x):
    return math.cos(x) ** 2 + math.sin(x) ** 2

Exercise 4

What are ways you can check if your Python environment has access to a GPU?

Exercise 5

If you wanted to do NumPy style work on GPUs, could you use:

  • cuPy

  • JAX