Summary
Contents
Summary¶
In this workshop, we covered:
1. Understand the fundamentals of machine learning and deep learning.
Machine learning and deep learning are a range of prediction methods that learn associations from training data.
The objective is for the models to generalise to new data.
They mainly use tensors (multi-dimensional arrays) as inputs.
Problems are mainly either supervised (if you provide labels) or unsupervised (if you don’t provide labels).
Problems are either classification (if you’re trying to predict a discrete category) or regression (if you’re trying to predict a continuous number).
Data is split into training, validation, and test sets.
The models only learn from the training data.
The test set is used only once.
Hyperparameters are set before model training.
Parameters (i.e., the weights and biases) are learnt during model training.
The aim is to minimise the loss function.
The model underfits when it has high bias.
The model overfits when it has high variance.
2. Know how to use key tools, including:
-
scikit-learn is great for classic machine learning problems.
TensorFlow and Keras
TensorFlow is great for deep learning problems.
Keras (high-level API for TensorFlow) has many high-level objects to help you create deep learning models.
-
PyTorch is great for deep learning problems.
PyTorch Lightning (high-level API for PyTorch) has many high-level objects to help you create deep learning models.
You can use low-level APIs for any custom objects.
Explore your data before using it.
Check your model before fitting the training data to it.
Evaluate your model and analyse the errors it makes.
3. Be aware of good practices for data, such as pipelines and modules.
Always split the data into train and test subsets first, before any pre-processing.
Never fit to the test data.
Use a data pipeline.
Use a random seed and any available deterministic functionalities for reproducibility.
Try and reproduce your own work, to check that it is reproducible.
Consider optimising the data pipeline with:
Shuffling.
Batching.
Caching.
Prefetching.
Parallel data extraction.
Data augmentation.
Parallel data transformation.
Vectorised mapping.
Mixed precision.
4. Be aware of good practices for models, such as hyperparameter tuning, transfer learning, and callbacks.
Tune hyperparamaters for the best model fit.
Use transfer learning to save computation on similar problems.
Consider using callbacks to help with model training, such as:
Checkpoints.
Fault tolerance.
Logging.
Profiling.
Early stopping.
Learning rate decay.
5. Be able to undertake distributed training.
Ensure that you really need to use distributed devices.
Check everything first works on a single device.
Ensure that the data pipeline can efficiently use multiple devices.
Use data parallelism (to split the data over multiple devices).
Take care when setting the global batch size.
Check the efficiency of your jobs to ensure utilising the requested resources (for both single and multi-device).
When moving from Jupyter to HPC:
Clean non-essential code.
Refactor Jupyter Notebook code into functions.
Create a Python script.
Create submission script.
Create tests.
Next steps¶
For things that you’re interested in:
Try things out yourself (e.g., play around with the examples).
Check out the Online Courses.