# Fundamentals

## Basic ideas

### Overview

Machine learning is a subset of Artificial Intelligence.

It is a range of methods that learn associations from (training) data.

It then uses these associations for new predictions (i.e., also known as [inference](https://developers.google.com/machine-learning/glossary/#inference)).

This ability to do this well is [generalising](https://developers.google.com/machine-learning/glossary/#generalization).

These can be useful for a range of problems including:

- Prediction problems (e.g., pattern recognition).
- Problems that you cannot (or are difficult to) program (e.g., image recognition).
- Faster approximations to problems that you can program (e.g., spam classification).

```{image} images/ai_ml_dl.jpg
:height: 300px
:name: ai_ml_dl.jpg
```

*[Image source](https://vas3k.com/blog/machine_learning/)*

### Methods

Within machine learning, there are many different methods.

We'll focus on _classic machine learning_ and _deep learning_ in this course.

```{image} images/ml_types.jpg
:height: 300px
:name: ml_types.jpg
```

*[Image source](https://vas3k.com/blog/machine_learning/)*

### Classic Machine Learning

There are a wide variety of types. Some common ones are:

- [Linear Regression](https://youtu.be/kHwlB_j7Hkc)
    - Predict a continuous number using a linear model i.e., fit a straight line.
- [Logistic Regression](https://youtu.be/hjrYrynGWGA)
    - Predict a class of either 0 or 1 i.e., a binary classification problem.
- [Clustering](https://youtu.be/hDmNF9JG3lo)
    - Predictions are based on their similarility to their neighbours.
- [Support vector machines](https://youtu.be/hCOIMkcsm_g)
    - Predictions are based their position relative to a decision boundary.
    - The decision boundary is found by focusing on the two hardest to classify examples and placing support vectors between them.
- And many, many more.

### Deep Learning (Neural Networks)

[Neural networks](https://youtu.be/n1l-9lIMW7E) are models made of layers of neurons.

[Neurons](https://developers.google.com/machine-learning/glossary/#neuron) (also known as units or nodes) take in inputs and return an output by applying an activation function.

[Activation functions](https://youtu.be/Xvg00QnyaIY) (also known as non-linearities) take a weighted sum of inputs from a previous layer, apply a non-linear function, and pass the output onto the next layer.

Common activation functions are:

- [ReLU (Rectified Linear Unit)](https://developers.google.com/machine-learning/glossary/#rectified-linear-unit-relu)
    - If input is negative, the output equals 0.
    - If input is positive, the output equals the input.
- [Sigmoid](https://developers.google.com/machine-learning/glossary/#sigmoid-function)
    - Converts [log-odds](logits_and_log_odds) (we'll see these later) into probabilities between 0 and 1.
    - Used for binary classification.
- [Softmax](https://developers.google.com/machine-learning/glossary/#softmax)
    - Sigmoid for multi-classification.

The neurons are connected in [many layers](https://developers.google.com/machine-learning/glossary/#hidden-layer).

Each layer is an input-output transformation.

All the layers together are the model.

The hidden layers between the [input layer](https://developers.google.com/machine-learning/glossary/#input-layer) and [output layer](https://developers.google.com/machine-learning/glossary/#output-layer) are the depth of the model (hence, _deep_ learning). 

A common layer is for all the neurons to be [fully connected](https://developers.google.com/machine-learning/glossary/#fully-connected-layer) to each other (also known as a dense layer).

The types of layers, how many there are, and how they are connected is the architecture of the neural network.

There are a wide variety of types of neural networks. Some common ones are:

- [Convolutional Neural Networks (CNN)](https://youtu.be/3PyJA9AfwSk)
    - A neural network that uses [convolutional layers](https://youtu.be/jPOAS7uCODQ).
    - These layers find features e.g., for image recognition:
        - Low-level such as vertical lines, horizontal lines, etc.
        - Medium-level such as eyes, ears, etc.
        - High-level such as faces, glasses, etc.
    - They group operations to reduce the number of parameters learned.
- [Recurrent Neural Networks (RNN)](https://developers.google.com/machine-learning/glossary/#recurrent-neural-network)
    - For sequential data e.g., time-series, natural language.
    - Loops over timesteps while maintaining information from previous timesteps.
- And many, many more.

[Deep learning has been progressing](https://youtu.be/xflCLdJh0n0) primarily due to scale (bigger datasets and bigger neural networks), investment, and attention (additional research).

(tensors)=
### [Data](https://developers.google.com/machine-learning/glossary/#data-set-or-dataset)

The data is a sample of the problem you're studying.

Data has inputs (also known as features) and outputs (also known as targets).

- The inputs are what you provide to the model.
- The outputs are what you're trying to predict.

The data is normally in the form of tensors.

[Tensors](https://developers.google.com/machine-learning/glossary/#tensor) are multi-dimensional arrays:

- Scalars are rank-0 tensors.
- Vectors are rank-1 tensors.
- Matrices are rank-2 tensors.
- 3+ dimensional arrays are rank-3+ tensors.

![tensors.png](images/tensors.png)  

*[Image source](https://medium.com/mlait/tensors-representation-of-data-in-neural-networks-bbe8a711b93b)*

(logits_and_log_odds)=
#### Logits and Log-odds

[Logits](https://developers.google.com/machine-learning/glossary#logits) are a vector of raw (non-normalised) predictions from a classification model. For multi-class classification, these are converted to (normalised) probabilities using a softmax function.

[Log-odds](https://developers.google.com/machine-learning/glossary#log-odds) are the logarithm of the odds of an event. They're the inverse of the sigmoid function.

### Supervised and Unsupervised

- [Supervised learning](https://developers.google.com/machine-learning/glossary/#supervised-machine-learning) is when you provide [labelled](https://developers.google.com/machine-learning/glossary/#label) outputs to learn from.
- [Unsupervised learning](https://developers.google.com/machine-learning/glossary/#unsupervised-machine-learning) when you don't provide any labels.

Below is an example of supervised learning (classify different coloured markers) and unsupervised learning (find clusters within data).

![supervised_vs_unsupervised.png](images/supervised_vs_unsupervised.png)  

*[Image source](https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/)*

We'll focus on supervised learning in this course.

### Classification and Regression

- [Classification](https://developers.google.com/machine-learning/glossary/#classification-model) problems are those that try to predict a [discrete category](https://developers.google.com/machine-learning/glossary/#categorical-data).
    - i.e., binary: cat or dog, multi-class: dog breeds (poodle, greyhound, etc.).
- [Regression](https://developers.google.com/machine-learning/glossary/#regression-model) problems are those that try to predict a [continuous number](https://developers.google.com/machine-learning/glossary/#numerical-data).
    - i.e., beans in a jar, house prices.

Below is an example of classification (separate blue circles from purple crosses) and regression (predict a numerical value from the data).

![classification_vs_regression.png](images/classification_vs_regression.png)  

*[Image source](https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d)*

### [Training, validation, and test splits](https://youtu.be/1waHlpKiNyY)

The data is normally split into training, validation, and test sets.

- The [training set](https://developers.google.com/machine-learning/glossary/#training_set) is for training the model.
- The [validation set](https://developers.google.com/machine-learning/glossary/#validation_set) (optional) is for iteratively optimising the model during training.
- The [test set](https://developers.google.com/machine-learning/glossary/#test-set) is _only_ for testing the model at the end.
    - This should remain untouched (i.e., [held out](https://developers.google.com/machine-learning/glossary/#holdout-data) of training).
    - _Single-use_ (to ensure representative of future data).
    - Think of the this like the exam at the end of a course. You don't want the students to just parrot back the teaching material. You'd like them to demonstrate understanding.

![train-val-test-split.png](images/train-val-test-split.png)  

*[Image source](https://stackoverflow.com/a/56100053/6250873)*

The [size of the split](https://youtu.be/_Fe5kKmFieg) depends on the size of the dataset and the signal you're trying to predict (i.e., the smaller the signal, then the larger the test set needs to be).

For example:  

| Data set size | Training split (%) | Validation split (%) | Test split (%) |
| --- | --- | --- | --- |
| Small | 60 | 20 | 20 |
| Medium | 80 | 10 | 10 |
| Large | 90 | 5 | 5 |
| Very large | 98 | 1 | 1 |

The split may benefit from being stratified (preserving original class frequencies) to ensure that each set has a sample of the classes.

### [Cross-validation](https://developers.google.com/machine-learning/glossary/#cross-validation)

Cross-validation estimates how well a model generalises to new data _before_ you check it on the _single-use_ test data.

It estimates the _variability_ in the _training_ score.

This repeats the _training/validation_ split multiple times (_the test data remains untouched_).

There are various methods for cross-validation.

These are mainly variations of K-fold cross-validation, where you split the data up K times (e.g., 5).

Variations then consider stratifying, shuffling, sampling, and replacing.

Below is an example for 5-fold cross-validation (i.e., splitting 5 times).

![cross_validation.png](images/cross_validation_diagram.png)  

*[Image source](https://inria.github.io/scikit-learn-mooc/python_scripts/02_numerical_pipeline_cross_validation.html)*

(hyperparameters)=
### [Hyperparameters](https://youtu.be/VTE2KlfoO3Q)

These are what _you set before_ model training.

They control the learning process.

These include, for example:

- The number of layers.
- The number of units per layer.
- The activation function(s).
- Whether to use dropout.
- The optimiser learning rate.
- The batch size.

They are often found through iteratively trying out different options.

This iterative tuning method can be:

- Systematically over a grid (i.e., grid-search).
    - Thorough, but slow. Hence, not suitable for problems with many variables.
- Randomly over a grid (i.e., random grid-search).
    - Faster and more suitable for problems with many variables.
- Other options including:
    - Using Bayes Theorem (i.e., Bayes grid-search) to choose a new set of hyperparameters to test based on the performance of the prior set.

### [Parameters](https://developers.google.com/machine-learning/glossary/#parameter)

These are what the model learns _during training_ (i.e., the weights / biases / coefficients of the model).

The weights first need to be [initialised](https://youtu.be/s2coXdufOzE) e.g., as zeros, random numbers, etc.

The parameters are then optimised in training.

### [Model](https://developers.google.com/machine-learning/glossary/#model)

A model is the machine learning system.

This includes the architechture, parameters, and hyperparameters.

### [Training](https://developers.google.com/machine-learning/glossary/#training)

Training is the process of finding the best model.

```{image} images/unteachable.jpg
:height: 300px
:name: unteachable.jpg
```

*[Image source](https://vas3k.com/blog/machine_learning/)*

#### [Loss Function](https://developers.google.com/machine-learning/glossary/#loss)

The loss function measures how accurate the model is during training.

This is measured as the error on single training example. 

You always want to minimise the loss function.

A similar concept is the [Cost Function](https://youtu.be/SHEPb1JHw5o), which is the average of the loss functions over the whole training set.

The loss function is a proxy of the metric (covered below) with a smooth gradient. Note, that in some cases it is actually the same as the metric e.g., mean squared error.

Common loss functions are:

- [Mean squared error](https://developers.google.com/machine-learning/glossary/#mean-squared-error-mse)
    - The average squared loss per example.
- [Crossentropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)
    - A measure of the difference between two probability distributions.
    - Categorical Crossentropy for binary classification.
    - Sparse Categorical Crossentropy for multi-class classification.
    - Similar to the Negative Log Loss, except that this takes in _log_ probabilities, instead of raw ones.

#### [Gradient descent](https://youtu.be/uJryes5Vk1o)

A group of methods to minimise the loss function.

It is how the model gets updated based on the data it sees.

One step of gradient descent includes:

- Forward propagate the inputs through the model to calculate the outputs of each neuron.
- Calculate the loss (error) and gradient of the loss for these parameters.
- [Back propagate](https://developers.google.com/machine-learning/glossary/#backpropagation) these gradients back through the model to update the parameters and reduce the loss.

The gradient of the loss is reduced (optimised) in this process (i.e., the gradient descends).

It aims to find the best parameters (i.e., the weights and biases that minimise the loss).

The best parameters represent the single _global minimum_ of the loss (i.e., think of the lowest point in a bowl).

If there are many peaks and valleys in the loss function, then there may be many _local minimums_ (i.e., where the loss function can't reduce anymore locally).

Common methods of gradient descent are:

- [Stochastic Gradient Descent (SGD)](https://developers.google.com/machine-learning/glossary/#stochastic-gradient-descent-sgd).
    - Uses 1 training example per iteration.
- [Batch gradient descent](https://youtu.be/KKfZLXcF-aE).
    - Uses all training examples per iteration.
- [Mini-batch gradient descent](https://youtu.be/4qJaSmvhxi8).
    - Uses a smaller batch (e.g., 32) of training examples per iteration.

#### [Optimiser](https://developers.google.com/machine-learning/glossary/#optimizer)

The optimiser is the type of gradient descent used.

A common choice is the [Adam (ADAptive with Momentum) optimiser](https://youtu.be/JXQT_vxqwIs). 

Adam combines [momentum](https://youtu.be/k8fTYJPd3_I) and [RMSprop](https://youtu.be/_e-LFe_igno) (Root Mean Squared propagation).

Momentum remembers past gradients to speed up learning and get out of local minimuns.

RMSprop speeds up learning in a specific direction.

#### [Metric](https://developers.google.com/machine-learning/glossary/#metric)

The goal of machine learning is predicting new data.

Hence, the [objective](https://developers.google.com/machine-learning/glossary/#objective) is to minimise the _test error_ (as this represents new data).

This is the evaluation metric i.e., the number you primarily care about. 

It is helpful to have a [single evaluation metric](https://youtu.be/sofffBNhVSo) to guide decisions.

### [Error analysis](https://youtu.be/JoAxZsdw_3w)

This is where you manually analyse the prediction errors from the model to help guide how to improve the model.

An example is a [confusion matrix](https://developers.google.com/machine-learning/glossary/#confusion-matrix). This is where you aggregate a classifcation model's correct and incorrect guesses. This is useful to see what classes have more errors.

For example, you could have a classification model to predict whether or not there is a tumor in the image:

|  | Tumor (predicted) | Non-Tumor (predicted) |
| --- | --- | --- |
| Tumor (ground truth) | 18 | 1 |
| Non-Tumor (ground truth) | 6 | 452 |

So, here there are 19 ground truth images that had tumors (18 + 1), of which the model predicted 18 correct (_true positives, TP_) and 1 wrong (_false negative, FN_).

Also, there are 458 ground truth images that did not have tumors (452 + 6), of which the model predicted 452 correct (_true negatives, TN_) and 6 wrong (_false positives, FP_).

The _precision_ identifies the frequency of correct predictions for positive cases. Here:  

$precision = TP / (TP + FP)$  
$precision = 18 / (18 + 6)$  
$precision = 0.75$  

_Recall_ represents: out of all the possible positive labels, how many did the model correctly identify. Here:

$recall = TP / (TP + FN)$  
$recall = 18 / (18 + 1)$  
$recall = 0.95$  


There is often a trade-off between precision and recall (i.e., one goes up and the other goes down).

### [Underfit](https://youtu.be/SjQyLhQIXSM)

A model _underfits_ the data when it has _high bias_ (i.e., systematic errors). 

This means the model is _too simple_ to capture the association (i.e., it doesn't have enough capacity to learn the generalisation).

You can tell that the model underfits because there are _both_ high training errors and high test errors.

To reduce underfitting, try:

- Adding more features.
- Adding more complex features.
- Decreasing [regularisation](https://youtu.be/6g0t3Phly2M) (i.e., decrease the preference for simpler functions).

_More training data is unlikely to help a model that underfits the data._

(overfit)=
### [Overfit](https://youtu.be/u73PU6Qwl1I)

A model _overfits_ the data when it has _high variance_ (i.e., varies a lot). 

This means the model is _too complex_ to capture the association (i.e., it has too much capacity, so the training data is memorised).

You can tell that the model overfits because there are _low_ training errors _but_ high test errors (i.e., there is a big difference between these errors, where the model doesn't work well on new data because it overfitted to the noise in the training data).

To reduce overfitting, try:

- Adding more data.
- Using fewer or simpler features.
- Increasing [regularisation](https://youtu.be/6g0t3Phly2M) (i.e., increase the preference for simpler functions).
    - [L1](https://developers.google.com/machine-learning/glossary/#l1-regularization) regularlisation penalises weights in proportion to the _sum_ of their absolute values.
    - [L2](https://youtu.be/6g0t3Phly2M) regularlisation penalises weights in proportion to the _square_ of their absolute values.
    - [Dropout](https://youtu.be/D8PJAL-MZv8) regularlisation removes a random selection of neurons for a training step.
- A smaller neural network with fewer layers/parameters.

Below is an example of underfitting (linear line through non-linear data) and overfitting (very-high order polynomial passing through every training point).

![underfit_vs_overfit.png](images/underfit_vs_overfit.png)  

*[Image source](https://www.educative.io/edpresso/overfitting-and-underfitting)*

## Questions

```{admonition} Question 1

What does _deep_ mean in deep learning?

```

```{admonition} Question 2

Activation functions help neural networks learn complex functions because they are:

- Linear
- Non-linear

```

```{admonition} Question 3

What is a tensor?

```

```{admonition} Question 4

I have labelled pictures of cats and dogs that I'd like a model to classify.

Is this a supervised or unsupervised problem?

```

```{admonition} Question 5

I'd like a model to predict house prices from their features.

Is this a classification or regression problem?

```

```{admonition} Question 6

How many times can I use the test data?

```

```{admonition} Question 7

I've decided on the number of hidden layers to use in my neural network.

Is this a parameter or hyperparameter?

```

```{admonition} Question 8

Do I want to minimise or maximise the loss?

```

```{admonition} Question 9

A model underfits the data when it has:

- High bias
- High variance

```

```{admonition} Question 10

If my model underfits, what might help:

- Adding more features
- Adding more data

```

```{admonition} Question 11

If my model overfits, what might help:

- Adding more complex features
- Increasing regularlisation

```

## {ref}`Solutions <fundamentals>`

## Key Points

```{important}

- [x] _Machine learning and deep learning are a range of prediction methods that learn associations from training data._
- [x] _The objective is for the models to generalise to new data._
- [x] _They mainly use tensors (multi-dimensional arrays) as inputs._
- [x] _Problems are mainly either supervised (if you provide labels) or unsupervised (if you don't provide labels)._
- [x] _Problems are either classification (if you're trying to predict a discrete category) or regression (if you're trying to predict a continuous number)._
- [x] _Data is split into training, validation, and test sets._
- [x] _The models only learn from the training data._
- [x] _The test set is used only once._
- [x] _Hyperparameters are set before model training._
- [x] _Parameters (i.e., the weights and biases) are learnt during model training._
- [x] _The aim is to minimise the loss function._
- [x] _The model underfits when it has high bias._
- [x] _The model overfits when it has high variance._

```

## Further information

### Good practices

- Start simple.
- Incrementally test ideas.
- The choice of algortihm depends on the problem/data (i.e., whether you use linear regression, deep learning, etc.).
    - What assumptions are appropriate?
- Future data should be from the same distribution as the training data (to avoid _data drift_).
- The test set should be representative of the future data you're trying to predict. For example:
    - For time series, test data may be 2021, while training data was 2015-2020. 
    - For medical application, test data may be completely new patients, not multiple visits from same patients in training data.
- Consider ways to reduce the dimensionality of the data (e.g., using PCA, Principle Component Analysis).
- Have a baseline to compare the model skill against (i.e., simple model, human performance, etc.).

### Caveats

- Predictions are primarily based on associations, not explanations or causation.
- Predictions and models are specific to the data they were trained on.

### Resources

**Bold** are highly-recommended.

- **[Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/), Aurélien Géron, 2019, O’Reilly Media, Inc.**  
    - **[Jupyter notebooks](https://github.com/ageron/handson-ml2).**  
- [Deep Learning with Python, 2nd Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff), François Chollet, 2021, Manning.  
    - [Jupyter notebooks](https://github.com/fchollet/deep-learning-with-python-notebooks).  
- [Artificial Intelligence: A Modern Approach, 4th edition](http://aima.cs.berkeley.edu/), Stuart Russell and Peter Norvig, 2021, Pearson.  
- [Machine Learning Yearning](https://www.deeplearning.ai/programs/), Andrew Ng.  

(online_courses)=
### Online courses

**Bold** are highly-recommended.

#### Machine learning

- **[Machine learning](https://www.coursera.org/learn/machine-learning), Coursera, Andrew Ng.**
    - **CS229, Stanford University: [Video lectures](https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU).**  
- **[Machine Learning for Intelligent Systems](http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/), Kilian Weinberger, 2018.**  
    - **CS4780, Cornell: [Video lectures](https://youtube.com/playlist?list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS).**  
- [Artificial Intelligence: Principles and Techniques](https://www.youtube.com/playlist?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX), Percy Liang and Dorsa Sadigh, CS221, Standord, 2019.  
- [Machine learning in Python with scikit-learn](https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn/), scikit-learn developers, 2022.
  - [Course materials](https://inria.github.io/scikit-learn-mooc/)
  - [Jupyter Notebooks](https://github.com/INRIA/scikit-learn-mooc/) 


#### Deep learning

- **[Deep Learning Specialization](https://www.coursera.org/specializations/deep-learning), Coursera, DeepLearning.AI (_NumPy, Keras, TensorFlow_)**
    - **CS230, Stanford University: [Video lectures](https://www.youtube.com/playlist?list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb), [Syllabus](http://cs230.stanford.edu/syllabus/)**
- [NYU Deep Learning](https://atcold.github.io/NYU-DLSP21/), Yann LeCun and Alfredo Canziani, NYU, 2021 (_PyTorch_)
    - [Video lectures](https://www.youtube.com/playlist?list=PLLHTzKZzVU9e6xUfG10TkTWApKSZCzuBI)  