# Data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ARCTraining/swd8_intro_ml/blob/main/docs/03_data.ipynb)

```{note}
If youâ€™re in COLAB or have a local CUDA GPU, you can follow along with the more computationally intensive training in this lesson.

For those in COLAB, ensure the session is using a GPU by going to: Runtime > Change runtime type > Hardware accelerator = GPU.
```

In [None]:
# if you're using colab, then install the required modules
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    %pip install --quiet --upgrade pytorch-lightning lightning-bolts

## [Tensors](tensors)

### NumPy

In [None]:
import numpy as np

In [None]:
np.random.normal(size=(1,))  # scalar

In [None]:
np.random.normal(size=(3,))  # vector

In [None]:
np.random.normal(size=(3, 3))  # matrix

### [TensorFlow](https://www.tensorflow.org/guide/tensor)

Tensors are immutable.

There are also [sparse tensors](https://www.tensorflow.org/guide/tensor#sparse_tensors) (mostly zeros), and a range of other data structures such as [variables](https://www.tensorflow.org/guide/variable).

You can do a range [mathematics](https://www.tensorflow.org/api_docs/python/tf/math) with tensors.

In [None]:
import tensorflow as tf

In [None]:
tf.random.normal(shape=(1,))  # scalar

In [None]:
tf.random.normal(shape=(3,))  # vector

In [None]:
tf.random.normal(shape=(3, 3))  # matrix

### [PyTorch](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html)

More information for doing [maths](https://pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html#math-logic-with-pytorch-tensors) with PyTorch tensors.

In [None]:
import torch

In [None]:
torch.rand(size=(1,))  # scalar

In [None]:
torch.rand(size=(3,))  # vector

In [None]:
torch.rand(size=(3, 3))  # matrix

## Reproducibility

Use random seeds to assist reproducibility.

### Python

In [None]:
import random

random.seed(42)

### NumPy

Used by scikit-learn.

In [None]:
np.random.seed(42)

So, after running the random seed cell above (with 42), this next scalar should always return:
```python
>>> np.random.normal(size=(1,))
array([0.49671415])
```

In [None]:
np.random.normal(size=(1,))

### [scikit-learn](https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness)

Any object that uses the `random_state` keyword, set it to `rng` (for random number generator) rather than `None`.

For example, `random_state` is used in:

- `sklearn.model_selection.train_test_split`
- `sklearn.datasets.make_classification`
- `sklearn.model_selection.KFold`
- `sklearn.ensemble.RandomForestClassifier`

```python
rng = np.random.RandomState(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
```

### TensorFlow (Keras)

In [None]:
tf.keras.utils.set_random_seed(42)

### [PyTorch (Lightning)](https://pytorch.org/docs/stable/notes/randomness.html)

For PyTorch, there are separate seeds for the CPU and GPU:

In [None]:
def set_seed(seed):
    # cpu
    random.seed(seed)  # python
    np.random.seed(seed)  # numpy
    torch.manual_seed(seed)  # torch

    # gpu
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

In [None]:
set_seed(42)

PyTorch Lightning also has its own seed function:

In [None]:
from pytorch_lightning import seed_everything

seed_everything(42)

Additionaly, some operations on GPUs are implemented stochastically for efficiency. Check the documentation for details.

To make your GPU workflow deterministic, you may also need to set:

```python
# in pytorch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# in pytorch lightning trainer
Trainer(deterministic=True)

# in the pytorch lightning dataloader
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

generator = torch.Generator()
generator.manual_seed(42)

DataLoader(
    train_dataset,
    worker_init_fn=seed_worker,
    generator=generator,
)
```

```{tip}
It's good practice to try and reproduce your own work, to check that this is working correctly.
```

## Data pipelines

Data pipelines are good practice for your workflow for many reasons such as convenience, reproduciblity, and avoiding data leakage.  

They are especially useful:

- When the data does not fit in memory.
- When the data requires pre-processing.
- To efficiently use hardware.

The steps can include:

- Extract e.g., read data from memory / storage.
- Transform e.g., pre-processing, batching, shuffling.
- Load e.g., transfer to GPU.

### Data loading

#### scikit-learn

##### [Datasets](https://scikit-learn.org/stable/datasets.html)

`sklearn.datasets` has a range of [toy](https://scikit-learn.org/stable/datasets/toy_dataset.html) and [real-world](https://scikit-learn.org/stable/datasets/real_world.html) datasets.

In [None]:
from sklearn import datasets

In [None]:
digits = datasets.load_digits()

In [None]:
import matplotlib.pyplot as plt

plt.gray()
plt.matshow(digits.images[0])
plt.show()

In [None]:
df_california_housing = datasets.fetch_california_housing(as_frame=True)

In [None]:
df_california_housing["frame"]

##### [Pipelines](https://scikit-learn.org/stable/modules/compose.html)

You can create data pipelines via a list of key-value pairs:

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

In [None]:
estimators = [("reduce_dim", PCA()), ("clf", SVC())]

In [None]:
Pipeline(estimators)

Or, by using the `make_pipeline` function and passing in instantiated classes:

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

In [None]:
make_pipeline(Binarizer(), MultinomialNB())

#### TensorFlow (Keras)

[Keras](https://keras.io/api/data_loading/) models accept three types of inputs:

- [NumPy arrays](https://www.tensorflow.org/guide/data#consuming_numpy_arrays)
    - Suitable for when the data fits in memory.
- [TensorFlow Dataset objects](https://www.tensorflow.org/guide/data#dataset_structure)
    - Suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.
- [Python generators](https://www.tensorflow.org/guide/data#consuming_python_generators)
    - Suitable for custom processing that yields batches of data (subclasses of `tf.keras.utils.Sequence` class).

The documentation has more information on different data formats, such as [CSV](https://www.tensorflow.org/tutorials/load_data/csv) and [Pandas DataFrames](https://www.tensorflow.org/tutorials/load_data/pandas_dataframe).

```{note}
The word class has two definitions here depending on the context.  

- A [class in Python](https://docs.python.org/3/tutorial/classes.html) bundles data and functionality together to make new object instances.
- The [class](https://en.wikipedia.org/wiki/Statistical_classification) in machine learning is the category that a sample belongs to e.g., cat or dog.

```

Keras features a range of utilities to help you turn raw data on disk into a Dataset:

- [`tf.keras.utils.image_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory) turns image files sorted into class-specific folders into a labeled dataset of image tensors.
- [`tf.keras.utils.text_dataset_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory) does the same for text files.
- [`tf.keras.utils.timeseries_dataset_from_array`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/timeseries_dataset_from_array) creates a dataset of sliding windows over a timeseries provided as array.

```{tip}
If you have a large dataset and you are training on GPU(s), consider using `Dataset` objects, since they will take care of performance-critical details, such as:

- Asynchronously preprocessing your data on CPU while your GPU is busy, and buffering it into a queue.
- Prefetching data on GPU memory so it's immediately available when the GPU has finished processing the previous batch, so you can reach full GPU utilization.
```

##### [Keras Utilities](https://www.tensorflow.org/tutorials/load_data/images)

In [None]:
import pathlib

import matplotlib.pyplot as plt

In [None]:
if IN_COLAB:
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file(
        origin=dataset_url, fname="flower_photos", untar=True
    )
    data_dir = pathlib.Path(data_dir)

    BATCH_SIZE = 32
    IMAGE_HEIGHT = 180
    IMAGE_WIDTH = 180

    ds_train = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="training",
        seed=123,
        image_size=(IMAGE_HEIGHT, IMAGE_WIDTH),
        batch_size=BATCH_SIZE,
    )

In [None]:
if IN_COLAB:
    class_names = ds_train.class_names

    plt.figure(figsize=(10, 10))
    for images, labels in ds_train.take(1):
        for i in range(9):
            ax = plt.subplot(3, 3, i + 1)
            plt.imshow(images[i].numpy().astype("uint8"))
            plt.title(class_names[labels[i]])
            plt.axis("off")

##### [TensorFlow Datasets](https://www.tensorflow.org/datasets/overview)

Can [split](https://www.tensorflow.org/datasets/splits) the data on load.

In [None]:
import tensorflow_datasets as tfds

In [None]:
if IN_COLAB:
    (ds_train, ds_val, ds_test), ds_info = tfds.load(
        "tf_flowers",
        split=["train[:80%]", "train[80%:90%]", "train[90%:]"],
        with_info=True,  # returns (img, label) instead of {image': img, 'label': label}
        as_supervised=True,
    )

##### [NumPy to TensorFlow Dataset](https://www.tensorflow.org/tutorials/load_data/numpy)

Load a `.npz` file:

In [None]:
DATA_URL = "https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz"

path = tf.keras.utils.get_file("mnist.npz", DATA_URL)
with np.load(path) as data:
    x_train = data["x_train"]
    y_train = data["y_train"]
    x_test = data["x_test"]
    y_test = data["y_test"]

In [None]:
type(x_train)

Convert to a TensorFlow object:

In [None]:
ds_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))

In [None]:
ds_train

#### [PyTorch (Lightning)](https://pytorch-lightning.readthedocs.io/en/stable/guides/data.html)

There are few different options for data:

- [Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#loading-a-dataset)
    - Maps keys to data samples.
    - Also, [Iterable Datasets](https://pytorch-lightning.readthedocs.io/en/stable/guides/data.html#iterable-datasets) for sequential data.
- [DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders)
    - Wraps an iterable around Dataset.
- [LightningDataModule](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html#datamodules)
    - A collection of training/validation/test/predict DataLoaders, along with their preprocessing/downloading steps.

##### [Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#loading-a-dataset)

In [None]:
from torch.utils.data import random_split
from torchvision.datasets import MNIST

In [None]:
import os

data_path = f"{os.getcwd()}/data"

In [None]:
train_dataset = MNIST(data_path, train=True, download=True)
test_dataset = MNIST(data_path, train=False, download=True)
predict_dataset = MNIST(
    data_path, train=False, download=True
)  # same as the test dataset

train_dataset, val_dataset = random_split(train_dataset, [55000, 5000])

In [None]:
train_dataset

##### [DataLoader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders)

In [None]:
from torch.utils.data import DataLoader

In [None]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)
predict_dataloader = DataLoader(predict_dataset, batch_size=BATCH_SIZE)

In [None]:
train_dataloader

##### [LightningDataModule](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html#datamodules)

Decouples the data hooks from the PyTorch Lightning model, so you can develop dataset agnostic models with reusable and sharable DataModules.

You can think of this as the data pipeline.

For multi-node training, you can also add [`prepare_data_per_node`](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html#prepare-data-per-node).

In [1]:
import pytorch_lightning as pl
from torchvision import transforms

In [2]:
class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, data_path=data_path, batch_size=BATCH_SIZE):
        super().__init__()
        self.data_path = data_path
        self.batch_size = batch_size
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),  # specific to MNIST
            ]
        )

    def prepare_data(self):
        # download data once, useful for distributed training to avoid duplicates
        MNIST(self.data_path, train=True, download=True)
        MNIST(self.data_path, train=False, download=True)

    def setup(self, stage=None):
        if stage == "fit" or stage is None:
            mnist_full = MNIST(self.data_path, train=True, transform=self.transform)
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        if stage == "test" or stage is None:
            self.mnist_test = MNIST(
                self.data_path, train=False, transform=self.transform
            )

        if stage == "predict" or stage is None:
            self.mnist_predict = MNIST(
                self.data_path, train=False, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size)

    def predict_dataloader(self):
        return DataLoader(self.mnist_predict, batch_size=self.batch_size)

NameError: name 'data_path' is not defined

In [None]:
datamodule = MNISTDataModule()

You can then [use the LightningDataModule in the Trainer](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html#using-a-datamodule):

```python
trainer.fit(model, datamodule=datamodule)
trainer.test(datamodule=datamodule)
```

Now you can also swap out the datamodule for another one that works with the same model e.g., [Fashion MNIST](https://www.kaggle.com/datasets/zalando-research/fashionmnist):

```python
datamodule = FashionMNISTDataModule()
```

### Shuffle

Shuffle the _training_ data to help with training accuracy.

Normally, the _test_ data is not shuffled.

#### TensorFlow (Keras)

In [None]:
ds_train = ds_train.shuffle(10_000)

In [None]:
ds_train

#### PyTorch (Lightning)

In [None]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_dataset, shuffle=True)

In [None]:
train_dataloader

(batch_tf)=
### Batch

A batch is a set of examples used in one iteration of model training.

The batch size is the number of examples in a batch.

The optimum batch size depends on the problem and what you're optimising for.  

In general:

- They are often multiples of 32, where 32 or 64 is a good starting point if unsure.
- Larger batch sizes can be more performant (e.g., 256 is often used for distributed training over multiple GPUs).
- Batch sizes that match the number of classes for multi-class classification can increase accuracy (e.g., 10 for MNIST). 

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data#batching_dataset_elements)

In [None]:
BATCH_SIZE = 32

ds_train = ds_train.batch(batch_size=BATCH_SIZE)

In [None]:
ds_train

#### PyTorch (Lightning)

In [None]:
BATCH_SIZE = 32

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

In [None]:
train_dataloader

[Automatic batch size](https://pytorch-lightning.readthedocs.io/en/stable/advanced/training_tricks.html#batch-size-finder) with PyTorch lightning:

```python
Trainer(auto_scale_batch_size=True)
```

### Map

Map a preprocessing function to a dataset.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data#preprocessing_data)

```python
dataset.map(function)
```

```{tip}
There are range of ways to [improve the performance](https://www.tensorflow.org/guide/data_performance) of the data pipeline.

In these examples, using `tf.data.AUTOTUNE` leaves the decision to TensorFlow.
```

(cache_tf)=
### Dataset caching

Cache the data after the first iteration through it. The data can be cached to either memory or a local file.

This can improve performance when:

- The data stays the same each iteration.
- The data is read from a remote distributed filesystem.
- The data is I/O (input/output) bound and fits in memory.

Note, large datasets are [sharded](https://www.tensorflow.org/tutorials/distribute/input#sharding) rather than cached, as they don't fit into memory.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data_performance#caching)

In [None]:
ds_train = ds_train.cache()

In [None]:
ds_train

(prefetch_tf)=
### Prefetch data

Overlaps data pre-processing and model execution while training.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data_performance#prefetching)

In [None]:
ds_train = ds_train.prefetch(buffer_size=tf.data.AUTOTUNE)

In [None]:
ds_train

### Parallel data extraction

Extract the data in parallel.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data_performance#parallelizing_data_extraction)

```python
dataset.interleave(
    build_dataset, 
    num_parallel_calls=tf.data.AUTOTUNE
)
```

#### [PyTorch (Lightning)](https://pytorch.org/docs/stable/data.html#multi-process-data-loading)

Set `num_workers` to be greater than 0 in the DataLoader:

```python
train_dataloader = DataLoader(train_dataset, num_workers=4)
```

```{tip}
Can also pin memory to the GPU for faster memory copies by adding `pin_memory=True` inside the DataLoader.
```

### Data pre-processing

Pre-processing your data is often helpful as the raw data is often not in the exact format that the model performs well with.

For example, normalising the values in a tensor can help with model training.

```{tip}
Pre-processing transformations are based on the training data _only_ (not the test data). For example, if you normalise by the avergage, ensure that this is the average of the training data only.

These are then applied to the inputs of both the _training_ and the _test_ data.  

This helps avoid [data leakage](https://scikit-learn.org/stable/common_pitfalls.html#data-leakage) of the test data into training.
```

#### [scikit-learn](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)

In [None]:
from sklearn import preprocessing

In [None]:
preprocessing.StandardScaler()  # standardisation: zero mean and unit variance

In [None]:
preprocessing.Normalizer()  # normalisation: unit norm

In [None]:
preprocessing.PowerTransformer()  # mapping to Gaussian distribution

In [None]:
preprocessing.OneHotEncoder()  # encoding categorical features

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/keras/preprocessing_layers)

In [None]:
tf.keras.layers.Rescaling(1.0 / 255)

#### [PyTorch (Lightning)](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)

In [None]:
import torchvision

In [None]:
torchvision.transforms.Normalize((0.1307,), (0.3081,)),  # specific to MNIST

(data_augmentation)=
### [Data augmentation](https://youtu.be/JI8saFjK84o)

Data augmentation artificially increases the range and number of training examples.

This is useful for small data sets.

There are a range of methods. For example, in image problems you could rotate, stretch, and reflect images.

```{tip}
Apply random transformations _after_ both caching (to avoid caching randomness) and batching (for vectorisation).
```

#### [TensorFlow (Keras)](https://www.tensorflow.org/tutorials/images/data_augmentation)

In [None]:
tf.keras.layers.RandomFlip("horizontal")

In [None]:
tf.keras.layers.RandomRotation(0.1)

#### [PyTorch (Lightning)](https://pytorch.org/vision/master/transforms.html)

In [None]:
torchvision.transforms.RandomHorizontalFlip()

In [None]:
torchvision.transforms.RandomRotation(0.1)

###  Parallel data transformation

Pre-process your data in parallel.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data_performance#parallelizing_data_transformation)

```python
dataset.map(
    function, 
    num_parallel_calls=tf.data.AUTOTUNE
)
```

### Vectorise mapping

Batch _before_ mapping, to vectorise a function.

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/data_performance#vectorizing_mapping)

```python
dataset.batch(256).map(function)
```

### Mixed precision

Mixed precision is the combined use of 16-bit and 32-bit floating-point types during training to use less memory and make it run faster.

It uses 32-bits where it needs to for accuracy and 16-bits elsewhere for speed.

```{note}
This functionality varies by GPU, and is mostly available to modern NVIDIA GPUs.
```

```{warning}
Be careful with underflow and overflow issues.

16-bit floats above 65,504 overflow to infinity and below 6.0<sub>x10</sub><sup>-8</sup> underflow to zero.

[Loss scaling](https://www.tensorflow.org/guide/mixed_precision#loss_scaling_overview) can help avoid errors by scaling the losses up or down temporarily i.e.,:  
`optimizer = mixed_precision.LossScaleOptimizer(optimizer)`  

```

#### [TensorFlow (Keras)](https://www.tensorflow.org/guide/mixed_precision)

```python
tf.keras.mixed_precision.set_global_policy('mixed_float16')
```

#### [PyTorch (Lightning)](https://pytorch-lightning.readthedocs.io/en/stable/advanced/precision.html#)

```python
Trainer(precision=16)
```

### Example - Digit Classification

(tensorflow_datasets)=
#### [TensorFlow Datasets](https://www.tensorflow.org/datasets)

TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks.

Here is an example for [MNIST](https://www.tensorflow.org/datasets/keras_example).

Load the data:

In [None]:
(ds_train, ds_val, ds_test), ds_info = tfds.load(
    "mnist",
    split=["train[:80%]", "train[80%:90%]", "train[90%:]"],
    shuffle_files=True,  # good practise for larger datasets with many files on disk
    as_supervised=True,
    with_info=True,
)

Create the data pipelines:

In [None]:
AUTOTUNE = tf.data.AUTOTUNE


def normalise_image(image, label):
    return tf.cast(image, tf.float32) / 255.0, label


def training_pipeline(ds_train):
    ds_train = ds_train.map(
        normalise_image, num_parallel_calls=AUTOTUNE
    )  # parallelise preprocessing first to reuse it
    ds_train = ds_train.cache()  # cache before shuffling for performance
    ds_train = ds_train.shuffle(
        ds_info.splits["train"].num_examples
    )  # shuffle by the full dataset size
    ds_train = ds_train.batch(
        128
    )  # batch after shuffling for unique batches at each epoch
    ds_train = ds_train.prefetch(
        AUTOTUNE
    )  # end pipeline with prefetching for performance
    return ds_train


def test_pipeline(ds_test):
    ds_test = ds_test.map(normalise_image, num_parallel_calls=AUTOTUNE)
    ds_test = ds_test.batch(128)
    ds_test = ds_test.cache()
    # cache after batching because batches can be the same between epochs
    # no shuffling needed
    ds_test = ds_test.prefetch(AUTOTUNE)
    return ds_test


ds_train = training_pipeline(ds_train)
ds_val = training_pipeline(ds_val)
ds_test = test_pipeline(ds_test)

Create the model using the [Functional API](https://keras.io/guides/functional_api/):

In [None]:
inputs = tf.keras.Input(shape=(28, 28, 1), name="inputs")
x = tf.keras.layers.Flatten(name="flatten")(inputs)
x = tf.keras.layers.Dense(128, activation="relu", name="layer1")(x)
x = tf.keras.layers.Dense(128, activation="relu", name="layer2")(x)
outputs = tf.keras.layers.Dense(10, name="outputs")(x)

model = tf.keras.Model(inputs, outputs, name="functional")

model.summary()

Compile the model:

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")],
)

Train the model:

In [None]:
NUM_EPOCHS = 10

history = model.fit(
    ds_train,
    validation_data=ds_val,
    epochs=NUM_EPOCHS,
    verbose=False,
);

View the loss and accuracy curves over the epochs:

In [None]:
import matplotlib.pyplot as plt

epochs_range = range(1, NUM_EPOCHS + 1)

In [None]:
plt.plot(epochs_range, history.history["loss"], "bo", label="Training loss")
plt.plot(epochs_range, history.history["val_loss"], "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
plt.plot(epochs_range, history.history["accuracy"], "bo", label="Training accuracy")
plt.plot(
    epochs_range, history.history["val_accuracy"], "b", label="Validation accuracy"
)
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.ylim([0.9, 1.0])
plt.legend()
plt.show()

The training accuracy and the validation accuracy are diverging.

This means that the model is overfitting (i.e., memorising the training data and not generalising to the validation data).

One way to alleviate this is to add [regularisation](https://www.tensorflow.org/tutorials/keras/overfit_and_underfit#add_weight_regularization).

In this example, we'll add [dropout](overfit) to the dense layers.

In [None]:
inputs = tf.keras.Input(shape=(28, 28, 1), name="inputs")
x = tf.keras.layers.Flatten(name="flatten")(inputs)
x = tf.keras.layers.Dense(128, activation="relu", name="layer1")(x)

# I'm new
x = tf.keras.layers.Dropout(0.2)(x)

x = tf.keras.layers.Dense(128, activation="relu", name="layer2")(x)
outputs = tf.keras.layers.Dense(10, name="outputs")(x)

model = tf.keras.Model(inputs, outputs, name="functional")

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")],
)

In [None]:
history = model.fit(
    ds_train,
    validation_data=ds_val,
    epochs=NUM_EPOCHS,
    verbose=False,
);

In [None]:
plt.plot(epochs_range, history.history["accuracy"], "bo", label="Training accuracy")
plt.plot(
    epochs_range, history.history["val_accuracy"], "b", label="Validation accuracy"
)
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.ylim([0.9, 1.0])
plt.legend()
plt.show()

Much better. The model now performs well on _both_ the training and validation data.

#### [PyTorch (Lightning)](https://pytorch-lightning.readthedocs.io/en/stable/notebooks/lightning_examples/datamodules.html)

In [None]:
import os

import torch
import torch.nn.functional as F
from pytorch_lightning import (
    LightningDataModule,
    LightningModule,
    Trainer,
    seed_everything,
)
from pytorch_lightning.callbacks.progress import TQDMProgressBar
from torch import nn
from torch.utils.data import DataLoader, random_split
from torchmetrics.functional import accuracy
from torchvision import transforms
from torchvision.datasets import CIFAR10, MNIST

Set global parameters:

In [None]:
seed_everything(42)

In [None]:
PATH_DATASETS = f"{os.getcwd()}/data"
AVAIL_GPUS = min(1, torch.cuda.device_count())
BATCH_SIZE = 256 if AVAIL_GPUS else 64

[Create the (dataset agnostic) PyTorch Lightning Model](https://pytorch-lightning.readthedocs.io/en/stable/notebooks/lightning_examples/datamodules.html#Defining-the-dataset-agnostic-LitModel):

In [None]:
class LitModel(LightningModule):
    def __init__(
        self, channels, width, height, num_classes, hidden_size=64, learning_rate=2e-4
    ):

        super().__init__()

        # We take in input dimensions as parameters and use those to dynamically build model.
        self.channels = channels
        self.width = width
        self.height = height
        self.num_classes = num_classes
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate

        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, num_classes),
        )

    def forward(self, x):
        x = self.model(x)
        return F.log_softmax(x, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        return loss

    def test_step(self, batch, batch_idx):
        # Here we just reuse the validation_step for testing
        return self.validation_step(batch, batch_idx)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

[Create the PyTorch Lightning DataModule](https://pytorch-lightning.readthedocs.io/en/stable/notebooks/lightning_examples/datamodules.html#Defining-The-MNISTDataModule):

In [None]:
class MNISTDataModule(LightningDataModule):
    def __init__(self, data_dir=PATH_DATASETS):
        super().__init__()
        self.data_dir = data_dir
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),  # specific to MNIST
            ]
        )

    def prepare_data(self):  # download the data, once if distributed
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit" or stage is None:
            ds_full = MNIST(self.data_dir, train=True, transform=self.transform)
            self.ds_train, self.ds_val = random_split(ds_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == "test" or stage is None:
            self.ds_test = MNIST(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.ds_train, batch_size=BATCH_SIZE)

    def val_dataloader(self):
        return DataLoader(self.ds_val, batch_size=BATCH_SIZE)

    def test_dataloader(self):
        return DataLoader(self.ds_test, batch_size=BATCH_SIZE)

Instantiate the Model, DataModule, and Trainer:

In [None]:
datamodule = MNISTDataModule()

In [None]:
model = LitModel(channels=1, width=28, height=28, num_classes=10)

In [None]:
trainer = Trainer(
    gpus=AVAIL_GPUS,
    max_epochs=3,
    callbacks=TQDMProgressBar(refresh_rate=20),
)

Run training:

In [None]:
if IN_COLAB:
    trainer.fit(model, datamodule=datamodule)

Test the model:

In [None]:
if IN_COLAB:
    trainer.test(datamodule=datamodule)

Now, we can [change over to a different dataset](https://pytorchlightning.github.io/lightning-tutorials/notebooks/lightning_examples/datamodules.html#Defining-the-CIFAR10-DataModule) e.g., [CIFAR10](https://en.wikipedia.org/wiki/CIFAR-10):

In [None]:
class CIFAR10DataModule(LightningDataModule):
    def __init__(self, data_dir=PATH_DATASETS):
        super().__init__()
        self.data_dir = data_dir
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize(
                    (0.5, 0.5, 0.5), (0.5, 0.5, 0.5)
                ),  # specific to CIFAR10
            ]
        )

    def prepare_data(self):  # download the data, once if distributed
        CIFAR10(self.data_dir, train=True, download=True)
        CIFAR10(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit" or stage is None:
            ds_full = CIFAR10(self.data_dir, train=True, transform=self.transform)
            self.ds_train, self.ds_val = random_split(ds_full, [45000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == "test" or stage is None:
            self.ds_test = CIFAR10(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.ds_train, batch_size=BATCH_SIZE)

    def val_dataloader(self):
        return DataLoader(self.ds_val, batch_size=BATCH_SIZE)

    def test_dataloader(self):
        return DataLoader(self.ds_test, batch_size=BATCH_SIZE)

In [None]:
datamodule = CIFAR10DataModule()

In [None]:
model = LitModel(channels=3, width=32, height=32, num_classes=10, hidden_size=512)

In [None]:
trainer = Trainer(
    gpus=AVAIL_GPUS,
    max_epochs=3,
    callbacks=TQDMProgressBar(refresh_rate=20),
)

In [None]:
if IN_COLAB:
    trainer.fit(model, datamodule=datamodule)

In [None]:
if IN_COLAB:
    trainer.test(datamodule=datamodule)

This simple model works well for MNIST but not for CIFAR10. However, it demonstrates the ease and benefits of switching out data modules.

[PyTorch Lightning Bolts](https://lightning-bolts.readthedocs.io/en/latest/introduction_guide.html) simplifies this even further for common [DataModules](https://lightning-bolts.readthedocs.io/en/latest/datamodules/vision.html) (e.g., MNIST, FashionMNIST, CIFAR10, [ImageNet](https://en.wikipedia.org/wiki/ImageNet)) by providing them for you:

In [None]:
from pl_bolts.datamodules import MNISTDataModule

In [None]:
datamodule = MNISTDataModule(PATH_DATASETS)

In [None]:
model = LitModel(channels=1, width=28, height=28, num_classes=10)

In [None]:
trainer = Trainer(
    gpus=AVAIL_GPUS,
    max_epochs=3,
    callbacks=TQDMProgressBar(refresh_rate=20),
)

In [None]:
if IN_COLAB:
    trainer.fit(model, datamodule=datamodule)

In [None]:
if IN_COLAB:
    trainer.test(datamodule=datamodule)

PyTorch Lighting Bolts also has a range of models (e.g., regression, [GPT-2](https://en.wikipedia.org/wiki/GPT-2), [ImageGPT](https://openai.com/blog/image-gpt/), [GAN](https://en.wikipedia.org/wiki/Generative_adversarial_network), [VAE](https://en.wikipedia.org/wiki/Variational_autoencoder)):

```python
from pl_bolts.models.vision import ImageGPT
```

Also, you can [override functionality for fast iteration of research ideas](https://lightning-bolts.readthedocs.io/en/latest/introduction_guide.html#for-research):

```python
class VideoGPT(ImageGPT):  # inherit from the pre-trained model
    def training_step(self, batch, batch_idx):  # create a new training step
        # cool science
```

## Questions

```{admonition} Question 1

Should I split my data in train and test subsets _before_ or _after_ pre-processing?

```

```{admonition} Question 2

Before I use random functionality, what is a good practice for reproducibility?

```

```{admonition} Question 3

What should I create if there are multiple steps to my data pre-processing?

```

```{admonition} Question 4

Name three ways to improve performance in a data pipeline.

```

## {ref}`Solutions <data>`

## Key Points

```{important}

- [x] _Always split the data into train and test subsets first, before any pre-processing._
- [x] _Never fit to the test data._
- [x] _Use a data pipeline._
- [x] _Use a random seed and any available deterministic functionalities for reproducibility._
    - [x] _Try and reproduce your own work, to check that it is reproducible._
- [x] _Consider optimising the data pipeline with:_
    - [x] _Shuffling._
    - [x] _Batching._
    - [x] _Caching._
    - [x] _Prefetching._
    - [x] _Parallel data extraction._
    - [x] _Data augmentation._
    - [x] _Parallel data transformation._
    - [x] _Vectorised mapping._
    - [x] _Mixed precision._

```

## Further information

### Good practices

- Do data processing in a pipeline or module to increase portability and reproducibility.
- Pre-processing transformations are based on the training data _only_ (not the test data). These are then applied to the inputs of both the _training_ and the _test_ data. This helps avoid [data leakage](https://scikit-learn.org/stable/common_pitfalls.html#data-leakage).
- Analyse data pipeline performance with [TensorBoard Profiler](https://www.tensorflow.org/guide/data_performance_analysis).
- Use sparse tensors when there are many zeros / np.nans (e.g., [TensorFlow](https://www.tensorflow.org/guide/sparse_tensor)).
- Take care with [datasets with imbalanced classes](https://developers.google.com/machine-learning/glossary/#class-imbalanced-dataset) (i.e., only a few positive samples).
- Best practices for [managing data with PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/guides/data.html) and [scikit-learn](https://scikit-learn.org/stable/common_pitfalls.html).
- Models are often heavily optimised, while the data is less so. There are many good practices around [data-centric machine learning](https://datacentricai.org/).
- Consider sharing your data for reproducibility if you can.

### Other options

- [NVIDIA Data Loading Library (DALI)](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html)
    - A library for data loading and pre-processing to accelerate deep learning applications.
- [Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html)
    - Load and exchange data in Ray libraries and applications. 
- [NVIDIA Replicator Composer](https://docs.omniverse.nvidia.com/app_isaacsim/app_isaacsim/tutorial_replicator_composer.html#replicator-composer)
    - A tool for creating synthetic data.
 
### Resources

- [Papers with code - Datasets](https://paperswithcode.com/datasets)
- [HuggingFace - Datasets](https://huggingface.co/datasets)
- [Google research datasets](https://ai.google/tools/datasets/)
- [Google Dataset Search](https://datasetsearch.research.google.com/)
- [Google Cloud public datasets](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&pli=1)
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [Torch Vision Datasets](https://pytorch.org/vision/stable/datasets.html)
- [Torch Text Datasets](https://pytorch.org/text/stable/datasets.html)
- [Torch Audio Datasets](https://pytorch.org/audio/stable/datasets.html)