Tip

This tutorial is available as a Jupyter notebook.

⏯️ Autoresume Training#

We all know the pain of having a training run die in the middle of training. Composer’s autoresume feature provides a convenient solution to picking back up automatically from the last checkpoint when re-running a previously failed training script.

We’ve put together this tutorial to demonstrate this feature in action and how you can activate it through the Composer trainer.

🐕 Autoresume via Watchdog: Composer autoresumption works best when coupled with automated node failure detection and retries on Mosaic AI training. See our platform docs page on enabling this feature for your runs

Recommended Background#

This tutorial assumes you are familiar with the Composer trainer basics and its saving/checkpointing features.

Tutorial Goals and Concepts Covered#

The goal of this tutorial is to demonstrate autoresume by simulating a failed and re-run training script with the Composer trainer. More details on that below.

For a deeper look into the way autoresume works, check out our more in depth notes.

Install Dependencies#

Install Composer, if it isn’t already installed.

[ ]:

%pip install mosaicml
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install git+https://github.com/mosaicml/composer.git

Training Script#

Let’s use the block of code below as our training script. Autoresume comes in handy when training gets interrupted for some reason (say, your node dies). In those cases, being able to seamlessly pick up from your last checkpoint can be a huge time saver. We can simulate such a situation by manually interrupting our “training script” (the below cell) and re-running with autoresume!

To see this example in action, run this notebook twice.

The first time the notebook is run, the trainer will save a checkpoint to the save_folder and train for one epoch.
Any subsequent time the notebook is run, the trainer will resume from the latest checkpoint if using autoresume=True. If the latest checkpoint was saved at max_duration, meaning all training is finished, the Trainer will exit immediately with an error that no training would occur.

When the Trainer is configured with autoresume=True, it will automatically look for existing checkpoints and resume training. If no checkpoints exist, it’ll start a new training run. This allows you to automatically resume from any faults, with no code changes.

To simulate a flaky spot instance, try interrupting the notebook (e.g. Ctrl-C) midway through the first training run (say, after epoch 0 is finished). Notice how the progress bars resume at the next epoch and not repeat any training already completed.

This feature does not require code or configuration changes to distinguish between starting a new training run or automatically resuming from an existing one, making it easy to use Composer on preemptable cloud instances. Simply configure the instance to start Composer with the same command every time until training has finished!

[ ]:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MNISTModel(nn.Module):
    """Toy convolutional neural network architecture in pytorch for MNIST."""

    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.num_classes = num_classes
        self.conv1 = nn.Conv2d(1, 16, (3, 3), padding=0)
        self.conv2 = nn.Conv2d(16, 32, (3, 3), padding=0)
        self.bn = nn.BatchNorm2d(32)
        self.fc1 = nn.Linear(32 * 16, 32)
        self.fc2 = nn.Linear(32, num_classes)

    def forward(self, x):
        out = self.conv1(x)
        out = F.relu(out)
        out = self.conv2(out)
        out = self.bn(out)
        out = F.relu(out)
        out = F.adaptive_avg_pool2d(out, (4, 4))
        out = torch.flatten(out, 1, -1)
        out = self.fc1(out)
        out = F.relu(out)
        return self.fc2(out)

[ ]:

import torch.utils.data
from torch.optim import SGD
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

from composer import Trainer
from composer.models import ComposerClassifier

# Configure the trainer -- here, we train a simple MNIST classifier
model = ComposerClassifier(module=MNISTModel(num_classes=10), num_classes=10)
optimizer = SGD(model.parameters(), lr=0.01)
train_dataloader = torch.utils.data.DataLoader(
    dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
    batch_size=2048,
)
eval_dataloader = torch.utils.data.DataLoader(
    dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
    batch_size=2048,
)

# When using `autoresume`, it is required to specify the `run_name`, so
# Composer will know which training run to resume
run_name = 'my_autoresume_training_run'

trainer = Trainer(
    model=model,
    max_duration='5ep',
    optimizers=optimizer,

    # Training data configuration
    train_dataloader=train_dataloader,
    train_subset_num_batches=5,  # For this example, limit each epoch to 5 batches

    # Evaluation configuration
    eval_dataloader=eval_dataloader,
    eval_subset_num_batches=5,  # For this example, limit evaluation to 5 batches

    # Checkpoint configuration
    run_name=run_name,
    save_folder='./my_autoresume_training_run', # Make sure to specify `save_folder` to enable saving
    save_interval='1ep',

    # Configure autoresume!
    autoresume=True,
)

print('Training!')

# Train!
trainer.fit()

# Print the number of trained epochs (should always be the `max_duration`, which is 5ep)
print(f'\nNumber of epochs trained: {trainer.state.timestamp.epoch}')

What next?#

You’ve now seen how autoresume can save you from the headache of unpredictable training interruptions.

For a deeper dive, check out our more in depth notes on this feature.

In addition, please continue to explore our tutorials! Here’s a couple suggestions:

Continue learning about other Composer features like automatic gradient accumulation.
Explore more advanced applications of Composer like applying image segmentation to medical images or fine-tuning a transformer for sentiment classification.
Learn how to [train without local storage][no_local_storage_tutorial].

Come get involved with MosaicML!#

We’d love for you to get involved with the MosaicML community in any of these ways:

Star Composer on GitHub #

Help make others aware of our work by starring Composer on GitHub.

Join the MosaicML Slack #

Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!

Contribute to Composer#

Is there a bug you noticed or a feature you’d like? File an issue or make a pull request!

⏯️ Autoresume Training#

Recommended Background#

Tutorial Goals and Concepts Covered#

Install Dependencies#

Training Script#

What next?#

Come get involved with MosaicML!#

Star Composer on GitHub#

Join the MosaicML Slack#

Contribute to Composer#

Star Composer on GitHub #

Join the MosaicML Slack #