⏯️ Autoresume Training#
We all know the pain of having a training run die in the middle of training. Composer’s autoresume feature provides a convenient solution to picking back up automatically from the last checkpoint when re-running a previously failed training script.
We’ve put together this tutorial to demonstrate this feature in action and how you can activate it through the Composer trainer.
Recommended Background#
This tutorial assumes you are familiar with the Composer trainer basics and its saving/checkpointing features.
Tutorial Goals and Concepts Covered#
The goal of this tutorial is to demonstrate autoresume by simulating a failed and re-run training script with the Composer trainer. More details on that below.
For a deeper look into the way autoresume works, check out our more in depth notes.
Install Dependencies#
Install Composer, if it isn’t already installed.
[ ]:
%pip install mosaicml
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install git+https://github.com/mosaicml/composer.git
Training Script#
Let’s use the block of code below as our training script. Autoresume comes in handy when training gets interrupted for some reason (say, your node dies). In those cases, being able to seamlessly pick up from your last checkpoint can be a huge time saver. We can simulate such a situation by manually interrupting our “training script” (the below cell) and re-running with autoresume!
To see this example in action, run this notebook twice.
The first time the notebook is run, the trainer will save a checkpoint to the
save_folder
and train for one epoch.Any subsequent time the notebook is run, the trainer will resume from the latest checkpoint if using
autoresume=True
. If the latest checkpoint was saved atmax_duration
, meaning all training is finished, the Trainer will exit immediately with an error that no training would occur.
When the Trainer is configured with autoresume=True
, it will automatically look for existing checkpoints and resume training. If no checkpoints exist, it’ll start a new training run. This allows you to automatically resume from any faults, with no code changes.
To simulate a flaky spot instance, try interrupting the notebook (e.g. Ctrl-C) midway through the first training run (say, after epoch 0 is finished). Notice how the progress bars resume at the next epoch and not repeat any training already completed.
This feature does not require code or configuration changes to distinguish between starting a new training run or automatically resuming from an existing one, making it easy to use Composer on preemptable cloud instances. Simply configure the instance to start Composer with the same command every time until training has finished!
[ ]:
import torch.utils.data
from torch.optim import SGD
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from composer import Trainer
from composer.models.classify_mnist import mnist_model
# Configure the trainer -- here, we train a simple MNIST classifier
model = mnist_model(num_classes=10)
optimizer = SGD(model.parameters(), lr=0.01)
train_dataloader = torch.utils.data.DataLoader(
dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
batch_size=2048,
)
eval_dataloader = torch.utils.data.DataLoader(
dataset=MNIST('~/datasets', train=True, download=True, transform=ToTensor()),
batch_size=2048,
)
# When using `autoresume`, it is required to specify the `run_name`, so
# Composer will know which training run to resume
run_name = 'my_autoresume_training_run'
trainer = Trainer(
model=model,
max_duration='5ep',
optimizers=optimizer,
# Training data configuration
train_dataloader=train_dataloader,
train_subset_num_batches=5, # For this example, limit each epoch to 5 batches
# Evaluation configuration
eval_dataloader=eval_dataloader,
eval_subset_num_batches=5, # For this example, limit evaluation to 5 batches
# Checkpoint configuration
run_name=run_name,
save_folder='./my_autoresume_training_run', # Make sure to specify `save_folder` to enable saving
save_interval='1ep',
# Configure autoresume!
autoresume=True,
)
print('Training!')
# Train!
trainer.fit()
# Print the number of trained epochs (should always be the `max_duration`, which is 5ep)
print(f'\nNumber of epochs trained: {trainer.state.timestamp.epoch}')
What next?#
You’ve now seen how autoresume can save you from the headache of unpredictable training interruptions.
For a deeper dive, check out our more in depth notes on this feature.
In addition, please continue to explore our tutorials! Here’s a couple suggestions:
Continue learning about other Composer features like automatic gradient accumulation.
Explore more advanced applications of Composer like applying image segmentation to medical images or fine-tuning a transformer for sentiment classification.
Learn how to [train without local storage][no_local_storage_tutorial].
Come get involved with MosaicML!#
We’d love for you to get involved with the MosaicML community in any of these ways:
Star Composer on GitHub#
Help make others aware of our work by starring Composer on GitHub.
Join the MosaicML Slack#
Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!
Contribute to Composer#
Is there a bug you noticed or a feature you’d like? File an issue or make a pull request!