This tutorial is available as a Jupyter notebook.
♻️ Auto Grad Accum#
Have you ever wanted to choose your batch size without having to stress about CUDA Out-of-Memory (OOM) errors? We sure have. That’s why we built Composer’s automatic gradient accumulation feature.
This tutorial will demonstrate how to use automatic gradient accumulation to avoid CUDA OOMs, regardless of your batch size choice, GPU type, and number of devices.
Note that this demo requires a GPU to demonstrate automatic gradient accumulation.
To follow this tutorial, you should be familiar with the basics of using the Composer trainer. Otherwise, it’s pretty straightforward.
Tutorial Goals and Concepts Covered#
The goal of this tutorial is to show you how to turn on automatic gradient accumulation and to provide a sandbox to play around with it a bit. Please feel free to experiment with different batch sizes and other configuration choices to see how it works!
For details of the implementation, see our Auto Grad Accum documentation.
Let’s get started!
Set Up Our Workspace#
We’ll start by installing Composer:
%pip install mosaicml # To install from source instead of the last release, comment the command above and uncomment the following one. # %pip install git+https://github.com/mosaicml/composer.git
We are going to use the CIFAR-10 dataset with a ResNet-56 model and some standard optimization settings. For the purposes of this tutorial, we’ll choose a very large batch size and increase the image size to 96x96. These settings will cause CUDA Out-of-Memory errors on most GPUs.
import torch import composer from torchvision import datasets, transforms torch.manual_seed(42) # For replicability data_directory = "./data" # Normalization constants mean = (0.507, 0.487, 0.441) std = (0.267, 0.256, 0.276) # choose a very large batch size batch_size = 2048 cifar10_transforms = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean, std), transforms.Resize(size=[96, 96]) # choose a large image size ]) train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms) test_dataset = datasets.CIFAR10(data_directory, train=False, download=True, transform=cifar10_transforms) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
from composer import models model = models.composer_resnet_cifar(model_name='resnet_56', num_classes=10) optimizer = composer.optim.DecoupledSGDW( model.parameters(), # Model parameters to update lr=0.05, momentum=0.9, )
Train a Baseline Model#
Now we run our trainer code with the
assert torch.cuda.is_available(), "Demonstrating automatic gradient accumulation requires a GPU." trainer = composer.trainer.Trainer( model=model, train_dataloader=train_dataloader, eval_dataloader=test_dataloader, optimizers=optimizer, max_duration="1ep", grad_accum='auto', # <--- Activate Composer magic! device='gpu' ) # Train trainer.fit()
Depending on your GPU type, you should see some logs that increase the gradient accumulation dynamically until the model fits into memory, prior to the start of training—e.g., something like:
INFO:composer.trainer.trainer:CUDA out of memory detected. Gradient Accumulation increased from 1 -> 2, and the batch will be retrained.
Worry not! This just means everything is working as expected. With automatic gradient accumulation enabled, Composer responds to OOM errors during training by doubling the accumulation rate. Under the hood, each minibatch is split into
n “microbatches”, where
n is the accumulation rate, and gradients are accumulated across microbatches before stepping the optimizer. So, you should expect to see the accumulation rate increase until the resulting microbatch size fits on the device. This
lets you focus on getting the best minibatch size without having to stress about what your hardware can handle.
You’ve now seen how to turn on automatic gradient accumulation using the Composer trainer.
To dig deeper, see our Auto Grad Accum documentation.
In addition, please continue to explore our tutorials! Here’s a couple suggestions:
Continue learning about other Composer features like automatic restarting from checkpoints
Give your model life after training with Composer’s export for inference tools
Explore more advanced applications of Composer like applying image segmentation to medical images or fine-tuning a transformer for sentiment classification.
Come get involved with MosaicML!#
We’d love for you to get involved with the MosaicML community in any of these ways:
Help make others aware of our work by starring Composer on GitHub.
Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!