This tutorial is available as a Jupyter notebook.
♻️ Auto Grad Accum#
This notebook will demonstrate how to use automatic gradient accumulation to avoid CUDA OOMs, regardless of your batch size choice, GPU type, and number of devices. Experiment with different combinations and see how it works!
For details of the implementation, see our Auto Grad Accum documentation.
We’ll start by installing composer:
%pip install mosaicml
Set Up Our Workspace#
We are going to use the CIFAR10 dataset with a ResNet56 model, and some standard optimization settings. For the purposes of this notebook, we’ll choose very large batch size, and also increase the image size to 96x 96, such that you would typically hit CUDA Out-of-Memory errors on most GPUs.
import torch import composer from torchvision import datasets, transforms torch.manual_seed(42) # For replicability data_directory = "./data" # Normalization constants mean = (0.507, 0.487, 0.441) std = (0.267, 0.256, 0.276) # choose a very large batch size batch_size = 2048 cifar10_transforms = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean, std), transforms.Resize(size=[96, 96]) # choose a large image size ]) train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms) test_dataset = datasets.CIFAR10(data_directory, train=False, download=True, transform=cifar10_transforms) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
from composer import models model = models.ComposerResNetCIFAR(model_name='resnet_56', num_classes=10)
Train a Baseline Model#
Now we run our trainer code with
grad_accum=='auto'. setting. Note that this demo requires a GPU to demonstrate automatic gradient accumulation.
assert torch.cuda.is_available(), "Demonstrating automatic gradient accumulation requires a GPU." optimizer = composer.optim.DecoupledSGDW( model.parameters(), # Model parameters to update lr=0.05, momentum=0.9, ) trainer = composer.trainer.Trainer( model=model, train_dataloader=train_dataloader, eval_dataloader=test_dataloader, optimizers=optimizer, max_duration="1ep", grad_accum='auto', device='gpu' ) trainer.fit()
Depending on your GPU type, you should see some logs that increase the gradient accumulation dynamically until the model fits into memory, prior to the start of training, e.g. something like:
INFO:composer.trainer.trainer:CUDA out of memory detected. Gradient Accumulation increased from 1 -> 2, and the batch will be retrained.
Experiment with different batch sizes and image sizes, and notice the trainer will never hit OutOfMemory errors, and you do not have to manually tweak the gradient accumulation to get the model to fit!
For more details, see our Auto Grad Accum documentation.