Tip

This tutorial is available as a Jupyter notebook.

Open in Colab

🖼️ Getting Started#

Welcome to Composer! If you’re wondering how this thing works, then you’re in the right place. This is the first in our series of tutorials aimed at getting you familiar with the core concepts, capabilities, and patterns that make Composer such a useful tool!

Tutorial Goals and Concepts Covered#

We’re going to focus this introduction to Composer around an old classic: training a ResNet56 model on CIFAR10.

Our focus here running this with the Composer Trainer.

As part of using the Trainer, we’ll set up:

We’ll show how to assemble these pieces and train with the Trainer

And finally we’ll demonstrate training with speed-up algorithms!

Along the way we’ll illustrate some ways to visualize training progress using loggers.

Let’s get started!

Install Composer#

We’ll start by installing Composer

[ ]:
%pip install mosaicml
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install git+https://github.com/mosaicml/composer.git

Set Up Our Workspace#

Imports#

In this section we’ll set up our workspace. We’ll import the necessary packages, and set up our dataset and trainer. First, the imports:

[ ]:
import time

import torch
import torch.utils.data

import composer
import matplotlib.pyplot as plt

from torchvision import datasets, transforms
from composer.loggers import InMemoryLogger

torch.manual_seed(42) # For replicability

Dataset and DataLoader#

Now we’ll start setting up the ingredients for the Composer Trainer! Let’s start with dataloaders…

Here, we instantiate our CIFAR-10 dataset and dataloader. Composer has its own CIFAR-10 dataset and dataloaders for convenience, but we’ll stick with the Torchvision CIFAR-10 and PyTorch dataloader for the sake of familiarity.

As a bit of extra detail, there are three ways of passing training and/or evaluation dataloaders to the Trainer. If you’re used to working with PyTorch dataloaders, good news—that’s one of the ways! The below code should look pretty familiar if you’re coming from PyTorch, but if not, now may be a good time to familiarize yourself with those basics.

[ ]:
data_directory = "./data"

# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)

batch_size = 1024

cifar10_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])

train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)
test_dataset = datasets.CIFAR10(data_directory, train=False, download=True, transform=cifar10_transforms)

# Our train and test dataloaders are PyTorch DataLoader objects!
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

Model#

Next, we create our model. We’re using Composer’s built-in ResNet56. To use your own custom model, please see this custom model example.

Note: The model below is an instance of a ComposerModel. Models need to be wrapped in this class, which provides a convenient interface between the underlying PyTorch model and the Trainer.

[ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from composer.models import ComposerClassifier

class Block(nn.Module):
    """A ResNet block."""

    def __init__(self, f_in: int, f_out: int, downsample: bool = False):
        super(Block, self).__init__()

        stride = 2 if downsample else 1
        self.conv1 = nn.Conv2d(f_in, f_out, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(f_out)
        self.conv2 = nn.Conv2d(f_out, f_out, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(f_out)
        self.relu = nn.ReLU(inplace=True)

        # No parameters for shortcut connections.
        if downsample or f_in != f_out:
            self.shortcut = nn.Sequential(
                nn.Conv2d(f_in, f_out, kernel_size=1, stride=2, bias=False),
                nn.BatchNorm2d(f_out),
            )
        else:
            self.shortcut = nn.Sequential()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return self.relu(out)

class ResNetCIFAR(nn.Module):
    """A residual neural network as originally designed for CIFAR-10."""

    def __init__(self, outputs: int = 10):
        super(ResNetCIFAR, self).__init__()

        depth = 56
        width = 16
        num_blocks = (depth - 2) // 6

        plan = [(width, num_blocks), (2 * width, num_blocks), (4 * width, num_blocks)]

        self.num_classes = outputs

        # Initial convolution.
        current_filters = plan[0][0]
        self.conv = nn.Conv2d(3, current_filters, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(current_filters)
        self.relu = nn.ReLU(inplace=True)

        # The subsequent blocks of the ResNet.
        blocks = []
        for segment_index, (filters, num_blocks) in enumerate(plan):
            for block_index in range(num_blocks):
                downsample = segment_index > 0 and block_index == 0
                blocks.append(Block(current_filters, filters, downsample))
                current_filters = filters

        self.blocks = nn.Sequential(*blocks)

        # Final fc layer. Size = number of filters in last segment.
        self.fc = nn.Linear(plan[-1][0], outputs)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn(self.conv(x)))
        out = self.blocks(out)
        out = F.avg_pool2d(out, out.size()[3])
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = ComposerClassifier(module=ResNetCIFAR(), num_classes=10)

Optimizer and Scheduler#

Now let’s create the optimizer and LR scheduler. We’re using MosaicML’s SGD with decoupled weight decay:

[ ]:
optimizer = composer.optim.DecoupledSGDW(
    model.parameters(), # Model parameters to update
    lr=0.05, # Peak learning rate
    momentum=0.9,
    weight_decay=2.0e-3 # If this looks large, it's because its not scaled by the LR as in non-decoupled weight decay
)

We’ll assume this is being run on Colab, which means training for hundreds of epochs would take a very long time. Instead we’ll train our baseline model for three epochs. The first epoch will be linear warmup, followed by two epochs of constant LR. We achieve this by instantiating a LinearWithWarmupScheduler class.

Note: Composer provides a handful of different schedulers to help customize your training!

[ ]:
lr_scheduler = composer.optim.LinearWithWarmupScheduler(
    t_warmup="1ep", # Warm up over 1 epoch
    alpha_i=1.0, # Flat LR schedule achieved by having alpha_i == alpha_f
    alpha_f=1.0
)

Logging#

Finally, we instantiate an InMemoryLogger that records all the data from the Composer Trainer. We will use this logger to generate data plots after we complete training.

[ ]:
# "baseline" = no algorithms (which is what we're doing now)
logger_for_baseline = InMemoryLogger()

Train a Baseline Model#

And now we create our trainer!

The Trainer class is the workhorse of Composer. You may be wondering what exactly it does. In short, the Trainer class takes a handful of ingredients (e.g., the model, data loaders, algorithms) and instructions (e.g., training duration, device) and composes them into a single object (here, trainer) that can manage the entire training loop described by those inputs. This lets you focus on the higher-level details of training without having to worry about things like distributed training, scheduling, memory issues, and all other kinds of low-level headaches.

If you want to learn more about the Trainer, we recommend a deeper dive through our docs and the API reference! In the meantime, you can follow the patterns in this tutorial to get going quickly.

Here’s a quick reference for the parameters we’re specifying below:

  • model: The model to train, an instance of ComposerModel (in this case a ResNet-56)

  • train_dataloader: A data loader supplying the training data. More info here.

  • eval_dataloader: A data loader supplying the data used during evaluation (see same reference for train_dataloader). Model-defined evaluation metrics will be aggregated across this dataset each evaluation round.

  • max_duration: The training duration. You can use integers to specify the number of epochs or provide a Time string – e.g., "50ba" or "2ep" for 50 batches and 2 epochs, respectively.

  • optimizer: The optimizer used to update the model during training. More info here.

  • schedulers: Any schedulers used to schedule hyperparameter (e.g., learning rate) values over the course of training. More info here.

  • device: The device to use (e.g., CPU or GPU).

  • loggers: Any loggers to use during training. More info here.

[ ]:
train_epochs = "3ep" # Train for 3 epochs because we're assuming Colab environment and hardware
device = "gpu" if torch.cuda.is_available() else "cpu" # select the device

trainer = composer.trainer.Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=test_dataloader,
    max_duration=train_epochs,
    optimizers=optimizer,
    schedulers=lr_scheduler,
    device=device,
    loggers=logger_for_baseline,
)

We train and measure the training time below.

[ ]:
start_time = time.perf_counter()
trainer.fit() # <-- Your training loop in action!
end_time = time.perf_counter()
print(f"It took {end_time - start_time:0.4f} seconds to train")

If you’re running this on Colab, the runtime will vary a lot based on the instance. We found that the three epochs of training could take anywhere from 120-550 seconds to run, and the mean validation accuracy was typically in the range of 25%-40%.

Extract and Plot Logged Data#

We can now plot our validation accuracy over training…

Now, you might be thinking, “I don’t remember logging any validation accuracy!” And you’d be right—sort of. That’s because all the logging happened automatically during the trainer.fit() call. The values that we’ll plot were set up by the model, which defines which metric(s) to measure during evaluation (and possibly also during training). Check out the documentation for a deeper dive on the loggers Composer offers and how to interact with them!

[ ]:
timeseries_raw = logger_for_baseline.get_timeseries("metrics/eval/MulticlassAccuracy")
plt.plot(timeseries_raw['epoch'], timeseries_raw["metrics/eval/MulticlassAccuracy"])
plt.xlabel("Epoch")
plt.ylabel("Validation Accuracy")
plt.title("Accuracy per epoch with Baseline Training")
plt.show()

Use Algorithms to Speed Up Training#

One of the things we’re most excited about at MosaicML is our arsenal of speed-up algorithms. We used these algorithms to speed up training of ResNet-50 on ImageNet by up to 7.6x. Let’s try applying a few algorithms to make our ResNet-56 more efficient.

Before we jump in, here’s a quick primer on Composer speed-up algorithms. Each one is implemented as an Algorithm class, which basically just adds some structure that controls what happens when the algorithm is applied and when in the training loop it should be applied. Adding a particular algorithm into the training loop is as simple as creating an instance of it (using args/kwargs to set any hyperparameters) and passing it to the Trainer during initialization. We’ll see that in action below…

For our first algorithm here, let’s start with Label Smoothing, which serves as a form of regularization by interpolating between the target distribution and another distribution that usually has higher entropy.

[ ]:
label_smoothing = composer.algorithms.LabelSmoothing(0.1) # We're creating an instance of the LabelSmoothing algorithm class

Let’s also use BlurPool, which increases accuracy by applying a spatial low-pass filter before the pool in max pooling and whenever using a strided convolution.

[ ]:
blurpool = composer.algorithms.BlurPool(
    replace_convs=True, # Blur before convs
    replace_maxpools=True, # Blur before max-pools
    blur_first=True # Blur before conv/max-pool
)

Our final algorithm in our improved training recipe is Progressive Image Resizing. Progressive Image Resizing initially shrinks the size of training images and slowly scales them back to their full size over the course of training. It increases throughput during the early phase of training, when the network may learn coarse-grained features that do not require the details lost by reducing image resolution.

[ ]:
prog_resize = composer.algorithms.ProgressiveResizing(
    initial_scale=.6, # Size of images at the beginning of training = .6 * default image size
    finetune_fraction=0.34, # Train on default size images for 0.34 of total training time.
)

We’ll assemble all our algorithms into a list to pass to our trainer.

[ ]:
algorithms = [label_smoothing, blurpool, prog_resize]

Now let’s instantiate our model, optimizer, logger, and trainer again. No need to instantiate our scheduler again because it’s stateless!

[ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from composer.models import ComposerClassifier

class Block(nn.Module):
    """A ResNet block."""

    def __init__(self, f_in: int, f_out: int, downsample: bool = False):
        super(Block, self).__init__()

        stride = 2 if downsample else 1
        self.conv1 = nn.Conv2d(f_in, f_out, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(f_out)
        self.conv2 = nn.Conv2d(f_out, f_out, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(f_out)
        self.relu = nn.ReLU(inplace=True)

        # No parameters for shortcut connections.
        if downsample or f_in != f_out:
            self.shortcut = nn.Sequential(
                nn.Conv2d(f_in, f_out, kernel_size=1, stride=2, bias=False),
                nn.BatchNorm2d(f_out),
            )
        else:
            self.shortcut = nn.Sequential()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return self.relu(out)

class ResNetCIFAR(nn.Module):
    """A residual neural network as originally designed for CIFAR-10."""

    def __init__(self, outputs: int = 10):
        super(ResNetCIFAR, self).__init__()

        depth = 56
        width = 16
        num_blocks = (depth - 2) // 6

        plan = [(width, num_blocks), (2 * width, num_blocks), (4 * width, num_blocks)]

        self.num_classes = outputs

        # Initial convolution.
        current_filters = plan[0][0]
        self.conv = nn.Conv2d(3, current_filters, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(current_filters)
        self.relu = nn.ReLU(inplace=True)

        # The subsequent blocks of the ResNet.
        blocks = []
        for segment_index, (filters, num_blocks) in enumerate(plan):
            for block_index in range(num_blocks):
                downsample = segment_index > 0 and block_index == 0
                blocks.append(Block(current_filters, filters, downsample))
                current_filters = filters

        self.blocks = nn.Sequential(*blocks)

        # Final fc layer. Size = number of filters in last segment.
        self.fc = nn.Linear(plan[-1][0], outputs)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x: torch.Tensor):
        out = self.relu(self.bn(self.conv(x)))
        out = self.blocks(out)
        out = F.avg_pool2d(out, out.size()[3])
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = ComposerClassifier(module=ResNetCIFAR(), num_classes=10)
[ ]:
logger_for_algorithm_run = InMemoryLogger()

optimizer = composer.optim.DecoupledSGDW(
    model.parameters(),
    lr=0.05,
    momentum=0.9,
    weight_decay=2.0e-3
)

trainer = composer.trainer.Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=test_dataloader,
    max_duration=train_epochs,
    optimizers=optimizer,
    schedulers=lr_scheduler,
    device=device,
    loggers=logger_for_algorithm_run,
    algorithms=algorithms # Adding algorithms this time!
)

And let’s get training!

[ ]:
# Now we're cooking with algorithms!
start_time = time.perf_counter()
trainer.fit()
end_time = time.perf_counter()
three_epochs_accelerated_time = end_time - start_time
print(f"It took {three_epochs_accelerated_time:0.4f} seconds to train")

Again, the runtime will vary based on the instance, but we found that it took about 0.43x-0.75x as long to train (a 1.3x-2.3x speedup, which corresponds to 90-400 seconds) relative to the baseline recipe without augmentations. We also found that validation accuracy was similar for the algorithm-enhanced and baseline recipes.

Bonus Training!#

Because Progressive Resizing increases data throughput (i.e. more samples per second), we can train for more iterations in the same amount of wall clock time. Let’s train our model for one additional epoch!

[ ]:
train_epochs = "1ep"

Resuming training means we’ll need to use a flat LR schedule:

[ ]:
lr_scheduler = composer.optim.scheduler.ConstantScheduler(alpha=1.0, t_max='1dur')

And we can also get rid of progressive resizing (because we want to train on the full size images for this additional epoch), and the model already has blurpool enabled, so we don’t need to pass that either:

[ ]:
algorithms = [label_smoothing]
[ ]:
logger_for_bonus_1ep = InMemoryLogger()

trainer = composer.trainer.Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=test_dataloader,
    max_duration=train_epochs,
    optimizers=optimizer,
    schedulers=lr_scheduler,
    device=device,
    loggers=logger_for_bonus_1ep,
    algorithms=algorithms,
)

start_time = time.perf_counter()
trainer.fit()

end_time = time.perf_counter()
final_epoch_accelerated_time = end_time - start_time
# Time for four epochs = time for three epochs + time for fourth epoch
four_epochs_accelerated_time = three_epochs_accelerated_time + final_epoch_accelerated_time
print(f"It took {four_epochs_accelerated_time:0.4f} seconds to train")

We found that using these speed-up algorithms for four epochs resulted in runtime similar to or less than three epochs without speed-up algorithms (120-550 seconds, depending on the instance), and that they usually improved validation accuracy by 5-15 percentage points, yielding validation accuracy in the range of 30%-50%.

Let’s plot the results from using Label Smoothing and Progressive Resizing!

[ ]:
# Baseline (no algorithms) data
baseline_timeseries = logger_for_baseline.get_timeseries("metrics/eval/MulticlassAccuracy")
baseline_epochs = baseline_timeseries['epoch']
baseline_acc = baseline_timeseries["metrics/eval/MulticlassAccuracy"]

# Composer data
with_algorithms_timeseries = logger_for_algorithm_run.get_timeseries("metrics/eval/MulticlassAccuracy")
with_algorithms_epochs = list(with_algorithms_timeseries["epoch"])
with_algorithms_acc = list(with_algorithms_timeseries["metrics/eval/MulticlassAccuracy"])

# Concatenate 3 epochs with Label Smoothing and ProgRes with 1 epoch without ProgRes
bonus_epoch_timeseries = logger_for_bonus_1ep.get_timeseries("metrics/eval/MulticlassAccuracy")
bonus_epoch_epochs = [with_algorithms_epochs[-1] + i for i in bonus_epoch_timeseries["epoch"]]
with_algorithms_epochs.extend(bonus_epoch_epochs)
with_algorithms_acc.extend(bonus_epoch_timeseries["metrics/eval/MulticlassAccuracy"])

#Print mean validation accuracies
print("Baseline Validation Mean: " + str(sum(baseline_acc)/len(baseline_acc)))
print("W/ Algs. Validation Mean: " + str(sum(with_algorithms_acc)/len(with_algorithms_acc)))

# Plot both sets of data
plt.plot(baseline_epochs, baseline_acc, label="Baseline")
plt.plot(with_algorithms_epochs, with_algorithms_acc, label="With Algorithms")
plt.legend()
plt.xlabel("Epoch")
plt.ylabel("Validation Accuracy")
plt.title("Accuracy and speed improvements with equivalent WCT")
plt.show()

What next?#

You’ve now seen a simple example of how to use the Composer Trainer and how to take advantage of useful features like learning rate scheduling, logging, and speed-up algorithms.

If you want to keep learning more, try looking through some of the documents linked throughout this tutorial to see if you can form a deeper intuition for how these examples were structured.

In addition, please continue to explore our tutorials! Here’s a couple suggestions:

Come get involved with MosaicML!#

We’d love for you to get involved with the MosaicML community in any of these ways:

Star Composer on GitHub#

Help make others aware of our work by starring Composer on GitHub.

Join the MosaicML Slack#

Head on over to the MosaicML slack to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!

Contribute to Composer#

Is there a bug you noticed or a feature you’d like? File an issue or make a pull request!