🤖 Algorithms#
Composer has a curated collection of speedup methods (“Algorithms”) that can be composed to easily create efficient training recipes.
Below is a brief overview of the algorithms currently in Composer. For more detailed information about each algorithm, see the method cards, also linked in the table. Each algorithm has a functional implementation intended for use with your own training loop and an implementation intended for use with Composer’s trainer.
Name |
tldr |
functional |
---|---|---|
Replace attention with AliBi |
||
Image-preserving data augmentations |
||
Applies blur before pooling or downsampling |
||
Uses channels last memory format (NHWC) |
||
Removes columns and rows from the image for augmentation and efficiency. |
||
Combines pairs of examples in non-overlapping regions and mixes labels |
||
Randomly erases rectangular blocks from the image. |
||
Maintains an exponential moving average of model weights for use in evaluation. |
||
Factorize GEMMs into smaller GEMMs |
||
Fuses underlying LayerNorm kernels into single kernel |
||
Swaps the building block from a Linear layer to a Gated Linear layer. |
||
Use smaller # samples to compute batchnorm |
||
Clips all gradients in model based on specified clipping_type |
||
Dropout layer by using Gyrodropout |
||
Smooths the labels with a uniform prior |
||
Progressively freezes layers during training. |
||
Forces LayerNorm to fp16 or bf16 |
||
Blends pairs of examples and labels |
||
Increases the input image size during training |
||
Applies a series of random augmentations |
||
SAM optimizer measures sharpness of optimization space |
||
Drops examples with small loss contributions. |
|
|
Progressively increase sequence length. |
||
Replaces eligible layers with Squeeze-Excite layers |
||
Replaces a specified layer with a stochastic verion that randomly drops the layer or samples during training |
||
Computes running average of model weights. |
||
Standardizes convolutional weights along input channels and kernel axes. |
Functional API#
The simplest way to use Composer’s algorithms is via the functional API. Composer’s algorithms can be grouped into three, broad classes:
data augmentations add additional transforms to the training data.
model surgery algorithms modify the network architecture.
training loop modifications change various properties of the training loop.
Data Augmentations#
Data augmentations can be inserted into your dataset.transforms similiar to Torchvision’s transforms. For example, with 🎲 RandAugment:
import torch
from torchvision import datasets, transforms
from composer import functional as cf
c10_transforms = transforms.Compose([cf.randaugment_image(), # <---- Add RandAugment
transforms.ToTensor(),
transforms.Normalize(mean, std)])
dataset = datasets.CIFAR10('../data',
train=True,
download=True,
transform=c10_transforms)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1024)
Some augmentations, such as ✂️ CutMix, act on a batch of inputs. Insert these in your training loop after a batch is loaded from the dataloader:
from composer import functional as cf
cutmix_alpha = 1
num_classes = 10
for batch_idx, (data, target) in enumerate(dataloader):
data = cf.cutmix(
data,
target,
alpha=cutmix_alpha,
num_classes=num_classes
)
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
Model Surgery#
Model surgery algorithms make direct modifications to the network itself. For example, apply 🏊 BlurPool, inserts a blur layer before strided convolution layers as demonstrated here:
from composer import functional as cf
import torchvision.models as models
model = models.resnet18()
cf.apply_blurpool(model)
For a transformer model, we can swap out the attention head of a 🤗 transformer with one from 🥸 ALiBi:
from composer import functional as cf
from composer.algorithms.alibi.gpt2_alibi import _attn
from composer.algorithms.alibi.gpt2_alibi import enlarge_mask
from transformers import GPT2Model
from transformers.models.gpt2.modeling_gpt2 import GPT2Attention
model = GPT2Model.from_pretrained("gpt2")
cf.apply_alibi(
model=model,
heads_per_layer=12,
max_sequence_length=8192,
position_embedding_attribute="module.transformer.wpe",
attention_module=GPT2Attention,
attr_to_replace="_attn",
alibi_attention=_attn,
mask_replacement_function=enlarge_mask
)
Training Loop#
Methods such as 🏞️ Progressive Image Resizing or ❄️ Layer Freezing apply changes to the training loop. See their method cards for details on how to use them in your own code.
Composer Trainer#
Building training recipes require composing all these different methods together, which is
the purpose of our Trainer
. Pass in a list of the algorithm classes to run
to the trainer, and we will automatically run each one at the appropriate time during training,
handling any collisions or reorderings as needed.
from composer import Trainer
from composer.algorithms import BlurPool, ChannelsLast
trainer = Trainer(
model=model,
algorithms=[ChannelsLast(), BlurPool()]
train_dataloader=train_dataloader,
eval_dataloader=test_dataloader,
max_duration='10ep',
)
For more information, see: ⚙️ Using the Trainer and 🚌 Welcome Tour.
Two-way callbacks#
The way our algorithms insert themselves in our trainer is based on the two-way callbacks system developed
by (Howard et al, 2020). Algorithms interact with the
training loop at various Events
and effect their changes by modifing the trainer State
.
Events denote locations inside the training procedure where algorithms can be run. In pseudocode, Composer’s events look as follows:
EVENT.INIT
state.model = model()
state.train_dataloader = train_dataloader()
state.optimizers = optimizers()
load_checkpoint()
EVENT.AFTER_LOAD
EVENT.FIT_START
for epoch in epochs:
EVENT.EPOCH_START
for batch in state.train_dataloader:
EVENT.AFTER_DATALOADER
EVENT.BATCH_START
prepare_batch_for_training()
EVENT.BEFORE_TRAIN_BATCH
EVENT.BEFORE_FORWARD
forward_pass()
EVENT.AFTER_FORWARD
EVENT.BEFORE_LOSS
compute_loss()
EVENT.AFTER_LOSS
EVENT.BEFORE_BACKWARD
backward_pass()
EVENT.AFTER_BACKWARD
EVENT.AFTER_TRAIN_BATCH
optimizers.step()
EVENT.BATCH_END
EVENT.EPOCH_END
Complete definitions of these events can be found here. Some events have a before and after flavor. These events differ in the order that algorithms are run. For example, on EVENT.BEFORE_X, algorithms passed to the trainer in order [A, B, C] are also run in order [A, B,C]. On EVENT.AFTER_X, algorithms passed to the trainer in order [A, B, C] are run in order [C, B, A] . This allows algorithms to clean undo their effects on state if necessary.
Composer’s state tracks relevant quantities for the training procedure. The code for state can be found here. Algorithms can modify state, and therefore modify the training procedure.
To implement a custom algorithm, one needs to create a class that inherits from Composer’s Algorithm class, and implements a match methods that specifies which event(s) the algorithm should run on, and an apply function that specifies how the custom algorithm should modify quantities in state.
The match method simply takes state and the current event as an argument, determines whether or not the algorithm should run, and returns true if it should, false otherwise. In code, a simple match might look like this:
def match(self, event, state):
return event in [Event.AFTER_DATALOADER, Event.AFTER_FORWARD]
This will cause the algorithm to run on the AFTER_DATALOADER and AFTER_FORWARD events. Note that a given algorithm might run on multiple events.
The apply method also takes state and the current event as arguments. Based on this information, apply carries out the appropriate algorithm logic, and modifies state with the changes necessary. In code, an apply might look like this:
def apply(self, event, state, logger):
if event == Event.AFTER_DATALOADER:
state.batch = process_inputs(state.batch)
if event == Event.AFTER_FORWARD:
state.output = process_outputs(state.outputs)
Note that different logic can be used for different events.
Packaging this all together into a class gives the object that Composer can run:
from composer.core import Algoritm, Event
class MyAlgorithm(Algorithm):
def __init__(self, hparam1=1):
self.hparam1 = hparam1
def match(self, event, state):
return event in [Event.AFTER_DATALOADER, Event.AFTER_FORWARD]
def apply(self, event, state, logger):
if event == Event.AFTER_DATALOADER:
state.batch = process_inputs(state.batch, self.hparam1)
if event == Event.AFTER_FORWARD:
state.output = process_outputs(state.outputs)
Using this in training can be done the same way as with Composer’s native algorithms.
from composer import Trainer
from composer.algorithms.blurpool import BlurPool
from composer.algorithms.channels_last import ChannelsLast
channels_last = ChannelsLast()
blurpool = BlurPool(replace_convs=True, replace_maxpools=True, blur_first=True)
custom_algorithm = MyAlgorithm(hparam1=1)
trainer = Trainer(model=model,
train_dataloader=train_dataloader,
eval_dataloader=test_dataloader,
max_duration='90ep',
device='gpu',
algorithms=[channels_last, blurpool, custom_algorithm],
eval_interval="0ep",
seed=42)