🔆 Sequence Length Warmup#
Natural Language Processing
Sequence Length Warmup linearly increases the sequence length (number of tokens per sentence) used to train a language model from a
min_seq_length to a
max_seq_length over some duration at the beginning of training. The underlying motivation is that sequence length is a proxy for the difficulty of an example, and this method assumes a simple curriculum where the model is trained on easy examples (by this definition) first. Sequence Length Warmup is able to reduce the training time of GPT-style models by ~1.5x while still achieving the same loss as baselines.
The sequence length used to train a model over the course of training. It increases linearly over the first 30% of training before reaching its full value for the remainder of training.
How to Use#
from composer import functional as cf def training_loop(model, train_loader): opt = torch.optim.Adam(model.parameters()) loss_fn = F.cross_entropy model.train() max_seq_length = 1024 curr_seq_length = 8 seq_length_step_size = 8 # in this example, we're going to define a warmup schedule that increases the # sequence length by 8 at every step until it reaches the maximum sequence length for epoch in range(num_epochs): for X, y in train_loader: curr_seq_length = max(max_seq_length, curr_seq_length + seq_length_step_size) X = cf.set_batch_sequence_length(X, curr_seq_length) y_hat = model(X) loss = loss_fn(y_hat, y) loss.backward() opt.step() opt.zero_grad()
from composer.trainer import Trainer from composer.algorithms import SeqLengthWarmup trainer = Trainer(model=model, train_dataloader=train_dataloader, eval_dataloader=eval_dataloader, max_duration='25ep', algorithms=[SeqLengthWarmup(max_seq_length=64)]) trainer.fit()
We implement this as a pre-processing step during the forward pass when training the model.
We found that running Sequence Length Warmup for 30% of training (i.e., setting
duration=0.3) provided the largest speedup that could still maintain full model quality on GPT-2 125M. We also recommend always ensuring that the sequence length is a multiple of eight in order to take advantage of hardware acceleration, such as Tensor Cores.
Sequence Length Warmup is a form of curriculum learning, a category of techniques that present examples in a structured or organized order, such as by difficulty. The particular heuristic it uses to determine example difficulty is the length of the sequence. Typically, sequences in language modeling tasks are sentences, so Sequence Length Warmup entails training a model on sentences of increasing length. Note that our implementation of Sequence Length Warmup (which follows that of Li et al., 2021) creates short sentences by truncating or segmenting longer sentences; it does not explicitly train on shorter sentences.
🚧 Sequence Length Warmup Truncates or Segments Sentences to Create Shorter Ones
To create shorter sentences, Sequence Length Warmup truncates longer sentences or breaks them into shorter segments. It does not explicitly train on only the shortest sentences in the corpus. This design decision is in line with Li et al., 2021, whose implementation was the basis for ours.
As the name suggests, Sequence Length Warmup starts by training on shorter sentences (determined by the
min_seq_length hyperparameter) and linearly increases sentence length to the full value (determined by the
max_seq_length hyperparameter) over the course of the beginning of training (the fraction of training specified by the
After this point, the model is trained exclusively on sentences of up to
Our experiments found that Sequence Length Warmup could speed up training by a factor of ~1.5x while achieving the same loss. Li et al., 2021 claim that Sequence Length Warmup also reduces the outliers in Adam’s variance term, which makes training more stable and permits training on larger batch sizes and larger learning rates without divergence.
✅ Sequence Length Warmup’s Tradeoff Between Quality and Training Speed
In our experiments, Sequence Length Warmup improves the attainable tradeoffs between training speed and the final quality of the trained model.
One of the key design decisions when performing Sequence Length Warmup is the manner in which the sentences are shortened to the appropriate length. There are two options for doing this:
Truncating the sentence, discarding everything beyond the desired sequence length.
Segmenting the sentence, breaking it up into segments of the desired sequence lenght and making all segments into separate trianing examples.
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training by Conglong Li, Minjia Zhang, and Yuxiong He. Posted to arXiv in 2021.
The Composer implementation of this method and the accompanying documentation were produced by Moin Nadeem at MosaicML.