For PyTorch schedulers, we step every epoch by default. To instead step every batch, set
from composer import Trainer from torch.optim.lr_scheduler import CosineAnnealingLR trainer = Trainer( ..., schedulers=CosineAnnealingLR(optimizer, T_max=2), step_schedulers_every_batch=True, )
True, remember to specify the
arguments to your pytorch scheduler in units of batches, not epochs.
Our experiments have shown better accuracy using stepwise schedulers, so it is the recommended setting in most cases.
Our schedulers take advantage of our Time abstraction
to provide easier ways to set time. Time parameters can be provided in different units:
"sp"), tokens (
"tok"), batches (
"ba"), epochs (
and duration (
"dur"). See Time.
For example, the below would step the learning rate at 30%, 50%, and 90% of the way through training:
from composer import Trainer from composer.optim.scheduler import MultiStepScheduler trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration='90ep', schedulers=MultiStepScheduler( milestones=['0.3dur', '0.5dur', '0.9dur'], gamma=0.1 ))
These schedulers typically read the
state.timestamp to determine the trainer’s progress
and return a learning rate multipler. Inside the Trainer, we convert these to
LabmdaLR schedulers. By default, our schedulers
are stepped at every batch.
Below are the supported schedulers found at
Decays the learning rate discretely at fixed intervals.
Decays the learning rate discretely at fixed milestones.
Decays the learning rate discretely at fixed milestones, with an initial warmup.
Maintains a fixed learning rate.
Adjusts the learning rate linearly.
Adjusts the learning rate linearly, with an initial warmup.
Decays the learning rate exponentially.
Decays the learning rate according to the decreasing part of a cosine curve.
Decays the learning rate according to the decreasing part of a cosine curve, with an initial warmup.
Cyclically decays the learning rate according to the decreasing part of a cosine curve.
Sets the learning rate to be proportional to a power of the fraction of training time left.
Decays the learning rate according to a power of the fraction of training time left, with an initial warmup.
Compared to PyTorch schedulers,
ComposerScheduler need not be provided
an optimizer directly. The trainer will handle binding the optimizer when
it compiles the scheduler later.
Scale Schedule Ratio#
The Scale Schedule Ratio (SSR) scales the learning rate schedule by some factor and
is a powerful way to tradeoff training time and quality.
is an argument to the
Scale Schedule changes the training duration by a scaling factor and scales the learning rate scheduler accordingly. This serves to vary the training budget, making it possible to explore tradeoffs between cost (measured in time or money) and model quality.
For example, the code below will scale the training time by half (to 10 epochs) and also scale the learning rate schedule.
from composer import Trainer from composer.optim.scheduler import MultiStepScheduler trainer = Trainer( ..., max_duration="20ep", schedulers=MultiStepScheduler(milestones=["10ep", "16ep"]), scale_schedule_ratio=0.5, ) # or equivalently, with default SSR=1.0: trainer = Trainer( ..., max_duration="10ep", schedulers=MultiStepScheduler(milestones=["5ep", "8ep"]) )
Importantly, for our schedulers that have warmup, the warmup
period is never scaled. For example, if we apply
from composer.optim.scheduler import MultiStepWithWarmupScheduler scheduler = MultiStepWithWarmupScheduler( milestones=["10ep", "20ep"], t_warmup="4ep", )
The resulting scheduler would warmup for 4 epochs and then have step milestones at 5 epochs and 10 epochs.