# composer.optim.scheduler#

Stateless learning rate schedulers.

Stateless schedulers solve some of the problems associated with PyTorch’s built-in schedulers provided in torch.optim.lr_scheduler. The primary design goal of the schedulers provided in this module is to allow schedulers to interface directly with Composer’s time abstraction. This means that schedulers can be configured using arbitrary but explicit time units.

See ComposerScheduler for more information on stateless schedulers.

Functions

 compile_composer_scheduler Converts a stateless scheduler into a PyTorch scheduler object.

Classes

 ComposerScheduler Specification for a stateless scheduler function. ConstantScheduler Maintains a fixed learning rate. ConstantWithWarmupScheduler Maintains a fixed learning rate, with an initial warmup. CosineAnnealingScheduler Decays the learning rate according to the decreasing part of a cosine curve. CosineAnnealingWarmRestartsScheduler Cyclically decays the learning rate according to the decreasing part of a cosine curve. CosineAnnealingWithWarmupScheduler Decays the learning rate according to the decreasing part of a cosine curve, with an initial warmup. ExponentialScheduler Decays the learning rate exponentially. LinearScheduler Adjusts the learning rate linearly. LinearWithWarmupScheduler Adjusts the learning rate linearly, with an initial warmup. MultiStepScheduler Decays the learning rate discretely at fixed milestones. MultiStepWithWarmupScheduler Decays the learning rate discretely at fixed milestones, with an initial warmup. PolynomialScheduler Sets the learning rate to be proportional to a power of the fraction of training time left. PolynomialWithWarmupScheduler Decays the learning rate according to a power of the fraction of training time left, with an initial warmup. StepScheduler Decays the learning rate discretely at fixed intervals.
class composer.optim.scheduler.ComposerScheduler[source]#

Specification for a stateless scheduler function.

While this specification is provided as a Python class, an ordinary function can implement this interface as long as it matches the signature of this interface’s __call__() method.

For example, a scheduler that halves the learning rate after 10 epochs could be written as:

def ten_epoch_decay_scheduler(state: State) -> float:
if state.timestamp.epoch < 10:
return 1.0
return 0.5

# ten_epoch_decay_scheduler is a valid ComposerScheduler
trainer = Trainer(
schedulers=[ten_epoch_decay_scheduler],
...
)


In order to allow schedulers to be configured, schedulers may also written as callable classes:

class VariableEpochDecayScheduler(ComposerScheduler):

def __init__(num_epochs: int):
self.num_epochs = num_epochs

def __call__(state: State) -> float:
if state.time.epoch < self.num_epochs:
return 1.0
return 0.5

ten_epoch_decay_scheduler = VariableEpochDecayScheduler(num_epochs=10)
# ten_epoch_decay_scheduler is also a valid ComposerScheduler
trainer = Trainer(
schedulers=[ten_epoch_decay_scheduler],
...
)


The constructions of ten_epoch_decay_scheduler in each of the examples above are equivalent. Note that neither scheduler uses the scale_schedule_ratio parameter. As long as this parameter is not used when initializing Trainer, it is not required that any schedulers implement that parameter.

__call__(state, ssr=1.0)[source]#

Calculate the current learning rate multiplier $$\alpha$$.

A scheduler function should be a pure function that returns a multiplier to apply to the optimizer’s provided learning rate, given the current trainer state, and optionally a “scale schedule ratio” (SSR). A typical implementation will read state.timestamp, and possibly other fields like state.max_duration, to determine the trainer’s latest temporal progress.

Note

All instances of ComposerScheduler output a multiplier for the learning rate, rather than the learning rate directly. By convention, we use the symbol $$\alpha$$ to refer to this multiplier. This means that the learning rate $$\eta$$ at time $$t$$ can be represented as $$\eta(t) = \eta_i \times \alpha(t)$$, where $$\eta_i$$ represents the learning rate used to initialize the optimizer.

Note

It is possible to use multiple schedulers, in which case their effects will stack multiplicatively.

The ssr param indicates that the schedule should be “stretched” accordingly. In symbolic terms, where $$\alpha_\sigma(t)$$ represents the scheduler output at time $$t$$ using scale schedule ratio $$\sigma$$:

$\alpha_{\sigma}(t) = \alpha(t / \sigma)$
Parameters
• state (State) – The current Composer Trainer state.

• ssr (float) – The scale schedule ratio. In general, the learning rate computed by this scheduler at time $$t$$ with an SSR of 1.0 should be the same as that computed by this scheduler at time $$t \times s$$ with an SSR of $$s$$. Default = 1.0.

Returns

alpha (float) – A multiplier to apply to the optimizer’s provided learning rate.

class composer.optim.scheduler.ConstantScheduler(alpha=1.0, t_max='1dur')[source]#

Maintains a fixed learning rate.

This scheduler is based on ConstantLR from PyTorch.

The default settings for this scheduler simply maintain a learning rate factor of 1 for the entire training duration. However, both the factor and the duration of this scheduler can be configured.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} \alpha, & \text{if } t < t_{max} \\ 1.0 & \text{otherwise} \end{cases}$

Where $$\alpha$$ represents the learning rate multiplier to maintain while this scheduler is active, and $$t_{max}$$ represents the duration of this scheduler.

Parameters
• alpha (float) – Learning rate multiplier to maintain while this scheduler is active. Default = 1.0.

• t_max (str | Time) – Duration of this scheduler. Default = "1dur".

class composer.optim.scheduler.ConstantWithWarmupScheduler(t_warmup, alpha=1.0, t_max='1dur')[source]#

Maintains a fixed learning rate, with an initial warmup.

This scheduler is based on ConstantLR from PyTorch, with an added warmup.

Starts with a linear warmup over t_warmup time, then simply maintains a learning rate factor of 1 for the entire training duration. However, both the factor and the duration of this scheduler can be configured.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha, & \text{if } t < t_{max} \\ 1.0 & \text{otherwise} \end{cases}$

Where $$\alpha$$ represents the learning rate multiplier to maintain while this scheduler is active, and $$t_{max}$$ represents the duration of this scheduler.

Parameters
• t_warmup (str | Time) – Warmup time.

• alpha (float) – Learning rate multiplier to maintain while this scheduler is active. Default = 1.0.

• t_max (str | Time) – Duration of this scheduler. Default = "1dur".

class composer.optim.scheduler.CosineAnnealingScheduler(t_max='1dur', alpha_f=0.0)[source]#

Decays the learning rate according to the decreasing part of a cosine curve.

This scheduler is based on CosineAnnealingLR from PyTorch.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \alpha_f + (1 - \alpha_f) \times \frac{1}{2} (1 + \cos(\pi \times \tau))$

Given $$\tau$$, the fraction of time elapsed (clipped to the interval $$[0, 1]$$), as:

$\tau = t / t_{max}$

Where $$t_{max}$$ represents the duration of this scheduler, and $$\alpha_f$$ represents the learning rate multiplier to decay to.

Parameters
• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

• alpha_f (float) – Learning rate multiplier to decay to. Default = 0.0.

class composer.optim.scheduler.CosineAnnealingWarmRestartsScheduler(t_0, t_mult=1.0, alpha_f=0.0)[source]#

Cyclically decays the learning rate according to the decreasing part of a cosine curve.

This scheduler is based on CosineAnnealingWarmRestarts from PyTorch.

This scheduler resembles a regular cosine annealing curve, as seen in CosineAnnealingScheduler, except that after the curve first completes t_0 time, the curve resets to the start. The durations of subsequent cycles are each multiplied by t_mult.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \alpha_f + (1 - \alpha_f) \times \frac{1}{2}(1 + \cos(\pi \times \tau_i))$

Given $$\tau_i$$, the fraction of time elapsed through the $$i^\text{th}$$ cycle, as:

$\tau_i = (t - \sum_{j=0}^{i-1} t_0 t_{mult}^j) / (t_0 t_{mult}^i)$

Where $$t_0$$ represents the period of the first cycle, $$t_{mult}$$ represents the multiplier for the duration of successive cycles, and $$\alpha_f$$ represents the learning rate multiplier to decay to.

Parameters
• t_0 (str | Time) – The period of the first cycle.

• t_mult (float) – The multiplier for the duration of successive cycles. Default = 1.0.

• alpha_f (float) – Learning rate multiplier to decay to. Default = 0.0.

class composer.optim.scheduler.CosineAnnealingWithWarmupScheduler(t_warmup, t_max='1dur', alpha_f=0.0)[source]#

Decays the learning rate according to the decreasing part of a cosine curve, with an initial warmup.

This scheduler is based on CosineAnnealingScheduler, with an added warmup.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha_f + (1 - \alpha_f) \times \frac{1}{2} (1 + \cos(\pi \times \tau_w)) & \text{otherwise} \end{cases}$

Given $$\tau_w$$, the fraction of post-warmup time elpased (clipped to the interval $$[0, 1]$$), as:

$\tau_w = (t - t_{warmup}) / t_{max}$

Where $$t_{warmup}$$ represents the warmup time, $$t_{max}$$ represents the duration of this scheduler, and $$\alpha_f$$ represents the learning rate multiplier to decay to.

Warning

Initial warmup time is not scaled according to any provided scale schedule ratio! However, the duration of the scheduler is still scaled accordingly. To achieve this, after warmup, the scheduler’s “pace” will be slightly distorted from what would otherwise be expected.

Parameters
• t_warmup (str | Time) – Warmup time.

• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

• alpha_f (float) – Learning rate multiplier to decay to. Default = 0.0.

class composer.optim.scheduler.ExponentialScheduler(gamma, decay_period='1ep')[source]#

Decays the learning rate exponentially.

This scheduler is based on ExponentialLR from PyTorch.

Exponentially decays the learning rate such that it decays by a factor of gamma every decay_period time.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \gamma ^ {t / \rho}$

Where $$\rho$$ represents the decay period, and $$\gamma$$ represents the multiplicative decay factor.

Parameters
• decay_period (str | Time) – Decay period. Default = "1ep".

• gamma (float) – Multiplicative decay factor.

class composer.optim.scheduler.LinearScheduler(alpha_i=1.0, alpha_f=0.0, t_max='1dur')[source]#

This scheduler is based on LinearLR from PyTorch.

Warning

Note that the defaults for this scheduler differ from the defaults for LinearLR. The PyTorch scheduler, by default, linearly increases the learning rate multiplier from 1.0 / 3 to 1.0, whereas this implementation, by default, linearly decreases the multiplier rom 1.0 to 0.0.

Linearly adjusts the learning rate multiplier from alpha_i to alpha_f over t_{max} time.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \alpha_i + (alpha_f - \alpha_i) \times \tau$

Given $$\tau$$, the fraction of time elapsed (clipped to the interval $$[0, 1]$$), as:

$\tau = t / t_{max}$

Where $$\alpha_i$$ represents the initial learning rate multiplier, $$\alpha_f$$ represents the learning rate multiplier to decay to, and $$t_{max}$$ represents the duration of this scheduler.

Parameters
• alpha_i (float) – Initial learning rate multiplier. Default = 1.0.

• alpha_f (float) – Final learning rate multiplier. Default = 0.0.

• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

class composer.optim.scheduler.LinearWithWarmupScheduler(t_warmup, alpha_i=1.0, alpha_f=0.0, t_max='1dur')[source]#

Adjusts the learning rate linearly, with an initial warmup.

This scheduler is based on LinearScheduler, with an added warmup.

Linearly adjusts the learning rate multiplier from alpha_i to alpha_f over t_{max} time.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha_i + (alpha_f - \alpha_i) \times \tau_w & \text{otherwise} \end{cases}$

Given $$\tau_w$$, the fraction of post-warmup time elpased (clipped to the interval $$[0, 1]$$), as:

$\tau_w = (t - t_{warmup}) / t_{max}$

Where $$t_{warmup}$$ represents the warmup time, $$\alpha_i$$ represents the initial learning rate multiplier, and $$\alpha_f$$ represents the learning rate multiplier to decay to, and $$t_{max}$$ represents the duration of this scheduler.

Warning

Initial warmup time is not scaled according to any provided scale schedule ratio! However, the duration of the scheduler is still scaled accordingly. To achieve this, after warmup, the scheduler’s “pace” will be slightly distorted from what would otherwise be expected.

Parameters
• t_warmup (str | Time) – Warmup time.

• alpha_i (float) – Initial learning rate multiplier. Default = 1.0.

• alpha_f (float) – Final learning rate multiplier. Default = 0.0.

• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

class composer.optim.scheduler.MultiStepScheduler(milestones, gamma=0.1)[source]#

Decays the learning rate discretely at fixed milestones.

This scheduler is based on MultiStepLR from PyTorch.

Decays the learning rate by a factor of gamma whenever a time milestone in milestones is reached.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \gamma ^ x$

Where $$x$$ represents the amount of milestones that have been reached, and $$\gamma$$ represents the multiplicative decay factor.

Parameters
• milestones (List[str | Time]) – Times at which the learning rate should change.

• gamma (float) – Multiplicative decay factor. Default = 0.1.

class composer.optim.scheduler.MultiStepWithWarmupScheduler(t_warmup, milestones, gamma=0.1)[source]#

Decays the learning rate discretely at fixed milestones, with an initial warmup.

This scheduler is based on MultiStepScheduler, with an added warmup.

Starts with a linear warmup over t_warmup time, then decays the learning rate by a factor of gamma whenever a time milestone in milestones is reached.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \gamma ^ x & \text{otherwise} \end{cases}$

Where $$t_{warmup}$$ represents the warmup time, $$x$$ represents the amount of milestones that have been reached, and $$\gamma$$ represents the multiplicative decay factor.

Warning

All milestones should be greater than t_warmup; otherwise, they will have no effect on the computed learning rate multiplier until the warmup has completed.

Warning

Initial warmup time is not scaled according to any provided scale schedule ratio! However, the milestones will still be scaled accordingly.

Parameters
• t_warmup (str | Time) – Warmup time.

• milestones (List[str | Time]) – Times at which the learning rate should change.

• gamma (float) – Multiplicative decay factor. Default = 0.1.

class composer.optim.scheduler.PolynomialScheduler(power, t_max='1dur', alpha_f=0.0)[source]#

Sets the learning rate to be proportional to a power of the fraction of training time left.

Specifially, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \alpha_f + (1 - \alpha_f) \times (1 - \tau) ^ {\kappa}$

Given $$\tau$$, the fraction of time elapsed (clipped to the interval $$[0, 1]$$), as:

$\tau = t / t_{max}$

Where $$\kappa$$ represents the exponent to be used for the proportionality relationship, $$t_{max}$$ represents the duration of this scheduler, and $$\alpha_f$$ represents the learning rate multiplier to decay to.

Parameters
• power (float) – The exponent to be used for the proportionality relationship.

• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

• alpha_f (float) – Learning rate multiplier to decay to. Default = 0.0.

class composer.optim.scheduler.PolynomialWithWarmupScheduler(t_warmup, power=2.0, t_max='1dur', alpha_f=0.0)[source]#

Decays the learning rate according to a power of the fraction of training time left, with an initial warmup.

This scheduler is based on PolynomialScheduler, with an added warmup.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \begin{cases} t / t_{warmup}, & \text{if } t < t_{warmup} \\ \alpha_f + (1 - \alpha_f) \times (1 - \tau_w) ^ {\kappa} & \text{otherwise} \end{cases}$

Given $$\tau_w$$, the fraction of post-warmup time elpased (clipped to the interval $$[0, 1]$$), as:

$\tau_w = (t - t_{warmup}) / t_{max}$

Where $$\kappa$$ represents the exponent to be used for the proportionality relationship, $$t_{warmup}$$ represents the warmup time, $$t_{max}$$ represents the duration of this scheduler, and $$\alpha_f$$ represents the learning rate multiplier to decay to.

Warning

Initial warmup time is not scaled according to any provided scale schedule ratio! However, the duration of the scheduler is still scaled accordingly. To achieve this, after warmup, the scheduler’s “pace” will be slightly distorted from what would otherwise be expected.

Parameters
• t_warmup (str | Time) – Warmup time.

• power (float) – The exponent to be used for the proportionality relationship. Default = 2.0.

• t_max (str | Time) – The duration of this scheduler. Default = "1dur".

• alpha_f (float) – Learning rate multiplier to decay to. Default = 0.0.

class composer.optim.scheduler.StepScheduler(step_size, gamma=0.1)[source]#

Decays the learning rate discretely at fixed intervals.

This scheduler is based on StepLR from PyTorch.

Decays the learning rate by a factor of gamma periodically, with a frequency determined by step_size.

Specifically, the learning rate multiplier $$\alpha$$ can be expressed as:

$\alpha(t) = \gamma ^ {\text{floor}(t / \rho)}$

Where $$\rho$$ represents the time between changes to the learning rate (the step size), and $$\gamma$$ represents the multiplicative decay factor.

Parameters
• step_size (str | Time) – Time between changes to the learning rate.

• gamma (float) – Multiplicative decay factor. Default = 0.1.

composer.optim.scheduler.compile_composer_scheduler(scheduler, state, ssr=1.0)[source]#

Converts a stateless scheduler into a PyTorch scheduler object.

While the resulting scheduler provides a .step() interface similar to other PyTorch schedulers, the scheduler is also given a bound reference to the current State. This means that any internal state updated by .step() can be ignored, and the scheduler can instead simply use the bound state to recalculate the current learning rate.

Parameters
Returns

compiled_scheduler (PyTorchScheduler) – The scheduler, in a form compatible with PyTorch scheduler interfaces.