OptimizerMonitor#

class composer.callbacks.OptimizerMonitor(log_optimizer_metrics=True)[source]#

Computes and logs the L2 norm of gradients as well as any optimizer-specific metrics implemented in the optimizerโ€™s report_per_parameter_metrics method.

L2 norms are calculated after the reduction of gradients across GPUs. This function iterates over the parameters of the model and may cause a reduction in throughput while training large models. In order to ensure the correctness of the norm, this function should be called after gradient unscaling in cases where gradients are scaled.

Example

>>> from composer import Trainer
>>> from composer.callbacks import OptimizerMonitor
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration="1ep",
...     callbacks=[OptimizerMonitor()],
... )

The metrics are logged by the Logger to the following keys as described below. grad_l2_norm and layer_grad_l2_norm are logged in addition to metrics logged by the optimizerโ€™s report_per_parameter_metrics method. For convenience we have listed the metrics logged by DecoupledAdamW below.

Key

Logged data

l2_norm/grad/global

L2 norm of the gradients of all parameters in the model on the Event.AFTER_TRAIN_BATCH event.

l2_norm/grad/LAYER_NAME

Layer-wise L2 norms

l2_norm/moment/LAYER_NAME

Layer-wise L2 norms of Adam first moment after

calling optimizer step.

l2_norm_ratio/moment_grad/LAYER_NAME

Layer-wise ratio of the gradient norm to the moment norm after calling optimizer step.

cosine/moment_grad/LAYER_NAME

Layer-wise cosine angle between gradient and moment after calling optimizer step.

l2_norm/param/LAYER_NAME

Layer-wise L2 norms of parameter weights

l2_norm/second_moment_sqrt/LAYER_NAME

Layer-wise L2 norms of the square root

of the Adam second moment is.

l2_norm/update/LAYER_NAME

Layer-wise L2 norms of the step

cosine/update_grad/LAYER_NAME

Layer-wise cosine between the gradient and the step

l2_norm_ratio/update_param/LAYER_NAME

Layer-wise ratio between step size and parameter norm