SpeedMonitor#

class composer.callbacks.SpeedMonitor(window_size=100, gpu_flops_available=None, time_unit='hours')[source]#

Logs the training throughput and utilization.

The training throughput is logged on the Event.BATCH_END event once we have reached the window_size threshold. If a model has flops_per_batch attribute, then flops per second is also logged. If running on a known GPU type or if gpu_flops_available is set, then MFU is also logged. All metrics are also logged as per device by dividing by world size.

To compute flops_per_sec, the model attribute flops_per_batch should be set to a callable which accepts a batch and returns the number of flops for that batch. Typically, this should be flops per sample times the batch size unless pad tokens are used.

The wall clock time is logged on every Event.BATCH_END event.

Example

>>> from composer import Trainer
>>> from composer.callbacks import SpeedMonitor
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration='1ep',
...     callbacks=[SpeedMonitor(window_size=100)],
... )

The training throughput is logged by the Logger to the following keys as described below.

Key	Logged data
throughput/batches_per_sec	Rolling average (over window_size most recent batches) of the number of batches processed per second
throughput/samples_per_sec	Rolling average (over window_size most recent batches) of the number of samples processed per second
throughput/tokens_per_sec	Rolling average (over window_size most recent batches) of the number of tokens processed per second. Only logged if dataspec returns tokens per batch
throughput/flops_per_sec	Estimates flops by flops_per_batch * batches_per_sec if model has attribute flops_per_batch
throughput/device/batches_per_sec	throughput/batches_per_sec divided by world size
throughput/device/samples_per_sec	throughput/samples_per_sec divided by world size
throughput/device/tokens_per_sec	throughput/tokens_per_sec divided by world size. Only logged if dataspec returns tokens per batch
throughput/device/flops_per_sec	throughput/flops_per_sec divided by world size. Only logged when model has attribute flops_per_batch
throughput/device/mfu	throughput/device/flops_per_sec divided by flops available on the GPU device. Only logged when model has attribute flops_per_batch and gpu_flops_available, which can be passed as an argument if not automatically determined by SpeedMonitor
time/train	Total elapsed training time
time/val	Total elapsed validation time
time/total	Total elapsed time (time/train + time/val)

Parameters

window_size (int, optional) – Number of batches to use for a rolling average of throughput. Defaults to 100.
gpu_flops_available (float, optional) – Number of flops available on the GPU. If not set, SpeedMonitor will attempt to determine this automatically. Defaults to None.
time_unit (str, optional) – Time unit to use for time logging. Can be one of ‘seconds’, ‘minutes’, ‘hours’, or ‘days’. Defaults to ‘hours’.