SpeedMonitor#

class composer.callbacks.SpeedMonitor(window_size=100, gpu_flops_available=None, time_unit='hours')[source]#

Logs the training throughput and utilization.

The training throughput is logged on the Event.BATCH_END event once we have reached the window_size threshold. If a model has flops_per_batch attribute, then flops per second is also logged. If running on a known GPU type or if gpu_flops_available is set, then MFU is also logged. All metrics are also logged as per device by dividing by world size.

To compute flops_per_sec, the model attribute flops_per_batch should be set to a callable which accepts a batch and returns the number of flops for that batch. Typically, this should be flops per sample times the batch size unless pad tokens are used.

The wall clock time is logged on every Event.BATCH_END event.

Example

>>> from composer import Trainer
>>> from composer.callbacks import SpeedMonitor
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration='1ep',
...     callbacks=[SpeedMonitor(window_size=100)],
... )

The training throughput is logged by the Logger to the following keys as described below.

Key

Logged data

throughput/batches_per_sec

Rolling average (over window_size most recent batches) of the number of batches processed per second

throughput/samples_per_sec

Rolling average (over window_size most recent batches) of the number of samples processed per second

throughput/tokens_per_sec

Rolling average (over window_size most recent batches) of the number of tokens processed per second. Only logged if dataspec returns tokens per batch

throughput/flops_per_sec

Estimates flops by flops_per_batch * batches_per_sec if model has attribute flops_per_batch

throughput/device/batches_per_sec

throughput/batches_per_sec divided by world size

throughput/device/samples_per_sec

throughput/samples_per_sec divided by world size

throughput/device/tokens_per_sec

throughput/tokens_per_sec divided by world size. Only logged if dataspec returns tokens per batch

throughput/device/flops_per_sec

throughput/flops_per_sec divided by world size. Only logged when model has attribute flops_per_batch

throughput/device/mfu

throughput/device/flops_per_sec divided by flops available on the GPU device. Only logged when model has attribute flops_per_batch and gpu_flops_available, which can be passed as an argument if not automatically determined by SpeedMonitor

time/train

Total elapsed training time

time/val

Total elapsed validation time

time/total

Total elapsed time (time/train + time/val)

Parameters
  • window_size (int, optional) โ€“ Number of batches to use for a rolling average of throughput. Defaults to 100.

  • gpu_flops_available (float, optional) โ€“ Number of flops available on the GPU. If not set, SpeedMonitor will attempt to determine this automatically. Defaults to None.

  • time_unit (str, optional) โ€“ Time unit to use for time logging. Can be one of โ€˜secondsโ€™, โ€˜minutesโ€™, โ€˜hoursโ€™, or โ€˜daysโ€™. Defaults to โ€˜hoursโ€™.