- class composer.callbacks.SpeedMonitor(window_size=100, gpu_flops_available=None, time_unit='hours')#
Logs the training throughput and utilization.
The training throughput is logged on the
Event.BATCH_ENDevent once we have reached the window_size threshold. If a model has flops_per_batch attribute, then flops per second is also logged. If running on a known GPU type or if gpu_flops_available is set, then MFU is also logged. All metrics are also logged as per device by dividing by world size.
To compute flops_per_sec, the model attribute flops_per_batch should be set to a callable which accepts a batch and returns the number of flops for that batch. Typically, this should be flops per sample times the batch size unless pad tokens are used.
The wall clock time is logged on every
>>> from composer import Trainer >>> from composer.callbacks import SpeedMonitor >>> # constructing trainer object with this callback >>> trainer = Trainer( ... model=model, ... train_dataloader=train_dataloader, ... eval_dataloader=eval_dataloader, ... optimizers=optimizer, ... max_duration='1ep', ... callbacks=[SpeedMonitor(window_size=100)], ... )
The training throughput is logged by the
Loggerto the following keys as described below.
Rolling average (over window_size most recent batches) of the number of batches processed per second
Rolling average (over window_size most recent batches) of the number of samples processed per second
Rolling average (over window_size most recent batches) of the number of tokens processed per second. Only logged when dataloader.dataset has max_seq_len. This may include padding depending on dataset
Estimates flops by flops_per_batch * batches_per_sec if model has attribute flops_per_batch
throughput/batches_per_sec divided by world size
throughput/samples_per_sec divided by world size
throughput/tokens_per_sec divided by world size. Only logged when dataloader.dataset has max_seq_len. This may include pad tokens depending on dataset
throughput/flops_per_sec divided by world size. Only logged when model has attribute flops_per_batch
throughput/device/flops_per_sec divided by world size. Only logged when model has attribute flops_per_batch and gpu_flops_available, which can be passed as an argument if not automatically determined by SpeedMonitor
Total elapsed training time
Total elapsed validation time
Total elapsed time (time/train + time/val)
window_size (int, optional) – Number of batches to use for a rolling average of throughput. Defaults to 100.
gpu_flops_available (float, optional) – Number of flops available on the GPU. If not set, SpeedMonitor will attempt to determine this automatically. Defaults to None.
time_unit (str, optional) – Time unit to use for time logging. Can be one of ‘seconds’, ‘minutes’, ‘hours’, or ‘days’. Defaults to ‘hours’.