SystemMetricsMonitor#

class composer.callbacks.SystemMetricsMonitor(log_all_data=False)[source]#

Logs GPU/CPU metrics.

GPU Metrics:

gpu_percentage: Occupancy rate, percent of time over sampling period during which one or more kernels was executing on the GPU. memory_percentage: Percent of time over sampling period during which global memory was being read or written. gpu_temperature_C: Temperature of device, in Celcius. gpu_power_usage_W: Power usage of device, in Watts.

By default, only the maximum and minimum values for these metrics, alongside their respective ranks in the key names, are logged on the Event.BATCH_START, Event.EVAL_BATCH_START, Event.PREDICT_BATCH_START events for every batch. If log_all_data is set to True, all values for these metrics across all ranks are logged on the above events for every batch.

Example: .. doctest:

>>> from composer import Trainer
>>> from composer.callbacks import SystemMetricsMonitor
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...    model=model,
...    train_dataloader=train_dataloader,
...    eval_dataloader=eval_dataloader,
...    optimizers=optimizer,
...    max_duration='1ep',
...    callbacks=[SystemMetricsMonitor()],
... )
Parameters

log_all_data (bool, optional) โ€“ True if user wants to log data for all ranks, not just the min/max. Defaults to False.