SystemMetricsMonitor#
- class composer.callbacks.SystemMetricsMonitor(log_all_data=False)[source]#
Logs GPU/CPU metrics.
- GPU Metrics:
gpu_percentage: Occupancy rate, percent of time over sampling period during which one or more kernels was executing on the GPU. memory_percentage: Percent of time over sampling period during which global memory was being read or written. gpu_temperature_C: Temperature of device, in Celcius. gpu_power_usage_W: Power usage of device, in Watts.
By default, only the maximum and minimum values for these metrics, alongside their respective ranks in the key names, are logged on the
Event.BATCH_START
,Event.EVAL_BATCH_START
,Event.PREDICT_BATCH_START
events for every batch. If log_all_data is set to True, all values for these metrics across all ranks are logged on the above events for every batch.Example: .. doctest:
>>> from composer import Trainer >>> from composer.callbacks import SystemMetricsMonitor >>> # constructing trainer object with this callback >>> trainer = Trainer( ... model=model, ... train_dataloader=train_dataloader, ... eval_dataloader=eval_dataloader, ... optimizers=optimizer, ... max_duration='1ep', ... callbacks=[SystemMetricsMonitor()], ... )
- Parameters
log_all_data (bool, optional) โ True if user wants to log data for all ranks, not just the min/max. Defaults to False.