MemoryMonitor#
- class composer.callbacks.MemoryMonitor(memory_keys=None, dist_aggregate_batch_interval=None)[source]#
Logs the memory usage of the model.
This callback calls the torch memory stats API for CUDA (see
torch.cuda.memory_stats()
) on theEvent.AFTER_TRAIN_BATCH
and reports different memory statistics.Example
>>> from composer import Trainer >>> from composer.callbacks import MemoryMonitor >>> # constructing trainer object with this callback >>> trainer = Trainer( ... model=model, ... train_dataloader=train_dataloader, ... eval_dataloader=eval_dataloader, ... optimizers=optimizer, ... max_duration="1ep", ... callbacks=[MemoryMonitor()], ... )
The memory statistics are logged by the
Logger
to the following keys as described below.Key
Logged data
memory/{statistic}
Several memory usage statistics are logged on
Event.AFTER_TRAIN_BATCH
event.The following statistics are recorded:
Statistic
Description
current_allocated_mem
Current amount of allocated memory in gigabytes.
current_active_mem
Current amount of active memory in gigabytes at the time of recording.
current_inactive_mem
Current amount of inactive, non-releaseable memory in gigabytes at the time of recording.
current_reserved_mem
Current amount of reserved memory in gigabytes at the time of recording.
peak_allocated_mem
Peak amount of allocated memory in gigabytes.
peak_active_mem
Peak amount of active memory in gigabytes at the time of recording.
peak_inactive_mem
Peak amount of inactive, non-releaseable memory in gigabytes at the time of recording.
peak_reserved_mem
Peak amount of reserved memory in gigabytes at the time of recording.
alloc_retries
Number of failed cudaMalloc calls that result in a cache flush and retry.
Additionally, if dist_aggregate_batch_interval is enabled, the avg, min, and max of the aformentioned statistics are also logged.
Note
Memory usage monitoring is only supported for GPU devices.
- Parameters
memory_keys (dict[str, str], optional) โ A dict specifying memory statistics to log. Keys are the names of memory statistics to log from torch.cuda.memory_stats(), and values are the names they will be logged under. If not provided, the above statistics are logged. Defaults to None.
dist_aggregate_batch_interval (int, optional) โ interval for aggregating memory stats across all nodes. Defaults to None (by default the functionality is disabled).