MemoryMonitor#

class composer.callbacks.MemoryMonitor(memory_keys=None, dist_aggregate_batch_interval=None)[source]#

Logs the memory usage of the model.

This callback calls the torch memory stats API for CUDA (see torch.cuda.memory_stats()) on the Event.AFTER_TRAIN_BATCH and reports different memory statistics.

Example

>>> from composer import Trainer
>>> from composer.callbacks import MemoryMonitor
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration="1ep",
...     callbacks=[MemoryMonitor()],
... )

The memory statistics are logged by the Logger to the following keys as described below.

Key

Logged data

memory/{statistic}

Several memory usage statistics are logged on Event.AFTER_TRAIN_BATCH event.

The following statistics are recorded:

Statistic

Description

current_allocated_mem

Current amount of allocated memory in gigabytes.

current_active_mem

Current amount of active memory in gigabytes at the time of recording.

current_inactive_mem

Current amount of inactive, non-releaseable memory in gigabytes at the time of recording.

current_reserved_mem

Current amount of reserved memory in gigabytes at the time of recording.

peak_allocated_mem

Peak amount of allocated memory in gigabytes.

peak_active_mem

Peak amount of active memory in gigabytes at the time of recording.

peak_inactive_mem

Peak amount of inactive, non-releaseable memory in gigabytes at the time of recording.

peak_reserved_mem

Peak amount of reserved memory in gigabytes at the time of recording.

alloc_retries

Number of failed cudaMalloc calls that result in a cache flush and retry.

Additionally, if dist_aggregate_batch_interval is enabled, the avg, min, and max of the aformentioned statistics are also logged.

Note

Memory usage monitoring is only supported for GPU devices.

Parameters
  • memory_keys (Dict[str, str], optional) โ€“ A dict specifying memory statistics to log. Keys are the names of memory statistics to log from torch.cuda.memory_stats(), and values are the names they will be logged under. If not provided, the above statistics are logged. Defaults to None.

  • dist_aggregate_batch_interval (int, optional) โ€“ interval for aggregating memory stats across all nodes. Defaults to None (by default the functionality is disabled).