MemorySnapshot#

class composer.callbacks.MemorySnapshot(skip_batches=1, interval='3ba', max_entries=100000, folder='{run_name}/torch_traces', filename='rank{rank}.{batch}.memory_snapshot', remote_file_name='{run_name}/torch_memory_traces', overwrite=False)[source]#

Logs the memory snapshot of the model.

This callback calls the torch memory snapshot API (see torch.cuda.memory._snapshot()) to record a modelโ€™s tensor memory allocation over a user defined interval (only once through time [skip_batches, skip_batches + interval]). This provides a fine-grained GPU memory visualization for debugging GPU OOMs. Captured memory snapshots will show memory events including allocations, frees and OOMs, along with their stack traces over one interval.

Example

>>> from composer import Trainer
>>> from composer.callbacks import MemorySnapshot
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration="1ep",
...     callbacks=[MemorySnapshot()],
... )

Note

Memory snapshot is only supported for GPU devices.

Parameters
  • skip_batches (int, optional) โ€“ Number of batches to skip before starting recording memory snapshot. Defaults to 1.

  • interval (Union[int, str, Time], optional) โ€“ Time string specifying how long to record the tensor allocation. For example, interval='3ba' means 3 batches are recorded. Default: โ€˜3baโ€™.

  • max_entries (int, optional) โ€“ Maximum number of memory alloc/free events to record. Defaults to 100000.

  • folder (str, optional) โ€“ A format string describing the folder containing the memory snapshot files. Defaults to '{{run_name}}/torch_traces'.

  • filename (str, optional) โ€“ A format string describing the prefix used to name the memory snapshot files. Defaults to 'rank{{rank}}.{{batch}}.memory_snapshot'.

  • remote_file_name (str, optional) โ€“

    A format string describing the prefix for the memory snapshot remote file name. Defaults to '{{run_name}}/torch_traces/rank{{rank}}.{{batch}}.memory_snapshot'.

    Whenever a trace file is saved, it is also uploaded as a file according to this format string. The same format variables as for filename are available.

    See also

    Uploading Files for notes for file uploading.

    Leading slashes ('/') will be stripped.

    To disable uploading trace files, set this parameter to None.

  • overwrite (bool, optional) โ€“

    Whether to override existing memory snapshots. Defaults to False.

    If False, then the trace folder as determined by folder must be empty.