MemorySnapshot#
- class composer.callbacks.MemorySnapshot(skip_batches=1, interval='3ba', max_entries=100000, folder='{run_name}/torch_traces', filename='rank{rank}.{batch}.memory_snapshot', remote_file_name='{run_name}/torch_memory_traces', overwrite=False)[source]#
Logs the memory snapshot of the model.
This callback calls the torch memory snapshot API (see
torch.cuda.memory._snapshot()
) to record a modelโs tensor memory allocation over a user defined interval (only once through time [skip_batches, skip_batches + interval]). This provides a fine-grained GPU memory visualization for debugging GPU OOMs. Captured memory snapshots will show memory events including allocations, frees and OOMs, along with their stack traces over one interval.Example
>>> from composer import Trainer >>> from composer.callbacks import MemorySnapshot >>> # constructing trainer object with this callback >>> trainer = Trainer( ... model=model, ... train_dataloader=train_dataloader, ... eval_dataloader=eval_dataloader, ... optimizers=optimizer, ... max_duration="1ep", ... callbacks=[MemorySnapshot()], ... )
Note
Memory snapshot is only supported for GPU devices.
- Parameters
skip_batches (int, optional) โ Number of batches to skip before starting recording memory snapshot. Defaults to 1.
interval (Union[int, str, Time], optional) โ Time string specifying how long to record the tensor allocation. For example,
interval='3ba'
means 3 batches are recorded. Default: โ3baโ.max_entries (int, optional) โ Maximum number of memory alloc/free events to record. Defaults to 100000.
folder (str, optional) โ A format string describing the folder containing the memory snapshot files. Defaults to
'{{run_name}}/torch_traces'
.filename (str, optional) โ A format string describing the prefix used to name the memory snapshot files. Defaults to
'rank{{rank}}.{{batch}}.memory_snapshot'
.remote_file_name (str, optional) โ
A format string describing the prefix for the memory snapshot remote file name. Defaults to
'{{run_name}}/torch_traces/rank{{rank}}.{{batch}}.memory_snapshot'
.Whenever a trace file is saved, it is also uploaded as a file according to this format string. The same format variables as for
filename
are available.See also
Uploading Files for notes for file uploading.
Leading slashes (
'/'
) will be stripped.To disable uploading trace files, set this parameter to
None
.overwrite (bool, optional) โ
Whether to override existing memory snapshots. Defaults to False.
If False, then the trace folder as determined by
folder
must be empty.