OOMObserver#

class composer.callbacks.OOMObserver(max_entries=100000, folder='{run_name}/torch_traces', filename='rank{rank}_oom', remote_file_name='{run_name}/oom_traces/rank{rank}_oom', overwrite=False)[source]#

Generate visualizations of the state of allocated memory during an OutOfMemory exception.

This callback registers an observer with the allocator that will be called everytime it is about to raise an OutOfMemoryError before any memory has been release while unwinding the exception. OOMObserver is attached to the Trainer at init stage. The visualizations include a snapshot of the memory state, a trace plot, a segment plot, a segment flamegraph, and a memory flamegraph.

Example

>>> from composer import Trainer
>>> from composer.callbacks import OOMObserver
>>> # constructing trainer object with this callback
>>> trainer = Trainer(
...     model=model,
...     train_dataloader=train_dataloader,
...     eval_dataloader=eval_dataloader,
...     optimizers=optimizer,
...     max_duration="1ep",
...     callbacks=[OOMObserver()],
... )

Note

OOMObserver is only supported for GPU devices.

Parameters
  • max_entries (int, optional) โ€“ Maximum number of memory alloc/free events to record. Defaults to 100000.

  • folder (str, optional) โ€“ A format string describing the folder containing the memory visualization files. Defaults to '{{run_name}}/torch_traces'.

  • filename (str, optional) โ€“ A format string describing the prefix used to name the memory visualization files. Defaults to 'rank{{rank}}_oom'.

  • remote_file_name (str, optional) โ€“

    A format string describing the prefix for the memory visualization remote file name. Defaults to '{{run_name}}/oom_traces/rank{{rank}}_oom'.

    Whenever a trace file is saved, it is also uploaded as a file according to this format string. The same format variables as for filename are available.

    See also

    Uploading Files for notes for file uploading.

    Leading slashes ('/') will be stripped.

    To disable uploading trace files, set this parameter to None.

  • overwrite (bool, optional) โ€“

    Whether to override existing memory snapshots. Defaults to False.

    If False, then the trace folder as determined by folder must be empty.