composer.profiler.torch_profiler#

Profiler to collect torch performance metrics during training.

Classes

 TorchProfiler Profile the execution using the PyTorch Profiler.
class composer.profiler.torch_profiler.TorchProfiler(folder='{run_name}/torch_traces', filename='rank{rank}.{batch}.pt.trace.json', artifact_name='{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.json', *, overwrite=False, use_gzip=False, record_shapes=False, profile_memory=True, with_stack=False, with_flops=True, num_traces_to_keep=- 1)[source]#

Profile the execution using the PyTorch Profiler.

Profiling results are stored in TensorBoard format in the directory specified by folder.

Note

The Composer Trainer automatically creates an instance of this TorchProfiler callback whenever any of the PyTorch Profiler arguments (torch_prof_record_shapes, torch_prof_profile_memory, torch_prof_with_stack, or torch_prof_with_flops) are enabled.

When using the Composer Trainer, one does not need to directly create an instance of this TorchProfiler callback.

To view profiling results, run:

pip install tensorbaord torch_tb_profiler
tensorboard --logdir path/to/torch/trace_folder


Note

See torch.profiler for additional usage details on the torch.profiler.profile.

Note

Enabling shape and stack tracing results in additional overhead. When record_shapes=True is specified, the profiler will temporarily hold references to tensors which may prevent certain optimizations that depend on the reference count and can introduce extra tensor copies.

Parameters
• folder (str, optional) –

Format string for the folder containing the Torch Profiler trace files. Defaults to '{run_name}/torch_traces'.

The following format variables are available:

Variable

Description

{run_name}

The name of the training run. See run_name.

{rank}

The global rank, as returned by get_global_rank().

{local_rank}

The local rank of the process, as returned by get_local_rank().

{world_size}

The world size, as returned by get_world_size().

{local_world_size}

The local world size, as returned by get_local_world_size().

{node_rank}

The node rank, as returned by get_node_rank().

For example, if the run_name is 'awesome_training_run', and the default folder of '{run_name}/torch_traces' is used, Torch Profiler traces will be stored in 'awesome_training_run/torch_traces'.

• filename (str, optional) –

A format string describing how to name Torch Profiler trace files. Defaults to 'rank{rank}.{batch}.pt.trace.json'.

At the end of each batch where get_action() returns ACTIVE_AND_SAVE, trace files are saved approximately to {folder.format(...)}/{filename.format(...)}.

The following format variables are available:

Variable

Description

{run_name}

The name of the training run. See run_name.

{rank}

The global rank, as returned by get_global_rank().

{local_rank}

The local rank of the process, as returned by get_local_rank().

{world_size}

The world size, as returned by get_world_size().

{local_world_size}

The local world size, as returned by get_local_world_size().

{node_rank}

The node rank, as returned by get_node_rank().

{epoch}

The total epoch count, as returned by epoch().

{batch}

The total batch count, as returned by batch().

{batch_in_epoch}

The batch count in the current epoch, as returned by batch_in_epoch().

{sample}

The total sample count, as returned by sample().

{sample_in_epoch}

The sample count in the current epoch, as returned by sample_in_epoch().

{token}

The total token count, as returned by token().

{token_in_epoch}

The token count in the current epoch, as returned by token_in_epoch().

{total_wct}

The total training duration in seconds, as returned by total_wct().

{epoch_wct}

The epoch duration in seconds, as returned by epoch_wct().

{batch_wct}

The batch duration in seconds, as returned by batch_wct().

Consider the following scenario, where:

• The run_name is 'awesome-training-run'.

• The default trace_folder='{run_name}/torch_traces' is used.

• The default name='rank{rank}.{batch}.pt.trace.json' is used.

• The current epoch count is 1.

• The current batch count is 42.

Each rank (process) will save traces to:

awesome-training-run/torch_traces/ep1-ba42-rank0.json
awesome-training-run/torch_traces/ep1-ba42-rank1.json
awesome-training-run/torch_traces/ep1-ba42-rank2.json
...


• artifact_name (str, optional) –

Format string for a Torch Profiler trace file’s artifact name. Defaults to '{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.json'.

Whenever a trace file is saved, it is also logged as a file artifact according to this format string. The same format variables as for filename are available.

file_artifact() for file artifact logging.

Leading slashes ('/') will be stripped.

To disable logging trace files as file artifacts, set this parameter to None.

• overwrite (bool, optional) –

Whether to override existing Torch Profiler traces. Defaults to False.

If False, then the trace folder as determined by folder must be empty.

• use_gzip (bool, optional) – Whether to use gzip for the trace. Defaults to False. If True, '.gz' will be appended filename and artifact_name (if they do not already end in '.gz').

• record_shapes (bool, optional) – Whether to record tensor shapes. Defaults to False.

• profile_memory (bool, optional) – Whether to profile memory. Defaults to True.

• with_stack (bool, optional) – Whether to record stack info. Defaults to False.

• with_flops (bool, optional) – Whether to estimate flops for operators. Defaults to True.

• num_traces_to_keep (int, optional) –

The number of trace files to keep locally. Defaults to -1.

If set to -1, then all traces files are kept locally.

After a trace has been saved and logged as a file artifact, the oldest traces are removed until num_traces_to_keep traces remain. This parameter only controls how many traces are kept locally; traces are not deleted from artifact stores.

It can be useful to set this parameter to 0 when using an artifact logger such as the ObjectStoreLogger. This combination will minimize local disk usage by deleting trace files immediately after they have been uploaded to the object store.

saved_traces#

The trace timestamps and filepaths.

This list contains tuples of the save timestamp and the trace filepaths. This list will have at most num_traces_to_keep entries. The latest trace will be at the end.

The index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the process’s (rank’s) node.

Type

List[Tuple[Timestamp, List[Path]]]