TorchProfiler#
- class composer.profiler.TorchProfiler(folder='{run_name}/torch_traces', filename='rank{rank}.{batch}.pt.trace.json', remote_file_name='{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.json', memory_filename=None, memory_remote_file_name='{run_name}/torch_memory_traces/rank{rank}.{batch}.pt.trace.memory.html', overwrite=False, use_gzip=False, record_shapes=False, profile_memory=True, with_stack=False, with_flops=True, num_traces_to_keep=- 1)[source]#
Profile the execution using the
PyTorch Profiler
.Profiling results are stored in TensorBoard format in the directory specified by
folder
.Note
The Composer
Trainer
automatically creates an instance of thisTorchProfiler
callback whenever any of the PyTorch Profiler arguments (torch_prof_record_shapes
,torch_prof_profile_memory
,torch_prof_with_stack
, ortorch_prof_with_flops
) are enabled.When using the Composer
Trainer
, one does not need to directly create an instance of thisTorchProfiler
callback.To view profiling results, run:
pip install tensorboard torch_tb_profiler tensorboard --logdir path/to/torch/trace_folder
Note
See torch.profiler for additional usage details on the
torch.profiler.profile
.Note
Enabling shape and stack tracing results in additional overhead. When
record_shapes=True
is specified, the profiler will temporarily hold references to tensors which may prevent certain optimizations that depend on the reference count and can introduce extra tensor copies.- Parameters
folder (str, optional) โ
Format string for the folder containing the Torch Profiler trace files. Defaults to
'{run_name}/torch_traces'
.The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
Logger.run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.For example, if the
run_name
is'awesome_training_run'
, and the defaultfolder
of'{run_name}/torch_traces'
is used, Torch Profiler traces will be stored in'awesome_training_run/torch_traces'
.filename (str, optional) โ
A format string describing how to name Torch Profiler trace files. Defaults to
'rank{rank}.{batch}.pt.trace.json'
.At the end of each batch where
get_action()
returnsACTIVE_AND_SAVE
, trace files are saved approximately to{folder.format(...)}/{filename.format(...)}
.The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
Logger.run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.{epoch}
The total epoch count, as returned by
epoch()
.{batch}
The total batch count, as returned by
batch()
.{batch_in_epoch}
The batch count in the current epoch, as returned by
batch_in_epoch()
.{sample}
The total sample count, as returned by
sample()
.{sample_in_epoch}
The sample count in the current epoch, as returned by
sample_in_epoch()
.{token}
The total token count, as returned by
token()
.{token_in_epoch}
The token count in the current epoch, as returned by
token_in_epoch()
.{total_wct}
The total training duration in seconds, as returned by
total_wct()
.{epoch_wct}
The epoch duration in seconds, as returned by
epoch_wct()
.{batch_wct}
The batch duration in seconds, as returned by
batch_wct()
.Consider the following scenario, where:
The
run_name
is'awesome-training-run'
.The default
trace_folder='{run_name}/torch_traces'
is used.The default
name='rank{rank}.{batch}.pt.trace.json'
is used.The current epoch count is
1
.The current batch count is
42
.
Each rank (process) will save traces to:
awesome-training-run/torch_traces/ep1-ba42-rank0.pt.trace.json awesome-training-run/torch_traces/ep1-ba42-rank1.pt.trace.json awesome-training-run/torch_traces/ep1-ba42-rank2.pt.trace.json ...
remote_file_name (str, optional) โ
Format string for a Torch Profiler trace fileโs remote file name. Defaults to
'{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.json'
.Whenever a trace file is saved, it is also uploaded as a file according to this format string. The same format variables as for
filename
are available.See also
Uploading Files for notes for file uploading.
Leading slashes (
'/'
) will be stripped.To disable uploading trace files, set this parameter to
None
.memory_filename (str, optional) โ
A format string describing how to name Torch Profiler memory trace files. Defaults to None. An example memory_filename is
'rank{rank}.{batch}.pt.trace.memory.html'
.At the end of each batch where
get_action()
returnsACTIVE_AND_SAVE
, trace files are saved approximately to{folder.format(...)}/{memory_filename.format(...)}
.The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
Logger.run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.{epoch}
The total epoch count, as returned by
epoch()
.{batch}
The total batch count, as returned by
batch()
.{batch_in_epoch}
The batch count in the current epoch, as returned by
batch_in_epoch()
.{sample}
The total sample count, as returned by
sample()
.{sample_in_epoch}
The sample count in the current epoch, as returned by
sample_in_epoch()
.{token}
The total token count, as returned by
token()
.{token_in_epoch}
The token count in the current epoch, as returned by
token_in_epoch()
.{total_wct}
The total training duration in seconds, as returned by
total_wct()
.{epoch_wct}
The epoch duration in seconds, as returned by
epoch_wct()
.{batch_wct}
The batch duration in seconds, as returned by
batch_wct()
.Consider the following scenario, where:
The
run_name
is'awesome-training-run'
.The default
trace_folder='{run_name}/torch_traces'
is used.The default
name='rank{rank}.{batch}.pt.trace.memory.html'
is used.The current epoch count is
1
.The current batch count is
42
.
Each rank (process) will save traces to:
awesome-training-run/torch_traces/ep1-ba42-rank0.pt.trace.memory.html awesome-training-run/torch_traces/ep1-ba42-rank1.pt.trace.memory.html awesome-training-run/torch_traces/ep1-ba42-rank2.pt.trace.memory.html ...
memory_remote_file_name (str, optional) โ
Format string for a Torch Profiler memory trace fileโs remote file name. Defaults to
'{run_name}/torch_traces/rank{rank}.{batch}.pt.trace.memory.json'
.Whenever a trace file is saved, it is also uploaded as a file according to this format string. The same format variables as for
filename
are available.See also
Uploading Files for notes for file uploading.
Leading slashes (
'/'
) will be stripped.To disable uploading trace files, set this parameter to
None
.overwrite (bool, optional) โ
Whether to override existing Torch Profiler traces. Defaults to False.
If False, then the trace folder as determined by
folder
must be empty.use_gzip (bool, optional) โ Whether to use gzip for the trace. Defaults to False. If True,
'.gz'
will be appendedfilename
andremote_file_name
(if they do not already end in'.gz'
).record_shapes (bool, optional) โ Whether to record tensor shapes. Defaults to False.
profile_memory (bool, optional) โ Whether to profile memory. Defaults to True.
with_stack (bool, optional) โ Whether to record stack info. Defaults to False.
with_flops (bool, optional) โ Whether to estimate flops for operators. Defaults to True.
num_traces_to_keep (int, optional) โ
The number of trace files to keep locally. Defaults to -1.
If set to -1, then all traces files are kept locally.
After a trace has been saved and uploaded, the oldest traces are removed until
num_traces_to_keep
traces remain. This parameter only controls how many traces are kept locally; traces are not deleted from remote file systems.It can be useful to set this parameter to
0
when using a remote file uploader such as theRemoteUploaderDownloader
. This combination will minimize local disk usage by deleting trace files immediately after they have been uploaded to the object store.
- saved_traces#
The trace timestamps and filepaths.
This list contains tuples of the save timestamp and the trace filepaths. This list will have at most
num_traces_to_keep
entries. The latest trace will be at the end.The index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the processโs (rankโs) node.