JSONTraceHandler#

class composer.profiler.JSONTraceHandler(folder='{run_name}/traces', filename='ep{epoch}-ba{batch}-rank{rank}.json', remote_file_name='{run_name}/traces/ep{epoch}-ba{batch}-rank{rank}.json', merged_trace_filename='merged_trace.json', merged_trace_remote_file_name='{run_name}/traces/merged_trace.json', *, overwrite=False, num_traces_to_keep=- 1)[source]#

Records trace events in Chrome JSON trace format.

See this document for more information.

Traces are output to output_directory. Traces can be visualized using the Chrome Trace Viewer. To view in a Google Chrome browser, navigate to chrome://tracing and load the JSON trace file.

Parameters
  • folder (str, optional) โ€“

    Format string for the trace file folder. Defaults to '{run_name}/traces'.

    The following format variables are available:

    Variable

    Description

    {run_name}

    The name of the training run. See Logger.run_name.

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {world_size}

    The world size, as returned by get_world_size().

    {local_world_size}

    The local world size, as returned by get_local_world_size().

    {node_rank}

    The node rank, as returned by get_node_rank().

    For example, if the run_name is 'awesome_training_run', and the default folder of '{run_name}/traces' is used, traces will be stored in 'awesome_training_run/traces'.

  • filename (str, optional) โ€“

    A format string describing how to name trace files. (default: 'ep{epoch}-ba{batch}-rank{rank}.json')

    At the end of each batch where get_action() returns ACTIVE_AND_SAVE, trace files are saved approximately to {folder}/{filename.format(...)}.

    The following format variables are available:

    Variable

    Description

    {run_name}

    The name of the training run. See Logger.run_name.

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {world_size}

    The world size, as returned by get_world_size().

    {local_world_size}

    The local world size, as returned by get_local_world_size().

    {node_rank}

    The node rank, as returned by get_node_rank().

    {epoch}

    The total epoch count, as returned by epoch().

    {batch}

    The total batch count, as returned by batch().

    {batch_in_epoch}

    The batch count in the current epoch, as returned by batch_in_epoch().

    {sample}

    The total sample count, as returned by sample().

    {sample_in_epoch}

    The sample count in the current epoch, as returned by sample_in_epoch().

    {token}

    The total token count, as returned by token().

    {token_in_epoch}

    The token count in the current epoch, as returned by token_in_epoch().

    {total_wct}

    The total training duration in seconds, as returned by total_wct().

    {epoch_wct}

    The epoch duration in seconds, as returned by epoch_wct().

    {batch_wct}

    The batch duration in seconds, as returned by batch_wct().

    Consider the following scenario, where:

    • The run_name is 'awesome-training-run'

    • The default trace_folder='{run_name}/traces' is used.

    • The default name='ep{epoch}-ba{batch}-rank{rank}.json' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    Each rank (process) will save traces to:

    awesome-training-run/traces/ep1-ba42-rank0.json
    awesome-training-run/traces/ep1-ba42-rank1.json
    awesome-training-run/traces/ep1-ba42-rank2.json
    ...
    

  • remote_file_name (str, optional) โ€“

    Format string for the trace fileโ€™s remote name. (default: '{run_name}/traces/ep{epoch}-ba{batch}-rank{rank}.json')

    Whenever a trace file is saved, it is also uploaded as a remote file according to this format string. The same format variables as for filename are available.

    See also

    Uploading Files for notes for file uploading.

    Leading slashes ('/') will be stripped.

    To disable uploading trace files, set this parameter to None.

  • merged_trace_filename (str, optional) โ€“

    Format string for the merged trace filename. (default: 'node{node_rank}.json')

    Each rank writes a separate trace file at the end of each profiling cycle. However, when visualizing traces, it is generally helpful to merge traces together into a single file. This allows the traces across all ranks to be shown in a single view. To

    The same format variables as for folder are available. The merged trace file is saved approximately to {folder}/{merged_trace_filename.format(...)} on the local rank zero process for each node.

    If specified (the default), the local rank zero process merges together all traces files from that node, across all profiling cycles, into a single trace file. The merged trace file is written to the filename specified by this format string. There will be one merged trace file per node.

    To disable merging, set this parameter to None.

    Warning

    Trace merging blocks the training loop. When profiling live training runs, it is recommended to disable trace merging by setting this parameter to None. Instead, traces should be merged together in a post-processing step. See composer.profiler.json_trace_merger for additional info.

  • merged_trace_remote_file_name (str, optional) โ€“

    Format string for the merged trace fileโ€™s remote file name. (default: '{run_name}/traces/merged_trace.json')

    The same format variables as for folder are available.

    This parameter has no effect if merged_trace_filename is None.

    To disable uploading merged trace files, set this parameter to None.

  • overwrite (bool, optional) โ€“ Whether to overwrite existing traces. (default: False) If False, the trace_folder() (as determined by the trace_folder argument) must be empty when training starts.

  • num_traces_to_keep (int, optional) โ€“

    The number of traces to keep locally. The oldest traces are removed first. Set to -1 to keep all traces locally. (default: -1)

    Traces will be removed after they have been uploaded. For example, when this handler is used in conjunction with the RemoteUploaderDownloader, set this parameter to 0 to immediately delete traces from the local disk after they have been uploaded to the object store.

    This parameter only controls how many traces are kept locally; traces are not deleted from remote file systems.

saved_traces#

The trace timestamps and filepaths.

This list contains tuples of the save timestamp and the trace filepaths. This list will have at most save_num_traces_to_keep entries. The latest trace will be at the end.

The index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the processโ€™s (rankโ€™s) node.

Type

list[tuple[Timestamp, list[Path]]]