save_checkpoint#

composer.utils.save_checkpoint(state, filename='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False, ignore_keys=None)[source]#

Checkpoint the training state.

Parameters

state (State) – The training state.
logger (Logger) – The logger.

filename (str) –

A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}-rank{rank}')

The following format variables are available:

Variable	Description
`{run_name}`	The name of the training run. See `Logger.run_name`.
`{rank}`	The global rank, as returned by `get_global_rank()`.
`{local_rank}`	The local rank of the process, as returned by `get_local_rank()`.
`{world_size}`	The world size, as returned by `get_world_size()`.
`{local_world_size}`	The local world size, as returned by `get_local_world_size()`.
`{node_rank}`	The node rank, as returned by `get_node_rank()`.
`{epoch}`	The total epoch count, as returned by `epoch()`.
`{batch}`	The total batch count, as returned by `batch()`.
`{batch_in_epoch}`	The batch count in the current epoch, as returned by `batch_in_epoch()`.
`{sample}`	The total sample count, as returned by `sample()`.
`{sample_in_epoch}`	The sample count in the current epoch, as returned by `sample_in_epoch()`.
`{token}`	The total token count, as returned by `token()`.
`{token_in_epoch}`	The token count in the current epoch, as returned by `token_in_epoch()`.
`{total_wct}`	The total training duration in seconds, as returned by `total_wct()`.
`{epoch_wct}`	The epoch duration in seconds, as returned by `epoch_wct()`.
`{batch_wct}`	The batch duration in seconds, as returned by `batch_wct()`.

Note

By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that '{rank}' appears within the filename. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified, .tar will be used.
To write to compressed tar files (regardless of whether DeepSpeed is enabled), set the file extension to '.tar.gz', '.tgz', '.tar.bz2', or '.tar.lzma' (depending on the desired compression algorithm).
To write to compressed pt files (when DeepSpeed is disabled), set the file extension to '.pt.bz2', '.pt.gz', '.pt.lz4', '.pt.lzma', '.pt.lzo', '.pt.xz', '.pt.zst' (depending on the desired algorithm). You must have the corresponding CLI tool installed. lz4 is a good choice for a modest space saving while being very fast to compress.

Warning

Using compression will block the training loop while checkpoints are being compressed and the compressibility of checkpoints can vary significantly depending on your setup. As such, we recommend saving checkpoints without compression by default.

If you have the lz4 command available on your system, you may want to try saving as .pt.lz4 as the overhead is minimal (usually less than a second) and the saved space can sometimes be significant (1% - 40%).

Consider the following scenario, where:

The default name='ep{epoch}-ba{batch}-rank{rank}' is used.
The current epoch count is 1.
The current batch count is 42.

When DeepSpeed is not being used, the rank zero process will save the checkpoint to 'ep1-ba42-rank0'. When DeepSpeed is being used, each rank (process) will save checkpoints to:

ep1-ba42-rank0.tar
ep1-ba42-rank1.tar
ep1-ba42-rank2.tar
...

weights_only (bool, optional) –
If True, save only the model weights instead of the entire training state. (default: False)

Note

When using DeepSpeed, this parameter must be False. Weights-only checkpointing is not currently compatible with DeepSpeed,
Returns –
list[pathlib.Path]: The list of checkpoint files saved, indexed by the rank of the process.

Note

When using DeepSpeed, each process (rank) saves its own checkpoint file. When doing multi-node training, the filepaths are valid only on each process’s node; Composer does not move checkpoint files between nodes.

Otherwise, when not using DeepSpeed, each list will contain only one filepath, since only the rank zero process saves checkpoints.