save_checkpoint#
- composer.utils.save_checkpoint(state, filename='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False)[source]#
Checkpoint the training
state
.- Parameters
state (State) โ The training state.
logger (Logger) โ The logger.
filename (str) โ
A format string describing how to name checkpoints. (default:
'ep{epoch}-ba{batch}-rank{rank}'
)The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
Logger.run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.{epoch}
The total epoch count, as returned by
epoch()
.{batch}
The total batch count, as returned by
batch()
.{batch_in_epoch}
The batch count in the current epoch, as returned by
batch_in_epoch()
.{sample}
The total sample count, as returned by
sample()
.{sample_in_epoch}
The sample count in the current epoch, as returned by
sample_in_epoch()
.{token}
The total token count, as returned by
token()
.{token_in_epoch}
The token count in the current epoch, as returned by
token_in_epoch()
.{total_wct}
The total training duration in seconds, as returned by
total_wct()
.{epoch_wct}
The epoch duration in seconds, as returned by
epoch_wct()
.{batch_wct}
The batch duration in seconds, as returned by
batch_wct()
.Note
By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that
'{rank}'
appears within thefilename
. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified,.tar
will be used.To use compression (regardless of whether DeepSpeed is enabled), set the file extension to
'.tar.gz'
,'.tgz'
,'.tar.bzip'
, or'.tar.lzma'
(depending on the desired compression algorithm).
Warning
Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.
Consider the following scenario, where:
The default
name='ep{epoch}-ba{batch}-rank{rank}'
is used.The current epoch count is
1
.The current batch count is
42
.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to
'ep1-ba42-rank0'
. When DeepSpeed is being used, each rank (process) will save checkpoints to:ep1-ba42-rank0.tar ep1-ba42-rank1.tar ep1-ba42-rank2.tar ...
weights_only (bool, optional) โ
If
True
, save only the model weights instead of the entire training state. (default:False
)Note
When using DeepSpeed, this parameter must be
False
. Weights-only checkpointing is not currently compatible with DeepSpeed,Returns โ
List[pathlib.Path]: The list of checkpoint files saved, indexed by the rank of the process.
Note
When using DeepSpeed, each process (rank) saves its own checkpoint file. When doing multi-node training, the filepaths are valid only on each processโs node; Composer does not move checkpoint files between nodes.
Otherwise, when not using DeepSpeed, each list will contain only one filepath, since only the rank zero process saves checkpoints.