composer.callbacks.checkpoint_saver#
Callback to save checkpoints during training.
Functions
Helper function to create a checkpoint scheduler according to a specified interval. |
Classes
Callback to save checkpoints. |
- class composer.callbacks.checkpoint_saver.CheckpointSaver(folder='{run_name}/checkpoints', filename='ep{epoch}-ba{batch}-rank{rank}', artifact_name='{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}', latest_filename='latest-rank{rank}', latest_artifact_name='{run_name}/checkpoints/latest-rank{rank}', save_interval='1ep', *, overwrite=False, num_checkpoints_to_keep=- 1, weights_only=False)[source]#
Bases:
composer.core.callback.Callback
Callback to save checkpoints.
Note
If the
folder
argument is specified constructing theTrainer
, then theCheckpointSaver
callback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of thisCheckpointSaver
callback can be specified in thecallbacks
argument of theTrainer
, as shown in the example below.Example
>>> trainer = Trainer(..., callbacks=[ ... CheckpointSaver( ... folder='{run_name}/checkpoints', ... filename="ep{epoch}-ba{batch}-rank{rank}", ... latest_filename="latest-rank{rank}", ... save_interval="1ep", ... weights_only=False, ... ) ... ])
- Parameters
folder (str, optional) โ
Format string for the folder where checkpoints will be saved. (default:
'{run_name}/checkpoints'
)The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.Note
When training with multiple devices (i.e. GPUs), ensure that
'{rank}'
appears in the format. Otherwise, multiple processes may attempt to write to the same file.filename (str, optional) โ
A format string describing how to name checkpoints. (default:
'ep{epoch}-ba{batch}-rank{rank}'
)Checkpoints will be saved approximately to
{folder}/{filename.format(...)}
.The following format variables are available:
Variable
Description
{run_name}
The name of the training run. See
run_name
.{rank}
The global rank, as returned by
get_global_rank()
.{local_rank}
The local rank of the process, as returned by
get_local_rank()
.{world_size}
The world size, as returned by
get_world_size()
.{local_world_size}
The local world size, as returned by
get_local_world_size()
.{node_rank}
The node rank, as returned by
get_node_rank()
.{epoch}
The total epoch count, as returned by
epoch()
.{batch}
The total batch count, as returned by
batch()
.{batch_in_epoch}
The batch count in the current epoch, as returned by
batch_in_epoch()
.{sample}
The total sample count, as returned by
sample()
.{sample_in_epoch}
The sample count in the current epoch, as returned by
sample_in_epoch()
.{token}
The total token count, as returned by
token()
.{token_in_epoch}
The token count in the current epoch, as returned by
token_in_epoch()
.Note
By default, only the rank zero process will save a checkpoint file.
When using DeepSpeed, each rank will save a checkpoint file in tarball format. DeepSpeed requires tarball format, as it saves model and optimizer states in separate files. Ensure that
'{rank}'
appears within thefilename
. Otherwise, multiple ranks may attempt to write to the same file(s), leading to corrupted checkpoints. If no tarball file extension is specified,'.tar'
will be used.To use compression (regardless of whether DeepSpeed is enabled), set the file extension to
'.tar.gz'
,'.tgz'
,'.tar.bzip'
, or'.tar.lzma'
(depending on the desired compression algorithm).
Warning
Using compression will block the training loop while checkpoints are being compressed. As such, we recommend saving checkpoints without compression.
Consider the following scenario, where:
The
run_name
is'awesome-training-run'
The default
folder='{run_name}/checkpoints'
is used.The default
name='ep{epoch}-ba{batch}-rank{rank}'
is used.The current epoch count is
1
.The current batch count is
42
.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to
"awesome-training-run/checkpoints/ep1-ba42-rank0"
.When DeepSpeed is being used, each rank (process) will save checkpoints to:
awesome-training-run/checkpoints/ep1-ba42-rank0.tar awesome-training-run/checkpoints/ep1-ba42-rank1.tar awesome-training-run/checkpoints/ep1-ba42-rank2.tar ...
artifact_name (str, optional) โ
Format string for the checkpointโs artifact name. (default:
'{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}"
)After the checkpoint is saved, it will be periodically logged as a file artifact. The artifact name will be determined by this format string.
See also
log_file_artifact()
for file artifact logging.The same format variables for
filename
are available.Leading slashes (
'/'
) will be stripped.To disable logging trace files as file artifacts, set this parameter to
None
.latest_filename (str, optional) โ
A format string for a symlink which points to the last saved checkpoint. (default:
'latest-rank{rank}'
)Symlinks will be created approximately at
{folder}/{latest_filename.format(...)}
.The same format variables as for
name
are available.To disable symlinks, set this parameter to
None
.Consider the following scenario, where:
The
run_name
is โawesome-training-runโThe default
folder='{run_name}/checkpoints'
is used.The default
name='ep{epoch}-ba{batch}-rank{rank}'
is used.The default
latest_filename='latest-rank{rank}'
is used.The current epoch count is
1
.The current batch count is
42
.
When DeepSpeed is not being used, the rank zero process will save the checkpoint to
'awesome-training-run/checkpoints/ep1-ba42-rank0'
, and a symlink will be created at'awesome-training-run/checkpoints/latest-rank0' -> 'awesome-training-run/checkpoints/ep1-ba42-rank0'
When DeepSpeed is being used, each rank (process) will save checkpoints to:
awesome-training-run/checkpoints/ep1-ba42-rank0.tar awesome-training-run/checkpoints/ep1-ba42-rank1.tar awesome-training-run/checkpoints/ep1-ba42-rank2.tar ...
Corresponding symlinks will be created at:
awesome-training-run/checkpoints/latest-rank0.tar -> awesome-training-run/checkpoints/ep1-ba42-rank0.tar awesome-training-run/checkpoints/latest-rank1.tar -> awesome-training-run/checkpoints/ep1-ba42-rank1.tar awesome-training-run/checkpoints/latest-rank2.tar -> awesome-training-run/checkpoints/ep1-ba42-rank2.tar ...
latest_artifact_name (str, optional) โ
Format string for the checkpointโs latest symlink artifact name. (default:
'{run_name}/checkpoints/latest-rank{rank}"
)Whenever a new checkpoint is saved, a symlink artifact is created or updated to point to the latest checkpointโs
artifact_name
. The artifact name will be determined by this format string. This parameter has no effect iflatest_filename
orartifact_name
is None.โSee also
log_symlink_artifact()
for symlink artifact logging.The same format variables for
filename
are available.Leading slashes (
'/'
) will be stripped.To disable symlinks in logger, set this parameter to
None
.overwrite (bool, optional) โ Whether existing checkpoints should be overridden. If
False
(the default), then thefolder
must not exist or be empty. (default:False
)save_interval (Time | str | int | (State, Event) -> bool) โ
A
Time
, time-string, integer (in epochs), or a function that takes (state, event) and returns a boolean whether a checkpoint should be saved.If an integer, checkpoints will be saved every n epochs. If
Time
or a time-string, checkpoints will be saved according to this interval.See also
If a function, then this function should take two arguments (
State
,Event
). The first argument will be the current state of the trainer, and the second argument will be beEvent.BATCH_CHECKPOINT
orEPOCH_CHECKPOINT
(depending on the current training progress). It should returnTrue
if a checkpoint should be saved given the current state and event.weights_only (bool) โ If
True
, save only the model weights instead of the entire training state. This parmeter must beFalse
when using DeepSpeed. (default:False
)num_checkpoints_to_keep (int, optional) โ
The number of checkpoints to keep locally. The oldest checkpoints are removed first. Set to
-1
to keep all checkpoints locally. (default:-1
)Checkpoints will be removed after they have been logged as a file artifact. For example, when this callback is used in conjunction with the
ObjectStoreLogger
, set this parameter to0
to immediately delete checkpoints from the local disk after they have been uploaded to the object store.This parameter only controls how many checkpoints are kept locally; checkpoints are not deleted from artifact stores.
- saved_checkpoints#
The checkpoint timestamps and filepaths.
This list contains tuples of the save timestamp and the checkpoint filepaths. This list will have at most
num_checkpoints_to_keep
entries. The latest checkpoint will be at the end.Note
When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the processโs (rankโs) node.
Otherwise, when not using DeepSpeed, each sub-list will contain only one filepath since only rank zero saves checkpoints.
- composer.callbacks.checkpoint_saver.checkpoint_periodically(interval)[source]#
Helper function to create a checkpoint scheduler according to a specified interval.
- Parameters
interval (Union[str, int, Time]) โ
The interval describing how often checkpoints should be saved. If an integer, it will be assumed to be in
EPOCH
s. Otherwise, the unit must be eitherTimeUnit.EPOCH
orTimeUnit.BATCH
.Checkpoints will be saved every
n
batches or epochs (depending on the unit), and at the end of training.- Returns
Callable[[State, Event], bool] โ A function that can be passed as the
save_interval
argument into theCheckpointSaver
.