save_checkpoint#

composer.utils.save_checkpoint(state, filename='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False, ignore_keys=None)[source]#

Checkpoint the training state.

Parameters
  • state (State) โ€“ The training state.

  • logger (Logger) โ€“ The logger.

  • filename (str) โ€“

    A format string describing how to name checkpoints. (default: 'ep{epoch}-ba{batch}-rank{rank}')

    The following format variables are available:

    Variable

    Description

    {run_name}

    The name of the training run. See Logger.run_name.

    {rank}

    The global rank, as returned by get_global_rank().

    {local_rank}

    The local rank of the process, as returned by get_local_rank().

    {world_size}

    The world size, as returned by get_world_size().

    {local_world_size}

    The local world size, as returned by get_local_world_size().

    {node_rank}

    The node rank, as returned by get_node_rank().

    {epoch}

    The total epoch count, as returned by epoch().

    {batch}

    The total batch count, as returned by batch().

    {batch_in_epoch}

    The batch count in the current epoch, as returned by batch_in_epoch().

    {sample}

    The total sample count, as returned by sample().

    {sample_in_epoch}

    The sample count in the current epoch, as returned by sample_in_epoch().

    {token}

    The total token count, as returned by token().

    {token_in_epoch}

    The token count in the current epoch, as returned by token_in_epoch().

    {total_wct}

    The total training duration in seconds, as returned by total_wct().

    {epoch_wct}

    The epoch duration in seconds, as returned by epoch_wct().

    {batch_wct}

    The batch duration in seconds, as returned by batch_wct().

    Note

    • By default, only the rank zero process will save a checkpoint file.

    • To write to compressed tar files, set the file extension to '.tar.gz', '.tgz', '.tar.bz2', or '.tar.lzma' (depending on the desired compression algorithm).

    • To write to compressed pt files, set the file extension to '.pt.bz2', '.pt.gz', '.pt.lz4', '.pt.lzma', '.pt.lzo', '.pt.xz', '.pt.zst' (depending on the desired algorithm). You must have the corresponding CLI tool installed. lz4 is a good choice for a modest space saving while being very fast to compress.

    Warning

    Using compression will block the training loop while checkpoints are being compressed and the compressibility of checkpoints can vary significantly depending on your setup. As such, we recommend saving checkpoints without compression by default.

    If you have the lz4 command available on your system, you may want to try saving as .pt.lz4 as the overhead is minimal (usually less than a second) and the saved space can sometimes be significant (1% - 40%).

    Consider the following scenario, where:

    • The default name='ep{epoch}-ba{batch}-rank{rank}' is used.

    • The current epoch count is 1.

    • The current batch count is 42.

    The rank zero process will save the checkpoint to 'ep1-ba42-rank0'.

  • weights_only (bool, optional) โ€“ If True, save only the model weights instead of the entire training state. (default: False)

  • Returns โ€“

    list[pathlib.Path]: The list of checkpoint files saved, indexed by the rank of the process.

    Note

    When doing multi-node training, the filepaths are valid only on each processโ€™s node; Composer does not move checkpoint files between nodes.

    Each list will contain only one filepath since only the rank zero process saves checkpoints.