# composer.core.state#

The state of the trainer.

Classes

 State The state of the trainer.

The state of the trainer.

Contains variables that the trainer tracks throughout the training loop. Note that all the necessary parts (i.e., serialized_attributes) of state are serialized when the trainer is checkpointed so that it can be used restore the trainer and continue training from a checkpoint. algorithms are able to modify an instance of this class in-place.

Note

An instance of this class is automatically constructed by the Trainer constructor. A user need not instantiate this class.

Parameters
• model (Module) – The model, typically as a subclass of ComposerModel.

• rank_zero_seed (int) – The seed used on the rank zero process. It is assumed that each rank’s seed is rank_zero_seed + dist.get_global_rank().

• grad_accum (int, optional) – The number of gradient accumulation steps to use. With this argument, micro batch size for each device becomes microbatch_size = train_batch_size / (num_devices * grad_accum).

• evaluators (Evalutor | Evaluators, optional) – Evaluator used for evaluation.

• dataloader_len (int | Time[int], optional) – The number of batches per dataloader iteration (e.g. epoch). The trainer will yield the first dataloader_len batches per iteration. If -1 (the default), the entire dataloader will be iterated over.

The name for the dataloader. Required if dataloader is specified. (default: None)

By convention, the training dataloader is called 'train'. The evaluator dataloader is called 'eval', or when multiple evaluators are used, the name of the evaluator.

• max_duration (str | Time, optional) – The maximum duration to train for. (default: None)

• precision (str | Precision) – The numerical precision to use for training. See Precision for the supported precisions.

• optimizers (Optimizer | Sequence[Optimizer], optional) – The optimizer being used to train the model. Multiple optimizers are not currently supported.

• schedulers (types.PyTorchScheduler | Sequence[types.PyTorchScheduler], optional) – The learning rate scheduler (can also be a list or tuple of schedulers).

• scaler (GradScaler, optional) – The gradient scaler in use for mixed precision training.

• algorithms (Algorithm | Sequence[Algorithm], optional) – The algorithms used for training.

• callbacks (Callback | Sequence[Callback], optional) – The callbacks used for training.

• profiler (Optional[Profiler]) – The Composer profiler.

batch#

The batch. This will be the entire batch during the Event.AFTER_DATALOADER, or a microbatch between Event.BATCH_START and Event.BATCH_END.

Type

types.Batch

batch_num_samples#

The number of samples in the batch.

Type

int

batch_num_tokens#

The number of tokens in the batch.

Type

int

current_metrics#

The current computed metrics, organized by dataloader label and then by metric name. The train dataloader is labeled 'train'. If not using an Evaluator, the eval dataloader is labeled 'eval'. Otherwise, the evaluator label is used.

For example:

>>> trainer = Trainer(
...     ...,
...     compute_training_metrics=True,
... )
>>> trainer.fit()
>>> trainer.state.current_metrics
{'train': {'Accuracy': tensor(...)}, 'eval': {'Accuracy': tensor(...)}}


Or, when using an Evaluator:

>>> from torchmetrics import Accuracy
>>> from composer.core import Evaluator
>>> trainer = Trainer(
...     ...,
...     compute_training_metrics=True,
...     ],
... )
>>> trainer.fit()
>>> trainer.state.current_metrics
{'train': {'Accuracy': tensor(...)}, 'eval1': {'Accuracy': tensor(...)}, 'eval2': {'Accuracy': tensor(...)}}

Type

Dict[str, Dict[str, Any]]

eval_timestamp#

The timestamp for the current evaluation dataloader. This timestamp is reset before the dataloader is evaluated. The epoch attribute for this timestamp is always 0.

Type

Timestamp

The number of gradient accumulation steps per batch.

Type

int

loss#

The most recently computed loss.

Type

Tensor | Sequence[Tensor]

model#

The training model.

Note

When using DeepSpeed or multi-rank training, the model will be wrapped with DeepSpeedEngine or DistributedDataParallel, respectively.

Type

Module

outputs#

The most recently computed output from the model’s forward pass.

Type

Tensor | Sequence[Tensor]

predict_timestamp#

The timestamp for the current prediction dataloader. This timestamp is reset before the dataloader is used. The epoch attribute for this timestamp is always 0.

Type

Timestamp

profiler#

The profiler (if profiling is enabled), or None if not profiling.

Type

Profiler

rank_zero_seed#

The seed of the rank zero process.

Type

int

scaler#

The gradient scaler if using mixed-precision training, or None if not using mixed-precision training.

Type

serialized_attributes#

The names of the attribute which are serialized in a checkpoint.

By default, the following attributes are serialized:

Attribute

Description

model

The model under training.

optimizers

The optimizers being used to train the model.

schedulers

The learning rate schedulers.

algorithms

The algorithms used for training.

callbacks

The callbacks used for training.

scaler

The gradient scaler in use for mixed precision training.

timestamp

The timestamp that tracks training loop progress.

rank_zero_seed

The seed of the rank zero process.

current_metrics

The current metrics.

Type

List[str]

timestamp#

The current training timestamp.

Type

Timestamp

The training dataloader. (May be None if not training.)

Type

Iterable

property algorithms#

The algorithms.

batch_get_item(key=None, get_fn=None)[source]#

Gets element from batch either specified by key or user-specified function.

Parameters
• key (Any) – A key to index into the batch. Key is optional if get_fn is supplied.

• get_fn (Callable) – A user-specified function to do the extracting. Note: get_fn is optional if key is supplied.

Returns

The part of the batch specified by the key extracted by the get_fn. This could – be any type depending on what the batch is composed of.

Raises

ValueError if key is unset and get_fn is unset or if both are set.

batch_set_item(*, key=None, value, set_fn=None)[source]#

Sets the element specified by the key of the set_fn to the specified value.

This is not an in-place operation, as for tuple-typed batches, a new batch object must be created to modify them.

Parameters
• key (Any) – A key to index into the batch. Optional if set_fn is specified.

• value (Any) – The value that batch[key] or batch.key gets set to or that the set_fn uses to set a part of the batch to.

• set_fn (Callable) – A user-specified function to do the setting. set_fn is optional if key and value are supplied. The set_fn must return the updated batch.

Returns

batch (Any) – The updated batch with value set at key.

Raises

ValueError if

• key and set_fn are both unset * key and set_fn are both set

property callbacks#

The callbacks.

By default, the training dataloader is called 'train'. The evaluator dataloader is called 'eval', or when multiple evaluators are used, the name of the evaluator. However, the dataloader label can be explicitely specified in Trainer.fit() and Trainer.eval().

Returns

Optional[str] – The dataloader label, or None if no dataloader is set.

The number of batches per dataloader iteration (e.g. epoch), as used by the trainer.

Note

If not explicitely specified, this value is an approximation, as it depends on len(self.dataloader). See the PyTorch DataLoader Documentation for more information.

Returns
• Optional[Time[int]] – The number of batches per dataloader iteration (e.g. epoch), or None if no dataloader

• is defined or if the dataloader has an unknown length (e.g. streaming dataloaders)

property deepspeed_model#

Cast model to DeepSpeedEngine.

property evaluators#

The evaluators.

get_elapsed_duration()[source]#

Get the elapsed training duration.

Returns

Optional[Time[float]] – The elapsed duration, in TimeUnit.DURATION. Time(0.0, TimeUnit.DURATION) represents the beginning of training and Time(1.0, TimeUnit.DURATION) represents a completed training process. Returns None if max_duration is None.

property is_model_ddp#

Whether model is an instance of a DistributedDataParallel.

property is_model_deepspeed#

Whether model is an instance of a DeepSpeedEngine.

Loads the model’s state from a state_dict.

Parameters
• state_dict (Dict[str, Any]) – The state dict, generated from a previous call to state_dict().

• strict (bool) – Whether the keys (i.e., model parameter names) in the model state dict should perfectly match the keys in the model instance.

Parameters
• state (Dict[str, Any]) – object returned from call to state_dict().

• strict (bool) – whether the keys in the state["model"] should perfectly match the keys in the self.model. Defaults to False.

property max_duration#

The maximum training duration.

property optimizers#

The optimizers.

property precision#

The numerical precision to use for training.

See Precision for the supported precisions.

property schedulers#

The schedulers.

property seed#

The seed for the current rank.

• dataloader_label (str, optional) – The dataloader label. Must be None if and only if dataloader is None. Defaults to None.
• dataloader_len (int, int) – The number of batches per dataloader iteration (e.g. epoch), as used by the trainer. Set to -1 to iterate over the entire dataset. (Default: -1.)