We use the
Time class to represent and track time throughout
the training loop. We track several time-related quantities
(epochs, batches, samples, and tokens) throughout training and
represent them as elements of the
TimeUnit enum class. Values
can be provided as a string:
Duration is defined as a multiplier of the
These above string inputs are valid when an argument accepts the
type. There are some exceptions – for example
dur is not valid when
max_duration as that is circular.
Users can also specify milestones for objects such as learning rate schedulers
in units of
0.1dur. This makes it easy to build recipes
such as “decay the learning rate 10% into training”.
dur arguments, we keep the same units as used in
and round down. For example, if
max_duration = "7ep" and
warmup = "0.2dur",
then warmup will be converted to
floor(7 * 0.2) = 1 epoch.
We also support arithmetic between instances that share the same units. For more information,
see the documentation for
The trainer has a
Timestamp object stored in
measures progress in all the time formats above.
State.timestamp can be
read by algorithms and callbacks to trigger behavior at different times
during training. This feature allows algorithms to specify time in whatever unit
is most useful – e.g. an algorithm could activate once every n batches or
during the last 20% of training.
After each batch and epoch,
State.timestamp is updated to reflect
the amount of data being consumed in terms of epochs, batches, samples, and tokens.
By default, we attempt to infer the number of samples based on the batch type:
torch.Tensor, the size of its first dimension is used.
tuple, the size of its first dimension is used. As such, all elements must have the same first dimension size.
dict, the size of its first dimension is used. As such, all elements must have the same first dimension size
Users can supply their own
get_num_samples_in_batch method to the trainer
DataSpec for more complicated datasets:
from composer.core import DataSpec from composer import Trainer def my_num_samples(batch: dict) -> int: return batch['image1'].shape + batch['image2'].shape data_spec = DataSpec( dataloader=my_train_dataloader, get_num_samples_in_batch=my_num_samples, ) trainer = Trainer( model=model, train_dataloader=data_spec, )
To track tokens properly, users will need to supply the
function to the Trainer; otherwise, tokens will not be tracked.
Samples Per Epoch#
To convert between samples and epochs, we infer the number of samples per epoch
len(dataloader.dataset) if the property is available. If not, we assume
the dataset is unsized.
num_samples can also be provided directly to the
DataSpec to override
this default behavior.
from composer.core import DataSpec from composer import Trainer trainer = Trainer( model=model, train_dataloader=DataSpec( dataloader=my_train_dataloader, num_samples=1028428, ) )