⌛ Time#

We use the Time class to represent and track time throughout the training loop. We track several time-related quantities (epochs, batches, samples, and tokens) throughout training and represent them as elements of the TimeUnit enum class. Values can be provided as a string:

Unit	Suffix	Example	Enum
Epochs	`"ep"`	`"10ep"`	`TimeUnit.EPOCH`
Batches	`"ba"`	`"100ba"`	`TimeUnit.BATCH`
Samples	`"sp"`	`"2048sp"`	`TimeUnit.SAMPLE`
Tokens	`"tok"`	`"93874tok"`	`TimeUnit.TOKEN`
Duration	`"dur"`	`"0.7dur"`	`TimeUnit.DURATION`
Seconds	`"sec"`	`"30sec"`	`TimeUnit.SECOND`

Duration is defined as a multiplier of the max_duration.

These above string inputs are valid when an argument accepts the Time type. There are some exceptions – for example dur is not valid when setting max_duration as that is circular and seconds cannot be used for schedulers and max_duration.

Using timedelta strings are also supported and will be converted into seconds in the Time class. For instance, something like 1h20m40s is supported and will be converted to Time(4840, TimeUnit.SECOND).

Users can also specify milestones for objects such as learning rate schedulers in units of duration, e.g. 0.1dur. This makes it easy to build recipes such as “decay the learning rate 10% into training”.

Warning

For dur arguments, we keep the same units as used in max_duration, and round down. For example, if max_duration = "7ep" and warmup = "0.2dur", then warmup will be converted to floor(7 * 0.2) = 1 epoch.

We also support arithmetic between instances that share the same units. For more information, see the documentation for Time.

Tracking Time#

The trainer has a Timestamp object stored in State.timestamp that measures progress in all the time formats above. State.timestamp can be read by algorithms and callbacks to trigger behavior at different times during training. This feature allows algorithms to specify time in whatever unit is most useful – e.g. an algorithm could activate once every n batches or during the last 20% of training.

After each batch and epoch, State.timestamp is updated to reflect the amount of data being consumed in terms of epochs, batches, samples, and tokens.

By default, we attempt to infer the number of samples based on the batch type:

If torch.Tensor, the size of its first dimension is used.
If list or tuple, the size of its first dimension is used. As such, all elements must have the same first dimension size.
If dict, the size of its first dimension is used. As such, all elements must have the same first dimension size

Users can supply their own get_num_samples_in_batch method to the trainer via the DataSpec for more complicated datasets:

from composer.core import DataSpec
from composer import Trainer

def my_num_samples(batch: dict) -> int:
    return batch['image1'].shape[0] + batch['image2'].shape[0]


data_spec = DataSpec(
    dataloader=my_train_dataloader,
    get_num_samples_in_batch=my_num_samples,
)

trainer = Trainer(
    model=model,
    train_dataloader=data_spec,
)

To track tokens properly, users will need to supply the get_num_tokens_in_batch function to the Trainer; otherwise, tokens will not be tracked.

Samples Per Epoch#

To convert between samples and epochs, we infer the number of samples per epoch from len(dataloader.dataset) if the property is available. If not, we assume the dataset is unsized.

num_samples can also be provided directly to the DataSpec to override this default behavior.

from composer.core import DataSpec
from composer import Trainer

trainer = Trainer(
    model=model,
    train_dataloader=DataSpec(
        dataloader=my_train_dataloader,
        num_samples=1028428,
    )
)