โ Time#
We use the Time
class to represent and track time throughout
the training loop. We track several time-related quantities
(epochs, batches, samples, and tokens) throughout training and
represent them as elements of the TimeUnit
enum class. Values
can be provided as a string:
Unit |
Suffix |
Example |
Enum |
---|---|---|---|
Epochs |
|
|
|
Batches |
|
|
|
Samples |
|
|
|
Tokens |
|
|
|
Duration |
|
|
|
Seconds |
|
|
|
Duration is defined as a multiplier of the max_duration
.
These above string inputs are valid when an argument accepts the Time
type. There are some exceptions โ for example dur
is not valid when
setting max_duration
as that is circular and seconds cannot be used
for schedulers and max_duration
.
Using timedelta strings are also supported and will be converted into
seconds in the Time
class. For instance, something like 1h20m40s is
supported and will be converted to Time(4840, TimeUnit.SECOND).
Users can also specify milestones for objects such as learning rate schedulers
in units of duration
, e.g. 0.1dur
. This makes it easy to build recipes
such as โdecay the learning rate 10% into trainingโ.
Warning
For dur
arguments, we keep the same units as used in max_duration
,
and round down. For example, if max_duration = "7ep"
and warmup = "0.2dur"
,
then warmup will be converted to floor(7 * 0.2) = 1 epoch
.
We also support arithmetic between instances that share the same units. For more information,
see the documentation for Time
.
Tracking Time#
The trainer has a Timestamp
object stored in State.timestamp
that
measures progress in all the time formats above. State.timestamp
can be
read by algorithms and callbacks to trigger behavior at different times
during training. This feature allows algorithms to specify time in whatever unit
is most useful โ e.g. an algorithm could activate once every n batches or
during the last 20% of training.
After each batch and epoch, State.timestamp
is updated to reflect
the amount of data being consumed in terms of epochs, batches, samples, and tokens.
By default, we attempt to infer the number of samples based on the batch type:
If
torch.Tensor
, the size of its first dimension is used.If
list
ortuple
, the size of its first dimension is used. As such, all elements must have the same first dimension size.If
dict
, the size of its first dimension is used. As such, all elements must have the same first dimension size
Users can supply their own get_num_samples_in_batch
method to the trainer
via the DataSpec
for more complicated datasets:
from composer.core import DataSpec
from composer import Trainer
def my_num_samples(batch: dict) -> int:
return batch['image1'].shape[0] + batch['image2'].shape[0]
data_spec = DataSpec(
dataloader=my_train_dataloader,
get_num_samples_in_batch=my_num_samples,
)
trainer = Trainer(
model=model,
train_dataloader=data_spec,
)
To track tokens properly, users will need to supply the get_num_tokens_in_batch
function to the Trainer; otherwise, tokens will not be tracked.
Samples Per Epoch#
To convert between samples and epochs, we infer the number of samples per epoch
from len(dataloader.dataset)
if the property is available. If not, we assume
the dataset is unsized.
num_samples
can also be provided directly to the DataSpec
to override
this default behavior.
from composer.core import DataSpec
from composer import Trainer
trainer = Trainer(
model=model,
train_dataloader=DataSpec(
dataloader=my_train_dataloader,
num_samples=1028428,
)
)