# 📊 Evaluation#

To track training progress, validation datasets can be provided to the Composer Trainer through the eval_dataloader parameter. The trainer will compute evaluation metrics on the evaluation dataset at a frequency specified by the the Trainer parameter eval_interval.

from composer import Trainer

trainer = Trainer(
...,
eval_interval="1ep",  # Default is every epoch
)


The metrics should be provided by ComposerModel.get_metrics(). For more information, see the “Metrics” section in 🛻 ComposerModel.

To provide a deeper intuition, here’s pseudocode for the evaluation logic that occurs every eval_interval:

metrics = model.get_metrics(train=False)

outputs, targets = model.eval_forward(batch)
metrics.update(outputs, targets)  # implements the torchmetrics interface

metrics.compute()

• The trainer iterates over eval_dataloader and passes each batch to the model’s ComposerModel.eval_forward() method.

• Outputs of model.eval_forward are used to update the metrics (a torchmetrics.Metric returned by .ComposerModel.get_metrics).

• Finally, metrics over the whole validation dataset are computed.

Note that the tuple returned by ComposerModel.eval_forward() provide the positional arguments to metric.update. Please keep this in mind when using custom models and/or metrics.

## Multiple Datasets#

If there are multiple validation datasets that may have different metrics, use Evaluator to specify each pair of dataloader and metrics. This class is just a container for a few attributes:

For example, the GLUE tasks for language models can be specified as in the following example:

from composer.core import Evaluator
from composer.models.nlp_metrics import BinaryF1Score

label='glue_mrpc',
metric_names=['BinaryF1Score', 'MulticlassAccuracy']
)

label='glue_mnli',
metric_names=['MulticlassAccuracy']
)

trainer = Trainer(
...,
...
)


Note that metric_names must be a subset of the metrics provided by the model in ComposerModel.get_metrics().