MLPerfCallback#
- class composer.callbacks.MLPerfCallback(root_folder, index, benchmark='resnet', target=0.759, division='open', metric_name='MulticlassAccuracy', metric_label='eval', submitter='MosaicML', system_name=None, status='onprem', cache_clear_cmd=None, host_processors_per_node=None, exit_at_target=False)[source]#
Create compliant results file for MLPerf Training benchmark.
A submission folder structure will be created with the
root_folder
as the base and the following directories:root_folder/ results/ [system_name]/ [benchmark]/ results_0.txt results_1.txt ... systems/ [system_name].json
A required systems description will be automatically generated, and best effort made to populate the fields, but should be manually checked prior to submission.
Currently, only open division submissions are supported with this Callback.
Example
from composer.callbacks import MLPerfCallback callback = MLPerfCallback( root_folder='/submission', index=0, metric_name='MulticlassAccuracy', metric_label='eval', target='0.759', )
During training, the metric found in
state.eval_metrics[evaluator_label][metric_name]
will be compared against the target criterion.Note
This is currently an experimental logger that has not been used (yet) to submit an actual result to MLPerf. Please use with caution.
Note
MLPerf submissions require clearing the system cache prior to any training run. By default, this callback does not clear the cache, as that is a system specific operation. To enable cache clearing, and thus pass the mlperf compliance checker, provide a
cache_clear_cmd
that will be executed withos.system
.- Parameters
root_folder (str) โ The root submission folder
index (int) โ The repetition index of this run. The filename created will be
result_[index].txt
.benchmark (str, optional) โ Benchmark name. Currently only
resnet
supported. Default:'resnet'
.target (float, optional) โ The target metric before the mllogger marks the stop of the timing run. Default:
0.759
(resnet benchmark).division (str, optional) โ Division of submission. Currently only
open
division supported. Default:'open'
.metric_name (str, optional) โ name of the metric to compare against the target. Default:
MulticlassAccuracy
.metric_label (str, optional) โ The label name. The metric will be accessed via
state.eval_metrics[evaluator_label][metric_name]
.submitter (str, optional) โ Submitting organization. Default:
"MosaicML"
.system_name (str, optional) โ Name of the system (e.g. 8xA100_composer). If not provided, system name will default to
[world_size]x[device_name]_composer
, e.g.8xNVIDIA_A100_80GB_composer
.status (str, optional) โ Submission status. One of (onprem, cloud, or preview). Default:
"onprem"
.cache_clear_cmd (str, optional) โ Command to invoke during the cache clear. This callback will call
os.system(cache_clear_cmd)
. Default is disabled (None)host_processors_per_node (int, optional) โ Total number of host processors per node. Default:
None
.exit_at_target (bool, optional) โ Whether to exit training when target metric is met. Default:
False
.