Watchdog ๐#
When training large models across many nodes, inevitably some nodes may fail over time due to hardware issues halting any in-progress training runs. Instead of manually restarting these runs, you can enable Watchdog to automatically restart the run if system or node failure is detected. To do so, you must enable Watchdog either in your YAML file or via MCLI for every run you submit.
Note that Watchdog just restarts the run from the beginning. It does not handle graceful resumption from a checkpoint automatically. To handle these resumptions gracefully, see the below section for enabling autoresume in Composer or LLM Foundry training. Alternatively, you could include customized logic in your command to resume from a specified checkpoint.
To enable Watchdog via MCLI, use:
mcli watchdog <run_name>
To disable Watchdog via MCLI, use:
mcli watchdog --disable <run_name>
You can also configure Watchdog on run submission via your run YAML by specifying the following fields in your scheduling parameter:
scheduling:
retry_on_system_failure: True
max_retries: 10
If Watchdog is enabled for your run, youโll see a ๐ icon next to your run_name
in the mcli get runs
display.
By default, enabling Watchdog will automatically retry your run 10
times.
You can configure this default in your yaml by overriding the max_retries
scheduling parameter.
Resuming from a checkpoint in LLM Foundry training#
For more efficient training with Watchdog, you can configure autoresume from a checkpoint using LLM Foundry training
by passing in a configured save_folder
inside your parameters.
For example, a full run YAML would look something like this:
name: mpt-1b-gpus-8
image: mosaicml/llm-foundry:2.1.0_cu121_flash2-latest
command: |
cd llm-foundry/scripts
composer train/train.py $PARAMETERS
compute:
gpus: 8
# Enable watchdog with max 10 retries
scheduling:
retry_on_system_failure: True
max_retries: 10
parameters:
# See https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/yamls
# for example training parameters
# ...
# Configure a save folder for checkpointing and auto-resumption
save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
save_interval: 5000ba
The training script leverages Composer, our deep learning training library, and will pre-populate Composerโs autoresume
parameter by default if you pass in a save_folder
. If you explicitly want to turn off resuming from a checkpoint,
you can specify this in parameters:
parameters:
...
autoresume: False
Resuming from a checkpoint in Composer#
For more efficient training with Watchdog, you can configure autoresume from a checkpoint using Composer. To do so, pass both of the below arguments into the Composer Trainer:
autoresume
: Should be set toTrue
to resume from latest checkpointsave_folder
: The remote folder where checkpoints should be saved during training. Checkpoints are written to this folder at thesave_interval
interval. If a run crashes between checkpoints, auto-resume will pick up at the latest checkpoint
A full end-to-end example is available in the Composer Autoresume documentation