Watchdog ๐Ÿ•#

When training large models across many nodes, inevitably some nodes may fail over time due to hardware issues halting any in-progress training runs. Instead of manually restarting these runs, you can enable Watchdog to automatically restart the run if system or node failure is detected. To do so, you must enable Watchdog either in your YAML file or via MCLI for every run you submit.

Note that Watchdog just restarts the run from the beginning. It does not handle graceful resumption from a checkpoint automatically. To handle these resumptions gracefully, see the below section for enabling autoresume in Composer or LLM Foundry training. Alternatively, you could include customized logic in your command to resume from a specified checkpoint.

To enable Watchdog via MCLI, use:

mcli watchdog <run_name>

To disable Watchdog via MCLI, use:

mcli watchdog --disable <run_name>

You can also configure Watchdog on run submission via your run YAML by specifying the following fields in your scheduling parameter:

scheduling:
  retry_on_system_failure: True
  max_retries: 10

If Watchdog is enabled for your run, youโ€™ll see a ๐Ÿ• icon next to your run_name in the mcli get runs display. By default, enabling Watchdog will automatically retry your run 10 times. You can configure this default in your yaml by overriding the max_retries scheduling parameter.

Resuming from a checkpoint in LLM Foundry training#

For more efficient training with Watchdog, you can configure autoresume from a checkpoint using LLM Foundry training by passing in a configured save_folder inside your parameters.

For example, a full run YAML would look something like this:

name: mpt-1b-gpus-8
image: mosaicml/llm-foundry:2.1.0_cu121_flash2-latest
command: |
  cd llm-foundry/scripts
  composer train/train.py $PARAMETERS
compute:
  gpus: 8

# Enable watchdog with max 10 retries
scheduling:
  retry_on_system_failure: True
  max_retries: 10

parameters:
  # See https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/yamls
  # for example training parameters
  # ...

  # Configure a save folder for checkpointing and auto-resumption
  save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
  save_interval: 5000ba

The training script leverages Composer, our deep learning training library, and will pre-populate Composerโ€™s autoresume parameter by default if you pass in a save_folder. If you explicitly want to turn off resuming from a checkpoint, you can specify this in parameters:

parameters:
  ...

  autoresume: False

Resuming from a checkpoint in Composer#

For more efficient training with Watchdog, you can configure autoresume from a checkpoint using Composer. To do so, pass both of the below arguments into the Composer Trainer:

  • autoresume: Should be set to True to resume from latest checkpoint

  • save_folder: The remote folder where checkpoints should be saved during training. Checkpoints are written to this folder at the save_interval interval. If a run crashes between checkpoints, auto-resume will pick up at the latest checkpoint

A full end-to-end example is available in the Composer Autoresume documentation