Common Commands#

mcli run -f <your_yaml>

Use run to submit a run with the provided YAML configuration.


mcli run --clone <existing_run_name>

Use run --clone to submit a new run using the existing run’s configuration.


mcli get runs

Use get runs to list all of your submitted runs (see mcli get runs --help to view the many filters available).


mcli describe run <run_name>

Use describe run to get detailed information about a run, including the config that was used to launch it.


mcli logs <run_name>

Use logs to retrieve the console log of the latest resumption of the indicated run.


mcli logs <run_name> --prev

Use logs with a --prev parameter to retrieve the console log for the previous (next to last) resumption of the indicated run.


mcli logs <run_name> --resumption <N>

Use logs with a --resumption parameter to retrieve the console log for a given resumption of the indicated run.


mcli stop run <run_name>

Use stop run to stop the provided run. The run will be stopped but not deleted from the cluster.


mcli run -r <stopped_run>

Use run -r to restart a stopped run. See Composer’s Auto Resumption guide!


mcli delete run <run_name>

Use delete run to delete the run (and its associated logs) from the cluster. We recommend using this sparingly.


mcli update run <run_name> --max-duration <hours>

Use update run to update a handful of run parameters, like the max time (in hours) that a run can run for.

Full documentation for the `mcli update run` command
usage: mcli update run [-h] [--priority PRIORITY] [--no-preemptible | --preemptible]
[--max-retries MAX_RETRIES] [--max-duration MAX_DURATION] run_name

mcli watchdog <run_name>

Use watchdog to turn automatic retries on for a run. More details can be found here!

Run sharing#

If run sharing is enabled, users within the same organization have read access to other users’ runs. We often enable this when your organization is created – ask your administrator if you would like this feature enabled or disable.

This enables easier collaboration, so a user can fetch other users’ runs with:

mcli get runs --user <another_users_email>

Users can also tail the logs and describe another user’s runs with:

mcli logs <run_name>

Watchdog#

When training large-scale runs, there may be hardware failures (i.e. node failures). We’ve developed a system called Watchdog that will automatically resume your run if our system detects any failures. This is not enabled by default because gracefully resuming models during training requires careful consideration. If you are using Composer or LLM foundry, this can be easily enabled.

You can enable Watchdog on an existing and active run.

To enable Watchdog, use:

mcli watchdog <run_name>

To disable Watchdog, use:

mcli watchdog --disable <run_name>

If Watchdog is enabled for your run, you’ll see a 🐕 icon next to your run_name in the mcli get runs display.

By default, enabling Watchdog will automatically retry your run 10 times.

You can configure this default in your yaml by overriding the max_retries scheduling parameter.