Common Commands#
mcli run -f <your_yaml>
Use run
to submit a run with the provided YAML configuration.
mcli run --clone <existing_run_name>
Use run --clone
to submit a new run using the existing run’s configuration.
mcli get runs
Use get runs
to list all of your submitted runs (see mcli get runs --help
to view the many filters available).
mcli describe run <run_name>
Use describe run
to get detailed information about a run, including the config that was used to launch it.
mcli logs <run_name>
Use logs
to retrieve the console log of the latest resumption of the indicated run.
mcli logs <run_name> --prev
Use logs
with a --prev
parameter to retrieve the console log for the previous (next to last) resumption of the indicated run.
mcli logs <run_name> --resumption <N>
Use logs
with a --resumption
parameter to retrieve the console log for a given resumption of the indicated run.
mcli stop run <run_name>
Use stop run
to stop the provided run. The run will be stopped but not deleted from the cluster.
mcli run -r <stopped_run>
Use run -r
to restart a stopped run. See Composer’s Auto Resumption guide!
mcli delete run <run_name>
Use delete run
to delete the run (and its associated logs) from the cluster. We recommend using this sparingly.
mcli update run <run_name> --max-duration <hours>
Use update run
to update a handful of run parameters, like the max time (in hours) that a run can run for.
Full documentation for the `mcli update run` command
usage: mcli update run [-h] [--priority PRIORITY] [--no-preemptible | --preemptible]
[--max-retries MAX_RETRIES] [--max-duration MAX_DURATION] run_name
mcli watchdog <run_name>
Use watchdog
to turn automatic retries on for a run.
More details can be found here!
Run sharing#
If run sharing is enabled, users within the same organization have read access to other users’ runs. We often enable this when your organization is created – ask your administrator if you would like this feature enabled or disable.
This enables easier collaboration, so a user can fetch other users’ runs with:
mcli get runs --user <another_users_email>
Users can also tail the logs and describe another user’s runs with:
mcli logs <run_name>
Watchdog#
When training large-scale runs, there may be hardware failures (i.e. node failures). We’ve developed
a system called Watchdog
that will automatically resume your run if our system detects any failures.
This is not enabled by default because gracefully resuming models during training requires careful consideration.
If you are using Composer
or LLM foundry
, this can be easily enabled.
You can enable Watchdog
on an existing and active run.
To enable Watchdog, use:
mcli watchdog <run_name>
To disable Watchdog, use:
mcli watchdog --disable <run_name>
If Watchdog is enabled for your run, you’ll see a 🐕 icon next to your run_name
in the mcli get runs
display.
By default, enabling Watchdog will automatically retry your run 10
times.
You can configure this default in your yaml by overriding the max_retries
scheduling parameter.