Common Commands#
mcli run -f <your_yaml>
Submits a run with the provided YAML configuration.
mcli run --clone <existing_run_name>
Submits a new run using the existing run’s configuration
mcli get runs
Lists all of your submitted runs (see mcli get runs --help
to view the many filters available)
mcli describe run <run_name>
Get detailed information about a run, including the config that was used to launch it.
mcli logs <run_name>
Retrieves the console log of the latest resumption of the indicated run.
mcli logs <run_name> --prev
Retrieves the console log for the previous (next to last) resumption of the indicated run.
mcli logs <run_name> --resumption <N>
Retrieves the console log for a given resumption of the indicated run.
mcli stop run <run_name>
Stops the provided run. The run will be stopped but not deleted from the cluster.
mcli run -r <stopped_run>
Restarts a stopped run. See Composer’s Auto Resumption guide!
mcli delete run <run_name>
Deletes the run (and its associated logs) from the cluster.
mcli update run <run_name> --max-duration <hours>
Updates the max time (in hours) than a run can run for.
Full documentation for the mcli update run command
usage: mcli update run [-h] [--priority PRIORITY] [--no-preemptible | --preemptible]
[--max-retries MAX_RETRIES] [--max-duration MAX_DURATION] run_name
Run sharing#
If run sharing is enabled, users within the same organization have read access to other users’ runs. Ask your administrator if you would like this feature enabled!
This enables easier collaboration, so a user can fetch other users’ runs with:
mcli get runs --user <another_users_email>
Users can also tail the logs and describe another user’s runs with:
mcli logs <run_name>
Watchdog#
When training large-scale runs, there may be hardware failures (i.e. node failures). We’ve developed
a system called watchdog
that will automatically resume your run if our system detects any failures.
This is not enabled by default because gracefully resuming models during training requires careful consideration.
If you are using Composer
or the LLM foundry, this can be easily enabled.
You can enable watchdog
on an existing and active run.
To enable watchdog, use:
mcli watchdog <run_name>
To disable watchdog, use:
mcli watchdog --disable <run_name>
If watchdog is enabled for your run, you’ll see a 🐕 icon next to your run_name
in the mcli get runs
display.
By default, enabling watchdog will automatically retry your run 10
times.
You can configure this default in your yaml by overriding the max_retries
scheduling parameter.