MosaicML CLI & SDK Documentation#

The MosaicML platform is an AI development platform designed to tackle the challenges of training and serving large AI models.
This documentation covers MosaicML platform’s Command Line Interface (CLI) and Python SDK. To explore documention of other MosaicML components, check out MosaicML platform main documentation.

MosaicML platform features#

  • 🚀 Easily scale training across multiple nodes:

mcli run -f gpt_70b.yaml --gpus 256
  • Direct jobs across multiple clouds with a single flag.

> mcli get clusters
NAME           PROVIDER   GPU_TYPES_AND_NUMS
onprem-oregon  MosaicML   a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
                          none (CPU only): [0]
aws-us-west-2  AWS        a100_80gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
aws-us-east-1  AWS        a100_40gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
oracle-sjc     OCI        a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
                          none (CPU only): [0]
mcli run -f gpu_30b.yaml --gpus 64 --cluster oracle-sjc
  • 🐍 Fully featured python API. Build advanced workflows for your team.

from mcli import wait_for_run_status, Run, RunConfig, RunStatus, create_run
from time import sleep

def monitor_run(run: Run, max_retries: int):
  """Monitor and resubmit failed runs for automatic resumption."""
  num_retries = 0
  while wait_for_run_status(run, RunStatus.COMPLETED).result():
     if run.status == RunStatus.FAILED:
        num_retries += 1
        if num_retries > max_retries:
           raise RuntimeError('Exceeded maximum number of retries')

        run = run.clone()
        print(f'Failure detected, resubmitting new run: {run.name}')
     else:
        print(f'Run {run.name} completed successfully with status {run.status}')
        break

config = RunConfig.from_file('resnet50.yaml')
run = create_run(config)
monitor_run(run, max_retries=5)

We support integrations with all your favorite tooling: Git, Weights & Biases, CometML, and more!

About Us#

MosaicML’s mission is to make training and serving of large AI models accessible. We continually productionize state-of-the-art research on efficient model training and inference, and study the combinations of these methods in order to ensure that model training and serving is ✨ as efficient as possible ✨. These findings are baked into our highly efficient MosaicML platform.

If you have questions, please feel free to reach out to us on Twitter, Email, or join our Slack channel!

Contents#

Python API

Resources

Releases