Mosaic AI CLI & SDK Documentation#

Databricks Mosaic AI training is designed to tackle the challenges of training large AI models.

This documentation covers how to train using the Command Line Interface (CLI) and Python SDK. To explore documentation of other Mosaic AI components, check out our main documentation page.

We offer three interfaces for training models on Mosaic AI:

Custom Training: Using llm-foundry or your own custom image and training code, pretrain, finetune, and evaluate models with maximum flexibility. This is the most powerful and flexible way to train models on Mosaic AI.
Pretraining API (in preview): Train DBRX from scratch with ease using our Pretraining API. This is the only way to train DBRX from scratch on Mosaic AI using speedups built while training DBRX
Finetuning API: Confidently finetune and adapt models with our Finetuning API. Less flexible than Custom Training, but includes a large set of prebuilt models and training configurations that “just work”.

Key features#

🚀 Easily scale training across multiple nodes:

mcli run -f gpt_70b.yaml --gpus 256

☁ Direct jobs across multiple clouds with a single flag.

> mcli get clusters
NAME           PROVIDER   GPU_TYPES_AND_NUMS
onprem-oregon  MosaicML   a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
                          none (CPU only): [0]
aws-us-west-2  AWS        a100_80gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
aws-us-east-1  AWS        a100_40gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
oracle-sjc     OCI        a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
                          none (CPU only): [0]

mcli run -f gpu_30b.yaml --gpus 64 --cluster oracle-sjc

🐍 Fully featured python API. Build advanced workflows for your team.

from mcli import wait_for_run_status, Run, RunConfig, RunStatus, create_run
from time import sleep

def monitor_run(run: Run, max_retries: int):
  """Monitor and resubmit failed runs for automatic resumption."""
  num_retries = 0
  while wait_for_run_status(run, RunStatus.COMPLETED).result():
     if run.status == RunStatus.FAILED:
        num_retries += 1
        if num_retries > max_retries:
           raise RuntimeError('Exceeded maximum number of retries')

        run = run.clone()
        print(f'Failure detected, resubmitting new run: {run.name}')
     else:
        print(f'Run {run.name} completed successfully with status {run.status}')
        break

config = RunConfig.from_file('resnet50.yaml')
run = create_run(config)
monitor_run(run, max_retries=5)

We support integrations with all your favorite tooling: Git, Weights & Biases, CometML, and more!

About Us#

The mission of Databricks Mosaic AI is to make training and tuning of large AI models accessible. We continually productionize state-of-the-art research on efficient model training and study the combinations of these methods in order to ensure that model training is ✨ as efficient as possible ✨

If you have questions, please feel free to reach out to us on Twitter, Email, or join our Slack channel!

Contents#

Getting Started

Training

Pretraining API (In Preview)

Finetuning API

Resources

Releases