MosaicML CLI & SDK Documentation#
MosaicML platform features#
🚀 Easily scale training across multiple nodes:
mcli run -f gpt_70b.yaml --gpus 256
☁ Direct jobs across multiple clouds with a single flag.
> mcli get clusters
NAME PROVIDER GPU_TYPES_AND_NUMS
onprem-oregon MosaicML a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
none (CPU only): [0]
aws-us-west-2 AWS a100_80gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
aws-us-east-1 AWS a100_40gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
oracle-sjc OCI a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
none (CPU only): [0]
mcli run -f gpu_30b.yaml --gpus 64 --cluster oracle-sjc
🐍 Fully featured python API. Build advanced workflows for your team.
from mcli import wait_for_run_status, Run, RunConfig, RunStatus, create_run
from time import sleep
def monitor_run(run: Run, max_retries: int):
"""Monitor and resubmit failed runs for automatic resumption."""
num_retries = 0
while wait_for_run_status(run, RunStatus.COMPLETED).result():
if run.status == RunStatus.FAILED:
num_retries += 1
if num_retries > max_retries:
raise RuntimeError('Exceeded maximum number of retries')
run = run.clone()
print(f'Failure detected, resubmitting new run: {run.name}')
else:
print(f'Run {run.name} completed successfully with status {run.status}')
break
config = RunConfig.from_file('resnet50.yaml')
run = create_run(config)
monitor_run(run, max_retries=5)
We support integrations with all your favorite tooling: Git, Weights & Biases, CometML, and more!
About Us#
MosaicML’s mission is to make training and serving of large AI models accessible. We continually productionize state-of-the-art research on efficient model training and inference, and study the combinations of these methods in order to ensure that model training and serving is ✨ as efficient as possible ✨. These findings are baked into our highly efficient MosaicML platform.
If you have questions, please feel free to reach out to us on Twitter, Email, or join our Slack channel!
Contents#
Getting Started
Finetuning
Python API
Resources
[Deprecated] Inference