0.5#
Looking for the latest release notes? See v0.6.x
0.5.34#
Bugfix where experimentTracker was not returned on createFinetune
0.5.33#
Additional finetuning docs
Update finetuning config to use
experimentTrackerinstead ofexperimentTrackersRemove
--followfrommcli finetuneRemove generic 500s from MAPI retries
Small bug fix on interactive event loops
0.5.32#
Donโt strip strings when printing
BaseSubmissionConfigAdd new
list_finetuning_eventsfunctionFix
update_run_metadatato serialize input data or ignore the data if it is not serializable
0.5.31#
Documentation updates
0.5.30#
Finetuning docs updates
Improvements to
mcli describe ftdisplay:Rename
Reasoncolumn toDetailsHide
Detailswhen null
0.5.29#
Use
runTypefilter instead ofisInteractiveto fetch interactive runsReport credentials check failures as โFailedโ
Documentation updates
Fix stop run name filter bugs
Fix estimated end time on display
0.5.28#
Updated documentation for Huggingface on mcli
mcli describe ftnow shows original submitted yaml
0.5.27#
Add mcli stop and delete finetuning runs
Update formatting for mcli describe ft
0.5.26#
Add optional
--containerflag tomcli logsto filter logs by container name
0.5.25#
Docs updates around managed mlflow.
Move errors not in that donโt get retried to debug mode.
0.5.24#
Add retry and an optional bool to protect from SIGTERMs
Add dependent deployments for eval.
0.5.23#
โ ๏ธ BACKWARDS INCOMPATIBLE CHANGES โ ๏ธ:
mcli finetune sdk now returns a distinct FineTune object as opposed to a Run object
corresponding documentation updates
0.5.22#
Add
node_nameto the MCLI admin support command
0.5.21#
Add gpu_type to response for cluster utliization
0.5.20#
Add support for UC volumes as data input for the Finetuning API
Modifies the cluster utilization type to support MVP version of serverless scheduling for runs
Adds new
Runfieldparent_nameto support tracking for runs that spawn child runs
0.5.19#
Fix S3 secret bug
Error out for unknown finetuning fields
0.5.18#
Sort run metadata keys by value not length in
mcli describe runFix bug with watchdog retry logic
0.5.17#
โ ๏ธ BACKWARDS INCOMPATIBLE CHANGES โ ๏ธ:
The
task_typefor instruction finetuning has changed fromIFTtoINSTRUCTION_FINETUNERemoved
instruction_finetunefunction. Please use thefinetunefunction withtask_type="INSTRUCTION_FINETUNE"
0.5.16#
Bugfix for invalid profile error when creating s3 secrets
0.5.15#
You can now use
mcli log <run> --tail Nto get the latest N log lines from your runAdded support for โContinued Pretrainingโ to
mcli finetuneAdded Databricks secret type
0.5.14#
Fixed S3 secrets created without explicit profiles
Run reason autopopulates from latest attempt
Support for HF secrets
0.5.13#
Added Reason column to
mcli utildisplaying the reason for pending or queued runs under the queued training runs section.
0.5.12#
Fix for describe run
Update to finetuning API
0.5.11#
Adds alias mcli util -t for mcli util โtraining
Fixes bug introduce in mcli describe run from 0.5.10
Fixes bug in ETA in mcli get runs
0.5.10#
Finetuning compute updates
Fix describe run bugs
Add estimated end time to commands
Display cpu information in describe cluster
0.5.9#
Doc updates for finetuning
Added
--yaml-onlyflag for runsFixed timestamp bug.
0.5.8#
Added
mcli stop deploymentas well as sdk support for itAdded code evaluation
Allow
mcli-adminto be permissioned to stop and restart other userโs runsAdd
override_eval_data_pathand tests for finetuningFix
--resumptionand--prevflags formcli logs
0.5.7#
Add cluster and GPU information to
mcli connectValidate tty to accept StringIO stdin for
mcli interactiveFinetuning docs updates
Inference no-code docs updates
0.5.6#
New
create_default_deployment&get_clusterSDK functionsIncrease default
predicttimeout to 60 secondsAdd pip command for how to upgrade mcli
Show node statuses in
mcli describe runviewBy default
mcli get runsshows only the latest 3 resumptionsMax duration improvements
Improved
mcli get deploymentsviewAdd max_batch_size_in_bytes to
BatchingConfig
0.5.5#
rateLimitcan be specified in submission YAMLsVisual improvements to
get runsanddescribe runsoutputInitial finetuning support
0.5.4#
Small version change for finetuning API
0.5.3#
Improved
mcli get deploymentswith full and--compactmodeFilter
mcli utilusing--trainingand--inferenceargumentsEarly release of finetuning API (subject to change)
0.5.2#
Add
mcli describe clustersupport to show a detailed view with cluster information
0.5.1#
Add
--max-durationflag for creating and updating runsAdd ability to view the logs of the latest failed replica with
mcli get deployment logs --failed
0.5.0#
This page includes information on updates and features related to MosaicML CLI and SDK. The first sections cover general features and deprecation notices, and then features specific for the training and inference products respectively
General#
CLI AutoComplete#
We now support tab autocomplete in bash and zsh shells! To enable, run:
eval "$(register-python-argcomplete mcli)"
Deprecation & breaking changes#
ClusterUtilizationobject retuned fromget_clusters(utilization=True):active_by_useris nowactive_runs_by_userqueued_by_useris nowqueued_runs_by_userClusterUtilizationByRunnow hasnamecolumns instead ofrun_name
Deprecated two
RunStatusvalues:FAILED_PULL- This is reported asRunStatus.FAILEDwith reasonFailedImagePullSCHEDULED- This is synonymous withRunStatus.QUEUED
New training features#
First-class watchdog (๐) support#
Hero run users are well-familiar with our watchdog script, which autoresumes your run given an system failure with a python script within a yaml.
๐ย Now, we are launching first-class support for watchdog! ๐
# Enable watchdog for an existing run
mcli watchdog <run>
# Disable watchdog for an existing run
mcli watchdog <run> --disable
Youโre still able to configure resumable: True in your yaml if youโd like to launch watchdog at the start.
Also, see autoresume within Composer for a fully managed autoresumption experience from the last checkpoint.
If watchdog was configured for your run, youโll see a ๐ย icon next to your run_name in the mcli get runs display.
NAME USER CREATED_TIME STATUS START_TIME END_TIME CLUSTER INSTANCE NODES
finetune-mpt-7b-bZOcnU ๐ [email protected] 2023-07-20 05:34 PM Completed 2023-07-20 05:35 PM 2023-07-20 05:47 PM r1z1 8x a100_80gb 1
By default, enabling watchdog will automatically retry your run 10 times. You can configure this default in your yaml by overriding the max_retries scheduling parameter:
scheduling:
resumable: True
max_retries: 5
(Preview) Interactive runs#
Interactive runs give the ability to debug and iterate quickly inside your cluster in a secure way. Interactivity works on top of the existing MosaicML runs and adds two new CLI commands:
# Submit a new run entirely for interactive debugging
mcli interactive --hours 1 --gpus 1 --tmux --cluster <cluster-name>
# Connect to an existing run (either launched via mcli run or mcli interactive)
mcli connect --tmux <run-name>
You can find the the full docs and details about connected using VSCode here.
Improved resumption UX#
If your run autoresumes on our platform, youโll see a new view when fetching runs that displays high-level information on the multiple resumptions:
> mcli get runs
NAME USER CREATED_TIME RESUMPTION STATUS START_TIME END_TIME CLUSTER INSTANCE NODES
long-run-GfeqDT [email protected] 2023-06-21 05:36 PM 1 Completed 2023-06-21 05:36 PM 2023-06-21 06:36 PM r1z1 cpu 1
0 Stopped 2023-06-21 05:36 PM 2023-06-21 05:37 PM r1z1 cpu 1
Run resumptions are listed in descending order so you can focus on the latest resumption by default. Also, resumptions for a single run are also grouped visually for comparison.
We also improved the describe view to easily visualize different resumptions of your run, their run states, and their duration.
Thereโs also a handy Event Log section that details when states changed within your overall run.
> mcli describe run
Run Lifecycle
Resumption 1:
โญโโโโโโโโ Pending โโโโโโโโโฎ โญโโโโโโโโ Running โโโโโโโโโฎ โญโโโโโโโ Completed โโโโโโโโฎ
โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 06:36 PM โ
โ For: 5s โ โ For: 1hr โ โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Initial Run:
โญโโโโโโโโ Pending โโโโโโโโโฎ โญโโโโโโโโ Running โโโโโโโโโฎ โญโโโโโโโโ Stopped โโโโโโโโโฎ
โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:37 PM โ
โ For: 7s โ โ For: 1min โ โ For: 0s โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Number of Resumptions: 2
Total time spent in Pending: 12s
Total time spent in Running: 1hr
Event Log
โโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Time โ Resumption โ Event โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 2023-06-21 05:36 PM โ 0 โ Run created โ
โ 2023-06-21 05:36 PM โ 0 โ Run started โ
โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 2023-06-21 05:36 PM โ 1 โ Run placed back in the scheduling queue โ
โ 2023-06-21 05:36 PM โ 1 โ Run resumed โ
โ 2023-06-21 06:36 PM โ 1 โ Run completed successfully โ
โโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Update scheduling properties of a run#
The scheduling configurations of a run can be updated. This includes:
priority: Update the default priority of the run fromautotoloworlowestpreemptible: Update whether the run can be stopped and re-queued by higher priority jobs; default is Falsemax_retries: Update the max number of times the run can be retried; default is 0
# Update run scheduling fields
mcli update run example-run --preemptible true --max-retries 10 --priority medium
from mcli import get_run
my_run = get_run("example-run")
updated_run = my_run.update(preemptible=True, max_retries=10, priority='medium')
All run parameters can also be updated when cloning:
# Update any parameters when cloning a run
mcli clone <run-name> --gpus 10 --priority low
This can also be done in the SDK:
from mcli import get_run
my_run = get_run("example-run")
new_run = my_run.clone(gpus=10, priority='low')
New inference features#
Batching support#
Batching config allows the user to specify the max batch size and timeout for inference request processing:
batching:
max_batch_size: 4
max_timeout_ms: 3000
View utilization of inference clusters#
mcli util now shows inference usage:
Inference Instances:
NAME INSTANCE_NAME NODE_INFO GPUS_AVAILABLE GPUS_USED GPUS_TOTAL
r7z14 oci.vm.gpu.a10.2 2xa10 1 1 2
oci.vm.gpu.a10.1 1xa10 0 4 4
Active Inference Deployments:
DEPLOYMENT_NAME USER AGE GPUS
mpt-7b-u0qtof [email protected] 4d 1
mpt-30b-vl5mrp [email protected] 5d 2
mpt-7b-test-9sc4ta [email protected] 21hr 3
Queued Inference Deployments:
No items found.
As shown above, mcli util now also shows gpu instance names along with node info! This is because there are now nodes in our clusters with gpu instances that are only different by the number of gpus per instance. This will make it easier for customers to ask for their deployment or training run to land on the specified instance that they want from their yaml.
Customizable compute resource requests#
Compute specifications can now be configured with the compute field in deployment yamls:
compute:
cluster: my-cluster
gpus: 4
instance: oci.vm.gpu.a10.2
Update properties of a deployment#
After youโve created an inference deployment, you can easily update a few configurations with:
mcli update deployment <deployment_name> --replicas 2 --image "new_image"
Thereโs also a handy SDK command for updating your deployment:
from mcli import update_inference_deployments
update_inference_deployments(['name-of-deployment'], {'replicas': 2, 'image': 'new_image'})