0.5#
Looking for the latest release notes? See v0.6.x
0.5.34#
Bugfix where experimentTracker
was not returned on createFinetune
0.5.33#
Additional finetuning docs
Update finetuning config to use
experimentTracker
instead ofexperimentTrackers
Remove
--follow
frommcli finetune
Remove generic 500s from MAPI retries
Small bug fix on interactive event loops
0.5.32#
Donโt strip strings when printing
BaseSubmissionConfig
Add new
list_finetuning_events
functionFix
update_run_metadata
to serialize input data or ignore the data if it is not serializable
0.5.31#
Documentation updates
0.5.30#
Finetuning docs updates
Improvements to
mcli describe ft
display:Rename
Reason
column toDetails
Hide
Details
when null
0.5.29#
Use
runType
filter instead ofisInteractive
to fetch interactive runsReport credentials check failures as โFailedโ
Documentation updates
Fix stop run name filter bugs
Fix estimated end time on display
0.5.28#
Updated documentation for Huggingface on mcli
mcli describe ft
now shows original submitted yaml
0.5.27#
Add mcli stop and delete finetuning runs
Update formatting for mcli describe ft
0.5.26#
Add optional
--container
flag tomcli logs
to filter logs by container name
0.5.25#
Docs updates around managed mlflow.
Move errors not in that donโt get retried to debug mode.
0.5.24#
Add retry and an optional bool to protect from SIGTERMs
Add dependent deployments for eval.
0.5.23#
โ ๏ธ BACKWARDS INCOMPATIBLE CHANGES โ ๏ธ:
mcli finetune sdk now returns a distinct FineTune object as opposed to a Run object
corresponding documentation updates
0.5.22#
Add
node_name
to the MCLI admin support command
0.5.21#
Add gpu_type to response for cluster utliization
0.5.20#
Add support for UC volumes as data input for the Finetuning API
Modifies the cluster utilization type to support MVP version of serverless scheduling for runs
Adds new
Run
fieldparent_name
to support tracking for runs that spawn child runs
0.5.19#
Fix S3 secret bug
Error out for unknown finetuning fields
0.5.18#
Sort run metadata keys by value not length in
mcli describe run
Fix bug with watchdog retry logic
0.5.17#
โ ๏ธ BACKWARDS INCOMPATIBLE CHANGES โ ๏ธ:
The
task_type
for instruction finetuning has changed fromIFT
toINSTRUCTION_FINETUNE
Removed
instruction_finetune
function. Please use thefinetune
function withtask_type="INSTRUCTION_FINETUNE"
0.5.16#
Bugfix for invalid profile error when creating s3 secrets
0.5.15#
You can now use
mcli log <run> --tail N
to get the latest N log lines from your runAdded support for โContinued Pretrainingโ to
mcli finetune
Added Databricks secret type
0.5.14#
Fixed S3 secrets created without explicit profiles
Run reason autopopulates from latest attempt
Support for HF secrets
0.5.13#
Added Reason column to
mcli util
displaying the reason for pending or queued runs under the queued training runs section.
0.5.12#
Fix for describe run
Update to finetuning API
0.5.11#
Adds alias mcli util -t for mcli util โtraining
Fixes bug introduce in mcli describe run from 0.5.10
Fixes bug in ETA in mcli get runs
0.5.10#
Finetuning compute updates
Fix describe run bugs
Add estimated end time to commands
Display cpu information in describe cluster
0.5.9#
Doc updates for finetuning
Added
--yaml-only
flag for runsFixed timestamp bug.
0.5.8#
Added
mcli stop deployment
as well as sdk support for itAdded code evaluation
Allow
mcli-admin
to be permissioned to stop and restart other userโs runsAdd
override_eval_data_path
and tests for finetuningFix
--resumption
and--prev
flags formcli logs
0.5.7#
Add cluster and GPU information to
mcli connect
Validate tty to accept StringIO stdin for
mcli interactive
Finetuning docs updates
Inference no-code docs updates
0.5.6#
New
create_default_deployment
&get_cluster
SDK functionsIncrease default
predict
timeout to 60 secondsAdd pip command for how to upgrade mcli
Show node statuses in
mcli describe run
viewBy default
mcli get runs
shows only the latest 3 resumptionsMax duration improvements
Improved
mcli get deployments
viewAdd max_batch_size_in_bytes to
BatchingConfig
0.5.5#
rateLimit
can be specified in submission YAMLsVisual improvements to
get runs
anddescribe runs
outputInitial finetuning support
0.5.4#
Small version change for finetuning API
0.5.3#
Improved
mcli get deployments
with full and--compact
modeFilter
mcli util
using--training
and--inference
argumentsEarly release of finetuning API (subject to change)
0.5.2#
Add
mcli describe cluster
support to show a detailed view with cluster information
0.5.1#
Add
--max-duration
flag for creating and updating runsAdd ability to view the logs of the latest failed replica with
mcli get deployment logs --failed
0.5.0#
This page includes information on updates and features related to MosaicML CLI and SDK. The first sections cover general features and deprecation notices, and then features specific for the training and inference products respectively
General#
CLI AutoComplete#
We now support tab autocomplete in bash and zsh shells! To enable, run:
eval "$(register-python-argcomplete mcli)"
Deprecation & breaking changes#
ClusterUtilization
object retuned fromget_clusters(utilization=True)
:active_by_user
is nowactive_runs_by_user
queued_by_user
is nowqueued_runs_by_user
ClusterUtilizationByRun
now hasname
columns instead ofrun_name
Deprecated two
RunStatus
values:FAILED_PULL
- This is reported asRunStatus.FAILED
with reasonFailedImagePull
SCHEDULED
- This is synonymous withRunStatus.QUEUED
New training features#
First-class watchdog (๐) support#
Hero run users are well-familiar with our watchdog script, which autoresumes your run given an system failure with a python script within a yaml.
๐ย Now, we are launching first-class support for watchdog! ๐
# Enable watchdog for an existing run
mcli watchdog <run>
# Disable watchdog for an existing run
mcli watchdog <run> --disable
Youโre still able to configure resumable: True
in your yaml if youโd like to launch watchdog at the start.
Also, see autoresume within Composer for a fully managed autoresumption experience from the last checkpoint.
If watchdog was configured for your run, youโll see a ๐ย icon next to your run_name
in the mcli get runs
display.
NAME USER CREATED_TIME STATUS START_TIME END_TIME CLUSTER INSTANCE NODES
finetune-mpt-7b-bZOcnU ๐ [email protected] 2023-07-20 05:34 PM Completed 2023-07-20 05:35 PM 2023-07-20 05:47 PM r1z1 8x a100_80gb 1
By default, enabling watchdog will automatically retry your run 10
times. You can configure this default in your yaml by overriding the max_retries
scheduling parameter:
scheduling:
resumable: True
max_retries: 5
(Preview) Interactive runs#
Interactive runs give the ability to debug and iterate quickly inside your cluster in a secure way. Interactivity works on top of the existing MosaicML runs and adds two new CLI commands:
# Submit a new run entirely for interactive debugging
mcli interactive --hours 1 --gpus 1 --tmux --cluster <cluster-name>
# Connect to an existing run (either launched via mcli run or mcli interactive)
mcli connect --tmux <run-name>
You can find the the full docs and details about connected using VSCode here.
Improved resumption UX#
If your run autoresumes on our platform, youโll see a new view when fetching runs that displays high-level information on the multiple resumptions:
> mcli get runs
NAME USER CREATED_TIME RESUMPTION STATUS START_TIME END_TIME CLUSTER INSTANCE NODES
long-run-GfeqDT [email protected] 2023-06-21 05:36 PM 1 Completed 2023-06-21 05:36 PM 2023-06-21 06:36 PM r1z1 cpu 1
0 Stopped 2023-06-21 05:36 PM 2023-06-21 05:37 PM r1z1 cpu 1
Run resumptions are listed in descending order so you can focus on the latest resumption by default. Also, resumptions for a single run are also grouped visually for comparison.
We also improved the describe
view to easily visualize different resumptions of your run, their run states, and their duration.
Thereโs also a handy Event Log
section that details when states changed within your overall run.
> mcli describe run
Run Lifecycle
Resumption 1:
โญโโโโโโโโ Pending โโโโโโโโโฎ โญโโโโโโโโ Running โโโโโโโโโฎ โญโโโโโโโ Completed โโโโโโโโฎ
โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 06:36 PM โ
โ For: 5s โ โ For: 1hr โ โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Initial Run:
โญโโโโโโโโ Pending โโโโโโโโโฎ โญโโโโโโโโ Running โโโโโโโโโฎ โญโโโโโโโโ Stopped โโโโโโโโโฎ
โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:36 PM โ โ At: 2023-06-21 05:37 PM โ
โ For: 7s โ โ For: 1min โ โ For: 0s โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Number of Resumptions: 2
Total time spent in Pending: 12s
Total time spent in Running: 1hr
Event Log
โโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Time โ Resumption โ Event โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 2023-06-21 05:36 PM โ 0 โ Run created โ
โ 2023-06-21 05:36 PM โ 0 โ Run started โ
โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 2023-06-21 05:36 PM โ 1 โ Run placed back in the scheduling queue โ
โ 2023-06-21 05:36 PM โ 1 โ Run resumed โ
โ 2023-06-21 06:36 PM โ 1 โ Run completed successfully โ
โโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Update scheduling properties of a run#
The scheduling configurations of a run can be updated. This includes:
priority
: Update the default priority of the run fromauto
tolow
orlowest
preemptible
: Update whether the run can be stopped and re-queued by higher priority jobs; default is Falsemax_retries
: Update the max number of times the run can be retried; default is 0
# Update run scheduling fields
mcli update run example-run --preemptible true --max-retries 10 --priority medium
from mcli import get_run
my_run = get_run("example-run")
updated_run = my_run.update(preemptible=True, max_retries=10, priority='medium')
All run parameters can also be updated when cloning:
# Update any parameters when cloning a run
mcli clone <run-name> --gpus 10 --priority low
This can also be done in the SDK:
from mcli import get_run
my_run = get_run("example-run")
new_run = my_run.clone(gpus=10, priority='low')
New inference features#
Batching support#
Batching config allows the user to specify the max batch size and timeout for inference request processing:
batching:
max_batch_size: 4
max_timeout_ms: 3000
View utilization of inference clusters#
mcli util
now shows inference usage:
Inference Instances:
NAME INSTANCE_NAME NODE_INFO GPUS_AVAILABLE GPUS_USED GPUS_TOTAL
r7z14 oci.vm.gpu.a10.2 2xa10 1 1 2
oci.vm.gpu.a10.1 1xa10 0 4 4
Active Inference Deployments:
DEPLOYMENT_NAME USER AGE GPUS
mpt-7b-u0qtof [email protected] 4d 1
mpt-30b-vl5mrp [email protected] 5d 2
mpt-7b-test-9sc4ta [email protected] 21hr 3
Queued Inference Deployments:
No items found.
As shown above, mcli util
now also shows gpu instance names along with node info! This is because there are now nodes in our clusters with gpu instances that are only different by the number of gpus per instance. This will make it easier for customers to ask for their deployment or training run to land on the specified instance that they want from their yaml.
Customizable compute resource requests#
Compute specifications can now be configured with the compute field in deployment yamls:
compute:
cluster: my-cluster
gpus: 4
instance: oci.vm.gpu.a10.2
Update properties of a deployment#
After youโve created an inference deployment, you can easily update a few configurations with:
mcli update deployment <deployment_name> --replicas 2 --image "new_image"
Thereโs also a handy SDK command for updating your deployment:
from mcli import update_inference_deployments
update_inference_deployments(['name-of-deployment'], {'replicas': 2, 'image': 'new_image'})