Python SDK#

Setup#

`create_secret`	Create a secret in the MosaicML platform
`delete_secrets`	Deletes secrets from the MosaicML platform
`initialize`	Initialize the MosaicML platform
`get_cluster`	Gets a cluster available in the MosaicML platform
`get_clusters`	Get clusters available in the MosaicML platform
`set_api_key`	Set the api key for the MosaicML platform
`MAPIException`	Exceptions raised when a request to MAPI fails
`MCLIConfig`	Global Config Store persisted on local disk

mcli.create_secret(secret, *, timeout=10, future=False)[source]#

Create a secret in the MosaicML platform

Parameters

secret (Secret) – A Secret object to create
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to create_secret() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Secret output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.delete_secrets(secrets=None, *, timeout=10, future=False)[source]#

Deletes secrets from the MosaicML platform

Parameters

secrets (Secret) – List of Secret objects or secret name strings to delete.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to delete_secrets() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Secret output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.initialize(api_key=None)[source]#

Initialize the MosaicML platform

Parameters: api_key – Optional value to set

mcli.get_cluster(cluster, *, include_utilization=True, include_all=False, timeout=10, future=False)[source]#

Gets a cluster available in the MosaicML platform

Parameters

cluster (ClusterDetails) – ClusterDetails object or cluster name string to get.
include_utilization (bool) – Include information on how the cluster is currently being utilized
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to get_cluster() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the ClusterDetails output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.get_clusters(clusters=None, *, include_utilization=False, include_all=False, timeout=10, future=False, submission_type_filter=None)[source]#

Get clusters available in the MosaicML platform

Parameters

clusters (ClusterDetails) – List of ClusterDetails objects or cluster name strings to get.
include_utilization (bool) – Include information on how the cluster is currently being utilized
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to get_clusters() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the ClusterDetails output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.set_api_key(api_key)[source]#

Set the api key for the MosaicML platform

Parameters: api_key – value to set

class mcli.ObjectList(data, obj_type)[source]#: Common helper for list of objects

class mcli.MAPIException(status, message='Unknown Error', description=None)[source]#

Exceptions raised when a request to MAPI fails

Parameters

status – The status code for the exception
message – A brief description of the error
description – An optional longer description of the error

Details: MAPI responds to failures with the following status codes: - 400: The request was misconfigured or missing an argument. Double-check the API and try again - 401: User credentials were either missing or invalid. Be sure to set your API key before making a request - 403: User credentials were valid, but the requested action is not allowed - 404: Could not find the requested resource(s) - 409: Attempted to create an object with a name that already exists. Change the name and try again. - 500: Internal error in MAPI. Please report the issue - 503: MAPI or a subcomponent is currently offline. Please report the issue

class mcli.MCLIConfig(MOSAICML_API_KEY='', feature_flags=<factory>, last_update_check=<factory>, mcloud_envs=<factory>, _user_id=None, _organization_id=None)[source]#: Global Config Store persisted on local disk

Runs#

`create_run`	Launch a run in the MosaicML platform
`create_interactive_run`	Launch an interactive run in the MosaicML platform
`delete_run`	Delete a run in the MosaicML platform
`delete_runs`	Delete a list of runs in the MosaicML platform
`follow_run_logs`	Follow the logs for an active or completed run in the MosaicML platform
`get_run_logs`	Get the current logs for an active or completed run
`get_run`	Get a run that has been launched in the MosaicML platform
`get_runs`	List runs that have been launched in the MosaicML platform
`start_run`	Start a run
`start_runs`	Start a list of runs
`stop_run`	Stop a run
`stop_runs`	Stop a list of runs
`update_run_metadata`	Update a run's metadata in the MosaicML platform.
`update_run`	Update a run's data in the MosaicML platform.
`wait_for_run_status`	Wait for a launched run to reach a specific status
`watch_run_status`	Watch a launched run and retrieve a new Run object everytime its status updates
`Run`	A run that has been launched on the MosaicML platform
`RunConfig`	A run configuration for the MosaicML platform
`RunStatus`	Possible statuses of a run
`ComputeConfig`	Typed dictionary for nested compute requests
`SchedulingConfig`	Typed dictionary for nested scheduling configurations

mcli.create_run(run, *, timeout=10, future=False)[source]#

Launch a run in the MosaicML platform

The provided run must contain enough information to fully detail the run

Parameters

run – A fully-configured run to launch. The run will be queued and persisted in the run database.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A Run that includes the launched run details and the run status

mcli.create_interactive_run(run, *, timeout=10, seconds=None, future=False)[source]#

Launch an interactive run in the MosaicML platform

Users are not required to provide a name, image, or ‘hours’ variable for an interactive run. If these variables are not provided, they will be filled in with defaults. If the user provides a value for the ‘command’ variable, this will be overwritten with sleep <hours>, where <hours> is the value of the ‘hours’ variable.

Parameters

run – A fully-configured run to launch. The run will be queued and persisted in the run database.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.
hours – How many hours an interactive run can sleep for until MORC marks it as completed.

Returns

A Run that includes the launched run details and the run status

mcli.delete_run(run, *, timeout=10, future=False)[source]#

Delete a run in the MosaicML platform

If a run is currently running, it will first be stopped.

Parameters

run – A run to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A – type Run: for the run that was deleted

mcli.delete_runs(runs, *, timeout=10, future=False)[source]#

Delete a list of runs in the MosaicML platform

Any runs that are currently running will first be stopped.

Parameters

runs – A list of runs or run names to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A list of – type Run: for the runs that were deleted

mcli.follow_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, resumption=None, tail=None, container=None)[source]#

Follow the logs for an active or completed run in the MosaicML platform

This returns a generator of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active.

Parameters

run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().
rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored. A run may take some time to generate logs, so you likely do not want to set a timeout.
future (bool) – Return the output as a Future . If True, the call to follow_run_logs() will return immediately and the request will be processed in the background. The generator returned by the ~concurrent.futures.Future will yield a ~concurrent.futures.Future for each new log string returned from the cloud. This takes precedence over the timeout argument. To get the generator, use return_value.result() with an optional timeout argument and log_future.result() for each new log string.
resumption (Optional[int]) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumption
tail (Optional[int]) – Number of chars to read from the end of the log. Defaults to reading the entire log.
container (Optional[str]) – Container name of a run to get logs for. Defaults to the MAIN container.

Returns

If future is False – A line-by-line Generator of the logs for a run
Otherwise – A Future of a line-by-line generator of the logs for a run

mcli.get_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, failed=False, resumption=None, tail=None, container=None)[source]#

Get the current logs for an active or completed run

Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a str, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, use follow_run_logs().

Parameters

run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().
rank (Optional[int]) – [DEPRECATED, Use node_rank instead] Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.
node_rank (Optional[int]) – Specifies the node rank within a multi-node run to fetch logs for. Defaults to lowest available rank. Indexing starts from 0.
local_gpu_rank (Optional[int]) – Specifies the GPU rank on the specified node to fetch logs for. Cannot be used with global_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.
global_gpu_rank (Optional[int]) –
Specifies the global GPU rank to fetch logs for. Cannot be used with node_rank and local_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future . If True, the call to get_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the log text, use return_value.result() with an optional timeout argument.
failed (bool) – Return the logs of the first failed rank for the provided resumption if True. False by default.
resumption (Optional[int]) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumption
tail (Optional[int]) – Number of chars to read from the end of the log. Defaults to reading the entire log.
container (Optional[str]) – Container name of a run to get logs for. Defaults to the MAIN container.

Returns

If future is False – The full log text for a run at the time of the request as a str
Otherwise – A Future for the log text

mcli.get_run(run, *, timeout=10, future=False, include_details=True)[source]#

Get a run that has been launched in the MosaicML platform

The run will contain all details requested

Parameters

run – Run on which to get information
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of runs, use return_value.result() with an optional timeout argument.
include_details – If true, will fetch detailed information like run input for each run.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.get_runs(runs=None, *, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, user_emails=None, run_types=None, include_details=False, include_deleted=False, ended_before=None, ended_after=None, limit=100)[source]#

List runs that have been launched in the MosaicML platform

The returned list will contain all of the details stored about the requested runs.

Parameters

runs – List of runs on which to get information
cluster_names – List of cluster names to filter runs. This can be a list of str or :type Cluster: objects. Only runs submitted to these clusters will be returned.
before – Only runs created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
after – Only runs created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
gpu_types – List of gpu types to filter runs. This can be a list of str or :type GPUType: enums. Only runs scheduled on these GPUs will be returned.
gpu_nums – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.
statuses – List of run statuses to filter runs. This can be a list of str or :type RunStatus: enums. Only runs currently in these phases will be returned.
user_emails – List of user emails to filter runs. Only runs submitted by these users will be returned. By default, will return runs submitted by the current user. Requires shared runs or admin permission
run_types – List of run types to filter runs - ‘INTERACTIVE’: Runs created with the mcli interactive command - ‘HERO_RUN’: Runs created with is_hero_run in the metadata - ‘TRAINING’: All other runs
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of runs, use return_value.result() with an optional timeout argument.
include_details – If true, will fetch detailed information like run input for each run.
include_deleted – If true, will include deleted runs in the response.
ended_before – Only runs ended strictly before this time will be returned.
ended_after – Only runs ended at or after this time will be returned.
limit – Maximum number of runs to return. If None, the latest 100 runs will be returned.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.start_run(run, *, timeout=10, future=False)[source]#

Start a run

Start a run currently stopped in the MosaicML platform.

Parameters

run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to start
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to start_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if starting the requested runs failed A successfully started run will have the status `RunStatus.PENDING`

Returns

If future is False – Started Run object
Otherwise – A Future for the object

mcli.start_runs(runs, *, timeout=10, future=False)[source]#

Start a list of runs

Start a list of runs currently stopped in the MosaicML platform.

Parameters

runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to start
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to start_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if starting any of the requested runs failed. All successfully started runs will have the status `RunStatus.PENDING`. You can freely retry any started and started runs if this error is raised due to a connection issue.

Returns

If future is False – A list of started Run objects
Otherwise – A Future for the list

mcli.stop_run(run, *, reason=None, timeout=10, future=False)[source]#

Stop a run

Stop a run currently running in the MosaicML platform.

Parameters

run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to stop. Using Run objects is most efficient. See the note below.
reason (Optional[str]) – A reason for stopping the run
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to stop_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if stopping the requested runs failed A successfully stopped run will have the status `RunStatus.STOPPED`

Returns

If future is False – Stopped Run object
Otherwise – A Future for the object

mcli.stop_runs(runs, *, reason=None, timeout=10, future=False)[source]#

Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

Parameters

runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.
reason (Optional[str]) – A reason for stopping the run
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

Returns

If future is False – A list of stopped Run objects
Otherwise – A Future for the list

mcli.update_run(run, update_run_data=None, *, preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, timeout=10, future=False, max_duration=None)[source]#

Update a run’s data in the MosaicML platform.

Any values that are not specified will not be modified.

Parameters

run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to update. Using Run objects is most efficient. See the note below.
update_run_data (Dict[str, Any]) – DEPRECATED: Use the individual named-arguments instead. The data to update the run with. This can include preemptible, priority, maxRetries, and retryOnSystemFailure
preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) – Update the default priority of the run from auto to low or lowest
max_retries (int) – Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
max_duration – Update the max time that a run can run for (in hours).
future (bool) – Return the output as a Future. If True, the call to update_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if updating the requested run failed

Returns

If future is False – Updated Run object
Otherwise – A Future for the list

mcli.update_run_metadata(run, metadata, *, timeout=10, future=False, protect=False)[source]#

Update a run’s metadata in the MosaicML platform.

Parameters

run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to update. Using Run objects is most efficient. See the note below.
metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to update_run_metadata() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.
protect (bool) – If True, the call will be protected from SIGTERMs to allow it to complete reliably. Defaults to False.

Raises

MAPIException – Raised if updating the requested run failed

Returns

If future is False – Updated Run object
Otherwise – A Future for the list

mcli.wait_for_run_status(run, status, timeout=None, future=False)[source]#

Wait for a launched run to reach a specific status

Parameters

run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.
status (str | RunStatus) – Status to wait for. This can be any valid RunStatus value. If the status is short-lived, or the run terminates, it is possible the run will reach a LATER status than the one requested. If the run never reaches this state (e.g. it stops early or the wait times out), then an error will be raised. See exception details below.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future (bool) – Return the output as a Future. If True, the call to wait_for_run_status() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if the run does not exist or there is an issue connecting to the MAPI service.
RunStatusNotReached – Raised in the event that the watch closes before the run reaches the desired status. If this happens, the connection to MAPI may have dropped, so try again.
TimeoutError – Raised if the run did not reach the correct status in the specified time

Returns

If future is False – A Run object once it has reached the requested status

Otherwise:: A Future for the run. This will not resolve until the run reaches the requested status

mcli.watch_run_status(run, timeout=None, future=False)[source]#

Watch a launched run and retrieve a new Run object everytime its status updates

Parameters

run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.
timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored. A run may take some time to change statuses (especially to go from RUNNING to COMPLETED), so you likely do not want to set a timeout.
future (bool) – Return the output as a Future. If True, each iteration will yield a Future for the next updated Run object. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument. With futures, you can easily watch multiple Runs in parallel. NOTE: If you set future==True, you should wrap your return_value.result() in a try: ... except StopAsyncIteration to catch the end of the iteration.

Raises

MAPIException – Raised if the run could not be found or if there is an issue contacting the MAPI service
TimeoutError – Raised if the run did not reach the correct status in the specified time

Yields

If future is False – A Run object at each status update Otherwise:

A Future for the run. This will not resolve until the run reaches a new status

class mcli.Run(run_uid, name, status, created_at, updated_at, created_by, priority, preemptible, retry_on_system_failure, cluster, gpus, gpu_type, cpus, node_count, latest_resumption, is_deleted, run_type, max_retries=None, reason=None, nodes=<factory>, submitted_config=None, metadata=None, last_resumption_id=None, resumptions=<factory>, events=<factory>, lifecycle=<factory>, image=None, max_duration=None, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'reason', 'createdByEmail', 'priority', 'preemptible', 'retryOnSystemFailure', 'resumptions', 'isDeleted', 'runType'))[source]#

A run that has been launched on the MosaicML platform

Parameters

run_uid (str) – Unique identifier for the run
name (str) – User-defined name of the run
status (RunStatus) – Status of the run at a moment in time
created_at (datetime) – Date and time when the run was created
updated_at (datetime) – Date and time when the run was last updated
created_by (str) – Email of the user who created the run
priority (str) – Priority of the run; defaults to auto but can be updated to low or lowest
preemptible (bool) – Whether the run can be stopped and re-queued by higher priority jobs
retry_on_system_failure (bool) – Whether the run should be retried on system failure
cluster (str) – Cluster the run is running on
gpus (int) – Number of GPUs the run is using
gpu_type (str) – Type of GPU the run is using
cpus (int) – Number of CPUs the run is using
node_count (int) – Number of nodes the run is using
latest_resumption (Resumption) – Latest resumption of the run
max_retries (Optional[int]) – Maximum number of times the run can be retried
reason (Optional[str]) – Reason the run was stopped
nodes (List[:class:`~mcli.api.model.run.Node]`) – Nodes the run is using
submitted_config (Optional[:class:`~mcli.models.run_config.RunConfig]`) – Submitted run configuration
metadata (Optional[Dict[str, Any]]) – Metadata associated with the run
last_resumption_id (Optional[str]) – ID of the last resumption of the run
resumptions (List[:class:`~mcli.api.model.run.Resumption]`) – Resumptions of the run
lifecycle (List[:class:`~mcli.api.model.run.RunLifecycle]`) – Lifecycle of the run
image (Optional[str]) – Image the run is using

clone(name=None, image=None, cluster=None, instance=None, nodes=None, gpu_type=None, gpus=None, priority=None, preemptible=None, max_retries=None, max_duration=None)[source]#

Submits a new run with the same configuration as this run

Parameters

name (str) – Override the name of the run
image (str) – Override the image of the run
cluster (str) – Override the cluster of the run
instance (str) – Override the instance of the run
nodes (int) – Override the number of nodes of the run
gpu_type (str) – Override the GPU type of the run
gpus (int) – Override the number of GPUs of the run
priority (str) – Override the default priority of the run from auto to low or lowest
preemptible (bool) – Override whether the run can be stopped and re-queued by higher priority jobs
max_retries (int) – Override the max number of times the run can be retried
max_duration (float) – Override the max duration (in hours) that a run can run for

Returns

New :class:`~mcli.api.model.run.Run` object

refresh()[source]#

Refreshes the data on the run object

Returns: Refreshed :class:`~mcli.api.model.run.Run` object

stop()[source]#

Stops the run

Returns: Stopped :class:`~mcli.api.model.run.Run` object

delete()[source]#

Deletes the run

Returns: Deleted :class:`~mcli.api.model.run.Run` object

update(preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, max_duration=None)[source]#

Updates the run’s data

Parameters

preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) – Update the default priority of the run from auto to low or lowest
max_retries (int) – Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False

Returns

Updated :class:`~mcli.api.model.run.Run` object

update_metadata(metadata)[source]#

Updates the run’s metadata

Parameters: metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
Returns: Updated :class:`~mcli.api.model.run.Run` object

class mcli.RunConfig(name=None, parent_name=None, image=None, gpu_type=None, gpu_num=None, cpus=None, cluster=None, scheduling=<factory>, compute=<factory>, parameters=<factory>, integrations=<factory>, env_variables=<factory>, metadata=<factory>, command='', dependent_deployment=<factory>, _suppress_deprecation_warnings=False)[source]#

A run configuration for the MosaicML platform

Values in here are not yet validated and some required values may be missing. On attempting to create the run, a bad config will raise a MapiException with a 400 status code.

Required args:

name (str): User-defined name of the run
image (str): Docker image (e.g. mosaicml/composer)
command (str): Command to use when a run starts
compute (ComputeConfig or Dict[str, Any]): Compute configuration. Typically
a subset of the following fields will be required:
- cluster (str): Name of cluster to use
- instance (str): Name of instance to use
- gpu_type (str): Name of gpu type to use
- gpus (int): Number of GPUs to use
- cpus (int): Number of CPUs to use
- nodes (int): Number of nodes to use
See mcli get clusters for a list of available clusters and instances

Optional args:

parameters (Dict[str, Any]): Parameters to mount into the environment
scheduling (SchedulingConfig or Dict[str, Any]): Scheduling configuration
- priority (str): Priority of the run (default auto with options low and lowest`)
- preemptible (bool): Whether the run is preemptible (default False)
- retry_on_system_failure (bool): Whether the run should be retried on system failure (default False)
- max_retries (int): Maximum number of retries (default 0)
- max_duration (float): Maximum duration of the run in hours (default None)
  Run will be automatically stopped after this duration has elapsed.
integrations (List[Dict[str, Any]]): List of integrations. See integration documentation for more details:
https://docs.mosaicml.com/projects/mcli/en/latest/resources/integrations/index.html
env_variables (Dict[str, str]): Dictionary of environment variables to set in the run
- key (str): Name of the environment variable
- value (str): Value of the environment variable
metadata (Dict[str, Any]): Arbitrary metadata to attach to the run

class mcli.RunStatus(value)[source]#

Possible statuses of a run

PENDING = 'PENDING'#: The run has been submitted and is waiting to be scheduled

QUEUED = 'QUEUED'#: The run is awaiting execution

STARTING = 'STARTING'#: The run is starting up and preparing to run

RUNNING = 'RUNNING'#: The run is actively running

TERMINATING = 'TERMINATING'#: The run is in the process of being terminated

COMPLETED = 'COMPLETED'#: The run has finished without any errors

STOPPED = 'STOPPED'#: The run has stopped

FAILED = 'FAILED'#: The run has failed due to an issue at runtime

UNKNOWN = 'UNKNOWN'#: A valid run status cannot be found

before(other, inclusive=False)[source]#

Returns True if this state usually comes “before” the other

Parameters

other – Another RunStatus
inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “before” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True

after(other, inclusive=False)[source]#

Returns True if this state usually comes “after” the other

Parameters

other – Another RunStatus
inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “after” the other

Example

>>> RunStatus.COMPLETED.after(RunStatus.RUNNING)
True
>>> RunStatus.RUNNING.after(RunStatus.PENDING)
True

is_terminal()[source]#

Returns True if this state is terminal

Returns: If this state is terminal

Example

>>> RunStatus.RUNNING.is_terminal()
False
>>> RunStatus.COMPLETED.is_terminal()
True

classmethod from_string(run_status)[source]#

Convert a string to a valid RunStatus Enum

If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError

class mcli.ComputeConfig[source]#: Typed dictionary for nested compute requests

class mcli.SchedulingConfig[source]#: Typed dictionary for nested scheduling configurations

class mcli.PaginatedObjectList(data, obj, query_function, pagination_function)[source]#

A list of objects that is paginated

next_page(limit=None)[source]#: Returns the next page of results

Example pagination of runs:

import time
from mcli import get_runs

runs = get_runs()
while True:
    try:
        print(f'Found {len(runs)} runs')
        time.sleep(1)
        runs = runs.next_page()
    except StopIteration:
        print("No more pages")
        break

Pretraining API#

create_pretraining_run

Create a pretraining run.

mcli.create_pretraining_run(model, train_data, save_folder, *, compute=None, tokenizer=None, training_duration=None, parameters=None, eval=None, experiment_tracker=None, custom_weights_path=None, timeout=10, future=False)[source]#

Create a pretraining run.

Parameters

model – The name of the Hugging Face model to use. Required.
train_data – Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, if you have two datasets, dataset1 and dataset2, and you want to use 80% of dataset1 and 20% of dataset2, you can pass in {"dataset1": {"path": "path/to/dataset1", "proportion": .8}, "dataset2": {"path": "path/to/dataset2", "proportion": .2}}. Required.
save_folder – The remote location to save the checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://. Required.
compute – The compute configuration to use. Required for now
tokenizer – Tokenizer configuration to use. If not provided, the default tokenizer for the model will be used.
training_duration – The total duration of your run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.
parameters –
Additional parameters to pass to the model
- learning_rate: The peak learning rate to use. Default is 5e-7. The optimizer used
is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.
- context_length: The maximum sequence length to use. This will be used to truncate any data that is too
long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.
experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in {"wandb": {"project": "my-project", "entity": "my-entity"}}. To add in mlflow tracking, you can pass in {"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}.
eval – Configuration for evaluation
custom_weights_path – The remote location of a custom model checkpoint to resume from. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to create_pretraining_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A – type Run: object containing the pretraining run information.

Finetuning API#

`create_finetuning_run`	Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.
`Finetune`	A Finetune that has been run on the MosaicML platform

mcli.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_tracker=None, disable_credentials_check=None, timeout=10, future=False)[source]#

Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.

Parameters

model – The name of the Hugging Face model to use.
train_data_path – The full remote location of your training data (eg ‘s3://my-bucket/my-data.jsonl’). For INSTRUCTION_FINETUNE, another option is to provide the name of a Hugging Face dataset that includes the train split, like ‘mosaicml/dolly_hhrlhf/test’. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field for INSTRUCTION_FINETUNE, or in raw data format for CONTINUED_PRETRAIN.
save_folder – The remote location to save the finetuned checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the finetuned Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.
task_type – The type of finetuning task to run. Current available options are INSTRUCTION_FINETUNE and CONTINUED_PRETRAIN, defaults to INSTRUCTION_FINETUNE.
eval_data_path – The remote location of your evaluation data (e.g. s3://my-bucket/my-data.jsonl). For INSTRUCTION_FINETUNE, the name of a Hugging Face dataset with the test split (e.g. mosaicml/dolly_hhrlhf/test) can also be given. The evaluation data should be formatted with each row containing a prompt and response field, for INSTRUCTION_FINETUNE and raw data for CONTINUED_PRETRAIN. Default is None.
eval_prompts –
A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:
- max_new_tokens: 100
- temperature: 1
- top_k: 50
- top_p: 0.95
- do_sample: true
Default is None (do not generate prompts).
custom_weights_path – The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is None.
training_duration – The total duration of your finetuning run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.
learning_rate – The peak learning rate to use for finetuning. Default is 5e-7. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.
context_length – The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.
experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in {"wandb": {"project": "my-project", "entity": "my-entity"}}. To add in mlflow tracking, you can pass in {"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}.
disable_credentials_check – Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Finetune: output, use return_value.result() with an optional timeout argument.

Returns

A – type Finetune: object containing the finetuning run information.

class mcli.Finetune(id, name, status, created_at, updated_at, created_by, started_at=None, completed_at=None, reason=None, estimated_end_time=None, model=None, save_folder=None, train_data_path=None, submitted_config=None, events=None, _required_properties=('id', 'name', 'status', 'createdByEmail', 'createdAt', 'updatedAt'))[source]#

A Finetune that has been run on the MosaicML platform

Parameters

id – The unique identifier for this finetuning run.
name – The name of the finetuning run.
status – The current status of the finetuning run. This is a RunStatus enum, which has values such as PENDING, RUNNING, or COMPLETED.
created_at – The timestamp at which the finetuning run was created.
updated_at – The timestamp at which the finetuning run was last updated.
created_by – The email address of the user who created the finetuning run.
started_at – The timestamp at which the finetuning run was started.
completed_at – The timestamp at which the finetuning run was completed.
reason – The reason for the finetuning run’s current status, such as Run completed successfully.