Python SDK#

Setup#

create_secret

Create a secret in the MosaicML platform

delete_secrets

Deletes secrets from the MosaicML platform

initialize

Initialize the MosaicML platform

get_cluster

Gets a cluster available in the MosaicML platform

get_clusters

Get clusters available in the MosaicML platform

set_api_key

Set the api key for the MosaicML platform

MAPIException

Exceptions raised when a request to MAPI fails

MCLIConfig

Global Config Store persisted on local disk

mcli.create_secret(secret, *, timeout=10, future=False)[source]#

Create a secret in the MosaicML platform

Parameters
  • secret (Secret) – A Secret object to create

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to create_secret() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Secret output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.delete_secrets(secrets=None, *, timeout=10, future=False)[source]#

Deletes secrets from the MosaicML platform

Parameters
  • secrets (Secret) – List of Secret objects or secret name strings to delete.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to delete_secrets() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Secret output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.initialize(api_key=None)[source]#

Initialize the MosaicML platform

Parameters

api_key – Optional value to set

mcli.get_cluster(cluster, *, include_utilization=True, include_all=False, timeout=10, future=False)[source]#

Gets a cluster available in the MosaicML platform

Parameters
  • cluster (ClusterDetails) – ClusterDetails object or cluster name string to get.

  • include_utilization (bool) – Include information on how the cluster is currently being utilized

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to get_cluster() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the ClusterDetails output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.get_clusters(clusters=None, *, include_utilization=False, include_all=False, timeout=10, future=False, submission_type_filter=None)[source]#

Get clusters available in the MosaicML platform

Parameters
  • clusters (ClusterDetails) – List of ClusterDetails objects or cluster name strings to get.

  • include_utilization (bool) – Include information on how the cluster is currently being utilized

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to get_clusters() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the ClusterDetails output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.set_api_key(api_key)[source]#

Set the api key for the MosaicML platform

Parameters

api_key – value to set

class mcli.ObjectList(data, obj_type)[source]#

Common helper for list of objects

class mcli.MAPIException(status, message='Unknown Error', description=None)[source]#

Exceptions raised when a request to MAPI fails

Parameters
  • status – The status code for the exception

  • message – A brief description of the error

  • description – An optional longer description of the error

Details: MAPI responds to failures with the following status codes: - 400: The request was misconfigured or missing an argument. Double-check the API and try again - 401: User credentials were either missing or invalid. Be sure to set your API key before making a request - 403: User credentials were valid, but the requested action is not allowed - 404: Could not find the requested resource(s) - 409: Attempted to create an object with a name that already exists. Change the name and try again. - 500: Internal error in MAPI. Please report the issue - 503: MAPI or a subcomponent is currently offline. Please report the issue

class mcli.MCLIConfig(MOSAICML_API_KEY='', feature_flags=<factory>, last_update_check=<factory>, mcloud_envs=<factory>, _user_id=None, _organization_id=None)[source]#

Global Config Store persisted on local disk

Runs#

create_run

Launch a run in the MosaicML platform

create_interactive_run

Launch an interactive run in the MosaicML platform

delete_run

Delete a run in the MosaicML platform

delete_runs

Delete a list of runs in the MosaicML platform

follow_run_logs

Follow the logs for an active or completed run in the MosaicML platform

get_run_logs

Get the current logs for an active or completed run

get_run

Get a run that has been launched in the MosaicML platform

get_runs

List runs that have been launched in the MosaicML platform

start_run

Start a run

start_runs

Start a list of runs

stop_run

Stop a run

stop_runs

Stop a list of runs

update_run_metadata

Update a run's metadata in the MosaicML platform.

update_run

Update a run's data in the MosaicML platform.

wait_for_run_status

Wait for a launched run to reach a specific status

watch_run_status

Watch a launched run and retrieve a new Run object everytime its status updates

Run

A run that has been launched on the MosaicML platform

RunConfig

A run configuration for the MosaicML platform

RunStatus

Possible statuses of a run

ComputeConfig

Typed dictionary for nested compute requests

SchedulingConfig

Typed dictionary for nested scheduling configurations

mcli.create_run(run, *, timeout=10, future=False)[source]#

Launch a run in the MosaicML platform

The provided run must contain enough information to fully detail the run

Parameters
  • run – A fully-configured run to launch. The run will be queued and persisted in the run database.

  • timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A Run that includes the launched run details and the run status

mcli.create_interactive_run(run, *, timeout=10, seconds=None, future=False)[source]#

Launch an interactive run in the MosaicML platform

Users are not required to provide a name, image, or β€˜hours’ variable for an interactive run. If these variables are not provided, they will be filled in with defaults. If the user provides a value for the β€˜command’ variable, this will be overwritten with sleep <hours>, where <hours> is the value of the β€˜hours’ variable.

Parameters
  • run – A fully-configured run to launch. The run will be queued and persisted in the run database.

  • timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

  • hours – How many hours an interactive run can sleep for until MORC marks it as completed.

Returns

A Run that includes the launched run details and the run status

mcli.delete_run(run, *, timeout=10, future=False)[source]#

Delete a run in the MosaicML platform

If a run is currently running, it will first be stopped.

Parameters
  • run – A run to delete

  • timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A – type Run: for the run that was deleted

mcli.delete_runs(runs, *, timeout=10, future=False)[source]#

Delete a list of runs in the MosaicML platform

Any runs that are currently running will first be stopped.

Parameters
  • runs – A list of runs or run names to delete

  • timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A list of – type Run: for the runs that were deleted

mcli.follow_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, resumption=None, tail=None, container=None)[source]#

Follow the logs for an active or completed run in the MosaicML platform

This returns a generator of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active.

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored. A run may take some time to generate logs, so you likely do not want to set a timeout.

  • future (bool) – Return the output as a Future . If True, the call to follow_run_logs() will return immediately and the request will be processed in the background. The generator returned by the ~concurrent.futures.Future will yield a ~concurrent.futures.Future for each new log string returned from the cloud. This takes precedence over the timeout argument. To get the generator, use return_value.result() with an optional timeout argument and log_future.result() for each new log string.

  • resumption (Optional[int]) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumption

  • tail (Optional[int]) – Number of chars to read from the end of the log. Defaults to reading the entire log.

  • container (Optional[str]) – Container name of a run to get logs for. Defaults to the MAIN container.

Returns
  • If future is False – A line-by-line Generator of the logs for a run

  • Otherwise – A Future of a line-by-line generator of the logs for a run

mcli.get_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, failed=False, resumption=None, tail=None, container=None)[source]#

Get the current logs for an active or completed run

Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a str, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, use follow_run_logs().

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – [DEPRECATED, Use node_rank instead] Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • node_rank (Optional[int]) – Specifies the node rank within a multi-node run to fetch logs for. Defaults to lowest available rank. Indexing starts from 0.

  • local_gpu_rank (Optional[int]) – Specifies the GPU rank on the specified node to fetch logs for. Cannot be used with global_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.

  • global_gpu_rank (Optional[int]) –

    Specifies the global GPU rank to fetch logs for. Cannot be used with node_rank and local_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future . If True, the call to get_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the log text, use return_value.result() with an optional timeout argument.

  • failed (bool) – Return the logs of the first failed rank for the provided resumption if True. False by default.

  • resumption (Optional[int]) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumption

  • tail (Optional[int]) – Number of chars to read from the end of the log. Defaults to reading the entire log.

  • container (Optional[str]) – Container name of a run to get logs for. Defaults to the MAIN container.

Returns
  • If future is False – The full log text for a run at the time of the request as a str

  • Otherwise – A Future for the log text

mcli.get_run(run, *, timeout=10, future=False, include_details=True)[source]#

Get a run that has been launched in the MosaicML platform

The run will contain all details requested

Parameters
  • run – Run on which to get information

  • timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of runs, use return_value.result() with an optional timeout argument.

  • include_details – If true, will fetch detailed information like run input for each run.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.get_runs(runs=None, *, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, user_emails=None, run_types=None, include_details=False, include_deleted=False, ended_before=None, ended_after=None, limit=100)[source]#

List runs that have been launched in the MosaicML platform

The returned list will contain all of the details stored about the requested runs.

Parameters
  • runs – List of runs on which to get information

  • cluster_names – List of cluster names to filter runs. This can be a list of str or :type Cluster: objects. Only runs submitted to these clusters will be returned.

  • before – Only runs created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.

  • after – Only runs created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.

  • gpu_types – List of gpu types to filter runs. This can be a list of str or :type GPUType: enums. Only runs scheduled on these GPUs will be returned.

  • gpu_nums – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.

  • statuses – List of run statuses to filter runs. This can be a list of str or :type RunStatus: enums. Only runs currently in these phases will be returned.

  • user_emails – List of user emails to filter runs. Only runs submitted by these users will be returned. By default, will return runs submitted by the current user. Requires shared runs or admin permission

  • run_types – List of run types to filter runs - β€˜INTERACTIVE’: Runs created with the mcli interactive command - β€˜HERO_RUN’: Runs created with is_hero_run in the metadata - β€˜TRAINING’: All other runs

  • timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of runs, use return_value.result() with an optional timeout argument.

  • include_details – If true, will fetch detailed information like run input for each run.

  • include_deleted – If true, will include deleted runs in the response.

  • ended_before – Only runs ended strictly before this time will be returned.

  • ended_after – Only runs ended at or after this time will be returned.

  • limit – Maximum number of runs to return. If None, the latest 100 runs will be returned.

Raises

MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

mcli.start_run(run, *, timeout=10, future=False)[source]#

Start a run

Start a run currently stopped in the MosaicML platform.

Parameters
  • run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to start

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to start_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if starting the requested runs failed A successfully started run will have the status `RunStatus.PENDING`

Returns
  • If future is False – Started Run object

  • Otherwise – A Future for the object

mcli.start_runs(runs, *, timeout=10, future=False)[source]#

Start a list of runs

Start a list of runs currently stopped in the MosaicML platform.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to start

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to start_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if starting any of the requested runs failed. All successfully started runs will have the status `RunStatus.PENDING`. You can freely retry any started and started runs if this error is raised due to a connection issue.

Returns
  • If future is False – A list of started Run objects

  • Otherwise – A Future for the list

mcli.stop_run(run, *, reason=None, timeout=10, future=False)[source]#

Stop a run

Stop a run currently running in the MosaicML platform.

Parameters
  • run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to stop. Using Run objects is most efficient. See the note below.

  • reason (Optional[str]) – A reason for stopping the run

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to stop_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if stopping the requested runs failed A successfully stopped run will have the status `RunStatus.STOPPED`

Returns
  • If future is False – Stopped Run object

  • Otherwise – A Future for the object

mcli.stop_runs(runs, *, reason=None, timeout=10, future=False)[source]#

Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.

  • reason (Optional[str]) – A reason for stopping the run

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

Returns
  • If future is False – A list of stopped Run objects

  • Otherwise – A Future for the list

mcli.update_run(run, update_run_data=None, *, preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, timeout=10, future=False, max_duration=None)[source]#

Update a run’s data in the MosaicML platform.

Any values that are not specified will not be modified.

Parameters
  • run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to update. Using Run objects is most efficient. See the note below.

  • update_run_data (Dict[str, Any]) – DEPRECATED: Use the individual named-arguments instead. The data to update the run with. This can include preemptible, priority, maxRetries, and retryOnSystemFailure

  • preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False

  • priority (str) – Update the default priority of the run from auto to low or lowest

  • max_retries (int) – Update the max number of times the run can be retried; default is 0

  • retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • max_duration – Update the max time that a run can run for (in hours).

  • future (bool) – Return the output as a Future. If True, the call to update_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

MAPIException – Raised if updating the requested run failed

Returns
  • If future is False – Updated Run object

  • Otherwise – A Future for the list

mcli.update_run_metadata(run, metadata, *, timeout=10, future=False, protect=False)[source]#

Update a run’s metadata in the MosaicML platform.

Parameters
  • run (Optional[str | ``:class:`~mcli.api.model.run.Run` ``]) – A run or run name to update. Using Run objects is most efficient. See the note below.

  • metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to update_run_metadata() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

  • protect (bool) – If True, the call will be protected from SIGTERMs to allow it to complete reliably. Defaults to False.

Raises

MAPIException – Raised if updating the requested run failed

Returns
  • If future is False – Updated Run object

  • Otherwise – A Future for the list

mcli.wait_for_run_status(run, status, timeout=None, future=False)[source]#

Wait for a launched run to reach a specific status

Parameters
  • run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.

  • status (str | RunStatus) – Status to wait for. This can be any valid RunStatus value. If the status is short-lived, or the run terminates, it is possible the run will reach a LATER status than the one requested. If the run never reaches this state (e.g. it stops early or the wait times out), then an error will be raised. See exception details below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to wait_for_run_status() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

Raises
  • MAPIException – Raised if the run does not exist or there is an issue connecting to the MAPI service.

  • RunStatusNotReached – Raised in the event that the watch closes before the run reaches the desired status. If this happens, the connection to MAPI may have dropped, so try again.

  • TimeoutError – Raised if the run did not reach the correct status in the specified time

Returns

If future is False – A Run object once it has reached the requested status

Otherwise:

A Future for the run. This will not resolve until the run reaches the requested status

mcli.watch_run_status(run, timeout=None, future=False)[source]#

Watch a launched run and retrieve a new Run object everytime its status updates

Parameters
  • run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored. A run may take some time to change statuses (especially to go from RUNNING to COMPLETED), so you likely do not want to set a timeout.

  • future (bool) – Return the output as a Future. If True, each iteration will yield a Future for the next updated Run object. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument. With futures, you can easily watch multiple Runs in parallel. NOTE: If you set future==True, you should wrap your return_value.result() in a try: ... except StopAsyncIteration to catch the end of the iteration.

Raises
  • MAPIException – Raised if the run could not be found or if there is an issue contacting the MAPI service

  • TimeoutError – Raised if the run did not reach the correct status in the specified time

Yields

If future is False – A Run object at each status update Otherwise:

A Future for the run. This will not resolve until the run reaches a new status

class mcli.Run(run_uid, name, status, created_at, updated_at, created_by, priority, preemptible, retry_on_system_failure, cluster, gpus, gpu_type, cpus, node_count, latest_resumption, is_deleted, run_type, max_retries=None, reason=None, nodes=<factory>, submitted_config=None, metadata=None, last_resumption_id=None, resumptions=<factory>, events=<factory>, lifecycle=<factory>, image=None, max_duration=None, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'reason', 'createdByEmail', 'priority', 'preemptible', 'retryOnSystemFailure', 'resumptions', 'isDeleted', 'runType'))[source]#

A run that has been launched on the MosaicML platform

Parameters
  • run_uid (str) – Unique identifier for the run

  • name (str) – User-defined name of the run

  • status (RunStatus) – Status of the run at a moment in time

  • created_at (datetime) – Date and time when the run was created

  • updated_at (datetime) – Date and time when the run was last updated

  • created_by (str) – Email of the user who created the run

  • priority (str) – Priority of the run; defaults to auto but can be updated to low or lowest

  • preemptible (bool) – Whether the run can be stopped and re-queued by higher priority jobs

  • retry_on_system_failure (bool) – Whether the run should be retried on system failure

  • cluster (str) – Cluster the run is running on

  • gpus (int) – Number of GPUs the run is using

  • gpu_type (str) – Type of GPU the run is using

  • cpus (int) – Number of CPUs the run is using

  • node_count (int) – Number of nodes the run is using

  • latest_resumption (Resumption) – Latest resumption of the run

  • max_retries (Optional[int]) – Maximum number of times the run can be retried

  • reason (Optional[str]) – Reason the run was stopped

  • nodes (List[:class:`~mcli.api.model.run.Node]`) – Nodes the run is using

  • submitted_config (Optional[:class:`~mcli.models.run_config.RunConfig]`) – Submitted run configuration

  • metadata (Optional[Dict[str, Any]]) – Metadata associated with the run

  • last_resumption_id (Optional[str]) – ID of the last resumption of the run

  • resumptions (List[:class:`~mcli.api.model.run.Resumption]`) – Resumptions of the run

  • lifecycle (List[:class:`~mcli.api.model.run.RunLifecycle]`) – Lifecycle of the run

  • image (Optional[str]) – Image the run is using

clone(name=None, image=None, cluster=None, instance=None, nodes=None, gpu_type=None, gpus=None, priority=None, preemptible=None, max_retries=None, max_duration=None)[source]#

Submits a new run with the same configuration as this run

Parameters
  • name (str) – Override the name of the run

  • image (str) – Override the image of the run

  • cluster (str) – Override the cluster of the run

  • instance (str) – Override the instance of the run

  • nodes (int) – Override the number of nodes of the run

  • gpu_type (str) – Override the GPU type of the run

  • gpus (int) – Override the number of GPUs of the run

  • priority (str) – Override the default priority of the run from auto to low or lowest

  • preemptible (bool) – Override whether the run can be stopped and re-queued by higher priority jobs

  • max_retries (int) – Override the max number of times the run can be retried

  • max_duration (float) – Override the max duration (in hours) that a run can run for

Returns

New :class:`~mcli.api.model.run.Run` object

refresh()[source]#

Refreshes the data on the run object

Returns

Refreshed :class:`~mcli.api.model.run.Run` object

stop()[source]#

Stops the run

Returns

Stopped :class:`~mcli.api.model.run.Run` object

delete()[source]#

Deletes the run

Returns

Deleted :class:`~mcli.api.model.run.Run` object

update(preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, max_duration=None)[source]#

Updates the run’s data

Parameters
  • preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False

  • priority (str) – Update the default priority of the run from auto to low or lowest

  • max_retries (int) – Update the max number of times the run can be retried; default is 0

  • retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False

Returns

Updated :class:`~mcli.api.model.run.Run` object

update_metadata(metadata)[source]#

Updates the run’s metadata

Parameters

metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.

Returns

Updated :class:`~mcli.api.model.run.Run` object

class mcli.RunConfig(name=None, parent_name=None, image=None, gpu_type=None, gpu_num=None, cpus=None, cluster=None, scheduling=<factory>, compute=<factory>, parameters=<factory>, integrations=<factory>, env_variables=<factory>, metadata=<factory>, command='', dependent_deployment=<factory>, _suppress_deprecation_warnings=False)[source]#

A run configuration for the MosaicML platform

Values in here are not yet validated and some required values may be missing. On attempting to create the run, a bad config will raise a MapiException with a 400 status code.

Required args:
  • name (str): User-defined name of the run

  • image (str): Docker image (e.g. mosaicml/composer)

  • command (str): Command to use when a run starts

  • compute (ComputeConfig or Dict[str, Any]): Compute configuration. Typically

    a subset of the following fields will be required:

    • cluster (str): Name of cluster to use

    • instance (str): Name of instance to use

    • gpu_type (str): Name of gpu type to use

    • gpus (int): Number of GPUs to use

    • cpus (int): Number of CPUs to use

    • nodes (int): Number of nodes to use

    See mcli get clusters for a list of available clusters and instances

Optional args:
  • parameters (Dict[str, Any]): Parameters to mount into the environment

  • scheduling (SchedulingConfig or Dict[str, Any]): Scheduling configuration
    • priority (str): Priority of the run (default auto with options low and lowest`)

    • preemptible (bool): Whether the run is preemptible (default False)

    • retry_on_system_failure (bool): Whether the run should be retried on system failure (default False)

    • max_retries (int): Maximum number of retries (default 0)

    • max_duration (float): Maximum duration of the run in hours (default None)

      Run will be automatically stopped after this duration has elapsed.

  • integrations (List[Dict[str, Any]]): List of integrations. See integration documentation for more details:

    https://docs.mosaicml.com/projects/mcli/en/latest/resources/integrations/index.html

  • env_variables (Dict[str, str]): Dictionary of environment variables to set in the run
    • key (str): Name of the environment variable

    • value (str): Value of the environment variable

  • metadata (Dict[str, Any]): Arbitrary metadata to attach to the run

class mcli.RunStatus(value)[source]#

Possible statuses of a run

PENDING = 'PENDING'#

The run has been submitted and is waiting to be scheduled

QUEUED = 'QUEUED'#

The run is awaiting execution

STARTING = 'STARTING'#

The run is starting up and preparing to run

RUNNING = 'RUNNING'#

The run is actively running

TERMINATING = 'TERMINATING'#

The run is in the process of being terminated

COMPLETED = 'COMPLETED'#

The run has finished without any errors

STOPPED = 'STOPPED'#

The run has stopped

FAILED = 'FAILED'#

The run has failed due to an issue at runtime

UNKNOWN = 'UNKNOWN'#

A valid run status cannot be found

before(other, inclusive=False)[source]#

Returns True if this state usually comes β€œbefore” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is β€œbefore” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True
after(other, inclusive=False)[source]#

Returns True if this state usually comes β€œafter” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is β€œafter” the other

Example

>>> RunStatus.COMPLETED.after(RunStatus.RUNNING)
True
>>> RunStatus.RUNNING.after(RunStatus.PENDING)
True
is_terminal()[source]#

Returns True if this state is terminal

Returns

If this state is terminal

Example

>>> RunStatus.RUNNING.is_terminal()
False
>>> RunStatus.COMPLETED.is_terminal()
True
classmethod from_string(run_status)[source]#

Convert a string to a valid RunStatus Enum

If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError

class mcli.ComputeConfig[source]#

Typed dictionary for nested compute requests

class mcli.SchedulingConfig[source]#

Typed dictionary for nested scheduling configurations

class mcli.PaginatedObjectList(data, obj, query_function, pagination_function)[source]#

A list of objects that is paginated

next_page(limit=None)[source]#

Returns the next page of results

Example pagination of runs:

import time
from mcli import get_runs

runs = get_runs()
while True:
    try:
        print(f'Found {len(runs)} runs')
        time.sleep(1)
        runs = runs.next_page()
    except StopIteration:
        print("No more pages")
        break

Pretraining API#

create_pretraining_run

Create a pretraining run.

mcli.create_pretraining_run(model, train_data, save_folder, *, compute=None, tokenizer=None, training_duration=None, parameters=None, eval=None, experiment_tracker=None, custom_weights_path=None, timeout=10, future=False)[source]#

Create a pretraining run.

Parameters
  • model – The name of the Hugging Face model to use. Required.

  • train_data – Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, if you have two datasets, dataset1 and dataset2, and you want to use 80% of dataset1 and 20% of dataset2, you can pass in {"dataset1": {"path": "path/to/dataset1", "proportion": .8}, "dataset2": {"path": "path/to/dataset2", "proportion": .2}}. Required.

  • save_folder – The remote location to save the checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://. Required.

  • compute – The compute configuration to use. Required for now

  • tokenizer – Tokenizer configuration to use. If not provided, the default tokenizer for the model will be used.

  • training_duration – The total duration of your run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.

  • parameters –

    Additional parameters to pass to the model
    • learning_rate: The peak learning rate to use. Default is 5e-7. The optimizer used

    is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.

    • context_length: The maximum sequence length to use. This will be used to truncate any data that is too

    long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.

  • experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in {"wandb": {"project": "my-project", "entity": "my-entity"}}. To add in mlflow tracking, you can pass in {"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}.

  • eval – Configuration for evaluation

  • custom_weights_path – The remote location of a custom model checkpoint to resume from. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint.

  • timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to create_pretraining_run will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Run: output, use return_value.result() with an optional timeout argument.

Returns

A – type Run: object containing the pretraining run information.

Finetuning API#

create_finetuning_run

Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.

Finetune

A Finetune that has been run on the MosaicML platform

mcli.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_tracker=None, disable_credentials_check=None, timeout=10, future=False)[source]#

Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.

Parameters
  • model – The name of the Hugging Face model to use.

  • train_data_path – The full remote location of your training data (eg β€˜s3://my-bucket/my-data.jsonl’). For INSTRUCTION_FINETUNE, another option is to provide the name of a Hugging Face dataset that includes the train split, like β€˜mosaicml/dolly_hhrlhf/test’. The data should be formatted with each row containing a β€˜prompt’ and β€˜response’ field for INSTRUCTION_FINETUNE, or in raw data format for CONTINUED_PRETRAIN.

  • save_folder – The remote location to save the finetuned checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the finetuned Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

  • task_type – The type of finetuning task to run. Current available options are INSTRUCTION_FINETUNE and CONTINUED_PRETRAIN, defaults to INSTRUCTION_FINETUNE.

  • eval_data_path – The remote location of your evaluation data (e.g. s3://my-bucket/my-data.jsonl). For INSTRUCTION_FINETUNE, the name of a Hugging Face dataset with the test split (e.g. mosaicml/dolly_hhrlhf/test) can also be given. The evaluation data should be formatted with each row containing a prompt and response field, for INSTRUCTION_FINETUNE and raw data for CONTINUED_PRETRAIN. Default is None.

  • eval_prompts –

    A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:

    • max_new_tokens: 100

    • temperature: 1

    • top_k: 50

    • top_p: 0.95

    • do_sample: true

    Default is None (do not generate prompts).

  • custom_weights_path – The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is None.

  • training_duration – The total duration of your finetuning run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.

  • learning_rate – The peak learning rate to use for finetuning. Default is 5e-7. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.

  • context_length – The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.

  • experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in {"wandb": {"project": "my-project", "entity": "my-entity"}}. To add in mlflow tracking, you can pass in {"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}.

  • disable_credentials_check – Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.

  • timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future – Return the output as a Future. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Finetune: output, use return_value.result() with an optional timeout argument.

Returns

A – type Finetune: object containing the finetuning run information.

class mcli.Finetune(id, name, status, created_at, updated_at, created_by, started_at=None, completed_at=None, reason=None, estimated_end_time=None, model=None, save_folder=None, train_data_path=None, submitted_config=None, events=None, _required_properties=('id', 'name', 'status', 'createdByEmail', 'createdAt', 'updatedAt'))[source]#

A Finetune that has been run on the MosaicML platform

Parameters
  • id – The unique identifier for this finetuning run.

  • name – The name of the finetuning run.

  • status – The current status of the finetuning run. This is a RunStatus enum, which has values such as PENDING, RUNNING, or COMPLETED.

  • created_at – The timestamp at which the finetuning run was created.

  • updated_at – The timestamp at which the finetuning run was last updated.

  • created_by – The email address of the user who created the finetuning run.

  • started_at – The timestamp at which the finetuning run was started.

  • completed_at – The timestamp at which the finetuning run was completed.

  • reason – The reason for the finetuning run’s current status, such as Run completed successfully.