API Reference#
Setup#
Create a secret in the MosaicML platform |
|
Deletes secrets from the MosaicML platform |
|
Initialize the MosaicML platform |
|
Gets a cluster available in the MosaicML platform |
|
Get clusters available in the MosaicML platform |
|
Set the api key for the MosaicML platform |
|
Exceptions raised when a request to MAPI fails |
|
Global Config Store persisted on local disk |
- mcli.create_secret(secret, *, timeout=10, future=False)[source]#
Create a secret in the MosaicML platform
- Parameters
secret (
Secret
) – ASecret
object to createtimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tocreate_secret()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theSecret
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.delete_secrets(secrets=None, *, timeout=10, future=False)[source]#
Deletes secrets from the MosaicML platform
- Parameters
secrets (
Secret
) – List ofSecret
objects or secret name strings to delete.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call todelete_secrets()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theSecret
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.initialize(api_key=None)[source]#
Initialize the MosaicML platform
- Parameters
api_key – Optional value to set
- mcli.get_cluster(cluster, *, include_utilization=True, timeout=10, future=False)[source]#
Gets a cluster available in the MosaicML platform
- Parameters
cluster (
ClusterDetails
) –ClusterDetails
object or cluster name string to get.include_utilization (
bool
) – Include information on how the cluster is currently being utilizedtimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_cluster()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theClusterDetails
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.get_clusters(clusters=None, *, include_utilization=False, timeout=10, future=False, submission_type_filter=None)[source]#
Get clusters available in the MosaicML platform
- Parameters
clusters (
Cluster
) – List ofCluster
objects or cluster name strings to get.include_utilization (
bool
) – Include information on how the cluster is currently being utilizedtimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_clusters()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theClusterDetails
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.set_api_key(api_key)[source]#
Set the api key for the MosaicML platform
- Parameters
api_key – value to set
- class mcli.MAPIException(status, message='Unknown Error', description=None)[source]#
Exceptions raised when a request to MAPI fails
- Parameters
status – The status code for the exception
message – A brief description of the error
description – An optional longer description of the error
Details: MAPI responds to failures with the following status codes: - 400: The request was misconfigured or missing an argument. Double-check the API and try again - 401: User credentials were either missing or invalid. Be sure to set your API key before making a request - 403: User credentials were valid, but the requested action is not allowed - 404: Could not find the requested resource(s) - 409: Attempted to create an object with a name that already exists. Change the name and try again. - 500: Internal error in MAPI. Please report the issue - 503: MAPI or a subcomponent is currently offline. Please report the issue
Runs#
Launch a run in the MosaicML platform |
|
Delete a run in the MosaicML platform |
|
Delete a list of runs in the MosaicML platform |
|
Follow the logs for an active or completed run in the MosaicML platform |
|
Get the current logs for an active or completed run |
|
Get a run that has been launched in the MosaicML platform |
|
List runs that have been launched in the MosaicML platform |
|
Start a run |
|
Start a list of runs |
|
Stop a run |
|
Stop a list of runs |
|
Update a run's metadata in the MosaicML platform. |
|
Update a run's data in the MosaicML platform. |
|
Wait for a launched run to reach a specific status |
|
Watch a launched run and retrieve a new Run object everytime its status updates |
|
A run that has been launched on the MosaicML platform |
|
A run configuration for the MosaicML platform |
|
Possible statuses of a run |
|
Typed dictionary for nested scheduling configurations |
- mcli.create_run(run, *, timeout=10, future=False)[source]#
Launch a run in the MosaicML platform
The provided
run
must contain enough information to fully detail the run- Parameters
run – A fully-configured run to launch. The run will be queued and persisted in the run database.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A Run that includes the launched run details and the run status
- mcli.delete_run(run, *, timeout=10, future=False)[source]#
Delete a run in the MosaicML platform
If a run is currently running, it will first be stopped.
- Parameters
run – A run to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A – type Run: for the run that was deleted
- mcli.delete_runs(runs, *, timeout=10, future=False)[source]#
Delete a list of runs in the MosaicML platform
Any runs that are currently running will first be stopped.
- Parameters
runs – A list of runs or run names to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A list of – type Run: for the runs that were deleted
- mcli.follow_run_logs(run, rank=None, *, timeout=None, future=False, resumption=None, tail=None, container=None)[source]#
Follow the logs for an active or completed run in the MosaicML platform
This returns a
generator
of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored. A run may take some time to generate logs, so you likely do not want to set a timeout.future (
bool
) – Return the output as aFuture
. If True, the call tofollow_run_logs()
will return immediately and the request will be processed in the background. The generator returned by the ~concurrent.futures.Future will yield a ~concurrent.futures.Future for each new log string returned from the cloud. This takes precedence over thetimeout
argument. To get the generator, usereturn_value.result()
with an optionaltimeout
argument andlog_future.result()
for each new log string.resumption (
Optional[int]
) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumptiontail (
Optional[int]
) – Number of chars to read from the end of the log. Defaults to reading the entire log.container (
Optional[str]
) – Container name of a run to get logs for. Defaults to the MAIN container.
- Returns
If future is False – A line-by-line
Generator
of the logs for a runOtherwise – A
Future
of a line-by-line generator of the logs for a run
- mcli.get_run_logs(run, rank=None, *, timeout=None, future=False, failed=False, resumption=None, tail=None, container=None)[source]#
Get the current logs for an active or completed run
Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a
str
, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, usefollow_run_logs()
.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the log text, usereturn_value.result()
with an optionaltimeout
argument.failed (
bool
) – Return the logs of the first failed rank for the provided resumption ifTrue
.False
by default.resumption (
Optional[int]
) – Resumption (0-indexed) of a run to get logs for. Defaults to the last resumptiontail (
Optional[int]
) – Number of chars to read from the end of the log. Defaults to reading the entire log.container (
Optional[str]
) – Container name of a run to get logs for. Defaults to the MAIN container.
- Returns
- mcli.get_run(run, *, timeout=10, future=False, include_details=True)[source]#
Get a run that has been launched in the MosaicML platform
The run will contain all details requested
- Parameters
run – Run on which to get information
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of runs, usereturn_value.result()
with an optionaltimeout
argument.include_details – If true, will fetch detailed information like run input for each run.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.get_runs(runs=None, *, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None, user_emails=None, run_types=None, include_details=False, limit=None, include_interactive=None)[source]#
List runs that have been launched in the MosaicML platform
The returned list will contain all of the details stored about the requested runs.
- Parameters
runs – List of runs on which to get information
cluster_names – List of cluster names to filter runs. This can be a list of str or :type Cluster: objects. Only runs submitted to these clusters will be returned.
before – Only runs created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
after – Only runs created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
gpu_types – List of gpu types to filter runs. This can be a list of str or :type GPUType: enums. Only runs scheduled on these GPUs will be returned.
gpu_nums – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.
statuses – List of run statuses to filter runs. This can be a list of str or :type RunStatus: enums. Only runs currently in these phases will be returned.
user_emails – List of user emails to filter runs. Only runs submitted by these users will be returned. By default, will return runs submitted by the current user. Requires shared runs or admin permission
run_types – List of run types to filter runs - ‘INTERACTIVE’: Runs created with the mcli interactive command - ‘HERO_RUN’: Runs created with is_hero_run in the metadata - ‘TRAINING’: All other runs
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of runs, usereturn_value.result()
with an optionaltimeout
argument.include_details – If true, will fetch detailed information like run input for each run.
limit – Maximum number of runs to return. If None, all runs will be returned.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.start_run(run, *, timeout=10, future=False)[source]#
Start a run
Start a run currently stopped in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) – A run or run name to starttimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostart_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if starting the requested runs failed A successfully started run will have the status
`RunStatus.PENDING`
- Returns
- mcli.start_runs(runs, *, timeout=10, future=False)[source]#
Start a list of runs
Start a list of runs currently stopped in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to starttimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostart_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if starting any of the requested runs failed. All successfully started runs will have the status
`RunStatus.PENDING`
. You can freely retry any started and started runs if this error is raised due to a connection issue.- Returns
- mcli.stop_run(run, *, reason=None, timeout=10, future=False)[source]#
Stop a run
Stop a run currently running in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) – A run or run name to stop. UsingRun
objects is most efficient. See the note below.reason (
Optional[str]
) – A reason for stopping the runtimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostop_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if stopping the requested runs failed A successfully stopped run will have the status
`RunStatus.STOPPED`
- Returns
- mcli.stop_runs(runs, *, reason=None, timeout=10, future=False)[source]#
Stop a list of runs
Stop a list of runs currently running in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to stop. UsingRun
objects is most efficient. See the note below.reason (
Optional[str]
) – A reason for stopping the runtimeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostop_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status
`RunStatus.STOPPED`
. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.- Returns
- mcli.update_run(run, update_run_data=None, *, preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, timeout=10, future=False, max_duration=None)[source]#
Update a run’s data in the MosaicML platform.
Any values that are not specified will not be modified.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) – A run or run name to update. UsingRun
objects is most efficient. See the note below.update_run_data (Dict[str, Any]) – DEPRECATED: Use the individual named-arguments instead. The data to update the run with. This can include preemptible, priority, maxRetries, and retryOnSystemFailure
preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) – Update the priority of the run to low, medium, or high; default is medium
max_retries (int) – Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False
timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.max_duration – Update the max time that a run can run for (in hours).
future (
bool
) – Return the output as aFuture
. If True, the call toupdate_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if updating the requested run failed
- Returns
- mcli.update_run_metadata(run, metadata, *, timeout=10, future=False, protect=False)[source]#
Update a run’s metadata in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) – A run or run name to update. UsingRun
objects is most efficient. See the note below.metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toupdate_run_metadata()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.protect (
bool
) – If True, the call will be protected from SIGTERMs to allow it to complete reliably. Defaults to False.
- Raises
MAPIException – Raised if updating the requested run failed
- Returns
- mcli.wait_for_run_status(run, status, timeout=None, future=False)[source]#
Wait for a launched run to reach a specific status
- Parameters
run (
str
|Run
) – The run whose status should be watched. This can be provided using the run’s name or an existingRun
object.status (
str
|RunStatus
) – Status to wait for. This can be any validRunStatus
value. If the status is short-lived, or the run terminates, it is possible the run will reach a LATER status than the one requested. If the run never reaches this state (e.g. it stops early or the wait times out), then an error will be raised. See exception details below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call towait_for_run_status()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if the run does not exist or there is an issue connecting to the MAPI service.
RunStatusNotReached – Raised in the event that the watch closes before the run reaches the desired status. If this happens, the connection to MAPI may have dropped, so try again.
TimeoutError – Raised if the run did not reach the correct status in the specified time
- Returns
If future is False – A
Run
object once it has reached the requested status- Otherwise:
A
Future
for the run. This will not resolve until the run reaches the requested status
- mcli.watch_run_status(run, timeout=None, future=False)[source]#
Watch a launched run and retrieve a new Run object everytime its status updates
- Parameters
run (
str
|Run
) – The run whose status should be watched. This can be provided using the run’s name or an existingRun
object.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored. A run may take some time to change statuses (especially to go from RUNNING to COMPLETED), so you likely do not want to set a timeout.future (
bool
) – Return the output as aFuture
. IfTrue
, each iteration will yield aFuture
for the next updatedRun
object. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument. With futures, you can easily watch multiple Runs in parallel. NOTE: If you setfuture==True
, you should wrap yourreturn_value.result()
in atry: ... except StopAsyncIteration
to catch the end of the iteration.
- Raises
MAPIException – Raised if the run could not be found or if there is an issue contacting the MAPI service
TimeoutError – Raised if the run did not reach the correct status in the specified time
- Yields
If future is False – A
Run
object at each status update Otherwise:A
Future
for the run. This will not resolve until the run reaches a new status
- class mcli.Run(run_uid, name, status, created_at, updated_at, created_by, priority, preemptible, retry_on_system_failure, cluster, gpus, gpu_type, cpus, node_count, latest_resumption, max_retries=None, reason=None, nodes=<factory>, submitted_config=None, metadata=None, last_resumption_id=None, resumptions=<factory>, lifecycle=<factory>, image=None, max_duration=None, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'reason', 'createdByEmail', 'priority', 'preemptible', 'retryOnSystemFailure', 'resumptions'))[source]#
A run that has been launched on the MosaicML platform
- Parameters
run_uid (str) – Unique identifier for the run
name (str) – User-defined name of the run
status (
RunStatus
) – Status of the run at a moment in timecreated_at (datetime) – Date and time when the run was created
updated_at (datetime) – Date and time when the run was last updated
created_by (str) – Email of the user who created the run
priority (str) – Priority of the run
preemptible (bool) – Whether the run can be stopped and re-queued by higher priority jobs
retry_on_system_failure (bool) – Whether the run should be retried on system failure
cluster (str) – Cluster the run is running on
gpus (int) – Number of GPUs the run is using
gpu_type (str) – Type of GPU the run is using
cpus (int) – Number of CPUs the run is using
node_count (int) – Number of nodes the run is using
latest_resumption (
Resumption
) – Latest resumption of the runmax_retries (Optional[int]) – Maximum number of times the run can be retried
reason (Optional[str]) – Reason the run was stopped
nodes (List[:class:`~mcli.api.model.run.Node]`) – Nodes the run is using
submitted_config (Optional[:class:`~mcli.models.run_config.RunConfig]`) – Submitted run configuration
metadata (Optional[Dict[str, Any]]) – Metadata associated with the run
last_resumption_id (Optional[str]) – ID of the last resumption of the run
resumptions (List[:class:`~mcli.api.model.run.Resumption]`) – Resumptions of the run
lifecycle (List[:class:`~mcli.api.model.run.RunLifecycle]`) – Lifecycle of the run
image (Optional[str]) – Image the run is using
- clone(name=None, image=None, cluster=None, instance=None, nodes=None, gpu_type=None, gpus=None, priority=None, preemptible=None, max_retries=None, max_duration=None)[source]#
Submits a new run with the same configuration as this run
- Parameters
name (str) – Override the name of the run
image (str) – Override the image of the run
cluster (str) – Override the cluster of the run
instance (str) – Override the instance of the run
nodes (int) – Override the number of nodes of the run
gpu_type (str) – Override the GPU type of the run
gpus (int) – Override the number of GPUs of the run
priority (str) – Override the priority of the run
preemptible (bool) – Override whether the run can be stopped and re-queued by higher priority jobs
max_retries (int) – Override the max number of times the run can be retried
max_duration (float) – Override the max duration (in hours) that a run can run for
- Returns
New :class:`~mcli.api.model.run.Run` object
- refresh()[source]#
Refreshes the data on the run object
- Returns
Refreshed :class:`~mcli.api.model.run.Run` object
- update(preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, max_duration=None)[source]#
Updates the run’s data
- Parameters
preemptible (bool) – Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) – Update the priority of the run to low, medium, or high; default is medium
max_retries (int) – Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) – Update whether the run should be retried on system failure (i.e. a node failure); default is False
- Returns
Updated :class:`~mcli.api.model.run.Run` object
- update_metadata(metadata)[source]#
Updates the run’s metadata
- Parameters
metadata (Dict[str, Any]) – The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
- Returns
Updated :class:`~mcli.api.model.run.Run` object
- class mcli.RunConfig(name=None, parent_name=None, image=None, gpu_type=None, gpu_num=None, cpus=None, cluster=None, scheduling=<factory>, compute=<factory>, parameters=<factory>, integrations=<factory>, env_variables=<factory>, metadata=<factory>, command='', dependent_deployment=<factory>, run_name=None, entrypoint='', partitions=None, optimization_level=None, platform=None)[source]#
A run configuration for the MosaicML platform
Values in here are not yet validated and some required values may be missing. On attempting to create the run, a bad config will raise a MapiException with a 400 status code.
- Required args:
name (str): User-defined name of the run
image (str): Docker image (e.g. mosaicml/composer)
command (str): Command to use when a run starts
- compute (
ComputeConfig
or Dict[str, Any]): Compute configuration. Typically a subset of the following fields will be required:
cluster (str): Name of cluster to use
instance (str): Name of instance to use
gpu_type (str): Name of gpu type to use
gpus (int): Number of GPUs to use
cpus (int): Number of CPUs to use
nodes (int): Number of nodes to use
See mcli get clusters for a list of available clusters and instances
- compute (
- Optional args:
parameters (Dict[str, Any]): Parameters to mount into the environment
- scheduling (
SchedulingConfig
or Dict[str, Any]): Scheduling configuration priority (str): Priority of the run
preemptible (bool): Whether the run is preemptible (default False)
retry_on_system_failure (bool): Whether the run should be retried on system failure (default False)
max_retries (int): Maximum number of retries (default 0)
- max_duration (float): Maximum duration of the run in hours (default None)
Run will be automatically stopped after this duration has elapsed.
- scheduling (
- integrations (List[Dict[str, Any]]): List of integrations. See integration documentation for more details:
https://docs.mosaicml.com/projects/mcli/en/latest/resources/integrations/index.html
- env_variables (List[Dict[str, str]]): List of environment variables. Each dict should have:
key (str): Name of the environment variable
value (str): Value of the environment variable
metadata (Dict[str, Any]): Arbitrary metadata to attach to the run
- class mcli.RunStatus(value)[source]#
Possible statuses of a run
- PENDING = 'PENDING'#
The run has been submitted and is waiting to be scheduled
- QUEUED = 'QUEUED'#
The run is awaiting execution
- STARTING = 'STARTING'#
The run is starting up and preparing to run
- RUNNING = 'RUNNING'#
The run is actively running
- TERMINATING = 'TERMINATING'#
The run is in the process of being terminated
- COMPLETED = 'COMPLETED'#
The run has finished without any errors
- STOPPED = 'STOPPED'#
The run has stopped
- FAILED = 'FAILED'#
The run has failed due to an issue at runtime
- UNKNOWN = 'UNKNOWN'#
A valid run status cannot be found
- before(other, inclusive=False)[source]#
Returns True if this state usually comes “before” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “before” the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True
- after(other, inclusive=False)[source]#
Returns True if this state usually comes “after” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “after” the other
Example
>>> RunStatus.COMPLETED.after(RunStatus.RUNNING) True >>> RunStatus.RUNNING.after(RunStatus.PENDING) True
Finetunes#
Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference. |
|
A Finetune that has been run on the MosaicML platform |
- mcli.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_trackers=None, disable_credentials_check=None, timeout=10, future=False)[source]#
Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.
- Parameters
model – The name of the Hugging Face model to use.
train_data_path – The full remote location of your training data (eg ‘s3://my-bucket/my-data.jsonl’). For
INSTRUCTION_FINETUNE
, another option is to provide the name of a Hugging Face dataset that includes the train split, like ‘mosaicml/dolly_hhrlhf/test’. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field forINSTRUCTION_FINETUNE
, or in raw data format forCONTINUED_PRETRAIN
.save_folder – The remote location to save the finetuned checkpoints. For example, if your
save_folder
iss3://my-bucket/my-checkpoints
, the finetuned Composer checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes ares3://
,gs://
, andoci://
.task_type – The type of finetuning task to run. Current available options are
INSTRUCTION_FINETUNE
andCONTINUED_PRETRAIN
, defaults toINSTRUCTION_FINETUNE
.eval_data_path – The remote location of your evaluation data (e.g.
s3://my-bucket/my-data.jsonl
). ForINSTRUCTION_FINETUNE
, the name of a Hugging Face dataset with the test split (e.g.mosaicml/dolly_hhrlhf/test
) can also be given. The evaluation data should be formatted with each row containing aprompt
andresponse
field, forINSTRUCTION_FINETUNE
and raw data forCONTINUED_PRETRAIN
. Default isNone
.eval_prompts –
A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:
max_new_tokens: 100
temperature: 1
top_k: 50
top_p: 0.95
do_sample: true
Default is
None
(do not generate prompts).custom_weights_path – The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is
None
.training_duration – The total duration of your finetuning run. This can be specified in batches (e.g.
100ba
), epochs (e.g.10ep
), or tokens (e.g.1_000_000tok
). Default is1ep
.learning_rate – The peak learning rate to use for finetuning. Default is
5e-7
. The optimizer used is DecoupledLionW with betas of 0.99 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.context_length – The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.
experiment_trackers – A list of experiment tracker configurations. For example, to add Weights and Biases tracking, you can pass in
{'integration_type': 'wandb', 'project': 'my-project', 'entity': 'my-entity'}
.disable_credentials_check – Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Finetune: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A – type Finetune: object containing the finetuning run information.
- class mcli.Finetune(id, name, status, created_at, updated_at, created_by, started_at=None, completed_at=None, reason=None, estimated_end_time=None, model=None, save_folder=None, train_data_path=None, submitted_config=None, events=None, _required_properties=('id', 'name', 'status', 'createdByEmail', 'createdAt', 'updatedAt'))[source]#
A Finetune that has been run on the MosaicML platform
- Parameters
id – The unique identifier for this finetuning run.
name – The name of the finetuning run.
status – The current status of the finetuning run. This is a RunStatus enum, which has values such as
PENDING
,RUNNING
, orCOMPLETED
.created_at – The timestamp at which the finetuning run was created.
updated_at – The timestamp at which the finetuning run was last updated.
created_by – The email address of the user who created the finetuning run.
started_at – The timestamp at which the finetuning run was started.
completed_at – The timestamp at which the finetuning run was completed.
reason – The reason for the finetuning run’s current status, such as
Run completed successfully
.
Deployments#
Launch a inference deployment in the MosaicML platform |
|
Delete an inference deployment in the MosaicML Cloud |
|
Delete a list of inference deployments in the MosaicML Cloud |
|
Get the current logs for an active or completed inference deployment |
|
Gets a single inference deployment that has been launched in the MosaicML platform |
|
List inference deployments that have been launched in the MosaicML platform |
|
A deployment that has been launched on the MosaicML Cloud |
|
A deployment configuration for the MosaicML Cloud |
|
Pings an inference deployment that has been launched in the MosaicML platform and returns the status of the deployment. |
|
Sends input to '/predict' endpoint of an inference deployment on the MosaicML platform. |
|
Updates a single inference deployment that has been launched in the MosaicML platform |
|
Updates a list of inference deployments that have been launched in the MosaicML platform |
- mcli.create_inference_deployment(deployment, *, timeout=10, future=False)[source]#
Launch a inference deployment in the MosaicML platform
The provided
deploy
must contain enough information to fully detail the inference deployment- Parameters
deployment – A fully-configured inference deployment to launch. The deployment will be queued and persisted in the deployment database.
timeout – Time, in seconds, in which the call should complete. If the deployment creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to create_deployment will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type InferenceDeployment: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A InferenceDeployment that includes the launched deployment details and the deployment status
- mcli.delete_inference_deployment(deployment, *, timeout=10, future=False)[source]#
Delete an inference deployment in the MosaicML Cloud
If it is currently running the deployment will first be stopped.
- Parameters
deployment – An inference deployments or inference deployment name to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to delete_inference_deployments will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type InferenceDeployment: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A – type InferenceDeployment: that was deleted
- mcli.delete_inference_deployments(deployments, *, timeout=10, future=False)[source]#
Delete a list of inference deployments in the MosaicML Cloud
Any deployments that are currently running will first be stopped.
- Parameters
deployments – A list of inference deployments or inference deployment names to delete
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to delete_inference_deployments will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type InferenceDeployment: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A list of – type InferenceDeployment: for the inference deployments that were deleted
- mcli.get_inference_deployment_logs(deployment, *, restart=None, timeout=None, future=False, failed=False, follow=False, tail=None)[source]#
Get the current logs for an active or completed inference deployment
Get the current logs for an active or completed inference deployment in the MosaicML platform. This returns the full logs as a
str
, as they exist at the time the request is made.- Parameters
deployment (
str
|InferenceDeployment
) – The inference deployment to get logs for. If a name is provided, the remaining required deployment details will be queried withget_inference_deployments()
.restart (
Optional[int]
) – Which restart of a inference deployment to get logs for. Defaults to the most recent deployment restart.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_inference_deployment_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the log text, usereturn_value.result()
with an optionaltimeout
argument.failed (
bool
) – Return the logs of the latest failed deployment ifTrue
.False
by default.follow (
bool
) – Returns the logs of the inference deployment as they are produced ifTrue
. Defaults toFalse
.
- Returns
- mcli.get_inference_deployment(deployment=None, *, timeout=10, future=False)[source]#
Gets a single inference deployment that has been launched in the MosaicML platform
The returned object will contain all of the details stored about the requested deployment.
- Parameters
deployment – Inference deployment object or name string
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to get_inference_deployment will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of deployments, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised when a MAPI communication error occurs
- mcli.get_inference_deployments(deployments=None, *, clusters=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False)[source]#
List inference deployments that have been launched in the MosaicML platform
The returned list will contain all of the details stored about the requested deployments.
- Parameters
deployments – List of inference deployments on which to get information
clusters – List of clusters to filter inference deployments. This can be a list of str or :type Cluster: objects. Only deployments submitted to these clusters will be returned.
before – Only inference deployments created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
after – Only inference deployments created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
gpu_types – List of gpu types to filter inference deployments. This can be a list of str or :type GPUType: enums. Only deployments scheduled on these GPUs will be returned.
gpu_nums – List of gpu counts to filter inference deployments. Only deployments scheduled on this number of GPUs will be returned.
statuses – List of inference deployment statuses to filter deployments. This can be a list of str. Only deployments currently in these phases will be returned.
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to get_inference_deployments will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of deployments, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- class mcli.InferenceDeployment(deployment_uid, name, status, created_at, updated_at, config, created_by, public_dns='', current_version=0, deleted_at=None, submitted_config=None, replicas=<factory>, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'inferenceDeploymentInput', 'publicDNS'))[source]#
A deployment that has been launched on the MosaicML Cloud
- Parameters
deployment_uid (str) – Unique identifier for the deployment
name (str) – User-defined name of the deployment
status (
DeploymentStatus
) – Status of the deploymenttime (at a moment in) –
created_at (datetime) – Date and time when the deployment was created
updated_at (datetime) – Date and time when the deployment was last updated
config (
DeploymentConfig
) – Thedeployment configuration
that was used to launch to the deployment
- refresh()[source]#
Refreshed the data on the deployment object
- Returns
Refreshed :class:`~mcli.api.model.inference_deployment.InferenceDeployment` object
- class mcli.InferenceDeploymentConfig(name=None, gpu_type=None, gpu_num=None, cluster=None, image=None, replicas=None, command=None, metadata=None, env_variables=<factory>, integrations=<factory>, model=None, default_model=None, batching=<factory>, compute=<factory>, rate_limit=None)[source]#
A deployment configuration for the MosaicML Cloud
Values in here are not yet validated and some required values may be missing.
- Parameters
name (Optional[str]) – User-defined name of the deployment
gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)
gpu_num (Optional[int]) – Number of GPUs
image (Optional[str]) – Docker image (e.g. mosaicml/composer)
command (str) – Command to use when a deployment starts
env_variables (List[Dict[str, str]]) – List of environment variables
integrations (List[Dict[str, Any]]) – List of integrations
compute (ComputeConfig) – The compute to use for the inference deployment.
replicas (Optional[int]) – Number of replicas to create
batching (BatchingConfig) – The dynamic batching configuration.
cluster (Optional[str]) – Deprecated. Cluster to use (optional if you only have one)
- mcli.ping(deployment, *, timeout=10)[source]#
Pings an inference deployment that has been launched in the MosaicML platform and returns the status of the deployment. The deployment must have a ‘/ping’ endpoint defined.
- Parameters
deployment (the name of an) – The deployment to check the status of. Can be a InferenceDeployment object,
deployment – //<deployment dns>.
https (or a string which is of the form) – //<deployment dns>.
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised.
- Raises
HTTPError – If pinging the endpoint fails
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.predict(deployment, inputs, *, timeout=60, stream=False)[source]#
Sends input to ‘/predict’ endpoint of an inference deployment on the MosaicML platform. Runs prediction on input and returns output produced by the model.
- Parameters
deployment – The deployment to make a prediction with. Can be a InferenceDeployment object, the name of an deployment, or a string which is of the form https://<deployment dns>.
input – Input data to run prediction on in the form of dictionary
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised.
stream – If True, the response will be streamed and a generator will be returned. Streaming supports only a single input at a time.
- Raises
HTTPError – If sending the request to the endpoint fails
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.update_inference_deployment(deployment, updates, *, timeout=10, future=False)[source]#
Updates a single inference deployment that has been launched in the MosaicML platform
Any deployments that are currently running will not be interrupted.
- Parameters
deployment – An inference deployment or inference deployment name to update
updates – A dictionary of inference deployment fields to update (eg. {“image”: “new_image”, “replicas”: 2})
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to update_inference_deployments will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type InferenceDeployment: output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if updating the deployment failed
- Returns
A – type InferenceDeployment: for the deployment that was updated
- mcli.update_inference_deployments(deployments, updates, *, timeout=10, future=False)[source]#
Updates a list of inference deployments that have been launched in the MosaicML platform
Any deployments that are currently running will not be interrupted.
- Parameters
deployments – A list of inference deployments or inference deployment names to update
updates – A dictionary of inference deployment fields to update (eg. {“image”: “new_image”, “replicas”: 2})
timeout – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to update_inference_deployments will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type InferenceDeployment: output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – Raised if updating the deployments failed
- Returns
A list of – type InferenceDeployment: for the deployments that were updated