Python SDK#
Setup#
Create a secret in the MosaicML platform |
|
Deletes secrets from the MosaicML platform |
|
Initialize the MosaicML platform |
|
Gets a cluster available in the MosaicML platform |
|
Get clusters available in the MosaicML platform |
|
Set the api key for the MosaicML platform |
|
Exceptions raised when a request to MAPI fails |
|
Global Config Store persisted on local disk |
- mcli.create_secret(secret, *, timeout=10, future=False)[source]#
Create a secret in the MosaicML platform
- Parameters
secret (
Secret
) β ASecret
object to createtimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call tocreate_secret()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theSecret
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.delete_secrets(secrets=None, *, timeout=10, future=False)[source]#
Deletes secrets from the MosaicML platform
- Parameters
secrets (
Secret
) β List ofSecret
objects or secret name strings to delete.timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call todelete_secrets()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theSecret
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.initialize(api_key=None)[source]#
Initialize the MosaicML platform
- Parameters
api_key β Optional value to set
- mcli.get_cluster(cluster, *, include_utilization=True, include_all=False, timeout=10, future=False)[source]#
Gets a cluster available in the MosaicML platform
- Parameters
cluster (
ClusterDetails
) βClusterDetails
object or cluster name string to get.include_utilization (
bool
) β Include information on how the cluster is currently being utilizedtimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call toget_cluster()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theClusterDetails
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.get_clusters(clusters=None, *, include_utilization=False, include_all=False, timeout=10, future=False, submission_type_filter=None)[source]#
Get clusters available in the MosaicML platform
- Parameters
clusters (
ClusterDetails
) β List ofClusterDetails
objects or cluster name strings to get.include_utilization (
bool
) β Include information on how the cluster is currently being utilizedtimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call toget_clusters()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theClusterDetails
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.set_api_key(api_key)[source]#
Set the api key for the MosaicML platform
- Parameters
api_key β value to set
- class mcli.MAPIException(status, message='Unknown Error', description=None)[source]#
Exceptions raised when a request to MAPI fails
- Parameters
status β The status code for the exception
message β A brief description of the error
description β An optional longer description of the error
Details: MAPI responds to failures with the following status codes: - 400: The request was misconfigured or missing an argument. Double-check the API and try again - 401: User credentials were either missing or invalid. Be sure to set your API key before making a request - 403: User credentials were valid, but the requested action is not allowed - 404: Could not find the requested resource(s) - 409: Attempted to create an object with a name that already exists. Change the name and try again. - 500: Internal error in MAPI. Please report the issue - 503: MAPI or a subcomponent is currently offline. Please report the issue
Runs#
Launch a run in the MosaicML platform |
|
Launch an interactive run in the MosaicML platform |
|
Delete a run in the MosaicML platform |
|
Delete a list of runs in the MosaicML platform |
|
Follow the logs for an active or completed run in the MosaicML platform |
|
Get the current logs for an active or completed run |
|
Get a run that has been launched in the MosaicML platform |
|
List runs that have been launched in the MosaicML platform |
|
Start a run |
|
Start a list of runs |
|
Stop a run |
|
Stop a list of runs |
|
Update a run's metadata in the MosaicML platform. |
|
Update a run's data in the MosaicML platform. |
|
Wait for a launched run to reach a specific status |
|
Watch a launched run and retrieve a new Run object everytime its status updates |
|
A run that has been launched on the MosaicML platform |
|
A run configuration for the MosaicML platform |
|
Possible statuses of a run |
|
Typed dictionary for nested compute requests |
|
Typed dictionary for nested scheduling configurations |
- mcli.create_run(run, *, timeout=10, future=False)[source]#
Launch a run in the MosaicML platform
The provided
run
must contain enough information to fully detail the run- Parameters
run β A fully-configured run to launch. The run will be queued and persisted in the run database.
timeout β Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A Run that includes the launched run details and the run status
- mcli.create_interactive_run(run, *, timeout=10, seconds=None, future=False)[source]#
Launch an interactive run in the MosaicML platform
Users are not required to provide a name, image, or βhoursβ variable for an interactive run. If these variables are not provided, they will be filled in with defaults. If the user provides a value for the βcommandβ variable, this will be overwritten with sleep <hours>, where <hours> is the value of the βhoursβ variable.
- Parameters
run β A fully-configured run to launch. The run will be queued and persisted in the run database.
timeout β Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to create_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.hours β How many hours an interactive run can sleep for until MORC marks it as completed.
- Returns
A Run that includes the launched run details and the run status
- mcli.delete_run(run, *, timeout=10, future=False)[source]#
Delete a run in the MosaicML platform
If a run is currently running, it will first be stopped.
- Parameters
run β A run to delete
timeout β Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A β type Run: for the run that was deleted
- mcli.delete_runs(runs, *, timeout=10, future=False)[source]#
Delete a list of runs in the MosaicML platform
Any runs that are currently running will first be stopped.
- Parameters
runs β A list of runs or run names to delete
timeout β Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to delete_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A list of β type Run: for the runs that were deleted
- mcli.follow_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, resumption=None, tail=None, container=None)[source]#
Follow the logs for an active or completed run in the MosaicML platform
This returns a
generator
of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active.- Parameters
run (
str
|Run
) β The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) β Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored. A run may take some time to generate logs, so you likely do not want to set a timeout.future (
bool
) β Return the output as aFuture
. If True, the call tofollow_run_logs()
will return immediately and the request will be processed in the background. The generator returned by the ~concurrent.futures.Future will yield a ~concurrent.futures.Future for each new log string returned from the cloud. This takes precedence over thetimeout
argument. To get the generator, usereturn_value.result()
with an optionaltimeout
argument andlog_future.result()
for each new log string.resumption (
Optional[int]
) β Resumption (0-indexed) of a run to get logs for. Defaults to the last resumptiontail (
Optional[int]
) β Number of chars to read from the end of the log. Defaults to reading the entire log.container (
Optional[str]
) β Container name of a run to get logs for. Defaults to the MAIN container.
- Returns
If future is False β A line-by-line
Generator
of the logs for a runOtherwise β A
Future
of a line-by-line generator of the logs for a run
- mcli.get_run_logs(run, rank=None, *, node_rank=None, local_gpu_rank=None, global_gpu_rank=None, timeout=None, future=False, failed=False, resumption=None, tail=None, container=None)[source]#
Get the current logs for an active or completed run
Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a
str
, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, usefollow_run_logs()
.- Parameters
run (
str
|Run
) β The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) β [DEPRECATED, Use node_rank instead] Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.node_rank (
Optional[int]
) β Specifies the node rank within a multi-node run to fetch logs for. Defaults to lowest available rank. Indexing starts from 0.local_gpu_rank (
Optional[int]
) β Specifies the GPU rank on the specified node to fetch logs for. Cannot be used with global_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.global_gpu_rank (
Optional[int]
) βSpecifies the global GPU rank to fetch logs for. Cannot be used with node_rank and local_gpu_rank. Indexing starts from 0. Note: GPU rank logs are only available for runs using Composer and/or LLM Foundry and MAIN container logs.
timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call toget_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the log text, usereturn_value.result()
with an optionaltimeout
argument.failed (
bool
) β Return the logs of the first failed rank for the provided resumption ifTrue
.False
by default.resumption (
Optional[int]
) β Resumption (0-indexed) of a run to get logs for. Defaults to the last resumptiontail (
Optional[int]
) β Number of chars to read from the end of the log. Defaults to reading the entire log.container (
Optional[str]
) β Container name of a run to get logs for. Defaults to the MAIN container.
- Returns
- mcli.get_run(run, *, timeout=10, future=False, include_details=True)[source]#
Get a run that has been launched in the MosaicML platform
The run will contain all details requested
- Parameters
run β Run on which to get information
timeout β Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of runs, usereturn_value.result()
with an optionaltimeout
argument.include_details β If true, will fetch detailed information like run input for each run.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.get_runs(runs=None, *, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, user_emails=None, run_types=None, include_details=False, include_deleted=False, ended_before=None, ended_after=None, limit=100)[source]#
List runs that have been launched in the MosaicML platform
The returned list will contain all of the details stored about the requested runs.
- Parameters
runs β List of runs on which to get information
cluster_names β List of cluster names to filter runs. This can be a list of str or :type Cluster: objects. Only runs submitted to these clusters will be returned.
before β Only runs created strictly before this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
after β Only runs created at or after this time will be returned. This can be a str in ISO 8601 format(e.g 2023-03-31T12:23:04.34+05:30) or a datetime object.
gpu_types β List of gpu types to filter runs. This can be a list of str or :type GPUType: enums. Only runs scheduled on these GPUs will be returned.
gpu_nums β List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.
statuses β List of run statuses to filter runs. This can be a list of str or :type RunStatus: enums. Only runs currently in these phases will be returned.
user_emails β List of user emails to filter runs. Only runs submitted by these users will be returned. By default, will return runs submitted by the current user. Requires shared runs or admin permission
run_types β List of run types to filter runs - βINTERACTIVEβ: Runs created with the mcli interactive command - βHERO_RUNβ: Runs created with is_hero_run in the metadata - βTRAININGβ: All other runs
timeout β Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to get_runs will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list of runs, usereturn_value.result()
with an optionaltimeout
argument.include_details β If true, will fetch detailed information like run input for each run.
include_deleted β If true, will include deleted runs in the response.
ended_before β Only runs ended strictly before this time will be returned.
ended_after β Only runs ended at or after this time will be returned.
limit β Maximum number of runs to return. If None, the latest 100 runs will be returned.
- Raises
MAPIException β If connecting to MAPI, raised when a MAPI communication error occurs
- mcli.start_run(run, *, timeout=10, future=False)[source]#
Start a run
Start a run currently stopped in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) β A run or run name to starttimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call tostart_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if starting the requested runs failed A successfully started run will have the status
`RunStatus.PENDING`
- Returns
- mcli.start_runs(runs, *, timeout=10, future=False)[source]#
Start a list of runs
Start a list of runs currently stopped in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) β A list of runs or run names to starttimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call tostart_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if starting any of the requested runs failed. All successfully started runs will have the status
`RunStatus.PENDING`
. You can freely retry any started and started runs if this error is raised due to a connection issue.- Returns
- mcli.stop_run(run, *, reason=None, timeout=10, future=False)[source]#
Stop a run
Stop a run currently running in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) β A run or run name to stop. UsingRun
objects is most efficient. See the note below.reason (
Optional[str]
) β A reason for stopping the runtimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call tostop_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if stopping the requested runs failed A successfully stopped run will have the status
`RunStatus.STOPPED`
- Returns
- mcli.stop_runs(runs, *, reason=None, timeout=10, future=False)[source]#
Stop a list of runs
Stop a list of runs currently running in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) β A list of runs or run names to stop. UsingRun
objects is most efficient. See the note below.reason (
Optional[str]
) β A reason for stopping the runtimeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call tostop_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status
`RunStatus.STOPPED`
. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.- Returns
- mcli.update_run(run, update_run_data=None, *, preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, timeout=10, future=False, max_duration=None)[source]#
Update a runβs data in the MosaicML platform.
Any values that are not specified will not be modified.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) β A run or run name to update. UsingRun
objects is most efficient. See the note below.update_run_data (Dict[str, Any]) β DEPRECATED: Use the individual named-arguments instead. The data to update the run with. This can include preemptible, priority, maxRetries, and retryOnSystemFailure
preemptible (bool) β Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) β Update the default priority of the run from auto to low or lowest
max_retries (int) β Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) β Update whether the run should be retried on system failure (i.e. a node failure); default is False
timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.max_duration β Update the max time that a run can run for (in hours).
future (
bool
) β Return the output as aFuture
. If True, the call toupdate_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if updating the requested run failed
- Returns
- mcli.update_run_metadata(run, metadata, *, timeout=10, future=False, protect=False)[source]#
Update a runβs metadata in the MosaicML platform.
- Parameters
run (
Optional[str | ``:class:`~mcli.api.model.run.Run` ``]
) β A run or run name to update. UsingRun
objects is most efficient. See the note below.metadata (Dict[str, Any]) β The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call toupdate_run_metadata()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.protect (
bool
) β If True, the call will be protected from SIGTERMs to allow it to complete reliably. Defaults to False.
- Raises
MAPIException β Raised if updating the requested run failed
- Returns
- mcli.wait_for_run_status(run, status, timeout=None, future=False)[source]#
Wait for a launched run to reach a specific status
- Parameters
run (
str
|Run
) β The run whose status should be watched. This can be provided using the runβs name or an existingRun
object.status (
str
|RunStatus
) β Status to wait for. This can be any validRunStatus
value. If the status is short-lived, or the run terminates, it is possible the run will reach a LATER status than the one requested. If the run never reaches this state (e.g. it stops early or the wait times out), then an error will be raised. See exception details below.timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) β Return the output as aFuture
. If True, the call towait_for_run_status()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException β Raised if the run does not exist or there is an issue connecting to the MAPI service.
RunStatusNotReached β Raised in the event that the watch closes before the run reaches the desired status. If this happens, the connection to MAPI may have dropped, so try again.
TimeoutError β Raised if the run did not reach the correct status in the specified time
- Returns
If future is False β A
Run
object once it has reached the requested status- Otherwise:
A
Future
for the run. This will not resolve until the run reaches the requested status
- mcli.watch_run_status(run, timeout=None, future=False)[source]#
Watch a launched run and retrieve a new Run object everytime its status updates
- Parameters
run (
str
|Run
) β The run whose status should be watched. This can be provided using the runβs name or an existingRun
object.timeout (
Optional[float]
) β Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored. A run may take some time to change statuses (especially to go from RUNNING to COMPLETED), so you likely do not want to set a timeout.future (
bool
) β Return the output as aFuture
. IfTrue
, each iteration will yield aFuture
for the next updatedRun
object. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument. With futures, you can easily watch multiple Runs in parallel. NOTE: If you setfuture==True
, you should wrap yourreturn_value.result()
in atry: ... except StopAsyncIteration
to catch the end of the iteration.
- Raises
MAPIException β Raised if the run could not be found or if there is an issue contacting the MAPI service
TimeoutError β Raised if the run did not reach the correct status in the specified time
- Yields
If future is False β A
Run
object at each status update Otherwise:A
Future
for the run. This will not resolve until the run reaches a new status
- class mcli.Run(run_uid, name, status, created_at, updated_at, created_by, priority, preemptible, retry_on_system_failure, cluster, gpus, gpu_type, cpus, node_count, latest_resumption, is_deleted, run_type, max_retries=None, reason=None, nodes=<factory>, submitted_config=None, metadata=None, last_resumption_id=None, resumptions=<factory>, events=<factory>, lifecycle=<factory>, image=None, max_duration=None, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'reason', 'createdByEmail', 'priority', 'preemptible', 'retryOnSystemFailure', 'resumptions', 'isDeleted', 'runType'))[source]#
A run that has been launched on the MosaicML platform
- Parameters
run_uid (str) β Unique identifier for the run
name (str) β User-defined name of the run
status (
RunStatus
) β Status of the run at a moment in timecreated_at (datetime) β Date and time when the run was created
updated_at (datetime) β Date and time when the run was last updated
created_by (str) β Email of the user who created the run
priority (str) β Priority of the run; defaults to auto but can be updated to low or lowest
preemptible (bool) β Whether the run can be stopped and re-queued by higher priority jobs
retry_on_system_failure (bool) β Whether the run should be retried on system failure
cluster (str) β Cluster the run is running on
gpus (int) β Number of GPUs the run is using
gpu_type (str) β Type of GPU the run is using
cpus (int) β Number of CPUs the run is using
node_count (int) β Number of nodes the run is using
latest_resumption (
Resumption
) β Latest resumption of the runmax_retries (Optional[int]) β Maximum number of times the run can be retried
reason (Optional[str]) β Reason the run was stopped
nodes (List[:class:`~mcli.api.model.run.Node]`) β Nodes the run is using
submitted_config (Optional[:class:`~mcli.models.run_config.RunConfig]`) β Submitted run configuration
metadata (Optional[Dict[str, Any]]) β Metadata associated with the run
last_resumption_id (Optional[str]) β ID of the last resumption of the run
resumptions (List[:class:`~mcli.api.model.run.Resumption]`) β Resumptions of the run
lifecycle (List[:class:`~mcli.api.model.run.RunLifecycle]`) β Lifecycle of the run
image (Optional[str]) β Image the run is using
- clone(name=None, image=None, cluster=None, instance=None, nodes=None, gpu_type=None, gpus=None, priority=None, preemptible=None, max_retries=None, max_duration=None)[source]#
Submits a new run with the same configuration as this run
- Parameters
name (str) β Override the name of the run
image (str) β Override the image of the run
cluster (str) β Override the cluster of the run
instance (str) β Override the instance of the run
nodes (int) β Override the number of nodes of the run
gpu_type (str) β Override the GPU type of the run
gpus (int) β Override the number of GPUs of the run
priority (str) β Override the default priority of the run from auto to low or lowest
preemptible (bool) β Override whether the run can be stopped and re-queued by higher priority jobs
max_retries (int) β Override the max number of times the run can be retried
max_duration (float) β Override the max duration (in hours) that a run can run for
- Returns
New :class:`~mcli.api.model.run.Run` object
- refresh()[source]#
Refreshes the data on the run object
- Returns
Refreshed :class:`~mcli.api.model.run.Run` object
- update(preemptible=None, priority=None, max_retries=None, retry_on_system_failure=None, max_duration=None)[source]#
Updates the runβs data
- Parameters
preemptible (bool) β Update whether the run can be stopped and re-queued by higher priority jobs; default is False
priority (str) β Update the default priority of the run from auto to low or lowest
max_retries (int) β Update the max number of times the run can be retried; default is 0
retry_on_system_failure (bool) β Update whether the run should be retried on system failure (i.e. a node failure); default is False
- Returns
Updated :class:`~mcli.api.model.run.Run` object
- update_metadata(metadata)[source]#
Updates the runβs metadata
- Parameters
metadata (Dict[str, Any]) β The metadata to update the run with. This will be merged with the existing metadata. Keys not specified in this dictionary will not be modified.
- Returns
Updated :class:`~mcli.api.model.run.Run` object
- class mcli.RunConfig(name=None, parent_name=None, image=None, gpu_type=None, gpu_num=None, cpus=None, cluster=None, scheduling=<factory>, compute=<factory>, parameters=<factory>, integrations=<factory>, env_variables=<factory>, metadata=<factory>, command='', dependent_deployment=<factory>, _suppress_deprecation_warnings=False)[source]#
A run configuration for the MosaicML platform
Values in here are not yet validated and some required values may be missing. On attempting to create the run, a bad config will raise a MapiException with a 400 status code.
- Required args:
name (str): User-defined name of the run
image (str): Docker image (e.g. mosaicml/composer)
command (str): Command to use when a run starts
- compute (
ComputeConfig
or Dict[str, Any]): Compute configuration. Typically a subset of the following fields will be required:
cluster (str): Name of cluster to use
instance (str): Name of instance to use
gpu_type (str): Name of gpu type to use
gpus (int): Number of GPUs to use
cpus (int): Number of CPUs to use
nodes (int): Number of nodes to use
See mcli get clusters for a list of available clusters and instances
- compute (
- Optional args:
parameters (Dict[str, Any]): Parameters to mount into the environment
- scheduling (
SchedulingConfig
or Dict[str, Any]): Scheduling configuration priority (str): Priority of the run (default auto with options low and lowest`)
preemptible (bool): Whether the run is preemptible (default False)
retry_on_system_failure (bool): Whether the run should be retried on system failure (default False)
max_retries (int): Maximum number of retries (default 0)
- max_duration (float): Maximum duration of the run in hours (default None)
Run will be automatically stopped after this duration has elapsed.
- scheduling (
- integrations (List[Dict[str, Any]]): List of integrations. See integration documentation for more details:
https://docs.mosaicml.com/projects/mcli/en/latest/resources/integrations/index.html
- env_variables (Dict[str, str]): Dictionary of environment variables to set in the run
key (str): Name of the environment variable
value (str): Value of the environment variable
metadata (Dict[str, Any]): Arbitrary metadata to attach to the run
- class mcli.RunStatus(value)[source]#
Possible statuses of a run
- PENDING = 'PENDING'#
The run has been submitted and is waiting to be scheduled
- QUEUED = 'QUEUED'#
The run is awaiting execution
- STARTING = 'STARTING'#
The run is starting up and preparing to run
- RUNNING = 'RUNNING'#
The run is actively running
- TERMINATING = 'TERMINATING'#
The run is in the process of being terminated
- COMPLETED = 'COMPLETED'#
The run has finished without any errors
- STOPPED = 'STOPPED'#
The run has stopped
- FAILED = 'FAILED'#
The run has failed due to an issue at runtime
- UNKNOWN = 'UNKNOWN'#
A valid run status cannot be found
- before(other, inclusive=False)[source]#
Returns True if this state usually comes βbeforeβ the other
- Parameters
other β Another
RunStatus
inclusive β If True, equality evaluates to True. Default False.
- Returns
If this state is βbeforeβ the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True
- after(other, inclusive=False)[source]#
Returns True if this state usually comes βafterβ the other
- Parameters
other β Another
RunStatus
inclusive β If True, equality evaluates to True. Default False.
- Returns
If this state is βafterβ the other
Example
>>> RunStatus.COMPLETED.after(RunStatus.RUNNING) True >>> RunStatus.RUNNING.after(RunStatus.PENDING) True
- class mcli.PaginatedObjectList(data, obj, query_function, pagination_function)[source]#
A list of objects that is paginated
Example pagination of runs:
import time
from mcli import get_runs
runs = get_runs()
while True:
try:
print(f'Found {len(runs)} runs')
time.sleep(1)
runs = runs.next_page()
except StopIteration:
print("No more pages")
break
Pretraining API#
Create a pretraining run. |
- mcli.create_pretraining_run(model, train_data, save_folder, *, compute=None, tokenizer=None, training_duration=None, parameters=None, eval=None, experiment_tracker=None, custom_weights_path=None, timeout=10, future=False)[source]#
Create a pretraining run.
- Parameters
model β The name of the Hugging Face model to use. Required.
train_data β Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, if you have two datasets,
dataset1
anddataset2
, and you want to use 80% ofdataset1
and 20% ofdataset2
, you can pass in{"dataset1": {"path": "path/to/dataset1", "proportion": .8}, "dataset2": {"path": "path/to/dataset2", "proportion": .2}}
. Required.save_folder β The remote location to save the checkpoints. For example, if your
save_folder
iss3://my-bucket/my-checkpoints
, the Composer checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes ares3://
,gs://
, andoci://
. Required.compute β The compute configuration to use. Required for now
tokenizer β Tokenizer configuration to use. If not provided, the default tokenizer for the model will be used.
training_duration β The total duration of your run. This can be specified in batches (e.g.
100ba
), epochs (e.g.10ep
), or tokens (e.g.1_000_000tok
). Default is1ep
.parameters β
- Additional parameters to pass to the model
learning_rate: The peak learning rate to use. Default is
5e-7
. The optimizer used
is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.
context_length: The maximum sequence length to use. This will be used to truncate any data that is too
long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each modelβs default.
experiment_tracker β The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in
{"wandb": {"project": "my-project", "entity": "my-entity"}}
. To add in mlflow tracking, you can pass in{"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}
.eval β Configuration for evaluation
custom_weights_path β The remote location of a custom model checkpoint to resume from. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint.
timeout β Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to create_pretraining_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A β type Run: object containing the pretraining run information.
Finetuning API#
Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference. |
|
A Finetune that has been run on the MosaicML platform |
- mcli.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_tracker=None, disable_credentials_check=None, timeout=10, future=False)[source]#
Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.
- Parameters
model β The name of the Hugging Face model to use.
train_data_path β The full remote location of your training data (eg βs3://my-bucket/my-data.jsonlβ). For
INSTRUCTION_FINETUNE
, another option is to provide the name of a Hugging Face dataset that includes the train split, like βmosaicml/dolly_hhrlhf/testβ. The data should be formatted with each row containing a βpromptβ and βresponseβ field forINSTRUCTION_FINETUNE
, or in raw data format forCONTINUED_PRETRAIN
.save_folder β The remote location to save the finetuned checkpoints. For example, if your
save_folder
iss3://my-bucket/my-checkpoints
, the finetuned Composer checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes ares3://
,gs://
, andoci://
.task_type β The type of finetuning task to run. Current available options are
INSTRUCTION_FINETUNE
andCONTINUED_PRETRAIN
, defaults toINSTRUCTION_FINETUNE
.eval_data_path β The remote location of your evaluation data (e.g.
s3://my-bucket/my-data.jsonl
). ForINSTRUCTION_FINETUNE
, the name of a Hugging Face dataset with the test split (e.g.mosaicml/dolly_hhrlhf/test
) can also be given. The evaluation data should be formatted with each row containing aprompt
andresponse
field, forINSTRUCTION_FINETUNE
and raw data forCONTINUED_PRETRAIN
. Default isNone
.eval_prompts β
A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) youβve configured. Generations will occur at every model checkpoint with the following generation parameters:
max_new_tokens: 100
temperature: 1
top_k: 50
top_p: 0.95
do_sample: true
Default is
None
(do not generate prompts).custom_weights_path β The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is
None
.training_duration β The total duration of your finetuning run. This can be specified in batches (e.g.
100ba
), epochs (e.g.10ep
), or tokens (e.g.1_000_000tok
). Default is1ep
.learning_rate β The peak learning rate to use for finetuning. Default is
5e-7
. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.context_length β The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each modelβs default.
experiment_tracker β The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in
{"wandb": {"project": "my-project", "entity": "my-entity"}}
. To add in mlflow tracking, you can pass in{"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}
.disable_credentials_check β Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.
timeout β Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Finetune: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A β type Finetune: object containing the finetuning run information.
- class mcli.Finetune(id, name, status, created_at, updated_at, created_by, started_at=None, completed_at=None, reason=None, estimated_end_time=None, model=None, save_folder=None, train_data_path=None, submitted_config=None, events=None, _required_properties=('id', 'name', 'status', 'createdByEmail', 'createdAt', 'updatedAt'))[source]#
A Finetune that has been run on the MosaicML platform
- Parameters
id β The unique identifier for this finetuning run.
name β The name of the finetuning run.
status β The current status of the finetuning run. This is a RunStatus enum, which has values such as
PENDING
,RUNNING
, orCOMPLETED
.created_at β The timestamp at which the finetuning run was created.
updated_at β The timestamp at which the finetuning run was last updated.
created_by β The email address of the user who created the finetuning run.
started_at β The timestamp at which the finetuning run was started.
completed_at β The timestamp at which the finetuning run was completed.
reason β The reason for the finetuning runβs current status, such as
Run completed successfully
.