Deployments#
Below outlines how to work with deployments, including creating, updating, getting, and deleting deployments as well as pinging the deployment, and sending requests to your deployment.
Creating a deployment#
Deployments can programmatically be created, giving you flexibility to define custom workflows or create similar deployments in quick succession.
create_inference_deployment()
will takes a InferenceDeploymentConfig
object, which is a fully-configured deployment ready to launch. The method will launch the inference deployment and then return a InferenceDeployment
object, which includes the InferenceDeploymentConfig
data in InferenceDeployment.config
but also data received at the time the deployment was launched.
The InferenceDeploymentConfig
object#
The InferenceDeploymentConfig
object holds configuration data needed to launch a deployment.
This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the inference schema. Take a look at the API Reference for the full list of fields on the InferenceDeploymentConfig
object.
There are two ways to initialize a InferenceDeploymentConfig
object that can be used to configure and create a deployment.
The first is by referencing a YAML file, equivalent to the file argument in MCLI:
from mcli import InferenceDeploymentConfig, create_inference_deployment
deployment_config = InferenceDeploymentConfig.from_file('hello_world.yaml')
created_deployment = create_inference_deployment(deployment_config)
Alternatively, you can instantiate the InferenceDeploymentConfig
object directly in python:
from mcli import InferenceDeploymentConfig, create_inference_deployment
cluster = "<your-cluster>"
inference_deployment_config = InferenceDeploymentConfig(
name='hello-world',
image='bash',
command='echo "Hello World!" && sleep 60',
gpu_type='none',
cluster=cluster,
)
create_deployment = create_inference_deployment(inference_deployment_config)
These can also be used in combination, for example loading a base configuration file and modifying select fields:
from mcli import InferenceDeploymentConfig, create_inference_deployment
special_config = InferenceDeploymentConfig.from_file('base_config.yaml')
special_config.metadata = {"version": 1}
created_deployment = create_inference_deployment(special_config)
The InferenceDeployment
object#
Created deployments will be returned as an InferenceDeployment
object in create_inference_deployment()
.
This object can be used as input to any subsequent deployment function, for example you can start a deployment and then immediately ping the deployment to see if it’s ready.
from mcli import create_inference_deployment, ping_inference_deployment as ping
created_deployment = create_inference_deployment(config)
ping(created_deployment)
Querying a deployment#
When querying your inference deployment, you must provide a JSON with a key called inputs
in the request. This will typicaly be a list of inputs to the model. For example, in a text-to-text language model the inputs
field will contain a list of strings to be tokenized and fed into the model.
Optionally, you can also provide a parameters
field which contains hyperparameters used
in the forward pass of your model. An example of where one might use the parameters
field is to pass arguments to
the generation pipeline in a text-to-text language model. See our docs on this for more details.
The reason parameters
is separated out from inputs
in the request is so that the webserver’s dynamic batching functionality can automatically group requests with the same sets of parameters together in the batches it creates. This is important because in some cases different sets of parameters cannot be grouped together when running inference. For example, consider grouping different max_output_sequence_length
parameters together in a text-to-text language model. The result would be that the user’s model handler class would have to implement logic to handle this. Separating out parameters
makes it possible for the user to write a handler class without having to consider these details.
An example request is shown below:
{
"inputs": ["(required) <any JSON value>"],
"parameters": "(optional) <any JSON value>"
}
The response from the server will be formatted as shown below. Note that the length of outputs
will be the same as the length of inputs
in the server request where index i
in inputs
corresponds to index i
in outputs
.
{
"outputs": ["<whatever the model handler output is for a single `input`>"]
}
Observing a deployment#
Getting a deployment’s logs#
get_inference_deployment_logs()
gets currently available logs for any deployment.
from mcli import create_inference_deployment, get_inference_deployment_logs
created_deployment = create_inference_deployment(config)
logs = get_inference_deployment_logs(created_deployment)
Listing deployments#
All deployments from your organization that have been launched through the MosaicML platform and have not deleted can be accessed using the get_inference_deployments()
function.
Optional filters allow you to specify a subset of deployments to list by name, cluster, gpu type, gpu number, or status.
from mcli import get_inference_deployments
listed_deployments = get_inference_deployments(gpu_nums=1)
Updating a deployment#
To update a deployment, you must supply the deployment names or InferenceDeployment
object and the fields that need to be updated.
To update a set of deployments, you can use the output of get_inference_deployments()
or even define your own filters directly:
Currently, we support the following fields:
image
: Takes a string value.replicas
: Takes an int valuemetadata
: Takes a dict value of metadata keys (strings) and values (any).
from mcli import update_inference_deployment
update_inference_deployment('deployment-name', {"metadata":'{"name":"my_first_model"}', "replicas":2, "image":"my_new_image"})
from mcli import update_inference_deployments, get_inference_deployments
to_update = get_inference_deployments(cluster="name")
update_inference_deployments(to_update, {"replicas": 3})
Everytime a deployment is updated, we create a new Inference Deployment Release for it. You can see how many releases your deployment has had through mcli describe deployment <deployment_name>
Stopping or Deleting deployments#
If you want to take your deployment down temporarily, and still be able to view logs for it, then consider stopping your deployment. Otherwise, use the delete functionality.
To stop/delete deployments, you must supply the deployment names or InferenceDeployment
object.
To stop/delete a set of deployments, you can use the output of get_inference_deployments()
or even define your own filters directly:
from mcli import delete_inference_deployment, stop_inference_deployment
stop_inference_deployment('stop-this-deployment')
delete_inference_deployment('delete-this-deployment')
from mcli import delete_inference_deployments, stop_inference_deployments, get_inference_deployments
to_delete = get_inference_deployments(cluster="name")
stop_inference_deployments(to_stop)
delete_inference_deployments(to_delete)
Pinging a deployment#
You can ping a deployment to determine the server status. We return a status code 200
when the server is live, which indicates the model has finished loading and is ready to accept requests. You can either pass in a name or a InferenceDeployment
object.
from mcli import ping
ping('deployment-name')
Sending predictions to a deployment#
You can send predictions to your deployment programmatically. There are 3 ways you can specify the deployment you’d like to send your request to:
You can pass in the deployment object returned from
create_inference_deployment
orget_inference_deployment
.
from mcli import predict
deployment = get_inference_deployments(name='your-deployment-name')
predict(deployment, {'inputs': ['some input']})
You can pass in the url to the deployment.
from mcli import predict
predict('https://your-deployment.inf.hosted-on.mosaicml.hosting', {'inputs': ['some input']})
You can pass in the name of the deployment.
from mcli import predict
predict('your-deployment-name', {'inputs': ['some input']})
Getting metrics for a deployment#
You can retrieve latency, throughput, error rate and cpu utilization metrics from the /metrics
endpoint on the deployment. These metrics are compute over the past 1 hour at 1 minute interval.
curl https://{deployment-name}.inf.hosted-on.mosaicml.hosting/metrics -H "Authorization: {api-key}"
Sample response here:
{
"status": 200,
"metrics": {
"error_rate": [
["2023-05-01 16:24:57", "10"],
["2023-05-01 16:23:57", "10"],
...
],
"cpu_seconds": [
["2023-05-01 16:24:57", "0.006"],
["2023-05-01 16:23:57", "0.001"],
...
],
"avg_latency": [
["2023-05-01 16:24:57", "1.2"],
["2023-05-01 16:23:57", "1.5"],
...
],
"requests_per_second": [
["2023-05-01 16:24:57", "0.5"],
["2023-05-01 16:23:57", "1.2"],
...
]
}
}