Below outlines how to work with deployments, including creating, updating, getting, and deleting deployments as well as pinging the deployment, and sending requests to your deployment.

Creating a deployment#

Deployments can programmatically be created, giving you flexibility to define custom workflows or create similar deployments in quick succession. create_inference_deployment() will takes a InferenceDeploymentConfig object, which is a fully-configured deployment ready to launch. The method will launch the inference deployment and then return a InferenceDeployment object, which includes the InferenceDeploymentConfig data in InferenceDeployment.config but also data received at the time the deployment was launched.

The InferenceDeploymentConfig object#

The InferenceDeploymentConfig object holds configuration data needed to launch a deployment. This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the inference schema. Take a look at the API Reference for the full list of fields on the InferenceDeploymentConfig object.

There are two ways to initialize a InferenceDeploymentConfig object that can be used to configure and create a deployment. The first is by referencing a YAML file, equivalent to the file argument in MCLI:

from mcli import InferenceDeploymentConfig, create_inference_deployment

deployment_config = InferenceDeploymentConfig.from_file('hello_world.yaml')
created_deployment = create_inference_deployment(deployment_config)

Alternatively, you can instantiate the InferenceDeploymentConfig object directly in python:

from mcli import InferenceDeploymentConfig, create_inference_deployment

cluster = "<your-cluster>"
inference_deployment_config = InferenceDeploymentConfig(
    command='echo "Hello World!" && sleep 60',
create_deployment = create_inference_deployment(inference_deployment_config)

These can also be used in combination, for example loading a base configuration file and modifying select fields:

from mcli import InferenceDeploymentConfig, create_inference_deployment

special_config = InferenceDeploymentConfig.from_file('base_config.yaml')
special_config.metadata = {"version": 1}
created_deployment = create_inference_deployment(special_config)

The InferenceDeployment object#

Created deployments will be returned as an InferenceDeployment object in create_inference_deployment(). This object can be used as input to any subsequent deployment function, for example you can start a deployment and then immediately ping the deployment to see if it’s ready.

from mcli import create_inference_deployment, ping_inference_deployment as ping

created_deployment = create_inference_deployment(config)

Querying a deployment#

When querying your inference deployment, you must provide a JSON with a key called inputs in the request. This will typicaly be a list of inputs to the model. For example, in a text-to-text language model the inputs field will contain a list of strings to be tokenized and fed into the model.

Optionally, you can also provide a parameters field which contains hyperparameters used in the forward pass of your model. An example of where one might use the parameters field is to pass arguments to the generation pipeline in a text-to-text language model. See our docs on this for more details.

The reason parameters is separated out from inputs in the request is so that the webserver’s dynamic batching functionality can automatically group requests with the same sets of parameters together in the batches it creates. This is important because in some cases different sets of parameters cannot be grouped together when running inference. For example, consider grouping different max_output_sequence_length parameters together in a text-to-text language model. The result would be that the user’s model handler class would have to implement logic to handle this. Separating out parameters makes it possible for the user to write a handler class without having to consider these details.

An example request is shown below:

  "inputs": ["(required) <any JSON value>"],
  "parameters": "(optional) <any JSON value>"

The response from the server will be formatted as shown below. Note that the length of outputs will be the same as the length of inputs in the server request where index i in inputs corresponds to index i in outputs.

  "outputs": ["<whatever the model handler output is for a single `input`>"]

Observing a deployment#

Getting a deployment’s logs#

get_inference_deployment_logs() gets currently available logs for any deployment.

from mcli import create_inference_deployment, get_inference_deployment_logs

created_deployment = create_inference_deployment(config)
logs = get_inference_deployment_logs(created_deployment)

Listing deployments#

All deployments from your organization that have been launched through the MosaicML platform and have not deleted can be accessed using the get_inference_deployments() function. Optional filters allow you to specify a subset of deployments to list by name, cluster, gpu type, gpu number, or status.

from mcli import get_inference_deployments

listed_deployments = get_inference_deployments(gpu_nums=1)

Updating a deployment#

To update a deployment, you must supply the deployment names or InferenceDeployment object and the fields that need to be updated.

To update a set of deployments, you can use the output of get_inference_deployments() or even define your own filters directly:

Currently, we support the following fields:

  1. image : Takes a string value.

  2. replicas : Takes an int value

  3. metadata: Takes a dict value of metadata keys (strings) and values (any).

from mcli import update_inference_deployment

update_inference_deployment('deployment-name', {"metadata":'{"name":"my_first_model"}', "replicas":2, "image":"my_new_image"})

from mcli import update_inference_deployments, get_inference_deployments

to_update = get_inference_deployments(cluster="name")
update_inference_deployments(to_update, {"replicas": 3})

Everytime a deployment is updated, we create a new Inference Deployment Release for it. You can see how many releases your deployment has had through mcli describe deployment <deployment_name>

Stopping or Deleting deployments#

If you want to take your deployment down temporarily, and still be able to view logs for it, then consider stopping your deployment. Otherwise, use the delete functionality.

To stop/delete deployments, you must supply the deployment names or InferenceDeployment object. To stop/delete a set of deployments, you can use the output of get_inference_deployments() or even define your own filters directly:

from mcli import delete_inference_deployment, stop_inference_deployment


from mcli import delete_inference_deployments, stop_inference_deployments, get_inference_deployments

to_delete = get_inference_deployments(cluster="name")

Pinging a deployment#

You can ping a deployment to determine the server status. We return a status code 200 when the server is live, which indicates the model has finished loading and is ready to accept requests. You can either pass in a name or a InferenceDeployment object.

from mcli import ping


Sending predictions to a deployment#

You can send predictions to your deployment programmatically. There are 3 ways you can specify the deployment you’d like to send your request to:

  1. You can pass in the deployment object returned from create_inference_deployment or get_inference_deployment.

from mcli import predict

deployment = get_inference_deployments(name='your-deployment-name')
predict(deployment, {'inputs': ['some input']})
  1. You can pass in the url to the deployment.

from mcli import predict

predict('', {'inputs': ['some input']})
  1. You can pass in the name of the deployment.

from mcli import predict

predict('your-deployment-name', {'inputs': ['some input']})

Getting metrics for a deployment#

You can retrieve latency, throughput, error rate and cpu utilization metrics from the /metrics endpoint on the deployment. These metrics are compute over the past 1 hour at 1 minute interval.

curl https://{deployment-name} -H "Authorization: {api-key}"

Sample response here:

  "status": 200,
  "metrics": {
    "error_rate": [
      ["2023-05-01 16:24:57", "10"],
      ["2023-05-01 16:23:57", "10"],
    "cpu_seconds": [
      ["2023-05-01 16:24:57", "0.006"],
      ["2023-05-01 16:23:57", "0.001"],
    "avg_latency": [
      ["2023-05-01 16:24:57", "1.2"],
      ["2023-05-01 16:23:57", "1.5"],
    "requests_per_second": [
      ["2023-05-01 16:24:57", "0.5"],
      ["2023-05-01 16:23:57", "1.2"],