Custom Deployments#

There are two ways we allow you to customize your inference deployment. You can provide your own downloader function or model handler implementation. In this section, we’ll cover the interface for each of these and show you how to use them.

Model Handlers#

Model handlers allow you to define how your model should be loaded and what should happen in a forward pass. They allow for the MosaicML platform to support a wide variety of models and use cases. This is configured by the model_handler field in your deployment input yaml, which expects a python path to your model handler class, and the model_parameters field which expects a key-value mapping of parameters that gets passed as kwargs to initialize your model handler class.

Default#

We provide a model handler that is built into the webserver by default. This model handler is used when no model handler is specified in the deployment input yaml. It loads a model from a checkpoint file (expected to be in the HuggingFace checkpoint format) and runs a forward pass on the model. It is a good starting point for most text generation or text embedding use cases.

The parameters for the default model handler are as follows:

Field

Type

Details

task

str

Required. Determines how the forward pass is computed. Supported values are text-generation and feature-extraction

model_dtype

str

The dtype that a Hugging Face model gets loaded as. Defaults to fp16. Note that bf16 is not supported by DeepSpeed, which our default model handler uses.

autocast_dtype

str

The dtype that the model gets autocasted to if provided. Defaults to None

model_name_or_path

str

The name of the HuggingFace repo of the model to load or the path of the locally downloaded HuggingFace checkpoint.

Custom Model Handlers#

You may have a use case that is not covered by our default model handler (e.g. you want to deploy a vision model).

If you’d like to define your own model handler, you can implement a class that exposes the below interface.

class ModelHandlerInterface:

    def __init__(self, **kwargs):
        '''
        The init function you define can have keyword arguments equal
        to the values passed in the `model_parameters` section of the deployment YAML.
        '''

    def predict(self, model_requests: List[Dict[str, Any]]):
        '''
        Specify the logic of your model's forward pass.
        For example for Hugging Face models for text generation, this would be a call to generate().

        The `model_requests` is a list of dictionaries where each dictionary
        is an individual request and the list represents a batch of requests.
         Note that each dictionary in the list is guaranteed to have
        two keys: `input` and `parameters`.

        The value of the `input` key represents a single input
        to the model. If you pass a list of inputs in your request to the server, it will be flattened
        into a list of dictionaries where each one contains an `input` whose value is one of the values in
        the request.
        '''

Note that the format of the requests that are a flattened version of the requests that are passed in to the webserver.. See Querying a Deployment for details on the input request format of the webserver. For example, if the inputs key in the request to the webserver maps to a list of length 2, then that will result in model_requests being a list of length two where each dictionary in the list has a key input which maps to one of the values in the inputs list.

Specifically if the request is:

{
  "inputs": ["prompt 1", "prompt 2"]
}

then the predict function will get model_requests with the following:

[{ "input": "prompt 1" }, { "input": "prompt 2" }]

There are some examples you can follow in the examples repo here.

Let’s walk through a concrete example. Here’s a very simple model handler implementation that just returns the input string as the output of the forward pass:

# Saved as hello_world_handler.py
class HelloWorldModelHandler(ModelHandlerInterface):

    def predict(self, model_requests: List[Dict[str, Any]]):
        return [inp["input"] for inp in model_requests]

Suppose my model handler is saved in a git repo with this structure:

```
hello_world/
├── hello_world_handler.py
└── __init__.py
```

And here is a sample yaml for how you can configure your deployment to use your custom model handler:

name: hello-world-model
compute:
  gpus: 1
  gpu_type: a100_40gb
replicas: 1
image: mosaicml/inference
integrations:
  - integration_type: git_repo
    git_repo: hello_world
model:
  model_handler: hello_world.hello_world_handler.HelloWorldModelHandler
  model_parameters:
    print_string: "hello world!"

Downloader Function#

Default#

The downloader function allows you to customize how your model checkpoint is downloaded. This is configured by the downloader field in your deployment input yaml, which expects a python path to your downloader module, and the download_parameters field which expects a key-value mapping of parameters that gets passed as kwargs to your downloader function.

If you don’t provide a custom downloader, you can use the downloader that is built into the webserver, which can download checkpoint files in the HuggingFace format from either the HuggingFace hub or s3. You must provide at most one of the parameters in the following table to download_parameters.

Parameter

Description

Default

Example

Output Path

hf_path

The name/path of the model on HuggingFace hub.

None

mosaicml/mpt-7b

Huggingface cache directory

s3_path

The path to the model on s3.

None

s3://my-bucket/checkpoint

/mosaicml/local_model

gcp_path

The path to the model on GCP.

None

gs://my-bucket/checkpoint

/mosaicml/local_model

Custom Downloader#

If you’d like to download your checkpoint from a custom location, you can implement a function with the following interface where my_custom_location is passed in under the download_parameters field in your deployment YAML:

def download_model(my_custom_location: str) -> None:
    print("My custom location:", my_custom_location)

You can also take a look at this diffusion example for reference here.

Again, let’s walk through a concrete example and add to the custom repo in the earlier model handler example by saving the download function to custom_downloader.py:

```
hello_world/
├── hello_world_handler.py
├── custom_downloader.py
└── __init__.py
```

Let’s hook up the downloader to the input yaml:

name: hello-world-model
compute:
  gpus: 1
  gpu_type: a100_40gb
replicas: 1
image: mosaicml/inference
integrations:
  - integration_type: git_repo
    git_repo: hello_world
model:
  downloader: hello_world.custom_downloader.download_model
  download_parameters:
    my_custom_location: my_custom_location
  model_handler: hello_world.hello_world_handler.HelloWorldModelHandler
  model_parameters:
    print_string: hello world!