Inference#
To make large models accessible to all organizations, we’ve built MosaicML Inference. MosaicML Inference offers two service tiers: Enterprise and Starter.
To learn more about MosaicML Inference see our product page.
Enterprise Tier#
With the Enterprise Tier, you can turn any saved model checkpoint into a secure, inexpensive API within a MosaicML managed cluster, or within your own virtual private cloud (VPC), in under a minute.
To learn more about deploying your own model in your own secure environment, read our blog post and check out our documentation.
Starter Tier#
For less demanding applications, the Starter Tier features a suite of open source models with commercial licensing terms. These models are hosted by MosaicML and available through an API, offering text embeddings and text completions use cases.
Text Embedding Models#
Embedding models are used to obtain a vector representation of an input string. Embedding vectors can be used to compute the similarity of two input strings, retrieve documents relevant to a specific query, and more.
Model |
Description |
Endpoint |
---|---|---|
A 335M parameters, instruction finetuned model capable of generating embeddings for various tasks. The |
|
|
A 1.2B parameters, instruction finetuned model capable of generating text embeddings for various tasks. The |
|
Text Completion Models#
Text completion models are used to generate text based on a provided input prompt string. They can be used for generic text completion, question answering, information extraction, and much more.
Model |
Description |
Instruction finetuned |
Endpoint |
---|---|---|---|
The 1.5B parameter version of GPT-2, an open-source language model capable of generating free-form text completions, trained and released by OpenAI. |
No |
|
|
A state-of-the-art 6.7B parameter instruction finetuned language model trained by MosaicML. The model is pretrained for 1T tokens on a mixture of datasets, and then further instruction finetuned on a dataset derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets. |
Yes |
|
|
A 12B parameter instruction finetuned language model released by Databricks. The model is based on the Pythia-12B model trained by EleutherAI, and is further instruction finetuned on a dataset created by Databricks. See the Databricks code for an example of how to best format your prompt for this model. |
Yes |
|
|
A 20B parameter language model capable of generating free-form text completions, trained and released by EleutherAI. See the paper for more infomration. |
No |
|
API Reference#
Users can interact with MosaicML’s hosted models through HTTP requests to our REST API, enabling robust and extensible support for any programming language.
Authentication#
Accessing the the MosaicML REST API requires a MosaicML platform API key for authentication. Please see our Quick Start for instructions on how to setup MosaicML platform access.
Embedding Requests#
To calculate embeddings for a string, send your text string to the hosted endpoint of the embedding model you wish to query.
POST https://models.hosted-on.mosaicml.hosting/<endpoint>
Request example:
from mcli.sdk import predict
inputs = {
"input_strings": [
[
"Represent the Science title:",
"3D ActionSLAM: wearable person tracking in multi-floor environments"
]
]
}
predict('https://models.hosted-on.mosaicml.hosting/instructor-large/v1', inputs)
curl https://models.hosted-on.mosaicml.hosting/instructor-large/v1/predict \
-H "Authorization: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{"input_strings": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
Response example:
{
"data":[
[
-0.06155527010560036,0.010419987142086029,0.005884397309273481...-0.03766140714287758,0.010227023623883724,0.04394740238785744
]
]
}
Request body#
Parameters |
Type |
Required |
Default |
Description |
---|---|---|---|---|
input_strings |
List[[str, str]] |
yes |
N/A |
List of pairs of strings in format [[“Instruction”, “Sentence”]] |
Text Completion Requests#
To generate text completions, send an input string to the hosted endnpoint of the text completion model you wish to query.
POST https://models.hosted-on.mosaicml.hosting/<endpoint>
Request example:
from mcli.sdk import predict
prompt = "Write 3 reasons why you should train an AI model on domain specific data set."
predict('https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1', {'input_strings': [prompt], 'temperature': 0.01})
curl https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1/predict \
-H "Authorization: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"temperature": 0.01, "input_strings": ["Write 3 reasons why you should train an AI model on domain specific data set."]}'
Response example:
{
'data': [
'1. The model will be more accurate.\n2. The model will be more efficient.\n3. The model will be more interpretable.'
]
}
Request body#
Parameters |
Type |
Required |
Default |
Description |
---|---|---|---|---|
input_string |
List[str] |
yes |
N/A |
The prompt to generate a completion for. |
top_p |
float |
no |
0.95 |
Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p. |
temperature |
float |
no |
0.8 |
The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score, 100.0 is getting closer to uniform probability. |
max_length |
int |
no |
256 |
Defines the maximum length in tokens of the output summary. |
use_cache |
bool |
no |
TRUE |
Whether to use KV cacheing during autoregressive decoding. This will use more memory but improve speed. |
do_sample |
bool |
no |
TRUE |
Whether or not to use sampling, use greedy decoding otherwise. |