Inference#
To make large models accessible to all organizations, we’ve built MosaicML Inference.
To learn more about MosaicML Inference, including the current list of models available, and the per-1k-tokens pricing, see our product page.
If you need more support or want to let us know what you think, please join the conversation in the #inference-support
channel on our community Slack!
Note: This document refers to MosaicML Inference, the hosted model offering also called “Starter Tier.” If you are looking for our “Enterprise Tier” inference offering, see this section of the MCLI documentation.
Get Started#
To get started using the Inference API endpoints, you need to generate an API key that you will submit with your requests. Optionally, you can also install the MosaicML client package, mosaicml-cli
, which provides our mcli
client SDK. Below, you will find examples of using curl
for querying our endpoints, as well as Python code using mcli
.
Generate an API key#
To generate an API key, log into the MosaicML Console at console.mosaicml.com.
Note: Access to the MosaicML console is granted by invitation. If you have received an invitation via email, you can click the link in that email, or begin at console.mosaicml.com, making sure to sign in with the same email address that received the invitation.
Upon your very first login, you will be presented with a Terms and Conditions document. Please review and accept these terms and conditions to use the MosaicML platform.
Once logged in, click the Account tab in the left navigation bar:
Once in the Account tab, click the Add button in the API Keys window.
In the Add API key window, give your Key a name.
Your API key secret will be displayed. Use the Copy button to put the key in your Clipboard, and paste/save the key in a secure location.
That’s it. Now that you have your API key, you’re ready to try out MosaicML Inference.
(Optional) Install mosaicml-cli
#
To work with our Python SDK for Inference, you’ll also want to install our client application, mosaicml-cli
.
In the terminal application of choice, simply run:
pip install --upgrade mosaicml-cli
After that has completed, run:
mcli init
The mcli init
command will prompt you to go through the same API key generation steps above. Without generating a second key, you can copy/paste the one you have generated already (when prompted) to complete the initialization process. Once this is done, all mcli
commands, and all usage of the mcli
SDK in your Python code, will use that API key for authentication.
Example Usage#
Request example:
from mcli.sdk import predict
model_requests = {
"inputs": [
[
"Represent the Science title:",
"3D ActionSLAM: wearable person tracking in multi-floor environments"
]
]
}
predict('https://models.hosted-on.mosaicml.hosting/instructor-large/v1', model_requests)
curl https://models.hosted-on.mosaicml.hosting/instructor-large/v1/predict \
-H "Authorization: <your_api_key>" \
-H "Content-Type: application/json" \
-d '{"inputs": [["Represent the Science title:", "3D ActionSLAM: wearable person tracking in multi-floor environments"]]}'
Response example:
{
"outputs":[
[
-0.06155527010560036,0.010419987142086029,0.005884397309273481...-0.03766140714287758,0.010227023623883724,0.04394740238785744
]
]
}
MPT Models Example#
MPT-Instruct models were trained on the Alpaca prompt format. For best response quality, we recommend you follow the request format described here.
Request example:
from mcli.sdk import predict
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: write 3 reasons why you should train an AI model on domain specific data set.
### Response: """
predict('https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1', {"inputs": [prompt], "parameters": {"temperature": 0}})
curl https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1/predict \
-H "Authorization: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"inputs": ["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction: write 3 reasons why you should train an AI model on domain specific data set.\n### Response: "], "parameters": {"temperature": 0}}'
Response example:
{
"outputs": [
"\n1. The model will be more accurate\n2. The model will be more efficient\n3. The model will be more domain specific"
]
}
We can also stream the outputs of MPT models token by token. For the SDK this just means setting the stream
argument to be True
in predict()
.
Request example:
from mcli.sdk import predict
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: write 3 reasons why you should train an AI model on domain specific data set.
### Response: """
predict('https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1', {"inputs": [prompt], "parameters": {"temperature": 0}, stream = True})
curl https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1/predict_stream \
-H "Authorization: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"inputs": ["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction: write 3 reasons why you should train an AI model on domain specific data set.\n### Response: "], "parameters": {"temperature": 0}}'
Llama2 Model Example#
Llama2-Chat models work best when prompts contain a two-party conversation. Each turn in the conversation should start with a user instruction inside an [INST] ... [/INST]
block, followed by the model’s response to that instruction. For more on this, see Designing effective prompts on our blog.
Request example:
from mcli import predict
prompt = """[INST] <<SYS>>
Always answer in a professional and engaging manner.
<</SYS>>
Write LinkedIn Post about Llama2-70B-Chat being available on MosaicML Inference. [/INST]"""
response = predict("https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1", {
"inputs": [prompt],
"parameters": {"max_new_tokens": 128, "temperature": 0}})
curl https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1/predict \
-H "Authorization: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"inputs": ["[INST] <<SYS>>\nAlways answer in a professional and engaging manner.\n<</SYS>> \nWrite LinkedIn Post about Llama2-70B-Chat being available on MosaicML Inference. [/INST]"], "parameters": {"max_new_tokens": 128, "temperature": 0}}'
Response example:
{
"outputs": [
" Exciting news, everyone! 🚀 We're thrilled to announce that Llama2-70B-Chat is now available on MosaicML Inference! 💻\n\nThis powerful language model is trained on a massive dataset of text from the internet and is capable of generating human-like responses to a wide range of questions and prompts. With its impressive knowledge base and conversational capabilities, Llama2-70B-Chat is perfect for a variety of applications, including chatbots, virtual assistants, and customer service platforms"
]
}
We can also stream the outputs of Llama2 models token by token. For the SDK this just means setting the stream
argument to be True
in predict()
.
Request example:
from mcli import predict
prompt = """[INST] <<SYS>>
Always answer in a professional and engaging manner.
<</SYS>>
Write LinkedIn Post about Llama2-70B-Chat being available on MosaicML Inference. [/INST]"""
response = predict("https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1", {
"inputs": [prompt],
"parameters": {"max_new_tokens": 128, "temperature": 0}},
stream = True)
curl https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1/predict_stream \
-H "Authorization: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"inputs": ["[INST] <<SYS>>\nAlways answer in a professional and engaging manner.\n<</SYS>> \nWrite LinkedIn Post about Llama2-70B-Chat being available on MosaicML Inference. [/INST]"], "parameters": {"max_new_tokens": 128, "temperature": 0}}'
API Reference#
Text Embedding Models#
Embedding models are used to obtain a vector representation of an input string. Embedding vectors can be used to compute the similarity of two input strings, retrieve documents relevant to a specific query, and more.
Model |
Description |
Rate Limit |
Endpoint |
---|---|---|---|
A 335M parameter instruction finetuned model capable of generating embeddings for various tasks. The |
75 RPS |
|
|
A 1.2B parameter instruction finetuned model capable of generating text embeddings for various tasks. See the model card for guidelines on how to best prompt the model. |
75 RPS |
|
Text Embedding Request - Endpoint#
To calculate embeddings for a string, send your text string to the hosted endpoint of the embedding model you wish to query in the following format:
POST https://models.hosted-on.mosaicml.hosting/<endpoint>
Text Embedding Request - Body#
Name |
Type |
Required |
Description |
---|---|---|---|
inputs |
List[[str, str]] |
Yes |
List of pairs of strings in format [[“Instruction”, “Sentence”]] |
Text Completion Models#
Text completion models are used to generate text based on a provided input prompt string. They can be used for generic text completion, question answering, information extraction, and much more. All of our text completion models allow for the ability to stream responses.
Model |
Description |
Rate Limit |
Endpoint |
---|---|---|---|
A state-of-the-art 6.7B parameter instruction finetuned language model trained by MosaicML. The model is pretrained for 1T tokens on a mixture of datasets, and then further instruction finetuned on a dataset derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets. |
5 RPS |
|
|
A state-of-the-art 30B parameter, 8,192-token sequence length, instruction finetuned language model trained by MosaicML. The model is pretrained for 1T tokens on a mixture of datasets, and then further instruction finetuned on a dataset derived from: Databricks Dolly-15k, Anthropic Helpful and Harmless (HH-RLHF), CompetitionMath, GradeSchoolMath, DialogSum, DuoRC, QASPER, QuALITY, SummScreen, and Spider datasets |
5 RPS |
|
|
A state-of-the-art 70B parameter language model with a context length of 4096 tokens, trained by Meta. The model was pretrained on 2T tokens of text and fine-tuned for dialog use cases leveraging over 1 million human annotations. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses. |
5 RPS |
|
Text Completion Request - Endpoint#
To generate text completions, send a prompt to the hosted endpoint of the text completion model you wish to query using the following format:
POST https://models.hosted-on.mosaicml.hosting/<endpoint>
Text Completion Request - Body#
Name |
Type |
Required |
Description |
---|---|---|---|
inputs |
List[str] |
Yes |
The prompt(s) to generate a completion for. |
parameters |
Dict[str, Any] |
No |
See the |
Text Completion Request - Parameters#
This section documents the parameters that can be used with the parameters
field in the request body.
Name |
Type |
Valid Range |
Default |
Description |
---|---|---|---|---|
top_p |
float |
[0.0, 1.0] |
0.95 |
Defines the tokens that are within the sample operation of text generation. Add tokens in the sample for more probable to least probable until the sum of the probabilities is greater than top_p. |
top_k |
int |
\(\ge\)1 |
50 |
Defines the number of top k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic. |
temperature |
float |
\(\ge\)0.0 |
0.8 |
The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score. |
max_new_tokens |
int |
\(\gt\)0 |
256 |
Defines the maximum number of new tokens generated in the output. |
Rate Limiting#
All our hosted models are rate limited on a per user basis. As an example, both the Instructor Large
and Instructor XL
models are rate limited to 75 requests per second (RPS) for every user. This is to ensure fairness across the platform. If a single user were to try and query more than 75 RPS for either text embedding model, then they may receive HTTP errors with a status code of 429 blocking some of their outstanding requests.
If your use case requires higher RPS than the default rate limit allows, please contact us, and we may be able to make accommodations.
Questions and Next Steps#
We really hope to help you get the best out of Inference. Please bring your questions and concerns to the #inference-support
channel in the MosaicML Community Slack workspace. We look forward to working with you.