Finetuning#

Finetuning is an easy and cost-effective way to create your own custom model.

Our finetuning API offers:

A simple interface to our training stack to perform full model finetuning.
Optimal, default hyperparameters and model training setup.
Finetuned model checkpoints saved to remote store of your choice.
Automatic conversion of Composer checkpoints to Hugging Face checkpoints for easy inference deployment.
Finetuning on top of a completed, proprietary model by loading the weights of a previously finetuned model.

We recommend to try finetuning if:

You have tried few-shot learning and want better results.
You have tried prompt engineering on an existing model and want better results.
You want full ownership over a custom model for data privacy.
You are latency-sensitive or cost-sensitive and want to use a smaller, cheaper model with your task-specific data.

Setup#

Before getting started with finetuning, make sure you have configured MosaicML access.

Task types#

We currently support two types of finetuning for your model:

Instruction Finetuning: Use this to finetune your model on prompt-response data.
Chat Completion: Use this to finetune your model on multi-turn chat data.
Continued Pretraining: Use this task to continue training your model with additional text data.

Data preparation and credentials#

Instruction Finetuning#

The training data is your custom data, representing your specific task.

The training data must be in JSONL format, where each line is a prompt and response JSON object.

{"prompt": <your-custom-prompt>, "response": <your-custom-response>}
{"prompt": <your-custom-prompt>, "response": <your-custom-response>}

For a more extensive example, checkout the mosaicml/dolly_hhrlhf example instruction finetuning dataset on Hugging Face.

To give you some idea of the data the model is seeing in this example, here are a couple of rows from the mosaicml/dolly_hhrlhf dataset.

{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Kubernetes? ### Response: ", "response": "Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google, the project is now maintained by the Cloud Native Computing Foundation."}
{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Van Halen famously banned what color M&Ms in their rider? ### Response: ", "response": "Brown."}

Chat Completion#

Chat-formatted data must be in a file .jsonl format, where each line is a separate JSON object representing a single chat session. Each chat session is represented as a JSON object with a single key, "messages", that maps to an array of message objects. To finetune on chat data, simply provide the task_type = 'CHAT_COMPLETION'.

Note: Messages in chat format are automatically formatted according to the model’s chat template, so there is no need to add special chat tokens to signal the beginning or end of a chat turn manually. An example of a model that uses a custom chat template is mistral-instruct.

Each message object in the array represents a single message in the conversation and has the following structure:

role: A string indicating the author of the message. Possible values are "system", "user", and "assistant". If the role is "system", it must be the first chat in the messages list. There must be at least one message with the role "assistant", and any messages after the (optional) system prompt must alternate roles between user/assistant. There must not be two adjacent messages with the same role. The last message in the "messages" array must have the role "assistant".
content: A string containing the text of the message.

The following is a chat-formatted data example:

{"messages": [
  {"role": "system", "content": "A conversation between a user and a helpful assistant."},
  {"role": "user", "content": "Hi there. What's the capital of the moon?"},
  {"role": "assistant", "content": "This question doesn't make sense as nobody currently lives on the moon, meaning it would have no government or political institutions. Furthermore, international treaties prohibit any nation from asserting sovereignty over the moon and other celestial bodies."},
 ]
}

File format#

If your training data (or optional eval data) is in a remote object store, you must provide the full path to your .jsonl file, e.g. s3://bucket/my-dataset.jsonl.

If you are using a Hugging Face dataset as your train data source, you must specify the full path with the split, e.g. mosaicml/instruct-v3/train and mosaicml/instruct-v3/test. This accounts for datasets that have different split schemas.

Continued Pretraining#

Here, the training data is your additional raw text data.

The training data must be in a folder containing .txt files. Any non .txt files in the folder are ignored.

The first time that we finetune with this data, it is internally converted into a more efficient Mosaic Data Streaming (MDS) format, which will speed up subsequent runs with the same training data. The MDS converted data is stored in the data path provided for checkpointing by the user, under a subdirectory /mds.

File format#

If your training data (or optional eval data) is in a remote object store, you must provide the full path to your folder containing .txt files, e.g. s3://bucket/my-dataset.

Supported data sources#

If you are using a remote object store as the source of your training data, you must first create an MCLI secret with the credentials to access your data.

Note that the folder to save your checkpoints must be a remote object store, which will also require secrets configurations. We support the following data sources:

Data Source	Example	MCLI Secret
Unity Catalog	`dbfs:/Volumes/...`	Databricks
AWS S3	`s3://bucket/...`	AWS S3
OCI	`oci://bucket/...`	OCI
GCP	`gs://bucket/...`	GCP

Supported models#

We currently support finetuning on the following suite of models:

Model	Maximum context length
`mosaicml/mpt-7b-8k`	8192
`mosaicml/mpt-7b`	2048
`mosaicml/mpt-30b`	8192
`mosaicml/mpt-7b-8k-instruct`	8192
`mosaicml/mpt-7b-instruct`	2048
`mosaicml/mpt-30b-instruct`	8192
`meta-llama/Llama-2-7b-hf`	4096
`meta-llama/Llama-2-13b-hf`	4096
`meta-llama/Llama-2-70b-hf`	4096
`meta-llama/Llama-2-7b-chat-hf`	4096
`meta-llama/Llama-2-13b-chat-hf`	4096
`meta-llama/Llama-2-70b-chat-hf`	4096
`codellama/CodeLlama-7b-hf`	16384
`codellama/CodeLlama-13b-hf`	16384
`codellama/CodeLlama-34b-hf`	16384
`codellama/CodeLlama-7b-Python-hf`	16384
`codellama/CodeLlama-13b-Python-hf`	16384
`codellama/CodeLlama-34b-Python-hf`	16384
`codellama/CodeLlama-7b-Instruct-hf`	16384
`codellama/CodeLlama-13b-Instruct-hf`	16384
`codellama/CodeLlama-34b-Instruct-hf`	16384
`mistralai/Mistral-7B-v0.1`	32768
`mistralai/Mistral-7B-Instruct-v0.2`	32768
`mistralai/Mixtral-8x7B-v0.1`	32768

Building on custom model weights#

We also support instruction finetuning any of the previous models starting from custom weights via the optional argument custom_weights_path. For example, you can create a domain-specific model with your custom data, and then pass the desired checkpoint as an input to the instruction finetuning API for further finetuning.

You can provide the remote location to the Composer checkpoint from your previous run for finetuning.

model: mosaicml/mpt-7b
custom_weights_path: oci://my-bucket/my-folder/mpt-7b/checkpoints/some_checkpoint.pt

A quick example#

Here is a minimal example of finetuning a model on a dataset.

model: mosaicml/mpt-7b
train_data_path: mosaicml/dolly_hhrlhf/train
save_folder: s3://<my-bucket>/checkpoints

You can then launch this run and save checkpoints to your S3 bucket with the following command:

mcli finetune -f finetune.yaml

You can also pass overrides to the yaml via the CLI command for the mandatory and optional fields:

mcli finetune -f finetune.yaml \
  --model mosaicml/mpt-30b \
  --train-data-path mosaicml/instruct-v3/train \
  --eval-data-path mosaicml/instruct-v3/test \
  --custom-weights-path s3://my-custom-weights.pt \
  --training-duration 10000tok \
  --learning-rate 1e-5 \
  --context-length 8192

Experiment tracking#

We support both MLflow and WandB as experiment trackers to monitor and visualize the metrics for your finetuning run. Set experiment_tracker to contain the configuration for the tracker you want to use.

MLflow#

Provide the full path for the experiment, including the experiment name. In Databricks Managed MLflow, this will be a workspace path resembling /Users/example@domain.com/my_experiment. You can also provide a model_registry_path for model deployment. Make sure to configure your Databricks secret.

experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment
    model_registry_path: catalog.schema | catalog.schema.model_name # optional

Weights & Biases#

Include both project name and entity name in your configuration, and make sure to set up your WandB secret.

experiment_tracker:
  wandb:
    project: my-project
    entity: my-entity

Launching a finetuning run#

Fine-tuning can be configured through a YAML or via our Python API.

The fields are the same for both methods:

mcli.api.finetuning_runs.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_tracker=None, disable_credentials_check=None, timeout=10, future=False)[source]

Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.

Parameters

model – The name of the Hugging Face model to use.
train_data_path – The full remote location of your training data (eg ‘s3://my-bucket/my-data.jsonl’). For INSTRUCTION_FINETUNE, another option is to provide the name of a Hugging Face dataset that includes the train split, like ‘mosaicml/dolly_hhrlhf/test’. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field for INSTRUCTION_FINETUNE, or in raw data format for CONTINUED_PRETRAIN.
save_folder – The remote location to save the finetuned checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the finetuned Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.
task_type – The type of finetuning task to run. Current available options are INSTRUCTION_FINETUNE and CONTINUED_PRETRAIN, defaults to INSTRUCTION_FINETUNE.
eval_data_path – The remote location of your evaluation data (e.g. s3://my-bucket/my-data.jsonl). For INSTRUCTION_FINETUNE, the name of a Hugging Face dataset with the test split (e.g. mosaicml/dolly_hhrlhf/test) can also be given. The evaluation data should be formatted with each row containing a prompt and response field, for INSTRUCTION_FINETUNE and raw data for CONTINUED_PRETRAIN. Default is None.
eval_prompts –
A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:
- max_new_tokens: 100
- temperature: 1
- top_k: 50
- top_p: 0.95
- do_sample: true
Default is None (do not generate prompts).
custom_weights_path – The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is None.
training_duration – The total duration of your finetuning run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.
learning_rate – The peak learning rate to use for finetuning. Default is 5e-7. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.
context_length – The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.
experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in {"wandb": {"project": "my-project", "entity": "my-entity"}}. To add in mlflow tracking, you can pass in {"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}.
disable_credentials_check – Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.
future – Return the output as a Future. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the :type Finetune: output, use return_value.result() with an optional timeout argument.

Returns

A – type Finetune: object containing the finetuning run information.

You can configure your finetuning run either using the SDK or a YAML:

PYTHON

from mcli import create_finetuning_run
run = create_finetuning_run(
    model="mosaicml/mpt-30b",
    train_data_path="s3://my-bucket/my-data",
    save_folder="s3://my-bucket/checkpoints",
    eval_data_path="s3://my-bucket/my-data",
    eval_prompts=["Example prompt", "Example prompt 2"],
    custom_weights_path="s3://my-bucket/my-custom-weights.pt",
    training_duration="10ep",
    experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
)

YAML

model: mosaicml/mpt-30b
train_data_path: s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval_data_path: s3://my-bucket/my-data
eval_prompts:
  - Example prompt
  - Example prompt 2
custom_weights_path: s3://my-bucket/my-custom-weights.pt
training_duration: 10ep
experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment

Calling the finetune API launches your run, while the yaml needs to be launched with mcli finetune -f <your-yaml>. The SDK result is a Finetune object.

mcli.Finetune(id, name, status, created_at, updated_at, created_by, started_at=None, completed_at=None, reason=None, estimated_end_time=None, model=None, save_folder=None, train_data_path=None, submitted_config=None, events=None, _required_properties=('id', 'name', 'status', 'createdByEmail', 'createdAt', 'updatedAt'))[source]

A Finetune that has been run on the MosaicML platform

Parameters

id – The unique identifier for this finetuning run.
name – The name of the finetuning run.
status – The current status of the finetuning run. This is a RunStatus enum, which has values such as PENDING, RUNNING, or COMPLETED.
created_at – The timestamp at which the finetuning run was created.
updated_at – The timestamp at which the finetuning run was last updated.
created_by – The email address of the user who created the finetuning run.
started_at – The timestamp at which the finetuning run was started.
completed_at – The timestamp at which the finetuning run was completed.
reason – The reason for the finetuning run’s current status, such as Run completed successfully.

See the Finetuning CLI and Finetuning SDK for more information on how to interact with your finetuning runs.

See the Finetuning Schema for more information about the parameters for the finetuning API.

Looking for more configurability over the model training? Try creating a training run instead and see the LLM foundry finetuning documentation for more details.

To give you an idea of what this will look like in an experiment tracker, we provide the loss curve (logged to Weights and Biases) of a 3 epoch finetuning run using mosaicml/mpt-7b on the mosaicml/dolly_hhrlhf dataset.

Finetuning loss curve

Want to evaluate your model?#

Our finetuning API provides two lightweight solutions that run evaluation during finetuning:

eval_data_path: The remote location of your evaluation data (e.g. s3://my-bucket/my-data.jsonl). This should be in the same format as your training data, see the file format instructions above. We will compute Cross Entropy and Perplexity on this evaluation data.
eval_prompts: A list of prompt strings to generate from periodically during training. Results will be logged to the experiment tracker(s) you’ve configured. Note: an experiment tracker is required to use this parameter.

For complete evaluation after finetuning, see our LLM evaluation framework for open-source In-context learning (ICL) tasks.

Ready to deploy your model?#

You can use Databricks Model Serving to serve your model. Make sure your Databricks credentials are set-up.

You’ll need managed MLflow and to configure your MLflow integration to take in the following fields:

experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/<your-experiment-name>
    model_registry_path: <catalog>.<schema>

With this configuration, the finetuning run will automatically register your model in the Unity Catalog. There are a few steps to deploy your model:

Navigate to your databricks UI
Click on Catalog
Search for your model: Type in your catalog-name in the left panel and click on it. Search for your schema on the right hand panel and click on it.
You’ll see a list of the registered models in the left hand panel.

We are working on simplifying this experience in the Databricks-native finetuning product.

Don’t want to use MLflow? Your Hugging Face formatted checkpoints will be available in the location you specified for save_folder.

Help us improve!#

We’re eager to hear your feedback! If our Finetuning API doesn’t meet your needs, please let us know so we can prioritize future enhancements to better support you. Your input is invaluable in shaping our API’s growth and development!