Configure a finetuning run#

Finetuning run submissions to the MosaicML platform can be configured through a YAML file or using our Python API create_finetuning_run().

The fields are identical across both methods:

Field

Type

model

required

str

train_data_path

required

str

save_folder

required

str

task_type

optional

str

eval_data_path

optional

str

eval_prompts

optional

str

custom_weights_path

optional

str

training_duration

optional

str

learning_rate

optional

str

context_length

optional

str

experiment_tracker

optional

Dict[Dict]

disable_credentials_check

optional

bool

Here’s an example finetuning run configuration:

model: mosaicml/mpt-7b
train_data_path: mosaicml/dolly_hhrlhf/train
eval_data_path: mosaicml/dolly_hhrlhf/test
save_folder: <fill-in>
experiment_tracker:
  mlflow:
    experiment_path: <fill-in>
    model_registry_path: <fill-in>
eval_prompts:
- A quick brown fox jumped
- Who was the president of the US in 1776?
from mcli import create_finetuning_run
ft = create_finetuning_run(
    model = 'mosaicml/mpt-7b',
    train_data_path = mosaicml/dolly_hhrlhf/train,
    eval_data_path = mosaicml/dolly_hhrlhf/test,
    save_folder = <fill-in>,
    experiment_tracker = {'mlflow': {'experiment_path': <fill-in>, 'model_registry_path': <fill-in>}},
    eval_prompts = ['A quick brown fox jumped', 'Who was the president of the US in 1776?'],
)

Field Types#

Model#

The name of the Hugging Face model to use. Current available options are listed in the supported models section of the Finetuning landing page.

Train data path#

The full remote location of your training data (eg s3://my-bucket/my-data.jsonl).

For INSTRUCTION_FINETUNE, you can instead provide the name of a Hugging Face dataset that includes the split name, e.g. mosaicml/dolly_hhrlhf/train. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field for INSTRUCTION_FINETUNE.

For CONTINUED_PRETRAIN, this is a folder of txt files. We automatically convert your raw data to MosaicML streaming dataset format.

Save folder#

The remote location to save the finetuned checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the finetuned Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

Task type#

The type of finetuning task to run. Current available options are INSTRUCTION_FINETUNE and CONTINUED_PRETRAIN, defaults to INSTRUCTION_FINETUNE.

Eval data path#

The remote location of your evaluation data (e.g. s3://my-bucket/my-data.jsonl).

The full remote location of your training data (eg s3://my-bucket/my-data.jsonl). For INSTRUCTION_FINETUNE, another option is to provide the name of a Hugging Face dataset that includes the split name, like mosaicml/dolly_hhrlhf/test. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field for INSTRUCTION_FINETUNE.

For CONTINUED_PRETRAIN, this is a folder of txt files. We automatically convert your raw data to streaming format to support training at scale.

Eval prompts#

A list of prompt strings to generate responses based on during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:

  • max_new_tokens: 100

  • temperature: 1

  • top_k: 50

  • top_p: 0.95

  • do_sample: true

Default is None (do not generate prompts).

Custom weights path#

The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Note, these must be weights that match the model type you have chosen. No custom architecture modifications are supported. Default is None, meaning the finetuning run will start from the original pretrained weights of the chosen model.

Training duration#

The total duration of your finetuning run. This can be specified in batches (e.g. 100ba), epochs (e.g. 10ep), or tokens (e.g. 1_000_000tok). Default is 1ep.

Learning rate#

The peak learning rate to use for finetuning. Default is 5e-7. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.

Context length#

The maximum sequence length to use. This will be used to truncate any data that is too long.

The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default at this time. See thesupported models section for context length information per model.

Experiment tracker#

Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}.

Disable credentials check#

Flag to disable checking credentials (S3, Databricks, etc.).

If the credentials check is enabled (the default), a preflight check will be run on finetune submission. This runs a few tests to ensure that your user has valid credentials for the resources you are attempting to access (e.g. S3 buckets, MLflow experiments, etc.). If the credential check fails, your finetune run will be stopped.