Finetuning SDK#
Creating a finetuning run#
- mcli.create_finetuning_run(model, train_data_path, save_folder, *, task_type='INSTRUCTION_FINETUNE', eval_data_path=None, eval_prompts=None, custom_weights_path=None, training_duration=None, learning_rate=None, context_length=None, experiment_tracker=None, disable_credentials_check=None, timeout=10, future=False)[source]
Finetunes a model on a finetuning dataset and converts the final composer checkpoint to a Hugging Face formatted checkpoint for inference.
- Parameters
model – The name of the Hugging Face model to use.
train_data_path – The full remote location of your training data (eg ‘s3://my-bucket/my-data.jsonl’). For
INSTRUCTION_FINETUNE
, another option is to provide the name of a Hugging Face dataset that includes the train split, like ‘mosaicml/dolly_hhrlhf/test’. The data should be formatted with each row containing a ‘prompt’ and ‘response’ field forINSTRUCTION_FINETUNE
, or in raw data format forCONTINUED_PRETRAIN
.save_folder – The remote location to save the finetuned checkpoints. For example, if your
save_folder
iss3://my-bucket/my-checkpoints
, the finetuned Composer checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes ares3://
,gs://
, andoci://
.task_type – The type of finetuning task to run. Current available options are
INSTRUCTION_FINETUNE
andCONTINUED_PRETRAIN
, defaults toINSTRUCTION_FINETUNE
.eval_data_path – The remote location of your evaluation data (e.g.
s3://my-bucket/my-data.jsonl
). ForINSTRUCTION_FINETUNE
, the name of a Hugging Face dataset with the test split (e.g.mosaicml/dolly_hhrlhf/test
) can also be given. The evaluation data should be formatted with each row containing aprompt
andresponse
field, forINSTRUCTION_FINETUNE
and raw data forCONTINUED_PRETRAIN
. Default isNone
.eval_prompts –
A list of prompt strings to generate during training. Results will be logged to the experiment tracker(s) you’ve configured. Generations will occur at every model checkpoint with the following generation parameters:
max_new_tokens: 100
temperature: 1
top_k: 50
top_p: 0.95
do_sample: true
Default is
None
(do not generate prompts).custom_weights_path – The remote location of a custom model checkpoint to use for finetuning. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint. Default is
None
.training_duration – The total duration of your finetuning run. This can be specified in batches (e.g.
100ba
), epochs (e.g.10ep
), or tokens (e.g.1_000_000tok
). Default is1ep
.learning_rate – The peak learning rate to use for finetuning. Default is
5e-7
. The optimizer used is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.context_length – The maximum sequence length to use. This will be used to truncate any data that is too long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each model’s default.
experiment_tracker – The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in
{"wandb": {"project": "my-project", "entity": "my-entity"}}
. To add in mlflow tracking, you can pass in{"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}
.disable_credentials_check – Flag to disable checking credentials (S3, Databricks, etc.). If the credentials check is enabled (False), a preflight check will be ran on finetune submission, running a few tests to ensure that the credentials provided are valid for the resources you are attemption to access (S3 buckets, Databricks experiments, etc.). If the credential check fails, your finetune run will be stopped.
timeout – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future – Return the output as a
Future
. If True, the call to finetune will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Finetune: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A – type Finetune: object containing the finetuning run information.
Finetuning runs can be programmatically created, which provides flexibility to define custom workflows or create similar finetuning runs in quick succession.
create_finetuning_run()
takes fields that allow you to create a customized model. At a minimum, you’ll need to provide the model you want to finetune, the location of your training dataset, and the location where your checkpoints will be saved. There are many optional fields that allow you to perform evaluation, register your model for deployment, and change the hyperparameters of your finetuning run.
Listing finetuning runs#
You can use the get_finetuning_runs()
function to see the finetuning runs you have launched.
Optional filters allow you to specify a subset of the finetuning runs to list by finetuning run name, email of the person who created the run, or the run status.
Stopping finetuning runs#
To stop a run, you must list the finetuning runs by the run names or Finetune
object. You can optionally provide a custom reason
for why you are stopping the finetuning run for posterity.
Deleting finetuning runs#
To delete a run, just pass the finetuning run name or the Finetune
object.
To delete a set of runs, you can use the output of get_finetuning_runs()
:
# delete a finetuning run by name
delete_finetuning_run('delete-this-run')
# delete failed runs
failed_finetuning_runs = get_finetuning_runs(statuses=['FAILED'])
delete_finetuning_runs(failed_finetuning_runs)
# delete completed runs older than a month with name pattern
completed = get_finetuning_runs(statuses=['COMPLETED'])
ref_date = dt.datetime.now() - dt.timedelta(days=30)
old_finetuning_runs = [ft for ft in completed if 'experiment1' in ft.name and ft.created_at < ref_date ]
delete_finetuning_runs(old_finetuning_runs)
List finetuning events#
You can use the list_finetuning_events
function at anytime during your run to understand your run’s progress. Call list_finetuning_events
by passing a finetuning run name or the Finetune
.
# list events for a finetuning run
list_finetuning_events('my-ft-run')
# returns a list of events that have occurred
[
FormattedRunEvent(
event_type='CREATED',
event_time='2023-12-05T19:02:57.191Z',
event_message='Run created.'),
FormattedRunEvent(
event_type='CHECK_PASSED',
event_time='2023-12-05T19:03:14.325Z',
event_message='Credentials check passed.'),
FormattedRunEvent(
event_type='STARTED',
event_time='2023-12-05T19:03:17.757Z',
event_message='Run started.'),
FormattedRunEvent(
event_type='DATA_VALIDATED',
event_time='2023-12-05T19:05:29.225Z',
event_message='Training data validated.'),
FormattedRunEvent(
event_type='MODEL_INITIALIZED',
event_time='2023-12-05T19:06:49.702Z',
event_message='Model data downloaded and initialized for base model mosaicml/mpt-7b.'),
FormattedRunEvent(
event_type='TRAIN_UPDATED',
event_time='2023-12-05T19:35:48.806Z',
event_message='[epoch=1/1][batch=50/56][ETA=5min] Train loss: 1.71'),
]
Finetuning events#
The order of expected events in a finetuning run are as follows:
Event Type |
Example Event Message |
Definition |
---|---|---|
|
|
Finetuning run was created. At this point, the run will either proceed if resources are available or will be pending. |
|
|
Run passes preflight check. |
|
|
Run fails preflight check. |
|
|
Resources have been allocated and the run has started. |
|
|
Validated that training data is correctly formatted. |
|
|
Weights for base model have been downloaded, and training is ready to begin. |
|
|
Reports the current training batch, epoch, or token, ETA for training to finish (not including checkpoint upload time) and train loss. This event is updated on every batch end. Note: if you submit your finetuning config with |
|
|
Training has finished. Checkpoint uploading begins. |
|
|
Checkpoint has uploaded, and the run has completed. |
|
|
Run will stop if |
|
Check |