Pretraining SDK#
Creating a pretraining run#
- mcli.create_pretraining_run(model, train_data, save_folder, *, compute=None, tokenizer=None, training_duration=None, parameters=None, eval=None, experiment_tracker=None, custom_weights_path=None, timeout=10, future=False)[source]
Create a pretraining run.
- Parameters
model β The name of the Hugging Face model to use. Required.
train_data β Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, if you have two datasets,
dataset1
anddataset2
, and you want to use 80% ofdataset1
and 20% ofdataset2
, you can pass in{"dataset1": {"path": "path/to/dataset1", "proportion": .8}, "dataset2": {"path": "path/to/dataset2", "proportion": .2}}
. Required.save_folder β The remote location to save the checkpoints. For example, if your
save_folder
iss3://my-bucket/my-checkpoints
, the Composer checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved tos3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes ares3://
,gs://
, andoci://
. Required.compute β The compute configuration to use. Required for now
tokenizer β Tokenizer configuration to use. If not provided, the default tokenizer for the model will be used.
training_duration β The total duration of your run. This can be specified in batches (e.g.
100ba
), epochs (e.g.10ep
), or tokens (e.g.1_000_000tok
). Default is1ep
.parameters β
- Additional parameters to pass to the model
learning_rate: The peak learning rate to use. Default is
5e-7
. The optimizer used
is DecoupledLionW with betas of 0.90 and 0.95 and no weight decay, and the learning rate scheduler used is LinearWithWarmupSchedule with a warmup of 2% of the total training duration and a final learning rate multiplier of 0.
context_length: The maximum sequence length to use. This will be used to truncate any data that is too
long. The default is the default for the provided Hugging Face model. We do not support extending the context length beyond each modelβs default.
experiment_tracker β The configuration for an experiment tracker. For example, to add Weights and Biases tracking, you can pass in
{"wandb": {"project": "my-project", "entity": "my-entity"}}
. To add in mlflow tracking, you can pass in{"mlflow": {"experiment_path": "my-experiment", "model_registry_path: "catalog.schema.model_name"}}
.eval β Configuration for evaluation
custom_weights_path β The remote location of a custom model checkpoint to resume from. If provided, these weights will be used instead of the original pretrained weights of the model. This must be a Composer checkpoint.
timeout β Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If
future
isTrue
, this value will be ignored.future β Return the output as a
Future
. If True, the call to create_pretraining_run will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the :type Run: output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
A β type Run: object containing the pretraining run information.
Pretraining runs can be programmatically created, which provides flexibility to define custom workflows or create similar pretraining runs in quick succession.
create_pretraining_run()
takes fields that allow you to create a customized model. At a minimum, youβll need to provide the model you want to pretrain, the location of your training dataset, and the location where your checkpoints will be saved. There are many optional fields that allow you to perform evaluation, specify a custom tokenizer, and more.
Other actions on pretraining runs#
For listing, stopping, deleting, describing, debugging (logs) pretraining runs, follow the same workflow as Run
here.