Configure a pretraining run#
Pretraining run submissions can be configured through a YAML file or using our Python API create_pretraining_run()
.
The fields are:
Field |
Type |
|
---|---|---|
|
required |
|
|
required |
`List[str] |
|
required |
|
|
required |
|
|
optional |
|
|
optional |
|
|
optional |
|
|
optional |
|
|
optional |
|
Hereās an example pretraining run configuration:
model: databricks/dbrx-9b
train_data:
- s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval:
data_path: s3://my-bucket/my-data
training_duration: 10000000000tok
experiment_tracker:
mlflow:
experiment_path: /Users/[email protected]/my_experiment
tokenizer:
name: EleutherAI/gpt-neox-20b
compute:
cluster: r1z1
gpus: 128
from mcli import create_pretraining_run
run = create_pretraining_run(
model="databricks/dbrx-9b",
train_data_path=["s3://my-bucket/my-data"],
save_folder="s3://my-bucket/checkpoints",
eval= { data_path: "s3://my-bucket/my-data"},
training_duration="10000000000tok",
experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
tokenizer={ "name": 'EleutherAI/gpt-neox-20b'},
compute={"cluster"="r1z1", "gpus": 128}
)
Field Types#
Model#
Current available options are listed in the supported models section of the Pretraining landing page.
Train data path#
Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, to configure a single dataset:
train_data:
- s3://my-bucket/my-data
To configure 80% of dataset1
and 20% of dataset2
:
train_data:
dataset1:
proportion: 0.8
remote: s3://my-bucket/my-data/dataset1
dataset2:
proportion: 0.2
remote: s3://my-bucket/my-data/dataset2
Save folder#
The remote location to save the pretrained checkpoints. For example, if your save_folder
is s3://my-bucket/my-checkpoints
, the pretrained Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints
, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints
. The supported cloud provider prefixes are s3://
, gs://
, and oci://
.
Custom weights path#
The remote location to a checkpoint that can be used to resume training from. If these weights are provided, then they will be used instead of the initial weights of the model being used for pretraining. This must be a Composer checkpoint.
Eval#
Field |
Type |
|
---|---|---|
|
optional |
|
|
optional |
|
data_path
: The remote location of your evaluation data (e.g.s3://my-bucket/my-data
), containing MDS files. See the main readme for supported data sources. Metrics include Cross Entropy and Perplexityprompts
: A list of prompts to pass through the model for manual generation evaluation
Both are triggered at every checkpoint and logged to the experiment tracker. See evaluate your model for tips on custom and more complete evauation
Training duration#
The total duration of your pretraining run. This should be specified in tokens (e.g. 1000000tok
).
Tokenizer#
You can configure the following fields for your tokenizer:
Field |
Type |
Description |
---|---|---|
|
|
The name of your tokenizer. This is the name of an approved HuggingFace tokenizer or a path to a remote object store. |
|
|
The maximum length (in number of tokens) for the inputs to the transformer model. |
|
|
The list of inputs accepted by the forward pass of the model (like ātoken_type_idsā or āattention_maskā). |
We currently support the following HuggingFace tokenizers. You can specify this under tokenizer.name
:
Tokenizer |
---|
|
|
|
|
|
|
You may alternately provide a remote path that uses one of these tokenizers with your custom vocabulary. Your tokenizer should be listed under the tokenizer_class
field inside of the tokenizer_config.json
file. We support the following object storage providers to download your tokenizer files from (for tokenizers only):
S3:
s3://my-bucket/my-data
GCP:
gs://my-bucket/my-data
OCI:
oci://my-bucket/my-data
Azure:
azure://my-bucket/my-data
Databricks File System:
dbfs:/Volumes/my-data
Hereās what an example configuration would look like: You can configure the following fields for your tokenizer:
tokenizer:
name: EleutherAI/gpt-neox-20b
model_max_length: 4096
model_input_names:
-- name1
-- name2
from mcli import create_pretraining_run
tokenizer_config = {
"name": 'EleutherAI/gpt-neox-20b',
"model_max_length": 4096,
"model_input_names": ["name1", "name2"]
}
run = create_pretraining_run(
model="databricks/dbrx-9b",
train_data_path=["s3://my-bucket/my-data"],
save_folder="s3://my-bucket/checkpoints",
tokenizer=tokenizer_config,
compute={"cluster"="r1z1", "gpus": 128}
)
If this does not satisfy your requirement, please reach out to us and we can work together on adding your required tokenizer to our whitelist.
Experiment tracker#
Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}
.