Configure a pretraining run#

Pretraining run submissions to the Databricks Mosaic AI training platform can be configured through a YAML file or using our Python API create_pretraining_run().

The fields are:

Field

Type

model

required

str

train_data_path

required

`List[str]

save_folder

required

str

compute

required

Dict[Dict]

eval

optional

Dict[str]

training_duration

optional

str

experiment_tracker

optional

Dict[Dict]

tokenizer

optional

Dict[Dict]

Here’s an example pretraining run configuration:

model: databricks/dbrx-9b
train_data: 
  - s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval: 
  data_path: s3://my-bucket/my-data
training_duration: 10000000000tok
experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment
tokenizer:
  name: EleutherAI/gpt-neox-20b
compute:
  cluster: r1z1
  gpus: 128
from mcli import create_pretraining_run
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    eval= { data_path: "s3://my-bucket/my-data"},
    training_duration="10000000000tok",
    experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
    tokenizer={ "name": 'EleutherAI/gpt-neox-20b'},
    compute={"cluster"="r1z1", "gpus": 128}
)

Field Types#

Model#

Current available options are listed in the supported models section of the Pretraining landing page.

Train data path#

The full remote location of your training dataset(s) (eg [s3://my-bucket/my-data]).

Save folder#

The remote location to save the pretrained checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the pretrained Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

Eval#

Eval data path#

The remote location of your evaluation data (e.g. s3://my-bucket/my-data), containing MDS files.

Training duration#

The total duration of your pretraining run. This should be specified in tokens (e.g. 1_000_000tok).

Tokenizer#

We allow the following tokenizers:

Tokenizer

meta-llama/Meta-Llama-3-70B

meta-llama/Meta-Llama-3-70B-Instruct

meta-llama/Meta-Llama-3-8B

meta-llama/Meta-Llama-3-8B-Instruct

EleutherAI/gpt-neox-20b

openai-community/gpt2

You may alternately provide a remote path (s3, oci, hf) that uses one of these tokenizers with your custom vocabulary. The remote path should be specified in tokenizer.name.

If this does not satisfy your requirement, please reach out to us and we can work together on adding your required tokenizer to our whitelist.

Experiment tracker#

Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}.