Configure a pretraining run#

Pretraining run submissions can be configured through a YAML file or using our Python API create_pretraining_run().

The fields are:

Field

Type

model

required

str

train_data_path

required

`List[str]

save_folder

required

str

compute

required

Dict[Dict]

eval

optional

Dict[str]

training_duration

optional

str

experiment_tracker

optional

Dict[Dict]

tokenizer

optional

Dict[Dict]

Here’s an example pretraining run configuration:

model: databricks/dbrx-9b
train_data: 
  - s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval: 
  data_path: s3://my-bucket/my-data
training_duration: 10000000000tok
experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment
tokenizer:
  name: EleutherAI/gpt-neox-20b
compute:
  cluster: r1z1
  gpus: 128
from mcli import create_pretraining_run
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    eval= { data_path: "s3://my-bucket/my-data"},
    training_duration="10000000000tok",
    experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
    tokenizer={ "name": 'EleutherAI/gpt-neox-20b'},
    compute={"cluster"="r1z1", "gpus": 128}
)

Field Types#

Model#

Current available options are listed in the supported models section of the Pretraining landing page.

Train data path#

The full remote location of your training dataset(s) (eg [s3://my-bucket/my-data]).

Save folder#

The remote location to save the pretrained checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the pretrained Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

Eval#

Field

Type

data_path

optional

str

model

optional

List[str]

  • data_path: The remote location of your evaluation data (e.g. s3://my-bucket/my-data), containing MDS files. See the main readme for supported data sources. Metrics include Cross Entropy and Perplexity

  • prompts: A list of prompts to pass through the model for manual generation evaluation

Both are triggered at every checkpoint and logged to the experiment tracker. See evaluate your model for tips on custom and more complete evauation

Training duration#

The total duration of your pretraining run. This should be specified in tokens (e.g. 1_000_000tok).

Tokenizer#

We allow the following tokenizers:

Tokenizer

meta-llama/Meta-Llama-3-70B

meta-llama/Meta-Llama-3-70B-Instruct

meta-llama/Meta-Llama-3-8B

meta-llama/Meta-Llama-3-8B-Instruct

EleutherAI/gpt-neox-20b

openai-community/gpt2

You may alternately provide a remote path (s3, oci, hf) that uses one of these tokenizers with your custom vocabulary. The remote path should be specified in tokenizer.name.

If this does not satisfy your requirement, please reach out to us and we can work together on adding your required tokenizer to our whitelist.

Experiment tracker#

Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}.