Configure a pretraining run#

Pretraining run submissions can be configured through a YAML file or using our Python API create_pretraining_run().

The fields are:

Field

Type

model

required

str

train_data_path

required

`List[str]

save_folder

required

str

compute

required

Dict[Dict]

eval

optional

Dict[str]

training_duration

optional

str

experiment_tracker

optional

Dict[Dict]

tokenizer

optional

Dict[Dict]

custom_weights_path

optional

str

Hereā€™s an example pretraining run configuration:

model: databricks/dbrx-9b
train_data: 
  - s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval: 
  data_path: s3://my-bucket/my-data
training_duration: 10000000000tok
experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment
tokenizer:
  name: EleutherAI/gpt-neox-20b
compute:
  cluster: r1z1
  gpus: 128
from mcli import create_pretraining_run
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    eval= { data_path: "s3://my-bucket/my-data"},
    training_duration="10000000000tok",
    experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
    tokenizer={ "name": 'EleutherAI/gpt-neox-20b'},
    compute={"cluster"="r1z1", "gpus": 128}
)

Field Types#

Model#

Current available options are listed in the supported models section of the Pretraining landing page.

Train data path#

Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, to configure a single dataset:

train_data:
  - s3://my-bucket/my-data

To configure 80% of dataset1 and 20% of dataset2:

train_data:
  dataset1:
    proportion: 0.8
    remote: s3://my-bucket/my-data/dataset1
  dataset2:
    proportion: 0.2
    remote: s3://my-bucket/my-data/dataset2

Save folder#

The remote location to save the pretrained checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the pretrained Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

Custom weights path#

The remote location to a checkpoint that can be used to resume training from. If these weights are provided, then they will be used instead of the initial weights of the model being used for pretraining. This must be a Composer checkpoint.

Eval#

Field

Type

data_path

optional

str

model

optional

List[str]

  • data_path: The remote location of your evaluation data (e.g. s3://my-bucket/my-data), containing MDS files. See the main readme for supported data sources. Metrics include Cross Entropy and Perplexity

  • prompts: A list of prompts to pass through the model for manual generation evaluation

Both are triggered at every checkpoint and logged to the experiment tracker. See evaluate your model for tips on custom and more complete evauation

Training duration#

The total duration of your pretraining run. This should be specified in tokens (e.g. 1000000tok).

Tokenizer#

You can configure the following fields for your tokenizer:

Field

Type

Description

name

str (required)

The name of your tokenizer. This is the name of an approved HuggingFace tokenizer or a path to a remote object store.

model_max_length

int (optional)

The maximum length (in number of tokens) for the inputs to the transformer model.

model_input_names

List[str] (optional)

The list of inputs accepted by the forward pass of the model (like ā€œtoken_type_idsā€ or ā€œattention_maskā€).

We currently support the following HuggingFace tokenizers. You can specify this under tokenizer.name:

Tokenizer

meta-llama/Meta-Llama-3-70B

meta-llama/Meta-Llama-3-70B-Instruct

meta-llama/Meta-Llama-3-8B

meta-llama/Meta-Llama-3-8B-Instruct

EleutherAI/gpt-neox-20b

openai-community/gpt2

You may alternately provide a remote path that uses one of these tokenizers with your custom vocabulary. Your tokenizer should be listed under the tokenizer_class field inside of the tokenizer_config.json file. We support the following object storage providers to download your tokenizer files from (for tokenizers only):

  • S3: s3://my-bucket/my-data

  • GCP: gs://my-bucket/my-data

  • OCI: oci://my-bucket/my-data

  • Azure: azure://my-bucket/my-data

  • Databricks File System: dbfs:/Volumes/my-data

Hereā€™s what an example configuration would look like: You can configure the following fields for your tokenizer:

tokenizer:
  name: EleutherAI/gpt-neox-20b
  model_max_length: 4096
  model_input_names:
    -- name1
    -- name2
from mcli import create_pretraining_run
tokenizer_config = { 
  "name": 'EleutherAI/gpt-neox-20b',  
  "model_max_length": 4096, 
  "model_input_names": ["name1", "name2"]
}
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    tokenizer=tokenizer_config,
    compute={"cluster"="r1z1", "gpus": 128}
)

If this does not satisfy your requirement, please reach out to us and we can work together on adding your required tokenizer to our whitelist.

Experiment tracker#

Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}.