Configure a pretraining run#

Pretraining run submissions can be configured through a YAML file or using our Python API create_pretraining_run().

The fields are:

Field		Type
`model`	required	`str`
`train_data_path`	required	`List[str]
`save_folder`	required	`str`
`compute`	required	`Dict[Dict]`
`eval`	optional	`Dict[str]`
`training_duration`	optional	`str`
`experiment_tracker`	optional	`Dict[Dict]`
`tokenizer`	optional	`Dict[Dict]`
`custom_weights_path`	optional	`str`

Here’s an example pretraining run configuration:

YAML

model: databricks/dbrx-9b
train_data: 
  - s3://my-bucket/my-data
save_folder: s3://my-bucket/checkpoints
eval: 
  data_path: s3://my-bucket/my-data
training_duration: 10000000000tok
experiment_tracker:
  mlflow:
    experiment_path: /Users/[email protected]/my_experiment
tokenizer:
  name: EleutherAI/gpt-neox-20b
compute:
  cluster: r1z1
  gpus: 128

PYTHON

from mcli import create_pretraining_run
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    eval= { data_path: "s3://my-bucket/my-data"},
    training_duration="10000000000tok",
    experiment_tracker={"mlflow": {"experiment_path": "/Users/[email protected]/my_experiment"}},
    tokenizer={ "name": 'EleutherAI/gpt-neox-20b'},
    compute={"cluster"="r1z1", "gpus": 128}
)

Field Types#

Model#

Current available options are listed in the supported models section of the Pretraining landing page.

Train data path#

Either a list of paths to the training data or a mapping of dataset names to the path and proportion of the dataset to use. For example, to configure a single dataset:

train_data:
  - s3://my-bucket/my-data

To configure 80% of dataset1 and 20% of dataset2:

train_data:
  dataset1:
    proportion: 0.8
    remote: s3://my-bucket/my-data/dataset1
  dataset2:
    proportion: 0.2
    remote: s3://my-bucket/my-data/dataset2

Save folder#

The remote location to save the pretrained checkpoints. For example, if your save_folder is s3://my-bucket/my-checkpoints, the pretrained Composer checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/checkpoints, and Hugging Face formatted checkpoints will be saved to s3://my-bucket/my-checkpoints/<run-name>/hf_checkpoints. The supported cloud provider prefixes are s3://, gs://, and oci://.

Custom weights path#

The remote location to a checkpoint that can be used to resume training from. If these weights are provided, then they will be used instead of the initial weights of the model being used for pretraining. This must be a Composer checkpoint.

Eval#

Field		Type
`data_path`	optional	`str`
`model`	optional	`List[str]`

data_path: The remote location of your evaluation data (e.g. s3://my-bucket/my-data), containing MDS files. See the main readme for supported data sources. Metrics include Cross Entropy and Perplexity
prompts: A list of prompts to pass through the model for manual generation evaluation

Both are triggered at every checkpoint and logged to the experiment tracker. See evaluate your model for tips on custom and more complete evauation

Training duration#

The total duration of your pretraining run. This should be specified in tokens (e.g. 1000000tok).

Tokenizer#

You can configure the following fields for your tokenizer:

Field	Type	Description
`name`	`str` (required)	The name of your tokenizer. This is the name of an approved HuggingFace tokenizer or a path to a remote object store.
`model_max_length`	`int` (optional)	The maximum length (in number of tokens) for the inputs to the transformer model.
`model_input_names`	`List[str]` (optional)	The list of inputs accepted by the forward pass of the model (like “token_type_ids” or “attention_mask”).

We currently support the following HuggingFace tokenizers. You can specify this under tokenizer.name:

Tokenizer
`meta-llama/Meta-Llama-3-70B`
`meta-llama/Meta-Llama-3-70B-Instruct`
`meta-llama/Meta-Llama-3-8B`
`meta-llama/Meta-Llama-3-8B-Instruct`
`EleutherAI/gpt-neox-20b`
`openai-community/gpt2`

You may alternately provide a remote path that uses one of these tokenizers with your custom vocabulary. Your tokenizer should be listed under the tokenizer_class field inside of the tokenizer_config.json file. We support the following object storage providers to download your tokenizer files from (for tokenizers only):

S3: s3://my-bucket/my-data
GCP: gs://my-bucket/my-data
OCI: oci://my-bucket/my-data
Azure: azure://my-bucket/my-data
Databricks File System: dbfs:/Volumes/my-data

Here’s what an example configuration would look like: You can configure the following fields for your tokenizer:

YAML

tokenizer:
  name: EleutherAI/gpt-neox-20b
  model_max_length: 4096
  model_input_names:
    -- name1
    -- name2

PYTHON

from mcli import create_pretraining_run
tokenizer_config = { 
  "name": 'EleutherAI/gpt-neox-20b',  
  "model_max_length": 4096, 
  "model_input_names": ["name1", "name2"]
}
run = create_pretraining_run(
    model="databricks/dbrx-9b",
    train_data_path=["s3://my-bucket/my-data"],
    save_folder="s3://my-bucket/checkpoints",
    tokenizer=tokenizer_config,
    compute={"cluster"="r1z1", "gpus": 128}
)

If this does not satisfy your requirement, please reach out to us and we can work together on adding your required tokenizer to our whitelist.

Experiment tracker#

Experiment tracker configurations. For example, to add MLflow tracking, you can set the tracker to be in {'mlflow': {'experiment_path': '/Users/xxx@yyyy.com/<your-experiment-name>'}}.