First Model#

Let’s train your first 1 billion parameter GPT model!

Download the following run YAML file as mosaic_gpt_1b.yaml:

name: mosaic-gpt-1b-gpus-4
image: mosaicml/llm-foundry:2.1.0_cu121_flash2-ddba5c8
# You can find other images that are ready to use here: https://hub.docker.com/u/mosaicml
compute:
  gpus: 4
  # cluster: <name> ## If you have access to multiple clusters, this may be required

integrations:
- integration_type: git_repo
  git_repo: mosaicml/llm-foundry
  git_commit: v0.4.0
  pip_install: .[gpu-flash2]
  ssh_clone: false # Should be true if using a private repo

command: |
  cd llm-foundry/scripts
  python data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root ./my-copy-c4 --splits train_small val \
    --concat_tokens 2048 --tokenizer gpt2 --eos_text '<|endoftext|>'
  composer train/train.py train/yamls/pretrain/mpt-1b.yaml \
    train_loader.dataset.split=train_small \
    max_duration=100ba \
    eval_interval=0

# The below is injected as a YAML file: /mnt/config/parameters.yaml
# but is not used in this example.
parameters: {}

This run clones MosaicML’s LLM code from our public LLM Foundry repository and trains a GPT 1 billion parameter language model on the C4 dataset with 8x A100 40GB GPUs.

C4

The configuration above first runs a conversion script to convert the C4 dataset into a format usable by our streaming dataloader. For more details, see the Streaming documentation.

After submitting this run, training starts after a brief setup period:

mcli run -f mosaic_gpt_1b.yaml --follow
from mcli import RunConfig, create_run

config = RunConfig.from_file('mosaic_gpt_1b.yaml')
# config.cluster = <your_cluster_name> # Only needed if you have more than one cluster

create_run(config)
Starting training...
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: inst-dlgir-r8z5-2-workers
num_gpus_per_node: 8
num_nodes: 1
rank_zero_seed: 17

******************************
[batch=1/100]:
	 Train time/epoch: 0
	 Train time/batch: 0
	 Train time/sample: 0
	 Train time/batch_in_epoch: 0
	 Train time/sample_in_epoch: 0
	 Train time/token: 0
	 Train time/token_in_epoch: 0
	 Train memory/allocated_mem: 2.4149
	 Train memory/active_mem: 2.4149
	 Train memory/inactive_mem: 1.1692
	 Train memory/reserved_mem: 25.6170
	 Train memory/alloc_retries: 0
	 Train trainer/device_train_microbatch_size: 4
	 Train loss/train/total: 11.8108
	 Train metrics/train/LanguageCrossEntropy: 11.8108
	 Train metrics/train/LanguagePerplexity: 134696.6562
	 Train time/train: 0.0045
	 Train time/val: 0.0000
	 Train time/total: 0.0045
	 Train lr-DecoupledAdamW/group0: 0.0000

Unique Names

Databricks Mosaic AI training will append a unique six-character identifier to your provided run name in order to ensure uniqueness.

View your run status, and the unique run name, at any time with:

> mcli get runs
NAME                       CLUSTER  GPU_TYPE   GPU_NUM  ...  STATUS
mosaic-gpt-1b-gpus-8-3isk9a  r8z2     a100_40gb  8        ...  Running

Let’s stop all your runs:

mcli stop run <run-name>

Scaling up the number of GPUs is easy. If you have access to multiple nodes, simply do:

mcli run -f mosaic_gpt_1b.yaml --gpus 16 --follow
i  Run mosaic-gpt-1b-gpus-16-4czz submitted. Waiting for it to start...
i  You can press Ctrl+C to quit and follow your run manually.
⠏ Rank 0: Waiting for resources to become available... 0:00:03
⠏ Rank 1: Waiting for resources to become available... 0:00:03

Clean up all your runs with:

mcli delete runs --all

Customization#

Our examples repository is designed to be easily modifiable for your own use cases. For example, you could fork the repository and edit the 1b.yaml configuration file.

Or, we recommend using the parameters field to make it easy to tweak these settings with each run. Anything under parameters is mounted at /mnt/config/parameters.yaml for your code to access.

The resulting run yaml looks the same as above, except (1) the python program is pointed at /mnt/config/parameters.yaml instead of its own config, and (2) we append the contents of 1b.yaml under parameters field.

Example yaml
name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.12.1_cu116-python3.9-ubuntu20.04

compute:
  gpus: 8

integrations:
  - integration_type: git_repo
    git_repo: mosaicml/examples
    git_branch: v0.0.2
    pip_install: -r llm/requirements.txt

command: |
  cd examples/llm
  python convert_c4.py --out_root ./my-copy-c4 --splits val
  composer main.py /mnt/config/parameters.yaml \
    train_loader.dataset.split=val \
    progress_bar=false \
    run_name=$COMPOSER_RUN_NAME

parameters:
  data_remote: &data_remote ./my-copy-c4
  data_local: &data_local ./my-copy-c4
  max_seq_len: &max_seq_len 2048
  tokenizer_name: &tokenizer_name gpt2

  # Run Name
  run_name: gpt-1b

  # Model
  model:
    name: mosaic_gpt
    device: meta
    tokenizer_name: *tokenizer_name
    d_model: 2048
    n_heads: 16
    n_layers: 24
    mlp_ratio: 4
    max_seq_len: *max_seq_len
    vocab_size: 50257
    init_std: 0.02
    attn_pdrop: 0.0
    resid_pdrop: 0.0
    emb_pdrop: 0.0
    attn_impl: flash

  # Tokenizer
  tokenizer:
    type: hftokenizer
    args:
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len

  # Dataloaders
  train_loader:
    name: c4
    dataset:
        remote: *data_remote
        local: *data_local
        split: train
        shuffle: true
        prefetch: 1_000_000
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len
        group_method: concat
    drop_last: true
    num_workers: 8
    pin_memory: true
    prefetch_factor: 2
    persistent_workers: true
    timeout: 0

  eval_loader:
    name: c4
    dataset:
        remote: *data_remote
        local: *data_local
        split: val
        shuffle: false
        prefetch: 1000
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len
        group_method: truncate
    drop_last: false
    num_workers: 8
    pin_memory: true
    prefetch_factor: 2
    persistent_workers: true
    timeout: 0

  # Optimization
  scheduler:
    name: cosine_with_warmup
    t_warmup: 100ba
    alpha_f: 0.1

  optimizer:
    name: decoupled_adamw
    lr: 2.0e-4
    betas:
    - 0.9
    - 0.95
    eps: 1.0e-08
    weight_decay: 0.0

  max_duration: 24800ba
  eval_interval: 2000ba
  global_train_batch_size: 512
  grad_clip_norm: 1.0

  # System
  seed: 17
  device_eval_batch_size: 16
  device_train_microbatch_size: 16
  # device_train_microbatch_size: auto
  precision: bf16

  # FSDP
  fsdp_config:
    sharding_strategy: FULL_SHARD
    min_params: 2e8
    mixed_precision: DEFAULT
    activation_checkpointing: true
    activation_cpu_offload: false
    verbose: true

  # Logging
  progress_bar: true
  log_to_console: true

  callbacks:
    speed_monitor:
        window_size: 10
    lr_monitor: {}
    memory_monitor: {}