First Model#
Let’s train your first 1 billion parameter GPT model!
Download the following run YAML file as mosaic_gpt_1b.yaml
:
name: mosaic-gpt-1b-gpus-4
image: mosaicml/llm-foundry:2.1.0_cu121_flash2-ddba5c8
# You can find other images that are ready to use here: https://hub.docker.com/u/mosaicml
compute:
gpus: 4
# cluster: <name> ## If you have access to multiple clusters, this may be required
integrations:
- integration_type: git_repo
git_repo: mosaicml/llm-foundry
git_commit: v0.4.0
pip_install: .[gpu-flash2]
ssh_clone: false # Should be true if using a private repo
command: |
cd llm-foundry/scripts
python data_prep/convert_dataset_hf.py --dataset c4 --data_subset en --out_root ./my-copy-c4 --splits train_small val \
--concat_tokens 2048 --tokenizer gpt2 --eos_text '<|endoftext|>'
composer train/train.py train/yamls/pretrain/mpt-1b.yaml \
train_loader.dataset.split=train_small \
max_duration=100ba \
eval_interval=0
# The below is injected as a YAML file: /mnt/config/parameters.yaml
# but is not used in this example.
parameters: {}
This run clones MosaicML’s LLM code from our public LLM Foundry repository and trains a GPT 1 billion parameter language model on the C4 dataset with 8x A100 40GB GPUs.
C4
The configuration above first runs a conversion script to convert the C4 dataset into a format usable by our streaming dataloader. For more details, see the Streaming documentation.
After submitting this run, training starts after a brief setup period:
mcli run -f mosaic_gpt_1b.yaml --follow
from mcli import RunConfig, create_run
config = RunConfig.from_file('mosaic_gpt_1b.yaml')
# config.cluster = <your_cluster_name> # Only needed if you have more than one cluster
create_run(config)
Starting training...
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: inst-dlgir-r8z5-2-workers
num_gpus_per_node: 8
num_nodes: 1
rank_zero_seed: 17
******************************
[batch=1/100]:
Train time/epoch: 0
Train time/batch: 0
Train time/sample: 0
Train time/batch_in_epoch: 0
Train time/sample_in_epoch: 0
Train time/token: 0
Train time/token_in_epoch: 0
Train memory/allocated_mem: 2.4149
Train memory/active_mem: 2.4149
Train memory/inactive_mem: 1.1692
Train memory/reserved_mem: 25.6170
Train memory/alloc_retries: 0
Train trainer/device_train_microbatch_size: 4
Train loss/train/total: 11.8108
Train metrics/train/LanguageCrossEntropy: 11.8108
Train metrics/train/LanguagePerplexity: 134696.6562
Train time/train: 0.0045
Train time/val: 0.0000
Train time/total: 0.0045
Train lr-DecoupledAdamW/group0: 0.0000
Unique Names
Databricks Mosaic AI training will append a unique six-character identifier to your provided run name in order to ensure uniqueness.
View your run status, and the unique run name, at any time with:
> mcli get runs
NAME CLUSTER GPU_TYPE GPU_NUM ... STATUS
mosaic-gpt-1b-gpus-8-3isk9a r8z2 a100_40gb 8 ... Running
Let’s stop all your runs:
mcli stop run <run-name>
Scaling up the number of GPUs is easy. If you have access to multiple nodes, simply do:
mcli run -f mosaic_gpt_1b.yaml --gpus 16 --follow
i Run mosaic-gpt-1b-gpus-16-4czz submitted. Waiting for it to start...
i You can press Ctrl+C to quit and follow your run manually.
⠏ Rank 0: Waiting for resources to become available... 0:00:03
⠏ Rank 1: Waiting for resources to become available... 0:00:03
Clean up all your runs with:
mcli delete runs --all
Customization#
Our examples repository is designed to be easily modifiable for your own use cases. For example, you could fork the repository and edit the 1b.yaml configuration file.
Or, we recommend using the parameters
field to make it easy to tweak these settings with each run. Anything under parameters
is mounted at /mnt/config/parameters.yaml
for your code to access.
The resulting run yaml looks the same as above, except (1) the python program is pointed at /mnt/config/parameters.yaml
instead of its own config, and (2) we append the contents of 1b.yaml
under parameters
field.
Example yaml
name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.12.1_cu116-python3.9-ubuntu20.04
compute:
gpus: 8
integrations:
- integration_type: git_repo
git_repo: mosaicml/examples
git_branch: v0.0.2
pip_install: -r llm/requirements.txt
command: |
cd examples/llm
python convert_c4.py --out_root ./my-copy-c4 --splits val
composer main.py /mnt/config/parameters.yaml \
train_loader.dataset.split=val \
progress_bar=false \
run_name=$COMPOSER_RUN_NAME
parameters:
data_remote: &data_remote ./my-copy-c4
data_local: &data_local ./my-copy-c4
max_seq_len: &max_seq_len 2048
tokenizer_name: &tokenizer_name gpt2
# Run Name
run_name: gpt-1b
# Model
model:
name: mosaic_gpt
device: meta
tokenizer_name: *tokenizer_name
d_model: 2048
n_heads: 16
n_layers: 24
mlp_ratio: 4
max_seq_len: *max_seq_len
vocab_size: 50257
init_std: 0.02
attn_pdrop: 0.0
resid_pdrop: 0.0
emb_pdrop: 0.0
attn_impl: flash
# Tokenizer
tokenizer:
type: hftokenizer
args:
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
# Dataloaders
train_loader:
name: c4
dataset:
remote: *data_remote
local: *data_local
split: train
shuffle: true
prefetch: 1_000_000
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: concat
drop_last: true
num_workers: 8
pin_memory: true
prefetch_factor: 2
persistent_workers: true
timeout: 0
eval_loader:
name: c4
dataset:
remote: *data_remote
local: *data_local
split: val
shuffle: false
prefetch: 1000
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
drop_last: false
num_workers: 8
pin_memory: true
prefetch_factor: 2
persistent_workers: true
timeout: 0
# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1
optimizer:
name: decoupled_adamw
lr: 2.0e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
max_duration: 24800ba
eval_interval: 2000ba
global_train_batch_size: 512
grad_clip_norm: 1.0
# System
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
# device_train_microbatch_size: auto
precision: bf16
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
min_params: 2e8
mixed_precision: DEFAULT
activation_checkpointing: true
activation_cpu_offload: false
verbose: true
# Logging
progress_bar: true
log_to_console: true
callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}