โ๏ธ Train ResNet-50 on AWS#
Composer is a PyTorch library that accelerates training for deep learning models while improving quality at significantly lower cost. Composer makes it possible to train ResNet-50 on the ImageNet dataset to the standard 76.6% top-1 accuracy in 27 minutes on an AWS EC2 instance for a mere $15. In this tutorial weโll teach you how simple it is to do this yourself!
The starting point for this tutorial are the training recipes we present in our Mosaic ResNet blog post. Weโll walk through:
Launching an AWS EC2 instance capable of running GPU training
Configuring your AWS EC2 instance to run Composer with our pre-built Docker images
Running Composer training using the ResNet-50 Mild recipe introduced in our blog post
Prerequisites#
AWS account with permissions to:
Create/manage EC2 instances, EBS volumes
Create/manage Security Groups, Key Pairs (alternatively, IT admin provided)
AWS quota to create Accelerated Computing EC2 instances
Note
We use a p4d.24xlarge instance in this tutorial. However these steps should run on any P-type EC2 instance.
Download the latest Imagenet Dataset
Note
Due to the challenges associated with distributing ImageNet, we assume users to provide their own version of this dataset for the purpose of this tutorial.
MosaicMLโs ResNet-50 Recipes Docker Image
Tag:
mosaicml/pytorch_vision:resnet50_recipes
The image comes pre-configured with the following dependencies:
Mosaic ResNet Training recipes
Training entrypoint:
train.py
Composer Version: 0.9.0
PyTorch Version: 1.11.0
CUDA Version: 11.3
Python Version: 3.9
Ubuntu Version: 20.04
Launching an AWS EC2 Instance#
First letโs create an EC2 instance that we can run GPU training on.
Login to your AWS account and open the Management Console
For the purposes of this material, we will configure and launch a new
p4d.24xlarge
instance. On yourEC2 Dashboard
click theLaunch instance
button.Name your instance and select an AMI, Instance type, Key pair and Network settings. The following settings were used for this tutorial, customize as required depending on your AWS setup and IT requirements:
Name:
composer-r50-demo-a100x8
Amazon Machine Image (AMI):
Deep Learning AMI GPU PyTorch 1.12.0 (Amazon Linux 2)
Instance type:
p4d.24xlarge
Key pair:
Create key pair
(make sure to note where you save the private key)Key pair name:
composer_demo
Key pair type:
RSA
Private key format:
.pem
Network settings: Use defaults
Storage (volumes):
Click
Launch instance
!
Configuring your AWS EC2 instance#
Next we will connect to our newly launched p4d.24xlarge
instance, perform some basic system configuration to optimize our runtime environment and setup our dataset area.
Navigate back to the
Instances
page in your AWS console. Click on the running instance you just launched and in the Details pane, copy the instanceโsPublic IPv4 DNS
address. You will need this value to connect to the instance.Using the private key you downloaded during the launch configuration and the instanceโs public DNS address, connect to the system using SSH:
ssh -i <path_to_private_key> ec2-user@<public_dns_address>
For example,
ssh -i ~/composer_demo.pem [email protected]
Now letโs create a
datasets
area to place the ImageNet data as follows:sudo mkdir -p /datasets/ImageNet sudo chmod -R 777 /datasets
(Optional )If the EC2 instance you selected comes direct attached Instance Store Volumes, it can be mounted as follows:
sudo mkfs -t xfs /dev/nvme1n1 sudo mkdir ImageNet sudo mount /dev/nvme1n1 /ImageNet sudo chmod 777 ImageNet/
Instance Store Volumes (ISV) generally have better performance than EBS volumes since they are directly attached to the instance, at the expense of persistence. Thus Instance Store Volumes are ephemeral and any data stored on these volumes will be inaccessible after the instance is powered off.
Regardless of whether you choose to use an EBS volume or Instance Store Volume to host your dataset, the ImageNet data can be copied to the
/datasets/Imagenet
folder. In our example, the directory tree under/datasets
looks as follows:[ec2-user@ip-172-31-0-30 /]$ find ./datasets/ -maxdepth 2 ./datasets/ ./datasets/imagenet_files.tar ./datasets/ImageNet ./datasets/ImageNet/train ./datasets/ImageNet/val
Once you populate the dataset area, youโll be ready to start training!
Train ResNet-50 on ImageNet#
Now that we have launched an EC2 instance, configured the runtime and populated the dataset area, we are ready to kick off training.
Pull and run the
mosaicml/pytorch_vision:resnet50_recipes
Docker image. The image contains everything required to train including: pre-installed Composer, package dependencies, training entrypoint and recipe configuration files.docker run -it -v /datasets:/datasets --gpus all --shm-size 1g mosaicml/pytorch_vision:resnet50_recipes
Note
The default shared memory size of a Docker container is typically too small for larger datasets. In this example, increasing the shared memory size to 1GB is usually sufficient.
Run ResNet-50 Training using the Mild recipe!
composer train.py -f recipes/resnet50_mild.yaml --scale_schedule_ratio 0.36 \ --train_dataset.imagenet.ffcv_dir /datasets/ImageNet/ffcv \ --val_dataset.imagenet.ffcv_dir /datasets/ImageNet/ffcv
Note
The ResNet-50 Mild and Medium recipes utilize the very efficient and high performing [FFCV dataloader](https://ffcv.io/), requiring the raw ImageNet data to be processed into FFCV format. Composer can automatically perform this step for you prior to launching the training run, simply append the following command line arguments to the training command above: โtrain_dataset.imagenet.datadir /datasets/ImageNet/ โval_dataset.imagenet.datadir /datasets/ImageNet/ โtrain_dataset.imagenet.ffcv_write_dataset โval_dataset.imagenet.ffcv_write_dataset` The first two arguments simply specify the area of the raw ImageNet training and validation data, respectively. The second two arguments enable dataset conversion if the expected FFCV formatted files do not exist.
To perform this conversion manually, please follow the instructions detailed in the [README](https://github.com/mosaicml/examples/tree/main/examples/resnet#using-mosaic-recipes) in our [examples repository](https://github.com/mosaicml/examples/tree/main/examples/resnet), which contains all the code associated with our original blog post.
Expected Results#
Weโve performed various sweeps on AWS EC2 instances to understand the efficiency frontier across time, accuracy and cost as shown below.
The recipe explored in this tutorial should result in a model trained to a Top-1 accuracy of 76.6% in about 27 minutes for a total cost of $14.77.
You can explore the results of our other ResNet-50 runs on AWS in Explorer, our tool for exploring efficiency frontiers for different models and datasets with different speed-up techniques across various clouds.
Next steps#
Check out our GitHub repository for the latest information on Composer
Check out Composer + FFCV: Faster Together blog post for more information on how FFCV and Composer work together
Reproduce our record setting MLPerf ResNet-50 benchmark! Note, you will require access to the
p4de.24xlarge
(in preview) EC2 instances which contain the Nvidia A100 80GB GPUs. Please see the MLPerf Training Results v2.0 GitHub Repository for additional details.Try training on your models and datasets using Composer!