Training EfficientNets at Supercomputer Scale: 83% ImageNet Top-1 Accuracy in One Hour
Arissa Wongpanich, Hieu Pham, James Demmel, Mingxing Tan, Quoc Le,, Yang You, Sameer Kumar

TL;DR
This paper demonstrates how to efficiently scale up EfficientNet training on supercomputers, achieving 83% ImageNet accuracy in just over an hour by optimizing batch size, learning rates, and distributed evaluation.
Contribution
The paper introduces optimization techniques for large-scale EfficientNet training on TPU clusters, enabling rapid training with high accuracy in a fraction of previous time.
Findings
Achieved 83% ImageNet Top-1 accuracy in 1 hour and 4 minutes.
Optimized training with large batch sizes and advanced learning rate schedules.
Provided performance benchmarks for EfficientNets at supercomputer scale.
Abstract
EfficientNets are a family of state-of-the-art image classification models based on efficiently scaled convolutional neural networks. Currently, EfficientNets can take on the order of days to train; for example, training an EfficientNet-B0 model takes 23 hours on a Cloud TPU v2-8 node. In this paper, we explore techniques to scale up the training of EfficientNets on TPU-v3 Pods with 2048 cores, motivated by speedups that can be achieved when training at such scales. We discuss optimizations required to scale training to a batch size of 65536 on 1024 TPU-v3 cores, such as selecting large batch optimizers and learning rate schedules as well as utilizing distributed evaluation and batch normalization techniques. Additionally, we present timing and performance benchmarks for EfficientNet models trained on the ImageNet dataset in order to analyze the behavior of EfficientNets at scale. With…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDepthwise Convolution · Pointwise Convolution · Depthwise Separable Convolution · Sigmoid Activation · Dropout · Inverted Residual Block · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Dense Connections · Squeeze-and-Excitation Block
