Image Classification at Supercomputer Scale

Chris Ying; Sameer Kumar; Dehao Chen; Tao Wang; Youlong Cheng

arXiv:1811.06992·cs.LG·December 4, 2018·95 cites

Image Classification at Supercomputer Scale

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, Youlong Cheng

PDF

Open Access

TL;DR

This paper presents system-level optimizations enabling the training of deep learning models at petaflop scale, achieving rapid training times and high throughput on large-scale TPU clusters.

Contribution

The paper introduces three novel systems optimizations—distributed batch normalization, input pipeline enhancements, and 2-D torus all-reduce—for large-scale deep learning training.

Findings

01

Trained ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes.

02

Achieved over 1.05 million images/second throughput.

03

No accuracy drop despite large-scale distributed training.

Abstract

Deep learning is extremely computationally intensive, and hardware vendors have responded by building faster accelerators in large clusters. Training deep learning models at petaFLOPS scale requires overcoming both algorithmic and systems software challenges. In this paper, we discuss three systems-related optimizations: (1) distributed batch normalization to control per-replica batch sizes, (2) input pipeline optimizations to sustain model throughput, and (3) 2-D torus all-reduce to speed up gradient summation. We combine these optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million images/second and no accuracy drop.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Batch Normalization