Scale out for large minibatch SGD: Residual network training on   ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu; Damian Podareanu; and Vikram Saletore

arXiv:1711.04291·stat.ML·November 17, 2017·36 cites

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore

PDF

Open Access

TL;DR

This paper demonstrates scalable training of ResNet-50 on large supercomputers, achieving high accuracy and significantly reduced training time through novel techniques and software tools.

Contribution

It introduces a scalable training methodology for ResNet-50 on supercomputers, achieving high efficiency and accuracy with novel ensemble techniques.

Findings

01

Over 90% scaling efficiency up to 104K cores

02

Training time reduced to 28 minutes for ResNet-50

03

Achieved 77.5% top-1 accuracy with Collapsed Ensemble

Abstract

For the past 5 years, the ILSVRC competition and the ImageNet dataset have attracted a lot of interest from the Computer Vision community, allowing for state-of-the-art accuracy to grow tremendously. This should be credited to the use of deep artificial neural network designs. As these became more complex, the storage, bandwidth, and compute requirements increased. This means that with a non-distributed approach, even when using the most high-density server available, the training process may take weeks, making it prohibitive. Furthermore, as datasets grow, the representation learning potential of deep networks grows as well by using more complex models. This synchronicity triggers a sharp increase in the computational requirements and motivates us to explore the scaling behaviour on petaflop scale supercomputers. In this paper we will describe the challenges and novel solutions needed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning