Large batch size training of neural networks with adversarial training and second-order information
Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph, Gonzalez, Kurt Keutzer, Michael Mahoney

TL;DR
This paper introduces an efficient adaptive batch size training framework with autoscaling and second-order methods, improving training speed and accuracy for neural networks across multiple datasets.
Contribution
It presents a novel elastic scaling approach with negligible overhead and a new adaptive batch size scheme leveraging second-order and adversarial training methods.
Findings
Achieves up to 1% higher accuracy
Reduces number of SGD iterations by up to 5x
Demonstrates effectiveness across multiple datasets and architectures
Abstract
The most straightforward method to accelerate Stochastic Gradient Descent (SGD) computation is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires commensurately growing the batch size. However, large batch training often leads to poorer generalization. A recently proposed solution for this problem is to use adaptive batch sizes in SGD. In this case, one starts with a small number of processes and scales the processes as training progresses. Two major challenges with this approach are (i) that dynamically resizing the cluster can add non-trivial overhead, in part since it is currently not supported, and (ii) that the overall speed up is limited by the initial phase with smaller batches. In this work, we address both challenges by developing a new adaptive batch size framework, with autoscaling based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent
