Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J. Shallue; Jaehoon Lee; Joseph Antognini; Jascha; Sohl-Dickstein; Roy Frostig; George E. Dahl

arXiv:1811.03600·cs.LG·July 22, 2019·163 cites

Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha, Sohl-Dickstein, Roy Frostig, George E. Dahl

PDF

Open Access

TL;DR

This paper experimentally investigates how increasing batch size in data parallelism affects neural network training efficiency and model quality, revealing significant workload variation and no evidence of performance degradation.

Contribution

It provides a comprehensive empirical analysis of batch size effects across multiple models and datasets, clarifying misconceptions and guiding future training speed improvements.

Findings

01

Larger batch sizes do not degrade out-of-sample performance.

02

Significant variation exists in how batch size impacts training time across workloads.

03

Discrepancies in literature are due to differences in tuning and compute budgets.

Abstract

Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications