Online Batch Selection for Faster Training of Neural Networks
Ilya Loshchilov, Frank Hutter

TL;DR
This paper explores online batch selection strategies for neural network training, demonstrating that selecting batches based on loss ranking can significantly accelerate convergence of optimizers like AdaDelta and Adam.
Contribution
It introduces a simple loss-based ranking strategy for online batch selection, improving training speed for stochastic gradient methods.
Findings
Batch selection speeds up training by about 5 times.
The proposed ranking strategy effectively controls selection pressure.
Results are demonstrated on the MNIST dataset.
Abstract
Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification
MethodsAdaDelta · Adam
