TL;DR
This paper empirically analyzes the behavior of SGD from a line search perspective, revealing that the full-batch loss along update directions is parabolic and that SGD can perform near-exact line searches, providing insights into batch size and learning rate effects.
Contribution
It offers the first empirical analysis of SGD trajectories from a line search perspective, demonstrating parabolic loss behavior and near-exact line search conditions.
Findings
Full-batch loss along update lines is highly parabolic.
Existence of a learning rate enabling near-exact line searches.
Increasing batch size has a similar effect as decreasing learning rate.
Abstract
Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
