Why Does Multi-Epoch Training Help?
Yi Xu, Qi Qian, Hao Li, Rong Jin

TL;DR
This paper provides theoretical explanations for why multi-epoch (multi-pass) stochastic gradient descent improves performance in training deep neural networks, especially under the Polyak-Lojasiewicz condition.
Contribution
It offers the first theoretical analysis showing faster convergence rates for multi-pass SGD compared to one-pass SGD in non-convex least squares problems.
Findings
Multi-pass SGD achieves faster excess risk convergence.
Theoretical evidence under PL condition explains empirical benefits.
Improves understanding of multi-epoch training in deep learning.
Abstract
Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
