Why Does Multi-Epoch Training Help?

Yi Xu; Qi Qian; Hao Li; Rong Jin

arXiv:2105.06015·cs.LG·May 14, 2021

Why Does Multi-Epoch Training Help?

Yi Xu, Qi Qian, Hao Li, Rong Jin

PDF

Open Access

TL;DR

This paper provides theoretical explanations for why multi-epoch (multi-pass) stochastic gradient descent improves performance in training deep neural networks, especially under the Polyak-Lojasiewicz condition.

Contribution

It offers the first theoretical analysis showing faster convergence rates for multi-pass SGD compared to one-pass SGD in non-convex least squares problems.

Findings

01

Multi-pass SGD achieves faster excess risk convergence.

02

Theoretical evidence under PL condition explains empirical benefits.

03

Improves understanding of multi-epoch training in deep learning.

Abstract

Stochastic gradient descent (SGD) has become the most attractive optimization method in training large-scale deep neural networks due to its simplicity, low computational cost in each updating step, and good performance. Standard excess risk bounds show that SGD only needs to take one pass over the training data and more passes could not help to improve the performance. Empirically, it has been observed that SGD taking more than one pass over the training data (multi-pass SGD) has much better excess risk bound performance than the SGD only taking one pass over the training data (one-pass SGD). However, it is not very clear that how to explain this phenomenon in theory. In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstance. Specifically, we consider smooth risk minimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent