Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent
Wei Xu

TL;DR
This paper analyzes averaged stochastic gradient descent (ASGD) for large-scale learning, showing that proper learning rate setting significantly improves its efficiency and effectiveness in reaching optimal models in fewer samples.
Contribution
It provides a finite sample analysis of ASGD, proposes a simple method to set the learning rate properly, and demonstrates its superiority over other algorithms in large-scale linear classification.
Findings
Proper learning rate setting reduces samples needed for ASGD to reach asymptotic performance.
ASGD with the proposed learning rate outperforms other algorithms in experiments.
Finite sample analysis explains the practical efficiency of ASGD in large-scale learning.
Abstract
For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of the parameters obtained by stochastic gradient descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received little attention in recent research on large scale learning. One possible reason is that it may take a prohibitively large number of training samples for ASGD to reach its asymptotic region for most real problems. In this paper, we present a finite sample analysis for the method of Polyak and Juditsky (1992). Our analysis shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Machine Learning and ELM
