Towards Optimal One Pass Large Scale Learning with Averaged Stochastic   Gradient Descent

Wei Xu

arXiv:1107.2490·cs.LG·December 23, 2011·120 cites

Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

Wei Xu

PDF

Open Access 1 Repo

TL;DR

This paper analyzes averaged stochastic gradient descent (ASGD) for large-scale learning, showing that proper learning rate setting significantly improves its efficiency and effectiveness in reaching optimal models in fewer samples.

Contribution

It provides a finite sample analysis of ASGD, proposes a simple method to set the learning rate properly, and demonstrates its superiority over other algorithms in large-scale linear classification.

Findings

01

Proper learning rate setting reduces samples needed for ASGD to reach asymptotic performance.

02

ASGD with the proposed learning rate outperforms other algorithms in experiments.

03

Finite sample analysis explains the practical efficiency of ASGD in large-scale learning.

Abstract

For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of the parameters obtained by stochastic gradient descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received little attention in recent research on large scale learning. One possible reason is that it may take a prohibitively large number of training samples for ASGD to reach its asymptotic region for most real problems. In this paper, we present a finite sample analysis for the method of Polyak and Juditsky (1992). Our analysis shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ksopyla/svm_mnist_digit_classification
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Machine Learning and ELM