Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson; Jason D. Lee; Suriya Gunasekar; Pedro H. P.; Savarese; Nathan Srebro; Daniel Soudry

arXiv:1803.01905·stat.ML·March 26, 2019·29 cites

Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P., Savarese, Nathan Srebro, Daniel Soudry

PDF

Open Access

TL;DR

This paper investigates how gradient descent on separable data converges to maximum-margin solutions depending on the loss function's tail behavior, revealing conditions for convergence and optimal rates.

Contribution

It characterizes the conditions under which gradient descent converges to the maximum-margin separator for various loss tails and proposes improved convergence rates with aggressive step sizes.

Findings

01

Gradient descent converges to the maximum-margin solution for super-polynomial tailed losses.

02

Exponential tailed losses like logistic loss achieve optimal convergence rates.

03

Aggressive step sizes can improve convergence rates to clog(t)/\u221asqrt{t} for linear models.

Abstract

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_{2}$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_{2}$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods