Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari; Daniel A. Roberts; Ethan Dyer

arXiv:1812.04754·cs.LG·December 13, 2018·110 cites

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer

PDF

Open Access

TL;DR

This paper demonstrates that in large-scale deep learning, gradients rapidly concentrate in a small subspace spanned by top Hessian eigenvectors, which remains stable over training and may explain the dynamics of gradient descent.

Contribution

It reveals that gradient descent primarily occurs within a tiny, stable subspace defined by top Hessian eigenvectors, offering new insights into optimization in deep learning.

Findings

01

Gradients converge to a small subspace after short training

02

The subspace is spanned by top eigenvectors of the Hessian

03

Gradient descent mainly occurs within this subspace

Abstract

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning