Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime
Reza Ghane, Danil Akhtiamov, Babak Hassibi

TL;DR
This paper analyzes the convergence and implicit bias of Dual Space Preconditioned Gradient Descent, including optimizers like Adam, in over-parameterized linear models, introducing novel divergence techniques and characterizing convergence points.
Contribution
It proves convergence of dual space preconditioned gradient descent in over-parameterized models and characterizes the implicit bias for isotropic preconditioners.
Findings
Convergence to a solution satisfying XW=Y is guaranteed.
Implicit bias for isotropic preconditioners minimizes the Frobenius norm difference from initialization.
General preconditioners approximate this bias up to a constant factor.
Abstract
In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form , where is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form , for weights , labels and data . Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point satisfying . Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Sparse and Compressive Sensing Techniques
