On The Concurrence of Layer-wise Preconditioning Methods and Provable   Feature Learning

Thomas T. Zhang; Behrad Moniri; Ansh Nagwekar; Faraz Rahman; Anton; Xue; Hamed Hassani; Nikolai Matni

arXiv:2502.01763·cs.LG·February 5, 2025

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton, Xue, Hamed Hassani, Nikolai Matni

PDF

Open Access

TL;DR

This paper demonstrates that layer-wise preconditioning methods are both practically effective and theoretically necessary for feature learning in neural networks, especially beyond idealized input conditions, outperforming standard optimizers like Adam.

Contribution

It provides a theoretical and empirical analysis showing the fundamental necessity of layer-wise preconditioning for effective feature learning beyond ideal assumptions.

Findings

01

Layer-wise preconditioning improves feature learning in neural networks.

02

SGD is suboptimal for feature learning in non-ideal input settings.

03

Standard optimizers like Adam only mildly mitigate the limitations of SGD.

Abstract

Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and ELM · Speech Recognition and Synthesis

MethodsAdam · Stochastic Gradient Descent