On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton, Xue, Hamed Hassani, Nikolai Matni

TL;DR
This paper demonstrates that layer-wise preconditioning methods are both practically effective and theoretically necessary for feature learning in neural networks, especially beyond idealized input conditions, outperforming standard optimizers like Adam.
Contribution
It provides a theoretical and empirical analysis showing the fundamental necessity of layer-wise preconditioning for effective feature learning beyond ideal assumptions.
Findings
Layer-wise preconditioning improves feature learning in neural networks.
SGD is suboptimal for feature learning in non-ideal input settings.
Standard optimizers like Adam only mildly mitigate the limitations of SGD.
Abstract
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM · Speech Recognition and Synthesis
MethodsAdam · Stochastic Gradient Descent
