The lazy (NTK) and rich ($\mu$P) regimes: a gentle tutorial
Dhruva Karkada

TL;DR
This tutorial explains how the hyperparameter choices in wide neural networks determine whether they behave like kernel machines or exhibit feature learning, synthesizing recent research and providing empirical support.
Contribution
It offers a nonrigorous derivation and unified perspective on the richness scale in wide neural networks, connecting lazy and active training regimes.
Findings
Wide networks can train lazily like kernel machines or actively learn features.
The richness of training behavior is controlled by a single hyperparameter.
Empirical evidence supports the theoretical claims.
Abstract
A central theme of the modern machine learning paradigm is that larger neural networks achieve better performance on a variety of metrics. Theoretical analyses of these overparameterized models have recently centered around studying very wide neural networks. In this tutorial, we provide a nonrigorous but illustrative derivation of the following fact: in order to train wide networks effectively, there is only one degree of freedom in choosing hyperparameters such as the learning rate and the size of the initial weights. This degree of freedom controls the richness of training behavior: at minimum, the wide network trains lazily like a kernel machine, and at maximum, it exhibits feature learning in the active P regime. In this paper, we explain this richness scale, synthesize recent research results into a coherent whole, offer new perspectives and intuitions, and provide empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Machine Learning and Data Classification
