Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
Minh-Toan Nguyen, Jean Barbier

TL;DR
This paper analyzes the limits of learning hierarchical features in wide neural networks, revealing phase transitions in feature recoverability and deriving sharp scaling laws for generalization error.
Contribution
It introduces a theoretical framework with fixed-point equations to characterize feature learnability and generalization error scaling in extensive-width networks.
Findings
Feature learnability occurs via sharp phase transitions as data increases.
Effective width $k_c$ unifies different scaling regimes of generalization error.
Empirical results show Adam-trained models near $k_c$ achieve optimal scaling laws.
Abstract
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width scales linearly with the input dimension -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
