Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity
Noa Rubin, Orit Davidovich, Zohar Ringel

TL;DR
This paper introduces a simplified scale analysis heuristic to predict feature learning patterns and sample complexity in deep neural networks, extending theoretical understanding to complex architectures.
Contribution
It proposes a new heuristic method for predicting feature learning regimes and scales, simplifying the analysis of deep learning models compared to existing complex theories.
Findings
Reproduces known scaling exponents of feature learning.
Predicts feature learning behavior in complex architectures.
Extends theoretical predictions to non-linear and attention-based networks.
Abstract
Two pressing topics in the theory of deep learning are the interpretation of feature learning (FL) mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich FL often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this analytical complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of FL emerge. This form of scale analysis is considerably simpler than such exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the…
Peer Reviews
Decision·ICLR 2026 Poster
The paper addresses a well-defined problem, which is how to estimate the scaling property of sample complexity, in deep network, with is usually analytically intractable. To address this problem, the author uses lower-bound variational energy and a set of heuristics. The paper supports their claim by both theoretical results and simulations, and show that these heuristics can estimate the lower bound variational energy and derive the sample complexity for fully-connected network and CNN with non
As the paper employs heuristics to estimate the the lower variational bound for a specific set of feature learning patterns (Gaussian Process, Gaussian Feature Learning, Feature Specialization), under a specific task setup (polynomial of degree m), it's difficult to evaluate how these heuristics of specific feature learning patterns can be applicable for different tasks (classification, regression with a more general basis functions). I would appreciate if the authors could discuss the expected
* Strong motivation and relevance: The paper tackles an important and timely question—how to quantify representation quality via alignment—and aims to connect it to generalization performance. * Potential for theoretical contribution: If clarified, the proposed framework could bridge geometric and probabilistic views of feature learning, an area of broad interest to both machine learning and neuroscience communities. * Clear high-level narrative: The intuition that alignment captures “how well l
* Unclear and inconsistent mathematical exposition: The logic behind the central inequality (around line 98) is confusing. It provides only a lower bound on test MSE in terms of alignment, so high alignment is at best a necessary—but not sufficient—condition for good generalization. The paper does not establish or even discuss whether an upper bound exists, which weakens the claim that alignment close to 1 can reliably indicate low test error. * Ambiguity in probabilistic statements: In Section
There appears to be quality work done on the mathematics behind this paper’s analysis, as well as an original approach to mech interp. It’s clear not enough similar work is done in that community, so the authors should be applauded for their instinct to pursue this direction. Another major strength is connecting the results obtained back to previous findings, this time under a new light. The main math-heavy sections are well presented. This work appears to be quite significant to those in the me
This paper would greatly benefit from plots showing how the theory matches results, not just empirical results which are quite difficult to connect back to specifically this work and not some other broad notions of the field. Another weakness is in the downplay of grokking: since this theory is entirely about near-convergence networks, it seems like there should be a much greater focus on things like weight decay or grokking; outside of the intro, I don’t see those mentioned again. Finally, some
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques · Face Recognition and Perception
