Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit
Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee

TL;DR
This paper proves that two-layer neural networks can optimally learn high-dimensional Gaussian multi-index models near the information-theoretic limit, demonstrating efficient spectral learning during gradient descent.
Contribution
It establishes that neural networks can achieve near-optimal sample and time complexity for learning multi-index models, with a detailed spectral analysis of the training process.
Findings
Neural networks learn multi-index models with optimal sample complexity.
Gradient descent implicitly performs spectral power iteration.
Training beyond constant steps is necessary for optimal learning.
Abstract
In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model with hidden subspace , which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with test error using samples and time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing…
Peer Reviews
Decision·ICLR 2026 Poster
It was already known from Chen & Meka (2020) and related works that low-dimensional polynomial functions (i.e., generative-exponent-2 multi-index models) can be learned efficiently, near the information-theoretic limit. The present paper’s contribution is to re-establish this result within a neural-network training framework, showing that a standard two-layer network trained by gradient descent can implicitly reproduce the behavior of these earlier polynomial-learning algorithms. Replacing an ex
I think the main result is genuinely interesting—essentially showing that, in a specific regime, the network dynamics are equivalent to a power iteration on a certain matrix. But the framing is off: the paper presents this as a broad statement about neural networks learning generic multi-index models, whereas in reality it addresses a much narrower, well-understood setting. A more honest and focused framing would make the contribution stronger and easier to appreciate. As of now, the paper curre
The finding that a super-constant number of steps can improve the sample complexity of feature learning under the multi-index model is interesting, and the proof sketch seems intuitive and can be used beyond this work.
1. In comparison with the recent literature, Assumption 5 is a bit restrictive. It would be interesting if one could show that gradient descent would automatically reduce leap complexity down to at most 2, similar to Lee et al., 2024 for the single-index case. I agree however that it would be technically challenging to achieve such result, and it's perhaps an interesting direction for future research in this area. 2. The Lipschitz assumption on loss (Assumption 3) rules out squared loss, and I'
The paper presents a novel theoretical analysis of the ability of shallow neural networks to learn multi-index models. The assumptions on the loss function and activation are fairly general and their necessity is discussed transparently. The connection between the early-stage training dynamics and power iteration provides insight into the trade-off between training time and full subspace recovery. The bound in Theorem 1 improves upon previous results, showing that, in principle, two-layer neural
1. The analysis strongly depends on specific initialization assumptions and on the particular layer-wise training scheme. The manuscript lacks a discussion of their necessity or the validity of the claims beyond these settings. 2. Several references are missing or misplaced: - Line 95: [1] and [2] are two other relevant references on learning thresholds of single-index models. - Line 116: spectral methods for multi-index models (with generative exponent 2) achieving the optimal learning thres
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Advanced Graph Neural Networks
