Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

Bohan Zhang; Zihao Wang; Hengyu Fu; Jason D. Lee

arXiv:2511.15120·stat.ML·February 6, 2026

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee

PDF

Open Access 3 Reviews

TL;DR

This paper proves that two-layer neural networks can optimally learn high-dimensional Gaussian multi-index models near the information-theoretic limit, demonstrating efficient spectral learning during gradient descent.

Contribution

It establishes that neural networks can achieve near-optimal sample and time complexity for learning multi-index models, with a detailed spectral analysis of the training process.

Findings

01

Neural networks learn multi-index models with optimal sample complexity.

02

Gradient descent implicitly performs spectral power iteration.

03

Training beyond constant steps is necessary for optimal learning.

Abstract

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f (x) = g (U x)$ with hidden subspace $U \in R^{r \times d}$ , which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_{d} (1)$ test error using $O (d)$ samples and $O (d^{2})$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

It was already known from Chen & Meka (2020) and related works that low-dimensional polynomial functions (i.e., generative-exponent-2 multi-index models) can be learned efficiently, near the information-theoretic limit. The present paper’s contribution is to re-establish this result within a neural-network training framework, showing that a standard two-layer network trained by gradient descent can implicitly reproduce the behavior of these earlier polynomial-learning algorithms. Replacing an ex

Weaknesses

I think the main result is genuinely interesting—essentially showing that, in a specific regime, the network dynamics are equivalent to a power iteration on a certain matrix. But the framing is off: the paper presents this as a broad statement about neural networks learning generic multi-index models, whereas in reality it addresses a much narrower, well-understood setting. A more honest and focused framing would make the contribution stronger and easier to appreciate. As of now, the paper curre

Reviewer 02Rating 8Confidence 4

Strengths

The finding that a super-constant number of steps can improve the sample complexity of feature learning under the multi-index model is interesting, and the proof sketch seems intuitive and can be used beyond this work.

Weaknesses

1. In comparison with the recent literature, Assumption 5 is a bit restrictive. It would be interesting if one could show that gradient descent would automatically reduce leap complexity down to at most 2, similar to Lee et al., 2024 for the single-index case. I agree however that it would be technically challenging to achieve such result, and it's perhaps an interesting direction for future research in this area. 2. The Lipschitz assumption on loss (Assumption 3) rules out squared loss, and I'

Reviewer 03Rating 4Confidence 4

Strengths

The paper presents a novel theoretical analysis of the ability of shallow neural networks to learn multi-index models. The assumptions on the loss function and activation are fairly general and their necessity is discussed transparently. The connection between the early-stage training dynamics and power iteration provides insight into the trade-off between training time and full subspace recovery. The bound in Theorem 1 improves upon previous results, showing that, in principle, two-layer neural

Weaknesses

1. The analysis strongly depends on specific initialization assumptions and on the particular layer-wise training scheme. The manuscript lacks a discussion of their necessity or the validity of the claims beyond these settings. 2. Several references are missing or misplaced: - Line 95: [1] and [2] are two other relevant references on learning thresholds of single-index models. - Line 116: spectral methods for multi-index models (with generative exponent 2) achieving the optimal learning thres

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Advanced Graph Neural Networks