TL;DR
This paper introduces a derandomization framework for discovering neural network structures, applicable to various architectures and training methods, with implications for optimization and data embedding techniques.
Contribution
It presents a derandomization lemma that explains structure discovery in neural networks under weak assumptions, extending previous analyses to more general settings.
Findings
The derandomization lemma shows convergence to zero weights under mild conditions.
Application of the framework to MAXCUT approximation and Johnson-Lindenstrauss embeddings.
Supports structure discovery in neural networks trained with any SOSP-attaining method.
Abstract
Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order…
Peer Reviews
Decision·ICLR 2026 Poster
The main result of the paper is appealingly clean, general, and powerful. The applications are creative and interesting. In the case of neural network optimization, the paper recovers a "structure discovery" result which had been proved much more painstakingly (and under more restrictive assumptions) in prior work. The derandomization applications are conceptually thought-provoking and potentially of independent interest in complexity theory and randomized algorithms. The paper is mostly written
A weakness of the paper is that it is stylized in certain important ways. Probably the biggest concern is that all the results rely on perfect Gaussianity of the random variables, because of the crucial use of Stein's lemma in the main result. Another concern is that the functions involved be smooth, which necessitates the use of smooth activation functions etc. Finally, all the claims require regularization and only hold for $\rho$-SOSPs with $\rho$ very small compared to the regularization str
- The paper extends earlier result (Mousavi-Hosseini et al., 2023) to a broader setting.
- The paper does not guarantee learning the teacher directions; rather, it shows that the component of the student weights in the subspace orthogonal to the teacher directions vanishes. However, this does not guarantee recovery of the teacher directions. For example, consider the setting where the teacher is $y = \mathrm{He}_4(\langle \theta, x\rangle) + \epsilon$ for some unit vector $\theta$, and the student is $\hat y = \mathrm{He}_2(\langle w, x\rangle)$, $w \in \mathbb{R}^d$, with $x \sim N
1. The authors propose a new derandomization lemma for analyzing feature learning behaviour of neural networks. The obtained results extend and generalize the analysis in (Mousavi-Hosseini et al., 2023). 2. The authors applied the result for derandomization in other domains including MAXCUT and JL embeddings.
1. The assumption of first and second order smoothness is restrictive and does apply to many practical scenarios, such as ReLU networks, forcing the authors to adopt the approximation in Section 4.2. Can this result be extended to non-smooth settings such as the original ReLU activation? 2. The applications to MAXCUT and JL embedding are actually not new since, as acknowledged by the authors, there are already known derandomized algorithms for both problems. Can this result be applied to new p
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Machine Learning and ELM
