Symmetric Single Index Learning
Aaron Zweig, Joan Bruna

TL;DR
This paper extends the theoretical understanding of single index models to symmetric neural networks, showing gradient flow can recover hidden directions under certain conditions, with a new notion of information exponent.
Contribution
It introduces a framework for analyzing single index learning in symmetric neural networks, including conditions for successful recovery and a new information exponent concept.
Findings
Gradient flow recovers the hidden direction in symmetric neural networks.
A new information exponent controls the learning efficiency.
Under certain assumptions, the model achieves polynomial sample complexity.
Abstract
Few neural architectures lend themselves to provable learning with gradient based methods. One popular model is the single-index model, in which labels are produced by composing an unknown linear projection with a possibly unknown scalar link function. Learning this model with SGD is relatively well-understood, whereby the so-called information exponent of the link function governs a polynomial sample complexity rate. However, extending this analysis to deeper or more complicated architectures remains challenging. In this work, we consider single index learning in the setting of symmetric neural networks. Under analytic assumptions on the activation and maximum degree assumptions on the link function, we prove that gradient flow recovers the hidden planted direction, represented as a finitely supported vector in the feature space of power sum polynomials. We characterize a notion of…
Peer Reviews
Decision·ICLR 2024 poster
- Understanding convergence of a neural network training with gradient-based methods is an important problem, and this paper provides a new theoretical framework that admits a clean analysis. The problem setting and techniques may be useful for further work in this line of study. - The paper is well written overall. I am new to this area, but I was able to understand a high-level landscape of the previous works and the contribution of this work compared to the existing works. Though there are se
- There are several assumptions that limit the applicability of the framework, though they are properly discussed. Proposition 2.3 is the key technique that governs the whole technical assumptions including the assumption on the link function (Assumption 2.4). It is unclear for me as a newbie reader to this field how much it is restrictive and whether this can be relaxed. Any further discussion would be appreciated. - Though the paper starts with the single-index learning, it seems that the fram
The presentation is clear and easy to follow. The problem of provable guarantees for the dynamics of learning symmetric functions has not been studied as far as I know, so this can be of interest to the community. Related literature is covered well. The analysis seems correct.
1) "These dynamics naturally motivate the question of learning efficiency, measured in convergence rates in time in the case of gradient flow". Does this say anything about sample complexity when discretized? Because gradient flow time can be rescaled, so it doesn't appear to be a well-defined complexity measure? 2) Can you prove a converse to Theorem 4.2 with corresponding lower bounds on the time? This seems doable and like it would strengthen the result to make it more of a characterization.
The paper provides the first positive result in the proposed setting and acquires bounds in terms of the information exponent, which is well-studied quantity. The presentation is clear, sufficiently detailed and accurate. The techniques proposed in this paper might be of independent interest, especially as they motivate further study of learning problems under Vandermode marginals.
As the authors mention in their limitations section, the (distributional and modeling) assumptions required for the proposed analysis to work are not as common in the literature. Therefore, it is not clear to what extent such assumptions are realistic or significantly valuable from a theoretical perspective. That said, this work provides results that apply to a very large class of link functions, which partly justifies a certain number of strong distributional assumptions. Another weakness of t
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Machine Learning and ELM
MethodsStochastic Gradient Descent
