Interactive Learning of Single-Index Models via Stochastic Gradient Descent
Nived Rajaraman, Yanjun Han

TL;DR
This paper analyzes the learning dynamics of stochastic gradient descent (SGD) in high-dimensional single-index models, revealing a two-phase process and demonstrating near-optimal sample complexity and regret guarantees with proper learning rate schedules.
Contribution
It provides the first detailed analysis of SGD's behavior in sequential learning of single-index models, including phase transition and optimality guarantees.
Findings
SGD exhibits a burn-in and learning phase in single-index models.
Proper learning rate schedules enable near-optimal sample complexity.
SGD achieves competitive regret bounds in adaptive data settings.
Abstract
Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textit{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in'' phase before entering the ``learning'' phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret…
Peer Reviews
Decision·ICLR 2026 Poster
While there is now a rich literature for learning single-index models in the supervised setting, I believe the literature on online/bandit settings is more sparse. Therefore, the problem that this paper wants to tackle is novel and of importance to the community. Also, the paper is explicit about its assumptions and it is generally easy to read and follow.
My main concerns are the following: * In the pure exploration case $\sigma_t = 1$, this algorithm is the same as one-pass SGD studied by Ben Arous et al., 2021. However, the sample complexity seems to be worse. Due to the monotonicity of $f$ (hence information exponent 1), the sample complexity of (pure exploration) SGD would scale linearly with $d$ (at least in the noiseless setting, but I believe it should be able to tolerate $O(1)$ i.i.d. noise as well). However, the bound of Theorem 1 (SGD w
* Overall, this is a well-written paper and is easy-to-follow. In addition, it is short (19 pages), which is a nice and rare thing for a theory paper to have. * It is somewhat surprising that how being able to choose the position of the query greatly simplifies the analysis and improves the bounds (when the label noise is large). They choose the next query position $a_t$ to be a weighted average of the current weight $\theta_t$ and a noise that is *orthogonal* to the current weight. Thi
This is a neat paper that does everything the authors claim to achieve, so I do not think there is any major weakness, though one could complain that the setting is too easy. Nevertheless, the following are a few complaints I have. * As someone who is more familiar with IE/Gaussian single-index models, I found the $\tilde{O}(d^2)$ bounds really confusing until I realized that the scaling is different, as the non-interactive bounds are $\tilde{O}(d)$. It might be better to point this out
The paper proves novel results for understanding the training dynamics and sample complexity of SGD for Single Index Models with interactive features. The results are novel and demonstrate the value of SGD in this setting. The paper does a good job at explaining their results and the outlines of the proofs which are often otherwise hidden in the appendix.
The assumptions feel rather restrictive, particularly monotonicity of the link function. There is discussion of the 'necessity' of the monotonicity assumption, however there is not nearly enough detail to convince me that the assumption is necessary. If the assumption is truly necessary for the results it would be nice to have a rigorous negative result such as a counter example demonstrating that a lack of monotonicity will indeed break the conclusion of the proof. The relationship of the info
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
