Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions
Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, Yao Li

TL;DR
This paper introduces new algorithms for generalized linear bandits with unknown reward functions, addressing the practical challenge of link function misspecification and achieving near-optimal regret bounds.
Contribution
It proposes the STOR, ESTOR, and GSTOR algorithms for unknown and general reward functions, extending bandit theory to more realistic scenarios.
Findings
ESTOR achieves nearly optimal $ ilde{O}_T( oot{T})$ regret.
Algorithms perform well on synthetic and real datasets.
Extensions to high-dimensional sparse settings are effective.
Abstract
Generalized linear bandits have been extensively studied due to their broad applicability in real-world online decision-making problems. However, these methods typically assume that the expected reward function is known to the users, an assumption that is often unrealistic in practice. Misspecification of this link function can lead to the failure of all existing algorithms. In this work, we address this critical limitation by introducing a new problem of generalized linear bandits with unknown reward functions, also known as single index bandits. We first consider the case where the unknown reward function is monotonically increasing, and propose two novel and efficient algorithms, STOR and ESTOR, that achieve decent regrets under standard assumptions. Notably, our ESTOR can obtain the nearly optimal regret bound in terms of the time horizon . We then extend…
Peer Reviews
Decision·ICLR 2026 Poster
1) This paper tackles a challenging and relatively unexplored generalized linear bandit setting in which the link function is unknown. 2) It offers a hierarchical perspective on assumptions and algorithms, distinguishing between monotone and general link functions, to illustrate the trade-off between assumptions and regret. 3) The proposed Stein’s estimator is new and conceptually clear, enabling efficient estimation of the underlying parameter without explicitly modeling the link function.
1) The monotonicity assumption is key to achieving near-optimal regret but remains restrictive. It would be helpful to discuss whether weaker conditions, e.g., local monotonicity or Lipschitz continuity, could still lead to meaningful results. 2) The non-monotone case achieves $O(T^{3/4})$ rate is suboptimal. The dependence on the number of arms $K$ might be suboptimal. 3) The analysis of sparsity is good but assumes knowledge of the true sparsity level.
1. The paper tackles a fundamental limitation of GLBs - the known reward function (link function) assumption. By using an the Stein's method approach, they address this issue. The resulting estimator is both theoretically optimal (achieving minimax rates) and computationally efficient (closed-form solution requiring no optimization). 2. The paper provides two algorithms, STOR, a simpler variant with uniform exploration, and ESTOR which is an epoch-based algorithm that balances exploration-exploi
1. To me, one of the main weaknesses of the paper is that it tries to pack too much stuff into the paper without going deep into one section. 2. Following the previous comment, none of the theorems, technical novelty has been explained in detail. 3. The sparse high-dimensional experiments only test linear rewards (not truly testing the unknown f capability), GSTOR evaluation is limited.
1. The work directly addresses a critical and often unrealistic limitation (known link function) in the extensive GLB literature, making bandit algorithms significantly more robust to model misspecification. 2. The paper is generally well-written and structured. The motivation is compelling, and the challenges of the problem are clearly explained, especially in contrasting with existing GLB and general contextual bandit approaches.
1. The analysis crucially relies on the assumption that the context vectors (arms) are i.i.d. from a fixed distribution. This is a significant limitation, as a large body of bandit literature deals with adversarially chosen arm sets. Why this assumption is needed? 2. The worst-case regret bound of ESTOR exhibits a $K^{3/2}$ dependence on the number of arms, a limitation not present in prior work on heavy-tailed GLBs. Why this term exists? Is there any lower bound? 3. This paper assumes $||\theta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
