Efficient Simple Regret Algorithms for Stochastic Contextual Bandits

Shuai Liu; Alireza Bakhtiari; Alex Ayoub; Botao Hao; Csaba Szepesv\'ari

arXiv:2601.21167·cs.LG·January 30, 2026

Efficient Simple Regret Algorithms for Stochastic Contextual Bandits

Shuai Liu, Alireza Bakhtiari, Alex Ayoub, Botao Hao, Csaba Szepesv\'ari

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the first algorithms with provable simple regret guarantees for stochastic contextual logistic bandits, extending linear bandit results and providing practical, tractable solutions with empirical validation.

Contribution

It proposes novel algorithms achieving the first simple regret bounds for logistic bandits, including a new Thompson Sampling variant, with bounds independent of the unknown parameter magnitude.

Findings

01

Achieves simple regret $ ilde{O}(d/\sqrt{T})$ for logistic bandits.

02

Introduces a Thompson Sampling algorithm with regret $ ilde{O}(d^{3/2}/\sqrt{T})$.

03

Empirically validates theoretical guarantees through experiments.

Abstract

We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees have been established for the linear case, no such results were previously known for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret $\tilde{O} (d / T)$ . Notably, the leading term of our regret bound is free of the constant $κ = O (exp (S))$ , where $S$ is a bound on the magnitude of the unknown parameter vector. The algorithm is shown to be fully tractable when the action set is finite. We also introduce a new variant of Thompson Sampling tailored to the simple-regret setting. This yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits, with regret…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The authors propose an effective algorithm for simple-regret minimization in stochastic contextual bandits. The regret guarantees are reasonable, and the authors provide sufficient explanations for their derivations. - In particular, the finite-sample analysis is strong.

Weaknesses

1. In my understanding, several other studies address simple-regret minimization in stochastic contextual bandits. For example, Kato et al. (2024) develop policy-learning algorithms in this setting. Their goal is to train a policy that minimizes simple regret in a best-arm-identification setting, and they characterize regret bounds using the VC dimension, which covers certain linear and logistic models. Theoretically, that analysis may be somewhat coarse, but could those results be applied to th

Reviewer 02Rating 6Confidence 3

Strengths

The logistic simple-regret setting is well motivated, and the work fills a clear gap in the literature. To my knowledge, this is the first paper to remove the dependence on the curvature constant $\kappa$ from the leading term of the regret bound. The construction of a monotone surrogate Hessian and the associated decreasing-uncertainty lemma are non-trivial and address the main technical challenge in logistic models, where the uncertainty depends on the unknown slope $\mu'(z)$. These ideas are

Weaknesses

Although there are no fatal theoretical flaws, the paper contains numerous typographical and consistency issues that make verification difficult. The most important ones are: - In both MULIN and SIMPLELINTS, the design matrix $V_{t+1}$ is never updated. The pseudocode should include $V_{t+1} \leftarrow V_t + \phi(S_t, A_t)\phi(S_t, A_t)^\top.$ - From Equation (17), we have $\mathcal{V}_{t+1} \subseteq \mathcal{V}_t$, but it is reversed in Rows 1420-1421. - The term $\varphi(s,a)^\top \the

Reviewer 03Rating 8Confidence 3

Strengths

1. The methods and analysis are unified in the sense that we can understand the intuitive and important theoretical property in the linear bandits and then the natural extension to the logistic model is described. The paper is effortless to follow, and the text is flawless. 2. SIMPLELINTS (Theorem 3) achieves $\tilde{O}(d^{3/2}/\sqrt{T})$. Based on it, they also analyze a randomized logistic algorithm based on TS. These randomized methods have computational advantages over the deterministic met

Weaknesses

The motivation to study the simple regret in practice is not discussed.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Risk and Portfolio Optimization · Stochastic Gradient Optimization Techniques