Neural Logistic Bandits
Seoungbin Bae, Dabeen Lee

TL;DR
This paper introduces a new theoretical framework for neural logistic bandits that achieves regret bounds depending on an effective dimension rather than the feature dimension, improving learning efficiency.
Contribution
It develops a novel Bernstein-type inequality for vector martingales and proposes two algorithms with improved regret bounds for neural logistic bandits.
Findings
Regret bounds depend on effective dimension, not feature dimension.
Algorithms NeuralLog-UCB-1 and NeuralLog-UCB-2 outperform existing methods.
Numerical experiments validate theoretical improvements.
Abstract
We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on , where represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension , which can be huge in neural network-based settings. In this work, we introduce a novel Bernstein-type inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension , not the feature dimension, while keeping a minimal dependence on . Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee…
Peer Reviews
Decision·Submitted to ICLR 2026
**Strengths of the paper:** 1. This paper considers the neural logistic bandits problem, where an unknown latent non-linear reward function is estimated using neural networks. 2. This paper proposes a novel Bernstein-type concentration inequality for self-normalized vector-valued martingales and then uses it to derive tighter regret upper bounds. 3. The authors propose two new algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, both achieving improved existing regret bounds, closing the gap betwe
**Weaknesses of the paper:** 1. It is unclear how one can choose the right architecture of a neural network (NN) to estimate the underlying unknown reward function. If the NN architecture (too small or too large) is good enough for estimating the reward function, it may lead to mis-specification. 2. The empirical results can also include Thompson sampling-based variants of the proposed algorithms. Also, the authors can mention the key challenges to integrating Thompson sampling with the propose
1. Technically solid and interesting concentration result. The new Bernstein-type, self-normalized martingale inequality is elegant and, as far as I can check, correct. It simultaneously achieves (i) variance adaptivity (using the true logistic variance instead of the worst-case $1/\kappa$) and (ii) data adaptivity (via the log-det / effective-dimension term), thereby bridging the gap between the variance-aware inequality of Faury et al. (2020), which still kept an explicit $d$, and the neural l
1. Algorithm requires knowing $S$. The algorithm needs the learner to input the norm parameter $S$ (Condition 4.4: “set $S$ as a norm parameter satisfying $S \ge \sqrt{2 h^\top H^{-1} h}$”) but this quantity is defined in terms of the true latent reward vector $h$ and the NTK matrix over all future contexts — not something the learner can observe or compute online. So in practice this is an assumption, not an implementable step. The paper itself later notes that removing the dependence on $S$ is
- Improved variance dependence for neural logistic bandits The paper introduces two UCB-style algorithms (NeuralLog-UCB-1 and NeuralLog-UCB-2) that achieve regret bounds with better dependence on the logistic variance parameters, improving over prior $\tilde{O}(\kappa \tilde{d}\sqrt{T})$ in neural logistic bandits. - Data-adaptive exploration design NeuralLog-UCB-2 estimates per-arm uncertainty using a learned, curvature-weighted design matrix rather than a crude global worst-case variance. T
Weaknesses - Removing d is not new The paper highlights that its regret bounds no longer depend directly on the ambient dimension d, but instead on the effective dimension \tilde{d}. However, this “$d \rightarrow \tilde{d}$” replacement has already been standard since NeuralUCB / NeuralTS–style analyses in neural bandits, where NTK-based arguments control regret via an effective log-det complexity term rather than the raw parameter dimension. - The regret bounds are not strictly stronger tha
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHeart Rate Variability and Autonomic Control · Advanced Bandit Algorithms Research
