Stage-wise Conservative Linear Bandits
Ahmadreza Moradipari, Christos Thrampoulidis, Mahnoosh Alizadeh

TL;DR
This paper introduces two algorithms for stage-wise conservative linear bandits that ensure safety constraints while optimizing rewards, providing regret bounds and adaptability to various constraint settings.
Contribution
The paper proposes novel algorithms, SCLTS and SCLUCB, for safe linear bandit optimization with regret guarantees and flexibility for different safety constraint scenarios.
Findings
SCLTS and SCLUCB achieve regret bounds of O(√T log^{3/2}T) and O(√T log T).
Algorithms limit the number of baseline actions to O(log T).
Methods adapt to constraints with bandit feedback and unknown baseline actions.
Abstract
We study stage-wise conservative linear stochastic bandits: an instance of bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials. At each stage, the learner must choose actions that not only maximize cumulative reward across the entire time horizon but further satisfy a linear baseline constraint that takes the form of a lower bound on the instantaneous reward. For this problem, we present two novel algorithms, stage-wise conservative linear Thompson Sampling (SCLTS) and stage-wise conservative linear UCB (SCLUCB), that respect the baseline constraints and enjoy probabilistic regret bounds of order O(\sqrt{T} \log^{3/2}T) and O(\sqrt{T} \log T), respectively. Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations, such as constraints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
