The best of both worlds: stochastic and adversarial bandits
Sebastien Bubeck, Aleksandrs Slivkins

TL;DR
This paper introduces SAO, a novel bandit algorithm that achieves near-optimal regret in both adversarial and stochastic reward settings, bridging a significant gap in multi-armed bandit research.
Contribution
SAO is the first algorithm to simultaneously optimize for both adversarial and stochastic rewards, combining the strengths of Exp3 and UCB1.
Findings
SAO achieves near-optimal regret in adversarial settings.
SAO attains logarithmic regret in stochastic settings.
The algorithm effectively adapts to different reward environments.
Abstract
We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal), whose regret is, essentially, optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the square-root worst-case regret of Exp3 (Auer et al., SIAM J. on Computing 2002) and the (poly)logarithmic regret of UCB1 (Auer et al., Machine Learning 2002) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on (non-Bayesian) multi-armed bandits. Prior work on multi-armed bandits treats them separately, and does not attempt to jointly optimize for both. Our result falls into a general theme of achieving good worst-case performance while also taking advantage of "nice" problem instances, an important issue in the design of algorithms with partially known inputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
