Variational Bayesian Optimistic Sampling
Brendan O'Donoghue, Tor Lattimore

TL;DR
This paper introduces a Bayesian optimistic sampling approach for online decision problems, providing a unified analysis of regret bounds and extending to complex settings like saddle-point problems, with a flexible, variational framework.
Contribution
It develops a new class of Bayesian optimistic policies, including a variational method that works with any posterior, and extends regret analysis to bilinear saddle-point problems.
Findings
Optimistic policies achieve $ ilde O( oot{A}{T})$ Bayesian regret.
Thompson sampling may suffer linear regret outside the optimistic set.
The variational approach allows flexible policy tuning and constraint incorporation.
Abstract
We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian `optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys Bayesian regret for a problem with actions after rounds. We extend the regret analysis for optimistic policies to bilinear saddle-point problems which include zero-sum matrix games and constrained bandits as special cases. In this case we show that Thompson sampling can produce policies outside of the optimistic set and suffer linear regret in some instances. Finding a policy inside the optimistic set amounts to solving a convex optimization problem and we call the resulting algorithm `variational Bayesian optimistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms
