From Bandits to Experts: A Tale of Domination and Independence
Noga Alon, Nicol\`o Cesa-Bianchi, Claudio Gentile, Yishay Mansour

TL;DR
This paper characterizes regret bounds in partial observability multi-armed bandits using graph parameters, showing optimal regret can be achieved without full graph access in undirected cases.
Contribution
It introduces a graph-theoretic characterization of regret in directed observability models and demonstrates that optimal regret is achievable without prior graph access in undirected models.
Findings
Regret bounds are characterized by dominating and independence numbers of the observability graph.
Optimal regret is achievable in undirected models without prior access to the observability graph.
Variants of the Exp3 algorithm are used to achieve these results efficiently.
Abstract
We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
