Global Bandits
Onur Atan, Cem Tekin, Mihaela van der Schaar

TL;DR
This paper introduces 'global bandits', a new class of multi-armed bandit models where rewards are correlated through a single unknown parameter, and proposes algorithms with bounded and sub-linear regret for such models.
Contribution
The paper defines global bandits with correlated rewards, and develops algorithms that achieve bounded and sub-linear regret, improving decision-making in correlated reward settings.
Findings
The greedy policy achieves bounded regret depending on the true parameter.
A variant attains $ ilde{O}( oot{T}{})$ worst-case regret.
Experiments show significant gains in dynamic pricing applications.
Abstract
Multi-armed bandits (MAB) model sequential decision making problems, in which a learner sequentially chooses arms with unknown reward distributions in order to maximize its cumulative reward. Most of the prior work on MAB assumes that the reward distributions of each arm are independent. But in a wide variety of decision problems -- from drug dosage to dynamic pricing -- the expected rewards of different arms are correlated, so that selecting one arm provides information about the expected rewards of other arms as well. We propose and analyze a class of models of such decision problems, which we call {\em global bandits}. In the case in which rewards of all arms are deterministic functions of a single unknown parameter, we construct a greedy policy that achieves {\em bounded regret}, with a bound that depends on the single true parameter of the problem. Hence, this policy selects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
