Observe Before Play: Multi-armed Bandit with Pre-observations

Jinhang Zuo; Xiaoxi Zhang; Carlee Joe-Wong

arXiv:1911.09458·cs.LG·November 22, 2019·1 cites

Observe Before Play: Multi-armed Bandit with Pre-observations

Jinhang Zuo, Xiaoxi Zhang, Carlee Joe-Wong

PDF

Open Access

TL;DR

This paper introduces algorithms for multi-armed bandit problems with pre-observation of rewards, balancing observation costs and reward maximization, and extends to multi-player scenarios with collision management, demonstrating improved regret bounds and practical performance.

Contribution

It proposes the OBP-UCB algorithm for single-player and centralized/distributed algorithms for multi-player bandits with pre-observations, providing theoretical regret bounds and empirical validation.

Findings

01

OBP-UCB achieves $O(K^2 ext{log} T)$ regret for single-player.

02

C-MP-OBP attains $O(rac{K^4}{M^2} ext{log} T)$ regret in multi-player setting.

03

Distributed algorithms outperform heuristics and non-pre-observation policies.

Abstract

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for $K$ arms with Bernoulli rewards, and prove a $T$ -round regret upper bound $O (K^{2} lo g T)$ . In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its $T$ -round regret relative to an offline greedy strategy is upper bounded in $O(\frac{K^4}{M^2}\log…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Optimization and Search Problems