Maximal Objectives in the Multi-armed Bandit with Applications
Eren Ozbay, Vijay Kamble

TL;DR
This paper introduces a new objective for the multi-armed bandit problem focused on maximizing the highest total reward among arms, providing theoretical regret bounds and an adaptive policy, with applications to online platform participant management.
Contribution
It proposes a novel 'max' objective for multi-armed bandits, derives regret bounds, and develops an adaptive policy that outperforms natural alternatives in practical scenarios.
Findings
Theoretical regret bounds of () () for the max objective.
An adaptive explore-then-commit policy achieves near-optimal regret bounds.
Numerical experiments show the policy's effectiveness over alternatives.
Abstract
In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given arms, instead of maximizing the expected total reward from pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the arms at the end of pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of . We then design an adaptive explore-then-commit policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Auction Theory and Applications
