Lenient Regret for Multi-Armed Bandits
Nadav Merlis, Shie Mannor

TL;DR
This paper introduces a lenient regret criterion for multi-armed bandits, focusing on near-optimal arms instead of absolute optimality, and proposes an asymptotically optimal Thompson Sampling variant called epsilon-TS.
Contribution
It proposes a new lenient regret measure and develops epsilon-TS, a variant of Thompson Sampling, with proven asymptotic optimality under this criterion.
Findings
Epsilon-TS achieves asymptotic optimality in lenient regret.
When the optimal arm's mean is high, lenient regret is bounded by a constant.
Applying epsilon-TS improves performance with known suboptimality gap lower bounds.
Abstract
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regret-minimizing algorithms tend to over-explore. To overcome this issue, algorithms for such settings should instead focus on playing near-optimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some . We then present a variant of the Thompson Sampling (TS) algorithm, called -TS, and prove its asymptotic optimality in terms of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms
