p-Mean Regret for Stochastic Bandits
Anand Krishna, Philips George John, Adarsh Barik, Vincent Y. F. Tan

TL;DR
This paper introduces a flexible $p$-mean regret framework for stochastic bandits, providing a unified UCB-based algorithm with new bounds that balance fairness and efficiency across different $p$ values.
Contribution
The work extends $p$-mean welfare to bandit regret, proposing a simple unified algorithm with novel bounds applicable to a range of $p$ values, including Nash regret.
Findings
Achieves $p$-mean regret bounds of $ ilde{O}( oot{T^{1/2|p|}})$ for $p eq 0$
Matches lower bounds for $0< p extless 1$ up to logarithmic factors
Unifies analysis for average and Nash regret with a single algorithm.
Abstract
In this work, we extend the concept of the -mean welfare objective from social choice theory (Moulin 2004) to study -mean regret in stochastic multi-armed bandit problems. The -mean regret, defined as the difference between the optimal mean among the arms and the -mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter . Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel -mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Distributed Sensor Networks and Detection Algorithms
