Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need
Dhruv Sarkar, Nishant Pandey, Sayak Ray Chowdhury

TL;DR
This paper shows that a simple UCB algorithm, with an initial exploration phase, effectively minimizes fairness-aware Nash regret in stochastic bandits, extending to a broad class of fairness metrics with near-optimal guarantees.
Contribution
It demonstrates that a standard UCB algorithm, combined with initial exploration, suffices for near-optimal Nash regret, removing the need for complex, assumption-heavy algorithms.
Findings
UCB with exploration achieves near-optimal Nash regret.
The approach extends to sub-Gaussian rewards.
The method generalizes to p-mean regret with strong guarantees.
Abstract
Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
