Dueling Bandits With Weak Regret
Bangrui Chen, Peter I. Frazier

TL;DR
This paper introduces the Winner Stays (WS) algorithm for dueling bandits, effectively minimizing weak and strong regret in content recommendation tasks with pairwise feedback, outperforming existing methods in simulations and real data.
Contribution
The paper proposes the first weak regret-optimized dueling bandit algorithm, Winner Stays, with theoretical guarantees and practical efficiency for both weak and strong regret settings.
Findings
WS-W achieves constant weak regret over time.
WS outperforms existing algorithms in simulations.
WS is computationally simple for many arms.
Abstract
We consider online content recommendation with implicit feedback through pairwise comparisons, formalized as the so-called dueling bandit problem. We study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: the more well-studied strong regret, which is 0 only when both arms pulled are the Condorcet winner; and the less well-studied weak regret, which is 0 if either arm pulled is the Condorcet winner. We propose a new algorithm for this problem, Winner Stays (WS), with variations for each kind of regret: WS for weak regret (WS-W) has expected cumulative weak regret that is , and if arms have a total order; WS for strong regret (WS-S) has expected cumulative strong regret of , and if arms have a total order. WS-W is the first dueling bandit algorithm with weak regret that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Machine Learning and Algorithms
