Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke

TL;DR
This paper introduces a novel Upper Confidence Bound-based algorithm for the K-armed dueling bandit problem, effectively handling relative feedback and demonstrating superior empirical performance with finite-time regret guarantees.
Contribution
It extends the UCB algorithm to the dueling bandit setting, providing theoretical regret bounds and improved empirical results over existing methods.
Findings
Achieves finite-time regret bound of O(log t)
Outperforms state-of-the-art algorithms in real data experiments
Effectively handles relative feedback in bandit problems
Abstract
This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a finite-time regret bound of order O(log t). In addition, our empirical results using real data from an information retrieval application show that it greatly outperforms the state of the art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Auction Theory and Applications
