Adversarial Dueling Bandits
Aadirupa Saha, Tomer Koren, Yishay Mansour

TL;DR
This paper studies regret minimization in adversarial dueling bandits, introducing algorithms with matching upper and lower bounds for regret in both general and fixed-gap settings, advancing understanding of preference-based online learning.
Contribution
It proposes new algorithms with tight regret bounds for adversarial dueling bandits, including the Borda-winner setting and a simplified fixed-gap model, extending theoretical understanding.
Findings
Achieves $ ilde{O}(K^{1/3}T^{2/3})$ regret bound for Borda-winner in adversarial setting.
Provides a lower bound of $ ilde{ olinebreak} ext{Omega}(K^{1/3}T^{2/3})$, matching the upper bound.
In the fixed-gap setup, offers an $ ilde{O}((K/ riangle^2) ext{log}T)$ regret algorithm with tight lower bounds.
Abstract
We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe only a relative binary `win-loss' feedback for this pair, but here this feedback is generated from an arbitrary preference matrix, possibly chosen adversarially. Our main result is an algorithm whose -round regret compared to the \emph{Borda-winner} from a set of items is , as well as a matching lower bound. We also prove a similar high probability regret bound. We further consider a simpler \emph{fixed-gap} adversarial setup, which bridges between two extreme preference feedback models for dueling bandits: stationary preferences and an arbitrary sequence of preferences. For the fixed-gap adversarial setup we give an $\smash{ \tilde{O}((K/\Delta^2)\log{T})…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Optimization and Search Problems
