Simple Algorithms for Dueling Bandits

Tyler Lekang; Andrew Lamperski

arXiv:1906.07611·cs.LG·June 19, 2019

Simple Algorithms for Dueling Bandits

Tyler Lekang, Andrew Lamperski

PDF

Open Access

TL;DR

This paper introduces simple algorithms for Dueling Bandits, providing regret bounds independent of preference gaps, and demonstrates their competitive performance through theoretical analysis and experiments.

Contribution

The paper proposes new simple algorithms for Dueling Bandits with regret bounds not depending on preference gaps, advancing the state-of-the-art.

Findings

01

Regret bounds of order O(T^rho) with 1/2 <= rho <= 3/4

02

Algorithms outperform existing methods in some synthetic experiments

03

Regret performance comparable or better than state-of-the-art algorithms

Abstract

In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta. Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.

Equations279

\displaystyle i^{*}_{C}\in\operatorname*{arg\,max}_{i}\sum_{j}\mathbbm{1}\big{[}X(i,j)>0.5\big{]}

\displaystyle i^{*}_{C}\in\operatorname*{arg\,max}_{i}\sum_{j}\mathbbm{1}\big{[}X(i,j)>0.5\big{]}

i_{M}^{*} \in i arg max j min X (i, j) i_{B}^{*} \in i arg max \frac{1}{A} j \sum X (i, j)

i_{M}^{*} \in i arg max j min X (i, j) i_{B}^{*} \in i arg max \frac{1}{A} j \sum X (i, j)

w_{p, t} (i)

w_{p, t} (i)

p_{t + 1} (i)

X_{p, t} (i, J_{t}) = \frac{X _{t} ( i , J _{t} ) \mathbbm 1 ( i = I _{t} ) + β}{p _{t} ( i )}, X_{q, t} (j, I_{t}) = \frac{X _{t} ( j , I _{t} ) \mathbbm 1 ( j = J _{t} ) + β}{q _{t} ( j )}

X_{p, t} (i, J_{t}) = \frac{X _{t} ( i , J _{t} ) \mathbbm 1 ( i = I _{t} ) + β}{p _{t} ( i )}, X_{q, t} (j, I_{t}) = \frac{X _{t} ( j , I _{t} ) \mathbbm 1 ( j = J _{t} ) + β}{q _{t} ( j )}

w_{t} (i)

w_{t} (i)

p_{t + 1} (i)

g_{t} (i) = \frac{1}{A} k = 1 \sum A X_{t} (i, k) g_{t} (i) = \frac{1}{A} \frac{X _{t} ( i , J _{t} ) \mathbbm 1 ( i = I _{t} )}{p _{t} ( I _{t} ) p _{t} ( J _{t} )}

g_{t} (i) = \frac{1}{A} k = 1 \sum A X_{t} (i, k) g_{t} (i) = \frac{1}{A} \frac{X _{t} ( i , J _{t} ) \mathbbm 1 ( i = I _{t} )}{p _{t} ( I _{t} ) p _{t} ( J _{t} )}

\displaystyle\mathbf{R}_{T}=\frac{1}{2}\sum_{t=1}^{T}\Big{(}\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{I}_{t})-\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{i}^{*}_{M})+\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{J}_{t})-\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{i}^{*}_{M})\Big{)}

\displaystyle\mathbf{R}_{T}=\frac{1}{2}\sum_{t=1}^{T}\Big{(}\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{I}_{t})-\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{i}^{*}_{M})+\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{J}_{t})-\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{i}^{*}_{M})\Big{)}

E [R_{T}] \leq \frac{A}{2} lo g A T

E [R_{T}] \leq \frac{A}{2} lo g A T

\displaystyle\mathbf{R}_{T}=\frac{1}{2A}\sum_{t=1}^{T}\sum_{k=1}^{A}\Big{(}\mathbf{X}_{t}(\mathbf{i}^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{I}_{t},k)+\mathbf{X}_{t}(\mathbf{i}^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{J}_{t},k)\Big{)}

\displaystyle\mathbf{R}_{T}=\frac{1}{2A}\sum_{t=1}^{T}\sum_{k=1}^{A}\Big{(}\mathbf{X}_{t}(\mathbf{i}^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{I}_{t},k)+\mathbf{X}_{t}(\mathbf{i}^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{J}_{t},k)\Big{)}

E [R_{T}] \leq (c + \frac{A}{c} lo g A) T^{2/3}

E [R_{T}] \leq (c + \frac{A}{c} lo g A) T^{2/3}

\displaystyle\mathbf{R}_{T}=\frac{1}{2}\sum_{t=1}^{T}\Big{(}\mathbf{X}_{t}(i^{*}_{M},\mathbf{J}_{t})-\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})+\mathbf{X}_{t}(i^{*}_{M},\mathbf{I}_{t})-\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})\Big{)}

\displaystyle\mathbf{R}_{T}=\frac{1}{2}\sum_{t=1}^{T}\Big{(}\mathbf{X}_{t}(i^{*}_{M},\mathbf{J}_{t})-\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})+\mathbf{X}_{t}(i^{*}_{M},\mathbf{I}_{t})-\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})\Big{)}

β = \frac{lo g A}{A T} η = 0.95 \frac{lo g A}{A T} γ = 1.05 \frac{A lo g A}{T}

β = \frac{lo g A}{A T} η = 0.95 \frac{lo g A}{A T} γ = 1.05 \frac{A lo g A}{T}

T \geq max [4.41 A lo g A, \frac{0.9 5 ^{2} lo g A}{0. 1 ^{2} A}]

T \geq max [4.41 A lo g A, \frac{0.9 5 ^{2} lo g A}{0. 1 ^{2} A}]

\displaystyle\mathbb{E}[\mathbf{R}_{T}]\,\leq\,\bigg{(}\sqrt{A\,(\log A)^{\text{-}1}}+4.2\sqrt{A\log A}\bigg{)}\sqrt{T}

\displaystyle\mathbb{E}[\mathbf{R}_{T}]\,\leq\,\bigg{(}\sqrt{A\,(\log A)^{\text{-}1}}+4.2\sqrt{A\log A}\bigg{)}\sqrt{T}

\displaystyle\mathbf{R}_{T}=\frac{1}{2A}\sum_{t=1}^{T}\sum_{k=1}^{A}\Big{(}\mathbf{X}_{t}(i^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{I}_{t},k)+\mathbf{X}_{t}(i^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{J}_{t},k)\Big{)}

\displaystyle\mathbf{R}_{T}=\frac{1}{2A}\sum_{t=1}^{T}\sum_{k=1}^{A}\Big{(}\mathbf{X}_{t}(i^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{I}_{t},k)+\mathbf{X}_{t}(i^{*}_{B},k)-\mathbf{X}_{t}(\mathbf{J}_{t},k)\Big{)}

\displaystyle\eta=(e-2)^{-1/4}\,\bigg{(}\frac{\log A}{A^{2/3}\,T}\bigg{)}^{3/4}\qquad\gamma=(e-2)^{1/4}\,\bigg{(}\frac{A^{2}\log A}{T}\bigg{)}^{1/4}

\displaystyle\eta=(e-2)^{-1/4}\,\bigg{(}\frac{\log A}{A^{2/3}\,T}\bigg{)}^{3/4}\qquad\gamma=(e-2)^{1/4}\,\bigg{(}\frac{A^{2}\log A}{T}\bigg{)}^{1/4}

T > (e - 2) A^{2} lo g A

T > (e - 2) A^{2} lo g A

E [R_{T}] \leq 2 (e - 2)^{1/4} A (lo g A)^{1/2} T^{3/4}

E [R_{T}] \leq 2 (e - 2)^{1/4} A (lo g A)^{1/2} T^{3/4}

P [i ≻ j] = X (i, j) = \frac{1}{1 + exp ( u ( j ) - u ( i ))}

P [i ≻ j] = X (i, j) = \frac{1}{1 + exp ( u ( j ) - u ( i ))}

E_{t} [r_{t}] = i^{*}, j \sum P_{t} (i_{M}^{*} = j) P_{t} (i_{M}^{*} = i^{*}) (E_{t} [X_{t} (i^{*}, j) ∣ i_{M}^{*} = i^{*}] - E_{t} [X_{t} (i^{*}, j)])

E_{t} [r_{t}] = i^{*}, j \sum P_{t} (i_{M}^{*} = j) P_{t} (i_{M}^{*} = i^{*}) (E_{t} [X_{t} (i^{*}, j) ∣ i_{M}^{*} = i^{*}] - E_{t} [X_{t} (i^{*}, j)])

I_{t} (i_{M}^{*}; (I_{t}, J_{t}, X_{t} (i, j)))

= i, j, i^{*} \sum P_{t} (i_{M}^{*} = i) P_{t} (i_{M}^{*} = i^{*}) P_{t} (i_{M}^{*} = j) D (p_{t} (X_{t} (i, j) ∣ i_{M}^{*} = i^{*}) ∣∣ p_{t} (X_{t} (i, j)))

E_{t} [X_{t} (i_{M}^{*}, J_{t})]

E_{t} [X_{t} (i_{M}^{*}, J_{t})]

= i^{*}, j \sum P (i_{M}^{*} = i^{*}) P (i_{M}^{*} = j) E_{t} [X_{t} (i^{*}, j) ∣ i_{M}^{*} = i]

E_{t} [X_{t} (i_{M}^{*}, i_{M}^{*})]

E_{t} [X_{t} (i_{M}^{*}, i_{M}^{*})]

= i^{*}, j \sum P (i_{M}^{*} = i^{*}) P (i_{M}^{*} = j) E_{t} [X_{t} (i^{*}, j)]

I_{t} (i_{M}^{*};

I_{t} (i_{M}^{*};

= I_{t} (i_{M}^{*}; (I_{t}, J_{t})) + I_{t} (i_{M}^{*}; X_{t} (I_{t}, J_{t}) ∣ I_{t}, J_{t})

= I_{t} (i_{M}^{*}; X_{t} (I_{t}, J_{t}) ∣ I_{t}, J_{t})

= i, j \sum P_{t} (i_{M}^{*} = i) P_{t} (i_{M}^{*} = j) I_{t} (i_{M}^{*}; X_{t} (i, j))

= i, j, i^{*} \sum P_{t} (i_{M}^{*} = i) P_{t} (i_{M}^{*} = j) P_{t} (i_{M}^{*} = i^{*}) D (p_{t} (X_{t} (i, j) ∣ i_{M}^{*} = i^{*}) ∣∣ p_{t} (X_{t} (i, j)))

E_{t} [r_{t}]

E_{t} [r_{t}]

\leq \frac{A ^{2}}{2} i^{*}, j \sum P_{t} (i_{M}^{*} = j)^{2} P_{t} (i_{M}^{*} = i^{*})^{2} D (p_{t} (X_{t} (i^{*}, j) ∣ i_{M}^{*} = i^{*}) ∣∣ p_{t} (X_{t} (i^{*}, j)))

\leq \frac{A ^{2}}{2} i, j, i^{*} \sum P_{t} (i_{M}^{*} = i)^{2} P_{t} (i_{M}^{*} = i^{*})^{2} P_{t} (i_{M}^{*} = j) D (p_{t} (X_{t} (i, j) ∣ i_{M}^{*} = i^{*}) ∣∣ p_{t} (X_{t} (i, j)))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Optimization and Search Problems

Full text

Simple Algorithms for Dueling Bandits

Tyler Lekang

University of Minnesota, Twin Cities

Minneapolis, MN

[email protected]

&Andrew Lamperski

University of Minnesota, Twin Cities

Minneapolis, MN

[email protected]

Abstract

In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon $T$ of order $O(T^{\rho})$ with $1/2\leq\rho\leq 3/4$ , which importantly do not depend on any preference gap between actions $\Delta$ . Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.

1 Introduction

Dueling Bandits, first proposed in [24], is an important variation on the Multi-Armed Bandit (MAB), a well-known online machine learning problem that has been studied extensively by many previous works, such as [4], [6], and [5]. Dueling Bandits is different from MAB in that it provides binary feedback at each time, the win/lose outcome of a duel between two actions. This corresponds well to comparisons between two system states that receive better/worse type responses from users, patients, raters, and so on. Previous work on this topic has proposed various algorithms that generally allow for regret bounds of the order $\log T/\Delta$ to be proven, where $\Delta$ represents the preference gap between two different states (or actions). See [18] for a reference. Such algorithms include, Beat the Mean [25], Interleaved Filter [23], SAVAGE [20], RUCB [27] and RCS [28], MultiSBM and Sparring [3], Sparse Borda [9], RMED [11], CCB [26], and (E)CW-RMED [13]. Thompson Sampling, first proposed in [19], is a powerful method of learning true parameters values $\theta$ , by sampling from a posterior distribution using Bayes Theorem. See [14] and [16] for reference. It has been implemented in algorithms for multi-armed bandits, such as in [7], [1], [10], [2], [12], and [22]. The current state-of-the-art algorithms for Dueling Bandits both utilize Thompson Sampling methods, Independent Self-Sparring (ISS) [17] and Double Thompson Sampling (DTS) [21]. The ISS method is relatively simple, has strong empirical performance, and has been proven to converge asymptotically to a Condorcet winner, if one exists. However, its non-asymptotic regret has not been analyzed. The DTS algorithm is a relatively complex algorithm with a highly complex proof. It achieves regret of order $\log T/\Delta$ . However, the worst-case $\Delta$ values, lead to regret bounds that are actually of order $\sqrt{T\log T}$ . We address these issues in this paper, with our main contributions: (1)we present four simple algorithms for Dueling Bandits, each of which allows provable upper bounds on regret of order $O(T^{\rho})$ with $1/2\leq\rho\leq 3/4$ that do not depend on any preference gap $\Delta$ between actions, (2) we compare and contrast the algorithm complexity and theoretical results of the presented simple algorithms against the current state-of-the-art algorithms for Dueling Bandits, and (3) we evaluate the algorithms on multiple scenarios using synthetically generated data, demonstrating their performance for multiple definitions of optimality, that in some cases exceeds the state-of-the-art.

2 Background

2.1 Dueling Bandits

The dueling bandits problem is described in Problem 1. The random matrices $\mathbf{X}_{1},\ldots,\mathbf{X}_{T}\in\{0,1\}^{A\times A}$ are independent and identically distributed. Each element is Bernoulli distributed such that $\mathbb{P}[\mathbf{X}_{t}(i,j)=1]=\mathbb{P}[i\succ j]$ denotes the probability of action $i$ winning a duel with action $j$ .

For Thompson sampling algorithms, we will assume that the win probabilities depend on an unobserved random parameter, $\boldsymbol{\theta}$ , so that $\mathbb{P}[\mathbf{X}_{t}(i,j)=1|\boldsymbol{\theta}]=\mathbf{X}(i,j)$ . The parameter can be used to encode correlations between the actions and other structural assumptions.

For algorithms based on Exp3.P and partial monitoring, we assumes that $\mathbb{P}[\mathbf{X}_{t}(i,j)=1]=X(i,j)$ , where $X$ is a fixed but unknown matrix of win probabilities.

We assume that $\mathbf{X}_{t}(i,j)=1-\mathbf{X}_{t}(j,i)$ when $i\neq j$ and that $\mathbf{X}(i,i)=1/2$ or $X(i,i)=1/2$ , depending on the problem setup.

Random variables $I_{t}$ and $J_{t}$ represent the actions selected to duel at each time, and we denote $\mathbf{H}_{t}=\{\mathbf{I}_{\tau},\mathbf{J}_{\tau},\mathbf{X}_{\tau}(\mathbf{I}_{\tau},\mathbf{J}_{\tau})\}_{\tau=1}^{t-1}$ as the available history to help guide the selections. Note that the assumptions about $\mathbf{X}_{t}$ imply that if $\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})$ is observed, then $\mathbf{X}_{t}(\mathbf{J}_{t},\mathbf{I}_{t})$ is also known.

2.2 Optimal Actions

It is assumed that there is a sub-set of optimal actions within $\{1,\dots,A\}$ , and that we wish to find an optimal action as efficiently as possible. There are several optimality notions used for dueling bandits. We discuss some of these below, and note that section 4.1 of [18] provides additional definitions.

2.2.1 Copeland and Condorcet Winners

The standard definition of optimal actions in dueling bandits literature are Copeland and Condorcet winners. These rely on counting the number of other actions that a particular action is likely to beat in a duel (in the sense of $\mathbb{P}[i\succ j]=X(i,j)>0.5$ ). Copeland winners $i^{*}_{C}$ are defined as,

[TABLE]

If there is a single action that is likely to beat all other actions, this is known as a Condorcet winner. Copeland winners always exist, even if a Condorcet winner does not exist.

2.2.2 Maximin and Borda Winners

In this paper, we focus on two alternatives to Copeland and Condorcet winners for defining optimal actions: Maximin winners and Borda winners. Both rely on simpler measures of $X$ to determine the optimal actions. Maximin winners use row minimum values of $X$ , and Borda winners use row average values of $X$ . Let us define Maximin winners $i^{*}_{M}$ and Borda winners $i^{*}_{B}$ as,

[TABLE]

Maximin and Borda winners both always exist, even if a Condorcet winner does not exist. Also, Copeland winners are not guaranteed to align with either Maximin or Borda winners. Condorcet winners are guaranteed to align with Maximin winners, but not with Borda winners. For these reasons, we find these to be compelling alternative definitions for optimal actions.

2.3 Regret

To characterize the performance of the selected actions over time horizon $T$ , we can compare them against ideal selections that could have been made over that time period. This is known as regret. While it may be intuitive that an ideal sequence of $\mathbf{I}_{t}$ selections would be any $\mathbf{I}_{1},\dots,\mathbf{I}_{T}$ which maximizes $\sum_{t=1}^{T}\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})$ , for a given sequence of $\mathbf{J}_{1},\dots,\mathbf{J}_{T}$ selections (and vice versa, minimizes it for ideal $\mathbf{J}_{t}$ selections), this is unreasonable and not possible. Selections are unknown prior to a duel, and adaptations to selection strategies are made after a duel, meaning the original given selection sequence would no longer be valid. Instead, a reasonable ideal sequence of selections that could have been made is for both $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ to have been optimal actions, at all times. Therefore, if the regret incurred over time horizon $T$ is minimized, then the selected actions have converged to optimal actions as efficiently as possible in that time period.

3 Algorithms

3.1 Thompson Sampling for Dueling Bandits

We describe Thompson Sampling in generality, in order to highlight its flexibility. It learns true parameter values $\boldsymbol{\theta}$ , which can represent $\mathbf{X}$ directly or some other latent values for each action, by sampling the posterior distribution conditioned on the history $\mathbf{H}_{t}$ . The samples of $\boldsymbol{\theta}$ become more accurate as the information in $\mathbf{H}_{t}$ increases, and are used to form an estimate of $\mathbf{X}$ , which can be used with any optimal action definition. We present algorithms for both Maximin winners (Alg. 1) and Borda winners (Alg. 2).

An appropriate prior distribution over $\boldsymbol{\theta}$ must be chosen so that the posterior distribution can either be determined analytically or sampled from by using computational means (such as Markov chain Monte Carlo). The prior can be used to model correlations between actions, for example by using a Gaussian Process.

3.2 SparringExp3.P for Dueling Bandits

SparringExp3.P is implemented for dueling bandits in Algorithm 3, and is inspired by the methods in [3] and [8]. It learns from the previous duel outcomes and accordingly adjusts the strategies $\mathbf{p}_{t+1}$ and $\mathbf{q}_{t+1}$ using hyperparameters $\eta>0$ and $0<\gamma<1$ . For all times $t\leq T$ and all actions $i,j\in\{1,\dots,A\}$ , the update equations are,

[TABLE]

Since only outcome $\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})$ is revealed at each time $t$ , the other outcomes in the corresponding rows of $\mathbf{X}_{t}$ must be estimated. These estimates are made using the observed outcome and hyperparameter $0\leq\beta\leq 1$ as follows,

[TABLE]

for all $i,j\in\{1,\ldots,A\}$ . These estimates satisfy $\mathbb{E}[\widetilde{\mathbf{X}}_{p,t}(i,\mathbf{J}_{t})|\mathbf{J}_{t},\mathbf{H}_{t}]=\mathbf{X}_{t}(i,\mathbf{J}_{t})+\beta/\mathbf{p}_{t}(i)$ and $\mathbb{E}[\widetilde{\mathbf{X}}_{q,t}(j,\mathbf{I}_{t})|\mathbf{I}_{t},\mathbf{H}_{t}]=\mathbf{X}_{t}(j,\mathbf{I}_{t})+\beta/\mathbf{q}_{t}(j)$ for all $i,j$ and all times $t$ .

3.3 Partial Monitoring Forecaster for Dueling Bandits

The Partial Monitoring forecaster [6] is implemented for dueling bandits in Algorithm 4. The forecaster learns from the previous duel outcomes and accordingly adjusts the strategy $\mathbf{p}_{t+1}$ using hyperparameters $\eta>0$ and $0<\gamma<1$ . For all times $t\leq T$ and all actions $i\in\{1,\dots,A\}$ , the update equations are,

[TABLE]

Since only outcome $\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})$ is revealed at each time $t$ , the Borda score for $\mathbf{X}_{t}$ , must be estimated using the observed outcome as follows,

[TABLE]

for all $i\in\{1,\dots,A\}$ . These estimates satisfy $\mathbb{E}[\widetilde{\mathbf{g}}_{t}(i)|\mathbf{H}_{t}]=\mathbf{g}_{t}(i)$ for all $i$ and all times $t$ .

3.4 Comparison to State-of-the-Art

Both state-of-the-art dueling bandits algorithms ISS [17] and DTS [21] use variations of specific Thompson Sampling implementations. They both use $Beta(1,1)$ as prior distributions $p(\theta_{n})$ , for each independent, true $\theta_{n}$ value they attempt to learn. Since Beta distributions are conjugate pairs with Bernoulli likelihoods, the independent posterior distributions $p(\theta_{n}|\mathbf{H}_{t})$ are able to be determined analytically and are themselves Beta distributions.

While the ISS algorithm is very simple, it does not learn an estimate for $X$ . Instead, it learns the more basic overall probability of each action winning a duel with a Concorcet winner. It therefore learns $A$ independent $\theta_{n}$ values, one for each action. Since it does not learn $X$ , it cannot learn to track a Borda winner unless it is also the Condorcet winner.

The DTS algorithm does learn an estimate of $X$ . It thus learns $A^{2}$ independent $\theta_{n}$ values, one for each $i,j$ pair in $X$ . However, it is a complex and specialized algorithm that tracks the Copeland winner, so it cannot learn to track a Borda winner unless it is also the Copeland winner.

4 Theoretical Results

In this section, we will present theorems that upper bound the regret for each of the algorithms described in the previous section, and also compare the bounds to those for the current state-of-the-art. Each of the regret upper bounds is of the order $O(T^{\rho})$ with $1/2\leq\rho\leq 3/4$ , and this bound holds regardless of the size of any preference gaps between any two actions $\Delta$ . All definitions of regret are normalized, such that the regret incurred at any time $t$ satisfies $\mathbf{r}_{t}\leq 1$ , and therefore $\mathbf{R}_{T}=\sum_{t=1}^{T}\mathbf{r}_{t}\leq T$ . Detailed proofs are provided in the appendix.

Theorem 4.1

Let us define regret over time horizon $T$ in the sense of Maximin winner $\mathbf{i}^{*}_{M}$ ,

[TABLE]

Then, if actions $\mathbf{I}_{t},\mathbf{J}_{t}$ are selected at each time using Thompson Sampling for Dueling Bandits with Maximin winners (Alg. 1), the expected regret is upper bounded as,

[TABLE]

The proof method is a variation on the worst case bound from [15].

Theorem 4.2

Let us define regret over time horizon $T$ in the sense of Borda winner $\mathbf{i}^{*}_{B}$ ,

[TABLE]

Then, if actions $\mathbf{I}_{t},\mathbf{J}_{t}$ are selected at each time using Thompson Sampling for Dueling Bandits with Borda winners (Alg. 2), using $\alpha=c\,T^{-1/3}<\frac{1}{2}$ for $c>0$ , the expected regret is upper bounded as,

[TABLE]

The proof method uses the same concepts from [15] as the proof of Theorem 4.1.

Theorem 4.3

Let us define regret over time horizon $T$ in the sense of Maximin winner $i^{*}_{M}$ ,

[TABLE]

Then, if actions $\mathbf{I}_{t},\mathbf{J}_{t}$ are selected at each time using SparringExp3.P for Dueling Bandits (Alg. 3), with hyperparameter values of,

[TABLE]

and $T$ satisfying,

[TABLE]

the expected regret is upper bounded as,

[TABLE]

The proof method follows those used for lemma 3.1 and theorems 3.2 and 3.3 in [5].

Theorem 4.4

Let us define regret over time horizon $T$ in the sense of Borda winner $i^{*}_{B}$ ,

[TABLE]

Then, if actions $\mathbf{I}_{t},\mathbf{J}_{t}$ are selected at each time using the Partial Monitoring Forecaster for Dueling Bandits (Alg. 4), with hyperparameter values of,

[TABLE]

and $T$ satisfying,

[TABLE]

the expected regret is upper bounded as,

[TABLE]

The proof method follows those used for theorem 6.5 in [6].

4.1 Comparison to State-of-the-Art

Many works on dueling bandits assume that a Condorcet winner, $i^{*}_{C}$ , exists. In this case, $X(i^{*}_{C},j)>1/2$ for all $j\neq i_{C}$ , and let $\Delta=\min_{j\neq i_{C}}X(i_{C},j)-1/2$ be the preference gap between the Condorcet winner and the next best action. This commonly allows regret bounds of $O\left(\frac{\log T}{\Delta}\right)$ to be proven. These bounds appear to be superior to the $O(\sqrt{T})$ bounds derived in this paper. However, as discussed in [5] (and others), when $\Delta$ is small, the $(\log T)/\Delta$ bound becomes smaller than the regret for selecting the sub-optimal action each time, which is $\Delta T$ . Therefore, taking a worst-case value over $\Delta$ leads to an actual regret bound of $O(\sqrt{T\log T})$ , which is not superior to the $O(\sqrt{T})$ bounds we show.

This is the case for both state-of-the-art methods ISS [17] and DTS [21]. Furthermore, we note that the proof for ISS demonstrates only asymptotic convergence to a Condorect winner, while the proof for DTS is highly complex (owing the relatively complex nature of the algorithm). In comparison, the proofs available in appendix A are relatively simple (though presented in a detailed manner).

5 Experimental Results

5.1 Methods

We simulate each of the proposed algorithms, along with the two state-of-the-art algorithms ISS [17] and DTS [21], on two different scenarios using synthetic data. For the Thompson Sampling methods, we use $A^{2}-A$ independent $Beta(1,1)$ priors for the $X$ values we attempt to learn. We set $\mathbb{E}[\mathbf{X}(i,i)|\boldsymbol{\theta}]=0.5$ directly, for all $i$ . In the Condorcet scenario, an $X$ matrix is synthetically generated by linking a latent value for each action (called “utility") to the duel winning probability $\mathbb{P}[i\succ j]=X(i,j)$ for each pair of actions $i,j\in\{1,\dots,A\}$ . The utility of each action, $u(i)$ , is uniformly distributed between [math] and $c>0$ . We chose $c=3$ to give a larger spread of probabilities over the actions. One action has a maximum utility, that is significantly better than all other actions, and so it is the lone Borda winner and Condorcet winner, and thus also the lone Maximin winner. Linking the utility of each pair of actions to the corresponding duel winning probability is accomplished by using the logistic function on the gap between utilities of the actions,

[TABLE]

In the Borda scenario, we modify the previous $X$ matrix such that the action with the second largest utility $i_{2}$ becomes the lone Borda winner, even though the same Condorcet and Maximin winner still exists. This is done by setting $X(i_{2},j)=0.95$ for all $j\neq i_{2}$ other than the Condorcet winner. This aptly represents why the Borda winner is a reasonable definition for optimality. Even though it isn’t likely to beat every action, it is the most likely to beat an action drawn at random. Each algorithm runs with a time horizon of $T=40,000$ iterations, for $100$ separate runs, on each scenario.

5.2 Results

The results of the Condorcet scenario are shown in Figure 1, and the results of the Borda scenario are shown in Figure 2. In both subfigure (c), a shaded area, plotted above the mean, shows the standard deviation over the runs. Additional detailed plots of each algorithm, for each scenario, are available in appendix B. In the Condorcet scenario, the regret for each algorithm is as prescribed in the respective theorem, and the regret for ISS and DTS use the Maximin winner (theorem 4.1). All formulations for regret are comparable, due to the scenario having the same winning action in all cases. Both state-of-the-art methods show very strong regret performance. However, the Thompson Sampling with Borda winners method shows comparably strong performance, with other methods also performing well. All methods beat the regret upper bounds proposed in their respective theorems. In the Borda scenario, the regret for all algorithms (including ISS and DTS) uses the Borda winner. This is to highlight the fact that some of the methods are not capable of performing well in this type of scenario. Both state-of-the-art methods struggle with Borda winners, and so their Borda regret grows linearly. A similar behavior ultimately happens to SparringExp3.P (more details available in the appendix). Thompson Sampling shines in this case. Both methods that focus on Borda winners are able to beat their respective regret upper bounds.

6 Conclusion

In this paper, we have presented four simple algorithms for Dueling Bandits, each of which is able to efficiently find an optimal action within a finite set of available actions. We proved an upper bound on regret for each, over a variety of different optimal action types, such as the Borda Winner. The proven regret bounds were all of the order $O(T^{\rho})$ with $1/2\leq\rho\leq 3/4$ , and did not depend on any preference gap between any two actions $\Delta_{ij}$ . The algorithms were all evaluated and compared against the current state-of-the-art for Dueling Bandits, the ISS and DTS algorithms. While they did not meet or exceed the performance of ISS and DTS in certain scenarios, in others they demonstrated superior ability to find different types of optimal actions. Overall, their simplicity, regret bounds, and ability do merit inclusion with the current state-of-the-art.

Appendix A Theoretical Results

In this section, we provide formal proofs for all theorems presented in the paper. All random variables and probability distributions use bold font.

A.1 Proof of Theorem 4.1

The proof method is a variation on the worst case bound from [15].

First, we make the following definitions: $\mathbb{E}_{t}$ is the expectation, $\mathbb{P}_{t}$ is the probability measure, $\mathbf{p}_{t}$ is the probability density, and $\mathcal{I}_{t}(\cdot\ ;\ \cdot)$ is mutual information, all conditioned on the history $\mathbf{H}_{t}$ , at time $t$ . Furthermore, $D(\cdot\ ||\ \cdot)$ is the Kullback-Leibler divergence and $\mathcal{H}$ is entropy.

Then we note that Thompson Sampling selects both $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ using independent samples from the same posterior distribution conditioned on $\mathbf{H}_{t}$ . Therefore, $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ are independent and identically distributed, and the terms $\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{I}_{t})$ and $\mathbf{X}_{t}(\mathbf{i}^{*}_{M},\mathbf{J}_{t})$ are identically distributed.

Let $\mathbf{r}_{t}$ be the instantaneous regret at time $t$ , such that $\mathbf{R}_{T}=\sum_{t=1}^{T}\mathbf{r}_{t}$ .

We claim the following,

[TABLE]

To begin proving (7), we show,

[TABLE]

where the second equality follows because $\mathbf{J}_{t}$ is independent of $\mathbf{X}_{t}$ , when conditioned on $\mathbf{H}_{t}$ .

Furthermore,

[TABLE]

where the second equality follows because of the assumption $\mathbb{E}_{t}[\mathbf{X}_{t}(i^{*},j)]=1-\mathbb{E}_{t}[\mathbf{X}_{t}(j,i^{*})]$ . Combining (9) and (10), gives (7).

Next we prove (A.1).

[TABLE]

Here the first equality is the chain rule for mutual information, while the second follows from conditional independence of $\mathbf{I}_{t}$ , $\mathbf{J}_{t}$ , and $\mathbf{i}^{*}_{M}$ , given $\mathbf{H}_{t}$ . The third equality follows because of conditional independence of $(\mathbf{X}_{t},\mathbf{i}^{*}_{M})$ and $(\mathbf{I}_{t},\mathbf{J}_{t})$ given $\mathbf{H}_{t}$ . The final equality is a standard identity for mutual information. Thus, (A.1) holds.

Then we bound $\mathbb{E}_{t}[\mathbf{r}_{t}]$ in terms of the mutual information.

[TABLE]

The first inequality is from Pinsker’s inequality. The second is from the Cauchy-Schwarz inequality. The third is because adding more non-negative terms cannot decrease the sum. The final inquality is because $\mathbb{P}_{t}(\mathbf{i}^{*}_{M}=i)^{2}\leq\mathbb{P}_{t}(\mathbf{i}^{*}_{M}=i)$ .

Next we cite the following, $\sum_{t=1}^{T}\,\mathcal{I}_{t}(\mathbf{i}^{*}_{M};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))\leq\mathcal{H}(\mathbf{i}^{*}_{M})$ (see section 5 of [15]) and therefore $\sum_{t=1}^{T}\,\sqrt{\mathcal{I}_{t}(\mathbf{i}^{*}_{M};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))}\leq\sqrt{T\,\sum_{t=1}^{T}\,\mathcal{I}_{t}(\mathbf{i}^{*}_{M};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))}\leq\sqrt{T\,\mathcal{H}(\mathbf{i}^{*}_{M})}$ (Cauchy-Schwartz inequality),

[TABLE]

Finally, we have $\mathcal{H}(\mathbf{i}^{*}_{M})\leq\log A$ since there are $A$ actions, and so the desired bound is achieved. $\hfill\blacksquare$

A.2 Proof of Theorem 4.2

The proof method uses the same concepts from [15] as the proof of Theorem 4.1.

First, we make the following definitions: $\mathbb{E}_{t}$ is the expectation, $\mathbb{P}_{t}$ is the probability measure, $\mathbf{p}_{t}$ is the probability density, and $\mathcal{I}_{t}(\cdot\ ;\ \cdot)$ is mutual information, all conditioned on the history $\mathbf{H}_{t}$ , at time $t$ . Furthermore, $D(\cdot\ ||\ \cdot)$ is the Kullback-Leibler divergence and $\mathcal{H}$ is entropy.

Then we note that Thompson Sampling selects both $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ using independent samples from the same posterior distribution conditioned on $\mathbf{H}_{t}$ . Therefore, $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ are independent and identically distributed, and the terms $\mathbf{X}_{t}(\mathbf{i}^{*}_{B},\mathbf{I}_{t})$ and $\mathbf{X}_{t}(\mathbf{i}^{*}_{B},\mathbf{J}_{t})$ are identically distributed.

Let $\mathbf{r}_{t}$ be the instantaneous regret at time $t$ , such that $\mathbf{R}_{T}=\sum_{t=1}^{T}\mathbf{r}_{t}$ .

By construction,

[TABLE]

Now we bound $\mathbb{E}_{t}[\mathbf{r}_{t}]$ in terms of mutual information.

[TABLE]

Here (12) is derived analogously to (7), and the inequality (14) follows because $\mathbb{E}_{t}[\mathbf{X}_{t}(i,j)]\leq 1$ and $\frac{1}{A}\sum_{i,j}\mathbb{P}_{t}(\mathbf{i}_{B}^{*}=i)=1$ . Then the inequalities (15), (16), and (17) respectively follow from Pinsker’s inequality, the Cauchy-Schwarz inequality, and concavity. The inequality (18) follows because,

[TABLE]

and also from (11),

[TABLE]

The inequality (19) follows because adding extra non-negative terms cannot decrease the sum, and the result $\mathcal{I}_{t}(\mathbf{i}_{B}^{*};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})))$ in (20) is derived analogously to (A.1). The inequality (21) follows because $\alpha<\frac{1}{2}$ implies that $\frac{1}{\sqrt{1-\alpha}}<\sqrt{2}$ .

Next we cite the following, $\sum_{t=1}^{T}\,\mathcal{I}_{t}(\mathbf{i}^{*}_{B};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))\leq\mathcal{H}(i^{*}_{B})$ (see section 5 of [15]) and therefore $\sum_{t=1}^{T}\,\sqrt{\mathcal{I}_{t}(\mathbf{i}^{*}_{B};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))}\leq\sqrt{T\,\sum_{t=1}^{T}\,\mathcal{I}_{t}(\mathbf{i}^{*}_{B};(\mathbf{I}_{t},\mathbf{J}_{t},\mathbf{X}_{t}(i,j)))}\leq\sqrt{T\,\mathcal{H}(i^{*}_{B})}$ , from the Cauchy-Schwartz inequality. Thus, the regret can be bounded as

[TABLE]

Finally, we have $\mathcal{H}(i^{*}_{B})\leq\log A$ since there are $A$ actions, and so the desired bound is achieved when substituting $\alpha=c\,T^{-1/3}$ ,

[TABLE]

$\hfill\blacksquare$

A.3 Proof of Theorem 4.3

The proof of Theorem 4.3 requires the following auxiliary lemma.

Lemma. If hyperparameter $\beta\leq 1$ , then the following holds for all $i,j$ and any $0<\delta<1$ ,

[TABLE]

Proof. The proof method follows those used for lemma 3.1 in [5].

Taking the expected value with respect to $\mathbf{I}_{t}$ , for any $i$ and any $t\leq T$ ,

[TABLE]

where (a) uses $\exp(x)\leq 1+x+x^{2}$ for $x\leq 1$ , which is true because $\beta\leq 1$ , $|\mathbf{X}_{t}(i,\mathbf{J}_{t})|\leq 1$ , and $|\mathbf{X}_{t}(j,\mathbf{I}_{t})|\leq 1$ for all $i,j$ and $t\leq T$ ,

(b) uses,

[TABLE]

(c) uses,

[TABLE]

and (d) uses $(1+x)\exp(\text{-}x)\leq 1$ for all $x$ .

Then the following holds for any $i,j$ , since all $\mathbf{X}_{t}$ are independent,

[TABLE]

Finally, since Markov’s inequality implies $\mathbb{P}\big{[}\log\exp(\mathbf{Y})\leq\log\,\delta^{\text{-}1}\big{]}\geq 1-\delta\,\mathbb{E}\big{[}\exp(\mathbf{Y})\big{]}$ , by then setting,

[TABLE]

we have that $\mathbb{E}\big{[}\exp(\mathbf{Y})\big{]}\leq 1$ , and therefore we achieve the desired results,

[TABLE]

$\hfill\blacksquare$

Now we turn to the proof of Theorem 4.3. The proof method follows those used for Theorems 3.2 and 3.3 in [5].

Recall that the regret has the form

[TABLE]

Taking the expected value with respect to $i\sim\mathbf{p}_{t}$ and $j\sim\mathbf{q}_{t}$ , for any $t\leq T$ ,

[TABLE]

This means we have, for any $i,j$ ,

[TABLE]

Now we will begin bounding the expectation terms, which are taken with respect to $i,j$ being distributed as $\mathbf{p}_{t},\mathbf{q}_{t}$ respectively. But by the definitions of those distributions, we can split them up into the uniform portion $\mathbf{u}$ and the softmax portions $\mathbf{s}_{p,t},\mathbf{s}_{q,t}$ , such that $\mathbf{p}_{t}=(1-\gamma)\,\mathbf{s}_{p,t}+\gamma\,u$ and $\mathbf{q}_{t}=(1-\gamma)\,\mathbf{s}_{q,t}+\gamma\,u$ . Therefore,

[TABLE]

Next we focus on the main softmax expectation terms in eqs. 27 and 28,

[TABLE]

where (a) uses $\log x\leq x-1$ , (b) uses $\exp x\leq 1+x+x^{2}$ , and (c) uses $\mathbf{X}_{t}(\mathbf{I}_{t},\mathbf{J}_{t})\,\mathbbm{1}(k=\mathbf{I}_{t})\leq b$ , $\mathbf{X}_{t}(\mathbf{J}_{t},\mathbf{I}_{t})\,\mathbbm{1}(k=\mathbf{I}_{t})\leq b$ , $\mathbf{s}_{p,t}(k)/\mathbf{p}_{t}(k)\ \leq\ 1/(1-\gamma)$ , and $\mathbf{s}_{q,t}(k)/\mathbf{q}_{t}(k)\ \leq\ 1/(1-\gamma)$ for all $k$ .

Note that (a) and (b) require $x\leq 1$ , meaning that we need $\eta\,\widetilde{\mathbf{X}}_{p,t}(i,\mathbf{J}_{t})\leq 1$ and $\eta\,\widetilde{\mathbf{X}}_{q,t}(j,\mathbf{I}_{t})\leq 1$ for all $i,j$ and $t\leq T$ .

From their definitions,

[TABLE]

and so this requirement is exactly met by the assumption $0\leq(1+\beta)A\,\eta\leq\gamma\leq 1/2$ .

Then we look at the uniform expectation terms in eqs. 27 and 28,

[TABLE]

Making these substitutions into eqs. 27 and 28, and summing over time,

[TABLE]

where (a) uses the definitions of $\mathbf{s}_{p,t}(k)$ and $\mathbf{s}_{q,t}(k)$ as,

[TABLE]

(b) uses the cancellation of numerators and denominators in successive terms of the product, and that $\widetilde{\mathbf{X}}_{p,0}(k,\mathbf{J}_{t})=\widetilde{\mathbf{X}}_{q,0}(k,\mathbf{I}_{t})=0$ for all $k$ , (c) uses that $-(1-\gamma)\,\log(A)/\eta\ \leq\ \log(A)/\eta$ , and (d) uses that $\big{(}1-\gamma-(1+\beta)A\,\eta\big{)}\ \leq\ 1$ , which comes from the assumption $0\leq(1+\beta)A\,\eta\leq\gamma\leq 1/2$ , together with the lemma eqs. 22 and 23. Note that the inclusion of $\delta$ from the lemma equations implies that these results hold with probability $1-\delta$ for any $0<\delta<1$ .

Then substituting these into eqs. 25 and 26,

[TABLE]

with probability $1-\delta$ for any $0<\delta<1$ , where (a) uses the assumption $0\leq(1+\beta)A\,\eta\leq\gamma\leq 1/2$ .

Since these results are valid for any $i,j$ , we can use them directly in eq. 24,

[TABLE]

and applying the defined hyperparameter values,

[TABLE]

with probability $1-\delta$ for any $0<\delta<1$ .

Now we will verify the requirements on $T$ for enforcing the assumption $0\leq(1+\beta)A\,\eta\leq\gamma\leq 1/2$ .

Since all of the hyperparameter values are non-negative, then the left-hand side inequality is trivially satisfied.

[TABLE]

And so the requirement is $T\ \geq\ \max\big{[}4.41\,A\log A\ ,\ (0.95^{2}\log A)/(0.1^{2}\,A)\big{]}$ , as desired.

Finally, we demonstrate the following fact for random variable $\mathbf{W}$ with cumulative distribution function $F_{\mathbf{W}}$ ,

[TABLE]

Then recalling the regret high probability upper bound, for the required $T$ and any $0<\delta<1$ ,

[TABLE]

Now selecting $\ \mathbf{W}=\big{(}\mathbf{R}_{T}-4.2\sqrt{A\log A}\,\sqrt{T}\big{)}\ /\ \big{(}\sqrt{A\,(\log A)^{\text{-}1}}\,\sqrt{T}\big{)}\$ we have,

[TABLE]

Therefore, we achieve the desired result:

[TABLE]

$\hfill\blacksquare$

A.4 Proof of Theorem 4.4

The proof method follows those used for theorem 6.5 in [6].

First, we recall our definition of the (estimated) Borda score for $\mathbf{X}_{t}$ as,

[TABLE]

and we define the sum of (estimated) Borda scores for action $i$ over $t\leq T$ as,

[TABLE]

which means we can redefine eq. LABEL:Reg as,

[TABLE]

Since $\mathbf{I}_{t}$ and $\mathbf{J}_{t}$ are independently drawn from the same probability distribution $\mathbf{p}_{t}$ at each time $t$ , we can equivalently prove eq. LABEL:expReg using an expected regret equation strictly in terms of the Borda scores for $\mathbf{I}_{t}$ and $i^{*}_{B}$ ,

[TABLE]

Next we define a lower bound for the log of the ratio of weight sums at times $T$ and [math], for any $j$ ,

[TABLE]

and an upper bound for the log of the ratio of weight sums at times $t$ and $t-1$ ,

[TABLE]

where (a) is from the definition of $\mathbf{p}_{t}(i)$ , (b) is because $e^{x}\leq 1+x+(e-2)x^{2}$ for $x\leq 1$ , and (d) is because $\log(1+x)\leq x$ .

For (c), first note that $\sum_{i=1}^{A}(\mathbf{p}_{t}(i)-\gamma/A)/(1-\gamma)=1$ , as sum of softmax components. So it would be equivalent, except that on the far right side it has only a $\mathbf{p}_{t}(i)$ term, and hence $\mathbf{p}_{t}(i)-\gamma/A\leq\mathbf{p}_{t}(i)$ . This gives the inequality, since $\log(1+a)\leq\log(1+b)$ if $a\leq b$ .

The (b) requirement $\eta\,\widetilde{\mathbf{g}}_{t}(i)\leq 1$ holds if we have that $\eta\,A/\gamma^{2}\leq 1$ , because $\eta>0$ and $\widetilde{\mathbf{g}}_{t}(i)\leq A/\gamma^{2}$ , with $0<\gamma<1$ . We confirm this at the end of the proof.

Now we sum the upper bound over $t\leq T$ , to get the log of the ratio of weight sums at times $T$ and [math],

[TABLE]

Then we can compare the lower and upper bounds, to get a single inequality.

[TABLE]

Multiplying both sides by $(1-\gamma)/\eta$ gives,

[TABLE]

and by rearranging terms and noting that $(1-\gamma)<1$ ,

[TABLE]

By definition of $\widetilde{\mathbf{g}}_{t}(i)$ we then have,

[TABLE]

and by definition $\mathbf{p}_{t}(\mathbf{I}_{t})<1$ for all $t\leq T$ ,

[TABLE]

Since all terms are using the unbiased estimates of the Borda scores, we can take the expected value on both sides and replace the estimates with the actual scores,

[TABLE]

Noting that $\mathbf{G}_{T}(i)\leq T$ for any $i$ ,

[TABLE]

Next we bound the remaining expectation term,

[TABLE]

and because the $\mathbf{G}_{T}(j)$ term from the original lower bound is valid for any $j$ , we can arbitrarily choose the Borda winner $i^{*}_{B}$ . We thus have,

[TABLE]

Then by canceling the $\gamma T$ terms and taking the expected value of both sides,

[TABLE]

Now we define the hyperparameters $\gamma$ and $\eta$ by using the positive terms $T,\,A,\,\text{and}\,(e-2)$ ,

[TABLE]

which guarantees $\eta>0$ and $\gamma>0$ .

Then we substitute them into the terms on the right-hand side of the inequality,

[TABLE]

Combining the terms achieves the desired result.

Finally, we determine the required $T$ such that $\gamma<1$ and $\eta\,A/\gamma^{2}\leq 1$ hold,

[TABLE]

Since $(e-2)\,A^{2}>(e-2)^{-3}\,A^{-2}$ for all $A\geq 2$ , this gives the required $T$ .

$\hfill\blacksquare$

Appendix B Experimental Results

In this section, we provide additional plots that detail the behavior of the algorithms for the different experimental scenarios. For all figures:

•

(a) shows a detailed plot of the regret over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above the mean)

•

(b) shows the $\mathbf{I}_{t}$ action selections over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)

•

(c) shows the $\mathbf{J}_{t}$ action selections over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)

•

(d - if applicable) shows the $\mathbf{p}_{t}$ strategy over the runs for the scenario, with thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)

•

(e - if applicable) shows the $\mathbf{q}_{t}$ strategy over the runs for the scenario, with thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)

For the Condorcet scenario, see Figs. 3 - 8). For the Borda scenario, see Figs. 9 - 14.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39–1, 2012.
2[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics , pages 99–107, 2013.
3[3] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning , pages 856–864, 2014.
4[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002.
5[5] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
6[6] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games . Cambridge university press, 2006.
7[7] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems , pages 2249–2257, 2011.
8[8] Miroslav Dud \́mathbf{i} k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. ar Xiv preprint ar Xiv:1502.06362 , 2015.