Simple Algorithms for Dueling Bandits
Tyler Lekang, Andrew Lamperski

TL;DR
This paper introduces simple algorithms for Dueling Bandits, providing regret bounds independent of preference gaps, and demonstrates their competitive performance through theoretical analysis and experiments.
Contribution
The paper proposes new simple algorithms for Dueling Bandits with regret bounds not depending on preference gaps, advancing the state-of-the-art.
Findings
Regret bounds of order O(T^rho) with 1/2 <= rho <= 3/4
Algorithms outperform existing methods in some synthetic experiments
Regret performance comparable or better than state-of-the-art algorithms
Abstract
In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon T of order O(T^rho ) with 1/2 <= rho <= 3/4, which importantly do not depend on any preference gap between actions, Delta. Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Optimization and Search Problems
Simple Algorithms for Dueling Bandits
Tyler Lekang
University of Minnesota, Twin Cities
Minneapolis, MN
&Andrew Lamperski
University of Minnesota, Twin Cities
Minneapolis, MN
Abstract
In this paper, we present simple algorithms for Dueling Bandits. We prove that the algorithms have regret bounds for time horizon of order with , which importantly do not depend on any preference gap between actions . Dueling Bandits is an important extension of the Multi-Armed Bandit problem, in which the algorithm must select two actions at a time and only receives binary feedback for the duel outcome. This is analogous to comparisons in which the rater can only provide yes/no or better/worse type responses. We compare our simple algorithms to the current state-of-the-art for Dueling Bandits, ISS and DTS, discussing complexity and regret upper bounds, and conducting experiments on synthetic data that demonstrate their regret performance, which in some cases exceeds state-of-the-art.
1 Introduction
Dueling Bandits, first proposed in [24], is an important variation on the Multi-Armed Bandit (MAB), a well-known online machine learning problem that has been studied extensively by many previous works, such as [4], [6], and [5]. Dueling Bandits is different from MAB in that it provides binary feedback at each time, the win/lose outcome of a duel between two actions. This corresponds well to comparisons between two system states that receive better/worse type responses from users, patients, raters, and so on. Previous work on this topic has proposed various algorithms that generally allow for regret bounds of the order to be proven, where represents the preference gap between two different states (or actions). See [18] for a reference. Such algorithms include, Beat the Mean [25], Interleaved Filter [23], SAVAGE [20], RUCB [27] and RCS [28], MultiSBM and Sparring [3], Sparse Borda [9], RMED [11], CCB [26], and (E)CW-RMED [13]. Thompson Sampling, first proposed in [19], is a powerful method of learning true parameters values , by sampling from a posterior distribution using Bayes Theorem. See [14] and [16] for reference. It has been implemented in algorithms for multi-armed bandits, such as in [7], [1], [10], [2], [12], and [22]. The current state-of-the-art algorithms for Dueling Bandits both utilize Thompson Sampling methods, Independent Self-Sparring (ISS) [17] and Double Thompson Sampling (DTS) [21]. The ISS method is relatively simple, has strong empirical performance, and has been proven to converge asymptotically to a Condorcet winner, if one exists. However, its non-asymptotic regret has not been analyzed. The DTS algorithm is a relatively complex algorithm with a highly complex proof. It achieves regret of order . However, the worst-case values, lead to regret bounds that are actually of order . We address these issues in this paper, with our main contributions: (1)we present four simple algorithms for Dueling Bandits, each of which allows provable upper bounds on regret of order with that do not depend on any preference gap between actions, (2) we compare and contrast the algorithm complexity and theoretical results of the presented simple algorithms against the current state-of-the-art algorithms for Dueling Bandits, and (3) we evaluate the algorithms on multiple scenarios using synthetically generated data, demonstrating their performance for multiple definitions of optimality, that in some cases exceeds the state-of-the-art.
2 Background
2.1 Dueling Bandits
The dueling bandits problem is described in Problem 1. The random matrices are independent and identically distributed. Each element is Bernoulli distributed such that denotes the probability of action winning a duel with action .
For Thompson sampling algorithms, we will assume that the win probabilities depend on an unobserved random parameter, , so that . The parameter can be used to encode correlations between the actions and other structural assumptions.
For algorithms based on Exp3.P and partial monitoring, we assumes that , where is a fixed but unknown matrix of win probabilities.
We assume that when and that or , depending on the problem setup.
Random variables and represent the actions selected to duel at each time, and we denote as the available history to help guide the selections. Note that the assumptions about imply that if is observed, then is also known.
2.2 Optimal Actions
It is assumed that there is a sub-set of optimal actions within , and that we wish to find an optimal action as efficiently as possible. There are several optimality notions used for dueling bandits. We discuss some of these below, and note that section 4.1 of [18] provides additional definitions.
2.2.1 Copeland and Condorcet Winners
The standard definition of optimal actions in dueling bandits literature are Copeland and Condorcet winners. These rely on counting the number of other actions that a particular action is likely to beat in a duel (in the sense of ). Copeland winners are defined as,
[TABLE]
If there is a single action that is likely to beat all other actions, this is known as a Condorcet winner. Copeland winners always exist, even if a Condorcet winner does not exist.
2.2.2 Maximin and Borda Winners
In this paper, we focus on two alternatives to Copeland and Condorcet winners for defining optimal actions: Maximin winners and Borda winners. Both rely on simpler measures of to determine the optimal actions. Maximin winners use row minimum values of , and Borda winners use row average values of . Let us define Maximin winners and Borda winners as,
[TABLE]
Maximin and Borda winners both always exist, even if a Condorcet winner does not exist. Also, Copeland winners are not guaranteed to align with either Maximin or Borda winners. Condorcet winners are guaranteed to align with Maximin winners, but not with Borda winners. For these reasons, we find these to be compelling alternative definitions for optimal actions.
2.3 Regret
To characterize the performance of the selected actions over time horizon , we can compare them against ideal selections that could have been made over that time period. This is known as regret. While it may be intuitive that an ideal sequence of selections would be any which maximizes , for a given sequence of selections (and vice versa, minimizes it for ideal selections), this is unreasonable and not possible. Selections are unknown prior to a duel, and adaptations to selection strategies are made after a duel, meaning the original given selection sequence would no longer be valid. Instead, a reasonable ideal sequence of selections that could have been made is for both and to have been optimal actions, at all times. Therefore, if the regret incurred over time horizon is minimized, then the selected actions have converged to optimal actions as efficiently as possible in that time period.
3 Algorithms
3.1 Thompson Sampling for Dueling Bandits
We describe Thompson Sampling in generality, in order to highlight its flexibility. It learns true parameter values , which can represent directly or some other latent values for each action, by sampling the posterior distribution conditioned on the history . The samples of become more accurate as the information in increases, and are used to form an estimate of , which can be used with any optimal action definition. We present algorithms for both Maximin winners (Alg. 1) and Borda winners (Alg. 2).
An appropriate prior distribution over must be chosen so that the posterior distribution can either be determined analytically or sampled from by using computational means (such as Markov chain Monte Carlo). The prior can be used to model correlations between actions, for example by using a Gaussian Process.
3.2 SparringExp3.P for Dueling Bandits
SparringExp3.P is implemented for dueling bandits in Algorithm 3, and is inspired by the methods in [3] and [8]. It learns from the previous duel outcomes and accordingly adjusts the strategies and using hyperparameters and . For all times and all actions , the update equations are,
[TABLE]
Since only outcome is revealed at each time , the other outcomes in the corresponding rows of must be estimated. These estimates are made using the observed outcome and hyperparameter as follows,
[TABLE]
for all . These estimates satisfy and for all and all times .
3.3 Partial Monitoring Forecaster for Dueling Bandits
The Partial Monitoring forecaster [6] is implemented for dueling bandits in Algorithm 4. The forecaster learns from the previous duel outcomes and accordingly adjusts the strategy using hyperparameters and . For all times and all actions , the update equations are,
[TABLE]
Since only outcome is revealed at each time , the Borda score for , must be estimated using the observed outcome as follows,
[TABLE]
for all . These estimates satisfy for all and all times .
3.4 Comparison to State-of-the-Art
Both state-of-the-art dueling bandits algorithms ISS [17] and DTS [21] use variations of specific Thompson Sampling implementations. They both use as prior distributions , for each independent, true value they attempt to learn. Since Beta distributions are conjugate pairs with Bernoulli likelihoods, the independent posterior distributions are able to be determined analytically and are themselves Beta distributions.
While the ISS algorithm is very simple, it does not learn an estimate for . Instead, it learns the more basic overall probability of each action winning a duel with a Concorcet winner. It therefore learns independent values, one for each action. Since it does not learn , it cannot learn to track a Borda winner unless it is also the Condorcet winner.
The DTS algorithm does learn an estimate of . It thus learns independent values, one for each pair in . However, it is a complex and specialized algorithm that tracks the Copeland winner, so it cannot learn to track a Borda winner unless it is also the Copeland winner.
4 Theoretical Results
In this section, we will present theorems that upper bound the regret for each of the algorithms described in the previous section, and also compare the bounds to those for the current state-of-the-art. Each of the regret upper bounds is of the order with , and this bound holds regardless of the size of any preference gaps between any two actions . All definitions of regret are normalized, such that the regret incurred at any time satisfies , and therefore . Detailed proofs are provided in the appendix.
Theorem 4.1
Let us define regret over time horizon in the sense of Maximin winner ,
[TABLE]
Then, if actions are selected at each time using Thompson Sampling for Dueling Bandits with Maximin winners (Alg. 1), the expected regret is upper bounded as,
[TABLE]
The proof method is a variation on the worst case bound from [15].
Theorem 4.2
Let us define regret over time horizon in the sense of Borda winner ,
[TABLE]
Then, if actions are selected at each time using Thompson Sampling for Dueling Bandits with Borda winners (Alg. 2), using for , the expected regret is upper bounded as,
[TABLE]
The proof method uses the same concepts from [15] as the proof of Theorem 4.1.
Theorem 4.3
Let us define regret over time horizon in the sense of Maximin winner ,
[TABLE]
Then, if actions are selected at each time using SparringExp3.P for Dueling Bandits (Alg. 3), with hyperparameter values of,
[TABLE]
and satisfying,
[TABLE]
the expected regret is upper bounded as,
[TABLE]
The proof method follows those used for lemma 3.1 and theorems 3.2 and 3.3 in [5].
Theorem 4.4
Let us define regret over time horizon in the sense of Borda winner ,
[TABLE]
Then, if actions are selected at each time using the Partial Monitoring Forecaster for Dueling Bandits (Alg. 4), with hyperparameter values of,
[TABLE]
and satisfying,
[TABLE]
the expected regret is upper bounded as,
[TABLE]
The proof method follows those used for theorem 6.5 in [6].
4.1 Comparison to State-of-the-Art
Many works on dueling bandits assume that a Condorcet winner, , exists. In this case, for all , and let be the preference gap between the Condorcet winner and the next best action. This commonly allows regret bounds of to be proven. These bounds appear to be superior to the bounds derived in this paper. However, as discussed in [5] (and others), when is small, the bound becomes smaller than the regret for selecting the sub-optimal action each time, which is . Therefore, taking a worst-case value over leads to an actual regret bound of , which is not superior to the bounds we show.
This is the case for both state-of-the-art methods ISS [17] and DTS [21]. Furthermore, we note that the proof for ISS demonstrates only asymptotic convergence to a Condorect winner, while the proof for DTS is highly complex (owing the relatively complex nature of the algorithm). In comparison, the proofs available in appendix A are relatively simple (though presented in a detailed manner).
5 Experimental Results
5.1 Methods
We simulate each of the proposed algorithms, along with the two state-of-the-art algorithms ISS [17] and DTS [21], on two different scenarios using synthetic data. For the Thompson Sampling methods, we use independent priors for the values we attempt to learn. We set directly, for all . In the Condorcet scenario, an matrix is synthetically generated by linking a latent value for each action (called “utility") to the duel winning probability for each pair of actions . The utility of each action, , is uniformly distributed between [math] and . We chose to give a larger spread of probabilities over the actions. One action has a maximum utility, that is significantly better than all other actions, and so it is the lone Borda winner and Condorcet winner, and thus also the lone Maximin winner. Linking the utility of each pair of actions to the corresponding duel winning probability is accomplished by using the logistic function on the gap between utilities of the actions,
[TABLE]
In the Borda scenario, we modify the previous matrix such that the action with the second largest utility becomes the lone Borda winner, even though the same Condorcet and Maximin winner still exists. This is done by setting for all other than the Condorcet winner. This aptly represents why the Borda winner is a reasonable definition for optimality. Even though it isn’t likely to beat every action, it is the most likely to beat an action drawn at random. Each algorithm runs with a time horizon of iterations, for separate runs, on each scenario.
5.2 Results
The results of the Condorcet scenario are shown in Figure 1, and the results of the Borda scenario are shown in Figure 2. In both subfigure (c), a shaded area, plotted above the mean, shows the standard deviation over the runs. Additional detailed plots of each algorithm, for each scenario, are available in appendix B. In the Condorcet scenario, the regret for each algorithm is as prescribed in the respective theorem, and the regret for ISS and DTS use the Maximin winner (theorem 4.1). All formulations for regret are comparable, due to the scenario having the same winning action in all cases. Both state-of-the-art methods show very strong regret performance. However, the Thompson Sampling with Borda winners method shows comparably strong performance, with other methods also performing well. All methods beat the regret upper bounds proposed in their respective theorems. In the Borda scenario, the regret for all algorithms (including ISS and DTS) uses the Borda winner. This is to highlight the fact that some of the methods are not capable of performing well in this type of scenario. Both state-of-the-art methods struggle with Borda winners, and so their Borda regret grows linearly. A similar behavior ultimately happens to SparringExp3.P (more details available in the appendix). Thompson Sampling shines in this case. Both methods that focus on Borda winners are able to beat their respective regret upper bounds.
6 Conclusion
In this paper, we have presented four simple algorithms for Dueling Bandits, each of which is able to efficiently find an optimal action within a finite set of available actions. We proved an upper bound on regret for each, over a variety of different optimal action types, such as the Borda Winner. The proven regret bounds were all of the order with , and did not depend on any preference gap between any two actions . The algorithms were all evaluated and compared against the current state-of-the-art for Dueling Bandits, the ISS and DTS algorithms. While they did not meet or exceed the performance of ISS and DTS in certain scenarios, in others they demonstrated superior ability to find different types of optimal actions. Overall, their simplicity, regret bounds, and ability do merit inclusion with the current state-of-the-art.
Appendix A Theoretical Results
In this section, we provide formal proofs for all theorems presented in the paper. All random variables and probability distributions use bold font.
A.1 Proof of Theorem 4.1
The proof method is a variation on the worst case bound from [15].
First, we make the following definitions: is the expectation, is the probability measure, is the probability density, and is mutual information, all conditioned on the history , at time . Furthermore, is the Kullback-Leibler divergence and is entropy.
Then we note that Thompson Sampling selects both and using independent samples from the same posterior distribution conditioned on . Therefore, and are independent and identically distributed, and the terms and are identically distributed.
Let be the instantaneous regret at time , such that .
We claim the following,
[TABLE]
To begin proving (7), we show,
[TABLE]
where the second equality follows because is independent of , when conditioned on .
Furthermore,
[TABLE]
where the second equality follows because of the assumption . Combining (9) and (10), gives (7).
Next we prove (A.1).
[TABLE]
Here the first equality is the chain rule for mutual information, while the second follows from conditional independence of , , and , given . The third equality follows because of conditional independence of and given . The final equality is a standard identity for mutual information. Thus, (A.1) holds.
Then we bound in terms of the mutual information.
[TABLE]
The first inequality is from Pinsker’s inequality. The second is from the Cauchy-Schwarz inequality. The third is because adding more non-negative terms cannot decrease the sum. The final inquality is because .
Next we cite the following, (see section 5 of [15]) and therefore (Cauchy-Schwartz inequality),
[TABLE]
Finally, we have since there are actions, and so the desired bound is achieved.
A.2 Proof of Theorem 4.2
The proof method uses the same concepts from [15] as the proof of Theorem 4.1.
First, we make the following definitions: is the expectation, is the probability measure, is the probability density, and is mutual information, all conditioned on the history , at time . Furthermore, is the Kullback-Leibler divergence and is entropy.
Then we note that Thompson Sampling selects both and using independent samples from the same posterior distribution conditioned on . Therefore, and are independent and identically distributed, and the terms and are identically distributed.
Let be the instantaneous regret at time , such that .
By construction,
[TABLE]
Now we bound in terms of mutual information.
[TABLE]
[TABLE]
Here (12) is derived analogously to (7), and the inequality (14) follows because and . Then the inequalities (15), (16), and (17) respectively follow from Pinsker’s inequality, the Cauchy-Schwarz inequality, and concavity. The inequality (18) follows because,
[TABLE]
and also from (11),
[TABLE]
The inequality (19) follows because adding extra non-negative terms cannot decrease the sum, and the result in (20) is derived analogously to (A.1). The inequality (21) follows because implies that .
Next we cite the following, (see section 5 of [15]) and therefore , from the Cauchy-Schwartz inequality. Thus, the regret can be bounded as
[TABLE]
Finally, we have since there are actions, and so the desired bound is achieved when substituting ,
[TABLE]
A.3 Proof of Theorem 4.3
The proof of Theorem 4.3 requires the following auxiliary lemma.
Lemma. If hyperparameter , then the following holds for all and any ,
[TABLE]
Proof. The proof method follows those used for lemma 3.1 in [5].
Taking the expected value with respect to , for any and any ,
[TABLE]
where (a) uses for , which is true because , , and for all and ,
(b) uses,
[TABLE]
(c) uses,
[TABLE]
and (d) uses for all .
Then the following holds for any , since all are independent,
[TABLE]
Finally, since Markov’s inequality implies \mathbb{P}\big{[}\log\exp(\mathbf{Y})\leq\log\,\delta^{\text{-}1}\big{]}\geq 1-\delta\,\mathbb{E}\big{[}\exp(\mathbf{Y})\big{]}, by then setting,
[TABLE]
we have that \mathbb{E}\big{[}\exp(\mathbf{Y})\big{]}\leq 1, and therefore we achieve the desired results,
[TABLE]
Now we turn to the proof of Theorem 4.3. The proof method follows those used for Theorems 3.2 and 3.3 in [5].
Recall that the regret has the form
[TABLE]
Taking the expected value with respect to and , for any ,
[TABLE]
This means we have, for any ,
[TABLE]
Now we will begin bounding the expectation terms, which are taken with respect to being distributed as respectively. But by the definitions of those distributions, we can split them up into the uniform portion and the softmax portions , such that and . Therefore,
[TABLE]
Next we focus on the main softmax expectation terms in eqs. 27 and 28,
[TABLE]
where (a) uses , (b) uses , and (c) uses , , , and for all .
Note that (a) and (b) require , meaning that we need and for all and .
From their definitions,
[TABLE]
and so this requirement is exactly met by the assumption .
Then we look at the uniform expectation terms in eqs. 27 and 28,
[TABLE]
Making these substitutions into eqs. 27 and 28, and summing over time,
[TABLE]
where (a) uses the definitions of and as,
[TABLE]
(b) uses the cancellation of numerators and denominators in successive terms of the product, and that for all , (c) uses that , and (d) uses that \big{(}1-\gamma-(1+\beta)A\,\eta\big{)}\ \leq\ 1, which comes from the assumption , together with the lemma eqs. 22 and 23. Note that the inclusion of from the lemma equations implies that these results hold with probability for any .
Then substituting these into eqs. 25 and 26,
[TABLE]
with probability for any , where (a) uses the assumption .
Since these results are valid for any , we can use them directly in eq. 24,
[TABLE]
and applying the defined hyperparameter values,
[TABLE]
with probability for any .
Now we will verify the requirements on for enforcing the assumption .
Since all of the hyperparameter values are non-negative, then the left-hand side inequality is trivially satisfied.
[TABLE]
And so the requirement is T\ \geq\ \max\big{[}4.41\,A\log A\ ,\ (0.95^{2}\log A)/(0.1^{2}\,A)\big{]}, as desired.
Finally, we demonstrate the following fact for random variable with cumulative distribution function ,
[TABLE]
Then recalling the regret high probability upper bound, for the required and any ,
[TABLE]
Now selecting \ \mathbf{W}=\big{(}\mathbf{R}_{T}-4.2\sqrt{A\log A}\,\sqrt{T}\big{)}\ /\ \big{(}\sqrt{A\,(\log A)^{\text{-}1}}\,\sqrt{T}\big{)}\ we have,
[TABLE]
Therefore, we achieve the desired result:
[TABLE]
A.4 Proof of Theorem 4.4
The proof method follows those used for theorem 6.5 in [6].
First, we recall our definition of the (estimated) Borda score for as,
[TABLE]
and we define the sum of (estimated) Borda scores for action over as,
[TABLE]
which means we can redefine eq. LABEL:Reg as,
[TABLE]
Since and are independently drawn from the same probability distribution at each time , we can equivalently prove eq. LABEL:expReg using an expected regret equation strictly in terms of the Borda scores for and ,
[TABLE]
Next we define a lower bound for the log of the ratio of weight sums at times and [math], for any ,
[TABLE]
and an upper bound for the log of the ratio of weight sums at times and ,
[TABLE]
where (a) is from the definition of , (b) is because for , and (d) is because .
For (c), first note that , as sum of softmax components. So it would be equivalent, except that on the far right side it has only a term, and hence . This gives the inequality, since if .
The (b) requirement holds if we have that , because and , with . We confirm this at the end of the proof.
Now we sum the upper bound over , to get the log of the ratio of weight sums at times and [math],
[TABLE]
Then we can compare the lower and upper bounds, to get a single inequality.
[TABLE]
Multiplying both sides by gives,
[TABLE]
and by rearranging terms and noting that ,
[TABLE]
By definition of we then have,
[TABLE]
and by definition for all ,
[TABLE]
Since all terms are using the unbiased estimates of the Borda scores, we can take the expected value on both sides and replace the estimates with the actual scores,
[TABLE]
Noting that for any ,
[TABLE]
Next we bound the remaining expectation term,
[TABLE]
and because the term from the original lower bound is valid for any , we can arbitrarily choose the Borda winner . We thus have,
[TABLE]
Then by canceling the terms and taking the expected value of both sides,
[TABLE]
Now we define the hyperparameters and by using the positive terms ,
[TABLE]
which guarantees and .
Then we substitute them into the terms on the right-hand side of the inequality,
[TABLE]
Combining the terms achieves the desired result.
Finally, we determine the required such that and hold,
[TABLE]
Since for all , this gives the required .
Appendix B Experimental Results
In this section, we provide additional plots that detail the behavior of the algorithms for the different experimental scenarios. For all figures:
- •
(a) shows a detailed plot of the regret over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above the mean)
- •
(b) shows the action selections over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)
- •
(c) shows the action selections over the runs for the scenario, with off-color lines showing individual runs, thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)
- •
(d - if applicable) shows the strategy over the runs for the scenario, with thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)
- •
(e - if applicable) shows the strategy over the runs for the scenario, with thick line showing the mean over runs, and shaded area showing the standard deviation over runs (plotted above and below the mean)
For the Condorcet scenario, see Figs. 3 - 8). For the Borda scenario, see Figs. 9 - 14.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory , pages 39–1, 2012.
- 2[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics , pages 99–107, 2013.
- 3[3] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning , pages 856–864, 2014.
- 4[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002.
- 5[5] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
- 6[6] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games . Cambridge university press, 2006.
- 7[7] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems , pages 2249–2257, 2011.
- 8[8] Miroslav Dud \́mathbf{i} k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual dueling bandits. ar Xiv preprint ar Xiv:1502.06362 , 2015.
