The Multi-Armed Bandit Problem: An Efficient Non-Parametric Solution
Hock Peng Chan

TL;DR
This paper introduces efficient non-parametric algorithms for the multi-armed bandit problem, addressing limitations of existing methods under general parametric settings, and enhancing arm allocation strategies in machine learning applications.
Contribution
It proposes novel non-parametric procedures that are computationally efficient and effective across various reward distribution settings, improving upon existing methods.
Findings
New non-parametric algorithms outperform traditional methods in diverse settings
Proposed methods achieve lower regret compared to existing non-parametric approaches
Algorithms are applicable to a wide range of reward distributions
Abstract
Lai and Robbins (1985) and Lai (1987) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (UCB) achieves minimum regret. These bounds are constructed from the Kullback-Leibler information of the reward distributions, estimated from specified parametric families. In recent years there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. Non-parametric arm allocation procedures like -greedy, Boltzmann exploration and BESA were studied, and modified versions of the UCB procedure were also analyzed under non-parametric settings. However unlike UCB these non-parametric procedures are not efficient under general parametric settings. In this paper we propose efficient non-parametric procedures.
| Regret | ||
|---|---|---|
| SSMC | 88.40.2 | 137.00.5 |
| UCB1 | 90.20.3 | 154.40.7 |
| UCB-Agrawal | 113.00.3 | 195.70.8 |
| Regret | ||
|---|---|---|
| SSTC | 2391 | 4925 |
| UCB1-tuned | 1302 | 84723 |
| UCB1-Normal | 15365 | 491131 |
| Regret | Regret | |||||
| SSMC | 141.70.4 | 3301 | 7953 | 23.60.1 | 65.00.3 | 236.90.8 |
| BESA | 1171 | 2652 | 6273 | 28.90.7 | 731 | 2152 |
| UCB1-tuned | 1012 | 2443 | 6086 | 501 | 1833 | 4996 |
| Boltz 0.1 | 1302 | 2944 | 6737 | 842 | 2244 | 5576 |
| 0.2 | 1282 | 2643 | 6326 | 801 | 1693 | 4656 |
| 0.5 | 3321 | 3872 | 6325 | 3105 | 3112 | 4284 |
| 1 | 7282 | 7372 | 8164 | 7312 | 7162 | 7123 |
| -greedy 0.1 | 1703 | 3274 | 6817 | 1333 | 2834 | 5797 |
| 0.2 | 1623 | 3124 | 6536 | 1142 | 2514 | 5366 |
| 0.5 | 1502 | 2823 | 6046 | 822 | 1893 | 4445 |
| 1 | 1592 | 2713 | 5695 | 611 | 1463 | 3705 |
| 2 | 2001 | 2892 | 5594 | 52.90.9 | 1132 | 3024 |
| 5 | 3341 | 3962 | 6174 | 63.40.5 | 1011 | 2413 |
| 10 | 5242 | 5672 | 7423 | 95.70.4 | 119.50.8 | 2262 |
| 20 | 8113 | 8393 | 9513 | 156.90.5 | 172.10.7 | 2512 |
| Frequency of emp. regrets | ||||||||
| lying within a given range | ||||||||
| 0 | 200 | 400 | 600 | 800 | 1000 | 1200 | ||
| to | to | to | to | to | to | to | Worst | |
| 200 | 400 | 600 | 800 | 1000 | 1200 | 2100 | emp. regret | |
| SSMC | 9134 | 845 | 16 | 5 | 0 | 0 | 0 | 770 |
| BESA | 9314 | 424 | 143 | 66 | 27 | 15 | 11 | 2089 |
| UCB1-tuned | 8830 | 625 | 301 | 132 | 64 | 32 | 16 | 1772 |
| Frequency of emp. regrets | ||||||||
| lying within a given range | ||||||||
| 0 | 1000 | 2000 | 3000 | 4000 | 5000 | 10000 | ||
| to | to | to | to | to | to | to | Worst | |
| 1000 | 2000 | 3000 | 4000 | 5000 | 10000 | 21000 | emp. regret | |
| SSMC | 9988 | 8 | 3 | 0 | 0 | 1 | 0 | 6192 |
| BESA | 9708 | 125 | 59 | 34 | 25 | 40 | 9 | 20639 |
| UCB1-tuned | 8833 | 365 | 250 | 161 | 122 | 225 | 44 | 16495 |
| Scenario | ||||
| 1 | 2 | 3 | 4 | |
| SSMC | 12.40.1 | 43.10.4 | 97.90.2 | 165.30.2 |
| SSMC∗ | 9.50.2 | 48.50.6 | 64.40.3 | 156.00.4 |
| BESA | 11.83 | 42.6 | 74.41 | 156.7 |
| KL-UCB | 17.48 | 52.34 | 121.21 | 170.82 |
| KL-UCB | 11.54 | 41.71 | 72.84 | 165.28 |
| Thompson | 11.3 | 46.14 | 83.36 | 165.08 |
| Trunc. expo. | Trunc. Poisson | |
| SSMC | 33.80.4 | 18.60.1 |
| SSMC∗ | 29.60.7 | 14.70.2 |
| BESA | 53.26 | 19.37 |
| BESAT | 31.41 | 16.72 |
| KL-UCB-expo | 65.67 | — |
| KL-UCB-Poisson | — | 25.05 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION
Hock Peng Chanlabel=e1][email protected] [ National University of Singapore
Department of Statistics and Applied Probability
Block S16, Level 7, 6 Science Drive 2
Faculty of Science
National University of Singapore
Singapore 117546
Abstract
Lai and Robbins (1985) and Lai (1987) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (UCB) achieves minimum regret. These bounds are constructed from the Kullback-Leibler information of the reward distributions, estimated from specified parametric families. In recent years there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. Non-parametric arm allocation procedures like -greedy, Boltzmann exploration and BESA were studied, and modified versions of the UCB procedure were also analyzed under non-parametric settings. However unlike UCB these non-parametric procedures are not efficient under general parametric settings. In this paper we propose efficient non-parametric procedures.
62L05,
efficiency,
KL-UCB,
subsampling,
Thompson sampling,
UCB,
keywords:
[class=AMS]
keywords:
t1Supported by MOE grant number R-155-000-158-112
1 Introduction
Lai and Robbins (1985) provided an asymptotic lower bound for the regret in the multi-armed bandit problem, and proposed an index strategy that is efficient, that is it achieves this bound. Lai (1987) showed that allocation to the arm having the highest upper confidence bound (UCB), constructed from the Kullback-Leibler (KL) information between the estimated reward distributions of the arms, is efficient when the distributions belong to a specified exponential family. Agrawal (1995) proposed a modified UCB procedure that is efficient despite not having to know in advance the total sample size. Cappé, Garivier, Maillard, Munos and Stoltz (2013) provided explicit, non-asymptotic bounds on the regret of a KL-UCB procedure that is efficient on a larger class of distribution families.
Burnetas and Kalehakis (1996) extended UCB to multi-parameter families, almost showing efficiency in the natural setting of normal rewards with unequal variances. Yakowitz and Lowe (1991) proposed non-parametric procedures that do not make use of KL-information, suggesting logarithmic and polynomial rates of regret under finite exponential moment and moment conditions respectively.
Auer, Cesa-Bianchi and Fischer (2002) proposed a UCB1 procedure that achieves logarithmic regret when the reward distributions are supported on [0,1]. They also studied the -greedy algorithm of Sutton and Barto (1998) and provided finite-time upper bounds of its regret. Both UCB1 and -greedy are non-parametric in their applications and, unlike UCB-Lai or UCB-Agrawal, are not expected to be efficient under a general exponential family setting. Other non-parametric methods that have been proposed include reinforcement comparison, Boltzmann exploration (Sutton and Barto, 1998) and pursuit (Thathacher and Sastry, 1985). Kuleshov and Precup (2014) provided numerical comparisons between UCB and these methods. For a description of applications to recommender systems and clinical trials, see Shivaswamy and Joachims (2012). Burtini, Loeppky and Lawrence (2015) provided a comprehensive survey of the methods, results and applications of the multi-armed bandit problem, developed over the past thirty years.
A strong competitor to UCB under the parametric setting is the Bayesian method, see for example Fabius and van Zwet (1970) and Berry (1972). There is also a well-developed literature on optimization under an infinite-time discounted window setting, in which allocation is to the arm maximizing a dynamic allocation (or Gittins) index, see the seminal papers Gittins (1979) and Gittins and Jones (1979), and also Berry and Fristedt (1985), Chang and Lai (1987), Brezzi and Lai (2002). Recently there has been renewed interest in the Bayesian method due to the developments of UCB-Bayes [see Kaufmann, Cappé and Garivier (2012)] and Thompson sampling [see for example Korda, Kaufmann and Munos (2013)].
In this paper we propose an arm allocation procedure subsample-mean comparison (SSMC), that though non-parametric, is nevertheless efficient when the reward distributions are from an unspecified one-dimensional exponential family. It achieves this by comparing subsample means of the leading arm with the sample means of its competitors. It is empirical in its approach, using more informative subsample means rather than full-sample means alone, for better decision-making. The subsampling strategy was first employed by Baransi, Maillard and Mannor (2014) in their best empirical sampled average (BESA) procedure. However there are key differences in their implementation of subsampling from ours, as will be elaborated in Section 2.2. Though efficiency has been attained for various one-dimensional exponential families by say UCB-Agrawal or KL-UCB, SSMC is the first to achieve efficiency without having to know the specific distribution family. In addition we propose in Section 2.4 a related subsample- comparison (SSTC) procedure, applying -statistic comparisons in place of mean comparisons, that is efficient for normal distributions with unknown and unequal variances.
The layout of the paper is as follows. In Section 2 we describe the subsample comparison strategy for allocating arms. In Section 3 we show that the strategy is efficient for exponential families, including the setting of normal rewards with unknown and unequal variances. In Section 4 we show logarthmic regret for Markovian rewards. In Section 5 we provide numerical comparisons against existing methods. In Section 6 we provide a concluding discussion. In Section 7 we prove the results of Sections 3 and 4.
2 Subsample comparisons
Let , , be the observations (or rewards) from a population (or arm) . We assume here and in Section 3 that the rewards are independent and identically distributed (i.i.d.) within each arm. We extend to Markovian rewards in Section 4. Let and .
Consider a sequential procedure for selecting the population to be sampled, with the decision based on past rewards. Let be the number of observations from when there are total observations, hence . The objective is to minimize the regret
[TABLE]
The Kullback-Leibler information number between two densities and , with respect to a common (-finite) measure, is
[TABLE]
where denotes expectation with respect to . An arm allocation procedure is said to be uniformly good if
[TABLE]
over all reward distributions lying within a specified parametric family.
Let be the density of and let for such that (assuming is unique). The celebrated result of Lai and Robbins (1985) is that under (2.2) and additional regularity conditions,
[TABLE]
Lai and Robbins (1985) and Lai (1987) went on to propose arm allocation procedures that have regrets achieving the lower bound in (2.3), and are hence efficient.
2.1 Review of existing methods
In the setting of normal rewards with unit variances, UCB-Lai can be described as the selection, for sampling, maximizing
[TABLE]
where , is the current number of observations from the populations, and is the current number of observations from . Agrawal (1995) proposed a modified version of UCB-Lai that does not involve the total sample size , with the selection instead of the population maximizing
[TABLE]
with and . Efficiency holds for (2.4) and (2.5), and there are corresponding versions of (2.4) and (2.5) that are efficient for other one-parameter exponential families. Cappé et al. (2013) proposed a more general KL-UCB procedure that is also efficient for distributions with given finite support.
Auer, Cesa-Bianchi and Fischer (2002) simplified UCB-Agrawal to UCB1, proposing that maximizing
[TABLE]
be selected. They showed that under UCB1, logarithmic regret is achieved when the reward distributions are supported on [0,1]. In the setting of normal rewards with unequal and unknown variances, Auer et al. suggested applying a variant of UCB1 which they called UCB1-Normal, and showed logarithmic regret. Under UCB1-Normal, an observation is taken from any population with . If such a population does not exist, then an observation is taken from maximizing
[TABLE]
where .
Auer et al. provided an excellent study of various non-parametric arm allocation procedures, for example the -greedy procedure proposed by Sutton and Barto (1998), in which an observation is taken from the population with the largest sample mean with probability , and randomly with probability . Auer et al. suggested replacing the fixed at every stage by a stage-dependent
[TABLE]
with user-specified and . They showed that if , then logarithmic regret is achieved for reward distributions supported on . A more recent numerical study by Kuleshov and Precup (2014) considered additional non-parametric procedures, for example Boltzmann exploration in which an observation is taken from with probability proportional to , for some .
2.2 Subsample-mean comparisons
A common characteristic of the procedures described in Section 2.1 is that allocation is based solely on a comparison of the sample means , with the exception of UCB1-Normal in which is also utilized. As we shall illustrate in Section 2.3, we can utilize subsample-mean information from the leading arm to estimate the confidence bounds for selecting from the other arms. In contrast UCB-based procedures like KL-UCB discard subsample information and rely on parametric information to estimate these bounds. Even though subsample-mean and KL-UCB are both efficient for exponential families, the advantage of subsample-mean is that the underlying family need not be specified.
In SSMC a leader is chosen in each round of play to compete against all the other arms. Let denote the round number. In round 1, we sample all arms. In round for , we set up a challenge between the leading arm (to be defined below) and each of the other arms. An arm is sampled only if it wins all its challenges in that round. Hence for round we sample either the leading arm or a non-empty subset of the challengers. Let be the total number of observations from all arms at the beginning of round , let be the corresponding number from . Hence and for all , and for .
Let be a non-negative monotone increasing sampling threshold in SSMC and SSTC, with
[TABLE]
For example in our implementation of SSMC and SSTC in Section 5, we select . An explanation of why (2.8) is required for efficiency of SSMC is given in the beginning of Section 7.1. Let , hence .
Subsample-mean comparison (SSMC)
. Sample each exactly once. 2. 2.
.
- (a)
Let the leader be the population with the most observations, with ties resolved by (in order):
- i.
the population with the larger sample mean, 2. ii.
the leader of the previous round, 3. iii.
randomization. 2. (b)
For all set up a challenge between and in the following manner.
- i.
If , then loses the challenge automatically. 2. ii.
If and , then wins the challenge automatically. 3. iii.
If , then wins the challenge when
[TABLE] 3. (c)
For all , sample from if wins its challenge against . Sample from if wins all its challenges. Hence either is sampled, or a non-empty subset of is sampled.
SSMC may recommend more than one populations to be sampled in a single round when . In the event that for some , we select populations randomly from among the recommended by SSMC in the th round, in order to make up exactly observations.
If wins all its challenges, then and are unchanged, and in the next round it suffices to perform the comparison in (2.9) at the largest instead of at every . The computational cost is thus . The computational cost is if at least one wins its challenge. Hence when there is only one optimal arm and SSMC achieves logarithmic regret, the total computational cost is for rounds of the algorithm.
In step 2(b)ii. we force the exploration of arms with less than rewards. By (2.8) we select small compared to , so that the cost of such forced explorations is asymptotically negligible. In contrast the forced exploration in the greedy algorithm (2.7) is more substantial, of order for rewards.
BESA, proposed by Baransi, Maillard and Mannor (2014), also applies subsample-mean comparisons. We describe BESA for below, noting that tournament-style elimination is applied for . Unlike SSMC, exactly one population is sampled in each round even when .
Best Empirical Sampled Average (BESA)
. Sample both and . 2. 2.
.
- (a)
Let the leader be the population with more observations, and let . 2. (b)
Sample randomly without replacement of the observations from , and let be the mean of the observations. 3. (c)
If , then sample from . Otherwise sample from .
As can be seen from the descriptions of SSMC and BESA, the mechanism of choosing the arm to be played in SSMC clearly promotes exploration of non-leading arms, relative to BESA. Whereas Baransi et al. demonstrated logarithmic regret of BESA for rewards bounded on [0,1] (though BESA can of course be applied on more general settings but with no such guarantees), we show in Section 3 that SSMC is able to extend BESA’s subsampling idea to achieve asymptotic optimality, that is efficiency, on a wider set of distributions. Tables 4 and 5 in Section 5 show that SSMC controls the oversampling of inferior arms better relative to BESA, due to its added explorations.
2.3 Comparison of SSMC with UCB methods
Lai and Robbins (1985) proposed a UCB strategy in which the arms take turns to challenge a leader with order observations. Let us restrict to the setting of exponential families. Denote the leader by and the challenger by . Lai and Robbins proposed, in their (3.1), upper confidence bounds satisfying
[TABLE]
The decision is to sample from arm if
[TABLE]
otherwise arm is sampled. By doing this we ensure that if , then the probability that arm is sampled is .
We next consider SSMC. Let . Since is of order , it follows that if , then as is stochastically larger than ,
[TABLE]
In SSMC we sample from arm if , ensuring, as in Lai and Robbins, that an optimal arm is sampled with probability when the leading arm is inferior.
In summary SSMC differs from UCB in that it compares against a lower confidence bound of the leading arm, computed from subsample-means instead of parametrically. Nevertheless the critical values that SSMC and UCB-based methods employ for allocating arms are asymptotically the same, as we shall next show.
For simplicity let us consider unit variance normal densities with . Consider firstly unbalanced sample sizes with say and note, see Appendix A, that
[TABLE]
Hence arm 2 winning the challenge requires
[TABLE]
By (2.5) and (2.6), UCB-Agrawal, KL-UCB and UCB1 also select arm 2 when (2.11) holds, since . Hence what SSMC does is to estimate the critical value , empirically by using the minimum of the running averages . In the case of both large compared to , , and SSMC, UCB-Agrawal, KL-UCB and UCB1 essentially select the population with the larger sample mean.
2.4 Subsample- comparisons
For efficiency outside one-parameter exponential families, we need to work with test statistics beyond sample means. For example to achieve efficiency for normal rewards with unknown and unequal variances, the analogue of mean comparisons is -statistic comparisons
[TABLE]
where and . Since is unknown, we estimate it by .
Subsample- comparison (SSTC)
Proceed as in SSMC, with step 2(b)iii.′ below replacing step 2(b)iii.
iii.′ If , then wins the challenge when either or
[TABLE]
As in SSMC only computations are needed for rounds when there is only one optimal arm and the regret is logarithmic. This is because it suffices to record the range of that satisfies (2.12) for each , and the actual value of . The updating of these requires computations when both and are unchanged.
3 Efficiency
Consider firstly an exponential family of density functions
[TABLE]
with respect to some measure , where is the log moment generating function and . For example the Bernoulli family satisfies (3.1) with the counting measure on and . The family of normal densities with variance satisfies (3.1) with the Lebesgue measure and .
Let for some , . Let and . By (2.1) and (3.1), the KL-information in (2.3),
[TABLE]
where is the large deviations rate function of . Let be the set of optimal arms.
Theorem 1**.**
For the exponential family (3.1), SSMC satisfies
[TABLE]
and is thus efficient.
UCB-Agrawal and KL-UCB are efficient as well for (3.1), see Agrawal (1995) and Cappé et al. (2013), SSMC is unique in that it achieves efficiency by being adaptive to the exponential family, whereas UCB-Agrawal and KL-UCB achieve efficiency by having selection procedures that are specific to the exponential family. On the other hand UCB-based methods require less storage space, and more informative finite-time bounds have been obtained. Specifically for UCB-based methods in exponential families we need only store the sample mean for each arm, and the numerical complexity is of the same order as the sample size. For SSMC as given in Section 2.3, all observations are stored (more of this in Section 6) and the numerical complexity for a sample of size is when we have efficiency and exactly one optimal arm.
We next consider normal rewards with unequal and unknown variances, that is with densities
[TABLE]
with respect to Lebesgue measure. Let . Burnetas and Katehakis (1996) showed that if , then under uniformly fast convergence and additional regularity conditions, an arm allocation procedure must have regret satisfying
[TABLE]
They proposed an extension of UCB-Lai but needed the verification of a technical condition to show efficiency. In the case of UCB1-Normal, logarithmic regret also depended on tail bounds of the - and -distributions that were only shown to hold numerically by Auer et al. (2002). In Theorem 2 we show that SSTC achieves efficiency.
Theorem 2**.**
For normal densities (3.3) with unequal and unknown variances, SSTC satisfies
[TABLE]
and is thus efficient.
4 Logarithmic regret
We show here that logarithmic regret can be achieved by SSMC under Markovian assumptions. This is possible because in SSMC we compare blocks of observations that retain the Markovian structure.
For , let be a potentially unobserved -valued Markov chain, with -field and transition kernel
[TABLE]
We shall assume for convenience that is stationary. Let be real-valued and conditionally independent given , and having conditional densities , with respect to some measure , such that
[TABLE]
We assume that the Markov chains are independent, and that the following Doeblin-type condition holds.
(C1) For , there exists a non-trival measure on such that
[TABLE]
As before let , and the regret
[TABLE]
In addition to (C1) we assume the following sample mean large deviations.
(C2) For any , there exists and such that for and ,
[TABLE]
(C3) For such that and such that , there exists , and such that for and ,
[TABLE]
Theorem 3**.**
For Markovian rewards satisfying (C1)–(C3), SSMC achieves for , hence .
Agrawal, Tenekatzis and Anantharam (1989) and Graves and Lai (1997) considered control problems in which, instead of (4.1) with Markov chains, there are arms with each arm representing a distinct Markov transition kernel acting on the same chain. Tekin and Liu (2010) on the other hand considered (4.1), with the constraints that is finite and is a point mass function for all and . They provided a UCB algorithm that achieves logarithmic regret.
We can apply Theorem 3 to show logarithmic regret for i.i.d. rewards on non-exponential parametric families. Lai and Robbins (1985) showed that for the double exponential (DE) densities
[TABLE]
with , efficiency is achieved by a UCB strategy involving KL-information of the DE densities, hence implementation requires knowledge that the family is DE, including knowing . In Example 1 below we state logarithmic regret, rather than efficiency, for SSMC. The advantage of SSMC is that we do not assume knowledge of (4.4) in its implementation. Verifications of (C1)–(C3) under (4.4) is given in Appendix B.
Example 1. For the double exponential densities (4.4), conditions (C1)–(C3) hold, hence under SSMC, for .
5 Numerical studies
We compare SSMC and SSTC against procedures described in Section 2.1, as well as more modern procedures like BESA, KL-UCB, UCB-Bayes and Thompson sampling. The reader can refer to Chapters 1–3 of Kaufmann (2014) for a description of these procedures. In Examples 2 and 3 we consider normal rewards and the comparisons are against procedures in which either efficiency or logarithmic regret has been established. In Example 4 we consider double exponential rewards and there the comparisons are against procedures that have been shown to perform well numerically. In Examples 5–7 we perform comparisons under the settings of Baransi, Maillard and Mannor (2014).
In the simulations done here datasets are generated for each , and the regret of a procedure is estimated by averaging over . Standard errors are located after the sign. In Examples 5–7 we reproduce simulation results from Baransi et al. (2014). Though no standard errors are provided, they are likely to be small given that a larger number of datasets are generated there.
Example 2. Consider N(), . In Table 1 we see that SSMC improves upon UCB1 and outperforms UCB-Agrawal [setting in (2.5)]. Here we generate N(0,1) in each dataset.
Example 3. Consider N(), . We compare SSTC against UCB1-tuned and UCB1-Normal. UCB1-tuned was suggested by Auer et al. and shown to perform well numerically. Under UCB1-tuned the population maximizing
[TABLE]
where , is selected. In Table 2 we see that UCB1-tuned is significantly better at whereas SSTC is better at . UCB1-Normal performs quite poorly. Here we generate and in each dataset.
Kaufmann, Cappè and Garivier (2012) performed simulations under the setting of normal rewards with unequal variances, with , , and . They showed that UCB-Bayes achieves regret of about 28 at and about 47 at . We apply SSTC on this setting, achieving regrets of 26.00.1 at and 43.30.2 at .
Example 4. Consider double exponential rewards , with densities
[TABLE]
We compare SSMC against UCB1-tuned, BESA, Boltzmann exploration and -greedy. For -greedy we consider . We generate N(0,1) in each dataset.
Table 3 shows that UCB1-tuned has the best performances at , whereas SSMC has the best performances at . BESA does well for at , and also for at . A properly-tuned Boltzmann exploration does well at for , whereas a properly-tuned -greedy does well at and 5 for and at for .
In Tables 4 and 5 we tabulate the frequencies of the empirical regrets over the simulation runs each for and 10000, at , for SSMC, BESA and UCB1-tuned. Tha tables show that SSMC has the best control of excessive sampling of inferior arms, the worst empirical regret being less than half that of BESA and UCB1-tuned.
Example 5. Consider Bernoulli rewards under the following scenarios.
, . 2. 2.
, . 3. 3.
, , ,
. 4. 4.
, .
When comparing the simulated regrets in Table 6, it is useful to remember that BESA and SSMC are non-parametric, using the same procedures even when the rewards are not Bernoulli, whereas KL-UCB and Thompson sampling utilize information on the Bernoulli family. SSMC∗ is a variant of SSMC, see Section 6, with more moderate levels of explorations.
Example 6. Consider truncated exponential and Poisson distributions with . For truncated exponential we consider , where (density ) with , . For truncated Poisson we consider , where , with , . The simulation results are given in Table 7. BESAT is a variation of BESA that starts with 10 observations from each population.
Example 7. Consider and with and . Here SSMC underperforms with regret of 1637 compared to Thompson sampling, which has regret of 13.18. On the other hand SSTC, by normalizing the different scales of the two uniform distributions, is able to achieve the best regret of 2.90.2.
6 Discussion
Together with BESA, the procedures SSMC and SSTC that we introduce here form a class of non-parametric procedures that differ from traditional non-parametric procedures, like -greedy and Boltzmann exploration, in their recognition that when deciding between which of two populations to be sampled, samples or subsamples of the same rather than different sizes should be compared. Among the parametric procedures, Thompson sampling fits most with this scheme.
As mentioned earlier, in SSMC (and SSTC), when the leading population in the previous round is sampled, essentially only one additional comparison is required in the current round between and for . On the other hand when there are rewards, an order comparisons may be required between and when wins in the previous round. It is these added comparisons that, relative to BESA, allows for faster catching-up of a potentially undersampled optimal arm. Tables 4 and 5 show the benefits of such added explorations in minimizing the worst-case empirical regret.
To see if SSMC still works well if we moderate these added explorations, we experimented with the following variation of SSMC in Examples 6 and 7. The numerical results indicate improvements.
SSMC∗
Proceed as in SSMC, with step 2(b)iii. replaced by the following.
2(b)iii*′* If , then wins the challenge when
[TABLE]
In contrast to SSMC, in SSMC∗ we partition the rewards of the leading arm into groups of size for comparisons instead of reusing the rewards in moving-averages. In principle the members of the group need not be consecutive in time, thus allowing for the modifications of SSMC∗ to provide storage space savings when the support of the distributions is finite. That is rather than to store the full sequence, we simply store the number of occurrences at each support point, and generate a new (permuted) sequence for comparisons whenever necessary. Likewise in BESA, there is substantial storage space savings for finite-support distributions by storing the number of occurrences at each support point.
7 Proofs of Theorems 1–3
Since SSMC and SSTC are index-blind, we may assume without loss of generality that . We provide here the statements and proofs of supporting Lemmas 1 and 2, and follow up with the proofs of Theorems 1–3 in Sections 7.1–7.3. We denote the complement of an event by , let and denote the greatest and least integer function respectively, and let denote the number of elements in a set .
Let be the number of observations from at the beginning of round . Let . Let . Let
[TABLE]
More specifically, let
[TABLE]
If , then . Otherwise the leader is selected randomly (uniformly) from . In particular if has a single element, then that element must be . For , let
[TABLE]
We restrict to because the leader is not defined at . Likewise in our subsequent notations on events , , , and , we restrict to .
In Lemma 1 below the key ingredient leading to (7.3) is condition (I) on the event , which says that it is difficult for an inferior arm with at least rewards to win against a leading optimal arm . In the case of exponential families we show efficiency by verifying (I) with . Condition (II), on the event , says that analogous winnings from an inferior arm with at least rewards, for large, are asymptotically negligible. Condition (III) limits the number of times an inferior arm is leading. This condition is important because and refer to the winning of arm when the leader is optimal, hence the need, in (III), to bound the event probability of an inferior leader.
Lemma 1**.**
Let i.e. is not an optimal arm and define
[TABLE]
for some , and . Consider the following conditions.
(I)* There exists such that for all , as .*
(II)* There exists such that as .*
(III)* as .*
Under (I)–(III),
[TABLE]
Proof. Consider . Let and . Under the event , arm in round is sampled to a size beyond only when (i.e. under the event ). In view that , it follows that
[TABLE]
Hence
[TABLE]
Similarly under the event ,
[TABLE]
Hence
[TABLE]
[TABLE]
By (III), , therefore by (7), (I) and (II),
[TABLE]
We can thus conclude (7.3) by letting . \sqcap\hbox to0.0pt{\hss\sqcup}
The verification of (III) is made easier by Lemma 2 below. To provide intuitions for the reader we sketch its proof first before providing the details.
Lemma 2**.**
Let
[TABLE]
If as ,
[TABLE]
then as .
Sketch of proof. Note that (7.7) bounds the probability of an inferior arm taking the leadership from an optimal leader in round , whereas (7.8) bounds the probability of an inferior leader winning against an optimal challenger in round . Let and for , let
[TABLE]
Under , there is a leadership takeover by an inferior arm at least once between rounds and . More specifically let be the largest for which . If , then by the definition of , . If , then since we are under , . In summary
[TABLE]
By showing that
[TABLE]
we can conclude from (7.7) and (7) that
[TABLE]
To see (7.10), recall that by step 2(b)i of SSMC or SSTC, if the (optimal) leader and (inferior) challenger have the same sample size, then the challenger loses by default. The tie-breaking rule then ensures that the challenger is unable to take over leadership in the next round. Hence for to lose leadership to an inferior arm in round , it has to lose to arm when arm has exactly observations.
What (7.11) says is that if at some previous round the leader is optimal, then (7.7) makes it difficult for an inferior arm to take over leadership during and after round , so the leader is likely to be optimal all the way from rounds to . The only situation we need to guard against is , the event that leaders are inferior for all rounds between and . Let be the number of rounds an inferior leader wins against at least one optimal arm. In (7.13) we show that by (7.8), the optimal arms will, with high probability, lose less than times between rounds and when the leader is inferior.
We next show that
[TABLE]
(or ), that is if the optimal arms lose this few times, then one of them has to be a leader at some round between to . Lemma 2 follows from (7.11)–(7.13).
Proof of Lemma 2. Consider . By (7.8),
[TABLE]
hence by Markov’s inequality,
[TABLE]
It remains for us to show (7.12). Assume . Let . Observe that if for some . This is because the leader is not sampled if it loses at least one challenge. Moreover by step 2(b)i. of SSMC or SSTC, all arms with the same number of observations as are not sampled. Therefore if and for all , that is if all optimal arms win against an inferior leader, then . In other words,
[TABLE]
Since , it follows from (7.14) that . Therefore
[TABLE]
and since and , we can conclude that
[TABLE]
Under , for , and it follows from (7.15) that
[TABLE]
and (7.12) indeed holds. \sqcap\hbox to0.0pt{\hss\sqcup}
7.1 Proof of Theorem 1
We consider here SSMC. Equation (7.7) follows from Lemma 4 below and whereas (7.8) follows from Lemma 5 and . We can thus conclude from Lemma 2, and together with the verification in Lemma 6 of (I), see Lemma 1, for and (II) for large, we can conclude Theorem 1.
The proofs of Lemmas 4–6 use large deviations Chernoff bounds that are given below in Lemma 3. They can be shown using change-of-measure arguments. Let be the large deviations rate function of .
Lemma 3**.**
Under (3.1), if , and for some , then
[TABLE]
In Lemmas 4–6 we let and . Recall that the parameter is a threshold for forced explorations, in step 2(b)ii. of SSMC.
Lemma 4**.**
Under (3.1), when .
Proof. Let be such that . The event occurs if at round the leading arm is optimal (i.e. ), and it loses to an inferior arm with and for (since arm is leading). It follows from Lemma 3 that
[TABLE]
Since arm loses to arm when , it follows that
[TABLE]
and Lemma 4 holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 5**.**
Under (3.1), .
Proof. The event occurs if at round the leading arm is inferior (i.e. ), and it wins a challenge against one or more optimal arms . By step 2(b)ii. of SSMC, arm loses automatically when , hence we need only consider . Note that when , for arm to be the leader, by the tie-breaking rule we require . We shall consider in case 1 and for in case 2.
Case 1: . By Lemma 3,
[TABLE]
Case 2: for . In view that when is the leading arm, we shall show that for large, for each such there exists such that
[TABLE]
The inequality within the brackets in (7.1) follows from partitioning into segments of length , and applying independence of the sample on each segment.
Since , if , then by (3.1),
[TABLE]
Hence if , then as ,
[TABLE]
Let for large ) be such that
[TABLE]
Equation (7.1) follows from (7.22) and the first inequality in (7.23), whereas (7.1) follows from the second inequality in (7.23) and . By (7.18)–(7.1),
[TABLE]
and Lemma 5 holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 6**.**
Under (3.1) and , (I) in the statement of Lemma 1 holds for and (II) holds for , where .
Proof. Let . Let be such that . Consider for (in ) and (in ). Since for , it follows from Lemma 3 that
[TABLE]
Since , we can consider large enough such that . Hence if in round arm has sample size of at least , it wins against leading optimal arm only if
[TABLE]
By (1), (7.24), (7.25) and Bonferroni’s inequality,
[TABLE]
and (I) holds because and .
Let . It follows from (1), (7.24), (7.25) and the arguments above that
[TABLE]
and (II) holds because and . \sqcap\hbox to0.0pt{\hss\sqcup}
7.2 Proof of Theorem 2
We consider here SSTC. By Lemmas 1 and 2 it suffices, in Lemmas 8–11 below, to verify the conditions needed to show that (7.3) holds with . Lemma 7 provides the underlying large deviations bounds for the standard error estimator. Let and for ) for N(0,1).
Lemma 7**.**
For and ,
[TABLE]
Proof. We note that , where , and that has large deviations rate function
[TABLE]
The last equality holds because the supremum occurs when . We conclude (7.26) and (7.27) from (7.16) and (7.17) respectively. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 8**.**
Under (3.3), for some and , when .
Proof. Let be such that . The event occurs if at round the leading arm is optimal, and it loses to an inferior arm with and for . Let , and let be such that . Let , , be quantities that we shall define below. Note that
[TABLE]
Since ,
[TABLE]
It follows from (7.26) and (7.27) that
[TABLE]
[TABLE]
Since N(0,) for , it follows that
[TABLE]
Hence by (7.31),
[TABLE]
We check that for ,
[TABLE]
[TABLE]
and Lemma 8 indeed holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 9**.**
Under (3.3), for some .
Proof. The event occurs if at round the leading arm is inferior, and it wins a challenge against one or more optimal arms . By step 2(b)ii. of SSTC, we need only consider . Note that when , for arm to be leader, by the tie-breaking rule we require . Consider taking values , taking values and let , , be quantities that we shall define below.
Case 1. . Let and check that
[TABLE]
Case 2. . Let be such that
[TABLE]
Hence
[TABLE]
We shall show that there exists such that for large ,
[TABLE]
For ,
[TABLE]
Since (7.2) and (7.38) hold with “” replacing “” and “” respectively, by adding to the upper bounds,
[TABLE]
We conclude Lemma 9 from (7.2) and (7.2)–(7.39).
We shall now show (7.38), noting firstly that for large, the satisfying (7.36) is negative. This is because for ,
[TABLE]
whereas .
Let be the common density function of and . By the independence of and ,
[TABLE]
By similar arguments,
[TABLE]
where . Let , and . Since and for large,
[TABLE]
where ( as ). Let . It follows from (7.2)–(7.42) that for large,
[TABLE]
Hence by (7.36), the inequality in (7.38) indeed holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 10**.**
Let and be independent. For any and , there exists such that for ,
[TABLE]
Proof. Consider the domain , and the set
[TABLE]
Let , and check that
[TABLE]
the second last equality follows from the infimum occurring at .
Let , , be half-spaces constructed as follows. Let
[TABLE]
The existence of satisfying second line of (7.2) follows from . Since , by (7.2), we can find half-spaces
[TABLE]
such that . Therefore , and so
[TABLE]
It follows from (7.27), (7.2), (7.2) and the independence of and , setting , that
[TABLE]
Lemma 10, with , follows from substituting (7.47) into (7.46). \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 11**.**
Under (3.3) and , (I) in the statement of Lemma holds for and (II) holds for large.
Proof. By considering the rewards , we may assume without loss of generality that . Let (hence ) and . Let and let and be such that
[TABLE]
Let . Since , we can consider large enough such that . By (7.27),
[TABLE]
Let . For ,
[TABLE]
[TABLE]
By (7.49)–(7.52), to show (I), it suffices to show that
[TABLE]
Keeping in mind that , let be such that . It follows from (7.26) and that
[TABLE]
and (7.53) indeed holds. Finally by Lemma 10,
[TABLE]
for some , and so (7.54) follows from (7.48).
To show (II), we consider . By (7.27), we can select large enough to satisfy (7.49) with “” replaced by “”. We note that (7.52) holds with in place of for this . Therefore to show (II), it suffices to note that for large enough, (7.51), (7.53) and (7.54) hold with “” replaced by “”. \sqcap\hbox to0.0pt{\hss\sqcup}
7.3 Proof of Theorem 3
Assume (C1)–(C3) and let . By Lemmas 1 and 2 it suffices, in Lemmas 12–14 below, to verify the conditions needed for SSMC to satisfy (7.3) for some .
Lemma 12**.**
Under (C2), for some and , when .
Proof. Consider such that . Let and let and be the constants satisfying (C2). Lemma 12 follows from arguments similar to those in the proof of Lemma 4, setting . \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 13**.**
Under (C1)–(C3), for some and .
Proof. The event occurs if at round the leading arm is inferior, and it wins against one or more optimal arms . By step 2(b)ii. of SSMC, we need only consider for . Note that and .
Case 1: . Let and be as in the proof of Lemma 12. By (C2), there exists and such that
[TABLE]
Case 2: for . Select for large) such that
[TABLE]
Let and let , . By (C1) and the second inequality of (7.56),
[TABLE]
To see the second inequality of (7.3), let
[TABLE]
Note that the probability in the second line of (7.3) is , and that by (7.56), . By the triangular inequality and the convention ,
[TABLE]
By (C1),
[TABLE]
since depends on whereas depends on . Substituting (7.59) into (7.3) gives us the second inequality of (7.3).
It follows from (C3) and the first inequality of (7.56) that there exists , and such that for ,
[TABLE]
Hence by (7.55) and (7.3), for such that ,
[TABLE]
and Lemma 13 holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Lemma 14**.**
Under (C2) and , statement (II) in Lemma 1 holds.
Proof. Let and be as in the proof of Lemma 12, and let and be the constants satisfying (C2). For an optimal arm ,
[TABLE]
Let . Since , for large, and therefore by Bonferroni’s inequality,
[TABLE]
and (II) holds. \sqcap\hbox to0.0pt{\hss\sqcup}
Appendix A Showing (2.10)
Let for . It follows from as that
[TABLE]
Assume without loss of generality and consider and (hence ) with . By (A.1) and Bonferroni’s inequality,
[TABLE]
By (A.2) and independence of for ,
[TABLE]
We conclude (2.10) from (A) and (A).
Appendix B Verifications of (C1)–(C3) for double exponential densities
By dividing by if necessary, we may assume without loss of generality that . We check that (C1) holds for , whereas (C2) follows from the Chernoff bounds given in Lemma 3, that is (4.2) holds for and , where is the large deviations rate function of the double exponential density .
Let with and let . Since , and similarly when is replaced by , to show (C3), it suffices to show that for and ,
[TABLE]
where . By (B.1), (C3) holds for , and the above .
Since , with and independent exponential random variables with mean 1, it follows that where and are independent Gamma random variables. Using this, Kotz, Kozubowski and Podǵorski (2001) showed, see their (2.3.25), that the density of can be expressed as for , where
[TABLE]
We shall show that
[TABLE]
By (B.3),
[TABLE]
and therefore for ,
[TABLE]
Hence . It follows that for ,
[TABLE]
and (C3) indeed holds.
We shall now show (B.3) by checking that after substituting (B.2) into (B.3), the coefficient of in the left-hand side of (B.3) is not more than in the right-hand side, for . More specifically that (with ),
[TABLE]
Indeed by (B.2),
[TABLE]
and the right-inequality of (B.4) holds.
Acknowledgment
We would like to thank three referees and an Associate Editor for going over the manuscript carefully, and providing useful feedbacks. The changes made in response to their comments have resulted in a much better paper. Thanks also to Shouri Hu for going over the proofs and performing some of the simulations in Examples 5 and 6.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Agrawal, R. (1995). Sample mean based index policies with O ( log n ) 𝑂 𝑛 O(\log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 17 1054–1078.
- 2[2] Agrawal, R., Teneketzis, D. and Anantharam, V. (1989). Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. IEEE Trans. Automat. Control AC-34 1249–1259.
- 3[3] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256.
- 4[4] Baransi, A., Maillard, O.A. and Mannor, S. (2014). Sub-sampling for multi-armed bandits. Proceedings of the European Conference on Machine Learning pp.13.
- 5[5] Berry, D. and Fristedt, B. (1985). Bandit problems . Chapman and Hall, London.
- 6[6] Brezzi, M. and Lai, T.L. (2002). Optimal learning and experimentation in bandit problems. J. Econ. Dynamics Cont. 27 87–108.
- 7[7] Burnetas, A. and Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17 122–142.
- 8[8] Burtini, G., Loeppky, J. and Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit. ar Xiv:1510.00757.
