The Multi-Armed Bandit Problem: An Efficient Non-Parametric Solution

Hock Peng Chan

arXiv:1703.08285·math.ST·January 17, 2019

The Multi-Armed Bandit Problem: An Efficient Non-Parametric Solution

Hock Peng Chan

PDF

TL;DR

This paper introduces efficient non-parametric algorithms for the multi-armed bandit problem, addressing limitations of existing methods under general parametric settings, and enhancing arm allocation strategies in machine learning applications.

Contribution

It proposes novel non-parametric procedures that are computationally efficient and effective across various reward distribution settings, improving upon existing methods.

Findings

01

New non-parametric algorithms outperform traditional methods in diverse settings

02

Proposed methods achieve lower regret compared to existing non-parametric approaches

03

Algorithms are applicable to a wide range of reward distributions

Abstract

Lai and Robbins (1985) and Lai (1987) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (UCB) achieves minimum regret. These bounds are constructed from the Kullback-Leibler information of the reward distributions, estimated from specified parametric families. In recent years there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. Non-parametric arm allocation procedures like $ϵ$ -greedy, Boltzmann exploration and BESA were studied, and modified versions of the UCB procedure were also analyzed under non-parametric settings. However unlike UCB these non-parametric procedures are not efficient under general parametric settings. In this paper we propose efficient non-parametric procedures.

Tables7

Table 1. Table 1: The regrets of SSMC, UCB1 and UCB-Agrawal. The rewards have normal distributions with unit variances. For each N 𝑁 N we generate μ k ∼ N ( 0 , 1 ) similar-to subscript 𝜇 𝑘 𝑁 0 1 \mu_{k}\sim N(0,1) for 1 ≤ k ≤ 10 1 𝑘 10 1\leq k\leq 10 a total of J = 10000 𝐽 10000 J=10000 times.

	Regret
	$N = 1000$	$N = 10000$
SSMC	88.4 $\pm$ 0.2	137.0 $\pm$ 0.5
UCB1	90.2 $\pm$ 0.3	154.4 $\pm$ 0.7
UCB-Agrawal	113.0 $\pm$ 0.3	195.7 $\pm$ 0.8

Table 2. Table 2: The regrets of SSTC, UCB1-tuned and UCB1-Normal. The rewards have normal distributions with unequal and unknown variances. For each N 𝑁 N we generate μ k ∼ similar-to subscript 𝜇 𝑘 absent \mu_{k}\sim N(0,1) and σ k − 2 ∼ similar-to superscript subscript 𝜎 𝑘 2 absent \sigma_{k}^{-2}\sim Exp(1) for 1 ≤ k ≤ 10 1 𝑘 10 1\leq k\leq 10 a total of J = 10000 𝐽 10000 J=10000 times.

	Regret
	$N = 1000$	$N = 10000$
SSTC	239 $\pm$ 1	492 $\pm$ 5
UCB1-tuned	130 $\pm$ 2	847 $\pm$ 23
UCB1-Normal	1536 $\pm$ 5	4911 $\pm$ 31

Table 3. Table 3: Regret comparisons for double exponential density rewards. For each N 𝑁 N and λ 𝜆 \lambda we generate μ k ∼ similar-to subscript 𝜇 𝑘 absent \mu_{k}\sim N(0,1) for 1 ≤ k ≤ 10 1 𝑘 10 1\leq k\leq 10 a total of J = 10000 𝐽 10000 J=10000 times.

	Regret			Regret $(\times 10)$
	$N = 1000$			$N = 10000$
	$λ = 1$	$λ = 2$	$λ = 5$	$λ = 1$	$λ = 2$	$λ = 5$
SSMC	141.7 $\pm$ 0.4	330 $\pm$ 1	795 $\pm$ 3	23.6 $\pm$ 0.1	65.0 $\pm$ 0.3	236.9 $\pm$ 0.8
BESA	117 $\pm$ 1	265 $\pm$ 2	627 $\pm$ 3	28.9 $\pm$ 0.7	73 $\pm$ 1	215 $\pm$ 2
UCB1-tuned	101 $\pm$ 2	244 $\pm$ 3	608 $\pm$ 6	50 $\pm$ 1	183 $\pm$ 3	499 $\pm$ 6
Boltz $τ =$ 0.1	130 $\pm$ 2	294 $\pm$ 4	673 $\pm$ 7	84 $\pm$ 2	224 $\pm$ 4	557 $\pm$ 6
0.2	128 $\pm$ 2	264 $\pm$ 3	632 $\pm$ 6	80 $\pm$ 1	169 $\pm$ 3	465 $\pm$ 6
0.5	332 $\pm$ 1	387 $\pm$ 2	632 $\pm$ 5	310 $\pm$ 5	311 $\pm$ 2	428 $\pm$ 4
1	728 $\pm$ 2	737 $\pm$ 2	816 $\pm$ 4	731 $\pm$ 2	716 $\pm$ 2	712 $\pm$ 3
$ϵ$ -greedy $c =$ 0.1	170 $\pm$ 3	327 $\pm$ 4	681 $\pm$ 7	133 $\pm$ 3	283 $\pm$ 4	579 $\pm$ 7
0.2	162 $\pm$ 3	312 $\pm$ 4	653 $\pm$ 6	114 $\pm$ 2	251 $\pm$ 4	536 $\pm$ 6
0.5	150 $\pm$ 2	282 $\pm$ 3	604 $\pm$ 6	82 $\pm$ 2	189 $\pm$ 3	444 $\pm$ 5
1	159 $\pm$ 2	271 $\pm$ 3	569 $\pm$ 5	61 $\pm$ 1	146 $\pm$ 3	370 $\pm$ 5
2	200 $\pm$ 1	289 $\pm$ 2	559 $\pm$ 4	52.9 $\pm$ 0.9	113 $\pm$ 2	302 $\pm$ 4
5	334 $\pm$ 1	396 $\pm$ 2	617 $\pm$ 4	63.4 $\pm$ 0.5	101 $\pm$ 1	241 $\pm$ 3
10	524 $\pm$ 2	567 $\pm$ 2	742 $\pm$ 3	95.7 $\pm$ 0.4	119.5 $\pm$ 0.8	226 $\pm$ 2
20	811 $\pm$ 3	839 $\pm$ 3	951 $\pm$ 3	156.9 $\pm$ 0.5	172.1 $\pm$ 0.7	251 $\pm$ 2

Table 4. Table 4: Number of simulations ( ( ( out of 10000 ) 10000) lying within a given empirical regret range, and the worst empirical regret, when N = 1000 𝑁 1000 N=1000 and λ = 1 𝜆 1 \lambda=1 .

	Frequency of emp. regrets
	lying within a given range
	0	200	400	600	800	1000	1200
	to	to	to	to	to	to	to	Worst
	200	400	600	800	1000	1200	2100	emp. regret
SSMC	9134	845	16	5	0	0	0	770
BESA	9314	424	143	66	27	15	11	2089
UCB1-tuned	8830	625	301	132	64	32	16	1772

Table 5. Table 5: Number of simulations ( ( ( out of 10000 ) 10000) lying within a given empirical regret range, and the worst empirical regret, when N = 10000 𝑁 10000 N=10000 and λ = 1 𝜆 1 \lambda=1 .

	Frequency of emp. regrets
	lying within a given range
	0	1000	2000	3000	4000	5000	10000
	to	to	to	to	to	to	to	Worst
	1000	2000	3000	4000	5000	10000	21000	emp. regret
SSMC	9988	8	3	0	0	1	0	6192
BESA	9708	125	59	34	25	40	9	20639
UCB1-tuned	8833	365	250	161	122	225	44	16495

Table 6. Table 6: Regret comparisons for Bernoulli rewards.

	Scenario
	1	2	3	4
SSMC	12.4 $\pm$ 0.1	43.1 $\pm$ 0.4	97.9 $\pm$ 0.2	165.3 $\pm$ 0.2
SSMC^∗	9.5 $\pm$ 0.2	48.5 $\pm$ 0.6	64.4 $\pm$ 0.3	156.0 $\pm$ 0.4
BESA	11.83	42.6	74.41	156.7
KL-UCB	17.48	52.34	121.21	170.82
KL-UCB $+$	11.54	41.71	72.84	165.28
Thompson	11.3	46.14	83.36	165.08

Table 7. Table 7: Regret comparisons for truncated exponential and Poisson rewards.

	Trunc. expo.	Trunc. Poisson
SSMC	33.8 $\pm$ 0.4	18.6 $\pm$ 0.1
SSMC^∗	29.6 $\pm$ 0.7	14.7 $\pm$ 0.2
BESA	53.26	19.37
BESAT	31.41	16.72
KL-UCB-expo	65.67	—
KL-UCB-Poisson	—	25.05

Equations273

R_{N} := k = 1 \sum K (μ_{*} - μ_{k}) E N_{k} .

R_{N} := k = 1 \sum K (μ_{*} - μ_{k}) E N_{k} .

D (f ∣ g) = E_{f} [lo g \frac{f ( Y )}{g ( Y )}],

D (f ∣ g) = E_{f} [lo g \frac{f ( Y )}{g ( Y )}],

R_{N} = o (N^{ϵ}) \mbox f or a l l ϵ > 0,

R_{N} = o (N^{ϵ}) \mbox f or a l l ϵ > 0,

N \to \infty lim inf \frac{R _{N}}{lo g N} \geq k : μ_{k} < μ_{*} \sum \frac{μ _{*} - μ _{k}}{D ( f _{k} ∣ f _{*} )} .

N \to \infty lim inf \frac{R _{N}}{lo g N} \geq k : μ_{k} < μ_{*} \sum \frac{μ _{*} - μ _{k}}{D ( f _{k} ∣ f _{*} )} .

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 l o g ( N / n )}{n},

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 l o g ( N / n )}{n},

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 ( l o g n + l o g l o g n + b _{n} )}{n _{k}},

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 ( l o g n + l o g l o g n + b _{n} )}{n _{k}},

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 l o g n}{n _{k}}

\overset{ˉ}{Y}_{k n_{k}} + \frac{2 l o g n}{n _{k}}

\overset{ˉ}{Y}_{k n_{k}} + 4 σ_{k n_{k}} \frac{l o g n}{n _{k}},

\overset{ˉ}{Y}_{k n_{k}} + 4 σ_{k n_{k}} \frac{l o g n}{n _{k}},

ϵ_{n} = min (1, \frac{cK}{d ^{2} n}),

ϵ_{n} = min (1, \frac{cK}{d ^{2} n}),

c_{n} = o (lo g n) \mbox an d \frac{c _{n}}{l o g l o g n} \to \infty \mbox a s n \to \infty.

c_{n} = o (lo g n) \mbox an d \frac{c _{n}}{l o g l o g n} \to \infty \mbox a s n \to \infty.

\overset{ˉ}{Y}_{k n_{k}} \geq \overset{ˉ}{Y}_{ζ, t : (t + n_{k} - 1)} \mbox f or so m e 1 \leq t \leq n_{ζ} - n_{k} + 1.

\overset{ˉ}{Y}_{k n_{k}} \geq \overset{ˉ}{Y}_{ζ, t : (t + n_{k} - 1)} \mbox f or so m e 1 \leq t \leq n_{ζ} - n_{k} + 1.

P (1 \leq t \leq n min U_{k t}^{n} \geq μ_{k} - ϵ) = 1 - o (n^{- 1}) \mbox f or a l l ϵ > 0.

P (1 \leq t \leq n min U_{k t}^{n} \geq μ_{k} - ϵ) = 1 - o (n^{- 1}) \mbox f or a l l ϵ > 0.

U_{k n_{k}}^{n} \geq \overset{ˉ}{Y}_{ζ n_{ζ}} (≐ μ_{ζ}),

U_{k n_{k}}^{n} \geq \overset{ˉ}{Y}_{ζ n_{ζ}} (≐ μ_{ζ}),

P (L_{ζ n_{k}} \leq \overset{ˉ}{Y}_{k n_{k}}) = 1 - o (n^{- 1}) .

P (L_{ζ n_{k}} \leq \overset{ˉ}{Y}_{k n_{k}}) = 1 - o (n^{- 1}) .

1 \leq t \leq n_{1} - n_{2} + 1 min \overset{ˉ}{Y}_{1, t : (t + n_{2} - 1)} = μ_{1} - [1 + o_{p} (1)] \frac{2 l o g n}{n _{2}} .

1 \leq t \leq n_{1} - n_{2} + 1 min \overset{ˉ}{Y}_{1, t : (t + n_{2} - 1)} = μ_{1} - [1 + o_{p} (1)] \frac{2 l o g n}{n _{2}} .

\overset{ˉ}{Y}_{2 n_{2}} \geq μ_{1} - [1 + o_{p} (1)] \frac{2 l o g n}{n _{2}} .

\overset{ˉ}{Y}_{2 n_{2}} \geq μ_{1} - [1 + o_{p} (1)] \frac{2 l o g n}{n _{2}} .

\frac{Y ˉ _{k n_{k}} - μ _{ζ}}{σ ^ _{k n_{k}}} \geq \frac{Y ˉ _{ζ, t : (t + n_{k} - 1)} - μ _{ζ}}{σ ^ _{ζ, t : (t + n_{k} - 1)}},

\frac{Y ˉ _{k n_{k}} - μ _{ζ}}{σ ^ _{k n_{k}}} \geq \frac{Y ˉ _{ζ, t : (t + n_{k} - 1)} - μ _{ζ}}{σ ^ _{ζ, t : (t + n_{k} - 1)}},

\frac{Y ˉ _{k n_{k}} - Y ˉ _{ζ n_{ζ}}}{σ ^ _{k n_{k}}} \geq \frac{Y ˉ _{ζ, t : (t + n_{k} - 1)} - Y ˉ _{ζ n_{ζ}}}{σ ^ _{ζ, t : (t + n_{k} - 1)}} \mbox f or so m e 1 \leq t \leq n_{ζ} - n_{k} + 1.

\frac{Y ˉ _{k n_{k}} - Y ˉ _{ζ n_{ζ}}}{σ ^ _{k n_{k}}} \geq \frac{Y ˉ _{ζ, t : (t + n_{k} - 1)} - Y ˉ _{ζ n_{ζ}}}{σ ^ _{ζ, t : (t + n_{k} - 1)}} \mbox f or so m e 1 \leq t \leq n_{ζ} - n_{k} + 1.

f (x; θ) = e^{θ x - ψ (θ)} f (x; 0), θ \in Θ,

f (x; θ) = e^{θ x - ψ (θ)} f (x; 0), θ \in Θ,

D (f_{k} ∣ f_{*})

D (f_{k} ∣ f_{*})

r \to \infty lim sup \frac{E n _{k}^{r}}{lo g r} \leq \frac{1}{D ( f _{k} ∣ f _{*} )}, k \neq \in Ξ,

r \to \infty lim sup \frac{E n _{k}^{r}}{lo g r} \leq \frac{1}{D ( f _{k} ∣ f _{*} )}, k \neq \in Ξ,

f (x; μ, σ^{2}) = \frac{1}{σ 2 π} e^{- \frac{( x - μ ) ^{2}}{2 σ ^{2}}},

f (x; μ, σ^{2}) = \frac{1}{σ 2 π} e^{- \frac{( x - μ ) ^{2}}{2 σ ^{2}}},

N \to \infty lim inf \frac{R _{N}}{lo g N} \geq k : μ_{k} < μ_{*} \sum \frac{μ _{*} - μ _{k}}{M ( \frac{μ _{*} - μ _{k}}{σ _{k}} )} .

N \to \infty lim inf \frac{R _{N}}{lo g N} \geq k : μ_{k} < μ_{*} \sum \frac{μ _{*} - μ _{k}}{M ( \frac{μ _{*} - μ _{k}}{σ _{k}} )} .

r \to \infty lim sup \frac{E n _{k}^{r}}{lo g r} \leq \frac{1}{M ( \frac{μ _{*} - μ _{k}}{σ _{k}} )}, k \neq \in Ξ,

r \to \infty lim sup \frac{E n _{k}^{r}}{lo g r} \leq \frac{1}{M ( \frac{μ _{*} - μ _{k}}{σ _{k}} )}, k \neq \in Ξ,

P_{k} (x, A) = P (X_{k t} \in A ∣ X_{k, t - 1} = x), x \in X, A \in A .

P_{k} (x, A) = P (X_{k t} \in A ∣ X_{k, t - 1} = x), x \in X, A \in A .

P (Y_{k t} \in B ∣ X_{k 1} = x_{1}, X_{k 2} = x_{2}, \dots) = \int_{B} f_{k} (y ∣ x_{t}) ν (d y) .

P (Y_{k t} \in B ∣ X_{k 1} = x_{1}, X_{k 2} = x_{2}, \dots) = \int_{B} f_{k} (y ∣ x_{t}) ν (d y) .

P_{k} (x, A) \geq λ_{k} (A), x \in X, A \in A .

P_{k} (x, A) \geq λ_{k} (A), x \in X, A \in A .

R_{N} = k : μ_{k} < μ_{*} \sum (μ_{*} - μ_{k}) E N_{k} .

R_{N} = k : μ_{k} < μ_{*} \sum (μ_{*} - μ_{k}) E N_{k} .

P (∣ \overset{ˉ}{Y}_{k t} - μ_{k} ∣ \geq ϵ) \leq Q e^{- t b} .

P (∣ \overset{ˉ}{Y}_{k t} - μ_{k} ∣ \geq ϵ) \leq Q e^{- t b} .

P (\overset{ˉ}{Y}_{ℓ t} < ω) \leq Q_{1} e^{- t b_{1}} P (\overset{ˉ}{Y}_{k t} < ω) .

P (\overset{ˉ}{Y}_{ℓ t} < ω) \leq Q_{1} e^{- t b_{1}} P (\overset{ˉ}{Y}_{k t} < ω) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

Hock Peng Chanlabel=e1][email protected] [ National University of Singapore

Department of Statistics and Applied Probability

Block S16, Level 7, 6 Science Drive 2

Faculty of Science

National University of Singapore

Singapore 117546

Abstract

Lai and Robbins (1985) and Lai (1987) provided efficient parametric solutions to the multi-armed bandit problem, showing that arm allocation via upper confidence bounds (UCB) achieves minimum regret. These bounds are constructed from the Kullback-Leibler information of the reward distributions, estimated from specified parametric families. In recent years there has been renewed interest in the multi-armed bandit problem due to new applications in machine learning algorithms and data analytics. Non-parametric arm allocation procedures like $\epsilon$ -greedy, Boltzmann exploration and BESA were studied, and modified versions of the UCB procedure were also analyzed under non-parametric settings. However unlike UCB these non-parametric procedures are not efficient under general parametric settings. In this paper we propose efficient non-parametric procedures.

62L05,

efficiency,

KL-UCB,

subsampling,

Thompson sampling,

UCB,

keywords:

[class=AMS]

keywords:

t1Supported by MOE grant number R-155-000-158-112

1 Introduction

Lai and Robbins (1985) provided an asymptotic lower bound for the regret in the multi-armed bandit problem, and proposed an index strategy that is efficient, that is it achieves this bound. Lai (1987) showed that allocation to the arm having the highest upper confidence bound (UCB), constructed from the Kullback-Leibler (KL) information between the estimated reward distributions of the arms, is efficient when the distributions belong to a specified exponential family. Agrawal (1995) proposed a modified UCB procedure that is efficient despite not having to know in advance the total sample size. Cappé, Garivier, Maillard, Munos and Stoltz (2013) provided explicit, non-asymptotic bounds on the regret of a KL-UCB procedure that is efficient on a larger class of distribution families.

Burnetas and Kalehakis (1996) extended UCB to multi-parameter families, almost showing efficiency in the natural setting of normal rewards with unequal variances. Yakowitz and Lowe (1991) proposed non-parametric procedures that do not make use of KL-information, suggesting logarithmic and polynomial rates of regret under finite exponential moment and moment conditions respectively.

Auer, Cesa-Bianchi and Fischer (2002) proposed a UCB1 procedure that achieves logarithmic regret when the reward distributions are supported on [0,1]. They also studied the $\epsilon$ -greedy algorithm of Sutton and Barto (1998) and provided finite-time upper bounds of its regret. Both UCB1 and $\epsilon$ -greedy are non-parametric in their applications and, unlike UCB-Lai or UCB-Agrawal, are not expected to be efficient under a general exponential family setting. Other non-parametric methods that have been proposed include reinforcement comparison, Boltzmann exploration (Sutton and Barto, 1998) and pursuit (Thathacher and Sastry, 1985). Kuleshov and Precup (2014) provided numerical comparisons between UCB and these methods. For a description of applications to recommender systems and clinical trials, see Shivaswamy and Joachims (2012). Burtini, Loeppky and Lawrence (2015) provided a comprehensive survey of the methods, results and applications of the multi-armed bandit problem, developed over the past thirty years.

A strong competitor to UCB under the parametric setting is the Bayesian method, see for example Fabius and van Zwet (1970) and Berry (1972). There is also a well-developed literature on optimization under an infinite-time discounted window setting, in which allocation is to the arm maximizing a dynamic allocation (or Gittins) index, see the seminal papers Gittins (1979) and Gittins and Jones (1979), and also Berry and Fristedt (1985), Chang and Lai (1987), Brezzi and Lai (2002). Recently there has been renewed interest in the Bayesian method due to the developments of UCB-Bayes [see Kaufmann, Cappé and Garivier (2012)] and Thompson sampling [see for example Korda, Kaufmann and Munos (2013)].

In this paper we propose an arm allocation procedure subsample-mean comparison (SSMC), that though non-parametric, is nevertheless efficient when the reward distributions are from an unspecified one-dimensional exponential family. It achieves this by comparing subsample means of the leading arm with the sample means of its competitors. It is empirical in its approach, using more informative subsample means rather than full-sample means alone, for better decision-making. The subsampling strategy was first employed by Baransi, Maillard and Mannor (2014) in their best empirical sampled average (BESA) procedure. However there are key differences in their implementation of subsampling from ours, as will be elaborated in Section 2.2. Though efficiency has been attained for various one-dimensional exponential families by say UCB-Agrawal or KL-UCB, SSMC is the first to achieve efficiency without having to know the specific distribution family. In addition we propose in Section 2.4 a related subsample- $t$ comparison (SSTC) procedure, applying $t$ -statistic comparisons in place of mean comparisons, that is efficient for normal distributions with unknown and unequal variances.

The layout of the paper is as follows. In Section 2 we describe the subsample comparison strategy for allocating arms. In Section 3 we show that the strategy is efficient for exponential families, including the setting of normal rewards with unknown and unequal variances. In Section 4 we show logarthmic regret for Markovian rewards. In Section 5 we provide numerical comparisons against existing methods. In Section 6 we provide a concluding discussion. In Section 7 we prove the results of Sections 3 and 4.

2 Subsample comparisons

Let $Y_{k1},Y_{k2},\ldots$ , $1\leq k\leq K$ , be the observations (or rewards) from a population (or arm) $\Pi_{k}$ . We assume here and in Section 3 that the rewards are independent and identically distributed (i.i.d.) within each arm. We extend to Markovian rewards in Section 4. Let $\mu_{k}=EY_{kt}$ and $\mu_{*}=\max_{1\leq k\leq K}\mu_{k}$ .

Consider a sequential procedure for selecting the population to be sampled, with the decision based on past rewards. Let $N_{k}$ be the number of observations from $\Pi_{k}$ when there are $N$ total observations, hence $N=\sum_{k=1}^{K}N_{k}$ . The objective is to minimize the regret

[TABLE]

The Kullback-Leibler information number between two densities $f$ and $g$ , with respect to a common ( $\sigma$ -finite) measure, is

[TABLE]

where $E_{f}$ denotes expectation with respect to $Y\sim f$ . An arm allocation procedure is said to be uniformly good if

[TABLE]

over all reward distributions lying within a specified parametric family.

Let $f_{k}$ be the density of $Y_{kt}$ and let $f_{*}=f_{k}$ for $k$ such that $\mu_{k}=\mu_{*}$ (assuming $f_{*}$ is unique). The celebrated result of Lai and Robbins (1985) is that under (2.2) and additional regularity conditions,

[TABLE]

Lai and Robbins (1985) and Lai (1987) went on to propose arm allocation procedures that have regrets achieving the lower bound in (2.3), and are hence efficient.

2.1 Review of existing methods

In the setting of normal rewards with unit variances, UCB-Lai can be described as the selection, for sampling, $\Pi_{k}$ maximizing

[TABLE]

where $\bar{Y}_{kt}=\frac{1}{t}\sum_{u=1}^{t}Y_{ku}$ , $n$ is the current number of observations from the $K$ populations, and $n_{k}$ is the current number of observations from $\Pi_{k}$ . Agrawal (1995) proposed a modified version of UCB-Lai that does not involve the total sample size $N$ , with the selection instead of the population $\Pi_{k}$ maximizing

[TABLE]

with $b_{n}\rightarrow\infty$ and $b_{n}=o(\log n)$ . Efficiency holds for (2.4) and (2.5), and there are corresponding versions of (2.4) and (2.5) that are efficient for other one-parameter exponential families. Cappé et al. (2013) proposed a more general KL-UCB procedure that is also efficient for distributions with given finite support.

Auer, Cesa-Bianchi and Fischer (2002) simplified UCB-Agrawal to UCB1, proposing that $\Pi_{k}$ maximizing

[TABLE]

be selected. They showed that under UCB1, logarithmic regret $R_{N}=O(\log N)$ is achieved when the reward distributions are supported on [0,1]. In the setting of normal rewards with unequal and unknown variances, Auer et al. suggested applying a variant of UCB1 which they called UCB1-Normal, and showed logarithmic regret. Under UCB1-Normal, an observation is taken from any population $\Pi_{k}$ with $n_{k}<8\log n$ . If such a population does not exist, then an observation is taken from $\Pi_{k}$ maximizing

[TABLE]

where $\widehat{\sigma}_{kt}^{2}=\frac{1}{t-1}\sum_{u=1}^{t}(Y_{ku}-\bar{Y}_{kt})^{2}$ .

Auer et al. provided an excellent study of various non-parametric arm allocation procedures, for example the $\epsilon$ -greedy procedure proposed by Sutton and Barto (1998), in which an observation is taken from the population with the largest sample mean with probability $1-\epsilon$ , and randomly with probability $\epsilon$ . Auer et al. suggested replacing the fixed $\epsilon$ at every stage by a stage-dependent

[TABLE]

with $c$ user-specified and $0<d\leq\min_{k:\mu_{k}<\mu^{*}}(\mu_{*}-\mu_{k})$ . They showed that if $c>5$ , then logarithmic regret is achieved for reward distributions supported on $[0,1]$ . A more recent numerical study by Kuleshov and Precup (2014) considered additional non-parametric procedures, for example Boltzmann exploration in which an observation is taken from $\Pi_{k}$ with probability proportional to $e^{\bar{Y}_{kn_{k}}/\tau}$ , for some $\tau>0$ .

2.2 Subsample-mean comparisons

A common characteristic of the procedures described in Section 2.1 is that allocation is based solely on a comparison of the sample means $\bar{Y}_{kn_{k}}$ , with the exception of UCB1-Normal in which $\widehat{\sigma}_{kn_{k}}$ is also utilized. As we shall illustrate in Section 2.3, we can utilize subsample-mean information from the leading arm to estimate the confidence bounds for selecting from the other arms. In contrast UCB-based procedures like KL-UCB discard subsample information and rely on parametric information to estimate these bounds. Even though subsample-mean and KL-UCB are both efficient for exponential families, the advantage of subsample-mean is that the underlying family need not be specified.

In SSMC a leader is chosen in each round of play to compete against all the other arms. Let $r$ denote the round number. In round 1, we sample all $K$ arms. In round $r$ for $r>1$ , we set up a challenge between the leading arm (to be defined below) and each of the other arms. An arm is sampled only if it wins all its challenges in that round. Hence for round $r>1$ we sample either the leading arm or a non-empty subset of the challengers. Let $n(=n^{r})$ be the total number of observations from all $K$ arms at the beginning of round $r$ , let $n_{k}(=n_{k}^{r})$ be the corresponding number from $\Pi_{k}$ . Hence $n_{k}^{1}=0$ and $n_{k}^{2}=1$ for all $k$ , and $K+(r-2)\leq n^{r}\leq K+(K-1)(r-2)$ for $r\geq 2$ .

Let $c_{n}$ be a non-negative monotone increasing sampling threshold in SSMC and SSTC, with

[TABLE]

For example in our implementation of SSMC and SSTC in Section 5, we select $c_{n}=(\log n)^{\frac{1}{2}}$ . An explanation of why (2.8) is required for efficiency of SSMC is given in the beginning of Section 7.1. Let $\bar{Y}_{k,t:u}=\frac{1}{u-t+1}\sum_{v=t}^{u}Y_{kv}$ , hence $\bar{Y}_{kt}=\bar{Y}_{k,1:t}$ .

Subsample-mean comparison (SSMC)

$r=1$ . Sample each $\Pi_{k}$ exactly once. 2. 2.

$r=2,3,\ldots$ .

(a)

Let the leader $\zeta(=\zeta^{r})$ be the population with the most observations, with ties resolved by (in order):

i.

the population with the larger sample mean, 2. ii.

the leader of the previous round, 3. iii.

randomization. 2. (b)

For all $k\neq\zeta$ set up a challenge between $\Pi_{\zeta}$ and $\Pi_{k}$ in the following manner.

i.

If $n_{k}=n_{\zeta}$ , then $\Pi_{k}$ loses the challenge automatically. 2. ii.

If $n_{k}<n_{\zeta}$ and $n_{k}<c_{n}$ , then $\Pi_{k}$ wins the challenge automatically. 3. iii.

If $c_{n}\leq n_{k}<n_{\zeta}$ , then $\Pi_{k}$ wins the challenge when

[TABLE] 3. (c)

For all $k\neq\zeta$ , sample from $\Pi_{k}$ if $\Pi_{k}$ wins its challenge against $\Pi_{\zeta}$ . Sample from $\Pi_{\zeta}$ if $\Pi_{\zeta}$ wins all its challenges. Hence either $\Pi_{\zeta}$ is sampled, or a non-empty subset of $\{\Pi_{k}:k\neq\zeta\}$ is sampled.

SSMC may recommend more than one populations to be sampled in a single round when $K>2$ . In the event that $n^{r}<N<n^{r+1}$ for some $r$ , we select $N-n^{r}$ populations randomly from among the $n^{r+1}-n^{r}$ recommended by SSMC in the $r$ th round, in order to make up exactly $N$ observations.

If $\Pi_{\zeta}$ wins all its challenges, then $\zeta$ and $(n_{k}:k\neq\zeta)$ are unchanged, and in the next round it suffices to perform the comparison in (2.9) at the largest $t$ instead of at every $t$ . The computational cost is thus $O(1)$ . The computational cost is $O(r)$ if at least one $k\neq\zeta$ wins its challenge. Hence when there is only one optimal arm and SSMC achieves logarithmic regret, the total computational cost is $O(r\log r)$ for $r$ rounds of the algorithm.

In step 2(b)ii. we force the exploration of arms with less than $c_{n}$ rewards. By (2.8) we select $c_{n}$ small compared to $\log n$ , so that the cost of such forced explorations is asymptotically negligible. In contrast the forced exploration in the greedy algorithm (2.7) is more substantial, of order $\log n$ for $n$ rewards.

BESA, proposed by Baransi, Maillard and Mannor (2014), also applies subsample-mean comparisons. We describe BESA for $K=2$ below, noting that tournament-style elimination is applied for $K>2$ . Unlike SSMC, exactly one population is sampled in each round $r>1$ even when $K>2$ .

Best Empirical Sampled Average (BESA)

$r=1$ . Sample both $\Pi_{1}$ and $\Pi_{2}$ . 2. 2.

$r=2,3,\ldots$ .

(a)

Let the leader $\zeta$ be the population with more observations, and let $k\neq\zeta$ . 2. (b)

Sample randomly without replacement $n_{k}$ of the $n_{\zeta}$ observations from $\Pi_{\zeta}$ , and let $\bar{Y}_{\zeta n_{k}}^{*}$ be the mean of the $n_{k}$ observations. 3. (c)

If $\bar{Y}_{kn_{k}}\geq\bar{Y}_{\zeta n_{k}}^{*}$ , then sample from $\Pi_{k}$ . Otherwise sample from $\Pi_{\zeta}$ .

As can be seen from the descriptions of SSMC and BESA, the mechanism of choosing the arm to be played in SSMC clearly promotes exploration of non-leading arms, relative to BESA. Whereas Baransi et al. demonstrated logarithmic regret of BESA for rewards bounded on [0,1] (though BESA can of course be applied on more general settings but with no such guarantees), we show in Section 3 that SSMC is able to extend BESA’s subsampling idea to achieve asymptotic optimality, that is efficiency, on a wider set of distributions. Tables 4 and 5 in Section 5 show that SSMC controls the oversampling of inferior arms better relative to BESA, due to its added explorations.

2.3 Comparison of SSMC with UCB methods

Lai and Robbins (1985) proposed a UCB strategy in which the arms take turns to challenge a leader with order $n$ observations. Let us restrict to the setting of exponential families. Denote the leader by $\zeta$ and the challenger by $k$ . Lai and Robbins proposed, in their (3.1), upper confidence bounds $U_{kt}^{n}=U_{k}^{n}(Y_{k1},\ldots,Y_{kt})$ satisfying

[TABLE]

The decision is to sample from arm $k$ if

[TABLE]

otherwise arm $\zeta$ is sampled. By doing this we ensure that if $\mu_{k}>\mu_{\zeta}$ , then the probability that arm $k$ is sampled is $1-o(n^{-1})$ .

We next consider SSMC. Let $L_{\zeta n_{k}}=\min_{1\leq t\leq n_{\zeta}-n_{k}+1}\bar{Y}_{\zeta,t:(t+n_{k}-1)}$ . Since $n_{\zeta}$ is of order $n$ , it follows that if $\mu_{k}>\mu_{\zeta}$ , then as $Y_{kt}$ is stochastically larger than $Y_{\zeta t}$ ,

[TABLE]

In SSMC we sample from arm $k$ if $L_{\zeta n_{k}}\leq\bar{Y}_{kn_{k}}$ , ensuring, as in Lai and Robbins, that an optimal arm is sampled with probability $1-o(n^{-1})$ when the leading arm is inferior.

In summary SSMC differs from UCB in that it compares $\bar{Y}_{kn_{k}}$ against a lower confidence bound $L_{\zeta n_{k}}$ of the leading arm, computed from subsample-means instead of parametrically. Nevertheless the critical values that SSMC and UCB-based methods employ for allocating arms are asymptotically the same, as we shall next show.

For simplicity let us consider unit variance normal densities with $K=2$ . Consider firstly unbalanced sample sizes with say $n_{2}=O(\log n)$ and note, see Appendix A, that

[TABLE]

Hence arm 2 winning the challenge requires

[TABLE]

By (2.5) and (2.6), UCB-Agrawal, KL-UCB and UCB1 also select arm 2 when (2.11) holds, since $\bar{Y}_{1n_{1}}+\sqrt{\frac{2\log n}{n_{1}}}=\mu_{1}+o_{p}(1)$ . Hence what SSMC does is to estimate the critical value $\mu_{1}-[1+o_{p}(1)]\sqrt{\frac{2\log n}{n_{2}}}$ , empirically by using the minimum of the running averages $\bar{Y}_{1,t:(t+n_{2}-1)}$ . In the case of $n_{1},n_{2}$ both large compared to $\log n$ , $\sqrt{\frac{2\log n}{n_{1}}}+\sqrt{\frac{2\log n}{n_{2}}}\rightarrow 0$ , and SSMC, UCB-Agrawal, KL-UCB and UCB1 essentially select the population with the larger sample mean.

2.4 Subsample- $t$ comparisons

For efficiency outside one-parameter exponential families, we need to work with test statistics beyond sample means. For example to achieve efficiency for normal rewards with unknown and unequal variances, the analogue of mean comparisons is $t$ -statistic comparisons

[TABLE]

where $\widehat{\sigma}^{2}_{k,t:u}=\frac{1}{u-t}\sum_{v=t}^{u}(Y_{kv}-\bar{Y}_{k,t:u})^{2}$ and $\widehat{\sigma}_{kt}=\widehat{\sigma}_{k,1:t}$ . Since $\mu_{\zeta}$ is unknown, we estimate it by $\bar{Y}_{\zeta n_{\zeta}}$ .

Subsample- $t$ comparison (SSTC)

Proceed as in SSMC, with step 2(b)iii.′ below replacing step 2(b)iii.

iii.′ If $c_{n}\leq n_{k}<n_{\zeta}$ , then $\Pi_{k}$ wins the challenge when either $\bar{Y}_{kn_{k}}\geq\bar{Y}_{\zeta n_{\zeta}}$ or

[TABLE]

As in SSMC only $O(r\log r)$ computations are needed for $r$ rounds when there is only one optimal arm and the regret is logarithmic. This is because it suffices to record the range of $\bar{Y}_{\zeta n_{\zeta}}$ that satisfies (2.12) for each $k\neq\zeta$ , and the actual value of $\bar{Y}_{\zeta n_{\zeta}}$ . The updating of these requires $O(1)$ computations when both $\zeta$ and $(n_{k}:k\neq\zeta)$ are unchanged.

3 Efficiency

Consider firstly an exponential family of density functions

[TABLE]

with respect to some measure $\nu$ , where $\psi(\theta)=\log[\int e^{\theta x}f(x;0)\nu(dx)]$ is the log moment generating function and $\Theta=\{\theta:\psi(\theta)<\infty\}$ . For example the Bernoulli family satisfies (3.1) with $\nu$ the counting measure on $\{0,1\}$ and $f(0;0)=f(1;0)=\frac{1}{2}$ . The family of normal densities with variance $\sigma^{2}$ satisfies (3.1) with $\nu$ the Lebesgue measure and $f(x;0)=\frac{1}{\sigma\sqrt{2\pi}}e^{-x^{2}/(2\sigma^{2})}$ .

Let $f_{k}=f(\cdot;\theta_{k})$ for some $\theta_{k}\in\Theta$ , $1\leq k\leq K$ . Let $\theta_{*}=\max_{1\leq k\leq K}\theta_{k}$ and $f_{*}=f(\cdot;\theta_{*})$ . By (2.1) and (3.1), the KL-information in (2.3),

[TABLE]

where $I_{*}$ is the large deviations rate function of $f_{*}$ . Let $\Xi=\{\ell:\mu_{\ell}=\mu_{*}\}$ be the set of optimal arms.

Theorem 1.

For the exponential family (3.1), SSMC satisfies

[TABLE]

and is thus efficient.

UCB-Agrawal and KL-UCB are efficient as well for (3.1), see Agrawal (1995) and Cappé et al. (2013), SSMC is unique in that it achieves efficiency by being adaptive to the exponential family, whereas UCB-Agrawal and KL-UCB achieve efficiency by having selection procedures that are specific to the exponential family. On the other hand UCB-based methods require less storage space, and more informative finite-time bounds have been obtained. Specifically for UCB-based methods in exponential families we need only store the sample mean for each arm, and the numerical complexity is of the same order as the sample size. For SSMC as given in Section 2.3, all observations are stored (more of this in Section 6) and the numerical complexity for a sample of size $N$ is $N\log N$ when we have efficiency and exactly one optimal arm.

We next consider normal rewards with unequal and unknown variances, that is with densities

[TABLE]

with respect to Lebesgue measure. Let $M(g)=\frac{1}{2}\log(1+g^{2})$ . Burnetas and Katehakis (1996) showed that if $f_{k}=f(\cdot;\mu_{k},\sigma_{k}^{2})$ , then under uniformly fast convergence and additional regularity conditions, an arm allocation procedure must have regret $R_{N}$ satisfying

[TABLE]

They proposed an extension of UCB-Lai but needed the verification of a technical condition to show efficiency. In the case of UCB1-Normal, logarithmic regret also depended on tail bounds of the $\chi^{2}$ - and $t$ -distributions that were only shown to hold numerically by Auer et al. (2002). In Theorem 2 we show that SSTC achieves efficiency.

Theorem 2.

For normal densities (3.3) with unequal and unknown variances, SSTC satisfies

[TABLE]

and is thus efficient.

4 Logarithmic regret

We show here that logarithmic regret can be achieved by SSMC under Markovian assumptions. This is possible because in SSMC we compare blocks of observations that retain the Markovian structure.

For $1\leq k\leq K$ , let $X_{k1},X_{k2},\ldots$ be a potentially unobserved ${\cal X}$ -valued Markov chain, with $\sigma$ -field ${\cal A}$ and transition kernel

[TABLE]

We shall assume for convenience that $(X_{kt})_{t\geq 1}$ is stationary. Let $Y_{k1},Y_{k2},\ldots$ be real-valued and conditionally independent given $(X_{kt})_{t\geq 1}$ , and having conditional densities $\{f_{k}(\cdot|x):1\leq k\leq K,x\in{\cal X}\}$ , with respect to some measure $\nu$ , such that

[TABLE]

We assume that the $K$ Markov chains are independent, and that the following Doeblin-type condition holds.

(C1) For $1\leq k\leq K$ , there exists a non-trival measure $\lambda_{k}$ on $({\cal X},{\cal A})$ such that

[TABLE]

As before let $\mu_{k}=EY_{kt}$ , $\mu_{*}=\max_{1\leq k\leq K}\mu_{k}$ and the regret

[TABLE]

In addition to (C1) we assume the following sample mean large deviations.

(C2) For any $\epsilon>0$ , there exists $b(=b_{\epsilon})>0$ and $Q(=Q_{\epsilon})>0$ such that for $1\leq k\leq K$ and $t\geq 1$ ,

[TABLE]

(C3) For $k$ such that $\mu_{k}<\mu_{*}$ and $\ell$ such that $\mu_{\ell}=\mu_{*}$ , there exists $b_{1}>0$ , $Q_{1}>0$ and $t_{1}\geq 1$ such that for $\omega\leq\mu_{k}$ and $t\geq t_{1}$ ,

[TABLE]

Theorem 3.

For Markovian rewards satisfying (C1)–(C3), SSMC achieves $En_{k}^{r}=O(\log r)$ for $k\not\in\Xi$ , hence $R_{N}=O(\log N)$ .

Agrawal, Tenekatzis and Anantharam (1989) and Graves and Lai (1997) considered control problems in which, instead of (4.1) with $K$ Markov chains, there are $K$ arms with each arm representing a distinct Markov transition kernel acting on the same chain. Tekin and Liu (2010) on the other hand considered (4.1), with the constraints that ${\cal X}$ is finite and $f_{k}(\cdot|x)$ is a point mass function for all $k$ and $x$ . They provided a UCB algorithm that achieves logarithmic regret.

We can apply Theorem 3 to show logarithmic regret for i.i.d. rewards on non-exponential parametric families. Lai and Robbins (1985) showed that for the double exponential (DE) densities

[TABLE]

with $\tau>0$ , efficiency is achieved by a UCB strategy involving KL-information of the DE densities, hence implementation requires knowledge that the family is DE, including knowing $\tau$ . In Example 1 below we state logarithmic regret, rather than efficiency, for SSMC. The advantage of SSMC is that we do not assume knowledge of (4.4) in its implementation. Verifications of (C1)–(C3) under (4.4) is given in Appendix B.

Example 1. For the double exponential densities (4.4), conditions (C1)–(C3) hold, hence under SSMC, $En_{k}^{r}=O(\log r)$ for $k\not\in\Xi$ .

5 Numerical studies

We compare SSMC and SSTC against procedures described in Section 2.1, as well as more modern procedures like BESA, KL-UCB, UCB-Bayes and Thompson sampling. The reader can refer to Chapters 1–3 of Kaufmann (2014) for a description of these procedures. In Examples 2 and 3 we consider normal rewards and the comparisons are against procedures in which either efficiency or logarithmic regret has been established. In Example 4 we consider double exponential rewards and there the comparisons are against procedures that have been shown to perform well numerically. In Examples 5–7 we perform comparisons under the settings of Baransi, Maillard and Mannor (2014).

In the simulations done here $J=10000$ datasets are generated for each $N$ , and the regret of a procedure is estimated by averaging over $\sum_{k=1}^{K}(\mu_{*}-\mu_{k})N_{k}$ . Standard errors are located after the $\pm$ sign. In Examples 5–7 we reproduce simulation results from Baransi et al. (2014). Though no standard errors are provided, they are likely to be small given that a larger $J=50000$ number of datasets are generated there.

Example 2. Consider $Y_{kt}\sim$ N( $\mu_{k},1$ ), $1\leq k\leq 10$ . In Table 1 we see that SSMC improves upon UCB1 and outperforms UCB-Agrawal [setting $b_{n}=\log\log\log n$ in (2.5)]. Here we generate $\mu_{k}\sim$ N(0,1) in each dataset.

Example 3. Consider $Y_{kt}\sim$ N( $\mu_{k},\sigma_{k}^{2}$ ), $1\leq k\leq 10$ . We compare SSTC against UCB1-tuned and UCB1-Normal. UCB1-tuned was suggested by Auer et al. and shown to perform well numerically. Under UCB1-tuned the population $\Pi_{k}$ maximizing

[TABLE]

where $V_{kn}=\widehat{\sigma}_{kn_{k}}^{2}+\sqrt{\frac{2\log n}{n_{k}}}$ , is selected. In Table 2 we see that UCB1-tuned is significantly better at $N=1000$ whereas SSTC is better at $N=10000$ . UCB1-Normal performs quite poorly. Here we generate $\mu_{k}\sim{\rm N}(0,1)$ and $\sigma_{k}^{-2}\sim{\rm Exp}(1)$ in each dataset.

Kaufmann, Cappè and Garivier (2012) performed simulations under the setting of normal rewards with unequal variances, with $(\mu_{1},\sigma_{1})=(1.8,0.5)$ , $(\mu_{2},\sigma_{2})=(2,0.7)$ , $(\mu_{3},\sigma_{3})=(1.5,0.5)$ and $(\mu_{4},\sigma_{4})=(2.2,0.3)$ . They showed that UCB-Bayes achieves regret of about 28 at $N=1000$ and about 47 at $N=10000$ . We apply SSTC on this setting, achieving regrets of 26.0 $\pm$ 0.1 at $N=1000$ and 43.3 $\pm$ 0.2 at $N=10000$ .

Example 4. Consider double exponential rewards $Y_{kt}\sim f_{k}$ , with densities

[TABLE]

We compare SSMC against UCB1-tuned, BESA, Boltzmann exploration and $\epsilon$ -greedy. For $\epsilon$ -greedy we consider $\epsilon_{n}=\min(1,\tfrac{3c}{n})$ . We generate $\mu_{k}\sim$ N(0,1) in each dataset.

Table 3 shows that UCB1-tuned has the best performances at $N=1000$ , whereas SSMC has the best performances at $N=10000$ . BESA does well for $\lambda=2$ at $N=1000$ , and also for $\lambda=5$ at $N=10000$ . A properly-tuned Boltzmann exploration does well at $N=1000$ for $\lambda=2$ , whereas a properly-tuned $\epsilon$ -greedy does well at $\lambda=2$ and 5 for $N=1000$ and at $\lambda=5$ for $N=10000$ .

In Tables 4 and 5 we tabulate the frequencies of the empirical regrets $\sum_{k=1}^{K}(\mu_{*}-\mu_{k})N_{k}$ over the $J=10000$ simulation runs each for $N=1000$ and 10000, at $\lambda=1$ , for SSMC, BESA and UCB1-tuned. Tha tables show that SSMC has the best control of excessive sampling of inferior arms, the worst empirical regret being less than half that of BESA and UCB1-tuned.

Example 5. Consider $N=20000$ Bernoulli rewards under the following scenarios.

$\mu_{1}=0.9$ , $\mu_{2}=0.8$ . 2. 2.

$\mu_{1}=0.81$ , $\mu_{2}=0.8$ . 3. 3.

$\mu_{2}=0.1$ , $\mu_{2}=\mu_{3}=\mu_{4}=0.05$ , $\mu_{5}=\mu_{6}=\mu_{7}=0.02$ ,

$\mu_{8}=\mu_{9}=\mu_{10}=0.01$ . 4. 4.

$\mu_{1}=0.51$ , $\mu_{2}=\cdots=\mu_{10}=0.5$ .

When comparing the simulated regrets in Table 6, it is useful to remember that BESA and SSMC are non-parametric, using the same procedures even when the rewards are not Bernoulli, whereas KL-UCB and Thompson sampling utilize information on the Bernoulli family. SSMC∗ is a variant of SSMC, see Section 6, with more moderate levels of explorations.

Example 6. Consider truncated exponential and Poisson distributions with $N=20000$ . For truncated exponential we consider $Y_{kt}=\min(\frac{X_{kt}}{10},1)$ , where $X_{kt}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}{\rm Exp}(\lambda_{k})$ (density $\lambda_{k}e^{-\lambda_{k}x}$ ) with $\lambda_{k}=\frac{1}{k}$ , $1\leq k\leq 5$ . For truncated Poisson we consider $Y_{kt}=\min(\frac{X_{kt}}{10},1)$ , where $X_{kt}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}{\rm Poisson}(\lambda_{k})$ , with $\lambda_{k}=0.5+\frac{k}{3}$ , $1\leq k\leq 6$ . The simulation results are given in Table 7. BESAT is a variation of BESA that starts with 10 observations from each population.

Example 7. Consider $K=2$ and $N=20000$ with $Y_{1t}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}{\rm Uniform}(0.2,0.4)$ and $Y_{2t}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}{\rm Uniform}(0,1)$ . Here SSMC underperforms with regret of 163 $\pm$ 7 compared to Thompson sampling, which has regret of 13.18. On the other hand SSTC, by normalizing the different scales of the two uniform distributions, is able to achieve the best regret of 2.9 $\pm$ 0.2.

6 Discussion

Together with BESA, the procedures SSMC and SSTC that we introduce here form a class of non-parametric procedures that differ from traditional non-parametric procedures, like $\epsilon$ -greedy and Boltzmann exploration, in their recognition that when deciding between which of two populations to be sampled, samples or subsamples of the same rather than different sizes should be compared. Among the parametric procedures, Thompson sampling fits most with this scheme.

As mentioned earlier, in SSMC (and SSTC), when the leading population $\Pi_{\zeta}$ in the previous round is sampled, essentially only one additional comparison is required in the current round between $\Pi_{\zeta}$ and $\Pi_{k}$ for $k\neq\zeta$ . On the other hand when there are $n$ rewards, an order $n$ comparisons may be required between $\Pi_{\zeta}$ and $\Pi_{k}$ when $\Pi_{k}$ wins in the previous round. It is these added comparisons that, relative to BESA, allows for faster catching-up of a potentially undersampled optimal arm. Tables 4 and 5 show the benefits of such added explorations in minimizing the worst-case empirical regret.

To see if SSMC still works well if we moderate these added explorations, we experimented with the following variation of SSMC in Examples 6 and 7. The numerical results indicate improvements.

SSMC∗

Proceed as in SSMC, with step 2(b)iii. replaced by the following.

2(b)iii*′* If $c_{n}\leq n_{k}<n_{\zeta}$ , then $\Pi_{k}$ wins the challenge when

[TABLE]

In contrast to SSMC, in SSMC∗ we partition the rewards of the leading arm into groups of size $n_{k}$ for comparisons instead of reusing the rewards in moving-averages. In principle the members of the group need not be consecutive in time, thus allowing for the modifications of SSMC∗ to provide storage space savings when the support of the distributions is finite. That is rather than to store the full sequence, we simply store the number of occurrences at each support point, and generate a new (permuted) sequence for comparisons whenever necessary. Likewise in BESA, there is substantial storage space savings for finite-support distributions by storing the number of occurrences at each support point.

7 Proofs of Theorems 1–3

Since SSMC and SSTC are index-blind, we may assume without loss of generality that $\mu_{1}=\mu_{*}$ . We provide here the statements and proofs of supporting Lemmas 1 and 2, and follow up with the proofs of Theorems 1–3 in Sections 7.1–7.3. We denote the complement of an event $D$ by $\bar{D}$ , let $\lfloor\cdot\rfloor$ and $\lceil\cdot\rceil$ denote the greatest and least integer function respectively, and let $|A|$ denote the number of elements in a set $A$ .

Let $n_{k}^{r}(=n_{k})$ be the number of observations from $\Pi_{k}$ at the beginning of round $r$ . Let $n^{r}(=n)=\sum_{k=1}^{K}n_{k}^{r}$ . Let $n_{*}^{r}=\max_{1\leq k\leq K}n_{k}^{r}$ . Let

[TABLE]

More specifically, let

[TABLE]

If $\zeta^{r-1}\in{\cal Z}_{1}^{r}$ , then $\zeta^{r}=\zeta^{r-1}$ . Otherwise the leader $\zeta^{r}$ is selected randomly (uniformly) from ${\cal Z}_{1}^{r}$ . In particular if ${\cal Z}_{1}^{r}$ has a single element, then that element must be $\zeta^{r}$ . For $r\geq 2$ , let

[TABLE]

We restrict to $r\geq 2$ because the leader is not defined at $r=1$ . Likewise in our subsequent notations on events $B^{r}$ , $C^{r}$ , $D^{r}$ , $G_{k}^{r}$ and $H_{k}^{r}$ , we restrict to $r\geq 2$ .

In Lemma 1 below the key ingredient leading to (7.3) is condition (I) on the event $G_{k}^{r}$ , which says that it is difficult for an inferior arm $k$ with at least $(1+\epsilon)\xi_{k}\log r$ rewards to win against a leading optimal arm $\zeta$ . In the case of exponential families we show efficiency by verifying (I) with $\xi_{k}=\frac{1}{I_{1}(\mu_{k})}$ . Condition (II), on the event $H_{k}^{r}$ , says that analogous winnings from an inferior arm $k$ with at least $J_{k}\log r$ rewards, for $J_{k}$ large, are asymptotically negligible. Condition (III) limits the number of times an inferior arm is leading. This condition is important because $G_{k}^{r}$ and $H_{k}^{r}$ refer to the winning of arm $k$ when the leader is optimal, hence the need, in (III), to bound the event probability of an inferior leader.

Lemma 1.

Let $k\not\in\Xi$ $($ i.e. $k$ is not an optimal arm $)$ and define

[TABLE]

for some $\epsilon>0$ , $\xi_{k}>0$ and $J_{k}>0$ . Consider the following conditions.

(I)* There exists $\xi_{k}>0$ such that for all $\epsilon>0$ , $P(G_{k}^{r})\rightarrow 0$ as $r\rightarrow\infty$ .*

(II)* There exists $J_{k}>0$ such that $P(H_{k}^{r})=O(r^{-1})$ as $r\rightarrow\infty$ .*

(III)* $P(A^{r})=o(r^{-1})$ as $r\rightarrow\infty$ .*

Under (I)–(III),

[TABLE]

Proof. Consider $r\geq 3$ . Let $b_{r}=1+(1+\epsilon)\xi_{k}\log r$ and $d_{r}=1+J_{k}\log r$ . Under the event $\bar{G}_{k}^{r}$ , arm $k$ in round $s\in[2,r-1]$ is sampled to a size beyond $b_{r}$ only when $\zeta^{s}\not\in\Xi$ (i.e. under the event $A^{s}$ ). In view that $n_{k}^{2}=1(<b_{r})$ , it follows that

[TABLE]

Hence

[TABLE]

Similarly under the event $\bar{H}_{k}^{r}$ ,

[TABLE]

Hence

[TABLE]

Since $n_{k}^{r}\leq r$ , by (7.4) and (7.5),

[TABLE]

By (III), $\sum_{s=2}^{r}P(A^{s})=o(\log r)$ , therefore by (7), (I) and (II),

[TABLE]

We can thus conclude (7.3) by letting $\epsilon\rightarrow 0$ . $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

The verification of (III) is made easier by Lemma 2 below. To provide intuitions for the reader we sketch its proof first before providing the details.

Lemma 2.

Let

[TABLE]

If as $s\rightarrow\infty$ ,

[TABLE]

then $P(A^{r})=o(r^{-1})$ as $r\rightarrow\infty$ .

Sketch of proof. Note that (7.7) bounds the probability of an inferior arm taking the leadership from an optimal leader in round $s+1$ , whereas (7.8) bounds the probability of an inferior leader winning against an optimal challenger in round $s$ . Let $s_{0}=\lfloor\frac{r}{4}\rfloor$ and for $r\geq 8$ , let

[TABLE]

Under $A^{r}\cap D^{r}$ , there is a leadership takeover by an inferior arm at least once between rounds $s_{0}+1$ and $r$ . More specifically let $s_{1}$ be the largest $s\in[s_{0},r-1]$ for which $\zeta^{s}\in\Xi$ . If $s_{1}<r-1$ , then by the definition of $s_{1}$ , $\zeta^{s_{1}+1}\not\in\Xi$ . If $s_{1}=r-1$ , then since we are under $A^{r}$ , $\zeta^{s_{1}+1}=\zeta^{r}\not\in\Xi$ . In summary

[TABLE]

By showing that

[TABLE]

we can conclude from (7.7) and (7) that

[TABLE]

To see (7.10), recall that by step 2(b)i of SSMC or SSTC, if the (optimal) leader and (inferior) challenger have the same sample size, then the challenger loses by default. The tie-breaking rule then ensures that the challenger is unable to take over leadership in the next round. Hence for $\zeta^{s}$ to lose leadership to an inferior arm $k$ in round $s+1$ , it has to lose to arm $k$ when arm $k$ has exactly $n_{\zeta}^{s}-1$ observations.

What (7.11) says is that if at some previous round $s\geq s_{0}$ the leader is optimal, then (7.7) makes it difficult for an inferior arm to take over leadership during and after round $s$ , so the leader is likely to be optimal all the way from rounds $s$ to $r$ . The only situation we need to guard against is $\bar{D}^{r}$ , the event that leaders are inferior for all rounds between $s_{0}$ and $r-1$ . Let $\#^{r}=\sum_{s=s_{0}}^{r-1}{\bf 1}_{C^{s}}$ be the number of rounds an inferior leader wins against at least one optimal arm. In (7.13) we show that by (7.8), the optimal arms will, with high probability, lose less than $\frac{r}{4}$ times between rounds $s_{0}$ and $r-1$ when the leader is inferior.

We next show that

[TABLE]

(or $\{\#^{r}<\frac{r}{4}\}\subset D^{r}$ ), that is if the optimal arms lose this few times, then one of them has to be a leader at some round between $s_{0}$ to $r-1$ . Lemma 2 follows from (7.11)–(7.13).

Proof of Lemma 2. Consider $r\geq 8$ . By (7.8),

[TABLE]

hence by Markov’s inequality,

[TABLE]

It remains for us to show (7.12). Assume $\bar{D}^{r}$ . Let $m^{s}=n_{\zeta}^{s}-\max_{\ell\in\Xi}n_{\ell}^{s}$ . Observe that $n_{\zeta}^{s+1}=n_{\zeta}^{s}$ if $n_{\ell}^{s+1}=n_{\ell}^{s}+1$ for some $\ell\neq\zeta^{s}$ . This is because the leader $\zeta^{s}$ is not sampled if it loses at least one challenge. Moreover by step 2(b)i. of SSMC or SSTC, all arms with the same number of observations as $\zeta^{s}$ are not sampled. Therefore if $\zeta^{s}\not\in\Xi$ and $n_{\ell}^{s+1}=n_{\ell}^{s}+1$ for all $\ell\in\Xi$ , that is if all optimal arms win against an inferior leader, then $m^{s+1}=m^{s}-1$ . In other words,

[TABLE]

Since $m^{s+1}\leq m^{s}+1$ , it follows from (7.14) that $m^{s+1}\leq m^{s}+1-2{\bf 1}_{F^{s}}$ . Therefore

[TABLE]

and since $m^{r}\geq 0$ and $m^{s_{0}}\leq s_{0}$ , we can conclude that

[TABLE]

Under $\bar{D}^{r}$ , ${\bf 1}_{C^{s}}=1-{\bf 1}_{F^{s}}$ for $s_{0}\leq s\leq r-1$ , and it follows from (7.15) that

[TABLE]

and (7.12) indeed holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

7.1 Proof of Theorem 1

We consider here SSMC. Equation (7.7) follows from Lemma 4 below and $c_{r}=o(\log r)$ whereas (7.8) follows from Lemma 5 and $\frac{c_{r}}{\log\log r}\rightarrow\infty$ . We can thus conclude $P(A^{r})=o(r^{-1})$ from Lemma 2, and together with the verification in Lemma 6 of (I), see Lemma 1, for $\xi_{k}=1/I_{1}(\mu_{k})$ and (II) for $J_{k}$ large, we can conclude Theorem 1.

The proofs of Lemmas 4–6 use large deviations Chernoff bounds that are given below in Lemma 3. They can be shown using change-of-measure arguments. Let $I_{k}$ be the large deviations rate function of $f_{k}$ .

Lemma 3.

Under (3.1), if $1\leq k\leq K$ , $t\geq 1$ and $\omega=\psi^{\prime}(\theta)$ for some $\theta\in\Theta$ , then

[TABLE]

In Lemmas 4–6 we let $\omega=\frac{1}{2}(\mu_{*}+\max_{k:\mu_{k}<\mu_{*}}\mu_{k})$ and $a=\min_{1\leq k\leq K}I_{k}(\omega)$ . Recall that the parameter $c_{r}$ is a threshold for forced explorations, in step 2(b)ii. of SSMC.

Lemma 4.

Under (3.1), $P(B^{r})\leq\frac{3K^{2}}{1-e^{-a}}e^{-a(\frac{r}{K}-1)}$ when $\frac{r}{K}-1\geq c_{r}$ .

Proof. Let $r$ be such that $\frac{r}{K}-1\geq c_{r}$ . The event $B^{r}$ occurs if at round $r$ the leading arm $\ell$ is optimal (i.e. $\ell\in\Xi$ ), and it loses to an inferior arm $k(\not\in\Xi)$ with $n_{k}=u$ and $n_{\ell}=u+1$ for $u+1\geq\frac{r}{K}$ (since arm $\ell$ is leading). It follows from Lemma 3 that

[TABLE]

Since arm $\ell$ loses to arm $k$ when $\bar{Y}_{ku}\geq\min(\bar{Y}_{\ell,1:u},\bar{Y}_{\ell,2:(u+1)})$ , it follows that

[TABLE]

and Lemma 4 holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 5.

Under (3.1), $P(C^{r})\leq K^{2}e^{-c_{r}a}\frac{(\log r)^{6}}{r}+o(r^{-1})$ .

Proof. The event $C^{r}$ occurs if at round $r$ the leading arm $k$ is inferior (i.e. $k\not\in\Xi$ ), and it wins a challenge against one or more optimal arms $\ell(\in\Xi)$ . By step 2(b)ii. of SSMC, arm $k$ loses automatically when $n_{\ell}<c_{n}$ , hence we need only consider $n_{\ell}\geq c_{n}$ . Note that when $n_{k}=n_{\ell}$ , for arm $k$ to be the leader, by the tie-breaking rule we require $\bar{Y}_{kn_{\ell}}\geq\bar{Y}_{\ell n_{\ell}}$ . We shall consider $n_{\ell}>(\log r)^{2}$ in case 1 and $n_{\ell}=v$ for $c_{n}\leq v<(\log r)^{2}$ in case 2.

Case 1: $n_{\ell}>(\log r)^{2}$ . By Lemma 3,

[TABLE]

Case 2: $n_{\ell}=v$ for $(c_{r}\leq)c_{n}\leq v<(\log r)^{2}$ . In view that $n_{k}\geq\frac{r}{K}$ when $k$ is the leading arm, we shall show that for $r$ large, for each such $v$ there exists $\xi(=\xi_{v})$ such that

[TABLE]

The inequality within the brackets in (7.1) follows from partitioning $[1,\frac{r}{K}]$ into $\lfloor\frac{r}{Kv}\rfloor$ segments of length $v$ , and applying independence of the sample on each segment.

Since $\theta_{\ell}>\theta_{k}$ , if $\sum_{t=1}^{v}y_{t}\leq v\mu_{k}$ , then by (3.1),

[TABLE]

Hence if $\xi\leq\mu_{k}$ , then as $v\geq c_{r}$ ,

[TABLE]

Let $\xi(\leq\mu_{k}$ for large $r$ ) be such that

[TABLE]

Equation (7.1) follows from (7.22) and the first inequality in (7.23), whereas (7.1) follows from the second inequality in (7.23) and $v<(\log r)^{2}$ . By (7.18)–(7.1),

[TABLE]

and Lemma 5 holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 6.

Under (3.1) and $c_{r}=o(\log r)$ , (I) $($ in the statement of Lemma 1 $)$ holds for $\xi_{k}=1/I_{1}(\mu_{k})$ and (II) holds for $J_{k}>\max(\frac{1}{I_{k}(\omega)},\frac{2}{I_{1}(\omega)})$ , where $\omega=\frac{1}{2}(\mu_{*}+\max_{k:\mu_{k}<\mu_{*}}\mu_{k})$ .

Proof. Let $k\not\in\Xi$ . Let $\mu_{k}<\omega_{k}<\mu_{1}$ be such that $(1+\epsilon)I_{1}(\omega_{k})>I_{1}(\mu_{k})$ . Consider $n_{k}=u$ for $u\geq(1+\xi_{k})\log r$ (in $G_{k}^{r}$ ) and $u\geq J_{k}\log r$ (in $H_{k}^{r}$ ). Since $I_{\ell}=I_{1}$ for $\ell\in\Xi$ , it follows from Lemma 3 that

[TABLE]

Since $c_{r}=o(\log r)$ , we can consider $r$ large enough such that $(1+\epsilon)\xi_{k}\log r\geq c_{r}$ . Hence if in round $1\leq s\leq r$ arm $k$ has sample size of at least $(1+\epsilon)\xi_{r}\log r$ , it wins against leading optimal arm $\ell$ only if

[TABLE]

By (1), (7.24), (7.25) and Bonferroni’s inequality,

[TABLE]

and (I) holds because $(1+\epsilon)\xi_{k}I_{1}(\omega_{k})>1$ and $(1+\epsilon)\xi_{k}I_{k}(\omega_{k})>0$ .

Let $J_{k}>\max(\frac{1}{I_{k}(\omega)},\frac{2}{I_{1}(\omega)})$ . It follows from (1), (7.24), (7.25) and the arguments above that

[TABLE]

and (II) holds because $J_{k}I_{1}(\omega)>2$ and $J_{k}I_{k}(\omega)>1$ . $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

7.2 Proof of Theorem 2

We consider here SSTC. By Lemmas 1 and 2 it suffices, in Lemmas 8–11 below, to verify the conditions needed to show that (7.3) holds with $\xi_{k}=1/M(\frac{\mu_{*}-\mu_{k}}{\sigma_{k}})$ . Lemma 7 provides the underlying large deviations bounds for the standard error estimator. Let $\Phi(z)=P(Z\leq z)$ and $\bar{\Phi}(z)=P(Z>z)(\leq e^{-z^{2}/2}$ for $z\geq 0$ ) for $Z\sim$ N(0,1).

Lemma 7.

For $1\leq k\leq K$ and $t\geq 2$ ,

[TABLE]

Proof. We note that $\widehat{\sigma}^{2}_{kt}/\sigma_{k}^{2}\stackrel{{\scriptstyle d}}{{=}}\frac{1}{t-1}\sum_{s=1}^{t-1}U_{s}$ , where $U_{s}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}\chi^{2}_{1}$ , and that $U_{1}$ has large deviations rate function

[TABLE]

The last equality holds because the supremum occurs when $\theta=\frac{x-1}{2x}$ . We conclude (7.26) and (7.27) from (7.16) and (7.17) respectively. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 8.

Under (3.3), $P(B^{r})\leq Qe^{-ar}$ for some $Q>0$ and $a>0$ , when $\frac{r}{K}-1\geq c_{r}$ .

Proof. Let $r$ be such that $\frac{r}{K}-1\geq c_{r}$ . The event $B^{r}$ occurs if at round $r$ the leading arm $\ell$ is optimal, and it loses to an inferior arm $k$ with $n_{k}=u$ and $n_{\ell}=u+1$ for $u\geq\frac{r}{K}-1$ . Let $k\not\in\Xi$ , $\ell\in\Xi$ and let $\epsilon>0$ be such that $\omega:=\frac{\mu_{k}-\mu_{\ell}+\epsilon}{2\sigma_{k}}<0$ . Let $\tau_{i}(u)$ , $1\leq i\leq 3$ , be quantities that we shall define below. Note that

[TABLE]

Since $\bar{Y}_{ku}-\bar{Y}_{{\ell},u+1}\sim{\rm N}(\mu_{k}-\mu_{\ell},\tfrac{\sigma_{\ell}^{2}}{u+1}+\tfrac{\sigma_{k}^{2}}{u})$ ,

[TABLE]

It follows from (7.26) and (7.27) that

[TABLE]

where $a_{1}=1-\log 2(>0)$ and $a_{2}=\log 2-\frac{1}{2}(>0)$ . By (7.28)–(7.30),

[TABLE]

Since $\frac{\bar{Y}_{\ell u}-\bar{Y}_{\ell,u+1}}{\sigma_{\ell}/2}\sim$ N(0, $\lambda$ ) for $\lambda\leq 4(\frac{1}{u}+\frac{1}{u+1})\leq\frac{8}{u}$ , it follows that

[TABLE]

Hence by (7.31),

[TABLE]

We check that for $\omega_{k}=\frac{\mu_{k}+\mu_{\ell}}{2}$ ,

[TABLE]

By (7.32)–(7.2),

[TABLE]

and Lemma 8 indeed holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 9.

Under (3.3), $P(C^{r})\leq K^{2}e^{-c_{r}a}\tfrac{(\log r)^{6}}{r}+o(r^{-1})$ for some $a>0$ .

Proof. The event $C^{r}$ occurs if at round $r$ the leading arm $k$ is inferior, and it wins a challenge against one or more optimal arms $\ell$ . By step 2(b)ii. of SSTC, we need only consider $n_{\ell}\geq c_{n}$ . Note that when $n_{k}=n_{\ell}$ , for arm $k$ to be leader, by the tie-breaking rule we require $\bar{Y}_{kn_{k}}\geq\bar{Y}_{\ell n_{\ell}}$ . Consider $n_{k}$ taking values $u$ , $n_{\ell}$ taking values $v$ and let $\tau_{i}(\cdot)$ , $1\leq i\leq 4$ , be quantities that we shall define below.

Case 1. $n_{\ell}>(\log r)^{2}$ . Let $\omega=\frac{\mu_{\ell}+\mu_{k}}{2}$ and check that

[TABLE]

Case 2. $(c_{r}\leq)c_{n}\leq n_{\ell}<(\log r)^{2}$ . Let $\omega$ be such that

[TABLE]

Hence

[TABLE]

We shall show that there exists $a>0$ such that for large $r$ ,

[TABLE]

For $u\geq\tfrac{r}{K}$ ,

[TABLE]

Since (7.2) and (7.38) hold with “ $-\bar{Y}_{ku}$ ” replacing “ $-\mu_{k}+r^{-\frac{1}{3}}$ ” and “ $-\mu_{k}-r^{-\frac{1}{3}}$ ” respectively, by adding $\tau_{4}(u)$ to the upper bounds,

[TABLE]

We conclude Lemma 9 from (7.2) and (7.2)–(7.39).

We shall now show (7.38), noting firstly that for $r$ large, the $\omega$ satisfying (7.36) is negative. This is because for $v<(\log r)^{2}$ ,

[TABLE]

whereas $\frac{(\log r)^{4}}{r}\rightarrow 0$ .

Let $g_{v}$ be the common density function of $\widehat{\sigma}_{kv}/\sigma_{k}$ and $\widehat{\sigma}_{\ell v}/\sigma_{\ell}$ . By the independence of $\bar{Y}_{kv}$ and $\widehat{\sigma}_{kv}$ ,

[TABLE]

By similar arguments,

[TABLE]

where $\Delta:=\mu_{\ell}-\mu_{k}(>0)$ . Let $\delta_{1}=\frac{r^{-\frac{1}{3}}}{\sigma_{k}}$ , $\delta_{2}=\frac{\Delta-r^{-\frac{1}{3}}}{\sigma_{\ell}}$ and $b=-\omega x$ . Since $b>0$ and $\delta_{2}>\delta_{1}>0$ for $r$ large,

[TABLE]

where $a_{r}=\frac{(\delta_{2}-\delta_{1})^{2}}{2}$ ( $\rightarrow\frac{\Delta^{2}}{2\sigma_{\ell}^{2}}$ as $r\rightarrow\infty$ ). Let $a=\frac{\Delta_{2}^{2}}{4\sigma_{\ell}^{2}}$ . It follows from (7.2)–(7.42) that for $r$ large,

[TABLE]

Hence by (7.36), the inequality in (7.38) indeed holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 10.

Let $Z_{s}\sim{\rm N}(0,\frac{1}{s+1})$ and $W_{s}\sim\chi^{2}_{s}/s$ be independent. For any $g<0$ and $0<\delta<M(g)$ , there exists $Q>0$ such that for $s_{1}\geq 1$ ,

[TABLE]

Proof. Consider the domain $\Omega={\bf R}^{+}\times{\bf R}$ , and the set

[TABLE]

Let $I(w,z)=\frac{1}{2}(z^{2}+w-1-\log w)$ , and check that

[TABLE]

the second last equality follows from the infimum occurring at $w=\frac{1}{g^{2}+1}$ .

Let $L_{v}$ , $1\leq v\leq V$ , be half-spaces constructed as follows. Let

[TABLE]

The existence of $z_{1}$ satisfying second line of (7.2) follows from $I(1,g)=\frac{1}{2}g^{2}>M(g)$ . Since $(A\setminus L_{1})\subset(0,1)\times(z_{1},0)$ , by (7.2), we can find half-spaces

[TABLE]

such that $(A\setminus L_{1})\subset\cup_{v=2}^{V}L_{v}$ . Therefore $A\subset\cup_{v=1}^{V}L_{v}$ , and so

[TABLE]

It follows from (7.27), (7.2), (7.2) and the independence of $Z_{s}$ and $W_{s}$ , setting $w_{1}=1$ , that

[TABLE]

Lemma 10, with $Q=\frac{V}{1-e^{-M(g)+\delta}}$ , follows from substituting (7.47) into (7.46). $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 11.

Under (3.3) and $c_{r}=o(\log r)$ , (I) $($ in the statement of Lemma ${\ref{lem2}})$ holds for $\xi_{k}=1/M(\tfrac{\mu_{*}-\mu_{k}}{\sigma_{k}})$ and (II) holds for $J_{k}$ large.

Proof. By considering the rewards $Y_{kt}-\mu_{*}$ , we may assume without loss of generality that $\mu_{*}=0$ . Let $k\not\in\Xi$ (hence $\mu_{k}<0$ ) and $\epsilon>0$ . Let $g_{k}=\frac{\mu_{k}}{\sigma_{k}}$ and let $g_{\omega}<0$ and $\delta>0$ be such that

[TABLE]

Let $m_{r}=\lceil(1+\epsilon)(\log r)/M(g_{k})\rceil$ . Since $c_{r}=o(\log r)$ , we can consider $r$ large enough such that $m_{r}\geq c_{r}$ . By (7.27),

[TABLE]

Let $\sigma_{0}=\min_{1\leq\ell\leq K}\sigma_{\ell}$ . For $\ell\in\Xi$ ,

[TABLE]

By (1) and (7.48),

[TABLE]

By (7.49)–(7.52), to show (I), it suffices to show that

[TABLE]

Keeping in mind that $g_{k}+\delta<0$ , let $w>1$ be such that $\sqrt{w}(g_{k}+\delta)>g_{k}$ . It follows from (7.26) and $g_{k}\sigma_{k}=\mu_{k}$ that

[TABLE]

and (7.53) indeed holds. Finally by Lemma 10,

[TABLE]

for some $Q>0$ , and so (7.54) follows from (7.48).

To show (II), we consider $m_{r}=\lceil J_{r}\log r\rceil$ . By (7.27), we can select $J_{k}$ large enough to satisfy (7.49) with “ $\rightarrow 0$ ” replaced by “ $=O(r^{-1})$ ”. We note that (7.52) holds with $H_{k}^{r}$ in place of $G_{k}^{r}$ for this $m_{r}$ . Therefore to show (II), it suffices to note that for $J_{k}$ large enough, (7.51), (7.53) and (7.54) hold with “ $\rightarrow 0$ ” replaced by “ $=O(r^{-1})$ ”. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

7.3 Proof of Theorem 3

Assume (C1)–(C3) and let $\widetilde{\mu}=\max_{k:\mu_{k}<\mu^{*}}\mu_{k}$ . By Lemmas 1 and 2 it suffices, in Lemmas 12–14 below, to verify the conditions needed for SSMC to satisfy (7.3) for some $\xi_{k}>0$ .

Lemma 12.

Under (C2), $P(B^{r})\leq\frac{3QK^{2}}{1-e^{-b}}e^{-b(\frac{r}{K}-1)}$ for some $b>0$ and $Q>0$ , when $\frac{r}{K}-1\geq c_{r}$ .

Proof. Consider $r$ such that $(n_{k}\geq)\frac{r}{K}-1\geq c_{r}$ . Let $\epsilon=\frac{1}{2}(\mu_{*}-\widetilde{\mu})$ and let $b$ and $Q$ be the constants satisfying (C2). Lemma 12 follows from arguments similar to those in the proof of Lemma 4, setting $\omega=\frac{1}{2}(\mu_{*}+\widetilde{\mu})$ . $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 13.

Under (C1)–(C3), $P(C^{r})\leq K^{2}Q_{1}e^{-c_{r}b_{1}}\frac{(\log r)^{6}}{r}+o(r^{-1})$ for some $b_{1}>0$ and $Q_{1}>0$ .

Proof. The event $C^{r}$ occurs if at round $r$ the leading arm $k$ is inferior, and it wins against one or more optimal arms $\ell$ . By step 2(b)ii. of SSMC, we need only consider $n_{\ell}=v$ for $v\geq c_{n}$ . Note that $n_{k}\geq\frac{r}{K}$ and $n_{k}\geq n_{\ell}$ .

Case 1: $n_{\ell}>(\log r)^{2}$ . Let $\omega$ and $\epsilon$ be as in the proof of Lemma 12. By (C2), there exists $b>0$ and $Q>0$ such that

[TABLE]

Case 2: $n_{\ell}=v$ for $(c_{r}\leq)c_{n}\leq v<(\log r)^{2}$ . Select $\omega(\leq\mu_{k}$ for $r$ large) such that

[TABLE]

Let $p_{\omega}=P(\bar{Y}_{kv}>\omega)$ and let $d=\lceil 2(\log r)^{2}\rceil$ , $\eta=\lfloor\frac{r/K-1}{d}\rfloor$ . By (C1) and the second inequality of (7.56),

[TABLE]

To see the second inequality of (7.3), let

[TABLE]

Note that the probability in the second line of (7.3) is $P(\cap_{m=0}^{\eta}D_{m})$ , and that by (7.56), $P(D_{m})=p_{\omega}\leq 1-\frac{(\log r)^{4}}{r}$ . By the triangular inequality and the convention $\prod_{m=\eta+1}^{\eta}=1$ ,

[TABLE]

By (C1),

[TABLE]

since $\cap_{m=0}^{u-1}D_{m}$ depends on $(Y_{k1},\ldots,Y_{k,(u-1)d+v})$ whereas $D_{u}$ depends on $(Y_{k,ud+1},\ldots,Y_{k,ud+v})$ . Substituting (7.59) into (7.3) gives us the second inequality of (7.3).

It follows from (C3) and the first inequality of (7.56) that there exists $Q_{1}>0$ , $b_{1}>0$ and $t_{1}\geq 1$ such that for $v\geq t_{1}$ ,

[TABLE]

Hence by (7.55) and (7.3), for $r$ such that $c_{r}\geq t_{1}$ ,

[TABLE]

and Lemma 13 holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Lemma 14.

Under (C2) and $c_{r}=o(\log r)$ , statement (II) in Lemma 1 holds.

Proof. Let $\epsilon$ and $\omega$ be as in the proof of Lemma 12, and let $b$ and $Q$ be the constants satisfying (C2). For an optimal arm $\ell$ ,

[TABLE]

Let $J_{k}>\frac{2}{b}$ . Since $c_{r}=o(\log r)$ , for $r$ large, $\lceil J_{k}\log r\rceil\geq c_{r}$ and therefore by Bonferroni’s inequality,

[TABLE]

and (II) holds. $\sqcap\hbox to0.0pt{\hss$ \sqcup $}$

Appendix A Showing (2.10)

Let $\Phi(z)=P(Z\leq z)$ for $Z\sim N(0,1)$ . It follows from $\Phi(-z)=[1+o(1)]\frac{1}{z\sqrt{2\pi}}e^{-z^{2}/2}$ as $z\rightarrow\infty$ that

[TABLE]

Assume without loss of generality $\mu_{1}=0$ and consider $n_{1}=u$ and $n_{2}=v$ (hence $u+v=n$ ) with $v=O(\log n)$ . By (A.1) and Bonferroni’s inequality,

[TABLE]

By (A.2) and independence of $\bar{Y}_{1,(sv+1):[(s+1)v]}$ for $0\leq s\leq\frac{u-v}{v}$ ,

[TABLE]

We conclude (2.10) from (A) and (A).

Appendix B Verifications of (C1)–(C3) for double exponential densities

By dividing $Y_{kt}$ by $\tau$ if necessary, we may assume without loss of generality that $\tau=1$ . We check that (C1) holds for $\lambda_{k}(A)=\int_{A}f_{k}(y)dy$ , whereas (C2) follows from the Chernoff bounds given in Lemma 3, that is (4.2) holds for $Q=2$ and $b=I(\epsilon)$ , where $I(\mu)=\sup_{|\theta|<1}[\theta\mu-\log(1-\theta^{2})]$ is the large deviations rate function of the double exponential density $f(y)=\frac{1}{2}e^{-|y|}$ .

Let $S_{t}=\sum_{u=1}^{t}Y_{u}$ with $Y_{u}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}f$ and let $\Delta=\mu_{\ell}-\mu_{k}$ . Since $\mu_{k}-Y_{kt}\sim f$ , and similarly when $k$ is replaced by $\ell$ , to show (C3), it suffices to show that for $z\geq 0$ and $t\geq 1$ ,

[TABLE]

where $b_{1}=\Delta-2\log(1+\frac{\Delta}{2})(>0)$ . By (B.1), (C3) holds for $Q_{1}=1$ , $t_{1}=1$ and the above $b_{1}$ .

Since $Y_{u}\stackrel{{\scriptstyle d}}{{=}}Z_{u1}-Z_{u2}$ , with $Z_{u1}$ and $Z_{u2}$ independent exponential random variables with mean 1, it follows that $S_{t}\stackrel{{\scriptstyle d}}{{=}}S_{t1}-S_{t2}$ where $S_{t1}$ and $S_{t2}$ are independent Gamma random variables. Using this, Kotz, Kozubowski and Podǵorski (2001) showed, see their (2.3.25), that the density $f_{t}$ of $S_{t}$ can be expressed as $f_{t}(x)=e^{-x}g_{t}(x)$ for $x\geq 0$ , where

[TABLE]

We shall show that

[TABLE]

By (B.3),

[TABLE]

and therefore for $y\geq 0$ ,

[TABLE]

Hence $f_{t}(y+t\Delta)\leq e^{-tb_{1}}f_{t}(y)$ . It follows that for $z\geq 0$ ,

[TABLE]

and (C3) indeed holds.

We shall now show (B.3) by checking that after substituting (B.2) into (B.3), the coefficient of $x^{j}$ in the left-hand side of (B.3) is not more than in the right-hand side, for $0\leq j\leq t-1$ . More specifically that (with $c_{tt}=0$ ),

[TABLE]

Indeed by (B.2),

[TABLE]

and the right-inequality of (B.4) holds.

Acknowledgment

We would like to thank three referees and an Associate Editor for going over the manuscript carefully, and providing useful feedbacks. The changes made in response to their comments have resulted in a much better paper. Thanks also to Shouri Hu for going over the proofs and performing some of the simulations in Examples 5 and 6.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Agrawal, R. (1995). Sample mean based index policies with O ( log ⁡ n ) 𝑂 𝑛 O(\log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 17 1054–1078.
2[2] Agrawal, R., Teneketzis, D. and Anantharam, V. (1989). Asymptotically efficient adaptive allocation schemes for controlled Markov chains: Finite parameter space. IEEE Trans. Automat. Control AC-34 1249–1259.
3[3] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256.
4[4] Baransi, A., Maillard, O.A. and Mannor, S. (2014). Sub-sampling for multi-armed bandits. Proceedings of the European Conference on Machine Learning pp.13.
5[5] Berry, D. and Fristedt, B. (1985). Bandit problems . Chapman and Hall, London.
6[6] Brezzi, M. and Lai, T.L. (2002). Optimal learning and experimentation in bandit problems. J. Econ. Dynamics Cont. 27 87–108.
7[7] Burnetas, A. and Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17 122–142.
8[8] Burtini, G., Loeppky, J. and Lawrence, R. (2015). A survey of online experiment design with the stochastic multi-armed bandit. ar Xiv:1510.00757.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

Abstract

keywords:

keywords:

1 Introduction

2 Subsample comparisons

2.1 Review of existing methods

2.2 Subsample-mean comparisons

2.3 Comparison of SSMC with UCB methods

2.4 Subsample-ttt comparisons

3 Efficiency

Theorem 1**.**

Theorem 2**.**

4 Logarithmic regret

Theorem 3**.**

5 Numerical studies

6 Discussion

7 Proofs of Theorems 1–3

Lemma 1**.**

Lemma 2**.**

7.1 Proof of Theorem 1

Lemma 3**.**

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

7.2 Proof of Theorem 2

Lemma 7**.**

Lemma 8**.**

Lemma 9**.**

Lemma 10**.**

Lemma 11**.**

7.3 Proof of Theorem 3

Lemma 12**.**

Lemma 13**.**

Lemma 14**.**

Appendix A Showing (2.10)

Appendix B Verifications of (C1)–(C3) for double exponential densities

Acknowledgment

2.4 Subsample- $t$ comparisons

Theorem 1.

Theorem 2.

Theorem 3.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.