Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

Arghya Roy Chaudhuri; Shivaram Kalyanakrishnan

arXiv:1901.08387·cs.LG·January 25, 2019

Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

Arghya Roy Chaudhuri, Shivaram Kalyanakrishnan

PDF

TL;DR

This paper introduces a simple, efficient regret minimization algorithm for multi-armed bandits that operates with a constant amount of memory, suitable for both finite and infinite arm settings, and demonstrates its effectiveness through theoretical bounds and experiments.

Contribution

The paper presents a novel constant-memory algorithm for regret minimization in multi-armed bandits, applicable to finite and infinite cases, improving over prior methods that require extensive memory or restrictive assumptions.

Findings

01

Achieves a regret bound of O(KM + K^{1.5}√(T log(T/MK))/M) for finite bandits.

02

Extends to sub-linear quantile-regret.

03

Empirically demonstrates efficiency through experiments.

Abstract

In this paper, we propose a constant word (RAM model) algorithm for regret minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB) instances. Most of the existing regret minimisation algorithms need to remember the statistics of all the arms they encounter. This may become a problem for the cases where the number of available words of memory is limited. Designing an efficient regret minimisation algorithm that uses a constant number of words has long been interesting to the community. Some early attempts consider the number of arms to be infinite, and require the reward distribution of the arms to belong to some particular family. Recently, for finitely many-armed bandits an explore-then-commit based algorithm~\citep{Liau+PSY:2018} seems to escape such assumption. However, due to the underlying PAC-based elimination their method incurs a high regret. We present a…

Tables2

Table 1. Table 1: Cumulative regret ( / 10 5 absent superscript 10 5 /10^{5} ) of QUCB-M , QTS-M , QMoss-M (with α = 0.347 𝛼 0.347 \alpha=0.347 ) and the strategies proposed by Herschkorn et al. ( 1996 ) and Berry et al. ( 1997 ) after 10 6 superscript 10 6 10^{6} pulls, on instances I 1 subscript 𝐼 1 I_{1} , I 2 subscript 𝐼 2 I_{2} , I 3 subscript 𝐼 3 I_{3} and I 4 subscript 𝐼 4 I_{4} . Each result is the average of 20 runs, showing one standard error.

Algorithms

M

I_{1}

:

β ​ (0.5, 2)

μ^{*} = 1

I_{2}

:

β ​ (1, 1)

μ^{*} = 1

I_{3}

:

β ​ (0.5, 2)

μ^{*} = 0.6

I_{4}

:

β ​ (1, 1)

μ^{*} = 0.6

Non-stationary Policy

(Herschkorn et al., 1996)

1

3.58

\pm

0.4

1.11

\pm

0.2

1.64

\pm

0.2

0.79

\pm

0.1

\sqrt{T}

-run

(Berry et al., 1997)

2

6.18

\pm

0.5

1.11

\pm

0.4

4.18

\pm

0.3

2.03

\pm

0.3

\sqrt{T} ​ \ln T

-learning

(Berry et al., 1997)

2

6.32

\pm

0.4

0.69

\pm

0.3

4.38

\pm

0.2

2.15

\pm

0.3

Non-recalling

\sqrt{T}

-run

(Berry et al., 1997)

1

5.35

\pm

0.5

0.03

\pm

0.004

4.56

\pm

0.001

2.55

\pm

0.001

QUCB-M

2

1.84

\pm

0.17

0.41

\pm

0.02

1.29

\pm

0.10

0.49

\pm

0.02

10

1.98

\pm

0.16

0.59

\pm

0.02

1.49

\pm

0.09

0.63

\pm

0.01

QUCB-M (

η = 0.2

)

2

2.00

\pm

0.20

0.32

\pm

0.05

1.41

\pm

0.10

0.69

\pm

0.04

10

1.71

\pm

0.16

\pm

0.02

1.16

\pm

0.09

0.30

\pm

0.02

QTS-M

2

1.77

\pm

0.17

0.32

\pm

0.04

1.23

\pm

0.09

0.40

\pm

0.02

10

1.91

\pm

0.16

0.18

\pm

0.03

1.14

\pm

0.10

0.30

\pm

0.02

QMoss-M

2

1.74

\pm

0.17

0.31

\pm

0.02

1.20

\pm

0.10

0.39

\pm

0.02

10

1.69

\pm

0.15

0.25

\pm

0.02

1.13

\pm

0.09

0.30

\pm

0.010

Table 2. Table 2: Cumulative regret ( / 10 5 absent superscript 10 5 /10^{5} ) of QUCB-M , QTS-M , QMoss-M and the strategies proposed by (Herschkorn et al., 1996 ) and (Berry et al., 1997 ) after 10 6 superscript 10 6 10^{6} pulls, on instances I 1 subscript 𝐼 1 I_{1} , I 2 subscript 𝐼 2 I_{2} , I 3 subscript 𝐼 3 I_{3} and I 4 subscript 𝐼 4 I_{4} . Each result is the average of 20 runs, showing one standard error.

Algorithms

M

I_{1}

:

β ​ (0.5, 2)

μ^{*} = 1

I_{2}

:

β ​ (1, 1)

μ^{*} = 1

I_{3}

:

β ​ (0.5, 2)

μ^{*} = 0.6

I_{4}

:

β ​ (1, 1)

μ^{*} = 0.6

Non-stationary Policy

(Herschkorn et al., 1996)

1

3.58

\pm

0.4

1.11

\pm

0.2

1.64

\pm

0.2

0.79

\pm

0.1

\sqrt{T}

-run

(Berry et al., 1997)

2

6.18

\pm

0.5

1.11

\pm

0.4

4.18

\pm

0.3

2.03

\pm

0.3

\sqrt{T} ​ \ln T

-learning

(Berry et al., 1997)

2

6.32

\pm

0.4

0.69

\pm

0.3

4.38

\pm

0.2

2.15

\pm

0.3

Non-recalling

\sqrt{T}

-run

(Berry et al., 1997)

1

5.35

\pm

0.5

0.03

\pm

0.004

4.56

\pm

0.001

2.55

\pm

0.001

QUCB-M

2

3.69

\pm

0.34

0.74

\pm

0.11

2.27

\pm

0.21

0.51

\pm

0.07

10

4.26

\pm

0.37

0.91

\pm

0.19

2.65

\pm

0.22

0.63

\pm

0.11

QUCB-M

η = 0.2

2

3.67

\pm

0.35

0.72

\pm

0.12

2.21

\pm

0.21

0.55

\pm

0.08

10

4.15

\pm

0.36

0.79

\pm

0.19

2.51

\pm

0.22

0.54

\pm

0.11

QTS-M

2

3.14

\pm

0.39

0.62

\pm

0.07

1.97

\pm

0.19

0.44

\pm

0.07

10

3.88

\pm

0.35

0.67

\pm

0.13

2.49

\pm

0.23

0.45

\pm

0.06

QMoss-M

2

3.64

\pm

0.34

0.70

\pm

0.11

2.21

\pm

0.21

0.46

\pm

0.07

10

4.16

\pm

0.36

0.80

\pm

0.19

2.53

\pm

0.22

0.52

\pm

0.11

Equations110

R_{T}^{*} = \mbox d e f T μ^{*} - t \sum T E [μ_{a_{t}}],

R_{T}^{*} = \mbox d e f T μ^{*} - t \sum T E [μ_{a_{t}}],

μ_{ρ} = in f {x \in [0, 1] : a \sim P_{A} Pr {μ_{a} \leq x} \geq 1 - ρ} .

μ_{ρ} = in f {x \in [0, 1] : a \sim P_{A} Pr {μ_{a} \leq x} \geq 1 - ρ} .

R_{T} (ρ) = \mbox d e f T μ_{ρ} - t \sum T E [μ_{a_{t}}],

R_{T} (ρ) = \mbox d e f T μ_{ρ} - t \sum T E [μ_{a_{t}}],

Pr {a_{t} = a^{*}} = X_{t} E [Pr {a_{t} = a^{*} ∣ a^{*} \in X_{t}} Pr {a^{*} \in X_{t}}] .

Pr {a_{t} = a^{*}} = X_{t} E [Pr {a_{t} = a^{*} ∣ a^{*} \in X_{t}} Pr {a^{*} \in X_{t}}] .

T ↑ \infty Lt \frac{\sum _{t = 1}^{T} E _{X_{t}} [ Pr { a _{t} = a ^{*} ∣ a ^{*} \in X _{t} } Pr { a ^{*} \in X _{t} } ]}{T} = 1,

T ↑ \infty Lt \frac{\sum _{t = 1}^{T} E _{X_{t}} [ Pr { a _{t} = a ^{*} ∣ a ^{*} \in X _{t} } Pr { a ^{*} \in X _{t} } ]}{T} = 1,

E [r^{*}] = \mbox d e f μ^{*} - E [μ_{b_{t}}] .

E [r^{*}] = \mbox d e f μ^{*} - E [μ_{b_{t}}] .

1 \leq j \leq h_{0} max E [μ^{*} - μ_{*}^{y, j}] \leq 2 h_{0} E [r^{y}] .

1 \leq j \leq h_{0} max E [μ^{*} - μ_{*}^{y, j}] \leq 2 h_{0} E [r^{y}] .

R_{w, j}^{*} = b_{w} μ^{*} - i = 1 \sum b_{w} E [μ_{a_{t}}]

R_{w, j}^{*} = b_{w} μ^{*} - i = 1 \sum b_{w} E [μ_{a_{t}}]

= b_{w} E [μ^{*} - μ_{*}^{w, j}] + t = 1 \sum b_{w} (E [μ_{*}^{w, j}] - E [μ_{a_{t}}]) .

R_{T}^{*} = w = 1 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{*} = w = 1 \sum x_{0} j = 1 \sum h_{0} (R_{w, j}^{(1)} + R_{w, j}^{(2)}) .

R_{T}^{*} = w = 1 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{*} = w = 1 \sum x_{0} j = 1 \sum h_{0} (R_{w, j}^{(1)} + R_{w, j}^{(2)}) .

R_{T} (ρ) \in o ((\frac{1}{ρ} lo g \frac{1}{ρ})^{4.89} + M T^{0.205} + T^{0.81} \frac{lo g M}{M ^{2}} lo g \frac{T}{M})

R_{T} (ρ) \in o ((\frac{1}{ρ} lo g \frac{1}{ρ})^{4.89} + M T^{0.205} + T^{0.81} \frac{lo g M}{M ^{2}} lo g \frac{T}{M})

r = 1 \sum l o g T t_{r} Pr {E_{r} (ρ)} \in O ((\frac{1}{ρ} lo g \frac{1}{ρ})^{\frac{1}{α}} + T^{1 - \frac{α l o g e}{1 + γ}})

r = 1 \sum l o g T t_{r} Pr {E_{r} (ρ)} \in O ((\frac{1}{ρ} lo g \frac{1}{ρ})^{\frac{1}{α}} + T^{1 - \frac{α l o g e}{1 + γ}})

r = r^{*} \sum l o g T C (n_{r} M + (n_{r}^{3/2} / M) t_{r} lo g t_{r}),

r = r^{*} \sum l o g T C (n_{r} M + (n_{r}^{3/2} / M) t_{r} lo g t_{r}),

\leq C^{'} (M T^{α} + (lo g M / M) T^{1 + 3 α} lo g (T / M)),

T = w = 1 \sum x j = 1 \sum h_{0} b_{w} = h_{0} b_{1} w = 1 \sum x 2^{w - 1} = h_{0} b_{1} (2^{x} - 1),

T = w = 1 \sum x j = 1 \sum h_{0} b_{w} = h_{0} b_{1} w = 1 \sum x 2^{w - 1} = h_{0} b_{1} (2^{x} - 1),

⟹ x = lo g (\frac{T}{h _{0} b _{1}} + 1),

\displaystyle\leq\log\left(\frac{T}{\Big{\lceil}\frac{K-1}{M-1}\Big{\rceil}b_{1}}+1\right),\;\left[\text{because},h_{0}=\Big{\lceil}\frac{K-1}{M-1}\Big{\rceil}\right],

\displaystyle\leq\log\left(\frac{T}{\Big{\lceil}\frac{K-1}{M-1}\Big{\rceil}M(M+2)}+1\right),\;

[because, b_{1} = M (M + 2)],

\leq lo g (\frac{T}{\frac{K - 1}{M - 1} M ( M + 2 )} + 1) < lo g (\frac{T}{K M} + 1),

\leq lo g (\frac{2 T}{K M}) [because, T > 2 M K],

\displaystyle\implies x\leq\Big{\lceil}\log\frac{2T}{MK}\Big{\rceil}=x_{0}.

E [r_{*}^{y, i}] \leq 2 \leq j_{0} \leq [h_{0}] max 1 \leq i \leq j_{0} - 1 max E [r_{*}^{y - 1, h_{0}}] + i \cdot E [r^{y}],

E [r_{*}^{y, i}] \leq 2 \leq j_{0} \leq [h_{0}] max 1 \leq i \leq j_{0} - 1 max E [r_{*}^{y - 1, h_{0}}] + i \cdot E [r^{y}],

\leq 2 \leq j_{0} [h_{0}] max 1 \leq i \leq j_{0} - 1 max \leq (h_{0} - j_{0} + 1) E [r^{y - 1}] + i \cdot E [r^{y}],

\leq (h_{0} - 1) E [r^{y - 1}] + E [r^{y}] < 2 h_{0} E [r^{y}]

[Because E [r^{y}] < E [r^{y - 1}] < 2 E [r^{y}]] .

w = 1 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{(1)} = j = 1 \sum h_{0} R_{1, j}^{(1)} + w = 2 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{(1)}

w = 1 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{(1)} = j = 1 \sum h_{0} R_{1, j}^{(1)} + w = 2 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{(1)}

\leq h_{0} b_{1} + w = 2 \sum x_{0} j = 1 \sum h_{0} R_{w, j}^{(1)}

\leq h_{0} b_{1} + w = 2 \sum x_{0} j = 1 \sum h_{0} b_{w} E [μ^{*} - μ_{*}^{w, j}]

\leq h_{0} b_{1} + w = 2 \sum x_{0} j = 1 \sum h_{0} b_{w} (2 h_{0} E [r^{w}]) [using Lemma \leavevmode \nobreak \ref lem:ubsimpreg]

\leq h_{0} b_{1} + 2 C_{1} h_{0}^{2} w = 2 \sum x_{0} b_{w} \frac{M lo g b _{w}}{b _{w}} [using Corollary \leavevmode \nobreak \ref cor:succsimpreg]

\leq h_{0} b_{1} + 2 C_{1} h_{0}^{2} w = 2 \sum x_{0} b_{w} M lo g b_{w}

\leq h_{0} b_{1} + 2 C_{2} h_{0}^{2} M b_{1} w = 2 \sum x_{0} (2^{w - 1} lo g (2^{w - 1} b_{1}))^{\frac{1}{2}}

[because, b_{w} = 2^{w - 1} b_{1}]

= h_{0} b_{1} + 2 C_{2} h_{0}^{2} M b_{1} w = 2 \sum x_{0} ((w - 1 + lo g b_{1}) 2^{w - 1})^{\frac{1}{2}}

\leq h_{0} b_{1} + C_{3} h_{0}^{2} M b_{1} w = 2 \sum x_{0} ((w - 1) 2^{w - 1})^{\frac{1}{2}}

[because, T > K M^{2} (M + 2), and b_{1} = M (M + 2),

therefore, x_{0} - 1 \geq lo g b_{1} = lo g M (M + 2)]

\leq h_{0} b_{1} + C_{4} h_{0}^{2} M b_{1} (x_{0} \cdot 2^{x_{0}})^{\frac{1}{2}}

\displaystyle\leq C_{5}\left(\Bigg{\lceil}\frac{K-1}{M-1}\Bigg{\rceil}M(M+2)+\right.

\displaystyle\hskip 28.45274pt\left.\Bigg{\lceil}\frac{K-1}{M-1}\Bigg{\rceil}^{2}\sqrt{M^{2}(M+2)}\left(\frac{T}{MK}\log\frac{T}{MK}\right)^{\frac{1}{2}}\right)

[substituting for b_{1}, h_{0} and x_{0}]

\leq C_{6} (\frac{K}{M} M^{2} + (\frac{K}{M})^{2} M^{3} (\frac{T}{M K} lo g \frac{T}{M K})^{\frac{1}{2}})

\leq C_{7} (K M + \frac{K ^{3/2}}{M} T lo g \frac{T}{M K})

\leq C_{8} (K M + \frac{K ^{3/2}}{M} T lo g \frac{T}{M K}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

Arghya Roy Chaudhuri

Shivaram Kalyanakrishnan

Abstract

In this paper, we propose a constant word (RAM model) algorithm for regret minimisation for both finite and infinite Stochastic Multi-Armed Bandit (MAB) instances. Most of the existing regret minimisation algorithms need to remember the statistics of all the arms they encounter. This may become a problem for the cases where the number of available words of memory is limited.

Designing an efficient regret minimisation algorithm that uses a constant number of words has long been interesting to the community. Some early attempts consider the number of arms to be infinite, and require the reward distribution of the arms to belong to some particular family. Recently, for finitely many-armed bandits an explore-then-commit based algorithm (Liau et al., 2018) seems to escape such assumption. However, due to the underlying PAC-based elimination their method incurs a high regret. We present a conceptually simple, and efficient algorithm that needs to remember statistics of at most $M$ arms, and for any $K$ -armed finite bandit instance it enjoys a $O(KM+K^{1.5}\sqrt{T\log(T/MK)}/M)$ upper-bound on regret. We extend it to achieve sub-linear quantile-regret (Roy Chaudhuri & Kalyanakrishnan, 2018) and empirically verify the efficiency of our algorithm via experiments.

1 Introduction

In this paper, we investigate the problem of regret minimisation in Multi-Armed Bandit (MAB) (Berry & Fristedt, 1985) using a bounded number of words. Each arm in a bandit instance represents a slot-machine with a fixed (but unknown) real-valued reward distribution associated with it. At each time step, the experimenter is supposed to select and pull an arm, and observe the reward. The goal of the experimenter is to maximise the expected total reward for a finite time horizon, thereby minimising the expected regret measured with respect to the mean of the optimal arm.

A range of real-world applications like drug testing (Armitage, 1960; Colton, 1963), crowd-sourcing (Tran-Thanh et al., 2014) etc. can be modelled using multi-armed bandits, where the number of arms is high. In such cases, due to budgetary constraints or some other practical considerations, it might viable to experiment only with a small number of arms instead of the whole pool. The problem is of particular interest because the experimenter gets to store statistics of a small but fixed number of arms. Therefore, it adds another layer of an exploration-exploitation dilemma for the task of regret minimisation. This particular set-up has been drawing attention since long ago (Cover, 1968); however, only a few investigations have been made in this direction.

In this paper, we present a regret minimisation algorithm that uses a bounded number of words, for both finite and infinite-armed bandits. Unlike the existing algorithms, our algorithm does not need any special assumption of reward distribution of arms but bypass the explicit PAC-based exploration for the sake of efficiency. Below, we formalise our problem followed by our specific contributions.

Background and Problem Setup.

A bandit-instance $\mathcal{B}=(\mathcal{A},\mathcal{D})$ consists of a set of arms $\mathcal{A}$ , and a set of sub-Gaussian cumulative distribution functions (CDF) $\mathcal{D}$ . Each arm $a\in\mathcal{A}$ , when pulled, generates a i.i.d. reward from the corresponding CDF $D_{a}\in\mathcal{D}$ , defined over $[0,1]$ . The expected reward of arm $a\in\mathcal{A}$ is given by $\mu_{a}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\operatorname*{\mathop{\mathbb{E}}}_{r\sim D_{a}}[r]$ . We also assume that the experimenter has no information regarding $\mathcal{D}$ . The only way for her to gather knowledge about $\mathcal{D}$ is via generated rewards by sampling the arms. We define a set called history as $H_{t}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\{(a_{i},r_{i})\}_{i=1}^{t}$ , where, $r_{i}\in[0,1]$ is the reward produced at $i$ -th step by pulling the arm $a_{i}\in\mathcal{A}$ .

Cumulative Regret Minimisation.

Assuming $\mu^{*}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\min\{y\in[0,1]:\forall a\in\mathcal{A},\mu(a)\leq y\}$ , and the given horizon of pulls as $T$ , the conventional cumulative regret incurred by a algorithm is defined as

[TABLE]

wherein $a_{t}$ is the arm pulled by the algorithm at time $t$ . The expectation is taken over random rewards and the possible randomisation introduced by the algorithm.

We briefly restate the definition of “quantile regret” introduced by (Roy Chaudhuri & Kalyanakrishnan, 2018) based on their previous contribution in a pure exploration setting (Roy Chaudhuri & Kalyanakrishnan, 2017). A problem instance $\mathcal{I}=(\mathcal{B},P_{\mathcal{A}})$ consists of a bandit instance $\mathcal{B}$ , and a sampling distribution for choosing arm from $\mathcal{A}$ . Letting $\rho\in[0,1]$ , the $(1-\rho)$ -th quantile of $P_{\mathcal{A}}$ is defined as

[TABLE]

Then, for a given horizon of pulls as $T$ , quantile regret with respect to $\mu_{\rho}$ is defined as

[TABLE]

wherein, $a_{t}$ and $\operatorname*{\mathop{\mathbb{E}}}[\cdot]$ bear the same interpretation.

RAM Model.

It should be noted that given any bandit instance $\mathcal{B}=(\mathcal{A},\mathcal{D})$ , as we are not considering any special structure in $\mathcal{A}$ or $\mathcal{D}$ , putting a restriction on an algorithm to use a bounded number of words of space, either restricts the horizon of pulls, or restricts the algorithm to store statistics of only bounded number of arms simultaneously. In this paper, we consider the latter and assume $M$ to be that number. We adopt the word RAM model (Aho et al., 1974; Cormen et al., 2009), that considers a word as the unit of space. This model facilitates to consider that each of the input values and variables can be stored in $O(1)$ word space. For finite bandit instances ( $|\mathcal{A}|<\infty$ ), we consider a word to be consisted of $O(\log T)$ bits. Therefore, our algorithm needs space-complexity of $O(M\log T+\log|\mathcal{A}|)$ bits. For the infinite bandit instances ( $|\mathcal{A}|=\infty$ ), for $\rho\in[0,1]$ , if the experimenter needs to analyse the performance with respect to $\mu_{\rho}$ (the $(1-\rho)$ -th quantile), she must allow the algorithm to use $O(M\log T+\log(1/\rho))$ bits.

We call this set of arm indices whose statistics are stored as arm memory and its cardinality as arm memory size. Hence, an algorithm with arm memory size $M$ can store the statistics of at most $M$ arms. Also, it should be noted that an algorithm is allowed to pull an arm only if it is stored in the memory. Hence, before pulling a new arm (which is not currently in the arm memory), the algorithm should replace an arm in its arm memory with this new arm. It is interesting to note that the algorithms that work with $M=1$ , can only keep the stat of the arm it is currently pulling. Therefore, switching to a new arm costs such an algorithm to lose all the experience gained by sampling the previous arm. However, for a finite bandit instance, as the algorithms are allowed to remember all the arm indices, such an algorithm can store the gained experience by storing a bounded number of arm indices for possible further special treatment. The scenario is widely different for infinite bandit instances, where an algorithm can pull a new arm only if it is chosen by the given sampling distribution $P_{\mathcal{A}}$ . In such a scenario, once an algorithm discards an arm from the arm memory, it can encounter that arm only if it is sampled again in future by $P_{\mathcal{A}}$ . Hence, the algorithm can not recall a discarded arm. In the existing literature (Herschkorn et al., 1996; Berry et al., 1997) on infinite bandit instances, such algorithms are termed as non-recalling algorithms. However, for $M>1$ (but bounded above), an algorithm enjoys the freedom of ensuring a previously encountered good arm to keep in the memory, irrespective of whether or not the bandit instance is finite or not. Our findings show that this is more beneficial than the non-recalling algorithms for infinite bandit instances.

Problem Definition.

Given a positive integer $M$ , below, we define the problem of conventional regret minimisation (CR-M) and extend the definition to quantile regret minimisation (QR-M).

CR-M.

An algorithm $\mathcal{L}$ is said to solve CR-M, if takes $\mathcal{A}$ , and $M$ as the input and for a sufficiently large budget $T$ (not necessarily known beforehand) it will achieve $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}\in o(T)$ ; using an arm memory size at most $M$ . It is assumed that for a finite bandit instance with $|\mathcal{A}|=K<\infty$ , the algorithm is allowed to store $O(M\log(T)+\log K)$ bits of information.

QR-M.

Suppose we are given a problem instance $\mathcal{I}=(\mathcal{B},P_{\mathcal{A}})$ , and a positive integer $M$ . Let, $\rho_{0}\in(0,1]$ . An algorithm $\mathcal{L}$ is said to solve QR-M, if takes $\mathcal{A}$ , $P_{\mathcal{A}}$ , and $M$ as the input, and for every $\rho\in(\rho_{0},1]$ , given a sufficiently large budget $T$ it will achieve $\operatorname{\mathop{\mathbb{R}}}_{T}(\rho)\in o(T)$ ; using an arm memory size at most $M$ . It is assumed that the algorithm is allowed to store $O(M\log(T)+\log(1/\rho_{0}))$ bits of information.

In this paper we present algorithms solve CR-M on finite bandit instances, and QR-M, with $M\geq 2$ . Following we brief our contribution in this paper.

Contributions.

We present algorithms for minimisation of conventional regret and quantile regret, using a bounded number of words of space. Following is the list of our specific contributions.

In Section 4.1 we present an algorithm UCB-M, which solves CR-M, and achieves $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}\in O(KM+{({K^{3/2}}/{M}})\sqrt{T\log({T}/{MK}}))$ over an unknown but finite horizon of $T$ pulls. The existing upper bound on regret is due to (Liau et al., 2018) and it involves problem specific quantities. Hence ours is the first problem independent finite time regret upper bound. Also, in Section 4.2 we empirically compare UCB-M and its variations with the existing algorithms those solve CR-M. 2. 2.

In Section 5.1 we present a meta-algorithm QUCB-M, that uses UCB-M as a subroutine to the algorithm QRM2 (Roy Chaudhuri & Kalyanakrishnan, 2018) to solve QR-M, and achieves $\operatorname{\mathop{\mathbb{R}}}_{T}(\rho)\in o\left(\left(\frac{1}{\rho}\log\frac{1}{\rho}\right)^{4.9}\right.+MT^{0.205}+\left.T^{0.81}\sqrt{\frac{\log M}{M^{2}}\log\frac{T}{M}}\right)$ . In Section 5.2 we experimentally demonstrate that QUCB-M (in terms of conventional regret $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}$ ) is more efficient than the algorithms by (Herschkorn et al., 1996) and (Berry et al., 1997), on problem instances with Bernoulli arms.

We briefly review the existing literature, before we present the key intuitions in Section 3.

2 Related Work

Started by Robbins (1952) the predominant body of literature in stochastic multi-armed bandit is dedicated to the regret minimisation task on finite and infinite bandit instances. Later, a number of salient algorithms like UCB1 (Auer et al., 2002), Thompson Sampling (Chapelle & Li, 2011; Agrawal & Goyal, 2012), Moss (Audibert & Bubeck, 2009) has been shown to achieve the order optimal cumulative regret on the finite instances. When the number of arms is infinite, algorithms make special assumption is made on the reward function (Agrawal, 1995; Kleinberg, 2005) or on the sampling distribution (Wang et al., 2008) to guarantee a sub-linear regret. Despite a thorough study on the finite and the infinite instances, the number of investigations in the memory frugal algorithms is limited.

Finite memory hypothesis testing has been drawing the attention of researchers since long (Robbins, 1956; Isbell, 1959; Cover et al., 1976). However, in MAB setting Cover (1968) first presented a finite memory algorithm for two-armed Bernoulli instance, that achieves an average reward which converges to the optimal proportion in the limit, with probability 1. His approach consisted of a collection of interleaved test and trial blocks, where each test block is divided into several sub-blocks and the switching among these sub-blocks is governed by a finite state machine. However, he considered only two-armed Bernoulli instances, and the approach guarantees only an asymptotic convergence of the empirical average reward. Hence, this setup is not very interesting, as our objective is to present a finite-time analysis of regret for general bandit instances.

Herschkorn et al. (1996) presented the first non-recalling algorithm for infinite bandit instances with Bernoulli arms, that maximises the almost sure average reward over an infinite horizon. Berry et al. (1997) improved over them for the problem instances where the sampling distribution $P_{\mathcal{A}}$ is uniform over the set expected rewards of the Bernoulli arms. Towards relaxing the assumption of Bernoulli reward distribution Peköz (2003) showed that a peculiarity that may arise if the reward distributions of the arms are not stochastically-ordered. Specifically, for some function $f:\operatorname{\mathop{\mathbb{R}}}^{+}\mapsto\operatorname{\mathop{\mathbb{R}}}^{+}$ , he proposed two policies—PolicyA and PolicyB parameterised by $f(\cdot)$ , where the latter is a non-stationary version of the former. Then he showed that for some choice of $f(\cdot)$ , there exist instances with a bounded positive reward on which in the limit, exactly for one of PolicyA and PolicyB, the average reward will converge to the supremum mean reward, while for the other, it will converge to the infimum mean reward. Most recently, Liau et al. (2018) have presented an explore-then-commit strategy based algorithm UCBConstantSpace that incurs sublinear finite-time regret on any finite bandit instance. However, their algorithm explicitly uses PAC-based arm elimination strategy that leads to a high regret. On the other hand, like the previous algorithms, their algorithm does not have the provision to take the advantage availability of larger arm memory. Next, we describe the key intuitions behind our approach.

3 Key Intuitions

One of our objectives is to solve CR-M for finite bandit instances ( $|\mathcal{A}|<\infty$ ). The problem is interesting for $M<|\mathcal{A}|$ ; otherwise, one can solve the problem by using any existing regret minimisation algorithm like UCB1 (Auer et al., 2002) etc. Intuitively, any algorithm that solves CR-M for finite bandit instances, must ensure that the probability of pulling the optimal arm is increased by progressively increasing at least one of the two probabilities—first, the probability of the optimal arm $a^{*}$ is in arm memory; second, if $a^{*}$ is in arm memory, the probability that it will be pulled more often than the other arms in arm memory. For any algorithm that achieves sub-linear regret we can write, for any horizon $T$ , $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}\in o(T)\implies\operatorname*{Lt}_{T\uparrow\infty}\frac{\operatorname{\mathop{\mathbb{R}}}_{T}^{*}}{T}=\operatorname*{Lt}_{T\uparrow\infty}\frac{1}{T}\sum_{t=1}^{T}\Pr\{a_{t}=a^{*}\}=1$ . Now, imposing the arm memory constraint, and letting $X_{t}$ be the current arm memory at $t$ -th pull, we notice, $\{a_{t}=a^{*}\}\implies\{a^{*}\in X_{t}\}$ . Therefore,

[TABLE]

Hence, the necessary and the sufficient condition for an algorithm that asymptotically solves CR-M is

[TABLE]

for $|X_{t}|\leq M$ , where $1\leq t\leq T$ .

Given a bandit instance, algorithm of Liau et al. (2018) first solves a pure exploration problem for a horizon $\bar{T}$ (a function of the mean reward of the arms) to maximise the quantity $\Pr\{a^{*}\in X_{t}\}$ in the R.H.S. of Equation (4). Once the number of pulls crosses $\bar{T}$ , it chooses the arm with the highest empirical reward in $X_{t}$ as the contentious best arm, and assigns the rest of the horizon to that arm. Therefore, for $t>\bar{T}$ , it switches to pure-exploitation mode, thus maximising the quantity $\Pr\{a_{t}=a^{*}|a^{*}\in X_{\bar{T}}\}$ . On the contrary, we adopt balanced exploration with an aim to simultaneously increase $\Pr\{a^{*}\in X_{t}\}$ and $\Pr\{a_{t}=a^{*}|a^{*}\in X_{t}\}$ . Therefore, our algorithm does not depend on such $\bar{T}$ . It should be noted that, for the sake of sufficient exploration, an algorithm should not stick to the same arm memory for too long, however, while selecting new arms (not in the current arm memory), it must judiciously choose the in memory arms to replace. This trade-off relies on the notion of simple regret which we introduce next.

Simple Regret.

Whereas the cumulative regret minimisation problem is based on the trade-off between exploration and exploitation, there is a separate line of literature in pure exploration setting. One of the popular problems in pure exploration setting is to minimise “simple regret”. If $b_{t}\in\mathcal{A}$ is the arm recommended by the algorithm after the $t$ -th pull, then the simple regret of the algorithm at $t$ is defined as,

[TABLE]

Relation of Cumulative Regret Minimisation with the Minimisation of Simple Regret.

Bubeck et al. (2009) gave a general definition of a forecaster, depicted in Figure 1. Given a set of arms as $\mathcal{A}$ as input, at each step $t$ , possibly depending on $H_{t-1}$ , it selects an arm $a_{t}$ by using a strategy called “allocation strategy”. On pulling the arm $a_{t}$ it receives a reward $r_{t}$ , and executes a “recommendation strategy” that takes $H_{t}$ as the input and outputs an arm $b_{t}$ . The forecaster continues to alternately execute allocation strategy and recommendation strategy until some stopping condition is met.

A careful look reveals that a forecaster that at each step, $t$ , recommends the arm, which is selected by allocation strategy in the same step (that is $b_{t}\equiv a_{t}$ in Figure 1), then the cumulative regret (Equation 3) of that forecaster is identical to the sum of simple regret (Equation 5) over time steps $t$ . This tempts one to intuit that using an allocation strategy which incurs low cumulative regret will help in designing a forecaster that achieves a small simple regret and vice-versa. However, Bubeck et al. (2009) present a negative result on this trade-off. Further they (Bubeck et al., 2009) present a upper bound on $\operatorname*{\mathop{\mathbb{E}}}[r^{*}]$ for a number of forecasters (Bubeck et al., 2009, Table 1) one of which is defined below:

UCB-MPA.

A forecaster, which at each step uses UCB1 (Auer et al., 2002) as the allocation strategy, and uses the recommendation strategy that outputs the most played arm (MPA), shall be called as UCB-MPA.

We quote their result (Bubeck et al., 2009, Theorem 3) in Theorem 3.1 which serves as the cornerstone of analysis of the algorithm UCB-M.

Theorem 3.1 (Distribution-free upper bound on Simple-Regret of UCB-MPA by Bubeck et al. (2009)).

Given a $K$ -sized set of arms $\mathcal{A}$ as input, if UCB-MPA runs for a horizon of $T$ pulls such that $T\geq K(K+2)$ , then for some constant $C>0$ , it achieves the expected simple regret $\operatorname*{\mathop{\mathbb{E}}}[r^{*}]\leq C\sqrt{\frac{K\log T}{T}}$ .

Although UCB1 was originally designed as a cumulative regret minimisation algorithm, empirically it performs reasonably well as an exploration strategy to give us a good balance between exploration and exploitation. We choose UCB-MPA over other forecasters as it is easy to comprehend and leads a simpler derivation. For the rest of the paper we shall use $\log$ and $\ln$ to denote base 2 and natural logarithm respectively. Also, for any positive integer $Z$ we shall denote the set $\{1,2,3,\cdots,Z\}$ by $[Z]$ .

4 Algorithm for Finite Bandit Instances

We present the algorithm UCB-M and establish a problem-independent upper-bound on the cumulative regret. Then we empirically compare UCB-M and its variations with the algorithm by Liau et al. (2018). Algorithm 1 is based on UCB-MPA. However, one can the replace the underlying call to UCB1 with any other allocation strategy like Thompson sampling (Agrawal & Goyal, 2012), or Moss Audibert & Bubeck (2009), as we do in our experiments.

4.1 Algorithm and Regret-Analysis.

Algorithm 1 describes UCB-M that solves the problem CR-M for finite bandit instances. We improve upon the contribution of Liau et al. (2018) in three aspects—first, UCB-M is empirically much more efficient even if we allow $M=2$ (as opposed to $M=4$ for theirs) as it does not explicitly use pure exploration based elimination; second, it scales with the arm memory size; third, we present a distribution-free upper bound on the incurred regret of UCB-M for solving CR-M on finite bandit instances.

Given a finite set of arms $\mathcal{A}$ ( $|\mathcal{A}|=K<\infty$ ), and arm memory size $M$ ( $2\leq M<K$ ) UCB-M approaches in phases. It breaks each phase into $h_{0}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\lceil(K-1)/(M-1)\rceil$ sub-phases. Inside any phase $w$ , at each sub-phase $j$ , it runs UCB-MPA on an $M$ -sized subset of arms $S^{w,j}$ (called arm memory), and assigns the recommended arm to $\hat{a}$ , and forwards it to the next sub-phase. On the subsequent sub-phase (that might belong to the next phase), it chooses $M-1$ new arms from $\mathcal{A}$ , along with the arm $\hat{a}$ forwarded from the previous sub-phase, and repeat the previous steps. It is to be noted that the horizon spent on each sub-phase of a phase $w$ is the same and is given by $b_{w}$ . Also, for $w\geq 2$ , the total horizon spent in phase $w$ is given by $h_{0}b_{w}=2h_{0}b_{w-1}$ . To satisfy the assumption in Theorem-3 of Bubeck et al. (2009), at the first phase, for each of the sub-phases, UCB-M chooses a horizon of $b_{1}=M(M+2)$ pulls. For the rest of the analysis of UCB-M, we shall denote, the number of phases executed by UCB-M as $x_{0}$ .

For $M>=K$ , as the arm memory size is large enough, it is effectively removing the memory constraint. In such unconstrained scenarios it is preferable run UCB1 (Auer et al., 2002) on the whole instance, which will incur a lower regret. We adopt this into UCB-M, and hence, running UCB-M with $M\geq K$ is identical to run original UCB1.

Theorem 4.1.

Given a set of $K$ arms $\mathcal{A}$ , with $K\geq 3$ , an arm memory size $M$ such that $2\leq M<K$ , as input, then for a horizon of $T$ pulls, with $T>KM^{2}(M+2)$ , UCB-M will incur $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}=O\left(KM+({K^{3/2}}/{M})\sqrt{T\log({T}/{KM}})\right).$

We note that for a given bandit instance with $K$ arms, an arm memory size $M$ , and horizon $T$ , the number of sub-phases is $h_{0}=\lceil(K-1)/(M-1)\rceil$ , and the total number of phases is upper bounded by $x_{0}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\lceil\log({2T}/{MK})\rceil$ (Lemma A.2 in Appendix A). Now, we upper bound the maximum regret incurred in a sub-phase inside a given phase, and sum over all the sub-phases.

As UCB-M ensures inclusion of the optimal arm at least once in every phase, we note the following.

Corollary 4.2.

Let us denote, the sequence of sub-phase-wise arm memory as $\mathcal{S}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\{S^{1,1},S^{1,2},\cdots,$ $S^{1,h_{0}},S^{2,1},S^{2,2},$ $\cdots,S^{2,h_{0}},\cdots,S^{x_{0},1},$ $S^{x_{0},2},\cdots,S^{x_{0},h_{0}}\}$ . Then, for $d\geq h_{0}$ , at least one of any $d$ consecutive elements of $\mathcal{S}$ contains $a^{*}$ .

In any given phase, we need to upper-bound the difference of mean of the best in memory arm between two successive sub-phase. Considering $\mathcal{S}$ as defined in Corollary 4.2, let $a_{*}^{y,j}\in S^{y,j}\in\mathcal{S}$ be the arm recommended by the sub-phase $j-1$ to $j$ . It is important to note that $\max_{a\in S^{y,j}}\mu_{a}\geq\mu_{a_{*}^{y,j}}$ . Therefore, in the interest of finding a upper bound on the regret, it is safe to consider $\mu_{a_{*}^{y,j}}=\max_{a\in S^{y,j}}\mu_{a}$ as a pessimistic estimate of the best mean in $S^{y,j}$ . In any given sub-phase $j\leq h_{0}-1$ in phase $y$ , we let $\operatorname*{\mathop{\mathbb{E}}}[r^{y}]\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\operatorname*{\mathop{\mathbb{E}}}[\mu_{a_{*}^{y,j}}-\mu_{a_{*}^{y,j+1}}]$ . Now, noticing that on each sub-phase in a phase $y$ UCB-M spends $b_{y}$ pulls we upper bound $\operatorname*{\mathop{\mathbb{E}}}[r^{y}]$ as follows.

Corollary 4.3.

Using Theorem 3.1, in phase $y$ , at the end of each sub-phase $j$ , the expected simple regret with respect to $\mu_{a_{*}^{y,j}}$ is upper-bounded as $\operatorname*{\mathop{\mathbb{E}}}[r^{y}]\leq C\sqrt{(M\log b_{y})/b_{y}}$ . The upper bound is independent of $j$ as the budget for each sub-phase in a given phase remains the same.

We notice, that the arm forwarded from each sub-phase to the next one, not necessarily to be the optimal arm. Hence, in the worst case, the expected difference between the mean of the optimal arm, and the highest mean reward in the current arm memory grows linearly with the number of sub-phase in a given phase. We upper bound it as follows.

Lemma 4.4.

Let, we are given a $K$ -sized set of arms $\mathcal{A}$ , and an arm memory size $M$ . Also, let as defined in Corollary 4.3 at any phase $y\geq 2$ , in the sub-phase $j$ , if $\mu_{*}^{y,j}$ is the maximum of the mean of the arms in the arm-memory, then

[TABLE]

The proof is presented in Appendix A. Next, we use Lemma 4.4 to upper bound the cumulative regret ( $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}$ ).

Bifurcation of $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}$ .

For any given phase $w$ , and a sub-phase $j$ , let $\mu_{*}^{w,j}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\max{\mu_{a}:a\in S^{w,j}}$ , and $R_{w,j}$ be the incurred regret. Then,

[TABLE]

Where the expectation is taken over all possible sources of randomisation. Now, letting $R_{w,j}^{(1)}=b_{w}(\mu^{*}-\mu_{*}^{w,j})$ , and $R_{w,j}^{(2)}=\sum_{t=1}^{b_{w}}(\operatorname*{\mathop{\mathbb{E}}}[\mu_{*}^{w,j}]-\operatorname*{\mathop{\mathbb{E}}}[\mu_{a_{t}}])$ , we can write,

[TABLE]

Now, using Lemma 4.4 we upper bound $R_{w,j}^{(1)}$ as follows.

Lemma 4.5.

For $2\leq M<K$ , and for $T>KM^{2}(M+2)$ , and for some constant $C^{\prime}$ , $\sum_{w=1}^{x_{0}}\sum_{j=1}^{h_{0}}R_{w,j}^{(1)}$ $\leq C^{\prime}$ $\left(KM+({K^{3/2}}/{M})\sqrt{T\log({T}/{MK}})\right)$ .

For the detailed proof we refer to Appendix A. We note, that $\sum_{w=1}^{x_{0}}\sum_{j=1}^{h_{0}}R_{w,j}^{(2)}$ , can be upper-bounded using the problem independent upper-bound on the cumulative regret of UCB1 (Auer et al., 2002), as we restate below.

Lemma 4.6 (Distribution-Free Upper Bound on Cumulative Regret of UCB1 Auer et al. (2002)).

Given a set of $K$ -arms as the input, for any horizon $T$ , the cumulative regret incurred by UCB1 $R_{T}^{*}\leq 12\sqrt{TK\log T}+6K$ . Further, if $T\geq K/2$ , then $R_{T}^{*}\leq 18\sqrt{TK\log T}$ .

Next, using Lemma 4.6, we upper bound $\sum_{w=1}^{x_{0}}\sum_{j=1}^{h_{0}}R_{w,j}^{(2)}$ , with the proof given in Appendix A.

Lemma 4.7.

For $2\leq M<K$ , and $T>KM^{2}(M+2)$ , and for some constant $C^{\prime\prime}>0$ , $\sum_{w=1}^{x_{0}}\sum_{j=1}^{h_{0}}R_{w,j}^{(2)}\leq$ $C^{\prime\prime}\left(KM+\sqrt{TK\log({T}/{MK}})\right)$ .

Proof of Theorem 4.1

Using Equation 6, and applying Lemma 4.5 and Lemma 4.7 we prove the theorem.

Next, we present an empirical comparison of UCB-M and some of its variations with the algorithm of Liau et al. (2018).

4.2 Experiment

The use of UCB1 (Auer et al., 2002) as a subroutine in Algorithm 1, can be replaced by any other allocation strategies, which in effect will give rise to a different upper bound. In the interest of studying the empirical behaviour, we consider Moss (Audibert & Bubeck, 2009) and Thompson Sampling (Agrawal & Goyal, 2013) in our experiments, and rename UCB-M to TS-M and Moss-M respectively. However, everything else (in Algorithm 1) including the recommendation strategy, is kept unchanged.

Bandit Instances.

We run the experiments on three different instances. Let, $K$ be the number of arms in the instance. Also, for convenience, let the arms indices be sorted in descending order of their mean, with $\mu_{1}=\mu^{*}=0.99$ . As we randomly permute the arm indices in all our experiments, this assumption does not affect the results. We write $\mathcal{B}^{K}_{L}$ to denote an instance in which the mean of the $K$ arms are linearly spaced between $0.99$ $(=\mu^{*})$ and $0.01$ . The other two $K$ -armed instances which are analogous to the ones used by Jamieson et al. (2014). For $\alpha\in\{0.3,0.6\}$ , they are defined as $\mathcal{B}^{K}_{\alpha}$ , in which any sub-optimal arm $i>1$ , has the mean $\mu_{i}=0.01+\mu^{*}-(\mu^{*}-0.01)((i-1)/(K-1))^{\alpha}$ .

For $K=100$ , Figure 2 compares the cumulative regret incurred by algorithms UCB-M, TS-M, Moss-M for an arm memory size $M=2$ , with the algorithm of Liau et al. (2018) (UCBConstantSpace). A comparison of cumulative regret, and the number of pulls to the individual arms in the instances with $K=10$ is presented in Figure 4 in Appendix B. It is important to note that despite using larger arm-memory of $M=4$ , which is twice of the others, their algorithm incurs a significantly higher regret.

Intuitively, Liau et al.’s (2018) algorithm first solves a pure exploration problem to identify a near optimal arm, and then commits the rest of the horizon to that arm. Consequently, it spends a prohibitively large number of pulls on the sub-optimal arms leading to a high regret. In contrast, we just make sure that at any instant, the expected difference between the mean of the optimal arm and the best arm in the current arm memory is not too large. Apparently, this difference increases with the subsequent sub-phases. However, UCB-M ensures to choose the optimal arm in its arm memory at least once in any given phase leading to a “reset” to this difference. On the other hand, this difference progressively reduces due to doubling the budget in each phase. This explains why UCB-M, TS-M, and Moss-M incur significantly lower regret.

As UCB-M can take advantage of larger arm memory size, next we shall compare the incurred regret by varying it. Recalling the algorithm UCB1 (Auer et al., 2002), if an arm $a$ has been pulled $u_{a}^{t}$ times till the time step $t$ , and if $\hat{\mu}_{a}^{t}$ is its empirical average reward, then the upper confidence bound of that arm is given by $ucb_{a}^{t}=\hat{\mu}_{a}^{t}+\eta\sqrt{2\log t/u_{a}^{t}},$ with $\eta=1$ . It can be experimentally validated that tuning $\eta$ can lead to achieving a smaller regret as claimed by the authors (Auer et al., 2002, Section 4). We present the regret incurred by UCB-M for $\eta=0.2$ , alongside the other algorithms.

Intuition suggests that increasing arm memory should help in achieving a low regret, as it increases the chance of pulling the optimal arm more frequently. Also, the upper bound given by Theorem 4.1 supports this intuition. However, in practice, we notice an interesting behaviour. On the instance $B_{L}^{100}$ , we compare the cumulative regret incurred by UCB-M, TS-M, and Moss-M in Figure 3 by varying $M$ . For a comparison on the other instances, the reader is referred to Figure 5 in Appendix B. As expected, UCB-M, TS-M and Moss-M always incur a higher regret than their unconstrained ( $M=K$ ) counter parts. Also, for UCB-M with $\eta=0.2$ , TS-M and Moss-M increasing the arm memory size $M$ makes them achieve a lower regret. However, the behaviour of UCB-M ( $\eta=1$ ) is significantly different from the other two. If $M<K$ , it incurs a relatively low regret for $M=2$ . Afterwards it increases with $M$ , followed by a slow decrease. We conclude that this peculiarity in its behaviour is due to the intrinsic looseness in the calculation of upper confidence bound. Also, that is the reason why UCB-M with $\eta=0.2$ , and the others not only incur a lower regret but behave consistently.

5 Algorithm for Infinite Bandit Instances

In this section, we provide a bounded arm-memory algorithm and its upper bound on the incurred quantile-regret. Also, on various problem instances we empirically compare its incurred conventional cumulative regret ( $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}$ ) with the existing algorithms.

5.1 Algorithm and Quantile-Regret Analysis

We solve the problem of QR-M by modifying the algorithm QRM2 (Roy Chaudhuri & Kalyanakrishnan, 2018) to make it use UCB-M as the sub-routine, and adjust the arm exploration rate accordingly to minimise the upper bound. We call it QUCB-M and describe in Algorithm 2.

Below, we present the upper bound on the quantile regret incurred by QUCB-M.

Theorem 5.1 (Sub-linear quantile-regret of QUCB-M).

For $\rho\in(0,1)$ and for sufficiently large $T$ , QUCB-M incurs

[TABLE]

Proof.

To prove the theorem we follow the steps of proof for Theorem 3.3 in Roy Chaudhuri & Kalyanakrishnan (2018). For any fixed $\rho\in(0,1)$ , we break the analysis for upper bound on $\operatorname{\mathop{\mathbb{R}}}_{T}(\rho)$ in cases—first, the algorithm never encounters an arm from $\mathcal{TOP}_{\rho}$ ; second, it picks at least one arm from $\mathcal{TOP}_{\rho}$ .

The key step in the analysis of the first part is showing that there exists $r^{*}\geq 1$ such that for all $r\geq r^{*}$ , the set of arms $\mathcal{K}_{r}$ is sufficiently large to contain an arm from $\mathcal{TOP}_{\rho}$ with high probability. Defining $\mathcal{TOP}_{\rho}$ is in $\mathcal{K}_{r}$ as $E_{r}(\rho)\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\{\mathcal{K}_{r}\cap\mathcal{TOP}_{\rho}=\emptyset\}$ , and following the steps for the derivation of Equation (3) in the proof of Theorem 3.3 in Roy Chaudhuri & Kalyanakrishnan (2018) we arrive at

[TABLE]

The detailed derivation of Equation (7) is given in Lemma C.2 in Appendix C.

In the second part, we upper-bound the incurred regret for the case where QUCB-M encounters at least one arm from the $\mathcal{TOP}_{\rho}$ in $\mathcal{K}_{r}$ (the event $\neg E_{r}(\rho))$ . Using Theorem 4.1 and using the similar approach for deriving Equation (4) in the proof of Theorem 3.3 in Roy Chaudhuri & Kalyanakrishnan (2018) we arrive at

[TABLE]

for some constant $C^{\prime}$ . The intermediate steps to obtain (5.1) are presented in Lemma C.3 in Appendix C. Combining Equation (7), Equation (5.1) and substituting for $t_{r^{*}}$ , the upper bound on $\operatorname{\mathop{\mathbb{R}}}_{T}(\rho)$ with respect to $T$ gets minimised for $\alpha=1/(3+2\log\mathrm{e}/(1+\gamma))\approx 0.205$ , thus proving the theorem. ∎

It is to be noted that inside QUCB-M one can use the algorithm by Liau et al. (2018) instead of UCB-M. However, as we have already shown in Section 4.2 that UCB-M is empirically superior than their algorithm, we do not consider this variation in our our experiment.

5.2 Experiment

Although QUCB-M is designed with the aim to minimise quantile-regret, we use conventional cumulative-regret as the evaluation metric. Similar to UCB-M, the algorithm QUCB-M can be altered to use TS-M or Moss-M as the subroutine instead, and we call them QTS-M and QMoss-M respectively. Algorithm 2 uses the value of $\alpha$ that minimises the upper bound on regret in Theorem 5.1. However, for empirical efficiency, we keep the $\alpha=0.347$ as used by the algorithm QRM2 (Roy Chaudhuri & Kalyanakrishnan, 2018). We compare incurred conventional regret by each of these algorithms against the algorithms by Herschkorn et al. (1996) and Berry et al. (1997), and present it in Table 1. We use the same four Bernoulli instances used by Roy Chaudhuri & Kalyanakrishnan (2018)—instances $I_{1}$ and $I_{2}$ have $\mu^{*}=1$ , and the probability distributions on $\mu$ induced by $P_{\mathcal{A}}$ are given by $\beta(0.5,2)$ , and $\beta(1,1)$ respectively. Similarly, instances $I_{3}$ and $I_{4}$ have $\mu^{*}=0.6$ , and the probability distributions on $\mu$ induced by $P_{\mathcal{A}}$ are given by scaled $\beta(0.5,2)$ , and $\beta(1,1)$ respectively. Each column of the tables is labelled by the corresponding probability density function of encountering the mean rewards. As Table 2 suggests, the existing algorithms incur a significantly higher regret in most of the cases. We put the comparison for $\alpha=0.205$ at Table 2 in Appendix D.

It is interesting to note that, like the finite instances, increasing arm memory leads to a lower regret. Specifically, the scaled version of QUCB-M (using UCB-M with $\eta=0.2$ ) along with QTS-M and QMoss-M show an improvement with larger arm memory. However, with $\eta=1$ in the underlying UCB-M, QUCB-M fails to take the advantage of larger arm memory.

6 Conclusion

In this paper, we address the problem of regret minimisation using a bounded number of words of memory. This problem becomes interesting where the number of arms is too large to consider all of them simultaneously, for example, crowd-sourcing, drug testing etc. Some existing approaches (Herschkorn et al., 1996; Berry et al., 1997) considers only the infinite bandit instances consist of Bernoulli arms. Recently, Liau et al. (2018) present an explore-then-commit based algorithm for finite bandit instances, which escapes such assumptions, but very inefficient in practice.

We provide a UCB1 (Auer et al., 2002) based algorithm UCB-M for finite bandit instances, which is empirically far more efficient and enjoys a sub-linear upper bound on the cumulative regret, but uses a bounded number of words of memory. Also, unlike all the existing algorithms, UCB-M offers the flexibility of varying the arm memory size, facilitating the experimenter to use the available memory resource. Further, we extend the existing algorithm QRM2 (Roy Chaudhuri & Kalyanakrishnan, 2017) for quantile-regret minimisation to QUCB-M to achieve sub-linear quantile regret under the bounded arm memory constraint. We empirically verify that QUCB-M incurs a lower conventional cumulative regret on a various infinite bandit instances than the existing algorithms (Herschkorn et al., 1996; Berry et al., 1997), which needs $O(1)$ memory.

We find that providing a lower bound on the cumulative regret under the bounded arm memory constraint is an interesting question, and we leave that for future investigation.

Appendix A Proofs from Section 4.1

Lemma A.1.

For a given $K$ -sized set of arms $\mathcal{A}$ , and an arm memory size $M<K$ , the number of sub-phases required to ensure that each arm in $\mathcal{A}$ has been chosen into arm memory at least once is not more than $h_{0}$ .

Proof.

We notice that at the beginning of each sub-phase there are exactly $M-1$ arms except the arm $\hat{a}$ recommended from the previous step. Let, $h$ be the maximum number of sub-phases possible in a phase. We realise that each phase $w$ ends as soon as for every arm $a\in\mathcal{A}$ , there exists a sub-phase $j$ , such that $S^{w,j}\ni a$ . Therefore, $h=\min\{y:\mathcal{A}\subseteq\cup_{j=1}^{y}S^{w,j}\}=\Big{\lceil}\frac{K-1}{M-1}\Big{\rceil}=h_{0}$ . ∎

Lemma A.2.

For a given $K$ -sized set of arms $\mathcal{A}$ , and an arm memory size $M<K$ , the number of phases UCB-M executes is upper bounded by $x_{0}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\Big{\lceil}\log\frac{2T}{MK}\Big{\rceil}$ .

Proof.

Let $x$ be the total number of phases executed by UCB-M. It should be noted that the value of $M$ , $K$ , and $T$ might be such that the total horizon ( $T$ ) runs out before finishing the last phase. Now, for any given phase $w$ ( $w\geq 1$ ), the horizon spent on each sub-phase is the same, that is $b_{w}=2^{w-1}b_{1}$ . Therefore, we can write

[TABLE]

∎

See 4.4

Proof.

Letting, $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,j}]\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\operatorname*{\mathop{\mathbb{E}}}[\mu^{*}-\mu_{*}^{y,j}]$ . We break the proof into two steps. Step 1 upper bounds $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,h_{0}}]$ , which is an upper bound on $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,j}]$ , for all $j\in[h_{0}]$ ; while Step 2 upper bounds $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y+1,j}]$ . Both the steps are based on Corollary 4.2, that ensures at least one of the $h_{0}$ consecutive sub-phases (not necessarily from the same phase) must contain the optimal arm $a^{*}$ in the arm-memory.

Step 1.

Let, $1\leq k_{0}\leq h_{0}-1$ be the first sub-phase in phase $y$ , to have $a^{*}$ in the arm-memory. Therefore, $k_{0}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\min\{i\in[h_{0}]:a^{*}\in S^{y,i}\}$ , and hence, by definition, $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,k_{0}+1}]=\operatorname*{\mathop{\mathbb{E}}}[r^{y,k_{0}+1}]$ . Therefore, for any subsequent sub-phase $j\in\{k_{0}+1,\cdots,h_{0}\}$ in phase $y$ , $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,j}]=\operatorname*{\mathop{\mathbb{E}}}[\mu^{*}-\mu_{*}^{y,j}]=\operatorname*{\mathop{\mathbb{E}}}[\mu^{*}-\mu_{*}^{y,k_{0}+1}]+\sum_{v=k_{0}+2}^{j-1}\operatorname*{\mathop{\mathbb{E}}}[\mu_{*}^{y,v}-\mu_{*}^{y,v+1}]$ . As there are $h_{0}$ sub-phases in any phase, hence, for all $k_{0}+1\leq j\leq h_{0}$ , $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,j}]\leq\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y,h_{0}}]\leq(h_{0}-k_{0}+1)\operatorname*{\mathop{\mathbb{E}}}[r^{y}]\leq h_{0}\operatorname*{\mathop{\mathbb{E}}}[r^{y}]$ .

Step 2.

Let, $j_{0}$ be a sub-phase in phase $y-1$ , such that $a^{*}\in S^{y-1,j_{0}}$ . From Step 1, $\operatorname*{\mathop{\mathbb{E}}}[r_{*}^{y-1,h_{0}}]\leq(h_{0}-j_{0}+1)\operatorname*{\mathop{\mathbb{E}}}[r^{y-1}]$ . Now, considering sub-phase $i$ in phase $y$ , we realise that if $i\geq j_{0}$ , then there exists a sub-phase $w\in\{1,\cdots,i\}$ such that $a^{*}\in S^{y,w}$ . Now, for $i\leq j_{0}-1$ ,

[TABLE]

Together, Step 1 and Step 2 prove the lemma. ∎

See 4.5

Proof.

[TABLE]

wherein, $C_{1},C_{2},\cdots,C_{8}$ are appropriate constants. ∎

See 4.7

Proof.

We notice that at any sub-phase $j$ of any phase $w\geq 2$ , due to Lemma 4.6, there exists a constant $C$ , such that $R_{w,j}^{(2)}\leq C\sqrt{b_{w}M\log b_{w}}$ . Therefore,

[TABLE]

wherein, $C_{1},C_{2},C_{3},C_{4}$ are appropriate constants. ∎

Appendix B Additional Experimental Results from Section 4.2

Appendix C Proofs from Section 5.1

In this appendix we provide the materials to complete the proof of Theorem 5.1.

Lemma C.1.

Let, $r^{*}=\Big{\lceil}\frac{1}{\alpha}\log\left(\frac{1}{\rho}\log\frac{1}{\rho}\right)-\log B\Big{\rceil}$ . Then, for every phase $r\geq r^{*}$ , the size of $\mathcal{K}_{r}$ can be lower bounded as $n_{r}=\Big{\lceil}t_{r}^{\alpha}\Big{\rceil}\geq\Big{\lceil}\frac{\alpha\log\mathrm{e}}{(1+\gamma)\rho}\cdot\ln t_{r}\Big{\rceil}$ , wherein, $0.53<\gamma\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\max_{x}\frac{\log\log x}{\log x}<0.531$ .

Proof.

We notice, for every, $r\geq r^{*}$ , $t_{r}\geq\Big{\lceil}\left(\frac{1}{\rho}\log\frac{1}{\rho}\right)^{\frac{1}{\alpha}}\Big{\rceil}$ . Then, for each $r\geq r^{*}$ , we can lower bound the size of the set $\mathcal{K}_{r}$ as follows. As, $|\mathcal{K}_{r}|=n_{r}=\Big{\lceil}t_{r}^{\alpha}\Big{\rceil}$ is an integer, to ease the calculation let us define $s_{u}=2^{u}B$ , where $u\in\mathbb{R}^{+}$ , and therefore, $s_{u}\in\mathbb{R}^{+}$ does not need to be an integer. Now, letting $u^{*}\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\log\left(\frac{1}{\rho}\log\frac{1}{\rho}\right)^{\frac{1}{\alpha}}$ , we get

[TABLE]

As, $s_{u}^{\alpha}$ grows with $u$ faster than $\log s_{u}$ , therefore,

[TABLE]

Therefore, recalling that $r$ is an integer, for all values of $r\geq\lceil r^{*}\rceil$ , the statement of the lemma follows. ∎

Lemma C.2.

The expected regret due to not encountering any arm from the set $\mathcal{TOP}_{\rho}$ is during running of the algorithm, is in $O\left(\left(\frac{1}{\rho}\log\frac{1}{\rho}\right)^{\frac{1}{\alpha}}+T^{1-\frac{\alpha\log\mathrm{e}}{1+\gamma}}\right)$ .

Proof.

We define an event that no arm from $\mathcal{TOP}_{\rho}$ is in $\mathcal{K}_{r}$ as $E_{r}(\rho)\stackrel{{\scriptstyle\mathclap{\mbox{\tiny{def}}}}}{{=}}\{\mathcal{K}_{r}\cap\mathcal{TOP}_{\rho}=\emptyset\}$ , and note $\Pr\{E_{r}(\rho)\}=(1-\rho)^{n_{r}}$ . Now, for some $\alpha\in(0,1)$ that shall be tuned later, let $r^{*}=\lceil(1/\alpha)\log((1/\rho)\log(1/\rho))\rceil$ . Therefore, in the round $r^{*}$ , the number of pulls is given by $t_{r^{*}}=2^{r^{*}}=((1/\rho)\log(1/\rho))^{1/\alpha}$ . Now, for $r\geq r^{*}$ , the number of arms in $\mathcal{K}_{r}$ is given by $n_{r}=t_{r}^{\alpha}\geq\lceil(\alpha/((1+\gamma)\rho))\cdot\ln t_{r}^{\log\mathrm{e}}\rceil$ , wherein, $\gamma=\max_{x}(\log\log x)/\log x$ ( $0.53<\gamma<0.531$ ).

Therefore, $\Pr\{E_{r}(\rho)\}$ $=(1-\rho)^{n_{r}}$ $\leq\exp(-\lceil(\alpha/((1+\gamma)))\cdot\ln t_{r}^{\log\mathrm{e}}\rceil)$ $\leq{t_{r}}^{-\alpha\log\mathrm{e}/(1+\gamma)}$ .

Using Lemma C.1, below we present the detailed steps for obtaining (7) in the proof of Theorem 5.1.

[TABLE]

∎

Lemma C.3.

For $r^{*}$ defined in Lemma C.1, given that for all $r\geq r^{*}$ , algorithm QUCB-M has encountered at least one arm from $\mathcal{TOP}_{\rho}$ , the incurred regret beyond the round $r^{*}$ is not more than $C^{\prime}\left(MT^{\alpha}+\frac{\sqrt{\log M}}{M}\sqrt{T^{1+3\alpha}\log\frac{T}{M}}\right)$ ; for some constant $C^{\prime}$ .

[TABLE]

for some constants $C_{1},C_{2},C_{3},C_{4}$ and $C_{5}$ .

Appendix D Additional Experimental Results from Section 5.2

For $\alpha=0.205$ the algorithms explore very small number of arms, that causes incorporating a good arm very unlikely leading to a high regret.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agrawal (1995) Agrawal, R. The continuum-armed bandit problem. SIAM J. Control Optim. , 33(6):1926–1951, 1995.
2Agrawal & Goyal (2012) Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proc. of the 25th Annual Conf. on Learning Theory , volume 23, pp. 39.1–39.26, Edinburgh, Scotland, 2012. PMLR.
3Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In Proc. AISTATS 2013 , volume 31, pp. 99–107. PMLR, 2013.
4Aho et al. (1974) Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer Algorithms . Addison-Wesley, 1974.
5Armitage (1960) Armitage, P. Sequential Medical Trials. Blackwell Scientific Publications, 1960.
6Audibert & Bubeck (2009) Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In Proc. COLT 2009 , pp. 217–226, 2009.
7Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235–256, 2002.
8Berry & Fristedt (1985) Berry, D. and Fristedt, B. Bandit Problems: Sequential Allocation of Experiments . Chapman & Hall, 1985.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory

Abstract

1 Introduction

Background and Problem Setup.

Cumulative Regret Minimisation.

RAM Model.

Problem Definition.

Contributions.

2 Related Work

3 Key Intuitions

Simple Regret.

Relation of Cumulative Regret Minimisation with the Minimisation of Simple Regret.

Theorem 3.1** (Distribution-free upper bound on Simple-Regret of UCB-MPA by Bubeck et al. (2009)).**

4 Algorithm for Finite Bandit Instances

4.1 Algorithm and Regret-Analysis.

Theorem 4.1**.**

Corollary 4.2**.**

Corollary 4.3**.**

Lemma 4.4**.**

Bifurcation of R⁡T∗\operatorname{\mathop{\mathbb{R}}}_{T}^{*}RT∗​.

Lemma 4.5**.**

Lemma 4.6** (Distribution-Free Upper Bound on Cumulative Regret of UCB1 Auer et al. (2002)).**

Lemma 4.7**.**

Proof of Theorem 4.1

4.2 Experiment

Bandit Instances.

5 Algorithm for Infinite Bandit Instances

5.1 Algorithm and Quantile-Regret Analysis

Theorem 5.1** (Sub-linear quantile-regret of QUCB-M).**

Proof.

5.2 Experiment

6 Conclusion

Appendix A Proofs from Section 4.1

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Proof.

Step 1.

Step 2.

Proof.

Proof.

Appendix B Additional Experimental Results from Section 4.2

Appendix C Proofs from Section 5.1

Lemma C.1**.**

Proof.

Lemma C.2**.**

Proof.

Lemma C.3**.**

Appendix D Additional Experimental Results from Section 5.2

Theorem 3.1 (Distribution-free upper bound on Simple-Regret of UCB-MPA by Bubeck et al. (2009)).

Theorem 4.1.

Corollary 4.2.

Corollary 4.3.

Lemma 4.4.

Bifurcation of $\operatorname{\mathop{\mathbb{R}}}_{T}^{*}$ .

Lemma 4.5.

Lemma 4.6 (Distribution-Free Upper Bound on Cumulative Regret of UCB1 Auer et al. (2002)).

Lemma 4.7.

Theorem 5.1 (Sub-linear quantile-regret of QUCB-M).

Lemma A.1.

Lemma A.2.

Lemma C.1.

Lemma C.2.

Lemma C.3.