Batch Data Processing and Gaussian Two-Armed Bandit

Alexander V. Kolnogorov

arXiv:1704.03631·math.ST·April 13, 2017

Batch Data Processing and Gaussian Two-Armed Bandit

Alexander V. Kolnogorov

PDF

Open Access

TL;DR

This paper analyzes the Gaussian two-armed bandit problem in batch data processing, showing that large packet sizes can be used without significant loss in control performance, especially when processing methods have similar efficiencies.

Contribution

It introduces a model for batch processing in the two-armed bandit framework and quantifies the impact of packet size and method efficiency differences on control risk.

Findings

01

Large packet processing does not significantly increase minimax risk when methods are similarly efficient.

02

Initial small-sized packets can mitigate losses when method efficiencies differ.

03

Control performance remains robust with sufficiently large packet numbers.

Abstract

We consider the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. One should determine the most effective method and provide its predominant application. Gaussian two-armed bandit describes the batch, and possibly parallel, processing when the same methods are applied to sufficiently large packets of data and accumulated incomes are used for the control. If the number of packets is large enough then such control does not deteriorate the control performance, i.e. does not increase the minimax risk. For example, in case of 50 packets the minimax risk is about 2% larger than that one corresponding to one-by-one optimal processing. However, this is completely true only for methods with close efficiencies because otherwise there may be significant expected losses at the initial stage…

Figures1

Click any figure to enlarge with its caption.

Equations116

f (x ∣ m) = (2 π)^{- 1/2} exp {- (x - m)^{2} /2} .

f (x ∣ m) = (2 π)^{- 1/2} exp {- (x - m)^{2} /2} .

σ_{ℓ} (y^{n - 1}, x^{n - 1}) = Pr (y_{n} = ℓ ∣ y^{n - 1}, x^{n - 1}),

σ_{ℓ} (y^{n - 1}, x^{n - 1}) = Pr (y_{n} = ℓ ∣ y^{n - 1}, x^{n - 1}),

L_{N} (σ, θ) = N (m_{1} \lor m_{2}) - E_{σ, θ} (n = 1 \sum N ξ_{n})

L_{N} (σ, θ) = N (m_{1} \lor m_{2}) - E_{σ, θ} (n = 1 \sum N ξ_{n})

Θ = {θ : ∣ m_{1} - m_{2} ∣ \leq 2 C},

Θ = {θ : ∣ m_{1} - m_{2} ∣ \leq 2 C},

R_{N}^{M} (Θ) = Σ in f Θ sup L_{N} (σ, θ)

R_{N}^{M} (Θ) = Σ in f Θ sup L_{N} (σ, θ)

L_{N} (σ^{M}, θ) \leq R_{N}^{M} (Θ)

L_{N} (σ^{M}, θ) \leq R_{N}^{M} (Θ)

Pr (ξ_{n} = 1∣ y_{n} = ℓ) = p_{ℓ}, Pr (ξ_{n} = 0∣ y_{n} = ℓ) = q_{ℓ},

Pr (ξ_{n} = 1∣ y_{n} = ℓ) = p_{ℓ}, Pr (ξ_{n} = 0∣ y_{n} = ℓ) = q_{ℓ},

0.612 \leq (D N)^{- 1/2} R_{N}^{M} (Θ) \leq 0.752,

0.612 \leq (D N)^{- 1/2} R_{N}^{M} (Θ) \leq 0.752,

R_{N}^{B} (λ) = Σ in f \int_{Θ} L_{N} (σ, θ) λ (θ) d θ

R_{N}^{B} (λ) = Σ in f \int_{Θ} L_{N} (σ, θ) λ (θ) d θ

R_{N}^{M} (Θ) = R_{N}^{B} (λ_{0}) = λ sup R_{N}^{B} (λ),

R_{N}^{M} (Θ) = R_{N}^{B} (λ_{0}) = λ sup R_{N}^{B} (λ),

\begin{array}[]{ll}\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})\\ =\displaystyle{\frac{f_{n_{1}}(X_{1}|n_{1}m_{1})f_{n_{2}}(X_{2}|n_{2}m_{2})\lambda(m_{1},m_{2})}{p(X_{1},n_{1},X_{2},n_{2})}}\end{array}

\begin{array}[]{ll}\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})\\ =\displaystyle{\frac{f_{n_{1}}(X_{1}|n_{1}m_{1})f_{n_{2}}(X_{2}|n_{2}m_{2})\lambda(m_{1},m_{2})}{p(X_{1},n_{1},X_{2},n_{2})}}\end{array}

p (X_{1}, n_{1}, X_{2}, n_{2})

p (X_{1}, n_{1}, X_{2}, n_{2})

= Θ \iint f_{n_{1}} (X_{1} ∣ n_{1} m_{1}) f_{n_{2}} (X_{2} ∣ n_{2} m_{2}) λ (m_{1}, m_{2}) d m_{1} d m_{2}

R_{N - n}^{B} (\cdot) = min (R_{N - n}^{(1)} (\cdot), R_{N - n}^{(2)} (\cdot)),

R_{N - n}^{B} (\cdot) = min (R_{N - n}^{(1)} (\cdot), R_{N - n}^{(2)} (\cdot)),

\begin{array}[]{lll}R^{(1)}_{N-n}(\lambda;X_{1},n_{1},X_{2},n_{2})=\displaystyle{\iint\limits_{\Theta}}\left(\right.M(m_{2}-m_{1})^{+}\\ \quad+\mathbb{E}_{x}^{(1)}R^{B}_{N-(n+M)}(\lambda;X_{1}+x,n_{1}+M,X_{2},n_{2})\left.\right)\\ \qquad\times\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})dm_{1}dm_{2},\end{array}

\begin{array}[]{lll}R^{(1)}_{N-n}(\lambda;X_{1},n_{1},X_{2},n_{2})=\displaystyle{\iint\limits_{\Theta}}\left(\right.M(m_{2}-m_{1})^{+}\\ \quad+\mathbb{E}_{x}^{(1)}R^{B}_{N-(n+M)}(\lambda;X_{1}+x,n_{1}+M,X_{2},n_{2})\left.\right)\\ \qquad\times\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})dm_{1}dm_{2},\end{array}

\begin{array}[]{lll}R^{(2)}_{N-n}(\lambda;X_{1},n_{1},X_{2},n_{2})=\displaystyle{\iint\limits_{\Theta}}\left(M(m_{1}-m_{2})^{+}\right.\\ \quad+\mathbb{E}_{x}^{(2)}R^{B}_{N-(n+M)}(\lambda;X_{1},n_{1},X_{2}+x,n_{2}+M)\left.\right)\\ \qquad\times\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})dm_{1}dm_{2}\end{array}

\begin{array}[]{lll}R^{(2)}_{N-n}(\lambda;X_{1},n_{1},X_{2},n_{2})=\displaystyle{\iint\limits_{\Theta}}\left(M(m_{1}-m_{2})^{+}\right.\\ \quad+\mathbb{E}_{x}^{(2)}R^{B}_{N-(n+M)}(\lambda;X_{1},n_{1},X_{2}+x,n_{2}+M)\left.\right)\\ \qquad\times\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})dm_{1}dm_{2}\end{array}

E_{x}^{(ℓ)} R (x) = - \infty \int + \infty R (x) f_{M} (x ∣ M m_{ℓ}) d x, ℓ = 1, 2.

E_{x}^{(ℓ)} R (x) = - \infty \int + \infty R (x) f_{M} (x ∣ M m_{ℓ}) d x, ℓ = 1, 2.

R_{N}^{B} (α λ + \tilde{α} \tilde{λ}) \geq α R_{N}^{B} (λ) + \tilde{α} R_{N}^{B} (\tilde{λ}),

R_{N}^{B} (α λ + \tilde{α} \tilde{λ}) \geq α R_{N}^{B} (λ) + \tilde{α} R_{N}^{B} (\tilde{λ}),

ν_{a} (m, v) = κ_{a} (m) ρ (v),

ν_{a} (m, v) = κ_{a} (m) ρ (v),

R_{M} (U, n_{1}, n_{2}) = ℓ = 1, 2 min R_{M}^{(ℓ)} (U, n_{1}, n_{2}),

R_{M} (U, n_{1}, n_{2}) = ℓ = 1, 2 min R_{M}^{(ℓ)} (U, n_{1}, n_{2}),

R_{M}^{(1)} (U, n_{1}, n_{2}) = M g^{(1)} (U, n_{1}, n_{2})

R_{M}^{(1)} (U, n_{1}, n_{2}) = M g^{(1)} (U, n_{1}, n_{2})

+ - \infty \int \infty R_{M} (U - x, n_{1} + M, n_{2}) f_{M n_{2}^{2} n^{- 1} (n + M)^{- 1}} (x) d x,

R_{M}^{(2)} (U, n_{1}, n_{2}) = M g^{(2)} (U, n_{1}, n_{2})

+ - \infty \int \infty R_{M} (U - x, n_{1}, n_{2} + M) f_{M n_{1}^{2} n^{- 1} (n + M)^{- 1}} (x) d x

\displaystyle\begin{array}[]{c}g^{(\ell)}(U,n_{1},n_{2})\\ =\displaystyle{\int\limits_{0}^{C}}2v\exp\left((-1)^{\ell}2Uv-2v^{2}n_{1}n_{2}n^{-1}\right)\rho(v)dv,\end{array}

\displaystyle\begin{array}[]{c}g^{(\ell)}(U,n_{1},n_{2})\\ =\displaystyle{\int\limits_{0}^{C}}2v\exp\left((-1)^{\ell}2Uv-2v^{2}n_{1}n_{2}n^{-1}\right)\rho(v)dv,\end{array}

a \to \infty lim R_{N}^{B} (ν_{a} (m, v)) = R_{N}^{B} (ρ (v))

a \to \infty lim R_{N}^{B} (ν_{a} (m, v)) = R_{N}^{B} (ρ (v))

= 4 M 0 \int C v ρ (v) d v + - \infty \int \infty R_{M} (U, M, M) f_{0.5 M} (U) d U .

Θ_{N} = {w : w \leq c} = {θ : ∣ m_{1} - m_{2} ∣ \leq 2 c N^{- 1/2}} .

Θ_{N} = {w : w \leq c} = {θ : ∣ m_{1} - m_{2} ∣ \leq 2 c N^{- 1/2}} .

r_{ε} (u, t_{1}, t_{2}) = ℓ = 1, 2 min r_{ε}^{(ℓ)} (u, t_{1}, t_{2}),

r_{ε} (u, t_{1}, t_{2}) = ℓ = 1, 2 min r_{ε}^{(ℓ)} (u, t_{1}, t_{2}),

r_{ε}^{(1)} (u, t_{1}, t_{2}) = ε g^{(1)} (u, t_{1}, t_{2})

r_{ε}^{(1)} (u, t_{1}, t_{2}) = ε g^{(1)} (u, t_{1}, t_{2})

+ - \infty \int \infty r_{ε} (u - x, t_{1} + ε, t_{2}) f_{ε t_{2}^{2} t^{- 1} (t + ε)^{- 1}} (x) d x,

r_{ε}^{(2)} (u, t_{1}, t_{2}) = ε g^{(2)} (u, t_{1}, t_{2})

+ - \infty \int \infty r_{ε} (u - x, t_{1}, t_{2} + ε) f_{ε t_{1}^{2} t^{- 1} (t + ε)^{- 1}} (x) d x

g^{(ℓ)} (u, t_{1}, t_{2})

g^{(ℓ)} (u, t_{1}, t_{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques

Full text

Batch Data Processing

and Gaussian Two-Armed Bandit

Alexander V. Kolnogorov

Yaroslav-the-Wise Novgorod State University, Velikiy Novgorod, 173003 Russia, (e-mail: [email protected]).

Abstract

We consider the two-armed bandit problem as applied to data processing if there are two alternative processing methods available with different a priori unknown efficiencies. One should determine the most effective method and provide its predominant application. Gaussian two-armed bandit describes the batch, and possibly parallel, processing when the same methods are applied to sufficiently large packets of data and accumulated incomes are used for the control. If the number of packets is large enough then such control does not deteriorate the control performance, i.e. does not increase the minimax risk. For example, in case of 50 packets the minimax risk is about 2% larger than that one corresponding to one-by-one optimal processing. However, this is completely true only for methods with close efficiencies because otherwise there may be significant expected losses at the initial stage of control when both actions are applied turn-by-turn. To avoid significant losses at the initial stage of control one should take initial packets of data having smaller sizes.

keywords:

two-armed bandit problem, stochastic robust control, minimax and bayesian approaches, batch processing.

††thanks: This work was supported in part by the Project Part of the State Assignment in the Field of Scientific Activity by the Ministry of Education and Science of the Russian Federation, project no. 1.949.2014/K.

1 Introduction

We consider the following setting of the two-armed bandit problem (see, e.g. Berry and Fristedt (1985), Presman and Sonin (1990)) which is also well-known as the problem of expedient behavior in a random environment (see, e.g. Tsetlin (1973), Varshavsky (1973)) and the problem of adaptive control in a random environment (see, e.g. Sragovich (2006), Nazin and Poznyak (1986)). Let $\xi_{n}$ , $n=1,\dots,N$ be a controlled random process which values are interpreted as incomes, depend only on currently chosen actions $y_{n}$ ( $y_{n}\in\{1,2\}$ ) and are normally distributed with probability densities $f(x|m_{\ell})$ if $y_{n}=\ell$ , $\ell=1,2$ , where

[TABLE]

It is the so-called Gaussian (or Normal) two-armed bandit. It can be completely described by a vector parameter $\theta=(m_{1},m_{2})$ . The goal is to maximize the total expected income. To thus end, one should determine the action corresponding to the largest value of $m_{1}$ , $m_{2}$ and provide its predominant application.

Let’s explain why Gaussian two-armed bandit is considered. We investigate the problem as applied to control of data processing if there are two alternative processing methods available with different a priori unknown efficiencies. Let $T=NM$ items of data be given which may be processed by either of the two alternative methods. Processing may be successful ( $\zeta_{t}=1$ ) or unsuccessful ( $\zeta_{t}=0$ ). The goal is to maximize the total expected number of successfully processed items of data. Probabilities of successful and unsuccessful processing depend only on chosen methods (actions), i.e. $\Pr(\zeta_{t}=1|y_{t}=\ell)=p_{\ell}$ , $\Pr(\zeta_{t}=0|y_{t}=\ell)=q_{\ell}$ , $\ell=1,2$ . Assume that $p_{1}$ , $p_{2}$ are close to $p$ ( $0<p<1$ ). We partition all data items into $N$ packets each containing $M$ data items. For data processing in each packet we use the same method. Note that data in the same packet may be processed in parallel. For control we use the values of the process $\xi_{n}=(DM)^{-1/2}\displaystyle{\sum_{t=(n-1)M+1}^{nM}}\zeta_{t}$ , $n=1,\dots,N$ with $D=p(1-p)$ . According to the central limit theorem distributions of $\xi_{n}$ , $n=1,\dots,N$ are close to Gaussian and their variances are close to unity just like in considered setup.

Remark 1

Parallel control in the two-armed bandit problem was first proposed for treating a large group of patients by either of the two alternative drugs with different unknown efficiencies. Clearly, the doctor cannot treat the patients sequentially one-by-one. Say, if the result of the treatment will be manifest in a week and there is a thousand of patients, then one-by-one treatment would take about twenty years. Therefore, it was proposed to give both drugs to sufficiently large groups of patients and then the most effective one to give to the rest of them. As the result, the entire treatment takes two weeks. The discussion and bibliography of the problem as applied to medical trials can be found, for example, in Lai et al (1980); Cheng (2003).

Control strategy $\sigma$ at the point of time $n$ assigns a random choice of the action $y_{n}$ depending on the current history of the process, i.e. responses $x^{n-1}=x_{1},\dots,x_{n-1}$ to applied actions $y^{n-1}=y_{1},\dots,y_{n-1}$ :

[TABLE]

$\ell=1,2$ . The set of strategies is denoted by $\Sigma$ .

Recall that the goal is to maximize (in some sense) the total expected income. Therefore, if parameter $\theta$ is known then the optimal strategy should always apply the action corresponding to the largest value of $m_{1}$ , $m_{2}$ . The total expected income would thus be equal to $N(m_{1}\vee m_{2})$ . If the parameter is unknown then the loss function

[TABLE]

is equal to expected losses of total income with respect to its maximal possible value. Here $\mathbb{E}_{\sigma,\theta}$ denotes the mathematical expectation calculated with respect to measure generated by strategy $\sigma$ and parameter $\theta$ . The set of parameters is assumed to be the following

[TABLE]

where $0<C<\infty$ . Here restriction $C<\infty$ ensures the boundedness of the loss function on $\Theta$ .

According to the minimax approach the maximal value of the loss function on the set of parameters $\Theta$ should be minimized on the set of strategies $\Sigma$ . The value

[TABLE]

is called the minimax risk and corresponding strategy $\sigma^{M}$ is called the minimax strategy. Note that if strategy $\sigma^{M}$ is applied then the following inequality holds

[TABLE]

for all $\theta\in\Theta$ and this implies robustness of the control.

The minimax approach to the problem was proposed in Robbins (1952) and caused a considerable interest to it. The classic object of most of arisen articles was the so-called Bernoulli two-armed bandit which can be described by distribution

[TABLE]

$p_{\ell}+q_{\ell}=1$ , $\ell=1,2$ . It can be described by a parameter $\theta=(p_{1},p_{2})$ with the set of values $\Theta=\{\theta:0\leq p_{\ell}\leq 1;\ell=1,2\}$ . It was shown in Fabius and van Zwet (1970) that explicit determination of the minimax strategy and minimax risk is virtually impossible already for $N\geq 5$ . However, an asymptotic minimax theorem was proved in Vogel (1960) using some indirect techniques. This theorem states that the following estimates hold as $N\to\infty$ :

[TABLE]

where $D=0.25$ is the maximal variance of one-step income. Presented here the lower estimate was obtained in Bather (1983). The maximal value of expected losses corresponds to $|p_{1}-p_{2}|\approx 3.78(D/N)^{1/2}$ with additional requirement that $p_{1}$ , $p_{2}$ are close to $0.5$ .

Remark 2

There are some different approaches to robust control in the two-armed bandit problem, see, e.g. Nazin and Poznyak (1986); Lugosi and Cesa-Bianchi (2006); Juditsky et al (2008); Gasnikov et al (2015). In these articles, another ideas like stochastic approximation method and mirror descent algorithm are used for the control. The order of the minimax risk for these algorithms is $N^{1/2}$ or close to $N^{1/2}$ .

Another very popular approach to the problem is a Bayesian one. Let $\lambda(\theta)=\lambda(m_{1},m_{2})$ be some prior probability density. The value

[TABLE]

is called the Bayesian risk and corresponding optimal strategy is called the Bayesian strategy. Bayesian approach allows to find Bayesian strategy and risk by solving a recursive Bellman-type equation. Minimax risk (2) and Bayesian risk (4) are related by the main theorem of the theory of games as follows:

[TABLE]

where $\lambda_{0}$ is called the worst-case prior distribution.

The goal of this paper is to present the approach based on the main theorem of the theory of games. This approach allows to determine minimax strategy and minimax risk explicitly by solving appropriate Bellman-type recursive equation and finding the worst-case prior distribution according to (5). This allows to evaluate the control performance. In particular, it turned out that in case of close mathematical expectations $m_{1}$ , $m_{2}$ batching of data almost does not enlarge the maximal expected losses if the number of packets is large enough, e.g. if the number of packets is 50 or larger. Therefore, say $50000$ items of data may be processed in 50 steps by packets of 1000 data with almost the same maximal losses as if the data were processed optimally one-by-one. To be more precise, the maximal expected losses in case of batch processing in 50 steps are about 2% larger than in case of optimal one-by-one processing. However, in case of distant expectations there may be large expected losses at the initial stage of control when actions are applied turn-by-turn. To reduce the losses at the initial stage, one should reduce corresponding sizes of packets. The example is given in Section 5

The structure of the paper is the following. In section 2 we present the Bellman-type recursive equation which allows to determine explicitly Bayesian strategy and risk for any prior distribution. In section 3 properties of the worst-case prior distribution are investigated and this allows to simplify the recursive Bellman-type equation significantly. In section 4 we obtain invariant recursive Bellman-type equation with unity control horizon and its limiting description by the second order partial differential equation. In section 5 we find minimax risks numerically. Section 6 contains a conclusion. Note that some results are presented in Kolnogorov (2010), Kolnogorov (2012), Kolnogorov (2015). Here we combine and compare them.

2 Recursive equation for

determination of Bayesian strategy and risk

Bayesian strategy and risk can be calculated recursively. Let history of control up to the point of time $n$ be described by $(X_{1},n_{1},X_{2},n_{2})$ . Here $n_{1}$ , $n_{2}$ are total numbers of applications of both actions ( $n_{1}+n_{2}=n$ ) and $X_{1}$ , $X_{2}$ are corresponding total incomes. Let $X_{\ell}=0$ if $n_{\ell}=0$ . Denote by $f_{D}(x|m)=(2\pi D)^{-1/2}\exp\left\{-(x-m)^{2}/(2D)\right\}$ the Gaussian probability density. The posterior distribution density is thus equal to

[TABLE]

with

[TABLE]

If it is assumed additionally that $f_{n}(X|nm)=1$ at $n=0$ then this expression holds true if $n_{1}=0$ and/or $n_{2}=0$ as well.

In the sequel, we consider strategies which apply each chosen action $M$ times. For the sake of simplicity we assume that $N$ is a multiple of $M$ . If incomes arise sequentially one-by-one, these strategies allow to switch actions more rarely. If incomes arise by packets, these strategies allow their parallel processing. Denote by $R^{B}_{N-n}(\lambda;X_{1},n_{1},X_{2},n_{2})$ Bayesian risk at the latter $(N-n)$ steps calculated with respect to the posterior distribution density $\lambda(m_{1},m_{2}|X_{1},n_{1},X_{2},n_{2})$ . Let $x^{+}=\max(x,0)$ . Then

[TABLE]

where $R^{(1)}_{0}(\cdot)=R^{(2)}_{0}(\cdot)=0$ if $n_{1}+n_{2}=N$ ,

[TABLE]

if $n_{1}+n_{2}<N$ where

[TABLE]

Bayesian strategy prescribes currently to choose the action corresponding to the smaller value of $R^{(1)}_{N-n}(\cdot)$ , $R^{(2)}_{N-n}(\cdot)$ , the choice may be arbitrary if these values are equal.

3 Description of the Worst-Case Prior and Corresponding Recursive Equation

A direct usage of the main theorem of the theory of games is virtually impossible because of the high computational complexity. In this section, we’ll specify the properties of the worst-case prior which allow to simplify equations (7)-(9) significantly. These properties are based on the following inequality

[TABLE]

if $\alpha+\tilde{\alpha}=1$ ; $\alpha,\tilde{\alpha}>0$ , i.e. Bayesian risk is a concave function of the prior distribution density.

This property allows to specify the worst-case prior distribution. Like in Kolnogorov (2010), one can prove that the following transformations $\tilde{\lambda}$ of the prior distribution density $\lambda$ do not change the Bayesian risk, i.e. $R^{B}_{N}(\tilde{\lambda})=R^{B}_{N}(\lambda)$ :

$\tilde{\lambda}^{(1)}(m_{1},m_{2})=\lambda(m_{2},m_{1})$ (for all $m_{1}$ , $m_{2}$ ). This property means that expected losses do not change if one swaps the arms of the bandit. 2. 2.

$\tilde{\lambda}^{(2)}(m_{1},m_{2})=\lambda(m_{1}+c,m_{2}+c)$ (for all $m_{1}$ , $m_{2}$ and any fixed $c$ ). This property means that expected losses do not change if one equally shifts both mathematical expectations.

So, if $\lambda$ is the worst-case prior distribution then $\alpha\lambda+\tilde{\alpha}\tilde{\lambda}$ is the worst-case prior as well. It means that the worst-case prior distribution does not change if the above transformations are implemented. In the sequel, it is convenient to modify parameterization. Let’s put $m_{1}=m+v$ , $m_{2}=m-v$ , then $\theta=(m+v,m-v)$ and $\Theta=\{\theta:|v|\leq C\}$ . Taking into account the Jacobian $|\partial(m_{1},m_{2})/\partial(m,v)|=2$ , a prior distribution density is equal to $\nu(m,v)=2\lambda(m+v,m-v)$ . Then transformations of the prior distribution densities $\tilde{\nu}^{(1)}(m,v)=\nu(m,-v)$ and $\tilde{\nu}^{(2)}(m,v)=\nu(m+c,v)$ (for any fixed $c$ ) do not change the value of Bayesian risk. These properties allow to specify the worst-case prior. Namely, asymptotically the worst-case prior distribution density can be chosen the following one:

[TABLE]

where $\kappa_{a}(m)$ is the uniform density on the interval $|m|\leq a$ , $\rho(v)$ is a symmetric density (i.e. $\rho(-v)=\rho(v)$ ) on the interval $|v|\leq C$ and $a\to\infty$ . This prior does not change under the first transformation and asymptotically (as $a\to\infty$ ) does not change under the second transformation.

Now let’s write the dynamic programming equation for calculation the Bayesian risk with respect to (10). These equations follow from (7)-(9) if the prior distribution density is formally assumed to be constant with respect to $m$ and this gives true expressions for the posterior densities if $n_{1}\geq M$ , $n_{2}\geq M$ . At the former two steps actions should be chosen turn-by-turn. At the time point $n=n_{1}+n_{2}$ control is completely determined for a triple $(U,n_{1},n_{2})$ with $U=(X_{1}n_{2}-X_{2}n_{1})n^{-1}$ .

Theorem 1

Let’s put $f_{D}(x)=f_{D}(x|0)$ . The strategy at the initial stage $n\leq 2M$ applies actions turn-by-turn. In the sequel it can be determined by solving the recursive Bellman-type equation:

[TABLE]

where $R^{(1)}_{M}(U,n_{1},n_{2})=R^{(2)}_{M}(U,n_{1},n_{2})=0$ if $n_{1}+n_{2}=N$ and

[TABLE]

if $n_{1}+n_{2}<N$ . Here

[TABLE]

$\ell=1,2$ . If $n>2M$ then the $\ell$ -th action is currently optimal iff $R^{(\ell)}_{M}(U,n_{1},n_{2})$ has smaller value ( $\ell=1,2$ ). Corresponding Bayesian risk (4) is calculated as follows

[TABLE]

Proof of theorem is presented in Appendix A.

4 Invariant Recursive Equation and Passage to the Limit

Let’s introduce the following change of variables $\varepsilon=MN^{-1}$ , $t_{1}=n_{1}N^{-1}$ , $t_{2}=n_{2}N^{-1}$ , $t=nN^{-1}$ , $u=UN^{-1/2}$ , $w=vN^{1/2}$ , $c=CN^{1/2}$ , $\varrho(w)=N^{1/2}\rho(v)$ , $r_{\varepsilon}(u,t_{1},t_{2})=N^{-1/2}R_{M}(U,n_{1},n_{2})$ ,

$r^{(\ell)}_{\varepsilon}(u,t_{1},t_{2})=N^{-1/2}R^{(\ell)}_{M}(U,n_{1},n_{2})$ . Now we consider the set of close expectations

[TABLE]

Recall that according to (3) the maximal expected losses in the two-armed bandit problem have the order $N^{1/2}$ and are attained just for close expectations with $c>0$ large enough. On the contrary, the maximal expected losses for distant expectations $|m_{1}-m_{2}|\geq\delta>0$ have the order $\log(N)$ . This estimate follows from the results of Lai et al (1980).

For close expectations the following theorem holds.

Theorem 2

The strategy at the initial stage $t\leq 2\varepsilon$ $(n\leq 2\varepsilon N)$ applies actions turn-by-turn. Then it can be determined by solving the following recursive Bellman-type equation:

[TABLE]

where $r^{(1)}_{\varepsilon}(u,t_{1},t_{2})=r^{(2)}_{\varepsilon}(u,t_{1},t_{2})=0$ if $t_{1}+t_{2}=1$ and

[TABLE]

if $t_{1}+t_{2}<1$ . Here

[TABLE]

$\ell=1,2$ . If $t>2\varepsilon$ $(n>2\varepsilon N)$ then the $\ell$ -th action is currently optimal iff $r^{(\ell)}_{\varepsilon}(u,t_{1},t_{2})$ has smaller value ( $\ell=1,2$ ). Bayesian risk corresponding to the worst-case prior distribution is calculated according to the formula

[TABLE]

Proof. Is done by implementation of described above change of variables.

Let’s denote by $r_{\varepsilon}(\varrho;u,t_{1},t_{2})$ the Bayesian risk as dependent on a prior distribution $\varrho(w)$ . Obviously, $r_{\varepsilon}(\varrho;u,t_{1},t_{2})$ is a decreasing function of $\varepsilon$ for any fixed $u$ , $t_{1}$ , $t_{2}$ because diminishing of $\varepsilon$ implies that actions may be changed more often. The following theorem is given without proof.

Theorem 3

For all $u$ , $t_{1}$ , $t_{2}$ , for which the solution to equation (18)-(20) is well defined, there exist limits $r(\varrho;u,t_{1},t_{2})=\lim\limits_{\varepsilon\to 0}r_{\varepsilon}(\varrho;u,t_{1},t_{2})$ which can be extended by continuity to all $u,\,t_{1},\,t_{2}\,(t_{1}>0,\,t_{2}>0,\,t_{1}+t_{2}<1)$ . These limits are uniformly bounded and satisfy Lipschitz conditions in $u$ . The minimax risk on the set of close expectations $\Theta_{N}=\{|m_{1}-m_{2}|\leq 2cN^{-1/2}\}$ satisfies the equality

[TABLE]

where $r(\varrho;0,0,0)=\lim\limits_{\varepsilon\to 0}r(\varrho;0,\varepsilon,\varepsilon)$ .

Let’s present the limiting description of $r(u,t_{1},t_{2})$ by the second order partial differential equation. Assume that $r_{\varepsilon}(u,t_{1},t_{2})$ has continuous partial derivatives of proper orders. We present $r_{\varepsilon}(u-x,t_{1}+\varepsilon,t_{2})$ as Taylor series:

[TABLE]

Noting that

[TABLE]

and substituting (26) into (19) one obtains

[TABLE]

Similarly,

[TABLE]

Recall now that equations (31)-(35) must be complemented by equation (18) which can be written as

[TABLE]

From (31)-(36) one obtains (as $\varepsilon\downarrow 0$ ) the partial differential equation:

[TABLE]

with $\overline{\ell}=3-\ell$ . Initial and boundary conditions take the form

[TABLE]

The optimal strategy prescribes to chose the $\ell$ -th action if the the $\ell$ -th member in the left-hand side of (37) has minimal value.

5 Numerical Results

Bayesian risks were calculated by (18)-(22) with $\varepsilon=0.02$ . It was assumed that the worst-case prior $\varrho(w)$ is a degenerate one and concentrated at two points $w=\pm d$ with equal probabilities 0.5. The risks are presented by line 1 on figure 1 as a function of $d$ . The worst-case prior corresponds to its maximum. The maximum is approximately equal to $0.65$ at $d\approx 1.6$ .

Expected losses corresponding to determined strategy $\sigma_{\ell}(u,t_{1},t_{2})=\Pr(y_{n}=\ell|u,t_{1},t_{2})$ were sought for by solving recursive equation

[TABLE]

where

[TABLE]

if $t_{1}+t_{2}=1$ and then

[TABLE]

if $t_{1}+t_{2}<1$ . Then

[TABLE]

The losses are presented by line 2 on figure 1. One can see that its maximal value does not exceed the value 0.65 and this confirms the assumption concerning the worst-case prior. Nevertheless, one can see that expected losses become larger than 0.65 if $d>16$ . This is caused by the initial stage of control where both actions are equally applied. On figure 1 lines 3 and 4 present risks and expected losses without those ones at the initial stage. These functions do not grow with growing $d$ . Therefore, to reduce expected losses at large $d$ one should reduce initial stage of control.

To obtain the limiting value of the minimax risk (23) calculations of $r(\varrho;u,t_{1},t_{2})$ , as a function of $d$ , were implemented according to (18), (31), (35), (40) with $\varepsilon=0.001$ for $|u|\leq 2.3$ . Partial derivatives were replaced by partial differences with $\Delta u=0.023$ , $\Delta t=2000^{-1}$ . It was assumed that $\varrho(w)$ is a degenerate distribution density concentrated at two points $w=\pm d$ . For $0.5\leq d\leq 2.5$ maximum of $2d\varepsilon+r(\varrho;0,\varepsilon,\varepsilon)$ was approximately equal to 0.637 at $d\approx 1.57$ . Hence, the minimax risk corresponding to batch processing in 50 stages is approximately 2% larger than the limiting value.

Monte-Carlo simulations were implemented for batch processing of $T=5000$ items of data by packets of $M=100$ data items, i.e. in 50 stages. The normalized expected losses $(DT)^{-1/2}L_{T}(\sigma,\theta)$ with $\theta=(p+d(D/T)^{1/2},p-d(D/T)^{1/2})$ , $p=0.5$ , $D=0.25$ were calculated as a function of $d$ . This function is just the same as the line 2 on figure 1 and that is why it is not specially presented there.

6 Conclusion

The minimax approach to the two-armed bandit problem based on the main theorem of the theory of games is proposed. Incomes of the two-armed bandit are assumed to have Gaussian distributions and this implies the possibility of their batch processing. The approach allows to determine numerically minimax strategy and minimax risk for any finite control horizon by solving Bellman-type recursive equation. However, the results have an asymptotic nature because they should be applied to batch processing a large amount of data by packets in a moderate number of stages. At the initial stage of control, there may be large expected losses because at initial stage actions are chosen turn-by-turn. To reduce losses at the initial stage one should take initial packets of data having smaller sizes.

Appendix A Proof of Theorem 2

Proof. Let’s put

[TABLE]

with $p(X_{1},n_{1},X_{2},n_{2})$ defined in (6). Denote by

$\hat{R}(Z,n_{1},n_{2})=\hat{R}(X_{1},n_{1},X_{2},n_{2})$ with $Z=X_{1}n_{2}-X_{2}n_{1}$ . Let’s check that if the prior is given by (10) then (8) takes the form

[TABLE]

with

[TABLE]

and

[TABLE]

Really, if the prior is taken from (10) then (8) takes the form

[TABLE]

Here

[TABLE]

and

[TABLE]

So, these expressions correspond to (42)-(43). Note that at $n_{1}+M$ , $n_{2}$ the value $Z$ is recalculated by expression $Z\leftarrow(X_{1}+x)n_{2}-X_{2}(n_{1}+M)=Z+z$ with $z=xn_{2}-MX_{2}$ . Noting that $MX_{1}-n_{1}x=n_{2}^{-1}(ZM-n_{1}z)$ and changing the integration variable in (44) from $x$ to $z$ one obtains (41).

Now let’s put $\hat{R}(Z,n_{1},n_{2})=f_{n_{1}n_{2}n}(Z)R(U,n_{1},n_{2})$ . The first equation (12) may be obtained from (41) and equality

[TABLE]

The second equation (12) is similarly checked. Obviously, Bayesian risk (4) is calculated according to the formula

[TABLE]

and hence by (17).

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bather (1983) Bather, J. A. (1983). The minimax risk for the two-armed bandit problem. Lecture Notes in Statistics , volume 20, 1–11. Springer-Verlag, New York.
2Berry and Fristedt (1985) Berry, D. A., and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments . Chapman and Hall, London, New York.
3Cheng (2003) Cheng, T., Su, Y., Berry, D.A. (2003) Choosing sample size for a clinical trial using decision analysis. Biometrika. 90, 923–936.
4Fabius and van Zwet (1970) Fabius, J., and van Zwet, W. R. (1970). Some remarks on the two-armed bandit. Ann. Math. Statist. , 41, 1906–1916.
5Gasnikov et al (2015) Gasnikov, A. V., Nesterov, Yu. E., and Spokoiny, V. G. (2015). On the efficiency of a randomized mirror descent algorithm in online optimization problems. Computational Mathematics and Mathematical Physics , 55:4, 580–596.
6Juditsky et al (2008) Juditsky, A., Nazin, A. V., Tsybakov, A. B., and Vayatis, N. (2008). Gap-free bounds for stochastic multi-armed bandit. Proc. 17th World Congress IFAC . Seoul, Korea, July 6–11), 11560–11563.
7Kolnogorov (2010) Kolnogorov, A. V. (2010). Determination of the minimax risk for the normal two-armed bandit. In Proceedings of the IFAC Workshop “Adaptation and Learning in Control and Signal Processing ALCOSP 2010” , Antalya, Turkey, August 26–28, 2010. DOI 10.3182/20100826-3-TR-4015.00044. http://www.ifac-papersonline.net.
8Kolnogorov (2012) Kolnogorov, A. V. (2012). Parallel design of robust control in the stochastic environment (the two-armed bandit problem). Automation and Remote Control , 73:4, 689–701.