A new approach to Poissonian two-armed bandit problem

Alexander Kolnogorov

arXiv:1907.06074·math.ST·July 16, 2019

A new approach to Poissonian two-armed bandit problem

Alexander Kolnogorov

PDF

Open Access

TL;DR

This paper introduces a Bayesian method for solving a continuous-time two-armed bandit problem with Poisson processes, using current process history rather than posterior evolution, and provides recursive and PDE-based solutions.

Contribution

It presents a novel Bayesian approach that leverages process history instead of posterior evolution, with recursive equations and PDEs for strategy and risk calculation.

Findings

01

Developed recursive equations for Bayesian strategies

02

Derived PDEs for limiting case analysis

03

Enhanced understanding of process history in Bayesian bandit solutions

Abstract

We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.

Equations48

Pr (X (τ + t) - X (τ) = i) = p (i, t; λ_{ℓ}) = \frac{( λ _{ℓ} t ) ^{i}}{i !} e^{- λ_{ℓ} t}, i = 0, 1, 2, \dots

Pr (X (τ + t) - X (τ) = i) = p (i, t; λ_{ℓ}) = \frac{( λ _{ℓ} t ) ^{i}}{i !} e^{- λ_{ℓ} t}, i = 0, 1, 2, \dots

L_{T} (σ, θ) = T max (λ_{1}, λ_{2}) - E_{σ, θ} (X_{1} (T) + X_{2} (T))

L_{T} (σ, θ) = T max (λ_{1}, λ_{2}) - E_{σ, θ} (X_{1} (T) + X_{2} (T))

R_{T} (μ) = {σ} in f \int_{Θ} L_{T} (σ, θ) μ (θ) d θ,

R_{T} (μ) = {σ} in f \int_{Θ} L_{T} (σ, θ) μ (θ) d θ,

R_{T}^{M} (Θ) = {σ} in f Θ sup L_{T} (σ, θ),

R_{T}^{M} (Θ) = {σ} in f Θ sup L_{T} (σ, θ),

R_{T}^{M} (Θ) = R_{T} (μ_{0}) = {μ} sup R_{T} (μ),

R_{T}^{M} (Θ) = R_{T} (μ_{0}) = {μ} sup R_{T} (μ),

μ (λ_{1}, λ_{2} ∣ X_{1}, t_{1}, X_{2}, t_{2}) = \frac{p ( X _{1} , t _{1} ; λ _{1} ) p ( X _{2} , t _{2} ; λ _{2} ) μ ( λ _{1} , λ _{2} )}{μ ( X _{1} , t _{1} , X _{2} , t _{2} )},

μ (λ_{1}, λ_{2} ∣ X_{1}, t_{1}, X_{2}, t_{2}) = \frac{p ( X _{1} , t _{1} ; λ _{1} ) p ( X _{2} , t _{2} ; λ _{2} ) μ ( λ _{1} , λ _{2} )}{μ ( X _{1} , t _{1} , X _{2} , t _{2} )},

μ (X_{1}, t_{1}, X_{2}, t_{2}) = \iint_{Θ} p (X_{1}, t_{1}; λ_{1}) p (X_{2}, t_{2}; λ_{2}) μ (λ_{1}, λ_{2}) d λ_{1} d λ_{2} .

μ (X_{1}, t_{1}, X_{2}, t_{2}) = \iint_{Θ} p (X_{1}, t_{1}; λ_{1}) p (X_{2}, t_{2}; λ_{2}) μ (λ_{1}, λ_{2}) d λ_{1} d λ_{2} .

R (X_{1}, t_{1}, X_{2}, t_{2}) = min (R^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}), R^{(2)} (X_{1}, t_{1}, X_{2}, t_{2})),

R (X_{1}, t_{1}, X_{2}, t_{2}) = min (R^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}), R^{(2)} (X_{1}, t_{1}, X_{2}, t_{2})),

R^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = R^{(2)} (X_{1}, t_{1}, X_{2}, t_{2}) = 0

R^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = R^{(2)} (X_{1}, t_{1}, X_{2}, t_{2}) = 0

\displaystyle\begin{array}[]{c}R^{(1)}(X_{1},t_{1},X_{2},t_{2})=\displaystyle{\iint_{\Theta}}\mu(\lambda_{1},\lambda_{2}|X_{1},t_{1},X_{2},t_{2})\\ \times\Big{(}(\lambda_{2}-\lambda_{1})^{+}\Delta+\displaystyle{\sum_{j=0}^{\infty}}R(X_{1}+j,t_{1}+\Delta,X_{2},t_{2})p(j,\Delta;\lambda_{1})\Big{)}d\lambda_{1}d\lambda_{2},\\ R^{(2)}(X_{1},t_{1},X_{2},t_{2})=\displaystyle{\iint_{\Theta}}\mu(\lambda_{1},\lambda_{2}|X_{1},t_{1},X_{2},t_{2})\\ \times\Big{(}(\lambda_{1}-\lambda_{2})^{+}\Delta+\displaystyle{\sum_{j=0}^{\infty}}R(X_{1},t_{1},X_{2}+j,t_{2}+\Delta)p(j,\Delta;\lambda_{2})\Big{)}d\lambda_{1}d\lambda_{2}.\end{array}

\displaystyle\begin{array}[]{c}R^{(1)}(X_{1},t_{1},X_{2},t_{2})=\displaystyle{\iint_{\Theta}}\mu(\lambda_{1},\lambda_{2}|X_{1},t_{1},X_{2},t_{2})\\ \times\Big{(}(\lambda_{2}-\lambda_{1})^{+}\Delta+\displaystyle{\sum_{j=0}^{\infty}}R(X_{1}+j,t_{1}+\Delta,X_{2},t_{2})p(j,\Delta;\lambda_{1})\Big{)}d\lambda_{1}d\lambda_{2},\\ R^{(2)}(X_{1},t_{1},X_{2},t_{2})=\displaystyle{\iint_{\Theta}}\mu(\lambda_{1},\lambda_{2}|X_{1},t_{1},X_{2},t_{2})\\ \times\Big{(}(\lambda_{1}-\lambda_{2})^{+}\Delta+\displaystyle{\sum_{j=0}^{\infty}}R(X_{1},t_{1},X_{2}+j,t_{2}+\Delta)p(j,\Delta;\lambda_{2})\Big{)}d\lambda_{1}d\lambda_{2}.\end{array}

R_{T} (μ) = R (0, 0, 0, 0) .

R_{T} (μ) = R (0, 0, 0, 0) .

\tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = R (X_{1}, t_{1}, X_{2}, t_{2}) \times μ (X_{1}, t_{1}, X_{2}, t_{2}),

\tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = R (X_{1}, t_{1}, X_{2}, t_{2}) \times μ (X_{1}, t_{1}, X_{2}, t_{2}),

\tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = min (\tilde{R}^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}), \tilde{R}^{(2)} (X_{1}, t_{1}, X_{2}, t_{2})),

\tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = min (\tilde{R}^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}), \tilde{R}^{(2)} (X_{1}, t_{1}, X_{2}, t_{2})),

\tilde{R}^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = \tilde{R}^{(2)} (X_{1}, t_{1}, X_{2}, t_{2}) = 0

\tilde{R}^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = \tilde{R}^{(2)} (X_{1}, t_{1}, X_{2}, t_{2}) = 0

\displaystyle\begin{array}[]{c}\tilde{R}^{(1)}(X_{1},t_{1},X_{2},t_{2})=g^{(1)}(X_{1},t_{1},X_{2},t_{2})\times\Delta\\ +\displaystyle{\sum_{j=0}^{\infty}}\tilde{R}(X_{1}+j,t_{1}+\Delta,X_{2},t_{2})\times\frac{t_{1}^{X_{1}}\Delta^{j}(X_{1}+j)!}{(t_{1}+\Delta)^{X_{1}+j}X_{1}!j!},\\ \tilde{R}^{(2)}(X_{1},t_{1},X_{2},t_{2})=g^{(2)}(X_{1},t_{1},X_{2},t_{2})\times\Delta\\ +\displaystyle{\sum_{j=0}^{\infty}}\tilde{R}(X_{1},t_{1},X_{2}+j,t_{2}+\Delta)\times\frac{t_{2}^{X_{2}}\Delta^{j}(X_{2}+j)!}{(t_{2}+\Delta)^{X_{2}+j}X_{2}!j!},\end{array}

\displaystyle\begin{array}[]{c}\tilde{R}^{(1)}(X_{1},t_{1},X_{2},t_{2})=g^{(1)}(X_{1},t_{1},X_{2},t_{2})\times\Delta\\ +\displaystyle{\sum_{j=0}^{\infty}}\tilde{R}(X_{1}+j,t_{1}+\Delta,X_{2},t_{2})\times\frac{t_{1}^{X_{1}}\Delta^{j}(X_{1}+j)!}{(t_{1}+\Delta)^{X_{1}+j}X_{1}!j!},\\ \tilde{R}^{(2)}(X_{1},t_{1},X_{2},t_{2})=g^{(2)}(X_{1},t_{1},X_{2},t_{2})\times\Delta\\ +\displaystyle{\sum_{j=0}^{\infty}}\tilde{R}(X_{1},t_{1},X_{2}+j,t_{2}+\Delta)\times\frac{t_{2}^{X_{2}}\Delta^{j}(X_{2}+j)!}{(t_{2}+\Delta)^{X_{2}+j}X_{2}!j!},\end{array}

g^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = \iint_{Θ} (λ_{2} - λ_{1})^{+} p (X_{1}, t_{1}; λ_{1}) p (X_{2}, t_{2}; λ_{2}) μ (λ_{1}, λ_{2}) d λ_{1} d λ_{2},

g^{(1)} (X_{1}, t_{1}, X_{2}, t_{2}) = \iint_{Θ} (λ_{2} - λ_{1})^{+} p (X_{1}, t_{1}; λ_{1}) p (X_{2}, t_{2}; λ_{2}) μ (λ_{1}, λ_{2}) d λ_{1} d λ_{2},

g^{(2)} (X_{1}, t_{1}, X_{2}, t_{2}) = \iint_{Θ} (λ_{1} - λ_{2})^{+} p (X_{1}, t_{1}; λ_{1}) p (X_{2}, t_{2}; λ_{2}) μ (λ_{1}, λ_{2}) d λ_{1} d λ_{2} .

R_{T} (μ) = \tilde{R} (0, 0, 0, 0) .

R_{T} (μ) = \tilde{R} (0, 0, 0, 0) .

\displaystyle\begin{array}[]{c}\tilde{R}^{(1)}(X_{1},t_{1},X_{2},t_{2})=g^{(1)}(X_{1},t_{1},X_{2},t_{2})\Delta\\ +\tilde{R}(X_{1},t_{1}+\Delta,X_{2},t_{2})-\tilde{R}(X_{1},t_{1}+\Delta,X_{2},t_{2})X_{1}t_{1}^{-1}\Delta\\ +\tilde{R}(X_{1}+1,t_{1}+\Delta,X_{2},t_{2})(X_{1}+1)t_{1}^{-1}\Delta+o(\Delta),\\ \tilde{R}^{(2)}(X_{1},t_{1},X_{2},t_{2})=g^{(2)}(X_{1},t_{1},X_{2},t_{2})\Delta\\ +\tilde{R}(X_{1},t_{1},X_{2},t_{2}+\Delta)-\tilde{R}(X_{1},t_{1},X_{2},t_{2}+\Delta)X_{2}t_{2}^{-1}\Delta\\ +\tilde{R}(X_{1},t_{1},X_{2}+1,t_{2}+\Delta)(X_{2}+1)t_{2}^{-1}\Delta+o(\Delta),\end{array}

\displaystyle\begin{array}[]{c}\tilde{R}^{(1)}(X_{1},t_{1},X_{2},t_{2})=g^{(1)}(X_{1},t_{1},X_{2},t_{2})\Delta\\ +\tilde{R}(X_{1},t_{1}+\Delta,X_{2},t_{2})-\tilde{R}(X_{1},t_{1}+\Delta,X_{2},t_{2})X_{1}t_{1}^{-1}\Delta\\ +\tilde{R}(X_{1}+1,t_{1}+\Delta,X_{2},t_{2})(X_{1}+1)t_{1}^{-1}\Delta+o(\Delta),\\ \tilde{R}^{(2)}(X_{1},t_{1},X_{2},t_{2})=g^{(2)}(X_{1},t_{1},X_{2},t_{2})\Delta\\ +\tilde{R}(X_{1},t_{1},X_{2},t_{2}+\Delta)-\tilde{R}(X_{1},t_{1},X_{2},t_{2}+\Delta)X_{2}t_{2}^{-1}\Delta\\ +\tilde{R}(X_{1},t_{1},X_{2}+1,t_{2}+\Delta)(X_{2}+1)t_{2}^{-1}\Delta+o(\Delta),\end{array}

ℓ = 1, 2 min (\tilde{R}^{(ℓ)} (X_{1}, t_{1}, X_{2}, t_{2}) - \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2})) = 0.

ℓ = 1, 2 min (\tilde{R}^{(ℓ)} (X_{1}, t_{1}, X_{2}, t_{2}) - \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2})) = 0.

ℓ = 1, 2 min (\frac{\partial R ~}{\partial t _{ℓ}} + D^{(ℓ)} \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) + g^{(ℓ)} (X_{1}, t_{1}, X_{2}, t_{2})) = 0,

ℓ = 1, 2 min (\frac{\partial R ~}{\partial t _{ℓ}} + D^{(ℓ)} \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) + g^{(ℓ)} (X_{1}, t_{1}, X_{2}, t_{2})) = 0,

D^{(1)} \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = - \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) X_{1} t_{1}^{- 1}

D^{(1)} \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = - \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) X_{1} t_{1}^{- 1}

+ \tilde{R} (X_{1} + 1, t_{1}, X_{2}, t_{2}) (X_{1} + 1) t_{1}^{- 1},

D^{(2)} \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) = - \tilde{R} (X_{1}, t_{1}, X_{2}, t_{2}) X_{2} t_{2}^{- 1}

+ \tilde{R} (X_{1}, t_{1}, X_{2} + 1, t_{2}) (X_{2} + 1) t_{2}^{- 1} .

R_{T} (μ) = \tilde{R} (0, 0, 0, 0) .

R_{T} (μ) = \tilde{R} (0, 0, 0, 0) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics

Full text

A new approach

to Poissonian two-armed bandit problem

Alexander Kolnogorovlabel=e1][email protected] [ Yaroslav-the-Wise Novgorod State University\thanksmarkm1

41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003

Applied Mathematics and Information Science Department

Abstract

We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.

93E20,

62L05,

62C10,

62C20,

62F35,

Poissonian two-armed bandit,

Bayesian approach,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

,

1 Introduction

We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diffusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diffusion bandit problems is restricted to the case of independent arms and discounted rewards. An interesting though a special case of diffusion two-armed bandit is presented in [4]. Some approaches to a discrete time two-armed bandit problem are presented in [5], [6], [7]. In the present article, we develop a new general approach to Poissonian two-armed bandit in Bayesian setting.

Formally, Poissonian two-armed bandit is a continuous-time random controlled process $X(t)$ . Its values are usually interpreted as incomes and depend only on chosen actions $y(t)$ as follows. If on the time interval $t^{\prime}\in[\tau,\tau+t]$ , $t>0$ the action $y(t^{\prime})=\ell$ was chosen then

[TABLE]

$\ell=1,2$ . Thus a vector parameter $\theta=(\lambda_{1},\lambda_{2})$ completely describes considered Poissonian two-armed bandit. The set of admissible values of parameters $\Theta$ is assumed to be known.

A control strategy generally assigns a random choice of the action at the point of time $t$ depending on currently observed history of the process, i.e. cumulative times of both actions applications $t_{1},t_{2}$ ( $t_{1}+t_{2}=t$ ) and corresponding cumulative incomes $X_{1},X_{2}$ . In what follows, current values $X_{1},X_{2}$ at the point of time $t$ are denoted by $X_{1}(t),X_{2}(t)$ . If one knew $\lambda_{1},\lambda_{2}$ , he should always choose the action corresponding to the largest of them, his total expected income on the control horizon $T$ would thus be equal to $T\max(\lambda_{1},\lambda_{2})$ . But if he uses some strategy $\sigma$ , his total expected income is less than maximal by the value

[TABLE]

which is called the regret. Here $\mathrm{E}_{\sigma,\theta}$ denotes the mathematical expectation with respect to the measure generated by strategy $\sigma$ and parameter $\theta$ .

Let’s assign a prior distribution density $\mu(\theta)=\mu(\lambda_{1},\lambda_{2})$ on the set of parameters $\Theta$ . Corresponding Bayesian risk is defined as follows

[TABLE]

the optimal strategy $\sigma^{B}$ is called Bayesian strategy. The minimax risk on the set $\Theta$ is defined as

[TABLE]

corresponding optimal strategy $\sigma^{M}$ is called minimax strategy.

A direct method of determining minimax strategy and minimax risk does not exist. However, one can determine them with the use of the main theorem of the theory of games. According to this theorem the following equality holds

[TABLE]

i.e. minimax risk is equal to the Bayesian one calculated with respect to the worst-case prior distribution and minimax strategy coincides with corresponding Bayesian strategy. Note that in case of finite set $\Theta$ determination of the minimax risk according to equality (1.5) is not laborious because Bayesian risk is a concave function of the prior distribution.

The rest of the paper is organized as follows. Recursive Bellman-type equation for determining Bayesian risk for piece-wise constant strategies is presented in Section 2. Note that our approach differs from presented in [1], [2] because we recalculate Bayesian risk with respect to current statistics $(X_{1},t_{1},X_{2},t_{2})$ and in [1], [2] recalculations are implemented with respect to current posterior distribution and $t=t_{1}+t_{2}$ . Our approach is applied to quite general sets $\Theta$ . The approach presented in [1], [2] is applied to finite sets of parameters and generalization to arbitrary sets is not obvious. In Section 3, another version of recursive equation is derived. In a limiting case, we obtain a partial differential equation which is presented in Section 4.

2 Recursive equation

Let’s consider piece-wise constant strategies $\{\sigma_{\ell}(X_{1},t_{1},X_{2},t_{2})\}$ . To this end, we assume that control horizon is partitioned into a number of intervals of the length $\Delta$ on which the chosen action does not change. Hence, $T=N\Delta$ and for any $n_{1}+n_{2}=n<N$ , $t_{1}=n_{1}\Delta$ , $t_{2}=n_{2}\Delta$ we have $\Pr(y(t^{\prime})=\ell)=\sigma_{\ell}(X_{1},t_{1},X_{2},t_{2})$ where $\sigma_{\ell}(X_{1},t_{1},X_{2},t_{2})$ is constant on the time interval $t^{\prime}\in[n\Delta,(n+1)\Delta]$ . The posterior distribution at the point of time $t=t_{1}+t_{2}$ is calculated as

[TABLE]

where

[TABLE]

Since $p(0,0;\lambda)=1$ , this formula remains correct if $t_{1}=0$ and/or $t_{2}=0$ . Denote $x^{+}=\max(x,0)$ . With the use of (1.1) we obtain the following standard recursive Bellman-type equation for determining Bayesian risk (1.3) with respect to the posterior distribution (2.1)

[TABLE]

where

[TABLE]

if $t_{1}+t_{2}=T$ and then

[TABLE]

Here $\{R^{(\ell)}(X_{1},t_{1},X_{2},t_{2})\}$ are expected losses if initially the $\ell$ -th action is applied at the control horizon of the length $\Delta$ and then control is optimally implemented ( $\ell=1,2$ ).

Bayesian risk (1.3) is calculated by the formula

[TABLE]

Equation (2.3)–(2.9) determine at the same time Bayesian risk and Bayesian strategy. Bayesian strategy prescribes to choose $\ell$ -th action (i.e $\sigma_{\ell}(X_{1},t_{1},X_{2},t_{2})=1$ ) if $R^{(\ell)}(X_{1},t_{1},X_{2},t_{2})$ has smaller value. In case of a draw $R^{(1)}(X_{1},t_{1},X_{2},t_{2})=R^{(2)}(X_{1},t_{1},X_{2},t_{2})$ the choice is arbitrary.

3 Another version of recursive equation

In this section, we obtain another version of recursive Bellman-type equation. Let’s denote

[TABLE]

where $\{R(X_{1},t_{1},X_{2},t_{2})\}$ are Bayesian risks calculated with respect to the posterior distribution (2.1) and $\{\mu(X_{1},t_{1},X_{2},t_{2})\}$ are defined in (2.2). Then the following recursive equation holds

[TABLE]

where

[TABLE]

if $t_{1}+t_{2}=T$ and then

[TABLE]

where

[TABLE]

Bayesian strategy prescribes to choose $\ell$ -th action (i.e $\sigma_{\ell}(X_{1},t_{1},X_{2},t_{2})=1$ ) if $\tilde{R}^{(\ell)}(X_{1},t_{1},X_{2},t_{2})$ has smaller value. In case of a draw $\tilde{R}^{(1)}(X_{1},t_{1},X_{2},t_{2})=\tilde{R}^{(2)}(X_{1},t_{1},X_{2},t_{2})$ the choice is arbitrary. Bayesian risk (1.3) is calculated by the formula

[TABLE]

Formulas (3.1)–(3.8) follow from (2.3)–(2.10). One should multiply left-hand side and right-hand side of (2.9) by $\mu(X_{1},t_{1},X_{2},t_{2})$ and implement mathematical transformations.

4 A limiting description

In this section, we consider the case when $\Delta$ has a small value. In this case (3.7) takes the form

[TABLE]

Equation (4.7) must be complemented with (3.1) which now is written as

[TABLE]

By (4.7)–(4.8) one derives in the limiting case (as $\Delta\to+0$ ) the following partial differential equation

[TABLE]

where

[TABLE]

Bayesian risk (1.3) is calculated by the formula

[TABLE]

Note that partial differential equation at the same time describes the evolution of $\tilde{R}(X_{1},t_{1},X_{2},t_{2})$ and the strategy. The strategy must choose $\ell$ -th action if the $\ell$ -th member in the left-hand side of (4.9) has smaller value, in case of a draw the choice of the action may be arbitrary.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Information: Bayesian Approach , Academic Press, New York.
2[2] Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317.
3[3] Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Processes. Ann. Probab. 15 1527–1556.
4[4] Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments , Chapman & Hall, London.
5[5] Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore.
6[6] Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cambridge Univ. Press, Cambridge.
7[7] Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100.