Exponential two-armed bandit problem
Alexander Kolnogorov, Denis Grunev

TL;DR
This paper analyzes the exponential two-armed bandit problem using Bayesian methods, deriving strategies and risks, and compares it with the Gaussian case, revealing similar limiting behaviors and implications for batch processing.
Contribution
It develops a Bayesian approach for exponential two-armed bandits and derives a PDE for the limiting case, showing equivalence with Gaussian bandits in the limit.
Findings
Exponential and Gaussian bandits have the same description in the limit.
Batch processing does not increase Bayesian risk compared to individual processing as data size grows.
Derived recursive Bayesian strategies and risks for exponential bandits.
Abstract
We consider exponential two-armed bandit problem in which incomes are described by exponential distribution densities. We develop Bayesian approach and present recursive equation for determination of Bayesian strategy and Bayesian risk. In the limiting case as the control horizon goes to infinity, we obtain the second order partial differential equation in the domain of "close distributions". Results are compared with Gaussian two-armed bandit. It turned out that exponential and Gaussian two-armed bandits have the same description in the limiting case. Since Gaussian two-armed bandit describes the batch processing, this means that in case of exponential two-armed bandit batch processing does not enlarge Bayesian risk in comparison with one-by-one optimal processing as the total number of processed data items goes to infinity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Exponential two-armed bandit problem
Alexander Kolnogorovlabel=e1][email protected] [
Denis Grunev label=e2] [email protected] [ Yaroslav-the-Wise Novgorod State University\thanksmarkm1
41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003
Applied Mathematics and Information Science Department
Abstract
We consider exponential two-armed bandit problem in which incomes are described by exponential distribution densities. We develop Bayesian approach and present recursive equation for determination of Bayesian strategy and Bayesian risk. In the limiting case as the control horizon goes to infinity, we obtain the second order partial differential equation in the domain of “close distributions”. Results are compared with Gaussian two-armed bandit. It turned out that exponential and Gaussian two-armed bandits have the same description in the limiting case. Since Gaussian two-armed bandit describes the batch processing, this means that in case of exponential two-armed bandit batch processing does not enlarge Bayesian risk in comparison with one-by-one optimal processing as the total number of processed data items goes to infinity.
93E20,
62L05,
62C10,
62C20,
62F35,
Poissonian two-armed bandit,
Bayesian approach,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
, and
1 Introduction
We consider the two-armed bandit problem (see, e.g. [1, 2]) in the following setting. Let , , be a controlled random process which values are interpreted as incomes, depend only on currently chosen actions and are described by exponential distribution density
[TABLE]
if , . Here are one-step mathematical expectations of income and a vector parameter completely describes exponential two-armed bandit. We assume that the set of admissible values of parameters is a priori known.
A control strategy generally assigns a random choice of the action at the point of time depending on currently observed history of the process. For exponential distributions of incomes the history consists of cumulative numbers () of both actions applications and corresponding cumulative incomes . If one knew both , he should always choose the action corresponding to the largest of them, his total expected income on the control horizon would thus be equal to . But if he uses strategy , his total expected income is less than maximal by the value
[TABLE]
which is called the regret. Here denotes the mathematical expectation with respect to the measure generated by strategy and parameter .
Let’s assign a prior distribution density on the set of parameters . Corresponding Bayesian risk is defined as follows
[TABLE]
the optimal strategy is called Bayesian strategy. Note that Bayesian approach allows to determine Bayesian risk and Bayesian strategy by solving recursive Bellman-type equation for arbitrary prior distribution. The minimax risk on the set is defined as
[TABLE]
corresponding optimal strategy is called minimax strategy. There is no a direct method of determining minimax strategy and minimax risk. However, one can determine them using the main theorem of the theory of games according to which the following equality holds
[TABLE]
i.e. minimax risk is equal to the Bayesian calculated with respect to the worst-case prior distribution and minimax strategy is equal to corresponding Bayesian strategy.
There are some different approaches to the two-armed bandit problem. We refer here to [3], [4], [5] and references therein.
The rest of the paper is the following. Recursive Bellman-type equation for determining Bayesian risk and Bayesian strategy is presented in Section 2. Another version of recursive equation is presented in Section 3. In Section 4, we obtain a second order partial differential equation in the limiting case as . In Section 5 we compare exponential and Gaussian two-armed bandits. It turned out that they have the same description in the limiting case. Since Gaussian two-armed bandit describes the batch processing (see, e.g. [5]), this means that in case of exponential two-armed bandit batch processing does not enlarge Bayesian risk in comparison with one-by-one optimal processing asymptotically as .
2 Recursive equation
Let’s consider control strategies which are defined by a condition
[TABLE]
where are current cumulative times of both actions applications, are corresponding current cumulative incomes. The posterior distribution at the point of time is calculated as
[TABLE]
where
[TABLE]
Here is defined as
[TABLE]
Note that is exponential distribution density. Let’s put . Then (2.1) remains correct if and/or . Denote . Using (1.3), (1.4) we obtain the following recursive Bellman-type equation for determining Bayesian risk (1.5) with respect to the posterior distribution (2.1):
[TABLE]
where
[TABLE]
if and then
[TABLE]
Here denote expected losses if initially the -th action is applied at the point of time and then control is optimally implemented (). Bayesian risk (1.5) is as follows
[TABLE]
Equation (2.3)–(2.5) allow to determine Bayesian strategy, too. Bayesian strategy prescribes to choose -th action if has smaller value. In case of a draw the choice of the action may be arbitrary.
Given large enough, consider the strategy which at the start of the control times equally applies both actions and then optimally controls. In this case
[TABLE]
3 One more version of recursive equation
In this section, we obtain another version of recursive Bellman-type equation. Let’s denote
[TABLE]
where are Bayesian risks calculated with respect to the posterior distribution (2.1) and are defined in (2.2). Then the following recursive equation holds
[TABLE]
where
[TABLE]
if and then
[TABLE]
Here
[TABLE]
Bayesian strategy prescribes to choose -th action if has smaller value. In case of a draw the choice of the action is arbitrary. Bayesian risk (1.5) is calculated by the formula
[TABLE]
Given large enough, consider the strategy which at the start of the control times equally applies both actions and then optimally controls. In this case
[TABLE]
Formulas (3.1)–(3.6) follow from (2.3)–(2.7).
4 A limiting description
In this section, we present a limiting description by the second order partial differential equation. We consider the domain of “close distributions”, satisfying condition with large enough but independent from , because just in this domain the maximum expected losses take place. Denote , , so that . Note that one-step expected income and variance of exponential two-armed bandit are the following
[TABLE]
. In the domain of distributions such that are close to , let’s put
[TABLE]
where , , ; . Let’s estimate functions in (3.4). If is large enough then according to central limit theorem we have
[TABLE]
Hence, for functions in (3.4) one derives
[TABLE]
where
[TABLE]
Let’s estimate factors in integrals of (3.3). First, one derives
[TABLE]
with
[TABLE]
Let’s now estimate . Since
[TABLE]
and , we obtain that
[TABLE]
So, one can verify that
[TABLE]
Let’s put
[TABLE]
Using (4.1)–(4.4) and (4.6), one derives from (3.1)–(3.4) the integro-difference equation
[TABLE]
where
[TABLE]
if and then
[TABLE]
Bayesian strategy prescribes to choose -th action if has smaller value. In case of a draw the choice of the action is arbitrary. For the strategy which at the start of the control equally applies both actions times and then optimally controls, one derives that Bayesian risk (1.5) is calculated by the formula
[TABLE]
Formula (4.10) follows from (3.6) with the use of (4.1) and (4.6).
Finally, let’s present a limiting description of (4.9) by the second order partial differential equation. It is sufficient to consider the first equation of (4.9). The estimates below are carried out with accuracy to the terms of the order . First, we present as Taylor series
[TABLE]
Substituting (4.11) into the first equation (4.9) and using (4.5) we obtain
[TABLE]
and, hence, in the limiting case as
[TABLE]
where . Similarly,
[TABLE]
Equations (4.12), (4.13) must be complemented by equation (4.7), which is now written as
[TABLE]
From (4.12)–(4.14) one derives
[TABLE]
Initial conditions are the following
[TABLE]
if . Bayesian strategy prescribes to choose -th action if -th term in the left-hand side of (4.15) has smaller value. In case of a draw the choice of the action is arbitrary.
5 Comparison with Gaussian two-armed bandit
Gaussian two-armed bandit is characterized by incomes , , which values depend only on currently chosen actions and are described by Gaussian (normal) distribution density
[TABLE]
if , . The variance is assumed to be known and expectations are unknown. So, a vector parameter describes Gaussian two-armed bandit. We assume that the set of admissible values of parameters is a priori known.
A control strategy generally assigns a random choice of the action at the point of time depending on currently observed history where () are cumulative times of both actions applications and are corresponding cumulative incomes.
Again, one can assign a prior distribution density and define a regret and Bayesian risk just like in (1.4) and (1.5). To determine Bayesian risk in the domain of “close distributions” one should solve the following integro-difference equation (see, e.g. [5]):
[TABLE]
where
[TABLE]
if and then
[TABLE]
Here and are defined in Section 4. Bayesian strategy prescribes to choose the -th action if has smaller value. In case of a draw the choice of the action is arbitrary. For the strategy which at the start of the control equally applies both actions times and then optimally controls, one derives that Bayesian risk (1.5) is calculated by the formula
[TABLE]
Here is defined in Section 4. In the limiting case as , one can verify that integro-difference equation (5.4) results in the second order partial differential equation (4.15) with initial conditions (4.16). Recall that Gaussian two-armed bandit describes the batch processing [5]. So, this means that in case of exponential two-armed bandit batch processing does not enlarge Bayesian risk in comparison with one-by-one optimal processing asymptotically as .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments , Chapman & Hall, London.
- 2[2] Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Information: Bayesian Approach , Academic Press, New York.
- 3[3] Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore.
- 4[4] Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cambridge Univ. Press, Cambridge.
- 5[5] Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100.
