A new approach to Poissonian two-armed bandit problem
Alexander Kolnogorov

TL;DR
This paper introduces a Bayesian method for solving a continuous-time two-armed bandit problem with Poisson processes, using current process history rather than posterior evolution, and provides recursive and PDE-based solutions.
Contribution
It presents a novel Bayesian approach that leverages process history instead of posterior evolution, with recursive equations and PDEs for strategy and risk calculation.
Findings
Developed recursive equations for Bayesian strategies
Derived PDEs for limiting case analysis
Enhanced understanding of process history in Bayesian bandit solutions
Abstract
We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics
A new approach
to Poissonian two-armed bandit problem
Alexander Kolnogorovlabel=e1][email protected] [ Yaroslav-the-Wise Novgorod State University\thanksmarkm1
41 B.Saint-Petersburgskaya Str., Velikiy Novgorod, Russia, 173003
Applied Mathematics and Information Science Department
Abstract
We consider a continuous time two-armed bandit problem in which incomes are described by Poissonian processes. We develop Bayesian approach with arbitrary prior distribution. We present two versions of recursive equation for determination of Bayesian piece-wise constant strategy and Bayesian risk and partial differential equation in the limiting case. Unlike the previously considered Bayesian settings our description uses current history of the process and not evolution of the posterior distribution.
93E20,
62L05,
62C10,
62C20,
62F35,
Poissonian two-armed bandit,
Bayesian approach,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
,
1 Introduction
We consider a continuous time two-armed bandit problem. This setting results either in Poissonian or in a diffusion two-armed bandit. Quite general Poissonian two-armed bandit was considered in [1, 2]. In [3] consideration of Poissonian and diffusion bandit problems is restricted to the case of independent arms and discounted rewards. An interesting though a special case of diffusion two-armed bandit is presented in [4]. Some approaches to a discrete time two-armed bandit problem are presented in [5], [6], [7]. In the present article, we develop a new general approach to Poissonian two-armed bandit in Bayesian setting.
Formally, Poissonian two-armed bandit is a continuous-time random controlled process . Its values are usually interpreted as incomes and depend only on chosen actions as follows. If on the time interval , the action was chosen then
[TABLE]
. Thus a vector parameter completely describes considered Poissonian two-armed bandit. The set of admissible values of parameters is assumed to be known.
A control strategy generally assigns a random choice of the action at the point of time depending on currently observed history of the process, i.e. cumulative times of both actions applications () and corresponding cumulative incomes . In what follows, current values at the point of time are denoted by . If one knew , he should always choose the action corresponding to the largest of them, his total expected income on the control horizon would thus be equal to . But if he uses some strategy , his total expected income is less than maximal by the value
[TABLE]
which is called the regret. Here denotes the mathematical expectation with respect to the measure generated by strategy and parameter .
Let’s assign a prior distribution density on the set of parameters . Corresponding Bayesian risk is defined as follows
[TABLE]
the optimal strategy is called Bayesian strategy. The minimax risk on the set is defined as
[TABLE]
corresponding optimal strategy is called minimax strategy.
A direct method of determining minimax strategy and minimax risk does not exist. However, one can determine them with the use of the main theorem of the theory of games. According to this theorem the following equality holds
[TABLE]
i.e. minimax risk is equal to the Bayesian one calculated with respect to the worst-case prior distribution and minimax strategy coincides with corresponding Bayesian strategy. Note that in case of finite set determination of the minimax risk according to equality (1.5) is not laborious because Bayesian risk is a concave function of the prior distribution.
The rest of the paper is organized as follows. Recursive Bellman-type equation for determining Bayesian risk for piece-wise constant strategies is presented in Section 2. Note that our approach differs from presented in [1], [2] because we recalculate Bayesian risk with respect to current statistics and in [1], [2] recalculations are implemented with respect to current posterior distribution and . Our approach is applied to quite general sets . The approach presented in [1], [2] is applied to finite sets of parameters and generalization to arbitrary sets is not obvious. In Section 3, another version of recursive equation is derived. In a limiting case, we obtain a partial differential equation which is presented in Section 4.
2 Recursive equation
Let’s consider piece-wise constant strategies . To this end, we assume that control horizon is partitioned into a number of intervals of the length on which the chosen action does not change. Hence, and for any , , we have where is constant on the time interval . The posterior distribution at the point of time is calculated as
[TABLE]
where
[TABLE]
Since , this formula remains correct if and/or . Denote . With the use of (1.1) we obtain the following standard recursive Bellman-type equation for determining Bayesian risk (1.3) with respect to the posterior distribution (2.1)
[TABLE]
where
[TABLE]
if and then
[TABLE]
Here are expected losses if initially the -th action is applied at the control horizon of the length and then control is optimally implemented ().
Bayesian risk (1.3) is calculated by the formula
[TABLE]
Equation (2.3)–(2.9) determine at the same time Bayesian risk and Bayesian strategy. Bayesian strategy prescribes to choose -th action (i.e ) if has smaller value. In case of a draw the choice is arbitrary.
3 Another version of recursive equation
In this section, we obtain another version of recursive Bellman-type equation. Let’s denote
[TABLE]
where are Bayesian risks calculated with respect to the posterior distribution (2.1) and are defined in (2.2). Then the following recursive equation holds
[TABLE]
where
[TABLE]
if and then
[TABLE]
where
[TABLE]
Bayesian strategy prescribes to choose -th action (i.e ) if has smaller value. In case of a draw the choice is arbitrary. Bayesian risk (1.3) is calculated by the formula
[TABLE]
Formulas (3.1)–(3.8) follow from (2.3)–(2.10). One should multiply left-hand side and right-hand side of (2.9) by and implement mathematical transformations.
4 A limiting description
In this section, we consider the case when has a small value. In this case (3.7) takes the form
[TABLE]
Equation (4.7) must be complemented with (3.1) which now is written as
[TABLE]
By (4.7)–(4.8) one derives in the limiting case (as ) the following partial differential equation
[TABLE]
where
[TABLE]
Bayesian risk (1.3) is calculated by the formula
[TABLE]
Note that partial differential equation at the same time describes the evolution of and the strategy. The strategy must choose -th action if the -th member in the left-hand side of (4.9) has smaller value, in case of a draw the choice of the action may be arbitrary.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Presman, E. L. and Sonin, I. M. (1990). Sequential Control with Incomplete Information: Bayesian Approach , Academic Press, New York.
- 2[2] Presman, E. L. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting. Theory Probab. Appl. 35 307–317.
- 3[3] Mandelbaum, A (1987). Continuous Multi-Armed Bandits and Multiparameter Processes. Ann. Probab. 15 1527–1556.
- 4[4] Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments , Chapman & Hall, London.
- 5[5] Sragovich, V. G. (2006). Mathematical Theory of Adaptive Control , World Sci., Singapore.
- 6[6] Cesa-Bianchi, N. and Lugosi. G. (2006) Prediction, Learning, and Games , Cambridge Univ. Press, Cambridge.
- 7[7] Kolnogorov, A. V. (2018). Gaussian Two-Armed Bandit and Optimization of Batch Data Processing. Problems of Information Transmission 54 84–100.
