Boundary Crossing Probabilities for General Exponential Families
Odalric-Ambrym Maillard

TL;DR
This paper extends boundary crossing probability bounds for exponential families to arbitrary finite dimensions, enabling analysis of advanced bandit algorithms and revealing overlooked classical techniques.
Contribution
It generalizes boundary crossing probability bounds from one-dimensional to multi-dimensional exponential families, facilitating new regret analyses for bandit algorithms.
Findings
Provides concentration inequalities for multi-dimensional exponential families.
Enables analysis of exttt{KLUCB} and exttt{KLUCB+} strategies in general settings.
Highlights the rediscovery of classical proof techniques relevant to modern bandit theory.
Abstract
We consider parametric exponential families of dimension on the real line. We study a variant of \textit{boundary crossing probabilities} coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension . Formally, our result is a concentration inequality that bounds the probability that , where is the parameter of an unknown target distribution, is the empirical parameter estimate built from observations, is the log-partition function of the exponential family and is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function is logarithmic, as it is enables to analyze the regret of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Boundary Crossing Probabilities for General Exponential Families
Odalric-Ambrym Maillard
*INRIA Lille - Nord Europe
40 Avenue Halley
59650 Villeneuve d’Ascq, France
We consider parametric exponential families of dimension on the real line. We study a variant of boundary crossing probabilities coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension . Formally, our result is a concentration inequality that bounds the probability that , where is the parameter of an unknown target distribution, is the empirical parameter estimate built from observations, is the log-partition function of the exponential family and is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function is logarithmic, as it is enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when , while we provide results for arbitrary finite dimension , thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.
*Keywords: * Exponential Families, Bregman Concentration, Multi-armed Bandits, Optimality.
1 Multi-armed bandit setup and notations
Let us consider a stochastic multi-armed bandit problem , where is a finite set of cardinality and is a set of probability distribution over indexed by . The game is sequential and goes as follows:
At each round , the player picks an arm (based on her past observations) and receives a stochastic payoff drawn independently at random according to the distribution . She only observes the payoff , and her goal is to maximize her expected cumulated payoff, , over a possibly unknown number of steps.
Although the term multi-armed bandit problem was probably coined during the 60’s in reference to the casino slot machines of the 19th century, the formulation of this problem is due to Herbert Robbins – one of the most brilliant mind of his time, see Robbins (1952) and takes its origin in earlier questions about optimal stopping policies for clinical trials, see Thompson (1933, 1935), Wald (1945). We refer the interested reader to Robbins (2012) regarding the legacy of the immense work of H. Robbins in mathematical statistics for the sequential design of experiments, compiling his most outstanding research for his 70’s birthday. Since then, the field of multi-armed bandits has grown large and bold, and we humbly refer to the introduction of Cappé et al. (2013) for key historical aspects about the development of the field. Most notably, they include first the introduction of dynamic allocation indices (aka Gittins indices, Gittins (1979)) suggesting that an optimal strategy can be found in the form of an index strategy (that at each round selects an arm with highest ”index”); second, the seminal work of Lai and Robbins (1985) that shows indexes can be chosen as ”upper confidence bounds” on the mean reward of each arm, and provided the first asymptotic lower-bound on the achievable performance for specific distributions; third, the generalization of this lower bound in the 90’s to generic distributions by Burnetas and Katehakis (1997) (see also the recent work from Garivier et al. (2016)) as well as the asymptotic analysis by Agrawal (1995) of generic classes of upper-confidence-bound based index policies and finally Auer et al. (2002) that popularized a simple sub-optimal index strategy termed UCB and most importantly opened the quest for finite-time, as opposed to asymptotic, performance guarantees. For the purpose of this paper, we now remind the formal definitions and notations for the stochastic multi-armed bandit problem, following Cappé et al. (2013).
Quality of a strategy
For each arm , let be the expectation of the distribution , and let be any optimal arm in the sense that
[TABLE]
We write as a short-hand notation for the largest expectation and denote the gap of the expected payoff of an arm to as . In addition, we denote the number of times each arm is pulled between the rounds and by ,
[TABLE]
Definition 1** (Expected regret)**
The quality of a strategy is evaluated using the notion of expected regret (or simply, regret) at round , defined as
[TABLE]
where we used the tower rule for the first equality. The expectation is with respect to the random draws of the according to the and to the possible auxiliary randomization introduced by the decision-making strategy.
Empirical distributions
We denote empirical distributions in two related ways, depending on whether random averages indexed by the global time or averages of given numbers of pulls of a given arms are considered. The first series of averages will be referred to by using a functional notation for the indexation in the global time: , while the second series will be indexed with the local times in subscripts: . These two related indexations, functional for global times and random averages versus subscript indexes for local times, will be consistent throughout the paper for all quantities at hand, not only empirical averages.
Definition 2** (Empirical distributions)**
For each , we denote by the round at which arm was pulled for the –th time, that is
[TABLE]
For each round such that , we then define the following two empirical distributions
[TABLE]
where denotes the Dirac distribution on .
Lemma 1
The random variables , where , are independent and identically distributed according to . Moreover, we have the rewriting
**Proof of Lemma 1: ** For means based on local times we consider the filtration , where for all , the –algebra is generated by , , . In particular, and all are –measurable. Likewise, \bigl{\{}\tau_{a,m}=t\bigr{\}} is –measurable. That is, each random variable is a (predictable) stopping time. Hence, the result follows by a standard result in probability theory (see, e.g., Chow and Teicher 1988, Section 5.3).
2 Boundary crossing probabilities for the generic KL-ucb strategy.
The first appearance of the KL-ucb strategy can be traced at least to Lai (1987) although it was not given an explicit name at that time. It seems the strategy was forgot after the work of Auer et al. (2002) that opened a decade of intensive research on finite-time analysis of bandit strategies and extensions to variants of the problem (Audibert et al. (2009), Audibert and Bubeck (2010), see also Bubeck et al. (2012) for a survey of relevant variants of bandit problems), until the work of Honda and Takemura (2010) shed a novel light on the asymptotically optimal strategies. Thanks to their illuminating work, the first finite-time regret analysis of KL-ucb was obtained by Maillard et al. (2011) for discrete distributions, soon extended to handle exponential families of dimension as well, in the unifying work of Cappé et al. (2013). However, as we will see in this paper, we should all be much in dept of the outstanding work of T.L. Lai. regarding the analysis of this index strategy, both asymptotically and in finite-time, as a second look at his papers shows how to bypass the limitations of the state-of-the-art regret bounds for the control of boundary crossing probabilities in this context (see Theorem 3 below). Actually, the first focus of the present paper is not stochastic bandits but boundary crossing probabilities, and the bandit setting that we provide here should be considered only as giving a solid motivation for the contribution of this paper.
Let us now introduce formally the KL-ucb strategy. We assume that the learner is given a family of probability distributions that satisfies for each arm , where denotes the set of all probability distributions over . For two distributions , we denote by their Kullback-Leibler divergence and by and their expectations. (This expectation operator is denoted by while expectations with respect to underlying randomizations are referred to as .)
The generic form of the algorithm of interest in this paper is described as Algorithm 1. It relies on two parameters: an operator (in spirit, a projection operator) that associates with each empirical distribution an element of the model ; and a non-decreasing function , which is typically such that .
At each round , a upper confidence bound is associated with the expectation of the distribution of each arm; an arm with highest upper confidence bound is then played.
In the literature, another a variant of KL-ucb is introduced where the term is replaced with . We refer to this algorithm as KL-ucb+. While KL-ucb has been analyzed and shown to be provably near-optimal, the variant KL-ucb+ has not been analyzed yet.
Alternative formulation of KL-ucb
We wrote the KL-ucb algorithm so that the optimization problem resulting from the computation of is easy to handle. Now, under some assumption, one can rewrite this term, in an equivalent form more suited for the analysis. We refer to Cappé et al. (2013):
Lemma 2** (Rewriting)**
Under the assumption that
Assumption 1
There is a known interval with boundary , for which each model of probability measures is included in and such that ,
\inf\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})>\mu\Bigr{\}}=\min\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})\geqslant\mu\Bigr{\}}\,,
then the upper bound used by the KL-ucb algorithm satisfies the following equality
\displaystyle\max\left\{\mu\in\Omega\setminus\{\mu^{+}\}:\;\mathcal{K}_{a}\!\Big{(}\Pi_{a}\left(\widehat{\nu}_{a}(t)\right),\mu\Big{)}\leqslant\frac{f(t)}{N_{a}(t)}\right\}\,
Likewise, a similar result holds forKL-ucb+ but where is replaced with .
Remark 1
For instance, this assumption is valid when and . Indeed we can replace the strict inequality with an inequality provided that by Honda and Takemura (2010), and the infimum is reached by lower semi-continuity of the KL divergence and convexity and closure of the set .
Using boundary-crossing probabilities for regret analysis
We continue this warming-up by restating a convenient way to decompose the regret and make appear the boundary crossing probabilities that are at the heart of this paper. The following lemma is a direct adaptation from Cappé et al. (2013):
Lemma 3** (From Regret to Boundary Crossing Probabilities)**
Let be a small constant such that . For , let us introduce the following set
\displaystyle\Bigl{\{}\nu^{\prime}\in\mathfrak{M}_{1}(\mathbb{R}):\ \ \mathcal{K}_{a}(\Pi_{a}(\nu^{\prime}),\mu)<\gamma\Bigr{\}}\,.
Then, the number of pulls of a sub-optimal arm by Algorithm KL-ucb satisfies
\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T)/n}\Bigr{\}}\bigg{\}}
\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t)\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.
Likewise, the number of pulls of a sub-optimal arm by Algorithm KL-ucb+ satisfies
\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T/n)/n}\Bigr{\}}\bigg{\}}
\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.
**Proof of Lemma 3: ** The first part of this lemma for KL-ucb is proved in Cappé et al. (2013). The second part that is about KL-ucb+can be proved straightforwardly following the very same lines. We thus only provide the main steps here for clarity: We start by introducing a small that satisfies , and then consider the following inclusion of events:
[TABLE]
indeed, on the event \displaystyle{\bigl{\{}a_{t+1}=a\bigr{\}}\,\cap\,\Bigl{\{}\mu^{\star}-\varepsilon<U_{a^{\star}}(t)\Bigr{\}}}\,, we have, (where the last inequality is by definition of the strategy). Moreover, let us note that
[TABLE]
since is a non-decreasing function in its second argument and \mathcal{K}_{a}\bigl{(}\nu,E(\nu)\bigr{)}=0 for all distributions . Therefore, this simple remark leads us to the following decomposition
[TABLE]
The remaining steps of the proof of the result from Cappé et al. (2013), equation (10) can now be straightforwardly modified to work with instead of , thus concluding this proof.
Lemma 3 shows that two terms need to be controlled in order to derive regret bounds for the considered strategy. The boundary crossing probability term is arguably the most difficult to handle and is the focus of the next sections. The other term involves the probability that an empirical distribution belongs to a convex set, which can be handled either direclty as in Cappé et al. (2013) or by resorting to finite-time Sanov-type results such as that of (Dinwoodie, 1992, Theorem 2.1 and comments on page 372), or its variant from (Maillard et al., 2011, Lemma 1). For completeness, the exact result from Dinwoodie (1992) writes
Lemma 4** (Non-asymptotic Sanov’s lemma)**
Let be an open convex subset of such that is finite. Then, for all , \qquad\mathbb{P}_{\nu}\{\widehat{\nu}_{t}\in\mathcal{C}\}\leqslant\exp\big{(}-t\Lambda(\overline{\mathcal{C}})\big{)}\qquad where is the closure of .
Scope and focus of this work
We focus on the setting of stochastic multi-armed bandits because this gives a strong and natural motivation for studying boundary crossing probabilities. However, one should understand that the primary goal of this paper is to give credit to the work of T.L. Lai regarding the neat understanding of boundary crossing probabilities and not necessarily to provide a regret bound for such bandit algorithms as KL-ucb or KL-ucb+. Also, we believe that results on boundary crossing probabilities are useful beyond the bandit problem in hypothesis testing. Thus, and in order to avoid obscuring the main result regarding boundary crossing probabilities, we choose not to provide regret bounds here and to leave them has an exercise for the interested reader; controlling the remaining term appearing in the decomposition of Lemma 3 is indeed mostly technical and does not seem to require especially illuminating or fancy idea. We refer to Cappé et al. (2013) for an example of bound in the case of exponential families of dimension .
High-level overview of the contribution
We are now ready to explain the main results of this paper. For the purpose of clarity, we provide them as an informal statement before proceeding with the technical material.
Our contribution is about the behavior the of the boundary crossing probability term for exponential families of dimension when choosing the threshold function . Our result reads as follows. Theorem (Informal statement) Assuming that the observations are generated from a distribution that belongs to an exponential family of dimension that satisfies some mild conditions, then for any non-negative and some class-dependent but fully explicit constants (also depending on ) it holds
[TABLE]
*where the first inequality holds for all and the second one for large enough where is class dependent but explicit and ”reasonably” small. *
We provide the rigorous statement in Theorem 3 and Corollaries 1, 2 below. The main interest of this result is that it shows how to tune with respect to the dimension of the family. Indeed, in order to ensure that the probability term is summable in , the bound suggests that should be at least larger than . The case of exponential families of dimension () is especially interesting, as it supports the fact that both KL-ucb and KL-ucb+ can be tuned using (and even negative for KL-ucb). This was observed in numerical experiments in Cappé et al. (2013) although not theoretically supported until now.
The remaining of the paper is organized as follows: Section 3 provides the required background and notations about exponential families, Section 4 provides the precise statements as well as previous results, Section 5 details the proof of Theorem 3, and finally Section 6 details the proof of Corollaries 1 and 2.
3 General exponential families, properties and examples
Before focusing on the boundary crossing probabilities, we require a few tools and definitions related to exponential families. The purpose of this section is thus to present them and prepare for the main result of this paper. In this section, for a set , we consider a multivariate function and denote .
Definition 3** (Exponential families)**
The exponential family generated by the function and the reference measure on the set is
[TABLE]
where \displaystyle{\psi(\theta)\stackrel{{\scriptstyle\rm def}}{{=}}\log\int_{\mathcal{X}}\exp\Big{(}\langle\theta,F(x)\rangle\Big{)}\nu_{0}(dx)} is the normalization function (aka log-partition function) of the exponential family. The vector is called the vector of canonical parameters. The parameter set of the family is the domain \Theta_{\mathcal{D}}\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}\theta\in\mathbb{R}^{K}\,;\,\psi(\theta)<\infty\Big{\}}, and the invertible parameter set of the family is \Theta_{I}\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}\theta\in\mathbb{R}^{K}\,;\,0<\lambda_{\texttt{MIN}}(\nabla^{2}\psi(\theta))\leqslant\lambda_{\texttt{MAX}}(\nabla^{2}\psi(\theta))<\infty\Big{\}}\subset\Theta_{\mathcal{D}}, where and denote the minimum and maximum eigenvalues of a semi-definite positive matrix .
Remark 2
When is compact, which is the usual assumption in multi-armed bandits () and is continuous, then we automatically get .
In the sequel, we always assume that the family is regular, that is has non empty interior. Another key assumption is that the parameter of the optimal arm belongs to the interior of and is away from its boundary, which essentially avoids degenerate distributions, as we illustrate below.
**Examples ** Bernoulli distributions form an exponential family with , , ,. The Bernoulli distribution with mean has parameter . Note that and that degenerate distributions with mean [math] or correspond to parameters .
Gaussian distributions on form an exponential family with , , and for each , \psi(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\Big{(}-\frac{\pi}{\theta_{2}}\Big{)}. The Gaussian distribution has parameter . It is immediate to check that . Degenerate distributions with variance [math] correspond to a parameter with both infinite components, while as approaches the boundary , then the variance tends to infinity. It is natural to consider only parameters that correspond to a not too large variance.
3.1 Bregman divergence induced by the exponential family
An interesting property of exponential families is the following straightforward identity:
[TABLE]
In particular, the vector is called the vector of dual (or expectation) parameters. It is equal to the vector . Now, we write , where we introduced the Bregman divergence with potential function defined by
[TABLE]
Thus, if is chosen to be the projection on the exponential family , and is a distribution with projection given by , then we can rewrite the definition of in the simpler form
[TABLE]
We continue by providing a powerful rewriting of the Bregman divergence.
Lemma 5** (Bregman duality)**
Let , and its Fenchel-Legendre dual given by
Then, for all and such that , it holds
\displaystyle\log\mathbb{E}_{\theta^{\star}}\exp\bigg{(}\langle\eta,F(X)\rangle\bigg{)}=\Phi(\eta)\,.
Further, for all with for some , then .
Lemma 6** (Bregman and Smoothness)**
We have on the one hand
and on the other hand
where and are the largest and smallest eigenvalue of .
**Proof of Lemma 5: ** The second equality holds by simple algebra. Now the first equality is immediate, since
[TABLE]
**Proof of Lemma 6: ** We have by definition that
Then, by a Taylor expansion, there exists such that
[TABLE]
Likewise, there exists such that
3.2 Dual formulation of the optimization problem
Using Bregman divergences enables to rewrite the -dimensional optimization problem (3) in a slightly more convenient form thanks to a dual formulation. Indeed introducing a Lagrangian parameter and using Karush-Kuhn-Tucker conditions, one gets the following necessary optimality conditions
[TABLE]
and by definition of the exponential family, we can make use of the fact that
[TABLE]
where we remember that and . Combining these two equations, we obtain the system
[TABLE]
For minimal exponential family, this system admits for each fixed a unique solution in , that we write for clarity to indicate its dependency with the optimal value of the dual parameter as well as the constraints.
Remark 3
For , when the optimal value of is , then it means that and thus , which is only possible if . Thus whenever , the dual constraint is active, i.e. , and we get the vector equation
[TABLE]
The example of discrete distributions In many cases, the previous optimization problem reduces to a simpler one-dimensional optimization problem, where we optimize over the dual parameter . We illustrate this phenomenon on a family of discrete distributions. Let be a set of distinct real-values. Without loss of generality, assume that . The family of distributions with support in is a specific -dimensional family. Indeed, let be the feature function with component , for all . Then the parameter of the distribution has components for all . Note that for all , and . It then comes , and . Further, and corresponds to the condition . Now, for a non trivial value such that , it can be readily checked that the system (4) specialized to this family is equivalent (with no surprise) to the one considered for instance in Honda and Takemura (2010) for discrete distributions. After some tedious but simple steps detailed in Honda and Takemura (2010), one obtains the following easy-to-solve one-dimensional optimization problem (see also Cappé et al. (2013)), although the family is of dimension :
[TABLE]
3.3 Empirical parameter and definition
In this section we discuss the well-definition of the empirical parameter corresponding to the projection of the empirical distribution on the exponential family. While this is innocuous for most settings, in full generality, one needs to take some specific care to ensure that all the objects we deal with are well-defined and that all parameters we talk about indeed belong to the set (or better ).
An important property is that if the family is regular, then is an open set that coincides with the interior of realizable values of for for any absolutely continuous with respect to . In particular, by convexity of the set this means that the empirical average belongs to for all with . Thus, for the observed samples coming from , the projection on the family can be represented by a sequence such that
[TABLE]
In the sequel, we want to ensure that provided that with , then we also have , which means that there is a unique such that , or equivalently . To this end, we assume that is away from the boundary of . In many cases, it is then sufficient to assume that is larger than a small constant (roughly ) to ensure that we can find a unique such that .
Example Let us consider Gaussian distributions on , with . We consider a parameter corresponding to a Gaussian finite mean and positive variance . Now, for any , the empirical mean is finite and the empirical variance is positive, and thus is well-defined.
The case of Bernoulli distributions is interesting as it shows a slightly different situation. Let us consider a parameter corresponding to a Bernoulli distribution with mean . Before can be mapped to a point in , one needs to wait that the number of observations for both [math] and is positive. Whenever , the probability that this does not happen is controlled by where denotes the number of observations of symbol after samples. For , the later quantity is less than for , which depends on the probability level and cannot be considered to be especially small when is close 111This also suggests to replace with a Laplace or a Krichevsky-Trofimov estimate that provide initial bonus to each symbol and, as a result, maps any , for to a parameter in . to . That said, even when the parameter does not belong to , the event corresponds to having empirical mean equal to . This is a favorable situation since any optimistic algorithm should pull the corresponding arm. Thus, we one only need to control , which is less than for , which is essentially a constant. As a matter of illustration, when and , this condition is met for .
Following the previous discussion, in the sequel we consider that is always large enough so that can be uniquely defined. We now discuss the separation between the parameter and the boundary more formally, and for that purpose introduce the following definition.
Definition 4** (Enlarged parameter set)**
Let and some constant . The enlargement of size of in Euclidean norm (aka -neighborhood) is defined by
[TABLE]
For each such that , we further introduce the quantities
[TABLE]
Using the notion of enlarged parameter set, we highlight an especially useful property to prove concentration inequalities, summarized in the following result
Lemma 7** (Log-Laplace control)**
Let be a convex set and such that . Then, for all such that , it holds
**Proof of Lemma 7: ** Indeed, it holds by simple algebra
[TABLE]
where . The equality holds by definition and basic rewriting. In the inequalities, we used that is convex as an enlargement of a convex set, and thus that .
In the sequel, we are interested in sets such that for some specific . This comes essentially from the fact that we require some room around and to ensure all quantities remain finite and well-defined. Before proceeding, it is convenient to introduce the notation , as well as the Euclidean ball . Using these notations, the following lemma whose proof is immediate provides conditions for which all future technical considerations are satisfied.
Lemma 8** (Well-defined parameters)**
Let and . Now for any convex set such that and , and any , it holds .
Further, for any such that , such that .
In the sequel, we will restrict our analysis to the slightly more restrictive case when with . This is mostly for convenience and avoid dealing with rather specific situations.
Remark 4
Again let us remind that when is compact and is continuous, then .
Illustration
We now illustrate the definition of and . For Bernoulli distributions with parameter , and . Thus, is away from [math] whenever excludes the means close to [math] or , and .
Now for a family of Gaussian distributions with unknown mean and variance, \psi(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\big{(}\frac{-\pi}{\theta_{2}}\big{)}, where . Thus, , and . The smallest eigenvalue is larger than and the largest is upper bounded by , which enables to control and .
4 Boundary crossing for -dimensional exponential families
In this section, we now study the boundary crossing probability term appearing in Lemma 3 for a -dimensional exponential family . We first provide an overview of the existing results before detailing our main contribution. As explained in the introduction, the key technical tools that enable to obtain the novel results were already known three decades ago, and thus even though the novel result is impressive due to its generality and tightness, it should be regarded as a modernized version of an existing, but almost forgotten result, that enables to solve a few long-lasting open questions as a by-product.
4.1 Previous work on boundary-crossing probabilities
The existing results used in the bandit literature about boundary-crossing probabilities are restricted to a few specific cases. For instance in Cappé et al. (2013), the authors provide the following control
Theorem 1** (KL-ucb)**
In the case of canonical (that is ) exponential families of dimension , then for a function such that , then it holds for all
\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t-A+1}n\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}\bigr{)}>f\big{(}t\big{)}\cap\mu_{a^{\star}}>\widehat{\mu}_{a^{\star},n}\Bigr{\}}\leqslant e\lceil f(t)\log(t)\rceil e^{-f(t)}\,.
Further, in the special case of distributions with finitely many atoms, it holds for all
\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t-A+1}n\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}-\varepsilon\bigr{)}>f\big{(}t\big{)}\Bigr{\}}\leqslant e^{-f(t)}\Big{(}3e+2+4\varepsilon^{-2}+8e\varepsilon^{-4}\Big{)}\,.
In contrast in Lai (1988), the authors provide an asymptotic control in the more general case of exponential families of dimension with some basic regularity condition, as we explained earlier. We now restate this beautiful result from Lai (1988) in a way that is suitable for a more direct comparison with other results. The following holds:
Theorem 2** (Lai, 88)**
Let us consider an exponential family of dimension . Define for the cone . Then, for a function such that it holds for all such that , where , as ,
\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t}\widehat{\theta}_{n}\in\Theta_{\rho}\,\cap\,n\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\dagger})\geqslant f\Big{(}\frac{t}{n}\Big{)}\,\cap\,\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\dagger})\in\mathcal{C}_{\gamma}(\theta^{\dagger}-\theta^{\star})\Bigr{\}}
\displaystyle O\bigg{(}t^{-\alpha}|\theta^{\dagger}-\theta^{\star}|^{-2\alpha}\log^{-\xi-\alpha+K/2}(t|\theta^{\dagger}-\theta^{\star}|^{2})\bigg{)}
\displaystyle O\bigg{(}e^{-f(t|\theta^{\dagger}-\theta^{\star}|^{2})}\log^{-\alpha+K/2}(t|\theta^{\dagger}-\theta^{\star}|^{2})\bigg{)}\,.
Discussion The quantity is the direct analog of \mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon) in Theorem 1. Note however that replaces the larger quantity , which means that Theorem 2 controls a larger quantity than Theorem 1, and is thus in this sense stronger. It also holds for general exponential families of dimension . Another important difference is the order of magnitude of the right hand side terms of both theorems. Indeed, since , Theorem 1 requires that in order that this term is , and for the second term of Theorem 1. In contrast, Theorem 2 shows that it is enough to consider with to ensure a bound. For , this means we can even use and in particular , which corresponds to the value they recommend in the experiments.
Thus, Theorem 2 improves in three ways over Theorem 1: it is an extension to dimension , it provides a bound for (and thus for KL-ucb+) and not only , and finally allows for smaller values of . These improvements are partly due to the fact Theorem 1 controls a concentration with respect to , not , which takes advantage of the fact there is some gap when going from to distributions with mean . The proof of Theorem 2 directly takes advantage of this, contrary to that of the first part of Theorem 1.
On the other hand, Theorem 2 is only asymptotic whereas Theorem 1 holds for finite . Furthermore, we notice two restrictions on the control event. First, it requires , but we showed in the previous section that this is a minor restriction. Second, there is the restriction to a cone which simplifies the analysis, but is a more dramatic restriction. This restriction cannot be removed trivially as it can be seen from the complete statement of (Lai, 1988, Theorem 2) that the right hand-side blows up to when . As we will see, it is possible to overcome this restriction by resorting to a smart covering of the space with cones, and sum the resulting terms via a union bound over the covering. We explain the precise way of proceeding in the proof of Theorem 3 in section 5.
Hint at proving the first part of Theorem 1 We believe it is interesting to give some hint the proof of the first part of Theorem 1, as it involves an elegant step, despite relying quite heavily on two specific properties of the canonical exponential family of dimension . Indeed in the special case of the canonical one-dimensional family (that is and ), coincides with the empirical mean and it can be shown that is strictly decreasing on . Thus for any , it holds
[TABLE]
Further, using the notations of Section 3.1, it also holds in that case \mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}\bigr{)}=\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})=\Phi^{\star}(\widehat{F}_{n}), where is uniquely defined. A second non-trivial property that is shown in Cappé et al. (2013) is that for all , we can localize the supremum as
[TABLE]
Armed with these two properties, the proof reduces almost trivially to the following elegant lemma:
Lemma 9** (Dimension 1)**
Consider a canonical one-dimensional family (that is and ). Then, for all such that is non-increasing in ,
\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{m\leqslant n<M}\,\,\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})\geqslant f(t/n)/n\Big{\}} \displaystyle\exp\bigg{(}-\frac{m}{M}f(t/M)\bigg{)}\,.
This lemma, whose proof is provided in the appendix for the interested reader and is directly adapted from the proof of Theorem 1. The first statement of Theorem 1 is obtained by a peeling argument, using . However this argument does not seem to extend nicely to using , which explains why there is no statement regarding this threshold.
4.2 Main results and contributions
In this section, we now provide several results on boundary crossing probabilities, that we prove in details in the next section. We first provide a non-asymptotic bound with explicit terms for the control of the boundary crossing probability term. We then provide two corollaries that can be used directly for the analysis of KL-ucb and KL-ucb+and that better highlight the asymptotic scaling of the bound with , which helps seeing the effect of the parameter on the bound.
Theorem 3** (Boundary crossing for exponential families)**
Let , and define . Let and be a set such that and . Thus for each . Assume that is non-increasing and is non-decreasing. Then, for every , and if , , it holds
\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n\leqslant t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}
\displaystyle C(K,b,\rho,p,\eta)\sum_{i=0}^{I_{t}-1}\exp\bigg{(}-n_{i}\rho_{\varepsilon}^{2}\alpha^{2}-\rho_{\varepsilon}\chi\sqrt{n_{i}f(t/n_{i})}-f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}\bigg{)}f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}^{K/2}\,,
where we introduced the constants , and
\displaystyle C(K,b,\rho,p,\eta)=C_{p,\eta,K}\Big{(}2\frac{\omega_{p,K\!-\!2}}{\omega_{\max\{p,\frac{2}{\sqrt{5}}\},K\!-\!2}}\max\Big{\{}\frac{2bV_{\rho}^{4}}{p\rho^{2}v^{6}_{\rho}},\frac{V_{\rho}^{3}}{v_{\rho}^{4}},\frac{b^{2}V_{\rho}^{5}}{pv_{\rho}^{6}(\frac{1}{2}\!+\!\frac{1}{K})}\Big{\}}^{K/2}+1\Big{)}\,,
where is the cone-covering number of \nabla\psi\big{(}\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})\big{)} with minimal angular separation and not intersecting the set \nabla\psi\big{(}\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\eta\rho_{\varepsilon})\big{)}, and if and else.
Remark 5
The same result holds by replacing all occurrences of by the constant .
Remark 6
In dimension , the theorem takes a simpler form. Indeed for all and thus, choosing for instance, reduces to 2\Big{(}2\max\Big{\{}\frac{2V_{\rho}^{2}}{\rho v^{3}_{\rho}},\frac{V_{\rho}^{3/2}}{v_{\rho}^{2}},\frac{2V_{\rho}^{5/2}}{v_{\rho}^{3}}\Big{\}}+1\Big{)}. In the case of Bernoulli distributions, if , then , and .
Remark 7
We believe it is possible to reduce the term by a factor in the definition of .
Let . We now state two corollaries of Theorem 3, The first one is stated for the case when boundary is set to and is thus directly relevant to the analysis of KL-ucb. The second corollary is about the more challenging boundary that corresponds to the KL-ucb+ strategy. We note that is non-decreasing only for . When , this requires that . Now, when where , imposing that is non-decreasing requires that for large , that is . In the sequel we thus restrict to when using the boundary and to when using the boundary . Finally, we remind that the quantity is a function of and , and introduce the notation for convenience.
Corollary 1** (Boundary crossing for )**
Let . Using the same notations as in Theorem 3, for all and all such that it holds
\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t)/n\Big{\}}
\displaystyle\frac{C(K,4,\rho,p,\eta)(1+\chi_{\varepsilon})}{\chi_{\varepsilon}t}\bigg{(}1+\xi\frac{\log\log(t)}{\log(t)}\bigg{)}^{K/2}\log(t)^{-\xi+K/2}e^{-\chi_{\varepsilon}\sqrt{\log(t)+\xi\log\log(t)}}\,.
Corollary 2** (Boundary crossing for )**
Let . For all and , provided that where , it holds
\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}\leqslant C(K,4,\rho,p,\eta)\bigg{[}e^{-\chi_{\varepsilon}\sqrt{t}c^{\prime}}+
\displaystyle\frac{(1+\xi)^{K/2}}{ct\log(tc)}\begin{cases}\frac{16}{3}\log(tc\log(tc)/4)^{K/2-\xi}+80\log(1.25)^{K/2-\xi}&\text{if }\xi\geqslant K/2\\ \frac{16}{3}\log(t/3)^{K/2-\xi}+80\log(t\frac{c\log(tc)}{4-c\log(tc)})^{K/2-\xi}&\text{if }\xi\in[K/2-1,K/2]\end{cases}\bigg{]},
where , and if and else. Further, for larger values of , , the second term in the brackets becomes
Remark 8
In Corollary 1, since the asymptotic regime of may take a massive amount of time to kick-in when , we recommend to take . Now, we also note that the value is interesting in practice, since then it holds for all .
Remark 9
The restriction to is merely for . For instance for , the restriction becomes , and it becomes less restrictive for larger . The term is virtually infinite: For instance when , this is already larger than , while .
Remark 10
According to this result, the value (when it is non-negative) appears to be a critical value for , since the boundary crossing probabilities are not summable in for , but are summable for . Indeed, the terms behind the curved brackets are conveniently with respect to , except when . In practice however, since this asymptotic behavior may take a large time to kick-in, we recommend to be away from .
Remark 11
Achieving a bound for the threshold is more challenging than for . Only the later case was analyzed in Cappé et al. (2013) as the former was was out of reach of their analysis. Also, the result is valid with exponential families of dimension and not only dimension , which is a major improvement. It is interesting to note that when , , and to observe experimentally that a sharp phase transition indeed appears for KL-ucb+ precisely at the value : the algorithm suffers a linear regret when and a logarithmic regret when . For KL-ucb, no sharp phase transition appears at point . Instead, a relatively smooth phase transition appears for a negative dependent on the problem. Both observations are coherent with the statements of the corollaries.
Discussion regarding the proof technique The proof technique that we consider below significantly differs from the proof from Cappé et al. (2013) and Honda and Takemura (2010), and combines key ideas disseminated in two works from Tze Leung Lai, Lai (1988) and Lai (1987) with some non-trivial extension that we describe below. Also, we also simplify sum of the original arguments and improve the readability of the initial proof technique, in order to shed more light on these neat ideas.
-Change of measure At a high level, the first main idea of this proof is to resort to a change of measure argument, which is the proof technique used to prove the lower bound on the regret. The work of Lai (1988) should be given full credit for this idea. This is in stark contrast with the proof techniques later developed for the finite-time analysis of stochastic bandits. The change of measure is actually not used once, but twice. First, to go from , the parameter of the optimal arm to some perturbation of it . Then, which is perhaps more surprising, to to go from this perturbed point to a mixture over a well-chosen ball centered on it. Although we have reasons to believe that this second change of measure may not be required (at least choosing a ball in dimension seems slightly sub-optimal), this two-step localization procedure is definitely the first main component that enables to handle the boundary crossing probabilities. The other steps for the proof of the Theorem include a concentration of measure argument and a peeling argument, which are more standard.
-Bregman divergence The second main idea that is the use of Bregman divergence and its relation with the quadratic norm, which is due to Lai (1987). This enables indeed to make explicit computations for exponential families of dimension without too much effort, at the price of loosing some ”variance” terms (linked to the Hessian of the family). We combine this idea with a some key properties of Bregman divergence that enables us to simplify a few steps, notably the concentration step, that we revisited entirely in order to obtain clean bounds valid in finite time and not only asymptotically.
-Concentration of measure and boundary effects One specific difficulty that appeared in the proof is to handle the shape of the parameter set , and the fact that should be away from its boundary. The initial asymptotic proof of Lai did not account for this and was not entirely accurate. Going beyond this proved to be quite challenging due to the boundary effects, although the concentration result (section 5.4, Lemma 15) that we obtain are eventually valid without restriction and the final proof looks deceptively easy. This concentration result is novel.
-Cone covering and dimension In Lai (1988), the author analyzed a boundary crossing problem first in the case of exponential families of dimension , and then sketch the analysis for exponential families of dimension and for one the intersection with one cone. However the complete result was nowhere stated explicitly. As a matter of fact, the initial proof from Lai (1988) restricts to a cone, which greatly simplifies the result. In order to obtain the full-blown results, valid in dimension for the unrestricted event, we introduced a cone covering of the space. This seemingly novel (although not very fancy) idea enables to get a final result that is only depending on the cone-covering number of the space. It required some careful considerations and simplifications of the initial steps from Lai (1988). Along the way, we made explicit the sketch of proof provided in Lai (1988) for the dimension .
-Corollaries and ratios The final key idea that should be credited to T.L. Lai is about the fine tuning of the final bound resulting from the two change of measures, the application of concentration and the peeling argument. Indeed these step lead to a bound by a sum of terms, say that should be studied and depends on a few free parameters. This correspond, with our rewriting and modifications, to the statement of Theorem 3.
The brilliant idea of T.L. Lai, that we separate from the proof of Theorem 3 and use in the proof of Corollaries 1 and 2 is to bound the ratios of for small values of and the ratio for large values of separately (instead of resorting, for instance to a sum-integral comparison lemma). A careful study of these terms enable to improve the scaling and allow for smaller values of , up to , while other approaches seem unable to go below . Nevertheless, in our quest to obtain explicit bounds valid not-only asymptotically but also in finite time, this step is quite delicate, since a naive approach easily requires huge values for before the asymptotic regimes kick-in. By refining the initial proof strategy of Lai (1988), we managed to obtain a result valid for all for the setting of Corollary 1 and for all ”reasonably”222We require to be at least about times some problem-dependent constant, against a factor that could be in the initial analysis. large for the more challenging setting of Corollary 2.
5 Analysis of boundary crossing probabilities: proof of Theorem 3
In this section, we closely follow the proof technique used in Lai (1988) for the proof of Theorem 2, in order to prove the result of Theorem 3. We precise further the constants, remove the cone restriction on the parameter and modify the original proof to be fully non-asymptotic which, using the technique of Lai (1988), forces us to make some parts of the proof a little more accurate.
Let us recall that we consider and such that . The proof is divided in four main steps that we briefly present here for clarity:
In Section 5.1, we take care of the random number of pulls of the arm by a peeling argument. Simultaneously, we introduce a covering of the space with cones, which enables to later use arguments from proof of Theorem 2.
In Section 5.2, we proceed with the first change of measure argument: taking advantage of the gap between and , we move from a concentration argument around to one around a shifted point .
In Section 5.3, we localize the empirical parameter and make use of the second change of measure, this time to a mixture of measures, following Lai (1988). Even though we follow the same high level idea, we modified the original proof in order to better handle the cone covering, and also make all quantities explicit.
In Section 5.4, we apply a concentration of measure argument. This part requires a specific care since this is the core of the finite-time result. An important complication comes from the ”boundary” of the parameter set, and was not explicitly controlled in the original proof from Lai (1988). A very careful analysis enables to obtain the finite-time concentration result without further restriction.
We finally combine all these steps in Sections 5.5.
5.1 Peeling and covering
In this section, the intuition we follow is that we want to control the random number of pulls and to this end use a standard peeling argument, considering maximum concentration inequalities on time intervals for some . Likewise, since the term can be seen as an infimum of some quantity over the set of parameters , we use a covering of in order to reduce the control of the desired quantity to that of each cell of the cover. Formally, we show that
Lemma 10** (Peeling and cone covering decomposition)**
For all and it holds
\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n\leqslant t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}
\displaystyle\sum_{i=0}^{\lceil\log_{b}(\beta t+\beta)\rceil-2}\sum_{c=1}^{C_{p,\eta,K}}\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{b^{i}\leqslant n<b^{i+1}}E_{c,p}(n,t)\Big{\}}+\sum_{c=1}^{C_{p,\eta,K}}\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{n=b^{\lceil\log_{b}(\beta t+\beta)\rceil-1}}^{t}E_{c,p}(n,t)\Big{\}}\,,
where the event is defined by
\displaystyle E_{c,p}(n,t)\stackrel{{\scriptstyle\rm def}}{{=}}\bigg{\{}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\widehat{F}_{n}\in\mathcal{C}_{p}(\theta^{\star}_{c})\cap\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star}_{c})\geqslant\frac{f(t/n)}{n}\bigg{\}}\,.
(7)
In this definition, , constrained to satisfy , parameterize a minimal covering of with cones (That is ), where \mathcal{C}_{p}(y;\Delta)=\bigg{\{}y^{\prime}\in\mathbb{R}^{K}:\langle y^{\prime}-y,\Delta\rangle\geqslant p\|y^{\prime}-y\|\|\Delta\|\bigg{\}}. For all , is of order and , while when .
Peeling Let us introduce an increasing sequence such that for some . Then by a simple union bound it holds for any event
[TABLE]
We apply this simple result to the following sequence, defined for some and by
[TABLE]
(this is indeed a valid sequence since ), and to the event
[TABLE]
Covering We now make the Kullback-Leibler projection explicit, and remark that in case of a regular family, it holds that
[TABLE]
where is any point such that . This rewriting makes appear explicitly a shift from to another point . For this reason, it is natural to study the link between and . Immediate computations show that for any such that it holds
[TABLE]
With this equality, the Kullback-Leibler projection can be rewritten to make appear an infimum over the shift term only. In order to control the second part of the shift term we localize it thanks to a cone covering of . More precisely, on the event , we know that . Indeed, for all , , and thus . It is thus natural to build a covering of . Formally, for a given and a base point , let us introduce the cone
[TABLE]
We then associate to each a cone defined by . Now for a given , let be a set of points corresponding to a minimal covering of the set , in the sense that
[TABLE]
constrained to be outside the ball , that is for each . It can be readily checked that by minimality of the size of the covering , it must be that . More precisely, when , then is such that is positive and away from [math]. Also, we have by property of that , and by the constraint that .
The size of the covering depends on the angle separation , the ambient dimension , and the repulsive parameter . For instance it can be checked that for all and . In higher dimension, typically scales as and blows up when . It also blows up when . It is now natural to introduce the decomposition
[TABLE]
Using this notation, we deduce that for all (we remind that ),
[TABLE]
5.2 Change of measure
In this section, we focus on one event . The idea is to take advantage of the gap between and , that allows to shift from to some of the from the cover. The key observation is to control the change of measure from to each . Note that and that . We show that
Lemma 11** (Change of measure)**
If is non-decreasing, then for any increasing sequence of non-negative integers it holds
\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=n_{i}}^{n_{i+1}-1}E_{c,p}(n,t)\Bigr{\}}\leqslant\exp\bigg{(}-n_{i}\alpha^{2}-\chi\sqrt{n_{i}f(t/n_{i})}\bigg{)}\mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n=n_{i}}^{n_{i+1}-1}E_{c,p}(n,t)\Bigr{\}}
where and .
**Proof of Lemma 11: ** For any event measurable , we have by absolute continuity that
[TABLE]
We thus bound the ratio which, in the case of , leads to
[TABLE]
where . Note that this rewriting makes appear the same term as the shift term appearing in (8). Now, we remark that since by construction, then under the event it holds by convexity of and elementary Taylor approximation
[TABLE]
where we used the fact that . On the other hand, it also holds that
[TABLE]
To conclude the proof we plug-in (11) and (12) into (10). Then, it remains to use that together with the fact that is non decreasing.
5.3 Localized change of measure
In this section, we decompose further the event of interest in \mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\Bigr{\}} in order to apply some concentration of measure argument. In particular, since by construction
[TABLE]
it is then natural to control . This is what we call localization. More precisely, we introduce for any sequence of positive values, the following decomposition
[TABLE]
We handle the first term in (13) by another change of measure argument that we detail below, and the second term thanks to a concentration of measure argument that we detail in section 5.4. We will show more precisely that
Lemma 12** (Change of measure)**
For any sequence of positive values , it holds
\displaystyle\mathbb{P}_{\theta^{\star}_{c}}\Big{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\cap\|\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\star}_{c})\|<\varepsilon_{t,i,c}\Big{\}}
\displaystyle\alpha_{\rho,p}\exp\Big{(}-f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}\Big{)}\min\Big{\{}\rho^{2}v_{\rho}^{2},\tilde{\varepsilon}_{t,i,c}^{2},\frac{(K+2)v_{\rho}^{2}}{K(n_{i+1}-1)V_{\rho}}\Big{\}}^{-K/2}\tilde{\varepsilon}_{t,i,c}^{K}\,.
where \tilde{\varepsilon}_{t,i,c}=\min\{\varepsilon_{t,i,c},\text{Diam}\big{(}\nabla\psi(\Theta_{\rho})\cap\mathcal{C}_{p}(\theta^{\star}_{c})\big{)}\} and \alpha_{\rho,p}=2\frac{\omega_{p,K-2}}{\omega_{p^{\prime},K-2}}\bigg{(}\frac{V_{\rho}}{v_{\rho}^{2}}\bigg{)}^{K/2}\Big{(}\frac{V_{\rho}}{v_{\rho}}\Big{)}^{K} where , with for and .
Let us recall that .
The idea is to go from to the measure that corresponds to the mixture of all the in the shrink ball B=\Theta_{\rho}\cap\nabla\psi^{-1}\big{(}\mathcal{C}_{p}(\theta^{\star}_{c})\cap\mathcal{B}_{2}(\nabla\psi(\theta^{\star}_{c}),\varepsilon_{t,i,c})\big{)} where \mathcal{B}_{2}(y,r)\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}y^{\prime}\in\mathbb{R}^{K}\,;\,\|y-y^{\prime}\|\leqslant t\Big{\}}. This makes sense since, on the one hand, under , , and on the other hand, . For convenience, let us introduce the event of interest
[TABLE]
We use the following change of measure
[TABLE]
where Q_{B}(\Omega)\stackrel{{\scriptstyle\rm def}}{{=}}\int_{\theta^{\prime}\in B}\mathbb{P}_{\theta^{\prime}}\Bigl{\{}\Omega\Bigr{\}}d\theta^{\prime} is the mixture of all distributions with parameter in . The proof technique consists now in bounding the ratio by some quantity not depending on .
[TABLE]
It is now convenient to remark that the term in the exponent can be rewritten in terms of Bregman divergence: by elementary substitution of the definition of the divergence and of , it holds
[TABLE]
Thus, the previous likelihood ratio simplifies as follows
[TABLE]
where we we note that both and belong to .
The next step is to consider a set that contains . For each such set, and the upper bound , we now obtain
[TABLE]
In this derivation, holds by positivity of and the inclusion , follows by a change of parameter argument and is obtained by controlling the determinant (in dimension ) of the Hessian, whose highest eigenvalue is .
In order to identify a good candidate for the set let us now study the set . A first remark is that plays a central role in : It is not difficult to show that, by construction of ,
[TABLE]
Indeed, if belongs to the set on the left hand side, then it must satisfy on the one hand. This implies that (this last inclusion is by construction of ). On the other hand, it satisfies . These two properties show that such a belongs to .
Thus, a natural candidate should satisfy , with . It is then natural to look for in the form , where is a sub-cone of with base point [math]. In this case, the previous derivation simplifies into
[TABLE]
where and . Cases of special interest for the set are such that the value of the function g:y\mapsto\int_{y^{\prime}\in\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}}\exp\big{(}-C\|y-y^{\prime}\|^{2}\big{)}dy^{\prime}, for is minimal at the base point [math]. Indeed this enables to derive the following bound
[TABLE]
where follows from another change of parameter argument, with combined with isotropy of the Euclidean norm (the right hand side of no longer depends on the random direction ), plus the fact that the sub-cone is invariant by rescaling. We recognize here a Gaussian integral on that can be bounded explicitly (see below).
Following this reasoning, we are now ready to specify the set . Let be a sub-cone where (remember that the larger , the more acute is a cone) and is chosen such that (there always exists such a cone). It thus remains to specify . A study of the function (defined above) on the domain reveals that it is minimal at point [math] provided that is not too small, more precisely provided that . The intuitive reasons are that the points that contribute most to the integral belong to the set for small values of , that this set has lowest volume (the map is minimal) when and that is a minimizer amongst these point provided that is not too small. More formally, the function rewrites
[TABLE]
from which we see that a minimal should be such that the spherical section is minimal for small values of (note also that ). Then, since is a convex set, the sections are of minimal size for points that are extremal, in the sense that satisfies . In order to choose and fully specify , we finally use the following lemma:
Lemma 13
Let be a cone with base point [math] and define . Provided that , then the set of extremal points reduces to .
**Proof of Lemma 13: ** First, note that the boundary of the convex set is supported by the union of the base point [math] and the set . Since this set is a sphere in dimension with radius , all its points are at distance at most from each other. Now they are also at distance exactly from the base point [math]. Thus, when , that is , then [math] is the unique point that satisfies .
We now summarize the previous steps. So far, we have proved the following upper bound
[TABLE]
where denotes the volume of , and for , We remark that by definition of , it holds
[TABLE]
Thus, it remains to analyze the volume and the Gaussian integral of . To do so, we use the following result from elementary geometry, whose proof is given in Appendix A:
Lemma 14
For all , and all the following equality and inequality hold
[TABLE]
where for and using the convention that .
Applying this Lemma, we thus get for ,
[TABLE]
This concludes the proof of Lemma 12.
5.4 Concentration of measure
In this section, we focus on the second term in (13), that is we want to control \mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\cap||\nabla\psi(\theta^{\star}_{c})-\widehat{F}_{n}||\geqslant\varepsilon_{t,i,c}\Bigr{\}}. In this term, should be considered as decreasing fast to [math] with , and slowly increasing with . Note that by definition is an empirical mean with mean given by and covariance matrix . We thus resort to a concentration of measure argument.
Lemma 15** (Concentration of measure)**
Let where we introduced the projected cone . Then, for all , it holds
\displaystyle\mathbb{P}_{\theta^{\star}_{c}}\Big{\{}\!\bigcup_{n=n_{i}}^{n_{i+1}-1}\!\!E_{c,p}(n,t)\cap||\nabla\psi(\widehat{\theta}_{n})\!-\!\nabla\psi(\theta^{\star}_{c})||\!\geqslant\!\varepsilon_{t,i,c}\Big{\}}\!\leqslant\exp\!\bigg{(}\!-\!\frac{n_{i}^{2}p\varepsilon_{t,i,c}^{2}}{2V_{\rho}(n_{i+1}\!-\!1)}\bigg{)}\mathbb{I}\{\varepsilon_{t,i,c}\!\leqslant\!\overline{\varepsilon}_{c}\}.
**Proof of Lemma 15: ** Note that by definition if , then
[TABLE]
We thus restrict to the case when , or equivalently, replace by . Now, by definition of the event , we have the rewriting
[TABLE]
Now, applying on both side of the inequality the function , for a deterministic , it comes
[TABLE]
Now we recognize that the sequence , where W_{n}(\lambda)=\exp\bigg{(}\sum_{i=1}^{n}\langle\frac{\lambda\Delta_{c}}{\|\Delta_{c}\|},\nabla\psi(\theta^{\star}_{c})-F(X_{a^{\star},i})\rangle-n\frac{\lambda^{2}V_{\rho}}{2}\big{)}\bigg{)} is a non-negative super-martingale provided that is not too large. Indeed, provided that it holds
[TABLE]
that is \mathbb{E}\bigg{[}W_{n}(e,\lambda)\bigg{|}H_{n-1}\bigg{]}\leqslant W_{n-1}(e,\lambda). Thus, we apply Doob’s maximal inequality for non-negative super-martingale and deduce that
[TABLE]
Optimizing over gives , and thus the condition becomes . At this point, it is convenient to introduce the quantity
[TABLE]
Indeed, it suffices to show that to ensure that the condition is satisfied. It is now not difficult to relate to : Indeed, any that maximizes and belongs to must satisfy
[TABLE]
on the one hand, and on the other hand, since ,
[TABLE]
Combining these two inequalities, we deduce that . Thus, using that and , we deduce that is indeed satisfied. We then get without further restriction
[TABLE]
5.5 Combining the different steps
In this part, we recap what we have shown so far. Combining the peeling, change of measure, localization and concentration of measure steps of the four previous sections, we have shown that for all , then
[TABLE]
where we recall that and that the definition of is
[TABLE]
A simple rewriting leads to the form
[TABLE]
which suggests we use . Replacing this term in the above expression, we obtain
[TABLE]
At this point, using the somewhat crude lower bound it is convenient to introduce the constant
[TABLE]
which leads to the final bound
[TABLE]
6 Fine-tuned upper bounds
In this section, we study the behavior of the bound obtained in Theorem 3 as a function of , for a specific choice of function , namely , and prove corollary 1 and corollary 2, using a fine-tuning of the remaining free quantities. This tuning is not completely trivial, as a naive tuning yields the condition that to ensure that the final bound is , while proceeding with some more care enables to show that is enough. Let us remind that is non-decreasing only for . We thus restrict to in corollary 1 that uses the threshold , and to in corollary 2 that uses the threshold function . In the sequel, we use the short-hand notation in order to replace .
6.1 Proof of Corollary 1
As a warming-up, we start by the boundary crossing probability involving instead of . Indeed, controlling the boundary crossing probability with term is more challenging. Although we focused so far on the boundary crossing probability with term , the previous proof directly applies to the case when is considered. In particular, the result of Theorem 3 holds also when all the terms are replaced with .
With the choice , which is non-increasing on the set of such that , Theorem 3 specifies for all , to
[TABLE]
In order to study the sum we provide two strategies. First, a direct upper bound gives . Thus, setting and we obtain
[TABLE]
This term is thus whenever and when . We now show that a more careful analysis leads to a similar behavior even for smaller values of . Indeed, let us note that for all , it holds by definition
[TABLE]
Since , if we set , which belongs to for all , we obtain that . Thus, we deduce that
[TABLE]
Thus, is asymptotically , and we deduce that \mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t)/n\Big{\}}=o(1/t) beyond the condition . It is interesting to note that due to the term in the exponent, and owing to the fact that for all positive and all , we actually have the stronger property that for all (using and ). However, since this asymptotic regime may take a massive amount of time to kick-in when we do not advise to take smaller than . All in all, we obtain, for with ,
[TABLE]
6.2 Proof of Corollary 2
Let us now focus on the proof of Corollary 2 involving the threshold . We consider the choice , which is non-increasing on the set of such that . When and is about , ensuring this monotonicity property means that we require to dominate , that is . Now, following the result of Theorem 3, we thus obtain for all ,
[TABLE]
We thus study the sum . To this end, let us first study the term . Since is a decreasing function of , it holds for any index that
[TABLE]
Small values of We start by handling the terms corresponding to small values of , for some to be chosen. In that case, we note that satisfies and thus
[TABLE]
from which we deduce that
[TABLE]
Following Lai (1988), in order to ensure that this quantity is summable in , it is convenient to define as
[TABLE]
for and a positive constant . Indeed in that case when we obtain the bounds333This is also valid when since the sum is equal to [math] in that case.
[TABLE]
We easily see that this is when both when and when , by construction of . Note that can further be chosen to be equal to [math] when . The value of is fixed by looking at what happens for larger values of . We note that the initial proof of Lai (1988) uses the value .
Large values of We now consider the terms of the sum corresponding to large values and thus focus on the term , and better on the following ratio
[TABLE]
Remarking that this ratio is a non increasing function of , we upper bound it by replacing with either or [math]. Using that we thus obtain,
[TABLE]
Since we would like this ratio to be less than for all (large enough) , we readily see from this expression that this excludes the cases when : the term in the ineer brackets converges to [math] in such cases, and thus the ratio is asymptotically upper bounded by . Thus we impose that , that is .
For the critical value it is then natural to study the term . First, when , this quantity is larger than for . Then, it can be checked that if . These two conditions show that, provided that , then . Now, in order to get a ratio that is away from , we target the bound . This can be achieved by requiring that by setting . Eventually, we obtain for and the bound
[TABLE]
Remark 12
Another notable value is . A similar study than the previous one shows that for , the term is larger than for , which entails that provided that .
Plugging-in the definition of , and since , we obtain if , and for ,
[TABLE]
It remains to handle the case when . Note that this case only happens for large enough so that . The later quantity may be huge since is possibly large when is close to [math]. In that case, we directly control . We control the ratio by provided that
[TABLE]
Thus, if we define to be the smallest such , then when and provided that , the bound of (16) remains valid for the sum , up to replacing with and with . The later constraint is satisfied as soon as which is generally satisfied for not too large.
Final control on S
We can now control the term by combining the two bounds for large and small . We get for and , and provided that and t\leqslant\chi^{-2}\frac{\exp\big{(}\chi^{-2}\ln(4.5)^{2}\big{)}}{4\ln(4.5)^{2}}, the following bound
[TABLE]
Further, for larger values of , t\geqslant\chi^{-2}\frac{\exp\big{(}\chi^{-2}\ln(4.5)^{2}\big{)}}{4\ln(4.5)^{2}}, then
[TABLE]
Concluding step
In this final step, we now gather equation (15) together with the previous bounds (17), (18) on . We obtain that for all
[TABLE]
where we re call the definition of the constants
When , one can then choose . When , there is a trade-off in , since the first term (the exponential) is decreasing with while the second term is increasing with . For instance choosing , where and leads to . When , simply choosing gives the final bound after some cosmetic simplifications.
Conclusion
In this work, that should be considered as a tribute to the contributions of T.L. Lai, we shed light on a beautiful and seemingly forgotten result from Lai (1988), that we modernized into a fully non-asymptotic statement, with explicit constants that can be directly used, for instance, for the regret analysis of multi-armed bandits strategies. Interestingly, the final results, whose roots are thirty-years old, show that the existing analysis of KL-ucb that was only stated for exponential families of dimension and discrete distributions lead to a sub-optimal constraints on the tuning of the threshold function , and can be extended to work with exponential families of arbitrary dimension and even for the thresholding term of the KL-ucb+ strategy, whose analysis was left open.
This proof technique is mostly based on a change-of-measure argument, like the lower bounds for the analysis of sequential decision making strategies and in stark contrast with other key results in the literature (Honda and Takemura (2010), Maillard et al. (2011), Cappé et al. (2013)). We believe and hope that the novel writing of this proof technique that we provided here will greatly benefit the community working on boundary crossing probabilities, sequential design of experiments as well as stochastic decision making strategies.
Appendix A Technical details
Lemma 9 (Dimension 1)
Consider a canonical one-dimensional family (that is and ). Then, for all such that is non-increasing in , then*
[TABLE]
**Proof of Lemma 9: ** The proof goes as follows. First, we observe that:
[TABLE]
At this point note that if for all with mean , it holds that then the probability of interest is [math] and we are done. In the other case, there exists an such that . We thus proceed with this case as follows
[TABLE]
where (a) holds by (5), (b) holds for all , and (c) for all such that . Now, the process defined by and W_{\lambda,n}=\exp\bigg{(}\sum_{i=1}^{n}\Big{(}\lambda F(X_{i})-\Phi(\lambda)\Big{)}\bigg{)} is a non-negative super-martingale, since it holds
[TABLE]
Thus, we deduce that for all such that
[TABLE]
Since by (6) this is satisfied by the optimal for , we thus deduce that
[TABLE]
Lemma 14
For all , and all the following equality holds*
[TABLE]
where for and using the convention that . Further,
[TABLE]
**Proof of Lemma 14: ** First of all, let us remark that provided that , then
[TABLE]
where is the dimensional unit sphere of . Let us recall that when , we get . For convenience, let us denote . Then, for ,
[TABLE]
For , . Likewise, we obtain, following the same steps that
[TABLE]
We obtain the first part of the lemma by combining the two previous equalities. For the second part, we use the inequality , which gives
[TABLE]
Thus, whenever , we obtain
[TABLE]
On the other hand, if , then
[TABLE]
Thus, in all cases, the integral is larger than , and we conclude by simple algebra.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agrawal (1995) Rajeev Agrawal. Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(04):1054–1078, 1995.
- 2Audibert et al. (2009) J-Y. Audibert, R. Munos, and Cs. Szepesvári. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science , 410(19), 2009.
- 3Audibert and Bubeck (2010) J.Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research , 11:2635–2686, 2010.
- 4Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2):235–256, 2002.
- 5Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
- 6Burnetas and Katehakis (1997) A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research , pages 222–255, 1997.
- 7Cappé et al. (2013) Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics , 41(3):1516–1541, 2013.
- 8Chow and Teicher (1988) YS Chow and H Teicher. Probability theory. 2nd. Springer-Verlag , 1:988, 1988.
