Boundary Crossing Probabilities for General Exponential Families

Odalric-Ambrym Maillard

arXiv:1705.08814·stat.ML·May 25, 2017

Boundary Crossing Probabilities for General Exponential Families

Odalric-Ambrym Maillard

PDF

TL;DR

This paper extends boundary crossing probability bounds for exponential families to arbitrary finite dimensions, enabling analysis of advanced bandit algorithms and revealing overlooked classical techniques.

Contribution

It generalizes boundary crossing probability bounds from one-dimensional to multi-dimensional exponential families, facilitating new regret analyses for bandit algorithms.

Findings

01

Provides concentration inequalities for multi-dimensional exponential families.

02

Enables analysis of exttt{KLUCB} and exttt{KLUCB+} strategies in general settings.

03

Highlights the rediscovery of classical proof techniques relevant to modern bandit theory.

Abstract

We consider parametric exponential families of dimension $K$ on the real line. We study a variant of \textit{boundary crossing probabilities} coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension $K$ . Formally, our result is a concentration inequality that bounds the probability that $B^{ψ} (\hat{θ}_{n}, θ^{⋆}) \geq f (t / n) / n$ , where $θ^{⋆}$ is the parameter of an unknown target distribution, $\hat{θ}_{n}$ is the empirical parameter estimate built from $n$ observations, $ψ$ is the log-partition function of the exponential family and $B^{ψ}$ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function $f$ is logarithmic, as it is enables to analyze the regret of the…

Equations263

a^{⋆} \in Argmax_{a \in A} μ_{a} .

a^{⋆} \in Argmax_{a \in A} μ_{a} .

N_{a} (T) = def t = 1 \sum T I_{{a_{t} = a}} .

N_{a} (T) = def t = 1 \sum T I_{{a_{t} = a}} .

\mathfrak{R}_{T}\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{E}\!\left[T\mu^{\star}-\sum_{t=1}^{T}Y_{t}\right]=\mathbb{E}\!\left[T\mu^{\star}-\sum_{t=1}^{T}\mu_{a_{t}}\right]=\sum_{a\in\mathcal{A}}\Delta_{a}\,\mathbb{E}\bigl{[}N_{a}(T)\bigr{]}\,,

\mathfrak{R}_{T}\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{E}\!\left[T\mu^{\star}-\sum_{t=1}^{T}Y_{t}\right]=\mathbb{E}\!\left[T\mu^{\star}-\sum_{t=1}^{T}\mu_{a_{t}}\right]=\sum_{a\in\mathcal{A}}\Delta_{a}\,\mathbb{E}\bigl{[}N_{a}(T)\bigr{]}\,,

\tau_{a,m}=\min\bigl{\{}t\in\mathbb{N}:\ \ N_{a}(t)=m\bigr{\}}\,.

\tau_{a,m}=\min\bigl{\{}t\in\mathbb{N}:\ \ N_{a}(t)=m\bigr{\}}\,.

ν_{a} (t) = \frac{1}{N _{a} ( t )} s = 1 \sum t δ_{Y_{s}} I_{{a_{s} = a}} and ν_{a, n} = \frac{1}{n} m = 1 \sum n δ_{X_{a, m}}, where X_{a, m} = def Y_{τ_{a, m}} .

ν_{a} (t) = \frac{1}{N _{a} ( t )} s = 1 \sum t δ_{Y_{s}} I_{{a_{s} = a}} and ν_{a, n} = \frac{1}{n} m = 1 \sum n δ_{X_{a, m}}, where X_{a, m} = def Y_{τ_{a, m}} .

U_{a}(t)=\sup\biggl{\{}E(\nu):\quad\nu\in\mathcal{D}\quad\mbox{and}\quad\texttt{KL}\Bigl{(}\Pi_{\mathcal{D}}\bigl{(}\widehat{\nu}_{a}(t)\bigr{)},\,\nu\Bigr{)}\leqslant\frac{f(t)}{N_{a}(t)}\biggr{\}}\,;

U_{a}(t)=\sup\biggl{\{}E(\nu):\quad\nu\in\mathcal{D}\quad\mbox{and}\quad\texttt{KL}\Bigl{(}\Pi_{\mathcal{D}}\bigl{(}\widehat{\nu}_{a}(t)\bigr{)},\,\nu\Bigr{)}\leqslant\frac{f(t)}{N_{a}(t)}\biggr{\}}\,;

\inf\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})>\mu\Bigr{\}}=\min\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})\geqslant\mu\Bigr{\}}\,,

\inf\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})>\mu\Bigr{\}}=\min\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})\geqslant\mu\Bigr{\}}\,,

U_{a} (t)

U_{a} (t)

C_{μ, γ}

C_{μ, γ}

\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T)/n}\Bigr{\}}\bigg{\}}

\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T)/n}\Bigr{\}}\bigg{\}}

\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t)\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.

\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T/n)/n}\Bigr{\}}\bigg{\}}

\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T/n)/n}\Bigr{\}}\bigg{\}}

\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.

\bigl{\{}a_{t+1}=a\bigr{\}}\subseteq\Bigl{\{}\mu^{\star}-\varepsilon<U_{a}(t)\text{ and }a_{t+1}=a\Bigr{\}}\,\cup\,\Bigl{\{}\mu^{\star}-\varepsilon\geqslant U_{a^{\star}}(t)\Bigr{\}}\,;

\bigl{\{}a_{t+1}=a\bigr{\}}\subseteq\Bigl{\{}\mu^{\star}-\varepsilon<U_{a}(t)\text{ and }a_{t+1}=a\Bigr{\}}\,\cup\,\Bigl{\{}\mu^{\star}-\varepsilon\geqslant U_{a^{\star}}(t)\Bigr{\}}\,;

\displaystyle\Bigl{\{}\mu^{\star}\!-\!\varepsilon<U_{a}(t)\Bigr{\}}\subseteq\Bigl{\{}\exists\nu^{\prime}\!\in\!\mathcal{D}:E(\nu^{\prime})>\mu^{\star}\!-\!\varepsilon\text{ and }N_{a}(t)\,\,\mathcal{K}_{a}\bigl{(}\Pi_{a}(\widehat{\nu}_{a,N_{a}(t)}),\,\mu^{\star}\!-\!\varepsilon\bigr{)}\leqslant f(t/N_{a}(t))\Bigr{\}}\,,

\displaystyle\Bigl{\{}\mu^{\star}\!-\!\varepsilon<U_{a}(t)\Bigr{\}}\subseteq\Bigl{\{}\exists\nu^{\prime}\!\in\!\mathcal{D}:E(\nu^{\prime})>\mu^{\star}\!-\!\varepsilon\text{ and }N_{a}(t)\,\,\mathcal{K}_{a}\bigl{(}\Pi_{a}(\widehat{\nu}_{a,N_{a}(t)}),\,\mu^{\star}\!-\!\varepsilon\bigr{)}\leqslant f(t/N_{a}(t))\Bigr{\}}\,,

\displaystyle\text{and}\quad\Bigl{\{}\mu^{\star}\!-\!\varepsilon\geqslant U_{a^{\star}}(t)\Bigr{\}}\subseteq\Bigl{\{}\exists\nu^{\prime}\!\in\!\mathcal{D}:N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}\!-\!\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}\,,

\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 1+\sum_{t=|\mathcal{A}|}^{T-1}\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}\\ +\sum_{t=|\mathcal{A}|}^{T-1}\mathbb{P}\Bigl{\{}N_{a}(t)\,\,\mathcal{K}_{a}\bigl{(}\Pi_{a}(\widehat{\nu}_{a,N_{a}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}\leqslant f(t/N_{a}(t))\ \,\,\mbox{\small and}\ \,\,a_{t+1}=a\Bigr{\}}\,.

\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 1+\sum_{t=|\mathcal{A}|}^{T-1}\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}\\ +\sum_{t=|\mathcal{A}|}^{T-1}\mathbb{P}\Bigl{\{}N_{a}(t)\,\,\mathcal{K}_{a}\bigl{(}\Pi_{a}(\widehat{\nu}_{a,N_{a}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}\leqslant f(t/N_{a}(t))\ \,\,\mbox{\small and}\ \,\,a_{t+1}=a\Bigr{\}}\,.

\displaystyle\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t)\Bigr{\}}

\displaystyle\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t)\Bigr{\}}

\displaystyle\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}

\mathcal{E}(F;\nu_{0})=\left\{\nu_{\theta}\in\mathfrak{M}_{1}(\mathcal{X})\,;\,\forall x\in\mathcal{X}\,\,\nu_{\theta}(x)=\exp\big{(}\,\langle\theta,F(x)\rangle-\psi(\theta)\,\big{)}\nu_{0}(x),\,\,\theta\in\mathbb{R}^{K}\right\}\,,

\mathcal{E}(F;\nu_{0})=\left\{\nu_{\theta}\in\mathfrak{M}_{1}(\mathcal{X})\,;\,\forall x\in\mathcal{X}\,\,\nu_{\theta}(x)=\exp\big{(}\,\langle\theta,F(x)\rangle-\psi(\theta)\,\big{)}\nu_{0}(x),\,\,\theta\in\mathbb{R}^{K}\right\}\,,

\forall θ, θ^{'} \in Θ_{D}, K (ν_{θ}, ν_{θ^{'}}) = ⟨ θ - θ^{'}, E_{X \sim ν_{θ}} (F (X))⟩ - ψ (θ) + ψ (θ^{'}),

\forall θ, θ^{'} \in Θ_{D}, K (ν_{θ}, ν_{θ^{'}}) = ⟨ θ - θ^{'}, E_{X \sim ν_{θ}} (F (X))⟩ - ψ (θ) + ψ (θ^{'}),

B^{ψ} (θ, θ^{'}) = def ψ (θ^{'}) - ψ (θ) - ⟨ θ^{'} - θ, \nabla ψ (θ)⟩ .

B^{ψ} (θ, θ^{'}) = def ψ (θ^{'}) - ψ (θ) - ⟨ θ^{'} - θ, \nabla ψ (θ)⟩ .

\displaystyle\mathcal{K}_{a}\bigl{(}\Pi_{a}(\nu),\,\mu\bigr{)}=\inf\big{\{}\,\mathcal{B}^{\psi}(\theta,\theta^{\prime})\,;\,\mathbb{E}_{\nu_{\theta^{\prime}}}(X)>\mu\,\big{\}}\,.

\displaystyle\mathcal{K}_{a}\bigl{(}\Pi_{a}(\nu),\,\mu\bigr{)}=\inf\big{\{}\,\mathcal{B}^{\psi}(\theta,\theta^{\prime})\,;\,\mathbb{E}_{\nu_{\theta^{\prime}}}(X)>\mu\,\big{\}}\,.

Φ^{⋆} (y) = η \in R^{K} sup ⟨ η, y ⟩ - ψ (θ^{⋆} + η) + ψ (θ^{⋆}) .

Φ^{⋆} (y) = η \in R^{K} sup ⟨ η, y ⟩ - ψ (θ^{⋆} + η) + ψ (θ^{⋆}) .

\displaystyle\log\mathbb{E}_{\theta^{\star}}\exp\bigg{(}\langle\eta,F(X)\rangle\bigg{)}=\Phi(\eta)\,.

\displaystyle\log\mathbb{E}_{\theta^{\star}}\exp\bigg{(}\langle\eta,F(X)\rangle\bigg{)}=\Phi(\eta)\,.

B^{ψ} (θ, θ^{'}) ⩽ \frac{∣ θ - θ ^{'} ∣ ^{2}}{2} sup {λ_{MAX} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]},

B^{ψ} (θ, θ^{'}) ⩽ \frac{∣ θ - θ ^{'} ∣ ^{2}}{2} sup {λ_{MAX} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]},

∣\nabla ψ (θ) - \nabla ψ (θ^{'}) ∣ ⩽ sup {λ_{MAX} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]} ∣ θ - θ^{'} ∣,

B^{ψ} (θ, θ^{'}) ⩾ \frac{∣ θ - θ ^{'} ∣ ^{2}}{2} in f {λ_{MIN} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]},

B^{ψ} (θ, θ^{'}) ⩾ \frac{∣ θ - θ ^{'} ∣ ^{2}}{2} in f {λ_{MIN} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]},

∣\nabla ψ (θ) - \nabla ψ (θ^{'}) ∣ ⩾ in f {λ_{MIN} (\nabla^{2} ψ (\tilde{θ})); \tilde{θ} \in [θ, θ^{'}]} ∣ θ - θ^{'} ∣,

lo g E_{θ^{⋆}} exp (⟨ η, F (X)⟩)

lo g E_{θ^{⋆}} exp (⟨ η, F (X)⟩)

ψ (θ) = ψ (θ^{'}) + ⟨ θ - θ^{'}, \nabla ψ (θ^{'})⟩ + \frac{1}{2} (θ - θ^{'})^{T} \nabla^{2} ψ (\tilde{θ}) (θ - θ^{'}) .

ψ (θ) = ψ (θ^{'}) + ⟨ θ - θ^{'}, \nabla ψ (θ^{'})⟩ + \frac{1}{2} (θ - θ^{'})^{T} \nabla^{2} ψ (\tilde{θ}) (θ - θ^{'}) .

\nabla ψ (θ^{'}) - \nabla ψ (θ) - λ \partial_{θ^{'}} E_{ν_{θ^{'}}} (X) = 0, with

\nabla ψ (θ^{'}) - \nabla ψ (θ) - λ \partial_{θ^{'}} E_{ν_{θ^{'}}} (X) = 0, with

λ (μ - E_{ν_{θ^{'}}} (X)) = 0, λ ⩾ 0, E_{ν_{θ^{'}}} (X) ⩾ μ,

\partial_{θ^{'}} E_{ν_{θ^{'}}} (X) = E_{ν_{θ^{'}}} (X F (X)) - E_{ν_{θ^{'}}} (X) \nabla ψ (θ^{'}) \in R^{K},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Boundary Crossing Probabilities for General Exponential Families

Odalric-Ambrym Maillard

*INRIA Lille - Nord Europe

40 Avenue Halley

59650 Villeneuve d’Ascq, France

*[email protected]

We consider parametric exponential families of dimension $K$ on the real line. We study a variant of boundary crossing probabilities coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension $K$ . Formally, our result is a concentration inequality that bounds the probability that $\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})\geqslant f(t/n)/n$ , where $\theta^{\star}$ is the parameter of an unknown target distribution, $\widehat{\theta}_{n}$ is the empirical parameter estimate built from $n$ observations, $\psi$ is the log-partition function of the exponential family and $\mathcal{B}^{\psi}$ is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function $f$ is logarithmic, as it is enables to analyze the regret of the state-of-the-art KL-ucb and KL-ucb+ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when $K=1$ , while we provide results for arbitrary finite dimension $K$ , thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits.

*Keywords: * Exponential Families, Bregman Concentration, Multi-armed Bandits, Optimality.

1 Multi-armed bandit setup and notations

Let us consider a stochastic multi-armed bandit problem $(\mathcal{A},\nu)$ , where $\mathcal{A}$ is a finite set of cardinality $A\in\mathbb{N}$ and $\nu=(\nu_{a})_{a\in\mathcal{A}}$ is a set of probability distribution over $\mathbb{R}$ indexed by $\mathcal{A}$ . The game is sequential and goes as follows:

At each round $t\in\mathbb{N}$ , the player picks an arm $a_{t}$ (based on her past observations) and receives a stochastic payoff $Y_{t}$ drawn independently at random according to the distribution $\nu_{a_{t}}$ . She only observes the payoff $Y_{t}$ , and her goal is to maximize her expected cumulated payoff, $\sum_{t=1}Y_{a_{t}}$ , over a possibly unknown number of steps.

Although the term multi-armed bandit problem was probably coined during the 60’s in reference to the casino slot machines of the 19th century, the formulation of this problem is due to Herbert Robbins – one of the most brilliant mind of his time, see Robbins (1952) and takes its origin in earlier questions about optimal stopping policies for clinical trials, see Thompson (1933, 1935), Wald (1945). We refer the interested reader to Robbins (2012) regarding the legacy of the immense work of H. Robbins in mathematical statistics for the sequential design of experiments, compiling his most outstanding research for his 70’s birthday. Since then, the field of multi-armed bandits has grown large and bold, and we humbly refer to the introduction of Cappé et al. (2013) for key historical aspects about the development of the field. Most notably, they include first the introduction of dynamic allocation indices (aka Gittins indices, Gittins (1979)) suggesting that an optimal strategy can be found in the form of an index strategy (that at each round selects an arm with highest ”index”); second, the seminal work of Lai and Robbins (1985) that shows indexes can be chosen as ”upper confidence bounds” on the mean reward of each arm, and provided the first asymptotic lower-bound on the achievable performance for specific distributions; third, the generalization of this lower bound in the 90’s to generic distributions by Burnetas and Katehakis (1997) (see also the recent work from Garivier et al. (2016)) as well as the asymptotic analysis by Agrawal (1995) of generic classes of upper-confidence-bound based index policies and finally Auer et al. (2002) that popularized a simple sub-optimal index strategy termed UCB and most importantly opened the quest for finite-time, as opposed to asymptotic, performance guarantees. For the purpose of this paper, we now remind the formal definitions and notations for the stochastic multi-armed bandit problem, following Cappé et al. (2013).

Quality of a strategy

For each arm $a\in\mathcal{A}$ , let $\mu_{a}$ be the expectation of the distribution $\nu_{a}$ , and let $a^{\star}$ be any optimal arm in the sense that

[TABLE]

We write $\mu^{\star}$ as a short-hand notation for the largest expectation $\mu_{a^{\star}}$ and denote the gap of the expected payoff $\mu_{a}$ of an arm $a$ to $\mu^{\star}$ as $\Delta_{a}=\mu^{\star}-\mu_{a}$ . In addition, we denote the number of times each arm $a$ is pulled between the rounds $1$ and $T$ by $N_{a}(T)$ ,

[TABLE]

Definition 1 (Expected regret)

The quality of a strategy is evaluated using the notion of expected regret (or simply, regret) at round $T\geqslant 1$ , defined as

[TABLE]

where we used the tower rule for the first equality. The expectation is with respect to the random draws of the $Y_{t}$ according to the $\nu_{a_{t}}$ and to the possible auxiliary randomization introduced by the decision-making strategy.

Empirical distributions

We denote empirical distributions in two related ways, depending on whether random averages indexed by the global time $t$ or averages of given numbers $t$ of pulls of a given arms are considered. The first series of averages will be referred to by using a functional notation for the indexation in the global time: $\widehat{\nu}_{a}(t)$ , while the second series will be indexed with the local times $t$ in subscripts: $\widehat{\nu}_{a,t}$ . These two related indexations, functional for global times and random averages versus subscript indexes for local times, will be consistent throughout the paper for all quantities at hand, not only empirical averages.

Definition 2 (Empirical distributions)

For each $m\geqslant 1$ , we denote by $\tau_{a,m}$ the round at which arm $a$ was pulled for the $m$ –th time, that is

[TABLE]

For each round $t$ such that $N_{a}(t)\geqslant 1$ , we then define the following two empirical distributions

[TABLE]

where $\delta_{x}$ denotes the Dirac distribution on $x\in\mathbb{R}$ .

Lemma 1

The random variables $X_{a,m}=Y_{\tau_{a,m}}$ , where $m=1,2,\ldots$ , are independent and identically distributed according to $\nu_{a}$ . Moreover, we have the rewriting $\widehat{\nu}_{a}(t)=\widehat{\nu}_{a,N_{a}(t)}\,.$

**Proof of Lemma 1: ** For means based on local times we consider the filtration $(\mathcal{F}_{t})$ , where for all $t\geqslant 1$ , the $\sigma$ –algebra $\mathcal{F}_{t}$ is generated by $a_{1},Y_{1}$ , $\ldots$ , $a_{t},Y_{t}$ . In particular, $a_{t+1}$ and all $N_{a}(t+1)$ are $\mathcal{F}_{t}$ –measurable. Likewise, $\bigl{\{}\tau_{a,m}=t\bigr{\}}$ is $\mathcal{F}_{t-1}$ –measurable. That is, each random variable $\tau_{a,m}$ is a (predictable) stopping time. Hence, the result follows by a standard result in probability theory (see, e.g., Chow and Teicher 1988, Section 5.3). $\hfill\square$

2 Boundary crossing probabilities for the generic KL-ucb strategy.

The first appearance of the KL-ucb strategy can be traced at least to Lai (1987) although it was not given an explicit name at that time. It seems the strategy was forgot after the work of Auer et al. (2002) that opened a decade of intensive research on finite-time analysis of bandit strategies and extensions to variants of the problem (Audibert et al. (2009), Audibert and Bubeck (2010), see also Bubeck et al. (2012) for a survey of relevant variants of bandit problems), until the work of Honda and Takemura (2010) shed a novel light on the asymptotically optimal strategies. Thanks to their illuminating work, the first finite-time regret analysis of KL-ucb was obtained by Maillard et al. (2011) for discrete distributions, soon extended to handle exponential families of dimension $1$ as well, in the unifying work of Cappé et al. (2013). However, as we will see in this paper, we should all be much in dept of the outstanding work of T.L. Lai. regarding the analysis of this index strategy, both asymptotically and in finite-time, as a second look at his papers shows how to bypass the limitations of the state-of-the-art regret bounds for the control of boundary crossing probabilities in this context (see Theorem 3 below). Actually, the first focus of the present paper is not stochastic bandits but boundary crossing probabilities, and the bandit setting that we provide here should be considered only as giving a solid motivation for the contribution of this paper.

Let us now introduce formally the KL-ucb strategy. We assume that the learner is given a family $\mathcal{D}\subset\mathfrak{M}_{1}(\mathbb{R})$ of probability distributions that satisfies $\nu_{a}\in\mathcal{D}$ for each arm $a\in\mathcal{A}$ , where $\mathfrak{M}_{1}(\mathbb{R})$ denotes the set of all probability distributions over $\mathbb{R}$ . For two distributions $\nu,\nu^{\prime}\in\mathfrak{M}_{1}(\mathbb{R})$ , we denote by $\texttt{KL}(\nu,\nu^{\prime})$ their Kullback-Leibler divergence and by $E(\nu)$ and $E(\nu^{\prime})$ their expectations. (This expectation operator is denoted by $E$ while expectations with respect to underlying randomizations are referred to as $\mathbb{E}$ .)

The generic form of the algorithm of interest in this paper is described as Algorithm 1. It relies on two parameters: an operator $\Pi_{\mathcal{D}}$ (in spirit, a projection operator) that associates with each empirical distribution $\widehat{\nu}_{a}(t)$ an element of the model $\mathcal{D}$ ; and a non-decreasing function $f$ , which is typically such that $f(t)\approx\log(t)$ .

At each round $t\geqslant K+1$ , a upper confidence bound $U_{a}(t)$ is associated with the expectation $\mu_{a}$ of the distribution $\nu_{a}$ of each arm; an arm $a_{t+1}$ with highest upper confidence bound is then played.

In the literature, another a variant of KL-ucb is introduced where the term $f(t)$ is replaced with $f(t/N_{a}(t))$ . We refer to this algorithm as KL-ucb+. While KL-ucb has been analyzed and shown to be provably near-optimal, the variant KL-ucb+ has not been analyzed yet.

Alternative formulation of KL-ucb

We wrote the KL-ucb algorithm so that the optimization problem resulting from the computation of $U_{a}(t)$ is easy to handle. Now, under some assumption, one can rewrite this term, in an equivalent form more suited for the analysis. We refer to Cappé et al. (2013):

Lemma 2 (Rewriting)

Under the assumption that

Assumption 1

There is a known interval $\Omega\subset\mathbb{R}$ with boundary $\mu^{-}\leqslant\mu^{+}$ , for which each model $\mathcal{D}_{a}$ of probability measures is included in $\mathcal{P}(\Omega)$ and such that $\forall\nu\in\mathcal{D}_{a}\forall\mu\in\Omega\setminus\{\mu^{+}\}$ ,

$\inf\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})>\mu\Bigr{\}}=\min\,\Bigl{\{}\texttt{KL}(\nu,\nu^{\prime}):\ \ \nu^{\prime}\in\mathcal{D}_{a}\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})\geqslant\mu\Bigr{\}}\,,$

then the upper bound used by the KL-ucb algorithm satisfies the following equality

$\displaystyle U_{a}(t)$ $\displaystyle=$ $\displaystyle\max\left\{\mu\in\Omega\setminus\{\mu^{+}\}:\;\mathcal{K}_{a}\!\Big{(}\Pi_{a}\left(\widehat{\nu}_{a}(t)\right),\mu\Big{)}\leqslant\frac{f(t)}{N_{a}(t)}\right\}\,$

$\displaystyle\mbox{where}\quad\mathcal{K}_{a}(\nu_{a},\mu^{\star})\stackrel{{\scriptstyle\rm def}}{{=}}\inf_{\nu\in\mathcal{D}_{a}:\,E(\nu)>\mu^{\star}}\texttt{KL}(\nu_{a},\nu)\,.$

Likewise, a similar result holds forKL-ucb+ but where $f(t)$ is replaced with $f(t/N_{a}(t))$ .

Remark 1

For instance, this assumption is valid when $\mathcal{D}_{a}=\mathcal{P}([0,1])$ and $\Omega=[0,1]$ . Indeed we can replace the strict inequality with an inequality provided that $\mu<1$ by Honda and Takemura (2010), and the infimum is reached by lower semi-continuity of the KL divergence and convexity and closure of the set $\{\nu^{\prime}\in\mathcal{P}([0,1])\ \ \mbox{\rm s.t.}\ \ E(\nu^{\prime})\geqslant\mu\}$ .

Using boundary-crossing probabilities for regret analysis

We continue this warming-up by restating a convenient way to decompose the regret and make appear the boundary crossing probabilities that are at the heart of this paper. The following lemma is a direct adaptation from Cappé et al. (2013):

Lemma 3 (From Regret to Boundary Crossing Probabilities)

Let $\varepsilon\in\mathbb{R}^{+}$ be a small constant such that $\varepsilon\in(0,\min\{\,\mu^{\star}-\mu_{a}\,,\,a\in\mathcal{A}\,\})$ . For $\mu,\gamma\in\mathbb{R}$ , let us introduce the following set

$\displaystyle\mathcal{C}_{\mu,\gamma}$ $\displaystyle=$ $\displaystyle\Bigl{\{}\nu^{\prime}\in\mathfrak{M}_{1}(\mathbb{R}):\ \ \mathcal{K}_{a}(\Pi_{a}(\nu^{\prime}),\mu)<\gamma\Bigr{\}}\,.$

Then, the number of pulls of a sub-optimal arm $a\in\mathcal{A}$ by Algorithm KL-ucb satisfies

$\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T)/n}\Bigr{\}}\bigg{\}}$

$\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t)\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.$

Likewise, the number of pulls of a sub-optimal arm $a\in\mathcal{A}$ by Algorithm KL-ucb+ satisfies

$\displaystyle\mathbb{E}\bigl{[}N_{T}(a)\bigr{]}\leqslant 2+\inf_{n_{0}\leqslant T}\bigg{\{}n_{0}+\sum_{n\geqslant n_{0}+1}^{T}\mathbb{P}\Bigl{\{}\widehat{\nu}_{a,n}\in\mathcal{C}_{\mu^{\star}-\varepsilon,f(T/n)/n}\Bigr{\}}\bigg{\}}$

$\displaystyle+\sum_{t=|\mathcal{A}|}^{T-1}\underbrace{\mathbb{P}\Bigl{\{}N_{a^{\star}}(t)\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},N_{a^{\star}}(t)}),\,\mu^{\star}-\varepsilon\bigr{)}>f(t/N_{a^{\star}}(t))\Bigr{\}}}_{\text{Boundary Crossing Probability}}\,.$

**Proof of Lemma 3: ** The first part of this lemma for KL-ucb is proved in Cappé et al. (2013). The second part that is about KL-ucb+can be proved straightforwardly following the very same lines. We thus only provide the main steps here for clarity: We start by introducing a small $\varepsilon>0$ that satisfies $\varepsilon<\min\{\,\mu^{\star}-\mu_{a}\,,\,a\in\mathcal{A}\,\}$ , and then consider the following inclusion of events:

[TABLE]

indeed, on the event $\displaystyle{\bigl{\{}a_{t+1}=a\bigr{\}}\,\cap\,\Bigl{\{}\mu^{\star}-\varepsilon<U_{a^{\star}}(t)\Bigr{\}}}\,,$ we have, $\mu^{\star}-\varepsilon<U_{a^{\star}}(t)\leqslant U_{a}(t)$ (where the last inequality is by definition of the strategy). Moreover, let us note that

[TABLE]

since $\mathcal{K}_{a}$ is a non-decreasing function in its second argument and $\mathcal{K}_{a}\bigl{(}\nu,E(\nu)\bigr{)}=0$ for all distributions $\nu$ . Therefore, this simple remark leads us to the following decomposition

[TABLE]

The remaining steps of the proof of the result from Cappé et al. (2013), equation (10) can now be straightforwardly modified to work with $f(t/N_{a}(t))$ instead of $f(t)$ , thus concluding this proof. $\hfill\square$

Lemma 3 shows that two terms need to be controlled in order to derive regret bounds for the considered strategy. The boundary crossing probability term is arguably the most difficult to handle and is the focus of the next sections. The other term involves the probability that an empirical distribution belongs to a convex set, which can be handled either direclty as in Cappé et al. (2013) or by resorting to finite-time Sanov-type results such as that of (Dinwoodie, 1992, Theorem 2.1 and comments on page 372), or its variant from (Maillard et al., 2011, Lemma 1). For completeness, the exact result from Dinwoodie (1992) writes

Lemma 4 (Non-asymptotic Sanov’s lemma)

Let $\mathcal{C}$ be an open convex subset of $\mathcal{P}(\mathcal{X})$ such that $\quad\Lambda(\mathcal{C})=\inf_{\kappa\in\mathcal{C}}\texttt{KL}(\kappa,\nu)$ is finite. Then, for all $t\geqslant 1$ , $\qquad\mathbb{P}_{\nu}\{\widehat{\nu}_{t}\in\mathcal{C}\}\leqslant\exp\big{(}-t\Lambda(\overline{\mathcal{C}})\big{)}\qquad$ where $\overline{\mathcal{C}}$ is the closure of $\mathcal{C}$ .

Scope and focus of this work

We focus on the setting of stochastic multi-armed bandits because this gives a strong and natural motivation for studying boundary crossing probabilities. However, one should understand that the primary goal of this paper is to give credit to the work of T.L. Lai regarding the neat understanding of boundary crossing probabilities and not necessarily to provide a regret bound for such bandit algorithms as KL-ucb or KL-ucb+. Also, we believe that results on boundary crossing probabilities are useful beyond the bandit problem in hypothesis testing. Thus, and in order to avoid obscuring the main result regarding boundary crossing probabilities, we choose not to provide regret bounds here and to leave them has an exercise for the interested reader; controlling the remaining term appearing in the decomposition of Lemma 3 is indeed mostly technical and does not seem to require especially illuminating or fancy idea. We refer to Cappé et al. (2013) for an example of bound in the case of exponential families of dimension $1$ .

High-level overview of the contribution

We are now ready to explain the main results of this paper. For the purpose of clarity, we provide them as an informal statement before proceeding with the technical material.

Our contribution is about the behavior the of the boundary crossing probability term for exponential families of dimension $K$ when choosing the threshold function $f(x)=\log(x)+\xi\log\log(x)$ . Our result reads as follows. Theorem (Informal statement) Assuming that the observations are generated from a distribution that belongs to an exponential family of dimension $K$ that satisfies some mild conditions, then for any non-negative $\varepsilon$ and some class-dependent but fully explicit constants $c,C$ (also depending on $\varepsilon$ ) it holds

[TABLE]

*where the first inequality holds for all $t$ and the second one for large enough $t\geqslant t_{c}$ where $t_{c}$ is class dependent but explicit and ”reasonably” small. *

We provide the rigorous statement in Theorem 3 and Corollaries 1, 2 below. The main interest of this result is that it shows how to tune $\xi$ with respect to the dimension $K$ of the family. Indeed, in order to ensure that the probability term is summable in $t$ , the bound suggests that $\xi$ should be at least larger than $K/2-1$ . The case of exponential families of dimension $1$ ( $K=1$ ) is especially interesting, as it supports the fact that both KL-ucb and KL-ucb+ can be tuned using $\xi=0$ (and even negative $\xi$ for KL-ucb). This was observed in numerical experiments in Cappé et al. (2013) although not theoretically supported until now.

The remaining of the paper is organized as follows: Section 3 provides the required background and notations about exponential families, Section 4 provides the precise statements as well as previous results, Section 5 details the proof of Theorem 3, and finally Section 6 details the proof of Corollaries 1 and 2.

3 General exponential families, properties and examples

Before focusing on the boundary crossing probabilities, we require a few tools and definitions related to exponential families. The purpose of this section is thus to present them and prepare for the main result of this paper. In this section, for a set $\mathcal{X}\subset\mathbb{R}$ , we consider a multivariate function $F:\mathcal{X}\to\mathbb{R}^{K}$ and denote $\mathcal{Y}=F(\mathcal{X})\subset\mathbb{R}^{K}$ .

Definition 3 (Exponential families)

The exponential family generated by the function $F$ and the reference measure $\nu_{0}$ on the set $\mathcal{X}$ is

[TABLE]

where $\displaystyle{\psi(\theta)\stackrel{{\scriptstyle\rm def}}{{=}}\log\int_{\mathcal{X}}\exp\Big{(}\langle\theta,F(x)\rangle\Big{)}\nu_{0}(dx)}$ is the normalization function (aka log-partition function) of the exponential family. The vector $\theta$ is called the vector of canonical parameters. The parameter set of the family is the domain $\Theta_{\mathcal{D}}\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}\theta\in\mathbb{R}^{K}\,;\,\psi(\theta)<\infty\Big{\}}$ , and the invertible parameter set of the family is $\Theta_{I}\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}\theta\in\mathbb{R}^{K}\,;\,0<\lambda_{\texttt{MIN}}(\nabla^{2}\psi(\theta))\leqslant\lambda_{\texttt{MAX}}(\nabla^{2}\psi(\theta))<\infty\Big{\}}\subset\Theta_{\mathcal{D}}$ , where $\lambda_{\texttt{MIN}}(M)$ and $\lambda_{\texttt{MAX}}(M)$ denote the minimum and maximum eigenvalues of a semi-definite positive matrix $M$ .

Remark 2

When $\mathcal{X}$ is compact, which is the usual assumption in multi-armed bandits ( $\mathcal{X}=[0,1]$ ) and $F$ is continuous, then we automatically get $\Theta_{\mathcal{D}}=\mathbb{R}^{K}$ .

In the sequel, we always assume that the family is regular, that is $\Theta_{\mathcal{D}}$ has non empty interior. Another key assumption is that the parameter $\theta^{\star}$ of the optimal arm belongs to the interior of $\Theta_{I}$ and is away from its boundary, which essentially avoids degenerate distributions, as we illustrate below.

**Examples ** Bernoulli distributions form an exponential family with $K=1$ , $\mathcal{X}=\{0,1\}$ , $F(x)=x$ , $\psi(\theta)=\log(1+e^{\theta})$ . The Bernoulli distribution with mean $\mu$ has parameter $\theta=\log(\mu/(1-\mu))$ . Note that $\Theta_{\mathcal{D}}=\mathbb{R}$ and that degenerate distributions with mean [math] or $1$ correspond to parameters $\pm\infty$ .

Gaussian distributions on $\mathcal{X}=\mathbb{R}$ form an exponential family with $K=2$ , $F(x)=(x,x^{2})$ , and for each $\theta=(\theta_{1},\theta_{2})$ , $\psi(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\Big{(}-\frac{\pi}{\theta_{2}}\Big{)}$ . The Gaussian distribution $\mathcal{N}(\mu,\sigma^{2})$ has parameter $\theta=(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}})$ . It is immediate to check that $\Theta_{\mathcal{D}}=\mathbb{R}\times\mathbb{R}_{\star}^{-}$ . Degenerate distributions with variance [math] correspond to a parameter $\theta$ with both infinite components, while as $\theta$ approaches the boundary $\mathbb{R}\times\{0\}$ , then the variance tends to infinity. It is natural to consider only parameters that correspond to a not too large variance.

3.1 Bregman divergence induced by the exponential family

An interesting property of exponential families is the following straightforward identity:

[TABLE]

In particular, the vector $\mathbb{E}_{X\sim\nu_{\theta}}(F(X))$ is called the vector of dual (or expectation) parameters. It is equal to the vector $\nabla\psi(\theta)$ . Now, we write $\mathcal{K}(\nu_{\theta},\nu_{\theta^{\prime}})=\mathcal{B}^{\psi}(\theta,\theta^{\prime})$ , where we introduced the Bregman divergence with potential function $\psi$ defined by

[TABLE]

Thus, if $\Pi_{a}$ is chosen to be the projection on the exponential family $\mathcal{E}(F;\nu_{0})$ , and $\nu$ is a distribution with projection given by $\nu_{\theta}=\Pi_{a}(\nu)$ , then we can rewrite the definition of $\mathcal{K}_{a}$ in the simpler form

[TABLE]

We continue by providing a powerful rewriting of the Bregman divergence.

Lemma 5 (Bregman duality)

Let $\Phi(\eta)=\psi(\theta^{\star}+\eta)-\psi(\theta^{\star})$ , and its Fenchel-Legendre dual given by

$\displaystyle\Phi^{\star}(y)=\sup_{\eta\in\mathbb{R}^{K}}\langle\eta,y\rangle-\psi(\theta^{\star}+\eta)+\psi(\theta^{\star})\,.$

Then, for all $\theta^{\star}\in\Theta_{\mathcal{D}}$ and $\eta\in\mathbb{R}^{K}$ such that $\theta^{\star}+\eta\in\Theta_{\mathcal{D}}$ , it holds

$\displaystyle\log\mathbb{E}_{\theta^{\star}}\exp\bigg{(}\langle\eta,F(X)\rangle\bigg{)}=\Phi(\eta)\,.$

Further, for all $F\in\nabla\psi(\Theta_{\mathcal{D}})$ with $F=\nabla\psi(\theta)$ for some $\theta\in\Theta_{\mathcal{D}}$ , then $\quad\Phi^{\star}(F)=\mathcal{B}^{\psi}(\theta,\theta^{\star})$ .

Lemma 6 (Bregman and Smoothness)

We have on the one hand

$\displaystyle\mathcal{B}^{\psi}(\theta,\theta^{\prime})\leqslant\frac{|\theta-\theta^{\prime}|^{2}}{2}\sup\{\lambda_{\texttt{MAX}}(\nabla^{2}\psi(\tilde{\theta}))\,;\,\tilde{\theta}\in[\theta,\theta^{\prime}]\}\,,$

$\displaystyle|\nabla\psi(\theta)-\nabla\psi(\theta^{\prime})|\leqslant\sup\{\lambda_{\texttt{MAX}}(\nabla^{2}\psi(\tilde{\theta}))\,;\,\tilde{\theta}\in[\theta,\theta^{\prime}]\}|\theta-\theta^{\prime}|\,,$

and on the other hand

$\displaystyle\mathcal{B}^{\psi}(\theta,\theta^{\prime})\geqslant\frac{|\theta-\theta^{\prime}|^{2}}{2}\inf\{\lambda_{\texttt{MIN}}(\nabla^{2}\psi(\tilde{\theta}))\,;\,\tilde{\theta}\in[\theta,\theta^{\prime}]\}\,,$

$\displaystyle|\nabla\psi(\theta)-\nabla\psi(\theta^{\prime})|\geqslant\inf\{\lambda_{\texttt{MIN}}(\nabla^{2}\psi(\tilde{\theta}))\,;\,\tilde{\theta}\in[\theta,\theta^{\prime}]\}|\theta-\theta^{\prime}|\,,$

where $\lambda_{\texttt{MAX}}(\nabla^{2}\psi(\tilde{\theta}))$ and $\lambda_{\texttt{MIN}}(\nabla^{2}\psi(\tilde{\theta}))$ are the largest and smallest eigenvalue of $\nabla^{2}\psi(\tilde{\theta})$ .

**Proof of Lemma 5: ** The second equality holds by simple algebra. Now the first equality is immediate, since

[TABLE]

$\hfill\square$

**Proof of Lemma 6: ** We have by definition that $\quad\mathcal{B}^{\psi}(\theta,\theta^{\prime})=\psi(\theta)-\psi(\theta^{\prime})-\langle\theta-\theta^{\prime},\nabla\psi(\theta^{\prime})\rangle\,.\quad$

Then, by a Taylor expansion, there exists $\tilde{\theta}^{\prime}\in[\theta,\theta^{\prime}]$ such that

[TABLE]

Likewise, there exists $\tilde{\theta}\in[\theta,\theta^{\prime}]$ such that $\quad\nabla\psi(\theta)=\nabla\psi(\theta^{\prime})+\nabla^{2}\psi(\tilde{\theta})(\theta-\theta^{\prime})\,.$ $\hfill\square$

3.2 Dual formulation of the optimization problem

Using Bregman divergences enables to rewrite the $K$ -dimensional optimization problem (3) in a slightly more convenient form thanks to a dual formulation. Indeed introducing a Lagrangian parameter $\lambda\in\mathbb{R}^{+}$ and using Karush-Kuhn-Tucker conditions, one gets the following necessary optimality conditions

[TABLE]

and by definition of the exponential family, we can make use of the fact that

[TABLE]

where we remember that $X\in\mathbb{R}$ and $F(X)\in\mathbb{R}^{K}$ . Combining these two equations, we obtain the system

[TABLE]

For minimal exponential family, this system admits for each fixed $\theta,\mu$ a unique solution in $\theta^{\prime}$ , that we write for clarity $\theta(\lambda^{\star};\theta,\mu)$ to indicate its dependency with the optimal value $\lambda^{\star}$ of the dual parameter as well as the constraints.

Remark 3

For $\theta\in\Theta_{I}$ , when the optimal value of $\lambda$ is $\lambda^{\star}=0$ , then it means that $\nabla\psi(\theta^{\prime})=\nabla\psi(\theta)$ and thus $\theta^{\prime}=\theta$ , which is only possible if $\mathbb{E}_{\nu_{\theta}}(X)\geqslant\mu$ . Thus whenever $\mu>\mathbb{E}_{\nu_{\theta}}(X)$ , the dual constraint is active, i.e. $\lambda>0$ , and we get the vector equation

[TABLE]

The example of discrete distributions In many cases, the previous optimization problem reduces to a simpler one-dimensional optimization problem, where we optimize over the dual parameter $\lambda$ . We illustrate this phenomenon on a family of discrete distributions. Let $\mathbb{X}=\{x_{1},\dots,x_{K},x_{\star}\}$ be a set of distinct real-values. Without loss of generality, assume that $x_{\star}>\max_{k\leqslant K}x_{k}$ . The family of distributions $p$ with support in $\mathbb{X}$ is a specific $K$ -dimensional family. Indeed, let $F$ be the feature function with $k^{th}$ component $F_{k}(x)=\mathbb{I}\{x=x_{k}\}$ , for all $k\in\{1,\dots,K\}$ . Then the parameter $\theta=(\theta_{k})_{1\leqslant k\leqslant K}$ of the distribution $p=p_{\theta}$ has components $\theta_{k}=\log(\frac{p(x_{k})}{p(x_{\star})})$ for all $k\neq 0$ . Note that $p(x_{k})=\exp(\theta_{k}-\psi(\theta))$ for all $k\neq 0$ , and $p(x_{0})=\exp(-\psi(\theta))$ . It then comes $\psi(\theta)=\log(\sum_{k=1}^{K}e^{\theta_{k}}+1)$ , $\nabla\psi(\theta)=(p(x_{1}),\dots,p(x_{K}))^{\top}$ and $\mathbb{E}(XF_{k}(X))=x_{k}p_{\theta}(x_{k})$ . Further, $\Theta_{\mathcal{D}}=(\mathbb{R}\cup\{-\infty\})^{K}$ and $\theta\in\Theta_{\mathcal{D}}$ corresponds to the condition $p_{\theta}(x_{\star})>0$ . Now, for a non trivial value $\mu$ such that $\mathbb{E}_{p_{\theta}}(X)<\mu<x_{\star}$ , it can be readily checked that the system (4) specialized to this family is equivalent (with no surprise) to the one considered for instance in Honda and Takemura (2010) for discrete distributions. After some tedious but simple steps detailed in Honda and Takemura (2010), one obtains the following easy-to-solve one-dimensional optimization problem (see also Cappé et al. (2013)), although the family is of dimension $K$ :

[TABLE]

3.3 Empirical parameter and definition

In this section we discuss the well-definition of the empirical parameter corresponding to the projection of the empirical distribution on the exponential family. While this is innocuous for most settings, in full generality, one needs to take some specific care to ensure that all the objects we deal with are well-defined and that all parameters $\theta$ we talk about indeed belong to the set $\Theta_{D}$ (or better $\Theta_{I}$ ).

An important property is that if the family is regular, then $\nabla\psi(\Theta_{\mathcal{D}})$ is an open set that coincides with the interior of realizable values of $F(x)$ for $x\sim\nu$ for any $\nu$ absolutely continuous with respect to $\nu_{0}$ . In particular, by convexity of the set $\nabla\psi(\Theta_{\mathcal{D}})$ this means that the empirical average $\frac{1}{n}\sum_{i=1}^{n}F(X_{i})\in\mathbb{R}^{K}$ belongs to $\overline{\nabla\psi(\Theta_{\mathcal{D}})}$ for all $\{X_{i}\}_{i\leqslant n}\sim\nu_{\theta}$ with $\theta\in\Theta_{\mathcal{D}}$ . Thus, for the observed samples $X_{1},\dots,X_{n}\in\mathcal{X}$ coming from $\nu_{a^{\star}}$ , the projection $\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n})$ on the family can be represented by a sequence $\{\widehat{\theta}_{n,m}\}_{m\in\mathbb{N}}\in\Theta_{\mathcal{D}}$ such that

[TABLE]

In the sequel, we want to ensure that provided that $\nu_{a^{\star}}=\nu_{\theta^{\star}}$ with $\theta^{\star}\in\mathring{\Theta}_{I}$ , then we also have $\widehat{F}_{n}\in\nabla\psi(\mathring{\Theta}_{I})$ , which means that there is a unique $\widehat{\theta}_{n}\in\mathring{\Theta}_{I}$ such that $\nabla\psi(\widehat{\theta}_{n})=\widehat{F}_{n}$ , or equivalently $\widehat{\theta}_{n}=\nabla\psi^{-1}(\widehat{F}_{n})$ . To this end, we assume that $\theta^{\star}$ is away from the boundary of $\Theta_{I}$ . In many cases, it is then sufficient to assume that $n$ is larger than a small constant (roughly $K$ ) to ensure that we can find a unique $\widehat{\theta}_{n}\in\mathring{\Theta}_{I}$ such that $\nabla\psi(\widehat{\theta}_{n})=\widehat{F}_{n}$ .

Example Let us consider Gaussian distributions on $\mathcal{X}=\mathbb{R}$ , with $K=2$ . We consider a parameter $\theta^{\star}=(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}})$ corresponding to a Gaussian finite mean $\mu$ and positive variance $\sigma^{2}$ . Now, for any $n\geqslant 2$ , the empirical mean $\widehat{\mu}_{n}$ is finite and the empirical variance $\widehat{\sigma}^{2}_{n}$ is positive, and thus $\theta_{n}=\nabla\psi^{-1}(\widehat{F}_{n})$ is well-defined.

The case of Bernoulli distributions is interesting as it shows a slightly different situation. Let us consider a parameter $\theta^{\star}=\log(\mu/(1-\mu))$ corresponding to a Bernoulli distribution with mean $\mu$ . Before $\widehat{F}_{n}$ can be mapped to a point in $\mathring{\Theta}_{I}=\mathbb{R}$ , one needs to wait that the number of observations for both [math] and $1$ is positive. Whenever $\mu\in(0,1)$ , the probability that this does not happen is controlled by $\mathbb{P}(n_{0}(n)=0\text{ or }n_{1}(n)=0)=\mu^{n}+(1-\mu)^{n}\leqslant 2\max(\mu,1-\mu)^{n},$ where $n_{x}(n)$ denotes the number of observations of symbol $x\in\{0,1\}$ after $n$ samples. For $\mu\geqslant 1/2$ , the later quantity is less than $\delta_{0}\in(0,1)$ for $n\geqslant\frac{\log(2/\delta_{0})}{\log(1/\mu)}$ , which depends on the probability level $\delta_{0}$ and cannot be considered to be especially small when $\mu$ is close 111This also suggests to replace $\widehat{F}_{n}$ with a Laplace or a Krichevsky-Trofimov estimate that provide initial bonus to each symbol and, as a result, maps any $\widehat{F}_{n}$ , for $n\geqslant 0$ to a parameter in $\widehat{\theta}_{n}\in\mathbb{R}$ . to $1$ . That said, even when the parameter $\widehat{\theta}_{n}$ does not belong to $\mathbb{R}$ , the event $n_{0}(n)=0$ corresponds to having empirical mean equal to $1$ . This is a favorable situation since any optimistic algorithm should pull the corresponding arm. Thus, we one only need to control $\mathbb{P}(n_{1}(n)=0)=(1-\mu)^{n}$ , which is less than $\delta_{0}\in(0,1)$ for $n\geqslant\frac{\log(1/\delta_{0})}{\log(1/(1-\mu))}$ , which is essentially a constant. As a matter of illustration, when $\delta=10^{-3}$ and $\mu=0.9$ , this condition is met for $n\geqslant 3$ .

Following the previous discussion, in the sequel we consider that $n$ is always large enough so that $\widehat{\theta}_{n}=\nabla\psi^{-1}(\widehat{F}_{n})\in\mathring{\Theta}_{I}$ can be uniquely defined. We now discuss the separation between the parameter and the boundary more formally, and for that purpose introduce the following definition.

Definition 4 (Enlarged parameter set)

Let $\Theta\subset\Theta_{\mathcal{D}}$ and some constant $\rho>0$ . The enlargement of size $\rho$ of $\Theta$ in Euclidean norm (aka $\rho$ -neighborhood) is defined by

[TABLE]

For each $\rho$ such that $\Theta_{\rho}\subset\Theta_{I}$ , we further introduce the quantities

[TABLE]

Using the notion of enlarged parameter set, we highlight an especially useful property to prove concentration inequalities, summarized in the following result

Lemma 7 (Log-Laplace control)

Let $\Theta\subset\Theta_{\mathcal{D}}$ be a convex set and $\rho>0$ such that $\theta^{\star}\in\Theta_{\rho}\subset\Theta_{I}$ . Then, for all $\eta\in\mathbb{R}^{K}$ such that $\theta^{\star}+\eta\in\Theta_{\rho}$ , it holds

$\displaystyle\log\mathbb{E}_{\theta^{\star}}\exp(\eta^{\top}F(X))$ $\displaystyle\leqslant$ $\displaystyle\eta^{\top}\nabla\psi(\theta^{\star})+\frac{V_{\rho}}{2}\|\eta\|^{2}\,.$

**Proof of Lemma 7: ** Indeed, it holds by simple algebra

[TABLE]

where $H(\theta,\theta^{\prime})=\{\alpha\theta+(1-\alpha)\theta^{\prime},\alpha\in[0,1]\}$ . The equality holds by definition and basic rewriting. In the inequalities, we used that $\Theta_{\rho}$ is convex as an enlargement of a convex set, and thus that $H(\eta+\theta^{\star},\theta^{\star})\subset\Theta_{\rho}$ . $\hfill\square$

In the sequel, we are interested in sets $\Theta$ such that $\Theta_{\rho}\subset\mathring{\Theta}_{I}$ for some specific $\rho$ . This comes essentially from the fact that we require some room around $\Theta$ and $\Theta_{I}$ to ensure all quantities remain finite and well-defined. Before proceeding, it is convenient to introduce the notation $d(\Theta^{\prime},\Theta)=\inf_{\theta\in\Theta,\theta^{\prime}\in\Theta^{\prime}}\|\theta-\theta^{\prime}\|$ , as well as the Euclidean ball $B(y,\delta)=\{y^{\prime}\in\mathbb{R}^{K}:||y^{\prime}-y||\leqslant\delta\}$ . Using these notations, the following lemma whose proof is immediate provides conditions for which all future technical considerations are satisfied.

Lemma 8 (Well-defined parameters)

Let $\theta^{\star}\in\mathring{\Theta}_{I}$ and $\rho^{\star}=d(\{\theta^{\star}\},\mathbb{R}^{K}\setminus\Theta_{I})>0$ . Now for any convex set $\Theta\subset\Theta_{I}$ such that $\theta^{\star}\in\Theta$ and $d(\Theta,\mathbb{R}^{K}\setminus\Theta_{I})=\rho^{\star}$ , and any $\rho<\rho^{\star}/2$ , it holds $\Theta_{2\rho}\subset\mathring{\Theta}_{I}$ .

Further, for any $\delta$ such that $\widehat{F}_{n}\!\in\!B(\nabla\psi(\theta^{\star}),\delta)\!\subset\!\nabla\psi(\Theta_{\rho})$ , $\exists\widehat{\theta}_{n}\!\in\!\Theta_{\rho}\!\subset\!\mathring{\Theta}_{I}$ such that $\nabla\psi(\widehat{\theta}_{n})\!=\!\widehat{F}_{n}$ .

In the sequel, we will restrict our analysis to the slightly more restrictive case when $\widehat{\theta}_{n}\in\Theta_{\rho}$ with $\Theta_{2\rho}\subset\mathring{\Theta}_{I}$ . This is mostly for convenience and avoid dealing with rather specific situations.

Remark 4

Again let us remind that when $\mathcal{X}$ is compact and $F$ is continuous, then $\Theta_{I}=\Theta_{\mathcal{D}}=\mathbb{R}^{K}$ .

Illustration

We now illustrate the definition of $v_{\rho}$ and $V_{\rho}$ . For Bernoulli distributions with parameter $\mu\in[0,1]$ , $\nabla\psi(\theta)=1/(1+e^{-\theta})$ and $\nabla^{2}\psi(\theta)=e^{-\theta}/(1+e^{-\theta})^{2}=\mu(1-\mu)$ . Thus, $v_{\rho}$ is away from [math] whenever $\Theta_{\rho}$ excludes the means $\mu$ close to [math] or $1$ , and $V_{\rho}\leqslant 1/4$ .

Now for a family of Gaussian distributions with unknown mean and variance, $\psi(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\big{(}\frac{-\pi}{\theta_{2}}\big{)}$ , where $\theta=(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}})$ . Thus, $\nabla\psi(\theta)=(-\frac{\theta_{1}}{2\theta_{2}},\frac{\theta_{1}^{2}}{4\theta_{2}^{2}}-\frac{1}{2\theta_{2}})$ , and $\nabla^{2}\psi(\theta)=(-\frac{1}{2\theta_{2}},\frac{\theta_{1}}{2\theta_{2}^{2}};\frac{\theta_{1}}{2\theta_{2}^{2}},-\frac{\theta_{1}^{2}}{2\theta_{2}^{3}}+\frac{1}{2\theta_{2}^{2}})=2\mu\sigma^{2}(\frac{1}{2\mu},1;1,2\mu+\frac{\sigma^{2}}{\mu})$ . The smallest eigenvalue is larger than $\sigma^{4}/(1/2+\sigma^{2}+2\mu^{2})$ and the largest is upper bounded by $\sigma^{2}(1+2\sigma^{2}+4\mu^{2})$ , which enables to control $V_{\rho}$ and $v_{\rho}$ .

4 Boundary crossing for $K$ -dimensional exponential families

In this section, we now study the boundary crossing probability term appearing in Lemma 3 for a $K$ -dimensional exponential family $\mathcal{E}(F;\nu_{0})$ . We first provide an overview of the existing results before detailing our main contribution. As explained in the introduction, the key technical tools that enable to obtain the novel results were already known three decades ago, and thus even though the novel result is impressive due to its generality and tightness, it should be regarded as a modernized version of an existing, but almost forgotten result, that enables to solve a few long-lasting open questions as a by-product.

4.1 Previous work on boundary-crossing probabilities

The existing results used in the bandit literature about boundary-crossing probabilities are restricted to a few specific cases. For instance in Cappé et al. (2013), the authors provide the following control

Theorem 1 (KL-ucb)

In the case of canonical (that is $F(x)=x$ ) exponential families of dimension $K=1$ , then for a function $f$ such that $f(x)=\log(x)+\xi\log\log(x)$ , then it holds for all $t>A$

$\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t-A+1}n\,\,\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}\bigr{)}>f\big{(}t\big{)}\cap\mu_{a^{\star}}>\widehat{\mu}_{a^{\star},n}\Bigr{\}}\leqslant e\lceil f(t)\log(t)\rceil e^{-f(t)}\,.$

Further, in the special case of distributions with finitely many $K$ atoms, it holds for all $t>A,\varepsilon>0$

$\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t-A+1}n\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}-\varepsilon\bigr{)}>f\big{(}t\big{)}\Bigr{\}}\leqslant e^{-f(t)}\Big{(}3e+2+4\varepsilon^{-2}+8e\varepsilon^{-4}\Big{)}\,.$

In contrast in Lai (1988), the authors provide an asymptotic control in the more general case of exponential families of dimension $K$ with some basic regularity condition, as we explained earlier. We now restate this beautiful result from Lai (1988) in a way that is suitable for a more direct comparison with other results. The following holds:

Theorem 2 (Lai, 88)

Let us consider an exponential family of dimension $K$ . Define for $\gamma>0$ the cone $\mathcal{C}_{\gamma}(\theta)=\{\theta^{\prime}\in\mathbb{R}^{K}:\langle\theta^{\prime},\theta\rangle\geqslant\gamma|\theta||\theta^{\prime}|\}$ . Then, for a function $f$ such that $f(x)=\alpha\log(x)+\xi\log\log(x)$ it holds for all $\theta^{\dagger}\in\Theta$ such that $|\theta^{\dagger}-\theta^{\star}|^{2}\geqslant\delta_{t}$ , where $\delta_{t}\to 0$ , $t\delta_{t}\to\infty$ as $t\to\infty$ ,

$\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=1}^{t}\widehat{\theta}_{n}\in\Theta_{\rho}\,\cap\,n\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\dagger})\geqslant f\Big{(}\frac{t}{n}\Big{)}\,\cap\,\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\dagger})\in\mathcal{C}_{\gamma}(\theta^{\dagger}-\theta^{\star})\Bigr{\}}$

$\displaystyle\stackrel{{\scriptstyle t\to\infty}}{{=}}$ $\displaystyle O\bigg{(}t^{-\alpha}|\theta^{\dagger}-\theta^{\star}|^{-2\alpha}\log^{-\xi-\alpha+K/2}(t|\theta^{\dagger}-\theta^{\star}|^{2})\bigg{)}$

$\displaystyle=$ $\displaystyle O\bigg{(}e^{-f(t|\theta^{\dagger}-\theta^{\star}|^{2})}\log^{-\alpha+K/2}(t|\theta^{\dagger}-\theta^{\star}|^{2})\bigg{)}\,.$

Discussion The quantity $\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\dagger})$ is the direct analog of $\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)$ in Theorem 1. Note however that $f(t/n)$ replaces the larger quantity $f(t)$ , which means that Theorem 2 controls a larger quantity than Theorem 1, and is thus in this sense stronger. It also holds for general exponential families of dimension $K$ . Another important difference is the order of magnitude of the right hand side terms of both theorems. Indeed, since $e\lceil f(t)\log(t)\rceil e^{-f(t)}=O(\frac{\log^{2-\xi}(t)+\xi\log(t)^{1-\xi}\log\log(t)}{t})$ , Theorem 1 requires that $\xi>2$ in order that this term is $o(1/t)$ , and $\xi>0$ for the second term of Theorem 1. In contrast, Theorem 2 shows that it is enough to consider $f(x)=\log(x)+\xi\log\log(x)$ with $\xi>K/2-1$ to ensure a $o(1/t)$ bound. For $K=1$ , this means we can even use $\xi>-1/2$ and in particular $\xi=0$ , which corresponds to the value they recommend in the experiments.

Thus, Theorem 2 improves in three ways over Theorem 1: it is an extension to dimension $K$ , it provides a bound for $f(t/n)$ (and thus for KL-ucb+) and not only $f(t)$ , and finally allows for smaller values of $\xi$ . These improvements are partly due to the fact Theorem 1 controls a concentration with respect to $\theta^{\dagger}$ , not $\theta^{\star}$ , which takes advantage of the fact there is some gap when going from $\mu^{\star}$ to distributions with mean $\mu^{\star}-\varepsilon$ . The proof of Theorem 2 directly takes advantage of this, contrary to that of the first part of Theorem 1.

On the other hand, Theorem 2 is only asymptotic whereas Theorem 1 holds for finite $t$ . Furthermore, we notice two restrictions on the control event. First, it requires $\widehat{\theta}_{n}\in\Theta_{\rho}$ , but we showed in the previous section that this is a minor restriction. Second, there is the restriction to a cone $\mathcal{C}_{\gamma}(\theta^{\dagger}-\theta^{\star})$ which simplifies the analysis, but is a more dramatic restriction. This restriction cannot be removed trivially as it can be seen from the complete statement of (Lai, 1988, Theorem 2) that the right hand-side blows up to $\infty$ when $\gamma\to 0$ . As we will see, it is possible to overcome this restriction by resorting to a smart covering of the space with cones, and sum the resulting terms via a union bound over the covering. We explain the precise way of proceeding in the proof of Theorem 3 in section 5.

Hint at proving the first part of Theorem 1 We believe it is interesting to give some hint the proof of the first part of Theorem 1, as it involves an elegant step, despite relying quite heavily on two specific properties of the canonical exponential family of dimension $1$ . Indeed in the special case of the canonical one-dimensional family (that is $K=1$ and $F_{1}(x)=x\in\mathbb{R}$ ), $\widehat{F}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}$ coincides with the empirical mean and it can be shown that $\Phi^{\star}(F)$ is strictly decreasing on $(-\infty,\mu^{\star}]$ . Thus for any $F\leqslant\mu^{\star}$ , it holds

[TABLE]

Further, using the notations of Section 3.1, it also holds in that case $\mathcal{K}_{a^{\star}}\bigl{(}\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\,\mu^{\star}\bigr{)}=\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})=\Phi^{\star}(\widehat{F}_{n})$ , where $\widehat{\theta}_{n}=\dot{\psi}^{-1}(\widehat{F}_{n})$ is uniquely defined. A second non-trivial property that is shown in Cappé et al. (2013) is that for all $F\leqslant\mu^{\star}$ , we can localize the supremum as

[TABLE]

Armed with these two properties, the proof reduces almost trivially to the following elegant lemma:

Lemma 9 (Dimension 1)

Consider a canonical one-dimensional family (that is $K=1$ and $F_{1}(x)=x\in\mathbb{R}$ ). Then, for all $f$ such that $f(t/n)/n$ is non-increasing in $n$ ,

$\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{m\leqslant n<M}\,\,\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})\geqslant f(t/n)/n\Big{\}}$ $\displaystyle\leqslant$ $\displaystyle\exp\bigg{(}-\frac{m}{M}f(t/M)\bigg{)}\,.$

This lemma, whose proof is provided in the appendix for the interested reader and is directly adapted from the proof of Theorem 1. The first statement of Theorem 1 is obtained by a peeling argument, using $m/M=(f(t)-1)/f(t)$ . However this argument does not seem to extend nicely to using $f(t/n)$ , which explains why there is no statement regarding this threshold.

4.2 Main results and contributions

In this section, we now provide several results on boundary crossing probabilities, that we prove in details in the next section. We first provide a non-asymptotic bound with explicit terms for the control of the boundary crossing probability term. We then provide two corollaries that can be used directly for the analysis of KL-ucb and KL-ucb+and that better highlight the asymptotic scaling of the bound with $t$ , which helps seeing the effect of the parameter $\xi$ on the bound.

Theorem 3 (Boundary crossing for exponential families)

Let $\varepsilon<\min_{a\in\mathcal{A}:\mu_{a}<\mu^{\star}}(\mu^{\star}-\mu_{a})$ , and define $\rho_{\varepsilon}=\inf\{||\theta^{\prime}-\theta||:\mu_{\theta^{\prime}}=\mu^{\star}-\varepsilon,\mu_{\theta}=\mu^{\star}\}$ . Let $\rho^{\star}=d(\{\theta^{\star}\},\mathbb{R}^{K}\setminus\Theta_{I})$ and $\Theta\subset\Theta_{\mathcal{D}}$ be a set such that $\theta^{\star}\in\Theta$ and $d(\Theta,\mathbb{R}^{K}\setminus\Theta_{I})=\rho^{\star}$ . Thus $\theta^{\star}\in\Theta\subset\Theta_{\rho}\subset\mathring{\Theta}_{I}$ for each $\rho<\rho^{\star}$ . Assume that $n\to f(t/n)/n$ is non-increasing and $n\to nf(t/n)$ is non-decreasing. Then, for every $b>1,p,q,\eta\in[0,1]$ , and $n_{i}=b^{i}$ if $i<I_{t}=\lceil\log_{b}(qt)\rceil$ , $n_{I_{t}}=t+1$ , it holds

$\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n\leqslant t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}$

$\displaystyle\leqslant$ $\displaystyle C(K,b,\rho,p,\eta)\sum_{i=0}^{I_{t}-1}\exp\bigg{(}-n_{i}\rho_{\varepsilon}^{2}\alpha^{2}-\rho_{\varepsilon}\chi\sqrt{n_{i}f(t/n_{i})}-f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}\bigg{)}f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}^{K/2}\,,$

where we introduced the constants $\alpha=\eta\sqrt{v_{\rho}/2}$ , $\chi=p\eta\sqrt{2v_{\rho}^{2}/V_{\rho}}$ and

$\displaystyle C(K,b,\rho,p,\eta)=C_{p,\eta,K}\Big{(}2\frac{\omega_{p,K\!-\!2}}{\omega_{\max\{p,\frac{2}{\sqrt{5}}\},K\!-\!2}}\max\Big{\{}\frac{2bV_{\rho}^{4}}{p\rho^{2}v^{6}_{\rho}},\frac{V_{\rho}^{3}}{v_{\rho}^{4}},\frac{b^{2}V_{\rho}^{5}}{pv_{\rho}^{6}(\frac{1}{2}\!+\!\frac{1}{K})}\Big{\}}^{K/2}+1\Big{)}\,,$

where $C_{p,\eta,K}$ is the cone-covering number of $\nabla\psi\big{(}\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})\big{)}$ with minimal angular separation $p$ and not intersecting the set $\nabla\psi\big{(}\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\eta\rho_{\varepsilon})\big{)}$ , and $\omega_{p,K}=\int_{p}^{1}\sqrt{1-z^{2}}^{K}dz$ if $K\geqslant 0$ and $1$ else.

Remark 5

The same result holds by replacing all occurrences of $f(\cdot)$ by the constant $f(t)$ .

Remark 6

In dimension $1$ , the theorem takes a simpler form. Indeed $C_{p,\eta,1}=2$ for all $p,\eta\in(0,1)$ and thus, choosing $b=2$ for instance, $C(1,2,\rho,p,\eta)$ reduces to $2\Big{(}2\max\Big{\{}\frac{2V_{\rho}^{2}}{\rho v^{3}_{\rho}},\frac{V_{\rho}^{3/2}}{v_{\rho}^{2}},\frac{2V_{\rho}^{5/2}}{v_{\rho}^{3}}\Big{\}}+1\Big{)}$ . In the case of Bernoulli distributions, if $\Theta_{\rho}=\{\log(\mu/(1-\mu)),\mu\in[\mu_{\rho},1-\mu_{\rho}]\}$ , then $v_{\rho}=\mu_{\rho}(1-\mu_{\rho})$ , $V_{\rho}=1/4$ and $C(1,2,\rho,p,\eta)=2(\frac{1}{8\mu_{\rho}^{3}(1-\mu_{\rho})^{3}}+1)$ .

Remark 7

We believe it is possible to reduce the $\max$ term by a factor $V_{\rho}^{3}/v_{\rho}^{4}$ in the definition of $C(K,b,\rho,p,\eta)$ .

Let $f(x)=\log(x)+\xi\log\log(x)$ . We now state two corollaries of Theorem 3, The first one is stated for the case when boundary is set to $f(t)/n$ and is thus directly relevant to the analysis of KL-ucb. The second corollary is about the more challenging boundary $f(t/n)/n$ that corresponds to the KL-ucb+ strategy. We note that $f$ is non-decreasing only for $x\geqslant e^{-\xi}$ . When $x=t$ , this requires that $t\geqslant e^{-\xi}$ . Now, when $x=t/N_{a^{\star}}(t)$ where $N_{a\star}(t)=t-O(\ln(t))$ , imposing that $f$ is non-decreasing requires that $\xi\geqslant\ln(1-O(\ln(t)/t))$ for large $t$ , that is $\xi\geqslant 0$ . In the sequel we thus restrict to $t\geqslant e^{-\xi}$ when using the boundary $f(t)$ and to $\xi\geqslant 0$ when using the boundary $f(t/n)$ . Finally, we remind that the quantity $\chi=p\eta\sqrt{2v_{\rho}^{2}/V_{\rho}}$ is a function of $p,\eta$ and $\rho$ , and introduce the notation $\chi_{\varepsilon}=\rho_{\varepsilon}\chi$ for convenience.

Corollary 1 (Boundary crossing for $f(t)$ )

Let $f(x)=\log(x)+\xi\log\log(x)$ . Using the same notations as in Theorem 3, for all $p,\eta\in[0,1],\rho<\rho^{\star}$ and all $t\geqslant e^{-\xi}$ such that $f(t)\geqslant 1$ it holds

$\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t)/n\Big{\}}$

$\displaystyle\leqslant$ $\displaystyle\frac{C(K,4,\rho,p,\eta)(1+\chi_{\varepsilon})}{\chi_{\varepsilon}t}\bigg{(}1+\xi\frac{\log\log(t)}{\log(t)}\bigg{)}^{K/2}\log(t)^{-\xi+K/2}e^{-\chi_{\varepsilon}\sqrt{\log(t)+\xi\log\log(t)}}\,.$

Corollary 2 (Boundary crossing for $f(t/n)$ )

Let $f(x)=\log(x)+\xi\log\log(x)$ . For all $p,\eta\in[0,1],\rho<\rho^{\star}$ and $\xi\geqslant\max(K/2-1,0)$ , provided that $t\in[85\chi^{-2},t_{\chi}]$ where $t_{\chi}=\chi_{\varepsilon}^{-2}\frac{\exp(\ln(4.5)^{2}/\chi_{\varepsilon}^{2})}{4\ln(4.5)^{2}}$ , it holds

$\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}\leqslant C(K,4,\rho,p,\eta)\bigg{[}e^{-\chi_{\varepsilon}\sqrt{t}c^{\prime}}+$

$\displaystyle\frac{(1+\xi)^{K/2}}{ct\log(tc)}\begin{cases}\frac{16}{3}\log(tc\log(tc)/4)^{K/2-\xi}+80\log(1.25)^{K/2-\xi}&\text{if }\xi\geqslant K/2\\ \frac{16}{3}\log(t/3)^{K/2-\xi}+80\log(t\frac{c\log(tc)}{4-c\log(tc)})^{K/2-\xi}&\text{if }\xi\in[K/2-1,K/2]\end{cases}\bigg{]},$

where $c=\chi_{\varepsilon}^{2}/(2\log(5))^{2}$ , and $c^{\prime}=\sqrt{f(5)/5}$ if $\xi\geqslant K/2$ and $\sqrt{f(4)/4}$ else. Further, for larger values of $t$ , $t\geqslant t_{\chi}$ , the second term in the brackets becomes

$\displaystyle\frac{(1+\xi)^{K/2}}{ct\log(tc)}\begin{cases}144\log(1.25)^{K/2-\xi}&\text{if }\xi\geqslant K/2\\ 144\log(t/3)^{K/2-\xi}&\text{if }\xi\in[K/2-1,K/2]\text{ (and }\xi\geqslant 0)\,.\end{cases}$

Remark 8

In Corollary 1, since the asymptotic regime of $\chi_{\varepsilon}\sqrt{\log(t)}-(K/2-\xi)\log\log(t)$ may take a massive amount of time to kick-in when $\xi<K/2-2\chi_{\varepsilon}$ , we recommend to take $\xi>K/2-2\chi_{\varepsilon}$ . Now, we also note that the value $\xi=K/2-1/2$ is interesting in practice, since then it holds $\log(t)^{K/2-\xi}=\sqrt{\log(t)}<5$ for all $t\leqslant 10^{9}$ .

Remark 9

The restriction to $t\geqslant 85\chi_{\varepsilon}^{-2}$ is merely for $\xi\simeq K/2-1$ . For instance for $\xi\geqslant K/2$ , the restriction becomes $t\geqslant 76\chi_{\varepsilon}^{-2}$ , and it becomes less restrictive for larger $\xi$ . The term $t_{\chi}$ is virtually infinite: For instance when $\chi_{\varepsilon}=0.3$ , this is already larger than $10^{12}$ , while $85\chi_{\varepsilon}^{-2}<945$ .

Remark 10

According to this result, the value $K/2-1$ (when it is non-negative) appears to be a critical value for $\xi$ , since the boundary crossing probabilities are not summable in $t$ for $\xi\leqslant K/2-1$ , but are summable for $\xi>K/2-1$ . Indeed, the terms behind the curved brackets are conveniently $o(\log(t))$ with respect to $t$ , except when $\xi=K/2-1$ . In practice however, since this asymptotic behavior may take a large time to kick-in, we recommend $\xi$ to be away from $K/2-1$ .

Remark 11

Achieving a bound for the threshold $f(t/N_{a}(t))$ is more challenging than for $f(t)$ . Only the later case was analyzed in Cappé et al. (2013) as the former was was out of reach of their analysis. Also, the result is valid with exponential families of dimension $K$ and not only dimension $1$ , which is a major improvement. It is interesting to note that when $K=1$ , $\max(K/2-1,0)=0$ , and to observe experimentally that a sharp phase transition indeed appears for KL-ucb+ precisely at the value $\xi=0$ : the algorithm suffers a linear regret when $\xi<0$ and a logarithmic regret when $\xi=0$ . For KL-ucb, no sharp phase transition appears at point $\xi=0$ . Instead, a relatively smooth phase transition appears for a negative $\xi$ dependent on the problem. Both observations are coherent with the statements of the corollaries.

Discussion regarding the proof technique The proof technique that we consider below significantly differs from the proof from Cappé et al. (2013) and Honda and Takemura (2010), and combines key ideas disseminated in two works from Tze Leung Lai, Lai (1988) and Lai (1987) with some non-trivial extension that we describe below. Also, we also simplify sum of the original arguments and improve the readability of the initial proof technique, in order to shed more light on these neat ideas.

-Change of measure At a high level, the first main idea of this proof is to resort to a change of measure argument, which is the proof technique used to prove the lower bound on the regret. The work of Lai (1988) should be given full credit for this idea. This is in stark contrast with the proof techniques later developed for the finite-time analysis of stochastic bandits. The change of measure is actually not used once, but twice. First, to go from $\theta^{\star}$ , the parameter of the optimal arm to some perturbation of it $\theta^{\star}_{c}$ . Then, which is perhaps more surprising, to to go from this perturbed point to a mixture over a well-chosen ball centered on it. Although we have reasons to believe that this second change of measure may not be required (at least choosing a ball in dimension $K$ seems slightly sub-optimal), this two-step localization procedure is definitely the first main component that enables to handle the boundary crossing probabilities. The other steps for the proof of the Theorem include a concentration of measure argument and a peeling argument, which are more standard.

-Bregman divergence The second main idea that is the use of Bregman divergence and its relation with the quadratic norm, which is due to Lai (1987). This enables indeed to make explicit computations for exponential families of dimension $K$ without too much effort, at the price of loosing some ”variance” terms (linked to the Hessian of the family). We combine this idea with a some key properties of Bregman divergence that enables us to simplify a few steps, notably the concentration step, that we revisited entirely in order to obtain clean bounds valid in finite time and not only asymptotically.

-Concentration of measure and boundary effects One specific difficulty that appeared in the proof is to handle the shape of the parameter set $\Theta$ , and the fact that $\theta^{\star}$ should be away from its boundary. The initial asymptotic proof of Lai did not account for this and was not entirely accurate. Going beyond this proved to be quite challenging due to the boundary effects, although the concentration result (section 5.4, Lemma 15) that we obtain are eventually valid without restriction and the final proof looks deceptively easy. This concentration result is novel.

-Cone covering and dimension $K$ In Lai (1988), the author analyzed a boundary crossing problem first in the case of exponential families of dimension $1$ , and then sketch the analysis for exponential families of dimension $K$ and for one the intersection with one cone. However the complete result was nowhere stated explicitly. As a matter of fact, the initial proof from Lai (1988) restricts to a cone, which greatly simplifies the result. In order to obtain the full-blown results, valid in dimension $K$ for the unrestricted event, we introduced a cone covering of the space. This seemingly novel (although not very fancy) idea enables to get a final result that is only depending on the cone-covering number of the space. It required some careful considerations and simplifications of the initial steps from Lai (1988). Along the way, we made explicit the sketch of proof provided in Lai (1988) for the dimension $K$ .

-Corollaries and ratios The final key idea that should be credited to T.L. Lai is about the fine tuning of the final bound resulting from the two change of measures, the application of concentration and the peeling argument. Indeed these step lead to a bound by a sum of terms, say $\sum_{i=0}^{I}s_{i}$ that should be studied and depends on a few free parameters. This correspond, with our rewriting and modifications, to the statement of Theorem 3.

The brilliant idea of T.L. Lai, that we separate from the proof of Theorem 3 and use in the proof of Corollaries 1 and 2 is to bound the ratios of $s_{i+1}/s_{i}$ for small values of $i$ and the ratio $s_{i}/s_{i+1}$ for large values of $i$ separately (instead of resorting, for instance to a sum-integral comparison lemma). A careful study of these terms enable to improve the scaling and allow for smaller values of $\xi$ , up to $K/2-1$ , while other approaches seem unable to go below $K/2+1$ . Nevertheless, in our quest to obtain explicit bounds valid not-only asymptotically but also in finite time, this step is quite delicate, since a naive approach easily requires huge values for $t$ before the asymptotic regimes kick-in. By refining the initial proof strategy of Lai (1988), we managed to obtain a result valid for all $t$ for the setting of Corollary 1 and for all ”reasonably”222We require $t$ to be at least about $10^{2}$ times some problem-dependent constant, against a factor that could be $e^{15}$ in the initial analysis. large $t$ for the more challenging setting of Corollary 2.

5 Analysis of boundary crossing probabilities: proof of Theorem 3

In this section, we closely follow the proof technique used in Lai (1988) for the proof of Theorem 2, in order to prove the result of Theorem 3. We precise further the constants, remove the cone restriction on the parameter and modify the original proof to be fully non-asymptotic which, using the technique of Lai (1988), forces us to make some parts of the proof a little more accurate.

Let us recall that we consider $\Theta$ and $\rho$ such that $\theta^{\star}\in\Theta_{\rho}\subset\mathring{\Theta}_{I}$ . The proof is divided in four main steps that we briefly present here for clarity:

In Section 5.1, we take care of the random number of pulls of the arm by a peeling argument. Simultaneously, we introduce a covering of the space with cones, which enables to later use arguments from proof of Theorem 2.

In Section 5.2, we proceed with the first change of measure argument: taking advantage of the gap between $\mu^{\star}$ and $\mu^{\star}-\varepsilon$ , we move from a concentration argument around $\theta^{\star}$ to one around a shifted point $\theta^{\star}-\Delta_{c}$ .

In Section 5.3, we localize the empirical parameter $\widehat{\theta}_{n}$ and make use of the second change of measure, this time to a mixture of measures, following Lai (1988). Even though we follow the same high level idea, we modified the original proof in order to better handle the cone covering, and also make all quantities explicit.

In Section 5.4, we apply a concentration of measure argument. This part requires a specific care since this is the core of the finite-time result. An important complication comes from the ”boundary” of the parameter set, and was not explicitly controlled in the original proof from Lai (1988). A very careful analysis enables to obtain the finite-time concentration result without further restriction.

We finally combine all these steps in Sections 5.5.

5.1 Peeling and covering

In this section, the intuition we follow is that we want to control the random number of pulls $N_{a^{\star}}(t)\in[1,t]$ and to this end use a standard peeling argument, considering maximum concentration inequalities on time intervals $[b^{i},b^{i+1}]$ for some $b>1$ . Likewise, since the term $\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)$ can be seen as an infimum of some quantity over the set of parameters $\Theta$ , we use a covering of $\Theta$ in order to reduce the control of the desired quantity to that of each cell of the cover. Formally, we show that

Lemma 10 (Peeling and cone covering decomposition)

For all $\beta\in(0,1),b>1$ and $\eta\in[0,1)$ it holds

$\displaystyle\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n\leqslant t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t/n)/n\Big{\}}$

$\displaystyle\leqslant$ $\displaystyle\sum_{i=0}^{\lceil\log_{b}(\beta t+\beta)\rceil-2}\sum_{c=1}^{C_{p,\eta,K}}\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{b^{i}\leqslant n<b^{i+1}}E_{c,p}(n,t)\Big{\}}+\sum_{c=1}^{C_{p,\eta,K}}\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{n=b^{\lceil\log_{b}(\beta t+\beta)\rceil-1}}^{t}E_{c,p}(n,t)\Big{\}}\,,$

where the event $E_{c,p}(n,t)$ is defined by

$\displaystyle E_{c,p}(n,t)\stackrel{{\scriptstyle\rm def}}{{=}}\bigg{\{}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\widehat{F}_{n}\in\mathcal{C}_{p}(\theta^{\star}_{c})\cap\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star}_{c})\geqslant\frac{f(t/n)}{n}\bigg{\}}\,.$

(7)

In this definition, $(\theta^{\star}_{c})_{c\leqslant C_{p,\eta,K}}$ , constrained to satisfy $\theta^{\star}_{c}\notin\mathcal{B}_{2}(\theta^{\star},\eta\rho_{\varepsilon})$ , parameterize a minimal covering of $\nabla\psi(\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon}))$ with cones $\mathcal{C}_{p}(\theta^{\star}_{c}):=\mathcal{C}_{p}(\nabla\psi(\theta^{\star}_{c});\theta^{\star}-\theta^{\star}_{c})$ (That is $\nabla\psi(\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon}))\subset\displaystyle{\bigcup_{c=1}^{C_{p,\eta,K}}\mathcal{C}_{p}(\theta^{\star}_{c})}$ ), where $\mathcal{C}_{p}(y;\Delta)=\bigg{\{}y^{\prime}\in\mathbb{R}^{K}:\langle y^{\prime}-y,\Delta\rangle\geqslant p\|y^{\prime}-y\|\|\Delta\|\bigg{\}}$ . For all $\eta<1$ , $C_{p,\eta,K}$ is of order $(1-p)^{-K}$ and $C_{p,\eta,1}=2$ , while $C_{p,\eta,K}\to\infty$ when $\eta\to 1$ .

Peeling Let us introduce an increasing sequence $\{n_{i}\}_{i\in\mathbb{N}}$ such that $n_{0}=1<n_{1}<\dots<n_{I_{t}}=t+1$ for some $I_{t}\in\mathbb{N}_{\star}$ . Then by a simple union bound it holds for any event $E_{n}$

[TABLE]

We apply this simple result to the following sequence, defined for some $b>1$ and $\beta\in(0,1)$ by

[TABLE]

(this is indeed a valid sequence since $n_{I_{t}-1}\leqslant b^{\log_{b}(\beta t+\beta)}=\beta(t+1)<t+1=n_{I_{t}}$ ), and to the event

[TABLE]

Covering We now make the Kullback-Leibler projection explicit, and remark that in case of a regular family, it holds that

[TABLE]

where $\widehat{\theta}_{n}\in\Theta_{\mathcal{D}}$ is any point such that $\widehat{F}_{n}=\nabla\psi(\widehat{\theta}_{n})$ . This rewriting makes appear explicitly a shift from $\theta^{\star}$ to another point $\theta^{\star}-\Delta$ . For this reason, it is natural to study the link between $\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star})$ and $\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star}-\Delta)$ . Immediate computations show that for any $\Delta$ such that $\theta^{\star}-\Delta\in\Theta_{\mathcal{D}}$ it holds

[TABLE]

With this equality, the Kullback-Leibler projection can be rewritten to make appear an infimum over the shift term only. In order to control the second part of the shift term we localize it thanks to a cone covering of $\nabla\psi(\Theta_{\mathcal{D}})$ . More precisely, on the event $E_{n}$ , we know that $\widehat{\theta}_{n}\notin\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})$ . Indeed, for all $\theta\in\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})\cap\Theta_{\mathcal{D}}$ , $\mu_{\theta}\geqslant\mu_{\star}-\varepsilon$ , and thus $\mathcal{K}_{a^{\star}}(\nu_{\theta},\mu^{\star}-\varepsilon)=0$ . It is thus natural to build a covering of $\nabla\psi(\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon}))$ . Formally, for a given $p\in[0,1]$ and a base point $y\in\mathcal{Y}$ , let us introduce the cone

[TABLE]

We then associate to each $\theta\in\Theta_{\rho}$ a cone defined by $\mathcal{C}_{p}(\theta)=\mathcal{C}_{p}(\nabla\psi(\theta),\theta^{\star}-\theta)$ . Now for a given $p$ , let $(\theta^{\star}_{c})_{c=1,\dots,C_{p,\eta,K}}$ be a set of points corresponding to a minimal covering of the set $\nabla\psi(\Theta_{\rho}\setminus\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon}))$ , in the sense that

[TABLE]

constrained to be outside the ball $\mathcal{B}_{2}(\theta^{\star},\eta\rho_{\varepsilon})$ , that is $\theta^{\star}_{c}\notin\mathcal{B}_{2}(\theta^{\star},\eta\rho_{\varepsilon})$ for each $c$ . It can be readily checked that by minimality of the size of the covering $C_{p,\eta,K}$ , it must be that $\theta^{\star}_{c}\in\Theta_{\rho}\cap\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})$ . More precisely, when $p<1$ , then $\Delta_{c}=\theta^{\star}-\theta^{\star}_{c}$ is such that $\rho_{\varepsilon}-\|\Delta_{c}\|$ is positive and away from [math]. Also, we have by property of $\mathcal{B}_{2}(\theta^{\star},\rho_{\varepsilon})$ that $\mu_{\theta^{\star}_{c}}\geqslant\mu^{\star}-\varepsilon$ , and by the constraint that $\|\Delta_{c}\|>\eta\rho_{\varepsilon}$ .

The size of the covering $C_{p,\eta,K}$ depends on the angle separation $p$ , the ambient dimension $K$ , and the repulsive parameter $\eta$ . For instance it can be checked that $C_{p,\eta,1}=2$ for all $p\in(0,1]$ and $\eta<1$ . In higher dimension, $C_{p,\eta,K}$ typically scales as $(1-p)^{-K}$ and blows up when $p\to 1$ . It also blows up when $\eta\to 1$ . It is now natural to introduce the decomposition

[TABLE]

Using this notation, we deduce that for all $\beta\in(0,1),b>1$ (we remind that $I_{t}=\lceil\log_{b}(\beta t+\beta)\rceil$ ),

[TABLE]

5.2 Change of measure

In this section, we focus on one event $E_{c,p}(n,t)$ . The idea is to take advantage of the gap between $\mu^{\star}$ and $\mu^{\star}-\varepsilon$ , that allows to shift from $\theta^{\star}$ to some of the $\theta^{\star}_{c}$ from the cover. The key observation is to control the change of measure from $\theta^{\star}$ to each $\theta^{\star}_{c}$ . Note that $\theta^{\star}_{c}\in(\Theta_{\rho}\cap\mathcal{B}_{2}(\theta^{\star}_{c},\rho_{\varepsilon}))\setminus\mathcal{B}_{2}(\theta^{\star}_{c},\eta\rho_{\varepsilon})$ and that $\mu_{\theta^{\star}_{c}}\geqslant\mu^{\star}-\varepsilon$ . We show that

Lemma 11 (Change of measure)

If $n\to nf(t/n)$ is non-decreasing, then for any increasing sequence $\{n_{i}\}_{i\geqslant 0}$ of non-negative integers it holds

$\displaystyle\mathbb{P}_{\theta^{\star}}\Bigl{\{}\bigcup_{n=n_{i}}^{n_{i+1}-1}E_{c,p}(n,t)\Bigr{\}}\leqslant\exp\bigg{(}-n_{i}\alpha^{2}-\chi\sqrt{n_{i}f(t/n_{i})}\bigg{)}\mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n=n_{i}}^{n_{i+1}-1}E_{c,p}(n,t)\Bigr{\}}$

where $\alpha=\alpha(p,\eta,\varepsilon)=\eta\rho_{\varepsilon}\sqrt{v_{\rho}/2}$ and $\chi=p\eta\rho_{\varepsilon}\sqrt{2v_{\rho}^{2}/V_{\rho}}$ .

**Proof of Lemma 11: ** For any event measurable $E$ , we have by absolute continuity that

[TABLE]

We thus bound the ratio which, in the case of $E=\{\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\}$ , leads to

[TABLE]

where $\Delta_{c}=\theta^{\star}-\theta^{\star}_{c}$ . Note that this rewriting makes appear the same term as the shift term appearing in (8). Now, we remark that since $\theta^{\star}_{c}\in\Theta_{\rho}$ by construction, then under the event $E_{c,p}(n,t)$ it holds by convexity of $\Theta_{\rho}$ and elementary Taylor approximation

[TABLE]

where we used the fact that $\|\Delta_{c}\|\geqslant\eta\rho_{\varepsilon}$ . On the other hand, it also holds that

[TABLE]

To conclude the proof we plug-in (11) and (12) into (10). Then, it remains to use that $n\geqslant b^{i}$ together with the fact that $n\mapsto nf(t/n)$ is non decreasing. $\hfill\square$

5.3 Localized change of measure

In this section, we decompose further the event of interest in $\mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\Bigr{\}}$ in order to apply some concentration of measure argument. In particular, since by construction

[TABLE]

it is then natural to control $\|\nabla\psi(\theta^{\star}_{c})-\widehat{F}_{n}\|$ . This is what we call localization. More precisely, we introduce for any sequence $\{\varepsilon_{t,i,c}\}_{t,i}$ of positive values, the following decomposition

[TABLE]

We handle the first term in (13) by another change of measure argument that we detail below, and the second term thanks to a concentration of measure argument that we detail in section 5.4. We will show more precisely that

Lemma 12 (Change of measure)

For any sequence of positive values $\{\varepsilon_{t,i,c}\}_{i\geqslant 0}$ , it holds

$\displaystyle\mathbb{P}_{\theta^{\star}_{c}}\Big{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\cap\|\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\star}_{c})\|<\varepsilon_{t,i,c}\Big{\}}$

$\displaystyle\leqslant$ $\displaystyle\alpha_{\rho,p}\exp\Big{(}-f\Big{(}\frac{t}{n_{i+1}\!-\!1}\Big{)}\Big{)}\min\Big{\{}\rho^{2}v_{\rho}^{2},\tilde{\varepsilon}_{t,i,c}^{2},\frac{(K+2)v_{\rho}^{2}}{K(n_{i+1}-1)V_{\rho}}\Big{\}}^{-K/2}\tilde{\varepsilon}_{t,i,c}^{K}\,.$

where $\tilde{\varepsilon}_{t,i,c}=\min\{\varepsilon_{t,i,c},\text{Diam}\big{(}\nabla\psi(\Theta_{\rho})\cap\mathcal{C}_{p}(\theta^{\star}_{c})\big{)}\}$ and $\alpha_{\rho,p}=2\frac{\omega_{p,K-2}}{\omega_{p^{\prime},K-2}}\bigg{(}\frac{V_{\rho}}{v_{\rho}^{2}}\bigg{)}^{K/2}\Big{(}\frac{V_{\rho}}{v_{\rho}}\Big{)}^{K}$ where $p^{\prime}>\max\{p,\frac{2}{\sqrt{5}}\}$ , with $\omega_{p,K}=\int_{p}^{1}\sqrt{1-z^{2}}^{K}dz$ for $K\geqslant 0$ and $w_{p,-1}=1$ .

Let us recall that $E_{c,p}(n,t)=\{\widehat{\theta}_{n}\!\in\!\Theta_{\rho}\,\cap\widehat{F}_{n}\!\in\!\mathcal{C}_{p}(\theta^{\star}_{c})\cap n\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\star}_{c})\geqslant f(t/n)\}$ .

The idea is to go from $\theta^{\star}_{c}$ to the measure that corresponds to the mixture of all the $\theta^{\prime}$ in the shrink ball $B=\Theta_{\rho}\cap\nabla\psi^{-1}\big{(}\mathcal{C}_{p}(\theta^{\star}_{c})\cap\mathcal{B}_{2}(\nabla\psi(\theta^{\star}_{c}),\varepsilon_{t,i,c})\big{)}$ where $\mathcal{B}_{2}(y,r)\stackrel{{\scriptstyle\rm def}}{{=}}\Big{\{}y^{\prime}\in\mathbb{R}^{K}\,;\,\|y-y^{\prime}\|\leqslant t\Big{\}}$ . This makes sense since, on the one hand, under $E_{c,p}(n,t)$ , $\nabla\psi(\widehat{\theta}_{n})\in\mathcal{C}_{p}(\theta^{\star}_{c})$ , and on the other hand, $||\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\star}_{c})||\leqslant\varepsilon_{t,i,c}$ . For convenience, let us introduce the event of interest

[TABLE]

We use the following change of measure

[TABLE]

where $Q_{B}(\Omega)\stackrel{{\scriptstyle\rm def}}{{=}}\int_{\theta^{\prime}\in B}\mathbb{P}_{\theta^{\prime}}\Bigl{\{}\Omega\Bigr{\}}d\theta^{\prime}$ is the mixture of all distributions with parameter in $B$ . The proof technique consists now in bounding the ratio by some quantity not depending on $\Omega$ .

[TABLE]

It is now convenient to remark that the term in the exponent can be rewritten in terms of Bregman divergence: by elementary substitution of the definition of the divergence and of $\nabla\psi(\widehat{\theta}_{n})=\widehat{F}_{a^{\star},n}$ , it holds

[TABLE]

Thus, the previous likelihood ratio simplifies as follows

[TABLE]

where we we note that both $\theta^{\prime}$ and $\widehat{\theta}_{n}$ belong to $\Theta_{\rho}$ .

The next step is to consider a set $B^{\prime}\subset B$ that contains $\widehat{\theta}_{n}$ . For each such set, and the upper bound $\mathcal{B}^{\psi}(\widehat{\theta}_{n},\theta^{\prime})\leqslant\frac{V_{\rho}}{2v_{\rho}^{2}}\|\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\prime})\|^{2}$ , we now obtain

[TABLE]

In this derivation, $(a)$ holds by positivity of $\exp$ and the inclusion $B^{\prime}\subset B$ , $(b)$ follows by a change of parameter argument and $(c)$ is obtained by controlling the determinant (in dimension $K$ ) of the Hessian, whose highest eigenvalue is $V_{\rho}$ .

In order to identify a good candidate for the set $B^{\prime}$ let us now study the set $B$ . A first remark is that $\theta^{\star}_{c}$ plays a central role in $B$ : It is not difficult to show that, by construction of $B$ ,

[TABLE]

Indeed, if $\theta^{\prime}$ belongs to the set on the left hand side, then it must satisfy on the one hand $\nabla\psi(\theta^{\prime})\in\nabla\psi(\theta^{\star}_{c})+\mathcal{B}_{2}(0,v_{\rho}\rho)$ . This implies that $\theta^{\prime}\in\mathcal{B}_{2}(\theta^{\star}_{c},\rho)\subset\Theta_{\rho}$ (this last inclusion is by construction of $\Theta$ ). On the other hand, it satisfies $\nabla\psi(\theta^{\prime})\in\nabla\psi(\theta^{\star}_{c})+\mathcal{B}_{2}(0,\varepsilon_{t,i,c})\cap\mathcal{C}_{p}(0,\Delta_{c})$ . These two properties show that such a $\theta^{\prime}$ belongs to $B$ .

Thus, a natural candidate $B^{\prime}$ should satisfy $\nabla\psi(\!B^{\prime}\!)\subset\nabla\psi(\theta^{\star}_{c})+\mathcal{B}_{2}(0,\tilde{r})\!\cap\!\mathcal{C}_{p}(0;\!\Delta_{c})$ , with $\tilde{r}=\!\min\{\!v_{\rho}\rho,\varepsilon_{t,i,c}\!\}$ . It is then natural to look for $B^{\prime}$ in the form $\nabla\psi^{-1}(\nabla\psi(\theta^{\star}_{c})+\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D})$ , where $\mathcal{D}\subset\mathcal{C}_{p}(0;\Delta_{c})$ is a sub-cone of $\mathcal{C}_{p}(0;\Delta_{c})$ with base point [math]. In this case, the previous derivation simplifies into

[TABLE]

where $y_{n}=\nabla\psi(\widehat{\theta}_{n})-\nabla\psi(\theta^{\star}_{c})\in\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}$ and $C=\frac{nV_{\rho}}{2v_{\rho}^{2}}$ . Cases of special interest for the set $\mathcal{D}$ are such that the value of the function $g:y\mapsto\int_{y^{\prime}\in\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}}\exp\big{(}-C\|y-y^{\prime}\|^{2}\big{)}dy^{\prime}$ , for $y\in\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}$ is minimal at the base point [math]. Indeed this enables to derive the following bound

[TABLE]

where $(d)$ follows from another change of parameter argument, with $r_{\rho}=\sqrt{\frac{V_{\rho}}{2v_{\rho}^{2}}}\min\{v_{\rho}\rho,\varepsilon_{t,i,c}\}$ combined with isotropy of the Euclidean norm (the right hand side of $(d)$ no longer depends on the random direction $\Delta_{n}$ ), plus the fact that the sub-cone $\mathcal{D}$ is invariant by rescaling. We recognize here a Gaussian integral on $\mathcal{B}_{2}(0,r_{\rho})\cap\mathcal{D}$ that can be bounded explicitly (see below).

Following this reasoning, we are now ready to specify the set $\mathcal{D}$ . Let $\mathcal{D}=\mathcal{C}_{p^{\prime}}(0;\Delta_{n})\subset\mathcal{C}_{p}(0;\Delta_{c})$ be a sub-cone where $p^{\prime}\geqslant p$ (remember that the larger $p$ , the more acute is a cone) and $\Delta_{n}$ is chosen such that $\nabla\psi(\widehat{\theta}_{n})\in\nabla\psi(\theta^{\star}_{c})+\mathcal{D}$ (there always exists such a cone). It thus remains to specify $p^{\prime}$ . A study of the function $g$ (defined above) on the domain $\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{C}_{p^{\prime}}(0;\Delta_{n})$ reveals that it is minimal at point [math] provided that $p^{\prime}$ is not too small, more precisely provided that $p\geqslant 2/\sqrt{5}$ . The intuitive reasons are that the points that contribute most to the integral belong to the set $\mathcal{B}_{2}(y,r)\cap\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}$ for small values of $r$ , that this set has lowest volume (the map $y\to|\mathcal{B}_{2}(y,r)\cap\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}|$ is minimal) when $y\in\partial\mathcal{B}_{2}(0,\tilde{r})\cap\partial\mathcal{D}$ and that $y=0$ is a minimizer amongst these point provided that $p^{\prime}$ is not too small. More formally, the function $g$ rewrites

[TABLE]

from which we see that a minimal $y$ should be such that the spherical section $|\mathcal{S}_{2}(y,r)\cap\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}|$ is minimal for small values of $r$ (note also that $C=O(n)$ ). Then, since $B=\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}$ is a convex set, the sections $|\mathcal{S}_{2}(y,r)\cap\mathcal{B}_{2}(0,\tilde{r})\cap\mathcal{D}|$ are of minimal size for points $y\in B$ that are extremal, in the sense that $y$ satisfies $B\subset\mathcal{B}_{2}(y,\text{Diam}(B))$ . In order to choose $p^{\prime}$ and fully specify $\mathcal{D}$ , we finally use the following lemma:

Lemma 13

Let $\mathcal{C}_{p^{\prime}}=\{y^{\prime}:\langle y^{\prime},\Delta\rangle\geqslant p^{\prime}\|y^{\prime}\|\|\Delta\|\}$ be a cone with base point [math] and define $B=\mathcal{B}_{2}(0,r)\cap\mathcal{C}_{p^{\prime}}$ . Provided that $p^{\prime}>2/\sqrt{5}$ , then the set of extremal points $\{y\in B:B\subset\mathcal{B}_{2}(y,\text{Diam}(B))\}$ reduces to $\{0\}$ .

**Proof of Lemma 13: ** First, note that the boundary of the convex set $B$ is supported by the union of the base point [math] and the set $\partial\mathcal{B}_{2}(0,\tilde{r})\cap\partial\mathcal{D}$ . Since this set is a sphere in dimension $K-1$ with radius $\frac{\sqrt{1-{p^{\prime}}^{2}}}{p}\tilde{r}$ , all its points are at distance at most $2\frac{\sqrt{1-{p^{\prime}}^{2}}}{p^{\prime}}\tilde{r}$ from each other. Now they are also at distance exactly $\tilde{r}$ from the base point [math]. Thus, when $2\frac{\sqrt{1-{p^{\prime}}^{2}}}{p^{\prime}}\tilde{r}<\tilde{r}$ , that is $p^{\prime}>2/\sqrt{5}$ , then [math] is the unique point that satisfies $B\subset\mathcal{B}_{2}(y,\text{Diam}(B))$ . $\hfill\square$

We now summarize the previous steps. So far, we have proved the following upper bound

[TABLE]

where $|B|$ denotes the volume of $B$ , $r_{\rho}=\sqrt{\frac{V_{\rho}}{2v_{\rho}^{2}}}\min\{v_{\rho}\rho,\varepsilon_{t,i,c}\}$ and for $p^{\prime}>\max\{p,2/\sqrt{5}\}$ , We remark that by definition of $B$ , it holds

[TABLE]

Thus, it remains to analyze the volume and the Gaussian integral of $\mathcal{B}_{2}(0,\varepsilon_{t,i,c})\cap\mathcal{C}_{p}(0;{\bf 1})$ . To do so, we use the following result from elementary geometry, whose proof is given in Appendix A:

Lemma 14

For all $\varepsilon,\varepsilon^{\prime}>0$ , $p,p^{\prime}\in[0,1]$ and all $K\geqslant 1$ the following equality and inequality hold

[TABLE]

where $\omega_{p,K-2}=\int_{p}^{1}\sqrt{1-z^{2}}^{K-2}dz$ for $K\geqslant 2$ and using the convention that $\omega_{p,-1}=1$ .

Applying this Lemma, we thus get for $r_{\rho}=\sqrt{\frac{V_{\rho}}{2v_{\rho}^{2}}}\min\{v_{\rho}\rho,\varepsilon_{t,i,c}\}$ ,

[TABLE]

This concludes the proof of Lemma 12.

5.4 Concentration of measure

In this section, we focus on the second term in (13), that is we want to control $\mathbb{P}_{\theta^{\star}_{c}}\Bigl{\{}\bigcup_{n_{i}\leqslant n<n_{i+1}}E_{c,p}(n,t)\cap||\nabla\psi(\theta^{\star}_{c})-\widehat{F}_{n}||\geqslant\varepsilon_{t,i,c}\Bigr{\}}$ . In this term, $\varepsilon_{t,i,c}$ should be considered as decreasing fast to [math] with $i$ , and slowly increasing with $t$ . Note that by definition $\nabla\psi(\widehat{\theta}_{n})=\widehat{F}_{a^{\star},n}=\frac{1}{n}\sum_{i=1}^{n}F(X_{a^{\star},i})\in\mathbb{R}^{K}$ is an empirical mean with mean given by $\nabla\psi(\theta^{\star}_{c})\in\mathbb{R}^{K}$ and covariance matrix $\frac{1}{n}\nabla^{2}\psi(\theta^{\star}_{c})$ . We thus resort to a concentration of measure argument.

Lemma 15 (Concentration of measure)

Let $\varepsilon^{\max}_{c}=\text{Diam}(\nabla\psi(\Theta_{\rho}\!\cap\!\mathcal{C}_{c,p}))$ where we introduced the projected cone $\mathcal{C}_{c,p}=\{\theta\!\in\!\Theta:\langle\frac{\Delta_{c}}{\|\Delta_{c}\|},\frac{\nabla\psi(\theta^{\star}_{c})-\nabla\psi(\theta)}{||\nabla\psi(\theta^{\star}_{c})-\nabla\psi(\theta)||}\rangle\geqslant p\}$ . Then, for all $\varepsilon_{t,i,c}$ , it holds

$\displaystyle\mathbb{P}_{\theta^{\star}_{c}}\Big{\{}\!\bigcup_{n=n_{i}}^{n_{i+1}-1}\!\!E_{c,p}(n,t)\cap||\nabla\psi(\widehat{\theta}_{n})\!-\!\nabla\psi(\theta^{\star}_{c})||\!\geqslant\!\varepsilon_{t,i,c}\Big{\}}\!\leqslant\exp\!\bigg{(}\!-\!\frac{n_{i}^{2}p\varepsilon_{t,i,c}^{2}}{2V_{\rho}(n_{i+1}\!-\!1)}\bigg{)}\mathbb{I}\{\varepsilon_{t,i,c}\!\leqslant\!\overline{\varepsilon}_{c}\}.$

**Proof of Lemma 15: ** Note that by definition if $\varepsilon_{t,i,c}>\varepsilon^{\max}_{c}$ , then

[TABLE]

We thus restrict to the case when $\varepsilon_{t,i,c}\leqslant\varepsilon^{\max}_{c}$ , or equivalently, replace $\varepsilon_{t,i,c}$ by $\tilde{\varepsilon}_{t,i,c}=\min\{\varepsilon_{t,i,c},\varepsilon^{\max}_{c}\}$ . Now, by definition of the event $E_{c,p}(n,t)$ , we have the rewriting

[TABLE]

Now, applying on both side of the inequality the function $x\mapsto\exp(\lambda x)$ , for a deterministic $\lambda>0$ , it comes

[TABLE]

Now we recognize that the sequence $\{W_{n}(\lambda)\}_{n\geqslant 0}$ , where $W_{n}(\lambda)=\exp\bigg{(}\sum_{i=1}^{n}\langle\frac{\lambda\Delta_{c}}{\|\Delta_{c}\|},\nabla\psi(\theta^{\star}_{c})-F(X_{a^{\star},i})\rangle-n\frac{\lambda^{2}V_{\rho}}{2}\big{)}\bigg{)}$ is a non-negative super-martingale provided that $\lambda$ is not too large. Indeed, provided that $\theta^{\star}_{c}-\frac{\lambda\Delta_{c}}{\|\Delta_{c}\|}\in\Theta_{\rho}$ it holds

[TABLE]

that is $\mathbb{E}\bigg{[}W_{n}(e,\lambda)\bigg{|}H_{n-1}\bigg{]}\leqslant W_{n-1}(e,\lambda)$ . Thus, we apply Doob’s maximal inequality for non-negative super-martingale and deduce that

[TABLE]

Optimizing over $\lambda$ gives $\lambda=\lambda^{\star}=\frac{n_{i}p\tilde{\varepsilon}_{t,i,c}}{(n_{i+1}\!-\!1)V_{\rho}}$ , and thus the condition becomes $\theta^{\star}_{c}-\frac{n_{i}p\tilde{\varepsilon}_{t,i,c}}{(n_{i+1}\!-\!1)V_{\rho}\|\Delta_{c}\|}\Delta_{c}\in\Theta_{\rho}$ . At this point, it is convenient to introduce the quantity

[TABLE]

Indeed, it suffices to show that $\lambda^{\star}\leqslant\lambda_{c}$ to ensure that the condition is satisfied. It is now not difficult to relate $\lambda_{c}$ to $\varepsilon^{\max}_{c}$ : Indeed, any $\theta_{\lambda}=\theta^{\star}_{c}-\lambda\frac{\Delta_{c}}{\|\Delta_{c}\|}$ that maximizes $||\nabla\psi(\theta^{\star}_{c})-\nabla\psi(\theta_{\lambda})||$ and belongs to $\Theta_{\rho}$ must satisfy

[TABLE]

on the one hand, and on the other hand, since $\theta^{\star}_{c},\theta_{\lambda}\in\Theta_{\rho}$ ,

[TABLE]

Combining these two inequalities, we deduce that $\lambda_{c}\geqslant p\varepsilon^{\max}_{c}/V_{\rho}$ . Thus, using that $n_{i}/(n_{i+1}\!-\!1)\leqslant 1$ and $\tilde{\varepsilon}_{t,i,c}\leqslant\varepsilon^{\max}_{c}$ , we deduce that $\lambda^{\star}=\frac{n_{i}p\tilde{\varepsilon}_{t,i,c}}{(n_{i+1}\!-\!1)V_{\rho}}\leqslant\frac{p\overline{\varepsilon}_{c}}{V_{\rho}}\leqslant\lambda_{c}$ is indeed satisfied. We then get without further restriction

[TABLE]

$\hfill\square$

5.5 Combining the different steps

In this part, we recap what we have shown so far. Combining the peeling, change of measure, localization and concentration of measure steps of the four previous sections, we have shown that for all $\{\varepsilon_{t,i,c}\}_{t,i}$ , then

[TABLE]

where we recall that $\alpha=\alpha(p,\eta,\varepsilon)=\eta\rho_{\varepsilon}\sqrt{v_{\rho}/2}$ and that the definition of $n_{i}$ is

[TABLE]

A simple rewriting leads to the form

[TABLE]

which suggests we use $\varepsilon_{t,i,c}=\sqrt{\frac{2V_{\rho}(\!n_{i+1}\!-\!1\!)f(t/(\!n_{i+1}\!-\!1\!))}{pn_{i}^{2}}}$ . Replacing this term in the above expression, we obtain

[TABLE]

At this point, using the somewhat crude lower bound $b^{i}\geqslant 1$ it is convenient to introduce the constant

[TABLE]

which leads to the final bound

[TABLE]

6 Fine-tuned upper bounds

In this section, we study the behavior of the bound obtained in Theorem 3 as a function of $t$ , for a specific choice of function $f$ , namely $f(x)=\log(x)+\xi\log\log x$ , and prove corollary 1 and corollary 2, using a fine-tuning of the remaining free quantities. This tuning is not completely trivial, as a naive tuning yields the condition that $\xi>K/2+1$ to ensure that the final bound is $o(1/t)$ , while proceeding with some more care enables to show that $\xi>K/2-1$ is enough. Let us remind that $f$ is non-decreasing only for $x\geqslant e^{-\xi}$ . We thus restrict to $t\geqslant e^{-\xi}$ in corollary 1 that uses the threshold $f(t)$ , and to $\xi\geqslant 0$ in corollary 2 that uses the threshold function $f(t/n)$ . In the sequel, we use the short-hand notation $C$ in order to replace $C(K,\rho,p,b,\eta)$ .

6.1 Proof of Corollary 1

As a warming-up, we start by the boundary crossing probability involving $f(t)$ instead of $f(t/n)$ . Indeed, controlling the boundary crossing probability with term $f(t/n)$ is more challenging. Although we focused so far on the boundary crossing probability with term $f(t/n)$ , the previous proof directly applies to the case when $f(t)$ is considered. In particular, the result of Theorem 3 holds also when all the terms $f(t/n),f(t/b^{i}),f(t/b^{i+1})$ are replaced with $f(t)$ .

With the choice $f(x)=\log(x)+\xi\log\log x$ , which is non-increasing on the set of $x$ such that $\xi>-\ln(x)$ , Theorem 3 specifies for all $b>1,p,q,\eta\in(0,1)$ , to

[TABLE]

In order to study the sum $S=\sum_{i=0}^{\lceil\log_{b}(qt)\rceil-1}s_{i}$ we provide two strategies. First, a direct upper bound gives $S\leqslant\lceil\log_{b}(qt)\rceil\leqslant\log_{b}(qt)+1$ . Thus, setting $q=1$ and $b=2$ we obtain

[TABLE]

This term is thus $o(1/t)$ whenever $\xi>K/2+1$ and $O(1/t)$ when $\xi=K/2+1$ . We now show that a more careful analysis leads to a similar behavior even for smaller values of $\xi$ . Indeed, let us note that for all $i\geqslant 0$ , it holds by definition

[TABLE]

Since $f(t)\geqslant 1$ , if we set $b=\lceil(1+\frac{\ln(1+\chi)}{\chi})^{2}\rceil$ , which belongs to $(1,4]$ for all $\chi\geqslant 0$ , we obtain that $s_{i+1}/s_{i}\leqslant\frac{1}{1+\chi}$ . Thus, we deduce that

[TABLE]

Thus, $S$ is asymptotically $o(1)$ , and we deduce that $\mathbb{P}_{\theta^{\star}}\Big{\{}\bigcup_{1\leqslant n<t}\widehat{\theta}_{n}\in\Theta_{\rho}\cap\mathcal{K}_{a^{\star}}(\Pi_{a^{\star}}(\widehat{\nu}_{a^{\star},n}),\mu^{\star}-\varepsilon)\geqslant f(t)/n\Big{\}}=o(1/t)$ beyond the condition $\xi>K/2+1$ . It is interesting to note that due to the term $-\chi\sqrt{f(t)}$ in the exponent, and owing to the fact that $\alpha\sqrt{\log(t)}-\beta\log\log(t)\to\infty$ for all positive $\alpha$ and all $\beta$ , we actually have the stronger property that $S\log(t)^{-\xi+K/2}=o(1)$ for all $\xi$ (using $\alpha=\chi$ and $\beta=K/2-\xi$ ). However, since this asymptotic regime may take a massive amount of time to kick-in when $\alpha/\beta<1/2$ we do not advise to take $\xi$ smaller than $K/2-2\chi$ . All in all, we obtain, for $C=C(K,b,\rho,p,\eta)$ with $b=\lceil(1+\frac{\ln(1+\chi)}{\chi})^{2}\rceil\leqslant 4$ ,

[TABLE]

6.2 Proof of Corollary 2

Let us now focus on the proof of Corollary 2 involving the threshold $f(t/n)$ . We consider the choice $f(x)=\log(x)+\xi\log\log x$ , which is non-increasing on the set of $x$ such that $\xi>-\ln(x)$ . When $x=t/n$ and $n$ is about $t-O(\ln(t))$ , ensuring this monotonicity property means that we require $\xi$ to dominate $\ln(1-O(\ln(t)/t))$ , that is $\xi\geqslant 0$ . Now, following the result of Theorem 3, we thus obtain for all $b>1,p,q,\eta\in(0,1)$ ,

[TABLE]

We thus study the sum $S=\sum_{i=0}^{\lceil\log_{b}(qt)\rceil-2}s_{i}$ . To this end, let us first study the term $s_{i}$ . Since $i\mapsto\log(t/b^{i+1})$ is a decreasing function of $i$ , it holds for any index $i_{0}\in\mathbb{N}$ that

[TABLE]

Small values of $i$ We start by handling the terms corresponding to small values of $i\leqslant i_{0}$ , for some $i_{0}$ to be chosen. In that case, we note that $r_{i}=\frac{b^{i+1}}{t}$ satisfies $r_{i-1}/r_{i}=1/b<1$ and thus

[TABLE]

from which we deduce that

[TABLE]

Following Lai (1988), in order to ensure that this quantity is summable in $t$ , it is convenient to define $i_{0}$ as

[TABLE]

for $\eta>K/2-\xi$ and a positive constant $c$ . Indeed in that case when $i_{0}\geqslant 0$ we obtain the bounds333This is also valid when $i_{0}<0$ since the sum is equal to [math] in that case.

[TABLE]

We easily see that this is $o(1/t)$ when both when $\xi>K/2$ and when $\xi\leqslant K/2$ , by construction of $\eta$ . Note that $\eta$ can further be chosen to be equal to [math] when $\xi>K/2$ . The value of $c$ is fixed by looking at what happens for larger values of $i\geqslant i_{0}$ . We note that the initial proof of Lai (1988) uses the value $\eta=1$ .

Large values of $i$ We now consider the terms of the sum $S$ corresponding to large values $i>i_{0}$ and thus focus on the term $s^{\prime}_{i}=\exp(-\chi\sqrt{b^{i}\log(t/b^{i})})b^{i+1}$ , and better on the following ratio

[TABLE]

Remarking that this ratio is a non increasing function of $i$ , we upper bound it by replacing $i$ with either $i_{0}+1$ or [math]. Using that $b^{i_{0}+1}\leqslant t_{0}$ we thus obtain,

[TABLE]

Since we would like this ratio to be less than $1$ for all (large enough) $t$ , we readily see from this expression that this excludes the cases when $\eta>1$ : the term in the ineer brackets converges to [math] in such cases, and thus the ratio is asymptotically upper bounded by $b>1$ . Thus we impose that $\eta\leqslant 1$ , that is $\xi\geqslant K/2-1$ .

For the critical value $\eta=1$ it is then natural to study the term $\sqrt{\frac{b\log(x\log(x)/b)}{\log(x)}}-\sqrt{\frac{\log(x\log(x))}{\log(x)}}$ . First, when $b=4$ , this quantity is larger than $1/2$ for $x\geqslant 8.2$ . Then, it can be checked that $4\exp(-\frac{1}{2}\sqrt{\chi^{2}/c})<1$ if $c>\chi^{2}/(2\ln(4))^{2}$ . These two conditions show that, provided that $t\geqslant 8.2(2\ln(4))^{2}\chi^{-2}\simeq 63\chi^{-2}$ , then $\frac{s^{\prime}_{i+1}}{s^{\prime}_{i}}<1$ . Now, in order to get a ratio $\frac{s^{\prime}_{i+1}}{s^{\prime}_{i}}$ that is away from $1$ , we target the bound $\frac{s^{\prime}_{i+1}}{s^{\prime}_{i}}<b/(b+1)$ . This can be achieved by requiring that $t\geqslant 8.2(2\ln(5))^{2}\chi^{-2}\simeq 85\chi^{-2}$ by setting $c=\chi^{2}/(2\ln(5))^{2}$ . Eventually, we obtain for $b=4$ and $t\geqslant 85\chi^{-2}$ the bound

[TABLE]

Remark 12

Another notable value is $\eta=0$ . A similar study than the previous one shows that for $b=3.5$ , the term $\sqrt{b\log(x/b)}-\sqrt{\log(x)}$ is larger than $1/2$ for $x>12$ , which entails that $\frac{s^{\prime}_{i+1}}{s^{\prime}_{i}}<b/(b+1)$ provided that $t\geqslant 12(2\ln(3.5))^{2}\chi^{-2}\simeq 76\chi^{-2}$ .

Plugging-in the definition of $t_{0}$ , and since $b^{i_{0}+1}\leqslant bt_{0}$ , we obtain if $i_{0}\geqslant 0$ , and for $b=4,c=\chi^{2}/(2\ln(5))^{2}$ ,

[TABLE]

It remains to handle the case when $i_{0}<0$ . Note that this case only happens for $t$ large enough so that $t>c^{-1}e^{\frac{1}{bc}}$ . The later quantity may be huge since $1/bc=\log(5)^{2}\chi^{-2}$ is possibly large when $\chi$ is close to [math]. In that case, we directly control $\sum_{i=0}^{I_{t}-2}s_{i}$ . We control the ratio $s^{\prime}_{i+1}/s^{\prime}_{i}$ by $b/(b+1/2)$ provided that

[TABLE]

Thus, if we define $t_{\chi}$ to be the smallest such $t$ , then when $t>c^{-1}e^{\frac{1}{bc}}$ and provided that $t\geqslant t_{\chi}$ , the bound of (16) remains valid for the sum $S$ , up to replacing $b^{2}(b+1)$ with $2b^{2}(b+1/2)$ and $\log(t\frac{c\log(tc)}{b-c\log(tc)})$ with $\log(t/(b-1))$ . The later constraint $t\geqslant t_{\chi}$ is satisfied as soon as $4\ln(5)^{2}\chi^{-2}e^{\chi^{-2}\ln(5)^{2}}\geqslant t_{\chi}$ which is generally satisfied for $\chi$ not too large.

Final control on S

We can now control the term $S$ by combining the two bounds for large and small $i$ . We get for $c=\chi^{2}/(2\ln(4.5))^{2}$ and $b=4$ , and provided that $t\geqslant 85\chi^{-2}$ and $t\leqslant\chi^{-2}\frac{\exp\big{(}\chi^{-2}\ln(4.5)^{2}\big{)}}{4\ln(4.5)^{2}}$ , the following bound

[TABLE]

Further, for larger values of $t$ , $t\geqslant\chi^{-2}\frac{\exp\big{(}\chi^{-2}\ln(4.5)^{2}\big{)}}{4\ln(4.5)^{2}}$ , then

[TABLE]

Concluding step

In this final step, we now gather equation (15) together with the previous bounds (17), (18) on $S$ . We obtain that for all $p,q,\eta\in(0,1)$

[TABLE]

where we re call the definition of the constants $\alpha=\eta\rho_{\varepsilon}\sqrt{v_{\rho}/2},\quad\chi=p\eta\rho_{\varepsilon}\sqrt{2v_{\rho}^{2}/V_{\rho}}.$

When $\xi\in[K/2-1,K/2]$ , one can then choose $q=1$ . When $\xi\geqslant K/2$ , there is a trade-off in $q$ , since the first term (the exponential) is decreasing with $q$ while the second term is increasing with $q$ . For instance choosing $q=\exp(-\kappa^{-1/\eta})$ , where $\eta=\xi-K/2$ and $\kappa>0$ leads to $\log(1/q)^{K/2-\xi}=\kappa$ . When $b=4$ , simply choosing $q=0.8$ gives the final bound after some cosmetic simplifications.

Conclusion

In this work, that should be considered as a tribute to the contributions of T.L. Lai, we shed light on a beautiful and seemingly forgotten result from Lai (1988), that we modernized into a fully non-asymptotic statement, with explicit constants that can be directly used, for instance, for the regret analysis of multi-armed bandits strategies. Interestingly, the final results, whose roots are thirty-years old, show that the existing analysis of KL-ucb that was only stated for exponential families of dimension $1$ and discrete distributions lead to a sub-optimal constraints on the tuning of the threshold function $f$ , and can be extended to work with exponential families of arbitrary dimension $K$ and even for the thresholding term of the KL-ucb+ strategy, whose analysis was left open.

This proof technique is mostly based on a change-of-measure argument, like the lower bounds for the analysis of sequential decision making strategies and in stark contrast with other key results in the literature (Honda and Takemura (2010), Maillard et al. (2011), Cappé et al. (2013)). We believe and hope that the novel writing of this proof technique that we provided here will greatly benefit the community working on boundary crossing probabilities, sequential design of experiments as well as stochastic decision making strategies.

Appendix A Technical details

Lemma 9 (Dimension 1)

Consider a canonical one-dimensional family (that is $K=1$ and $F_{1}(x)=x\in\mathbb{R}$ ). Then, for all $f$ such that $f(t/n)/n$ is non-increasing in $n$ , then*

[TABLE]

**Proof of Lemma 9: ** The proof goes as follows. First, we observe that:

[TABLE]

At this point note that if for all $F=\nabla\psi(\theta)$ with mean $\mu_{\theta}\leqslant\mu^{\star}-\varepsilon$ , it holds that $\Phi^{\star}(F)<f(t/M)/M$ then the probability of interest is [math] and we are done. In the other case, there exists an $F_{M}$ such that $\Phi^{\star}(F_{M})=f(t/M)/M$ . We thus proceed with this case as follows

[TABLE]

where (a) holds by (5), (b) holds for all $\lambda<0$ , and (c) for all $\lambda<0$ such that $\lambda F_{M}-\Phi(\lambda)>0$ . Now, the process defined by $W_{\lambda,0}=1$ and $W_{\lambda,n}=\exp\bigg{(}\sum_{i=1}^{n}\Big{(}\lambda F(X_{i})-\Phi(\lambda)\Big{)}\bigg{)}$ is a non-negative super-martingale, since it holds

[TABLE]

Thus, we deduce that for all $\lambda<0$ such that $\lambda F_{M}-\Phi(\lambda)>0$

[TABLE]

Since by (6) this is satisfied by the optimal $\lambda$ for $\Phi^{\star}(F_{M})$ , we thus deduce that

[TABLE]

$\hfill\square$

Lemma 14

For all $\varepsilon,\varepsilon^{\prime}>0$ , $p,p^{\prime}\in[0,1]$ and all $K\geqslant 1$ the following equality holds*

[TABLE]

where $\omega_{p,K-2}=\int_{p}^{1}\sqrt{1-z^{2}}^{K-2}dz$ for $K\geqslant 2$ and using the convention that $\omega_{p,-1}=1$ . Further,

[TABLE]

**Proof of Lemma 14: ** First of all, let us remark that provided that $K\geqslant 2$ , then

[TABLE]

where $\mathcal{S}_{K-1}\subset\mathbb{R}^{K-1}$ is the $K-2$ dimensional unit sphere of $\mathbb{R}^{K-1}$ . Let us recall that when $K=2$ , we get $|\mathcal{S}_{K-1}|=2$ . For convenience, let us denote $\omega_{p,K-2}=\int_{p}^{1}\sqrt{1-z^{2}}^{K-2}dz$ . Then, for $K\geqslant 2$ ,

[TABLE]

For $K=1$ , $|\mathcal{B}_{2}(0,\varepsilon)\cap\mathcal{C}_{p}(0;{\bf 1})|=\varepsilon$ . Likewise, we obtain, following the same steps that

[TABLE]

We obtain the first part of the lemma by combining the two previous equalities. For the second part, we use the inequality $e^{-x}\geqslant 1-x$ , which gives

[TABLE]

Thus, whenever $\varepsilon^{2}<(K+2)/K$ , we obtain

[TABLE]

On the other hand, if $\varepsilon^{2}\geqslant(K+2)/K$ , then

[TABLE]

Thus, in all cases, the integral is larger than $\frac{\min\{\varepsilon,\sqrt{1+2/K}\}^{K}}{2K}$ , and we conclude by simple algebra. $\hfill\square$

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agrawal (1995) Rajeev Agrawal. Sample mean based index policies by o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability , 27(04):1054–1078, 1995.
2Audibert et al. (2009) J-Y. Audibert, R. Munos, and Cs. Szepesvári. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science , 410(19), 2009.
3Audibert and Bubeck (2010) J.Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research , 11:2635–2686, 2010.
4Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2):235–256, 2002.
5Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
6Burnetas and Katehakis (1997) A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research , pages 222–255, 1997.
7Cappé et al. (2013) Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics , 41(3):1516–1541, 2013.
8Chow and Teicher (1988) YS Chow and H Teicher. Probability theory. 2nd. Springer-Verlag , 1:988, 1988.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

1 Multi-armed bandit setup and notations

Quality of a strategy

Definition 1** (Expected regret)**

Empirical distributions

Definition 2** (Empirical distributions)**

Lemma 1

2 Boundary crossing probabilities for the generic KL-ucb strategy.

Alternative formulation of KL-ucb

Lemma 2** (Rewriting)**

Assumption 1

Remark 1

Using boundary-crossing probabilities for regret analysis

Lemma 3** (From Regret to Boundary Crossing Probabilities)**

Lemma 4** (Non-asymptotic Sanov’s lemma)**

Scope and focus of this work

High-level overview of the contribution

3 General exponential families, properties and examples

Definition 3** (Exponential families)**

Remark 2

3.1 Bregman divergence induced by the exponential family

Lemma 5** (Bregman duality)**

Lemma 6** (Bregman and Smoothness)**

3.2 Dual formulation of the optimization problem

Remark 3

3.3 Empirical parameter and definition

Definition 4** (Enlarged parameter set)**

Lemma 7** (Log-Laplace control)**

Lemma 8** (Well-defined parameters)**

Remark 4

Illustration

4 Boundary crossing for KKK-dimensional exponential families

4.1 Previous work on boundary-crossing probabilities

Theorem 1** (KL-ucb)**

Theorem 2** (Lai, 88)**

Lemma 9** (Dimension 1)**

4.2 Main results and contributions

Theorem 3** (Boundary crossing for exponential families)**

Remark 5

Remark 6

Remark 7

Corollary 1** (Boundary crossing for f(t)f(t)f(t) )**

Corollary 2** (Boundary crossing for f(t/n)f(t/n)f(t/n) )**

Remark 8

Remark 9

Remark 10

Remark 11

5 Analysis of boundary crossing probabilities: proof of Theorem 3

5.1 Peeling and covering

Lemma 10** (Peeling and cone covering decomposition)**

5.2 Change of measure

Lemma 11** (Change of measure)**

5.3 Localized change of measure

Lemma 12** (Change of measure)**

Lemma 13

Lemma 14

5.4 Concentration of measure

Lemma 15** (Concentration of measure)**

5.5 Combining the different steps

6 Fine-tuned upper bounds

6.1 Proof of Corollary 1

6.2 Proof of Corollary 2

Remark 12

Final control on S

Concluding step

Conclusion

Appendix A Technical details

Lemma 9 (Dimension 1)

Lemma 14

Definition 1 (Expected regret)

Definition 2 (Empirical distributions)

Lemma 2 (Rewriting)

Lemma 3 (From Regret to Boundary Crossing Probabilities)

Lemma 4 (Non-asymptotic Sanov’s lemma)

Definition 3 (Exponential families)

Lemma 5 (Bregman duality)

Lemma 6 (Bregman and Smoothness)

Definition 4 (Enlarged parameter set)

Lemma 7 (Log-Laplace control)

Lemma 8 (Well-defined parameters)

4 Boundary crossing for $K$ -dimensional exponential families

Theorem 1 (KL-ucb)

Theorem 2 (Lai, 88)

Lemma 9 (Dimension 1)

Theorem 3 (Boundary crossing for exponential families)

Corollary 1 (Boundary crossing for $f(t)$ )

Corollary 2 (Boundary crossing for $f(t/n)$ )

Lemma 10 (Peeling and cone covering decomposition)

Lemma 11 (Change of measure)

Lemma 12 (Change of measure)

Lemma 15 (Concentration of measure)