Asymptotic convergence of iterative optimization algorithms

Randal Douc; Sylvain Le Corff

arXiv:2302.12544·stat.ML·February 27, 2023

Asymptotic convergence of iterative optimization algorithms

Randal Douc, Sylvain Le Corff

PDF

Open Access

TL;DR

This paper presents a comprehensive framework for iterative optimization algorithms, proving their asymptotic geometric convergence and providing exact rates, applicable to various algorithms including EM and mirror descent.

Contribution

It introduces a unified framework for analyzing convergence rates of iterative algorithms, including constrained cases and variants like alpha-EM and Mirror Prox.

Findings

01

Convergence is asymptotically geometric under general assumptions.

02

Exact asymptotic convergence rates are established.

03

Conditions for systematic convergence of Mirror Prox are provided.

Abstract

This paper introduces a general framework for iterative optimization algorithms and establishes under general assumptions that their convergence is asymptotically geometric. We also prove that under appropriate assumptions, the rate of convergence can be lower bounded. The convergence is then only geometric, and we provide the exact asymptotic convergence rate. This framework allows to deal with constrained optimization and encompasses the Expectation Maximization algorithm and the mirror descent algorithm, as well as some variants such as the alpha-Expectation Maximization or the Mirror Prox algorithm.Furthermore, we establish sufficient conditions for the convergence of the Mirror Prox algorithm, under which the method converges systematically to the unique minimizer of a convex function on a convex compact set.

Equations402

Q :

Q :

(θ, θ^{'}) \mapsto Q_{θ} (θ^{'}) .

M (θ) := θ^{'} \in Θ argmin Q_{θ} (θ^{'}) .

M (θ) := θ^{'} \in Θ argmin Q_{θ} (θ^{'}) .

θ_{n + 1} \in M (θ_{n}) .

θ_{n + 1} \in M (θ_{n}) .

θ_{n + 1} \in θ \in Θ argmin Q_{θ_{n}} (θ),

θ_{n + 1} \in θ \in Θ argmin Q_{θ_{n}} (θ),

Q_{θ} (θ^{'}) := - E_{θ} [lo g p_{θ^{'}} (X, Y) ∣ Y],

Q_{θ} (θ^{'}) := - E_{θ} [lo g p_{θ^{'}} (X, Y) ∣ Y],

Q_{θ}^{samp} (θ^{'}) := - \frac{1}{k} i = 1 \sum k \int_{X} p_{θ} (x ∣ Y_{i}) lo g p_{θ^{'}} (x, Y_{i}) μ (d x) .

Q_{θ}^{samp} (θ^{'}) := - \frac{1}{k} i = 1 \sum k \int_{X} p_{θ} (x ∣ Y_{i}) lo g p_{θ^{'}} (x, Y_{i}) μ (d x) .

Q_{θ}^{pop} (θ^{'}) := - \int_{Y} (\int_{X} p_{θ} (x ∣ y) lo g p_{θ^{'}} (x, y) μ (d x)) p_{θ_{⋆}} (y) μ (d y),

Q_{θ}^{pop} (θ^{'}) := - \int_{Y} (\int_{X} p_{θ} (x ∣ y) lo g p_{θ^{'}} (x, y) μ (d x)) p_{θ_{⋆}} (y) μ (d y),

\displaystyle\left\{\begin{array}[]{ll}\partial\Phi(\zeta_{n+1})=\partial\Phi(\theta_{n})-\eta g_{n},\quad\mbox{where}\ g_{n}\in\partial f(\theta_{n}),\\ \theta_{n+1}\in\mathrm{argmin}_{\theta\in\mathsf{C}\cap\mathsf{D}}D_{\Phi}(\theta,\zeta_{n+1}),\end{array}\right.

\displaystyle\left\{\begin{array}[]{ll}\partial\Phi(\zeta_{n+1})=\partial\Phi(\theta_{n})-\eta g_{n},\quad\mbox{where}\ g_{n}\in\partial f(\theta_{n}),\\ \theta_{n+1}\in\mathrm{argmin}_{\theta\in\mathsf{C}\cap\mathsf{D}}D_{\Phi}(\theta,\zeta_{n+1}),\end{array}\right.

\forall x, y \in D, D_{Φ} (x, y) = Φ (x) - Φ (y) - \partial Φ (y)^{⊤} (x - y) .

\forall x, y \in D, D_{Φ} (x, y) = Φ (x) - Φ (y) - \partial Φ (y)^{⊤} (x - y) .

θ_{n + 1} \in θ \in C \cap D argmin η g_{n}^{⊤} θ + D_{Φ} (θ, θ_{n}),

θ_{n + 1} \in θ \in C \cap D argmin η g_{n}^{⊤} θ + D_{Φ} (θ, θ_{n}),

Q_{θ} (θ^{'}) := η g^{⊤} θ^{'} + D_{Φ} (θ^{'}, θ), \mbox w h er e g \in \partial f (θ) .

Q_{θ} (θ^{'}) := η g^{⊤} θ^{'} + D_{Φ} (θ^{'}, θ), \mbox w h er e g \in \partial f (θ) .

\partial Φ (

\partial Φ (

ζ_{n + 1} \in θ \in C \cap D argmin D_{Φ} (θ, ζ_{n + 1}^{'}),

\partial Φ (

θ_{n + 1} \in θ \in C \cap D argmin D_{Φ} (θ, θ_{n + 1}^{'}) .

ζ_{n + 1} \in M (θ_{n}) =

ζ_{n + 1} \in M (θ_{n}) =

θ_{n + 1} \in

Q_{θ}^{m} (θ^{'}) := η \partial f (M (θ))^{⊤} θ^{'} + D_{Φ} (θ^{'}, θ) .

Q_{θ}^{m} (θ^{'}) := η \partial f (M (θ))^{⊤} θ^{'} + D_{Φ} (θ^{'}, θ) .

A_{⋆} := \partial_{22} Q_{θ_{⋆}} (θ_{⋆}), B_{⋆} := - \partial_{12} Q_{θ_{⋆}} (θ_{⋆}) .

A_{⋆} := \partial_{22} Q_{θ_{⋆}} (θ_{⋆}), B_{⋆} := - \partial_{12} Q_{θ_{⋆}} (θ_{⋆}) .

\hat{\uprho}_{⋆} := v \in V ∖ {0} sup \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

\hat{\uprho}_{⋆} := v \in V ∖ {0} sup \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

θ_{n} - θ_{⋆} = \lito (\uprho^{n}),

θ_{n} - θ_{⋆} = \lito (\uprho^{n}),

n \to \infty lim sup \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} ⩽ lo g \hat{\uprho}_{⋆} .

n \to \infty lim sup \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} ⩽ lo g \hat{\uprho}_{⋆} .

Q_{θ_{n}} (θ_{n + 1}) - Q_{θ_{⋆}} (θ_{⋆}) = \lito (\uprho^{n}) .

Q_{θ_{n}} (θ_{n + 1}) - Q_{θ_{⋆}} (θ_{⋆}) = \lito (\uprho^{n}) .

\overset{ˇ}{\uprho}_{⋆} := v \in V ∖ {0} in f \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

\overset{ˇ}{\uprho}_{⋆} := v \in V ∖ {0} in f \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

lo g \overset{ˇ}{\uprho}_{⋆} ⩽ n \to \infty lim in f \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} .

lo g \overset{ˇ}{\uprho}_{⋆} ⩽ n \to \infty lim in f \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} .

lo g min (\hat{\uprho}_{⋆}, \frac{\uprho ˇ _{⋆}}{\uprho ^ _{⋆}}) ⩽ n \to \infty lim in f \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2},

lo g min (\hat{\uprho}_{⋆}, \frac{\uprho ˇ _{⋆}}{\uprho ^ _{⋆}}) ⩽ n \to \infty lim in f \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2},

n \to \infty lim \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} = lo g \hat{\uprho}_{⋆} .

n \to \infty lim \frac{1}{n} lo g ∥ θ_{n} - θ_{⋆} ∥_{2} = lo g \hat{\uprho}_{⋆} .

\overset{˘}{\uprho}_{⋆} := v \in T_{⋆} ∖ {0} in f \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} \mbox an d \invbreve \uprho_{⋆} := v \in T_{⋆} ∖ {0} sup \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

\overset{˘}{\uprho}_{⋆} := v \in T_{⋆} ∖ {0} in f \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} \mbox an d \invbreve \uprho_{⋆} := v \in T_{⋆} ∖ {0} sup \frac{∣ v ^{⊤} B _{⋆} v ∣}{v ^{⊤} A _{⋆} v} .

θ_{n} - θ_{⋆} = \lito (\uprho^{n}) .

θ_{n} - θ_{⋆} = \lito (\uprho^{n}) .

\uprho^{n} = \lito (∥ θ_{n} - θ_{⋆} ∥_{2}) .

\uprho^{n} = \lito (∥ θ_{n} - θ_{⋆} ∥_{2}) .

ϑ (θ^{'}) ⩽ ϑ (θ),

ϑ (θ^{'}) ⩽ ϑ (θ),

A_{⋆}^{pop} = I_{X, Y} (θ_{⋆}) \mbox an d B_{⋆}^{pop} = I_{X, Y} (θ_{⋆}) - I_{Y} (θ_{⋆}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Optimization and Variational Analysis · Advanced Bandit Algorithms Research

Full text

Asymptotic convergence of iterative optimization algorithms

Randal Douc

Sylvain Le Corff

Abstract

This paper introduces a general framework for iterative optimization algorithms and establishes under general assumptions that their convergence is asymptotically geometric. We also prove that under appropriate assumptions, the rate of convergence can be lower bounded. The convergence is then only geometric, and we provide the exact asymptotic convergence rate. This framework allows to deal with constrained optimization and encompasses the Expectation Maximization algorithm and the mirror descent algorithm, as well as some variants such as the $\alpha$ -Expectation Maximization or the Mirror Prox algorithm. Furthermore, we establish sufficient conditions for the convergence of the Mirror Prox algorithm, under which the method converges systematically to the unique minimizer of a convex function on a convex compact set.

1 Introduction

The minimization of a real-valued function is the most common formulation for mathematical optimization problems. Examples of convex optimization problems in machine learning can be found for instance in [Bubeck, 2015]. For models involving missing or latent data, [Dempster et al., 1977] introduced the modern formulation of the Expectation Maximization (EM) algorithm, whose convergence has been proved under general assumptions in [Wu, 1983].

The asymptotic convergence rate of the EM algorithm has been widely studied and identified as a ratio of missing information from the very beginning [Dempster et al., 1977, Meng and Rubin, 1991, Meng and Rubin, 1993]. Since then, some links with gradient descent approaches have also been drawn, see for instance [Lange, 1995]. Among the most notable recent works, [Balakrishnan et al., 2017] provided quantitative results on the non-asymptotic convergence of the EM algorithm to local optima by considering smoothness and strong-concavity assumptions. In the particular case of exponential families, [Kunstner et al., 2021] show that the $M$ -step is equivalent to a mirror descent update. This allows to obtain non-asymptotic linear convergence rate, which directly depends on the ratio of missing information.

In this paper, instead of casting the EM algorithm into a gradient or a mirror descent framework, we propose an extended formulation to encompass both classes of algorithms, not restricted to exponential families. Indeed, both EM and mirror descent algorithms can be defined using a bivariate function that is iteratively minimized with respect to one coordinate. Such a representation can actually describe any iterative optimization algorithm whose minimization steps are parametrized only by the current parameter estimate. This paper provides the following contributions.

•

We prove under general assumptions that the convergence of such iterative optimization algorithms is asymptotically geometric, see Theorem 1. We also provide lower bounds for the rate of convergence, that allow to prove that the convergence can be only geometric, see Theorem 2, and in some cases to establish the exact asymptotic convergence rate, see Theorem 3. We show that those assumptions are natural either in an EM or in a mirror descent framework, and that they are satisfied generically without requiring any notable technical work, in contrast with non-asymptotic results that tend to be more demanding. Regarding the EM algorithm, we retrieve the well-known ratio of missing information under even more general assumptions, as the minimization mapping is not required to be point-to-point and this framework allows to deal with constrained optimization.

•

We derive results for settings with both finite and infinite data, as well as for a variant of the EM algorithm, known as the $\alpha$ -EM algorithm, see [Matsuyama, 2003]. However, the most significant contribution is that brought to the mirror descent framework: under mild assumptions, we prove that its convergence is asymptotically geometric. This also applies to the mirror prox variant. In a general manner, the convergence rates we exhibit are proved to be invariant to $C^{2}$ -reparametrization.

•

Furthermore, we prove that under general assumptions, the convergence of mirror prox is guaranteed for convex functions with a unique minimizer on a convex compact set, and that, without imposing any condition on the initialization.

This paper is organized as follows. Section 2 introduces the general iterative optimization framework we consider and shows how it encompasses classical settings such as the EM algorithm or the mirror descent algorithm. Section 3 states the main general results of this paper on asymptotic convergence rates. Sections 4-6 discuss the assumptions of Theorem 1, illustrating how they are met in those classical settings, but also in variants such as the $\alpha$ -EM or the mirror prox algorithm. Section 7 displays the proof of Theorem 1, and Section 8 is dedicated to the convergence of mirror prox. A discussion follows in Section 9. Additional proofs are postponed to Appendix A, using technical results listed in Appendix B and proved in the Supplementary material.

Notation

Throughout this paper, $\mathrm{Spec}(\cdot)$ denotes the spectrum of a matrix and $\varrho\left(\cdot\right)$ the spectral radius. The Euclidean norm is denoted by $\left\|\cdot\right\|_{2}$ , the spectral norm by ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ , the Frobenius norm by ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}$ , and for all symmetric positive-definite matrices $S$ , we define the norm $\left\|\cdot\right\|_{S}$ by $\|x\|_{S}^{2}:=x^{\top}Sx$ . The first derivative (resp. the second) of any univariate function $f$ is written $\partial f$ (resp. $\partial^{2}f$ ). For all bivariate functions $\mathcal{Q}\colon(x_{1},x_{2})\mapsto\mathcal{Q}_{x_{1}}(x_{2})$ and $i,j\in\{1,2\}$ , we write $\partial_{i}\mathcal{Q}:=\partial\mathcal{Q}/\partial x_{i}$ and $\partial_{ij}\mathcal{Q}:=\partial^{2}\mathcal{Q}/\partial x_{i}\partial x_{j}$ . The maximum of two real numbers $a,b$ is denoted by $a\vee b$ . For all topological spaces $\mathsf{E}$ , their closure are written $\overline{\mathsf{E}}$ and their interior $\mathring{\mathsf{E}}$ . Finally, $\mathrm{Conv}(\cdot)$ stands for convex hull, $\mathrm{Aff}(\cdot)$ for affine hull and $\mathrm{ri}(\cdot)$ for relative interior, i.e. the interior of a set within its affine hull.

2 General framework

Let $q\in\mathbb{N}^{*}$ and let $\mathcal{Q}$ be a real-valued function defined on $\mathbb{R}^{q}\times\mathbb{R}^{q}$ :

[TABLE]

Let $\Theta$ be a subset of $\mathbb{R}^{q}$ and $\mathcal{M}$ be the point-to-set map defined on $\Theta$ by

[TABLE]

In what follows, provided that $\mathcal{M}\left(\theta\right)\neq\emptyset$ for any $\theta\in\Theta$ , we let $(\theta_{n})_{n\in\mathbb{N}}$ be a sequence defined on $\Theta$ such that for all $n\in\mathbb{N}$ ,

[TABLE]

*Example 1** (EM algorithm).*

Let $X$ and $Y$ be random variables taking values in measurable spaces $(\mathsf{X},\mathcal{X})$ and $(\mathsf{Y},\mathcal{Y})$ , respectively. Assume that the pair $(X,Y)$ has a joint density function $p_{{\theta_{\star}}}$ with respect to a reference measure $\mu$ on $\mathcal{X}\otimes\mathcal{Y}$ that belongs to some parameterized family $\{p_{\theta}\;:\;\theta\in\Theta\}$ . Assume also that the state variable $X$ is latent in the sense that the model is only partially observed through the observation $Y$ . In this case, the Expectation Maximization (EM) algorithm, as defined in [Douc et al., 2013, Appendix D.1, p.492], provides an estimate of the unknown parameter ${\theta_{\star}}$ by considering a sequence $(\theta_{n})_{n\in\mathbb{N}}$ defined on $\Theta$ by

[TABLE]

where for all $\theta,\theta^{\prime}\in\Theta\times\Theta$ ,

[TABLE]

and $\mathbb{E}_{\theta}$ denotes the expectation under $p_{\theta}$ . Note that $\mathcal{Q}$ is a random function which depends on the observations we consider. For instance, in a model where $(X_{i},Y_{i})_{1\leqslant i\leqslant k}$ are independent and identically distributed, with $k$ observations $(Y_{i})_{1\leqslant i\leqslant k}$ , we define at the sample level: $X=(X_{1},\ldots,X_{k})$ and $Y=(Y_{1},\ldots,Y_{k})$ , and inserting in (3), we obtain up to a multiplicative constant (see [Balakrishnan et al., 2017]):

[TABLE]

In the limit of infinite data (i.e. $k\to\infty$ ), we define at the population level:

[TABLE]

where $x\mapsto p_{\theta}(x|y)$ denotes the conditional density of $X$ given $Y$ when the parameter value is $\theta$ and where we assume that $(Y_{i})_{i\geqslant 1}$ are iid with density $p_{\theta_{\star}}$ . Both settings are studied in this paper, replacing $\mathcal{Q}$ in (2) by $\mathcal{Q}^{\mathrm{samp}}$ or $\mathcal{Q}^{\mathrm{pop}}$ .

*Example 2.1** (Mirror descent).*

Let $\mathsf{C}$ be a convex compact set of $\mathbb{R}^{q}$ and $f$ be a real-valued function defined on $\mathsf{C}$ . The mirror descent strategy defined in [Bubeck, 2015, Chapter 4, p.296] considers a convex open set $\mathsf{D}$ of $\mathbb{R}^{q}$ such that $\mathsf{C}$ is contained in the closure of $\mathsf{D}$ and $\mathsf{C}\cap\mathsf{D}\neq\emptyset$ , along with a mirror map $\Phi\colon\mathsf{D}\rightarrow\mathbb{R}$ , that is,

(i)

$\Phi$ is strictly convex and differentiable, 2. (ii)

the gradient of $\Phi$ takes all possible values: $\partial\Phi(\mathsf{D})=\mathbb{R}^{q}$ , 3. (iii)

the gradient of $\Phi$ diverges on the boundary of $\mathsf{D}$ : $\lim_{x\to\partial\mathsf{D}}\|\partial\Phi(x)\|=+\infty$ .

Then, the mirror descent algorithm produces two sequences $(\theta_{n})_{n\in\mathbb{N}}$ and $(\zeta_{n})_{n\in\mathbb{N}}$ , defined on $\mathsf{C}$ and $\mathsf{D}$ respectively by

[TABLE]

where $\eta>0$ is the step-size, $\partial f$ is the sub-differential of $f$ (by abuse of notation) and $D_{\Phi}$ is the Bregman divergence associated with $\Phi$ :

[TABLE]

Note that gradient descent is a particular case of mirror descent with $\Phi\colon x\mapsto x^{\top}x/2$ . Following [Bubeck, 2015, p.301], mirror descent can be rewritten as

[TABLE]

which fits into the general framework (1) with $\Theta:=\mathsf{C}\cap\mathsf{D}$ and $\mathcal{Q}$ defined for all $(\theta,\theta^{\prime})\in\Theta\times\Theta$ by

[TABLE]

*Example 2.2** (Mirror prox).*

Mirror prox is a variant of mirror descent defined by the following equations [Bubeck, 2015, Chapter 4, p.305]:

[TABLE]

Straightforward algebra yields the equivalent definition:

[TABLE]

where $\mathcal{M}$ is defined for mirror descent by (10). We deduce from Example 2.1 that mirror prox fits into the general framework (1) with $\Theta:=\mathsf{C}\cap\mathsf{D}$ and $\mathcal{Q}^{m}$ defined on $\Theta\times\Theta$ by

[TABLE]

The fact that $\mathcal{M}\left(\theta\right)$ is a singleton is ensured by (i) and (iii) as in this case $\Phi$ is a Legendre function, see [Cesa-Bianchi and Lugosi, 2006, Lemma 11.1] or [Bauschke, 1997, Theorem 3.12].

3 Asymptotic convergence rate

Assume there exists ${\theta_{\star}}\in\Theta$ such that $\partial_{2}\mathcal{Q}$ is well-defined in a neighborhood of ${\theta_{\star}}$ and differentiable at $({\theta_{\star}},{\theta_{\star}})$ , and write

[TABLE]

Let $\mathsf{V}:=\mathrm{span}\{\theta-\theta^{\prime}\;:\;\theta,\theta^{\prime}\in\Theta\}$ be the direction of $\mathrm{Aff}(\Theta)$ . Consider the following set of assumptions.

(H1)

The set $\Theta$ is convex.

(H2)

The sequence $(\theta_{n})_{n\in\mathbb{N}}$ converges to ${\theta_{\star}}$ .

(H3)

There exists a neighborhood of $({\theta_{\star}},{\theta_{\star}})$ on which $\mathcal{Q}$ is continuous and $\partial_{2}\mathcal{Q}$ is well-defined and $C^{1}$ -differentiable.

(H4)

The matrix $\mathcal{B}_{\star}$ is symmetric and for all $v\in\mathsf{V}\setminus\{0\}$ , $v^{\top}\mathcal{A}_{\star}v>|v^{\top}\mathcal{B}_{\star}v|$ .

Under (H4), we can define

[TABLE]

In what follows, we set by convention, $\log 0=-\infty$ .

Theorem 1.

Assume that (H1)-(H4) hold. Then, $\hat{\uprho}_{\star}\in[0;1)$ and for all $\uprho\in(\hat{\uprho}_{\star};1)$ ,

[TABLE]

or equivalently,

[TABLE]

As any two norms on a finite-dimensional linear space are equivalent, Theorem 1 and the next results could be given using another norm on $\Theta$ . They are stated here with $\|\cdot\|_{2}$ for simplicity.

Proof.

See Section 7. ∎

In [Balakrishnan et al., 2017, Theorem 1], the authors prove that the population EM algorithm converges geometrically. Their proof rely mainly on convergence results for gradient ascent algorithms applied to the intermediate quantity of the EM algorithm which is assumed to be smooth and strongly concave. Theorem 1 establishes under general assumptions that the convergence of the algorithms introduced in Section 2 is asymptotically geometric. Corollary 1 extends the statement of Theorem 1 to the values taken by the function $(\theta,\theta^{\prime})\mapsto\mathcal{Q}_{\theta}(\theta^{\prime})$ which are invariant to the choice of the parametrisation.

Corollary 1.

Under (H1)-(H4), if $\partial_{1}\mathcal{Q}_{{\theta_{\star}}}\left({\theta_{\star}}\right)$ is well-defined, then for all $\uprho\in(\hat{\uprho}_{\star};1)$ ,

[TABLE]

Proof.

See Section A.1. ∎

Besides, the speed of convergence can be lower-bounded if the limit ${\theta_{\star}}$ lies in the relative interior of $\Theta$ . Under (H4), we can define

[TABLE]

Theorem 2.

Assume that (H1)-(H4) hold, that ${\theta_{\star}}\in\mathrm{ri}(\Theta)$ , and that the sequence $(\theta_{n})_{n\in\mathbb{N}}$ is not eventually equal to ${\theta_{\star}}$ . Then, $\check{\uprho}_{\star}\in[0;1)$ and

[TABLE]

Proof.

See Section A.1. ∎

If the limit ${\theta_{\star}}$ lies in the relative interior of $\Theta$ and $\check{\uprho}_{\star}>0$ , the asymptotic convergence is therefore only geometric.

Theorem 3.

Assume that (H1)-(H4) hold, that $\partial_{2}\mathcal{Q}$ is $C^{2}$ -differentiable in a neighborhood of $({\theta_{\star}},{\theta_{\star}})$ , that ${\theta_{\star}}\in\mathrm{ri}(\Theta)$ and that for all $p\in\mathbb{N}$ , $\mathrm{Span}(\theta_{n}-{\theta_{\star}},n\geqslant p)=\mathsf{V}$ . Then $\check{\uprho}_{\star},\hat{\uprho}_{\star}\in[0;1)$ and

[TABLE]

where in the left-hand term we use the convention $0/0=0$ and $\log(0)=-\infty$ . In particular, if $\hat{\uprho}_{\star}^{2}\leqslant\check{\uprho}_{\star}$ then

[TABLE]

Proof.

See Section A.1. ∎

4 Comments on H1

The assumption that $\Theta$ needs to be convex can be relaxed as follows.

(H’1)

There exist $\mathsf{E}\subset\mathbb{R}^{q}$ and a submanifold $\mathsf{S}\subset\mathbb{R}^{q}$ of class $C^{2}$ such that $\Theta=\mathsf{E}\cap\mathsf{S}$ and ${\theta_{\star}}\in\mathring{\mathsf{E}}$ .

Under (H’1), if $d$ is the dimension of the submanifold $\mathsf{S}$ , for all $x\in\mathsf{S}$ , there exist $\mathsf{U}_{1}$ , $\mathsf{U}_{2}$ two open neighborhoods of $x$ and the null-vector $\mathbf{0}$ in $\mathbb{R}^{q}$ , respectively, and a $C^{2}$ -diffeomorphism $\psi\colon\mathsf{U}_{1}\rightarrow\mathsf{U}_{2}$ such that $\psi(x)=\mathbf{0}$ and $\psi(\mathsf{U}_{1}\cap\mathsf{S})=\mathsf{U}_{2}\cap(\mathbb{R}^{d}\times\{0\}^{q-d})$ . Note that we identify $\mathbb{R}^{q}$ and $\mathbb{R}^{d}\times\mathbb{R}^{q-d}$ in a standard way using $(x_{1},\ldots,x_{q})\mapsto((x_{1},\ldots,x_{d}),(x_{d+1},\ldots,x_{q}))$ . Write ${\mathsf{T}_{\star}}$ the tangent space to $\mathsf{S}$ at the point ${\theta_{\star}}$ . If $\mathsf{U}_{1}$ and $\mathsf{U}_{2}$ are two open neighborhoods of ${\theta_{\star}}$ and the null-vector $\mathbf{0}$ in $\mathbb{R}^{q}$ , respectively, and $\psi\colon\mathsf{U}_{1}\rightarrow\mathsf{U}_{2}$ is a $C^{1}$ -diffeomorphism such that $\psi({\theta_{\star}})=\mathbf{0}$ and $\psi(\mathsf{U}_{1}\cap\mathsf{S})=\mathsf{U}_{2}\cap(\mathbb{R}^{d}\times\{0\}^{q-d})$ , then ${\mathsf{T}_{\star}}=\partial\psi^{-1}_{{\theta_{\star}}}(\mathbb{R}^{d}\times\{0\}^{q-d})$ , where $\partial\psi$ is the differential of $\psi$ at ${\theta_{\star}}$ .

(H’4)

The matrix $\mathcal{B}_{\star}$ is symmetric and for all $v\in{\mathsf{T}_{\star}}\setminus\{0\}$ , $v^{\top}\mathcal{A}_{\star}v>|v^{\top}\mathcal{B}_{\star}v|$ .

Under (H’4), we can define

[TABLE]

Theorem 4.

Assume that (H’1), (H2), (H3) and (H’4) hold. Then, $\invbreve{\uprho}_{\star}\in[0;1)$ and for all $\uprho\in(\invbreve{\uprho}_{\star};1)$ ,

[TABLE]

Furthermore, if the sequence $(\theta_{n})_{n\in\mathbb{N}}$ is not eventually equal to ${\theta_{\star}}$ , then $\breve{\uprho}_{\star}\in[0;1)$ and for all $\uprho\in(0;\breve{\uprho}_{\star})$ ,

[TABLE]

Proof.

See Section A.2. ∎

*Remark 1**.*

If (H1) holds with ${\theta_{\star}}\in\mathrm{ri}(\Theta)$ , then (H’1) is satisfied with $\mathsf{E}:=\Theta+\mathsf{V}^{\perp}$ and $\mathsf{S}:=\mathrm{Aff}(\Theta)$ .

*Remark 2**.*

If ${\theta_{\star}}$ does not lie in the relative interior of $\Theta$ , the asymptotic convergence rates are not necessarily invariant to $C^{2}$ -reparametrization. Define for instance $\tilde{\mathcal{Q}}_{\theta_{0}}(\theta_{1})=\theta_{0}^{2}-\theta_{0}\theta_{1}+\theta_{1}^{2}$ with $\Theta=[1;+\infty[$ . Using the reparametrization function $\Psi:\theta\mapsto\theta^{\alpha}$ , we set $\check{\mathcal{Q}}_{\theta_{0}}(\theta_{1})=\tilde{\mathcal{Q}}_{\Psi(\theta_{0})}(\Psi(\theta_{1}))=\theta_{0}^{2\alpha}-\theta_{0}^{\alpha}\theta_{1}^{\alpha}+\theta_{1}^{2\alpha}$ . Then, with $\mathcal{Q}=\tilde{\mathcal{Q}}$ , we get $\check{\uprho}_{\star}=\hat{\uprho}_{\star}=1/2$ whereas with $\mathcal{Q}=\check{\mathcal{Q}}$ and $\alpha=2/5$ , we get $\check{\uprho}_{\star}=\hat{\uprho}_{\star}=2$ .

5 Comments on H2

The convergence of the sequence $(\theta_{n})_{n\in\mathbb{N}}$ to ${\theta_{\star}}$ (stated in (H2)) may be the most challenging assumption of Theorem 1. However, we provide alternative sufficient assumptions to establish such convergence.

( $\mathsf{H}$ 2.1)

The set $\Theta$ is compact.

( $\mathsf{H}$ 2.2)

The function $\mathcal{Q}$ is continuous on $\Theta\times\Theta$ .

( $\mathsf{H}$ 2.3)

The point ${\theta_{\star}}$ is a limit point of the sequence $(\theta_{n})_{n\in\mathbb{N}}$ .

( $\mathsf{H}$ 2.4)

$\mathcal{M}\left({\theta_{\star}}\right)=\{{\theta_{\star}}\}$ .

Theorem 5.

Under (H1), ( $\mathsf{H}$ 2.1)-( $\mathsf{H}$ 2.4), (H3) and (H4), the sequence $(\theta_{n})_{n\in\mathbb{N}}$ converges to ${\theta_{\star}}$ .

Proof.

See Section A.3. ∎

*Remark 3**.*

Assumption ( $\mathsf{H}$ 2.3) weakens (H2) by only requiring that (H3)-(H4) hold for an arbitrary ${\theta_{\star}}$ in the limit set of $(\theta_{n})_{n\in\mathbb{N}}$ , which is non-empty under ( $\mathsf{H}$ 2.1).

*Example 2.1** (Mirror descent, cont.).*

The map $\mathcal{M}$ is point-to-point on $\Theta$ under the assumptions of the definition (see Example 2.1 in page 2.1). Indeed, the surjectivity of the gradient in ii provides the existence of $\zeta_{n+1}$ in (8), and the strict convexity of $\Phi$ in i proves its uniqueness. Assumptions i and iii ensure the existence and the uniqueness of $\theta_{n+1}$ in (8) (see [Bauschke, 1997, Theorem 3.12]). Note that if $f$ is convex or differentiable on $\mathsf{C}$ , then for all $\theta\in\mathsf{C}$ , $\partial f(\theta)\neq\emptyset$ and $g_{n}$ can be defined in (8).

Moreover, if ${\theta_{\star}}$ is a local minimizer of $f$ and $f$ is differentiable at ${\theta_{\star}}$ , then ( $\mathsf{H}$ 2.4) is met. Indeed, those two assumptions provide that for all $\theta\in\Theta$ , $\partial f({\theta_{\star}})^{\top}(\theta-{\theta_{\star}})\geqslant 0$ , and thus $\mathcal{Q}_{{\theta_{\star}}}\left(\theta\right)\geqslant\mathcal{Q}_{{\theta_{\star}}}\left({\theta_{\star}}\right)$ in (10) with equality if and only if $\theta={\theta_{\star}}$ .

Proposition 1 establishes that the mirror prox approach described in Example 2.2 statisfies ( $\mathsf{H}$ 2.3) and ( $\mathsf{H}$ 2.4) under additional assumptions.

Proposition 1.

Assume in Example 2.2 that ( $\mathsf{H}$ 2.1)-( $\mathsf{H}$ 2.2) hold and that: (i) $\Phi$ and $f$ are twice differentiable on $\Theta$ , (ii) $\Phi$ is $\gamma$ -strongly convex on $\mathsf{C}\cap\mathsf{D}$ and $f$ is convex and $\beta$ -smooth, with respect to $\left\|\cdot\right\|_{2}$ , (iii) $\eta\in(0;\gamma/\beta)$ , (iv) ${\theta_{\star}}$ is the unique minimizer of $f$ on $\mathsf{C}$ , (v) ${\theta_{\star}}\in\mathrm{ri}(\Theta)$ . Then, ( $\mathsf{H}$ 2.3) and ( $\mathsf{H}$ 2.4) hold.

Proof.

See Section A.3. ∎

We also provide alternative assumptions to prove ( $\mathsf{H}$ 2.4) in the general case.

( $\tilde{\mathsf{H}}$ 4.1)

$\mathcal{M}({\theta_{\star}})$ is a singleton.

( $\tilde{\mathsf{H}}$ 4.2)

There exists a continuous function $\vartheta\colon\Theta\rightarrow\mathbb{R}$ such that for all $\theta\in\Theta$ , $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

with equality if and only if $\theta=\theta^{\prime}$ .

Theorem 6.

Assume ( $\mathsf{H}$ 2.1), ( $\mathsf{H}$ 2.2), ( $\mathsf{H}$ 2.3). Then, ( $\tilde{\mathsf{H}}$ 4.1) and ( $\tilde{\mathsf{H}}$ 4.2) imply ( $\mathsf{H}$ 2.4).

Proof.

See Section A.3. ∎

*Example 1** (EM algorithm, cont.).*

By definition of the intermediate quantity (3), the EM algorithm monotonically increases the likelihood of the observations and Assumption ( $\tilde{\mathsf{H}}$ 4.2) is satisfied as soon as the log-likelihood is continuous, see for example [Cappé et al., 2005, Proposition 10.1.4, p.350].

6 Comments on H4

First of all, the matrix $\mathcal{B}_{\star}$ appears to be symmetric in all the examples below. The discussion then focuses on the domination assumption and on the value of the convergence rate $\hat{\uprho}_{\star}$ . Note that the domination assumption in (H4) is equivalent to having both $\tilde{\mathcal{A}}_{\star}\succ\tilde{\mathcal{B}}_{\star}$ and $\tilde{\mathcal{A}}_{\star}\succ-\tilde{\mathcal{B}}_{\star}$ . In the case where $\tilde{\mathcal{A}}_{\star}\succ 0$ , it is equivalent to $\hat{\uprho}_{\star}\in[0;1)$ .

*Example 1.1** (Population EM).*

Assume that ${\theta_{\star}}$ is the true parameter of the model, that for all $x,y\in\mathsf{X},\mathsf{Y}$ , the functions $\theta\mapsto p_{\theta}(x|y)$ and $\theta\mapsto p_{\theta}(y)$ are twice differentiable in a neighborhood of ${\theta_{\star}}$ , and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign. Then, we prove in Section A.4, see (63) and (64), that

[TABLE]

where $I_{X,Y}(\theta):=-\mathbb{E}_{\theta}[\partial^{2}_{\theta}\log p_{\theta}(X,Y)]$ and $I_{Y}:=-\mathbb{E}_{\theta}[\partial^{2}_{\theta}\log p_{\theta}(Y)]$ denote the Fisher information matrices of $(X,Y)$ and $Y$ , respectively. Therefore, (H4) is satisfied as soon as $I_{Y}({\theta_{\star}})\succ 0$ . Regarding the value of $\hat{\uprho}_{\star}$ , the above expressions of $\mathcal{A}_{\star}^{\mathrm{pop}}$ and $\mathcal{B}_{\star}^{\mathrm{pop}}$ provide the well-known ratio of missing information $I_{X,Y}({\theta_{\star}})^{-1}I_{X|Y}({\theta_{\star}})$ (see [Dempster et al., 1977, Kunstner et al., 2021, Meng and Rubin, 1991, Meng and Rubin, 1993, Orchard and Woodbury, 1972]), where

[TABLE]

*Example 1.2** (Sample EM).*

As for the other examples, all the results below are proved in Section A.4. Assume that for all $x,y\in\mathsf{X},\mathsf{Y}$ , the functions $\theta\mapsto p_{\theta}(x|y)$ and $\theta\mapsto p_{\theta}(y)$ are twice differentiable in a neighborhood of ${\theta_{\star}}$ , and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign. Let $(Y_{i})_{i\in\mathbb{N}^{*}}$ be a sequence of independent and identically distributed random variables with probability density function $p_{{\theta_{\star}}}$ , and write for all $k\in\mathbb{N}^{*}$ , $Y_{1:k}:=(Y_{i})_{1\leqslant i\leqslant k}$ . Then, for all $k\in\mathbb{N}^{*}$ , by (65) and (66),

[TABLE]

where $I_{X|Y=Y_{i}}({\theta_{\star}})=\int_{\mathsf{X}}p_{{\theta_{\star}}}(x|Y_{i})\partial\log p_{{\theta_{\star}}}(x|Y_{i})\left[\partial\log p_{{\theta_{\star}}}(x|Y_{i})\right]^{\top}\mu(\mathrm{d}x)$ .

Note that $\mathcal{A}_{\star}^{\mathrm{samp}}\left(Y_{1:k}\right)$ and $\mathcal{B}_{\star}^{\mathrm{samp}}\left(Y_{1:k}\right)$ converges almost surely to $\mathcal{A}_{\star}^{\mathrm{pop}}$ and $\mathcal{B}_{\star}^{\mathrm{pop}}$ . Then, if the corresponding population EM meets (H4), almost surely, for sufficiently large $k$ , the sample EM meets (H4). Denoting by $\hat{\uprho}_{\star}^{\mathrm{pop}}$ and $\hat{\uprho}_{\star}^{\mathrm{samp}}(Y_{1:k})$ their respective rates, as defined in (13), by Lemma A.3, we also have that

[TABLE]

Furthermore, if $\partial^{2}\log p_{\theta_{\star}}(X_{1},Y_{1}),\partial^{2}\log p_{{\theta_{\star}}}(Y_{1})\in\mathrm{L}^{2}(\mathbb{R}^{q\times q})$ , Lemma A.3 also establishes that for all $\delta\in(0;1)$ there exists $C_{\delta}>0$ such that

[TABLE]

*Example 2.1** (Mirror descent, cont.).*

If $f$ and $\Phi$ are twice differentiable in a neighborhood of ${\theta_{\star}}$ , we prove in Section A.4 that

[TABLE]

If for all $v\in\mathsf{V}$ , $v^{\top}\partial^{2}f({\theta_{\star}})v>0$ , the condition $\tilde{\mathcal{A}}_{\star}\succ\tilde{\mathcal{B}}_{\star}$ is automatically satisfied. The domination assumption in (H4) then reduces to $\tilde{\mathcal{A}}_{\star}\succ-\tilde{\mathcal{B}}_{\star}$ , which corresponds to $\eta$ being small enough. In the particular case of unconstrained gradient descent where $\mathrm{Aff}(\Theta)=\mathbb{R}^{q}$ and $\Phi\colon x\mapsto x^{\top}x/2$ , as (19) yields $\mathcal{A}_{\star}=I_{q}$ the above condition is equivalent to $\eta\in(0;2/\beta_{\star})$ , the optimal choice being $\eta=2/(\alpha_{\star}+\beta_{\star})$ where $\alpha_{\star}:=\min\mathrm{Spec}(\partial^{2}f({\theta_{\star}}))$ and $\beta_{\star}:=\max\mathrm{Spec}(\partial^{2}f({\theta_{\star}}))$ .

Besides, the asymptotic convergence rate $\hat{\uprho}_{\star}$ can be interpreted similarly to the EM framework. Despite not being, strictly speaking, a ratio of missing information, $\hat{\uprho}_{\star}$ still compares the mirror map $\Phi$ with the objective function $f$ . Intuitively, the choice of a mirror map with variations closer to those of $f$ provides a better convergence rate. If $\eta=1$ , the extreme case $\Phi=f$ yields $\mathcal{B}_{\star}=0$ and $\hat{\uprho}_{\star}=0$ , which is coherent with the fact that, in this case, the mirror descent is defined for all $n\in\mathbb{N}$ by $\theta_{n+1}\in\mathrm{argmin}_{\theta\in\Theta}f(\theta)$ .

The following discussion extends the above interpretation to a general class of functions $\mathcal{Q}$ that encompasses both mirror descent and the EM algorithm. The first thing to note is that in both settings the function $\mathcal{Q}$ can be redefined as

[TABLE]

where $f\colon\mathbb{R}^{q}\rightarrow\mathbb{R}$ is the objective function and $D\colon\mathbb{R}^{q}\times\mathbb{R}^{q}\rightarrow\mathbb{R}$ is a function such that for all $\theta\in\Theta$ , $\partial_{2}D(\theta,\theta)=0$ . Indeed, it is common knowledge that the intermediate quantity of the EM algorithm can be expressed as

[TABLE]

where $D_{\mathrm{KL}}$ denotes the Kullback-Leibler divergence (see [Daudel et al., 2020] for example). Regarding mirror descent, if $f$ is twice differentiable in Example 2.1, straightforward computation yields the following equivalent definition for $\mathcal{Q}$ :

[TABLE]

where the expression of $D_{\Phi-\eta f}$ follows that of (9) (and defines a Bregman divergence if $\Phi-\eta f$ is strictly convex). Besides, the condition $\partial_{2}D(\theta,\theta)=0$ for all $\theta\in\Theta$ is equivalent to $\partial_{2}\mathcal{Q}_{\theta}(\theta)=\partial f(\theta)$ for all $\theta\in\Theta$ , hence

[TABLE]

and

[TABLE]

In the framework of (20), the convergence rate can thus be viewed as a relative difference between the second-order variations of $\mathcal{Q}$ and $f$ . Computing iteratively $\mathrm{argmin}_{\Theta}\mathcal{Q}_{\theta_{n}}(\cdot)$ to estimate $\mathrm{argmin}_{\Theta}f$ can prove useful if those minimizations are easier to carry out, but the price to pay in terms of iterations (through the convergence rate) is directly related to how far the surrogate function $\mathcal{Q}$ is from the objective function $f$ . If $\tilde{\mathcal{A}}_{\star}$ is invertible, $\hat{\uprho}_{\star}$ is indeed the spectral radius of $\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1}=(\partial_{22}\tilde{\mathcal{Q}}_{{\theta_{\star}}}({\theta_{\star}})-\partial^{2}\tilde{f}({\theta_{\star}}))(\partial_{22}\tilde{\mathcal{Q}}_{{\theta_{\star}}}({\theta_{\star}}))^{-1}$ by Lemma B.3, and the interpretation of a ratio of missing information generalizes to that of a ratio measuring the loss of exactness in the minimization procedure.

Finally, in the particular case where $D$ is a distance or a divergence twice differentiable at $({\theta_{\star}},{\theta_{\star}})$ with respect to the second argument, $\theta\in\mathrm{argmin}_{\Theta}D(\theta,\cdot)$ for all $\theta\in\Theta$ implies $\mathcal{B}_{\star}=\partial_{22}D\left({\theta_{\star}},{\theta_{\star}}\right)\succeq 0$ . The domination assumption in (H4) then boils down to $\partial^{2}\tilde{f}({\theta_{\star}})\succ 0$ .

*Example 1.3** (The $\alpha$ -EM algorithm).*

The above discussion highlighted how the choice of the surrogate function $\mathcal{Q}$ determines the convergence rate $\hat{\uprho}_{\star}$ . In the EM algorithm of Example 1, where the function $\mathcal{Q}$ can be defined for all $\theta,\theta^{\prime}\in\Theta\times\Theta$ as

[TABLE]

the question then rises whether replacing the Kullback-Leibler divergence by an $\alpha$ -divergence (see [Daudel et al., 2020] for example) could provide a better convergence rate. This leads to replacing the previous expression of $\mathcal{Q}$ by:

[TABLE]

where for all $\alpha\in\mathbb{R}\setminus\{0,1\}$ , the concave function $f_{\alpha}$ is defined on $\mathbb{R}_{+}^{*}$ by $f_{\alpha}(x):=(1-x^{\alpha})/\alpha(\alpha-1)$ and $f_{0}:=\log$ . This approach has been introduced and developed in [Matsuyama, 2003]. We provide further elements for the choice of $\alpha$ by proving (see Section A.4) that at a population level, under the assumptions of Example 1.1 with $f_{\alpha}$ instead of $f_{0}$ ,

[TABLE]

Note that when $\alpha=0$ we recover the previous quantities $\mathcal{A}_{\star}$ and $\mathcal{B}_{\star}$ for the classical EM algorithm. If $I_{Y}({\theta_{\star}})\succ 0$ , then $\alpha\in(0;1/2)$ is a sufficient condition to meet (H4). Besides, if $\mathcal{A}_{\star}^{\alpha}$ is invertible we can write

[TABLE]

A necessary condition to improve the convergence rate is then $\alpha/(1-\alpha)>0$ , i.e. $\alpha\in(0;1)$ . We can also rewrite (23) as follows:

[TABLE]

By the positivity of $\mathcal{B}_{\star}$ for the original EM algorithm, we deduce that the optimal choice of $\alpha$ corresponds to $\alpha=(\hat{\uprho}_{\star}+\check{\uprho}_{\star})/2$ and $\hat{\uprho}_{\star}^{\alpha}=(\hat{\uprho}_{\star}-\check{\uprho}_{\star})/(2-\hat{\uprho}_{\star}-\check{\uprho}_{\star})$ , where $\hat{\uprho}_{\star}$ and $\check{\uprho}_{\star}$ are defined in (13) and (14) for the classical EM algorithm.

As a remark, we can see in [Matsuyama, 2003] that the $\alpha$ -EM algorithm does not simply change the $D$ -function in (20), it also replaces the objective function with a different bivariate function.

*Example 2.2** (Mirror prox, cont.).*

Assume that ( $\mathsf{H}$ 2.1) hold, that $\Phi$ and $f$ are $C^{1}$ -differentiable on $\Theta$ and twice differentiable at ${\theta_{\star}}$ , and that the corresponding mirror descent satisfies (H3)-(H4) and ${\theta_{\star}}=\mathcal{M}\left({\theta_{\star}}\right)\in\mathrm{ri}(\Theta)$ . Then, we prove in Section A.4 that

[TABLE]

where $\mathcal{A}_{\star}$ , $\mathcal{B}_{\star}$ are defined in (19) for mirror descent. This provides the symmetry of $\tilde{\mathcal{B}}_{\star}^{m}$ and thus of $\mathcal{B}_{\star}^{m}$ , as well as

[TABLE]

We deduce that under (H4) for mirror descent, $\hat{\uprho}_{\star}^{m}<1$ if and only if $\tilde{\mathcal{B}}_{\star}\succ 0$ , which is met as soon as $\eta\in(0;\gamma_{\star}/\beta_{\star})$ , where $\beta_{\star}:=\max\mathrm{Spec}(\partial^{2}\tilde{f}({\theta_{\star}}))$ and $\gamma_{\star}:=\min\mathrm{Spec}(\partial^{2}\tilde{\Phi}({\theta_{\star}}))$ . Besides, a sufficient condition for the $C^{1}$ -differentiability of $\partial_{2}\mathcal{Q}^{m}$ in a neighborhood of ${\theta_{\star}}$ is the $C^{2}$ -differentiability of $\Phi$ and $f$ in a neighborhood of ${\theta_{\star}}$ . Under all those assumptions, mirror prox thus meets (H3)-(H4).

Note that the above sufficient condition of regularity implies (H3) for mirror descent, and that if $\tilde{\mathcal{B}}_{\star}\succ 0$ , then $\partial^{2}\tilde{f}({\theta_{\star}})\succ 0$ implies (H4) for mirror descent (see Example 2.1 in page 2.1).

As a remark, (24) yields that the convergence rates defined in (13-14) are always strictly higher for mirror prox than for the corresponding mirror descent. The rate $\check{\uprho}_{\star}^{m}$ is even lower-bounded by $3/4$ (see Section A.4).

*Example 3** (Newton’s method).*

Let $f$ be a $C^{2}$ -differentiable function $f$ whose Hessian is invertible on $\Theta$ . Newton’s method considers the procedure defined for all $n\in\mathbb{N}$ by

[TABLE]

It fits into the general framework of (1) with $\mathcal{Q}$ defined on $\Theta\times\Theta$ by

[TABLE]

If $f$ is thrice differentiable at ${\theta_{\star}}$ and $\partial f({\theta_{\star}})=0$ , straightforward calculus yields $\mathcal{A}_{\star}=I_{q}$ and $\mathcal{B}_{\star}=0$ . Newton’s method thus meets (H4) with $\check{\uprho}_{\star}=\hat{\uprho}_{\star}=0$ , which is coherent with the fact that the convergence is quadratic under the assumptions of [Nocedal and J., 2006, Theorem 3.5, p.44].

7 Proof of Theorem 1

We start with some notation that will be used in several parts of the paper. Set $d:=\dim(\mathsf{V})$ . Let $v_{1},\ldots,v_{d}\in\mathbb{R}^{q}$ be an orthonormal basis of $\mathsf{V}$ and let $P$ be the matrix

[TABLE]

so that $\mathsf{V}=P(\mathbb{R}^{d})$ . For all $x\in\mathbb{R}^{q}$ and $M\in\mathbb{R}^{q\times q}$ , write

[TABLE]

Note that for all $v\in\mathsf{V}$ , $P\tilde{v}=PP^{\top}v=v$ . Write for all $n\in\mathbb{N}$ ,

[TABLE]

and hence

[TABLE]

Then, for all $M\in\mathbb{R}^{q\times q}$ and $n,m\in\mathbb{N}$ ,

[TABLE]

In particular, with $M=I_{q}$ the identity matrix, for all $n\in\mathbb{N}$ ,

[TABLE]

Proof of Theorem 1.

In the definition of $\hat{\uprho}_{\star}$ given in (13), the supremum of $v\mapsto|v^{\top}\mathcal{B}_{\star}v|/(v^{\top}\mathcal{A}_{\star}v)$ can be taken over the compact set $\{v\in\mathsf{V}\;:\;v^{\top}v=1\}$ and it is thus attained. This yields $\hat{\uprho}_{\star}\in[0;1)$ under (H4).

Let $\uprho\in(\hat{\uprho}_{\star};1)$ . Proposition 2 below provides for sufficiently large $n$ ,

[TABLE]

This yields $\|\tilde{\Delta}_{n}\|_{\tilde{\mathcal{A}}_{\star}}=\bigo(\uprho^{n})$ , and hence $\|\tilde{\Delta}_{n}\|_{2}=\bigo(\uprho^{n})$ by the equivalence of norms in finite dimension. The proof is concluded by noting that this holds for any arbitrary $\uprho>\hat{\uprho}_{\star}$ and since by (30),

[TABLE]

∎

Lemma 7.1.

Under (H2)-(H3), ${\theta_{\star}}$ is a local minimizer on $\Theta$ of the function $\theta\mapsto\mathcal{Q}_{{\theta_{\star}}}\left(\theta\right)$ .

Proof.

Let $\mathsf{N}$ be a neighborhood of ${\theta_{\star}}$ such that $\mathcal{Q}$ is continuous on $\mathsf{N}\times\mathsf{N}$ . For all $\theta\in\mathsf{N}$ and $n\in\mathbb{N}$ , the definition of $(\theta_{n})_{n\in\mathbb{N}}$ in (1) provides $\mathcal{Q}_{\theta_{n}}(\theta_{n+1})\leqslant\mathcal{Q}_{\theta_{n}}(\theta)$ . Taking the limit when $n$ goes to infinity yields $\mathcal{Q}_{{\theta_{\star}}}({\theta_{\star}})\leqslant\mathcal{Q}_{{\theta_{\star}}}(\theta)$ for all $\theta\in\mathsf{N}$ . ∎

Proposition 2.

Under (H1)-(H4), for all $\uprho>\hat{\uprho}_{\star}$ , for sufficiently large $n$ ,

[TABLE]

Proof.

By Lemma 7.1, ${\theta_{\star}}$ is a local minimizer of the function $\theta\mapsto\mathcal{Q}_{{\theta_{\star}}}(\theta)$ . From the differentiability of that function at ${\theta_{\star}}$ under (H3), and the convexity of $\Theta$ under (H1), we deduce that for all $\theta\in\Theta$ ,

[TABLE]

Similarly, under (H2)-(H3) the function $\theta\mapsto\mathcal{Q}_{\theta_{n}}(\theta)$ is differentiable at $\theta_{n+1}$ for sufficiently large $n$ , which yields for all $\theta\in\Theta$ ,

[TABLE]

Using (32) with $\theta={\theta_{\star}}$ and (31) with $\theta=\theta_{n+1}$ provides

[TABLE]

which in turn implies

[TABLE]

Besides, applying Taylor’s theorem to $\theta\mapsto\partial_{2}\mathcal{Q}_{\theta_{n}}(\theta)$ and $\theta\mapsto\partial_{2}\mathcal{Q}_{\theta}({\theta_{\star}})$ yields for sufficiently large $n$ ,

[TABLE]

where

[TABLE]

Plugging (34-35) into $\eqref{eq:main:balakrishnan}$ , we deduce

[TABLE]

Using (29), this can be written as

[TABLE]

Now, by Schwarz’s theorem, $\mathcal{A}_{\star}$ and hence $\tilde{\mathcal{A}}_{\star}$ are symmetric. Similarly, under (H2)-(H3), $\mathcal{A}_{n}$ and hence $\tilde{\mathcal{A}}_{n}$ are symmetric for sufficiently large $n$ . Moreover, (H4) implies the positive-definiteness of $\tilde{\mathcal{A}}_{\star}$ , and by (H2)-(H3) that of $\tilde{\mathcal{A}}_{n}$ for sufficiently large $n$ (see [Tao, 2012, Section 1.3.4, p.47]). We can thus apply Lemma B.1 to (37) with $x=\tilde{\Delta}_{n+1}$ , $y=\tilde{\Delta}_{n}$ , $A=\tilde{\mathcal{A}}_{n}$ and $B=\tilde{\mathcal{B}}_{n}$ and we obtain

[TABLE]

where $\hat{\uprho}_{n}:={|\kern-1.07639pt|\kern-1.07639pt|\tilde{\mathcal{A}}_{n}^{-1/2}\tilde{\mathcal{B}}_{n}\tilde{\mathcal{A}}_{n}^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ . Under (H2)-(H3), Lemma B.2 shows that $\hat{\uprho}_{n}$ converges to ${|\kern-1.07639pt|\kern-1.07639pt|\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ by choosing $A=\tilde{\mathcal{A}}_{\star}$ , $B=\tilde{\mathcal{B}}_{\star}$ , $M=\tilde{\mathcal{A}}_{n}-\tilde{\mathcal{A}}_{\star}$ and $N=\tilde{\mathcal{B}}_{n}-\tilde{\mathcal{B}}_{\star}$ . On the other hand, by Lemma B.3, $\hat{\uprho}_{\star}={|\kern-1.07639pt|\kern-1.07639pt|\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ .

Let $\uprho>\hat{\uprho}_{\star}$ . Set $\uprho^{\prime}:=(\uprho+\hat{\uprho}_{\star})/2$ and $\varepsilon>0$ such that $(1+\varepsilon)\uprho^{\prime}\leqslant(1-\varepsilon)\uprho$ . Under (H2)-(H3), Lemma B.4 yields that for sufficiently large $n$ , for all $u\in\mathbb{R}^{d}$ ,

[TABLE]

Combining with (38) and the convergence of $\hat{\uprho}_{n}$ to $\hat{\uprho}_{\star}$ , we deduce for sufficiently large $n$ ,

[TABLE]

∎

8 Convex constrained optimization

Theorem 7.

Assume that the mirror prox strategy defined in Example 2.2 page 2.2 satisfies the following assumptions: (i) $\mathsf{C}\subset\mathsf{D}$ , (ii) $\Phi$ and $f$ are twice differentiable on $\mathsf{C}$ and $C^{2}$ -differentiable in a neighborhood of ${\theta_{\star}}$ , (iii) $\Phi$ is $\gamma$ -strongly convex on $\mathsf{C}$ and $f$ is convex and $\beta$ -smooth, with respect to $\left\|\cdot\right\|_{2}$ , (iv) $\eta\in(0;\gamma/\beta)$ , (v) ${\theta_{\star}}$ is the unique minimizer of $f$ on $\mathsf{C}$ , (vi) ${\theta_{\star}}\in\mathrm{ri}(\mathsf{C})$ and $\partial^{2}\tilde{f}({\theta_{\star}})\succ 0$ .

Then, the algorithm converges and the convergence is asymptotically geometric.

Proof.

See Proposition 1 in page 1 and Example 2.2 in page 2.2. ∎

Even if the convergence rates of mirror descent are always lower than those of mirror prox for the same optimization problem (see Example 2.2 in page 2.2), the convergence of mirror prox is guaranteed under the assumptions of Theorem 7.

Note that no conditions are imposed on the initialization (see [Bubeck, 2015, Chapter 4, p.299]).

Corollary 2.

Let $q\in\mathbb{N}^{*}$ , $\mathsf{C}\subset\mathbb{R}^{q}$ be compact set, and $f$ be a function that meets assumptions (ii)-(iii) and (v)-(vi) of Theorem 7. Then, mirror prox provides an algorithm that converges to $\mathrm{argmin}_{\mathsf{C}}f$ .

Proof.

Write $R:=\max_{x\in\mathsf{C}}\left\|x\right\|_{2}$ . For all $R^{\prime}>R$ , the mirror map $\Phi$ defined on $\mathsf{D}:=\mathbf{B}\left(0,R^{\prime}\right):=\left\{x\in\mathbb{R}^{d}\;:\;\left\|x\right\|_{2}<R^{\prime}\right\}$ by $\Phi(x):=\left\|x\right\|_{2}^{2}/(R^{\prime}-\left\|x\right\|_{2}^{2})$ meets the assumptions of Theorem 7 for all $\eta\in(0;2(R^{\prime}\beta)^{-1})$ . ∎

9 Discussion

9.1 Non-asymptotic convergence

We proved the asymptotic geometric convergence in Section 7 by using that for all $n\in\mathbb{N}$ ,

[TABLE]

as soon as we can define the above quantities (see (38)). The question then rises of deriving non-asymptotic convergence rates. Apart from the fact that the norm depends on $n$ , the main issue would be to obtain $\hat{\uprho}_{n}<1$ . However, the ratio $\hat{\uprho}_{\star}$ compares $\partial_{22}\mathcal{Q}_{{\theta_{\star}}}({\theta_{\star}})$ with $\partial_{12}\mathcal{Q}_{{\theta_{\star}}}({\theta_{\star}})$ , whereas $\hat{\uprho}_{n}$ compares

[TABLE]

and the problem is not of the same complexity. A sufficient condition to simplify it can be $\min\mathrm{Spec}(\mathcal{A}_{n})>\max|\mathrm{Spec}(\mathcal{B}_{n})|$ . Despite being less precise than a comparison being matrices, such a condition has the advantage that it is sufficient to verify it for every $s\in[0;1]$ in (39). That pointwise condition essentially corresponds to the conditions 1 and 2 behind $\gamma<\lambda$ in [Balakrishnan et al., 2017, Theorem 1], concerning $\partial_{12}\mathcal{Q}$ and $\partial_{22}\mathcal{Q}$ respectively (the proof of that theorem has besides inspired this work). We can also identify the classical assumptions of smoothness and Lipschitz continuity for gradient descent and its variants.

In light of that remark, we better understand why the framework introduced in this paper yields better results asymptotically, as it allows to work with the true asymptotic convergence rate. We can see it when comparing with the results stated so far in the EM literature [Dempster et al., 1977, Kunstner et al., 2021, Meng and Rubin, 1994] (that generally do not consider constrained optimization and assume that the mapping $\mathcal{M}$ is differentiable, among other things).

9.2 Quadratic convergence

Under the assumptions of Theorem 2, we established in Section A.1 that for sufficiently large $n$ , we can write $\tilde{\mathcal{B}}_{n}\tilde{\Delta}_{n}=\tilde{\mathcal{A}}_{n}\tilde{\Delta}_{n+1}$ , which is equivalent to

[TABLE]

Note that $\tilde{\mathcal{A}}_{n}$ and $\tilde{\mathcal{B}}_{n}$ cannot be computed as they depend on ${\theta_{\star}}$ (see (36)). However, we can approximate them using only $\theta_{n}$ in order to estimate iteratively ${\theta_{\star}}$ with (40). In the example of unconstrained gradient descent with step-size $1$ , using $\hat{\mathcal{A}}_{n}=I_{q}$ and $\hat{\mathcal{B}}_{n}=I_{q}-\partial^{2}f(\theta_{n})$ (see Example 2.1 in page 2.1) corresponds to Newton’s method (see Example 3).

9.3 Non-convex constrained optimization

Considering Corollary 2 and Lemmas B.9 and B.11, the problem of finding the unique minimizer of non-convex functions can be brought down to finding $\beta$ -smooth approximations of their biconjugates (for an arbitrary $\beta\in\mathbb{R}^{*}_{+}$ ).

Appendix A Proofs

A.1 Asymptotic convergence rate

Proof of Corollary 1.

Let $\left\|\cdot\right\|$ be any norm on $\mathbb{R}^{q}$ . Under (H3), the Taylor expansion with integral remainder yields

[TABLE]

where the last equality follows from the continuity of the function $(\theta,\theta^{\prime})\mapsto\partial_{2}\mathcal{Q}_{\theta}\left(\theta^{\prime}\right)$ . Moreover, since $\theta\mapsto\mathcal{Q}_{\theta}\left({\theta_{\star}}\right)$ is differentiable at ${\theta_{\star}}$ ,

[TABLE]

Summing (41) and (42) combined with Theorem 1 yield the expected result. ∎

Proof of Theorem 2.

First, note that the theorem is proved if $\check{\uprho}_{\star}=0$ . We now assume that $\check{\uprho}_{\star}>0$ , which in particular implies that $\tilde{\mathcal{B}}_{\star}$ is invertible. In what follows, we use the notation introduced in Section 7. Following the proof of Theorem 1, see in particular the proof of Proposition 2, with the additional assumption that ${\theta_{\star}}\in\mathrm{ri}(\Theta)$ , we can prove that for sufficiently large $n$ , for all $\theta\in\Theta$ ,

[TABLE]

In other words, $\partial_{2}\mathcal{Q}_{\theta_{n}}(\theta_{n+1}),\partial_{2}\mathcal{Q}_{{\theta_{\star}}}({\theta_{\star}})\in\mathsf{V}^{\perp}$ . This implies that for sufficiently large $n$ ,

[TABLE]

where $\mathcal{A}_{n}$ , $\mathcal{B}_{n}$ , $P$ and $\tilde{\Delta}_{n}$ are defined respectively in (36), (25) and (28). As by definition of $P$ , the condition $v\in\mathsf{V}^{\perp}$ is equivalent to the identity $P^{\top}v=0$ , we deduce from (43) that $P^{T}\mathcal{B}_{n}P\tilde{\Delta}_{n}=P^{T}\mathcal{A}_{n}P\tilde{\Delta}_{n+1}$ , which can be written as

[TABLE]

Moreover, by (H4), $\tilde{\mathcal{A}}_{\star}$ is positive-definite and by (H2)-(H3), $\tilde{\mathcal{A}}_{n}$ is also positive-definite for sufficiently large $n$ . Then, the invertibility of $\tilde{\mathcal{B}}_{\star}$ allows to write for sufficiently large $n$ :

[TABLE]

and thus, combining with $\|\tilde{\Delta}_{n}\|_{\tilde{\mathcal{A}}_{n}}=\|\tilde{\mathcal{A}}_{n}^{1/2}\tilde{\Delta}_{n}\|_{2}$ and $\|\tilde{\Delta}_{n+1}\|_{\tilde{\mathcal{A}}_{n}}=\|\tilde{\mathcal{A}}_{n}^{1/2}\tilde{\Delta}_{n+1}\|_{2}$ , we get

[TABLE]

Besides, by the symmetry of $\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}$ ,

[TABLE]

Let $\uprho\in(0;\check{\uprho}_{\star})$ . Following the same steps as for the proof of Theorem 1, we can prove that ${|\kern-1.07639pt|\kern-1.07639pt|(\tilde{\mathcal{A}}_{n}^{-1/2}\tilde{\mathcal{B}}_{n}\tilde{\mathcal{A}}_{n}^{-1/2})^{-1}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ converges to ${|\kern-1.07639pt|\kern-1.07639pt|(\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2})^{-1}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ . Together with (45) and (46), this provides the existence of $n_{0}\in\mathbb{N}$ such that for all $n\geqslant n_{0}$ , $\|\tilde{\Delta}_{n}\|_{\tilde{\mathcal{A}}_{\star}}\leqslant\uprho^{-1}\|\tilde{\Delta}_{n+1}\|_{\tilde{\mathcal{A}}_{\star}}$ . We deduce by induction that for all $n\geqslant n_{0}$ , $\|\tilde{\Delta}_{n_{0}}\|_{\tilde{\mathcal{A}}_{\star}}\leqslant\uprho^{-(n-n_{0})}\|\tilde{\Delta}_{n}\|_{\tilde{\mathcal{A}}_{\star}}$ . As the sequence $(\theta_{n})_{n\in\mathbb{N}}$ is not eventually equal to ${\theta_{\star}}$ , by (30) we can choose $n_{0}$ such that $\|\tilde{\Delta}_{n_{0}}\|_{\tilde{\mathcal{A}}_{\star}}\neq 0$ . Hence, since all the norms on a finite dimensional space are equivalent, we deduce $\underset{n\to\infty}{\lim\inf}\;\frac{1}{n}\log\left\|\tilde{\Delta}_{n}\right\|_{2}=\underset{n\to\infty}{\lim\inf}\;\frac{1}{n}\log\left\|\tilde{\Delta}_{n}\right\|_{\tilde{\mathcal{A}}_{\star}}\geqslant\log\uprho$ . The proof is then concluded by applying (30) and by noting that $\uprho$ is arbitrary in $(0;\check{\uprho}_{\star})$ . ∎

Proof of Theorem 3.

In this proof, we use the notation introduced in Section 7. Define

[TABLE]

Note that $\tilde{\mathcal{S}}_{\star}$ is similar to the symmetric matrix $\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}$ and is therefore diagonalizable. Therefore, there exists an invertible matrix $R=[R(i,j)]_{1\leqslant i,j\leqslant d}\in\mathbb{R}^{d\times d}$ such that $\tilde{\mathcal{S}}_{\star}=\tilde{\mathcal{A}}_{\star}^{-1}\tilde{\mathcal{B}}_{\star}=R\tilde{\mathcal{D}}_{\star}R^{-1}$ where $\tilde{\mathcal{D}}_{\star}$ is a diagonal matrix. For any matrix $S\in\mathbb{R}^{d\times d}$ , it is convenient to use the notation $S^{R}=R^{-1}SR$ . In particular, we have $\tilde{\mathcal{S}}_{\star}^{R}=\tilde{\mathcal{D}}_{\star}$ . Moreover, for any vector $\Delta\in\mathbb{R}^{d}$ , we use the notation $\Delta^{R}:=R^{-1}\Delta$ . Write

[TABLE]

where $\mathcal{A}_{n}$ and $\mathcal{B}_{n}$ are defined in (36) and $\tilde{\mathcal{S}}_{n}$ is well-defined for sufficiently large $n$ as $\tilde{\mathcal{A}}_{\star}$ is positive-definite by (H4), and using (H2)-(H3). Besides, Theorems 1 and 2 provide $\check{\uprho}_{\star},\hat{\uprho}_{\star}\in[0;1)$ . As the theorem is proved if $\check{\uprho}_{\star}=0$ , we now assume that $\check{\uprho}_{\star}>0$ , which implies the invertibility of $\tilde{\mathcal{B}}_{\star}$ and hence of $\tilde{\mathcal{S}}_{\star}$ . Moreover, Theorem 1 also yields that for all $\uprho\in(\hat{\uprho}_{\star};1)$ , $\theta_{n}-{\theta_{\star}}=\lito(\uprho^{n})$ . We deduce by the $C^{2}$ -differentiability of $\partial_{2}\mathcal{Q}$ in a neighborhood of $({\theta_{\star}},{\theta_{\star}})$ that for all $\uprho\in(\hat{\uprho}_{\star};1)$ ,

[TABLE]

Following the proof of Theorem 2, by (44) there exists $n_{0}\in\mathbb{N}$ such that for all $n\geqslant n_{0}$ , $\tilde{\mathcal{B}}_{n}\tilde{\Delta}_{n}=\tilde{\mathcal{A}}_{n}\tilde{\Delta}_{n+1}$ , that is, $\tilde{\mathcal{S}}_{n}\tilde{\Delta}_{n}=\tilde{\Delta}_{n+1}$ or equivalently $\tilde{\mathcal{S}}_{n}^{R}\tilde{\Delta}_{n}^{R}=\tilde{\Delta}_{n+1}^{R}$ . This implies for all $m\in\mathbb{N}^{*}$ ,

[TABLE]

Component-wise, this yields for such $n,m\in\mathbb{N}^{*}$ that for all $i\in\llbracket 1:d\rrbracket$ ,

[TABLE]

where for any $k\in\llbracket 0:m-1\rrbracket$ , $L_{n,m,k}(i)^{\top}$ denotes the $i$ -th row of the matrix

[TABLE]

Recalling that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}$ is the Frobenius norm, we let $C_{F}>0$ be constant such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leqslant{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}C_{F}$ on $\mathbb{R}^{d\times d}$ . Let $i\in\llbracket 1:d\rrbracket$ such that $|\tilde{\mathcal{D}}_{\star}(i,i)|=\hat{\uprho}_{\star}$ . Using the Cauchy-Schwarz inequality we deduce

[TABLE]

Let $\delta>\max(\hat{\uprho}_{\star}^{-1},\hat{\uprho}_{\star}\check{\uprho}_{\star}^{-1})$ . Pick $\uprho\in(\hat{\uprho}_{\star},1)$ and $\varepsilon>0$ such that $\uprho(\check{\uprho}_{\star}^{-1}+\varepsilon)<\delta$ . By (47) there exists $C>0$ and $n_{1}\geqslant n_{0}$ such that for all $n\geqslant n_{1}$ ,

[TABLE]

Then, (48) yields for all $n\geqslant n_{1}$ and $m\in\mathbb{N}^{*}$ , using $\hat{\uprho}_{\star}^{-1}<\delta$ and $\uprho(\check{\uprho}_{\star}^{-1}+\varepsilon)<\delta$ ,

[TABLE]

We now show that there exists $n\geqslant n_{1}$ such that $\tilde{\Delta}_{n}^{R}(i)=R^{-1}P^{T}\Delta_{n}(i)\neq 0$ . Indeed, otherwise, by Lemma A.1 below, there exists a basis $w_{1},\ldots,w_{d}$ of $\mathsf{V}$ such that $\Delta_{n}=\theta_{n}-{\theta_{\star}}\in\mathrm{Span}(w_{j},j\in\llbracket 1:d\rrbracket\setminus\{i\})$ for any $n\geqslant n_{1}$ which contradicts the assumption $\mathrm{Span}(\Delta_{n},n\geqslant n_{1})=\mathsf{V}$ in the statement of Theorem 3. Therefore, we can choose $n\geqslant n_{1}$ such that the lhs of (50) is strictly positive. This $n$ being chosen, take the log in the previous inequality and divide by $m$ . Letting $m$ goes to infinity, we then obtain for any $\delta>\max(\hat{\uprho}_{\star}^{-1},\hat{\uprho}_{\star}\check{\uprho}_{\star}^{-1})$ ,

[TABLE]

where the last equality follows from (30). The proof is completed since $\delta$ is arbitrary provided that $\delta>\max(\hat{\uprho}_{\star}^{-1},\hat{\uprho}_{\star}\check{\uprho}_{\star}^{-1})$ , which is equivalent to $\delta^{-1}<\min(\hat{\uprho}_{\star},\hat{\uprho}_{\star}^{-1}\check{\uprho}_{\star})$ . ∎

Lemma A.1.

Let $\mathsf{V}$ be a $d$ -dimensional linear subspace of $\mathbb{R}^{q}$ . Let $v_{1},\ldots,v_{d}\in\mathbb{R}^{q}$ be an orthonormal basis of $\mathsf{V}$ and let $w_{1},\ldots,w_{d}\in\mathbb{R}^{q}$ be another basis obtained from $(v_{i})_{1\leqslant i\leqslant d}$ by the change-of-basis matrix $R$ , that is, for any $j\in\llbracket 1:d\rrbracket$ , $w_{j}=\sum_{i=1}^{d}R(i,j)v_{i}$ .

Then, for any $\Delta\in\mathsf{V}$ , the $i$ -th component of the decomposition of $\Delta$ on the basis $(w_{i})_{1\leqslant i\leqslant d}$ is $R^{-1}P^{T}\Delta(i)$ , where $P$ is the matrix

[TABLE]

Proof.

Decomposing the vector $\Delta\in\mathsf{V}$ on the basis $(v_{j})_{1\leqslant j\leqslant d}$ and using $v_{j}=\sum_{i=1}^{d}R^{-1}(i,j)w_{i}$ , we get

[TABLE]

∎

A.2 Comments on (H1)

Proof of Theorem 4.

The proof amounts to building an analogous framework that satisfies (H1)-(H4). The first step is to define a suitable minimization set that is convex. Write $d$ the dimension of the submanifold $\mathsf{S}$ and for all $x\in\mathbb{R}^{q}$ , $R>0$ , we set $\mathbf{B}\left(x,R\right):=\left\{y\in\mathbb{R}^{q}\;:\;\left\|x-y\right\|_{2}<R\right\}$ .

Under (H’1) there exist $\mathsf{U}_{1}$ , $\mathsf{U}_{2}$ two open neighborhoods of ${\theta_{\star}}$ and the null-vector $\mathbf{0}$ in $\mathbb{R}^{q}$ , respectively, and a $C^{2}$ -diffeomorphism $\psi\colon\mathsf{U}_{1}\rightarrow\mathsf{U}_{2}$ such that $\psi({\theta_{\star}})=\mathbf{0}$ and $\psi(\mathsf{U}_{1}\cap\mathsf{S})=\mathsf{U}_{2}\cap(\mathbb{R}^{d}\times\{0\}^{q-d})$ . Let $\mathsf{N}$ be an open neighborhood of ${\theta_{\star}}$ such that $\mathcal{Q}$ meets the conditions of (H3) on $\mathsf{N}\times\mathsf{N}$ . Define $\mathsf{U}:=\mathsf{U}_{1}\cap\mathsf{N}\cap\mathring{\mathsf{E}}$ , which is an open set containing ${\theta_{\star}}$ by (H’1). Set $r>0$ such that $\mathbf{B}\left({\theta_{\star}},r\right)\subset\mathsf{U}$ , and $\varepsilon>0$ such that $\mathbf{B}\left(\mathbf{0},\varepsilon\right)\subset\psi(\mathbf{B}\left({\theta_{\star}},r/2\right))$ . Define $\mathsf{W}:=\mathbb{R}^{d}\times\{0\}^{q-d}$ and the convex set

[TABLE]

Write $\phi$ the corestriction of $\psi$ to $\Xi$ , that is, $\phi\colon\psi^{-1}(\Xi)\rightarrow\Xi$ such that $\phi(x)=\psi(x)$ for any $x\in\psi^{-1}(\Xi)$ . Note that $\phi$ is still a $C^{2}$ -diffeomorphism. Define then

[TABLE]

Under (H2) there exists $n_{0}\in\mathbb{N}$ such that $\theta_{n}\in\phi^{-1}(\Xi)$ for all $n\geqslant n_{0}$ , which allows to define on $\Xi$ the sequence

[TABLE]

We deduce from the definition of $(\theta_{n})_{n\in\mathbb{N}}$ in (1) and from $\phi^{-1}(\Xi)\subset\mathsf{U}\subset\mathsf{E}$ that for all $n\in\mathbb{N}$ and $\zeta\in\Xi$ ,

[TABLE]

The framework defined by (51-53) thus fits into (1) and meets (H1)-(H3) with $(\theta_{n},{\theta_{\star}},\mathcal{Q})$ replaced by $(\zeta_{n},\mathbf{0},\mathcal{R})$ . For consistency of notation, we write $\zeta_{\star}:=\phi({\theta_{\star}})=\mathbf{0}$ . We now prove that (H4) is satisfied with $({\theta_{\star}},\mathcal{Q})$ replaced by $(\zeta_{\star},\mathcal{R})$ .

Denote by $J_{\phi}$ the Jacobian matrix of $\phi$ . The fact that $\phi^{-1}(\Xi)\subset\mathsf{U}\subset\mathsf{N}$ allows to write for all $\theta,\theta^{\prime}\in\phi^{-1}(\Xi)$ ,

[TABLE]

using that the image of $\phi$ is included in $\mathbb{R}^{d}\times\{0\}^{q-d}$ , i.e. $\phi_{i}\equiv 0$ for all $i\in\llbracket d+1:q\rrbracket{}$ . This yields

[TABLE]

Besides, (54) and Lemma 7.1 provide that $\zeta_{\star}$ is a local minimizer of the function $\zeta\mapsto\mathcal{R}_{\zeta_{\star}}\left(\zeta\right)$ . Together with the convexity of $\Xi$ , the fact that $\zeta_{\star}\in\mathrm{ri}(\Xi)$ and that $\mathrm{Aff}(\Xi)=\mathsf{W}$ , this implies $\partial_{2}\mathcal{R}_{\zeta_{\star}}\left(\zeta_{\star}\right)\in\mathsf{W}^{\perp}$ . As $\mathsf{W}=\mathbb{R}^{d}\times\{0\}^{q-d}$ , the second term of the rhs in (56) is thus null. Combining with ${\mathsf{T}_{\star}}=J_{\phi}({\theta_{\star}})^{-1}\mathsf{W}$ by definition of the tangent space of a submanifold, we deduce from (55-56) that the rates $\check{\uprho}_{\star},\hat{\uprho}_{\star}$ defined in (13-14) for $\mathcal{R}$ are equal to the rates $\breve{\uprho}_{\star},\invbreve{\uprho}_{\star}$ defined in (15) for $\mathcal{Q}$ . Satisfying (H4) for $\mathcal{R}$ is thus equivalent to satisfying (H’4) for $\mathcal{Q}$ .

Therefore, we can apply Theorems 1 and 2 to the sequence $(\zeta_{n})_{n\in\mathbb{N}}$ and we get for any $(\uprho_{1},\uprho_{2})\in(\invbreve{\uprho}_{\star},1)\times(0,\breve{\uprho}_{\star})$

[TABLE]

To relate with the speed of convergence of $\theta_{n}-{\theta_{\star}}$ , note that for all $n\in\mathbb{N}$ ,

[TABLE]

and that $\psi^{-1}(\Xi)\subset\mathbf{B}\left({\theta_{\star}},r/2\right)$ by (51). The $C^{1}$ -differentiability of $\psi$ on $\bar{\mathbf{B}}({\theta_{\star}},r/2)$ and of $\psi^{-1}$ on $\overline{\Xi}$ provides the existence of $C>0$ such that $\sup_{\theta\in\mathbf{B}\left({\theta_{\star}},r/2\right)}{|\kern-1.07639pt|\kern-1.07639pt|\mathrm{d}_{\psi}(\theta)|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant C$ and $\sup_{\zeta\in\Xi}{|\kern-1.07639pt|\kern-1.07639pt|\mathrm{d}_{\phi^{-1}}(\zeta)|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant C$ , where $\mathrm{d}_{\psi}$ and $\mathrm{d}_{\phi^{-1}}$ denote the differentials of $\psi$ and $\phi^{-1}$ , respectively. Thus $\psi$ is Lipshitz on $\mathbf{B}\left({\theta_{\star}},r/2\right)$ and $\phi^{-1}$ on $\Xi$ . Combining with (58) and (57) concludes the proof. ∎

A.3 Comments on (H2)

We first prove a general version of Proposition 2.

Proposition 3.

Assume that (H1), ( $\mathsf{H}$ 2.4), (H3) and (H4) hold. Let $\left\|\cdot\right\|$ be a norm on $\mathbb{R}^{q}$ . Then, for all $\uprho>\hat{\uprho}_{\star}$ , there exists $\delta>0$ such that for all $\theta\in\Theta$ , $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

where the notation $\tilde{\theta}$ , $\tilde{\theta}^{\prime}$ and $\tilde{\theta}_{\star}$ are defined in (26).

Proof.

In this proof, we use the notation introduced in Section 7. For all $\delta>0$ , write $\mathbf{B}\left({\theta_{\star}},\delta\right):=\{\theta\in\mathbb{R}^{q}\;:\;\left\|\theta-{\theta_{\star}}\right\|<\delta\}$ . Under (H3) there exists $\delta_{0}>0$ such that $\partial_{2}\mathcal{Q}$ is well-defined and $C^{1}$ -differentiable on $\mathbf{B}\left({\theta_{\star}},\delta_{0}\right)\times\mathbf{B}\left({\theta_{\star}},\delta_{0}\right)$ . By (H1), ( $\mathsf{H}$ 2.4) and (H3), we can prove, similarly to (33) in the proof of Proposition 2, that for all $\theta,\theta^{\prime}\in\mathbf{B}\left({\theta_{\star}},\delta_{0}\right)$ with $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

which yields

[TABLE]

where

[TABLE]

Under (H3)-(H4) there also exists $\delta_{1}\in(0;\delta_{0})$ such that if $\theta,\theta^{\prime}\in\mathbf{B}\left({\theta_{\star}},\delta_{1}\right)$ , then the symmetric matrix $\tilde{\mathcal{A}}_{\theta\theta^{\prime}}$ is positive-definite. Using (29) and applying Lemma B.1 to (59) then provides

[TABLE]

where $\hat{\uprho}_{\theta\theta^{\prime}}:={|\kern-1.07639pt|\kern-1.07639pt|\tilde{\mathcal{A}}_{\theta\theta^{\prime}}^{-1/2}\tilde{\mathcal{B}}_{\theta\theta^{\prime}}\tilde{\mathcal{A}}_{\theta\theta^{\prime}}^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ .

Let $\uprho>\hat{\uprho}_{\star}$ . Set $\uprho^{\prime}:=(\uprho+\hat{\uprho}_{\star})/2$ and $\varepsilon>0$ such that $(1+\varepsilon)\uprho^{\prime}\leqslant(1-\varepsilon)\uprho$ . By (H3)-(H4) and Lemma B.4 there exists $\delta_{2}\in(0;\delta_{1})$ such that for all $\theta,\theta^{\prime}\in\mathbf{B}\left({\theta_{\star}},\delta_{2}\right)$ , for all $u\in\mathbb{R}^{d}$ ,

[TABLE]

Moreover, by Lemmas B.2 and B.3, under (H3) there exists $\delta_{3}\in(0;\delta_{2})$ such that $\hat{\uprho}_{\theta\theta^{\prime}}\leqslant\uprho^{\prime}$ for all $\theta,\theta^{\prime}\in\mathbf{B}\left({\theta_{\star}},\delta_{3}\right)$ . Combining with (60) yields that for all $\theta,\theta^{\prime}\in\mathbf{B}\left({\theta_{\star}},\delta_{3}\right)$ with $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

∎

Proof of Theorem 5.

Applying Proposition 3 to $\uprho:=(1+\hat{\uprho}_{\star})/2$ and the norm $\left\|\cdot\right\|_{2}$ provides the existence of $\delta_{0}>0$ such that for all $\theta\in\Theta$ , $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

Moreover, by Lemma B.6 under ( $\mathsf{H}$ 2.1)-( $\mathsf{H}$ 2.2) and by ( $\mathsf{H}$ 2.4), there exists $\delta_{1}\in(0;\delta_{0})$ such that for all $\theta\in\Theta$ , $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

By (30) and the equivalence of norms in finite dimension, combining with (61) yields the existence of $\delta>0$ such that for all $\theta\in\Theta$ , $\theta^{\prime}\in\mathcal{M}\left(\theta\right)$ ,

[TABLE]

Besides, by ( $\mathsf{H}$ 2.3) there exists $n_{0}\in\mathbb{N}$ such that $\|\tilde{\theta}_{n_{0}}-\tilde{\theta}_{\star}\|_{\tilde{\mathcal{A}}_{\star}}\leqslant\delta$ . Using that $\uprho<1$ by (H4), we deduce by induction that for all $n\geqslant n_{0}$ , $\|\tilde{\theta}_{n}-\tilde{\theta}_{\star}\|_{\tilde{\mathcal{A}}_{\star}}\leqslant\delta$ , and that

[TABLE]

which concludes the proof. ∎

Lemma A.2.

Assume that $\partial^{2}\mathcal{Q}$ is well-defined and differentiable on $\Theta$ , and that for all $\theta,\theta^{\prime}\in\Theta$ , for all $v\in\mathsf{V}$ , $v^{\top}\partial_{12}\mathcal{Q}_{\theta}\left(\theta^{\prime}\right)v<0$ . Then, for all $\theta,\theta^{\prime}\in\Theta$ ,

[TABLE]

Proof.

Let $\theta^{\prime\prime}\in\mathcal{M}\left(\theta\right)\cap\mathcal{M}\left(\theta^{\prime}\right)\cap\mathrm{ri}(\Theta)$ . We can prove as for Theorem 2 that it implies $\tilde{\mathcal{B}}_{\theta\theta^{\prime}}(\tilde{\theta}-\tilde{\theta}^{\prime})=\tilde{\mathcal{A}}_{\theta\theta^{\prime}}(\tilde{\theta}^{\prime\prime}-\tilde{\theta}^{\prime\prime})=0$ , where

[TABLE]

Besides, by assumption, for all $v\in\mathsf{V}$ , $v^{\top}\mathcal{B}_{\theta\theta^{\prime}}v=-\int_{0}^{1}v^{\top}\partial_{12}\mathcal{Q}_{s\theta^{\prime}+(1-s)\theta}\left(\theta^{\prime\prime}\right)v\mathrm{d}s>0$ . This provides the invertibilty of $\tilde{\mathcal{B}}_{\theta\theta^{\prime}}$ and thus $\tilde{\theta}-\tilde{\theta}^{\prime}=0$ , which concludes the proof by (30). ∎

Proof of Proposition 1.

It is proved in [Bubeck, 2015, Theorem 4.4, p.305] that under assumptions (ii) and (iii), for all $\theta\in\mathsf{C}\cap\mathsf{D}$ and $n\geqslant 1$ ,

[TABLE]

We deduce by applying Lemma B.5 under (iv) and the compacity of $\mathsf{C}$ that there exists $\varphi\colon\mathbb{N}\rightarrow\mathbb{N}$ strictly increasing such that $(\zeta_{\varphi(n)})_{n\in\mathbb{N}}$ converges to ${\theta_{\star}}$ . By (11) this is equivalent to $(\mathcal{M}(\theta_{\varphi(n)-1}))_{n\in\mathbb{N}^{*}}$ converging to ${\theta_{\star}}$ . Besides, (iv) and the differentiability of $f$ provide $\mathcal{M}\left({\theta_{\star}}\right)=\{{\theta_{\star}}\}$ (see Example 2.1 in page 2.1), and by Lemma B.7 under ( $\mathsf{H}$ 2.1)-( $\mathsf{H}$ 2.2) the function $\mathcal{M}$ is continuous on $\Theta$ . We deduce that all accumulation points $\ell$ of the sequence $(\theta_{\varphi(n)-1})_{n\in\mathbb{N}^{*}}$ verify $\mathcal{M}(\ell)={\theta_{\star}}=\mathcal{M}\left({\theta_{\star}}\right)$ . By Lemma A.2 under (i), (ii), (iii) and (v) (using (70-71)), this yields $\ell={\theta_{\star}}$ for all accumulation points, and thus the convergence of $(\theta_{\varphi(n)-1})_{n\in\mathbb{N}^{*}}$ to ${\theta_{\star}}$ by the compacity of $\mathsf{C}$ .

Using that $\mathcal{M}\left({\theta_{\star}}\right)=\{{\theta_{\star}}\}$ , we can prove as in Example 2.1, page 2.1, that $\mathcal{M}^{m}({\theta_{\star}})=\{{\theta_{\star}}\}$ , where $\mathcal{M}^{m}$ is the minimization mapping corresponding to mirror prox (see (12)). ∎

Proof of Theorem 6.

Under ( $\mathsf{H}$ 2.3) there exists $\psi\colon\mathbb{N}\rightarrow\mathbb{N}$ strictly increasing such that $(\theta_{\psi(n)})_{n\in\mathbb{N}}$ converges to ${\theta_{\star}}$ . By the compacity of $\Theta\times\Theta$ under ( $\mathsf{H}$ 2.1) there also exist ${\theta_{\star\star}}\in\Theta$ and $\varphi\colon\mathbb{N}\rightarrow\mathbb{N}$ strictly increasing such that

[TABLE]

By the monotonicity of the sequence $(\vartheta(\theta_{n}))_{n\in\mathbb{N}}$ under ( $\tilde{\mathsf{H}}$ 4.2), for all $n\in\mathbb{N}$ , $\vartheta(\theta_{\varphi(n+1)})\leqslant\vartheta(\theta_{\varphi(n)+1})\leqslant\vartheta(\theta_{\varphi(n)})$ . Together with (62) and the continuity of $\vartheta$ this yields

[TABLE]

Besides, by the definition of $(\theta_{n})_{n\in\mathbb{N}}$ , for all $\theta\in\Theta$ ,

[TABLE]

Using the continuity of $\mathcal{Q}$ under ( $\mathsf{H}$ 2.2), this yields ${\theta_{\star\star}}\in\mathcal{M}({\theta_{\star}})$ . We deduce under ( $\tilde{\mathsf{H}}$ 4.2) that ${\theta_{\star}}={\theta_{\star\star}}$ , and hence $\mathcal{M}\left({\theta_{\star}}\right)=\{{\theta_{\star}}\}$ under ( $\tilde{\mathsf{H}}$ 4.1). ∎

A.4 Comments on (H4)

Proof for Example 1.1 (Population EM).

By (5), for all $\theta,\theta^{\prime}\in\Theta$ ,

[TABLE]

which yields, in a neighborhood of $({\theta_{\star}},{\theta_{\star}})$ ,

[TABLE]

On the other hand, using that for all $\theta,\theta^{\prime}\in\Theta$ ,

[TABLE]

we deduce that in a neighborhood of $({\theta_{\star}},{\theta_{\star}})$ ,

[TABLE]

Using the chain rule for Fisher information matrices [Zamir, 1998, Zegers, 2015],

[TABLE]

∎

Proofs for Example 1.2 (Sample EM).

Similarly to the proof of Example 1.1, we deduce from (4) that for all $\theta,\theta^{\prime}$ in a neighborhood of ${\theta_{\star}}$ ,

[TABLE]

and therefore,

[TABLE]

Lemma A.3.

Assume that ${\theta_{\star}}$ is the true parameter of the model, that for all $x,y\in\mathsf{X},\mathsf{Y}$ , the functions $\theta\mapsto p_{\theta}(x|y)$ and $\theta\mapsto p_{\theta}(y)$ are twice differentiable in a neighborhood of ${\theta_{\star}}$ , and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign. Then, if the corresponding population EM meets (H4), almost surely the sample EM meets (H4) for sufficiently large $k$ and

[TABLE]

Furthermore, if $\partial^{2}\log p_{\theta_{\star}}(X_{1},Y_{1}),\partial^{2}\log p_{{\theta_{\star}}}(Y_{1})\in\mathrm{L}^{2}(\mathbb{R}^{q\times q})$ , then for all $\delta\in(0;1)$ there exists $C_{\delta}>0$ such that

[TABLE]

Proof.

As $\hat{\uprho}_{\star}^{\mathrm{samp}}\left(Y_{1:k}\right)$ is not necessarily well-defined in (13), using Lemma B.3 we consider the following definition:

[TABLE]

Note first that $\hat{\uprho}_{\star}$ is a measurable function of $Y_{1:k}$ . Besides, we deduce from the assumptions that the following random variables are integrable:

[TABLE]

Together with (63-64) and (65-66), the strong law of large numbers provides

[TABLE]

By the continuity of the functions used in (67), this yields that if the corresponding population EM meets (H4), then, almost surely, the sample EM meets (H4) for sufficiently large $k$ , and

[TABLE]

Assume now that $\partial^{2}\log p_{\theta_{\star}}(X_{1},Y_{1}),\partial^{2}\log p_{{\theta_{\star}}}(Y_{1})\in\mathrm{L}^{2}(\mathbb{R}^{q\times q})$ . First, using Jensen’s inequality with $\left\|\cdot\right\|_{2}^{2}$ provides $W_{1}\in\mathrm{L}^{2}(\mathbb{R}^{q\times q})$ , and thus $Z_{1}=W_{1}-\partial^{2}\log p_{\theta_{\star}}(Y_{1})\in\mathrm{L}^{2}(\mathbb{R}^{q\times q})$ . Let $\delta\in(0;1)$ and write $\bar{Z}_{k}=\sum_{i=1}^{k}Z_{i}/k$ . Set $\delta^{\prime}\in(0;1)$ such that $4q^{2}\delta^{\prime}\leqslant\delta$ and write $x(\delta^{\prime})$ the quantile of order $1-\delta^{\prime}$ of the standard Gaussian distribution. Applying the central limit theorem to each component of $Z_{1}$ provides for all $i,j\in\llbracket 1:q\rrbracket$ and $k\in\mathbb{N}^{*}$ ,

[TABLE]

where $\mu_{ij}:=\mathbb{E}[Z_{1}(i,j)]$ , $\sigma_{ij}^{2}:=\mathrm{Var}\left[Z_{1}(i,j)\right]$ and $\sigma^{2}:=\mathbb{E}[{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|Z_{1}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}]\vee\mathbb{E}[{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|Z_{2}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}]$ . This yields

[TABLE]

Similarly, the same inequality holds for $\partial_{22}\mathcal{Q}^{\mathrm{samp}}_{{\theta_{\star}}}({\theta_{\star}})$ . Let $C_{F}>0$ such that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\geqslant C_{F}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\cdot\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ on $\mathbb{R}^{q\times q}$ . We deduce using (29) and (30) that

[TABLE]

Let $\varepsilon>0$ and $C>0$ be constants obtained by applying Lemma B.2 to $(\partial_{22}\tilde{\mathcal{Q}}^{\mathrm{pop}}_{{\theta_{\star}}}({\theta_{\star}}),\partial_{12}\tilde{\mathcal{Q}}^{\mathrm{pop}}_{{\theta_{\star}}}({\theta_{\star}}))$ . We deduce from (69) that

[TABLE]

which proves (18). ∎

∎

Proof for Example 2.1 in page 2.1 (Mirror descent, cont.).

For all $\theta,\theta^{\prime}\in\Theta$ in a neighborhood of ${\theta_{\star}}$ ,

[TABLE]

∎

Proof for Example 1.3 (The $\alpha$ -EM algorithm).

Let $\alpha\in\mathbb{R}\setminus\{0,1\}$ . The function $f_{\alpha}$ is defined on $\mathbb{R}_{+}^{*}$ by $f_{\alpha}\colon x\mapsto(1-x^{\alpha})/(\alpha(\alpha-1))$ . Note that $f_{\alpha}(1)=0$ and that for all differentiable functions $g$ taking values in $\mathbb{R}_{+}^{*}$ ,

[TABLE]

The function $\mathcal{Q}^{\alpha}$ defined in (21) can be written as $\mathcal{Q}^{\alpha}_{\theta}\left(\theta^{\prime}\right)=-\int_{\mathsf{X}}p_{\theta}(x|Y)F^{\alpha}_{\theta}(\theta^{\prime})\mu(\mathrm{d}x)$ , where $F^{\alpha}\colon(\theta,\theta^{\prime})\mapsto f_{\alpha}\left(p_{\theta^{\prime}}(x,Y)/p_{\theta}(x,Y)\right)$ . For all $\theta\in\Theta$ , $F^{\alpha}_{\theta}(\theta)=0$ , and under the assumptions of Example 1.1 with $f_{\alpha}$ instead of $f_{0}$ , for all $\theta,\theta^{\prime}\in\Theta$ ,

[TABLE]

We deduce

[TABLE]

At a population level this yields $\partial_{22}\mathcal{Q}^{\alpha}_{{\theta_{\star}}}\left({\theta_{\star}}\right)=I_{X,Y}({\theta_{\star}})$ and

[TABLE]

Regarding the value of $\hat{\uprho}_{\star}^{\alpha}$ , note that $\tilde{\mathcal{A}}_{\star}^{-1}\tilde{\mathcal{B}}_{\star}$ is equivalent to the symmetric matrix $\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}$ and is therefore diagonalizable. Besides, we deduce from (16) and (22) that

[TABLE]

This yields $\mathrm{Spec}((\tilde{\mathcal{A}}_{\star}^{\alpha})^{-1}\tilde{\mathcal{B}}_{\star}^{\alpha})=g_{\alpha}(\mathrm{Spec}(\tilde{\mathcal{A}}_{\star}^{-1}\tilde{\mathcal{B}}_{\star}))$ where $g_{\alpha}(x):=(x-\alpha)/(1-\alpha)$ . We obtain the optimal $\alpha$ by equating $g_{\alpha}(\hat{\uprho}_{\star})=-g_{\alpha}(\check{\uprho}_{\star})$ . ∎

Proof for Example 2.2 in page 2.2 (Mirror prox, cont.).

To begin with, the assumptions imply that the corresponding mirror descent meets (H3)-(H4) and ( $\mathsf{H}$ 2.1)-( $\mathsf{H}$ 2.2), and that ${\theta_{\star}}=\mathcal{M}\left({\theta_{\star}}\right)\in\mathrm{ri}(\Theta)$ . Together with the fact that the mapping is point-to-point on $\Theta$ , (see Example 2.1 in page 2.1), this allows to apply Lemma B.8 to mirror descent. We deduce the $C^{1}$ -differentiability of $\mathcal{M}$ in a neighborhood of ${\theta_{\star}}$ , where we can write

[TABLE]

This yields $\mathcal{A}_{\star}^{m}=\partial^{2}\Phi({\theta_{\star}})$ and $\mathcal{B}_{\star}^{m}=\partial^{2}\Phi({\theta_{\star}})-\eta\partial^{2}f({\theta_{\star}})P\tilde{\mathcal{A}}_{\star}^{-1}P^{\top}\mathcal{B}_{\star}$ . Regarding the value of $\hat{\uprho}_{\star}^{m}$ , note that $\tilde{\mathcal{A}}_{\star}^{-1}\tilde{\mathcal{B}}_{\star}$ is equivalent to the symmetric matrix $\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}$ and is therefore diagonalizable. Together with (24) this yields

[TABLE]

where $f$ is defined by $f(x):=x^{2}-x+1$ . Note that $|f(x)|<1$ if and only if $x\in(0;1)$ . Besides, $\mathrm{Spec}(\tilde{\mathcal{A}}_{\star}^{-1}\tilde{\mathcal{B}}_{\star})\subset\mathbb{R}_{+}^{*}$ if and only if $u^{\top}\tilde{\mathcal{A}}_{\star}^{-1/2}\tilde{\mathcal{B}}_{\star}\tilde{\mathcal{A}}_{\star}^{-1/2}u>0$ for all $u\in\mathbb{R}^{d}$ , which is equivalent to $u^{\top}\tilde{\mathcal{B}}_{\star}u>0$ for all $u\in\mathbb{R}^{d}$ , by the symmetry of $\tilde{\mathcal{A}}_{\star}^{-1/2}$ . Finally, $\min_{\mathbb{R}}f=3/4$ , which is attained at $x=1/2$ , and for all $x\in(0;1)$ , $f(x)>x$ . ∎

Appendix B Technical results

All lemmas below are proved in the Supplementary material (C.1).

B.1 Linear algebra

Lemma B.1.

Let $d\in\mathbb{N}^{*}$ and $A,B\in\mathbb{R}^{d\times d}$ such that $A$ is symmetric positive-definite. Then, for all $x,y\in\mathbb{R}^{d}$ , $x^{\top}Ax\leqslant x^{\top}By\implies\left\|x\right\|_{A}\leqslant\uprho\left\|y\right\|_{A}$ , where $\uprho:={|\kern-1.07639pt|\kern-1.07639pt|A^{-1/2}BA^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ .

Lemma B.2.

Let $d\in\mathbb{N}^{*}$ and $A,B\in\mathbb{R}^{d\times d}$ such that $A$ is symmetric positive-definite. Then, there exist $\varepsilon>0$ and $C>0$ such that for all symmetric matrices $M\in\mathbb{R}^{d\times d}$ and for all matrices $N\in\mathbb{R}^{d\times d}$ verifying ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\vee{|\kern-1.07639pt|\kern-1.07639pt|N|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon$ ,

[TABLE]

Lemma B.3.

Let $d\in\mathbb{N}^{*}$ , $A\in\mathbb{R}^{d\times d}$ be a symmetric positive-definite matrix, and $B\in\mathbb{R}^{d\times d}$ be a symmetric matrix. Then,

[TABLE]

Lemma B.4.

Let $d\in\mathbb{N}^{*}$ and $S_{\star}\in\mathbb{R}^{d\times d}$ be a symmetric positive-definite matrix. Then, for all $\varepsilon>0$ , there exists $\delta>0$ such that for all symmetric matrices $S\in\mathbb{R}^{d\times d}$ verifying ${|\kern-1.07639pt|\kern-1.07639pt|S-S_{\star}|\kern-1.07639pt|\kern-1.07639pt|}_{2}<\delta$ , and all $x\in\mathbb{R}^{d}$ ,

[TABLE]

B.2 Minimization

Lemma B.5.

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $f\colon\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function. Write $m:=\min_{K}f$ and $\mathsf{M}:=\mathrm{argmin}_{K}f$ . Then, for all $\delta>0$ there exists $\varepsilon>0$ such that for all $x\in\mathsf{K}$ ,

[TABLE]

Lemma B.6.

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $\mathcal{Q}\colon\mathsf{K}\times\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function. Define for all $x\in\mathsf{K}$ , $\mathcal{M}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\mathcal{Q}_{x}\left(x^{\prime}\right)$ . Then, for all $x_{\star}\in\mathsf{K}$ and $\delta>0$ , there exists $\delta^{\prime}>0$ such that for all $x\in\mathsf{K}$ ,

[TABLE]

Lemma B.7 (Continuity of the minimization mapping).

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $\mathcal{Q}\colon\mathsf{K}\times\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function such that for all $x\in\mathsf{K}$ , $\mathcal{M}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\mathcal{Q}_{x}\left(x^{\prime}\right)$ is a singleton. Then, the function $\mathcal{M}$ is continuous on $\mathsf{K}$ .

Lemma B.8 (Differentiability of the minimization mapping).

Let $d\in\mathbb{N}^{*}$ , $\mathsf{K}$ be a compact set of $\mathbb{R}^{d}$ , and $\mathcal{Q}\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a function continuous on $\mathsf{K}\times\mathsf{K}$ such that for all $x\in\mathsf{K}$ , $\mathcal{M}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\mathcal{Q}_{x}\left(x^{\prime}\right)$ is a singleton. Let $x\in\mathsf{K}$ such that:

(i)

$x,\mathcal{M}\left(x\right)\in\mathrm{ri}(\mathsf{K})$ , 2. (ii)

$\partial_{2}\mathcal{Q}$ * is well-defined and $C^{k}$ -differentiable in a neighborhood of $(x,\mathcal{M}\left(x\right))$ for $k\in\mathbb{N}^{*}$ ,* 3. (iii)

$\partial_{22}\tilde{\mathcal{Q}}_{x}(\mathcal{M}\left(x\right))$ * is invertible.*

Then, the function $\mathcal{M}$ is $C^{k}$ -differentiable in a neighborhood of $x$ (considering that the domain lies in the ambient space $\mathrm{Aff}(\mathsf{K})$ ).

Lemma B.9.

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $f\colon\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function. Then, for all $\delta>0$ , there exists $\varepsilon>0$ such that for all functions $\hat{f}\colon\mathsf{K}\rightarrow\mathbb{R}$ ,

[TABLE]

with $\hat{\mathsf{M}}:=\mathrm{argmin}_{\mathsf{K}}\hat{f}$ , and the convention that the supremum over an empty set is equal to minus infinity.

Lemma B.10.

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $\mathcal{Q}\colon\mathsf{K}\times\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function such that for all $x\in\mathsf{K}$ , $\mathcal{M}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\mathcal{Q}_{x}\left(x^{\prime}\right)$ is a singleton. Then, for all $\delta>0$ , there exists $\varepsilon>0$ such that for all functions $\hat{\mathcal{Q}}\colon\mathsf{K}\times\mathsf{K}\rightarrow\mathbb{R}$ ,

[TABLE]

where $\hat{\mathcal{M}}$ is defined by $\hat{\mathcal{M}}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\hat{\mathcal{Q}}_{x}(x^{\prime})$ , and with the convention that the supremum over an empty set is equal to minus infinity.

B.3 Convexity

Lemma B.11.

Let $d\in\mathbb{N}^{*}$ , $\mathsf{K}$ be a bounded convex set of $\mathbb{R}^{d}$ , and $f\colon\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function. Assume that $f$ has a unique minimizer $x_{\star}$ on $\mathsf{K}$ , that $x_{\star}\in\mathrm{ri}(K)$ , that $f$ is $C^{2}$ -differentiable in a neighborhood of $x_{\star}$ and that $\partial^{2}f(x_{\star})\succ 0$ .

Then, $x_{\star}$ is the unique minimizer of $f^{**}$ on $\mathsf{K}$ and $f^{**}$ is equal to $f$ in a neighborhood of $x_{\star}$ , where $f^{**}$ denotes the biconjugate of $f$ (see [Rockafellar and Wets, 1998, Section 11, p.473]).

Appendix C Supplementary material

C.1 Linear algebra

Proof of Lemma B.1.

Let $x,y\in\mathbb{R}^{d}$ such that $0<x^{\top}Ax\leqslant x^{\top}By$ . The Cauchy-Schwarz inequality provides

[TABLE]

Using that $\|A^{-1/2}By\|_{2}\leqslant{|\kern-1.07639pt|\kern-1.07639pt|A^{-1/2}BA^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}\|A^{1/2}y\|_{2}$ , we deduce

[TABLE]

where $\uprho={|\kern-1.07639pt|\kern-1.07639pt|A^{-1/2}BA^{-1/2}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ . ∎

Lemma C.1.

Let $d\in\mathbb{N}^{*}$ and $A\in\mathbb{R}^{d\times d}$ be a symmetric positive-definite matrix. Then, there exist $\varepsilon>0$ and $C>0$ such that for all symmetric matrices $M\in\mathbb{R}^{d\times d}$ verifying ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon$ , the matrix $A+M$ is symmetric positive-definite and

[TABLE]

Proof.

Write $\lambda:=\min\mathrm{Spec}(A)>0$ . Let $M\in\mathbb{R}^{d\times d}$ be a symmetric matrix such that ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\lambda/2$ . The matrix $A+M$ is then symmetric positive-definite and we can define its square root. By the symmetry of $A_{M}:=(A+M)^{1/2}-A^{1/2}$ , there exist $\mu\in\mathrm{Spec}(A_{M})$ such that $|\mu|={|\kern-1.07639pt|\kern-1.07639pt|A_{M}|\kern-1.07639pt|\kern-1.07639pt|}_{2}$ and $x\in\mathbb{R}^{d\times d}$ such that $A_{M}x=\mu x$ and $\left\|x\right\|_{2}=1$ . This yields

[TABLE]

We deduce, using that $x^{T}(A+M)^{1/2}x\geqslant 0$ and $\lambda^{1/2}=\min\mathrm{Spec}(A^{1/2})$ ,

[TABLE]

and hence the result with $\varepsilon=\lambda/2$ and $C=\lambda^{-1/2}$ . ∎

Lemma C.2.

Let $d\in\mathbb{N}^{*}$ and $A\in\mathbb{R}^{d\times d}$ be an invertible matrix. Then, there exist $\varepsilon>0$ and $C>0$ such that for all matrices $M\in\mathbb{R}^{d\times d}$ verifying ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon$ ,

[TABLE]

Proof.

Let $M\in\mathbb{R}^{d\times d}$ such that ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant{|\kern-1.07639pt|\kern-1.07639pt|A^{-1}|\kern-1.07639pt|\kern-1.07639pt|}_{2}^{-1}/2$ . We can then write

[TABLE]

This yields the result with $\varepsilon={|\kern-1.07639pt|\kern-1.07639pt|A^{-1}|\kern-1.07639pt|\kern-1.07639pt|}_{2}^{-1}/2$ and $C=2{|\kern-1.07639pt|\kern-1.07639pt|A^{-1}|\kern-1.07639pt|\kern-1.07639pt|}_{2}^{2}$ . ∎

Proof of Lemma B.2.

Applying Lemma C.1 to $A^{-1}$ and Lemma C.2 to $A$ provides the existence of $\varepsilon>0$ and $C>1$ such that for all symmetric matrices $S\in\mathbb{R}^{d\times d}$ verifying ${|\kern-1.07639pt|\kern-1.07639pt|S|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon$ , the matrix $A^{-1}+S$ is symmetric positive-definite and

[TABLE]

Let $M\in\mathbb{R}^{d\times d}$ be a symmetric matrix such that ${|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon/C\leqslant\varepsilon$ . By (73) we can then define $M^{\prime}:=A^{-1}-(A+M)^{-1}$ , which verifies ${|\kern-1.07639pt|\kern-1.07639pt|M^{\prime}|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant C{|\kern-1.07639pt|\kern-1.07639pt|M|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\varepsilon$ . Choosing $S=-M^{\prime}$ , we deduce that $A^{-1}-M^{\prime}=(A+M)^{-1}$ is symmetric positive-definite and by (72),

[TABLE]

This concludes the proof up to simple algebra, noting that for all matrices $N\in\mathbb{R}^{d\times d}$ ,

[TABLE]

∎

Proof of Lemma B.3.

The symmetry of $A^{-1}$ and $B$ provides the first equality. As $A^{-1}B=A^{-1/2}(A^{-1/2}BA^{-1/2})A^{1/2}$ , $A^{-1}B$ and $A^{-1/2}BA^{-1/2}$ are similar and then

[TABLE]

Since $A^{-1/2}BA^{-1/2}$ is symmetric, we can write

[TABLE]

This provides the second equality, along with the third one by considering the change of variables $v=A^{-1/2}x$ . ∎

Proof of Lemma B.4.

For all $x\in\mathbb{R}^{d}$ , we have

[TABLE]

This implies

[TABLE]

which yields, for all $S$ such that ${|\kern-1.07639pt|\kern-1.07639pt|S-S_{\star}|\kern-1.07639pt|\kern-1.07639pt|}_{2}\leqslant\delta$ ,

[TABLE]

The proof follows. ∎

C.2 Minimization

Proof of Lemma B.5.

Let $\delta>0$ . The function $\tilde{\mathrm{d}}$ defined on $\mathsf{K}$ by $\tilde{\mathrm{d}}(x):=\mathrm{d}(x,\mathsf{M})$ being continuous, the set $\tilde{\mathsf{K}}:=\tilde{\mathrm{d}}^{-1}([\delta,+\infty[)$ is compact, as the intersection of a closed set with a compact set. By the continuity of $f$ we deduce the existence of $x_{0}\in\tilde{\mathsf{K}}$ such that

[TABLE]

∎

Proof of Lemma B.6.

Let $x_{\star}\in\mathsf{K}$ and $\delta>0$ . By the compacity of $\mathsf{K}$ and the continuity of $\mathcal{Q}_{x_{\star}}\left(\cdot\right)$ , there exists $x_{\star\star}\in\mathcal{M}\left(x_{\star}\right)$ . Besides, Lemma B.5 applied to $\mathcal{Q}_{x_{\star}}\left(\cdot\right)$ provides the existence of $\varepsilon>0$ such that for all $x^{\prime}\in\mathsf{K}$ ,

[TABLE]

Moreover, by the uniform continuity of $\mathcal{Q}$ on $(\mathsf{K}\times\mathsf{K},\tilde{\mathrm{d}})$ , where $\tilde{\mathrm{d}}((y,y^{\prime}),(z,z^{\prime})):=\mathrm{d}(y,z)+\mathrm{d}(y^{\prime},z^{\prime})$ , there exists $\delta^{\prime}>0$ such that for all $y,y^{\prime},z,z^{\prime}\in\mathsf{K}$ ,

[TABLE]

We deduce that for all $x\in\mathsf{K}$ such that $\mathrm{d}(x,x_{\star})<\delta^{\prime}$ , for all $x^{\prime}\in\mathcal{M}\left(x\right)$ ,

[TABLE]

and thus $\mathrm{d}(x^{\prime},\mathcal{M}\left(x_{\star}\right))<\delta$ by (74). ∎

Proof of Lemma B.7.

This is a direct consequence of Lemma B.6. ∎

Proof of Lemma B.8.

We first prove the lemma for $k=1$ . Write $\mathsf{V}$ the direction of $\mathrm{Aff}(\mathsf{K})$ and for all $v\in\mathsf{V},\varepsilon>0$ , write $\mathbf{B}\left(v,\varepsilon\right):=\{w\in\mathsf{V}\;:\;\|v-w\|_{2}<\varepsilon\}$ . Note that the ball is defined as a subset of $\mathsf{V}$ .

Under i and ii there exists $\varepsilon_{0}>0$ such that $\partial_{2}\mathcal{Q}$ is $C^{1}$ -differentiable on $\mathbf{B}\left(x,2\varepsilon_{0}\right)\times\mathbf{B}\left(\mathcal{M}(x),2\varepsilon_{0}\right)\subset\mathrm{ri}(\mathsf{K})\times\mathrm{ri}(\mathsf{K})$ . Moreover, the function $\mathcal{M}$ is continuous on $\mathsf{K}$ by Lemma B.7, which yields the existence of $\varepsilon_{1}\in(0;\varepsilon_{0})$ such that $y\in\mathbf{B}\left(x,\varepsilon_{1}\right)$ implies $\mathcal{M}(y)\in\mathbf{B}\left(\mathcal{M}\left(x\right),\varepsilon_{0}\right)$ . By iii there also exists $\varepsilon_{2}\in(0;\varepsilon_{1})$ such that for all $y\in\mathbf{B}\left(x,\varepsilon_{2}\right)$ , the matrix $\partial_{22}\tilde{\mathcal{Q}}_{y}(\mathcal{M}\left(y\right))$ is invertible.

Let $y\in\mathbf{B}\left(x,\varepsilon_{2}\right)$ and set $\varepsilon>0$ such that $\mathbf{B}\left(y,\varepsilon\right)\subset\mathbf{B}\left(x,\varepsilon_{2}\right)$ . For all $h\in\mathbf{B}\left(0,\varepsilon\right)$ , the fact that $\mathcal{M}\left(y\right),\mathcal{M}\left(y+h\right)\in\mathrm{ri}(\mathsf{K})$ provides (see the proof of Theorem 2):

[TABLE]

where the functions $\mathcal{A}$ and $\mathcal{B}$ are defined on $\mathbf{B}\left(0,\varepsilon\right)$ by

[TABLE]

Besides, $\tilde{\mathcal{A}}(0)=\partial_{22}\tilde{\mathcal{Q}}_{y}(\mathcal{M}\left(y\right))$ is invertible by the definition of $\varepsilon_{2}$ , and the functions $\mathcal{A}$ , $\mathcal{B}$ are continuous on $\mathbf{B}\left(0,\varepsilon\right)$ by the continuity of $\mathcal{M}$ and the uniform continuity of $\partial_{12}\mathcal{Q},\partial_{22}\mathcal{Q}$ on $\bar{\mathbf{B}}(x,\varepsilon_{0})\times\bar{\mathbf{B}}(\mathcal{M}(x),\varepsilon_{0})$ . Therefore, there exists $\varepsilon^{\prime}\in(0;\varepsilon)$ such that on $\mathbf{B}\left(0,\varepsilon^{\prime}\right)$ ,

[TABLE]

We deduce

[TABLE]

where $P$ is defined as in Section 3. The case $k>1$ follows by induction, using the above expression and the $C^{\infty}$ -differentiability of the matrix inverse. ∎

Proof of Lemma B.9.

Let $\delta>0$ and $x_{\star}\in\mathsf{M}$ . By Lemma B.5 there exists $\varepsilon>0$ such that for all $y\in\mathsf{K}$ ,

[TABLE]

Let $\hat{f}$ be a real-valued function defined on $\mathsf{K}$ such that $\sup_{\mathsf{K}}|\hat{f}-f|<\varepsilon$ . Then, for all $y\in\hat{\mathsf{M}}$ ,

[TABLE]

which implies $\mathrm{d}(y,\mathsf{M})<\delta$ by (75). ∎

Lemma C.3.

Let $(\mathsf{K},\mathrm{d})$ be a compact metric space and $\mathcal{Q}\colon\mathsf{K}\times\mathsf{K}\rightarrow\mathbb{R}$ be a continuous function such that for all $x\in\mathsf{K}$ , $\mathcal{M}(x):=\mathrm{argmin}_{x^{\prime}\in\mathsf{K}}\mathcal{Q}_{x}\left(x^{\prime}\right)$ is a singleton. Then, for all $\delta>0$ , there exists $\varepsilon>0$ such that for all $x,x^{\prime}\in\mathsf{K}$ ,

[TABLE]

Proof.

By Lemma B.7, the function $\tilde{\mathrm{d}}$ defined on $\mathsf{K}^{2}$ by $\tilde{\mathrm{d}}(x,x^{\prime}):=\mathrm{d}(\mathcal{M}(x),x^{\prime})$ is continuous. Let $\delta>0$ . By the compacity of $\tilde{\mathsf{K}}:=\tilde{\mathrm{d}}^{-1}([\delta,+\infty[)$ and the continuity of $\mathcal{Q}$ , there exists $(x_{0},x_{0}^{\prime})\in\tilde{\mathsf{K}}$ such that

[TABLE]

∎

Proof of Lemma B.10.

Let $\delta>0$ . By Lemma C.3 there exists $\varepsilon>0$ such that for all $x,y\in\mathsf{K}$ ,

[TABLE]

Let $\hat{\mathcal{Q}}$ be a real-valued function defined on $\mathsf{K}\times\mathsf{K}$ such that $\sup_{\mathsf{K}\times\mathsf{K}}|\hat{\mathcal{Q}}-\mathcal{Q}|<\varepsilon$ . Then, for all $x,y\in\mathsf{K}$ such that $y\in\hat{\mathcal{M}}(x)$ ,

[TABLE]

which implies $\mathrm{d}(y,\mathcal{M}\left(x\right))<\delta$ by (76). ∎

C.3 Convexity

Proof of Lemma B.11.

For all $\varepsilon>0$ , write $\mathbf{B}\left(x_{\star},\varepsilon\right):=\left\{x\in\mathsf{K}\;:\;\left\|x-x_{\star}\right\|_{2}<\varepsilon\right\}$ . To begin with, note that $f(x_{\star})=f^{**}(x_{\star})$ . Indeed, $f(x_{\star})\geqslant f^{**}(x_{\star})$ by definition of the biconjugate, and the constant $f(x_{\star})$ is an affine minorant of $f$ , which provides $f^{**}\geqslant f(x_{\star})$ .

We now prove that $x_{\star}$ is the unique minimizer of $f^{**}$ on $\mathsf{K}$ . By assumption there exists $\varepsilon_{0}>0$ such that $f$ is $C^{2}$ -differentiable and $\partial^{2}f\succ 0$ on $\mathbf{B}\left(x_{\star},\varepsilon_{0}\right)$ , which implies the convexity of $f$ on that neighborhood. Besides, by Lemma B.5 there exists $\delta>0$ such that for all $x\in\mathsf{K}$ ,

[TABLE]

Let $\varepsilon_{1}\in(0;\varepsilon_{0})$ such that $f(\mathbf{B}\left(x_{\star},\varepsilon_{1}\right))\subset\mathbf{B}\left(f(x_{\star}),\delta/2\right)$ . As $x_{\star}\in\mathrm{ri}(K)$ , for all $x\in\mathsf{K}$ , $\partial f(x_{\star})^{\top}(x-x_{\star})=0$ . By the boundedness of $\mathsf{K}$ and the $C^{1}$ -differentiability of $f$ on $\mathbf{B}\left(x_{\star},\varepsilon_{1}\right)$ this provides the existence of $\varepsilon_{2}\in(0;\varepsilon_{1})$ such that for all $x\in\mathbf{B}\left(x_{\star},\varepsilon_{2}\right)$ and $y\in\mathsf{K}$ , $\partial f(x)^{\top}(y-x)\leqslant\delta/2$ . Together with (77) this yields for all $x\in\mathbf{B}\left(x_{\star},\varepsilon_{2}\right)$ and $y\in\mathsf{K}\setminus\mathbf{B}\left(x_{\star},\varepsilon_{0}\right)$ ,

[TABLE]

By the convexity of $f$ on $\mathbf{B}\left(x_{\star},\varepsilon_{0}\right)$ , the same inequality holds for all $y\in\mathbf{B}\left(x_{\star},\varepsilon_{0}\right)$ . We deduce from (78) that for all $x\in\mathbf{B}\left(x_{\star},\varepsilon_{2}\right)$ , $y\in\mathsf{K}$ ,

[TABLE]

Let $y\in\mathsf{K}\setminus\{x_{\star}\}$ and set $t\in(0;1)$ such that $x(t):=x_{\star}+t(y-x_{\star})\in\mathbf{B}\left(x_{\star},\varepsilon\right)$ . By the convexity of $f$ on $\mathbf{B}\left(x_{\star},\varepsilon\right)$ ,

[TABLE]

Using (79) with $x=x(t)$ then yields $f^{**}(y)\geqslant f(x(t))>f(x_{\star})=f^{**}(x_{\star})$ , which proves that $x_{\star}$ is the only minimizer of $f^{**}$ on $\mathsf{K}$ .

Finally, for all $x\in\mathbf{B}\left(x_{\star},\varepsilon_{2}\right)$ , using (79) with $y=x$ provides $f^{**}(x)\geqslant f(x)$ , and hence $f=f^{**}$ on $\mathbf{B}\left(x_{\star},\varepsilon_{2}\right)$ . ∎

Acknowledgement

Many results presented in this paper were obtained during the first year of the Ph.D. of Rayan Charrier, before his resignation for personal reasons.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Balakrishnan et al., 2017] Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Statist. , 45:77–120.
2[Bauschke, 1997] Bauschke, H. H.and Borwein, J. M. (1997). Legendre functions and the method of random Bregman projections. J. Convex Anal. , 4:27–67.
3[Bubeck, 2015] Bubeck, S. (2015). Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. , 8:231–357.
4[Cappé et al., 2005] Cappé, O., Moulines, E., and Rydén, T. (2005). Inference in hidden Markov models . Springer Series in Statistics. Springer, New York.
5[Cesa-Bianchi and Lugosi, 2006] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games . Cambridge university press.
6[Daudel et al., 2020] Daudel, K., Douc, R., and Portier, F. (2020). Infinite-dimensional gradient-based descent for alpha-divergence. ar Xiv:2005.10618 v 2 .
7[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. A , 39:1–22.
8[Douc et al., 2013] Douc, R., Moulines, E., and Stoffer, D. S. (2013). Nonlinear time series. Theory, methods, and applications with R examples . Chapman & Hall/CRC Texts Stat. Sci. Ser.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Asymptotic convergence of iterative optimization algorithms

Abstract

1 Introduction

Notation

2 General framework

Example 1* (EM algorithm).*

Example 2.1* (Mirror descent).*

Example 2.2* (Mirror prox).*

3 Asymptotic convergence rate

Theorem 1**.**

Proof.

Corollary 1**.**

Proof.

Theorem 2**.**

Proof.

Theorem 3**.**

Proof.

4 Comments on H1

Theorem 4**.**

Proof.

Remark 1*.*

Remark 2*.*

5 Comments on H2

Theorem 5**.**

Proof.

Remark 3*.*

Example 2.1* (Mirror descent, cont.).*

Proposition 1**.**

Proof.

Theorem 6**.**

Proof.

Example 1* (EM algorithm, cont.).*

6 Comments on H4

Example 1.1* (Population EM).*

Example 1.2* (Sample EM).*

Example 2.1* (Mirror descent, cont.).*

Example 1.3* (The α\alphaα-EM algorithm).*

Example 2.2* (Mirror prox, cont.).*

Example 3* (Newton’s method).*

7 Proof of Theorem 1

Proof of Theorem 1.

Lemma 7.1**.**

Proof.

Proposition 2**.**

Proof.

8 Convex constrained optimization

Theorem 7**.**

Proof.

Corollary 2**.**

Proof.

9 Discussion

9.1 Non-asymptotic convergence

9.2 Quadratic convergence

9.3 Non-convex constrained optimization

Appendix A Proofs

A.1 Asymptotic convergence rate

Proof of Corollary 1.

Proof of Theorem 2.

Proof of Theorem 3.

Lemma A.1**.**

Proof.

A.2 Comments on (H1)

Proof of Theorem 4.

A.3 Comments on (H2)

Proposition 3**.**

Proof.

Proof of Theorem 5.

Lemma A.2**.**

Proof.

Proof of Proposition 1.

Proof of Theorem 6.

A.4 Comments on (H4)

Proof for Example 1.1 (Population EM).

*Example 1** (EM algorithm).*

*Example 2.1** (Mirror descent).*

*Example 2.2** (Mirror prox).*

Theorem 1.

Corollary 1.

Theorem 2.

Theorem 3.

Theorem 4.

*Remark 1**.*

*Remark 2**.*

Theorem 5.

*Remark 3**.*

*Example 2.1** (Mirror descent, cont.).*

Proposition 1.

Theorem 6.

*Example 1** (EM algorithm, cont.).*

*Example 1.1** (Population EM).*

*Example 1.2** (Sample EM).*

*Example 2.1** (Mirror descent, cont.).*

*Example 1.3** (The $\alpha$ -EM algorithm).*

*Example 2.2** (Mirror prox, cont.).*

*Example 3** (Newton’s method).*

Lemma 7.1.

Proposition 2.

Theorem 7.

Corollary 2.

Lemma A.1.

Proposition 3.

Lemma A.2.

Lemma A.3.

Proof for Example 1.3 (The $\alpha$ -EM algorithm).

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5.

Lemma B.6.

Lemma B.7 (Continuity of the minimization mapping).

Lemma B.8 (Differentiability of the minimization mapping).

Lemma B.9.

Lemma B.10.

Lemma B.11.

Lemma C.1.

Lemma C.2.

Lemma C.3.