This paper presents a comprehensive framework for iterative optimization algorithms, proving their asymptotic geometric convergence and providing exact rates, applicable to various algorithms including EM and mirror descent.
Contribution
It introduces a unified framework for analyzing convergence rates of iterative algorithms, including constrained cases and variants like alpha-EM and Mirror Prox.
Findings
01
Convergence is asymptotically geometric under general assumptions.
02
Exact asymptotic convergence rates are established.
03
Conditions for systematic convergence of Mirror Prox are provided.
Abstract
This paper introduces a general framework for iterative optimization algorithms and establishes under general assumptions that their convergence is asymptotically geometric. We also prove that under appropriate assumptions, the rate of convergence can be lower bounded. The convergence is then only geometric, and we provide the exact asymptotic convergence rate. This framework allows to deal with constrained optimization and encompasses the Expectation Maximization algorithm and the mirror descent algorithm, as well as some variants such as the alpha-Expectation Maximization or the Mirror Prox algorithm.Furthermore, we establish sufficient conditions for the convergence of the Mirror Prox algorithm, under which the method converges systematically to the unique minimizer of a convex function on a convex compact set.
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Optimization and Variational Analysis · Advanced Bandit Algorithms Research
Full text
Asymptotic convergence of iterative optimization algorithms
Randal Douc
Sylvain Le Corff
Abstract
This paper introduces a general framework for iterative optimization algorithms and establishes under general assumptions that their convergence is asymptotically geometric.
We also prove that under appropriate assumptions, the rate of convergence can be lower bounded. The convergence is then only geometric, and we provide the exact asymptotic convergence rate.
This framework allows to deal with constrained optimization and encompasses the Expectation Maximization algorithm and the mirror descent algorithm, as well as some variants such as the α-Expectation Maximization or the Mirror Prox algorithm.
Furthermore, we establish sufficient conditions for the convergence of the Mirror Prox algorithm, under which the method converges systematically to the unique minimizer of a convex function on a convex compact set.
1 Introduction
The minimization of a real-valued function is the most common formulation for mathematical optimization problems. Examples of convex optimization problems in machine learning can be found for instance in [Bubeck, 2015].
For models involving missing or latent data, [Dempster et al., 1977] introduced the modern formulation of the Expectation Maximization (EM) algorithm, whose convergence has been proved under general assumptions in [Wu, 1983].
The asymptotic convergence rate of the EM algorithm has been widely
studied and identified as a ratio of missing information from the very
beginning [Dempster et al., 1977, Meng and Rubin, 1991, Meng and Rubin, 1993]. Since then, some links with gradient descent approaches have also
been drawn, see for instance [Lange, 1995]. Among the most notable
recent works, [Balakrishnan et al., 2017] provided quantitative results on the
non-asymptotic convergence of the EM algorithm to local optima by
considering smoothness and strong-concavity assumptions. In the
particular case of exponential families, [Kunstner et al., 2021] show that the
M-step is equivalent to a mirror descent update. This allows to
obtain non-asymptotic linear convergence rate, which directly depends on the ratio
of missing information.
In this paper, instead of casting the EM algorithm into a gradient or
a mirror descent framework, we propose an extended formulation to
encompass both classes of algorithms, not restricted to exponential
families.
Indeed, both EM and mirror descent algorithms can be defined using a bivariate function that is iteratively minimized with respect to one coordinate.
Such a representation can actually describe any iterative optimization algorithm whose minimization steps are parametrized only by the current parameter estimate. This paper provides the following contributions.
•
We prove under general assumptions that the convergence of such iterative optimization algorithms is asymptotically geometric, see Theorem 1. We also provide lower bounds for the rate of convergence, that allow to prove that the convergence can be only geometric, see Theorem 2, and in some cases to establish the exact asymptotic convergence rate, see Theorem 3.
We show that those assumptions are natural either in an EM or in a
mirror descent framework, and that they are satisfied generically
without requiring any notable technical work, in contrast with
non-asymptotic results that tend to be more demanding. Regarding the
EM algorithm, we retrieve the well-known ratio of missing information
under even more general assumptions, as the minimization mapping is
not required to be point-to-point and this framework allows to deal
with constrained optimization.
•
We derive results for settings with both finite and infinite data, as well as for a variant of the EM algorithm, known as the α-EM algorithm, see [Matsuyama, 2003]. However, the most significant contribution is that brought to the mirror descent framework: under mild assumptions, we prove that its convergence is asymptotically geometric. This also applies to the mirror prox variant.
In a general manner, the convergence rates we exhibit are proved to be invariant to C2-reparametrization.
•
Furthermore, we prove that under general assumptions, the convergence of mirror prox is guaranteed for convex functions with a unique minimizer on a convex compact set, and that, without imposing any condition on the initialization.
This paper is organized as follows.
Section 2 introduces the general iterative optimization framework we consider and shows how it encompasses classical settings such as the EM algorithm or the mirror descent algorithm.
Section 3 states the main general results of this paper on asymptotic convergence rates.
Sections 4-6 discuss the assumptions of Theorem 1, illustrating how they are met in those classical settings, but also in variants such as the α-EM or the mirror prox algorithm.
Section 7 displays the proof of Theorem 1, and Section 8 is dedicated to the convergence of mirror prox.
A discussion follows in Section 9.
Additional proofs are postponed to Appendix A, using technical results listed in Appendix B and proved in the Supplementary material.
Notation
Throughout this paper, Spec(⋅) denotes the spectrum of a matrix and ϱ(⋅) the spectral radius.
The Euclidean norm is denoted by ∥⋅∥2, the spectral norm by ∣∣∣⋅∣∣∣2, the Frobenius norm by ∣∣∣⋅∣∣∣F, and for all symmetric positive-definite matrices S, we define the norm ∥⋅∥S by ∥x∥S2:=x⊤Sx. The first derivative (resp. the second) of any univariate function f is written ∂f (resp. ∂2f). For all bivariate functions Q:(x1,x2)↦Qx1(x2) and i,j∈{1,2}, we write ∂iQ:=∂Q/∂xi and ∂ijQ:=∂2Q/∂xi∂xj.
The maximum of two real numbers a,b is denoted by a∨b. For all topological spaces E, their closure are written E and their interior E˚. Finally, Conv(⋅) stands for convex hull, Aff(⋅) for affine hull and ri(⋅) for relative interior, i.e. the interior of a set within its affine hull.
2 General framework
Let q∈N∗ and let Q be a real-valued function defined on Rq×Rq:
[TABLE]
Let Θ be a subset of Rq and M be the point-to-set map defined on Θ by
[TABLE]
In what follows, provided that M(θ)=∅ for any θ∈Θ, we let (θn)n∈N be a sequence defined on Θ such that for all n∈N,
[TABLE]
Example 1* (EM algorithm).*
Let X and Y be random variables taking values in measurable spaces (X,X) and (Y,Y), respectively.
Assume that the pair (X,Y) has a joint density function pθ⋆ with respect to a reference measure μ on X⊗Y that belongs to some parameterized family {pθ:θ∈Θ}. Assume also that the state variable X is
latent in the sense that the model is only partially observed through
the observation Y. In this case, the Expectation Maximization (EM)
algorithm, as defined in [Douc et al., 2013, Appendix D.1, p.492], provides an estimate of the unknown parameter θ⋆ by considering a sequence (θn)n∈N defined on Θ by
[TABLE]
where for all θ,θ′∈Θ×Θ,
[TABLE]
and Eθ denotes the expectation under pθ.
Note that Q is a random function which depends on the observations we consider.
For instance, in a model where (Xi,Yi)1⩽i⩽k are independent and identically distributed, with k observations (Yi)1⩽i⩽k, we define at the sample level: X=(X1,…,Xk) and Y=(Y1,…,Yk), and inserting in (3), we obtain up to a multiplicative constant (see [Balakrishnan et al., 2017]):
[TABLE]
In the limit of infinite data (i.e. k→∞), we define at the population level:
[TABLE]
where x↦pθ(x∣y) denotes the conditional density of X given Y when the parameter value is θ and where we assume that (Yi)i⩾1 are iid with density pθ⋆.
Both settings are studied in this paper, replacing Q in (2) by Qsamp or Qpop.
Example 2.1* (Mirror descent).*
Let C be a convex compact set of Rq and f be a real-valued function defined on C.
The mirror descent strategy defined in [Bubeck, 2015, Chapter 4, p.296] considers a convex open set D of Rq such that C is contained in the closure of D and C∩D=∅,
along with a mirror map Φ:D→R, that is,
(i)
Φ is strictly convex and differentiable,
2. (ii)
the gradient of Φ takes all possible values: ∂Φ(D)=Rq,
3. (iii)
the gradient of Φ diverges on the boundary of D: limx→∂D∥∂Φ(x)∥=+∞.
Then, the mirror descent algorithm produces two sequences (θn)n∈N and (ζn)n∈N, defined on C and D respectively by
[TABLE]
where η>0 is the step-size, ∂f is the sub-differential of f (by abuse of notation) and DΦ is the Bregman divergence associated with Φ:
[TABLE]
Note that gradient descent is a particular case of mirror descent with Φ:x↦x⊤x/2.
Following [Bubeck, 2015, p.301], mirror descent can be rewritten as
[TABLE]
which fits into the general framework (1) with Θ:=C∩D and Q defined for all (θ,θ′)∈Θ×Θ by
[TABLE]
Example 2.2* (Mirror prox).*
Mirror prox is a variant of mirror descent defined by the following equations [Bubeck, 2015, Chapter 4, p.305]:
[TABLE]
Straightforward algebra yields the equivalent definition:
[TABLE]
where M is defined for mirror descent by (10). We deduce from Example 2.1 that mirror prox fits into the general framework (1) with Θ:=C∩D and Qm defined on Θ×Θ by
[TABLE]
The fact that M(θ) is a singleton is ensured by (i) and (iii) as in this case Φ is a Legendre function, see [Cesa-Bianchi and Lugosi, 2006, Lemma 11.1] or [Bauschke, 1997, Theorem 3.12].
3 Asymptotic convergence rate
Assume there exists θ⋆∈Θ such that ∂2Q is well-defined in a neighborhood of θ⋆ and differentiable at (θ⋆,θ⋆), and write
[TABLE]
Let V:=span{θ−θ′:θ,θ′∈Θ} be the direction of Aff(Θ).
Consider the following set of assumptions.
(H1)
The set Θ is convex.
(H2)
The sequence (θn)n∈N converges to θ⋆.
(H3)
There exists a neighborhood of (θ⋆,θ⋆) on which Q is continuous and ∂2Q is well-defined and C1-differentiable.
(H4)
The matrix B⋆ is symmetric and for all v∈V∖{0}, v⊤A⋆v>∣v⊤B⋆v∣.
Assume that (H1)-(H4) hold. Then, \uprho^⋆∈[0;1) and for all \uprho∈(\uprho^⋆;1),
[TABLE]
or equivalently,
[TABLE]
As any two norms on a finite-dimensional linear space are equivalent, Theorem 1 and the next results could be given using another norm on Θ. They are stated here with ∥⋅∥2 for simplicity.
In [Balakrishnan et al., 2017, Theorem 1], the authors prove that the population EM algorithm converges geometrically. Their proof rely mainly on convergence results for gradient ascent algorithms applied to the intermediate quantity of the EM algorithm which is assumed to be smooth and strongly concave. Theorem 1 establishes under general assumptions that the convergence of the algorithms introduced in Section 2 is asymptotically geometric.
Corollary 1 extends the statement of Theorem 1 to the values taken by the function (θ,θ′)↦Qθ(θ′) which are invariant to the choice of the parametrisation.
Corollary 1**.**
Under (H1)-(H4), if ∂1Qθ⋆(θ⋆) is well-defined, then for all \uprho∈(\uprho^⋆;1),
If the limit θ⋆ lies in the relative interior of Θ and \uprhoˇ⋆>0, the asymptotic convergence is therefore only geometric.
Theorem 3**.**
Assume that (H1)-(H4) hold, that ∂2Q is C2-differentiable in a neighborhood of (θ⋆,θ⋆), that θ⋆∈ri(Θ) and that for all p∈N, Span(θn−θ⋆,n⩾p)=V. Then \uprhoˇ⋆,\uprho^⋆∈[0;1) and
[TABLE]
where in the left-hand term we use the convention 0/0=0 and log(0)=−∞. In particular, if \uprho^⋆2⩽\uprhoˇ⋆ then
The assumption that Θ needs to be convex can be relaxed as follows.
(H’1)
There exist E⊂Rq and a submanifold S⊂Rq of class C2 such that Θ=E∩S and θ⋆∈E˚.
Under (H’1), if d is the dimension of the submanifold S, for all x∈S, there exist U1, U2 two open neighborhoods of x and the null-vector 0 in Rq, respectively, and a C2-diffeomorphism ψ:U1→U2 such that ψ(x)=0 and ψ(U1∩S)=U2∩(Rd×{0}q−d). Note that we identify Rq and Rd×Rq−d in a standard way using (x1,…,xq)↦((x1,…,xd),(xd+1,…,xq)). Write T⋆ the tangent space to S at the point θ⋆. If U1 and U2 are two open neighborhoods of θ⋆ and the null-vector 0 in Rq, respectively, and ψ:U1→U2 is a C1-diffeomorphism such that ψ(θ⋆)=0 and ψ(U1∩S)=U2∩(Rd×{0}q−d), then T⋆=∂ψθ⋆−1(Rd×{0}q−d), where ∂ψ is the differential of ψ at θ⋆.
(H’4)
The matrix B⋆ is symmetric and for all v∈T⋆∖{0}, v⊤A⋆v>∣v⊤B⋆v∣.
If (H1) holds with θ⋆∈ri(Θ), then
(H’1) is satisfied with E:=Θ+V⊥ and S:=Aff(Θ).
Remark 2*.*
If θ⋆ does not lie in the relative interior of Θ, the asymptotic convergence rates are not necessarily invariant to C2-reparametrization. Define for instance Q~θ0(θ1)=θ02−θ0θ1+θ12 with Θ=[1;+∞[. Using the
reparametrization function Ψ:θ↦θα, we
set Qˇθ0(θ1)=Q~Ψ(θ0)(Ψ(θ1))=θ02α−θ0αθ1α+θ12α. Then, with Q=Q~, we get \uprhoˇ⋆=\uprho^⋆=1/2 whereas with Q=Qˇ and α=2/5, we get \uprhoˇ⋆=\uprho^⋆=2.
5 Comments on H2
The convergence of the sequence (θn)n∈N to θ⋆ (stated in
(H2)) may be the most challenging assumption of Theorem 1.
However, we provide alternative sufficient assumptions to establish
such convergence.
(H2.1)
The set Θ is compact.
(H2.2)
The function Q is continuous on Θ×Θ.
(H2.3)
The point θ⋆ is a limit point of the sequence (θn)n∈N.
Assumption (H2.3) weakens (H2) by only requiring that (H3)-(H4) hold for an arbitrary θ⋆ in the limit set of (θn)n∈N, which is non-empty under (H2.1).
Example 2.1* (Mirror descent, cont.).*
The map M is point-to-point on Θ under the assumptions of the definition (see Example 2.1 in page 2.1).
Indeed, the surjectivity of the gradient in ii provides the existence of ζn+1 in (8), and the strict convexity of Φ in i proves its uniqueness.
Assumptions i and iii ensure the existence and the uniqueness of θn+1 in (8) (see [Bauschke, 1997, Theorem 3.12]).
Note that if f is convex or differentiable on C, then for all θ∈C, ∂f(θ)=∅ and gn can be defined in (8).
Moreover, if θ⋆ is a local minimizer of f and f is differentiable at θ⋆, then (H2.4) is met.
Indeed, those two assumptions provide that for all θ∈Θ, ∂f(θ⋆)⊤(θ−θ⋆)⩾0, and thus Qθ⋆(θ)⩾Qθ⋆(θ⋆) in (10) with equality if and only if θ=θ⋆.
Assume in Example 2.2 that (H2.1)-(H2.2) hold and that:
(i) Φ and f are twice differentiable on Θ,
(ii) Φ is γ-strongly convex on C∩D and f is convex and β-smooth, with respect to ∥⋅∥2,
(iii) η∈(0;γ/β),
(iv) θ⋆ is the unique minimizer of f on C,
(v) θ⋆∈ri(Θ).
Then, (H2.3) and (H2.4) hold.
By definition of the intermediate quantity (3), the EM algorithm monotonically increases the likelihood of the observations and Assumption (H~4.2) is satisfied as soon as the log-likelihood is continuous, see for example [Cappé et al., 2005, Proposition 10.1.4, p.350].
6 Comments on H4
First of all, the matrix B⋆ appears to be symmetric in all the examples below. The discussion then focuses on the domination assumption and on the value of the convergence rate \uprho^⋆.
Note that the domination assumption in (H4) is equivalent to having both A~⋆≻B~⋆ and A~⋆≻−B~⋆. In the case where A~⋆≻0, it is equivalent to \uprho^⋆∈[0;1).
Example 1.1* (Population EM).*
Assume that θ⋆ is the true parameter of the model, that for all x,y∈X,Y, the functions θ↦pθ(x∣y) and θ↦pθ(y) are twice differentiable in a neighborhood of θ⋆, and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign. Then, we prove in Section A.4, see (63) and (64), that
[TABLE]
where IX,Y(θ):=−Eθ[∂θ2logpθ(X,Y)] and IY:=−Eθ[∂θ2logpθ(Y)] denote the Fisher information matrices of (X,Y) and Y, respectively.
Therefore, (H4) is satisfied as soon as IY(θ⋆)≻0. Regarding the value of \uprho^⋆, the above expressions of A⋆pop and B⋆pop provide the well-known ratio of missing information IX,Y(θ⋆)−1IX∣Y(θ⋆) (see [Dempster et al., 1977, Kunstner et al., 2021, Meng and Rubin, 1991, Meng and Rubin, 1993, Orchard and Woodbury, 1972]), where
[TABLE]
Example 1.2* (Sample EM).*
As for the other examples, all the results below are proved in Section A.4.
Assume that for all x,y∈X,Y, the functions θ↦pθ(x∣y) and θ↦pθ(y) are twice differentiable in a neighborhood of θ⋆, and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign.
Let (Yi)i∈N∗ be a sequence of independent and identically distributed random variables with probability density function pθ⋆, and write for all k∈N∗, Y1:k:=(Yi)1⩽i⩽k. Then, for all k∈N∗, by (65) and (66),
[TABLE]
where IX∣Y=Yi(θ⋆)=∫Xpθ⋆(x∣Yi)∂logpθ⋆(x∣Yi)[∂logpθ⋆(x∣Yi)]⊤μ(dx).
Note that A⋆samp(Y1:k) and B⋆samp(Y1:k) converges almost surely to A⋆pop and B⋆pop. Then, if the corresponding population EM meets (H4), almost surely, for sufficiently large k, the sample EM meets (H4).
Denoting by \uprho^⋆pop and \uprho^⋆samp(Y1:k) their respective rates, as defined in (13), by Lemma A.3, we also have that
[TABLE]
Furthermore, if ∂2logpθ⋆(X1,Y1),∂2logpθ⋆(Y1)∈L2(Rq×q), Lemma A.3 also establishes that for all δ∈(0;1) there exists Cδ>0 such that
[TABLE]
Example 2.1* (Mirror descent, cont.).*
If f and Φ are twice differentiable in a neighborhood of θ⋆, we prove in Section A.4 that
[TABLE]
If for all v∈V, v⊤∂2f(θ⋆)v>0, the condition A~⋆≻B~⋆ is automatically satisfied. The domination assumption in (H4) then reduces to A~⋆≻−B~⋆, which corresponds to η being small enough.
In the particular case of unconstrained gradient descent where Aff(Θ)=Rq and Φ:x↦x⊤x/2, as (19) yields A⋆=Iq the above condition is equivalent to η∈(0;2/β⋆), the optimal choice being η=2/(α⋆+β⋆) where α⋆:=minSpec(∂2f(θ⋆)) and β⋆:=maxSpec(∂2f(θ⋆)).
Besides, the asymptotic convergence rate \uprho^⋆ can be interpreted similarly to the EM framework. Despite not being, strictly speaking, a ratio of missing information, \uprho^⋆ still compares the mirror map Φ with the objective function f.
Intuitively, the choice of a mirror map with variations closer to those of f provides a better convergence rate.
If η=1, the extreme case Φ=f yields B⋆=0 and \uprho^⋆=0, which is coherent with the fact that, in this case, the mirror descent is defined for all n∈N by θn+1∈argminθ∈Θf(θ).
The following discussion extends the above interpretation to a general class of functions Q that encompasses both mirror descent and the EM algorithm.
The first thing to note is that in both settings the function Q can be redefined as
[TABLE]
where f:Rq→R is the objective function and D:Rq×Rq→R is a function such that for all θ∈Θ, ∂2D(θ,θ)=0.
Indeed, it is common knowledge that the intermediate quantity of the EM algorithm can be expressed as
[TABLE]
where DKL denotes the Kullback-Leibler divergence (see [Daudel et al., 2020] for example).
Regarding mirror descent, if f is twice differentiable in Example 2.1, straightforward computation yields the following equivalent definition for Q:
[TABLE]
where the expression of DΦ−ηf follows that of (9) (and defines a Bregman divergence if Φ−ηf is strictly convex).
Besides, the condition ∂2D(θ,θ)=0 for all θ∈Θ is equivalent to ∂2Qθ(θ)=∂f(θ) for all θ∈Θ, hence
[TABLE]
and
[TABLE]
In the framework of (20), the convergence rate can thus be viewed as a relative difference between the second-order variations of Q and f.
Computing iteratively argminΘQθn(⋅) to estimate argminΘf can prove useful if those minimizations are easier to carry out, but the price to pay in terms of iterations (through the convergence rate) is directly related to how far the surrogate function Q is from the objective function f.
If A~⋆ is invertible, \uprho^⋆ is indeed the spectral radius of B~⋆A~⋆−1=(∂22Q~θ⋆(θ⋆)−∂2f~(θ⋆))(∂22Q~θ⋆(θ⋆))−1 by Lemma B.3, and the interpretation of a ratio of missing information generalizes to that of a ratio measuring the loss of exactness in the minimization procedure.
Finally, in the particular case where D is a distance or a divergence twice differentiable at (θ⋆,θ⋆) with respect to the second argument, θ∈argminΘD(θ,⋅) for all θ∈Θ implies B⋆=∂22D(θ⋆,θ⋆)⪰0.
The domination assumption in (H4) then boils down to ∂2f~(θ⋆)≻0.
Example 1.3* (The α-EM algorithm).*
The above discussion highlighted how the choice of the surrogate function Q determines the convergence rate \uprho^⋆.
In the EM algorithm of Example 1, where the function Q can be defined for all θ,θ′∈Θ×Θ as
[TABLE]
the question then rises whether replacing the Kullback-Leibler divergence by an α-divergence (see [Daudel et al., 2020] for example) could provide a better convergence rate.
This leads to replacing the previous expression of Q by:
[TABLE]
where for all α∈R∖{0,1}, the concave function fα is defined on R+∗ by fα(x):=(1−xα)/α(α−1) and f0:=log.
This approach has been introduced and developed in [Matsuyama, 2003].
We provide further elements for the choice of α by proving (see Section A.4) that at a population level, under the assumptions of Example 1.1 with fα instead of f0,
[TABLE]
Note that when α=0 we recover the previous quantities A⋆ and B⋆ for the classical EM algorithm.
If IY(θ⋆)≻0, then α∈(0;1/2) is a sufficient condition to meet (H4).
Besides, if A⋆α is invertible we can write
[TABLE]
A necessary condition to improve the convergence rate is then α/(1−α)>0, i.e. α∈(0;1).
We can also rewrite (23) as follows:
[TABLE]
By the positivity of B⋆ for the original EM algorithm, we deduce that the optimal choice of α corresponds to α=(\uprho^⋆+\uprhoˇ⋆)/2 and \uprho^⋆α=(\uprho^⋆−\uprhoˇ⋆)/(2−\uprho^⋆−\uprhoˇ⋆), where \uprho^⋆ and \uprhoˇ⋆ are defined in (13) and (14) for the classical EM algorithm.
As a remark, we can see in [Matsuyama, 2003] that the α-EM algorithm does not simply change the D-function in (20), it also replaces the objective function with a different bivariate function.
Example 2.2* (Mirror prox, cont.).*
Assume that (H2.1) hold, that Φ and f are C1-differentiable on Θ and twice differentiable at θ⋆, and that the corresponding mirror descent satisfies (H3)-(H4) and θ⋆=M(θ⋆)∈ri(Θ).
Then, we prove in Section A.4 that
[TABLE]
where A⋆, B⋆ are defined in (19) for mirror descent.
This provides the symmetry of B~⋆m and thus of B⋆m, as well as
[TABLE]
We deduce that under (H4) for mirror descent, \uprho^⋆m<1 if and only if B~⋆≻0, which is met as soon as η∈(0;γ⋆/β⋆), where β⋆:=maxSpec(∂2f~(θ⋆)) and γ⋆:=minSpec(∂2Φ~(θ⋆)).
Besides, a sufficient condition for the C1-differentiability of ∂2Qm in a neighborhood of θ⋆ is the C2-differentiability of Φ and f in a neighborhood of θ⋆.
Under all those assumptions, mirror prox thus meets (H3)-(H4).
Note that the above sufficient condition of regularity implies (H3) for mirror descent, and that if B~⋆≻0, then ∂2f~(θ⋆)≻0 implies (H4) for mirror descent (see Example 2.1 in page 2.1).
As a remark, (24) yields that the convergence rates defined in (13-14) are always strictly higher for mirror prox than for the corresponding mirror descent.
The rate \uprhoˇ⋆m is even lower-bounded by 3/4 (see Section A.4).
Example 3* (Newton’s method).*
Let f be a C2-differentiable function f whose Hessian is invertible on Θ. Newton’s method considers the procedure defined for all n∈N by
[TABLE]
It fits into the general framework of (1) with Q defined on Θ×Θ by
[TABLE]
If f is thrice differentiable at θ⋆ and ∂f(θ⋆)=0, straightforward calculus yields A⋆=Iq and B⋆=0.
Newton’s method thus meets (H4) with \uprhoˇ⋆=\uprho^⋆=0, which is coherent with the fact that the convergence is quadratic under the assumptions of [Nocedal and J., 2006, Theorem 3.5, p.44].
7 Proof of Theorem 1
We start with some notation that will be used in several parts of the paper.
Set d:=dim(V).
Let v1,…,vd∈Rq be an orthonormal basis of V and let P be the matrix
[TABLE]
so that V=P(Rd).
For all x∈Rq and M∈Rq×q, write
[TABLE]
Note that for all v∈V, Pv~=PP⊤v=v.
Write for all n∈N,
[TABLE]
and hence
[TABLE]
Then, for all M∈Rq×q and n,m∈N,
[TABLE]
In particular, with M=Iq the identity matrix, for all n∈N,
In the definition of \uprho^⋆ given in (13), the supremum of v↦∣v⊤B⋆v∣/(v⊤A⋆v) can be taken over the compact set {v∈V:v⊤v=1} and it is thus attained.
This yields \uprho^⋆∈[0;1) under (H4).
Let \uprho∈(\uprho^⋆;1). Proposition 2 below provides for sufficiently large n,
[TABLE]
This yields ∥Δ~n∥A~⋆=\bigo(\uprhon), and hence ∥Δ~n∥2=\bigo(\uprhon) by the equivalence of norms in finite dimension.
The proof is concluded by noting that this holds for any arbitrary \uprho>\uprho^⋆ and since by (30),
[TABLE]
∎
Lemma 7.1**.**
Under (H2)-(H3), θ⋆ is a local minimizer on Θ of the function θ↦Qθ⋆(θ).
Proof.
Let N be a neighborhood of θ⋆ such that Q is continuous on N×N.
For all θ∈N and n∈N, the definition of (θn)n∈N in (1) provides Qθn(θn+1)⩽Qθn(θ).
Taking the limit when n goes to infinity yields Qθ⋆(θ⋆)⩽Qθ⋆(θ) for all θ∈N.
∎
Proposition 2**.**
Under (H1)-(H4), for all \uprho>\uprho^⋆, for sufficiently large n,
[TABLE]
Proof.
By Lemma 7.1, θ⋆ is a local minimizer of the function θ↦Qθ⋆(θ).
From the differentiability of that function at θ⋆ under (H3), and the convexity of Θ under (H1), we deduce that for all θ∈Θ,
[TABLE]
Similarly, under (H2)-(H3) the function θ↦Qθn(θ) is differentiable at θn+1 for sufficiently large n, which yields for all θ∈Θ,
[TABLE]
Using (32) with θ=θ⋆ and (31) with θ=θn+1 provides
[TABLE]
which in turn implies
[TABLE]
Besides, applying Taylor’s theorem to θ↦∂2Qθn(θ) and θ↦∂2Qθ(θ⋆) yields for sufficiently large n,
[TABLE]
where
[TABLE]
Plugging (34-35) into \eqrefeq:main:balakrishnan, we deduce
Now, by Schwarz’s theorem, A⋆ and hence A~⋆ are
symmetric. Similarly, under
(H2)-(H3), An and
hence A~n are symmetric for sufficiently large n. Moreover, (H4) implies the positive-definiteness of A~⋆, and by (H2)-(H3) that of A~n for sufficiently large n (see [Tao, 2012, Section 1.3.4, p.47]). We can thus apply Lemma B.1 to (37) with x=Δ~n+1, y=Δ~n, A=A~n and B=B~n and we obtain
[TABLE]
where \uprho^n:=∣∣∣A~n−1/2B~nA~n−1/2∣∣∣2. Under (H2)-(H3), Lemma B.2 shows that \uprho^n converges to ∣∣∣A~⋆−1/2B~⋆A~⋆−1/2∣∣∣2 by choosing A=A~⋆, B=B~⋆, M=A~n−A~⋆ and N=B~n−B~⋆. On the other hand, by Lemma B.3, \uprho^⋆=∣∣∣A~⋆−1/2B~⋆A~⋆−1/2∣∣∣2.
Let \uprho>\uprho^⋆. Set \uprho′:=(\uprho+\uprho^⋆)/2 and ε>0 such that (1+ε)\uprho′⩽(1−ε)\uprho.
Under (H2)-(H3), Lemma B.4 yields that for sufficiently large n, for all u∈Rd,
[TABLE]
Combining with (38) and the convergence of \uprho^n to \uprho^⋆, we deduce for sufficiently large n,
[TABLE]
∎
8 Convex constrained optimization
Theorem 7**.**
Assume that the mirror prox strategy defined in Example 2.2 page 2.2 satisfies the following assumptions:
(i) C⊂D,
(ii) Φ and f are twice differentiable on C and C2-differentiable in a neighborhood of θ⋆,
(iii) Φ is γ-strongly convex on C and f is convex and β-smooth, with respect to ∥⋅∥2,
(iv) η∈(0;γ/β),
(v) θ⋆ is the unique minimizer of f on C,
(vi) θ⋆∈ri(C) and ∂2f~(θ⋆)≻0.
Then, the algorithm converges and the convergence is asymptotically geometric.
Even if the convergence rates of mirror descent are always lower than those of mirror prox for the same optimization problem (see Example 2.2 in page 2.2), the convergence of mirror prox is guaranteed under the assumptions of Theorem 7.
Note that no conditions are imposed on the initialization (see [Bubeck, 2015, Chapter 4, p.299]).
Corollary 2**.**
Let q∈N∗, C⊂Rq be compact set, and f be a function that meets assumptions (ii)-(iii) and (v)-(vi) of Theorem 7.
Then, mirror prox provides an algorithm that converges to argminCf.
Proof.
Write R:=maxx∈C∥x∥2. For all R′>R, the mirror map Φ defined on D:=B(0,R′):={x∈Rd:∥x∥2<R′} by Φ(x):=∥x∥22/(R′−∥x∥22) meets the assumptions of Theorem 7 for all η∈(0;2(R′β)−1).
∎
9 Discussion
9.1 Non-asymptotic convergence
We proved the asymptotic geometric convergence in Section 7 by using that for all n∈N,
[TABLE]
as soon as we can define the above quantities (see (38)).
The question then rises of deriving non-asymptotic convergence rates. Apart from the fact that the norm depends on n, the main issue would be to obtain \uprho^n<1.
However, the ratio \uprho^⋆ compares ∂22Qθ⋆(θ⋆) with ∂12Qθ⋆(θ⋆), whereas \uprho^n compares
[TABLE]
and the problem is not of the same complexity. A sufficient condition to simplify it can be minSpec(An)>max∣Spec(Bn)∣.
Despite being less precise than a comparison being matrices, such a condition has the advantage that it is sufficient to verify it for every s∈[0;1] in (39).
That pointwise condition essentially corresponds to the conditions 1 and 2 behind γ<λ in [Balakrishnan et al., 2017, Theorem 1], concerning ∂12Q and ∂22Q respectively
(the proof of that theorem has besides inspired this work).
We can also identify the classical assumptions of smoothness and
Lipschitz continuity for gradient descent and its variants.
In light of that remark, we better understand why the framework introduced in this paper yields better results asymptotically, as it allows to work with the true asymptotic convergence rate. We can see it when comparing with the results stated so far in the EM literature [Dempster et al., 1977, Kunstner et al., 2021, Meng and Rubin, 1994] (that generally do not consider constrained optimization and assume that the mapping M is differentiable, among other things).
9.2 Quadratic convergence
Under the assumptions of Theorem 2, we established in Section A.1 that for sufficiently large n, we can write B~nΔ~n=A~nΔ~n+1, which is equivalent to
[TABLE]
Note that A~n and B~n cannot be computed as they depend on θ⋆ (see (36)). However, we can approximate them using only θn in order to estimate iteratively θ⋆ with (40).
In the example of unconstrained gradient descent with step-size 1, using A^n=Iq and B^n=Iq−∂2f(θn) (see Example 2.1 in page 2.1) corresponds to Newton’s method (see Example 3).
9.3 Non-convex constrained optimization
Considering Corollary 2 and Lemmas B.9 and B.11, the problem of finding the unique minimizer of non-convex functions can be brought down to finding β-smooth approximations of their biconjugates (for an arbitrary β∈R+∗).
First, note that the theorem is proved if \uprhoˇ⋆=0. We now assume that \uprhoˇ⋆>0, which in particular implies that B~⋆ is invertible. In what follows, we use the notation introduced in Section 7.
Following the proof of Theorem 1, see in particular the proof of Proposition 2, with the additional assumption that θ⋆∈ri(Θ), we can prove that for sufficiently large n, for all θ∈Θ,
[TABLE]
In other words, ∂2Qθn(θn+1),∂2Qθ⋆(θ⋆)∈V⊥. This implies that for sufficiently large n,
[TABLE]
where An,Bn, P and Δ~n are defined respectively in (36), (25) and (28). As by definition of P, the condition v∈V⊥ is equivalent to the identity P⊤v=0, we deduce from (43) that PTBnPΔ~n=PTAnPΔ~n+1, which can be written as
[TABLE]
Moreover, by (H4), A~⋆ is positive-definite and by (H2)-(H3), A~n is also positive-definite for sufficiently large n. Then, the invertibility of B~⋆ allows to write for sufficiently large n:
[TABLE]
and thus, combining with ∥Δ~n∥A~n=∥A~n1/2Δ~n∥2 and ∥Δ~n+1∥A~n=∥A~n1/2Δ~n+1∥2, we get
[TABLE]
Besides, by the symmetry of A~⋆−1/2B~⋆A~⋆−1/2,
[TABLE]
Let \uprho∈(0;\uprhoˇ⋆).
Following the same steps as for the proof of Theorem 1, we can prove that ∣∣∣(A~n−1/2B~nA~n−1/2)−1∣∣∣2 converges to ∣∣∣(A~⋆−1/2B~⋆A~⋆−1/2)−1∣∣∣2.
Together with (45) and (46), this provides the existence of n0∈N such that for all n⩾n0, ∥Δ~n∥A~⋆⩽\uprho−1∥Δ~n+1∥A~⋆.
We deduce by induction that for all n⩾n0, ∥Δ~n0∥A~⋆⩽\uprho−(n−n0)∥Δ~n∥A~⋆.
As the sequence (θn)n∈N is not eventually equal to θ⋆, by (30) we can choose n0 such that ∥Δ~n0∥A~⋆=0. Hence, since all the norms on a finite dimensional space are equivalent, we deduce n→∞liminfn1logΔ~n2=n→∞liminfn1logΔ~nA~⋆⩾log\uprho. The proof is then concluded by applying (30) and by noting that \uprho is arbitrary in (0;\uprhoˇ⋆).
∎
In this proof, we use the notation introduced in Section 7. Define
[TABLE]
Note that S~⋆ is similar to the symmetric matrix A~⋆−1/2B~⋆A~⋆−1/2 and is therefore diagonalizable. Therefore, there exists an invertible matrix R=[R(i,j)]1⩽i,j⩽d∈Rd×d such that S~⋆=A~⋆−1B~⋆=RD~⋆R−1 where D~⋆ is a diagonal matrix. For any matrix S∈Rd×d, it is convenient to use the notation SR=R−1SR. In particular, we have S~⋆R=D~⋆. Moreover, for any vector Δ∈Rd, we use the notation ΔR:=R−1Δ. Write
[TABLE]
where An and Bn are defined in (36) and S~n is well-defined for sufficiently large n as A~⋆ is positive-definite by (H4), and using (H2)-(H3). Besides, Theorems 1 and 2 provide \uprhoˇ⋆,\uprho^⋆∈[0;1). As the theorem is proved if \uprhoˇ⋆=0, we now assume that \uprhoˇ⋆>0, which implies the invertibility of B~⋆ and hence of S~⋆.
Moreover, Theorem 1 also yields that for all \uprho∈(\uprho^⋆;1), θn−θ⋆=\lito(\uprhon).
We deduce by the C2-differentiability of ∂2Q in a neighborhood of (θ⋆,θ⋆) that for all \uprho∈(\uprho^⋆;1),
[TABLE]
Following the proof of Theorem 2, by (44) there exists n0∈N such that for all n⩾n0, B~nΔ~n=A~nΔ~n+1, that is, S~nΔ~n=Δ~n+1 or equivalently S~nRΔ~nR=Δ~n+1R. This implies for all m∈N∗,
[TABLE]
Component-wise, this yields for such n,m∈N∗ that for all i∈[[1:d]],
[TABLE]
where for any k∈[[0:m−1]], Ln,m,k(i)⊤ denotes the i-th row of the matrix
[TABLE]
Recalling that ∣∣∣⋅∣∣∣F is the Frobenius norm, we let CF>0 be constant such that ∣∣∣⋅∣∣∣F⩽∣∣∣⋅∣∣∣2CF on Rd×d.
Let i∈[[1:d]] such that ∣D~⋆(i,i)∣=\uprho^⋆.
Using the Cauchy-Schwarz inequality we deduce
[TABLE]
Let δ>max(\uprho^⋆−1,\uprho^⋆\uprhoˇ⋆−1). Pick \uprho∈(\uprho^⋆,1) and ε>0 such that \uprho(\uprhoˇ⋆−1+ε)<δ. By (47) there exists C>0 and n1⩾n0 such that for all n⩾n1,
[TABLE]
Then, (48) yields for all n⩾n1 and m∈N∗, using \uprho^⋆−1<δ and \uprho(\uprhoˇ⋆−1+ε)<δ,
[TABLE]
We now show that there exists n⩾n1 such that Δ~nR(i)=R−1PTΔn(i)=0. Indeed, otherwise, by Lemma A.1 below, there exists a basis w1,…,wd of V such that Δn=θn−θ⋆∈Span(wj,j∈[[1:d]]∖{i}) for any n⩾n1 which contradicts the assumption Span(Δn,n⩾n1)=V in the statement of Theorem 3. Therefore, we can choose n⩾n1 such that the lhs of (50) is strictly positive. This n being chosen, take the log in the previous inequality and divide by m. Letting m goes to infinity, we then obtain for any δ>max(\uprho^⋆−1,\uprho^⋆\uprhoˇ⋆−1),
[TABLE]
where the last equality follows from (30). The proof is completed since δ is arbitrary provided that δ>max(\uprho^⋆−1,\uprho^⋆\uprhoˇ⋆−1), which is equivalent to δ−1<min(\uprho^⋆,\uprho^⋆−1\uprhoˇ⋆).
∎
Lemma A.1**.**
Let V be a d-dimensional linear subspace of Rq. Let v1,…,vd∈Rq be an orthonormal basis of V and let w1,…,wd∈Rq be another basis obtained from (vi)1⩽i⩽d by the change-of-basis matrix R, that is, for any j∈[[1:d]], wj=∑i=1dR(i,j)vi.
Then, for any Δ∈V, the i-th component of the decomposition of Δ on the basis (wi)1⩽i⩽d is R−1PTΔ(i), where P is the matrix
[TABLE]
Proof.
Decomposing the vector Δ∈V on the basis (vj)1⩽j⩽d and using vj=∑i=1dR−1(i,j)wi, we get
The proof amounts to building an analogous framework that satisfies (H1)-(H4). The first step is to define a suitable minimization set that is convex.
Write d the dimension of the submanifold S and for all x∈Rq, R>0, we set B(x,R):={y∈Rq:∥x−y∥2<R}.
Under (H’1) there exist U1, U2 two open
neighborhoods of θ⋆ and the null-vector 0 in Rq, respectively, and a
C2-diffeomorphism ψ:U1→U2 such
that ψ(θ⋆)=0 and
ψ(U1∩S)=U2∩(Rd×{0}q−d).
Let N be an open neighborhood of θ⋆ such that Q meets the conditions of (H3) on N×N. Define U:=U1∩N∩E˚, which is an open set containing θ⋆ by (H’1).
Set r>0 such that B(θ⋆,r)⊂U, and ε>0 such that B(0,ε)⊂ψ(B(θ⋆,r/2)). Define W:=Rd×{0}q−d and the convex set
[TABLE]
Write ϕ the corestriction of ψ to Ξ, that is, ϕ:ψ−1(Ξ)→Ξ such that ϕ(x)=ψ(x) for any x∈ψ−1(Ξ). Note that ϕ is still a C2-diffeomorphism. Define then
[TABLE]
Under (H2) there exists n0∈N such that θn∈ϕ−1(Ξ) for all n⩾n0, which allows to define on Ξ the sequence
[TABLE]
We deduce from the definition of (θn)n∈N in (1) and from ϕ−1(Ξ)⊂U⊂E that for all n∈N and ζ∈Ξ,
[TABLE]
The framework defined by (51-53) thus fits into (1) and meets (H1)-(H3) with (θn,θ⋆,Q) replaced by (ζn,0,R). For consistency of notation, we write ζ⋆:=ϕ(θ⋆)=0. We now prove that (H4) is satisfied with (θ⋆,Q) replaced by (ζ⋆,R).
Denote by Jϕ the Jacobian matrix of ϕ. The fact that ϕ−1(Ξ)⊂U⊂N allows to write for all θ,θ′∈ϕ−1(Ξ),
[TABLE]
using that the image of ϕ is included in Rd×{0}q−d, i.e. ϕi≡0 for all i∈[[d+1:q]]. This yields
[TABLE]
Besides, (54) and Lemma 7.1 provide that ζ⋆ is a local minimizer of the function ζ↦Rζ⋆(ζ). Together with the convexity of Ξ, the fact that ζ⋆∈ri(Ξ) and that Aff(Ξ)=W, this implies ∂2Rζ⋆(ζ⋆)∈W⊥.
As W=Rd×{0}q−d, the second term of the rhs in (56) is thus null.
Combining with T⋆=Jϕ(θ⋆)−1W by definition of the tangent space of a submanifold, we deduce from (55-56) that the rates \uprhoˇ⋆,\uprho^⋆ defined in (13-14) for R are equal to the rates \uprho˘⋆,\invbreve\uprho⋆ defined in (15) for Q. Satisfying (H4) for R is thus equivalent to satisfying (H’4) for Q.
Therefore, we can apply Theorems 1 and 2 to the sequence (ζn)n∈N and we get for any (\uprho1,\uprho2)∈(\invbreve\uprho⋆,1)×(0,\uprho˘⋆)
[TABLE]
To relate with the speed of convergence of θn−θ⋆, note that for all n∈N,
[TABLE]
and that ψ−1(Ξ)⊂B(θ⋆,r/2) by (51). The C1-differentiability of ψ on Bˉ(θ⋆,r/2) and of ψ−1 on Ξ provides the existence of C>0 such that supθ∈B(θ⋆,r/2)∣∣∣dψ(θ)∣∣∣2⩽C and supζ∈Ξ∣∣∣dϕ−1(ζ)∣∣∣2⩽C, where dψ and dϕ−1 denote the differentials of ψ and ϕ−1, respectively. Thus ψ is Lipshitz on B(θ⋆,r/2) and ϕ−1 on Ξ. Combining with (58) and (57) concludes the proof.
∎
Assume that (H1), (H2.4), (H3) and (H4) hold.
Let ∥⋅∥ be a norm on Rq. Then, for all \uprho>\uprho^⋆, there exists δ>0 such that for all θ∈Θ, θ′∈M(θ),
[TABLE]
where the notation θ~, θ~′ and θ~⋆ are defined in (26).
Proof.
In this proof, we use the notation introduced in Section 7. For all δ>0, write B(θ⋆,δ):={θ∈Rq:∥θ−θ⋆∥<δ}.
Under (H3) there exists δ0>0 such that ∂2Q is well-defined and C1-differentiable on B(θ⋆,δ0)×B(θ⋆,δ0).
By (H1), (H2.4) and (H3), we can prove, similarly to (33) in the proof of Proposition 2, that for all θ,θ′∈B(θ⋆,δ0) with θ′∈M(θ),
[TABLE]
which yields
[TABLE]
where
[TABLE]
Under (H3)-(H4) there also exists δ1∈(0;δ0) such that if θ,θ′∈B(θ⋆,δ1), then the symmetric matrix A~θθ′ is positive-definite.
Using (29) and applying Lemma B.1 to (59) then provides
[TABLE]
where \uprho^θθ′:=∣∣∣A~θθ′−1/2B~θθ′A~θθ′−1/2∣∣∣2.
Let \uprho>\uprho^⋆. Set \uprho′:=(\uprho+\uprho^⋆)/2 and ε>0 such that (1+ε)\uprho′⩽(1−ε)\uprho.
By (H3)-(H4) and Lemma B.4 there exists δ2∈(0;δ1) such that for all θ,θ′∈B(θ⋆,δ2), for all u∈Rd,
[TABLE]
Moreover, by Lemmas B.2 and B.3, under (H3) there exists δ3∈(0;δ2) such that \uprho^θθ′⩽\uprho′ for all θ,θ′∈B(θ⋆,δ3).
Combining with (60) yields that for all θ,θ′∈B(θ⋆,δ3) with θ′∈M(θ),
Applying Proposition 3 to \uprho:=(1+\uprho^⋆)/2 and the norm ∥⋅∥2 provides the existence of δ0>0 such that for all θ∈Θ, θ′∈M(θ),
[TABLE]
Moreover, by Lemma B.6 under (H2.1)-(H2.2) and by (H2.4), there exists δ1∈(0;δ0) such that for all θ∈Θ, θ′∈M(θ),
[TABLE]
By (30) and the equivalence of norms in finite dimension, combining with (61) yields the existence of δ>0 such that for all θ∈Θ, θ′∈M(θ),
[TABLE]
Besides, by (H2.3) there exists n0∈N such that ∥θ~n0−θ~⋆∥A~⋆⩽δ.
Using that \uprho<1 by (H4), we deduce by induction that for all n⩾n0, ∥θ~n−θ~⋆∥A~⋆⩽δ, and that
[TABLE]
which concludes the proof.
∎
Lemma A.2**.**
Assume that ∂2Q is well-defined and differentiable on Θ, and that for all θ,θ′∈Θ, for all v∈V, v⊤∂12Qθ(θ′)v<0.
Then, for all θ,θ′∈Θ,
[TABLE]
Proof.
Let θ′′∈M(θ)∩M(θ′)∩ri(Θ). We can prove as for Theorem 2 that it implies B~θθ′(θ~−θ~′)=A~θθ′(θ~′′−θ~′′)=0, where
[TABLE]
Besides, by assumption, for all v∈V, v⊤Bθθ′v=−∫01v⊤∂12Qsθ′+(1−s)θ(θ′′)vds>0.
This provides the invertibilty of B~θθ′ and thus θ~−θ~′=0, which concludes the proof by (30).
∎
It is proved in [Bubeck, 2015, Theorem 4.4, p.305] that under assumptions (ii) and (iii), for all θ∈C∩D and n⩾1,
[TABLE]
We deduce by applying Lemma B.5 under (iv) and the compacity of C that there exists φ:N→N strictly increasing such that (ζφ(n))n∈N converges to θ⋆.
By (11) this is equivalent to (M(θφ(n)−1))n∈N∗ converging to θ⋆.
Besides, (iv) and the differentiability of f provide M(θ⋆)={θ⋆} (see Example 2.1 in page 2.1), and by Lemma B.7 under (H2.1)-(H2.2) the function M is continuous on Θ.
We deduce that all accumulation points ℓ of the sequence (θφ(n)−1)n∈N∗ verify M(ℓ)=θ⋆=M(θ⋆).
By Lemma A.2 under (i), (ii), (iii) and (v) (using (70-71)), this yields ℓ=θ⋆ for all accumulation points, and thus the convergence of (θφ(n)−1)n∈N∗ to θ⋆ by the compacity of C.
Using that M(θ⋆)={θ⋆}, we can prove as in Example 2.1, page 2.1, that Mm(θ⋆)={θ⋆}, where Mm is the minimization mapping corresponding to mirror prox (see (12)).
∎
Under (H2.3) there exists ψ:N→N strictly increasing such that (θψ(n))n∈N converges to θ⋆.
By the compacity of Θ×Θ under (H2.1) there also exist θ⋆⋆∈Θ and φ:N→N strictly increasing such that
[TABLE]
By the monotonicity of the sequence (ϑ(θn))n∈N under (H~4.2), for all n∈N, ϑ(θφ(n+1))⩽ϑ(θφ(n)+1)⩽ϑ(θφ(n)).
Together with (62) and the continuity of ϑ this yields
[TABLE]
Besides, by the definition of (θn)n∈N, for all θ∈Θ,
[TABLE]
Using the continuity of Q under (H2.2), this yields θ⋆⋆∈M(θ⋆). We deduce under (H~4.2) that θ⋆=θ⋆⋆, and hence M(θ⋆)={θ⋆} under (H~4.1).
∎
Similarly to the proof of Example 1.1, we deduce from (4) that for all θ,θ′ in a neighborhood of θ⋆,
[TABLE]
and therefore,
[TABLE]
Lemma A.3**.**
Assume that θ⋆ is the true parameter of the model, that for all x,y∈X,Y, the functions θ↦pθ(x∣y) and θ↦pθ(y) are twice differentiable in a neighborhood of θ⋆, and that conditions similar to [Douc et al., 2013, Assumption AD.1, p.492] hold to differentiate under the integral sign. Then, if the corresponding population EM meets (H4), almost surely the sample EM meets (H4) for sufficiently large k and
[TABLE]
Furthermore, if ∂2logpθ⋆(X1,Y1),∂2logpθ⋆(Y1)∈L2(Rq×q), then for all δ∈(0;1) there exists Cδ>0 such that
[TABLE]
Proof.
As \uprho^⋆samp(Y1:k) is not necessarily well-defined in (13), using Lemma B.3 we consider the following definition:
[TABLE]
Note first that \uprho^⋆ is a measurable function of Y1:k. Besides, we deduce from the assumptions that the following random variables are integrable:
[TABLE]
Together with (63-64) and (65-66), the strong law of large numbers provides
[TABLE]
By the continuity of the functions used in (67), this yields that if the corresponding population EM meets (H4), then, almost surely, the sample EM meets (H4) for sufficiently large k, and
[TABLE]
Assume now that ∂2logpθ⋆(X1,Y1),∂2logpθ⋆(Y1)∈L2(Rq×q).
First, using Jensen’s inequality with ∥⋅∥22 provides W1∈L2(Rq×q), and thus Z1=W1−∂2logpθ⋆(Y1)∈L2(Rq×q).
Let δ∈(0;1) and write Zˉk=∑i=1kZi/k. Set δ′∈(0;1) such that 4q2δ′⩽δ and write x(δ′) the quantile of order 1−δ′ of the standard Gaussian distribution. Applying the central limit theorem to each component of Z1 provides for all i,j∈[[1:q]] and k∈N∗,
[TABLE]
where μij:=E[Z1(i,j)], σij2:=Var[Z1(i,j)] and σ2:=E[∣∣∣Z1∣∣∣F2]∨E[∣∣∣Z2∣∣∣F2]. This yields
[TABLE]
Similarly, the same inequality holds for ∂22Qθ⋆samp(θ⋆).
Let CF>0 such that ∣∣∣⋅∣∣∣F⩾CF∣∣∣⋅∣∣∣2 on Rq×q.
We deduce using (29) and (30) that
[TABLE]
Let ε>0 and C>0 be constants obtained by applying Lemma B.2 to (∂22Q~θ⋆pop(θ⋆),∂12Q~θ⋆pop(θ⋆)).
We deduce from (69) that
Let α∈R∖{0,1}. The function fα is defined on R+∗ by
fα:x↦(1−xα)/(α(α−1)).
Note that fα(1)=0 and that for all differentiable functions g taking values in R+∗,
[TABLE]
The function Qα defined in (21) can be written as Qθα(θ′)=−∫Xpθ(x∣Y)Fθα(θ′)μ(dx),
where Fα:(θ,θ′)↦fα(pθ′(x,Y)/pθ(x,Y)).
For all θ∈Θ, Fθα(θ)=0, and under the assumptions of Example 1.1 with fα instead of f0, for all θ,θ′∈Θ,
[TABLE]
We deduce
[TABLE]
At a population level this yields ∂22Qθ⋆α(θ⋆)=IX,Y(θ⋆) and
[TABLE]
Regarding the value of \uprho^⋆α, note that A~⋆−1B~⋆ is equivalent to the symmetric matrix A~⋆−1/2B~⋆A~⋆−1/2 and is therefore diagonalizable.
Besides, we deduce from (16) and (22) that
[TABLE]
This yields Spec((A~⋆α)−1B~⋆α)=gα(Spec(A~⋆−1B~⋆)) where gα(x):=(x−α)/(1−α). We obtain the optimal α by equating gα(\uprho^⋆)=−gα(\uprhoˇ⋆).
∎
To begin with, the assumptions imply that the corresponding mirror descent meets (H3)-(H4) and (H2.1)-(H2.2), and that θ⋆=M(θ⋆)∈ri(Θ).
Together with the fact that the mapping is point-to-point on Θ, (see Example 2.1 in page 2.1), this allows to apply Lemma B.8 to mirror descent.
We deduce the C1-differentiability of M in a neighborhood of θ⋆, where we can write
[TABLE]
This yields A⋆m=∂2Φ(θ⋆) and B⋆m=∂2Φ(θ⋆)−η∂2f(θ⋆)PA~⋆−1P⊤B⋆. Regarding the value of \uprho^⋆m, note that A~⋆−1B~⋆ is equivalent to the symmetric matrix A~⋆−1/2B~⋆A~⋆−1/2 and is therefore diagonalizable.
Together with (24) this yields
[TABLE]
where f is defined by f(x):=x2−x+1. Note that ∣f(x)∣<1 if and only if x∈(0;1).
Besides, Spec(A~⋆−1B~⋆)⊂R+∗ if and only if u⊤A~⋆−1/2B~⋆A~⋆−1/2u>0 for all u∈Rd, which is equivalent to u⊤B~⋆u>0 for all u∈Rd, by the symmetry of A~⋆−1/2.
Finally, minRf=3/4, which is attained at x=1/2, and for all x∈(0;1), f(x)>x.
∎
Appendix B Technical results
All lemmas below are proved in the Supplementary material (C.1).
B.1 Linear algebra
Lemma B.1**.**
Let d∈N∗ and A,B∈Rd×d such that A is symmetric positive-definite.
Then, for all x,y∈Rd, x⊤Ax⩽x⊤By⟹∥x∥A⩽\uprho∥y∥A, where \uprho:=∣∣∣A−1/2BA−1/2∣∣∣2.
Lemma B.2**.**
Let d∈N∗ and A,B∈Rd×d such that A is symmetric positive-definite.
Then, there exist ε>0 and C>0 such that for all symmetric matrices M∈Rd×d and for all matrices N∈Rd×d verifying ∣∣∣M∣∣∣2∨∣∣∣N∣∣∣2⩽ε,
[TABLE]
Lemma B.3**.**
Let d∈N∗, A∈Rd×d be a symmetric positive-definite matrix, and B∈Rd×d be a symmetric matrix. Then,
[TABLE]
Lemma B.4**.**
Let d∈N∗ and S⋆∈Rd×d be a symmetric positive-definite matrix.
Then, for all ε>0, there exists δ>0 such that for all symmetric matrices S∈Rd×d verifying ∣∣∣S−S⋆∣∣∣2<δ, and all x∈Rd,
[TABLE]
B.2 Minimization
Lemma B.5**.**
Let (K,d) be a compact metric space and f:K→R be a continuous function. Write m:=minKf and M:=argminKf.
Then, for all δ>0 there exists ε>0 such that for all x∈K,
[TABLE]
Lemma B.6**.**
Let (K,d) be a compact metric space and Q:K×K→R be a continuous function. Define for all x∈K, M(x):=argminx′∈KQx(x′).
Then, for all x⋆∈K and δ>0, there exists δ′>0 such that for all x∈K,
[TABLE]
Lemma B.7** (Continuity of the minimization mapping).**
Let (K,d) be a compact metric space and Q:K×K→R be a continuous function such that for all x∈K, M(x):=argminx′∈KQx(x′) is a singleton.
Then, the function M is continuous on K.
Lemma B.8** (Differentiability of the minimization mapping).**
Let d∈N∗, K be a compact set of Rd, and Q:Rd×Rd→R be a function continuous on K×K such that for all x∈K, M(x):=argminx′∈KQx(x′) is a singleton.
Let x∈K such that:
(i)
x,M(x)∈ri(K),
2. (ii)
∂2Q* is well-defined and Ck-differentiable in a neighborhood of (x,M(x)) for k∈N∗,*
3. (iii)
∂22Q~x(M(x))* is invertible.*
Then, the function M is Ck-differentiable in a neighborhood of x (considering that the domain lies in the ambient space Aff(K)).
Lemma B.9**.**
Let (K,d) be a compact metric space and f:K→R be a continuous function.
Then, for all δ>0, there exists ε>0 such that for all functions f^:K→R,
[TABLE]
with M^:=argminKf^, and the convention that the supremum over an empty set is equal to minus infinity.
Lemma B.10**.**
Let (K,d) be a compact metric space and Q:K×K→R be a continuous function such that for all x∈K, M(x):=argminx′∈KQx(x′) is a singleton.
Then, for all δ>0, there exists ε>0 such that for all functions Q^:K×K→R,
[TABLE]
where M^ is defined by M^(x):=argminx′∈KQ^x(x′), and with the convention that the supremum over an empty set is equal to minus infinity.
B.3 Convexity
Lemma B.11**.**
Let d∈N∗, K be a bounded convex set of Rd, and f:K→R be a continuous function.
Assume that f has a unique minimizer x⋆ on K, that x⋆∈ri(K), that f is C2-differentiable in a neighborhood of x⋆ and that ∂2f(x⋆)≻0.
Then, x⋆ is the unique minimizer of f∗∗ on K and f∗∗ is equal to f in a neighborhood of x⋆, where f∗∗ denotes the biconjugate of f (see [Rockafellar and Wets, 1998, Section 11, p.473]).
Let x,y∈Rd such that 0<x⊤Ax⩽x⊤By. The Cauchy-Schwarz inequality provides
[TABLE]
Using that ∥A−1/2By∥2⩽∣∣∣A−1/2BA−1/2∣∣∣2∥A1/2y∥2, we deduce
[TABLE]
where \uprho=∣∣∣A−1/2BA−1/2∣∣∣2.
∎
Lemma C.1**.**
Let d∈N∗ and A∈Rd×d be a symmetric positive-definite matrix.
Then, there exist ε>0 and C>0 such that for all
symmetric matrices M∈Rd×d verifying ∣∣∣M∣∣∣2⩽ε, the matrix A+M is symmetric positive-definite and
[TABLE]
Proof.
Write λ:=minSpec(A)>0. Let M∈Rd×d be a symmetric matrix such that ∣∣∣M∣∣∣2⩽λ/2. The matrix A+M is then symmetric positive-definite and we can define its square root.
By the symmetry of AM:=(A+M)1/2−A1/2, there exist μ∈Spec(AM) such that ∣μ∣=∣∣∣AM∣∣∣2 and x∈Rd×d such that AMx=μx and ∥x∥2=1. This yields
[TABLE]
We deduce, using that xT(A+M)1/2x⩾0 and λ1/2=minSpec(A1/2),
[TABLE]
and hence the result with ε=λ/2 and C=λ−1/2.
∎
Lemma C.2**.**
Let d∈N∗ and A∈Rd×d be an invertible matrix.
Then, there exist ε>0 and C>0 such that for all matrices M∈Rd×d verifying ∣∣∣M∣∣∣2⩽ε,
[TABLE]
Proof.
Let M∈Rd×d such that ∣∣∣M∣∣∣2⩽∣∣∣A−1∣∣∣2−1/2. We can then write
[TABLE]
This yields the result with ε=∣∣∣A−1∣∣∣2−1/2 and C=2∣∣∣A−1∣∣∣22.
∎
Applying Lemma C.1 to A−1 and
Lemma C.2 to A provides the existence of
ε>0 and C>1 such that for all symmetric matrices
S∈Rd×d verifying ∣∣∣S∣∣∣2⩽ε, the
matrix A−1+S is symmetric positive-definite and
[TABLE]
Let M∈Rd×d be a symmetric matrix such that ∣∣∣M∣∣∣2⩽ε/C⩽ε.
By (73) we can then define M′:=A−1−(A+M)−1, which verifies ∣∣∣M′∣∣∣2⩽C∣∣∣M∣∣∣2⩽ε. Choosing S=−M′, we deduce that A−1−M′=(A+M)−1
is symmetric positive-definite and by (72),
[TABLE]
This concludes the proof up to simple algebra, noting that for all matrices N∈Rd×d,
Let δ>0. The function d~ defined on K by d~(x):=d(x,M) being continuous, the set K~:=d~−1([δ,+∞[) is compact, as the intersection of a closed set with a compact set.
By the continuity of f we deduce the existence of x0∈K~ such that
Let x⋆∈K and δ>0. By the compacity of K and the continuity of Qx⋆(⋅), there exists x⋆⋆∈M(x⋆). Besides, Lemma B.5 applied to Qx⋆(⋅) provides the existence of ε>0 such that for all x′∈K,
[TABLE]
Moreover, by the uniform continuity of Q on (K×K,d~), where d~((y,y′),(z,z′)):=d(y,z)+d(y′,z′), there exists δ′>0 such that for all y,y′,z,z′∈K,
[TABLE]
We deduce that for all x∈K such that d(x,x⋆)<δ′, for all x′∈M(x),
We first prove the lemma for k=1.
Write V the direction of Aff(K) and for all v∈V,ε>0, write B(v,ε):={w∈V:∥v−w∥2<ε}. Note that the ball is defined as a subset of V.
Under i and ii there exists ε0>0 such that ∂2Q is C1-differentiable on B(x,2ε0)×B(M(x),2ε0)⊂ri(K)×ri(K).
Moreover, the function M is continuous on K by Lemma B.7, which yields the existence of ε1∈(0;ε0) such that y∈B(x,ε1) implies M(y)∈B(M(x),ε0).
By iii there also exists ε2∈(0;ε1) such that for all y∈B(x,ε2), the matrix ∂22Q~y(M(y)) is invertible.
Let y∈B(x,ε2) and set ε>0 such that B(y,ε)⊂B(x,ε2). For all h∈B(0,ε), the fact that M(y),M(y+h)∈ri(K) provides (see the proof of Theorem 2):
[TABLE]
where the functions A and B are defined on B(0,ε) by
[TABLE]
Besides, A~(0)=∂22Q~y(M(y)) is invertible by the definition of ε2, and the functions A, B are continuous on B(0,ε) by the continuity of M and the uniform continuity of ∂12Q,∂22Q on Bˉ(x,ε0)×Bˉ(M(x),ε0).
Therefore, there exists ε′∈(0;ε) such that on B(0,ε′),
[TABLE]
We deduce
[TABLE]
where P is defined as in Section 3.
The case k>1 follows by induction, using the above expression and the C∞-differentiability of the matrix inverse.
∎
Let (K,d) be a compact metric space and Q:K×K→R be a continuous function such that for all x∈K, M(x):=argminx′∈KQx(x′) is a singleton.
Then, for all δ>0, there exists ε>0 such that for all x,x′∈K,
[TABLE]
Proof.
By Lemma B.7, the function d~ defined on K2 by d~(x,x′):=d(M(x),x′) is continuous.
Let δ>0. By the compacity of K~:=d~−1([δ,+∞[) and the continuity of Q, there exists (x0,x0′)∈K~ such that
For all ε>0, write B(x⋆,ε):={x∈K:∥x−x⋆∥2<ε}.
To begin with, note that f(x⋆)=f∗∗(x⋆). Indeed, f(x⋆)⩾f∗∗(x⋆) by definition of the biconjugate, and the constant f(x⋆) is an affine minorant of f, which provides f∗∗⩾f(x⋆).
We now prove that x⋆ is the unique minimizer of f∗∗ on K.
By assumption there exists ε0>0 such that f is C2-differentiable and ∂2f≻0 on B(x⋆,ε0), which implies the convexity of f on that neighborhood.
Besides, by Lemma B.5 there exists δ>0 such that for all x∈K,
[TABLE]
Let ε1∈(0;ε0) such that f(B(x⋆,ε1))⊂B(f(x⋆),δ/2).
As x⋆∈ri(K), for all x∈K, ∂f(x⋆)⊤(x−x⋆)=0.
By the boundedness of K and the C1-differentiability of f on B(x⋆,ε1) this provides the existence of ε2∈(0;ε1) such that for all x∈B(x⋆,ε2) and y∈K, ∂f(x)⊤(y−x)⩽δ/2.
Together with (77) this yields for all x∈B(x⋆,ε2) and y∈K∖B(x⋆,ε0),
[TABLE]
By the convexity of f on B(x⋆,ε0), the same inequality holds for all y∈B(x⋆,ε0).
We deduce from (78) that for all x∈B(x⋆,ε2), y∈K,
[TABLE]
Let y∈K∖{x⋆} and set t∈(0;1) such that x(t):=x⋆+t(y−x⋆)∈B(x⋆,ε). By the convexity of f on B(x⋆,ε),
[TABLE]
Using (79) with x=x(t) then yields f∗∗(y)⩾f(x(t))>f(x⋆)=f∗∗(x⋆), which proves that x⋆ is the only minimizer of f∗∗ on K.
Finally, for all x∈B(x⋆,ε2), using (79) with y=x provides f∗∗(x)⩾f(x), and hence f=f∗∗ on B(x⋆,ε2).
∎
Acknowledgement
Many results presented in this paper were obtained during the first year of the Ph.D. of Rayan Charrier, before his resignation for personal reasons.
Bibliography21
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[Balakrishnan et al., 2017] Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Statist. , 45:77–120.
2[Bauschke, 1997] Bauschke, H. H.and Borwein, J. M. (1997). Legendre functions and the method of random Bregman projections. J. Convex Anal. , 4:27–67.
3[Bubeck, 2015] Bubeck, S. (2015). Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. , 8:231–357.
4[Cappé et al., 2005] Cappé, O., Moulines, E., and Rydén, T. (2005). Inference in hidden Markov models . Springer Series in Statistics. Springer, New York.
5[Cesa-Bianchi and Lugosi, 2006] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games . Cambridge university press.
6[Daudel et al., 2020] Daudel, K., Douc, R., and Portier, F. (2020). Infinite-dimensional gradient-based descent for alpha-divergence. ar Xiv:2005.10618 v 2 .
7[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. A , 39:1–22.
8[Douc et al., 2013] Douc, R., Moulines, E., and Stoffer, D. S. (2013). Nonlinear time series. Theory, methods, and applications with R examples . Chapman & Hall/CRC Texts Stat. Sci. Ser.