Tensor Methods for Finding Approximate Stationary Points of Convex Functions
Geovani Nunes Grapiglia, Yurii Nesterov

TL;DR
This paper develops tensor-based algorithms to efficiently find approximate stationary points of convex functions with specific smoothness properties, providing complexity bounds for both accelerated and non-accelerated schemes.
Contribution
It introduces new tensor methods with proven iteration complexity bounds for convex functions with Hölder continuous derivatives, including cases with unknown smoothness parameters.
Findings
Non-accelerated schemes require O(ε^{-1/(p+ν-1)}) iterations.
Accelerated schemes improve complexity bounds, e.g., O(ε^{-(p+ν)/[(p+ν-1)(p+ν+1)]}).
Universal accelerated method achieves bounds when ν is unknown.
Abstract
In this paper we consider the problem of finding -approximate stationary points of convex functions that are -times differentiable with -H\"{o}lder continuous th derivatives. We present tensor methods with and without acceleration. Specifically, we show that the non-accelerated schemes take at most iterations to reduce the norm of the gradient of the objective below a given . For accelerated tensor schemes we establish improved complexity bounds of and , when the H\"{o}lder parameter is known. For the case in which is unknown, we obtain a bound of for a universal accelerated scheme.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Tensor Methods for Finding Approximate Stationary Points of Convex Functions
\nameG.N. Grapigliaa*∗*, Yu. Nesterovb ∗Corresponding author. Email: [email protected] aDepartamento de Matemática, Universidade Federal do Paraná, Centro Politécnico, Cx. postal 19.081, 81531-980, Curitiba, Paraná, Brazil;
bCenter for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium
(August 18, 2020)
Abstract
In this paper we consider the problem of finding -approximate stationary points of convex functions that are -times differentiable with -Hölder continuous th derivatives. We present tensor methods with and without acceleration. Specifically, we show that the non-accelerated schemes take at most iterations to reduce the norm of the gradient of the objective below a given . For accelerated tensor schemes we establish improved complexity bounds of and , when the Hölder parameter is known. For the case in which is unknown, we obtain a bound of for a universal accelerated scheme. Finally, we also obtain a lower complexity bound of for finding -approximate stationary points using -order tensor methods.
keywords:
unconstrained minimization; high-order methods, tensor methods; Hölder condition; worst-case complexity
{classcode}
49M15; 49M37; 58C15; 90C25; 90C30
1 Introduction
1.1 Motivation
In this paper we study tensor methods for unconstrained optimization, i.e., methods in which the iterates are obtained by the (approximate) minimization of models defined from high-order Taylor approximations of the objective function. This type of methods is not new in the Optimization literature (see, e.g., [35, 4, 1]). Recently, the interest for tensor methods has been renewed by the work in [2], where -order tensor methods were proposed for unconstrained minimization of nonconvex functions with Lipschitz continuous th derivatives. It was shown that these methods take at most iterations to find an -approximate first order stationary point of the objective, generalizing the bound of , originally established in [28] for the Cubic Regularization of Newton’s Method (). After [2], several high-order methods have been proposed and analyzed for nonconvex optimization (see, e.g., [10, 11, 12, 24]), resulting even in worst-case complexity bounds for the number of iterations that -order methods need to generate approximate th order stationary points [8, 9].
More recently, in [31], -order tensor methods with and without acceleration were proposed for unconstrained minimization of convex functions with Lipschitz continuous th derivatives. As it is usual in Convex Optimization, these methods aim the generation of a point such that , where is the objective function, is its optimal value and is a given precision. Specifically, it was shown that the non-accelerated scheme takes at most iterations to reduce the functional residual below a given , while the accelerated scheme takes at most iterations to accomplish the same task. Auxiliary problems in these methods consist in the minimization of a -regularization of the th order Taylor approximation of the objective, which is a multivariate polynomial. A remarkable result shown in [31] (which distinguish this work from [1]) is that, in the convex case, the auxiliary problems in tensor methods become convex when the corresponding regularization parameter is sufficiently large. Since [31], several high-order methods have been proposed for convex optimization (see, e.g., [14, 15, 19, 21]), including near-optimal methods [5, 16, 22, 32, 33] motivated by the second-order method in [25]. In particular, in [19], we have adapted and generalized the methods in [17, 18, 31] to handle convex functions with -Hölder continuous th derivatives. It was shown that the non-accelerated schemes take at most iterations to generate a point with functional residual smaller than a given , while the accelerated variants take only iterations when the parameter is explicitly used in the scheme. For the case in which is not known, we also proposed a universal accelerated scheme for which we established an iteration complexity bound of .
As a natural development, in this paper we present variants of the -order methods () proposed in [19] that aim the generation of a point such that , for a given threshold . In the context of nonconvex optimization, finding approximate stationary points is usually the best one can expect from local optimization methods. In the context of convex optimization, one motivation to search for approximate stationary points is the fact that the norm of the gradient may serve as a measure of feasibility and optimality when one applies the dual approach for solving constrained convex problems (see, e.g., [30]). Another motivation comes from the inexact high-order proximal-point methods, recently proposed in [32, 33], in which the iterates are computed as approximate stationary points of uniformly convex models.
Specifically, our contributions are the following:
We show that the non-accelerated schemes in [19] take at most iterations to reduce the norm of the gradient of the objective below a given , when the objective is convex, and iterations, when is nonconvex. These complexity bounds extend our previous results reported in [17] for regularized Newton methods (case ). Moreover, our complexity bound for the nonconvex case agrees in order with the bounds obtained in [24] and [10] for different tensor methods. 2. 3. 2.
For accelerated tensor schemes we establish improved complexity bounds of , when the Hölder parameter is known. This result generalizes the bound of obtained in [30] for the accelerated gradient method (). In contrast, when is unknown, we prove a bound of for a universal accelerated scheme. 4. 5. 3.
For the case in which and the corresponding Hölder constant are known, we propose tensor schemes for the composite minimization problem. In particular, we prove a bound of iterations, where is an upper bound for the initial distance to the optimal set. This result generalizes the bounds obtained in [30] for first-order and second-order accelerated schemes combined with a regularization approach ( and ). We also prove a bound of iterations, where is an upper bound for the initial functional residual. 6. 7. 4.
Considering the same class of difficult functions described in [19], we derive a lower complexity bound of iterations (in terms of the initial distance to the optimal set), and a lower complexity bound of iterations (in terms of the initial functional residual), for -order tensor methods to find -approximate stationary points of convex functions with -Hölder continuous th derivatives. These bounds generalize the corresponding bounds given in [6] for first-order methods ().
The paper is organized as follows. In section 2, we define our problem. In section 3, we present complexity results for tensor schemes without acceleration. In section 4, we present complexity results for accelerated schemes. In section 5 we analyze tensor schemes for the composite minimization problem. Finally, in section 6, we establish lower complexity bounds for tensor methods find -approximate stationary points of convex functions under the Hölder condition. Some auxiliary results are left in the Appendix.
1.2 Notations and Generalities
Let be a finite-dimensional real vector space, and be its dual space. We denote by the value of the linear functional at point . Spaces and are equipped with conjugate Euclidean norms:
[TABLE]
where is a self-adjoint positive definite operator (). For a smooth function , denote by its gradient, and by its Hessian evaluated at point . Then and for .
For any integer , denote by
[TABLE]
the directional derivative of function at along directions , . For any and we have
[TABLE]
If , we denote by . Using this notation, the th order Taylor approximation of function at can be written as follows:
[TABLE]
where
[TABLE]
Since is a symmetric -linear form, its norm is defined as:
[TABLE]
It can be shown that (see, e.g., Appendix 1 in [27])
[TABLE]
Similarly, since is also a symmetric -linear form for fixed , it follows that
[TABLE]
2 Problem Statement
In this paper we consider methods for solving the following minimization problem
[TABLE]
where is a convex function -times differentiable. We assume that (2.4) has at least one optimal solution . As in [19], the level of smoothness of the objective will be characterized by the family of Hölder constants
[TABLE]
From (2.5), it can be shown that, for all ,
[TABLE]
[TABLE]
and
[TABLE]
Given , if and , by (2.6) we have
[TABLE]
This property motivates the use of the following class of models of around :
[TABLE]
Note that, by (2.9), if then for all .
3 Tensor Schemes Without Acceleration
Let us consider the following assumption:
- H1
for some .
Regarding the smoothness parameter , there are only two possible situations: either is known, or is unknown. In order to cover both cases in a single framework, as in [19], we shall consider the parameter
[TABLE]
Algorithm 1. Tensor Method (Algorithm 2 in [19])
Step 0. Choose , , and . Set by (3.11) and .
Step 1. If , STOP.
Step 2. Set .
Step 2.1 Compute an approximate solution to
(3.12)
such that
(3.13)
Step 2.2. If either or
(3.14)
holds, set and go to Step 3. Otherwise, set and go to Step 2.1.
Step 3. Set and .
Step 4. Set and go back to Step 1.
Remark 1*.*
If is unknown, by (3.11) we set in Algorithm 1. The resulting algorithm is a universal scheme that can be viewed as a generalization of the universal second-order method (6.10) in [17]. Moreover, it is worth mentioning that for and , one case use Gradient Methods with Bregman distance [20] to approximately solve (3.12) in the sense of (3.13).
For both cases ( known or unknown), Algorithm 1 is a particular instance of Algorithm 1 in [19] in which for all . Let us define the following function of :
[TABLE]
The next lemma provides upper bounds on and on the number of calls of the oracle in Algorithm 1.
Lemma 3.1**.**
Suppose that H1 holds. Given , assume that is a sequence generated by Algorithm 1 such that
[TABLE]
Then,
[TABLE]
and, consequently,
[TABLE]
Moreover, the number of calls of the oracle after iterations is bounded as follows:
[TABLE]
Proof.
Let us prove (3.17) by induction. Clearly it holds for . Assume that (3.17) is true for some , . If is known, then by (3.11) we have . Thus, it follows from H1 and Lemma A.2 in [19] that the final value of cannot be bigger than , since otherwise we should stop the line search earlier. Therefore,
[TABLE]
that is, (3.17) holds for . On the other hand, if is unknown, we have . In view of (3.16), Corollary A.5 [19] and , we must have
[TABLE]
Consequently, it follows that
[TABLE]
that is, (3.17) holds for . This completes the induction argument. Using (3.17), for we get . Finally, note that at the th iteration of Algorithm 1, the oracle is called times. Since , it follows that . Thus, by (3.17) we get
[TABLE]
and the proof is complete. ∎
Let us consider the additional assumption:
- H2
The level sets of are bounded, that is, for , with being the starting point.
The next theorem gives global convergence rates for Algorithm 1 in terms of the functional residual.
Theorem 3.2**.**
Suppose that H1 and H2 are true and let be a sequence generated by Algorithm 1 such that, for , we have
[TABLE]
Let be the first iteration number such that
[TABLE]
and assume that . Then
[TABLE]
and, for all , , we have
[TABLE]
Proof.
By Lemma 3.1, this result follows from Theorem 3.1 in [19] with . ∎
Now, we can derive global convergence rates for Algorithm 1 in terms of the norm of the gradient.
Theorem 3.3**.**
Under the same assumptions of Theorem 3.2, if for some , then
[TABLE]
Consequently,
[TABLE]
with
[TABLE]
Proof.
By Theorem 3.2, we have
[TABLE]
for all , . In particular, it follows from (3.14) and (3.24) that
[TABLE]
Therefore,
[TABLE]
and so (3.22) holds. By assumption, we have . Thus, by (3.22) we get
[TABLE]
[TABLE]
Finally, by analyzing separately the cases in which is known and unknown, it follows from (3.25) and (3.15) that (3.23) is true. ∎
Remark 2*.*
Suppose that the objective in (2.4) is nonconvex and bounded from below by . Then, it follows from (3.14) and (3.18) that
[TABLE]
Summing up these inequalities, we get
[TABLE]
and so, by (3.15), we obtain . This bound generalizes the bound of proved in [17] for . It agrees in order with the complexity bounds proved in [24] and [10] for different universal tensor methods.
4 Accelerated Tensor Schemes
The schemes presented here generalize the procedures described in [30] for and . Specifically, our general scheme is obtained by adding Step 2 of Algorithm 1 at the end of Algorithm 4 in [19], in order to relate the functional decrease with the norm of the gradient of in suitable points:
Algorithm 2. Adaptive Accelerated Tensor Method
Step 0. Choose , , and . Set by (3.11) and define function . Set , and .
Step 1. If , STOP.
Step 2. Set .
Step 2.1. Compute the coefficient by solving equation
Step 2.2. Set and compute .
Step 2.3 Compute an approximate solution to , such that
Step 2.4. If either condition or
holds, set and go to Step 3. Otherwise, set and go back to Step 2.1.
Step 3. Set and .
Step 4. Define and compute .
Step 5. Set and .
Step 6. Set .
Step 6.1 Compute an approximate solution to such that
Step 6.2 If either or
(4.26)
holds, set and go to Step 7. Otherwise, set and go to Step 6.1.
Step 7. Set , , and go to Step 1.
Let us define the following function of :
[TABLE]
In Algorithm 2, note that is independent of . The next theorem establishes global convergence rates for the functional residual with respect to .
Theorem 4.1**.**
Assume that H1 holds and let the sequence be generated by Algorithm 2 such that, for we have
[TABLE]
Then,
[TABLE]
for .
Proof.
As in the proof of Lemma 3.1, it follows from (4.28), (4.27) and Lemmas A.6 and A.7 in [19] that
[TABLE]
which gives
[TABLE]
Then, (4.29) follows directly from Theorem 4.2 in [19] with . ∎
Now we can obtain global convergence rates for Algorithm 2 in terms of the norm of the gradient.
Theorem 4.2**.**
Suppose that H1 holds and let sequences and be generated by Algorithm 2. Assume that, for , we have
[TABLE]
If for some , then
[TABLE]
where
[TABLE]
with and defined in (3.15) and (4.27), respectively. Consequently,
[TABLE]
if is known (i.e., ), and
[TABLE]
if is unknown (i.e., ).
Proof.
By Theorem 4.1, we have
[TABLE]
for . On the other hand, as in Lemma 3.1, by (4.30) we get
[TABLE]
where is defined in (3.15). Then, in view of (4.26), it follows that
[TABLE]
for . In particular, for . Moreover, by the definition of , we get and . Therefore
[TABLE]
and
[TABLE]
Now, since , summing up (4.37), we get
[TABLE]
Thus,
[TABLE]
and so (4.31) holds. By assumption, we have . Thus, it follows from (4.31) that
[TABLE]
[TABLE]
If is known, by (3.15) and (4.27) we have . Then,
[TABLE]
and so
[TABLE]
Combining (4.38), (4.39) and , we obtain (4.32). If is unknown, it follows from (3.15) and (4.27) that
[TABLE]
Then,
[TABLE]
and so
[TABLE]
Combining (4.38), (4.40) and , we obtain (4.33). ∎
Remark 3*.*
When , bounds (4.32) and (4.33) have the same dependence on . However, when , the bound of obtained for the universal scheme (i.e., ) is worse than the bound of obtained for the non-universal scheme (i.e., ). In both cases, these complexity bounds are better than the bound of proved for Algorithm 1.
5 Composite Minimization
From now on, we will assume that and are known. In this setting, we can consider the composite minimization problem:
[TABLE]
where is a convex function satisfying H1 (see page 4), and is a simple closed convex function whose effective domain has nonempty relative interior, that is, . We assume that there exists at least one optimal solution for (5.41). By (2.6), if we have
[TABLE]
This motivates the following class of models of around a fixed point :
[TABLE]
where is defined in (2.10). The next lemma gives a sufficient condition for function to be convex. Its proof is an adaptation of the proof of Theorem 1 in [31].
Lemma 5.1**.**
Suppose that H1 holds for some . Then, for any we have
[TABLE]
Moreover, if , then function is convex for any .
Proof.
For any , it follows from (2.8) that
[TABLE]
Since is arbitrary, we get (5.43).
Now, suppose that . Then, by (5.43) we have
[TABLE]
Therefore, is convex. ∎
From Lemma 5.1, if it follows that is also convex. In this case, since , any solution of
[TABLE]
satisfies the first-order optimality condition:
[TABLE]
Therefore, there exists such that
[TABLE]
Instead of solving (5.44) exactly, in our algorithms we consider inexact solutions such that111Conditions (5.47) have already been used in [21] and are the composite analogue of the conditions proposed in [2]. It is worth to mention that, for and , the tensor model has very nice relative smoothness properties (see [31]) which allow the approximate solution of (5.44) by Bregman Proximal Gradient Algorithms [23, 3].
[TABLE]
for some and . For such points , we define
[TABLE]
with satisfying (5.47). Clearly, we have .
Lemma 5.2**.**
Suppose that H1 holds and let be an approximate solution of (5.44) such that (5.47) holds for some . If
[TABLE]
then
[TABLE]
Proof.
By (5.48), (2.7), (2.10), (5.47) and (5.49) we have
[TABLE]
where the last inequality is due to . On the other hand, by (2.6), (5.42), (5.49), we have
[TABLE]
Note that . Thus,
[TABLE]
Finally, combining (5.51) and (5.52), we get (5.50). ∎
In this composite context, let us consider the following scheme:
Algorithm 3. Tensor Method for Composite Minimization
Step 0. Choose and . Set and .
Step 1. Compute an approximate solution to such that
for some .
Step 2. Set and go back to Step 1.
For , point at Step 1 can be computed by Algorithm 2 in [20], which is linearly convergent. As far as we know, the development of efficient algorithms to approximately solve (5.44)-(5.42) with is still an open problem.
Theorem 5.3**.**
Suppose that H1 holds and that is bounded from below by . Given , assume that is a sequence generated by Algorithm 3 such that for . Then,
[TABLE]
Proof.
By Lemma 5.2, bound (5.53) follows as in Remark 2. ∎
5.1 Extended Accelerated Scheme
Let us consider the following variant of Algorithm 2 for composite minimization:
Algorithm 4. Two-Phase Accelerated Tensor Method
Step 0. Choose , and . Define . Set , , and .
Step 1. If and , STOP.
Step 2. Compute the coefficient by solving the equation
Step 3. Set , with .
Step 4. Compute an approximate solution to such that
(5.54)
for some .
Step 5. Define and compute .
Step 6. Set and .
Step 7 Compute an approximate solution to such that
(5.55)
for some .
Step 8 Set and go to Step 1.
The next theorem gives the global convergence rate for Algorithm 4 in terms of the norm of the gradient. Its proof is a direct adaptation of the proof of Theorem 4.2.
Theorem 5.4**.**
Suppose that H1 holds. Assume that is a sequence generated by Algorithm 4 such that
[TABLE]
If for some , then
[TABLE]
Consequently,
[TABLE]
Proof.
In view of Theorem A.2, we have
[TABLE]
for . On the other hand, by (5.55) and Lemma 5.2, we have
[TABLE]
for . Thus, and, consequently,
[TABLE]
and
[TABLE]
Since , combining (5.59), (5.61) and (5.62), we obtain
[TABLE]
where . Therefore,
[TABLE]
which gives (5.57). Finally, by (5.56) we have . Thus, (5.58) follows directly from (5.57). ∎
5.2 Regularization Approach
Now, let us consider the ideal situation in which , and are known. In this case, a complexity bound with a better dependence on can be obtained by repeatedly applying an accelerated algorithm to a suitable regularization of . Specifically, given , consider the regularized problem
[TABLE]
for
[TABLE]
Lemma 5.5**.**
Given and , let be defined by , where is the Euclidean norm defined in (1.1). Then,
[TABLE]
where .
Proof.
See [34]. ∎
As a consequence of the lemma above, we have the following property.
Lemma 5.6**.**
If H1 holds, then the th derivative of in (5.64) is -Hölder continuous with constant .
In view of Lemma 5.6, to solve (5.63) we can use the following instance of Algorithm A (see Appendix A):
Algorithm 5. Accelerated Tensor Method for Problem (5.63)
Step 0. Choose , and . Define function
. Set
(5.65)
, and .
Step 1. Compute the coefficient by solving equation
Step 2. Compute , with .
Step 3. Compute an approximate solution to such that
for some .
Step 4. Define and compute .
Step 5. Set and go back to Step 1.
Let us consider the following restart procedure based on Algorithm 5.
Algorithm 6. Accelerated Regularized Tensor Method
Step 0. Choose , , and . Define
(5.66)
for defined in (5.65). Set , and .
Step 1. If and , STOP.
Step 2. By applying Algorithm 5 to problem (5.63), with , compute the first iterates .
Step 3. Set and compute such that
(5.67)
for some .
Step 4. Set and go back to Step 1.
Theorem 5.7**.**
Suppose that H1 holds and let be a sequence generated by Algorithm 6 such that
[TABLE]
Then,
[TABLE]
Proof.
Let . By Theorem A.2 and (5.66), we have
[TABLE]
On the other hand, by Lemma 5 in [13] and Lemma 1 in [29], function is uniformly convex of degree with parameter . Thus,
[TABLE]
Combining (5.70) and (5.71), we obtain , and so
[TABLE]
Thus, it follows from (5.70) and (5.72) that
[TABLE]
In view of Lemma 5.2, by (5.67) and (5.65), we get
[TABLE]
Then, combining (5.73) and (5.74), it follows that
[TABLE]
In particular, for , it follows from (5.68) that
[TABLE]
Since , it follows that . Thus, combining this with (5.75), we get (5.69). ∎
Corollary 5.8**.**
Suppose that H1 holds and that . Then, Algorithm 6 with
[TABLE]
perform at most
[TABLE]
iterations of Algorithm 5 in order to generate such that .
Proof.
By Theorem 5.7, we can obtain with
[TABLE]
Moreover, it follows from (5.65), (5.76), the definition of in Lemma 5.6, and that
[TABLE]
Combining (5.78), (5.79) and (5.76), we have
[TABLE]
At this point , we have
[TABLE]
Since is uniformly convex of degree with parameter , it follows from (5.74) and (5.73) that
[TABLE]
Therefore, , and so
[TABLE]
Now, combining (5.81), (5.83) and (5.76), we obtain
[TABLE]
The conclusion is obtained by noticing that, for given in (5.76) we have
[TABLE]
Thus, (5.77) follows from multiplying (5.80) and (5.85). ∎
Suppose now that is known. In this case, we have the following variant of Theorem 5.7.
Theorem 5.9**.**
Suppose that H1 holds and let be a sequence generated by Algorithm 6 such that
[TABLE]
Then,
[TABLE]
Proof.
By (5.75), we have
[TABLE]
Since is uniformly convex of degree with parameter we have
[TABLE]
Combining (5.88) and (5.89) we get (5.87). ∎
Corollary 5.10**.**
Suppose that H1 holds and that . Then, Algorithm 6 with
[TABLE]
performs at most
[TABLE]
iterations of Algorithm 5 in order to generate such that .
Proof.
By Theorem 5.9, we can obtain with
[TABLE]
In view of (5.90), and , we also have
[TABLE]
Thus, from (5.92) and (5.93) it follows that
[TABLE]
At this point we have
[TABLE]
[TABLE]
Thus, it follows from (5.2), (5.95) and (5.90) that
[TABLE]
Finally, by (5.66) and (5.90) we have
[TABLE]
Thus, (5.91) follows by multiplying (5.94) by the upper bound on given above. ∎
6 Lower complexity bounds under Hölder condition
In this section we derive lower complexity bounds for -order tensor methods applied to the problem (2.4) in terms of the norm of the gradient of , where the objective is convex and for some .
For simplicity, assume that and . Given an approximation for the solution of (2.4), we consider -order methods that compute trial points of the form , where the search direction is the solution of an auxiliary problem of the form
[TABLE]
with , and . Denote by the set of all stationary points of function and define the linear subspace
[TABLE]
More specifically, we consider the class of -order tensor methods characterized by the following assumption.
Assumption 1. Given , the method generates a sequence of test points such that
[TABLE]
Given , we consider the same family of difficult problems discussed in [19], namely:
[TABLE]
The next lemma establishes that for each we have .
Lemma 6.1**.**
Given an integer , the th derivative of is -Hölder continuous with
[TABLE]
Proof.
See Lemma 5.1 in [19]. ∎
The next lemma provides additional properties of .
Lemma 6.2**.**
Given an integer , let function be defined by (6.99). Then, has a unique global minimizer . Moreover,
[TABLE]
Proof.
See Lemma 5.2 in [19]. ∎
Our goal is to understand the behavior of the tensor methods specified by Assumption 1 when applied to the minimization of with a suitable . For that, let us consider the following subspaces:
[TABLE]
Lemma 6.3**.**
For any and , .
Proof.
It follows directly from (6.99). ∎
Lemma 6.4**.**
Let be a -order tensor method satisfying Assumption 1. If is applied to the minimization of () starting from , then the sequence of test points generated by satisfies
[TABLE]
Proof.
See Lemma 2 in [31]. ∎
The next lemma gives a lower bound for the norm of the gradient of on suitable points.
Lemma 6.5**.**
Let be an integer in the interval , with . If , then .
Proof.
In view of (6.99) we have
[TABLE]
where
[TABLE]
and
[TABLE]
By (6.104) and (6.103), we have
[TABLE]
Since , it follows that for . Therefore,
[TABLE]
which means that . Then, from (6.102), we obtain
[TABLE]
By (6.104), we have
[TABLE]
Consequently,
[TABLE]
and
[TABLE]
From (6.111), it can be checked that
[TABLE]
with
[TABLE]
Now, combining (6.110) and (6.111)–(6.112), we get
[TABLE]
Then, it follows from (6.109) and (6.114) that
[TABLE]
Finally, by (6.108) and (6.123) we have
[TABLE]
and the proof is complete. ∎
The next theorem establishes a lower bound for the rate of convergence of -order tensor methods with respect to the initial functional residual .
Theorem 6.6**.**
Let be a -order tensor method satisfying Assumption 1. Assume that for any function with this method ensures the rate of convergence:
[TABLE]
where is the sequence generated by method and is the optimal value of . Then, for all such that we have
[TABLE]
Proof.
Suppose that method is applied to minimize function with initial point . By Lemma 6.4, we have for all , . Thus, from Lemma 6.5 it follows that
[TABLE]
Then, combining (6.124), (6.126), Lemma 6.1 and Lemma 6.2 we get
[TABLE]
where constant is given in (6.125). ∎
Remark 4*.*
Theorem 6.6 gives a lower bound of for the rate of convergence of tensor methods with respect to the initial functional residual. For first-order methods in the Lipschitz case (i.e., ), we have . This gives a lower complexity bound of iterations for finding -stationary points of convex functions using first-order methods, which coincides with the lower bound (8a) in [6]. Moreover, in view of Corollary 5.10, Algorithm 6 is suboptimal in terms of the initial residual, with the complexity a complexity gap that increases as grows.
Now, we obtain a lower bound for the rate of convergence of -order tensor methods with respect to the distance .
Theorem 6.7**.**
Let be a -order tensor method satisfying Assumption 1. Assume that for any function with this method ensures the rate of convergence:
[TABLE]
where is the sequence generated by method and is a global minimizer of . Then, for all such that we have
[TABLE]
Proof.
Let us apply method for minimizing function starting from point . By Lemma 6.4, we have for all , . Thus, from Lemma 6.5 it follows that
[TABLE]
Then, combining (6.127), (6.129), Lemma 6.1 and Lemma 6.2 we get
[TABLE]
where constant is given in (6.128). ∎
Remark 5*.*
Theorem 6.7 establishes that the lower bound for the rate of convergence of tensor methods in terms of the norm of the gradient is also of . For first-order methods in the Lipschitz case (i.e., ) we have . This gives a lower complexity bound of for finding -stationary points of convex functions using first-order methods, which coincides with the lower bound (8b) in [6].
Remark 6*.*
The rate of corresponds to a worst-case complexity bound of iterations necessary to ensure . Note that, for , we have
[TABLE]
Thus, by increasing the power of the oracle (i.e., the order ), our non-universal schemes become nearly optimal. For example, if and , we have
7 Conclusion
In this paper, we presented -order methods that can find -approximate stationary points of convex functions that are -times differentiable with -Hölder continuous th derivatives. For the universal and the non-universal schemes without acceleration, we established iteration complexity bounds of for finding such that . For the case in which is known, we obtain improved complexity bounds of of and for the corresponding accelerated schemes. For the case in which is unknown, we obtained a bound of for a universal accelerated scheme. Similar bounds were also obtained for tensor schemes adapted to the minimization of composite convex functions. A lower complexity bound of was obtained for the referred problem class. Therefore, in practice, our non-universal schemes become nearly optimal as we increase the order .
As an additional result, we showed that Algorithm 6 takes at most iterations to find -stationary points of uniformly convex functions of degree in the form (5.64). Notice that strongly convex functions are uniformly convex of degree 2. Thus, our result generalizes the known bound of obtained for first-order schemes () applied to strongly convex functions with Lipschitz continuous gradients (). At this point, it is not clear to us how -order methods (with ) behave when the objective functions is strongly convex with -Hölder continuous th derivatives. Neverthless, from the remarks done in [13, p. 6] for , it appears that in our case the class of uniformly convex functions of degree is the most suitable for -order methods from a physical point of view.
Acknowledgments
The authors are very grateful to an anonymous referee, whose comments helped to improve the first version of this paper.
Funding
G.N. Grapiglia was supported by the National Council for Scientific and Technological Development - Brazil (grant 406269/2016-5) and by the European Research Council Advanced Grant 788368. Yu. Nesterov was supported by the European Research Council Advanced Grant 788368.
Appendix A Accelerated Scheme for Composite Minimization
To solve problem (5.41), we can apply the following modification of Algorithm 3 in [19]:
Algorithm A. Accelerated Tensor Method for Composite Minimization
Step 0. Choose , and define . Set , , and .
Step 1. Compute by solving the equation
(A.130)
Step 2. Compute with .
Step 3. Compute an approximate solution to such that
(A.131)
for some .
Step 4. Define .
Step 5. Set and go to Step 1.
In order to establish a convergence rate for Algorithm B, we will need the following result.
Lemma A.1**.**
Suppose that H1 holds and let be an approximate solution to such that
[TABLE]
for some . If , then
[TABLE]
Proof.
Denote . Then,
[TABLE]
which gives
[TABLE]
From (A.134), the rest of the proof follows exactly as in the proof of Lemma A.6 in [19]. ∎
Theorem A.2**.**
Suppose that H1 holds and let the sequence be generated by Algorithm B. Then, for ,
[TABLE]
Proof.
For all , we have
[TABLE]
Indeed, (A.136) is true for because and . Suppose that (A.136) is true for some . Then,
[TABLE]
Thus, (A.136) follows by induction. Now, let us prove that
[TABLE]
Again, using , we see that (A.137) is true for . Assume that (A.137) is true for some . Note that is uniformly convex of degree with parameter . Thus, by the induction assumption
[TABLE]
Consequently,
[TABLE]
Since is convex and differentiable and , we have
[TABLE]
and
[TABLE]
Using (A.139) and (A.140) in (A.138), it follows that
[TABLE]
Note that and . Thus, combining (A.141) and Lemma A.1, we obtain
[TABLE]
where the last inequality follows from (A.130) exactly as in the proof of Theorem 4.2 in [17]. Thus, (A.137) also holds for , which completes the induction argument.
Now, combining (A.136) and (A.137) we have
[TABLE]
Once again, as in the proof of Theorem 4.2 in [19], it follows from (A.130) that
[TABLE]
Finally, (A.135) follows directly from (A.142) and (A.143). ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Baes: Estimate Sequence Methods: Extensions and Approximations. Optimization Online (2009)
- 2[2] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163 , 359-368 (2017).
- 3[3] Bolte, J., Sabach, S., Teboulle, M., Vaesburg, Y.: First Order Methods Beyond Convexity and Lipschitz Cradient Continuity with Applications to Quadratic Inverse Problems. SIAM Journal on Optimization 28 , 2131–2151 (2018)
- 4[4] Bouaricha, A.: Tensor methods for large, sparse unconstrained optimization. SIAM Journal on Optimization 7 , 732–756 (1997)
- 5[5] Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Near-optimal method for highly smooth convex optimization. ar Xiv: 1812.08026 v 2 [math. OC] (2019)
- 6[6] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower Bounds for Finding Stationary Points II: First-Order Methods. ar Xiv: 1711.00841 [math. OC] (2017)
- 7[7] Cartis, C., Gould, N.I.M., and Toint, Ph.L.: Adaptive cubic regularization methods for unconstrained optimization. Part II: worst-case function - and derivative - evaluation complexity. Mathematical Programming 130 , 295-319 (2011)
- 8[8] Cartis, C., Gould, N.I.M., Toint, Ph.L.: Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics 18 , 1073–1107 (2018)
