Tensor Methods for Minimizing Convex Functions with H\"{o}lder Continuous Higher-Order Derivatives
Geovani Nunes Grapiglia, Yurii Nesterov

TL;DR
This paper develops tensor-based optimization methods for convex functions with higher-order derivatives that are Hölder continuous, providing complexity bounds for both accelerated and universal schemes, advancing the theoretical understanding of such optimization problems.
Contribution
It introduces new tensor schemes with and without acceleration for convex minimization, establishing their iteration complexity bounds and a universal scheme for unknown Hölder parameters.
Findings
Accelerated tensor schemes achieve improved complexity bounds.
Universal scheme works without knowing Hölder continuity parameter.
Lower bounds match the proposed schemes' complexity.
Abstract
In this paper we study -order methods for unconstrained minimization of convex functions that are -times differentiable () with -H\"{o}lder continuous th derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of for reducing the functional residual below a given . Assuming that is known, we obtain an improved complexity bound of for the corresponding accelerated scheme. For the case in which is unknown, we present a universal accelerated tensor scheme with iteration complexity of . A lower complexity bound of is also obtained for this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\slugger
siopt3042750-2779
Tensor Methods for Minimizing Convex Functions with Hölder Continuous Higher-Order Derivatives
G.N. Grapiglia Departamento de Matemática, Universidade Federal do Paraná, Centro Politécnico, Cx. postal 19.081, 81531-980, Curitiba, Paraná, Brazil ([email protected]). This author was supported by the National Council for Scientific and Technological Development - Brazil (grants 401288/2014-5 and 406269/2016-5) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 788368).
Yu. Nesterov Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium ([email protected]). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 788368).
(November 28, 2019)
Abstract
In this paper we study -order methods for unconstrained minimization of convex functions that are -times differentiable () with -Hölder continuous th derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of for reducing the functional residual below a given . Assuming that is known, we obtain an improved complexity bound of for the corresponding accelerated scheme. For the case in which is unknown, we present a universal accelerated tensor scheme with iteration complexity of . A lower complexity bound of is also obtained for this problem class.
keywords:
unconstrained minimization, high-order methods, tensor methods, Hölder condition, worst-case global complexity bounds
AMS:
49M15, 49M37, 58C15, 90C25, 90C30
1 Introduction
1.1 Motivation
In [13], it was shown that a suitable cubic regularization of the Newton method (CNM) takes at most iterations to reduce the functional residual below a given precision when the objective is a twice-differentiable convex function with a Lipschitz continuous Hessian. A better complexity bound of was shown in [14] for an accelerated version of CNM. Auxiliary problems in these methods consist in the minimization of a third-order regularization of the second-order Taylor approximation of the objective function around the current iterate. A natural generalization is to consider auxiliary problems in which one minimizes a -order regularization of the th-order Taylor approximation of the objective function, resulting in tensor methods. Unconstrained optimization by tensor methods is not a new subject (see, for example, [17, 5]). In the context of convex optimization, accelerated tensor methods (as described above) were first considered by Baes [2]. However, the author did not realize that under a proper choice of the regularization coefficient the auxiliary problems become convex. This important observation was done in a recent paper [15], where tensor methods with and without acceleration were proposed for unconstrained minimization of -times differentiable convex functions with Lipschitz continuous th derivatives. An iteration complexity bound of was proved for the method without acceleration, while an improved bound of was proved for the accelerated tensor method.
In the present paper, we study tensor methods (with and without acceleration) that can handle convex functions with -Hölder continuous th derivatives () and allow the inexact solution of auxiliary problems (in the sense of [4]). Specifically, our contribution is threefold:
For the schemes without acceleration, we establish iteration complexity bounds of for reducing the functional residual below a given . 2. 3. 2.
Assuming that is known, we obtain an improved complexity bound of for the corresponding accelerated scheme. For the case in which is unknown, we present a universal accelerated tensor scheme with iteration complexity of . 4. 5. 3.
A lower complexity bound of is also obtained, from which we conclude that our accelerated nonuniversal scheme is nearly optimal. 6.
The methods and results presented here extend in a significant way the contributions in [2, 8, 9, 15]. Indeed, [8, 9] deal only with second-order schemes () which require the exact solution of the auxiliary problems. On the other hand, the -order methods proposed in [2, 15] are restricted to the Lipschitz case (), assuming that the Lipschitz constant is known and that the auxiliary problems are solved exactly. We believe that the development of -order methods with affordable trial steps and automatic adjustment to the objective’s function class (universality) constitutes a fundamental step towards implementable high-order methods for convex optimization.
1.2 Contents
The paper is organized as follows. In Section 2, we define our problem. In Section 3, we present tensor methods without acceleration and establish their convergence properties. In Section 4, we present complexity results for accelerated schemes. Finally, in Section 5 we obtain lower complexity bounds for tensor methods under the Hölder condition. All necessary auxiliary results are included in Appendix A.
1.3 Notations and Generalities
In what follows, we denote by a finite-dimensional real vector space, and by its dual space, composed by linear functionals on . The value of function at point is denoted by . Given a self-adjoint positive definite operator (notation ), we can endow these spaces with conjugate Euclidean norms:
[TABLE]
For a smooth function with convex and open domain , denote by its gradient and by its Hessian evaluated at point . Note that and for and .
For any integer , denote by
[TABLE]
the directional derivative of function at along directions , . In particular, for any and we have
[TABLE]
For , we use notation . Then the th-order Taylor approximation of function at can be written as follows:
[TABLE]
where
[TABLE]
Note that is a symmetric -linear form. Its norm is defined by
[TABLE]
In fact, it can be shown that (see, e.g., [3])
[TABLE]
Similarly, since is also a symmetric -linear form for fixed , we can define
[TABLE]
2 Problem Statement
In this paper we consider methods for solving the following minimization problem
[TABLE]
where is a convex -times differentiable function (). We assume that there exists at least one optimal solution for problem (2.3). Let us characterize the level of smoothness of the objective by the system of Hölder constants
[TABLE]
Then, from (2.4) and from the integral form of the mean-value theorem, it follows that
[TABLE]
[TABLE]
for all . Given , if and , by (2.5) we have
[TABLE]
This property motivates the use of the following class of models of around :
[TABLE]
In particular, as long as , by (2.7) we have
[TABLE]
3 Tensor schemes without acceleration
If we assume that for some , there are two possible situations: either is known, or is unknown. We cover both cases in a single framework by introducing parameter
[TABLE]
Let be the target precision. At the beginning of the th iteration one has an estimate for the solution of (2.3) and a scaling coefficient . A trial point is computed as an approximate solution to the auxiliary problem
[TABLE]
with given by (3.10). Similarly to [4], the trial point must satisfy the following conditions:
[TABLE]
where is a user-defined parameter. When (3.11) is not convex, then is not necessarily an approximation of its global solution. If the descent condition
[TABLE]
holds, then is accepted and we define . Otherwise, constant is increased until the corresponding trial point is accepted. We will see that this process is well defined in the sense that there exists such that for all . This general scheme can be summarized in the following way.
Algorithm 1. Tensor Method
Step 0. Choose and . Set by (3.10) and .
Step 1. Find such that (3.13) holds for an approximate solution to (3.11) satisfying conditions (3.12).
Step 2. Set .
Step 3. Set and go back to Step 1.
Remark 1**.**
Regarding the approximate solution of the auxiliary problems, it is easy to see that satisfying (3.3) can be computed by any monotone optimization scheme that drives the gradient of the objective to zero. In [10], we investigated the possible use of gradient methods with Bregman distance for the case and . Specifically, under assumptions H1 and H2 below, we showed that if is a sequence generated by these methods applied to with
[TABLE]
then
[TABLE]
For more details, see Theorems 3.8 and 3.10 in [10].
To analyze the convergence of Algorithm 1, we introduce the following assumptions:
- H1
for some .
- H2
The level sets of are bounded, that is, for , with being the starting point.
The next theorem establishes global convergence rate for Algorithm 1.
Theorem 2**.**
Suppose that H1 and H2 are true and let be a sequence generated by Algorithm 1. Denote by the first iteration number such that
[TABLE]
and assume that . Then
[TABLE]
and, for all , , we have
[TABLE]
Proof.
By Step 1 in Algorithm 1, we have
[TABLE]
Thus, in view of (3.13), (3.16), and H2, for we have
[TABLE]
where the last inequality is due to the convexity of . Now, denoting
[TABLE]
we see from (3.18) that this sequence satisfies condition (1.1) of Lemma 1.1 in [8] with . Note that is the first iteration for which . Using Lemma 1.1 in [8], this inequality allow us to obtain the upper bound (3.14) for and also simplifies our final bound for the functional residual. Indeed, if , then and, in view of inequality (1.2) of Lemma 1.1 in [8], we have
[TABLE]
[TABLE]
Thus, , and so (3.14) holds. Consequently, from inequality (1.3) of Lemma 1.1 in [8] we get the following rate of convergence:
[TABLE]
that is,
[TABLE]
Therefore,
[TABLE]
∎
If we assume that and are known, by Lemma 17, we can set
[TABLE]
Here, by (3.10) the corresponding version of Algorithm 1 takes at most
iterations to generate such that for a given . However, in most practical problems, is not known. To deal with this situation, we can consider the following adaptive version of Algorithm 1:
Algorithm 2. Adaptive Tensor Method
Step 0. Choose , and . Set by (3.10) and .
Step 1. Set .
Step 1.1 Compute an approximate solution to , such that
Step 1.2. If
holds, set and go to Step 2. Otherwise, set and go to Step 1.1.
Step 2. Set and .
Step 3. Set and go to Step 1.
Let us define the following function of :
[TABLE]
where
[TABLE]
The next lemma provides upper bounds on and on the number of calls of the oracle111By calls of the oracle we mean the joint computation of and its derivatives..
Lemma 3**.**
Suppose that H1 and H2 are true. Given , assume that is a sequence generated by Algorithm 2 such that
[TABLE]
[TABLE]
Then,
[TABLE]
Moreover, the number of calls of the oracle after iterations is bounded as follows:
[TABLE]
Proof.
Let us prove (3.23) by induction. Clearly it holds for . Assume that (3.23) is true for some , . If is known, then by (3.10) we have . Thus, by H1 and Lemma 17 the final value of cannot exceed
[TABLE]
since otherwise we should stop the line search earlier. Therefore,
[TABLE]
that is, (3.23) holds for .
On the other hand, if is unknown, we have . In view of (3.20), (3.21) and H2, it follows that
[TABLE]
Thus, by (3.22) and Lemma A.5 in [9] we have In this case, it follows from Corollary 20 with that
[TABLE]
Consequently, we also have
[TABLE]
that is, (3.23) holds for . This completes the induction argument.
Finally, note that at the th iteration of Algorithm 2, the oracle is called times. Since , it follows that . Thus, by (3.23) we get
[TABLE]
∎
From Lemma 3, we see that Algorithm 2 is a particular case of Algorithm 1 in which
[TABLE]
Thus, combining Theorem 3.2 and Lemma 3.3, we obtain the following result.
Theorem 4**.**
Suppose that H1 and H2 are true. Given , assume that is a sequence generated by Algorithm 2 such that
[TABLE]
[TABLE]
Denote by the first iteration number such that
[TABLE]
and assume that . Then
[TABLE]
and
[TABLE]
Consequently,
[TABLE]
where
[TABLE]
Proof.
As mentioned above, by Lemma 3.3 we have
[TABLE]
Then, (3.28) and (3.29) follow directly from Theorem 2 with
[TABLE]
Now, combining (3.27) and (3.29), we obtain
[TABLE]
and so,
[TABLE]
If is known, then and, by (3.19), we have
[TABLE]
Thus, combining (3.31) and (3.32), we get (3.30). On the other hand, if is unknown, then and, by (3.19), (3.25) and , we have
[TABLE]
In this case, combining (3.31) and (3.33) we also get (3.30). ∎
Remark 5**.**
Note that for any in the interval
[TABLE]
the corresponding right-hand side in (3.21) has the same value as for .
Note that Algorithm 2 with is a universal scheme: it works for any Hölder parameter without using it explicitly. This algorithm can be viewed as a generalization of the universal method (6.10) in [8]. Looking at the efficiency bound (3.30), for known and unknown, we see that the universal scheme ensures the same dependence on the accuracy as the nonuniversal scheme (). Remarkably, this is not true for the accelerated schemes obtained from the standard estimating sequences technique, as we will see in the next section.
4 Accelerated tensor schemes
Similarly to Section 3, we shall consider a general accelerated tensor method parametrized by the constant given in (3.10). Specifically, at the beginning of the th iteration () one has an estimate for the solution of (2.3), an auxiliary vector and constants . A new vector is computed as a convex combination of and :
[TABLE]
where
[TABLE]
with being computed from the equation
[TABLE]
Then, a trial point is computed as an approximate solution to the auxiliary problem
[TABLE]
such that
[TABLE]
where is a user-defined parameter. If the descent condition
[TABLE]
is satisfied, then is accepted, and we define . Otherwise, constant is increased until the corresponding trial point is accepted. As in Algorithm 1, we assume that there exists such that for all . After obtaining , we set and compute
[TABLE]
where
[TABLE]
To initialize, we choose and we set , , and . This general scheme can be summarized in the following way.
Algorithm 3. Accelerated Tensor Method
Step 0. Choose , . Set by (3.10), , , and .
Step 1. Find such that (4.39) holds for an approximate solution to (4.37) satisfying (4.38), with being defined by (4.34)-(4.36).
Step 2. Set and , with obtained from (4.36).
Step 3. Define by (4.41) and compute by (4.40).
Step 4. Set and go back to Step 1.
Remark 6**.**
From the expression of we can see that admits a closed form solution, namely,
[TABLE]
Regarding the computation of , for , (4.36) gives
[TABLE]
For , we have . Thus, the computation of requires the solution of a univariate nonlinear equation of the form
[TABLE]
Denoting , it is easy to see that
[TABLE]
where . Since is continuous, we can use the bisection Method to compute an approximation to such that . As can be seen in the proof of Theorem 8, our convergence results only require
[TABLE]
The next result establishes the relationship between the estimating functions and the objective function .
Lemma 7**.**
For all ,
[TABLE]
Proof.
We prove this result by induction in . Since , for all ,
[TABLE]
that is, (4.42) is true for . Suppose that (4.42) is true for some . Then (4.41) and the convexity of imply that, for all ,
[TABLE]
Thus, (4.42) is also true for , and the proof is completed. ∎
The theorem below establishes the global convergence rate for Algorithm 3.
Theorem 8**.**
Assume that H1 is true and let the sequence be generated by Algorithm 3. Then, for ,
[TABLE]
Proof.
Let us prove by induction that
[TABLE]
Since , we have . Thus, (4.44) is true for . Assume that it is true for some . Note that for any we have
[TABLE]
Note that is a linear function. Moreover, by Lemma 4 in [14], function
is uniformly convex of degree with parameter . Thus, is also a uniformly convex function of degree with parameter . Therefore, Lemma A.2 in [9] and the induction assumption imply that
[TABLE]
Thus,
[TABLE]
Since is convex and differentiable, we have
[TABLE]
Then, substituting this inequality above, we obtain
[TABLE]
Note that . Therefore, , and
[TABLE]
Moreover, , and so
[TABLE]
where the last inequality is due to (4.39). Thus, to prove that (4.44) is true for , it is enough to show that
[TABLE]
for all . Using Lemma 2 in [14] with , and , we see that a sufficient condition for (4.45) is
[TABLE]
which is equivalent to
[TABLE]
Note that, . Therefore, by (4.36) we have
[TABLE]
Thus (4.44) is true for , completing the induction argument.
Let us now estimate the growth of the coefficients . Since for all , by (4.36) we get with
[TABLE]
Consequently,
[TABLE]
Now, denoting for all , it follows from (4.47) that
[TABLE]
Then, by Lemma A.4 in [9], we have
[TABLE]
Note that . Thus, and consequently
[TABLE]
Therefore, for all , we have
[TABLE]
Finally, by (4.44) and Lemma 7, for , we have
[TABLE]
Hence, , and (4.43) follows immediately from (4.46) and (4.48). ∎
If we assume that and are known, then, by Lemma 21, we can set
[TABLE]
Here, by (3.10) the corresponding version of Algorithm 3 takes at most
iterations to generate such that For problems in which is not known, let us consider the following adaptive version of Algorithm 3.
Algorithm 4. Adaptive Accelerated Tensor Method
Step 0. Choose , , and . Set by (3.10) and define function . Set , , and .
Step 1. Set .
Step 1.1. Compute the coefficient by solving equation
Step 1.2. Set and compute vector .
Step 1.3 Compute an approximate solution to such that
Step 1.4. If condition
set and go to Step 2. Otherwise, set and go back to Step 1.1.
Step 2. Set , , and . Define and .
Step 3. Define by (4.41) and compute by (4.40).
Step 4. Set and go back to Step 1.
Note that Algorithm 4 is a particular case of Algorithm 3 in which
[TABLE]
Let us define the following function of :
[TABLE]
The next lemma provides upper bounds on and on the number of calls of the oracle in Algorithm 4.
Lemma 9**.**
Suppose that H1 and H2 are true. Given , assume that is a sequence generated by Algorithm 4 such that
[TABLE]
and
[TABLE]
Then
[TABLE]
Moreover, the number of calls of the oracle after iterations is bounded as follows:
[TABLE]
Proof.
Let us prove by induction that the scaling coefficients in Algorithm 4 satisfy (4.52). This is obvious for . Assume that (4.52) is true for some . If , it follows from Lemma 21 that the final value cannot be bigger than
[TABLE]
since otherwise we should stop the line-search earlier. Thus,
[TABLE]
that is, (4.53) holds for . On the other hand, suppose that . In view of Lemma A.5 in [9], at any trial point we have
[TABLE]
Thus, it follows from Lemma 22 that
[TABLE]
Consequently, we also have ; i.e., (4.52) holds for . This completes the induction argument. Finally, as in the proof of Lemma 3.3, from (4.52) we get (4.53). ∎
Now we can prove the following convergence result for Algorithm 4.
Theorem 10**.**
Suppose that H1 and H2 are true. Given , assume that is a sequence generated by Algorithm such that (4.50) and (4.51) hold. Then
[TABLE]
Consequently,
[TABLE]
if is known (i.e., ), and
[TABLE]
if is unknown (i.e., ).
Proof.
By Lemma 4.4, we have
[TABLE]
Then (4.54) follows directly from Theorem 4.3 with
[TABLE]
Now, combining (4.54) and (4.51) for , we obtain
[TABLE]
and so,
[TABLE]
If is known, then and, by (4.49), we have
[TABLE]
Thus, combining (4.57) and (4.58), we get (4.55). On the other hand, if is unknown, then and, by (4.49), (3.25) and , we have
[TABLE]
In this case, combining (4.57) and (4.59) we get (4.56). ∎
When , bounds (4.55) and (4.56) have the same dependence on . However, when , the bound of obtained for the universal scheme (i.e., Algorithm 4 with ) is worse than the bound of obtained for the nonuniversal scheme (). For high-order methods (), to the best of our knowledge, there is no simple procedure by which one can identify the level of smoothness of the th derivatives (in general). Therefore, despite this gap in the complexity bounds, we believe that the automatic choice of the best function subclass in the universal scheme is a very attractive feature. Moreover, in the nonuniversal scheme, for any with , the corresponding right-hand side of (4.22) has an additional term
[TABLE]
in comparison to its value when . In contrast, in the accelerated universal scheme, for any in the interval
[TABLE]
the corresponding right-hand side in (4.23) is the same as for . In this sense, it appears that the accelerated universal scheme is more robust than the accelerated nonuniversal scheme in terms of the inexact solution of the auxiliary problems.
5 Lower complexity bounds under Hölder condition
In this section we investigate how much the convergence rates of our tensor methods can be improved with respect to problems satisfying H1. Specifically, we derive lower complexity bounds for -order tensor methods applied to the problem (2.3), where the objective is convex and for some .
5.1 Hard functions and Lower Complexity Bounds
For simplicity, let us consider and . Given an approximation for the solution of (2.3), -order methods usually compute the next test point as , where the search direction is the solution of an auxiliary problem of the form
[TABLE]
with , , and . Denote by the set of all stationary points of function , and define the linear subspace
[TABLE]
With this notation, we can characterize the class of -order tensor methods by the following assumption.
Assumption 1. Given , the method generates a sequence of test points such that
[TABLE]
Given , our parametric family of difficult functions for -order tensor methods is defined as
[TABLE]
The next lemma establishes that for each we have .
Lemma 11**.**
Given an integer , the th derivative of is -Hölder continuous with
[TABLE]
Proof.
In view of (5.63), we have
[TABLE]
where
[TABLE]
[TABLE]
It can be shown that (see page 13 in [15])
[TABLE]
On the other hand, for any , we have
[TABLE]
Therefore, for all , it follows that
[TABLE]
Consequently, for all , we have
[TABLE]
Note that
[TABLE]
and, by (5.68), that
[TABLE]
Thus, combining (5.69)-(5.71), we get
[TABLE]
∎
The next lemma provides additional properties of .
Lemma 12**.**
Given an integer , let function be defined by (5.63). Then, has a unique global minimizer . Moreover,
[TABLE]
Proof.
The existence and uniqueness of follows from the fact that is uniformly convex. In view of (5.65), it follows from the first-order optimality condition that
[TABLE]
Therefore, , where satisfies
[TABLE]
with being the vector of all ones and being the origin in . Note that
[TABLE]
Consequently, (5.73) is equivalent to
[TABLE]
Thus,
[TABLE]
and so
[TABLE]
where . Finally, combining (5.65), (5.66), (5.75) and (5.76) we get
[TABLE]
[TABLE]
∎
Our goal is to understand the behavior of the tensor methods specified by Assumption 1 when applied to the minimization of with a suitable . For that, let us consider the following subspaces:
[TABLE]
Lemma 13**.**
For any and , .
Proof.
It follows directly from (5.63). ∎
Lemma 14**.**
Let be a -order tensor method satisfying Assumption 1. If is applied to the minimization of starting from , then the sequence of test points generated by satisfies
[TABLE]
Proof.
See Lemma 2 in [15]. ∎
Now, we can prove the lower complexity bound for -order tensor methods applied to the minimization of functions with -Hölder continuous th derivatives.
Theorem 15**.**
Let be a -order tensor method satisfying Assumption 1. Assume that for any function with this method ensures the rate of convergence:
[TABLE]
where is the sequence generated by method and is a global minimizer of . Then, for all such that we have
[TABLE]
where
[TABLE]
Proof.
Let us apply method for minimizing function starting from point . By Lemma 14 we have for all , . Moreover, by Lemma 13 we have
[TABLE]
Thus, from (5.77), (5.80), Lemma 11 and Lemma 12 we get
[TABLE]
where constant is given by (5.79). ∎
5.2 Discussion
Theorem 15 establishes that the lower bound for the rate of convergence of tensor methods applied to functions with -Hölder continuous th derivatives is of . In the Lipschitz case (i.e., ) we have , which coincide with the bounds in [1, 15]. On the other hand, for first-order methods (i.e., ) we have , which is the bound in [12].
The rate of corresponds to a worst-case complexity bound of
iterations necessary to ensure This means that the nonuniversal accelerated schemes proposed in this paper (e.g., Algorithm 4 with ) are nearly optimal tensor methods. In fact, for , note that
[TABLE]
In particular, if , we have Thus, in practice, the complexity bounds of our accelerated nonuniversal methods differ from the lower bound just by a small constant factor.
Notice that the lower-bound described in Theorem 5.5 is only valid while the iteration counter satisfies , where is the dimension of the domain of the objective. The same condition on appears in other lower bounds in the literature for the case and (see, e.g., Theorem 2.1.7 in [16]).
6 Conclusion
In this paper, we presented -order methods for unconstrained minimization of convex functions that are -times differentiable with -Hölder continuous th derivatives. For the universal and the nonuniversal schemes without acceleration, we established iteration complexity bounds of for reducing the functional residual below a given . Assuming that is known, we obtained an improved complexity bound of for the corresponding accelerated scheme. For the case in which is unknown, we presented an accelerated universal tensor scheme with an iteration complexity of .
Finally, a lower complexity bound of was also obtained for the referred problem class. This means that, in practice, our accelerated nonuniversal schemes are nearly optimal. Remarkably, the complexity bound obtained for the accelerated universal schemes is slightly worse than the bound obtained for the nonuniversal accelerated schemes. Up to now, it is not clear whether the estimating sequences technique can be modified to provide an accelerated universal -order method with a complexity bound of .
It is worth mentioning that the study of high-order methods is still at its early stages, with the majority of recent works in this area focusing on the derivation of global complexity bounds (see, e.g., [2, 4, 6, 7, 11, 15]). These bounds predict that high-order methods with may require significantly fewer calls of the oracle than second-order methods. As pointed out in [7, 15], the computation of high-order derivatives may be affordable for structured objectives (such as separable functions). Moreover, at least for and , the auxiliary problems can be solved using Bregman gradient methods that also take into account their particular structure [10, 15]. Nevertheless, the practical impact of high-order methods is yet to be seen.
Appendix A Auxiliary Results
In all algorithms described in this paper, the acceptance of new points is conditioned to the achievement of a sufficient decrease of the objective function value. In the nonaccelerated schemes, the sufficient decrease condition is specified by (3.4), while for accelerated schemes, it is specified by (4.6). In this Appendix we present auxiliary results from which we conclude that these conditions are satisfied when the regularization parameter is sufficiently large.
A.1 Results for schemes without acceleration
Our first lemma gives a lower bound for the functional decrease in terms of a suitable power of the norm of the displacement, when is known.
Lemma 16**.**
Let for some and assume that satisfies
[TABLE]
for some and . If , then
[TABLE]
Proof.
In view of (2.9) and (A.81), we have
[TABLE]
which gives
[TABLE]
Since for all , it follows that
[TABLE]
∎
The next lemma provides a lower bound for the functional decrease in terms of a suitable power of the norm of the gradient when is known.
Lemma 17**.**
Let for some , and assume that satisfies (A.81) and
[TABLE]
for some , , and . If
[TABLE]
then
[TABLE]
Proof.
By (2.6), (2.8), (A.83), and (A.84), we have
[TABLE]
Thus,
[TABLE]
On the other hand, by (A.81) and (A.84) it follows from Lemma 16 that
[TABLE]
Then, combining (A.86), and (A.87) we get (A.85). ∎
The lemma below gives lower bounds for powers of the norm of the displacement when is unknown.
Lemma 18**.**
Let for some , and assume that satisfies
[TABLE]
for some , and . If for some we have
[TABLE]
with constant , then
[TABLE]
and, consequently,
[TABLE]
Proof.
For , (A.90) is obvious. Thus, assume that and denote . Then, by (2.6), (2.8), and (A.88), we have
[TABLE]
Assume by contradiction that (A.90) is not true, i.e., . Since and , it follows that
[TABLE]
This implies that contradicting the second inequality in (A.89). Therefore, (A.90) holds.
Finally, let us prove (A.91). In view of inequality (A.90), we have
[TABLE]
Thus, it follows from (A.92) that
[TABLE]
∎
Now, using Lemma 18, we obtain a lower bound for the functional decrease in terms of a computable power of the norm of the displacement, when is unknown.
Lemma 19**.**
Let for some and assume that satisfies
[TABLE]
and
[TABLE]
for some , and . If for some we have
[TABLE]
with constant , then
[TABLE]
Proof.
In view of (2.5), (2.8), and (A.93), we have
[TABLE]
and so
[TABLE]
Assume by contradiction that (A.96) is not true, i.e.,
[TABLE]
Then, combining (A.97) and (A.98), we obtain
[TABLE]
which implies that
[TABLE]
By (A.94) and (A.95), the conclusions of Lemma 18 hold. In particular, we have
[TABLE]
and so
[TABLE]
Then it follows from (A.99) and (A.100) that
[TABLE]
contradicting the second inequality in (A.95). Therefore, (A.96) is true. ∎
Finally, the next lemma gives a lower bound for the functional decrease in terms of a computable power of the norm of the gradient when is unknown.
Corollary 20**.**
Let for some , and assume that satisfies (A.93) and (A.94) for some , , and . Given , define
[TABLE]
If and , then
[TABLE]
Proof.
From inequality (A.91) in Lemma 18 we have
[TABLE]
which implies that
[TABLE]
Then, it follows from inequality (A.96) in Lemma 19 that
[TABLE]
∎
A.2 Results for accelerated schemes
For the case in which is known, the lemma below establishes that (4.6) is achievable when the regularization parameter is sufficiently large.
Lemma 21**.**
Let for some , and assume that satisfies
[TABLE]
for some , and . If
[TABLE]
then
[TABLE]
Proof.
Denote . Then, by (2.6), (2.8), and (A.102), we have
[TABLE]
Thus, we obtain
[TABLE]
which implies that
[TABLE]
For , (A.105) leads to the desired relation. Let us assume that . Denote and By (A.103), we have
[TABLE]
Consider the right-hand side of inequality (A.105) as a function of :
[TABLE]
Since , is a convex function for . Thus, let us find the optimal as a solution to the first-order optimality condition for function :
[TABLE]
Solving this equation for , we obtain Consequently,
[TABLE]
Now, usinig (A.106) we obtain
[TABLE]
Note that
[TABLE]
Thus, and so by (A.105) we get (A.104). ∎
Finally, for the case in which is unknown, the next lemma establishes that (4.6) is also achievable when the regularization parameter is sufficiently large.
Lemma 22**.**
Let for some , and assume that satisfies
[TABLE]
for some , , and . If for some we have
[TABLE]
with , then
[TABLE]
Proof.
Denote . Then, by (2.6), (2.8), and (A.107) we have
[TABLE]
Therefore,
[TABLE]
which gives
[TABLE]
Since , it follows that
[TABLE]
Because , we have
[TABLE]
and so
[TABLE]
Therefore,
[TABLE]
Denote and consider the right-hand side of (A.110) as a function of :
[TABLE]
Let us find the optimal as a solution to the first-order optimality condition for function :
[TABLE]
Solving this equation for , we obtain
[TABLE]
Consequently,
[TABLE]
Therefore, (A.109) holds. ∎
Acknowledgments
The authors are very grateful to the two anonymous referees, whose comments helped to improve the first version of this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Arjevani, O. Shamir, R. Shiff.: Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming 178 , 327–360 (2019)
- 2[2] M. Baes: Estimate Sequence Methods: Extensions and Approximations. Optimization Online (2009)
- 3[3] S. Banach: Über homogene Polynome in (L 2). Studia Math. 7 , 36–44 (1938).
- 4[4] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163 , 359-368 (2017).
- 5[5] Bouaricha, A.: Tensor methods for large, sparse unconstrained optimization. SIAM Journal on Optimization 7 , 732-756 (1997)
- 6[6] Cartis, C., Gould, N.I.M., and Toint, Ph.L.: Universal regularized methods - varying the power, the smoothness, and the accuracy. SIAM Journal on Optimization 29 , 595–615 (2019).
- 7[7] Chen, X., Toint, Ph.L., Wang, H.: Partially separable convexly-constrained optimization with non-Lipschitzian singularities and its complexity. SIAM Journal on Optimization 29 , 874–903 (2019)
- 8[8] Grapiglia, G.N., Nesterov, Yu.: Regularized Newton Methods for minimizing functions with Hölder continuous Hessians. SIAM Journal on Optimization 27 , 478-506 (2017)
