Tensor Methods for Finding Approximate Stationary Points of Convex   Functions

Geovani Nunes Grapiglia; Yurii Nesterov

arXiv:1907.07053·math.OC·June 7, 2021·Optim. Methods Softw.

Tensor Methods for Finding Approximate Stationary Points of Convex Functions

Geovani Nunes Grapiglia, Yurii Nesterov

PDF

TL;DR

This paper develops tensor-based algorithms to efficiently find approximate stationary points of convex functions with specific smoothness properties, providing complexity bounds for both accelerated and non-accelerated schemes.

Contribution

It introduces new tensor methods with proven iteration complexity bounds for convex functions with Hölder continuous derivatives, including cases with unknown smoothness parameters.

Findings

01

Non-accelerated schemes require O(ε^{-1/(p+ν-1)}) iterations.

02

Accelerated schemes improve complexity bounds, e.g., O(ε^{-(p+ν)/[(p+ν-1)(p+ν+1)]}).

03

Universal accelerated method achieves bounds when ν is unknown.

Abstract

In this paper we consider the problem of finding $ϵ$ -approximate stationary points of convex functions that are $p$ -times differentiable with $ν$ -H\"{o}lder continuous $p$ th derivatives. We present tensor methods with and without acceleration. Specifically, we show that the non-accelerated schemes take at most $O (ϵ^{- 1/ (p + ν - 1)})$ iterations to reduce the norm of the gradient of the objective below a given $ϵ \in (0, 1)$ . For accelerated tensor schemes we establish improved complexity bounds of $O (ϵ^{- (p + ν) / [(p + ν - 1) (p + ν + 1)]})$ and $O (∣ lo g (ϵ) ∣ ϵ^{- 1/ (p + ν)})$ , when the H\"{o}lder parameter $ν \in [0, 1]$ is known. For the case in which $ν$ is unknown, we obtain a bound of $O (ϵ^{- (p + 1) / [(p + ν - 1) (p + 2)]})$ for a universal accelerated scheme.…

Equations383

∥ x ∥ = ⟨ B x, x ⟩^{1/2}, x \in E, ∥ s ∥_{*} = ⟨ s, B^{- 1} s ⟩^{1/2}, s \in E^{*} .

∥ x ∥ = ⟨ B x, x ⟩^{1/2}, x \in E, ∥ s ∥_{*} = ⟨ s, B^{- 1} s ⟩^{1/2}, s \in E^{*} .

D^{p} f (x) [h_{1}, \dots, h_{p}]

D^{p} f (x) [h_{1}, \dots, h_{p}]

D f (x) [h_{1}] = ⟨ \nabla f (x), h_{1} ⟩ and D^{2} f (x) [h_{1}, h_{2}] = ⟨ \nabla^{2} f (x) h_{1}, h_{2} ⟩ .

D f (x) [h_{1}] = ⟨ \nabla f (x), h_{1} ⟩ and D^{2} f (x) [h_{1}, h_{2}] = ⟨ \nabla^{2} f (x) h_{1}, h_{2} ⟩ .

f (x + h) = Φ_{x, p} (x + h) + o (∥ h ∥^{p}),

f (x + h) = Φ_{x, p} (x + h) + o (∥ h ∥^{p}),

Φ_{x, p} (y) \equiv f (x) + i = 1 \sum p \frac{1}{i !} D^{i} f (x) [y - x]^{i}, y \in E .

Φ_{x, p} (y) \equiv f (x) + i = 1 \sum p \frac{1}{i !} D^{i} f (x) [y - x]^{i}, y \in E .

∥ D^{p} f (x) ∥ = h_{1}, \dots, h_{p} max {∣ D^{p} f (x) [h_{1}, \dots, h_{p}] ∣ : ∥ h_{i} ∥ \leq 1, i = 1, \dots, p} .

∥ D^{p} f (x) ∥ = h_{1}, \dots, h_{p} max {∣ D^{p} f (x) [h_{1}, \dots, h_{p}] ∣ : ∥ h_{i} ∥ \leq 1, i = 1, \dots, p} .

∥ D^{p} f (x) ∥ = h max {∣ D^{p} f (x) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) ∥ = h max {∣ D^{p} f (x) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) - D^{p} f (y) ∥ = h max {∣ D^{p} f (x) [h]^{p} - D^{p} f (y) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) - D^{p} f (y) ∥ = h max {∣ D^{p} f (x) [h]^{p} - D^{p} f (y) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

x \in E min f (x),

x \in E min f (x),

H_{f, p} (ν) \equiv x, y \in E sup {\frac{∥ D ^{p} f ( x ) - D ^{p} f ( y ) ∥}{∥ x - y ∥ ^{ν}}}, 0 \leq ν \leq 1.

H_{f, p} (ν) \equiv x, y \in E sup {\frac{∥ D ^{p} f ( x ) - D ^{p} f ( y ) ∥}{∥ x - y ∥ ^{ν}}}, 0 \leq ν \leq 1.

∣ f (y) - Φ_{x, p} (y) ∣ \leq \frac{H _{f, p} ( ν )}{p !} ∥ y - x ∥^{p + ν},

∣ f (y) - Φ_{x, p} (y) ∣ \leq \frac{H _{f, p} ( ν )}{p !} ∥ y - x ∥^{p + ν},

∥\nabla f (y) - \nabla Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 1 )!} ∥ y - x ∥^{p + ν - 1},

∥\nabla f (y) - \nabla Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 1 )!} ∥ y - x ∥^{p + ν - 1},

∥ \nabla^{2} f (y) - \nabla^{2} Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 2 )!} ∥ y - x ∥^{p + ν - 2} .

∥ \nabla^{2} f (y) - \nabla^{2} Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 2 )!} ∥ y - x ∥^{p + ν - 2} .

f (y) \leq Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + ν}, y \in E .

f (y) \leq Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + ν}, y \in E .

Ω_{x, p, H}^{(α)} (y) = Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + α}, α \in [0, 1] .

Ω_{x, p, H}^{(α)} (y) = Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + α}, α \in [0, 1] .

\alpha=\left\{\begin{array}[]{ll}\nu,&\text{if}\,\,\nu\,\,\text{is known},\\ 1,&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

\alpha=\left\{\begin{array}[]{ll}\nu,&\text{if}\,\,\nu\,\,\text{is known},\\ 1,&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

y \in E min Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (y)

y \in E min Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (y)

Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (x_{t, i}^{+}) \leq f (x_{t}) and ∥\nabla Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (x_{t, i}^{+}) ∥_{*} \leq θ ∥ x_{t, i}^{+} - x_{t} ∥^{p + α - 1} .

Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (x_{t, i}^{+}) \leq f (x_{t}) and ∥\nabla Ω_{x_{t}, p, 2^{i} H_{t}}^{(α)} (x_{t, i}^{+}) ∥_{*} \leq θ ∥ x_{t, i}^{+} - x_{t} ∥^{p + α - 1} .

f (x_{t}) - f (x_{t, i}^{+}) \geq \frac{1}{8 ( p + 1 )! ( 2 ^{i} H _{t} ) ^{\frac{1}{p + α - 1}}} ∥\nabla f (x_{t, i}^{+}) ∥_{*}^{\frac{p + α}{p + α - 1}},

f (x_{t}) - f (x_{t, i}^{+}) \geq \frac{1}{8 ( p + 1 )! ( 2 ^{i} H _{t} ) ^{\frac{1}{p + α - 1}}} ∥\nabla f (x_{t, i}^{+}) ∥_{*}^{\frac{p + α}{p + α - 1}},

N_{\nu}(\epsilon)=\left\{\begin{array}[]{ll}\max\left\{\dfrac{3H_{f,p}(\nu)}{2},3\theta(p-1)!\right\},&\text{if}\,\,\nu\,\,\text{is known},\\ \max\left\{\theta,\left(\dfrac{3H_{f,p}(\nu)}{2}\right)^{\frac{p}{p+\nu-1}}4^{\frac{1-\nu}{p+\nu-1}}\right\}\epsilon^{-\frac{1-\nu}{p+\nu-1}},&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

N_{\nu}(\epsilon)=\left\{\begin{array}[]{ll}\max\left\{\dfrac{3H_{f,p}(\nu)}{2},3\theta(p-1)!\right\},&\text{if}\,\,\nu\,\,\text{is known},\\ \max\left\{\theta,\left(\dfrac{3H_{f,p}(\nu)}{2}\right)^{\frac{p}{p+\nu-1}}4^{\frac{1-\nu}{p+\nu-1}}\right\}\epsilon^{-\frac{1-\nu}{p+\nu-1}},&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

∥\nabla f (x_{t}) ∥_{*} > ϵ, t = 0, \dots, T .

∥\nabla f (x_{t}) ∥_{*} > ϵ, t = 0, \dots, T .

H_{t} \leq max {H_{0}, N_{ν} (ϵ)}, for t = 0, \dots, T,

H_{t} \leq max {H_{0}, N_{ν} (ϵ)}, for t = 0, \dots, T,

M_{t} = 2^{i_{t}} H_{t} \leq 2 max {H_{0}, N_{ν} (ϵ)}, for t = 0, \dots, T - 1,

M_{t} = 2^{i_{t}} H_{t} \leq 2 max {H_{0}, N_{ν} (ϵ)}, for t = 0, \dots, T - 1,

O_{T} \leq 2 T + lo g_{2} max {H_{0}, N_{ν} (ϵ)} - lo g_{2} H_{0} .

O_{T} \leq 2 T + lo g_{2} max {H_{0}, N_{ν} (ϵ)} - lo g_{2} H_{0} .

H_{t + 1} = \frac{1}{2} 2^{i_{t}} H_{t} \leq max {\frac{3 H _{f, p} ( ν )}{2}, 3 θ (p - 1)!} = N_{ν} (ϵ) \leq max {H_{0}, N_{ν} (ϵ)},

H_{t + 1} = \frac{1}{2} 2^{i_{t}} H_{t} \leq max {\frac{3 H _{f, p} ( ν )}{2}, 3 θ (p - 1)!} = N_{ν} (ϵ) \leq max {H_{0}, N_{ν} (ϵ)},

2^{i_{t}} H_{t} \leq 2 max {θ, (\frac{3 H _{f, p} ( ν )}{2})^{\frac{p}{p + ν - 1}} (\frac{4}{ϵ})^{\frac{1 - ν}{p + ν - 1}}} \leq 2 N_{ν} (ϵ) .

2^{i_{t}} H_{t} \leq 2 max {θ, (\frac{3 H _{f, p} ( ν )}{2})^{\frac{p}{p + ν - 1}} (\frac{4}{ϵ})^{\frac{1 - ν}{p + ν - 1}}} \leq 2 N_{ν} (ϵ) .

H_{t + 1} = \frac{1}{2} 2^{i_{t}} H_{t} \leq N_{ν} (ϵ) \leq max {H_{0}, N_{ν} (ϵ)},

H_{t + 1} = \frac{1}{2} 2^{i_{t}} H_{t} \leq N_{ν} (ϵ) \leq max {H_{0}, N_{ν} (ϵ)},

O_{T}

O_{T}

∥\nabla f (x_{t, i}^{+}) ∥_{*} > ϵ, i = 0, \dots, i_{t} .

∥\nabla f (x_{t, i}^{+}) ∥_{*} > ϵ, i = 0, \dots, i_{t} .

f (x_{m}) - f (x^{*}) \leq 4 [8 (p + 1)!]^{p + α - 1} max {H_{0}, N_{ν} (ν)} D_{0}^{p + α},

f (x_{m}) - f (x^{*}) \leq 4 [8 (p + 1)!]^{p + α - 1} max {H_{0}, N_{ν} (ν)} D_{0}^{p + α},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Tensor Methods for Finding Approximate Stationary Points of Convex Functions

\nameG.N. Grapigliaa*∗*, Yu. Nesterovb ∗Corresponding author. Email: [email protected] aDepartamento de Matemática, Universidade Federal do Paraná, Centro Politécnico, Cx. postal 19.081, 81531-980, Curitiba, Paraná, Brazil;

bCenter for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium

(August 18, 2020)

Abstract

In this paper we consider the problem of finding $\epsilon$ -approximate stationary points of convex functions that are $p$ -times differentiable with $\nu$ -Hölder continuous $p$ th derivatives. We present tensor methods with and without acceleration. Specifically, we show that the non-accelerated schemes take at most $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ iterations to reduce the norm of the gradient of the objective below a given $\epsilon\in(0,1)$ . For accelerated tensor schemes we establish improved complexity bounds of $\mathcal{O}\left(\epsilon^{-(p+\nu)/[(p+\nu-1)(p+\nu+1)]}\right)$ and $\mathcal{O}\left(|\log(\epsilon)|\epsilon^{-1/(p+\nu)}\right)$ , when the Hölder parameter $\nu\in[0,1]$ is known. For the case in which $\nu$ is unknown, we obtain a bound of $\mathcal{O}\left(\epsilon^{-(p+1)/[(p+\nu-1)(p+2)]}\right)$ for a universal accelerated scheme. Finally, we also obtain a lower complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ for finding $\epsilon$ -approximate stationary points using $p$ -order tensor methods.

keywords:

unconstrained minimization; high-order methods, tensor methods; Hölder condition; worst-case complexity

{classcode}

49M15; 49M37; 58C15; 90C25; 90C30

1 Introduction

1.1 Motivation

In this paper we study tensor methods for unconstrained optimization, i.e., methods in which the iterates are obtained by the (approximate) minimization of models defined from high-order Taylor approximations of the objective function. This type of methods is not new in the Optimization literature (see, e.g., [35, 4, 1]). Recently, the interest for tensor methods has been renewed by the work in [2], where $p$ -order tensor methods were proposed for unconstrained minimization of nonconvex functions with Lipschitz continuous $p$ th derivatives. It was shown that these methods take at most $\mathcal{O}\left(\epsilon^{-\frac{p+1}{p}}\right)$ iterations to find an $\epsilon$ -approximate first order stationary point of the objective, generalizing the bound of $\mathcal{O}\left(\epsilon^{-3/2}\right)$ , originally established in [28] for the Cubic Regularization of Newton’s Method ( $p=2$ ). After [2], several high-order methods have been proposed and analyzed for nonconvex optimization (see, e.g., [10, 11, 12, 24]), resulting even in worst-case complexity bounds for the number of iterations that $p$ -order methods need to generate approximate $q$ th order stationary points [8, 9].

More recently, in [31], $p$ -order tensor methods with and without acceleration were proposed for unconstrained minimization of convex functions with Lipschitz continuous $p$ th derivatives. As it is usual in Convex Optimization, these methods aim the generation of a point $\bar{x}$ such that $f(\bar{x})-f^{*}\leq\epsilon$ , where $f$ is the objective function, $f^{*}$ is its optimal value and $\epsilon>0$ is a given precision. Specifically, it was shown that the non-accelerated scheme takes at most $\mathcal{O}(\epsilon^{-1/p})$ iterations to reduce the functional residual below a given $\epsilon>0$ , while the accelerated scheme takes at most $\mathcal{O}(\epsilon^{-1/(p+1)})$ iterations to accomplish the same task. Auxiliary problems in these methods consist in the minimization of a $(p+1)$ -regularization of the $p$ th order Taylor approximation of the objective, which is a multivariate polynomial. A remarkable result shown in [31] (which distinguish this work from [1]) is that, in the convex case, the auxiliary problems in tensor methods become convex when the corresponding regularization parameter is sufficiently large. Since [31], several high-order methods have been proposed for convex optimization (see, e.g., [14, 15, 19, 21]), including near-optimal methods [5, 16, 22, 32, 33] motivated by the second-order method in [25]. In particular, in [19], we have adapted and generalized the methods in [17, 18, 31] to handle convex functions with $\nu$ -Hölder continuous $p$ th derivatives. It was shown that the non-accelerated schemes take at most $\mathcal{O}(\epsilon^{-1/(p+\nu-1)})$ iterations to generate a point with functional residual smaller than a given $\epsilon\in(0,1)$ , while the accelerated variants take only $\mathcal{O}(\epsilon^{-1/(p+\nu)})$ iterations when the parameter $\nu$ is explicitly used in the scheme. For the case in which $\nu$ is not known, we also proposed a universal accelerated scheme for which we established an iteration complexity bound of $\mathcal{O}(\epsilon^{-p/[(p+1)(p+\nu-1)]})$ .

As a natural development, in this paper we present variants of the $p$ -order methods ( $p\geq 2$ ) proposed in [19] that aim the generation of a point $\bar{x}$ such that $\|\nabla f(\bar{x})\|_{*}\leq\epsilon$ , for a given threshold $\epsilon\in(0,1)$ . In the context of nonconvex optimization, finding approximate stationary points is usually the best one can expect from local optimization methods. In the context of convex optimization, one motivation to search for approximate stationary points is the fact that the norm of the gradient may serve as a measure of feasibility and optimality when one applies the dual approach for solving constrained convex problems (see, e.g., [30]). Another motivation comes from the inexact high-order proximal-point methods, recently proposed in [32, 33], in which the iterates are computed as approximate stationary points of uniformly convex models.

Specifically, our contributions are the following:

We show that the non-accelerated schemes in [19] take at most $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ iterations to reduce the norm of the gradient of the objective below a given $\epsilon\in(0,1)$ , when the objective is convex, and $\mathcal{O}\left(\epsilon^{-(p+\nu)/(p+\nu-1)}\right)$ iterations, when $f$ is nonconvex. These complexity bounds extend our previous results reported in [17] for regularized Newton methods (case $p=2$ ). Moreover, our complexity bound for the nonconvex case agrees in order with the bounds obtained in [24] and [10] for different tensor methods. 2. 3. 2.

For accelerated tensor schemes we establish improved complexity bounds of $\mathcal{O}\left(\epsilon^{-(p+\nu)/[(p+\nu-1)(p+\nu+1)]}\right)$ , when the Hölder parameter $\nu\in[0,1]$ is known. This result generalizes the bound of $\mathcal{O}\left(\epsilon^{-2/3}\right)$ obtained in [30] for the accelerated gradient method ( $p=\nu=1$ ). In contrast, when $\nu$ is unknown, we prove a bound of $\mathcal{O}\left(\epsilon^{-(p+1)/[(p+\nu-1)(p+2)]}\right)$ for a universal accelerated scheme. 4. 5. 3.

For the case in which $\nu$ and the corresponding Hölder constant are known, we propose tensor schemes for the composite minimization problem. In particular, we prove a bound of $\mathcal{O}\left(R^{\frac{p+\nu-1}{p+\nu}}|\log_{2}(\epsilon)|\epsilon^{-\frac{1}{p+\nu}}\right)$ iterations, where $R$ is an upper bound for the initial distance to the optimal set. This result generalizes the bounds obtained in [30] for first-order and second-order accelerated schemes combined with a regularization approach ( $p=1,2$ and $\nu=1$ ). We also prove a bound of $\mathcal{O}\left(S^{\frac{p+\nu-1}{p+\nu}}|\log_{2}(\epsilon)|\epsilon^{-1}\right)$ iterations, where $S$ is an upper bound for the initial functional residual. 6. 7. 4.

Considering the same class of difficult functions described in [19], we derive a lower complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ iterations (in terms of the initial distance to the optimal set), and a lower complexity bound of $\mathcal{O}\left(\epsilon^{-2(p+\nu)/[3(p+\nu)-2]}\right)$ iterations (in terms of the initial functional residual), for $p$ -order tensor methods to find $\epsilon$ -approximate stationary points of convex functions with $\nu$ -Hölder continuous $p$ th derivatives. These bounds generalize the corresponding bounds given in [6] for first-order methods ( $p=\nu=1$ ).

The paper is organized as follows. In section 2, we define our problem. In section 3, we present complexity results for tensor schemes without acceleration. In section 4, we present complexity results for accelerated schemes. In section 5 we analyze tensor schemes for the composite minimization problem. Finally, in section 6, we establish lower complexity bounds for tensor methods find $\epsilon$ -approximate stationary points of convex functions under the Hölder condition. Some auxiliary results are left in the Appendix.

1.2 Notations and Generalities

Let $\mathbb{E}$ be a finite-dimensional real vector space, and $\mathbb{E}^{*}$ be its dual space. We denote by $\langle s,x\rangle$ the value of the linear functional $s\in\mathbb{E}^{*}$ at point $x\in\mathbb{E}$ . Spaces $\mathbb{E}$ and $\mathbb{E}^{*}$ are equipped with conjugate Euclidean norms:

[TABLE]

where $B:\mathbb{E}\to\mathbb{E}^{*}$ is a self-adjoint positive definite operator ( $B\succ 0$ ). For a smooth function $f:\mathbb{E}\to\mathbb{R}$ , denote by $\nabla f(x)$ its gradient, and by $\nabla^{2}f(x)$ its Hessian evaluated at point $x\in\mathbb{E}$ . Then $\nabla f(x)\in\mathbb{E}^{*}$ and $\nabla^{2}f(x)h\in\mathbb{E}^{*}$ for $x,h\in\mathbb{E}$ .

For any integer $p\geq 1$ , denote by

[TABLE]

the directional derivative of function $f$ at $x$ along directions $h_{i}\in\mathbb{E}$ , $i=1,\ldots,p$ . For any $x\in\text{dom}\,f$ and $h_{1},h_{2}\in\mathbb{E}$ we have

[TABLE]

If $h_{1}=\ldots=h_{p}=h\in\mathbb{E}$ , we denote $D^{p}f(x)[h_{1},\ldots,h_{p}]$ by $D^{p}f(x)[h]^{p}$ . Using this notation, the $p$ th order Taylor approximation of function $f$ at $x\in\mathbb{E}$ can be written as follows:

[TABLE]

where

[TABLE]

Since $D^{p}f(x)[\,.\,]$ is a symmetric $p$ -linear form, its norm is defined as:

[TABLE]

It can be shown that (see, e.g., Appendix 1 in [27])

[TABLE]

Similarly, since $D^{p}f(x)[.\,,\ldots,\,.]-D^{p}f(y)[.,\ldots,.]$ is also a symmetric $p$ -linear form for fixed $x,y\in\mathbb{E}$ , it follows that

[TABLE]

2 Problem Statement

In this paper we consider methods for solving the following minimization problem

[TABLE]

where $f:\mathbb{E}\to\mathbb{R}$ is a convex function $p$ -times differentiable. We assume that (2.4) has at least one optimal solution $x^{*}\in\mathbb{E}$ . As in [19], the level of smoothness of the objective $f$ will be characterized by the family of Hölder constants

[TABLE]

From (2.5), it can be shown that, for all $x,y\in\mathbb{E}$ ,

[TABLE]

and

[TABLE]

Given $x\in\mathbb{E}$ , if $H_{f,p}(\nu)<+\infty$ and $H\geq H_{f,p}(\nu)$ , by (2.6) we have

[TABLE]

This property motivates the use of the following class of models of $f$ around $x\in\mathbb{E}$ :

[TABLE]

Note that, by (2.9), if $H\geq H_{f,p}(\nu)$ then $f(y)\leq\Omega_{x,p,H}^{(\nu)}(y)$ for all $y\in\mathbb{E}$ .

3 Tensor Schemes Without Acceleration

Let us consider the following assumption:

H1

$H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ .

Regarding the smoothness parameter $\nu$ , there are only two possible situations: either $\nu$ is known, or $\nu$ is unknown. In order to cover both cases in a single framework, as in [19], we shall consider the parameter

[TABLE]

Algorithm 1. Tensor Method (Algorithm 2 in [19])

Step 0. Choose $x_{0}\in\mathbb{E}$ , $H_{0}>0$ , $\theta\geq 0$ and $\epsilon\in(0,1)$ . Set $\alpha$ by (3.11) and $t:=0$ .

Step 1. If $\|\nabla f(x_{t})\|_{*}\leq\epsilon$ , STOP.

Step 2. Set $i:=0$ .

Step 2.1 Compute an approximate solution $x_{t,i}^{+}$ to

$\min_{y\in\mathbb{E}}\,\Omega_{x_{t},p,2^{i}H_{t}}^{(\alpha)}(y)$

(3.12)

such that

$\Omega_{x_{t},p,2^{i}H_{t}}^{(\alpha)}(x_{t,i}^{+})\leq f(x_{t})\quad\text{and}\quad\|\nabla\Omega_{x_{t},p,2^{i}H_{t}}^{(\alpha)}(x_{t,i}^{+})\|_{*}\leq\theta\|x_{t,i}^{+}-x_{t}\|^{p+\alpha-1}.$

(3.13)

Step 2.2. If either $\|\nabla f(x_{t,i}^{+})\|_{*}\leq\epsilon$ or

$f(x_{t})-f(x_{t,i}^{+})\geq\dfrac{1}{8(p+1)!(2^{i}H_{t})^{\frac{1}{p+\alpha-1}}}\|\nabla f(x_{t,i}^{+})\|_{*}^{\frac{p+\alpha}{p+\alpha-1}},$

(3.14)

holds, set $i_{t}:=i$ and go to Step 3. Otherwise, set $i:=i+1$ and go to Step 2.1.

Step 3. Set $x_{t+1}=x_{t,i_{t}}^{+}$ and $H_{t+1}=2^{i_{t}-1}H_{t}$ .

Step 4. Set $t:=t+1$ and go back to Step 1.

*Remark 1**.*

If $\nu$ is unknown, by (3.11) we set $\alpha=1$ in Algorithm 1. The resulting algorithm is a universal scheme that can be viewed as a generalization of the universal second-order method (6.10) in [17]. Moreover, it is worth mentioning that for $p=3$ and $\alpha=\nu$ , one case use Gradient Methods with Bregman distance [20] to approximately solve (3.12) in the sense of (3.13).

For both cases ( $\nu$ known or unknown), Algorithm 1 is a particular instance of Algorithm 1 in [19] in which $M_{t}=2^{i_{t}}H_{t}$ for all $t\geq 0$ . Let us define the following function of $\epsilon$ :

[TABLE]

The next lemma provides upper bounds on $M_{t}$ and on the number of calls of the oracle in Algorithm 1.

Lemma 3.1.

Suppose that H1 holds. Given $\epsilon\in(0,1)$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 1 such that

[TABLE]

Then,

[TABLE]

and, consequently,

[TABLE]

Moreover, the number $O_{T}$ of calls of the oracle after $T$ iterations is bounded as follows:

[TABLE]

Proof.

Let us prove (3.17) by induction. Clearly it holds for $t=0$ . Assume that (3.17) is true for some $t$ , $0\leq t\leq T-1$ . If $\nu$ is known, then by (3.11) we have $\alpha=\nu$ . Thus, it follows from H1 and Lemma A.2 in [19] that the final value of $2^{i_{t}}H_{t}$ cannot be bigger than $2\max\left\{(3/2)H_{f,p}(\nu),3\theta(p-1)!\right\}$ , since otherwise we should stop the line search earlier. Therefore,

[TABLE]

that is, (3.17) holds for $t=t+1$ . On the other hand, if $\nu$ is unknown, we have $\alpha=1$ . In view of (3.16), Corollary A.5 [19] and $\epsilon\in(0,1)$ , we must have

[TABLE]

Consequently, it follows that

[TABLE]

that is, (3.17) holds for $t+1$ . This completes the induction argument. Using (3.17), for $t=0,\ldots,T-1$ we get $M_{t}=2H_{t+1}\leq 2\max\left\{H_{0},N_{\nu}(\epsilon)\right\}$ . Finally, note that at the $t$ th iteration of Algorithm 1, the oracle is called $i_{t}+1$ times. Since $H_{t+1}=2^{i_{t}-1}H_{t}$ , it follows that $i_{t}-1=\log_{2}H_{t+1}-\log_{2}H_{t}$ . Thus, by (3.17) we get

[TABLE]

and the proof is complete. ∎

Let us consider the additional assumption:

H2

The level sets of $f$ are bounded, that is, $\max_{x\in\mathcal{L}(x_{0})}\|x-x^{*}\|\leq D_{0}\in(1,+\infty)$ for $\mathcal{L}(x_{0})\equiv\left\{x\in\mathbb{E}\,:\,f(x)\leq f(x_{0})\right\}$ , with $x_{0}$ being the starting point.

The next theorem gives global convergence rates for Algorithm 1 in terms of the functional residual.

Theorem 3.2.

Suppose that H1 and H2 are true and let $\left\{x_{t}\right\}_{t=0}^{T}$ be a sequence generated by Algorithm 1 such that, for $t=0,\ldots,T$ , we have

[TABLE]

Let $m$ be the first iteration number such that

[TABLE]

and assume that $m<T$ . Then

[TABLE]

and, for all $k$ , $m<k\leq T$ , we have

[TABLE]

Proof.

By Lemma 3.1, this result follows from Theorem 3.1 in [19] with $M_{\nu}=2\max\left\{H_{0},N_{\nu}(\epsilon)\right\}$ . ∎

Now, we can derive global convergence rates for Algorithm 1 in terms of the norm of the gradient.

Theorem 3.3.

Under the same assumptions of Theorem 3.2, if $T=m+3s$ for some $s\geq 1$ , then

[TABLE]

Consequently,

[TABLE]

with

[TABLE]

Proof.

By Theorem 3.2, we have

[TABLE]

for all $k$ , $m<k\leq T$ . In particular, it follows from (3.14) and (3.24) that

[TABLE]

Therefore,

[TABLE]

and so (3.22) holds. By assumption, we have $g_{T}^{*}>\epsilon$ . Thus, by (3.22) we get

[TABLE]

Finally, by analyzing separately the cases in which $\nu$ is known and unknown, it follows from (3.25) and (3.15) that (3.23) is true. ∎

*Remark 2**.*

Suppose that the objective $f$ in (2.4) is nonconvex and bounded from below by $f^{*}$ . Then, it follows from (3.14) and (3.18) that

[TABLE]

Summing up these inequalities, we get

[TABLE]

and so, by (3.15), we obtain $T\leq\mathcal{O}\left(\epsilon^{-\frac{p+\nu}{p+\nu-1}}\right)$ . This bound generalizes the bound of $\mathcal{O}\left(\epsilon^{-\frac{2+\nu}{1+\nu}}\right)$ proved in [17] for $p=2$ . It agrees in order with the complexity bounds proved in [24] and [10] for different universal tensor methods.

4 Accelerated Tensor Schemes

The schemes presented here generalize the procedures described in [30] for $p=1$ and $p=2$ . Specifically, our general scheme is obtained by adding Step 2 of Algorithm 1 at the end of Algorithm 4 in [19], in order to relate the functional decrease with the norm of the gradient of $f$ in suitable points:

Algorithm 2. Adaptive Accelerated Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $\tilde{H}_{0},H_{0}>0$ , $\theta\geq 0$ and $\epsilon\in(0,1)$ . Set $\alpha$ by (3.11) and define function $\psi_{0}(x)=\frac{1}{p+\alpha}\|x-x_{0}\|^{p+\alpha}$ . Set $v_{0}=z_{0}=x_{0}$ , $A_{0}=0$ and $t:=0$ .

Step 1. If $\min\left\{\|\nabla f(x_{t})\|_{*},\|\nabla f(z_{t})\|_{*}\right\}\leq\epsilon$ , STOP.

Step 2. Set $i:=0$ .

Step 2.1. Compute the coefficient $a_{t,i}>0$ by solving equation

$a_{t,i}^{p+\alpha}=\dfrac{1}{2^{(3p-1)}}\left[\dfrac{(p-1)!}{2^{i}\tilde{H}_{t}}\right](A_{t}+a_{t,i})^{p+\alpha-1}.$

Step 2.2. Set $\gamma_{t,i}=\dfrac{a_{t,i}}{A_{t}+a_{t,i}}$ and compute $y_{t,i}=(1-\gamma_{t,i})x_{t}+\gamma_{t,i}v_{t}$ .

Step 2.3 Compute an approximate solution $x_{t,i}^{+}$ to $\min_{x\in\mathbb{E}}\,\Omega_{y_{t,i},p,2^{i}\tilde{H}_{t}}^{(\alpha)}(x)$ , such that

$\Omega_{y_{t,i},p,2^{i}\tilde{H}_{t}}^{(\alpha)}(x_{t,i}^{+})\leq f(y_{t,i})\quad\text{and}\quad\|\nabla\Omega_{y_{i,t},p,2^{i}\tilde{H}_{t}}^{(\alpha)}(x_{t,i}^{+})\|_{*}\leq\theta\|x_{t,i}^{+}-y_{t,i}\|^{p+\alpha-1}.$

Step 2.4. If either condition $\|\nabla f(x_{t,i}^{+})\|_{*}\leq\epsilon$ or

$\langle\nabla f(x_{t,i}^{+}),y_{t,i}-x_{t,i}^{+}\rangle\geq\dfrac{1}{4}\left[\dfrac{(p-1)!}{2^{i}\tilde{H}_{t}}\right]^{\frac{1}{p+\alpha-1}}\|\nabla f(x_{t,i}^{+})\|_{*}^{\frac{p+\alpha}{p+\alpha-1}}$

holds, set $i_{t}:=i$ and go to Step 3. Otherwise, set $i:=i+1$ and go back to Step 2.1.

Step 3. Set $x_{t+1}=x_{t,i_{t}}^{+}$ and $\tilde{H}_{t+1}=2^{i_{t}-1}\tilde{H}_{t}$ .

Step 4. Define $\psi_{t+1}(x)=\psi_{t}(x)+a_{t}\left[f(x_{t+1})+\langle\nabla f(x_{t+1}),x-x_{t+1}\rangle\right]$ and compute $v_{t+1}=\arg\min_{x\in\mathbb{E}}\,\psi_{t+1}(x)$ .

Step 5. Set $\bar{z}_{t}=\arg\min\left\{f(y)\,:\,y\in\left\{z_{t},x_{t+1}\right\}\right\}$ and $j:=0$ .

Step 6. Set $j:=0$ .

Step 6.1 Compute an approximate solution $z_{t,j}^{+}$ to $\min_{y\in\mathbb{E}}\,\Omega_{\bar{z}_{t},p,2^{j}H_{t}}^{(\alpha)}(y)$ such that

$\Omega_{\bar{z}_{t},p,2^{j}H_{t}}^{(\alpha)}(z_{t,j}^{+})\leq f(\bar{z}_{t})\quad\text{and}\quad\|\nabla\Omega_{\bar{z}_{t},p,2^{j}H_{t}}^{(\alpha)}(z_{t,j}^{+})\|_{*}\leq\theta\|z_{t,j}^{+}-\bar{z}_{t}\|^{p+\alpha-1}.$

Step 6.2 If either $\|\nabla f(z_{t,j}^{+})\|_{*}\leq\epsilon$ or

$f(\bar{z}_{t})-f(z_{t,j}^{+})\geq\dfrac{1}{8(p+1)!(2^{j}H_{t})^{\frac{1}{p+\alpha-1}}}\|\nabla f(z_{t,j}^{+})\|_{*}^{\frac{p+\alpha}{p+\alpha-1}}$

(4.26)

holds, set $j_{t}:=j$ and go to Step 7. Otherwise, set $j:=j+1$ and go to Step 6.1.

Step 7. Set $z_{t+1}=z_{t,j_{t}}^{+}$ , $H_{t+1}=2^{j_{t}-1}H_{t}$ , $t:=t+1$ and go to Step 1.

Let us define the following function of $\epsilon$ :

[TABLE]

In Algorithm 2, note that $\left\{x_{t}\right\}$ is independent of $\left\{z_{t}\right\}$ . The next theorem establishes global convergence rates for the functional residual with respect to $\left\{x_{t}\right\}$ .

Theorem 4.1.

Assume that H1 holds and let the sequence $\left\{x_{t}\right\}_{t=0}^{T}$ be generated by Algorithm 2 such that, for $t=0,\ldots,T$ we have

[TABLE]

Then,

[TABLE]

for $t=2,\ldots,T$ .

Proof.

As in the proof of Lemma 3.1, it follows from (4.28), (4.27) and Lemmas A.6 and A.7 in [19] that

[TABLE]

which gives

[TABLE]

Then, (4.29) follows directly from Theorem 4.2 in [19] with $M_{\nu}=2\max\left\{\tilde{H}_{0},\tilde{N}_{\nu}(\epsilon)\right\}$ . ∎

Now we can obtain global convergence rates for Algorithm 2 in terms of the norm of the gradient.

Theorem 4.2.

Suppose that H1 holds and let sequences $\left\{x_{t}\right\}_{t=0}^{T}$ and $\left\{z_{t}\right\}_{t=0}^{T}$ be generated by Algorithm 2. Assume that, for $t=0,\ldots,T$ , we have

[TABLE]

If $T=2s$ for some $s>1$ , then

[TABLE]

where

[TABLE]

with $N_{\nu}(\epsilon)$ and $\tilde{N}_{\nu}(\epsilon)$ defined in (3.15) and (4.27), respectively. Consequently,

[TABLE]

if $\nu$ is known (i.e., $\alpha=\nu$ ), and

[TABLE]

if $\nu$ is unknown (i.e., $\alpha=1$ ).

Proof.

By Theorem 4.1, we have

[TABLE]

for $t=2,\ldots,T$ . On the other hand, as in Lemma 3.1, by (4.30) we get

[TABLE]

where $N_{\nu}(\epsilon)$ is defined in (3.15). Then, in view of (4.26), it follows that

[TABLE]

for $t=0,\ldots,T-1$ . In particular, $f(z_{t+1})\leq f(\bar{z}_{t})$ for $t=0,\ldots,T-1$ . Moreover, by the definition of $\bar{z}_{t}$ , we get $f(\bar{z}_{t})\leq f(x_{t+1})$ and $f(\bar{z}_{t})\leq f(z_{t})$ . Therefore

[TABLE]

and

[TABLE]

Now, since $T=2s$ , summing up (4.37), we get

[TABLE]

Thus,

[TABLE]

and so (4.31) holds. By assumption, we have $g_{T}^{*}>\epsilon$ . Thus, it follows from (4.31) that

[TABLE]

If $\nu$ is known, by (3.15) and (4.27) we have $\max\left\{N_{\nu}(\epsilon),\tilde{N}_{\nu}(\epsilon)\right\}\leq 3p\left(H_{f,p}(\nu)+\theta(p-1)!\right)$ . Then,

[TABLE]

and so

[TABLE]

Combining (4.38), (4.39) and $\frac{p+\nu}{(p+\nu-1)(p+\nu+1)}\leq 1$ , we obtain (4.32). If $\nu$ is unknown, it follows from (3.15) and (4.27) that

[TABLE]

Then,

[TABLE]

and so

[TABLE]

Combining (4.38), (4.40) and $\frac{p+1}{p(p+2)}<\frac{1}{2}$ , we obtain (4.33). ∎

*Remark 3**.*

When $\nu=1$ , bounds (4.32) and (4.33) have the same dependence on $\epsilon$ . However, when $\nu\neq 1$ , the bound of $\mathcal{O}\left(\epsilon^{-(p+1)/[(p+\nu-1)(p+2)]}\right)$ obtained for the universal scheme (i.e., $\alpha=1$ ) is worse than the bound of $\mathcal{O}\left(\epsilon^{-(p+\nu)/[(p+\nu-1)(p+\nu+1)]}\right)$ obtained for the non-universal scheme (i.e., $\alpha=\nu$ ). In both cases, these complexity bounds are better than the bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ proved for Algorithm 1.

5 Composite Minimization

From now on, we will assume that $\nu$ and $H_{f,p}(\nu)$ are known. In this setting, we can consider the composite minimization problem:

[TABLE]

where $f:\mathbb{E}\to\mathbb{R}$ is a convex function satisfying H1 (see page 4), and $\varphi:\mathbb{E}\to\mathbb{R}\cup\left\{+\infty\right\}$ is a simple closed convex function whose effective domain has nonempty relative interior, that is, $\text{ri}\left({\rm dom\,}\varphi\right)\neq\emptyset$ . We assume that there exists at least one optimal solution $x^{*}\in\mathbb{E}$ for (5.41). By (2.6), if $H\geq H_{f,p}(\nu)$ we have

[TABLE]

This motivates the following class of models of $\tilde{f}$ around a fixed point $x\in\mathbb{E}$ :

[TABLE]

where $\Omega_{x,p,H}^{(\nu)}(\,.\,)$ is defined in (2.10). The next lemma gives a sufficient condition for function $\Omega_{x,p,H}^{(\nu)}(\,.\,)$ to be convex. Its proof is an adaptation of the proof of Theorem 1 in [31].

Lemma 5.1.

Suppose that H1 holds for some $p\geq 2$ . Then, for any $x,y\in\mathbb{E}$ we have

[TABLE]

Moreover, if $H\geq(p-1)H_{f,p}(\nu)$ , then function $\Omega_{x,p,H}^{(\nu)}(\,.\,)$ is convex for any $x\in\mathbb{E}$ .

Proof.

For any $u\in\mathbb{E}$ , it follows from (2.8) that

[TABLE]

Since $u\in\mathbb{E}$ is arbitrary, we get (5.43).

Now, suppose that $H\geq(p-1)H_{f,p}(\nu)$ . Then, by (5.43) we have

[TABLE]

Therefore, $\Omega_{x,p,H}^{(\nu)}(y)$ is convex. ∎

From Lemma 5.1, if $H\geq(p-1)H_{f,p}(\nu)$ it follows that $\tilde{\Omega}_{x,p,H}^{(\nu)}(\,.\,)$ is also convex. In this case, since $\text{ri}\left({\rm dom\,}\varphi\right)\neq\emptyset$ , any solution $x^{+}$ of

[TABLE]

satisfies the first-order optimality condition:

[TABLE]

Therefore, there exists $g_{\varphi}(x^{+})\in\partial\varphi(x^{+})$ such that

[TABLE]

Instead of solving (5.44) exactly, in our algorithms we consider inexact solutions $x^{+}$ such that111Conditions (5.47) have already been used in [21] and are the composite analogue of the conditions proposed in [2]. It is worth to mention that, for $p=3$ and $\nu=1$ , the tensor model $\Omega_{x,p,M}^{(\nu)}(\,.\,)$ has very nice relative smoothness properties (see [31]) which allow the approximate solution of (5.44) by Bregman Proximal Gradient Algorithms [23, 3].

[TABLE]

for some $g_{\varphi}(x^{+})\in\partial\varphi(x^{+})$ and $\theta\geq 0$ . For such points $x^{+}$ , we define

[TABLE]

with $g_{\varphi}(x^{+})$ satisfying (5.47). Clearly, we have $\nabla\tilde{f}(x^{+})\in\partial\tilde{f}(x^{+})$ .

Lemma 5.2.

Suppose that H1 holds and let $x^{+}$ be an approximate solution of (5.44) such that (5.47) holds for some $x\in\mathbb{E}$ . If

[TABLE]

then

[TABLE]

Proof.

By (5.48), (2.7), (2.10), (5.47) and (5.49) we have

[TABLE]

where the last inequality is due to $p\geq 2$ . On the other hand, by (2.6), (5.42), (5.49), we have

[TABLE]

Note that $H\geq pH_{f,p}(\nu)\geq\left(\frac{p+1}{p}\right)H_{f,p}(\nu)$ . Thus,

[TABLE]

Finally, combining (5.51) and (5.52), we get (5.50). ∎

In this composite context, let us consider the following scheme:

Algorithm 3. Tensor Method for Composite Minimization

Step 0. Choose $x_{0}\in\mathbb{E}$ and $\theta\geq 0$ . Set $M=\max\left\{pH_{f,p}(\nu),3\theta(p-1)!\right\}$ and $t:=0$ .

Step 1. Compute an approximate solution $x_{t+1}$ to $\min_{y\in\mathbb{E}}\tilde{\Omega}_{x_{t},p,M}^{(\nu)}(y)$ such that

$\tilde{\Omega}_{x_{t},p,M}^{(\nu)}(x_{t+1})\leq\tilde{f}(x_{t})\quad\text{and}\quad\|\nabla\Omega_{x_{t},p,M}^{(\nu)}(x_{t+1})+g_{\varphi}(x_{t+1})\|_{*}\leq\theta\|x_{t+1}-x_{t}\|^{p+\nu-1},$

for some $g_{\varphi}(x_{t+1})\in\partial\varphi(x_{t+1})$ .

Step 2. Set $t:=t+1$ and go back to Step 1.

For $p=3$ , point $x_{t+1}$ at Step 1 can be computed by Algorithm 2 in [20], which is linearly convergent. As far as we know, the development of efficient algorithms to approximately solve (5.44)-(5.42) with $p>3$ is still an open problem.

Theorem 5.3.

Suppose that H1 holds and that $\tilde{f}$ is bounded from below by $\tilde{f}^{*}$ . Given $\epsilon\in(0,1)$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 3 such that $\|\nabla\tilde{f}(x_{t})\|_{*}>\epsilon$ for $t=0,\ldots,T$ . Then,

[TABLE]

Proof.

By Lemma 5.2, bound (5.53) follows as in Remark 2. ∎

5.1 Extended Accelerated Scheme

Let us consider the following variant of Algorithm 2 for composite minimization:

Algorithm 4. Two-Phase Accelerated Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $\theta\geq 0$ and $\epsilon\in(0,1)$ . Define $\psi_{0}(x)=\frac{1}{p+\nu}\|x-x_{0}\|^{p+\nu}$ . Set $v_{0}=z_{0}=x_{0}$ , $A_{0}=0$ , $M=p\left(H_{f,p}(\nu)+3\theta(p-1)!\right)$ and $t:=0$ .

Step 1. If $t>0$ and $\min\left\{\|\nabla\tilde{f}(x_{t})\|_{*},\|\nabla\tilde{f}(z_{t})\|_{*}\right\}\leq\epsilon$ , STOP.

Step 2. Compute the coefficient $a_{t}>0$ by solving the equation

$a_{t}^{p+\nu}=\dfrac{1}{2^{(3p-1)}}\left[\dfrac{(p-1)!}{M}\right](A_{t}+a_{t})^{p+\nu-1}.$

Step 3. Set $y_{t}=(1-\gamma_{t})x_{t}+\gamma_{t}v_{t}$ , with $\gamma_{t}=a_{t}/[A_{t}+a_{t}]$ .

Step 4. Compute an approximate solution $x_{t+1}$ to $\min_{x\in\mathbb{E}}\tilde{\Omega}_{y_{t},p,M}^{(\nu)}(x)$ such that

$\tilde{\Omega}_{y_{t},p,M}^{(\nu)}(x_{t+1})\leq\tilde{f}(y_{t})\quad\text{and}\quad\|\nabla\Omega_{y_{t},p,M}^{(\nu)}(x_{t+1})+g_{\varphi}(x_{t+1})\|\leq\theta\|x_{t+1}-y_{t}\|^{p+\nu-1}.$

(5.54)

for some $g_{\varphi}(x_{t+1})\in\partial\varphi(x_{t+1})$ .

Step 5. Define $\psi_{t+1}(x)=\psi_{t}(x)+a_{t}\left[f(x_{t+1})+\langle\nabla f(x_{t+1}),x-x_{t+1}\rangle+\varphi(x)\right]$ and compute $v_{t+1}=\arg\min_{x\in\mathbb{E}}\,\psi_{t+1}(x)$ .

Step 6. Set $\bar{z}_{t}=\arg\min\left\{\tilde{f}(y)\,:\,y\in\left\{z_{t},x_{t+1}\right\}\right\}$ and $j:=0$ .

Step 7 Compute an approximate solution $z_{t+1}$ to $\min_{x\in\mathbb{E}}\tilde{\Omega}_{\bar{z}_{t},p,M}^{(\nu)}(x)$ such that

$\tilde{\Omega}_{\bar{z}_{t},p,M}^{(\nu)}(z_{t+1})\leq\tilde{f}(\bar{z}_{t})\quad\text{and}\quad\|\nabla\Omega_{\bar{z}_{t},p,M}^{(\nu)}(z_{t+1})+g_{\varphi}(z_{t+1})\|\leq\theta\|z_{t+1}-\bar{z}_{t}\|^{p+\nu-1}.$

(5.55)

for some $g_{\varphi}(z_{t+1})\in\partial\varphi(z_{t+1})$ .

Step 8 Set $t:=t+1$ and go to Step 1.

The next theorem gives the global convergence rate for Algorithm 4 in terms of the norm of the gradient. Its proof is a direct adaptation of the proof of Theorem 4.2.

Theorem 5.4.

Suppose that H1 holds. Assume that $\left\{z_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 4 such that

[TABLE]

If $T=2s$ for some $s>1$ , then

[TABLE]

Consequently,

[TABLE]

Proof.

In view of Theorem A.2, we have

[TABLE]

for $t=2,\ldots,T$ . On the other hand, by (5.55) and Lemma 5.2, we have

[TABLE]

for $t=0,\ldots,T-1$ . Thus, $f(z_{t+1})\leq f(\bar{z}_{t})\leq\min\left\{f(x_{t+1}),f(z_{t})\right\}$ and, consequently,

[TABLE]

and

[TABLE]

Since $T=2s$ , combining (5.59), (5.61) and (5.62), we obtain

[TABLE]

where $\tilde{g}_{T}=\min_{0\leq k\leq T}\|\nabla\tilde{f}(z_{t})\|_{*}$ . Therefore,

[TABLE]

which gives (5.57). Finally, by (5.56) we have $\tilde{g}_{T}>\epsilon$ . Thus, (5.58) follows directly from (5.57). ∎

5.2 Regularization Approach

Now, let us consider the ideal situation in which $\nu$ , $H_{f,p}(\nu)$ and $R\geq\|x_{0}-x^{*}\|$ are known. In this case, a complexity bound with a better dependence on $\epsilon$ can be obtained by repeatedly applying an accelerated algorithm to a suitable regularization of $\tilde{f}$ . Specifically, given $\delta>0$ , consider the regularized problem

[TABLE]

for

[TABLE]

Lemma 5.5.

Given $x_{0}\in\mathbb{E}$ and $\nu\in[0,1]$ , let $d_{p+\nu}:\mathbb{E}\to\mathbb{R}$ be defined by $d_{p+\nu}(x)=\|x-x_{0}\|^{p+\nu}$ , where $\|\,.\,\|$ is the Euclidean norm defined in (1.1). Then,

[TABLE]

where $C_{p,\nu}=2\Pi_{i=1}^{p}(\nu+i)$ .

Proof.

See [34]. ∎

As a consequence of the lemma above, we have the following property.

Lemma 5.6.

If H1 holds, then the $p$ th derivative of $F_{\delta}(\,.\,)$ in (5.64) is $\nu$ -Hölder continuous with constant $H_{F_{\delta},p}(\nu)=H_{f,p}(\nu)+\frac{\delta}{p+\nu}C_{p,\nu}$ .

In view of Lemma 5.6, to solve (5.63) we can use the following instance of Algorithm A (see Appendix A):

Algorithm 5. Accelerated Tensor Method for Problem (5.63)

Step 0. Choose $x_{0}\in\mathbb{E}$ , $\theta\geq 0$ and $\epsilon\in(0,1)$ . Define function

$\psi_{0}(x)=\frac{1}{p+\nu}\|x-x_{0}\|^{p+\nu}$ . Set

$H_{\delta}=p\left(H_{F_{\delta},p}+3\theta(p-1)!\right),$

(5.65)

$v_{0}=x_{0}$ , $A_{0}=0$ and $t:=0$ .

Step 1. Compute the coefficient $a_{t}>0$ by solving equation

$a_{t}^{p+\alpha}=\dfrac{1}{2^{(3p-1)}}\left[\dfrac{(p-1)!}{H_{\delta}}\right](A_{t}+a_{t})^{p+\alpha-1}.$

Step 2. Compute $y_{t}=(1-\gamma_{t})x_{t}+\gamma_{t}v_{t}$ , with $\gamma_{t}=a_{t}/[A_{t}+a_{t}]$ .

Step 3. Compute an approximate solution $x_{t+1}$ to $\min_{x\in\mathbb{E}}\tilde{\Omega}_{y_{t},p,M}^{(\nu)}(x)$ such that

$\tilde{\Omega}_{y_{t},p,M}^{(\nu)}(x_{t+1})\leq\tilde{F}_{\delta}(y_{t})\quad\text{and}\quad\|\nabla\Omega_{y_{t},p,M}^{(\nu)}(x_{t+1})+g_{\varphi}(x_{t+1})\|\leq\theta\|x_{t+1}-y_{t}\|^{p+\nu-1}.$

for some $g_{\varphi}(x_{t+1})\in\partial\varphi(x_{t+1})$ .

Step 4. Define $\psi_{t+1}(x)=\psi_{t}(x)+a_{t}\left[F_{\delta}(x_{t+1})+\langle\nabla F_{\delta}(x_{t+1}),x-x_{t+1}\rangle+\varphi(x)\right]$ and compute $v_{t+1}=\arg\min_{x\in\mathbb{E}}\,\psi_{t+1}(x)$ .

Step 5. Set $t:=t+1$ and go back to Step 1.

Let us consider the following restart procedure based on Algorithm 5.

Algorithm 6. Accelerated Regularized Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $\epsilon\in(0,1)$ , $\theta\geq 0$ and $\delta>0$ . Define

$m=1+\left\lceil\left(\dfrac{2^{4p+\nu-2}(p+\nu)^{p+\nu}H_{\delta}}{\delta(p-1)!}\right)^{\frac{1}{p+\nu}}\right\rceil,$

(5.66)

for $H_{\delta}$ defined in (5.65). Set $y_{0}=x_{0}$ , $u_{0}=x_{0}$ and $k:=0$ .

Step 1. If $k>0$ and $\|\nabla\tilde{F}_{\delta}(u_{k})\|_{*}\leq\epsilon/2$ , STOP.

Step 2. By applying Algorithm 5 to problem (5.63), with $x_{0}^{(k)}=y_{k}$ , compute the first $m$ iterates $\left\{x_{t}^{(k)}\right\}_{t=0}^{m}$ .

Step 3. Set $y_{k+1}=x_{m}^{(k)}$ and compute $u_{k+1}\in\mathbb{R}^{n}$ such that

$\tilde{\Omega}_{y_{t},p,M}^{(\nu)}(u_{k+1})\leq\tilde{F}_{\delta}(y_{k+1})\quad\text{and}\quad\|\nabla\Omega_{y_{k+1},p,M}^{(\nu)}(u_{k+1})+g_{\varphi}(u_{k+1})\|\leq\theta\|u_{k+1}-y_{k+1}\|^{p+\nu-1}.$

(5.67)

for some $g_{\varphi}(u_{k+1})\in\partial\varphi(u_{k+1})$ .

Step 4. Set $k:=k+1$ and go back to Step 1.

Theorem 5.7.

Suppose that H1 holds and let $\left\{u_{k}\right\}_{k=0}^{T}$ be a sequence generated by Algorithm 6 such that

[TABLE]

Then,

[TABLE]

Proof.

Let $x_{\delta}^{*}=\arg\min_{x\in\mathbb{E}}\,\tilde{F}_{\delta}(x)$ . By Theorem A.2 and (5.66), we have

[TABLE]

On the other hand, by Lemma 5 in [13] and Lemma 1 in [29], function $F_{\delta}(\,.\,)$ is uniformly convex of degree $p+\nu$ with parameter $2^{-(p+\nu-2)}$ . Thus,

[TABLE]

Combining (5.70) and (5.71), we obtain $\|y_{k+1}-x_{\delta}^{*}\|^{p+\nu}\leq\dfrac{1}{2}\|y_{k}-x_{\delta}^{*}\|^{p+\nu}$ , and so

[TABLE]

Thus, it follows from (5.70) and (5.72) that

[TABLE]

In view of Lemma 5.2, by (5.67) and (5.65), we get

[TABLE]

Then, combining (5.73) and (5.74), it follows that

[TABLE]

In particular, for $k=T-1$ , it follows from (5.68) that

[TABLE]

Since $\tilde{F}_{\delta}(x_{\delta}^{*})\leq\tilde{F}_{\delta}(x^{*})$ , it follows that $\|x_{0}-x_{\delta}^{*}\|\leq\|x_{0}-x^{*}\|\leq R$ . Thus, combining this with (5.75), we get (5.69). ∎

Corollary 5.8.

Suppose that H1 holds and that $R\geq 1$ . Then, Algorithm 6 with

[TABLE]

perform at most

[TABLE]

iterations of Algorithm 5 in order to generate $u_{T}$ such that $\|\nabla\tilde{f}(u_{T})\|_{*}\leq\epsilon$ .

Proof.

By Theorem 5.7, we can obtain $\|\nabla\tilde{F}_{\delta}(u_{T})\|_{*}\leq\epsilon/2$ with

[TABLE]

Moreover, it follows from (5.65), (5.76), the definition of $H_{F_{\delta},p}(\nu)$ in Lemma 5.6, $\epsilon\in(0,1)$ and $R\geq 1$ that

[TABLE]

Combining (5.78), (5.79) and (5.76), we have

[TABLE]

At this point $u_{T}$ , we have

[TABLE]

Since $\tilde{F}_{\delta}(\,.\,)$ is uniformly convex of degree $p+\nu$ with parameter $2^{-(p+\nu-2)}$ , it follows from (5.74) and (5.73) that

[TABLE]

Therefore, $\|u_{T}-x_{\delta}^{*}\|\leq\|x_{0}-x_{\delta}^{*}\|$ , and so

[TABLE]

Now, combining (5.81), (5.83) and (5.76), we obtain

[TABLE]

The conclusion is obtained by noticing that, for $\delta$ given in (5.76) we have

[TABLE]

Thus, (5.77) follows from multiplying (5.80) and (5.85). ∎

Suppose now that $S\geq\tilde{f}(x_{0})-\tilde{f}(x^{*})$ is known. In this case, we have the following variant of Theorem 5.7.

Theorem 5.9.

Suppose that H1 holds and let $\left\{u_{k}\right\}_{k=0}^{T}$ be a sequence generated by Algorithm 6 such that

[TABLE]

Then,

[TABLE]

Proof.

By (5.75), we have

[TABLE]

Since $\tilde{F}_{\delta}(\,.\,)$ is uniformly convex of degree $p+\nu$ with parameter $\delta 2^{-(p+\nu-2)}$ we have

[TABLE]

Combining (5.88) and (5.89) we get (5.87). ∎

Corollary 5.10.

Suppose that H1 holds and that $S\geq 1$ . Then, Algorithm 6 with

[TABLE]

performs at most

[TABLE]

iterations of Algorithm 5 in order to generate $u_{T}$ such that $\|\nabla\tilde{f}(u_{T})\|_{*}\leq\epsilon$ .

Proof.

By Theorem 5.9, we can obtain $\|\nabla\tilde{F}_{\delta}(u_{T})\|_{*}\leq\epsilon/2$ with

[TABLE]

In view of (5.90), $\epsilon\in(0,1)$ and $S\geq 1$ , we also have

[TABLE]

Thus, from (5.92) and (5.93) it follows that

[TABLE]

At this point $u_{T}$ we have

[TABLE]

By (5.82) and (5.89),

[TABLE]

Thus, it follows from (5.2), (5.95) and (5.90) that

[TABLE]

Finally, by (5.66) and (5.90) we have

[TABLE]

Thus, (5.91) follows by multiplying (5.94) by the upper bound on $m$ given above. ∎

6 Lower complexity bounds under Hölder condition

In this section we derive lower complexity bounds for $p$ -order tensor methods applied to the problem (2.4) in terms of the norm of the gradient of $f$ , where the objective $f$ is convex and $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ .

For simplicity, assume that $\mathbb{E}=\mathbb{R}^{n}$ and $B=I_{n}$ . Given an approximation $\bar{x}$ for the solution of (2.4), we consider $p$ -order methods that compute trial points of the form $x^{+}=\bar{x}+\bar{h}$ , where the search direction $\bar{h}$ is the solution of an auxiliary problem of the form

[TABLE]

with $a\in\mathbb{R}^{p}$ , $\gamma>0$ and $q>1$ . Denote by $\Gamma_{\bar{x},f}(a,\gamma,q)$ the set of all stationary points of function $\phi_{a,\gamma,q}(\,.\,)$ and define the linear subspace

[TABLE]

More specifically, we consider the class of $p$ -order tensor methods characterized by the following assumption.

Assumption 1. Given $x_{0}\in\mathbb{R}^{n}$ , the method generates a sequence of test points $\left\{x_{k}\right\}_{k\geq 0}$ such that

[TABLE]

Given $\nu\in[0,1]$ , we consider the same family of difficult problems discussed in [19], namely:

[TABLE]

The next lemma establishes that for each $f_{k}(\,.\,)$ we have $H_{f_{k},p}(\nu)<+\infty$ .

Lemma 6.1.

Given an integer $k\in[2,n]$ , the $p$ th derivative of $f_{k}(\,.\,)$ is $\nu$ -Hölder continuous with

[TABLE]

Proof.

See Lemma 5.1 in [19]. ∎

The next lemma provides additional properties of $f_{k}(\,.\,)$ .

Lemma 6.2.

Given an integer $k\in[2,n]$ , let function $f_{k}(\,.\,)$ be defined by (6.99). Then, $f_{k}(\,.\,)$ has a unique global minimizer $x_{k}^{*}$ . Moreover,

[TABLE]

Proof.

See Lemma 5.2 in [19]. ∎

Our goal is to understand the behavior of the tensor methods specified by Assumption 1 when applied to the minimization of $f_{k}(\,.\,)$ with a suitable $k$ . For that, let us consider the following subspaces:

[TABLE]

Lemma 6.3.

For any $q\geq 0$ and $x\in\mathbb{R}_{k}^{n}$ , $f_{k+q}(x)=f_{k}(x)$ .

Proof.

It follows directly from (6.99). ∎

Lemma 6.4.

Let $\mathcal{M}$ be a $p$ -order tensor method satisfying Assumption 1. If $\mathcal{M}$ is applied to the minimization of $f_{t}(\,.\,)$ ( $2\leq t\leq n$ ) starting from $x_{0}=0$ , then the sequence $\left\{x_{k}\right\}_{k\geq 0}$ of test points generated by $\mathcal{M}$ satisfies

[TABLE]

Proof.

See Lemma 2 in [31]. ∎

The next lemma gives a lower bound for the norm of the gradient of $f_{t}(\,.\,)$ on suitable points.

Lemma 6.5.

Let $k$ be an integer in the interval $[1,t-1)$ , with $t+1\leq n$ . If $x\in\mathbb{R}^{n}_{k}$ , then $\|\nabla f_{t}(x)\|_{*}\geq\frac{1}{\sqrt{k+1}}$ .

Proof.

In view of (6.99) we have

[TABLE]

where

[TABLE]

and

[TABLE]

By (6.104) and (6.103), we have

[TABLE]

Since $x\in\mathbb{R}^{n}_{k}$ , it follows that $x^{(i)}=0$ for $i>k$ . Therefore,

[TABLE]

which means that $\nabla\eta_{p+\nu}(A_{t}x)\in\mathbb{R}^{n}_{k}$ . Then, from (6.102), we obtain

[TABLE]

By (6.104), we have

[TABLE]

Consequently,

[TABLE]

and

[TABLE]

From (6.111), it can be checked that

[TABLE]

with

[TABLE]

Now, combining (6.110) and (6.111)–(6.112), we get

[TABLE]

Then, it follows from (6.109) and (6.114) that

[TABLE]

Finally, by (6.108) and (6.123) we have

[TABLE]

and the proof is complete. ∎

The next theorem establishes a lower bound for the rate of convergence of $p$ -order tensor methods with respect to the initial functional residual $(f(x_{0})-f^{*})$ .

Theorem 6.6.

Let $\mathcal{M}$ be a $p$ -order tensor method satisfying Assumption 1. Assume that for any function $f$ with $H_{f,p}(\nu)<+\infty$ this method ensures the rate of convergence:

[TABLE]

where $\left\{x_{k}\right\}_{k\geq 0}$ is the sequence generated by method $\mathcal{M}$ and $f^{*}$ is the optimal value of $f$ . Then, for all $t\geq 2$ such that $t+1\leq n$ we have

[TABLE]

Proof.

Suppose that method $\mathcal{M}$ is applied to minimize function $f_{t}(\,.\,)$ with initial point $x_{0}=0$ . By Lemma 6.4, we have $x_{k}\in\mathbb{R}^{n}_{k}$ for all $k$ , $1\leq k\leq t-1$ . Thus, from Lemma 6.5 it follows that

[TABLE]

Then, combining (6.124), (6.126), Lemma 6.1 and Lemma 6.2 we get

[TABLE]

where constant $D_{p,\nu}$ is given in (6.125). ∎

*Remark 4**.*

Theorem 6.6 gives a lower bound of $\mathcal{O}\left(\left(\frac{1}{k}\right)^{\frac{3(p+\nu)-2}{2(p+\nu)}}\right)$ for the rate of convergence of tensor methods with respect to the initial functional residual. For first-order methods in the Lipschitz case (i.e., $p=\nu=1$ ), we have $\mathcal{O}\left(\frac{1}{k}\right)$ . This gives a lower complexity bound of $\mathcal{O}(\epsilon^{-1})$ iterations for finding $\epsilon$ -stationary points of convex functions using first-order methods, which coincides with the lower bound (8a) in [6]. Moreover, in view of Corollary 5.10, Algorithm 6 is suboptimal in terms of the initial residual, with the complexity a complexity gap that increases as $p$ grows.

Now, we obtain a lower bound for the rate of convergence of $p$ -order tensor methods with respect to the distance $\|x_{0}-x^{*}\|$ .

Theorem 6.7.

Let $\mathcal{M}$ be a $p$ -order tensor method satisfying Assumption 1. Assume that for any function $f$ with $H_{f,p}(\nu)<+\infty$ this method ensures the rate of convergence:

[TABLE]

where $\left\{x_{k}\right\}_{k\geq 0}$ is the sequence generated by method $\mathcal{M}$ and $x^{*}$ is a global minimizer of $f$ . Then, for all $t\geq 2$ such that $t+1\leq n$ we have

[TABLE]

Proof.

Let us apply method $\mathcal{M}$ for minimizing function $f_{t}(\,.\,)$ starting from point $x_{0}=0$ . By Lemma 6.4, we have $x_{k}\in\mathbb{R}^{n}_{k}$ for all $k$ , $1\leq k\leq t-1$ . Thus, from Lemma 6.5 it follows that

[TABLE]

Then, combining (6.127), (6.129), Lemma 6.1 and Lemma 6.2 we get

[TABLE]

where constant $L_{p,\nu}$ is given in (6.128). ∎

*Remark 5**.*

Theorem 6.7 establishes that the lower bound for the rate of convergence of tensor methods in terms of the norm of the gradient is also of $\mathcal{O}\left((\frac{1}{k})^{{3(p+\nu)-2\over 2}}\right)$ . For first-order methods in the Lipschitz case (i.e., $p=\nu=1$ ) we have $\mathcal{O}\left(\frac{1}{k^{2}}\right)$ . This gives a lower complexity bound of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ for finding $\epsilon$ -stationary points of convex functions using first-order methods, which coincides with the lower bound (8b) in [6].

*Remark 6**.*

The rate of $\mathcal{O}\left((\frac{1}{k})^{\frac{3(p+\nu)-2}{2}}\right)$ corresponds to a worst-case complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ iterations necessary to ensure $\|\nabla f(x_{k})\|_{*}\leq\epsilon$ . Note that, for $\epsilon\in(0,1)$ , we have

[TABLE]

Thus, by increasing the power of the oracle (i.e., the order $p$ ), our non-universal schemes become nearly optimal. For example, if $\epsilon=10^{-6}$ and $p\geq 4$ , we have $\left(\frac{1}{\epsilon}\right)^{\frac{1}{p+\nu-1}}\leq 10\left(\frac{1}{\epsilon}\right)^{\frac{2}{3(p+\nu)-2}}.$

7 Conclusion

In this paper, we presented $p$ -order methods that can find $\epsilon$ -approximate stationary points of convex functions that are $p$ -times differentiable with $\nu$ -Hölder continuous $p$ th derivatives. For the universal and the non-universal schemes without acceleration, we established iteration complexity bounds of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ for finding $\bar{x}$ such that $\|\nabla f(\bar{x})\|_{*}\leq\epsilon$ . For the case in which $\nu$ is known, we obtain improved complexity bounds of of $\mathcal{O}\left(\epsilon^{-(p+\nu)/[(p+\nu-1)(p+\nu+1)]}\right)$ and $\mathcal{O}\left(|\log(\epsilon)|\epsilon^{-1/(p+\nu)}\right)$ for the corresponding accelerated schemes. For the case in which $\nu$ is unknown, we obtained a bound of $\mathcal{O}\left(\epsilon^{-(p+1)/[(p+\nu-1)(p+2)]}\right)$ for a universal accelerated scheme. Similar bounds were also obtained for tensor schemes adapted to the minimization of composite convex functions. A lower complexity bound of $\mathcal{O}(\epsilon^{-2/[3(p+\nu)-2]})$ was obtained for the referred problem class. Therefore, in practice, our non-universal schemes become nearly optimal as we increase the order $p$ .

As an additional result, we showed that Algorithm 6 takes at most $\mathcal{O}\left(\log(\epsilon^{-1})\right)$ iterations to find $\epsilon$ -stationary points of uniformly convex functions of degree $p+\nu$ in the form (5.64). Notice that strongly convex functions are uniformly convex of degree 2. Thus, our result generalizes the known bound of $\mathcal{O}\left(\log(\epsilon^{-1})\right)$ obtained for first-order schemes ( $p=1$ ) applied to strongly convex functions with Lipschitz continuous gradients ( $\nu=1$ ). At this point, it is not clear to us how $p$ -order methods (with $p\geq 2$ ) behave when the objective functions is strongly convex with $\nu$ -Hölder continuous $p$ th derivatives. Neverthless, from the remarks done in [13, p. 6] for $p=2$ , it appears that in our case the class of uniformly convex functions of degree $p+\nu$ is the most suitable for $p$ -order methods from a physical point of view.

Acknowledgments

The authors are very grateful to an anonymous referee, whose comments helped to improve the first version of this paper.

Funding

G.N. Grapiglia was supported by the National Council for Scientific and Technological Development - Brazil (grant 406269/2016-5) and by the European Research Council Advanced Grant 788368. Yu. Nesterov was supported by the European Research Council Advanced Grant 788368.

Appendix A Accelerated Scheme for Composite Minimization

To solve problem (5.41), we can apply the following modification of Algorithm 3 in [19]:

Algorithm A. Accelerated Tensor Method for Composite Minimization

Step 0. Choose $x_{0}\in{\rm dom\,}\varphi$ , $\theta\geq 0$ and define $\psi_{0}(x)=\frac{1}{p+\nu}\|x-x_{0}\|^{p+\nu}$ . Set $M\geq(p+\nu-1)\left(H_{f,p}(\nu)+\theta(p-1)!\right)$ , $v_{0}=x_{0}$ , $A_{0}=0$ and $t:=0$ .

Step 1. Compute $a_{t}>0$ by solving the equation

$a_{t}^{p+\nu}=\dfrac{1}{2^{(3p-1)}}\left[\dfrac{(p-1)!}{M}\right](A_{t}+a_{t})^{p+\nu-1}.$

(A.130)

Step 2. Compute $y_{t}=(1-\gamma_{t})x_{t}+\gamma_{t}v_{t}$ with $\gamma_{t}=a_{t}/[A_{t}+a_{t}]$ .

Step 3. Compute an approximate solution $x_{t+1}$ to $\min_{x\in\mathbb{E}}\tilde{\Omega}_{y_{t},p,M}(x)$ such that

$\tilde{\Omega}_{y_{t},p,M}(x_{t+1})\leq\tilde{f}(y_{t})\quad\text{and}\quad\|\nabla\Omega_{y_{t},p,M}(x_{t+1})+g_{\varphi}(x_{t+1})\|_{*}\leq\theta\|x_{t+1}-y_{t}\|^{p+\nu-1},$

(A.131)

for some $g_{\varphi}(x_{t+1})\in\partial\varphi(x_{t+1})$ .

Step 4. Define $\psi_{t+1}(x)=\psi_{t}(x)+a_{t}\left[f(x_{t+1})+\langle\nabla f(x_{t+1}),x-x_{t+1}\rangle+\varphi(x)\right]$ .

Step 5. Set $t:=t+1$ and go to Step 1.

In order to establish a convergence rate for Algorithm B, we will need the following result.

Lemma A.1.

Suppose that H1 holds and let $x^{+}$ be an approximate solution to $\min_{y\in\mathbb{E}}\tilde{\Omega}_{x,p,H}^{(\nu)}(y)$ such that

[TABLE]

for some $g_{\varphi}(x^{+})\in\partial\varphi(x^{+})$ . If $H\geq(p+\nu-1)\left(H_{f,p}(\nu)+\theta(p-1)!\right)$ , then

[TABLE]

Proof.

Denote $r=\|x^{+}-x\|$ . Then,

[TABLE]

which gives

[TABLE]

From (A.134), the rest of the proof follows exactly as in the proof of Lemma A.6 in [19]. ∎

Theorem A.2.

Suppose that H1 holds and let the sequence $\left\{x_{t}\right\}_{t=0}^{T}$ be generated by Algorithm B. Then, for $t=2,\ldots,T$ ,

[TABLE]

Proof.

For all $t\geq 0$ , we have

[TABLE]

Indeed, (A.136) is true for $t=0$ because $A_{0}=0$ and $\psi_{0}(x)=\frac{1}{p+\nu}\|x-x_{0}\|^{p+\nu}$ . Suppose that (A.136) is true for some $t\geq 0$ . Then,

[TABLE]

Thus, (A.136) follows by induction. Now, let us prove that

[TABLE]

Again, using $A_{0}=0$ , we see that (A.137) is true for $t=0$ . Assume that (A.137) is true for some $t\geq 0$ . Note that $\psi_{t}(\,.\,)$ is uniformly convex of degree $p+\nu$ with parameter $2^{-(p+\nu-2)}$ . Thus, by the induction assumption

[TABLE]

Consequently,

[TABLE]

Since $f$ is convex and differentiable and $g_{\varphi}(x_{t+1})\in\partial\varphi(x_{t+1})$ , we have

[TABLE]

and

[TABLE]

Using (A.139) and (A.140) in (A.138), it follows that

[TABLE]

Note that $A_{t}x_{t}=A_{t+1}y_{t}-a_{t}v_{t}$ and $A_{t+1}x_{t+1}=A_{t}x_{t+1}+a_{t}x_{t+1}$ . Thus, combining (A.141) and Lemma A.1, we obtain

[TABLE]

where the last inequality follows from (A.130) exactly as in the proof of Theorem 4.2 in [17]. Thus, (A.137) also holds for $t+1$ , which completes the induction argument.

Now, combining (A.136) and (A.137) we have

[TABLE]

Once again, as in the proof of Theorem 4.2 in [19], it follows from (A.130) that

[TABLE]

Finally, (A.135) follows directly from (A.142) and (A.143). ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Baes: Estimate Sequence Methods: Extensions and Approximations. Optimization Online (2009)
2[2] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163 , 359-368 (2017).
3[3] Bolte, J., Sabach, S., Teboulle, M., Vaesburg, Y.: First Order Methods Beyond Convexity and Lipschitz Cradient Continuity with Applications to Quadratic Inverse Problems. SIAM Journal on Optimization 28 , 2131–2151 (2018)
4[4] Bouaricha, A.: Tensor methods for large, sparse unconstrained optimization. SIAM Journal on Optimization 7 , 732–756 (1997)
5[5] Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Near-optimal method for highly smooth convex optimization. ar Xiv: 1812.08026 v 2 [math. OC] (2019)
6[6] Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower Bounds for Finding Stationary Points II: First-Order Methods. ar Xiv: 1711.00841 [math. OC] (2017)
7[7] Cartis, C., Gould, N.I.M., and Toint, Ph.L.: Adaptive cubic regularization methods for unconstrained optimization. Part II: worst-case function - and derivative - evaluation complexity. Mathematical Programming 130 , 295-319 (2011)
8[8] Cartis, C., Gould, N.I.M., Toint, Ph.L.: Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics 18 , 1073–1107 (2018)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Tensor Methods for Finding Approximate Stationary Points of Convex Functions

Abstract

keywords:

1 Introduction

1.1 Motivation

1.2 Notations and Generalities

2 Problem Statement

3 Tensor Schemes Without Acceleration

Remark 1*.*

Lemma 3.1**.**

Proof.

Theorem 3.2**.**

Proof.

Theorem 3.3**.**

Proof.

Remark 2*.*

4 Accelerated Tensor Schemes

Theorem 4.1**.**

Proof.

Theorem 4.2**.**

Proof.

Remark 3*.*

5 Composite Minimization

Lemma 5.1**.**

Proof.

Lemma 5.2**.**

Proof.

Theorem 5.3**.**

Proof.

5.1 Extended Accelerated Scheme

Theorem 5.4**.**

Proof.

5.2 Regularization Approach

Lemma 5.5**.**

Proof.

Lemma 5.6**.**

Theorem 5.7**.**

Proof.

Corollary 5.8**.**

Proof.

Theorem 5.9**.**

Proof.

Corollary 5.10**.**

Proof.

6 Lower complexity bounds under Hölder condition

Lemma 6.1**.**

Proof.

Lemma 6.2**.**

Proof.

Lemma 6.3**.**

Proof.

Lemma 6.4**.**

Proof.

Lemma 6.5**.**

Proof.

Theorem 6.6**.**

Proof.

Remark 4*.*

Theorem 6.7**.**

Proof.

Remark 5*.*

Remark 6*.*

7 Conclusion

Acknowledgments

Funding

Appendix A Accelerated Scheme for Composite Minimization

Lemma A.1**.**

Proof.

Theorem A.2**.**

Proof.

*Remark 1**.*

Lemma 3.1.

Theorem 3.2.

Theorem 3.3.

*Remark 2**.*

Theorem 4.1.

Theorem 4.2.

*Remark 3**.*

Lemma 5.1.

Lemma 5.2.

Theorem 5.3.

Theorem 5.4.

Lemma 5.5.

Lemma 5.6.

Theorem 5.7.

Corollary 5.8.

Theorem 5.9.

Corollary 5.10.

Lemma 6.1.

Lemma 6.2.

Lemma 6.3.

Lemma 6.4.

Lemma 6.5.

Theorem 6.6.

*Remark 4**.*

Theorem 6.7.

*Remark 5**.*

*Remark 6**.*

Lemma A.1.

Theorem A.2.