Tensor Methods for Minimizing Convex Functions with H\"{o}lder   Continuous Higher-Order Derivatives

Geovani Nunes Grapiglia; Yurii Nesterov

arXiv:1904.12559·math.OC·June 7, 2021

Tensor Methods for Minimizing Convex Functions with H\"{o}lder Continuous Higher-Order Derivatives

Geovani Nunes Grapiglia, Yurii Nesterov

PDF

TL;DR

This paper develops tensor-based optimization methods for convex functions with higher-order derivatives that are Hölder continuous, providing complexity bounds for both accelerated and universal schemes, advancing the theoretical understanding of such optimization problems.

Contribution

It introduces new tensor schemes with and without acceleration for convex minimization, establishing their iteration complexity bounds and a universal scheme for unknown Hölder parameters.

Findings

01

Accelerated tensor schemes achieve improved complexity bounds.

02

Universal scheme works without knowing Hölder continuity parameter.

03

Lower bounds match the proposed schemes' complexity.

Abstract

In this paper we study $p$ -order methods for unconstrained minimization of convex functions that are $p$ -times differentiable ( $p \geq 2$ ) with $ν$ -H\"{o}lder continuous $p$ th derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of $O (ϵ^{- 1/ (p + ν - 1)})$ for reducing the functional residual below a given $ϵ \in (0, 1)$ . Assuming that $ν$ is known, we obtain an improved complexity bound of $O (ϵ^{- 1/ (p + ν)})$ for the corresponding accelerated scheme. For the case in which $ν$ is unknown, we present a universal accelerated tensor scheme with iteration complexity of $O (ϵ^{- p / [(p + 1) (p + ν - 1)]})$ . A lower complexity bound of $O (ϵ^{- 2/ [3 (p + ν) - 2]})$ is also obtained for this…

Equations435

\begin{array}[]{rcl}\|x\|&=&\langle Bx,x\rangle^{1/2},\;x\in\mathbb{E},\quad\|s\|_{*}\;=\;\langle s,B^{-1}s\rangle^{1/2},\;s\in\mathbb{E}^{*}.\end{array}

\begin{array}[]{rcl}\|x\|&=&\langle Bx,x\rangle^{1/2},\;x\in\mathbb{E},\quad\|s\|_{*}\;=\;\langle s,B^{-1}s\rangle^{1/2},\;s\in\mathbb{E}^{*}.\end{array}

D^{p} f (x) [h_{1}, \dots, h_{p}]

D^{p} f (x) [h_{1}, \dots, h_{p}]

D f (x) [h_{1}] = ⟨ \nabla f (x), h_{1} ⟩ and D^{2} f (x) [h_{1}, h_{2}] = ⟨ \nabla^{2} f (x) h_{1}, h_{2} ⟩ .

D f (x) [h_{1}] = ⟨ \nabla f (x), h_{1} ⟩ and D^{2} f (x) [h_{1}, h_{2}] = ⟨ \nabla^{2} f (x) h_{1}, h_{2} ⟩ .

f (x + h) = Φ_{x, p} (x + h) + o (∥ h ∥^{p}), x + h \in dom f,

f (x + h) = Φ_{x, p} (x + h) + o (∥ h ∥^{p}), x + h \in dom f,

Φ_{x, p} (y) \equiv f (x) + i = 1 \sum p \frac{1}{i !} D^{i} f (x) [y - x]^{i}, y \in E .

Φ_{x, p} (y) \equiv f (x) + i = 1 \sum p \frac{1}{i !} D^{i} f (x) [y - x]^{i}, y \in E .

∥ D^{p} f (x) ∥ = h_{1}, \dots, h_{p} max {∣ D^{p} f (x) [h_{1}, \dots, h_{p}] ∣ : ∥ h_{i} ∥ \leq 1, i = 1, \dots, p} .

∥ D^{p} f (x) ∥ = h_{1}, \dots, h_{p} max {∣ D^{p} f (x) [h_{1}, \dots, h_{p}] ∣ : ∥ h_{i} ∥ \leq 1, i = 1, \dots, p} .

∥ D^{p} f (x) ∥ = h max {∣ D^{p} f (x) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) ∥ = h max {∣ D^{p} f (x) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) - D^{p} f (y) ∥ = h max {∣ D^{p} f (x) [h]^{p} - D^{p} f (y) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

∥ D^{p} f (x) - D^{p} f (y) ∥ = h max {∣ D^{p} f (x) [h]^{p} - D^{p} f (y) [h]^{p} ∣ : ∥ h ∥ \leq 1} .

x \in E min f (x),

x \in E min f (x),

H_{f, p} (ν) \equiv x, y \in E sup {\frac{∥ D ^{p} f ( x ) - D ^{p} f ( y ) ∥}{∥ x - y ∥ ^{ν}} : x \neq = y}, 0 \leq ν \leq 1.

H_{f, p} (ν) \equiv x, y \in E sup {\frac{∥ D ^{p} f ( x ) - D ^{p} f ( y ) ∥}{∥ x - y ∥ ^{ν}} : x \neq = y}, 0 \leq ν \leq 1.

∣ f (y) - Φ_{x, p} (y) ∣ \leq \frac{H _{f, p} ( ν )}{p !} ∥ y - x ∥^{p + ν},

∣ f (y) - Φ_{x, p} (y) ∣ \leq \frac{H _{f, p} ( ν )}{p !} ∥ y - x ∥^{p + ν},

∥\nabla f (y) - \nabla Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 1 )!} ∥ y - x ∥^{p + ν - 1},

∥\nabla f (y) - \nabla Φ_{x, p} (y) ∥_{*} \leq \frac{H _{f, p} ( ν )}{( p - 1 )!} ∥ y - x ∥^{p + ν - 1},

f (y) \leq Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + ν}, y \in E .

f (y) \leq Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + ν}, y \in E .

Ω_{x, p, H}^{(α)} (y) = Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + α}, α \in [0, 1] .

Ω_{x, p, H}^{(α)} (y) = Φ_{x, p} (y) + \frac{H}{p !} ∥ y - x ∥^{p + α}, α \in [0, 1] .

f (y) \leq Ω_{x, p, H}^{(ν)} (y), y \in E .

f (y) \leq Ω_{x, p, H}^{(ν)} (y), y \in E .

\alpha=\left\{\begin{array}[]{ll}\nu,&\text{if}\,\,\nu\,\,\text{is known},\\ 1,&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

\alpha=\left\{\begin{array}[]{ll}\nu,&\text{if}\,\,\nu\,\,\text{is known},\\ 1,&\text{if}\,\,\nu\,\,\text{is unknown}.\end{array}\right.

y \in E min Ω_{x_{t}, p, M_{t}}^{(α)} (y),

y \in E min Ω_{x_{t}, p, M_{t}}^{(α)} (y),

Ω_{x_{t}, p, M_{t}}^{(α)} (x_{t}^{+}) \leq f (x_{t}) and ∥\nabla Ω_{x_{t}, p, M_{t}}^{(α)} (x_{t}^{+}) ∥_{*} \leq θ ∥ x_{t}^{+} - x_{t} ∥^{p + α - 1},

Ω_{x_{t}, p, M_{t}}^{(α)} (x_{t}^{+}) \leq f (x_{t}) and ∥\nabla Ω_{x_{t}, p, M_{t}}^{(α)} (x_{t}^{+}) ∥_{*} \leq θ ∥ x_{t}^{+} - x_{t} ∥^{p + α - 1},

f (x_{t}) - f (x_{t}^{+}) \geq \frac{1}{8 ( p + 1 )! M _{t}^{\frac{1}{p + α - 1}}} ∥\nabla f (x_{t}^{+}) ∥_{*}^{\frac{p + α}{p + α - 1}},

f (x_{t}) - f (x_{t}^{+}) \geq \frac{1}{8 ( p + 1 )! M _{t}^{\frac{1}{p + α - 1}}} ∥\nabla f (x_{t}^{+}) ∥_{*}^{\frac{p + α}{p + α - 1}},

∥\nabla f (y_{k}) ∥_{*} > ϵ and ∥\nabla Ω_{x_{t}, 3, M_{t}}^{(ν)} (y_{k}) ∥_{*} > θ ∥ y_{k} - x_{0} ∥^{2 + ν}, for k = 0, \dots, T,

∥\nabla f (y_{k}) ∥_{*} > ϵ and ∥\nabla Ω_{x_{t}, 3, M_{t}}^{(ν)} (y_{k}) ∥_{*} > θ ∥ y_{k} - x_{0} ∥^{2 + ν}, for k = 0, \dots, T,

T\leq\left\{\begin{array}[]{ll}\mathcal{O}\left(\epsilon^{-(3+\nu)}\right),&\text{if}\,\,M_{t}<[6/(3+\nu)]H_{f,3}(\nu)\,\,\text{and}\,\,\nu\neq 0,\\ \mathcal{O}\left(\epsilon^{-\frac{(3+\nu)}{2}}\right),&\text{if}\,\,M_{t}=[6/(3+\nu)]H_{f,3}(\nu)\,\,\text{and}\,\,\nu\neq 0,\\ \mathcal{O}\left(\log(\epsilon^{-1})\right),&\text{if}\,\,M_{t}>[6/(3+\nu)]H_{f,3}(\nu).\\ \end{array}\right.

T\leq\left\{\begin{array}[]{ll}\mathcal{O}\left(\epsilon^{-(3+\nu)}\right),&\text{if}\,\,M_{t}<[6/(3+\nu)]H_{f,3}(\nu)\,\,\text{and}\,\,\nu\neq 0,\\ \mathcal{O}\left(\epsilon^{-\frac{(3+\nu)}{2}}\right),&\text{if}\,\,M_{t}=[6/(3+\nu)]H_{f,3}(\nu)\,\,\text{and}\,\,\nu\neq 0,\\ \mathcal{O}\left(\log(\epsilon^{-1})\right),&\text{if}\,\,M_{t}>[6/(3+\nu)]H_{f,3}(\nu).\\ \end{array}\right.

f (x_{m}) - f (x^{*}) \leq 2 [8 (p + 1)!]^{p + α - 1} M_{ν} D_{0}^{p + α},

f (x_{m}) - f (x^{*}) \leq 2 [8 (p + 1)!]^{p + α - 1} M_{ν} D_{0}^{p + α},

m \leq \frac{1}{ln ( \frac{p + α}{p + α - 1} )} ln max {1, lo g_{2} \frac{f ( x _{0} ) - f ( x ^{*} )}{[ 8 ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}}

m \leq \frac{1}{ln ( \frac{p + α}{p + α - 1} )} ln max {1, lo g_{2} \frac{f ( x _{0} ) - f ( x ^{*} )}{[ 8 ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}}

f (x_{k}) - f (x^{*}) \leq \frac{[ 24 p ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}{( k - m ) ^{p + α - 1}} .

f (x_{k}) - f (x^{*}) \leq \frac{[ 24 p ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}{( k - m ) ^{p + α - 1}} .

M_{k} \leq M_{ν}, k = 0, \dots, T - 1.

M_{k} \leq M_{ν}, k = 0, \dots, T - 1.

f (x_{k}) - f (x_{k + 1})

f (x_{k}) - f (x_{k + 1})

\geq

δ_{k} = \frac{f ( x _{k} ) - f ( x ^{*} )}{[ 8 ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}

δ_{k} = \frac{f ( x _{k} ) - f ( x ^{*} )}{[ 8 ( p + 1 )! ] ^{p + α - 1} M _{ν} D _{0}^{p + α}}

ln 2 \leq ln δ_{m - 1} \leq (\frac{p + α - 1}{p + α})^{m - 1} ln δ_{0} ⟹ (\frac{p + α}{p + α - 1})^{m - 1} ln 2 \leq ln δ_{0}

ln 2 \leq ln δ_{m - 1} \leq (\frac{p + α - 1}{p + α})^{m - 1} ln δ_{0} ⟹ (\frac{p + α}{p + α - 1})^{m - 1} ln 2 \leq ln δ_{0}

⟹ (\frac{p + α}{p + α - 1})^{m - 1} \leq \frac{ln δ _{0}}{ln 2} = lo g_{2} δ_{0} .

⟹ (\frac{p + α}{p + α - 1})^{m - 1} \leq \frac{ln δ _{0}}{ln 2} = lo g_{2} δ_{0} .

δ_{k} \leq [\frac{1 + δ _{m}^{u - 1}}{( u - 1 ) ( k - m )}]^{\frac{1}{u - 1}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\slugger

siopt3042750-2779

Tensor Methods for Minimizing Convex Functions with Hölder Continuous Higher-Order Derivatives

G.N. Grapiglia Departamento de Matemática, Universidade Federal do Paraná, Centro Politécnico, Cx. postal 19.081, 81531-980, Curitiba, Paraná, Brazil ([email protected]). This author was supported by the National Council for Scientific and Technological Development - Brazil (grants 401288/2014-5 and 406269/2016-5) and by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 788368).

Yu. Nesterov Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium ([email protected]). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 788368).

(November 28, 2019)

Abstract

In this paper we study $p$ -order methods for unconstrained minimization of convex functions that are $p$ -times differentiable ( $p\geq 2$ ) with $\nu$ -Hölder continuous $p$ th derivatives. We propose tensor schemes with and without acceleration. For the schemes without acceleration, we establish iteration complexity bounds of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ for reducing the functional residual below a given $\epsilon\in(0,1)$ . Assuming that $\nu$ is known, we obtain an improved complexity bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ for the corresponding accelerated scheme. For the case in which $\nu$ is unknown, we present a universal accelerated tensor scheme with iteration complexity of $\mathcal{O}\left(\epsilon^{-p/[(p+1)(p+\nu-1)]}\right)$ . A lower complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ is also obtained for this problem class.

keywords:

unconstrained minimization, high-order methods, tensor methods, Hölder condition, worst-case global complexity bounds

AMS:

49M15, 49M37, 58C15, 90C25, 90C30

1 Introduction

1.1 Motivation

In [13], it was shown that a suitable cubic regularization of the Newton method (CNM) takes at most $\mathcal{O}(\epsilon^{-1/2})$ iterations to reduce the functional residual below a given precision $\epsilon>0$ when the objective is a twice-differentiable convex function with a Lipschitz continuous Hessian. A better complexity bound of $\mathcal{O}(\epsilon^{-1/3})$ was shown in [14] for an accelerated version of CNM. Auxiliary problems in these methods consist in the minimization of a third-order regularization of the second-order Taylor approximation of the objective function around the current iterate. A natural generalization is to consider auxiliary problems in which one minimizes a $(p+1)$ -order regularization of the $p$ th-order Taylor approximation of the objective function, resulting in tensor methods. Unconstrained optimization by tensor methods is not a new subject (see, for example, [17, 5]). In the context of convex optimization, accelerated tensor methods (as described above) were first considered by Baes [2]. However, the author did not realize that under a proper choice of the regularization coefficient the auxiliary problems become convex. This important observation was done in a recent paper [15], where tensor methods with and without acceleration were proposed for unconstrained minimization of $p$ -times differentiable convex functions with Lipschitz continuous $p$ th derivatives. An iteration complexity bound of $\mathcal{O}(\epsilon^{-1/p})$ was proved for the method without acceleration, while an improved bound of $\mathcal{O}(\epsilon^{-1/(p+1)})$ was proved for the accelerated tensor method.

In the present paper, we study tensor methods (with and without acceleration) that can handle convex functions with $\nu$ -Hölder continuous $p$ th derivatives ( $p\geq 2$ ) and allow the inexact solution of auxiliary problems (in the sense of [4]). Specifically, our contribution is threefold:

For the schemes without acceleration, we establish iteration complexity bounds of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ for reducing the functional residual below a given $\epsilon\in(0,1)$ . 2. 3. 2.

Assuming that $\nu$ is known, we obtain an improved complexity bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ for the corresponding accelerated scheme. For the case in which $\nu$ is unknown, we present a universal accelerated tensor scheme with iteration complexity of $\mathcal{O}\left(\epsilon^{-p/[(p+1)(p+\nu-1)]}\right)$ . 4. 5. 3.

A lower complexity bound of $\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ is also obtained, from which we conclude that our accelerated nonuniversal scheme is nearly optimal. 6.

The methods and results presented here extend in a significant way the contributions in [2, 8, 9, 15]. Indeed, [8, 9] deal only with second-order schemes ( $p=2$ ) which require the exact solution of the auxiliary problems. On the other hand, the $p$ -order methods proposed in [2, 15] are restricted to the Lipschitz case ( $\nu=1$ ), assuming that the Lipschitz constant is known and that the auxiliary problems are solved exactly. We believe that the development of $p$ -order methods with affordable trial steps and automatic adjustment to the objective’s function class (universality) constitutes a fundamental step towards implementable high-order methods for convex optimization.

1.2 Contents

The paper is organized as follows. In Section 2, we define our problem. In Section 3, we present tensor methods without acceleration and establish their convergence properties. In Section 4, we present complexity results for accelerated schemes. Finally, in Section 5 we obtain lower complexity bounds for tensor methods under the Hölder condition. All necessary auxiliary results are included in Appendix A.

1.3 Notations and Generalities

In what follows, we denote by $\mathbb{E}$ a finite-dimensional real vector space, and by $\mathbb{E}^{*}$ its dual space, composed by linear functionals on $\mathbb{E}$ . The value of function $s\in\mathbb{E}^{*}$ at point $x\in\mathbb{E}$ is denoted by $\langle s,x\rangle$ . Given a self-adjoint positive definite operator $B:\mathbb{E}\to\mathbb{E}^{*}$ (notation $B\succ 0$ ), we can endow these spaces with conjugate Euclidean norms:

[TABLE]

For a smooth function $f:\text{dom}\,f\to\mathbb{R}$ with convex and open domain $\text{dom}\,f\subset\mathbb{E}$ , denote by $\nabla f(x)$ its gradient and by $\nabla^{2}f(x)$ its Hessian evaluated at point $x\in\text{dom}\,f$ . Note that $\nabla f(x)\in\mathbb{E}^{*}$ and $\nabla^{2}f(x)h\in\mathbb{E}^{*}$ for $x\in\text{dom}\,f$ and $h\in\mathbb{E}$ .

For any integer $p\geq 1$ , denote by

[TABLE]

the directional derivative of function $f$ at $x$ along directions $h_{i}\in\mathbb{E}$ , $i=1,\ldots,p$ . In particular, for any $x\in\text{dom}\,f$ and $h_{1},h_{2}\in\mathbb{E}$ we have

[TABLE]

For $h_{1}=\ldots=h_{p}=h\in\mathbb{E}$ , we use notation $D^{p}f(x)[h]^{p}$ . Then the $p$ th-order Taylor approximation of function $f$ at $x\in\text{dom}\,f$ can be written as follows:

[TABLE]

where

[TABLE]

Note that $D^{p}f(x)[\,.\,]$ is a symmetric $p$ -linear form. Its norm is defined by

[TABLE]

In fact, it can be shown that (see, e.g., [3])

[TABLE]

Similarly, since $D^{p}f(x)[.\,,\ldots,\,.]-D^{p}f(y)[.,\ldots,.]$ is also a symmetric $p$ -linear form for fixed $x,y\in\text{dom}\,f$ , we can define

[TABLE]

2 Problem Statement

In this paper we consider methods for solving the following minimization problem

[TABLE]

where $f:\mathbb{E}\to\mathbb{R}$ is a convex $p$ -times differentiable function ( $p\geq 2$ ). We assume that there exists at least one optimal solution $x^{*}\in\mathbb{E}$ for problem (2.3). Let us characterize the level of smoothness of the objective $f$ by the system of Hölder constants

[TABLE]

Then, from (2.4) and from the integral form of the mean-value theorem, it follows that

[TABLE]

for all $x,y\in\mathbb{E}$ . Given $x\in\mathbb{E}$ , if $H_{f,p}(\nu)<+\infty$ and $H\geq H_{f,p}(\nu)$ , by (2.5) we have

[TABLE]

This property motivates the use of the following class of models of $f$ around $x\in\mathbb{E}$ :

[TABLE]

In particular, as long as $H\geq H_{f,p}(\nu)$ , by (2.7) we have

[TABLE]

3 Tensor schemes without acceleration

If we assume that $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , there are two possible situations: either $\nu$ is known, or $\nu$ is unknown. We cover both cases in a single framework by introducing parameter

[TABLE]

Let $\epsilon\in(0,1)$ be the target precision. At the beginning of the $t$ th iteration one has an estimate $x_{t}$ for the solution of (2.3) and a scaling coefficient $M_{t}$ . A trial point $x_{t}^{+}$ is computed as an approximate solution to the auxiliary problem

[TABLE]

with $\alpha$ given by (3.10). Similarly to [4], the trial point $x^{+}_{t}$ must satisfy the following conditions:

[TABLE]

where $\theta\geq 0$ is a user-defined parameter. When (3.11) is not convex, then $x_{t}^{+}$ is not necessarily an approximation of its global solution. If the descent condition

[TABLE]

holds, then $x_{t}^{+}$ is accepted and we define $x_{t+1}=x_{t}^{+}$ . Otherwise, constant $M_{t}$ is increased until the corresponding trial point $x_{t}^{+}$ is accepted. We will see that this process is well defined in the sense that there exists $M_{\nu}>0$ such that $M_{t}\leq M_{\nu}$ for all $t$ . This general scheme can be summarized in the following way.

Algorithm 1. Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ and $\theta\geq 0$ . Set $\alpha$ by (3.10) and $t:=0$ .

Step 1. Find $0<M_{t}\leq M_{\nu}$ such that (3.13) holds for an approximate solution $x_{t}^{+}$ to (3.11) satisfying conditions (3.12).

Step 2. Set $x_{t+1}=x_{t}^{+}$ .

Step 3. Set $t:=t+1$ and go back to Step 1.

Remark 1.

Regarding the approximate solution of the auxiliary problems, it is easy to see that $x_{t}^{+}$ satisfying (3.3) can be computed by any monotone optimization scheme that drives the gradient of the objective to zero. In [10], we investigated the possible use of gradient methods with Bregman distance for the case $p=3$ and $\alpha=\nu$ . Specifically, under assumptions H1 and H2 below, we showed that if $\left\{y_{k}\right\}_{k=0}^{T}$ is a sequence generated by these methods applied to $\min_{y\in\mathbb{E}}\Omega_{x_{t},3,M_{t}}^{(\nu)}(y)$ with

[TABLE]

then

[TABLE]

For more details, see Theorems 3.8 and 3.10 in [10].

To analyze the convergence of Algorithm 1, we introduce the following assumptions:

H1

$H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ .

H2

The level sets of $f$ are bounded, that is, $\max_{x\in\mathcal{L}(x_{0})}\|x-x^{*}\|\leq D_{0}\in(1,+\infty)$ for $\mathcal{L}(x_{0})\equiv\left\{x\in\mathbb{E}\,:\,f(x)\leq f(x_{0})\right\}$ , with $x_{0}$ being the starting point.

The next theorem establishes global convergence rate for Algorithm 1.

Theorem 2.

Suppose that H1 and H2 are true and let $\left\{x_{t}\right\}_{t=0}^{T}$ be a sequence generated by Algorithm 1. Denote by $m$ the first iteration number such that

[TABLE]

and assume that $m<T$ . Then

[TABLE]

and, for all $k$ , $m<k\leq T$ , we have

[TABLE]

Proof.

By Step 1 in Algorithm 1, we have

[TABLE]

Thus, in view of (3.13), (3.16), and H2, for $k=0,\ldots,T-1$ we have

[TABLE]

where the last inequality is due to the convexity of $f$ . Now, denoting

[TABLE]

we see from (3.18) that this sequence satisfies condition (1.1) of Lemma 1.1 in [8] with $u=\frac{p+\alpha}{p+\alpha-1}$ . Note that $m$ is the first iteration for which $\delta_{m}\leq 2$ . Using Lemma 1.1 in [8], this inequality allow us to obtain the upper bound (3.14) for $m$ and also simplifies our final bound for the functional residual. Indeed, if $m>0$ , then $\delta_{0}>2$ and, in view of inequality (1.2) of Lemma 1.1 in [8], we have

[TABLE]

Thus, $m\leq\dfrac{\ln\,\delta_{0}}{\ln\left(\frac{p+\alpha}{p+\alpha-1}\right)}$ , and so (3.14) holds. Consequently, from inequality (1.3) of Lemma 1.1 in [8] we get the following rate of convergence:

[TABLE]

that is,

[TABLE]

Therefore,

[TABLE]

∎

If we assume that $\nu$ and $H_{f,p}(\nu)$ are known, by Lemma 17, we can set

[TABLE]

Here, by (3.10) the corresponding version of Algorithm 1 takes at most

$\mathcal{O}(\epsilon^{-1/(p+\nu-1)})$ iterations to generate $x_{k}$ such that $f(x_{k})-f(x^{*})\leq\epsilon$ for a given $\epsilon\in(0,1)$ . However, in most practical problems, $H_{f,p}(\nu)$ is not known. To deal with this situation, we can consider the following adaptive version of Algorithm 1:

Algorithm 2. Adaptive Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $H_{0}>0$ and $\theta\geq 0$ . Set $\alpha$ by (3.10) and $t:=0$ .

Step 1. Set $i:=0$ .

Step 1.1 Compute an approximate solution $x_{t,i}^{+}$ to $\min\limits_{y\in\mathbb{E}}\,\Omega^{(\alpha)}_{x_{t},p,2^{i}H_{t}}(y)$ , such that

$\Omega^{(\alpha)}_{x_{t},p,2^{i}H_{t}}(x_{t,i}^{+})\leq f(x_{t})\quad\text{and}\quad\|\nabla\Omega^{(\alpha)}_{x_{t},p,2^{i}H_{t}}(x_{t,i}^{+})\|_{*}\leq\theta\|x_{t,i}^{+}-x_{t}\|^{p+\alpha-1}.$

Step 1.2. If

$f(x_{t})-f(x_{t,i}^{+})\geq\dfrac{1}{8(p+1)!(2^{i}H_{t})^{\frac{1}{p+\alpha-1}}}\|\nabla f(x_{t,i}^{+})\|_{*}^{\frac{p+\alpha}{p+\alpha-1}}$

holds, set $i_{t}:=i$ and go to Step 2. Otherwise, set $i:=i+1$ and go to Step 1.1.

Step 2. Set $x_{t+1}=x_{t,i_{t}}^{+}$ and $H_{t+1}=2^{i_{t}-1}H_{t}$ .

Step 3. Set $t:=t+1$ and go to Step 1.

Let us define the following function of $\epsilon>0$ :

[TABLE]

where

[TABLE]

The next lemma provides upper bounds on $H_{t}$ and on the number of calls of the oracle111By calls of the oracle we mean the joint computation of $f$ and its derivatives..

Lemma 3.

Suppose that H1 and H2 are true. Given $\epsilon>0$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 2 such that

[TABLE]

Then,

[TABLE]

Moreover, the number $O_{T}$ of calls of the oracle after $T$ iterations is bounded as follows:

[TABLE]

Proof.

Let us prove (3.23) by induction. Clearly it holds for $t=0$ . Assume that (3.23) is true for some $t$ , $0\leq t\leq T-1$ . If $\nu$ is known, then by (3.10) we have $\alpha=\nu$ . Thus, by H1 and Lemma 17 the final value of $2^{i_{t}}H_{t}$ cannot exceed

[TABLE]

since otherwise we should stop the line search earlier. Therefore,

[TABLE]

that is, (3.23) holds for $t=t+1$ .

On the other hand, if $\nu$ is unknown, we have $\alpha=1$ . In view of (3.20), (3.21) and H2, it follows that

[TABLE]

Thus, by (3.22) and Lemma A.5 in [9] we have $\|\nabla f(x_{t+1})\|_{*}\geq\dfrac{\epsilon}{R(\epsilon)}.$ In this case, it follows from Corollary 20 with $\delta=\epsilon/R(\epsilon)$ that

[TABLE]

Consequently, we also have

[TABLE]

that is, (3.23) holds for $t+1$ . This completes the induction argument.

Finally, note that at the $k$ th iteration of Algorithm 2, the oracle is called $i_{k}+1$ times. Since $H_{k+1}=2^{i_{k}-1}H_{k}$ , it follows that $i_{k}-1=\log_{2}H_{k+1}-\log_{2}H_{k}$ . Thus, by (3.23) we get

[TABLE]

∎

From Lemma 3, we see that Algorithm 2 is a particular case of Algorithm 1 in which

[TABLE]

Thus, combining Theorem 3.2 and Lemma 3.3, we obtain the following result.

Theorem 4.

Suppose that H1 and H2 are true. Given $\epsilon\in(0,1)$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 2 such that

[TABLE]

Denote by $m$ the first iteration number such that

[TABLE]

and assume that $m<T$ . Then

[TABLE]

and

[TABLE]

Consequently,

[TABLE]

where

[TABLE]

Proof.

As mentioned above, by Lemma 3.3 we have

[TABLE]

Then, (3.28) and (3.29) follow directly from Theorem 2 with

[TABLE]

Now, combining (3.27) and (3.29), we obtain

[TABLE]

and so,

[TABLE]

If $\nu$ is known, then $\alpha=\nu$ and, by (3.19), we have

[TABLE]

Thus, combining (3.31) and (3.32), we get (3.30). On the other hand, if $\nu$ is unknown, then $\alpha=1$ and, by (3.19), (3.25) and $\epsilon\in(0,1)$ , we have

[TABLE]

In this case, combining (3.31) and (3.33) we also get (3.30). ∎

Remark 5.

Note that for any $\theta$ in the interval

[TABLE]

the corresponding right-hand side in (3.21) has the same value as for $\theta=0$ .

Note that Algorithm 2 with $\alpha=1$ is a universal scheme: it works for any Hölder parameter $\nu\in[0,1]$ without using it explicitly. This algorithm can be viewed as a generalization of the universal method (6.10) in [8]. Looking at the efficiency bound (3.30), for $\nu$ known and $\nu$ unknown, we see that the universal scheme ensures the same dependence on the accuracy $\epsilon$ as the nonuniversal scheme ( $\alpha=\nu\neq 1$ ). Remarkably, this is not true for the accelerated schemes obtained from the standard estimating sequences technique, as we will see in the next section.

4 Accelerated tensor schemes

Similarly to Section 3, we shall consider a general accelerated tensor method parametrized by the constant $\alpha$ given in (3.10). Specifically, at the beginning of the $t$ th iteration ( $t>0$ ) one has an estimate $x_{t}$ for the solution of (2.3), an auxiliary vector $v_{t}$ and constants $A_{t},M_{t}>0$ . A new vector $y_{t}$ is computed as a convex combination of $x_{t}$ and $v_{t}$ :

[TABLE]

where

[TABLE]

with $a_{t}>0$ being computed from the equation

[TABLE]

Then, a trial point $x_{t}^{+}$ is computed as an approximate solution to the auxiliary problem

[TABLE]

such that

[TABLE]

where $\theta\geq 0$ is a user-defined parameter. If the descent condition

[TABLE]

is satisfied, then $x_{t}^{+}$ is accepted, and we define $x_{t+1}=x_{t}^{+}$ . Otherwise, constant $M_{t}$ is increased until the corresponding trial point $x_{t}^{+}$ is accepted. As in Algorithm 1, we assume that there exists $M_{\nu}>0$ such that $M_{t}\leq M_{\nu}$ for all $t$ . After obtaining $x_{t+1}$ , we set $A_{t+1}=A_{t}+a_{t}$ and compute

[TABLE]

where

[TABLE]

To initialize, we choose $x_{0}$ and we set $v_{0}=x_{0}$ , $A_{0}=0$ , and $\psi_{0}(x)=\frac{1}{p+\alpha}\|x-x_{0}\|^{p+\alpha}$ . This general scheme can be summarized in the following way.

Algorithm 3. Accelerated Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $H_{0}>0$ . Set $\alpha$ by (3.10), $v_{0}=x_{0}$ , $A_{0}=0$ , and $t:=0$ .

Step 1. Find $0<M_{t}\leq M_{\nu}$ such that (4.39) holds for an approximate solution $x_{t}^{+}$ to (4.37) satisfying (4.38), with $y_{t}$ being defined by (4.34)-(4.36).

Step 2. Set $x_{t+1}=x_{t}^{+}$ and $A_{t+1}=A_{t}+a_{t}$ , with $a_{t}>0$ obtained from (4.36).

Step 3. Define $\psi_{t+1}(.)$ by (4.41) and compute $v_{t+1}$ by (4.40).

Step 4. Set $t:=t+1$ and go back to Step 1.

Remark 6.

From the expression of $\psi_{t+1}(\,.\,)$ we can see that $\min_{x\in\mathbb{E}}\psi_{t+1}(x)$ admits a closed form solution, namely,

[TABLE]

Regarding the computation of $a_{t}$ , for $t=0$ , (4.36) gives

[TABLE]

For $t>0$ , we have $A_{t}>0$ . Thus, the computation of $a_{t}$ requires the solution of a univariate nonlinear equation of the form

[TABLE]

Denoting $g(x)=x^{p+\alpha}-B(A+x)^{p+\alpha-1}$ , it is easy to see that

[TABLE]

where $c=\max\left\{A,2^{p+\alpha-1}B\right\}$ . Since $g(\,.\,)$ is continuous, we can use the bisection Method to compute an approximation to $a^{*}\in(0,c]$ such that $g(a^{*})=0$ . As can be seen in the proof of Theorem 8, our convergence results only require

[TABLE]

The next result establishes the relationship between the estimating functions $\psi_{t}(\cdot)$ and the objective function $f(\cdot)$ .

Lemma 7.

For all $t\geq 0$ ,

[TABLE]

Proof.

We prove this result by induction in $t$ . Since $A_{0}=0$ , for all $x\in\mathbb{E}$ ,

[TABLE]

that is, (4.42) is true for $t=0$ . Suppose that (4.42) is true for some $t\geq 0$ . Then (4.41) and the convexity of $f$ imply that, for all $x\in\mathbb{E}$ ,

[TABLE]

Thus, (4.42) is also true for $t+1$ , and the proof is completed. ∎

The theorem below establishes the global convergence rate for Algorithm 3.

Theorem 8.

Assume that H1 is true and let the sequence $\left\{x_{t}\right\}_{t=0}^{T}$ be generated by Algorithm 3. Then, for $t=2,\ldots,T$ ,

[TABLE]

Proof.

Let us prove by induction that

[TABLE]

Since $A_{0}=0$ , we have $A_{0}f(x_{0})=0=\min_{x\in E}\psi_{0}(x)$ . Thus, (4.44) is true for $t=0$ . Assume that it is true for some $t\geq 0$ . Note that for any $x\in E$ we have

[TABLE]

Note that $\ell_{t}(x)$ is a linear function. Moreover, by Lemma 4 in [14], function

${1\over(p+\alpha)}\|x-x_{0}\|^{p+\alpha}$ is uniformly convex of degree $p+\alpha$ with parameter $2^{-(p+\alpha-2)}$ . Thus, $\psi_{t}(x)$ is also a uniformly convex function of degree $p+\alpha$ with parameter $2^{-(p+\alpha-2)}$ . Therefore, Lemma A.2 in [9] and the induction assumption imply that

[TABLE]

Thus,

[TABLE]

Since $f$ is convex and differentiable, we have

[TABLE]

Then, substituting this inequality above, we obtain

[TABLE]

Note that $y_{t}=(1-\gamma_{t})x_{t}+\gamma_{t}v_{t}={A_{t}\over A_{t+1}}x_{t}+{a_{t}\over A_{t+1}}v_{t}$ . Therefore, $A_{t}x_{t}=A_{t+1}y_{t}-a_{t}v_{t}$ , and

[TABLE]

Moreover, $A_{t+1}x_{t+1}=A_{t}x_{t+1}+a_{t}x_{t+1}$ , and so

[TABLE]

where the last inequality is due to (4.39). Thus, to prove that (4.44) is true for $t+1$ , it is enough to show that

[TABLE]

for all $x\in\mathbb{E}$ . Using Lemma 2 in [14] with $r=p+\nu$ , $s=a_{t}\nabla f(x_{t+1})$ and $\omega=2^{-(p+\alpha-1)}$ , we see that a sufficient condition for (4.45) is

[TABLE]

which is equivalent to

[TABLE]

Note that, $2\left({p+\alpha\over p+\alpha-1}\right)^{p+\alpha-1}\left(1\over 8\right)^{p+\alpha-1}\geq{1\over 2^{(3p-1)}}$ . Therefore, by (4.36) we have

[TABLE]

Thus (4.44) is true for $t+1$ , completing the induction argument.

Let us now estimate the growth of the coefficients $A_{t}$ . Since $M_{t}\leq M_{\nu}$ for all $t=0,\ldots,T$ , by (4.36) we get $a_{t}^{p+\alpha}\geq\dfrac{1}{\tilde{M}}(A_{t}+a_{t})^{p+\alpha-1}$ with

[TABLE]

Consequently,

[TABLE]

Now, denoting $B_{t}=\tilde{M}A_{t}$ for all $t\geq 0$ , it follows from (4.47) that

[TABLE]

Then, by Lemma A.4 in [9], we have

[TABLE]

Note that $A_{1}\geq{1\over 2M}$ . Thus, $B_{1}\geq 1$ and consequently

[TABLE]

Therefore, for all $t\geq 2$ , we have

[TABLE]

Finally, by (4.44) and Lemma 7, for $t\geq 0$ , we have

[TABLE]

Hence, $A_{t}(f(x_{t})-f(x^{*}))\leq{1\over 2+\nu}\|x^{*}-x_{0}\|^{2+\nu}$ , and (4.43) follows immediately from (4.46) and (4.48). ∎

If we assume that $\nu$ and $H_{f,p}(\nu)$ are known, then, by Lemma 21, we can set

[TABLE]

Here, by (3.10) the corresponding version of Algorithm 3 takes at most

$\mathcal{O}(\epsilon^{-1/(p+\nu)})$ iterations to generate $x_{t}$ such that $f(x_{t})-f(x^{*})\leq\epsilon.$ For problems in which $H_{f,p}(\nu)$ is not known, let us consider the following adaptive version of Algorithm 3.

Algorithm 4. Adaptive Accelerated Tensor Method

Step 0. Choose $x_{0}\in\mathbb{E}$ , $H_{0}>0$ , and $\theta\geq 0$ . Set $\alpha$ by (3.10) and define function $\psi_{0}(x)=\frac{1}{p+\alpha}\|x-x_{0}\|^{p+\alpha}$ . Set $v_{0}=x_{0}$ , $A_{0}=0$ , and $t:=0$ .

Step 1. Set $i:=0$ .

Step 1.1. Compute the coefficient $a_{t,i}>0$ by solving equation

$a_{t,i}^{p+\alpha}=\dfrac{1}{2^{(3p-1)}}\left[\dfrac{(p-1)!}{2^{i}H_{t}}\right](A_{t}+a_{t,i})^{p+\alpha-1}.$

Step 1.2. Set $\gamma_{t,i}=\dfrac{a_{t,i}}{A_{t}+a_{t,i}}$ and compute vector $y_{t,i}=(1-\gamma_{t,i})x_{t}+\gamma_{t,i}v_{t}$ .

Step 1.3 Compute an approximate solution $x_{t,i}^{+}$ to $\min_{x\in\mathbb{E}}\Omega_{y_{t,i},p,2^{i}H_{t}}^{(\alpha)}(x),$ such that

$\Omega_{y_{t,i},p,2^{i}H_{t}}^{(\alpha)}(x_{t,i}^{+})\leq f(y_{t,i})\quad\text{and}\quad\|\nabla\Omega_{y_{i,t},p,2^{i}H_{t}}^{(\alpha)}(x_{t,i}^{+})\|_{*}\leq\theta\|x_{t,i}^{+}-y_{t,i}\|^{p+\alpha-1}.$

Step 1.4. If condition

$\langle\nabla f(x_{t,i}^{+}),y_{t,i}-x_{t,i}^{+}\rangle\geq\dfrac{1}{4}\left[\dfrac{(p-1)!}{2^{i}H_{t}}\right]^{\frac{1}{p+\alpha-1}}\|\nabla f(x_{t,i}^{+})\|_{*}^{\frac{p+\alpha}{p+\alpha-1}},$

set $i_{t}:=i$ and go to Step 2. Otherwise, set $i:=i+1$ and go back to Step 1.1.

Step 2. Set $x_{t+1}=x_{t,i_{t}}^{+}$ , $y_{t}=y_{t,i_{t}}$ , $a_{t}=a_{t,i_{t}}$ and $\gamma_{t}=\gamma_{t,i_{t}}$ . Define $A_{t+1}=A_{t}+a_{t}$ and $H_{t+1}=2^{i_{t}-1}H_{t}$ .

Step 3. Define $\psi_{t+1}(.)$ by (4.41) and compute $v_{t+1}$ by (4.40).

Step 4. Set $t:=t+1$ and go back to Step 1.

Note that Algorithm 4 is a particular case of Algorithm 3 in which

[TABLE]

Let us define the following function of $\epsilon>0$ :

[TABLE]

The next lemma provides upper bounds on $H_{t}$ and on the number of calls of the oracle in Algorithm 4.

Lemma 9.

Suppose that H1 and H2 are true. Given $\epsilon>0$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm 4 such that

[TABLE]

and

[TABLE]

Then

[TABLE]

Moreover, the number $O_{T}$ of calls of the oracle after $T$ iterations is bounded as follows:

[TABLE]

Proof.

Let us prove by induction that the scaling coefficients $H_{t}$ in Algorithm 4 satisfy (4.52). This is obvious for $t=0$ . Assume that (4.52) is true for some $t\geq 0$ . If $\alpha=\nu$ , it follows from Lemma 21 that the final value $2^{i_{t}}H_{t}$ cannot be bigger than

[TABLE]

since otherwise we should stop the line-search earlier. Thus,

[TABLE]

that is, (4.53) holds for $t+1$ . On the other hand, suppose that $\alpha=1$ . In view of Lemma A.5 in [9], at any trial point $x_{t,i}^{+}$ we have

[TABLE]

Thus, it follows from Lemma 22 that

[TABLE]

Consequently, we also have $H_{t+1}\leq\max\left\{M_{\nu}(\epsilon),H_{0}\right\}$ ; i.e., (4.52) holds for $t+1$ . This completes the induction argument. Finally, as in the proof of Lemma 3.3, from (4.52) we get (4.53). ∎

Now we can prove the following convergence result for Algorithm 4.

Theorem 10.

Suppose that H1 and H2 are true. Given $\epsilon\in(0,1)$ , assume that $\left\{x_{t}\right\}_{t=0}^{T}$ is a sequence generated by Algorithm such that (4.50) and (4.51) hold. Then

[TABLE]

Consequently,

[TABLE]

if $\nu$ is known (i.e., $\alpha=\nu$ ), and

[TABLE]

if $\nu$ is unknown (i.e., $\alpha=1$ ).

Proof.

By Lemma 4.4, we have

[TABLE]

Then (4.54) follows directly from Theorem 4.3 with

[TABLE]

Now, combining (4.54) and (4.51) for $k=T$ , we obtain

[TABLE]

and so,

[TABLE]

If $\nu$ is known, then $\alpha=\nu$ and, by (4.49), we have

[TABLE]

Thus, combining (4.57) and (4.58), we get (4.55). On the other hand, if $\nu$ is unknown, then $\alpha=1$ and, by (4.49), (3.25) and $\epsilon\in(0,1)$ , we have

[TABLE]

In this case, combining (4.57) and (4.59) we get (4.56). ∎

When $\nu=1$ , bounds (4.55) and (4.56) have the same dependence on $\epsilon$ . However, when $\nu\neq 1$ , the bound of $\mathcal{O}\left(\epsilon^{-p/(p+1)(p+\nu-1)}\right)$ obtained for the universal scheme (i.e., Algorithm 4 with $\alpha=1$ ) is worse than the bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ obtained for the nonuniversal scheme ( $\alpha=\nu$ ). For high-order methods ( $p\geq 2$ ), to the best of our knowledge, there is no simple procedure by which one can identify the level of smoothness $\nu$ of the $p$ th derivatives (in general). Therefore, despite this gap in the complexity bounds, we believe that the automatic choice of the best function subclass in the universal scheme is a very attractive feature. Moreover, in the nonuniversal scheme, for any $\theta>0$ with $(p+\nu-1)(H_{f,p}(\nu)+\theta(p-1)!)>H_{0}$ , the corresponding right-hand side of (4.22) has an additional term

[TABLE]

in comparison to its value when $\theta=0$ . In contrast, in the accelerated universal scheme, for any $\theta$ in the interval

[TABLE]

the corresponding right-hand side in (4.23) is the same as for $\theta=0$ . In this sense, it appears that the accelerated universal scheme is more robust than the accelerated nonuniversal scheme in terms of the inexact solution of the auxiliary problems.

5 Lower complexity bounds under Hölder condition

In this section we investigate how much the convergence rates of our tensor methods can be improved with respect to problems satisfying H1. Specifically, we derive lower complexity bounds for $p$ -order tensor methods applied to the problem (2.3), where the objective $f$ is convex and $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ .

5.1 Hard functions and Lower Complexity Bounds

For simplicity, let us consider $\mathbb{E}=\mathbb{R}^{n}$ and $B=I_{n}$ . Given an approximation $\bar{x}$ for the solution of (2.3), $p$ -order methods usually compute the next test point as $x^{+}=\bar{x}+\bar{h}$ , where the search direction $\bar{h}$ is the solution of an auxiliary problem of the form

[TABLE]

with $a\in\mathbb{R}^{p}$ , $\gamma>0$ , and $m>1$ . Denote by $\Gamma_{\bar{x},f}(a,\gamma,m)$ the set of all stationary points of function $\phi_{a,\gamma,m}(\,.\,)$ , and define the linear subspace

[TABLE]

With this notation, we can characterize the class of $p$ -order tensor methods by the following assumption.

Assumption 1. Given $x_{0}\in\mathbb{R}^{n}$ , the method generates a sequence of test points $\left\{x_{k}\right\}_{k\geq 0}$ such that

[TABLE]

Given $\nu\in[0,1]$ , our parametric family of difficult functions for $p$ -order tensor methods is defined as

[TABLE]

The next lemma establishes that for each $f_{k}(\,.\,)$ we have $H_{f_{k},p}(\nu)<+\infty$ .

Lemma 11.

Given an integer $k\in[2,n]$ , the $p$ th derivative of $f_{k}(\,.\,)$ is $\nu$ -Hölder continuous with

[TABLE]

Proof.

In view of (5.63), we have

[TABLE]

where

[TABLE]

It can be shown that (see page 13 in [15])

[TABLE]

On the other hand, for any $x,h\in\mathbb{R}^{n}$ , we have

[TABLE]

Therefore, for all $x,y,h\in\mathbb{R}^{n}$ , it follows that

[TABLE]

Consequently, for all $x,d,h\in\mathbb{R}^{n}$ , we have

[TABLE]

Note that

[TABLE]

and, by (5.68), that

[TABLE]

Thus, combining (5.69)-(5.71), we get

[TABLE]

∎

The next lemma provides additional properties of $f_{k}(\,.\,)$ .

Lemma 12.

Given an integer $k\in[2,n]$ , let function $f_{k}(\,.\,)$ be defined by (5.63). Then, $f_{k}(\,.\,)$ has a unique global minimizer $x_{k}^{*}$ . Moreover,

[TABLE]

Proof.

The existence and uniqueness of $x_{k}^{*}$ follows from the fact that $f_{k}(\,.\,)$ is uniformly convex. In view of (5.65), it follows from the first-order optimality condition that

[TABLE]

Therefore, $A_{k}x_{k}^{*}=y_{k}^{*}$ , where $y_{k}^{*}$ satisfies

[TABLE]

with $\bar{e}_{k}\in\mathbb{R}^{k}$ being the vector of all ones and $0_{n-k}$ being the origin in $\mathbb{R}^{n-k}$ . Note that

[TABLE]

Consequently, (5.73) is equivalent to

[TABLE]

Thus,

[TABLE]

and so

[TABLE]

where $(\tau)_{+}=\max\left\{0,\tau\right\}$ . Finally, combining (5.65), (5.66), (5.75) and (5.76) we get

[TABLE]

∎

Our goal is to understand the behavior of the tensor methods specified by Assumption 1 when applied to the minimization of $f_{k}(\,.\,)$ with a suitable $k$ . For that, let us consider the following subspaces:

[TABLE]

Lemma 13.

For any $q\geq 0$ and $x\in\mathbb{R}_{k}^{n}$ , $f_{k+q}(x)=f_{k}(x)$ .

Proof.

It follows directly from (5.63). ∎

Lemma 14.

Let $\mathcal{M}$ be a $p$ -order tensor method satisfying Assumption 1. If $\mathcal{M}$ is applied to the minimization of $f_{t}(\,.\,)$ starting from $x_{0}=0$ , then the sequence $\left\{x_{k}\right\}_{k\geq 0}$ of test points generated by $\mathcal{M}$ satisfies

[TABLE]

Proof.

See Lemma 2 in [15]. ∎

Now, we can prove the lower complexity bound for $p$ -order tensor methods applied to the minimization of functions with $\nu$ -Hölder continuous $p$ th derivatives.

Theorem 15.

Let $\mathcal{M}$ be a $p$ -order tensor method satisfying Assumption 1. Assume that for any function $f$ with $H_{f,p}(\nu)<+\infty$ this method ensures the rate of convergence:

[TABLE]

where $\left\{x_{k}\right\}_{k\geq 0}$ is the sequence generated by method $\mathcal{M}$ and $x^{*}$ is a global minimizer of $f$ . Then, for all $t\geq 1$ such that $2t+1\leq n$ we have

[TABLE]

where

[TABLE]

Proof.

Let us apply method $\mathcal{M}$ for minimizing function $f_{2t+1}(\,.\,)$ starting from point $x_{0}=0$ . By Lemma 14 we have $x_{i}\in\mathbb{R}_{t}^{n}$ for all $i$ , $0\leq i\leq t$ . Moreover, by Lemma 13 we have

[TABLE]

Thus, from (5.77), (5.80), Lemma 11 and Lemma 12 we get

[TABLE]

where constant $C_{p,\nu}$ is given by (5.79). ∎

5.2 Discussion

Theorem 15 establishes that the lower bound for the rate of convergence of tensor methods applied to functions with $\nu$ -Hölder continuous $p$ th derivatives is of $\mathcal{O}\left((\frac{1}{k})^{\frac{3(p+\nu)-2}{2}}\right)$ . In the Lipschitz case (i.e., $\nu=1$ ) we have $\mathcal{O}\left((\frac{1}{k})^{\frac{3p+1}{2}}\right)$ , which coincide with the bounds in [1, 15]. On the other hand, for first-order methods (i.e., $p=1$ ) we have $\mathcal{O}\left((\frac{1}{k})^{\frac{1+3\nu}{2}}\right)$ , which is the bound in [12].

The rate of $\mathcal{O}\left((\frac{1}{k})^{\frac{3(p+\nu)-2}{2}}\right)$ corresponds to a worst-case complexity bound of

$\mathcal{O}\left(\epsilon^{-2/[3(p+\nu)-2]}\right)$ iterations necessary to ensure $f(x_{k})-f(x^{*})\leq\epsilon.$ This means that the nonuniversal accelerated schemes proposed in this paper (e.g., Algorithm 4 with $\alpha=\nu$ ) are nearly optimal tensor methods. In fact, for $\epsilon\in(0,1)$ , note that

[TABLE]

In particular, if $\epsilon=10^{-6}$ , we have $\left(\frac{1}{\epsilon}\right)^{\frac{1}{p+\nu}}\leq 6\left(\frac{1}{\epsilon}\right)^{\frac{2}{3(p+\nu)-2}}.$ Thus, in practice, the complexity bounds of our accelerated nonuniversal methods differ from the lower bound just by a small constant factor.

Notice that the lower-bound described in Theorem 5.5 is only valid while the iteration counter $t$ satisfies $t<\frac{1}{2}(n-1)$ , where $n$ is the dimension of the domain of the objective. The same condition on $t$ appears in other lower bounds in the literature for the case $p=1$ and $\nu=1$ (see, e.g., Theorem 2.1.7 in [16]).

6 Conclusion

In this paper, we presented $p$ -order methods for unconstrained minimization of convex functions that are $p$ -times differentiable with $\nu$ -Hölder continuous $p$ th derivatives. For the universal and the nonuniversal schemes without acceleration, we established iteration complexity bounds of $\mathcal{O}\left(\epsilon^{-1/(p+\nu-1)}\right)$ for reducing the functional residual below a given $\epsilon\in(0,1)$ . Assuming that $\nu$ is known, we obtained an improved complexity bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ for the corresponding accelerated scheme. For the case in which $\nu$ is unknown, we presented an accelerated universal tensor scheme with an iteration complexity of $\mathcal{O}\left(\epsilon^{-p/[(p+1)(p+\nu-1)]}\right)$ .

Finally, a lower complexity bound of $\mathcal{O}(\epsilon^{-2/[3(p+\nu)-2]})$ was also obtained for the referred problem class. This means that, in practice, our accelerated nonuniversal schemes are nearly optimal. Remarkably, the complexity bound obtained for the accelerated universal schemes is slightly worse than the bound obtained for the nonuniversal accelerated schemes. Up to now, it is not clear whether the estimating sequences technique can be modified to provide an accelerated universal $p$ -order method with a complexity bound of $\mathcal{O}\left(\epsilon^{-1/(p+\nu)}\right)$ .

It is worth mentioning that the study of high-order methods is still at its early stages, with the majority of recent works in this area focusing on the derivation of global complexity bounds (see, e.g., [2, 4, 6, 7, 11, 15]). These bounds predict that high-order methods with $p\geq 3$ may require significantly fewer calls of the oracle than second-order methods. As pointed out in [7, 15], the computation of high-order derivatives may be affordable for structured objectives (such as separable functions). Moreover, at least for $p=3$ and $\alpha=\nu$ , the auxiliary problems can be solved using Bregman gradient methods that also take into account their particular structure [10, 15]. Nevertheless, the practical impact of high-order methods is yet to be seen.

Appendix A Auxiliary Results

In all algorithms described in this paper, the acceptance of new points is conditioned to the achievement of a sufficient decrease of the objective function value. In the nonaccelerated schemes, the sufficient decrease condition is specified by (3.4), while for accelerated schemes, it is specified by (4.6). In this Appendix we present auxiliary results from which we conclude that these conditions are satisfied when the regularization parameter is sufficiently large.

A.1 Results for schemes without acceleration

Our first lemma gives a lower bound for the functional decrease in terms of a suitable power of the norm of the displacement, when $\nu$ is known.

Lemma 16.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ and assume that $x^{+}$ satisfies

[TABLE]

for some $\bar{x}\in\mathbb{E}$ and $H>0$ . If $H\geq\frac{3}{2}H_{f,p}(\nu)$ , then

[TABLE]

Proof.

In view of (2.9) and (A.81), we have

[TABLE]

which gives

[TABLE]

Since $H\geq\frac{3}{2}H_{f,p}(\nu)\geq\dfrac{p+1}{p}H_{f,p}(\nu)$ for all $p\geq 2$ , it follows that

[TABLE]

∎

The next lemma provides a lower bound for the functional decrease in terms of a suitable power of the norm of the gradient when $\nu$ is known.

Lemma 17.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , and assume that $x^{+}\in\mathbb{E}$ satisfies (A.81) and

[TABLE]

for some $\bar{x}\in\mathbb{E}$ , $H>0$ , and $\theta\geq 0$ . If

[TABLE]

then

[TABLE]

Proof.

By (2.6), (2.8), (A.83), and (A.84), we have

[TABLE]

Thus,

[TABLE]

On the other hand, by (A.81) and (A.84) it follows from Lemma 16 that

[TABLE]

Then, combining (A.86), and (A.87) we get (A.85). ∎

The lemma below gives lower bounds for powers of the norm of the displacement when $\nu$ is unknown.

Lemma 18.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , and assume that $x^{+}$ satisfies

[TABLE]

for some $\bar{x}\in\mathbb{E}$ , $H>0$ and $\theta\geq 0$ . If for some $\delta>0$ we have

[TABLE]

with constant $C\geq 1$ , then

[TABLE]

and, consequently,

[TABLE]

Proof.

For $\nu=1$ , (A.90) is obvious. Thus, assume that $\nu\in[0,1)$ and denote $r=\|x^{+}-\bar{x}\|$ . Then, by (2.6), (2.8), and (A.88), we have

[TABLE]

Assume by contradiction that (A.90) is not true, i.e., $Hr^{1-\nu}<CH_{f,p}(\nu)$ . Since $H\geq\theta$ and $C\geq 1$ , it follows that

[TABLE]

This implies that $H<(CH_{f,p}(\nu))^{\frac{p}{p+\nu-1}}\left(\dfrac{4}{\delta}\right)^{\frac{1-\nu}{p+\nu-1}}$ contradicting the second inequality in (A.89). Therefore, (A.90) holds.

Finally, let us prove (A.91). In view of inequality (A.90), we have

[TABLE]

Thus, it follows from (A.92) that

[TABLE]

∎

Now, using Lemma 18, we obtain a lower bound for the functional decrease in terms of a computable power of the norm of the displacement, when $\nu$ is unknown.

Lemma 19.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ and assume that $x^{+}\in\mathbb{E}$ satisfies

[TABLE]

and

[TABLE]

for some $\bar{x}\in\mathbb{E}$ , $H>0$ and $\theta\geq 0$ . If for some $\delta>0$ we have

[TABLE]

with constant $C\geq\frac{3}{2}$ , then

[TABLE]

Proof.

In view of (2.5), (2.8), and (A.93), we have

[TABLE]

and so

[TABLE]

Assume by contradiction that (A.96) is not true, i.e.,

[TABLE]

Then, combining (A.97) and (A.98), we obtain

[TABLE]

which implies that

[TABLE]

By (A.94) and (A.95), the conclusions of Lemma 18 hold. In particular, we have

[TABLE]

and so

[TABLE]

Then it follows from (A.99) and (A.100) that

[TABLE]

contradicting the second inequality in (A.95). Therefore, (A.96) is true. ∎

Finally, the next lemma gives a lower bound for the functional decrease in terms of a computable power of the norm of the gradient when $\nu$ is unknown.

Corollary 20.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , and assume that $x^{+}\in\mathbb{E}$ satisfies (A.93) and (A.94) for some $\bar{x}\in\mathbb{E}$ , $H>0$ , and $\theta\geq 0$ . Given $\delta>0$ , define

[TABLE]

If $\|\nabla f(x^{+})\|_{*}\geq\delta$ and $H\geq\xi_{\nu}(\delta)$ , then

[TABLE]

Proof.

From inequality (A.91) in Lemma 18 we have

[TABLE]

which implies that

[TABLE]

Then, it follows from inequality (A.96) in Lemma 19 that

[TABLE]

∎

A.2 Results for accelerated schemes

For the case in which $\nu$ is known, the lemma below establishes that (4.6) is achievable when the regularization parameter is sufficiently large.

Lemma 21.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , and assume that $x^{+}$ satisfies

[TABLE]

for some $\bar{x}\in\mathbb{E}$ , $H>0$ and $\theta\geq 0$ . If

[TABLE]

then

[TABLE]

Proof.

Denote $r=\|x^{+}-\bar{x}\|$ . Then, by (2.6), (2.8), and (A.102), we have

[TABLE]

Thus, we obtain

[TABLE]

which implies that

[TABLE]

For $\nu=0$ , (A.105) leads to the desired relation. Let us assume that $\nu>0$ . Denote $g=\|\nabla f(x^{+})\|$ and $\Delta^{2}=1-\left(\dfrac{H_{f,p}(\nu)+\theta(p-1)!}{H}\right)^{2}.$ By (A.103), we have

[TABLE]

Consider the right-hand side of inequality (A.105) as a function of $r$ :

[TABLE]

Since $\Delta^{2}>0$ , $h$ is a convex function for $r>0$ . Thus, let us find the optimal $r_{*}$ as a solution to the first-order optimality condition for function $h$ :

[TABLE]

Solving this equation for $r_{*}$ , we obtain $r_{*}^{p+\nu-1}=\dfrac{g(p-1)!}{H\Delta}\sqrt{\dfrac{p+\nu-2}{p+\nu}}.$ Consequently,

[TABLE]

Now, usinig (A.106) we obtain

[TABLE]

Note that

[TABLE]

Thus, $h(r^{*})\geq\dfrac{1}{3}\left[\dfrac{(p-1)!}{H}\right]^{\frac{1}{p+\nu-1}}g^{\frac{p+\nu}{p+\nu-1}}$ and so by (A.105) we get (A.104). ∎

Finally, for the case in which $\nu$ is unknown, the next lemma establishes that (4.6) is also achievable when the regularization parameter is sufficiently large.

Lemma 22.

Let $H_{f,p}(\nu)<+\infty$ for some $\nu\in[0,1]$ , and assume that $x^{+}$ satisfies

[TABLE]

for some $\bar{x}\in\mathbb{E}$ , $H>0$ , and $\theta\geq 0$ . If for some $\delta>0$ we have

[TABLE]

with $C\geq 4$ , then

[TABLE]

Proof.

Denote $r=\|x^{+}-\bar{x}\|$ . Then, by (2.6), (2.8), and (A.107) we have

[TABLE]

Therefore,

[TABLE]

which gives

[TABLE]

Since $H\geq C\theta(p-1)!$ , it follows that

[TABLE]

Because $C\geq 4$ , we have

[TABLE]

and so

[TABLE]

Therefore,

[TABLE]

Denote $g=\|\nabla f(x^{+})\|$ and consider the right-hand side of (A.110) as a function of $r$ :

[TABLE]

Let us find the optimal $r_{*}$ as a solution to the first-order optimality condition for function $h$ :

[TABLE]

Solving this equation for $r_{*}$ , we obtain

[TABLE]

Consequently,

[TABLE]

Therefore, (A.109) holds. ∎

Acknowledgments

The authors are very grateful to the two anonymous referees, whose comments helped to improve the first version of this paper.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Arjevani, O. Shamir, R. Shiff.: Oracle complexity of second-order methods for smooth convex optimization. Mathematical Programming 178 , 327–360 (2019)
2[2] M. Baes: Estimate Sequence Methods: Extensions and Approximations. Optimization Online (2009)
3[3] S. Banach: Über homogene Polynome in (L 2). Studia Math. 7 , 36–44 (1938).
4[4] E. G. Birgin, J. L. Gardenghi, J. M. Martínez, S. A. Santos, and Ph. L. Toint: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming 163 , 359-368 (2017).
5[5] Bouaricha, A.: Tensor methods for large, sparse unconstrained optimization. SIAM Journal on Optimization 7 , 732-756 (1997)
6[6] Cartis, C., Gould, N.I.M., and Toint, Ph.L.: Universal regularized methods - varying the power, the smoothness, and the accuracy. SIAM Journal on Optimization 29 , 595–615 (2019).
7[7] Chen, X., Toint, Ph.L., Wang, H.: Partially separable convexly-constrained optimization with non-Lipschitzian singularities and its complexity. SIAM Journal on Optimization 29 , 874–903 (2019)
8[8] Grapiglia, G.N., Nesterov, Yu.: Regularized Newton Methods for minimizing functions with Hölder continuous Hessians. SIAM Journal on Optimization 27 , 478-506 (2017)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Tensor Methods for Minimizing Convex Functions with Hölder Continuous Higher-Order Derivatives

Abstract

keywords:

AMS:

1 Introduction

1.1 Motivation

1.2 Contents

1.3 Notations and Generalities

2 Problem Statement

3 Tensor schemes without acceleration

Remark 1**.**

Theorem 2**.**

Proof.

Lemma 3**.**

Proof.

Theorem 4**.**

Proof.

Remark 5**.**

4 Accelerated tensor schemes

Remark 6**.**

Lemma 7**.**

Proof.

Theorem 8**.**

Proof.

Lemma 9**.**

Proof.

Theorem 10**.**

Proof.

5 Lower complexity bounds under Hölder condition

5.1 Hard functions and Lower Complexity Bounds

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Theorem 15**.**

Proof.

5.2 Discussion

6 Conclusion

Appendix A Auxiliary Results

A.1 Results for schemes without acceleration

Lemma 16**.**

Proof.

Lemma 17**.**

Proof.

Lemma 18**.**

Proof.

Lemma 19**.**

Proof.

Corollary 20**.**

Proof.

A.2 Results for accelerated schemes

Lemma 21**.**

Proof.

Lemma 22**.**

Proof.

Acknowledgments

Remark 1.

Theorem 2.

Lemma 3.

Theorem 4.

Remark 5.

Remark 6.

Lemma 7.

Theorem 8.

Lemma 9.

Theorem 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Theorem 15.

Lemma 16.

Lemma 17.

Lemma 18.

Lemma 19.

Corollary 20.

Lemma 21.

Lemma 22.