Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Yossi Arjevani; Ohad Shamir; Ron Shiff

arXiv:1705.07260·math.OC·August 18, 2017·Math. Program.

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Yossi Arjevani, Ohad Shamir, Ron Shiff

PDF

TL;DR

This paper establishes tight bounds on the number of iterations second-order methods need to optimize smooth convex functions, clarifying their advantages and limitations compared to gradient-based methods.

Contribution

It provides the first tight bounds on the oracle complexity of second-order methods for smooth convex optimization, including higher-order generalizations.

Findings

01

Second-order methods can match or outperform gradient methods under certain conditions.

02

Tight bounds reveal when second-order methods are advantageous or limited.

03

Results extend to higher-order optimization techniques.

Abstract

Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.

Equations383

w \in R^{d} min f (w),

w \in R^{d} min f (w),

Θ (\frac{μ _{1} D ^{2}}{ϵ}),

Θ (\frac{μ _{1} D ^{2}}{ϵ}),

Θ (\frac{μ _{1}}{λ} \cdot lo g (\frac{μ _{1} D ^{2}}{ϵ})) .

Θ (\frac{μ _{1}}{λ} \cdot lo g (\frac{μ _{1} D ^{2}}{ϵ})) .

Ω (min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}) .

Ω (min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}) .

Ω ((min {\frac{μ _{1}}{λ}, (\frac{μ _{2}}{λ} D)^{2/7}} + lo g lo g_{18} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ}))) .

Ω ((min {\frac{μ _{1}}{λ}, (\frac{μ _{2}}{λ} D)^{2/7}} + lo g lo g_{18} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ}))) .

Ω ((\frac{μ _{k} D ^{k + 1}}{( k + 1 )! k ϵ})^{2/ (3 k + 1)}) .

Ω ((\frac{μ _{k} D ^{k + 1}}{( k + 1 )! k ϵ})^{2/ (3 k + 1)}) .

O (\frac{μ _{1}^{2} μ _{2}^{2}}{λ ^{5}} (f (w_{1}) - f (w^{*})) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O (\frac{μ _{1}^{2} μ _{2}^{2}}{λ ^{5}} (f (w_{1}) - f (w^{*})) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O ((f (w_{1}) - f (w^{*})) + lo g lo g_{2} (\frac{1}{ϵ})),

O ((f (w_{1}) - f (w^{*})) + lo g lo g_{2} (\frac{1}{ϵ})),

O ((\frac{μ _{2}}{λ} D)^{1/3} + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O ((\frac{μ _{2}}{λ} D)^{1/3} + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O ((\frac{μ _{2}}{λ} D)^{2/7} lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})) .

O ((\frac{μ _{2}}{λ} D)^{2/7} lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})) .

O (\frac{μ _{1}}{λ} \cdot lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O (\frac{μ _{1}}{λ} \cdot lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})),

O ((\frac{μ _{2} D ^{3}}{ϵ})^{1/3}) .

O ((\frac{μ _{2} D ^{3}}{ϵ})^{1/3}) .

O ((\frac{μ _{2} D ^{3}}{ϵ})^{2/7}) .

O ((\frac{μ _{2} D ^{3}}{ϵ})^{2/7}) .

O (\frac{μ _{1} D ^{2}}{ϵ}) .

O (\frac{μ _{1} D ^{2}}{ϵ}) .

c \cdot (min {\frac{μ _{1}}{λ}, (\frac{μ _{2}}{λ} D)^{2/7}} + lo g lo g_{18} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ}))

c \cdot (min {\frac{μ _{1}}{λ}, (\frac{μ _{2}}{λ} D)^{2/7}} + lo g lo g_{18} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ}))

c \cdot min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}

c \cdot min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}

O (min {\frac{μ _{1}}{λ}, (\frac{μ _{2} D}{λ})^{2/7}} \cdot lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})) .

O (min {\frac{μ _{1}}{λ}, (\frac{μ _{2} D}{λ})^{2/7}} \cdot lo g (\frac{μ _{1} μ _{2}^{2} D ^{2}}{λ ^{3}}) + lo g lo g_{2} (\frac{λ ^{3} / μ _{2}^{2}}{ϵ})) .

O (min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}) .

O (min {\frac{μ _{1} D ^{2}}{ϵ}, (\frac{μ _{2} D ^{3}}{ϵ})^{2/7}}) .

c (\frac{μ _{k} D ^{k + 1}}{( k + 1 )! k ϵ})^{2/ (3 k + 1)},

c (\frac{μ _{k} D ^{k + 1}}{( k + 1 )! k ϵ})^{2/ (3 k + 1)},

O (k (\frac{f ( w _{1} ) - f ( w ^{*} )}{ϵ} + \frac{μ _{k} D ^{k + 1}}{( k + 1 )! ϵ})^{1/ (k + 1)}) .

O (k (\frac{f ( w _{1} ) - f ( w ^{*} )}{ϵ} + \frac{μ _{k} D ^{k + 1}}{( k + 1 )! ϵ})^{1/ (k + 1)}) .

f_{T} (w) = f_{T} (w_{1}, w_{2}, \dots) = w_{1}^{2} + j = 1 \sum T - 1 (w_{j} - w_{j + 1})^{2} + w_{T}^{2} - w_{1}

f_{T} (w) = f_{T} (w_{1}, w_{2}, \dots) = w_{1}^{2} + j = 1 \sum T - 1 (w_{j} - w_{j + 1})^{2} + w_{T}^{2} - w_{1}

f_{T} (w) = ∣ w_{1} ∣^{k + 1} + j = 1 \sum T - 1 ∣ w_{j} - w_{j + 1} ∣^{k + 1} + ∣ w_{T} ∣^{k + 1} - w_{1} .

f_{T} (w) = ∣ w_{1} ∣^{k + 1} + j = 1 \sum T - 1 ∣ w_{j} - w_{j + 1} ∣^{k + 1} + ∣ w_{T} ∣^{k + 1} - w_{1} .

f_{T} (w) = ∣ w_{1} ∣^{3} + j = 1 \sum T - 1 ∣ w_{j} - w_{j + 1} ∣^{3} + ∣ w_{T} ∣^{3} - 3 γ \cdot w_{1},

f_{T} (w) = ∣ w_{1} ∣^{3} + j = 1 \sum T - 1 ∣ w_{j} - w_{j + 1} ∣^{3} + ∣ w_{T} ∣^{3} - 3 γ \cdot w_{1},

w_{1}^{2} + (w_{1} - w_{2})^{2} = γ, w_{T}^{2} = (w_{T - 1} - w_{T})^{2}

w_{1}^{2} + (w_{1} - w_{2})^{2} = γ, w_{T}^{2} = (w_{T - 1} - w_{T})^{2}

\forall j = 2, 3, \dots, T - 1, (w_{j - 1} - w_{j})^{2} = (w_{j} - w_{j + 1})^{2} .

\forall j = 2, 3, \dots, T - 1, (w_{j - 1} - w_{j})^{2} = (w_{j} - w_{j + 1})^{2} .

f_{T} (w)

f_{T} (w)

= T (T^{2} + 1) (\frac{γ}{T ^{2} + 1})^{3/2} - 3 \frac{γ ^{3/2} T}{T ^{2} + 1} = - \frac{2 γ ^{3/2} T}{T ^{2} + 1} .

2 γ^{3/2} (\frac{2 T}{4 T ^{2} + 1} - \frac{T}{T ^{2} + 1}) = 2 γ^{3/2} \frac{1}{1 + \frac{1}{4 T ^{2}}} - \frac{1}{1 + \frac{1}{T ^{2}}} .

2 γ^{3/2} (\frac{2 T}{4 T ^{2} + 1} - \frac{T}{T ^{2} + 1}) = 2 γ^{3/2} \frac{1}{1 + \frac{1}{4 T ^{2}}} - \frac{1}{1 + \frac{1}{T ^{2}}} .

g (x) = {\frac{1}{3} ∣ x ∣^{3} Δ x^{2} - Δ^{2} ∣ x ∣ + \frac{1}{3} Δ^{3} ∣ x ∣ \leq Δ ∣ x ∣ > Δ, .

g (x) = {\frac{1}{3} ∣ x ∣^{3} Δ x^{2} - Δ^{2} ∣ x ∣ + \frac{1}{3} Δ^{3} ∣ x ∣ \leq Δ ∣ x ∣ > Δ, .

f (w) = \frac{μ _{2}}{12} i = 1 \sum \tilde{T} - 1 g (⟨ v_{i}, w ⟩ - ⟨ v_{i + 1}, w ⟩) - γ ⟨ v_{1}, w ⟩ + \frac{λ}{2} ∥ w ∥^{2},

f (w) = \frac{μ _{2}}{12} i = 1 \sum \tilde{T} - 1 g (⟨ v_{i}, w ⟩ - ⟨ v_{i + 1}, w ⟩) - γ ⟨ v_{1}, w ⟩ + \frac{λ}{2} ∥ w ∥^{2},

\tilde{f} (w) = \frac{1}{3} i = 1 \sum \tilde{T} - 1 ∣ w_{i} - w_{i + 1} ∣^{3} + \frac{λ ~}{2} ∥ w ∥^{2} - γ \cdot w_{1},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Oracle Complexity of Second-Order Methods

for Smooth Convex Optimization

Yossi Arjevani Ohad Shamir Ron Shiff

Department of Computer Science and Applied Mathematics

Weizmann Institute of Science

{yossi.arjevani,ohad.shamir}@weizmann.ac.il

[email protected]

Abstract

Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods.

1 Introduction

We consider an unconstrained optimization problem of the form

[TABLE]

where $f$ is a generic smooth and convex function. A natural and fundamental question is how efficiently can we optimize such functions.

We study this question through the well-known framework of oracle complexity (Nemirovsky and Yudin, 1983), which focuses on iterative methods relying on local information. Specifically, it is assumed that the algorithm’s access to the function $f$ is limited to an oracle, which given a point $\mathbf{w}$ , returns the values and derivatives of the function $f$ at $\mathbf{w}$ . This naturally models standard optimization approaches to unstructured problems such as Eq. (1), and allows one to study their efficiency, by bounding the number of oracle calls required to reach a given optimization error. Different classes of methods can be distinguished by the type of oracle they use. For example, gradient-based methods (such as gradient descent or accelerated gradient descent) rely on a first-order oracle, which returns gradients, whereas methods such as the Newton method rely on a second-order oracle, which returns gradients as well as Hessians.

The theory of first-order oracle complexity is quite well developed (Nemirovsky and Yudin, 1983; Nesterov, 2004; Nemirovski, 2005). For example, if the dimension is unrestricted, $f$ in Eq. (1) has $\mu_{1}$ -Lipschitz gradients, and the algorithm makes its first oracle query at a point $\mathbf{w}_{1}$ , then the worst-case number of queries $T$ required to attain a point $\mathbf{w}_{T}$ satisfying $f(\mathbf{w}_{T})-\min_{\mathbf{w}}f(\mathbf{w})\leq\epsilon$ is

[TABLE]

where $D$ is an upper bound on the distance between $\mathbf{w}_{1}$ and the nearest minimizer of $f$ . Moreover, if the function $f$ is also $\lambda$ -strongly convex for some $\lambda>0$ 111Assuming $f$ is twice-differentiable, this corresponds to $\nabla^{2}f(\mathbf{w})\succeq\lambda I$ uniformly for all $\mathbf{w}$ ., then the oracle complexity bound is

[TABLE]

Both bounds are achievable using accelerated gradient descent (Nesterov, 1983).

However, these bounds do not capture the attainable performance of second-order methods, which rely on gradient as well as Hessian information. This is a central class of optimization methods, including the well-known Newton method and its many variants. Clearly, since these methods rely on Hessians as well as gradients, their oracle complexity can only be better than first-order methods. On the flip side, the per-iteration computational complexity is generally higher, in order to process the additional Hessian information (especially in high-dimensional problems where the Hessian matrix may be very large). Thus, it is natural to ask how much does this added per-iteration complexity pay off in terms of oracle complexity.

To answer this question, one needs good oracle complexity lower bounds for second-order methods, which establish the limits of attainable performance using any such algorithm. Perhaps surprisingly, such results do not seem to currently exist in the literature, and clarifying the oracle complexity of such methods was posed as an important open question (see for example Nesterov, 2008). The goal of this paper is to address this gap.

Specifically, we prove that when the dimension is sufficiently large, for the class of convex functions with $\mu_{1}$ -Lipschitz gradients and $\mu_{2}$ -Lipschitz Hessians, the worst-case oracle complexity of any deterministic algorithm is

[TABLE]

This bound is tight up to constants, as it is matched by a combination of existing methods in the literature (see discussion below). Moreover, if we restrict ourselves to functions which are $\lambda$ -strongly convex, we prove an oracle complexity lower bound of

[TABLE]

Moreover, we establish that this bound is tight up to logarithmic factors (independent of $\epsilon$ ), utilizing a novel adaptation of the A-NPE algorithm proposed in Monteiro and Svaiter (2013) (see Appendix A). These new lower bounds have several implications:

•

Perhaps unexpectedly, Eq. (5) establishes that one cannot avoid in general a polynomial dependence on geometry-dependent “condition numbers” of the form $\mu_{1}/\lambda$ or $\mu_{2}D/\lambda$ , even with second-order methods. This is despite the ability of such methods to favorably alter the geometry of the problem (for example, the Newton method is well-known to be affine invariant).

•

To improve on the oracle complexity of first-order methods for strongly-convex problems (Eq. (3)) by more than logarithmic factors, one cannot avoid a polynomial dependence on the initial distance $D$ to the optimum. This is despite the fact that the dependence on $D$ with first-order methods is only logarithmic. In fact, when $D$ is sufficiently large (of order $\frac{\mu_{1}^{7/4}}{\mu_{2}\lambda^{3/4}}$ or larger), second-order methods cannot improve on the oracle complexity of first-order methods by more than logarithmic factors.

•

In the convex case, second-order methods are again no better than first-order methods in certain parameter regimes (i.e., when $\mu_{2}\geq\mu_{1}^{7/4}\sqrt{D}/\epsilon^{3/4}$ ), despite the availability of more information.

Finally, we show how our proof techniques can be generalized, to establish lower bounds for methods employing higher-order derivatives. In particular, for methods using all derivatives up to order $k$ , we show that for convex functions with $\mu_{k}$ -Lipschitz k-th order derivatives, the oracle complexity is

[TABLE]

Note that this directly generalizes Eq. (2) for $k=1$ , and Eq. (4) when $k=2$ and $\mu_{1}$ is unrestricted.

Related Work

Below, we review some pertinent results in the context of second-order methods. Related results in the contest of k-th order methods are discussed in Subsection 2.2.

Perhaps the most well-known and fundamental second-order method is the Newton method, which relies on iterations of the form $\mathbf{w}_{t+1}=\mathbf{w}_{t}-(\nabla^{2}f(\mathbf{w}))^{-1}\nabla f(\mathbf{w})$ (see e.g., Boyd and Vandenberghe (2004)). It is well-known that this method exhibits local quadratic convergence, in the sense that if $f$ is strictly convex, and the method is initialized close enough to the optimum $\mathbf{w}^{*}=\arg\min_{\mathbf{w}}f(\mathbf{w})$ , then $\mathcal{O}(\log\log(1/\epsilon))$ iterations suffice to reach a solution $\mathbf{w}$ such that $f(\mathbf{w})-f(\mathbf{w}^{*})\leq\epsilon$ . However, in order to get global convergence (starting from an arbitrary point not necessarily close to the optimum), one needs to make some algorithmic modifications, such as introducing a step size parameter or line search, employing trust region methods, or adding various types of regularization (see for example Conn et al. (2000) and references therein). Despite the huge literature on the subject, the worst-case global convergence behavior of these methods is not well understood (Nesterov and Polyak, 2006). For the Newton method with a line search, the number of iterations can be upper bounded by

[TABLE]

where $\mu_{1},\mu_{2}$ are the Lipschitz parameters of the gradients and Hessians respectively, and assuming the function is $\lambda$ -strongly convex (Kantorovich (1948), see also Boyd and Vandenberghe (2004)). Note that the first term captures the initial phase required to get sufficiently close to $\mathbf{w}^{*}$ , whereas the second term captures the quadratically convergent phase. Although the final convergence is rapid, the first phase is the dominant one in the bound (unless $\epsilon$ is exceedingly small). If $f$ is self-concordant222That is, for any vectors $\mathbf{v},\mathbf{w}$ , the function $g(t)=f(\mathbf{w}+t\mathbf{v})$ satisfies $|g^{\prime\prime\prime}(t)|\leq 2g^{\prime\prime}(t)^{3/2}$ , this can be improved to

[TABLE]

independent of the strong convexity and Lipschitz parameters (Nesterov and Nemirovskii (1994)). Unfortunately, not all practically relevant objective functions are self-concordant. For example, loss functions common in machine learning applications, such as the logistic loss $x\mapsto\log(1+\exp(-x))$ , are not self-concordant333These can often be made self-concordant by re-scaling, smoothing and adding regularization (e.g. Bach (2010)), but even when possible, these modifications strongly affect the $f(\mathbf{w}_{1})-f(\mathbf{w}^{*})$ term in the bound, and prevents it from being independent of the strong convexity and Lipschitz parameters., and our own results utilize the simple but not self-concordant function $x\mapsto|x|^{3}$ .

Returning to our setting of generic convex and smooth functions, and focusing on strongly convex functions for now, the best existing upper bounds (we are aware of) were obtained for cubic-regularized variants of the Newton method, where at each iteration one essentially minimizes a quadratic approximation of the function at the current point, regularized by a cubic term (Nesterov and Polyak, 2006; Nesterov, 2008). The existing analysis (in section 6 of Nesterov (2008)) implies an oracle complexity bound of at most

[TABLE]

where $D=\|\mathbf{w}_{1}-\mathbf{w}^{*}\|$ is the distance from the initialization point $\mathbf{w}_{1}$ to the optimum $\mathbf{w}^{*}$ (see section 6 in Nesterov (2008), as well as Cartis et al. (2012) for another treatment of such cubic-regularized methods). However, as we show in Appendix A, a better oracle complexity bound can be obtained, by adapting the A-NPE method proposed in (Monteiro and Svaiter, 2013) and analyzed for convex functions, to the strongly convex case. The resulting complexity upper bound is

[TABLE]

An alternative to the above is to use a hybrid scheme, starting with accelerated gradient descent (which is an optimal first-order method for strongly convex functions with Lipschitz gradients) and when close enough to the optimal solution, switch to a cubic-regularized Newton method, which is quadratically converging in that region444Instead of cubic-regularized Newton, one can also use the standard Newton method, although the resulting bound using the existing analysis will have slightly worse logarithmic factors.. The required number of iterations is then

[TABLE]

where $D=\|\mathbf{w}_{1}-\mathbf{w}^{*}\|$ (see Nesterov (2004, 2008)). Clearly, by taking the best of Eq. (6) and Eq. (7) (depending on the parameters), one can theoretically attain an oracle complexity which is the minimum of Eq. (6) and Eq. (7). This minimum matches (up to a logarithmic factors) the lower bound in Eq. (5), which we establish in this paper.

It is interesting to note that the bounds in Eq. (6) and Eq. (7) are not directly comparable: The first bound has a polynomial dependence on $\mu_{2}/\lambda$ and $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|$ , and a logarithmic dependence on $\mu_{1}$ , whereas the second bound has a polynomial dependence on $\mu_{1}/\lambda$ , logarithmic dependence on $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|$ , and a logarithmic dependence on $\mu_{2}$ . In a rather wide parameter regime (e.g. when $D$ is reasonably large, as often occurs in practice), the bound of the hybrid scheme can be better than that of pure second-order methods. In light of this, Nesterov (2008) raised the question of whether second-order schemes are indeed useful at the initial stage of the optimization process, for these types of problems. Our results indicate that indeed, in certain parameter regimes, this is not the case.

Analogous results can be obtained for convex (not necessarily strongly convex) smooth functions. Using an appropriate analysis of the accelerated cubic-regularized Newton method (Nesterov, 2008), one can attain a bound of

[TABLE]

More recently, Monteiro and Svaiter (2013) proposed an accelerated hybrid proximal extragradient method, denoted as A-NPE, which attains a better bound of

[TABLE]

In addition, using an optimal first-order method (such as accelerated gradient descent), one can attain a bound of

[TABLE]

Clearly, by taking the best of the last two approaches (depending on the problem parameters), one can attain an oracle complexity equal to the minimum of the two bounds in Eq. (8) and Eq. (9). This is matched (up to constants) by the lower bound in Eq. (4), which we establish in this paper.

Finally, we discuss the few existing lower bounds known for second-order methods. If $\mu_{2}$ is not bounded (i.e.,the Hessians are not Lipschitz), it is easy to show that Hessian information is not useful. Specifically, the lower bound of Eq. (2) for first-order methods will then also apply to second-order methods, and in fact, to any method based on local information (see Nemirovsky and Yudin (1983, section 7.2.6) and Arjevani and Shamir (2016b)). Of course, this lower bound does not apply to second-order methods when $\mu_{2}$ is bounded. In our setting, it is also possible to prove an $\Omega(\log\log(1/\epsilon))$ lower bound, even in one dimension (Nemirovsky and Yudin, 1983, section 8.1.1), but this does not capture the dependence on the strong convexity and Lipschitz parameters. Some algorithm-specific lower bounds in the context of non-convex optimization are provided in Cartis et al. (2010). Finally, we were recently informed of a new work (Agarwal and Hazan (2017), yet unpublished at the time of writing), which uses a clean and elegant smoothing approach, to derive second- and higher-order oracle lower bounds directly from known first-order oracle lower bounds, as well as extensions to randomized algorithms. However, the resulting bounds are not as tight as ours.

2 Main Results

In this section, we formally present our main results, starting with second-order oracle complexity bounds (Subsection 2.1), and then discussing extensions to higher-order oracles (Subsection 2.2).

2.1 Second-order Oracle

We consider a second-order oracle, which given a point $\mathbf{w}$ returns the function’s value $f(\mathbf{w})$ , its gradient $\nabla f(\mathbf{w})$ and its Hessian $\nabla^{2}f(\mathbf{w})$ , and algorithms, which produce a sequence of points $\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{T}$ , with each $\mathbf{w}_{t}$ being some deterministic function of the oracle’s responses at $\mathbf{w}_{1},\ldots,\mathbf{w}_{t-1}$ . Our main results (for strongly convex and convex functions $f$ respectively) are provided below.

Theorem 1.

For any positive $\mu_{1},\mu_{2},\lambda,D,\epsilon$ such that $\frac{\mu_{1}}{\lambda}\geq c_{1},\frac{\mu_{2}}{\lambda}D\geq c_{2}$ and $\epsilon<\frac{c_{3}\lambda^{3}}{\mu_{2}^{2}}$ (for some positive universal constants $c_{1},c_{2},c_{3}$ ), and any algorithm as above with initialization point $\mathbf{w}_{1}$ , there exists a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ (for some finite $d$ ) such that

•

$f$ * is $\lambda$ -strongly convex, twice-differentiable, has $\mu_{1}$ -Lipschitz gradients and $\mu_{2}$ -Lipschitz Hessians, and has a global minimum $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|\leq D$ .*

•

The index $T$ required to ensure $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})~{}\leq~{}\epsilon$ is at least

[TABLE]

for some universal constant $c>0$ .

Theorem 2.

For any positive $\mu_{1},\mu_{2},D,\epsilon$ and any algorithm as above with initialization point $\mathbf{w}_{1}$ , there exists a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ (for some finite $d$ ) such that

•

$f$ * is convex, twice-differentiable, has $\mu_{1}$ -Lipschitz gradients and $\mu_{2}$ -Lipschitz Hessians, and has a global minimum $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|\leq D$ .*

•

The index $T$ required to ensure $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})\leq\epsilon$ is at least

[TABLE]

for some universal constant $c>0$ .

We emphasize that the theorems focus on the high-dimensional setting, where the dimension $d$ is not necessarily fixed and may depend on other problem parameters. Also, we note that the parameter constraints in Thm. 1 are purely for technical reasons (they imply that the different terms in the bound are at least some positive constant), and can probably be relaxed somewhat.

Let us compare these theorems to the upper bounds discussed in the related work section, which are

[TABLE]

in the strongly convex case, and

[TABLE]

in the convex case. Our bound in the convex case is tight up to constants, and in the strongly convex case, up to a $\log(\mu_{1}\mu_{2}^{2}D^{2}/\lambda^{3})$ factor. We conjecture that some such logarithmic factor (possibly a smaller one) is indeed necessary, in order to get a tight interpolation to the $\Omega(\sqrt{\mu_{1}/\lambda}\cdot\log(\mu_{1}D^{2}/\epsilon))$ lower bound of first-order methods as $\mu_{2}\rightarrow\infty$ (see Nemirovsky and Yudin (1983, section 7.2.6) and Arjevani and Shamir (2016b)), and that it can be recovered with a more careful analysis of our construction. However, this involves some non-trivial technical challenges, which we leave to future work.

Comparing the lower and upper bounds in the strongly convex case, one can make the following observations:

•

The lower bound captures the two phases common in second-order methods such as the Newton method: An initial slow convergence from the initialization point to the local neighborhood of the optimum (captured by the $\min\left\{\sqrt{\frac{\mu_{1}}{\lambda}}~{},~{}\left(\frac{\mu_{2}}{\lambda}D\right)^{2/7}\right\}$ term), followed by a fast local quadratic convergence to the optimum (captured by the second term, which is doubly-logarithmic in the accuracy $\epsilon$ ).

•

Unless $\epsilon$ is exceedingly small, the oracle complexity is dominated by the geometry-dependent terms $\mu_{1}/\lambda$ and $\mu_{2}D/\lambda$ . This is despite the fact that second-order methods can use Hessian information to alter the geometry of the problem (for example, the Newton method is well-known to be affine invariant).

•

If $\mu_{2}D/\lambda$ is sufficiently large (specifically, if $D$ is order of $\frac{\mu_{1}^{7/4}}{\mu_{2}\lambda^{3/4}}$ or larger), then the lower bound becomes at least $\sqrt{\mu_{1}/\lambda}$ , which is no betters what can be obtained with first-order methods up to logarithmic factors (see Eq. (3)). Since $D$ often scales inversely with the strong convexity of the problem (e.g. since the strong convexity is due to a regularization term), this is a rather broad and reasonable regime.

•

On the other hand, if $\mu_{2}D/\lambda$ is smaller than $\sqrt{\mu_{1}/\lambda}$ , then the oracle complexity can be significantly better than that of first-order methods, but this still comes at the inevitable price of a polynomial dependence on the distance $D$ from the optimum. In contrast, first-order methods have only a logarithmic dependence on $D$ (see Eq. (3)).

Similar types of conclusions regarding on the behavior of first and second-order methods can be drawn as in the strongly convex case. Namely, if $\mu_{2}D^{3}/\epsilon$ is large enough (specifically, if $\mu_{2}\geq\mu_{1}^{7/4}\sqrt{D}/\epsilon^{3/4}$ ), the complexity of second-order methods is not significantly better than what can be obtained with first-order methods.

2.2 Higher Order Oracles

In addition to first-order and second-order oracles, it is of interest to understand what can be achieved with methods employing higher order derivatives. It turns out that the techniques we use to establish our second-order lower bounds can be easily generalized to such higher-order methods.

More explicitly, we consider methods which can be modelled as interacting with a k-th order oracle, which given a point $\mathbf{w}$ returns the function’s value and all of its derivatives up to order $k$ , namely, $f(\mathbf{w}),\nabla f(\mathbf{w}),\nabla^{2}f(\mathbf{w}),\ldots,\nabla^{k}f(\mathbf{w})$ . Given access to such an oracle, the method produces a sequence of points $\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{T}$ as before (where each $\mathbf{w}_{t}$ is a deterministic function of the previous oracle responses). For simplicity, we will focus here on the case of convex functions (not necessarily strongly convex), where the $k$ -th order derivative is Lipschitz continuous.

Theorem 3.

For any positive integer $k$ , positive $\mu_{k},D,\epsilon$ , and algorithm based on a $k$ -th order oracle as above, there exists a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ (for some finite $d$ ) such that

•

$f$ * is convex, $k$ times differentiable, $k$ -order smooth (i.e., $\|\nabla^{k}f(\mathbf{u})-\nabla^{k}f(\mathbf{v})\|\leq\mu_{k}\|\mathbf{u}-\mathbf{v}\|$ ) and has a global minimum $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|\leq D$ .*

•

The index $T$ required to ensure $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})\leq\epsilon$ is at least

[TABLE]

for some universal constant $c>0$ .

Note that this result directly generalizes existing results for first-order oracles ( $k=1$ ), as well as our results for second-order oracles ( $k=2$ , when $\mu_{1}$ is unrestricted).

Finally, we compare our lower bound to the best upper bound we are aware of, established by Baes (2009) using a high-order method with oracle complexity of

[TABLE]

Note that the upper bound contains an additional $(f(\mathbf{w}_{1})-f(\mathbf{w}^{*}))/\epsilon$ term, and moreover, the exponent (as a function of $k$ ) is larger than ours ( $1/(k+1)$ vs. $2/(3k+1)$ ). Based on our results, we know that this upper bound is loose in the $k=2$ case, so we conjecture that it is indeed loose for all $k$ , and can be improved.

3 Proof Ideas

The proofs of our theorems are based on a careful modification of a standard lower bound construction for first-order methods (see Nesterov (2004)). That construction uses quadratic functions, which in the convex case and ignoring various parameters, have a basic structure of the form

[TABLE]

(more precisely, one considers $f_{T}(V\mathbf{w})$ for a certain orthogonal matrix $V$ , and use additional parameters depending on the smoothness). A crucial ingredient of the proof is that the function $x\mapsto x^{2}$ has a value and derivative of zero at the origin, which allows us to construct a function which “hides” information from an algorithm relying solely on values and gradients. This can be shown to lead to an optimization error lower bound of the form $\min_{\mathbf{w}}f_{T}(\mathbf{w})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ after $T$ oracle queries, which for first-order methods leads to an $\Omega(\mu_{1}D^{2}/T^{2})$ lower bound on the error, translating to an $\Omega(\sqrt{\mu_{1}D^{2}/\epsilon})$ lower bound on $T$ . However, this construction leads to trivial bounds for second-order methods, since given the Hessian and a gradient of a quadratic function at just a single point, one can already compute the exact minimizer.

Our approach to handle second-order (and more generally, $k$ -order) methods is quite simple: Instead of $x\mapsto x^{2}$ , we rely on mappings of the form $x\mapsto|x|^{k+1}$ , and use functions with the basic structure

[TABLE]

The intuition is that $x\mapsto|x|^{k+1}$ has a value and first k derivatives of zero at the origin, and therefore variants of the function above can be used to “hide” information from the algorithm, even if it can receive Hessians or higher-order derivatives of the function. Another motivation for choosing such functions is that they are generally not self-concordant, and therefore the upper bounds relevant to self-concordant functions do not apply. We rely on this construction and arguments similar to those of first-order oracle lower bounds, to get our results.

In the derivation of our results for second-order methods, there are two technical challenges that need to be overcome: The first is that $f_{T}$ , as defined above (for $k=2$ ), can be shown to have globally Lipschitz Hessians, but not globally Lipschitz gradients as required by our theorems. To tackle this, we replace the mapping $x\mapsto|x|^{3}$ by a more complicated mapping, which is cubic close to the origin and quadratic further away. This necessarily complicates the proof. The second challenge is that due to the cubic terms, computing the minimizer of $f_{T}$ and its minimal value is more challenging than in first-order lower bounds, especially in the strongly convex case (where we are unable to even find a closed-form expression for the minimizer, and resort to bounds instead). Again, this makes the analysis more complicated.

We conclude this section by sketching how our bounds can be derived in case of second-order methods, and in the simplest possible setting, where we wish to obtain an $\Omega((D^{3}/\epsilon)^{2/7})$ lower bound for the class of convex functions with Lipschitz Hessians (and no assumptions on the Lipschitz parameter of the gradients), assuming the algorithm makes its first query at the origin. In that case, consider the function $f_{T}$ in this class of the form

[TABLE]

where $\gamma$ is a parameter to be chosen later. Computing the derivatives and setting to zero, and arguing that the minimizer must have non-negative coordinates, we get that the optimum satisfies

[TABLE]

and

[TABLE]

It can be verified that this is satisfied by $w_{j}=(T+1-j)\sqrt{\frac{\gamma}{T^{2}+1}}$ for all $j=1,2,\ldots,T$ , and that this is the unique minimizer of $f_{T}$ as a function on $\mathbb{R}^{T}$ . Moreover, assuming $\gamma\leq D^{2}/T$ , the norm of this minimizer (and hence the initial distance to it from the algorithm’s first query point, by assumption) is at most $D$ as required. Plugging this $\mathbf{w}$ into $f_{T}$ , we get that $\min_{\mathbf{w}}f_{T}(\mathbf{w})$ equals

[TABLE]

Now, using arguments very similar to those in first-order oracle complexity lower bounds (Nesterov, 2004), it is possible to construct a function for which the optimization error of the algorithm is lower bounded by $\min_{\mathbf{w}}f_{T}(\mathbf{w})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ . By the calculations above, this in turn equals

[TABLE]

Using the fact that $\frac{1}{\sqrt{1+x}}\approx 1-\frac{1}{2}x$ for small $x$ , this equals $\Omega(\gamma^{3/2}/T^{2})$ . Choosing $\gamma$ on the order of $D^{2}/T$ (as required earlier to satisfy the norm constraint on the minimizer), we get a lower bound of $\Omega(D^{3}/T^{7/2})$ on the optimization error $\epsilon$ , or equivalently, a lower bound of $\Omega((D^{3}/\epsilon)^{2/7})$ on $T$ .

4 Proof of Thm. 1

We will assume without loss of generality that the algorithm initializes at $\mathbf{w}_{1}=\mathbf{0}$ (if that is not the case, one can simply replace the “hard” function $f(\mathbf{w})$ below by $f(\mathbf{w}-\mathbf{w}_{1})$ , and the same proof holds verbatim). Thus, the theorem requires that our function has a minimizer $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}^{*}\|\leq D$ .

Let $\Delta,\gamma$ be parameters to be chosen later. Define $g:\mathbb{R}\mapsto\mathbb{R}$ as

[TABLE]

which is easily verified to be convex and twice continuously differentiable, and let $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ be orthogonal unit vectors in $\mathbb{R}^{d}$ which will be specified later. Letting the number of iterations $T$ be fixed, we consider the function

[TABLE]

where $\tilde{T}\geq\max\left\{4\gamma\left(\frac{\mu_{2}}{6\lambda}\right)^{2}+1,2T,\frac{\gamma\mu_{2}}{6\lambda}+1\right\}$ is some sufficiently large number, and the dimension $d$ is at least $2\tilde{T}$ .

The proof is constructed of several parts: First, we analyze properties of the global minimum of $f$ (Subsection 4.1). Then, we prove the oracle complexity lower bound in Subsection 4.2 (depending on $\Delta,\gamma$ ), and finally, in Subsection 4.3, we choose the parameters so that $f$ indeed has the various geometric properties specified in the theorem.

4.1 Minimizer of $f$

The goal of this subsection is to prove the following proposition, which characterizes key properties of the global minimum of $f$ :

Proposition 1.

Suppose that $\gamma\geq 10^{4}\left(\frac{\lambda}{\mu_{2}}\right)^{2}$ and $\Delta\geq\sqrt{\gamma}$ . Then $f$ has a unique minimizer $\mathbf{w}^{*}$ which satisfies the following:

For any $t\in\{1,2,\ldots,\tilde{T}\}$ , it holds that $\langle\mathbf{v}_{t},\mathbf{w}^{*}\rangle\geq\max\left\{0~{},~{}\frac{\gamma^{3/4}}{7\sqrt{12\lambda/\mu_{2}}}+\sqrt{\gamma}\left(\frac{1}{2}-t\right)\right\}$ . 2. 2.

There exists some $t_{0}\leq\tilde{T}/2$ such that for all indices $k\in\{0,1,\ldots,\tilde{T}-t_{0}\}$ , it holds that

$\langle\mathbf{v}_{t_{0}+k},\mathbf{w}^{*}\rangle~{}\geq~{}\frac{108\lambda}{\mu_{2}}\cdot(18)^{-2^{k}}$ . 3. 3.

$\|\mathbf{w}^{*}\|^{2}\leq\frac{2\gamma^{7/4}}{(12\lambda/\mu_{2})^{3/2}}$ * .*

Since $f$ is strongly convex, its global minimizer is unique and well-defined. To prove the proposition, we will consider the simpler strongly-convex function

[TABLE]

where

[TABLE]

and prove that its minimizer $\tilde{\mathbf{w}}^{*}$ satisfies the following:

For any $t\in\{1,2,\ldots,\tilde{T}\}$ , it holds that $\tilde{w}^{*}_{t}\geq\max\left\{0~{},~{}\frac{\gamma^{3/4}}{7\sqrt{\lambda}}+\sqrt{\gamma}\left(\frac{1}{2}-t\right)\right\}$ (Lemma 2). 2. 2.

There exists some $t_{0}\leq\tilde{T}/2$ such that for all $k\in\{0,1,\ldots,\tilde{T}-t_{0}\}$ , it holds that $\tilde{w}^{*}_{t_{0}+k}~{}\geq~{}9\tilde{\lambda}\cdot(18)^{-2^{k}}$ (Lemma 3). 3. 3.

$\sum_{i=1}^{\tilde{T}}\tilde{w}^{*2}_{i}\leq\frac{2\gamma^{7/4}}{\tilde{\lambda}^{3/2}}$ (Lemma 4)

We then argue that the minimizer $\mathbf{w}^{*}$ of $f$ satisfies $\langle\mathbf{v}_{i},\mathbf{w}^{*}\rangle=\tilde{w}^{*}_{i}$ for all $i=1,2,\ldots,\tilde{T}$ (Lemma 5), and that $\|\mathbf{w}^{*}\|^{2}=\sum_{i=1}^{\tilde{T}}\langle\mathbf{v}_{i},\mathbf{w}^{*}\rangle^{2}$ (Lemma 6), from which Proposition 1 follows.

We begin with the following technical key result:

Lemma 1.

It holds that $\tilde{w}^{*}_{1}\geq\tilde{w}^{*}_{2}\geq\cdots\geq\tilde{w}^{*}_{\tilde{T}}\geq 0$ , and

[TABLE]

Moreover, $\sum_{j=1}^{\tilde{T}}\tilde{w}^{*}_{j}=\frac{\gamma}{\tilde{\lambda}}$ .

Proof.

We begin by showing that $\tilde{w}^{*}_{j}\geq 0$ for all $j$ , first for $j=1$ and then for $j>1$ . To do so, note that $\tilde{f}(\mathbf{0})=0$ yet $\nabla\tilde{f}(\mathbf{0})=-\gamma\cdot\mathbf{e}_{1}\neq\mathbf{0}$ , and therefore $\mathbf{0}$ is a sub-optimal point. Thus, we must have $\tilde{f}(\tilde{\mathbf{w}}^{*})<0$ . The only negative term in the definition of $\tilde{f}(\cdot)$ is $-\gamma\cdot w_{1}$ , so we must have $\tilde{w}^{*}_{1}>0$ . We now argue that $w_{j}\geq 0$ for all $j>1$ : Otherwise, let $\mathbf{w}$ be the vector which equals $w_{j}=|\tilde{w}^{*}_{j}|$ for all $j$ , and note that $w_{1}=\tilde{w}^{*}_{1}$ since we just showed $\tilde{w}^{*}_{1}>0$ . Based on this, it is easily verified that

[TABLE]

which means that $\mathbf{w}$ is the (unique) minimum of $\tilde{f}$ , hence $\mathbf{w}=\tilde{\mathbf{w}}^{*}$ . By definition of $\mathbf{w}$ , this implies $\tilde{w}^{*}_{j}=|\tilde{w}^{*}_{j}|$ for all $j$ , hence $\tilde{w}^{*}_{j}\geq 0$ for all $j$ .

We now turn to prove that $\tilde{w}^{*}_{j}$ is monotonically decreasing in $j$ . Suppose on the contrary that this is not the case, and let $j_{0}$ be the smallest index for which $\tilde{w}^{*}_{j_{0}}<\tilde{w}^{*}_{j_{0}+1}$ , and let $\delta:=\tilde{w}^{*}_{j_{0}+1}-\tilde{w}^{*}_{j_{0}}>0$ . Define the vector $\mathbf{w}$ to be

[TABLE]

Note that this vector must be different than $\mathbf{w}$ , as $w_{j_{0}+1}=\max\{0,\tilde{w}^{*}_{j_{0}+1}-\delta\}=\max\{0,\tilde{w}^{*}_{j_{0}}\}=\tilde{w}^{*}_{j_{0}}=w_{j_{0}}$ , hence $w_{j_{0}+1}=w_{j_{0}}$ yet $\tilde{w}^{*}_{j_{0}+1}>\tilde{w}^{*}_{j_{0}}$ by assumption. On the other hand, it is easily verified that $|w_{i}-w_{i+1}|^{3}\leq|\tilde{w}^{*}_{i}-\tilde{w}^{*}_{i+1}|^{3}$ and $w_{i}^{2}\leq(\tilde{w}^{*}_{i})^{2}$ for all555This is trivially true for $i<j_{0}$ . For $i=j_{0}$ , we have $|w_{j_{0}}-w_{j_{0}+1}|^{3}=0<|\tilde{w}^{*}_{j_{0}}-\tilde{w}^{*}_{j_{0}+1}|^{3}$ and $w_{j_{0}}^{2}=(\tilde{w}^{*}_{j_{0}})^{2}$ . For $i>j_{0}$ , we have $|w_{i}-w_{i+1}|^{3}=|\max\{0,\tilde{w}^{*}_{i}-\Delta\}-\max\{0,\tilde{w}^{*}_{i+1}-\Delta\}|^{3}\leq|(\tilde{w}^{*}_{i}-\Delta)-(\tilde{w}^{*}_{i+1}-\Delta)|^{3}=|\tilde{w}^{*}_{i}-\tilde{w}^{*}_{i+1}|^{3}$ , and moreover, $w_{i}^{2}=\max\{0,\tilde{w}^{*}_{i}-\Delta\}^{2}$ , which is [math] (hence $\leq(\tilde{w}^{*}_{i})^{2}$ ) if $\tilde{w}^{*}_{i}\leq\Delta$ and less than $(\tilde{w}^{*}_{i})^{2}$ if $\tilde{w}^{*}_{i}>\Delta$ . $i$ , and therefore $\tilde{f}(\mathbf{w})\leq\tilde{f}(\tilde{\mathbf{w}}^{*})$ . But since $\tilde{\mathbf{w}}^{*}$ is the unique global minimizer and $\mathbf{w}\neq\tilde{\mathbf{w}}^{*}$ , we get a contradiction, so we must have $\tilde{w}^{*}_{j}$ monotonically decreasing for all $j$ .

We now turn to prove the recursive relation $\tilde{w}^{*}_{t+1}=w_{t}^{*}-\sqrt{\gamma-\tilde{\lambda}\sum_{j=1}^{t}\tilde{w}^{*}_{j}}$ . By differentiating $\tilde{f}$ and setting to zero (and using the fact that $\tilde{w}^{*}_{j}$ is monotonically decreasing in $j$ ), we get that

[TABLE]

and

[TABLE]

By unrolling this recursive form, we get

[TABLE]

from which the equation

[TABLE]

follows, again using the monotonicity of $\tilde{w}^{*}_{t}$ in $t$ .

It remains to prove that $\sum_{j=1}^{\tilde{T}}\tilde{w}^{*}_{j}=\frac{\gamma}{\tilde{\lambda}}$ . By summing both sides of Eq. (12) from $t=2$ to $t=\tilde{T}-1$ we have that:

[TABLE]

So by using Eq. (11) we get the desired equality. ∎

Lemma 2.

For any $t\in\{1,2,\ldots,\tilde{T}\}$ ,

[TABLE]

Proof.

By the displayed equation in Lemma 1, we clearly have $\tilde{w}^{*}_{t+1}\geq\tilde{w}^{*}_{t}-\sqrt{\gamma}$ for all $t\leq\tilde{T}-1$ , and therefore

[TABLE]

Using the facts that $\tilde{w}^{*}_{t}$ is also always non-negative, that $\tilde{T}\geq\frac{\gamma\mu_{2}}{6\lambda}+1=\frac{\gamma}{\tilde{\lambda}}+1\geq\tilde{w}^{*}_{1}+1$ , and by Lemma 1,

[TABLE]

which implies that $(\tilde{w}^{*}_{1})^{2}-\sqrt{\gamma}\cdot w_{1}^{*}-\frac{2\gamma^{3/2}}{\tilde{\lambda}}\leq 0$ , which implies in turn

[TABLE]

On the other hand, again by Lemma 1, we know that

[TABLE]

and hence

[TABLE]

Let $t_{0}\in\{1,2,\ldots,\tilde{T}\}$ be the smallest index such that $\sum_{j=1}^{t_{0}}\tilde{w}^{*}_{j}>\frac{3\gamma}{4\tilde{\lambda}}$ (such an index must exist since $\sum_{j=1}^{\tilde{T}}\tilde{w}^{*}_{j}=\frac{\gamma}{\tilde{\lambda}}$ ). Since $\frac{3\gamma}{4\tilde{\lambda}}<\sum_{j=1}^{t_{0}}\tilde{w}^{*}_{j}\leq t_{0}\tilde{w}^{*}_{1}\leq t_{0}\left(\sqrt{\gamma}+\sqrt{2\gamma^{3/2}/\tilde{\lambda}}\right)$ by Eq. (15), it follows that

[TABLE]

According to Eq. (16) and the fact that $\tilde{w}^{*}_{t_{0}}\geq 0$ , it follows that

[TABLE]

and hence

[TABLE]

Using this and Eq. (14), it follows that for all $t\leq\tilde{T}$ ,

[TABLE]

Since we assumed $\gamma\geq 10^{4}(\lambda/\mu_{2})^{2}>(12\lambda/\mu_{2})^{2}=\tilde{\lambda}^{2}$ , we have $\tilde{\lambda}<\sqrt{\gamma^{1/2}\tilde{\lambda}}$ , so the above can be lower bounded by the simpler expression $\gamma^{3/4}/7\sqrt{\tilde{\lambda}}+\sqrt{\gamma}(1/2-t)$ . Since we also know that $\tilde{w}^{*}_{t}$ is non-negative, the result follows.

∎

Lemma 3.

There exists an index $t_{0}\leq\tilde{T}/2$ such that

[TABLE]

Proof.

By Lemma 1, it holds for any $t\in\{1,2,\ldots,\tilde{T}-1\}$ that

[TABLE]

In particular, since $\tilde{w}^{*}_{j}\geq 0$ for all $j\leq\tilde{T}$ , it follows that $\tilde{w}^{*}_{t}\geq\sqrt{\tilde{\lambda}\sum_{j=t+1}^{\tilde{T}}\tilde{w}^{*}_{j}}\geq\sqrt{\tilde{\lambda}\tilde{w}^{*}_{t+1}}$ , and therefore

[TABLE]

Let $t\leq\tilde{T}-1$ be any index such that666Such an index must exist: By assumption, $\tilde{T}\geq 2\gamma\left(\frac{\mu_{2}}{6\lambda}\right)^{2}=\frac{2\gamma}{\tilde{\lambda}^{2}}$ , so by Lemma 1, $\frac{\gamma}{\tilde{\lambda}}=\sum_{t=1}^{\tilde{T}}\tilde{w}^{*}_{t}\geq\tilde{T}\tilde{w}^{*}_{\tilde{T}}\geq\frac{2\gamma}{\tilde{\lambda}^{2}}\tilde{w}^{*}_{\tilde{T}}$ , hence $\tilde{w}_{\tilde{T}}\leq\tilde{\lambda}/2$ . $\tilde{w}^{*}_{t+1}\leq\frac{\tilde{\lambda}}{2}$ . By Eq. (18), this implies that

[TABLE]

Using the inequality above together with Eq. (17) and the monotonicity of $\tilde{w}^{*}_{t}$ , we get that for all $t\leq\tilde{T}-1$ such that $\tilde{w}^{*}_{t+1}\leq\frac{\tilde{\lambda}}{2}$ ,

[TABLE]

This chain of inequalities implies that

[TABLE]

Let $t_{0}\leq\tilde{T}/2$ denote the unique index that satisfies $\tilde{w}^{*}_{t_{0}}>\frac{\tilde{\lambda}}{2}$ , as well as $\tilde{w}^{*}_{t_{0}+1}\leq\frac{\tilde{\lambda}}{2}$ for all $t$ between $t_{0}$ and $\tilde{T}-1$ 777Since $\tilde{w}^{*}_{t}$ monotonically decrease in $t$ , such an index must exist: On the one hand, $\tilde{w}^{*}_{1}$ can be verified to be at least $\tilde{\lambda}>\tilde{\lambda}/2$ (by Lemma 2 and the assumption $\gamma\geq 10^{4}(\lambda/\mu_{2})^{2}$ , hence $\gamma\geq 277\tilde{\lambda}^{2}$ ). On the other hand, if we let $t_{1}$ be the largest index $\leq\tilde{T}$ satisfying $\tilde{w}^{*}_{t_{1}}>\tilde{\lambda}/2$ , we have by Lemma 1 that $\frac{\gamma}{\tilde{\lambda}}\geq\sum_{t=1}^{t_{1}}\tilde{w}^{*}_{t}\geq t_{1}\tilde{w}_{t_{1}}^{*}>\frac{t_{1}\tilde{\lambda}}{2}$ , which implies that $t_{1}\leq\frac{2\gamma}{\tilde{\lambda}^{2}}$ , which is less than $\tilde{T}/2$ by the assumption on $\tilde{T}$ being large enough. Therefore, $t_{0}$ is at most $\tilde{T}/2$ as well.. Using the displayed inequality above, we get that for any $k\leq\tilde{T}-t_{0}$ ,

[TABLE]

so we get $\tilde{w}^{*}_{t_{0}+k}~{}\geq~{}9\tilde{\lambda}\cdot(18)^{-2^{k}}$ as required ∎

Lemma 4.

$\sum_{i=1}^{\tilde{T}}(\tilde{w}^{*}_{i})^{2}\leq 2\gamma^{7/4}/\tilde{\lambda}^{3/2}$ **

Proof.

We need to upper bound the squared Euclidean norm of $(\tilde{w}^{*}_{1},\ldots,\tilde{w}^{*}_{\tilde{T}})$ . Note that for any vector $\mathbf{w}$ , we have $\|\mathbf{w}\|^{2}=\sum_{i}w_{i}^{2}\leq(\max_{i}|w_{i}|)\sum_{i}|w_{i}|=\|\mathbf{w}\|_{\infty}\|\mathbf{w}\|_{1}$ . Thus, by Lemma 1, Eq. (15), and the assumption that $\gamma\geq 10^{4}(\lambda/\mu_{2})^{2}>277\tilde{\lambda}^{2}$ , the squared norm is at most

[TABLE]

which is at most $2\sqrt{\gamma^{1/2}/\tilde{\lambda}}\cdot\gamma^{3/2}/\tilde{\lambda}=2\gamma^{7/4}/\tilde{\lambda}^{3/2}$ ∎

Lemma 5.

$\mathbf{w}^{*}=\arg\min_{\mathbf{w}}f(\mathbf{w})$ * satisfies $\langle\mathbf{v}_{i},\mathbf{w}^{*}\rangle=\tilde{w}_{i}^{*}$ for all $i=1,\ldots,\tilde{T}$ , where $\tilde{\mathbf{w}}^{*}=\arg\min_{\mathbf{w}}\tilde{f}(\mathbf{w})$ .*

Proof.

First, we argue that $\tilde{\mathbf{w}}^{*}$ , whic minimizes

[TABLE]

also minimizes

[TABLE]

To see this, note that $\tilde{f}$ and $\hat{f}$ differ only in that $g(x)$ is replaced by $\frac{1}{3}|x|^{3}$ . By definition of $g$ , we have that $g(x)$ and $\frac{1}{3}|x|^{3}$ coincide for any $|x|\leq\Delta$ , from which it is easily verified that $f$ and $\tilde{f}$ have the same values and gradients at any $\mathbf{w}$ for which $|w_{i}-w_{i+1}|\leq\Delta$ for all $i\leq\tilde{T}-1$ . By Lemma 1 and the assumption $\Delta\geq\sqrt{\gamma}$ , the global minimizer $\tilde{\mathbf{w}}^{*}$ of $\tilde{f}$ belongs to this set, and therefore $\nabla\tilde{f}(\tilde{\mathbf{w}}^{*})=\nabla\hat{f}(\tilde{\mathbf{w}}^{*})=\mathbf{0}$ . But $\hat{f}$ is strongly convex, hence has a unique point (the global minimizer) at which the gradient of $\hat{f}$ is zero, hence $\tilde{\mathbf{w}}^{*}$ is indeed the global minimizer of $\hat{f}$ .

Next, since the global minimizer is invariant to multiplying the function by a fixed positive factor, we get that $\tilde{\mathbf{w}}^{*}$ is also the global minimizer of

[TABLE]

where in the last step we used the fact that $\tilde{\lambda}=12\lambda/\mu_{2}$ . Recalling that

[TABLE]

and that $\mathbf{v}_{1},\mathbf{v}_{2},\ldots$ are orthogonal, we can write $f(\mathbf{w})$ as $\frac{\mu}{12}\cdot\hat{f}(V\mathbf{w})$ , where $V$ is any orthogonal matrix with the first $\tilde{T}$ columns being $\mathbf{v}_{1},\ldots,\mathbf{v}_{\tilde{T}}$ . Therefore, the minimizer $\mathbf{w}^{*}$ of $f$ satisfies $V\mathbf{w}^{*}=(\langle\mathbf{v}_{1},\mathbf{w}^{*}\rangle,\langle\mathbf{v}_{2},\mathbf{w}_{2}^{*}\rangle,\ldots)=\tilde{\mathbf{w}}^{*}$ . ∎

Lemma 6.

$\|\mathbf{w}^{*}\|^{2}=\sum_{i=1}^{\tilde{T}}\langle\mathbf{v}_{i},\mathbf{w}^{*}\rangle^{2}$ **

Proof.

$f(\mathbf{w})$ is a function which can be written in the form $h(\langle\mathbf{v}_{1},\mathbf{w}\rangle,\langle\mathbf{v}_{2},\mathbf{w}\rangle,\ldots,\langle\mathbf{v}_{\tilde{T}},\mathbf{w}\rangle)+\frac{\lambda}{2}\|\mathbf{w}\|^{2}$ , so by the Representer theorem, its minimizer $\mathbf{w}^{*}$ must lie in the span of $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ . Moreover, since these vectors are orthogonal and of unit norm, we have $\mathbf{w}^{*}=\sum_{i=1}^{\tilde{T}}\langle\mathbf{v}_{i},\mathbf{w}^{*}\rangle\mathbf{v}_{i}$ , and thus

[TABLE]

∎

4.2 Oracle Complexity Lower Bound

In this subsection, we prove the following oracle complexity lower bound, depending on the free parameter $\gamma$ :

Proposition 2.

Assume that $\epsilon<\min\left\{\frac{108^{2}\cdot\lambda^{3}}{\mu_{2}^{2}},\frac{\gamma\lambda}{8}\right\}$ . Under the conditions of Proposition 1, it is possible to choose the vectors $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ in the function $f$ , such that the number of iterations $T$ required to have $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})\leq\epsilon$ is at least

[TABLE]

To prove the theorem, we will need the following key lemma, which establishes that oracle information at certain points $\mathbf{w}$ do not leak any information on some of the $\mathbf{v}_{1},\mathbf{v}_{2},\ldots$ vectors.

Lemma 7.

For any $\mathbf{w}\in\mathbb{R}^{d}$ orthogonal to $\mathbf{v}_{t},\mathbf{v}_{t+1},\ldots,\mathbf{v}_{\tilde{T}}$ , it holds that $f(\mathbf{w}),\nabla f(\mathbf{w}),\nabla^{2}f(\mathbf{w})$ do not depend on $\mathbf{v}_{t+1},\mathbf{v}_{t+2},\ldots,\mathbf{v}_{\tilde{T}}$ .

Proof.

Since the regularization term $\frac{\lambda}{2}\|\mathbf{w}\|^{2}$ doesn’t depend on $\mathbf{v}_{t+1},\mathbf{v}_{t+2},\ldots,\mathbf{v}_{\tilde{T}}$ we can define $h(\mathbf{w})\triangleq f(\mathbf{w})-\frac{\lambda}{2}\|\mathbf{w}\|^{2}$ and prove the result on $h(\mathbf{w})$ . Using the definition of $h$ and differentiating, we have that

[TABLE]

By the assumption $\langle\mathbf{v}_{t},\mathbf{w}\rangle=\langle\mathbf{v}_{t+1},\mathbf{w}\rangle=\ldots=0$ , and the fact that $g(0)=g^{\prime}(0)=g^{\prime\prime}(0)=0$ , we have that $g(\langle\mathbf{v}_{i}-\mathbf{v}_{i+1},\mathbf{w}\rangle)=g^{\prime}(\langle\mathbf{v}_{i}-\mathbf{v}_{i+1},\mathbf{w}\rangle)=g^{\prime\prime}(\langle\mathbf{v}_{i}-\mathbf{v}_{i+1},\mathbf{w}\rangle)=0$ for all $i\in\{t,t+1,\ldots,\tilde{T}-1\}$ . Therefore, it is easily verified that the expressions above indeed do not depend on $\mathbf{v}_{t+1},\mathbf{v}_{t+2},\ldots,\mathbf{v}_{\tilde{T}}$ . ∎

Let us now fix any number of iterations $T\leq\tilde{T}$ . Using the previous results, we can provide a way to pick $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ for any deterministic algorithm, such that we can provide a lower bound for the number of second-order oracle calls.

•

First, we compute $\mathbf{w}_{1}$ (which is possible since the algorithm is deterministic and $\mathbf{w}_{1}$ is chosen before any oracle calls are made).

•

We pick $\mathbf{v}_{1}$ to be some unit vector orthogonal to $\mathbf{w}_{1}$ . Assuming $\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ will also be orthogonal to $\mathbf{w}_{1}$ (which will be ensured by the construction which follows), we have by Lemma 7 that the information $F(\mathbf{w}_{1}),\nabla F(\mathbf{w}_{1}),\nabla^{2}F(\mathbf{w}_{1})$ provided by the oracle to the algorithm does not depend on $\{\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}\}$ , and thus depends only on $\mathbf{v}_{1}$ which was already fixed. Since the algorithm is deterministic, this fixes the next query point $\mathbf{w}_{2}$ .

•

For $t=2,3,\ldots,T-1$ , we repeat the process above: We compute $\mathbf{w}_{t}$ , and pick $\mathbf{v}_{t}$ to be some unit vectors orthogonal to $\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{t}$ , as well as all previously constructed $\mathbf{v}$ ’s (this is always possible since the dimension is sufficiently large). By Lemma 7, as long as all vectors thus constructed are orthogonal to $\mathbf{w}_{t}$ , the information $\{F(\mathbf{w}_{t}),\nabla F(\mathbf{w}_{t}),\nabla^{2}F(\mathbf{w}_{t})\}$ provided to the algorithm does not depend on $\mathbf{v}_{t+1},\ldots,\mathbf{v}_{\tilde{T}}$ , and only depends on $\mathbf{v}_{1},\ldots,\mathbf{v}_{t}$ which were already determined. Therefore, the next query point $\mathbf{w}_{t+1}$ is fixed.

•

At the end of the process, we pick $\mathbf{v}_{T},\mathbf{v}_{T+1},\ldots,\mathbf{v}_{\tilde{T}}$ to be some unit vectors orthogonal to all previously chosen $\mathbf{v}$ ’s as well as $\mathbf{w}_{1},\ldots,\mathbf{w}_{T}$ (this is possible since the dimension is large enough).

Using the facts that $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ are orthogonal (and thus act as an orthonormal basis to a subspace of $\mathbb{R}^{d}$ ), that $\mathbf{w}_{T}$ is orthogonal to $\mathbf{v}_{T},\mathbf{v}_{T+1},\ldots\mathbf{v}_{\tilde{T}}$ , and that $t_{0}+T\leq\frac{\tilde{T}}{2}+\frac{\tilde{T}}{2}=\tilde{T}$ (where $t_{0}$ is as defined in Proposition 1), we have

[TABLE]

By Proposition 1, we can lower bound the above by

[TABLE]

Using the strong convexity of $f$ , we therefore get

[TABLE]

To make the right-hand side smaller than $\epsilon$ , $T$ must satisfy

[TABLE]

Which is equivalent to

[TABLE]

Assuming $\epsilon<\frac{108^{2}\cdot\lambda^{3}}{\mu_{2}^{2}}$ , then

[TABLE]

We now turn to argue that we can also lower bound $T$ by $\frac{\gamma^{1/4}}{7\sqrt{12\lambda/\mu_{2}}}$ . Otherwise, suppose by contradiction that we can have $f(\mathbf{w}_{T})-f(\mathbf{w}^{*})\leq\epsilon$ for some $T<\frac{\gamma^{1/4}}{7\sqrt{12\lambda/\mu_{2}}}$ . From Proposition 1 we know that

[TABLE]

so as before, we have that

[TABLE]

and thus

[TABLE]

To make the right-hand side smaller than $\epsilon$ , $T$ must satisfy

[TABLE]

or equivalently

[TABLE]

But since we assume $\epsilon<\frac{\gamma\lambda}{8}$ , this is at least $\frac{\gamma^{1/4}}{7\sqrt{12\lambda/\mu_{2}}}$ , contradicting our earlier assumption.

Overall, we showed that $T$ is lower bounded by both $\frac{\gamma^{1/4}}{7\sqrt{12\lambda/\mu_{2}}}$ , as well as $\log_{2}\log_{18}\left(\frac{108^{2}\cdot\lambda^{3}}{\mu_{2}^{2}\epsilon}\right)-1$ , hence proving Proposition 2.

4.3 Setting the $\gamma,\Delta$ Parameters

In the following lemma, we establish the strong convexity and smoothness parameters of $f$ (depending on the parameter $\Delta$ which is still free at this point).

Lemma 8.

$f$ * is $\lambda$ -strongly convex and twice-differentiable, with $\mu_{2}$ -Lipschitz Hessians and $\left(\frac{2\mu_{2}\Delta}{3}+\lambda\right)$ -Lipschitz gradients.*

Proof.

Since $f$ is a sum of convex, twice-differentiable functions and the $\lambda$ -strongly convex function $\frac{\lambda}{2}\|\mathbf{w}\|^{2}$ , it is clearly $\lambda$ -strongly convex and twice-differentiable. Thus, it only remains to calculate the Lipschitz parameter of the gradients and Hessians.

To simplify the proof, we note that Lipschitz smoothness is a property invariant to the coordinate system used, so we can assume without loss of generality that $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{\tilde{T}}$ correspond to the standard basis $\mathbf{e}_{1},\mathbf{e}_{2},\ldots,\mathbf{e}_{\tilde{T}}$ , and consider the Lipschitz properties of the function

[TABLE]

By definition of $g$ , it is easily verified that

[TABLE]

which is a $2$ -Lipschitz function bounded in $[0,2\Delta]$ . This implies that $g^{\prime}(x)$ is $2\Delta$ -Lipschitz. Letting $\mathbf{r}_{i}:=\mathbf{e}_{i}-\mathbf{e}_{i+1}$ , we can write $\hat{f}$ as

[TABLE]

Differentiating twice, we get

[TABLE]

Since this is a sum of positive-semidefinite matrices with non-negative coefficients (as we showed that $g^{\prime\prime}(x)\in[0,2\Delta]$ for all $x$ ), it follows that its spectral norm is at most

[TABLE]

and the first term equals

[TABLE]

Overall, we showed that $\|\nabla^{2}\hat{f}(\mathbf{w})\|\leq\frac{2\mu_{2}\Delta}{3}+\lambda$ , so the gradients of $f$ are $\left(\frac{2\mu_{2}\Delta}{3}+\lambda\right)$ -Lipschitz.

It remains to show that $\nabla^{2}\hat{f}(\mathbf{w})$ is $\mu_{2}$ -Lipschitz. Using the formula for $\nabla^{2}\hat{f}(\mathbf{w})$ , and recalling that $g^{\prime\prime}(x)$ is $2$ -Lipschitz, and $\|\mathbf{r}_{i}\|=\sqrt{2}$ by definition, we have that for any $\mathbf{w},\tilde{\mathbf{w}}$ ,

[TABLE]

Using the same calculations as earlier, we have $\left\|\sum_{i=1}^{\tilde{T}-1}\mathbf{r}_{i}\mathbf{r}_{i}^{\top}\right\|\leq 4$ , and therefore we showed overall that

[TABLE]

hence $\nabla^{2}\hat{f}(\mathbf{w})$ is $\mu_{2}$ -Lipschitz. ∎

We now collect the ingredients necessary to fix $\gamma,\Delta$ and hence prove our theorem. Combining the previous lemma, Proposition 1 and Proposition 2, and recalling that we want $f$ to have $\mu_{1}$ -Lipschitz gradients and $\mu_{2}$ -Lipschitz Hessians, with an optimizer $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}^{*}\|\leq D$ , we have an oracle complexity lower bound of the form

[TABLE]

assuming the following conditions:

[TABLE]

Picking $\Delta=\sqrt{\gamma}$ , using the fact that $\mu_{1}\geq\lambda$ (as any $\lambda$ -strongly convex function must have gradients with Lipschitz parameter at least $\lambda$ ), and rewriting the last two conditions, this is equivalent to

[TABLE]

Since the first condition needs to hold anyway, we can allow ourself to make the second condition stronger, by substituting $10^{4}(\lambda/\mu_{2})^{2}$ in lieu of $\gamma$ in the second condition. Doing this, simplifying, and merging the last two conditions, the set of condition above is implied by requiring

[TABLE]

Clearly, to make the lower bound in Eq. (19) as large as possible, we should pick the largest possible $\gamma$ , namely $\gamma=\min\left\{\left(\frac{3(\mu_{1}-\lambda)}{2\mu_{2}}\right)^{2},\sqrt[7]{\frac{D^{8}(12\lambda)^{6}}{2^{4}\mu_{2}^{6}}}\right\}$ , and to ensure that the other conditions hold, require that

[TABLE]

Simplifying a bit, these two conditions are implied by requiring

[TABLE]

Finally, let us plug our choice of $\gamma=\min\left\{\left(\frac{3(\mu_{1}-\lambda)}{2\mu_{2}}\right)^{2},\sqrt[7]{\frac{D^{8}(12\lambda)^{6}}{2^{4}\mu_{2}^{6}}}\right\}$ into the lower bound in Eq. (19). We thus get an oracle complexity lower bound of

[TABLE]

under the conditions of Eq. (20).

To simplify the bound a bit, we note that we can lower bound $\mu_{1}-\lambda$ by $\frac{67}{68}\mu_{1}$ (possible by Eq. (20)), and lower bound $\log_{2}\log_{18}\left(\frac{108^{2}\cdot\lambda^{3}}{\mu_{2}^{2}\epsilon}\right)-1$ by $\frac{1}{2}\log\log_{18}\left(\frac{\lambda^{3}}{\mu_{2}^{2}\epsilon}\right)$ , by assuming that $\epsilon\leq c\lambda^{3}/\mu_{2}^{2}$ for some small enough $c$ (in other words, increasing the constant in the third condition in Eq. (20)). Finally, using the fact that $\max\{a,b\}\geq(a+b)/2$ , the result in the theorem follows.

5 Proof of Thm. 2

Similarly to the strongly convex case, we will assume without loss of generality that the algorithm initializes at $\mathbf{w}_{1}=\mathbf{0}$ , since otherwise one can simply replace the “hard” function $f(\mathbf{w})$ below by $f(\mathbf{w}-\mathbf{w}_{1})$ , and the same proof holds verbatim. Thus, the theorem requires that our function has a minimizer $\mathbf{w}^{*}$ satisfying $\|\mathbf{w}^{*}\|\leq D$ .

Define $g:\mathbb{R}\mapsto\mathbb{R}$ as

[TABLE]

where $\Delta\triangleq\frac{3\mu_{1}}{2\mu_{2}}$ . $g$ can be easily verified to be twice continuously differentiable. Assume that $d\geq 2T$ , and let $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{T}$ be orthogonal unit vectors in $\mathbb{R}^{d}$ which will be specified later. Given $T$ , and letting $\gamma>0$ be a parameter to be specified later, define the function $f_{T}$ as

[TABLE]

This function is easily shown to be convex and twice-differentiable, with $\mu_{1}$ -Lipschitz gradients and $\mu_{2}$ -Lipschitz Hessians (the proof is identical to the proof of Lemma 8). Our goal will be to show a lower bound on the optimization error using this type of function.

5.1 Minimizer of $f_{T}$

In this subsection, we analyze the properties of a minimizer of $f_{T}$ . To that end, we introduce the following function in $\mathbb{R}^{T}$ :

[TABLE]

It is easily verified that the minimal values of $\frac{\mu_{2}}{12}\hat{f}_{T}$ and $f_{T}$ are the same, and moreover, if $\hat{\mathbf{w}}\in\mathbb{R}^{T}$ is a minimizer of $\hat{f}_{T}$ , then $\mathbf{w}^{*}=\sum_{j=1}^{T}\hat{w}^{*}_{j}\cdot\mathbf{v}_{j}\in\mathbb{R}^{d}$ is a minimizer of $f_{T}$ , and with the same Euclidean norm as $\hat{\mathbf{w}}^{*}$ .

We begin with the following technical lemma:

Lemma 9.

$\hat{f}_{T}$ * has a unique minimizer $\hat{\mathbf{w}}^{*}\in\mathbb{R}^{T}$ , which satisfies*

[TABLE]

for all $t=1,2,\ldots,T$ , where $\delta$ is non-negative and independent of t. Moreover,

[TABLE]

Proof.

Taking the derivative and setting to zero, we get that the

[TABLE]

as well as

[TABLE]

for all $j\in\{2,3,\ldots,T-1\}$ . By definition of $g$ , it is easily verified that $g^{\prime}$ is a strictly monotonic (hence invertible) function, so the above implies $\hat{w}^{*}_{j-1}-\hat{w}^{*}_{j}=\hat{w}^{*}_{j}-\hat{w}^{*}_{j+1}$ for all $j\in\{2,3,\ldots,T-1\}$ , as well as $\hat{w}^{*}_{T-1}-\hat{w}^{*}_{T}=\hat{w}^{*}_{T}$ . From this, it follows by straightforward induction that $\hat{w}^{*}_{T+1-t}=t\cdot\hat{w}^{*}_{T}$ , from which the first displayed equation in the lemma follows. This also implies $g^{\prime}(T\hat{w}^{*}_{T})+g^{\prime}(\hat{w}^{*}_{T})=\gamma$ , and since $g^{\prime}$ is strictly monotonic, we have that $\hat{w}^{*}_{T}$ is uniquely defined, and since the other coordinates of $\hat{\mathbf{w}}^{*}$ are also uniquely defined given $\hat{w}^{*}_{T}$ , we get that $\hat{\mathbf{w}}^{*}$ is unique. Finally, $\delta$ (and hence $\hat{w}_{t}^{*}$ for all $t$ ) is necessarily non-negative, since otherwise $\hat{w}^{*}_{1}$ is negative, which would imply $\hat{f}_{T}(\hat{\mathbf{w}}^{*})>0$ , even though $\hat{f}_{T}(\mathbf{0})=0$ , violating the fact that $\hat{\mathbf{w}}^{*}$ minimizes $\hat{f}_{T}$ . ∎

The main technical result in this subsection is the following proposition, which characterizes $\|\hat{\mathbf{w}}^{*}\|$ and $\hat{f}_{T}(\hat{\mathbf{w}}^{*})$ under various parameter regimes. By the discussion above and definition of $f_{T}$ , we have

[TABLE]

which will be used in the remainder of the proof of our theorem.

Proposition 3.

The function $\hat{f}_{T}$ and its minimizer $\hat{\mathbf{w}}^{*}$ has the following properties, depending on the values of $\gamma,\Delta,T$ :

[TABLE]

Proof.

To prove the proposition, we will consider three regimes, depending on $T,\delta,\Delta$ : Namely, $T\delta\leq\Delta$ , $\frac{\Delta}{T}<\delta\leq\Delta$ and $\delta>\Delta$ . We will show that each regime corresponds to one of the three regimes specified in the proposition, and prove the relevant bounds.

Case 1: $T\delta\leq\Delta$ . In that case, $\hat{w}^{*}_{1},\hat{w}^{*}_{T}$ as well as $\hat{w}^{*}_{i}-\hat{w}^{*}_{i+1}$ for all $i=2,\ldots,T-1$ in the definition of $\hat{f}_{T}$ all lie in the interval where $g$ is a cubic function. Using Lemma 9,

[TABLE]

hence

[TABLE]

and

[TABLE]

Therefore, our condition $T\delta\leq\Delta$ is exactly equivalent to $\gamma\leq\frac{\Delta^{2}\left(1+T^{2}\right)}{T^{2}}$ , namely the first regime discussed in the proposition. We now establish the relevant bounds:

[TABLE]

and

[TABLE]

where in the calculation above we used fact $\sum_{t=1}^{T}t^{2}\leq\int_{1}^{T+1}t^{2}dt<\frac{(T+1)^{3}}{3}$ .

Case 2: $\frac{\Delta}{T}<\delta\leq\Delta$ . In this case, by Lemma 9, $\hat{w}^{*}_{T}\leq\Delta$ but $\hat{w}^{*}_{1}>\Delta$ . Therefore, in the definition $\hat{f}_{T}(\hat{\mathbf{w}}^{*})$ , $g(\hat{w}^{*}_{1})$ lies in the quadratic region of $g$ , whereas $g(\hat{w}^{*}_{T})$ and $g^{\prime}(\hat{w}^{*}_{i}-\hat{w}^{*}_{i+1})$ for all $i$ lies in the cubic region of $g$ . As a result,

[TABLE]

Plugging in $w^{*}_{T}=\delta$ and $w^{*}_{1}=T\cdot\delta$ , we get

[TABLE]

and therefore (using the fact $\delta\geq 0$ , see Lemma 9),

[TABLE]

This, plus the assumption $\frac{\Delta}{T}<\delta\leq\Delta$ , is equivalent to $\frac{\Delta^{2}\left(1+T^{2}\right)}{T^{2}}<\gamma\leq 2\Delta^{2}T$ , hence showing that we are indeed in the second regime as specified in our proposition. Turning to calculate the relevant bounds, we have

[TABLE]

Moreover,

[TABLE]

which by definition of $\delta$ above and the inequality $\sqrt{1+x}\leq 1+\frac{1}{2}x$ for all $x\geq 0$ , is at most $\frac{\left(\gamma+\Delta^{2}\right)^{2}(T+1)^{3}}{12\Delta^{2}T^{2}}$ .

Case 3: $\delta>\Delta$ . In this case, by Lemma 9, we have $\hat{w}^{*}_{1}>\hat{w}^{*}_{T}=\hat{w}^{*}_{i}-\hat{w}^{*}_{i+1}>\Delta$ , which implies that in the definition of $\hat{f}_{t}(\hat{\mathbf{w}}^{*})$ , these terms all lie in the quadratic region of $g$ . Therefore,

[TABLE]

and thus

[TABLE]

or equivalently

[TABLE]

Note that this, plus our assumption $\delta>\Delta$ , is equivalent to $\gamma>2\Delta^{2}T$ , which shows that we are indeed in the third regime as specified in our proposition. Turning to calculate $\|\hat{\mathbf{w}}^{*}\|$ and $\hat{f}_{T}(\hat{\mathbf{w}}^{*})$ , we have

[TABLE]

and

[TABLE]

∎

5.2 Oracle Complexity Lower Bound

Given the expressions on the optimal value of $\hat{f}_{T}$ , derived in the previous subsection, we turn to explain how the oracle complexity lower bound is derived. The argument is very similar to the strongly convex case (proof of Thm. 1, subsection 4.2): Specifically, consider the function $f_{2T}$ , given by

[TABLE]

Given an algorithm, we choose $\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{T}$ to be orthogonal unit vectors, so that each $\mathbf{v}_{t}$ is orthogonal to the first $t$ points $\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{t}$ computed by the algorithm (this is possible, since the gradients and Hessians of $f_{2T}$ at each $\mathbf{w}_{t}$ reveals no information on future $\mathbf{v}_{t}$ ’s – see Lemma 7). Also, we let $\mathbf{v}_{T+1},\ldots,\mathbf{v}_{2T}$ equal $\mathbf{v}_{T}$ .

With this choice, it is easily verified that

[TABLE]

which is clearly no better than $\min_{\mathbf{w}}f_{T}(\mathbf{w})$ , where $f_{T}$ is defined with the same $\mathbf{v}_{1},\ldots,\mathbf{v}_{T}$ . Therefore, we can lower bound the optimization error $f_{2T}(\mathbf{w}_{T})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ by $\min_{\mathbf{w}}f_{T}(\mathbf{w})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ . Moreover, by Eq. (21), this equals

[TABLE]

Using proposition 3, we can now plug in these minimal values, depending on the various parameter regimes, and get an oracle complexity lower bound. Computing these bounds and parameter regimes (while picking the free parameter $\gamma$ appropriately) is performed in the next subsection.

5.3 Setting the $\gamma$ Parameter

To simplify notation, we let $\hat{f}^{*}_{T}$ and $\hat{f}^{*}_{2T}$ be shorthand for $\min_{\mathbf{w}}\hat{f}_{T}(\mathbf{w})$ and $\min_{\mathbf{w}}\hat{f}_{2T}(\mathbf{w})$ respectively, with minimizers $\hat{\mathbf{w}}^{*}_{T}$ and $\hat{\mathbf{w}}^{*}_{2T}$ . We will consider three regimes, depending on the relationships between $D,\Delta,T$ .

5.3.1 Case 1: $\frac{D^{2}}{48\Delta^{2}T^{3}}\leq\frac{1}{T^{2}}$

In this setting, we choose

[TABLE]

Using this and the assumption on the parameters, we get that $\gamma\leq\Delta^{2}<\frac{\Delta^{2}\left(1+4T^{2}\right)}{4T^{2}}<\frac{\Delta^{2}\left(1+T^{2}\right)}{T^{2}}$ , and therefore, we are in the first regime for both $f_{T}$ and $f_{2T}$ as specified in proposition 3. Plugging in the bound on $\|\hat{\mathbf{w}^{*}}\|^{2}$ in that regime, and using the fact that $\Delta^{2}\leq\gamma$ by the assumption above, we have

[TABLE]

as required.

Using the results from proposition 3 for the first regime we can compute the optimization error bound

[TABLE]

Where in the first inequality we used the fact that $1-\frac{1}{2}x\leq\frac{1}{\sqrt{1+x}}\leq 1-\frac{1}{2}x+\frac{3}{8}x^{2}$ for all $x\geq 0$ and for the last inequality we assumed that $T\geq 2$ . In the case that $T=1$ , the final result still holds. Hence, the suboptimality is at least $\frac{\mu_{2}D^{3}}{16000T^{7/2}}$ .

5.3.2 Case 2: $\frac{1}{T^{2}}<\frac{D^{2}}{48\Delta^{2}T^{3}}\leq 1$

In this setting, we choose

[TABLE]

Using this and the assumption on the parameters, we get that $\frac{\Delta^{2}\left(1+T^{2}\right)}{T^{2}}<\gamma<2\Delta^{2}T$ , and therefore, we are in the second regime for both $f_{T}$ and $f_{2T}$ as specified in proposition 3. Plugging in the bound on $\|\hat{\mathbf{w}^{*}}\|^{2}$ in that regime, and using the fact that $\Delta^{2}<\gamma$ by the assumption above, we have

[TABLE]

as required.

Turning to compute the optimization error bound, and letting $\delta_{T},\delta_{2T}$ denote the quantity $\delta$ in proposition 3 for $\hat{f}_{T}$ and $\hat{f}_{2T}$ respectively, we have

[TABLE]

To continue, we use the following auxiliary lemma:

Lemma 10.

$\left(2\delta_{2T}-\delta_{T}\right)\left(T\left(\Delta^{2}+\gamma\right)-\Delta T^{2}\left(2\delta_{2T}+\delta_{T}\right)\right)\geq 0$ **

Proof.

First we will prove that $T\left(\Delta^{2}+\gamma\right)-\Delta T^{2}\left(2\delta_{2T}+\delta_{T}\right)\geq 0$ .

Since $\delta_{T}=-\Delta T+\Delta T\sqrt{1+\frac{\gamma+\Delta^{2}}{\Delta^{2}T^{2}}}$ and using $\sqrt{1+x}\leq 1+\frac{1}{2}x$ for $x\geq 0$ we have that:

[TABLE]

So

[TABLE]

To complete the proof, it remains to show that $2\delta_{2T}-\delta_{T}\geq 0$ . We have

[TABLE]

Define $\alpha:=\frac{\gamma+\Delta^{2}}{\Delta^{2}T^{2}}\geq 0$ . Hence, we need to prove:

[TABLE]

Which is true since $\sqrt{1+\alpha}\leq 1+\frac{1}{2}\alpha$ . ∎

With this lemma, we can lower bound the optimization error in Eq. (23) by

[TABLE]

To continue, we note that by definition of $\delta_{T},\delta_{2T}$ and the fact that $1+\frac{1}{2}x-\frac{1}{8}x^{2}\leq\sqrt{1+x}\leq 1+\frac{1}{2}x$ , we have

[TABLE]

Therefore,

[TABLE]

Using this inequality, and the fact $(a-b)^{3}\leq a^{3}-b^{3}$ for $a\geq b\geq 0$ , we can lower bound Eq. (24) by

[TABLE]

Hence, the suboptimality is at least $\frac{\mu_{2}D^{3}}{30000T^{7/2}}$ .

5.3.3 Case 3: $\frac{D^{2}}{48\Delta^{2}T^{3}}>1$

In this setting, we choose

[TABLE]

Using this and the assumption on the parameters, we get that $\gamma>4\Delta^{2}T$ , and therefore, we are in the third regime for both $f_{T}$ and $f_{2T}$ as specified in proposition 3. Plugging in the bound on $\|\hat{\mathbf{w}}^{*}_{2T}\|^{2}$ in that regime, and using the fact that $2\Delta^{2}<\gamma$ by the assumption above, we have

[TABLE]

Now, by the assumptions that $T\Delta^{3}<\frac{\Delta D^{2}}{48T^{2}}$ and by using the fact that $1-x\leq\frac{1}{1+x}\leq 1-x+x^{2}$ for all $x\geq 0$ , the optimization error bound is

[TABLE]

In the last inequality we assumed that $T\geq 3$ . For $T=1,2$ it can be easily verified that the inequality holds. Hence, using $\Delta=\frac{3\mu_{1}}{2\mu_{2}}$ the suboptimality is at least $\frac{\mu_{1}D^{2}}{576T^{2}}$ .

5.4 Wrapping Up

Combining the three cases from the previous subsection, we see that we get the following lower bound

[TABLE]

Thus, we get that

[TABLE]

Equating these bounds to $\epsilon$ , and solving for $T$ , the theorem follows.

6 Proof of Thm. 3

The proof of Thm. 3 will follow the same outline of the proof of Thm. 2. We are again going to assume without loss of generality that $\mathbf{w}_{1}=0$ , and we will thus require that $\|\mathbf{w}^{*}\|\leq D$ (see discussion in the proof of Thm. 2). We define $g:\mathbb{R}\mapsto\mathbb{R}$ as

[TABLE]

and

[TABLE]

By the following lemma, $f_{T}(\mathbf{w})$ is $k$ -times differentiable, with $\mu_{k}$ -Lipschitz $k-th$ order derivative tensor.

Lemma 11.

$f_{T}(\mathbf{w})$ * is $k$ -times differentiable, with $\mu_{k}$ -Lipschitz $k-th$ order derivative tensor.*

Proof.

Similarly to Lemma 8, we can assume without loss of generality, that the vectors $\mathbf{v}_{1},\mathbf{v}_{2},...,\mathbf{v}_{T}$ correspond to the standard basis vectors $\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{T}$ , so we can examine the Lipschitz property of

[TABLE]

Let

[TABLE]

By Differentiating $k$ times, we have that

[TABLE]

Where

[TABLE]

Since $g^{(k)}(x)=k!x$

[TABLE]

Note that for a $k$ -th order symmetric tensor $T$ , the operator norm equals (see e.g. (Mu et al., 2015)):

[TABLE]

So,

[TABLE]

Where in the first inequality we used that $\|\mathbf{r}_{i}\|\leq\sqrt{2}~{}~{}~{}\forall i$ .

Plugging that into Eq. (25) we have that

[TABLE]

as required. ∎

6.1 Minimizer of $f_{T}$

In order to derive the complexity bound, we will first analyze $\hat{f}_{T}$ , which is a simplified version of $f_{T}$ , as defined in Subsection 5.1. It is easily verified that $\min_{\mathbf{w}}f_{T}(\mathbf{w})=\frac{\mu_{k}}{k!2^{\frac{k+3}{2}}}\cdot\min_{\mathbf{w}}\mathbf{w}\hat{f}_{T}(\hat{\mathbf{w}})$ , and moreover, if $\hat{\mathbf{w}}\in\mathbb{R}^{T}$ is a minimizer of $\hat{f}_{T}$ , then $\mathbf{w}^{*}=\sum_{j=1}^{T}\hat{w}^{*}_{j}\cdot\mathbf{v}_{j}\in\mathbb{R}^{d}$ is a minimizer of $f_{T}$ , and with the same Euclidean norm as $\hat{\mathbf{w}}^{*}$ .

Using an identical proof to Lemma 9 we can have that $\hat{f}_{T}$ has a unique minimizer $\hat{\mathbf{w}}^{*}\in\mathbb{R}^{T}$ , which satisfies

[TABLE]

for some $\delta>0$ and all $t=1,2,\ldots,T$ and

[TABLE]

hence,

[TABLE]

By plugging the minimizer, we have that

[TABLE]

and,

[TABLE]

Where we used $\sum_{t=1}^{T}\left(1-\frac{t}{1+T}\right)^{2}\leq\frac{1}{3}(1+T)$ as in Proposition 3.

6.2 Oracle Complexity Lower Bound

The derivation of the lower complexity bound will be exactly the same as in Subsection 5.2.

In Subsection 5.2 we showed that we can lower bound the optimization error $f_{2T}(\mathbf{w}_{T})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ by $\min_{\mathbf{w}}f_{T}(\mathbf{w})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})$ . Using the fact that

[TABLE]

this equals

[TABLE]

Letting $f_{T}^{*}$ and $\hat{f}_{T}^{*}$ to be the minimal values of $f_{T}$ and $\hat{f}_{T}$ respectively, and by using equation Eq. (6.1) then

[TABLE]

The last inequality holds for $k=1,T\geq 3$ , $k=2,T\geq 2$ or $k\geq 3,T\geq 1$ . It can be verified that for the other cases, the inequality above holds.

Since we want $f_{T}^{*}-f_{2T}^{*}$ to be as large as possible, we will set $\gamma$ to be as large as possible, under the constraint that $\|\mathbf{w}^{*}_{2T}\|\leq D$ . By Eq. (6.1) we can choose

[TABLE]

Thus, according to the discussion in Subsection 6.2, the final bound is

[TABLE]

and the number of iterations required for having $\min_{\mathbf{w}}f_{T}(\mathbf{w})-\min_{\mathbf{w}}f_{2T}(\mathbf{w})<\epsilon$ , $T_{\epsilon}$ must satisfy

[TABLE]

Where $c=\left(\frac{1}{12}\right)^{\frac{1}{5}}\cdot\left(\frac{\sqrt{2}}{3}\right)^{\frac{4}{5}}$ .

Acknowledgments

We thank Yurii Nesterov for several helpful comments on a preliminary version of this paper, as well as Naman Agarwal, Elad Hazan and Zeyuan Allen-Zhu for informing us about the A-NPE algorithm of Monteiro and Svaiter (2013).

Appendix A An Improved Second-Order Oracle Complexity Bound for Strongly

Convex Functions

In this section, we show how the A-NPE algorithm of Monteiro and Svaiter [2013], which is a second-order method analyzed for smooth convex functions, can be used to yield near-optimal performance if the function is also strongly convex. Rather than directly adapting their analysis, which is non-trivial, we use a simple restarting scheme, which allows one to convert an algorithm for the convex setting, to an algorithm in the strongly convex setting888We note that the reverse direction, of adapting strongly convex optimization algorithms to the convex case, is more common in the literature, and can be achieved using regularization or more sophisticated approaches [Allen-Zhu and Hazan, 2016]..

Our algorithm is described as follows: In the first phase, we apply a generic restarting scheme (based on [Arjevani and Shamir, 2016a, Subsction 4.2]), where we repeatedly run A-NPE for a bounded number of steps, followed by restarting the algorithm, running it from the last iterate obtained. By strong convexity, we show that each such epoch reduces the suboptimality by a constant factor. Once we reach a point sufficiently close to the global optimum, we switch to the second phase, where we use the cubic-regularized Newton method to get a quadratic convergence rate.

To formalize this, let us first analyze the convergence rate of the first phase. We assume that we use the algorithm described in Monteiro and Svaiter [2013, Subsection 7.4]999Specifically, since in our framework we do not limit computational resources, we assume that the minimization problem in Eq. (6.1) of Monteiro and Svaiter [2013] can be solved exactly.. By [Monteiro and Svaiter, 2013, Theorem 6.4 and Theorem 3.10], we have that the $t$ ’th iterate satisfies

[TABLE]

where $\mu_{2}$ is the Lipschitz constant of $\nabla^{2}f$ , $\mathbf{w}_{1}$ is the initialization point, $\mathbf{w}^{*}$ is the unique minimizer (due to strong convexity) of $f$ , $D$ bounds $\|\mathbf{w}_{1}-\mathbf{w}^{*}\|$ from above, and $c>0$ is some universal constant. Since $f$ is also assumed to be $\lambda$ -strongly convex, we have

[TABLE]

hence

[TABLE]

Thus, running the algorithm for

[TABLE]

iterations, we see that $f(\mathbf{w}_{t})-f(\mathbf{w}^{*})\leq{(f(\mathbf{w}_{1})-f(\mathbf{w}^{*}))}/{2}$ . Now, since the distance from $\mathbf{w}_{t}$ to $\mathbf{w}^{*}$ is also smaller than $D$ , we may initialize the algorithm at the last iterate returned by the previous run and run it for $\tau$ iterations to reduce $f(\mathbf{w}_{t})-f(\mathbf{w}^{*})$ in, yet again, a factor of 2. Applying the algorithm for $T$ iterations (and restarting the algorithmic parameters after every $\tau$ iterations) yields

[TABLE]

Equivalently, to obtain an $\epsilon$ -optimal solution, we need at most

[TABLE]

oracle calls (note that this restarting scheme can be applied also on uniform convex functions of any order (defined in, e.g., (Vladimirov et al. [1978])).

Next, after performing a number of iterations sufficiently large to obtain high accuracy solutions, we proceed to the second phase of the algorithm where cubic-regularized Newton steps are applied (see Nesterov [2008]). According to that analysis, after reducing the optimization error to below $\lambda^{3}/4\mu_{2}^{2}$ , the number of cubic-regularized Newton steps required to achieve an $\epsilon$ -suboptimal solution is

[TABLE]

Thus, using the $\mu_{1}$ -Lipschitzness of the gradient to bound $f(\mathbf{w}_{1})-f(\mathbf{w}^{*})$ from above by $\mu_{1}D^{2}/2$ , we get that the overall number of iterations is at most

[TABLE]

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agarwal and Hazan [2017] Naman Agarwal and Elad Hazan. Lower bounds for higher-order optimization. Working draft, 2017.
2Allen-Zhu and Hazan [2016] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Advances in Neural Information Processing Systems , pages 1614–1622, 2016.
3Arjevani and Shamir [2016 a] Yossi Arjevani and Ohad Shamir. On the iteration complexity of oblivious first-order optimization algorithms. In International Conference on Machine Learning , pages 908–916, 2016 a.
4Arjevani and Shamir [2016 b] Yossi Arjevani and Ohad Shamir. Oracle complexity of second-order methods for finite-sum problems. ar Xiv preprint ar Xiv:1611.04982 , 2016 b.
5Bach [2010] Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics , 4:384–414, 2010.
6Baes [2009] Michel Baes. Estimate sequence methods: extensions and approximations. Institute for Operations Research, ETH, Zürich, Switzerland , 2009.
7Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
8Cartis et al. [2010] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. Siam journal on optimization , 20(6):2833–2852, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Oracle Complexity of Second-Order Methods

Abstract

1 Introduction

Related Work

2 Main Results

2.1 Second-order Oracle

Theorem 1**.**

Theorem 2**.**

2.2 Higher Order Oracles

Theorem 3**.**

3 Proof Ideas

4 Proof of Thm. 1

4.1 Minimizer of fff

Proposition 1**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

4.2 Oracle Complexity Lower Bound

Proposition 2**.**

Lemma 7**.**

Proof.

4.3 Setting the γ,Δ\gamma,\Deltaγ,Δ Parameters

Lemma 8**.**

Proof.

5 Proof of Thm. 2

5.1 Minimizer of fTf_{T}fT​

Lemma 9**.**

Proof.

Proposition 3**.**

Proof.

5.2 Oracle Complexity Lower Bound

5.3 Setting the γ\gammaγ Parameter

5.3.1 Case 1: D248Δ2T3≤1T2\frac{D^{2}}{48\Delta^{2}T^{3}}\leq\frac{1}{T^{2}}48Δ2T3D2​≤T21​

5.3.2 Case 2: 1T2<D248Δ2T3≤1\frac{1}{T^{2}}<\frac{D^{2}}{48\Delta^{2}T^{3}}\leq 1T21​<48Δ2T3D2​≤1

Lemma 10**.**

Proof.

5.3.3 Case 3: D248Δ2T3>1\frac{D^{2}}{48\Delta^{2}T^{3}}>148Δ2T3D2​>1

5.4 Wrapping Up

6 Proof of Thm. 3

Lemma 11**.**

Proof.

6.1 Minimizer of fTf_{T}fT​

6.2 Oracle Complexity Lower Bound

Acknowledgments

Appendix A An Improved Second-Order Oracle Complexity Bound for Strongly

Theorem 1.

Theorem 2.

Theorem 3.

4.1 Minimizer of $f$

Proposition 1.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Proposition 2.

Lemma 7.

4.3 Setting the $\gamma,\Delta$ Parameters

Lemma 8.

5.1 Minimizer of $f_{T}$

Lemma 9.

Proposition 3.

5.3 Setting the $\gamma$ Parameter

5.3.1 Case 1: $\frac{D^{2}}{48\Delta^{2}T^{3}}\leq\frac{1}{T^{2}}$

5.3.2 Case 2: $\frac{1}{T^{2}}<\frac{D^{2}}{48\Delta^{2}T^{3}}\leq 1$

Lemma 10.

5.3.3 Case 3: $\frac{D^{2}}{48\Delta^{2}T^{3}}>1$

Lemma 11.

6.1 Minimizer of $f_{T}$