Analysis of Optimization Algorithms via Sum-of-Squares

Sandra S. Y. Tan; Antonios Varvitsiotis; Vincent Y. F. Tan

arXiv:1906.04648·math.OC·June 23, 2021

Analysis of Optimization Algorithms via Sum-of-Squares

Sandra S. Y. Tan, Antonios Varvitsiotis, Vincent Y. F. Tan

PDF

1 Repo

TL;DR

This paper introduces a sum-of-squares (SOS) based framework to analyze and certify convergence rates of first-order convex optimization algorithms, unifying existing methods and providing new bounds especially for noisy gradient descent.

Contribution

It establishes a hierarchy of semidefinite programs for convergence analysis, connecting SOS proofs with the Performance Estimation Problem framework, and derives new bounds for noisy gradient descent.

Findings

01

First level SOS hierarchy corresponds to the PEP framework.

02

New convergence bounds for noisy gradient descent with inexact line search.

03

SOS framework offers a systematic approach to certify improved convergence rates.

Abstract

We introduce a new framework for unifying and systematizing the performance analysis of first-order black-box optimization algorithms for unconstrained convex minimization. The low-cost iteration complexity enjoyed by first-order algorithms renders them particularly relevant for applications in machine learning and large-scale data analysis. Relying on sum-of-squares (SOS) optimization, we introduce a hierarchy of semidefinite programs that give increasingly better convergence bounds for higher levels of the hierarchy. Alluding to the power of the SOS hierarchy, we show that the (dual of the) first level corresponds to the Performance Estimation Problem (PEP) introduced by Drori and Teboulle [Math. Program., 145(1):451--482, 2014], a powerful framework for determining convergence rates of first-order optimization algorithms. Consequently, many results obtained within the PEP framework…

Equations228

t_{*} = minimize

t_{*} = minimize

f_{k + 1} - f_{*} \leq t (f_{k} - f_{*}),

x_{k + 1} = A (x_{0}, \dots, x_{k}; f_{0}, \dots, f_{k}; g_{0}, \dots, g_{k}),

for all f \in F,

K = {z : h_{i} (z) \geq 0, i \in [m], v_{j} (z) = 0, j \in [m^{'}]},

K = {z : h_{i} (z) \geq 0, i \in [m], v_{j} (z) = 0, j \in [m^{'}]},

t (f_{k} - f_{*}) - (f_{k + 1} - f_{*}) = σ_{0} (z) + i = 1 \sum m σ_{i} (z) h_{i} (z) + j = 1 \sum m^{'} θ_{j} (z) v_{j} (z),

t (f_{k} - f_{*}) - (f_{k + 1} - f_{*}) = σ_{0} (z) + i = 1 \sum m σ_{i} (z) h_{i} (z) + j = 1 \sum m^{'} θ_{j} (z) v_{j} (z),

f, x_{0}, \dots, x_{N}, x_{*} maximize

f, x_{0}, \dots, x_{N}, x_{*} maximize

f \in F,

x_{k + 1} = A (x_{0}, \dots, x_{k}; f_{0}, \dots, f_{k}; \nabla f (x_{0}), \dots, \nabla f (x_{k})), 0 \leq k \leq N - 1,

x_{*} is a minimizer of f on R^{n}, ∥ x_{0} - x_{*} ∥ \leq R,

x_{0}, \dots, x_{N}, x_{*} \in R^{n},

{x_{i}, g_{i}, f_{i}}_{i \in I} maximize

{x_{i}, g_{i}, f_{i}}_{i \in I} maximize

\exists f \in F such that f_{i} = f (x_{i}), g_{i} = \nabla f (x_{i}) for all i \in I,

x_{k + 1} = A (x_{0}, \dots, x_{k}; f_{0}, \dots, f_{k}; g_{0}, \dots, g_{k}), k = 0, \dots, N - 1,

g_{*} = 0, ∥ x_{0} - x_{*} ∥ \leq R,

⟨ c_{i}, f ⟩ + ⟨ C_{i}, G ⟩ \geq a_{i} or ⟨ d_{j}, f ⟩ + ⟨ D_{j}, G ⟩ = b_{i},

⟨ c_{i}, f ⟩ + ⟨ C_{i}, G ⟩ \geq a_{i} or ⟨ d_{j}, f ⟩ + ⟨ D_{j}, G ⟩ = b_{i},

⟨ c_{i}, z_{0} ⟩ + ℓ = 1 \sum n z_{ℓ}^{⊤} C_{i} z_{ℓ} \geq a_{i} or ⟨ d_{j}, z_{0} ⟩ + ℓ \sum z_{ℓ}^{⊤} D_{j} z_{ℓ} = b_{i},

⟨ c_{i}, z_{0} ⟩ + ℓ = 1 \sum n z_{ℓ}^{⊤} C_{i} z_{ℓ} \geq a_{i} or ⟨ d_{j}, z_{0} ⟩ + ℓ \sum z_{ℓ}^{⊤} D_{j} z_{ℓ} = b_{i},

f_{k + 1} - f_{*} \leq [1 - \frac{4 μ ϵ ( 1 - δ ) ^{2}}{η L} (\frac{1 - δ}{( 1 + δ ) ^{2}} - ϵ)] (f_{k} - f_{*}),

f_{k + 1} - f_{*} \leq [1 - \frac{4 μ ϵ ( 1 - δ ) ^{2}}{η L} (\frac{1 - δ}{( 1 + δ ) ^{2}} - ϵ)] (f_{k} - f_{*}),

f_{k + 1} - f_{*} \leq (1 - \frac{4 μ ϵ ( 1 - δ ) ^{2}}{L} [\frac{1 - δ}{( 1 + δ ) ^{2}} - (1 - ϵ)]) (f_{k} - f_{*}),

f_{k + 1} - f_{*} \leq (1 - \frac{4 μ ϵ ( 1 - δ ) ^{2}}{L} [\frac{1 - δ}{( 1 + δ ) ^{2}} - (1 - ϵ)]) (f_{k} - f_{*}),

f_{k + 1} - f_{*} \leq (1 - \frac{2 μ c _{1} ( 1 - c _{2} )}{L}) (f_{k} - f_{*}),

f_{k + 1} - f_{*} \leq (1 - \frac{2 μ c _{1} ( 1 - c _{2} )}{L}) (f_{k} - f_{*}),

p (z) = σ_{0} (z) + i = 1 \sum m σ_{i} (z) h_{i} (z) + j = 1 \sum m^{'} θ_{j} (z) v_{j} (z),

p (z) = σ_{0} (z) + i = 1 \sum m σ_{i} (z) h_{i} (z) + j = 1 \sum m^{'} θ_{j} (z) v_{j} (z),

z := (f_{*}, f_{k}, f_{k + 1}, x_{*}, x_{k}, x_{k + 1}, g_{*}, g_{k}, g_{k + 1}) \in R^{6 n + 3}

z := (f_{*}, f_{k}, f_{k + 1}, x_{*}, x_{k}, x_{k + 1}, g_{*}, g_{k}, g_{k + 1}) \in R^{6 n + 3}

K := {z : h_{i} (z) \geq 0, i \in [m], v_{j} (z) = 0, j \in [m^{'}]},

K := {z : h_{i} (z) \geq 0, i \in [m], v_{j} (z) = 0, j \in [m^{'}]},

p_{t} (z) := t (f_{k} - f_{*}) - (f_{k + 1} - f_{*})

p_{t} (z) := t (f_{k} - f_{*}) - (f_{k + 1} - f_{*})

t_{poly} := in f {t : p_{t} (z) \geq 0 \forall z \in K, t \in (0, 1)},

t_{poly} := in f {t : p_{t} (z) \geq 0 \forall z \in K, t \in (0, 1)},

t_{d} := minimize

t_{d} := minimize

p_{t} (z) = s_{0} (z) + i = 1 \sum m σ_{i} (z) h_{i} (z) + j = 1 \sum m^{'} θ_{j} (z) v_{j} (z),

t \in (0, 1),

σ_{0} (z) : SOS polynomial with de g (σ_{0} (z)) \leq 2 d,

σ_{i} (z) : SOS polynomial with de g (σ_{i} (z) h_{i} (z)) \leq 2 d,

θ_{j} (z) : arbitrary polynomial with de g (θ_{j} (z) v_{j} (z)) \leq 2 d,

t_{*} \leq t_{poly} \leq ... \leq t_{d + 1} \leq t_{d}, for all d \in N .

t_{*} \leq t_{poly} \leq ... \leq t_{d + 1} \leq t_{d}, for all d \in N .

\norm g_{1} - g_{2} \leq L \norm x_{1} - x_{2}, \forall x_{1}, x_{2} \in R^{n}, g_{1} = \nabla f (x_{1}), g_{2} = \nabla f (x_{2}),

\norm g_{1} - g_{2} \leq L \norm x_{1} - x_{2}, \forall x_{1}, x_{2} \in R^{n}, g_{1} = \nabla f (x_{1}), g_{2} = \nabla f (x_{2}),

f_{i} - f_{j} - g_{j}^{⊤} (x_{i} - x_{j}) \geq \frac{L}{2 ( L - μ )} (\frac{1}{L} \norm g_{i} - g_{j}^{2} + μ \norm x_{i} - x_{j}^{2} - 2 \frac{μ}{L} (g_{j} - g_{i})^{⊤} (x_{j} - x_{i})) .

f_{i} - f_{j} - g_{j}^{⊤} (x_{i} - x_{j}) \geq \frac{L}{2 ( L - μ )} (\frac{1}{L} \norm g_{i} - g_{j}^{2} + μ \norm x_{i} - x_{j}^{2} - 2 \frac{μ}{L} (g_{j} - g_{i})^{⊤} (x_{j} - x_{i})) .

f_{k} - f_{k + 1} - g_{k + 1}^{⊤} (x_{k} - x_{k + 1}) - α (\frac{1}{L} \norm g_{k} - g_{k + 1}^{2} + μ \norm x_{k} - x_{k + 1}^{2} - 2 \frac{μ}{L} (g_{k + 1} - g_{k})^{⊤} (x_{k + 1} - x_{k})) \geq 0

f_{k} - f_{k + 1} - g_{k + 1}^{⊤} (x_{k} - x_{k + 1}) - α (\frac{1}{L} \norm g_{k} - g_{k + 1}^{2} + μ \norm x_{k} - x_{k + 1}^{2} - 2 \frac{μ}{L} (g_{k + 1} - g_{k})^{⊤} (x_{k + 1} - x_{k})) \geq 0

f_{k} - f_{*} - g_{*}^{⊤} (x_{k} - x_{*}) - α (\frac{1}{L} \norm g_{k} - g_{*}^{2} + μ \norm x_{k} - x_{*}^{2} - 2 \frac{μ}{L} (g_{*} - g_{k})^{⊤} (x_{*} - x_{k})) \geq 0

f_{k + 1} - f_{k} - g_{k}^{⊤} (x_{k + 1} - x_{k}) - α (\frac{1}{L} \norm g_{k + 1} - g_{k}^{2} + μ \norm x_{k + 1} - x_{k}^{2} - 2 \frac{μ}{L} (g_{k} - g_{k + 1})^{⊤} (x_{k} - x_{k + 1})) \geq 0

f_{k + 1} - f_{*} - g_{*}^{⊤} (x_{k + 1} - x_{*}) - α (\frac{1}{L} \norm g_{k + 1} - g_{*}^{2} + μ \norm x_{k + 1} - x_{*}^{2} - 2 \frac{μ}{L} (g_{*} - g_{k + 1})^{⊤} (x_{*} - x_{k + 1})) \geq 0

f_{*} - f_{k} - g_{k}^{⊤} (x_{*} - x_{k}) - α (\frac{1}{L} \norm g_{*} - g_{k}^{2} + μ \norm x_{*} - x_{k}^{2} - 2 \frac{μ}{L} (g_{k} - g_{*})^{⊤} (x_{k} - x_{*})) \geq 0

f_{*} - f_{k + 1} - g_{k + 1}^{⊤} (x_{*} - x_{k + 1}) - α (\frac{1}{L} \norm g_{*} - g_{k + 1}^{2} + μ \norm x_{*} - x_{k + 1}^{2} - 2 \frac{μ}{L} (g_{k + 1} - g_{*})^{⊤} (x_{k + 1} - x_{*})) \geq 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sandratsy/SumsOfSquares
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Analysis of Optimization Algorithms via Sum-of-Squares

Sandra S. Y. Tan

Antonios Varvitsiotis

Vincent Y. F. Tan Sandra S. Y. Tan was with the Department of Electrical and Computer Engineering, National University of Singapore ([email protected]). Antonios Varvitsiotis was with the Department of Electrical and Computer Engineering and Department of Industrial Systems Engineering and Management, National University of Singapore ([email protected]). Vincent Tan is with the Department of Electrical and Computer Engineering and Department of Mathematics, National University of Singapore ([email protected]).

Abstract

We introduce a new framework for unifying and systematizing the performance analysis of first-order black-box optimization algorithms for unconstrained convex minimization. The low-cost iteration complexity enjoyed by first-order algorithms renders them particularly relevant for applications in machine learning and large-scale data analysis. Relying on sum-of-squares (SOS) optimization, we introduce a hierarchy of semidefinite programs that give increasingly better convergence bounds for higher levels of the hierarchy. Alluding to the power of the SOS hierarchy, we show that the (dual of the) first level corresponds to the Performance Estimation Problem (PEP) introduced by Drori and Teboulle [Math. Program., 145(1):451–482, 2014], a powerful framework for determining convergence rates of first-order optimization algorithms. Consequently, many results obtained within the PEP framework can be reinterpreted as degree-1 SOS proofs, and thus, the SOS framework provides a promising new approach for certifying improved rates of convergence by means of higher-order SOS certificates. To determine analytical rate bounds, in this work we use the first level of the SOS hierarchy and derive new results for noisy gradient descent with inexact line search methods (Armijo, Wolfe, and Goldstein).

1 Introduction

The pervasiveness of machine learning and big-data analytics throughout most academic fields and industrial domains has triggered renewed interest in convex optimization, the subfield of mathematical optimization that is concerned with minimizing a convex objective function over a convex set of decision variables. Of particular relevance for solving large-scale convex optimization problems with low accuracy requirements are first-order algorithms, defined as iterative algorithms that only use (sub)gradient information.

There exists extensive literature on the convergence analysis of first-order optimization algorithms with respect to various performance metrics; see, e.g., [2, 4, 5, 6] and the references therein. However, existing convergence results typically rely on case-by-case analyses and cannot be understood by a common guiding principle. In this work we introduce a unified framework for deriving worst-case upper bounds on the convergence rates of first-order optimization algorithms, through the use of sum-of-squares (SOS) optimization.

SOS optimization is an active research area with important practical applications; see, e.g., [3, 29, 21, 22, 30]. The key idea underlying SOS optimization is to use semidefinite programming (SDP) relaxations for certifying the nonnegativity of a polynomial over a set defined by polynomial (in)equalities. This allows to construct hierarchies of SDPs that approximate the optimal value of arbitrary polynomial optimization problems.

To illustrate the main ingredients of our approach, consider the problem of minimizing a convex function $f:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}$ over $\operatorname{\mathbb{R}}^{n}$ , i.e., $\min_{\operatorname{\mathbf{x}}\in\operatorname{\mathbb{R}}^{n}}f(\operatorname{\mathbf{x}}),$ and let $\operatorname{\mathbf{x}}_{*}$ be a global minimizer. Any solution strategy entails choosing a black-box algorithm $\operatorname{\mathcal{A}}$ that generates a sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ . Our goal is then to estimate the worst-case convergence rate of $\operatorname{\mathcal{A}}$ with respect to a fixed family of functions $\operatorname{\mathcal{F}}$ and an appropriate measure of performance (e.g., distance to optimality $\|\operatorname{\mathbf{x}}_{k}-\operatorname{\mathbf{x}}_{*}\|$ or objective function accuracy $f(\operatorname{\mathbf{x}}_{k})-f(\operatorname{\mathbf{x}}_{*}))$ . For concreteness, using as performance metric the objective function accuracy and a first-order algorithm $\operatorname{\mathcal{A}}$ that does not increase the objective function value at each step, we seek to solve the following optimization problem:

[TABLE]

where we set $f_{k}=f(\operatorname{\mathbf{x}}_{k})$ and $\operatorname{\mathbf{g}}_{k}=\nabla f(\operatorname{\mathbf{x}}_{k})$ for all $k\geq 1$ . As the optimization problem (1) is hard in general, we relax it into a tractable convex program (in fact, an SDP), in two steps. In the first step, we derive necessary conditions that are expressed as polynomial inequalities ${h_{1}(\operatorname{\mathbf{z}})\geq 0},\ldots,{h_{m}(\operatorname{\mathbf{z}})\geq 0}$ and equalities $v_{1}(\operatorname{\mathbf{z}})=0,\ldots,v_{m^{\prime}}(\operatorname{\mathbf{z}})=0$ , in terms of the variables in $\operatorname{\mathbf{z}}=(f_{*},f_{k},f_{k+1},\operatorname{\mathbf{x}}_{*},\operatorname{\mathbf{x}}_{k},\operatorname{\mathbf{x}}_{k+1},\operatorname{\mathbf{g}}_{*},\operatorname{\mathbf{g}}_{k},\operatorname{\mathbf{g}}_{k+1})$ , which are dictated by the choice of the algorithm and the corresponding class of functions. Having identified these necessary polynomial constraints, the first relaxation of the optimization problem (1) is to find the minimum $t\in(0,1)$ such that the polynomial $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ is nonnegative over the semi-algebraic set

[TABLE]

where here and throughout we use the notation $[m]=\{1,\dots,m\}$ . Nevertheless, as this second problem is also hard in general, in the second step we further relax this constraint by demanding that the nonnegativity of the polynomial $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ over $K$ is certified by an SOS decomposition:

[TABLE]

where the $\sigma_{i}(\operatorname{\mathbf{z}})$ ’s are SOS polynomials and the $\theta_{j}(\operatorname{\mathbf{z}})$ ’s are arbitrary polynomials. Clearly, expression (2) certifies that $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ is nonnegative over the semi-algebraic set $K$ . Furthermore, once the degree of the $\sigma_{i}$ ’s and the $\theta_{j}$ ’s has been fixed, the problem of finding the least $t\in(0,1)$ such that (2) holds is an instance of an SDP, and thus, it can be solved efficiently.

1.1 Related Work

Performance Estimation Problem.

Our work was motivated by the recent framework introduced by Drori and Teboulle [10] that casts the search for worst-case rate bounds as an infinite-dimensional optimization problem:

[TABLE]

called the Performance Estimation Problem (PEP). A series of recent works has highlighted the PEP as an extremely useful tool in various settings, including the study of worst-case guarantees for first-order optimization algorithms [10, 39, 38, 7, 36, 37, 8], the design of optimal methods [10, 20, 19, 11, 9], and the study of worst-case guarantees for solving monotone inclusion problems [32, 18, 16, 24]. The PEP captures the worst-case objective function accuracy over all functions within $\operatorname{\mathcal{F}}$ , after $N$ iterations of the algorithm $\operatorname{\mathcal{A}}$ from any starting point $\operatorname{\mathbf{x}}_{0}$ , which is within distance $R$ from some minimizer $\operatorname{\mathbf{x}}_{*}$ .

Although the PEP is infinite-dimensional (as its search space includes all functions in the class $\operatorname{\mathcal{F}}$ ), it can be transformed into an equivalent finite-dimensional problem using the (smooth) convex interpolation approach introduced in [38]. Following [38], the functional constraint $f\in\operatorname{\mathcal{F}}$ is discretized by introducing $2(N+2)$ additional variables capturing the value and the gradient of the function at the points $\operatorname{\mathbf{x}}_{0},\ldots,\operatorname{\mathbf{x}}_{N},\operatorname{\mathbf{x}}_{*}$ . Specifically, setting $I=\{0,1,\ldots,N,*\}$ , the finite-dimensional problem

[TABLE]

with decision variables $\{\operatorname{\mathbf{x}}_{i},\operatorname{\mathbf{g}}_{i},f_{i}\}_{i\in I}$ , is equivalent to the PEP in the sense that their optimal values coincide and an optimal solution to the PEP can be transformed to an optimal solution to the f-PEP (and conversely).

The seemingly simple step of reformulating the PEP into f-PEP by discretizing and introducing interpolability constraints leads naturally to a powerful approach for evaluating (or upper bounding) the value of the f-PEP. Specifically, if interpolability with respect to $\operatorname{\mathcal{F}}$ and the iterates generated by $\mathcal{A}$ satisfy conditions that are linear in $\operatorname{\mathbf{f}}=(f_{0},\ldots,f_{N},f_{*})$ and the entries of the Gram matrix $G=X^{\top}X$ , where $X=(\operatorname{\mathbf{x}}_{0}\ \dots\ \operatorname{\mathbf{x}}_{N}\ \operatorname{\mathbf{x}}_{*}\ \operatorname{\mathbf{g}}_{0}\ \dots$ $\ \operatorname{\mathbf{g}}_{N}\ \operatorname{\mathbf{g}}_{*})\in\operatorname{\mathbb{R}}^{n\times 2(N+2)}$ , the value of the f-PEP is upper bounded by the SDP defined by all necessary functional and algorithmic constraints, as well as the appropriate reformulations in terms of $\operatorname{\mathbf{f}}$ and $G$ of the optimality condition $\operatorname{\mathbf{g}}_{*}=0$ and the initialization condition $\|\operatorname{\mathbf{x}}_{0}-\operatorname{\mathbf{x}}_{*}\|\leq R$ .

Moreover, in the case where interpolability with respect to $\operatorname{\mathcal{F}}$ and the first-order method under consideration are both linearly Gram-representable, i.e., exactly characterized by a finite number of constraints that are linear in $\operatorname{\mathbf{f}}$ and in the entries of $G$ , the corresponding SDP relaxation of f-PEP is tight, for large enough values of $n$ .

Interpolability conditions have been formulated exactly for various function classes, including the class of $L$ -smooth and $\mu$ -strongly convex functions [38, Theorem 5], indicator and support functions [37, Section 3.3], smooth and nonconvex functions [37, Section 3.4]. In terms of the tightness of the SDP relaxation of the f-PEP, in the case where $\operatorname{\mathcal{F}}$ is one of the aforementioned function classes, and the corresponding algorithm is a fixed-step linear first-order method as defined in [37, Definition 2.11] the SDP relaxation is tight, as long as $2(N+1)\leq n$ [37, Proposition 2.6].

Integral Quadratic Constraints.

A competing approach that uses SDPs to analyze iterative optimization algorithms was introduced in [23]. In this setting, the minimizers of the function of interest are mapped to the fixed points of a discrete-time linear dynamical system with a nonlinear feedback law, whose convergence is then analyzed using integral quadratic constraints (IQCs). The IQC approach allows one to derive analytical and numerical upper bounds on the convergence rates for various algorithms by solving small SDPs. For instance, in [23], algorithms considered include the gradient method, the heavy-ball method, Nesterov’s accelerated method (and related variants) applied to smooth and strongly convex functions.

The line of research initiated in [23] has been generalized further in various directions. Some notable examples include the convergence analysis of the ADMM method [27], the case of non-strongly convex objective functions [13], the generalization to stochastic algorithms [17], and the design of first-order optimization algorithms [12]. In addition, an approach drawing upon ideas from both the PEP and IQC frameworks, and comparison between these, was proposed in [35].

1.2 Summary of Results

In most instances where the PEP framework was applied in the literature, close inspection of the proofs of the analytic worst-case bounds reveals that they can be reinterpreted as simple, i.e., low-degree SOS certificates; see, e.g., [39, Appendix A], [38, Section 3.6], and [7, Section 4.1]. This observation is the point of departure for our work, whose aim is to unify the aforementioned results, and additionally, to make the search for the underlying SOS certificates explicit.

As it turns out, the connection between the PEP and the SOS framework is an instance of SDP duality. Specifically, we have mentioned that relaxing the f-PEP into an SDP requires the functional and algorithmic constraints to imply linear constraints of the form

[TABLE]

for appropriate vectors $c_{i},d_{j}$ , matrices $C_{i},D_{j}$ and scalars $a_{i},b_{j}$ . Nevertheless, notice that an equivalent way expressing the constraints in (3) is as polynomial constraints in the variables $f_{*},f_{k},f_{k+1},\operatorname{\mathbf{x}}_{*},\operatorname{\mathbf{x}}_{k},\operatorname{\mathbf{x}}_{k+1},\operatorname{\mathbf{g}}_{*},\operatorname{\mathbf{g}}_{k},\operatorname{\mathbf{g}}_{k+1}$ . Specifically, setting $\operatorname{\mathbf{z}}_{0}=(f_{*},f_{k},f_{k+1})$ and $\operatorname{\mathbf{z}}_{\ell}=(x_{*}(\ell),x_{k}(\ell),x_{k+1}(\ell),g_{*}(\ell),g_{k}(\ell),g_{k+1}(\ell))$ , where we use $x_{*}(\ell)$ to denote the $\ell$ th coordinate of $\operatorname{\mathbf{x}}_{*}$ for $\ell\in[n]$ , the constraints in (3) may be equivalently expressed as

[TABLE]

i.e., as polynomials in the variables $\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}$ , to which we apply the SOS framework. Formalizing this connection, in Theorem 3 we show that the dual of the first level of the SOS hierarchy is equivalent to the PEP when the functional and algorithmic constraints are linearly Gram-representable. This allows to reinterpret existing rate bounds derived within the PEP framework as degree-1 SOS certificates.

Nevertheless, despite its many successful applications, the PEP framework does not offer a systematic way by which the SDP relaxation can be strengthened when the function class under consideration or the employed algorithm are not linearly Gram-representable. Indeed, recall that to go from the PEP to an SDP we take two relaxation steps. In the first step we extract necessary (quadratic) conditions that are dictated by the interpolability with respect to $\mathcal{F}$ and the algorithm $\mathcal{A}$ . In the second step, we use the identified conditions to formulate an SDP, which gives the desired rate bounds. Now, it is clear that if the first relaxation step is loose, then the value of the SDP is not necessarily equal to the value of the PEP. In such a setting, there is no systematic way to strengthen the PEP-SDP, whereas, the sum-of-squares approach clearly provides a solution: just consider a higher level of the hierarchy. This is exactly why we believe that the sum-of-squares approach is an interesting and complementary approach to the Gram matrix approach of Taylor et al. [38].

On the other hand, the SOS approach provides a systematic framework for finding better (i.e., smaller) bounds on the worst-case contraction factor of descent algorithms, by using higher levels of the SOS hierarchy. It is worth noting though that this flexibility comes at a computational cost, in the sense that the SDPs obtained via the SOS hierarchy are dimension-dependent, i.e., any performance certificate generated by the model only applies to functions over a domain with a fixed dimension $n$ .

To overcome this issue, we show in Theorem 2 that in the specific setting studied in this work (cf. Section 2.3), a degree-1 certificate for the univariate case (i.e. $n=1)$ can be lifted to a degree-1 certificate for the general case $(n>1$ ). Nevertheless, we have been unable to extend this lifting procedure for higher-order SOS certificates. As our goal is to identify analytic rates, the inability to work with general $n$ has forced us to only consider degree-1 certificates. We leave the consideration of higher degree certificates to future work.

In terms of using the SOS hierarchy to derive new convergence results, we focus on gradient descent applied to $L$ -smooth, $\mu$ -strongly convex functions, where the step size is chosen using inexact line search methods. Specifically, in Theorem 4, Theorem 5 and Theorem 6 we respectively study the Armijo, Wolfe, and Goldstein conditions with step size selection in both the noisy and noiseless settings. Denoting by $\delta\in[0,1)$ the noise level in the gradient estimation (see (22)), our main results are the following rate bounds:

Gradient descent with Armijo-terminated line search:

[TABLE]

which is valid for any noise level $\delta\in[0,1)$ , algorithm parameters $\epsilon\in\left(0,\frac{1-\delta}{(1+\delta)^{2}}\right)$ and $\eta>1$ .

Gradient descent with Goldstein-terminated line search:

[TABLE]

which is valid for noise levels $\delta\in[0,\sqrt{5}-2)$ and algorithm parameter $\epsilon\in\leavevmode\nobreak\ \left(1-\frac{1-\delta}{(1+\delta)^{2}},\frac{1}{2}\right)$ .

Gradient descent with Wolfe-terminated line search:

[TABLE]

which is valid for any algorithm parameters $0<c_{1}<c_{2}<1$ .

We show that the bound for GD with Armijo-terminated line search rule is an improvement upon two existing bounds in the literature, see [26, Proposition 3.3.5] and [25, Page 239]. On the other hand, our results for GD with Goldstein or Wolfe-terminated line search are, to the best of our knowledge, new.

The interested reader may find the code for numerically and symbolically verifying the results at https://github.com/sandratsy/SumsOfSquares.

Paper Organization.

The paper is organized as follows: Section 2 introduces the SOS technique, explains how it is applied to derive worst-case bounds and describes the function class and algorithms we examine within this work. Furthermore, we determine a procedure for lifting degree-1 certificates from the univariate to the multivariate case and also prove the relation between PEP and SOS. In Section 3 we use the SOS framework to determine new convergence results for noisy gradient descent with inexact line search methods (Armijo, Wolfe, Goldstein). Lastly, Section 4 contains concluding remarks and suggests avenues for future work.

Note.

A preliminary version of this paper was presented at the Signal Processing with Adaptive Sparse Structured Representations (SPARS) workshop in Toulouse, France in July 2019 [34]. Moreover, several additional convergence results obtained via the SOS approach including GD with constant step size and exact line search, and proximal gradient with constant step size and exact line search can be found in the M. Eng. thesis of the first author [33]. These results have not been included in this manuscript as the exact same rates have been also derived via the PEP framework, which as already discussed, is equivalent to degree-1 SOS proofs.

2 Description of our Approach

2.1 Background on Sum-of-Squares

Before we provide the details of our approach, we need to introduce some necessary notation and definitions. For any $\operatorname{\mathbf{a}}\in\operatorname{\mathbb{N}}^{n}$ , where $\operatorname{\mathbb{N}}$ is the set of nonnegative integers, we denote by $\operatorname{\mathbf{z}}^{\operatorname{\mathbf{a}}}$ the monomial $z_{1}^{a_{1}}\dots z_{n}^{a_{n}}$ . The degree of the monomial $\operatorname{\mathbf{z}}^{\operatorname{\mathbf{a}}}$ is defined to be $|\operatorname{\mathbf{a}}|=\sum_{i=1}^{n}a_{i}$ . Let $\operatorname{\mathbb{R}}[\operatorname{\mathbf{z}}]_{n,d}$ denote the set of polynomials in $n$ variables $z_{1},\ldots,z_{n}$ , of degree at most $d$ . Any polynomial $p(\operatorname{\mathbf{z}})\in\operatorname{\mathbb{R}}[\operatorname{\mathbf{z}}]_{n,d}$ can be written as a linear combination of monomials of degree at most $d$ , i.e., $p(\operatorname{\mathbf{z}})=\sum_{|\operatorname{\mathbf{a}}|\leq d}p_{\operatorname{\mathbf{a}}}\operatorname{\mathbf{z}}^{\operatorname{\mathbf{a}}}.$ An (even-degree) polynomial $p(\operatorname{\mathbf{z}})$ is called a sum-of-squares (SOS) if there exist polynomials $q_{1}(\operatorname{\mathbf{z}}),\dots,q_{m}(\operatorname{\mathbf{z}})$ satisfying $p(\operatorname{\mathbf{z}})=\sum_{i=1}^{m}q_{i}^{2}(\operatorname{\mathbf{z}}).$ Note that if the degree of $p(\operatorname{\mathbf{z}})$ is equal to $2d$ , all polynomials $q_{i}(\operatorname{\mathbf{z}})$ will necessarily have degree at most $d$ . It is instructive to think of the existence of an SOS decomposition as a tractable certificate for the global nonnegativity of $p(\operatorname{\mathbf{z}})$ . Indeed, it is clear that any SOS polynomial $p(\operatorname{\mathbf{z}})$ is also globally nonnegative, i.e., $p(\operatorname{\mathbf{z}})\geq 0$ for all $\operatorname{\mathbf{z}}\in\operatorname{\mathbb{R}}^{n}$ . Furthermore, although less obvious, it is well-known that checking the existence of an SOS decomposition can be done efficiently using SDPs [30].

Moving beyond the problem of certifying global nonnegativity, a more general problem is to certify the nonnegativity of a polynomial $p(\operatorname{\mathbf{z}})$ over a (basic) closed semi-algebraic set $K=\left\{\operatorname{\mathbf{z}}\in\operatorname{\mathbb{R}}^{n}:\ h_{i}(\operatorname{\mathbf{z}})\geq 0,\ i\in[m],\ v_{j}(\operatorname{\mathbf{z}})=0,\ j\in[m^{\prime}]\right\},$ i.e., to certify that $p(\operatorname{\mathbf{z}})\geq 0$ for all $\operatorname{\mathbf{z}}\in K$ . Analogously to the case of global nonnegativity, we look for certificates that can be found efficiently using SDPs. One such choice are Putinar-type certificates [31], given by:

[TABLE]

where the $\sigma_{i}$ ’s are themselves SOS polynomials and the $\theta_{j}$ ’s are arbitrary (i.e., not necessarily SOS) polynomials. Clearly, the expression (5) serves as a certificate that $p(\operatorname{\mathbf{z}})\geq 0$ for all $\operatorname{\mathbf{z}}\in K$ and moreover, the existence of such a representation (for a fixed degree $d$ ) can be done using SDPs, e.g., see [30].

2.2 Algorithm Analysis Using SOS Certificates

Fixing a family of functions $\operatorname{\mathcal{F}}$ and a first-order algorithm $\operatorname{\mathcal{A}}$ —one that uses only gradient information—our goal is to find the best (smallest) contraction factor $t\in(0,1)$ that is valid over all functions in $\operatorname{\mathcal{F}}$ and all sequences of iterates that can be generated using the algorithm $\operatorname{\mathcal{A}}$ . Concretely, for any fixed $k$ , we want to estimate the minimum $t\in(0,1)$ satisfying $f_{k+1}-f_{*}\leq t(f_{k}-f_{*}),$ for all $f\in\operatorname{\mathcal{F}}$ and $\operatorname{\mathbf{x}}_{k+1}=\operatorname{\mathcal{A}}\left(\operatorname{\mathbf{x}}_{k},f_{k},\operatorname{\mathbf{g}}_{k}\right)$ . We address this question using SOS certificates. To employ an SOS approach, we first need to identify polynomial inequalities $h_{i}(\operatorname{\mathbf{z}})\geq 0$ and polynomial equalities $v_{j}(\operatorname{\mathbf{z}})=0$ in the variables

[TABLE]

that should be necessarily satisfied following the choice of the class of functions $\operatorname{\mathcal{F}}$ and the first-order algorithm $\operatorname{\mathcal{A}}$ . Setting $K$ to be the semi-algebraic set defined by the identified polynomial equalities and inequalities, i.e.,

[TABLE]

it follows immediately that if the polynomial

[TABLE]

is nonnegative over the set $K$ for some $t\in(0,1)$ , then $t$ also serves as an upper bound on the worst-case rate $t_{*}$ , or, in other words, $t_{*}$ is upper bounded by the value of the following optimization problem

[TABLE]

where the decision variable is the scalar $t$ . As the optimization problem (7) involves a polynomial nonnegativity constraint (over a semi-algebraic set) it is in general hard—in fact, strongly NP-hard [1]. To obtain tractable upper bounds, we replace the constraint that $p_{t}(\operatorname{\mathbf{z}})$ is nonnegative over $K$ by asking that $p_{t}(\operatorname{\mathbf{z}})$ admits an SOS certificate of the form (5), which clearly certifies nonnegativity over $K$ . Concretely, for any $d\geq 0$ and $n\geq 1$ , we get the SDP:

[TABLE]

where $\operatorname{\mathbf{z}}\in\operatorname{\mathbb{R}}^{6n+3}$ . For any fixed integer $d\geq 0$ and $n\geq 1$ , the optimization problem (8) is an SDP, and consequently, it can be solved in polynomial-time to any desired accuracy. Furthermore, for a fixed $n$ , it follows immediately from the definitions that

[TABLE]

In other words, as $d$ increases, the SDPs given in (8) give increasingly better—more precisely, no worse—upper bounds on the worst-case ratio $t_{*}$ . On the negative side, the sizes of these SDPs grow as $\mathcal{O}(n^{d}),$ so in practice, working with large values of $d$ is computationally prohibitive. Summarizing, our strategy for estimating the worst-case rate consists of the following steps:

Identify polynomial inequality and equality constraints ${h_{i}(\operatorname{\mathbf{z}})\geq 0},\ v_{j}(\operatorname{\mathbf{z}})=\leavevmode\nobreak\ 0$ in the variable $\operatorname{\mathbf{z}}$ (recall (6)) that are implied by choosing a function class and an algorithm. 2. 2.

Fix a degree $d\in\operatorname{\mathbb{N}}$ for the SOS certificate, i.e., for the degrees of the polynomials $\sigma_{i}$ ’s and $\theta_{j}$ ’s. Higher degree certificates allow for tighter bounds but are more difficult to find due to the increase in size of the SDP. 3. 3.

Numerically solve the SDP in (8) using degree- $d$ SOS certificates multiple times, varying the parameters corresponding to the algorithm and the function class. This allows us to “guess” the analytic form of the optimal variables for (8). 4. 4.

Lastly, we verify that the identified solution from step 3 is indeed feasible for (8). Determining feasibility gives an analytic upper bound on the best contraction factor $t_{d}$ that can certified using degree- $d$ SOS certificates.

Implementation Details.

Throughout this paper, we restrict our attention to degree-1 SOS certificates, as our main goal is to derive rates symbolically (see Section 2.3). The derivation of the affine constraints defining the feasible region of the SDP (8) was done by matching coefficients in (5). The SDP (8) was solved with CVX [15, 14], using the supported SDP solver SDPT3 [40, 41]. Fortunately, there are many SOS optimization toolboxes such as YALMIP that automate the process of matching coefficients and constructing the SDP. Finally, verification of the identified solution was done through MATLAB’s Symbolic Math Toolbox and Mathematica [42]. Mathematica was used to first verify that the optimal matrices are PSD, before we found their corresponding SOS decompositions analytically. For the interested reader, the codes for implementation of the SDPs and verification of the solutions may be found at https://github.com/sandratsy/SumsOfSquares.

2.3 Choices Specific to this Work

Function classes of interest.

Consider parameters $0\leq\mu<L<+\infty$ . In this work, we only consider the class of $L$ -smooth, $\mu$ -strongly convex functions—also known as $(\mu,L)$ -smooth functions—with domain $\operatorname{\mathbb{R}}^{n}$ , which we denote by $\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ . Recall that a proper, closed, convex function ${f:\operatorname{\mathbb{R}}^{n}\rightarrow\operatorname{\mathbb{R}}\cup\{+\infty\}}$ is called $L$ -smooth if

[TABLE]

and $\mu$ -strongly convex if the function $f(\operatorname{\mathbf{x}})-\frac{\mu}{2}\norm{\operatorname{\mathbf{x}}}^{2}$ is convex, where $\|\cdot\|$ denotes the usual Euclidean norm.

Throughout this work, we use the following set of necessary and sufficient conditions developed in [38] for the existence of a function in $\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ generating data triples $\{(\operatorname{\mathbf{x}}_{i},f_{i},\operatorname{\mathbf{g}}_{i})\}_{i\in I}$ .

Theorem 1.

Given a set $\{(\operatorname{\mathbf{x}}_{i},f_{i},\operatorname{\mathbf{g}}_{i})\}_{i\in I}$ , there exists $f\in\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ where $f_{i}=f(\operatorname{\mathbf{x}}_{i})$ and $\operatorname{\mathbf{g}}_{i}=\nabla f(\operatorname{\mathbf{x}}_{i})$ for all $i\in I$ , if and only if, for all $i\neq j\in I$ :

[TABLE]

Applying Theorem 1 to the data triples $(\operatorname{\mathbf{x}}_{k},f_{k},\operatorname{\mathbf{g}}_{k})$ , $(\operatorname{\mathbf{x}}_{k+1},f_{k+1},\operatorname{\mathbf{g}}_{k+1})$ , and $(\operatorname{\mathbf{x}}_{*},f_{*},\operatorname{\mathbf{g}}_{*})$ we get six polynomial constraints that we denote throughout this paper by $h_{1}(\operatorname{\mathbf{z}})\geq 0,\ldots,h_{6}(\operatorname{\mathbf{z}})\geq 0.$ Specifically, setting $\alpha:=\frac{1}{2(1-\mu/L)}$ , the six $\operatorname{\mathcal{F}}_{\mu,L}$ -interpolability conditions are:

[TABLE]

2.4 Lifting Univariate Certificates

As already mentioned, we restrict our attention to degree-1 SOS certificates (recall (8)). In this setting, $\sigma_{0}(\operatorname{\mathbf{z}})$ is an SOS of linear polynomials and, since the polynomials $h_{i}(\operatorname{\mathbf{z}})$ and $v_{j}(\operatorname{\mathbf{z}})$ we consider are degree-2, the $\sigma_{i}(\operatorname{\mathbf{z}})$ ’s need to be degree-0 SOS polynomials and the $\theta_{j}(\operatorname{\mathbf{z}})$ ’s degree-0 polynomials. The SOS certificate can be thus expressed as:

[TABLE]

where $\sigma_{i}\in\operatorname{\mathbb{R}}_{+}$ and $\theta_{j}\in\operatorname{\mathbb{R}}$ . We claim that the form of the polynomials $p_{t}(\operatorname{\mathbf{z}})$ , $h_{i}(\operatorname{\mathbf{z}})$ ’s and $v_{j}(\operatorname{\mathbf{z}})$ ’s, combined with the specific choice of SOS certificates under consideration (i.e., degree-1 certificates) allow us to only consider the univariate case $n=1$ . Concretely, we show in the rest of this section that an SOS certificate for some contraction factor $t\in(0,1)$ in the univariate case, induces an SOS certificate for the same contraction factor in the multivariate case ( $n>1$ ). To see this, first we rearrange the variable $\operatorname{\mathbf{z}}=(f_{*},f_{k},f_{k+1},\operatorname{\mathbf{x}}_{*},\operatorname{\mathbf{x}}_{k},\operatorname{\mathbf{x}}_{k+1},\operatorname{\mathbf{g}}_{*},\operatorname{\mathbf{g}}_{k},\operatorname{\mathbf{g}}_{k+1})$ as $\operatorname{\mathbf{z}}=(\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}),$ where

[TABLE]

where $x_{*}(\ell)$ denotes the $\ell$ th coordinate of $\operatorname{\mathbf{x}}_{*}$ for $\ell\in[n]$ .

Theorem 2.

Assume that the performance measure polynomial and the constraint functions are separable with respect to the blocks of variables $\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}$ , they are invariant with respect to permutations of the blocks of variables $\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}$ , and that they have no constant terms. Then, a degree-1 SOS certificate for a rate $t\in(0,1)$ in the univariate case (i.e., $n=1$ ) can be lifted to degree-1 certificate for the general case (i.e., $n>1$ ).

Note that the structural assumptions on the performance measure polynomial and the constraint functions imply that they have the form

[TABLE]

for some polynomials $p_{t}^{0},p_{t}^{1},h_{i}^{0},h_{i}^{1},v_{j}^{0}$ and $v_{j}^{1}$ .

Furthermore, note that all the performance measure polynomials (e.g., $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ and $t\norm{\operatorname{\mathbf{x}}_{k}-\operatorname{\mathbf{x}}_{*}}^{2}-\norm{\operatorname{\mathbf{x}}_{k+1}-\operatorname{\mathbf{x}}_{*}}^{2}$ ) and the constraint functions encountered thus far have the form (12). Furthermore, (12) is satisfied when the polynomial constraints take the form given in (4), i.e, the constraints are linear in the $f$ ’s and in the inner products of the $\operatorname{\mathbf{x}}_{i}$ ’s and $\operatorname{\mathbf{g}}_{i}$ ’s.

Proof.

To prove the theorem, note that an SOS certificate $\left\{Q,\{\sigma_{i}\}_{i},\{\theta_{j}\}_{j}\right\}$ , (i.e., $Q$ is a PSD matrix, $\{\sigma_{i}\}_{i}\subseteq\operatorname{\mathbb{R}}_{+}$ and $\{\theta_{j}\}_{j}\subseteq\operatorname{\mathbb{R}}$ ) for a rate $t\in(0,1)$ in the general case $n>1$ has the following form:

[TABLE]

As the polynomials have no constant terms, it follows immediately that $Q_{11}=\leavevmode\nobreak\ 0$ . Furthermore, as the polynomials are separable with respect to the blocks of variables $(\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}),$ $Q$ is block diagonal. Using these two observations, (12) and (13) imply that:

[TABLE]

Lastly, assume there exists an SOS certificate for a rate $t\in(0,1)$ in the univariate case, i.e., a PSD matrix $\tilde{Q}$ and scalars $\{\tilde{\sigma}_{i}\}_{i}\subseteq\operatorname{\mathbb{R}}_{+},\{\tilde{\theta}_{j}\}_{j}\subseteq\operatorname{\mathbb{R}}$ where

[TABLE]

As before, this may be decomposed into

[TABLE]

Comparing equation (14) with (16) and equation (15) with (17), we see that $Q_{0}=\tilde{Q}_{0}$ , $Q_{\ell}=\tilde{Q}_{1},\ \ell\in[n]$ , $\sigma_{i}=\tilde{\sigma}_{i},\ i\in[m]$ , $\theta_{j}=\tilde{\theta}_{j},\ j\in[m^{\prime}]$ is a valid certificate for the multivariate case. ∎

Lastly, we note that for higher-degree SOS certificates (beyond degree-1), it is not immediately apparent how to verify that a certificate for the univariate case induces one for the multivariate case.

2.5 Dual of the SOS Hierarchy

In this section we determine the exact relationship between the PEP and the SOS hierarchy introduced in this work. Specifically, we show that:

Theorem 3.

If the functional and algorithmic constraints are linearly Gram-representable (i.e., (3) holds), the 1-step PEP applied to a contractive algorithm is equivalent to the first-level of the SOS hierarchy.

Proof.

For concreteness, we consider the 1-step PEP (i.e., where we only take 1 step using algorithm $\mathcal{A}$ ) with respect to the performance metric given by the objective function accuracy. Similar arguments apply when the performance is measured using the distance from optimality or the residual gradient norm.

The corresponding optimization problem is given by:

[TABLE]

If the $\operatorname{\mathcal{F}}$ -interpolability and algorithmic conditions are of the form given in (3) with $a_{i}=b_{i}=0$ , the equivalent (f-PEP) may be relaxed into an SDP of the following form:

[TABLE]

where we recall that $\operatorname{\mathbf{f}}=(f_{0},f_{1},f_{*})$ , $X=(\operatorname{\mathbf{x}}_{0}\ \operatorname{\mathbf{x}}_{1}\ \operatorname{\mathbf{x}}_{*}\ \operatorname{\mathbf{g}}_{0}\ \operatorname{\mathbf{g}}_{1}\ \operatorname{\mathbf{g}}_{*})\in\operatorname{\mathbb{R}}^{n\times 6}$ , $G=X^{\top}X$ and $I=\{0,1,*\}$ . Note that $f_{1}-f_{*}$ may be expressed as $\langle(0,1,-1),\operatorname{\mathbf{f}}\rangle$ and $f_{0}-f_{*}$ as $\langle(1,0,-1),\operatorname{\mathbf{f}}\rangle$ . Setting $\sigma_{i}$ , $\theta_{j}$ and $t$ to be the Lagrange multipliers of the three sets of constraints in (18) respectively, the dual of (18) is

[TABLE]

On the other hand, using the SOS approach and restricting our attention to degree-1 certificates, the SOS-SDP defined in (8) is given by:

[TABLE]

where we define as before (recall (11)) $\operatorname{\mathbf{z}}=(\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n}),$ with $\operatorname{\mathbf{z}}_{0}=(f_{k},f_{k+1},f_{*})$ and

[TABLE]

and furthermore, the polynomial constraints $h_{i}(\operatorname{\mathbf{z}})\geq 0$ and $v_{j}(\operatorname{\mathbf{z}})=0$ have the form given in (4) with $a_{i}=b_{i}=0$ , i.e.,

[TABLE]

Note that polynomials of the form (20) satisfy the requirement identified in (12). In particular, this implies that the matrix $Q$ is block-diagonal. Based on this, the constraint

[TABLE]

is equivalent to the equality of the following polynomials in the variables $\operatorname{\mathbf{z}}=(\operatorname{\mathbf{z}}_{0},\operatorname{\mathbf{z}}_{1},\ldots,\operatorname{\mathbf{z}}_{n})$ :

[TABLE]

which is in turn equivalent to:

[TABLE]

Thus, the SOS-SDP may be expressed as:

[TABLE]

Finally, we note that the constraint $t<1$ can be dropped if the algorithm is a descent algorithm. The SDP induced by the 1-PEP (19) and the SDP induced by the degree-1 SOS problem (21) are hence equivalent problems. ∎

3 Using the SOS Hierarchy to Obtain New Convergence Bounds

In this section, we consider a few variants of GD with inexact line search under both the noisy and noiseless settings. In “noisy” GD, the update step is given by $\operatorname{\mathbf{x}}_{k+1}=\operatorname{\mathbf{x}}_{k}+\gamma_{k}\operatorname{\mathbf{d}}_{k}$ , where the error (i.e., the difference between the descent direction $\operatorname{\mathbf{d}}_{k}$ and negative gradient) is bounded relative to the gradient:

[TABLE]

for some noise level $\delta\in[0,1)$ . This assumption ensures that the next step taken remains in a descent direction, i.e., $-\operatorname{\mathbf{g}}_{k}^{\top}\operatorname{\mathbf{d}}_{k}>0$ , e.g. see [2, Page 38].

We begin by deriving some inequalities that will be used throughout this section. We note that

[TABLE]

where the last inequality follows by (22). By a similar argument, we have

[TABLE]

Squaring and expanding (22), we have

[TABLE]

which, combined with (24), implies that

[TABLE]

Furthermore, by the triangle inequality, we have $\norm{-\operatorname{\mathbf{g}}_{k}}\leq\norm{\operatorname{\mathbf{d}}_{k}}+\norm{-\operatorname{\mathbf{d}}_{k}-\operatorname{\mathbf{g}}_{k}}$ and thus

[TABLE]

3.1 The Armijo Rule

Using Armijo-terminated line search, the step size $\gamma_{k}$ is chosen so that

[TABLE]

for some $\epsilon\in(0,1)$ and $\eta>1$ , e.g., see [26, Section 2.4.1] and [2, Page 29]. In the noisy setting, the gradient is not available. Substituting $-\operatorname{\mathbf{d}}_{k}$ for $\operatorname{\mathbf{g}}_{k}$ in (28)-(29) we obtain:

[TABLE]

When noisy GD with Armijo-terminated line search is applied to an $L$ -smooth function, we are able to show the validity of the following inequality:

[TABLE]

for $\delta\in[0,1)$ , $\epsilon\in\left(0,\frac{1-\delta}{(1+\delta)^{2}}\right)$ and $\eta>1$ . Indeed, as $f$ is $L$ -smooth we have that

[TABLE]

Substituting $\operatorname{\mathbf{x}}=\operatorname{\mathbf{x}}_{k}$ and $\operatorname{\mathbf{y}}=\operatorname{\mathbf{x}}_{k}+\eta\gamma_{k}\operatorname{\mathbf{d}}_{k}$ we get

[TABLE]

Furthermore, combining (31) and (33), we have

[TABLE]

In turn, this implies

[TABLE]

and as we require $\gamma_{k}>0$ , we need that $\epsilon<\frac{1-\delta}{(1+\delta)^{2}}$ . Substituting (34) into (30), we have

[TABLE]

where the last inequality follows by (27). We note that for $\delta=0$ , the above inequality reduces to [26, Equation (3.3.9)].

For the next theorem we use polynomial constraints $h_{1}(\operatorname{\mathbf{z}})\geq 0,\dots,{h_{6}(\operatorname{\mathbf{z}})\geq 0}$ given in (9), and $h_{7}(\operatorname{\mathbf{z}})\geq 0$ given in (32), and search for a degree-1 SOS certificate as described in (10). Constructing and solving the appropriate SDP, we obtain the following result.

Theorem 4.

For any $\delta\in[0,1)$ , $\epsilon\in\left(0,\frac{1-\delta}{(1+\delta)^{2}}\right)$ and $\eta>1$ , given an $(\mu,L)$ -smooth function $f:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}$ and any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated using noisy GD with Armijo-terminated line search, the bound

[TABLE]

admits an SOS certificate of degree-1.

Proof.

Defining

[TABLE]

we have that $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ is equal to

[TABLE]

The first term in the right-hand-side of equation (35) is strictly positive since $\epsilon<\frac{1-\delta}{(1+\delta)^{2}}$ . In addition, as previously discussed, any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated by noisy GD with Armijo-terminated line search for minimizing a function $f\in\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ satisfies $h_{5}(\operatorname{\mathbf{z}})\geq 0$ and $h_{7}(\operatorname{\mathbf{z}})\geq 0$ . Since $\sigma_{5},\sigma_{7}\geq 0$ , overall the right-hand-side of (35) is positive. Hence, the left-hand-side of equation (35) is also positive, concluding the proof. ∎

From Theorem 4, we also get a rate bound for GD with Armijo-terminated line search in the noiseless case (i.e., $\delta=0$ ), which is given by

[TABLE]

As a consequence of (36), for all $N\geq 1$ we have

[TABLE]

where $\kappa=L/\mu$ is the condition number of $f$ . To the best of our knowledge, the best bounds for a function $f\in\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ minimized by GD with Armijo rule were given by Luenberger and Ye [25, Page 239] and Nemirovski [26, Proposition 3.3.5]. For any $\epsilon<0.5$ and $\eta>1$ , Luenberger and Ye (LY) showed that

[TABLE]

while for any $\epsilon\geq 0.5$ and $\eta\geq 1$ , Nemirovski showed that

[TABLE]

To compare these convergence rates, we consider the three contraction factors

[TABLE]

Since Luenberger and Ye’s bound only holds for $\epsilon\in(0,0.5)$ whereas ours hold for $\epsilon\in(0,1)$ , we compare $t_{\operatorname{new}}$ and $t_{\operatorname{LY}}$ within the common range $0<\epsilon<0.5$ . On the other hand, Nemirovski’s bound only holds for $\epsilon\in[0.5,1)$ , hence we compare $t_{\operatorname{new}}$ and $t_{\operatorname{nemi}}$ within this range.

Using simple arguments we now show that $t_{\operatorname{new}}<t_{\operatorname{LY}}$ and $t_{\operatorname{new}}\leq t_{\operatorname{nemi}}$ within each range of comparison. Thus, our contraction factor is no larger than those of Luenberger and Ye’s, and Nemirovski’s. Indeed, to show $t_{\operatorname{new}}<t_{\operatorname{LY}}$ note that as $\epsilon<0.5$ , we have

[TABLE]

Next, we show that $t_{\operatorname{nemi}}\geq t_{\operatorname{new}}.$ Since $0.5\leq\epsilon$ , we have

[TABLE]

Since $\eta>1$ , $\kappa>1$ and $1-\epsilon>0$ , this implies

[TABLE]

Since $4\epsilon(1-\epsilon)^{2}>0$ , we can subtract it from the right-hand side:

[TABLE]

and add $\epsilon(\eta\kappa)^{2}$ to each side and factorize:

[TABLE]

Rearranging the equation, we obtain

[TABLE]

and the proof that $t_{\operatorname{nemi}}\geq t_{\operatorname{new}}$ is concluded.

Figures 2 and 2 compare $t_{\operatorname{new}}$ with $t_{\operatorname{LY}}$ and $t_{\operatorname{nemi}}$ respectively for various values of $\kappa$ and $\eta$ .

We note that when $\eta\to 1^{+}$ and $\kappa\to 1^{+}$ and $\epsilon=0.5$ , our contraction factor $t_{\operatorname{new}}$ tends to 0, whereas Nemirovski’s contraction factor $t_{\operatorname{nemi}}$ tends to 0.5. In fact, when $\kappa\to 1^{+}$ , the function $f$ behaves roughly as a quadratic. Combining this with the fact that $\eta\to 1^{+}$ and the updates as shown in (28), it can be verified that GD with Armijo rule takes only a single step to attain the optimal solution. Hence, the contraction factor we derived, i.e., $t_{\operatorname{new}}=0$ , is tight in this limiting scenario.

Conversely, Nemirovski’s contraction factor $t_{\operatorname{nemi}}=0.5$ is loose. The looseness of Nemirovski’s analysis can be attributed to the fact that he only applies the condition

[TABLE]

to select iterates (i.e., discretizing the condition), which is not sufficient to guarantee $\operatorname{\mathcal{F}}_{\mu,L}$ -interpolability. On the other hand, we make use of the condition from Theorem 1, which constitutes a necessary and sufficient condition for a function to be $\operatorname{\mathcal{F}}_{\mu,L}$ -interpolable.

3.2 The Goldstein Rule

Using Goldstein-terminated line search, the step size is chosen so that

[TABLE]

for some $\epsilon\in(0,1/2)$ [26, Section 2.4.2]. The Goldstein rule was proposed earlier than the Armijo rule, and encapsulates the same principle of sufficient decrease as the Armijo rule [2, Page 32].

To examine the performance of the Goldstein-terminated line search in the noisy setting, we again substitute $-\operatorname{\mathbf{d}}_{k}$ for $\operatorname{\mathbf{g}}_{k}$ in (38) to get:

[TABLE]

When noisy GD with the Goldstein rule is applied to an $L$ -smooth function, the following polynomial constraint:

[TABLE]

holds for $\delta\in[0,\sqrt{5}-2)$ and $\epsilon\in\left(1-\frac{1-\delta}{(1+\delta)^{2}},\frac{1}{2}\right)$ . Indeed, as $f$ is $L$ -smooth we get

[TABLE]

which combined with the first inequality in (39) gives

[TABLE]

In turn, this implies that

[TABLE]

Since we require $\gamma_{k}>0$ , we need that $\epsilon>1-\frac{1-\delta}{(1+\delta)^{2}}$ . Together with the condition that $\epsilon\in(0,1/2)$ , this implies that the Goldstein rule only works in the case where $\delta<\sqrt{5}-2$ and $\epsilon\in\left(1-\frac{1-\delta}{(1+\delta)^{2}},\frac{1}{2}\right)$ . Substituting (41) into the second inequality in (39), we have

[TABLE]

The polynomial constraints we use in the search of a degree-1 SOS certificate (10) are as follows: $h_{1}(\operatorname{\mathbf{z}})\geq 0,\dots,h_{6}(\operatorname{\mathbf{z}})\geq 0$ given in (9), as well as (40), denoted by $h_{7}(\operatorname{\mathbf{z}})\geq 0$ . As before, after constructing and solving the appropriate SDP we derive the following result:

Theorem 5.

For any $\delta\in[0,\sqrt{5}-2)$ and $\epsilon\in\left(1-\frac{1-\delta}{(1+\delta)^{2}},\frac{1}{2}\right)$ , given an $(\mu,L)$ -smooth function $f:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}$ and any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated using noisy GD with the Goldstein rule, the bound

[TABLE]

admits an SOS certificate of degree-1.

Proof.

Defining $\sigma_{5}=\frac{2\mu\epsilon(1-\delta)^{2}}{L}\left(\frac{1-\delta}{(1+\delta)^{2}}-(1-\epsilon)\right)$ , $\sigma_{7}=1$ and

[TABLE]

we have that $t(f_{k}-f_{*})-(f_{k+1}-f_{*})$ is equal to

[TABLE]

The first term in the right-hand-side of equation (42) is strictly positive since $\epsilon>1-\frac{1-\delta}{(1+\delta)^{2}}$ implies $\epsilon(1+\delta)^{2}>\delta^{2}+3\delta$ . In addition, any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated by noisy GD with the Goldstein rule for minimizing a function $f\in\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ satisfies $h_{5}(\operatorname{\mathbf{z}})\geq 0$ and $h_{7}(\operatorname{\mathbf{z}})\geq 0$ . Since $\sigma_{5},\sigma_{7}\geq 0$ , the right-hand-side of (42) is positive. Hence, the left-hand-side of equation (42) is also positive, concluding the proof. ∎

Lastly, from Theorem 5, we recover the bound for GD with the Goldstein rule in the noiseless case (i.e., $\delta=0$ ), which is given by

[TABLE]

3.3 The Wolfe Conditions

A step size chosen using the Wolfe conditions must satisfy the following two inequalities:

[TABLE]

where $\operatorname{\mathbf{d}}_{k}$ is a descent direction and $0<c_{1}<c_{2}<1$ , e.g. see [28, Page 37].

For the Wolfe conditions, we only have results for noiseless GD. In this setting, $\operatorname{\mathbf{d}}_{k}=-\operatorname{\mathbf{g}}_{k}$ and hence equations (43)-(44) become

[TABLE]

Next, we show that the following polynomial inequality holds when GD with the Wolfe conditions is applied to an $L$ -smooth function:

[TABLE]

Indeed, as $f$ is $L$ -smooth we have that

[TABLE]

which for $\operatorname{\mathbf{x}}=\operatorname{\mathbf{x}}_{k+1}=\operatorname{\mathbf{x}}_{k}-\gamma_{k}\operatorname{\mathbf{g}}_{k}$ and $\operatorname{\mathbf{y}}=\operatorname{\mathbf{x}}_{k}$ gives

[TABLE]

In turn, this implies that

[TABLE]

which shows that

[TABLE]

Lastly, (48) together with (45), gives us

[TABLE]

In the next theorem, the polynomials we use in the search of a degree-1 SOS certificate (10) are $h_{1}(\operatorname{\mathbf{z}})\geq 0,\dots,h_{6}(\operatorname{\mathbf{z}})\geq 0$ given in (9), as well as (47), denoted by $h_{7}(\operatorname{\mathbf{z}})\geq 0$ . Constructing and solving the appropriate SDP, we obtain the following result.

Theorem 6.

For any $0<c_{1}<c_{2}<1$ , given an $(\mu,L)$ -smooth function $f:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}$ and any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated using GD with an inexact line search satisfying the Wolfe conditions, the bound

[TABLE]

admits an SOS certificate of degree-1.

Proof.

Defining $\sigma_{5}=\frac{2\mu c_{1}(1-c_{2})}{L}$ , $\sigma_{7}=1$ and $t=1-\frac{2\mu c_{1}(1-c_{2})}{L}$ , we have

[TABLE]

Since $\sigma_{5}$ and $\sigma_{7}$ are both nonnegative, and since any sequence of iterates $\{\operatorname{\mathbf{x}}_{k}\}_{k\geq 1}$ generated by GD with the Wolfe conditions for minimizing a function $f\in\operatorname{\mathcal{F}}_{\mu,L}(\operatorname{\mathbb{R}}^{n})$ satisfies $h_{5}(\operatorname{\mathbf{z}})\geq 0$ and $h_{7}(\operatorname{\mathbf{z}})\geq 0$ , the right-hand-side of (49) is nonnegative. The left-hand-side of equation (49) is also nonnegative, which concludes the proof. ∎

4 Conclusions and Future Work

This paper proposes a new technique for bounding the convergence rates for various algorithms—namely, by searching for SOS certificates. This leads to a hierarchy of SDPs, for which the first level of the hierarchy is dual to the SDP induced by the one-step PEP as discussed in Section 2.5. Furthermore, using the first level of the SOS hierarchy, we derive new bounds for gradient descent with three popular inexact line search methods.

However, our technique does not necessarily produce tight bounds, since it entails two relaxation steps. For one, the constraints characterizing the function class or algorithm may be relaxed. Secondly, we relax the constraint that $p(\operatorname{\mathbf{z}})$ be nonnegative to the constraint that $p(\operatorname{\mathbf{z}})$ is an SOS. Recall that while SOS implies nonnegativity, the converse is not necessarily true. Proving the tightness of the derived bounds will have to be done via other means.

At present, we have only utilized the first level of the proposed hierarchy by searching for degree-1 certificates. In future work, we look to apply the SOS framework to broader function classes, for which exact $\operatorname{\mathcal{F}}$ -interpolability conditions have not been formulated. In this setting, the SDP formulation of the f-PEP (dual to our degree-1 SOS-SDP) would, in general, not be tight, and going higher up the hierarchy may produce tighter contraction factors (as the degree of the SOS-SDP increases). Finally, in the instances where SOS certificates cannot be found, it would be desirable to examine why the technique fails to better understand the scope for which this approach may be applied.

Acknowledgements

The authors are supported by a Singapore National Research Foundation (NRF) Fellowship (R-263-000-D02-281).

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. A. Ahmadi. Sum of squares (SOS) techniques: An introduction. http://www.princeton.edu/~amirali/Public/Teaching/ORF 523/S 16/ORF 523_S 16_Lec 15.pdf .
2[2] D. P. Bertsekas. Nonlinear Programming . Athena Scientific, second edition, 1999.
3[3] G. Blekherman, P. A. Parrilo, and R. R. Thomas. Semidefinite Optimization and Convex Algebraic Geometry , volume 13. MOS-SIAM Series on Optimization, 2012.
4[4] S. P. Boyd and L. Vandenberghe. Convex Optimization . Cambridge University Press, 2009.
5[5] A. I. Cohen. Rate of convergence of several conjugate gradient algorithms. SIAM Journal on Numerical Analysis , 9(2):248–259, 1972.
6[6] Y.-H. Dai. Nonlinear conjugate gradient methods. Wiley Encyclopedia of Operations Research and Management Science , 2011.
7[7] E. de Klerk, F. Glineur, and A. B. Taylor. On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optimization Letters , 11(7):1185–1199, 2017.
8[8] E. de Klerk, F. Glineur, and A. B. Taylor. Worst-case convergence analysis of gradient and Newton methods through semidefinite programming performance estimation. Technical report, https://arxiv.org/abs/1709.05191 , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Analysis of Optimization Algorithms via Sum-of-Squares

Abstract

1 Introduction

1.1 Related Work

Performance Estimation Problem.

Integral Quadratic Constraints.

1.2 Summary of Results

Paper Organization.

Note.

2 Description of our Approach

2.1 Background on Sum-of-Squares

2.2 Algorithm Analysis Using SOS Certificates

Implementation Details.

2.3 Choices Specific to this Work

Function classes of interest.

Theorem 1**.**

2.4 Lifting Univariate Certificates

Theorem 2**.**

Proof.

2.5 Dual of the SOS Hierarchy

Theorem 3**.**

Proof.

3 Using the SOS Hierarchy to Obtain New Convergence Bounds

3.1 The Armijo Rule

Theorem 4**.**

Proof.

3.2 The Goldstein Rule

Theorem 5**.**

Proof.

3.3 The Wolfe Conditions

Theorem 6**.**

Proof.

4 Conclusions and Future Work

Acknowledgements

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Theorem 6.