First-order algorithms converge faster than $O(1/k)$ on convex problems

Ching-pei Lee; Stephen J. Wright

arXiv:1812.08485·math.OC·May 15, 2019·ICML

First-order algorithms converge faster than $O(1/k)$ on convex problems

Ching-pei Lee, Stephen J. Wright

PDF

Open Access

TL;DR

This paper proves that first-order algorithms like gradient descent and coordinate descent can achieve convergence rates faster than $O(1/k)$ for convex problems, improving known bounds.

Contribution

It establishes that several first-order methods attain an $o(1/k)$ convergence rate, surpassing the traditional $O(1/k)$ rate, and shows this is the best possible improvement.

Findings

01

Gradient descent achieves $o(1/k)$ convergence rate.

02

Proximal methods also attain $o(1/k)$ rate.

03

The $o(1/k)$ rate is tight and cannot be improved to $O(1/k^{1+ heta})$ for any $ heta>0$.

Abstract

It is well known that both gradient descent and stochastic coordinate descent achieve a global convergence rate of $O (1/ k)$ in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to $o (1/ k)$ . We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar $o (1/ k)$ convergence rates. The result is tight in the sense that a rate of $O (1/ k^{1 + ϵ})$ is not generally attainable for any $ϵ > 0$ , for any of these methods.

Equations194

x min f (x),

x min f (x),

x_{k + 1} : = x_{k} - α_{k} \nabla f (x_{k}), k = 0, 1, 2, \dots,

x_{k + 1} : = x_{k} - α_{k} \nabla f (x_{k}), k = 0, 1, 2, \dots,

f (x_{k}) - f^{*} \leq \frac{\mbox dist ( x _{0} , Ω ) ^{2}}{2 α ˉ k}, k = 1, 2, \dots,

f (x_{k}) - f^{*} \leq \frac{\mbox dist ( x _{0} , Ω ) ^{2}}{2 α ˉ k}, k = 1, 2, \dots,

k (f (x_{k}) - f^{*}) \leq T = 1 \sum k (f (x_{T}) - f^{*}) \leq \frac{1}{2 α ˉ} \mbox dist (x_{0}, Ω)^{2}, k = 1, 2, \dots,

k (f (x_{k}) - f^{*}) \leq T = 1 \sum k (f (x_{T}) - f^{*}) \leq \frac{1}{2 α ˉ} \mbox dist (x_{0}, Ω)^{2}, k = 1, 2, \dots,

s_{k + 1} = (k + 1) Δ_{k + 1} \leq k Δ_{k} + Δ_{k + 1} \leq s_{k} + Δ_{k} .

s_{k + 1} = (k + 1) Δ_{k + 1} \leq k Δ_{k} + Δ_{k + 1} \leq s_{k} + Δ_{k} .

u_{k + 1} = s_{k + 1} + i = k + 1 \sum \infty Δ_{i} \leq s_{k} + Δ_{k} + i = k + 1 \sum \infty Δ_{i} = s_{k} + i = k \sum \infty Δ_{i} = u_{k},

u_{k + 1} = s_{k + 1} + i = k + 1 \sum \infty Δ_{i} \leq s_{k} + Δ_{k} + i = k + 1 \sum \infty Δ_{i} = s_{k} + i = k \sum \infty Δ_{i} = u_{k},

M (α) : = α - \frac{1}{2} L α^{2} .

M (α) : = α - \frac{1}{2} L α^{2} .

f (x - α \nabla f (x)) \leq f (x) - \nabla f (x)^{⊤} (α \nabla f (x)) + \frac{L}{2} ∥ α \nabla f (x) ∥^{2} = f (x) - M (α) ∥ \nabla f (x) ∥^{2} .

f (x - α \nabla f (x)) \leq f (x) - \nabla f (x)^{⊤} (α \nabla f (x)) + \frac{L}{2} ∥ α \nabla f (x) ∥^{2} = f (x) - M (α) ∥ \nabla f (x) ∥^{2} .

α \in (0, \frac{1}{L}] \Rightarrow M (α) \geq \frac{1}{2} α > 0,

α \in (0, \frac{1}{L}] \Rightarrow M (α) \geq \frac{1}{2} α > 0,

∥\nabla f (x) ∥^{2} \leq \frac{1}{M ( α )} (f (x) - f (x - α \nabla f (x))) \leq \frac{2}{α} (f (x) - f (x - α \nabla f (x))) .

∥\nabla f (x) ∥^{2} \leq \frac{1}{M ( α )} (f (x) - f (x - α \nabla f (x))) \leq \frac{2}{α} (f (x) - f (x - α \nabla f (x))) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} = ∥ x_{T} - α_{T} \nabla f (x_{T}) - \overset{x}{ˉ} ∥^{2} = ∥ x_{T} - \overset{x}{ˉ} ∥^{2} + α_{T}^{2} ∥\nabla f (x_{T}) ∥^{2} - 2 α_{T} \nabla f (x_{T})^{⊤} (x_{T} - \overset{x}{ˉ}) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} = ∥ x_{T} - α_{T} \nabla f (x_{T}) - \overset{x}{ˉ} ∥^{2} = ∥ x_{T} - \overset{x}{ˉ} ∥^{2} + α_{T}^{2} ∥\nabla f (x_{T}) ∥^{2} - 2 α_{T} \nabla f (x_{T})^{⊤} (x_{T} - \overset{x}{ˉ}) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} \leq ∥ x_{T} - \overset{x}{ˉ} ∥^{2} + 2 α_{T} (f (x_{T}) - f (x_{T + 1})) + 2 α_{T} (f^{*} - f (x_{T})) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} \leq ∥ x_{T} - \overset{x}{ˉ} ∥^{2} + 2 α_{T} (f (x_{T}) - f (x_{T + 1})) + 2 α_{T} (f^{*} - f (x_{T})) .

f (x_{T + 1}) - f^{*} \leq \frac{1}{2 α ˉ} (∥ x_{T} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2}) .

f (x_{T + 1}) - f^{*} \leq \frac{1}{2 α ˉ} (∥ x_{T} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2}) .

α_{k} \in [C_{2}, C_{1}], C_{2} \in (0, \frac{2 - γ}{L}], C_{1} \geq C_{2},

α_{k} \in [C_{2}, C_{1}], C_{2} \in (0, \frac{2 - γ}{L}], C_{1} \geq C_{2},

f (x_{k} - α_{k} \nabla f (x_{k})) \leq f (x_{k}) - \frac{γ α _{k}}{2} ∥ \nabla f (x_{k}) ∥^{2},

∥ x_{k + 1} - \overset{x}{ˉ} ∥^{2}

∥ x_{k + 1} - \overset{x}{ˉ} ∥^{2}

\leq ∥ x_{0} - \overset{x}{ˉ} ∥^{2} + \frac{2 C _{1}}{γ} (f (x_{0}) - f^{*}) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T} - \overset{x}{ˉ} ∥^{2} \leq \frac{2 α _{T}}{γ} (f (x_{T}) - f (x_{T + 1})) + 2 α_{T} (f^{*} - f (x_{T})) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T} - \overset{x}{ˉ} ∥^{2} \leq \frac{2 α _{T}}{γ} (f (x_{T}) - f (x_{T + 1})) + 2 α_{T} (f^{*} - f (x_{T})) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T} - \overset{x}{ˉ} ∥^{2} \leq \frac{2 C _{1}}{γ} (f (x_{T}) - f (x_{T + 1})) + 2 C_{2} (f^{*} - f (x_{T})) .

∥ x_{T + 1} - \overset{x}{ˉ} ∥^{2} - ∥ x_{T} - \overset{x}{ˉ} ∥^{2} \leq \frac{2 C _{1}}{γ} (f (x_{T}) - f (x_{T + 1})) + 2 C_{2} (f^{*} - f (x_{T})) .

2 C_{2} T = 0 \sum k Δ_{T} \leq ∥ x_{0} - \overset{x}{ˉ} ∥^{2} - ∥ x_{k + 1} - \overset{x}{ˉ} ∥^{2} + \frac{2 C _{1}}{γ} Δ_{0} \leq ∥ x_{0} - \overset{x}{ˉ} ∥^{2} + \frac{2 C _{1}}{γ} Δ_{0} .

2 C_{2} T = 0 \sum k Δ_{T} \leq ∥ x_{0} - \overset{x}{ˉ} ∥^{2} - ∥ x_{k + 1} - \overset{x}{ˉ} ∥^{2} + \frac{2 C _{1}}{γ} Δ_{0} \leq ∥ x_{0} - \overset{x}{ˉ} ∥^{2} + \frac{2 C _{1}}{γ} Δ_{0} .

Δ_{T} \leq \nabla f (x_{T})^{⊤} (x_{T} - \overset{x}{ˉ}_{T}) \leq ∥\nabla f (x_{T}) ∥ \mbox dist (x_{T}, Ω),

Δ_{T} \leq \nabla f (x_{T})^{⊤} (x_{T} - \overset{x}{ˉ}_{T}) \leq ∥\nabla f (x_{T}) ∥ \mbox dist (x_{T}, Ω),

∥\nabla f (x_{T}) ∥ \geq \frac{Δ _{T}}{\mbox dist ( x _{T} , Ω )} .

∥\nabla f (x_{T}) ∥ \geq \frac{Δ _{T}}{\mbox dist ( x _{T} , Ω )} .

Δ_{T + 1} \leq Δ_{T} - \frac{C _{2} γ Δ _{T}^{2}}{2 \mbox dist ( x _{T} , Ω ) ^{2}} .

Δ_{T + 1} \leq Δ_{T} - \frac{C _{2} γ Δ _{T}^{2}}{2 \mbox dist ( x _{T} , Ω ) ^{2}} .

\frac{1}{Δ _{T + 1}} \geq \frac{1}{Δ _{T}} + \frac{C _{2} γ Δ _{T}}{2 \mbox dist ( x _{T} , Ω ) ^{2} Δ _{T + 1}} \geq \frac{1}{Δ _{T}} + \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}} .

\frac{1}{Δ _{T + 1}} \geq \frac{1}{Δ _{T}} + \frac{C _{2} γ Δ _{T}}{2 \mbox dist ( x _{T} , Ω ) ^{2} Δ _{T + 1}} \geq \frac{1}{Δ _{T}} + \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}} .

\frac{1}{Δ _{k}} \geq \frac{1}{Δ _{0}} + T = 0 \sum k - 1 \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}} \Rightarrow Δ_{k} \leq \frac{1}{\sum _{T = 0}^{k - 1} \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}}} .

\frac{1}{Δ _{k}} \geq \frac{1}{Δ _{0}} + T = 0 \sum k - 1 \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}} \Rightarrow Δ_{k} \leq \frac{1}{\sum _{T = 0}^{k - 1} \frac{C _{2} γ}{2 \mbox dist ( x _{T} , Ω ) ^{2}}} .

T = 0 \sum k - 1 \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}} \geq \frac{k}{R _{0}^{2}} .

T = 0 \sum k - 1 \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}} \geq \frac{k}{R _{0}^{2}} .

k \to \infty lim \mbox dist (x_{k}, Ω) = 0.

k \to \infty lim \mbox dist (x_{k}, Ω) = 0.

\mbox dist (x_{k_{i}}, Ω) \geq ϵ, i = 1, 2, \dots .

\mbox dist (x_{k_{i}}, Ω) \geq ϵ, i = 1, 2, \dots .

\frac{1}{Δ _{k_{i + 1}}} \geq \frac{1}{Δ _{k_{i}}} + \frac{C _{2} γ}{2 ϵ ^{2}},

\frac{1}{Δ _{k_{i + 1}}} \geq \frac{1}{Δ _{k_{i}}} + \frac{C _{2} γ}{2 ϵ ^{2}},

k \to \infty lim \frac{\frac{1}{\frac{C _{2} γ}{2} \sum _{T = 0}^{k - 1} \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}}}}{\frac{1}{k}} = 0,

k \to \infty lim \frac{\frac{1}{\frac{C _{2} γ}{2} \sum _{T = 0}^{k - 1} \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}}}}{\frac{1}{k}} = 0,

k \to \infty lim \frac{k}{\sum _{T = 0}^{k - 1} \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}}} = 0.

k \to \infty lim \frac{k}{\sum _{T = 0}^{k - 1} \frac{1}{\mbox dist ( x _{T} , Ω ) ^{2}}} = 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

Full text

∎

11institutetext: Department of Computer Sciences and Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI

11email: {ching-pei,swright}@cs.wisc.edu

First-Order Algorithms

Converge Faster than $O(1/k)$ on Convex Problems ††thanks: This work was supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; Subcontract 8F-30039 from Argonne National Laboratory; and Award N660011824020 from the DARPA Lagrange Program.

Ching-pei Lee

Stephen J. Wright

Abstract

It is well known that both gradient descent and stochastic coordinate descent achieve a global convergence rate of $O(1/k)$ in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to $o(1/k)$ . We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar $o(1/k)$ convergence rates. The result is tight in the sense that a rate of $O(1/k^{1+\epsilon})$ is not generally attainable for any $\epsilon>0$ , for any of these methods.

Keywords:

Gradient descent methods Coordinate descent methods Proximal gradient methods Convex optimization Complexity

1 Introduction

Consider the unconstrained optimization problem

[TABLE]

where $f$ has domain in an inner-product space and is convex and $L$ -Lipschitz continuously differentiable for some $L>0$ . We assume throughout that the solution set $\Omega$ is non-empty. (Elementary arguments based on the convexity and continuity of $f$ show that $\Omega$ is a closed convex set.) Classical convergence theory for gradient descent on this problem indicates a $O(1/k)$ global convergence rate in the function value. Specifically, if

[TABLE]

and $\alpha_{k}\equiv\bar{\alpha}\in(0,1/L]$ , we have

[TABLE]

where $f^{*}$ is the optimal objective value and $\mbox{\rm dist}(x,\Omega)$ denotes the distance from $x$ to the solution set. The proof of (3) relies on showing that

[TABLE]

where the first inequality utilizes the fact that gradient descent is a descent method (yielding a nonincreasing sequence of function values $\{f(x_{k}\}$ ). We demonstrate in this paper that the bound (3) is not tight, in the sense that $k(f(x_{k})-f^{*})\to 0$ , and thus $f(x_{k})-f^{*}=o(1/k)$ . This result is a consequence of the following technical lemma.

Lemma 1

Let $\{\Delta_{k}\}$ be a nonnegative sequence satisfying the following conditions:

$\{\Delta_{k}\}$ * is monotonically decreasing;* 2. 2.

$\{\Delta_{k}\}$ * is summable, that is, $\sum_{k=0}^{\infty}\Delta_{k}<\infty$ .*

Then $k\Delta_{k}\to 0$ , so that $\Delta_{k}=o(1/k)$ .

Proof

The proof uses simplified elements of the proofs of Lemmas 2 and 9 of Section 2.2.1 from Pol87a . Define $s_{k}\coloneqq k\Delta_{k}$ and $u_{k}\coloneqq s_{k}+\sum_{i=k}^{\infty}\Delta_{i}$ . Note that

[TABLE]

From (5) we have

[TABLE]

so that $\{u_{k}\}$ is a monotonically decreasing nonnegative sequence. Thus there is $u\geq 0$ such that $u_{k}\to u$ , and since $\lim_{k\to\infty}\sum_{i=k}^{\infty}\Delta_{i}=0$ , we have $s_{k}\to u$ also.

Assuming for contradiction that $u>0$ , there exists $k_{0}>0$ such that $s_{k}\geq u/2>0$ for all $k\geq k_{0}$ , so that $\Delta_{k}\geq{u}/{(2k)}$ for all $k\geq k_{0}$ . This contradicts the summability of $\{\Delta_{k}\}$ . Therefore we have $u=0$ , so that $k\Delta_{k}=s_{k}\to 0$ , proving the result. ∎

Our claim about the fixed-step gradient descent method follows immediately by setting $\Delta_{k}=f(x_{k})-f^{*}$ in Lemma 1. We state the result formally as follows, and prove it at the start of Section 2.

Theorem 1.1

Consider (1) with $f$ convex and $L$ -Lipschitz continuously differentiable and nonempty solution set $\Omega$ . If the step sizes satisfy $\alpha_{k}\equiv\bar{\alpha}\in(0,1/L]$ for all $k$ , then gradient descent (2) generates objective values $f(x_{k})$ that converge to $f^{*}$ at an asymptotic rate of $o(1/k)$ .

This result shows that the $o(1/k)$ rate for gradient descent with a fixed short step size is universal on convex problems, without any additional requirements such as the boundedness of $\Omega$ assumed in (Ber16a, , Proposition 1.3.3). In the remainder of the paper, we show that this faster rate holds for several other smooth optimization algorithms, including gradient descent with fixed steps in the larger range $(0,2/L)$ , gradient descent with various line-search strategies, and stochastic coordinate descent with arbitrary sampling strategies. We then extend the result to algorithms for regularized convex optimization problems, including proximal gradient and stochastic proximal coordinate descent.

Except for the cases of coordinate descent and proximal coordinate descent which require a finite-dimensional space so that all the coordinates can be processed, our results apply to any inner-product spaces. Assumptions such as bounded solution set, bounded level set, or bounded distance to the solution set, which are commonly assumed in the literature, are all unnecessary. We can remove these assumptions because an implicit regularization property causes the iterates to stay within a bounded area.

In our description, the Euclidean norm is used for simplicity, but our results can be extended directly to any norms induced by an inner product,111We meant that given an inner product $<\cdot,\cdot>$ , the norm $\|\cdot\|$ is defined as $\|x\|\coloneqq\sqrt{<x,x>}$ . provided that Lipschitz continuity of $\nabla f$ is defined with respect to the corresponding norm and its dual norm.

Related Work.

Our work was inspired by (PenZZ18a, , Corollary 2) and (Ber16a, , Proposition 1.3.3), which improve convergence for certain algorithms and problems on convex problems in a Euclidean space from $O(1/k)$ to $o(1/k)$ when the level set is compact. Our paper develops improved convergence rates of several algorithms on convex problems without the assumption on the level set, with most of our results applying to non-Euclidean Hilbert spaces. The main proof techniques in this work are somewhat different from those in the works cited here.

For an accelerated version of proximal gradient on convex problems, it is proved in AttP16a that the convergence rate can be improved from $O(1/k^{2})$ to $o(1/k^{2})$ . Accelerated proximal gradient is a more complicated algorithm than the nonaccelerated versions we discuss, and thus AttP16a require a more complicated analysis that is quite different from ours.

DenLPY17a have stated a version of Lemma 1 with a proof different from the proof that we present, using it to show the convergence rate of the quantity $\|x_{k}-x_{k+1}\|$ of a version of the alternating-directions method of multipliers (ADMM). Our work differs in the range of algorithms considered and the nature of the convergence. We also provide a discussion of the tightness of the $o(1/k)$ convergence rate.

2 Main Results on Unconstrained Smooth Problems

We start by detailing the procedure for obtaining (4), to complete the proof of Theorem 1.1. First, we define

[TABLE]

From the Lipschitz continuity of $\nabla f$ , we have for any point $x$ and any real number $\alpha$ that

[TABLE]

Clearly,

[TABLE]

so in this case, we have by rearranging (7) that

[TABLE]

Considering any solution $\bar{x}\in\Omega$ and any $T\geq 0$ , we have for gradient descent (2) that

[TABLE]

Since $\alpha_{T}\in(0,1/L]$ in (10), from (9) and the convexity of $f$ (implying $\nabla f(x_{T})^{T}(\bar{x}-x_{T})\leq f^{*}-f(x_{T})$ ), we have

[TABLE]

By rearranging (11) and using $\alpha_{T}\equiv\bar{\alpha}\in(0,1/L]$ ,

[TABLE]

We then obtain (4) by summing (12) from $T=0$ to $T=k-1$ and noticing that $\bar{x}$ is arbitrary in $\Omega$ .

Theorem 1.1 applies to step sizes in the range $(0,1/L]$ only, but it is known that gradient descent converges at the rate of $O(1/k)$ for both the fixed step size scheme with $\bar{\alpha}\in(0,2/L)$ and line-search schemes. Next, we show that $o(1/k)$ rates hold for these variants too. We then extend the result to stochastic coordinate descent with arbitrary sampling of coordinates.

2.1 Gradient Descent with Longer Steps

In this subsection, we allow the steplengths $\alpha_{k}$ for (2) to vary from iteration to iteration, according to the following conditions, for some $\gamma\in(0,1]$ :

[TABLE]

Note that these conditions encompass a fixed-steplength strategy with $\alpha_{k}\equiv C_{2}$ as a special case, by setting $C_{1}=C_{2}$ , and noting that condition (13b) is a consequence of (7). (Note too that $\alpha_{k}\equiv C_{2}\in(0,(2-\gamma)/L]$ can be almost twice as large as the bound $1/L$ considered above.)

The main result for this subsection is as follows.

Theorem 2.1

Consider (1) with $f$ convex and $L$ -Lipschitz continuously differentiable and nonempty solution set $\Omega$ . If the step sizes $\alpha_{k}$ satisfy (13), then gradient descent (2) generates objective values $f(x_{k})$ converging to $f^{*}$ at an asymptotic rate of $o(1/k)$ .

We give two alternative proofs of this result to provide different insights. The first proof is similar to the one we presented for Theorem 1.1 at the start of this section. The second proof holds only for Euclidean spaces. This proof improves the standard proof of (Nes04a, , Section 2.1.5).

We start from the following lemma, which verifies that the iterates remain in a bounded set and is used in both proofs.

Lemma 2

Consider algorithm (2) with any initial point $x_{0}$ , and assume that $f$ is convex and $L$ -Lipschitz-continuously differentiable for some $L>0$ . Then when the sequence of steplengths $\alpha_{k}$ is chosen to satisfy (13), all iterates $x_{k}$ lie in a bounded set. In particular, for any $\bar{x}\in\Omega$ and any $k\geq 0$ , we have that

[TABLE]

Proof

By (13b) and the convexity of $f$ , (10) further implies that for any $T\geq 0$ ,

[TABLE]

We know that the first term is nonnegative from (13b), while the second term is nonpositive from the optimality of $f^{*}$ . Therefore, (16) implies

[TABLE]

We then obtain (14) by summing (17) for $T=0,1,\dotsc,k$ and telescoping. By noting that $f(x_{k})\geq f^{*}$ for all $k$ , (15) follows. ∎

The first proof of Theorem 2.1 is as follows.

Proof (First Proof of Theorem 2.1)

We again consider Lemma 1 with $\Delta_{k}\coloneqq f(x_{k})-f^{*}$ , which is always nonnegative from the optimality of $f^{*}$ . Monotonicity is clear from (13b), so we just need to show summability. By rearranging (14) and noting $f(x_{k+1})\geq f^{*}$ , we obtain

[TABLE]

For the second proof of Theorem 2.1, we first outline the analysis from (Nes04a, , Section 2.1.5) and then show how it can be modified to produce the desired $o(1/k)$ rate. Denote by $\bar{x}_{T}$ the projection of $x_{T}$ onto $\Omega$ (which is well defined because $\Omega$ is nonempty, closed, and convex). We can utilize the convexity of $f$ to obtain

[TABLE]

so that

[TABLE]

By subtracting $f^{*}$ from both sides of (13b) and using $\alpha_{k}\geq C_{2}$ and (18), we obtain

[TABLE]

By dividing both sides of this expression by $\Delta_{T}\Delta_{T+1}$ and using $\Delta_{T+1}\leq\Delta_{T}$ , we obtain

[TABLE]

By summing (19) over $T=0,1,\dotsc,k-1$ , we obtain

[TABLE]

A $O(1/k)$ rate is obtained by noting from Lemma 2 that $\mbox{\rm dist}(x_{T},\Omega)\leq R_{0}$ for some $R_{0}>0$ and all $T$ , so that

[TABLE]

Our alternative proof uses the fact that (21) is a loose bound for Euclidean spaces and that an improved result can be obtained by working directly with (20). We first use the Bolzano-Weierstrass theorem (a bounded and closed set is sequentially compact in a Euclidean space) together with Lemma 2, to show that the sequence $\{x_{k}\}$ approaches the solution set $\Omega$ .

Lemma 3

Assume the conditions in Lemma 2 and in addition that $f$ has domain in a Euclidean space $f:\Re^{n}\rightarrow\Re$ . We have

[TABLE]

Proof

The proof is similar to (PenZZ18a, , Proposition 1). Assume for contradiction that (22) does not hold. Then there are $\epsilon>0$ and an infinite increasing sequence $\{k_{i}\}$ , $i=1,2,\dotsc$ , such that

[TABLE]

From Lemma 2 and that $\{x_{k_{i}}\}\subset\Re^{n}$ , we can the sequence $\{x_{k_{i}}\}$ lies in a compact set and therefore has an accumulation point $x^{*}$ . From (19), we have

[TABLE]

so that $1/\Delta_{k}\uparrow\infty$ and hence $\Delta_{k}\downarrow 0$ . By continuity of $f$ , it follows that $f(x^{*})=f^{*}$ , so that $x^{*}\in\Omega$ by definition, contradicting (23). ∎

We note that a result similar to Lemma 3 has been given in BurGIS95a using a more complicated argument with more restricted choices of $\alpha$ .

Proof (Second Proof of Theorem 2.1, for

Euclidean Spaces)

We start with (20) and show that

[TABLE]

or, equivalently,

[TABLE]

From the arithmetic-mean / harmonic-mean inequality,222 This inequality says that for any real numbers $a_{1},\dotsc,a_{n}>0$ , their harmonic mean does not exceed their arithmetic mean. Namely,

$\frac{n}{\sum_{i=1}^{n}a_{i}^{-1}}\leq\frac{\sum_{i=1}^{n}a_{i}}{n}.$

we have that

[TABLE]

Lemma 3 shows that $\mbox{\rm dist}(x_{T},\Omega)\to 0$ , so by the Stolz-Cesàro theorem (see, for example, Mur09a ), the right-hand side of (25) converges to [math]. Therefore, from the sandwich lemma, (24) holds. ∎

2.2 Coordinate Descent

We now extend Theorem 1.1 to the case of randomized coordinate descent. Our results can extend immediately to block-coordinate descent with fixed blocks. Our analysis for coordinate descent requires Euclidean spaces so that coordinate descent can go through all coordinates.

The standard short-step coordinate descent procedure requires knowledge of coordinate-wise Lipschitz constants. Denoting by $e_{i}$ the $i$ th unit vector, we denote by $L_{i}\geq 0$ the constants such that:

[TABLE]

where $\nabla_{i}f(\cdot)$ denotes the $i$ th coordinate of the gradient. Note that if $\nabla f(x)$ is $L$ -Lipschitz continuous, there always exist $L_{1},\dotsc,L_{n}\in[0,L]$ such that (26) holds. Without loss of generality, we assume $L_{i}>0$ for all $i$ . Given parameters $\{\bar{L}_{i}\}_{i=1}^{n}$ such that $\bar{L}_{i}\geq L_{i}$ for all $i$ , the coordinate descent update is

[TABLE]

where $i_{k}$ is the coordinate selected for updating at the $k$ th iteration. We consider the general case of stochastic coordinate descent in which each $i_{k}$ is independently identically distributed following a fixed prespecified probability distribution $p_{1},\dotsc,p_{n}$ satisfying

[TABLE]

for some constant $p_{\min}>0$ . Nesterov Nes12a proves that stochastic coordinate descent has a $O(1/k)$ convergence rate (in expectation of $f$ ) on convex problems. We show below that this rate can be improved to $o(1/k)$ .

Theorem 2.2

Consider (1) with $f$ convex and nonempty solution set $\Omega$ , and that componentwise-Lipschitz continuous differentiability (26) holds with some $L_{1},\dotsc,L_{n}>0$ . If we apply coordinate descent (27) and at each iteration, $i_{k}$ is independently picked at random following a probability distribution satisfying (28), then the expected objective $\mathbb{E}_{i_{0},i_{1},\dotsc,i_{k-1}}[f(x_{k})]$ converges to $f^{*}$ at an asymptotic rate of $o(1/k)$ .

Proof

From (26) and that $\bar{L}_{i}\geq L_{i}$ , by treating all other coordinates as non-variables, we have that for any $T\geq 0$ ,

[TABLE]

showing that the algorithm decreases $f$ at each iteration. Consider any $\bar{x}\in\Omega$ , by defining

[TABLE]

we have from (27) that

[TABLE]

By taking expectation over $i_{T}$ on both sides of the above expression, we obtain from the convexity of $f$ and (29) that

[TABLE]

By taking expectation over $i_{0},i_{1},\dotsc,i_{T-1}$ on (31) and summing (31) over $T=0,1,\dotsc,k$ , we obtain

[TABLE]

The result now follows from Lemma 1. ∎

3 Regularized Problems

We turn now to regularized optimization in an inner-product space:

[TABLE]

where both terms are convex, $f$ is $L$ -Lipschitz-continuously differentiable, and $\psi$ is extended-valued, proper, and closed, but possibly nondifferentiable. We also assume that $\psi$ is such that the prox-operator can be applied easily, by solving the following problem for any given $y$ and any $\lambda>0$ :

[TABLE]

We assume further that the solution set $\Omega$ of (32) is nonempty, and denote by $F^{*}$ the value of $F$ for all $x\in\Omega$ . We discuss two algorithms to show how our techniques can be extended to regularized problems. They are proximal gradient (both with and without line search) and stochastic proximal coordinate descent with arbitrary sampling.

3.1 Short-Step Proximal Gradient

Given $\bar{L}\geq L$ , the $k$ th step of the proximal gradient algorithm is defined as follows:

[TABLE]

Note that $d_{k}$ is uniquely defined here, since the subproblem is strongly convex. It is shown in BecT09a ; Nes13a that $F(x_{k})$ converges to $F^{*}$ at a rate of $O(1/k)$ for this algorithm, under our assumptions. We prove that a $o(1/k)$ rate can be attained.

Theorem 3.1

Consider (32) with $f$ convex and $L$ -Lipschitz continuously differentiable, $\psi$ convex, and nonempty solution set $\Omega$ . Given any $\bar{L}\geq L$ , the proximal gradient method (33) generates iterates whose objective value converges to $F^{*}$ at a $o(1/k)$ rate.

Proof

The method (33) can be shown to be a descent method from the Lipschitz continuity of $\nabla f$ and the fact that $\bar{L}\geq L$ . From the optimality of the solution to (33) and that $x_{k+1}=x_{k}+d_{k}$ ,

[TABLE]

where $\partial\psi$ denotes the subdifferential of $\psi$ . Consider any $\bar{x}\in\Omega$ . We have from (33) that for any $T\geq 0$ , the following chain of relationships holds:

[TABLE]

where in the last inequality, we have used

[TABLE]

By rearranging (35) we obtain

[TABLE]

The result follows by summing both sides of this expression over $T=0,1,\dotsc,k-1$ and applying Lemma 1. ∎

3.2 Proximal Gradient with Line Search

We discuss a line-search variant of proximal gradient, where the update is defined as follows:

[TABLE]

where $\alpha_{k}$ is chosen such that for given $\gamma\in(0,1]$ and $C_{1}\geq C_{2}>0$ defined as in (13a), we have

[TABLE]

This framework is a generalization of that in Section 2.1, and includes the SpaRSA algorithm of WriNF09a , which obtains an initial choice of $\alpha_{k}$ from a Barzilai-Borwein approach and adjusts it until (38) holds. The approach of the previous subsection can also be seen as a special case of (37)-(38) through the following elementary result, whose proof is omitted.

Lemma 4

Consider a convex function $\psi$ , a positive scalar $a>0$ and two vectors $b$ and $x$ . If $d$ is the unique solution of the strictly convex problem

[TABLE]

then

[TABLE]

By setting $b=\nabla f(x)$ , $1/\alpha_{k}\equiv a=\bar{L}>0$ (where $\bar{L}\geq L$ ), this lemma together with (36) implies that (38) holds for any $\gamma\in(0,1]$ . Moreover, it also implies that for any $k\geq 0$ ,

[TABLE]

Therefore, for any $\gamma\in(0,1]$ , (38) holds whenever

[TABLE]

or equivalently

[TABLE]

which is how the upper bound for $C_{2}$ is set.

We show now that this approach also has a $o(1/k)$ convergence rate on convex problems.

Theorem 3.2

Consider (32) with $f$ convex and $L$ -Lipschitz continuously differentiable, $\psi$ convex, and nonempty solution set $\Omega$ . Given some $\gamma\in(0,1]$ and $C_{2}$ and $C_{1}$ such that $C_{1}\geq C_{2}$ and $C_{2}\in(0,(2-\gamma)/L]$ , then the algorithm (37) with $\alpha_{k}$ satisfying (38) generates iterates $\{x_{k}\}$ whose objective values converge to $F^{*}$ at a rate of $o(1/k)$ . Moreover, the sequence of iterates is bounded.

Proof

From the optimality conditions of (37), we have

[TABLE]

Now consider any $\bar{x}\in\Omega$ . We have from (37) that for any $T\geq 0$ , the following chain of relationships holds:

[TABLE]

By rearrangement, of this inequality, we obtain

[TABLE]

and by summing both sides and using telescoping sums, we find that $\sum_{T=0}^{\infty}(F(x_{T+1})-F^{*})<\infty$ , thus the conditions of Lemma 1 are satisfied by $\Delta_{T}:=F(x_{T})-F^{*}$ , and the $o(1/k)$ rate follows.

By summing the inequality above finitely over $T=0,1,\dotsc,k-1$ , we obtain

[TABLE]

By rearranging this inequality, we obtain a uniform upper bound on $\|x_{k}-\bar{x}\|$ , thus showing that the sequence $\{x_{k}\}$ is bounded. ∎

3.3 Proximal Coordinate Descent

We now discuss the extension of coordinate descent to (32), with the assumption (26) on $f$ , Euclidean domain of dimension $n$ , sampling weighted according to (28) as in Section 2.2, and the additional assumption of separability of the regularizer $\psi$ , that is,

[TABLE]

where each $\psi_{i}$ is convex, extended valued, and possibly nondifferentiable. As in our discussion of Section 2.2, the results in this subsection can be extended directly to the case of block-coordinate descent.

Given the component-wise Lipschitz constants $L_{1},L_{2},\dotsc,L_{n}$ and algorithmic parameters $\bar{L}_{1},\bar{L}_{2},\dotsc,\bar{L}_{n}$ with $\bar{L}_{i}\geq L_{i}$ for all $i$ , proximal coordinate descent updates have the form

[TABLE]

With $p_{i}\equiv 1/n$ for all $i$ , LuX15a showed that the expected objective value converges to $F^{*}$ at a $O(1/k)$ rate. When arbitrary sampling (28) is considered, (43) is a special case of the general algorithmic framework described in LeeW18b . The latter paper shows the same $O(1/k)$ rate for convex problems under the additional assumption that for any $x_{0}$ , we have

[TABLE]

We show here that with arbitrary sampling according to (28), (43) produces $o(1/k)$ convergence rates for the expected objective on convex problems, without the assumption (44).

The following result makes use of the quantity $r_{k}$ defined in (30).

Theorem 3.3

Consider (32) with $f$ and $\psi$ convex and nonempty solution set $\Omega$ . Assume further that (42) is true, and that (26) holds with some $L_{1},L_{2},\dotsc,L_{n}>0$ . Given $\{\bar{L}_{i}\}_{i=1}^{n}$ with $\bar{L}_{i}\geq L_{i}$ for all $i$ , suppose that proximal coordinate descent defines iterates according to (43), with $i_{k}$ chosen i.i.d. according to a probability distribution satisfying (28). Then $\mathbb{E}_{i_{0},i_{1},\dotsc,i_{k-1}}[F(x_{k})]$ converges to $F^{*}$ at an asymptotic rate of $o(1/k)$ . Moreover, given any $\bar{x}\in\Omega$ , the sequence of $\mathbb{E}_{i_{0},\dotsc,i_{k-1}}r_{k}^{2}$ is bounded.

Proof

From (26), we first notice that in the update (43),

[TABLE]

From Lemma 4, the method defined by (43) is a descent method. Optimality of the subproblem in (43) yields

[TABLE]

By taking any $\bar{x}\in\Omega$ , and using the definition (30), we have:

[TABLE]

By taking expectation over $i_{T}$ on both sides of (47) and using the convexity of $f$ together with (45), we obtain

[TABLE]

where in (48a) we used the fact that (43) is a descent method. By taking expectation over $i_{0},\dotsc,i_{k}$ on (48b), summing over $T=0,\dotsc,k$ , and applying Lemma 1, we obtain the result.

Boundedness of $\mathbb{E}_{i_{0},\dotsc,i_{k-1}}[r_{k}^{2}]$ follows from the same telescoping sum and the fact that $F(x_{k})$ decreases monotonically with $k$ . ∎

Our result shows that, similar to gradient descent and proximal gradient, proximal coordinate descent and coordinate descent also provide a form of implicit regularization in that the expected value of $r_{k}$ is bounded. Since $r_{k}$ can be viewed as a weighted Euclidean norm, this observation implies that the iterates are also in a sense expected to lie within a bounded region.

Our analysis here improves the rates in LuX15a ; LeeW18b in terms of the dependency on $k$ and removes the assumption of (13a) in LeeW18b . Even aside from the improvement from $O(1/k)$ to $o(1/k)$ , Theorem 3.3 is the first time that a convergence rate for proximal stochastic coordinate descent with arbitrary sampling for the coordinates is proven without additional assumptions such as (44). By manipulating (48b), one can also observe how different probability distributions affect the upper bound, and it might also be possible to get better upper bounds by using norms different from (30).

4 Tightness of the $o(1/k)$ Estimate

We demonstrate that the $o(1/k)$ estimate of convergence of $\{f(x_{k})\}$ is tight by showing that for any $\epsilon\in(0,1]$ , there is a convex smooth function for which the sequence of function values generated by gradient descent with a fixed step size converges slower than $O(1/k^{1+\epsilon})$ . The example problem we provide is a simple one-dimensional function, so it serves also as a special case of stochastic coordinate descent and the proximal methods (where $\psi\equiv 0$ ) as well. Thus, this example shows tightness of our analysis for all methods without line search considered in this paper.

Consider the one-dimensional real convex function

[TABLE]

where $p$ is an even integer greater than $2$ . The minimizer of this function is clearly at $x^{*}=0$ , for which $f(0)=f^{*}=0$ . Suppose that the gradient descent method is applied starting from $x_{0}=1$ . For any descent method, the iterates $x_{k}$ are confined to $[-1,1]$ and we have

[TABLE]

so we set $L=p(p-1)$ . Suppose that $\bar{\alpha}\in(0,2/L)$ as above. Then the iteration formula is

[TABLE]

and by Lemma 2, all iterates lie in a bounded set: the level set $[-1,1]$ defined by $x_{0}$ . In fact, since $p\geq 4$ and $\bar{\alpha}\in(0,2/L)$ , we have that

[TABLE]

so that $x_{k+1}\in\left(\tfrac{2}{3}x_{k},x_{k}\right)$ and the value of $L$ remains valid for all iterates.

We show by an informal argument that there exists a constant $C$ such that

[TABLE]

From (50) we have

[TABLE]

By substituting the hypothesis (51) into (52), and taking $k$ to be large, we obtain the following sequence of equivalent approximate equalities:

[TABLE]

This last expression is approximately satisfied for large $k$ if $C$ satisfies the expression

[TABLE]

Stated another way, our result (51) indicates that a convergence rate faster than $O(1/k^{1+\epsilon})$ is not possible when steepest descent with fixed steplength is applied to the function $f(x)=x^{p}$ provided that

[TABLE]

that is,

[TABLE]

We follow AttCPR18a to provide a continuous-time analysis of the same objective function, using a gradient flow argument. For the function $f$ defined by (49), consider the following differential equation:

[TABLE]

Suppose that

[TABLE]

for some $\theta>0$ , which indicates that starting from any $t>0$ , $x(t)$ lies in a bounded area. Substituting (54) into (53), we obtain

[TABLE]

which holds true if and only if the following equations are satisfied:

[TABLE]

from which we obtain

[TABLE]

Since $x$ decreases monotonically to zero, for all $t\geq(p-1)/(p-2)$ ,

[TABLE]

is an appropriate value for a bound on $\|\nabla^{2}f(x)\|$ . These values of $\alpha$ and $L$ satisfy $0<\alpha\leq\frac{1}{L}$ , making $\alpha$ a valid step size. The objective value is $f(x(t))=t^{-p/(p-2)}$ , matching the rate of (51).

Acknowledgment

The authors thank Yixin Tao for a discussion that helped us to improve the clarity of this work.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming 168 (1-2), 123–175 (2018)
2(2) Attouch, H., Peypouquet, J.: The rate of convergence of nesterov’s accelerated forward-backward method is actually faster than 1/k^2. SIAM Journal on Optimization 26 (3), 1824–1834 (2016)
3(3) Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2 (1), 183–202 (2009)
4(4) Bertsekas, D.P.: Nonlinear programming, 3 edn. Athena scientific Belmont (2016)
5(5) Burachik, R., Graña Drummond, L., Iusem, A.N., Svaiter, B.: Full convergence of the steepest descent method with inexact line searches. Optimization 32 (2), 137–146 (1995)
6(6) Deng, W., Lai, M.J., Peng, Z., Yin, W.: Parallel multi-block ADMM with o ( 1 / k ) 𝑜 1 𝑘 o(1/k) convergence. Journal of Scientific Computing 71 (2), 712–736 (2017)
7(7) Lee, C.p., Wright, S.J.: Inexact variable metric stochastic block-coordinate descent for regularized optimization. Tech. rep. (2018). URL http://www.optimization-online.org/DB_HTML/2018/08/6753.html
8(8) Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming 152 (1-2), 615–642 (2015)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

First-Order Algorithms

Abstract

Keywords:

1 Introduction

Lemma 1

Proof

Theorem 1.1

Related Work.

2 Main Results on Unconstrained Smooth Problems

2.1 Gradient Descent with Longer Steps

Theorem 2.1

Lemma 2

Proof

Proof (First Proof of Theorem 2.1)

Lemma 3

Proof

Proof (Second Proof of Theorem 2.1, for

2.2 Coordinate Descent

Theorem 2.2

Proof

3 Regularized Problems

3.1 Short-Step Proximal Gradient

Theorem 3.1

Proof

3.2 Proximal Gradient with Line Search

Lemma 4

Theorem 3.2

Proof

3.3 Proximal Coordinate Descent

Theorem 3.3

Proof

4 Tightness of the o(1/k)o(1/k)o(1/k) Estimate

Acknowledgment

4 Tightness of the $o(1/k)$ Estimate