Approximate and Stochastic Greedy Optimization

Nan Ye; Peter Bartlett

arXiv:1705.09396·math.OC·November 19, 2018

Approximate and Stochastic Greedy Optimization

Nan Ye, Peter Bartlett

PDF

Open Access

TL;DR

This paper analyzes approximate and stochastic greedy algorithms for convex optimization, establishing convergence conditions, rates, and equivalences, and demonstrating their effectiveness on smooth and nonsmooth functions.

Contribution

It provides a unified convergence analysis for approximate greedy algorithms, introduces stochastic variants with proven convergence, and compares their performance on different convex functions.

Findings

01

Approximate greedy algorithms converge under certain conditions.

02

Stochastic variants can fail with full gradients but succeed with stochastic gradients.

03

New stochastic FW algorithm converges for nonsmooth convex functions.

Abstract

We consider two greedy algorithms for minimizing a convex function in a bounded convex set: an algorithm by Jones [1992] and the Frank-Wolfe (FW) algorithm. We first consider approximate versions of these algorithms. For smooth convex functions, we give sufficient conditions for convergence, a unified analysis for the well-known convergence rate of O(1/k) together with a result showing that this rate is the best obtainable from the proof technique, and an equivalence result for the two algorithms. We also consider approximate stochastic greedy algorithms for minimizing expectations. We show that replacing the full gradient by a single stochastic gradient can fail even on smooth convex functions. We give a convergent approximate stochastic Jones algorithm and a convergent approximate stochastic FW algorithm for smooth convex functions. In addition, we give a convergent approximate…

Equations175

w \in W min f (w),

w \in W min f (w),

(η_{k}, d_{k}) = η \in [0, 1], d \in S arg min f (η w_{k} + (1 - η) d) .

(η_{k}, d_{k}) = η \in [0, 1], d \in S arg min f (η w_{k} + (1 - η) d) .

d_{k} = d \in S arg min \nabla f (w_{k})^{⊤} d,

d_{k} = d \in S arg min \nabla f (w_{k})^{⊤} d,

f ((1 - η_{k}) w_{k} + η_{k} d_{k}) \leq d \in S min f ((1 - η_{k}) w_{k} + η_{k} d) + ϵ_{k} η_{k} .

f ((1 - η_{k}) w_{k} + η_{k} d_{k}) \leq d \in S min f ((1 - η_{k}) w_{k} + η_{k} d) + ϵ_{k} η_{k} .

\nabla f (w_{k})^{⊤} d_{k} \leq d \in S min \nabla f (w_{k})^{⊤} d + ϵ_{k},

\nabla f (w_{k})^{⊤} d_{k} \leq d \in S min \nabla f (w_{k})^{⊤} d + ϵ_{k},

f (w^{'}) \geq f (w) + \nabla f (w)^{⊤} (w^{'} - w), for all w^{'}, w \in W,

f (w^{'}) \geq f (w) + \nabla f (w)^{⊤} (w^{'} - w), for all w^{'}, w \in W,

w \in W, d \in S, η \in (0, 1) sup \frac{2}{η ^{2}} D_{f} ((1 - η) w + η d, w) < \infty,

f (w) - f (w^{*}) \leq d \in S max f (w)^{⊤} (w - d) .

f (w) - f (w^{*}) \leq d \in S max f (w)^{⊤} (w - d) .

f ((1 - η) w + η d)

f ((1 - η) w + η d)

f (w) - f (w^{*}) \leq \nabla f (w)^{⊤} (w - w^{*}) \leq d \in W max f (w)^{⊤} (w - d) \leq d \in S max f (w)^{⊤} (w - d) .

f (w) - f (w^{*}) \leq \nabla f (w)^{⊤} (w - w^{*}) \leq d \in W max f (w)^{⊤} (w - d) \leq d \in S max f (w)^{⊤} (w - d) .

f ((1 - η) w + η d) = f (w) + η \nabla f (w)^{⊤} (d - w) + D_{f} ((1 - η) w + η d) .

f ((1 - η) w + η d) = f (w) + η \nabla f (w)^{⊤} (d - w) + D_{f} ((1 - η) w + η d) .

e_{k + 1} \leq (1 - η_{k}) e_{k} + ϵ_{k}^{'},

e_{k + 1} \leq (1 - η_{k}) e_{k} + ϵ_{k}^{'},

e_{k + 1} \leq e_{k} + η_{k} (ϵ_{k}^{'} / η_{k} - e_{k}) \leq e_{k} - η_{k} δ /2.

e_{k + 1} \leq e_{k} + η_{k} (ϵ_{k}^{'} / η_{k} - e_{k}) \leq e_{k} - η_{k} δ /2.

f (w_{k}) - f (w^{*}) \leq \frac{2 M + 4 c}{k + 2} .

f (w_{k}) - f (w^{*}) \leq \frac{2 M + 4 c}{k + 2} .

f (w_{k + 1}) \leq η \in [0, 1], d \in S min f ((1 - η) w_{k} + η d) + \frac{4 c}{( k + 2 ) ^{2}},

f (w_{k + 1}) \leq η \in [0, 1], d \in S min f ((1 - η) w_{k} + η d) + \frac{4 c}{( k + 2 ) ^{2}},

e_{k + 1} = (1 - η_{k}) e_{k} + C η_{k}^{2},

e_{k + 1} = (1 - η_{k}) e_{k} + C η_{k}^{2},

e_{k + 1} \geq e_{k} (1 - \frac{e _{k}}{4 C}) \geq \frac{a}{k + 2} (1 - \frac{a /4 C}{k + 2}) \geq \frac{a ( k + 2 - a /4 C )}{( k + 2 ) ^{2}} \geq \frac{a}{k + 3},

e_{k + 1} \geq e_{k} (1 - \frac{e _{k}}{4 C}) \geq \frac{a}{k + 2} (1 - \frac{a /4 C}{k + 2}) \geq \frac{a ( k + 2 - a /4 C )}{( k + 2 ) ^{2}} \geq \frac{a}{k + 3},

(k + 2 - a /4 C) (k + 3) \geq (k + 2 - 1/4) (k + 3) \geq k^{2} + \frac{19}{4} k + \frac{21}{4} \geq (k + 2)^{2} .

(k + 2 - a /4 C) (k + 3) \geq (k + 2 - 1/4) (k + 3) \geq k^{2} + \frac{19}{4} k + \frac{21}{4} \geq (k + 2)^{2} .

\tilde{f}_{k} (w) = \frac{1}{b _{k}} i \in I_{k} \sum f_{i} (w) .

\tilde{f}_{k} (w) = \frac{1}{b _{k}} i \in I_{k} \sum f_{i} (w) .

\tilde{f}_{k} ((1 - η_{k}) w_{k} + η_{k} d_{k}) \leq d \in S min \tilde{f}_{k} ((1 - η_{k}) w_{k} + η_{k} d) + η_{k} ϵ_{k} .

\tilde{f}_{k} ((1 - η_{k}) w_{k} + η_{k} d_{k}) \leq d \in S min \tilde{f}_{k} ((1 - η_{k}) w_{k} + η_{k} d) + η_{k} ϵ_{k} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq \frac{f ( w _{1} ) - f ( w ^{*} ) + D L + M + c}{t},

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq \frac{f ( w _{1} ) - f ( w ^{*} ) + D L + M + c}{t},

E f (\overset{w}{ˉ}_{k}) - f (w^{*}) \leq \frac{f ( w _{1} ) - f ( w ^{*} ) + ( D L + M + c ) ( ln t + 1 )}{t} .

E f (\overset{w}{ˉ}_{k}) - f (w^{*}) \leq \frac{f ( w _{1} ) - f ( w ^{*} ) + ( D L + M + c ) ( ln t + 1 )}{t} .

\tilde{f}_{k} (w) = \frac{1}{b _{k}} i \in I_{k} \sum f_{i} (w) .

\tilde{f}_{k} (w) = \frac{1}{b _{k}} i \in I_{k} \sum f_{i} (w) .

\nabla \tilde{f}_{k} (w_{k})^{⊤} d_{k} \leq d \in S min \nabla \tilde{f}_{k} (w)^{⊤} d + ϵ_{k} .

\nabla \tilde{f}_{k} (w_{k})^{⊤} d_{k} \leq d \in S min \nabla \tilde{f}_{k} (w)^{⊤} d + ϵ_{k} .

⟨ E (d_{k}), \nabla f (w_{k})⟩ \leq d \in S min ⟨ d, \nabla f (w_{k})⟩ + c η_{k},

⟨ E (d_{k}), \nabla f (w_{k})⟩ \leq d \in S min ⟨ d, \nabla f (w_{k})⟩ + c η_{k},

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (3 L \frac{2 K}{ρ} + \frac{2 c L ^{2}}{ρ}) \frac{\sum _{k = 1}^{t} σ _{k} η _{k}}{\sum _{k = 1}^{t} σ _{k}} + \frac{R ^{2}}{\sum _{k = 1}^{t} σ _{k}} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (3 L \frac{2 K}{ρ} + \frac{2 c L ^{2}}{ρ}) \frac{\sum _{k = 1}^{t} σ _{k} η _{k}}{\sum _{k = 1}^{t} σ _{k}} + \frac{R ^{2}}{\sum _{k = 1}^{t} σ _{k}} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (3 L \frac{2 K}{ρ} + \frac{2 c L ^{2}}{ρ}) \frac{ln t + 1}{t ^{1/4}} + \frac{R ^{2}}{c t ^{1/4}} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (3 L \frac{2 K}{ρ} + \frac{2 c L ^{2}}{ρ}) \frac{ln t + 1}{t ^{1/4}} + \frac{R ^{2}}{c t ^{1/4}} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (4 L \frac{2 K}{ρ} + \frac{8 c L ^{2}}{3 ρ} + \frac{R ^{2}}{c}) \frac{1}{t ^{1/4}} .

E f (\overset{w}{ˉ}_{t}) - f (w^{*}) \leq (4 L \frac{2 K}{ρ} + \frac{8 c L ^{2}}{3 ρ} + \frac{R ^{2}}{c}) \frac{1}{t ^{1/4}} .

k = 1 \sum t σ_{k} g_{k}^{⊤} (w_{k}^{*} - w) \leq k = 1 \sum t \frac{2 σ _{k}^{2} ∥ g _{k} ∥ _{2}^{2}}{ρ} + R^{2} .

k = 1 \sum t σ_{k} g_{k}^{⊤} (w_{k}^{*} - w) \leq k = 1 \sum t \frac{2 σ _{k}^{2} ∥ g _{k} ∥ _{2}^{2}}{ρ} + R^{2} .

e_{k + 1}

e_{k + 1}

k = 1 \sum t σ_{k} (\tilde{h}_{k} (w_{k}^{*}) - \tilde{h}_{k} (w^{*})) \leq k = 1 \sum t σ_{k} g_{k}^{⊤} (w_{k}^{*} - w^{*}) \leq k = 1 \sum t \frac{2 σ _{k}^{2} ∥ g _{k} ∥ _{2}^{2}}{ρ} + R^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research

Full text

Approximate and Stochastic Greedy Optimization

Nan Ye

QUT & ACEMS

[email protected]

&Peter Bartlett

UC Berkeley & QUT & ACEMS

[email protected]

Abstract

We consider two greedy algorithms for minimizing a convex function in a bounded convex set: an algorithm by Jones (1992) and the Frank-Wolfe (FW) algorithm. We first consider approximate versions of these algorithms. For smooth convex functions, we give sufficient conditions for convergence, a unified analysis for the well-known convergence rate of $O(1/k)$ together with a result showing that this rate is the best obtainable from the proof technique, and an equivalence result for the two algorithms. We also consider approximate stochastic greedy algorithms for minimizing expectations. We show that replacing the full gradient by a single stochastic gradient can fail even on smooth convex functions. We give a convergent approximate stochastic Jones algorithm and a convergent approximate stochastic FW algorithm for smooth convex functions. In addition, we give a convergent approximate stochastic FW algorithm for nonsmooth convex functions. Convergence rates for these algorithms are given and proved.

1 Introduction

Consider the following problem of minimizing a convex function over a convex set,

[TABLE]

where $W$ is the convex hull of a set of atoms $S$ in a linear vector space. Such problem occurs frequently in machine learning and engineering (Boyd and Vandenberghe, 2004). We consider greedy algorithms which starts with some $w_{1}\in W$ , and then iteratively find $w_{k+1}=(1-\eta_{k})w_{k}+\eta_{k}d_{k}$ , where $\eta_{k}$ and/or $d_{k}\in S$ are greedily chosen according to certain criterion. An attractive feature of such algorithm is that the iterates are sparse, because each iteration adds at most one new atom in $S$ .

Two greedy algorithms are well-known: an algorithm originally studied by Jones (1992), and the Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956). Jones’ algorithm chooses

[TABLE]

This has been studied in various contexts, such as function approximation in the Hilbert space (Jones, 1992; Barron, 1993; Lee et al., 1996), $\ell_{p}$ regression (Donahue et al., 1997), density estimation (Li and Barron, 1999), and is closely related to boosting (Zhang, 2003). The FW algorithm chooses

[TABLE]

and chooses $\eta_{k}$ by line search or a priori. The FW algorithm has recently attracted significant interest due to its projection-free property and the ability to handle structural constraints (Jaggi, 2013). In contrast to solving quadratic programs for projection in projected gradient descent and for the proximal map in the proximal algorithms, the FW algorithm solves a linear program at each step, which is often computationally more tractable (Jaggi et al., 2010; Lacoste-Julien and Jaggi, 2013). Approximate versions of Jones’ algorithm and the FW algorithm have also been studied, for example, see (Zhang, 2003; Jaggi, 2013).

In this paper, we first consider approximate versions of Jones’ algorithm and the FW algorithm, with a more general approximate version for Jones’ algorithm. We focus on smooth convex functions in our analysis, and give a sufficient convergence condition for both algorithms. Building on previous results on the $O(1/k)$ convergence rates for both algorithms, we present a unified analysis for the $O(1/k)$ convergence rate, and also show that this is the optimal that can be obtained with the proof technique. We also show that the approximate Jones’ algorithm and the approximate FW algorithm are equivalent.

We then consider stochastic versions of these approximate greedy algorithms for the stochastic approximation problem, where $f$ is an expectation ${\mathbb{E}}f_{z}(w)$ over some random variable $z$ . We show that some stochastic versions fail even on smooth convex functions. We give an approximate stochastic Jones algorithm that has error $\epsilon$ using $O(\epsilon^{-4})$ random $f_{z}(w)$ for smooth convex functions. We also give an approximate stochastic FW algorithm that has an error $\epsilon$ using $O(\epsilon^{-4})$ stochastic gradients and $O(\epsilon^{-2})$ linear optimizations. In addition, we give an approximate stochastic Frank-Wolfe algorithm that has error $\epsilon$ using $O(\epsilon^{-4})$ stochastic gradients for nonsmooth convex functions. The algorithms also apply to the finite-sum setting where $f(w)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ . The finite-sum form occurs when performing empirical risk minimization in machine learning, or when performing M-estimation in statistics. In both cases, each $f_{i}$ measures how well a model fits an example.

Stochastic algorithms originated in the 1950s (Robbins and Monro, 1951), and have attracted much interest in recent years, mainly due to its ability to scale up to large datasets. We note that stochastic FW algorithms have recently been considered for smooth functions by Reddi et al. (2016) and Hazan and Luo (2016). Reddi et al. (2016) considered the non-convex setting, and shows that one can achieve an error of $\epsilon$ with $O(\epsilon^{-4})$ stochastic gradients and $O(\epsilon^{-2})$ linear optimizations. When $f$ is a finite sum, the number of stochastic gradients needed can be reduced to $O(n+n^{1/3}\epsilon^{-2})$ . Hazan and Luo (2016) considered the convex setting, and showed that one can achieve an error of $\epsilon$ with $O(\ln\epsilon^{-1})$ full gradients, $O(\epsilon^{-2})$ stochastic gradients, and $O(\epsilon^{-1})$ linear optimizations. The number of stochastic gradients can be reduced to $O(\ln\epsilon^{-1})$ if $f$ is strongly-convex. Both works use recent variance reduction techniques in convex optimization, such as the works of Johnson and Zhang (2013); Mahdavi et al. (2013); Defazio et al. (2014). Hazan and Luo (2016) additionally uses Nesterov (1983)’s acceleration technique. They use exact greedy steps, instead of approximate greedy steps as in this paper.

For the non-stochastic case, faster rates for FW are known with additional assumptions (Lacoste-Julien and Jaggi, 2015; Garber and Hazan, 2015, 2016). We refer the readers to the works of Hazan and Luo (2016) and Reddi et al. (2016) for further related works.

2 Approximate Greedy Optimization

We consider the approximate Jones’ algorithm in Algorithm 1. At each iteration, the algorithm solves the optimization problem $\min_{d\in S}f\left((1-\eta_{k})w_{k}+\eta_{k}d\right)$ with an error of $\epsilon_{k}\eta_{k}$ . We call this an $\epsilon_{k}$ -approximate Jones’ algorithm, and we say the algorithm is a $c$ -Jones algorithm if there is a constant $c\geq 0$ such that $\epsilon_{k}\leq c\eta_{k}$ for all $k$ .

We leave the choice of $\eta_{k}$ unspecified, and thus this includes algorithms which fix $\eta_{k}$ a priori, or choose $\eta_{k}$ and $d_{k}$ jointly at each iteration. Similarly, $\epsilon_{k}$ may be chosen a priori or chosen adaptively.

An algorithm is called an $\epsilon_{k}$ -approximate FW algorithm, if given $w_{k}\in W$ , the algorithm yields $(\eta_{k},d_{k})\in[0,1]\times S$ such that

[TABLE]

and we say the algorithm is a $c$ -FW algorithm for some $c\geq 0$ if $\epsilon_{k}\leq c\eta_{k}$ .

2.1 Assumptions

In this section, we assume $f$ is convex with bounded curvature, that is,

[TABLE]

where $D_{f}(w,y)=f(w)-f(y)-\operatorname{\nabla}f(y)(w-y)$ is the Bregman divergence of $f$ , and the LHS of the second equation is called the curvature of $f$ in $W$ . This definition of curvature is the same as that in (Jaggi, 2013), except that Jaggi (2013) takes supremum over $d\in W$ . If $f$ is $L$ -smooth, that is, $\lVert\operatorname{\nabla}f(w^{\prime})-\operatorname{\nabla}f(w)\rVert_{2}\leq L\lVert w^{\prime}-w\rVert_{2}$ for all $w^{\prime},w\in W$ , then the curvature of $f$ is not more than $L\operatorname{diam}(S)^{2}$ . Thus a smooth function has bounded curvature. The curvature of $f$ is also not more than $\sup_{w\in W,d\in S,\eta\in(0,1)}\frac{\partial f((1-\eta)w+\eta d)}{\partial\eta^{2}}$ , assuming the second-order derivative exists.

The following are two basic bounds needed in our analysis.

Lemma 1.

(a) (Duality bound) If $f$ is convex on $W$ , $w^{*}=\operatorname*{arg\,min}_{w\in W}f(w)$ , then for any $w\in W$ ,

[TABLE]

(b) (Curvature inequality) If $f$ has curvature at most $M$ , then for any $w\in W$ , $d\in S$ , $\eta\in[0,1]$ ,

[TABLE]

Proof.

(a) Using the definitions, we have

[TABLE]

(b) From the definition of Bregman divergence, we have

[TABLE]

Apply the definition of curvature, then the desired inequality follows. ∎

In general, we cannot improve the quadratic term to a higher-order one in the curvature inequality. For example, if $f$ is $m$ -strongly convex, then we can show that $\sup_{w\in W,d\in S,\eta\in(0,1)}\frac{1}{\eta^{3}}D_{f}\left((1-\eta)w+\eta d,w\right)$ is infinity.

2.2 A Sufficient Condition for Convergence

The core to our convergence analysis for Jones’ algorithm and the FW algorithm is the following recurrence equation for the error $e_{k}=f(w_{k})-f(w^{*})$ .

Lemma 2.

Let $f$ be convex with curvature at most $M$ . Then for both $\epsilon_{k}$ -approximate Jones’ algorithm and $\epsilon_{k}$ -approximate FW algorithm the error $e_{k}=f(w_{k})-f(w^{*})$ satisfies

[TABLE]

where $\epsilon^{\prime}_{k}=\eta_{k}\epsilon_{k}+\frac{M}{2}\eta_{k}^{2}$ .

We omit the proof of this lemma and a few other proofs in the main text, but put them in the supplementary material, due to space limit.

The above lemma leads to a general convergence result for Jones’ algorithm and the FW algorithm.

Theorem 1.

Let $f$ be convex with curvature at most $M$ . For an $\epsilon_{k}$ -approximate Jones’ algorithm or an $\epsilon_{k}$ -approximate FW algorithm, if $\eta_{k}$ ’s and $\epsilon_{k}$ ’s are chosen such that $\sum_{k}\eta_{k}$ diverges, $\eta_{k}\to 0$ and $\epsilon_{k}\to 0$ as $k\to\infty$ , then $f(w_{k})\to f(w^{*})$ as $k\to\infty$ .

Proof.

From 2, it suffices to show that under the given conditions on $\eta_{k}$ and $\epsilon_{k}$ , the solution to the recurrence equation $e_{k+1}\leq(1-\eta_{k})e_{k}+\epsilon^{\prime}_{k}$ satisfies $e_{k}\to 0$ .

For any $\delta$ such that $0<\delta<1$ , there exists $K$ such that for all $k>K$ , we have $\eta_{k}<\delta$ , $\epsilon^{\prime}_{k}/\eta_{k}<\delta/2$ , because $\eta_{k}\to 0$ and $\epsilon_{k}/\eta_{k}\to 0$ . For any $k>K$ , if $e_{k}>\delta$ , then we have

[TABLE]

Since $\sum_{k}\eta_{k}$ diverges, thus if $e_{k}>\delta$ , then there exists $N>K$ such that $e_{N}\leq\delta$ . We show by induction that all $k\geq N$ , we have $e_{k}\leq\delta$ . This is true for $k=N$ . For the inductive case, assume $e_{k}\leq\delta$ . If $e_{k}\geq\delta/2$ , then $\epsilon^{\prime}_{k}/\eta_{k}-e_{k}\leq 0$ , and thus $e_{k+1}\leq e_{k}+\eta_{k}(\epsilon^{\prime}_{k}/\eta_{k}-e_{k})\leq\delta$ . If $e_{k}\leq\delta/2$ , then $e_{k+1}\leq\frac{\delta}{2}+\delta(\frac{\delta}{2}-0)\leq\delta$ . We have thus proved that for any $\delta>0$ , there exists $N$ such that for all $k\geq N$ , $e_{k}\leq\delta$ . Thus $e_{k}\to 0$ as $k\to\infty$ . ∎

2.3 Convergence Rate

We now show that with proper choices of $\eta_{k}$ ’s and $\epsilon_{k}$ ’s, we can obtain a convergence rate of $O(1/k)$ for Jones’ algorithm and the FW algorithm.

Theorem 2.

Let $f$ be convex with curvature at most $M$ , $\eta_{k}=\frac{2}{k+2}$ for $k\geq 0$ . Then for the iterates $(w_{k})$ obtained using a $c$ -Jones algorithm or a $c$ -FW algorithm, when $k\geq 1$ ,

[TABLE]

The constant in the rate can be improved in some cases. For example, if the minimizer is an algebraic interior point, then we can get a smaller constant using an argument similar to that in (Zhang, 2003).

A careful look at the analysis shows that if our update rule is guaranteed to generate a new iterate that is not more than that generated by a $c$ -FW algorithm or a $c$ -Jones algorithm with step size $\eta_{k}=2/(k+2)$ , then we can get an $O(1/k)$ convergence rate. This also implies that we can mix $c$ -FW steps and $c$ -Jones steps to get an $O(1/k)$ convergence rate. In addition, we can obtain the following result from Zhang (2003) as a special case.

Corollary 1.

Let $f$ be convex with curvature at most $M$ . If $w_{k+1}\in W$ is chosen such that

[TABLE]

where $c>0$ is some constant, then for $k\geq 1$ , we have $f(w_{k})-f(w^{*})\leq\frac{2M+4c}{k+2}$ .

The key idea in the above analysis is to show that $e_{k+1}\leq(1-\eta_{k})e_{k}+C\eta_{k}^{2}$ , and then use induction to show that $e_{k}\in O(\frac{1}{k})$ when $\eta_{k}=\frac{2}{k+2}$ . Can we tune $\eta_{k}$ to obtain a bound $O(\frac{1}{k^{p}})$ for some $p>1$ ? It turns out that $p=1$ is the best obtainable.

Theorem 3.

Consider a sequence $(e_{k})$ satisfying

[TABLE]

with $e_{0}\leq 2C$ , then for any choice of $\eta_{k}$ , we have $e_{k}\geq\frac{a}{k+2}$ for $a=\min\{e_{0},C\}$ .

Proof.

Clearly $e_{0}\geq\frac{a}{2}$ holds. Now we show by induction that if $e_{k}\geq\frac{a}{k+2}$ , then $e_{k+1}\geq\frac{a}{k+3}$ . Note that $(1-\eta_{k})e_{k}+C\eta_{k}^{2}$ is minimized when $\eta_{k}=\frac{e_{k}}{4C}$ , with minimum value $e_{k}(1-\frac{e_{k}}{4C})$ , which is an increasing function of $e_{k}$ when $e_{k}\in[\frac{a}{k+2},2C]$ . This implies that when $\eta_{k}$ ’s are chosen to minimize $e_{k}$ ’s, then $e_{k}$ ’s form a decreasing sequence. Since $e_{0}\leq 2C$ , this also implies the minimum $e_{k}\leq 2C$ . Hence we have

[TABLE]

where the last inequaliy holds because

[TABLE]

∎

2.4 An Equivalence Result

We have already seen that a few results hold for both the approximate Jones’ algorithm and the approximate FW algorithm. The following theorem shows that we can view these two algorithms as equivalent algorithms.

Theorem 4.

Assume $f$ is convex with curvature at most $M$ .

(a)

An $\epsilon_{k}$ -approximate Jones’ algorithm with step sizes $(\eta_{k})$ is $(\epsilon_{k}+\frac{M}{2}\eta_{k})$ -FW. In particular, a $c$ -Jones algorithm is $\frac{M+2c}{2}$ -FW with the same step sizes.

(b)

An $\epsilon_{k}$ -approximate FW algorithm with step sizes $(\eta_{k})$ is an $(\epsilon_{k}+\frac{M}{2}\eta_{k})$ -approximate Jones’ algorithm. In particular, a $c$ -FW algorithm is $\frac{M+2c}{2}$ -Jones with the same step sizes.

An immediate consequence of this result is that if any $c$ -Jones algorithm converges at $O(1/k)$ rate, then any $c$ -FW algorithm converges at $O(1/k)$ rate too.

3 Approximate Stochastic Greedy Optimization

We consider approximate stochastic versions of Jones’ algorithm and the FW algorithm for optimizing a function $f(w)={\mathbb{E}}f_{z}(w)$ , where the expectation is over a random variable. Without loss of generality, we work with the finite-sum case where $f(w)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ to ease presentation.

3.1 Stochastic Jones’ Algorithm

A natural stochastic version of Jones’ algorithm is obtained by replacing the function $f$ with a sampled approximation $\tilde{f}_{k}$ at iteration $k$ .

We show that ASJ is over-greedy when $b_{k}=1$ and the minimization problem at each iteration is solved exactly. The iterates can jump randomly from one vertex to another, leading to divergence. This differs from the nonstochastic case where exact minimization leads to smaller errors.

Proposition 1.

Let $b_{k}=1$ , $\epsilon_{k}=0$ and $\eta_{k}$ jointly optimized with $d_{k}$ in ASJ, then there exists a function $f(w)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ with each $f_{i}$ being convex and smooth, such that ${\mathbb{E}}f(w_{k})-f(w^{*})$ does not converge to 0 as $k\to\infty$ .

On the other hand, we can get a convergent algorithm using increasingly larger batch size. In essence, the theorem below shows that when we choose a batch size of $k$ a iteration $k$ with a step size $\sqrt{k}$ , we can get an error of $O(1/\sqrt{t})$ at any iteration $t$ . Taking $b_{k}$ as a measure of the computational complexity of the $k$ -th problem, then to get an error of $\epsilon$ , the complexity of the algorithm is $O(\epsilon^{-4})$ .

Theorem 5.

Assume that the diameter of $W$ is $D$ , each $f_{i}(w)$ is convex with curvature at most $M$ , and $\lVert\operatorname{\nabla}f_{i}(w)\rVert_{2}\leq L$ for all $i$ and $w\in W$ . Let $\bar{w}_{k}=\sum_{i=1}^{k}\eta_{i}w_{i}/\sum_{i=1}^{k}\eta_{i}$ . In ASJ, when $b_{k}=t$ , $\eta_{k}=t^{-1/2}$ and $\epsilon_{k}=c\eta_{k}$ for all $k$ , we have

[TABLE]

When $b_{k}=k$ , and $\eta_{k}=k^{-1/2}$ , we have

[TABLE]

3.2 Approximate Stochastic Versions of Frank-Wolfe

For FW, we can also sample a mini-batch estimation of the function $f(w)$ and use the gradient of the estimation to replace the gradient of $f$ , as shown in Algorithm 3.

We can show that if there exists a constant $c>0$ , for all $k\geq 0$ , we have

[TABLE]

then ${\mathbb{E}}(f(w_{k}))-f(w^{*})$ is of the order $O(1/k)$ for $k\geq 1$ . The above recursive property is a sufficient but not necessary condition for ASFW to have $O(1/k)$ convergence rate. Indeed, there are cases where the above recursive property does not hold, but ASFW converges.

Proposition 2.

Let $b_{k}=1$ , $\epsilon_{k}=0$ , $\eta_{k}=\frac{2}{k+2}$ in ASFW. There exists a function $f(w)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ with each $f_{i}$ being convex and smooth, such that ${\mathbb{E}}f(w_{k})\to f(w^{*})$ as $k\to\infty$ but Eq. 17 is not satisfied.

Proposition 3.

Let $b_{k}=1$ , $\epsilon_{k}=0$ , and $\eta_{k}$ be arbitrarily chosen in ASFW. There exists a convex and smooth $f$ such that $\lim_{k\to\infty}{\mathbb{E}}f(w_{k})$ exists, but the limit is larger than $f(w^{*})$ .

Reddi et al. (2016) considered the exact version of ASFW, that is, the case with $\epsilon_{k}=0$ . They showed that for smooth nonconvex $f$ , with suitale choice of $b_{k}$ and $\eta_{k}$ , one can achieve an error of $\epsilon$ with $O(\epsilon^{-4})$ stochastic gradients and $O(\epsilon^{-2})$ linear optimizations. We remark here that we can generalize their results to the approximate case: we choose $b_{k}$ , $\epsilon_{k}$ , $\eta_{k}$ as in 5, then we get the same kind of bound as for ASJ, with difference only in the constants. This result applies to both the smooth convex case and the smooth nonconvex case, with the cost in the nonconvex case having the form of the duality bound.

We consider the nonsmooth convex case, and give a stochastic version that has error $\epsilon$ using $O(\epsilon^{-4})$ stochastic gradients and $O(\epsilon^{-4})$ linear optimizations. The algorithm aggregates past stochastic gradients to construct a proxy $\bar{g}_{k}+\operatorname{\nabla}\Phi(w_{k})$ for the full gradient. The component $\bar{g}_{k}$ is a weighted sum of the stochastic gradients from past iterations. The term $\operatorname{\nabla}\Phi(w_{k})$ has a regularizing effect of encouraging alignment of $d$ with $w_{k}-w_{1}$ when $\Phi$ is strongly convex with $\Phi(w_{1})=0$ . This is because $\langle\operatorname{\nabla}\Phi(w)-\operatorname{\nabla}\Phi(w^{\prime}),w-w^{\prime}\rangle\geq\rho\lVert w-w^{\prime}\rVert_{2}^{2}$ . Without loss of generality, assume $\Phi(w)$ is $\rho$ -strongly convex and 1-smooth. One possible choice of $\Phi$ is $\Phi(w)=\frac{1}{2}\lVert w-w_{1}\rVert_{2}^{2}$ .

A similar algorithm has been used in online learning by Hazan and Kale (2012); Hazan et al. (2016). They used fixed instead of variable $\eta_{k}$ , and they perform exact instead of approximation minimization at each step.

Theorem 6.

*Let $\Phi(w)$ be a $\rho$ -stronly convex 1-smooth function, $R^{2}=\max_{w\in W}\Phi(w)-\Phi(w_{1})$ , $\eta_{k}=\frac{1}{k^{p}}$ , $\epsilon_{k}=\frac{(\lambda-1)R^{2}}{\rho}\eta_{k}$ , and $\sigma_{k}\leq c\eta_{k}^{3/2}$ , where $\lambda\geq 1$ , $c>0$ and $p\in[0,1]$ are constants. Assume $\lVert\operatorname{\nabla}f_{i}(w)\rVert_{2}\leq L$ for all $i$ and $w$ . Let $\bar{w}_{t}=\sum_{k=1}^{t}\sigma_{k}w_{k}/\sum_{k=1}^{t}\sigma_{k}$ , $K=\left(\sqrt{\frac{1}{2\rho}}\frac{cL}{1-p}+\sqrt{\frac{\lambda R^{2}+c^{2}L^{2}}{\rho(1-p)}+\frac{c^{2}L^{2}}{2\rho(1-p)^{2}}}\right)^{2}$ , then we have *

[TABLE]

In particular, when $p=\frac{1}{2}$ , for any $t\geq 1$ ,

[TABLE]

In addition, if $\sigma_{k}=\sigma=\frac{c}{t^{3/4}}$ , then

[TABLE]

We state two lemmas and then prove this theorem.

Lemma 3.

Let $\Phi(w)$ be a $\rho$ -strongly convex function, $F_{k}(w)=\sum_{i=1}^{s-1}\sigma_{i}g_{i}^{\top}w+\Phi(w)$ , $w^{*}_{k}=\operatorname*{arg\,min}_{w\in X}F_{k}(w)$ , $R^{2}=\max_{w\in W}\Phi(w)-\Phi(w^{*}_{1})$ . Then for any $w\in X$ ,

[TABLE]

Lemma 4.

Let $\eta_{k}=\frac{1}{k^{p}}$ , and $\sigma_{k}\leq c\eta_{k}^{3/2}$ , where $c$ is a positive constant, and $p$ a positive constant in $(0,1)$ . Let $R$ , $\rho$ and $L$ be positive constants, and $K$ as defined in 6. If $e_{1}\leq K\eta_{1}$ , and

[TABLE]

then $e_{k}\leq K\eta_{k}$ for any $k\geq 1$ .

Proof of 6.

Let $F_{k}(w)=\sum_{i=1}^{s-1}\sigma_{i}g_{i}^{\top}w+\Phi(w)$ , then $F_{k}$ is $\rho$ -strongly convex and 1-smooth. Let $w^{*}_{k}=\operatorname*{arg\,min}_{w\in W}F_{k}(w)$ , $h_{k}(w)=f_{i_{k}}(w)$ , and $\tilde{h}_{k}(w)=h_{k}(w-(w^{*}_{k}-w_{k}))$ . Then we have $g_{k}=\operatorname{\nabla}\tilde{h}_{k}(w^{*}_{k})$ . Using the convexity of $\tilde{h}$ and 3, we have

[TABLE]

We have $|h_{k}(w)-\tilde{h}_{k}(w)|\leq L\lVert w_{k}-w^{*}_{k}\rVert_{2}$ for any $w\in W$ because $h_{k}$ is $L$ -Lipschitz. Hence

[TABLE]

We have $\frac{\rho}{2}\lVert w_{k}-w^{*}_{k}\rVert_{2}^{2}\leq e_{k}=F_{k}(w_{k})-F_{k}(w^{*}_{k})$ , because $F_{k}$ is $\rho$ -strongly convex.

Note that $\operatorname{\nabla}F_{k}(w_{k})=g_{k}+\operatorname{\nabla}\Phi(w_{k})$ , thus $w_{k+1}$ is obtained by doing an $\frac{(\lambda-1)R^{2}}{\rho}$ -FW step on $F_{k}$ . On the other hand $F_{k}$ has curvature at most $\frac{2R^{2}}{\rho}$ because $F_{k}$ is 1-smooth, and $\lVert w-w^{\prime}\rVert_{2}^{2}\leq\frac{2}{\rho}(\Phi(w)-\Phi(w^{\prime}))\leq\frac{2R^{2}}{\rho}$ for any $w,w^{\prime}\in W$ due to the $\rho$ -strong convexity of $\Phi$ . Using 2, we have $F_{k}(w_{k+1})-F_{k}(w^{*})\leq(1-\eta_{k})e_{k}+\frac{\lambda R^{2}}{\rho}\eta_{k}^{2}$ . Thus we have

[TABLE]

Using 4, we have $e_{k}\leq K\eta_{k}$ . Thus $\lVert w_{k}-w_{k}^{*}\rVert_{2}\leq\sqrt{\frac{2K}{\rho}\eta_{k}}$ . Hence we have

[TABLE]

We used the fact that $\sigma_{k}^{2}=\sigma_{k}c\eta_{k}^{3/2}\leq c\sigma_{k}\sqrt{\eta_{k}}$ in the last inequality. Now observe that we have

[TABLE]

where the first equality holds due to linearity of expectation, the second equality holds because we take expectation with respect to $i_{k}$ (but not $w_{k}$ ), the third equality holds due to linearity of expectation, and the last inequality holds due to the convexity of $f$ . From Eq. 22 and Eq. 23, we obtain Eq. 18.

When $p=\frac{1}{2}$ , observe that $\sum_{k=1}^{t}\sigma_{k}=\sum_{k=1}^{t}ck^{-3/4}\geq ct^{1/4}$ , $\sum_{k=1}^{t}\sigma_{k}\sqrt{\eta_{k}}=\sum_{k=1}^{t}k^{-1}\leq\ln t+1$ , then using Eq. 18 we obtain Eq. 19.

When $\sigma_{k}=\sigma=ct^{-3/4}$ , using Eq. 22 and Eq. 23 and observe that $\sum_{k=1}^{t}\sqrt{\eta_{k}}\leq\frac{4}{3}t^{3/4}$ , we obtain Eq. 20. ∎

4 Conclusion

We have given a unified analysis of two approximate greedy algorithms, and presented new results on convergence and their connections. In addition, we studied their stochastic versions and demonstrated these algorithms can be robust against the optimization error in each iteration.

There are a few questions for further exploration. From recent results in FW and the equivalence result in 4, it is natural to ask whether Jones’ algorithm converges at faster rates under suitable additional assumptions, and whether more efficient stochastic Jones’ algorithm can be obtained. For stochastic FW, the nonsmooth case seems to be harder than the smooth case. Results on complexity lower bounds will lead to better understanding on the greedy algorithms and these problems.

Supplementary Material

See 2

Proof.

First consider the $\epsilon_{k}$ -greedy algorithm. We have

[TABLE]

Let $e_{k}=f(w_{k})-f(w^{*})$ , and subtract both sides of the above inequality by $f(w^{*})$ , we obtain

[TABLE]

For the $\epsilon_{k}$ -greedy FW algorithm, we have

[TABLE]

Subtracting both sides of the inequality by $f(w^{*})$ , we obtain

[TABLE]

∎

See 2

Proof.

Let $C=2M+4c$ . From 2, for both $c$ -greedy and $c$ -FW algorithms, we have

[TABLE]

We prove the bound by induction. Taking $k=0$ , we obtain $f(w_{1})-f(w^{*})\leq\frac{C}{4}<\frac{C}{k+2}$ .

For the inductive, assume the bound holds for $k$ , that is, $e_{k}\leq\frac{C}{k+2}$ , then we have

[TABLE]

where the last inequality holds because $(k+1)(k+3)=k^{2}+4k+3<(k+2)^{2}$ . ∎

See 4

Proof.

(a) Suppose the current iterate is $w_{k}\in W$ . If a $\epsilon_{k}$ -greedy algorithm yields $(\eta_{k},d_{k})\in[0,1]\times S$ , then

[TABLE]

That is, we have

[TABLE]

Simplifying the above inequality, we have (assuming $\eta_{k}\neq 0$ if $w_{k}$ is not an optimal solution)

[TABLE]

The case for $c$ -greedy algorithms follow easily.

(b) Suppose the current iterate is $w_{k}\in W$ . If an algorithm is $\epsilon_{k}$ -FW, then it gives a $(\eta_{k},d_{k})\in[0,1]\times S$ such that for any $d^{\prime}\in S$

[TABLE]

The case for $c$ -greedy FW algorithms follow easily. ∎

See 1

Proof.

Consider least squares regression $\min_{w\in W}\sum_{i=1}^{3}(x_{i}^{\top}w-y_{i})^{2}$ , where $x_{1}=(0,1)$ , $x_{2}=(-\frac{\sqrt{3}}{2},-\frac{1}{2})$ , $x_{3}=(\frac{\sqrt{3}}{2},-\frac{1}{2})$ , $y_{1}=1,y_{2}=1,y_{3}=1$ , and $S=\{x_{1},x_{2},x_{3}\}$ .

It can be shown that $(\eta_{k},d_{k})=(1,x_{i})$ , that is, $w_{k+1}=x_{i}$ . ∎

See 5

Proof.

Let $u_{k}=\operatorname*{arg\,min}_{d\in S}f((1-\eta_{k})w_{k}+\eta_{k}d_{k})$ . We have

[TABLE]

We have

[TABLE]

where the first inequality is due to the curvature assumption, the second inequality due to Eq. 24, the third due to the duality bound, and the last due to Cauchy-Schwarz and $\lVert d_{k}-u_{k}\rVert_{2}\leq D$ . Now using convexity and telescoping the above inequality over $k$ , we have

[TABLE]

Using $f(w_{t+1})\geq f(w^{*})$ , taking expectation, and using ${\mathbb{E}}\lVert\operatorname{\nabla}f(w_{k})-\operatorname{\nabla}\tilde{f}_{k}(w_{k})\rVert_{2}\leq\frac{L}{\sqrt{b}}$ ,111 This is because $\left({\mathbb{E}}\lVert\operatorname{\nabla}f(w_{k})-\operatorname{\nabla}\tilde{f}_{k}(w_{k})\rVert\right)^{2}\leq{\mathbb{E}}\lVert\operatorname{\nabla}f(w_{k})-\operatorname{\nabla}\tilde{f}_{k}(w_{k})\rVert_{2}^{2}=\frac{1}{b}{\mathbb{E}}\lVert\operatorname{\nabla}f_{i}(w_{k})-\operatorname{\nabla}f(w_{k})\rVert_{2}^{2}\leq\frac{1}{b}{\mathbb{E}}\lVert\operatorname{\nabla}f_{i}(w_{k})\rVert_{2}^{2}\leq\frac{L^{2}}{b}$ , where $i$ is randomly drawn from $[n]$ . we obtain

[TABLE]

When $b_{k}=t$ and $\eta_{k}=t^{-1/2}$ for all $k$ , we have

[TABLE]

When $b_{k}=k$ , and $\eta_{k}=k^{-1/2}$ , we have

[TABLE]

∎

See 2

Proof.

Consider $\min_{\lVert w\rVert_{2}\leq r}\sum_{i=1}^{2}(x_{i}^{\top}w-y_{i})^{2}$ , where $x_{1}=x_{2}=x=(1,1)$ , $y_{1}=1$ , $y_{2}=-1$ , and $r=1/2$ . Here we can take $S=\{w:\lVert w\rVert_{2}=r\}$ .

We first show prove convergence. Let $X_{i}$ be the random variable taking value 1 when $(x_{1},y_{1})$ is sampled at iteration $i$ , and value -1 otherwise. Define $Y_{k+1}=(1-\gamma_{i})Y_{k}+\gamma_{k}X_{k}$ , then it can be verified that $w_{k}=Y_{k}x$ . In addition, we can show that $Y_{k}$ converges in probability to 0, which implies that $Y_{k}x$ converges in probability to a minimizer $w^{*}=(0,0)$ of $f(w)$ , and thus ${\mathbb{E}}(f(w_{k}))-f(w^{*})$ converges to 0.

We prove the concentration result of $Y_{k}$ for the more general case where $X_{i}$ ’s are i.i.d. drawn from a distribution on $[a,b]$ with mean $\mu$ , instead of from the uniform distribution on {-1, 1}. First we have $Y_{k}=\sum_{i=0}^{k}w_{i}X_{i}$ , where $w_{i}=\frac{2}{i+2}\frac{i+1}{i+3}\ldots\frac{k}{k+2}\leq\frac{2}{k+2}$ . By Hoeffding’s inequality, we have

[TABLE]

Since each $w_{i}\leq\frac{2}{k+2}$ , we have

[TABLE]

We have $\operatorname{\nabla}f_{i}(w_{k})=2(x_{i}^{\top}w_{k}-y_{i})x_{i}$ , and $\operatorname{\nabla}f(w_{k})=4(x^{\top}w_{k})x$ . In addition, let $d_{k,i}=\operatorname*{arg\,min}_{d\in S}\langle d,\operatorname{\nabla}f_{i}(w_{k})$ , then

[TABLE]

where the $\operatorname{sgn}(x_{i}^{\top}w_{k}-y_{i})=\operatorname{sgn}(y_{i})$ because $x_{i}^{\top}w_{k}$ is not large enough to change the sign of $y_{i}$ . Hence we have

[TABLE]

On the other hand, we have

[TABLE]

We thus obtain

[TABLE]

Note that $w_{k}$ has nonzero probability of being $x$ , thus there is a nonzero probability that the above difference equals $4r|x^{\top}w_{k}|\lVert x\rVert_{2}=4\sqrt{2}$ . However $\eta_{k}=2/(k+2)$ converges to 0, thus there is no constant $c$ such that

[TABLE]

for all $k$ and all $w_{k}$ . ∎

See 3

Proof.

Consider least squares regression $\min_{||w||_{2}\leq r}\sum_{i=1}^{3}(x_{i}^{T}w-y_{i})^{2}$ , where $r<1$ , $x_{1}=(0,1)$ , $x_{2}=(-\frac{\sqrt{3}}{2},-\frac{1}{2})$ , $x_{3}=(\frac{\sqrt{3}}{2},-\frac{1}{2})$ , and $y_{1}=1,y_{2}=-1,y_{3}=-1$ .

It can be shown that $w_{k}$ converges in probability to $(0,2r/3)$ as $k\to\infty$ . However, the optimal solution is $(0,r)$ . ∎

See 3

Proof.

We have

[TABLE]

Hence we have $\lVert w^{*}_{k}-w^{*}_{k+1}\rVert_{2}\leq\frac{2\sigma_{k}\lVert g_{k}\rVert_{2}}{\rho}$ , and this implies

[TABLE]

We claim that

[TABLE]

This is equivalent to

[TABLE]

This holds when $s=0$ by the definition of $w^{*}_{1}$ . If this holds for some $s$ , then this holds for $s+1$ as follows,

[TABLE]

where the first inequality uses the inductive assumption, and the second one holds by the definition of $w^{*}_{s+2}$ .

Combining Eq. 27 and Eq. 28, we have

[TABLE]

∎

See 4

Proof.

We first transform the recurrence in Eq. 21 in the form $e_{k+1}\leq h(e_{k})$ for some function $h$ . For nonnegative numbers $A,B,C$ to satisfy $A\leq B+C\sqrt{A}$ , we need to have $(\sqrt{A}-\frac{C}{2})^{2}\leq B+\frac{C^{2}}{4}$ , or $\sqrt{A}\leq\sqrt{B+\frac{C^{2}}{4}}+\frac{C}{2}$ . This implies $A\leq B+\frac{C^{2}}{2}+C\sqrt{B+\frac{C^{2}}{4}}$ . Applying this transformation to the recurrence in Eq. 21, we have

[TABLE]

where $A^{\prime}=(1-\eta_{k})e_{k}+\frac{\lambda R^{2}}{\rho}\eta_{k}^{2}+\frac{c^{2}L^{2}}{\rho}\eta_{k}^{2}$ . The second inequality holds by observing that $\frac{\sigma_{k}^{2}L^{2}}{\rho}$ and $\frac{\sigma_{k}^{2}L^{2}}{2\rho}$ in the first inequality are smaller than $\frac{c^{2}L^{2}}{\rho}\eta_{k}^{2}$ , because $\sigma_{k}^{2}=c^{2}\eta_{k}^{3}\leq c^{2}\eta_{k}^{2}$ .

Now we find a value of $K$ by determining a sufficient condition on $K$ such that $e_{k+1}\leq K\eta_{k+1}$ holds when $e_{k}\leq K\eta_{k}$ . Assume $e_{k}\leq K\eta_{k}$ for some $s$ , then

[TABLE]

where $M=\frac{\lambda R^{2}+c^{2}L^{2}}{\rho}+\sqrt{\frac{2}{\rho}}cL\sqrt{K}$ .

•

When Eq. 30 holds, then from LABEL:eq:e_{k}+1, we have $e_{k+1}\leq K\eta_{k+1}$ .

•

When Eq. 31 holds, we have $K\eta_{k}>K\eta_{k+1}>A^{\prime}$ . Thus $A^{\prime}+\sqrt{\frac{2}{\rho}}cL\sqrt{K}\eta_{k}^{2}=A^{\prime}+\sqrt{\frac{2}{\rho}}c\eta_{k}^{3/2}L\sqrt{K\eta_{k}}\geq A^{\prime}+\sqrt{\frac{2}{\rho}}\sigma_{k}L\sqrt{A^{\prime}}$ .

•

Both Eq. 32 and Eq. 33 are just rewriting of the previous inequality.

•

To show that Eq. 34 implies Eq. 33, it suffices to show that $\frac{\eta_{k}^{2}}{\eta_{k}-\eta_{k+1}}\geq\frac{1}{p}$ for any $s\geq 1$ . Using calculus, we have $(1+\frac{1}{k})^{p}\leq 1+\frac{p}{k}\leq 1+\frac{p}{k^{p}}$ for $k\geq 1$ . Hence we have $\eta_{k}-\eta_{k+1}=\frac{1}{k^{p}}-\frac{1}{(k+1)^{p}}=\frac{1}{(k+1)^{p}}\left(\frac{(k+1)^{p}}{k^{p}}-1\right)=\frac{1}{(k+1)^{p}}\left((1+\frac{1}{k})^{p}-1\right)\leq\frac{1}{(k+1)^{p}}\frac{p}{k^{p}}$ . It follows that $\frac{\eta_{k}^{2}}{\eta_{k}-\eta_{k+1}}\geq\frac{1}{p}\frac{(k+1)^{p}}{k^{p}}\geq\frac{1}{p}$ .

Now we solve Eq. 34. This is equivalent to $K=\frac{M}{1-p}=\frac{\lambda R^{2}+c^{2}L^{2}}{\rho(1-p)}+\sqrt{\frac{2}{\rho}}\frac{cL}{1-p}\sqrt{K}$ , which is equivalent to $(\sqrt{K}-\sqrt{\frac{1}{2\rho}}\frac{cL}{1-p})^{2}=\frac{\lambda R^{2}+c^{2}L^{2}}{\rho(1-p)}+\frac{c^{2}L^{2}}{2\rho(1-p)^{2}}$ , or

[TABLE]

To complete the proof it suffices to show that $e_{1}\leq K\eta_{1}$ . This holds because

[TABLE]

∎

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Barron [1993] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. Information Theory, IEEE Transactions on , 39(3):930–945, 1993.
2Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
3Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , pages 1646–1654, 2014.
4Donahue et al. [1997] Michael J Donahue, C Darken, Leonid Gurvits, and Eduardo Sontag. Rates of convex approximation in non-Hilbert spaces. Constructive Approximation , 13(2):187–220, 1997.
5Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly , 3(1-2):95–110, 1956.
6Garber and Hazan [2015] Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning , volume 951, pages 541–549, 2015.
7Garber and Hazan [2016] Dan Garber and Elad Hazan. A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization. SIAM Journal on Optimization , 26(3):1493–1528, 2016.
8Hazan and Kale [2012] Elad Hazan and Satyen Kale. Projection-free Online Learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) , pages 521–528, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Approximate and Stochastic Greedy Optimization

Abstract

1 Introduction

2 Approximate Greedy Optimization

2.1 Assumptions

Lemma 1**.**

Proof.

2.2 A Sufficient Condition for Convergence

Lemma 2**.**

Theorem 1**.**

Proof.

2.3 Convergence Rate

Theorem 2**.**

Corollary 1**.**

Theorem 3**.**

Proof.

2.4 An Equivalence Result

Theorem 4**.**

3 Approximate Stochastic Greedy Optimization

3.1 Stochastic Jones’ Algorithm

Proposition 1**.**

Theorem 5**.**

3.2 Approximate Stochastic Versions of Frank-Wolfe

Proposition 2**.**

Proposition 3**.**

Theorem 6**.**

Lemma 3**.**

Lemma 4**.**

Proof of 6.

4 Conclusion

Supplementary Material

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Lemma 1.

Lemma 2.

Theorem 1.

Theorem 2.

Corollary 1.

Theorem 3.

Theorem 4.

Proposition 1.

Theorem 5.

Proposition 2.

Proposition 3.

Theorem 6.

Lemma 3.

Lemma 4.