Approximate and Stochastic Greedy Optimization
Nan Ye, Peter Bartlett

TL;DR
This paper analyzes approximate and stochastic greedy algorithms for convex optimization, establishing convergence conditions, rates, and equivalences, and demonstrating their effectiveness on smooth and nonsmooth functions.
Contribution
It provides a unified convergence analysis for approximate greedy algorithms, introduces stochastic variants with proven convergence, and compares their performance on different convex functions.
Findings
Approximate greedy algorithms converge under certain conditions.
Stochastic variants can fail with full gradients but succeed with stochastic gradients.
New stochastic FW algorithm converges for nonsmooth convex functions.
Abstract
We consider two greedy algorithms for minimizing a convex function in a bounded convex set: an algorithm by Jones [1992] and the Frank-Wolfe (FW) algorithm. We first consider approximate versions of these algorithms. For smooth convex functions, we give sufficient conditions for convergence, a unified analysis for the well-known convergence rate of O(1/k) together with a result showing that this rate is the best obtainable from the proof technique, and an equivalence result for the two algorithms. We also consider approximate stochastic greedy algorithms for minimizing expectations. We show that replacing the full gradient by a single stochastic gradient can fail even on smooth convex functions. We give a convergent approximate stochastic Jones algorithm and a convergent approximate stochastic FW algorithm for smooth convex functions. In addition, we give a convergent approximate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research
Approximate and Stochastic Greedy Optimization
Nan Ye
QUT & ACEMS
&Peter Bartlett
UC Berkeley & QUT & ACEMS
Abstract
We consider two greedy algorithms for minimizing a convex function in a bounded convex set: an algorithm by Jones (1992) and the Frank-Wolfe (FW) algorithm. We first consider approximate versions of these algorithms. For smooth convex functions, we give sufficient conditions for convergence, a unified analysis for the well-known convergence rate of together with a result showing that this rate is the best obtainable from the proof technique, and an equivalence result for the two algorithms. We also consider approximate stochastic greedy algorithms for minimizing expectations. We show that replacing the full gradient by a single stochastic gradient can fail even on smooth convex functions. We give a convergent approximate stochastic Jones algorithm and a convergent approximate stochastic FW algorithm for smooth convex functions. In addition, we give a convergent approximate stochastic FW algorithm for nonsmooth convex functions. Convergence rates for these algorithms are given and proved.
1 Introduction
Consider the following problem of minimizing a convex function over a convex set,
[TABLE]
where is the convex hull of a set of atoms in a linear vector space. Such problem occurs frequently in machine learning and engineering (Boyd and Vandenberghe, 2004). We consider greedy algorithms which starts with some , and then iteratively find , where and/or are greedily chosen according to certain criterion. An attractive feature of such algorithm is that the iterates are sparse, because each iteration adds at most one new atom in .
Two greedy algorithms are well-known: an algorithm originally studied by Jones (1992), and the Frank-Wolfe (FW) algorithm (Frank and Wolfe, 1956). Jones’ algorithm chooses
[TABLE]
This has been studied in various contexts, such as function approximation in the Hilbert space (Jones, 1992; Barron, 1993; Lee et al., 1996), regression (Donahue et al., 1997), density estimation (Li and Barron, 1999), and is closely related to boosting (Zhang, 2003). The FW algorithm chooses
[TABLE]
and chooses by line search or a priori. The FW algorithm has recently attracted significant interest due to its projection-free property and the ability to handle structural constraints (Jaggi, 2013). In contrast to solving quadratic programs for projection in projected gradient descent and for the proximal map in the proximal algorithms, the FW algorithm solves a linear program at each step, which is often computationally more tractable (Jaggi et al., 2010; Lacoste-Julien and Jaggi, 2013). Approximate versions of Jones’ algorithm and the FW algorithm have also been studied, for example, see (Zhang, 2003; Jaggi, 2013).
In this paper, we first consider approximate versions of Jones’ algorithm and the FW algorithm, with a more general approximate version for Jones’ algorithm. We focus on smooth convex functions in our analysis, and give a sufficient convergence condition for both algorithms. Building on previous results on the convergence rates for both algorithms, we present a unified analysis for the convergence rate, and also show that this is the optimal that can be obtained with the proof technique. We also show that the approximate Jones’ algorithm and the approximate FW algorithm are equivalent.
We then consider stochastic versions of these approximate greedy algorithms for the stochastic approximation problem, where is an expectation over some random variable . We show that some stochastic versions fail even on smooth convex functions. We give an approximate stochastic Jones algorithm that has error using random for smooth convex functions. We also give an approximate stochastic FW algorithm that has an error using stochastic gradients and linear optimizations. In addition, we give an approximate stochastic Frank-Wolfe algorithm that has error using stochastic gradients for nonsmooth convex functions. The algorithms also apply to the finite-sum setting where . The finite-sum form occurs when performing empirical risk minimization in machine learning, or when performing M-estimation in statistics. In both cases, each measures how well a model fits an example.
Stochastic algorithms originated in the 1950s (Robbins and Monro, 1951), and have attracted much interest in recent years, mainly due to its ability to scale up to large datasets. We note that stochastic FW algorithms have recently been considered for smooth functions by Reddi et al. (2016) and Hazan and Luo (2016). Reddi et al. (2016) considered the non-convex setting, and shows that one can achieve an error of with stochastic gradients and linear optimizations. When is a finite sum, the number of stochastic gradients needed can be reduced to . Hazan and Luo (2016) considered the convex setting, and showed that one can achieve an error of with full gradients, stochastic gradients, and linear optimizations. The number of stochastic gradients can be reduced to if is strongly-convex. Both works use recent variance reduction techniques in convex optimization, such as the works of Johnson and Zhang (2013); Mahdavi et al. (2013); Defazio et al. (2014). Hazan and Luo (2016) additionally uses Nesterov (1983)’s acceleration technique. They use exact greedy steps, instead of approximate greedy steps as in this paper.
For the non-stochastic case, faster rates for FW are known with additional assumptions (Lacoste-Julien and Jaggi, 2015; Garber and Hazan, 2015, 2016). We refer the readers to the works of Hazan and Luo (2016) and Reddi et al. (2016) for further related works.
2 Approximate Greedy Optimization
We consider the approximate Jones’ algorithm in Algorithm 1. At each iteration, the algorithm solves the optimization problem with an error of . We call this an -approximate Jones’ algorithm, and we say the algorithm is a -Jones algorithm if there is a constant such that for all .
We leave the choice of unspecified, and thus this includes algorithms which fix a priori, or choose and jointly at each iteration. Similarly, may be chosen a priori or chosen adaptively.
An algorithm is called an -approximate FW algorithm, if given , the algorithm yields such that
[TABLE]
and we say the algorithm is a -FW algorithm for some if .
2.1 Assumptions
In this section, we assume is convex with bounded curvature, that is,
[TABLE]
where is the Bregman divergence of , and the LHS of the second equation is called the curvature of in . This definition of curvature is the same as that in (Jaggi, 2013), except that Jaggi (2013) takes supremum over . If is -smooth, that is, for all , then the curvature of is not more than . Thus a smooth function has bounded curvature. The curvature of is also not more than , assuming the second-order derivative exists.
The following are two basic bounds needed in our analysis.
Lemma 1**.**
(a) (Duality bound) If is convex on , , then for any ,
[TABLE]
(b) (Curvature inequality) If has curvature at most , then for any , , ,
[TABLE]
Proof.
(a) Using the definitions, we have
[TABLE]
(b) From the definition of Bregman divergence, we have
[TABLE]
Apply the definition of curvature, then the desired inequality follows. ∎
In general, we cannot improve the quadratic term to a higher-order one in the curvature inequality. For example, if is -strongly convex, then we can show that is infinity.
2.2 A Sufficient Condition for Convergence
The core to our convergence analysis for Jones’ algorithm and the FW algorithm is the following recurrence equation for the error .
Lemma 2**.**
Let be convex with curvature at most . Then for both -approximate Jones’ algorithm and -approximate FW algorithm the error satisfies
[TABLE]
where .
We omit the proof of this lemma and a few other proofs in the main text, but put them in the supplementary material, due to space limit.
The above lemma leads to a general convergence result for Jones’ algorithm and the FW algorithm.
Theorem 1**.**
Let be convex with curvature at most . For an -approximate Jones’ algorithm or an -approximate FW algorithm, if ’s and ’s are chosen such that diverges, and as , then as .
Proof.
From 2, it suffices to show that under the given conditions on and , the solution to the recurrence equation satisfies .
For any such that , there exists such that for all , we have , , because and . For any , if , then we have
[TABLE]
Since diverges, thus if , then there exists such that . We show by induction that all , we have . This is true for . For the inductive case, assume . If , then , and thus . If , then . We have thus proved that for any , there exists such that for all , . Thus as . ∎
2.3 Convergence Rate
We now show that with proper choices of ’s and ’s, we can obtain a convergence rate of for Jones’ algorithm and the FW algorithm.
Theorem 2**.**
Let be convex with curvature at most , for . Then for the iterates obtained using a -Jones algorithm or a -FW algorithm, when ,
[TABLE]
The constant in the rate can be improved in some cases. For example, if the minimizer is an algebraic interior point, then we can get a smaller constant using an argument similar to that in (Zhang, 2003).
A careful look at the analysis shows that if our update rule is guaranteed to generate a new iterate that is not more than that generated by a -FW algorithm or a -Jones algorithm with step size , then we can get an convergence rate. This also implies that we can mix -FW steps and -Jones steps to get an convergence rate. In addition, we can obtain the following result from Zhang (2003) as a special case.
Corollary 1**.**
Let be convex with curvature at most . If is chosen such that
[TABLE]
where is some constant, then for , we have .
The key idea in the above analysis is to show that , and then use induction to show that when . Can we tune to obtain a bound for some ? It turns out that is the best obtainable.
Theorem 3**.**
Consider a sequence satisfying
[TABLE]
with , then for any choice of , we have for .
Proof.
Clearly holds. Now we show by induction that if , then . Note that is minimized when , with minimum value , which is an increasing function of when . This implies that when ’s are chosen to minimize ’s, then ’s form a decreasing sequence. Since , this also implies the minimum . Hence we have
[TABLE]
where the last inequaliy holds because
[TABLE]
∎
2.4 An Equivalence Result
We have already seen that a few results hold for both the approximate Jones’ algorithm and the approximate FW algorithm. The following theorem shows that we can view these two algorithms as equivalent algorithms.
Theorem 4**.**
Assume is convex with curvature at most .
- (a)
An -approximate Jones’ algorithm with step sizes is -FW. In particular, a -Jones algorithm is -FW with the same step sizes.
- (b)
An -approximate FW algorithm with step sizes is an -approximate Jones’ algorithm. In particular, a -FW algorithm is -Jones with the same step sizes.
An immediate consequence of this result is that if any -Jones algorithm converges at rate, then any -FW algorithm converges at rate too.
3 Approximate Stochastic Greedy Optimization
We consider approximate stochastic versions of Jones’ algorithm and the FW algorithm for optimizing a function , where the expectation is over a random variable. Without loss of generality, we work with the finite-sum case where to ease presentation.
3.1 Stochastic Jones’ Algorithm
A natural stochastic version of Jones’ algorithm is obtained by replacing the function with a sampled approximation at iteration .
We show that ASJ is over-greedy when and the minimization problem at each iteration is solved exactly. The iterates can jump randomly from one vertex to another, leading to divergence. This differs from the nonstochastic case where exact minimization leads to smaller errors.
Proposition 1**.**
Let , and jointly optimized with in ASJ, then there exists a function with each being convex and smooth, such that does not converge to 0 as .
On the other hand, we can get a convergent algorithm using increasingly larger batch size. In essence, the theorem below shows that when we choose a batch size of a iteration with a step size , we can get an error of at any iteration . Taking as a measure of the computational complexity of the -th problem, then to get an error of , the complexity of the algorithm is .
Theorem 5**.**
Assume that the diameter of is , each is convex with curvature at most , and for all and . Let . In ASJ, when , and for all , we have
[TABLE]
When , and , we have
[TABLE]
3.2 Approximate Stochastic Versions of Frank-Wolfe
For FW, we can also sample a mini-batch estimation of the function and use the gradient of the estimation to replace the gradient of , as shown in Algorithm 3.
We can show that if there exists a constant , for all , we have
[TABLE]
then is of the order for . The above recursive property is a sufficient but not necessary condition for ASFW to have convergence rate. Indeed, there are cases where the above recursive property does not hold, but ASFW converges.
Proposition 2**.**
Let , , in ASFW. There exists a function with each being convex and smooth, such that as but Eq. 17 is not satisfied.
Proposition 3**.**
Let , , and be arbitrarily chosen in ASFW. There exists a convex and smooth such that exists, but the limit is larger than .
Reddi et al. (2016) considered the exact version of ASFW, that is, the case with . They showed that for smooth nonconvex , with suitale choice of and , one can achieve an error of with stochastic gradients and linear optimizations. We remark here that we can generalize their results to the approximate case: we choose , , as in 5, then we get the same kind of bound as for ASJ, with difference only in the constants. This result applies to both the smooth convex case and the smooth nonconvex case, with the cost in the nonconvex case having the form of the duality bound.
We consider the nonsmooth convex case, and give a stochastic version that has error using stochastic gradients and linear optimizations. The algorithm aggregates past stochastic gradients to construct a proxy for the full gradient. The component is a weighted sum of the stochastic gradients from past iterations. The term has a regularizing effect of encouraging alignment of with when is strongly convex with . This is because . Without loss of generality, assume is -strongly convex and 1-smooth. One possible choice of is .
A similar algorithm has been used in online learning by Hazan and Kale (2012); Hazan et al. (2016). They used fixed instead of variable , and they perform exact instead of approximation minimization at each step.
Theorem 6**.**
*Let be a -stronly convex 1-smooth function, , , , and , where , and are constants. Assume for all and . Let , , then we have *
[TABLE]
In particular, when , for any ,
[TABLE]
In addition, if , then
[TABLE]
We state two lemmas and then prove this theorem.
Lemma 3**.**
Let be a -strongly convex function, , , . Then for any ,
[TABLE]
Lemma 4**.**
Let , and , where is a positive constant, and a positive constant in . Let , and be positive constants, and as defined in 6. If , and
[TABLE]
then for any .
Proof of 6.
Let , then is -strongly convex and 1-smooth. Let , , and . Then we have . Using the convexity of and 3, we have
[TABLE]
We have for any because is -Lipschitz. Hence
[TABLE]
We have , because is -strongly convex.
Note that , thus is obtained by doing an -FW step on . On the other hand has curvature at most because is 1-smooth, and for any due to the -strong convexity of . Using 2, we have . Thus we have
[TABLE]
Using 4, we have . Thus . Hence we have
[TABLE]
We used the fact that in the last inequality. Now observe that we have
[TABLE]
where the first equality holds due to linearity of expectation, the second equality holds because we take expectation with respect to (but not ), the third equality holds due to linearity of expectation, and the last inequality holds due to the convexity of . From Eq. 22 and Eq. 23, we obtain Eq. 18.
When , observe that , , then using Eq. 18 we obtain Eq. 19.
When , using Eq. 22 and Eq. 23 and observe that , we obtain Eq. 20. ∎
4 Conclusion
We have given a unified analysis of two approximate greedy algorithms, and presented new results on convergence and their connections. In addition, we studied their stochastic versions and demonstrated these algorithms can be robust against the optimization error in each iteration.
There are a few questions for further exploration. From recent results in FW and the equivalence result in 4, it is natural to ask whether Jones’ algorithm converges at faster rates under suitable additional assumptions, and whether more efficient stochastic Jones’ algorithm can be obtained. For stochastic FW, the nonsmooth case seems to be harder than the smooth case. Results on complexity lower bounds will lead to better understanding on the greedy algorithms and these problems.
Supplementary Material
See 2
Proof.
First consider the -greedy algorithm. We have
[TABLE]
Let , and subtract both sides of the above inequality by , we obtain
[TABLE]
For the -greedy FW algorithm, we have
[TABLE]
Subtracting both sides of the inequality by , we obtain
[TABLE]
∎
See 2
Proof.
Let . From 2, for both -greedy and -FW algorithms, we have
[TABLE]
We prove the bound by induction. Taking , we obtain .
For the inductive, assume the bound holds for , that is, , then we have
[TABLE]
where the last inequality holds because . ∎
See 4
Proof.
(a) Suppose the current iterate is . If a -greedy algorithm yields , then
[TABLE]
That is, we have
[TABLE]
Simplifying the above inequality, we have (assuming if is not an optimal solution)
[TABLE]
The case for -greedy algorithms follow easily.
(b) Suppose the current iterate is . If an algorithm is -FW, then it gives a such that for any
[TABLE]
The case for -greedy FW algorithms follow easily. ∎
See 1
Proof.
Consider least squares regression , where , , , , and .
It can be shown that , that is, . ∎
See 5
Proof.
Let . We have
[TABLE]
We have
[TABLE]
where the first inequality is due to the curvature assumption, the second inequality due to Eq. 24, the third due to the duality bound, and the last due to Cauchy-Schwarz and . Now using convexity and telescoping the above inequality over , we have
[TABLE]
Using , taking expectation, and using ,111 This is because , where is randomly drawn from . we obtain
[TABLE]
When and for all , we have
[TABLE]
When , and , we have
[TABLE]
∎
See 2
Proof.
Consider , where , , , and . Here we can take .
We first show prove convergence. Let be the random variable taking value 1 when is sampled at iteration , and value -1 otherwise. Define , then it can be verified that . In addition, we can show that converges in probability to 0, which implies that converges in probability to a minimizer of , and thus converges to 0.
We prove the concentration result of for the more general case where ’s are i.i.d. drawn from a distribution on with mean , instead of from the uniform distribution on {-1, 1}. First we have , where . By Hoeffding’s inequality, we have
[TABLE]
Since each , we have
[TABLE]
We have , and . In addition, let , then
[TABLE]
where the because is not large enough to change the sign of . Hence we have
[TABLE]
On the other hand, we have
[TABLE]
We thus obtain
[TABLE]
Note that has nonzero probability of being , thus there is a nonzero probability that the above difference equals . However converges to 0, thus there is no constant such that
[TABLE]
for all and all . ∎
See 3
Proof.
Consider least squares regression , where , , , , and .
It can be shown that converges in probability to as . However, the optimal solution is . ∎
See 3
Proof.
We have
[TABLE]
Hence we have , and this implies
[TABLE]
We claim that
[TABLE]
This is equivalent to
[TABLE]
This holds when by the definition of . If this holds for some , then this holds for as follows,
[TABLE]
where the first inequality uses the inductive assumption, and the second one holds by the definition of .
Combining Eq. 27 and Eq. 28, we have
[TABLE]
∎
See 4
Proof.
We first transform the recurrence in Eq. 21 in the form for some function . For nonnegative numbers to satisfy , we need to have , or . This implies . Applying this transformation to the recurrence in Eq. 21, we have
[TABLE]
where . The second inequality holds by observing that and in the first inequality are smaller than , because .
Now we find a value of by determining a sufficient condition on such that holds when . Assume for some , then
[TABLE]
where .
- •
When Eq. 30 holds, then from LABEL:eq:e_{k}+1, we have .
- •
When Eq. 31 holds, we have . Thus .
- •
Both Eq. 32 and Eq. 33 are just rewriting of the previous inequality.
- •
To show that Eq. 34 implies Eq. 33, it suffices to show that for any . Using calculus, we have for . Hence we have . It follows that .
Now we solve Eq. 34. This is equivalent to , which is equivalent to , or
[TABLE]
To complete the proof it suffices to show that . This holds because
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Barron [1993] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. Information Theory, IEEE Transactions on , 39(3):930–945, 1993.
- 2Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization . Cambridge university press, 2004.
- 3Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , pages 1646–1654, 2014.
- 4Donahue et al. [1997] Michael J Donahue, C Darken, Leonid Gurvits, and Eduardo Sontag. Rates of convex approximation in non-Hilbert spaces. Constructive Approximation , 13(2):187–220, 1997.
- 5Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly , 3(1-2):95–110, 1956.
- 6Garber and Hazan [2015] Dan Garber and Elad Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning , volume 951, pages 541–549, 2015.
- 7Garber and Hazan [2016] Dan Garber and Elad Hazan. A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization. SIAM Journal on Optimization , 26(3):1493–1528, 2016.
- 8Hazan and Kale [2012] Elad Hazan and Satyen Kale. Projection-free Online Learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12) , pages 521–528, 2012.
