Acceleration of SVRG and Katyusha X by Inexact Preconditioning
Yanli Liu, Fei Feng, and Wotao Yin

TL;DR
This paper introduces an inexact preconditioning technique to accelerate SVRG and Katyusha X algorithms, achieving faster convergence and practical speedups in empirical risk minimization tasks.
Contribution
It proposes a novel inexact preconditioning approach with fixed preconditioners that enhances convergence of SVRG and Katyusha X without increasing memory requirements.
Findings
Achieves better iteration and gradient complexity.
Provides theoretical convergence guarantees.
Demonstrates 8x iteration and 7x runtime speedups in experiments.
Abstract
Empirical risk minimization is an important class of optimization problems with many popular machine learning applications, and stochastic variance reduction methods are popular choices for solving them. Among these methods, SVRG and Katyusha X (a Nesterov accelerated SVRG) achieve fast convergence without substantial memory requirement. In this paper, we propose to accelerate these two algorithms by \textit{inexact preconditioning}, the proposed methods employ \textit{fixed} preconditioners, although the subproblem in each epoch becomes harder, it suffices to apply \textit{fixed} number of simple subroutines to solve it inexactly, without losing the overall convergence. As a result, this inexact preconditioning strategy gives provably better iteration complexity and gradient complexity over SVRG and Katyusha X. We also allow each function in the finite sum to be nonconvex while the sum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
Acceleration of SVRG and Katyusha X by Inexact Preconditioning
Yanli Liu
Fei Feng
Wotao Yin
Abstract
Empirical risk minimization is an important class of optimization problems with many popular machine learning applications, and stochastic variance reduction methods are popular choices for solving them. Among these methods, SVRG and Katyusha X (a Nesterov accelerated SVRG) achieve fast convergence without substantial memory requirement. In this paper, we propose to accelerate these two algorithms by inexact preconditioning, the proposed methods employ fixed preconditioners, although the subproblem in each epoch becomes harder, it suffices to apply fixed number of simple subroutines to solve it inexactly, without losing the overall convergence. As a result, this inexact preconditioning strategy gives provably better iteration complexity and gradient complexity over SVRG and Katyusha X. We also allow each function in the finite sum to be nonconvex while the sum is strongly convex. In our numerical experiments, we observe an on average speedup on the number of iterations and speedup on runtime.
SVRG, Katyusha X, inexact preconditioning
1 Introduction
Empirical risk minimization is an important class of optimization problems that has many applications in machine learning, especially in the large-scale setting. In this paper, we formulate it as the minimization of the following objective
[TABLE]
where the finite sum is strongly convex, each in the finite sum is smooth111A function is said to be smooth if its gradient is Lipschitz continuous. and can be nonconvex, and the regularizer is proper, closed, and convex, but may be nonsmooth. A nonzero is desirable in many applications, for example, regularization that induces sparsity in the solution. Allowing to be nonconvex is also necessary in some applications, e.g., shift-and-invert approach to solve PCA (Saad, 1992).
1.1 Related Work
To obtain a high quality approximate solution of (1.1), stochastic variance reduction algorithms are a class of preferable choices in the large scale setting where is huge. If each is strongly convex and smooth, and , then SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014a), SAG (Roux et al., 2012), SARAH (Nguyen et al., 2017), SDCA (Shalev-Shwartz & Zhang, 2013), SDCA without duality (Shalev-Shwartz, 2016), and Finito/MISO (Defazio et al., 2014b; Mairal, 2013) can find such a within {\mathcal{O}}\big{(}(n+\frac{L}{\sigma})\ln(\frac{1}{\varepsilon})\big{)} evaluations of component gradients , while vanilla gradient descent needs evaluations. Recently, SCSG improves this complexity to {\mathcal{O}}\big{(}(n\wedge\frac{L}{\sigma\varepsilon}+\frac{L}{\sigma})\ln\frac{1}{\varepsilon}\big{)}222.. When , many of these algorithms can be extended accordingly and the same gradient complexity is preserved (Xiao & Zhang, 2014; Defazio et al., 2014a; Shalev-Shwartz & Zhang, 2016). Among these methods, SVRG has been a popular choice due to its low memory cost.
When the condition number is large, the performances of these variance reduction methods may degenerate considerably. In view of this, there have been many schemes that incorporate second-order information into the variance reduction schemes. In (Gonen et al., 2016), the problem data is first transformed by linear sketching in order to decrease the condition number, then SVRG is applied. However, the strategy is only proposed for ridge regression and it is unclear whether it can be applied to other problems.
A larger family of algorithms, called Stochastic Quasi-Newton (SQN) methods, apply to more general settings. The idea is to first sample one or a few Hessian-vector products, then perform a L-BFGS type update on the approximate Hessian inverse (Byrd et al., 2016; Moritz et al., 2016; Gower et al., 2016), then is applied to the SVRG-type stochastic gradient as a preconditioner. That is,
[TABLE]
where is a variance-reduced stochastic gradient.
Linear convergence is established and competitive numerical performances are observed for SQN methods. However, the theoretical linear rate depends on the condition number of the approximate Hessian, which again depends poorly on the condition number of the objective, so it is not clear whether they are faster than SVRG in general. Furthermore, they do not support nondifferentiable regularizers nonconvexity of individual . Recently, the first issue is partially resolved in (Lin et al., 2016), where the algorithm is at least as fast as SVRG. To deal with the second issue, (Wang et al., 2018) applied a preconditioned proximal mapping of after is applied to the variance reduced stochastic gradient, but in order to evaluate this mapping efficiently, is required to be of the symmetric rank-one update form , where is the identity matrix and . However, is still ill-conditioned with a conditioner number of order , therefore only a gradient complexity of order {\mathcal{O}}\big{(}(n+\kappa\frac{1}{\varepsilon})\ln(\frac{1}{\varepsilon})\big{)} can be guaranteed.
Another way of exploiting second-order information is to cyclically calculate one individual Hessian (or an approximation of it) (Rodomanov & Kropotov, 2016; Mokhtari et al., 2018), linear and locally superlinear convergence are established. However, they require at least an amount of memory to store the local variables, which will be substantial when is large.
Aside from exploiting second-order information, it is also possible to apply Nesterov-type acceleration to SVRG. Recently, Katyusha (Allen-Zhu, 2017) and Katyusha X (Allen-Zhu, 2018) are developed in this spirit. Katyusha X also applies to the sum-of-nonconvex setting where each can be nonconvex. There are also “Catalyst” accelerated methods (Lin et al., 2015), where a small amount of strong convexity is added to the objective and is minimized inexactly at each step, then Nesterov acceleration is applied. However, Catalyst methods have an additional factor in gradient complexity over Katyusha and Katyusha X.
1.2 Our Contributions
We propose to accelerate SVRG and Katyusha X by a fixed preconditioner, as opposed to time-varying preconditioners in SQN methods. And the subproblems are solved with fixed number of simple subroutines. 2. 2.
If the preconditioner captures the second order information of , then there will be significant accelerations. With a good preconditioner , when , Algorithm 1 and Algorithm 2 are and times faster than SVRG and Katyusha X in terms of gradient complexity, respectively. When , these numbers become and . We also demonstrate these accelerations for Lasso and Logistic regression. 3. 3.
Our acceleration applies to the sum-of-nonconvex setting, where in (1.1) is strongly convex, but each individual can be nonconvex. We also allow a nondifferentiable regularizer .
2 Preliminaries and Assumptions
Throughout this paper, we use for norm and for dot product, denotes the norm.
The preconditioner is a symmetric, positive definite matrix. We write and as the smallest and the largest eigenvalues of , respectively, and as the condition number of . For , let and denote the norm and inner product induced by , respectively, i.e., .
We use to denote the ceiling function. For , Geom denotes a random variable that obeys the geometric distribution, i.e., with probability for . We have .
Definition 1.
We say that is smooth, if it is differentiable and satisfies
[TABLE]
We say that is smooth under , if it is differentiable and satisfies
[TABLE]
Definition 2.
We say that is strongly convex, if
[TABLE]
We say that is strongly convex under , if
[TABLE]
smoothness under is equivalent to . Also, strong convexity is equivalent to . Cf. Section 2 of (Shalev-Shwartz & Zhang, 2016).
Definition 3.
We define the condition number of under as .
When , we have .
In this paper, we will choose such that . For example, if where is ill-conditioned, by choosing we have
[TABLE]
which tells us that and , while . That is, under metric, has a much smaller condition number and can be minimized easily.
Definition 4.
For a proper closed convex function , its subdifferential at is written as
[TABLE]
Definition 5.
For a proper closed convex function , its preconditioned proximal mapping with step size is defined by
[TABLE]
When , this reduces to the classical proximal mapping.
Finally, let us list the assumptions that will be effective throughout this paper.
Assumption 1.
In the objective function (1.1),
Each is smooth and smooth under
. 2. 2.
is strongly convex, and strongly convex under , where and . 3. 3.
The regularization term is proper closed convex and is easy to compute.
Remark 1.
In Assumption 1, we only require to be strongly convex, while each can be nonconvex. 2. 2.
Several common choices of regularizers have simple proximal mappings. For example, when with , can be computed component wise as
[TABLE]
3 Proposed Algorithms
As discussed in Sec. 1, SVRG and Katyusha X suffer from ill-conditioning like other first order methods. In this section, we propose to accelerate them by applying inexact preconditioning. Let us illustrate the idea as follows,
We would like to apply a preconditioner to the gradient descent step in SVRG. i.e.,
[TABLE]
where is a variance-reduced stochastic gradient. When and this minimization is solved exactly, we have , which is a preconditioned gradient update. 2. 2.
However, solving (1) exactly may be expensive and impractical. In fact it suffices to solve it highly inexactly by fixed number of simple subroutines.
We summarize the resulted algorithm in Algorithm 1 and call it Inexact Preconditioned(IP-) SVRG. Compared to SVRG, the only difference lies in line .
Remark 2.
In line , the epoch length obeys a geometric distribution and , this is for the purpose of simplifying analysis (motivated by (Lei & Jordan, 2017; Allen-Zhu, 2018)), in practice one can just set . In our experiments, this still brings significant accelerations. 2. 2.
The choice of affects the performance. Intuitively, a larger means more gradient evaluations per epoch, but also more progress per epoch. Theoretically, we show that gives faster convergence than SVRG, where is the number of subroutines used in Line . 3. 3.
In line , one can also sample a batch of gradients instead of one. It is straightforward to generalize our convergence results in Sec. 4 to this setting. 4. 4.
If , line reduces to
[TABLE]
and Algorithm 1 reduces to SVRG.
For , line contains an optimization problem that may not have a closed form solution:
[TABLE]
To solve it inexactly, we propose to apply fixed number of iterations of some simple subroutines, which are initialized at . This procedure is summarized in Procedure 1.
Remark 3.
In Procedure 1, there are many choices for the iterator , for example, one can use proximal gradient, FISTA (Beck & Teboulle, 2009) (or equivalently, Nesterov acceleration (Nesterov, 2013)), and FISTA with restart (O’donoghue & Candes, 2015). Under these choices, line is easy to compute. For example, when is the proximal gradient step, line of Procedure 1 becomes
[TABLE]
Now, let us also apply the inexact preconditioning idea to Katyusha X (Algorithm 2 of (Allen-Zhu, 2018)). Similar to Katyusha X, we first apply a momentum step, then one epoch of iPreSVRG (i.e., line of Algorithm 1).
Remark 4.
When , one can show that , and Algorithm 2 reduces to Algorithm 1. 2. 2.
When and the proximal mapping is solved exactly, Algorithm 2 reduces to Katyusha X. 3. 3.
The convergence of Algorithm 2 is established when . In practice, we found that many other choices of also work.
4 Main Theory
In this section, we proceed to establish the convergence of Algorithm 1 and Algorithm 2. The key idea is that when the preconditioned proximal gradient update in (3.2) is solved inexactly as in Procedure 1, the error can be bounded by , under which we can still establish the overall convergence of Algorithm 1 and Algorithm 2. Combine this with the fixed number of simple subroutines in Procedure 1, we obtain a much lower gradient complexity when .
All the proofs in this section are deferred to the supplementary material.
First, Let us analyze the error in the optimality condition of (3.2) when it is solved inexactly by FISTA with restart as in Procedure 1. Specifically,
Let and then the subproblem (3.2) can be written as
[TABLE]
Therefore, FISTA with restart applied to (3.2) can be summarized in the following algorithm.
Lemma 1.
Take Assumption 1. Suppose in Procedure 1, we choose as the iterator of FISTA with restart111FISTA with restart can be replaced with any iterator with Q-linear convergence on the iterates. In our experiments, FISTA also works, and a simple choice of is enough. every steps, with step size and restart it times (that is, iterations in total). Then, is an approximate solution to (3.2) that satisfies
[TABLE]
where
[TABLE]
and
[TABLE]
With Lemma 1, the overall convergences of Algorithm 1 and 2 can be established. The analysis is similar to that of (Allen-Zhu, 2018).
Theorem 1.
Under Assumption 1, let , , , and . Then the iPreSVRG in Algorithm 1 satisfies
[TABLE]
Theorem 2.
Under Assumption 1, let , , , , and . Then the iPreKatX in Algorithm 2 satisfies
[TABLE]
Remark 5.
When , we have , and Theorems 1 and 2 recovers the Theorems D.1 and 4.3 of (Allen-Zhu, 2018).
In Theorems 1 and 2, we need the number of simple subroutines to be large enough such that , the following Lemma provides a sufficient condition for this.
Lemma 2.
If the subproblem iterator in Procedure 1 is FISTA with restart every steps, and with step size , then, in order for to hold, it suffices to choose
[TABLE]
where .
With (4.3), (4.4), and (4.5), we can now calculate the gradient complexities of Algorithm 1 and Algorithm 2, but let us first do that for SVRG and Katyusha X.
In Assumption 1, we have assumed that is cheap to evaluate, therefore, each epoch of SVRG needs gradient evaluations, which is also true for Katyusha X. As a result, the gradient complexity for SVRG and Katyusha X to reach suboptimality are:
[TABLE]
For Algorithm 1 and Algorithm 2, each iteration in Procedure 1 is at most as expensive as gradient computations111For each iteration of Procedure 1, the most expensive step is multiplying to some vector, which is often cheaper than gradient computations. and is operated times, therefore, one epoch of iPreSVRG/iPreKatX needs at most gradient computations.
Consequently, we can write the the gradient complexity for Algorithm 1 and Algorithm 2 to reach suboptimality as:
[TABLE]
Remark 6.
According to Lemma 2, when is FISTA with restart, it suffices to choose by (4.5). 2. 2.
When the preconditioner is chosen appropriately, the step size in (4.8) and (4.9) can be much larger than that of (4.6) and (4.7).
Finally, we can compare , with , , respectively. It turns out that there is a significant speedup when .
Theorem 3.
Take Assumption 1. Let the iterator in Procedure 1 be FISTA with restart, and an appropriate preconditioner is chosen such that and are of the same order, and is small compared to them, then
if and , then
[TABLE] 2. 2.
if and , then
[TABLE]
Theorem 4.
Take Assumption 1. Let the iterator in Procedure 1 be FISTA with restart, and an appropriate preconditioner is chosen such that and are of the same order, and is small compared to them, then
if and , then
[TABLE] 2. 2.
If and , then
[TABLE]
In Section 5, we provide practical choices of for Lasso and Logistic regression.
5 Experiments
To investigate the practical performance of Algorithms 1 and 2, we test on three problems: Lasso, logistic regression, and a synthetic sum-of-nonconvex problem. For the first two, each function in the finite sum is convex. To guarantee that the objective is strongly convex, a small regularization is added to Lasso and logistic regression.
In the following, we compare SVRG, iPreSVRG, Katyusha X, and iPreKatX on four datasets from LIBSVM111https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/: w1a.t (47272 samples, 300 features), protein (17766 samples, 357 features), cod-rna.t (271617 samples, 8 features), australian (690 samples, 14 features), and one synthetic dataset. The implementation settings are listed below,
We choose the epoch length in all experiments, since we found that the choices need more gradient evaluations. 2. 2.
For iPrePDHG and iPreKatX, we use FISTA as the subproblem iterator . If the preconditioner is diagonal, then the number of subroutines for solving the subproblem is , if not, then we set . 3. 3.
In all the experiments, we tune the step size and momentum weight to their optimal. 4. 4.
All algorithms are initialized at . 5. 5.
All algorithms are implemented in Matlab R2015b. To be fair, except for the subproblem routines for inexact preconditioning, the other parts of the code are identical in all algorithms. The experiments are conducted on a Windows system with Intel Core i7 2.6 GHz CPU. The code is available at:
[TABLE]
5.1 Lasso
We formulate Lasso as
[TABLE]
where are feature vectors and are labels. Note that the first term is equivalent to , where and .
For Lasso as in (5.1), we provide two choices of preconditioner ,
When is small, we choose
[TABLE]
this is the exact Hessian of the smooth part of the objective. 2. 2.
When is large and is diagonally dominant, we choose
[TABLE]
where . In this case, the subproblem (3.2) can be solved exactly with iteration.
Our numerical results are presented in the following figures. We didn’t observe significant accelerations of Katyusha X over SVRG and iPreKatX over iPrePDHG, and we suspect the reason is that and the optimal choices of step size make or , thus the complexity in (4.7) and (4.9) are not better than (4.6) and (4.8), respectively.
5.2 Logistic Regression
We formulate Logistic regression as
[TABLE]
where again are feature vectors and are labels.
For Logistic regression as in (5.2), the Hessian of the smooth part can be expressed as
[TABLE]
where . Inspired by this111Here is a heuristic justification: By Definition 1 we know that ; Since \frac{\exp(-b_{i}a_{i}^{T}x)}{\big{(}1+\exp(-b_{i}a_{i}^{T}x)\big{)}^{2}}\rightarrow 0 only when is unbounded, we know that if the iterates of our algorithms are bounded, then for some , which gives according to Definition 2. When is not too small, one can expect ., we provide two choices of preconditioner ,
When is small, we choose
[TABLE] 2. 2.
When is large and is diagonally dominant, we choose
[TABLE]
where . In this case, the subproblem (3.2) can be solved exactly with iteration.
Our results are presented in the following figures, again, we didn’t observe a significant acceleration of Katyusha X over SVRG and iPreKatX over iPrePDHG, due to the same reason mentioned in the last subsection.
5.3 Sum-of-nonconvex Example
Similar to (Allen-Zhu & Yuan, 2016), we generate a sum-of-nonconvex example by the following procedure:
We take normalized random vector , and also vectors of the form , where the nonzero element is at th coordinate.
And the sum-of-nonconvex problem is given by
[TABLE]
where , and .
[TABLE]
[TABLE]
Since the sum of ’s is [math], they do not affect the condition number of the whole problem. However, it makes most of the first half of to be highly nonconvex. Overall, the condition number of this problem is equal to that of , which is approximately 10000 in our tested data.
Since is diagonally dominant, we select as the preconditioner. Our algorithms also have significant acceleration in this sum-of-nonconvex setting.
6 Conclusions and Future Work
In this paper, we propose to accelerate SVRG and Katyusha X by inexact preconditioning, with an appropriate preconditioner, both can be provably accelerated in terms of iteration complexity and gradient complexity. Our algorithms admits a nondifferentiable regularizer, as well as nonconvexity of individual functions. We confirm our theoretical results on Lasso, Logistic regression, and a sum-of-nonconvex example, where simple choices of preconditioners lead to significant accelerations.
There are still open questions left for us to address in the future: (a) Do we have theoretical guarantee when the subproblem iterator is chosen as faster schemes such as APCG (Lin et al., 2014), NU_ACDM (Allen-Zhu et al., 2016), and A2BCD (Hannah et al., 2018a)? (b) In general, how to choose a simple preconditioner that can greatly reduce the condition number of the problem? (c) Is it possible to apply this inexact preconditioning technique to other stochastic algorithms?
Acknowledgements
We would like to thank Yunbei Xu for helpful discussions on the idea of inexact preconditioning. We also thank the reviewers for their valuable comments.
This work is supported in part by the National Key RD Program of China 2017YFB02029, AFOSR MURI FA9550-18-1-0502, NSF DMS-1720237, and ONR N0001417121.
Appendix A Proof of Lemma 1
In this section, we prove the results on the error generated when solving the subproblem (3.2) inexactly by Procedure 1. Before proving Lemma 1, we will first prove a simpler case in Lemma 3, where the subproblem iterator is the proximal gradient step.
Lemma 3.
Take Assumption 1. Suppose in Procedure 1, we choose as the proximal gradient step with step size , and is repeat it times, where . Then, is an approximate solution to (3.2) that satisfies
[TABLE]
where
[TABLE]
and .
Proof of Lemma 3.
The optimization problem in (3.2) is of the form
[TABLE]
for and With our choice of as the proximal gradient descent step, the iterations in Procedure 1 are
[TABLE]
where . From the definition of , we have
[TABLE]
Compare this with (A.1) gives
[TABLE]
To bound the right hand side, let be the solution of (A.3), , and . Then is convex and is -strongly convex and -Lipschitz differentiable. Consequently, Prop. 26.16(ii) of (Bauschke et al., 2017) gives
[TABLE]
where .
Let . Then, . We can derive
[TABLE]
On the other hand, we have
[TABLE]
Combining these two equations yields
[TABLE]
where
[TABLE]
Finally, let the eigenvalues of be , with orthonormal eigenvectors . Let and be decomposed by
[TABLE]
then
[TABLE]
Combine these two inequalities with (A.4), we arrive at
[TABLE]
where
[TABLE]
∎
Now, we are ready to prove Lemma 1, the techniques are similar to the proof of Lemma 3.
Proof of Lemma 1.
We want to find such that
[TABLE]
Take and , then the optimality condition of the problem in line 5 of Algorithm 3 is
[TABLE]
compare this with (A.7), we have
[TABLE]
where
[TABLE]
As a result,
[TABLE]
Let the solution of (3.2) be . By Theorem 4.4 of (Beck & Teboulle, 2009), for any and we have
[TABLE]
On the other hand, the strong convexity of gives
[TABLE]
Therefore,
[TABLE]
Now, let us use (A.11) repeatedly to bound the right hand side of (A.10). For example, the first term can be bounded as
[TABLE]
Similarly, the rest of the terms can be bounded as follows,
[TABLE]
[TABLE]
[TABLE]
where in the first and third estimate we have used . On the other hand, we have
[TABLE]
As a result, taking , , and yields
[TABLE]
where
[TABLE]
Similar to the end of proof of Lemma 3, we have
[TABLE]
Now, let us choose such that is minimized, a simple calculation yields
[TABLE]
In order for to be an integer, we can take
[TABLE]
then
[TABLE]
Finally, Let us show that in (A.12) can be bounded by , and the desired bound (A.8) on follows.
First, we have
[TABLE]
and
[TABLE]
On the other hand, a simple calculation shows that is decreasing in , therefore
[TABLE]
Similarly, one can show that
[TABLE]
Combining these two inequalities with (B.2) yields
[TABLE]
∎
Appendix B Proof of Theorem 1
In this section, we proceed to establish the convergence of inexact preconditioned SVRG as in Algorithm 1. The proof is similar to that of Theorem D.1 of (Allen-Zhu, 2018).
Before proving Theorem 1, let us first prove several lemmas.
First, the inexact optimality condition (4.1) gives the following descent:
Lemma 4.
Under Assumption 1, suppose that (4.1) holds. Then, for any we have
[TABLE]
Proof.
First, let us rewrite the left hand side as
[TABLE]
By (4.1) and the definition of subdifferential we have
[TABLE]
Combining these two gives
[TABLE]
where in the last equality we have applied
[TABLE]
∎
Based on lemma 4, we have
Lemma 5.
Under Assumption 1, if the iterator in Procedure 1 is proximal gradient descent or FISTA with restart, then, for any , , and we have
[TABLE]
Proof.
We have
[TABLE]
where the first and second inequality are due to the strong convexity and smoothness under in Assumption 1, respectively. the last equality is due to .
On the other hand, recall that Lemma 4 gives
[TABLE]
For the last term we can apply Cauchy-Schwartz as follows,
[TABLE]
from Lemma 3 and Lemma 1 we know that
[TABLE]
Therefore, by Young’s inequality, we have for any that
[TABLE]
Applying this to Lemma 4 yields
[TABLE]
Applying this to (B.2), we arrive at
[TABLE]
where in the second inequality we have applied
[TABLE]
Finally, since , we have , which gives the desired result.
∎
Lemma 6.
Under Assumption 1, we have
[TABLE]
Proof.
We have
[TABLE]
where in the first inequality, we have applied with \xi=M^{-\frac{1}{2}}\big{(}\nabla f_{i_{t}}(w_{t})-\nabla f_{i_{t}}(w_{0})\big{)}, and in the second inequality follows from Assumption 1. ∎
Lemma 7. (Fact 2.3 of (Allen-Zhu, 2018)).
Let be a sequence of numbers, and Geom, then
, and 2. 2.
**
Lemma 8.
Under Assumption 1, if and , then, for any we have
[TABLE]
Proof.
By Lemmas 5 and 6, we know that
[TABLE]
Let Geom as in Algorithm 1 and take , then
[TABLE]
where the first equality follows from the item 1 of Lemma 7 with , the second inequality follows from item 2 with , item 2 with , and item 1 with , then third inequality makes use of and the fourth inequality makes use of .
∎
Now, let us proceed to prove Theorem 1. With Lemma 8, it can be proved in a similar way as Theorem 3 of (Hannah et al., 2018b).
Proof of Theorem 1.
Without loss of generality, we can assume and
According to Lemma 8, for any , and we have
[TABLE]
or equivalently,
[TABLE]
In the following proof, we will omit .
Setting and yields the following two inequalities:
[TABLE]
[TABLE]
Define , multiply to (B.3), then add it to (B.5) yields
[TABLE]
Multiplying both sides by gives
[TABLE]
Summing over , we have
[TABLE]
Since , we have
[TABLE]
By the strong convexity of , we have , therefore
[TABLE]
Finally, recall that can be chosen arbitrarily, so we can take
[TABLE]
and
[TABLE]
[TABLE]
In order for the choice of in (B.7) to be possible, we need
[TABLE]
to have one solution at least, which requires
[TABLE]
under which satisfy (B.8). As a result, makes (B.7) into
[TABLE]
and the desired convergence result follows from (B.6). ∎
Appendix C Proof of Lemma 2
Proof.
From Lemma 1, we know that
[TABLE]
where
[TABLE]
Therefore, in order for , we need
[TABLE]
which is equivalent to
[TABLE]
Thus, it suffices to require that
[TABLE]
which gives
[TABLE]
∎
Appendix D Proof of Theorem 2
The proof of Theorem 2 is similar to that of Theorem 4.3 of (Allen-Zhu, 2018), so we provide a proof sketch here and omit the details.
In (Allen-Zhu, 2018), the proof of Theorem 4.3 is based on Lemma 3.3, here the proof of Theorem 2 is based on Lemma 8, which is an analog of Lemma of 3.3 in our settings. 2. 2.
Based on Lemma 8, the proof of Theorem 2 follows in nearly the same way as Theorem 4.3 of (Allen-Zhu, 2018), the only difference is that one needs to replace by . 3. 3.
By setting
[TABLE]
and
[TABLE]
as in the proof of Theorem 1, the in Theorem 4.3 of (Allen-Zhu, 2018) becomes , and the convergence result of Theorem 2 follows.
Appendix E Proof of Theorems 3 and 4
Proof of Theorem 3.
From Remark 5, we know that the gradient complexity of SVRG can be expressed as
[TABLE]
Taking the largest possible step size as in Theorem 1, we have
[TABLE]
Let us first find the optimal for SVRG, let
[TABLE]
then
[TABLE]
Taking derivative to the numerator gives
[TABLE]
Therefore, is given by . Let , then
[TABLE]
Since for , we know that , therefore, .
Let where , we would like to have , i,e.,
[TABLE]
so that .
Since , we have , on the other hand, we have
[TABLE]
Therefore, it suffices to have
[TABLE]
As a result, we have , and
[TABLE]
where in the second equality we have used .
For our iPreSVRG in Algorithm 1, we have
[TABLE]
thanks to Lemma 2, can be chosen as
[TABLE]
furthermore, we can take due to Theorem 1.
Under these settings, we have
[TABLE]
Let us take .
If , or equivalently , then
[TABLE]
Since p={\mathcal{O}}\bigg{(}\sqrt{\kappa(M)}\ln\big{(}\sqrt{\kappa^{M}_{f}}\kappa(M)\big{)}\bigg{)}, we know that when , or equivalently , we have
[TABLE]
therefore
[TABLE]
and
[TABLE]
If , or equivalently , then and
[TABLE]
therefore
[TABLE]
Since , this ratio becomes ∎
Proof of Theorem 4.
The proof of Theorem 4 is similar and is omitted. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pp. 1200–1205. ACM, 2017.
- 2Allen-Zhu (2018) Allen-Zhu, Z. Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization. In ICML , 2018.
- 3Allen-Zhu & Yuan (2016) Allen-Zhu, Z. and Yuan, Y. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning , pp. 1080–1089, 2016.
- 4Allen-Zhu et al. (2016) Allen-Zhu, Z., Qu, Z., Richtárik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning , pp. 1110–1119, 2016.
- 5Bauschke et al. (2017) Bauschke, H. H., Combettes, P. L., et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , volume 2011. Springer, 2017.
- 6Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences , 2(1):183–202, 2009.
- 7Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization , 26(2):1008–1031, 2016.
- 8Defazio et al. (2014 a) Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27 , pp. 1646–1654, 2014 a.
