SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points
Zhize Li

TL;DR
This paper introduces SSRGD, a simple perturbed stochastic recursive gradient descent algorithm that efficiently finds second-order stationary points in nonconvex optimization, outperforming more complex methods in simplicity and analysis.
Contribution
The paper presents SSRGD, a straightforward perturbation-based method for escaping saddle points, with near-optimal stochastic gradient complexity and simpler analysis compared to existing algorithms.
Findings
SSRGD finds second-order stationary points with near-optimal complexity.
The algorithm also efficiently finds first-order stationary points.
Results extend from finite-sum to online nonconvex problems.
Abstract
We analyze stochastic gradient algorithms for optimizing nonconvex problems. In particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points. We show that a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD) can find an -second-order stationary point with stochastic gradient complexity for nonconvex finite-sum problems. As a by-product, SSRGD finds an -first-order stationary point with stochastic gradients. These results are almost optimal since Fang et al. [2018] provided a lower bound for finding even just an -first-order stationary point. We emphasize that SSRGD…
| Algorithm | Stochastic gradient complexity | Guarantee | Negative-curvature search subroutine |
| GD (Nesterov, 2004) | 1st-order | No | |
| SVRG (Reddi et al., 2016), (Allen-Zhu and Hazan, 2016); SCSG (Lei et al., 2017); SVRG+ (Li and Li, 2018) | 1st-order | No | |
| SNVRG (Zhou et al., 2018b); SPIDER (Fang et al., 2018); SpiderBoost (Wang et al., 2018); SARAH (Pham et al., 2019) | 1st-order | No | |
| SSRGD (this paper) | 1st-order | No | |
| PGD (Jin et al., 2017) | 2nd-order | No | |
| Neon2+FastCubic/CDHS (Agarwal et al., 2016; Carmon et al., 2016) | 2nd-order | Needed | |
| Neon2+SVRG (Allen-Zhu and Li, 2018) | 2nd-order | Needed | |
| Stabilized SVRG (Ge et al., 2019) | 2nd-order | No | |
| SNVRG++Neon2 (Zhou et al., 2018a) | 2nd-order | Needed | |
| SPIDER-SFO+(+Neon2) (Fang et al., 2018) | 2nd-order | Needed | |
| SSRGD (this paper) | 2nd-order | No |
| Algorithm | Stochastic gradient complexity | Guarantee | Negative-curvature search subroutine |
| SGD (Ghadimi et al., 2016) | 1st-order | No | |
| SCSG (Lei et al., 2017); SVRG+ (Li and Li, 2018) | 1st-order | No | |
| SNVRG (Zhou et al., 2018b); SPIDER (Fang et al., 2018); SpiderBoost (Wang et al., 2018); SARAH (Pham et al., 2019) | 1st-order | No | |
| SSRGD (this paper) | 1st-order | No | |
| Perturbed SGD (Ge et al., 2015) | poly | 2nd-order | No |
| CNC-SGD (Daneshmand et al., 2018) | 2nd-order | No | |
| Neon2+SCSG (Allen-Zhu and Li, 2018) | 2nd-order | Needed | |
| Neon2+Natasha2 (Allen-Zhu, 2018) | 2nd-order | Needed | |
| SNVRG++Neon2 (Zhou et al., 2018a) | 2nd-order | Needed | |
| SPIDER-SFO+(+Neon2) (Fang et al., 2018) | 2nd-order | Needed | |
| SSRGD (this paper) | 2nd-order | No |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Complexity and Algorithms in Graphs
SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points
Zhize Li
IIIS, Tsinghua University
Abstract
We analyze stochastic gradient algorithms for optimizing nonconvex problems. In particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points. We show that a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD) can find an -second-order stationary point with stochastic gradient complexity for nonconvex finite-sum problems. As a by-product, SSRGD finds an -first-order stationary point with stochastic gradients. These results are almost optimal since Fang et al. (2018) provided a lower bound for finding even just an -first-order stationary point. We emphasize that SSRGD algorithm for finding second-order stationary points is as simple as for finding first-order stationary points just by adding a uniform perturbation sometimes, while all other algorithms for finding second-order stationary points with similar gradient complexity need to combine with a negative-curvature search subroutine (e.g., Neon2 (Allen-Zhu and Li, 2018)). Moreover, the simple SSRGD algorithm gets a simpler analysis. Besides, we also extend our results from nonconvex finite-sum problems to nonconvex online (expectation) problems, and prove the corresponding convergence results.
1 Introduction
Nonconvex optimization is ubiquitous in machine learning applications especially for deep neural networks. For convex optimization, every local minimum is a global minimum and it can be achieved by any first-order stationary point, i.e., . However, for nonconvex problems, the point with zero gradient can be a local minimum, a local maximum or a saddle point. To avoid converging to bad saddle points (including local maxima), we want to find a second-order stationary point, i.e., and (this is a necessary condition for to be a local minimum). All second-order stationary points indeed are local minima if function satisfies strict saddle property (Ge et al., 2015). Note that finding the global minimum in nonconvex problems is NP-hard in general. Also note that it was shown that all local minima are also global minima for some nonconvex problems, e.g., matrix sensing (Bhojanapalli et al., 2016), matrix completion (Ge et al., 2016), and some neural networks (Ge et al., 2017). Thus, our goal in this paper is to find an approximate second-order stationary point (local minimum) with proved convergence.
There has been extensive research for finding -first-order stationary point (i.e., ), e.g., GD, SGD and SVRG. See Table 1 for an overview. Although Xu et al. (2018) and Allen-Zhu and Li (2018) independently proposed reduction algorithms Neon/Neon2 that can be combined with previous -first-order stationary points finding algorithms to find an -second-order stationary point (i.e., and ). However, algorithms obtained by this reduction are very complicated in practice, and they need to extract negative curvature directions from the Hessian to escape saddle points by using a negative curvature search subroutine: given a point , find an approximate smallest eigenvector of . This also involves a more complicated analysis. Note that in practice, standard first-order stationary point finding algorithms can often work (escape bad saddle points) in nonconvex setting without a negative curvature search subroutine. The reason may be that the saddle points are usually not very stable. So there is a natural question “Is there any simple modification to allow first-order stationary point finding algorithms to get a theoretical second-order guarantee?”. For gradient descent (GD), Jin et al. (2017) showed that a simple perturbation step is enough to escape saddle points for finding a second-order stationary point, and this is necessary (Du et al., 2017). Very recently, Ge et al. (2019) showed that a simple perturbation step is also enough to find a second-order stationary point for SVRG algorithm (Li and Li, 2018). Moreover, Ge et al. (2019) also developed a stabilized trick to further improve the dependency of Hessian Lipschitz parameter.
Note:
-
Guarantee (see Definition 1): -first-order stationary point ; -second-order stationary point and .
-
In the classical setting where (Nesterov and Polyak, 2006; Jin et al., 2017), our simple SSRGD is always (no matter what and are) not worse than all other algorithms (in both Table 1 and 2) except FastCubic/CDHS (which need to compute Hessian-vector product) and SPIDER-SFO+. Moreover, our simple SSRGD is not worse than FastCubic/CDHS if and is better than SPIDER-SFO+ if is very small (e.g., ) in Table 1.
1.1 Our Contributions
In this paper, we propose a simple SSRGD algorithm (described in Algorithm 1) showed that a simple perturbation step is enough to find a second-order stationary point for stochastic recursive gradient descent algorithm. Our results and previous results are summarized in Table 1 and 2. We would like to highlight the following points:
- •
We improve the result in (Ge et al., 2019) to the almost optimal one (i.e., from to ) since Fang et al. (2018) provided a lower bound for finding even just an -first-order stationary point. Note that for the other two algorithms (i.e., SNVRG+ and SPIDER-SFO+), they both need the negative curvature search subroutine (e.g. Neon2) thus are more complicated in practice and in analysis compared with their first-order guarantee algorithms (SNVRG and SPIDER), while our SSRGD is as simple as its first-order guarantee algorithm just by adding a uniform perturbation sometimes.
- •
For more general nonconvex online (expectation) problems (2), we obtain the first algorithm which is as simple as finding first-order stationary points for finding a second-order stationary point with similar state-of-the-art convergence result. See the last column of Table 2.
- •
Our simple SSRGD algorithm gets simpler analysis. Also, the result for finding a first-order stationary point is a by-product from our analysis. We also give a clear interpretation to show why our analysis for SSRGD algorithm can improve the original SVRG from to in Section 5.1. We believe it is very useful for better understanding these two algorithms.
2 Preliminaries
Notation: Let denote the set and denote the Eculidean norm for a vector and the spectral norm for a matrix. Let denote the inner product of two vectors and . Let denote the smallest eigenvalue of a symmetric matrix . Let denote a Euclidean ball with center and radius . We use to hide the constant and to hide the polylogarithmic factor.
In this paper, we consider two types of nonconvex problems. The finite-sum problem has the form
[TABLE]
where and all individual are possibly nonconvex. This form usually models the empirical risk minimization in machine learning problems.
The online (expectation) problem has the form
[TABLE]
where and are possibly nonconvex. This form usually models the population risk minimization in machine learning problems.
Now, we make standard smoothness assumptions for these two problems.
Assumption 1** (Gradient Lipschitz)**
For finite-sum problem (1), each is differentiable and has -Lipschitz continuous gradient, i.e.,
[TABLE] 2. 2.
For online problem (2), is differentiable and has -Lipschitz continuous gradient, i.e.,
[TABLE]
Assumption 2** (Hessian Lipschitz)**
For finite-sum problem (1), each is twice-differentiable and has -Lipschitz continuous Hessian, i.e.,
[TABLE] 2. 2.
For online problem (2), is twice-differentiable and has -Lipschitz continuous Hessian, i.e.,
[TABLE]
These two assumptions are standard for finding first-order stationary points (Assumption 1) and second-order stationary points (Assumption 1 and 2) for all algorithms in both Table 1 and 2.
Now we define the approximate first-order stationary points and approximate second-order stationary points.
Definition 1
* is an -first-order stationary point for a differentiable function if*
[TABLE]
* is an -second-order stationary point for a twice-differentiable function if*
[TABLE]
The definition of -second-order stationary point is the same as (Allen-Zhu and Li, 2018; Daneshmand et al., 2018; Zhou et al., 2018a; Fang et al., 2018) and it generalizes the classical version where used in (Nesterov and Polyak, 2006; Jin et al., 2017; Ge et al., 2019).
3 Simple Stochastic Recursive Gradient Descent
In this section, we propose the simple stochastic recursive gradient descent algorithm called SSRGD. The high-level description (which omits the stop condition details in Line 10) of this algorithm is in Algorithm 1 and the full algorithm (containing the stop condition) is described in Algorithm 2. Note that we call each outer loop an epoch, i.e., iterations from to for an epoch . We call the iterations between the beginning of perturbation and end of perturbation a super epoch.
The SSRGD algorithm is based on the stochastic recursive gradient descent which is introduced in (Nguyen et al., 2017) for convex optimization. In particular, Nguyen et al. (2017) want to save the storage of past gradients in SAGA (Defazio et al., 2014) by using the recursive gradient. However, this stochastic recursive gradient descent is widely used in recent work for nonconvex optimization such as SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018) and some variants of SARAH (e.g., ProxSARAH (Pham et al., 2019)).
Recall that in the well-known SVRG algorithm, Johnson and Zhang (2013) reused a fixed snapshot full gradient (which is computed at the beginning of each epoch) in the gradient estimator:
[TABLE]
while the stochastic recursive gradient descent uses a recursive update form (more timely update):
[TABLE]
4 Convergence Results
Similar to the perturbed GD (Jin et al., 2017) and perturbed SVRG (Ge et al., 2019), we add simple perturbations to the stochastic recursive gradient descent algorithm to escape saddle points efficiently. Besides, we also consider the more general online case. In the following theorems, we provide the convergence results of SSRGD for finding an -first-order stationary point and an -second-order stationary point for both nonconvex finite-sum problem (1) and online problem (2). The proofs are provided in Appendix B. We give an overview of the proofs in next Section 5.
4.1 Nonconvex Finite-sum Problem
Theorem 1
Under Assumption 1 (i.e. (3)), let , where is the initial point and is the optimal value of . By letting step size , epoch length and minibatch size , SSRGD will find an -first-order stationary point in expectation using
[TABLE]
stochastic gradients for nonconvex finite-sum problem (1).
Theorem 2
Under Assumption 1 and 2 (i.e. (3) and (5)), let , where is the initial point and is the optimal value of . By letting step size , epoch length , minibatch size , perturbation radius r=\widetilde{O}\big{(}\min(\frac{\delta^{3}}{\rho^{2}\epsilon},\frac{\delta^{3/2}}{\rho\sqrt{L}})\big{)}, threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using
[TABLE]
stochastic gradients for nonconvex finite-sum problem (1).
4.2 Nonconvex Online (Expectation) Problem
For nonconvex online problem (2), one usually needs the following bounded variance assumption. For notational convenience, we also consider this online case as the finite-sum form by letting and thinking of as infinity (infinite data samples). Although we try to write it as finite-sum form, the convergence analysis of optimization methods in this online case is a little different from the finite-sum case.
Assumption 3** (Bounded Variance)**
For , , where is a constant.
Note that this assumption is standard and necessary for this online case since the full gradients are not available (see e.g., (Ghadimi et al., 2016; Lei et al., 2017; Li and Li, 2018; Zhou et al., 2018b; Fang et al., 2018; Wang et al., 2018; Pham et al., 2019)). Moreover, we need to modify the full gradient computation step at the beginning of each epoch to a large batch stochastic gradient computation step (similar to (Lei et al., 2017; Li and Li, 2018)), i.e., change (Line 8 of Algorithm 2) to
[TABLE]
where are i.i.d. samples with . We call the batch size and the minibatch size. Also, we need to change (Line 3 of Algorithm 2) to .
Theorem 3
Under Assumption 1 (i.e. (4)) and Assumption 3, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size and epoch length , SSRGD will find an -first-order stationary point in expectation using
[TABLE]
stochastic gradients for nonconvex online problem (2).
For achieving a high probability result of finding second-order stationary points in this online case (i.e., Theorem 4), we need a stronger version of Assumption 3 as in the following Assumption 4.
Assumption 4** (Bounded Variance)**
For , , where is a constant.
We want to point out that Assumption 4 can be relaxed such that has sub-Gaussian tail, i.e., , for . Then it is sufficient for us to get a high probability bound by using Hoeffding bound on these sub-Gaussian variables. Note that Assumption 4 (or the relaxed sub-Gaussian version) is also standard in online case for second-order stationary point finding algorithms (see e.g., (Allen-Zhu and Li, 2018; Zhou et al., 2018a; Fang et al., 2018)).
Theorem 4
Under Assumption 1, 2 (i.e. (4) and (6)) and Assumption 4, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size , epoch length , perturbation radius r=\widetilde{O}\big{(}\min(\frac{\delta^{3}}{\rho^{2}\epsilon},\frac{\delta^{3/2}}{\rho\sqrt{L}})\big{)}, threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using
[TABLE]
stochastic gradients for nonconvex online problem (2).
5 Overview of the Proofs
5.1 Finding First-order Stationary Points
In this section, we first show that why SSRGD algorithm can improve previous SVRG type algorithm (see e.g., (Li and Li, 2018; Ge et al., 2019)) from to . Then we give a simple high-level proof for achieving the convergence result (i.e., Theorem 1).
Why it can be improved from to : First, we need a key relation between and , where ,
[TABLE]
where (12) holds since has -Lipschitz continuous gradient (Assumption 1). The details for obtaining (12) can be found in Appendix B.1 (see (25)).
Note that (12) is very meaningful and also very important for the proofs. The first term indicates that the function value will decrease a lot if the gradient is large. The second term -\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\|x_{t}-x_{t-1}\|^{2} indicates that the function value will also decrease a lot if the moving distance is large (note that here we require the step size ). The additional third term exists since we use as an estimator of the actual gradient (i.e., ). So it may increase the function value if is a bad direction in this step.
To get an -first-order stationary point, we want to cancel the last two terms in (12). Firstly, we want to bound the last variance term. Recall the variance bound (see Equation (29) in (Li and Li, 2018)) for SVRG algorithm, i.e., estimator (9):
[TABLE]
In order to connect the last two terms in (12), we use Young’s inequality for the second term , i.e., (for any ). By plugging this Young’s inequality and (13) into (12), we can cancel the last two terms in (12) by summing up (12) for each epoch, i.e., for each epoch (i.e., iterations ), we have (see Equation (35) in (Li and Li, 2018))
[TABLE]
However, due to the Young’s inequality, we need to let to cancel the last two terms in (12) for obtaining (14), where denotes minibatch size and denotes the epoch length. According to (14), it is not hard to see that is an -first-order stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Note that for each iteration we need to compute stochastic gradients, where we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation. Thus, the convergence result is since , where equality holds if . Note that here we ignore the factors of and .
However, for stochastic recursive gradient descent estimator (10), we can bound the last variance term in (12) as (see Equation (31) in Appendix B.1):
[TABLE]
Now, the advantage of (15) compared with (13) is that it is already connected to the second term in (12), i.e., moving distances . Thus we do not need an additional Young’s inequality to transform the second term as before. This makes the function value decrease bound tighter. Similarly, we plug (15) into (12) and sum it up for each epoch to cancel the last two terms in (12), i.e., for each epoch , we have (see Equation (33) in Appendix B.1)
[TABLE]
Compared with (14) (which requires ), here (16) only requires due to the tighter function value decrease bound since it does not involve the additional Young’s inequality.
High-level proof for achieving result: Now, according to (16), we can use the same above SVRG arguments to show the convergence result of SSRGD, i.e., is an -first-order stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Also, for each iteration, we compute stochastic gradients. The only difference is that now the convergence result is T(b+\frac{n}{m})={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}O(\frac{L\Delta f\sqrt{n}}{\epsilon^{2}})} since (rather than ), where we let , and . Moreover, it is optimal since it matches the lower bound {\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\Omega(\frac{L\Delta f\sqrt{n}}{\epsilon^{2}})} provided by (Fang et al., 2018).
5.2 Finding Second-order Stationary Points
In this section, we give the high-level proof ideas for finding a second-order stationary point with high probability. Note that our proof is different from that in (Ge et al., 2019) due to the different estimators (9) and (10). Ge et al. (2019) used the estimator (9) and thus their proof is based on the first-order analysis in (Li and Li, 2018). Here, our SSRGD uses the estimator (10). The difference of the first-order analysis between estimator (9) ((Li and Li, 2018)) and estimator (10) (this paper) is already discussed in previous Section 5.1. For the second-order analysis, since the estimator (10) in our SSRGD is more correlated than (9), thus we will use martingales to handle it. Besides, different estimators will incur more differences in the detailed proofs of second-order guarantee analysis than that of first-order guarantee analysis.
We divide the proof into two situations, i.e., large gradients and around saddle points. According to (16), a natural way to prove the convergence result is that the function value will decrease at a desired rate with high probability. Note that the amount for function value decrease is at most .
Large gradients:
In this situation, due to the large gradients, it is sufficient to adjust the first-order analysis to show that the function value will decrease a lot in an epoch. Concretely, we want to show that the function value decrease bound (16) holds with high probability by using Azuma-Hoeffding inequality. Then, according to (16), it is not hard to see that the desired rate of function value decrease is per iteration in this situation (recall the parameters and in our Theorem 2). Also note that we compute stochastic gradients at each iteration (recall in our Theorem 2). Here we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation (we will analyze this more rigorously in the detailed proofs in appendices). Thus the number of stochastic gradient computation is at most \widetilde{O}(\sqrt{n}\frac{\Delta f}{\epsilon^{2}/L})={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\widetilde{O}(\frac{L\Delta f\sqrt{n}}{\epsilon^{2}})} for this large gradients situation.
For the proof, to show the function value decrease bound (16) holds with high probability, we need to show that the bound for variance term () holds with high probability. Note that the gradient estimator defined in (10) is correlated with previous . Fortunately, let , then it is not hard to see that is a martingale vector sequence with respect to a filtration such that . Moreover, let denote the associated martingale difference sequence with respect to the filtration , i.e., and Thus to bound the variance term with high probability, it is sufficient to bound the martingale sequence . This can be bounded with high probability by using the martingale Azuma-Hoeffding inequality. Note that in order to apply Azuma-Hoeffding inequality, we first need to use the Bernstein inequality to bound the associated difference sequence . In sum, we will get the high probability function value decrease bound by applying these two inequalities (see (42) in Appendix B.1).
Note that (42) only guarantees function value decrease when the summation of gradients in this epoch is large. However, in order to connect the guarantees between first situation (large gradients) and second situation (around saddle points), we need to show guarantees that are related to the gradient of the starting point of each epoch (see Line 3 of Algorithm 2). Similar to (Ge et al., 2019), we achieve this by stopping the epoch at a uniformly random point (see Line 16 of Algorithm 2). We use the following lemma to connect these two situations (large gradients and around saddle points):
Lemma 1** (Connection of Two Situations)**
For any epoch , let be a point uniformly sampled from this epoch and choose the step size (where ) and the minibatch size . Then for any , we have two cases:
If at least half of points in this epoch have gradient norm no larger than , then holds with probability at least ; 2. 2.
Otherwise, we know holds with probability at least
Moreover, holds with high probability no matter which case happens.
Note that if Case 2 happens, the function value already decreases a lot in this epoch (as we already discussed at the beginning of this situation). Otherwise Case 1 happens, we know the starting point of the next epoch (i.e., Line 19 of Algorithm 2), then we know . Then we will start a super epoch (see Line 3 of Algorithm 2). This corresponds to the following second situation (around saddle points). Note that if , this point is already an -second-order stationary point (recall in our Theorem 2).
Around saddle points: and at the initial point of a super epoch
In this situation, we want to show that the function value will decrease a lot in a super epoch (instead of an epoch as in the first situation) with high probability by adding a random perturbation at the initial point . To simplify the presentation, we use to denote the starting point of the super epoch after the perturbation, where uniformly and the perturbation radius is (see Line 6 in Algorithm 2). Following the classical widely used two-point analysis developed in (Jin et al., 2017), we consider two coupled points and with , where is a scalar and denotes the smallest eigenvector direction of Hessian . Then we get two coupled sequences and by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12 of Algorithm 2) for a super epoch. We will show that at least one of these two coupled sequences will decrease the function value a lot (escape the saddle point) with high probability, i.e.,
[TABLE]
Similar to the classical argument in (Jin et al., 2017), according to (17), we know that in the random perturbation ball, the stuck points can only be a short interval in the direction, i.e., at least one of two points in the direction will escape the saddle point if their distance is larger than . Thus, we know that the probability of the starting point (where uniformly ) located in the stuck region is less than (see (48) in Appendix B.1). By a union bound ( is not in a stuck region and (17) holds), with high probability, we have
[TABLE]
Note that the initial point of this super epoch is before the perturbation (see Line 6 of Algorithm 2), thus we also need to show that the perturbation step (where uniformly ) does not increase the function value a lot, i.e.,
[TABLE]
where the second inequality holds since the initial point satisfying and the perturbation radius is , and the last equality holds by letting the perturbation radius small enough. By combining (18) and (19), we obtain with high probability
[TABLE]
Now, we can obtain the desired rate of function value decrease in this situation is per iteration (recall the parameters , and in our Theorem 2). Same as before, we compute stochastic gradients at each iteration (recall in our Theorem 2). Thus the number of stochastic gradient computation is at most \widetilde{O}(\sqrt{n}\frac{\Delta f}{\delta^{4}/(L\rho^{2})})={\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\widetilde{O}(\frac{L\rho^{2}\Delta f\sqrt{n}}{\delta^{4}})} for this around saddle points situation.
Now, the remaining thing is to prove (17). It can be proved by contradiction. Assume the contrary, and . First, we show that if function value does not decrease a lot, then all iteration points are not far from the starting point with high probability.
Lemma 2** (Localization)**
Let denote the sequence by running SSRGD update steps (Line 8–12 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have
[TABLE]
where .
Then we show that the stuck region is relatively small in the random perturbation ball, i.e., at least one of and will go far away from their starting point and with high probability.
Lemma 3** (Small Stuck Region)**
If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12) from and with , where , , and denotes the smallest eigenvector direction of Hessian . Moreover, let the super epoch length , the step size \eta\leq\min\big{(}\frac{1}{8\log(\frac{8\delta\sqrt{d}}{C_{1}\rho\zeta^{\prime}r})L},\frac{1}{4C_{2}L\log t_{\mathrm{thres}}}\big{)}=\widetilde{O}(\frac{1}{L}), minibatch size and the perturbation radius , then with probability , we have
[TABLE]
where and .
Based on these two lemmas, we are ready to show that (17) holds with high probability. Without loss of generality, we assume in (22) (note that (21) holds for both and ), then by plugging it into (21) to obtain
[TABLE]
where the last inequality is due to and the first equality holds by letting (recall the parameters and in our Theorem 2). Now, the high-level proof for this situation is finished.
In sum, the number of stochastic gradient computation is at most for the large gradients situation and is at most for the around saddle points situation. Moreover, for the classical version where (Nesterov and Polyak, 2006; Jin et al., 2017), then , i.e., both situations get the same stochastic gradient complexity. This also matches the convergence result for finding first-order stationary points (see our Theorem 1) if we ignore the logarithmic factor. More importantly, it also almost matches the lower bound provided by (Fang et al., 2018) for finding even just an -first-order stationary point.
Finally, we point out that there is an extra term in Theorem 2 beyond these two terms obtained from the above two situations. The reason is that we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation. We will analyze this more rigorously in the appendices, which incurs the term . For the more general online problem (2), the high-level proofs are almost the same as the finite-sum problem (1). The difference is that we need to use more concentration bounds in the detailed proofs since the full gradients are not available in online case.
6 Conclusion
In this paper, we focus on developing simple algorithms that have theoretical second-order guarantee for nonconvex finite-sum problems and more general nonconvex online problems. Concretely, we propose a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD), which is as simple as its first-order stationary point finding algorithm (just by adding a random perturbation sometimes) and thus can be simply applied in practice for escaping saddle points (finding local minima). Moreover, the theoretical convergence results of SSRGD for finding second-order stationary points (local minima) almost match the theoretical results for finding first-order stationary points and these results are near-optimal as they almost match the lower bound.
Acknowledgments
The author would like to thank Rong Ge since the author learned a lot under his genuine guidance during the visit at Duke.
Appendix A Tools
In this appendix, we recall some classical concentration bounds for matrices and vectors.
Proposition 1** (Bernstein Inequality [Tropp, 2012])**
Consider a finite sequence of independent, random matrices with dimension . Assume that each random matrix satisfies
[TABLE]
Define
[TABLE]
Then, for all ,
[TABLE]
In our proof, we only need its special case vector version as follows, where .
Proposition 2** (Bernstein Inequality [Tropp, 2012])**
Consider a finite sequence of independent, random vectors with dimension . Assume that each random matrix satisfies
[TABLE]
Define
[TABLE]
Then, for all ,
[TABLE]
Moreover, we also need the martingale concentration bounds, i.e., Azuma-Hoffding inequality. Now, we will only write the vector version not repeat the more general matrix version.
Proposition 3** (Azuma-Hoeffding Inequality [Hoeffding, 1963, Tropp, 2011])**
Consider a martingale vector sequence with dimension , and let denote the associated martingale difference sequence with respect to a filtration , i.e., and . Suppose that satisfies
[TABLE]
Then, for all ,
[TABLE]
However, the assumption that in (23) with probability one sometime fails. Fortunately, the Azuma-Hoffding inequality also holds with a slackness if with high probability.
Proposition 4** (Azuma-Hoeffding Inequality with High Probability [Chung and Lu, 2006, Tao and Vu, 2015])**
Consider a martingale vector sequence with dimension , and let denote the associated martingale difference sequence with respect to a filtration , i.e., and . Suppose that satisfies
[TABLE]
Then, for all ,
[TABLE]
Appendix B Missing Proofs
In this appendix, we provide the detailed proofs for Theorem 1–4.
B.1 Proofs for Finite-sum Problem
In this section, we provide the detailed proofs for nonconvex finite-sum problem (1) (i.e., Theorem 1–2).
First, we obtain the relation between and as follows similar to [Li and Li, 2018, Ge et al., 2019], where we let and ,
[TABLE]
where (24) holds since has -Lipschitz continuous gradient (Assumption 1). Now, we bound the variance term as follows, where we take expectations with the history:
[TABLE]
where (26) and (27) use the law of total expectation and if are independent and of mean zero, (28) uses the fact , and (29) holds due to the gradient Lipschitz Assumption 1.
Note that for in (29), we can reuse the same computation above. Thus we can sum up (29) from the beginning of this epoch to the point ,
[TABLE]
where (31) holds since we compute the full gradient at the beginning point of this epoch, i.e., (see Line 5 of Algorithm 1). Now, we take expectations for (25) and then sum it up from the beginning of this epoch , i.e., iterations from to , by plugging the variance (31) into them to get:
[TABLE]
where (32) holds if the minibatch size (note that here ), and (33) holds if the step size .
Proof of Theorem 1. Let and step size , then (33) holds. Now, the proof is directly obtained by summing up (33) for all epochs as follows:
[TABLE]
where (34) holds by choosing uniformly from and letting . Note that the total number of computation of stochastic gradients equals to
[TABLE]
B.1.1 Proof of Theorem 2
For proving the second-order guarantee, we divide the proof into two situations. The first situation (large gradients) is almost the same as the above arguments for first-order guarantee, where the function value will decrease a lot since the gradients are large (see (33)). For the second situation (around saddle points), we will show that the function value can also decrease a lot by adding a random perturbation. The reason is that saddle points are usually unstable and the stuck region is relatively small in a random perturbation ball.
Large Gradients: First, we need a high probability bound for the variance term instead of the expectation one (31). Then we use it to get a high probability bound of (33) for function value decrease. Recall that v_{k}=\frac{1}{b}\sum_{i\in I_{b}}\big{(}\nabla f_{i}(x_{k})-\nabla f_{i}(x_{k-1})\big{)}+v_{k-1} (see Line 9 of Algorithm 1), we let and . It is not hard to verify that is a martingale sequence and is the associated martingale difference sequence. In order to apply the Azuma-Hoeffding inequalities to get a high probability bound, we first need to bound the difference sequence . We use the Bernstein inequality to bound the differences as follows.
[TABLE]
We define , and then we have
[TABLE]
where the last inequality holds due to the gradient Lipschitz Assumption 1. Then, consider the variance term
[TABLE]
where the first inequality uses the fact , and the last inequality uses the gradient Lipschitz Assumption 1. According to (36) and (37), we can bound the difference by Bernstein inequality (Proposition 2) as
[TABLE]
where the last equality holds by letting , where . Now, we have a high probability bound for the difference sequence , i.e.,
[TABLE]
Now, we are ready to get a high probability bound for our original variance term (31) by using the martingale Azuma-Hoeffding inequality. Consider in a specifical epoch , i.e, iterations from to current , where is less than (note that we only need to consider the current epoch since each epoch we start with ), we use a union bound for the difference sequence by letting such that
[TABLE]
Then according to Azuma-Hoeffding inequality (Proposition 4) and noting that , we have
[TABLE]
where the last equality holds by letting , where . Recall that and at the beginning point of this epoch due to (see Line 5 of Algorithm 1), thus we have
[TABLE]
with probability , where belongs to .
Now, we use this high probability version (40) instead of the expectation one (31) to obtain the high probability bound for function value decrease (see (33)). We sum up (25) from the beginning of this epoch , i.e., iterations from to , by plugging (40) into them to get:
[TABLE]
where (41) holds if the minibatch size (note that here ), and (42) holds if the step size .
Note that (42) only guarantees function value decrease when the summation of gradients in this epoch is large. However, in order to connect the guarantees between first situation (large gradients) and second situation (around saddle points), we need to show guarantees that are related to the gradient of the starting point of each epoch (see Line 3 of Algorithm 2). Similar to [Ge et al., 2019], we achieve this by stopping the epoch at a uniformly random point (see Line 16 of Algorithm 2).
Now we recall Lemma 1 to connect these two situations (large gradients and around saddle points):
Lemma 1 (Connection of Two Situations)
For any epoch , let be a point uniformly sampled from this epoch and choose the step size (where ) and the minibatch size . Then for any , we have two cases:
If at least half of points in this epoch have gradient norm no larger than , then holds with probability at least ; 2. 2.
Otherwise, we know holds with probability at least
Moreover, holds with high probability no matter which case happens.
Proof of Lemma 1. There are two cases in this epoch:
If at least half of points of in this epoch have gradient norm no larger than , then it is easy to see that a uniformly sampled point has gradient norm with probability at least 2. 2.
Otherwise, at least half of points have gradient norm larger than . Then, as long as the sampled point falls into the last quarter of , we know . This holds with probability at least since is uniformly sampled. Then combining with (42), i.e., , we obtain the function value decrease . Note that (42) holds with high probability if we choose the minibatch size and the step size . By a union bound, the function value decrease with probability at least .
Again according to (42), always holds with high probability.
Note that if Case 2 happens, the function value already decreases a lot in this epoch (corresponding to the first situation large gradients). Otherwise Case 1 happens, we know the starting point of the next epoch (i.e., Line 19 of Algorithm 2), then we know . Then we will start a super epoch (corresponding to the second situation around saddle points). Note that if , this point is already an -second-order stationary point (recall that in our Theorem 2).
Around Saddle Points and : In this situation, we will show that the function value decreases a lot in a super epoch (instead of an epoch as in the first situation) with high probability by adding a random perturbation at the initial point . To simplify the presentation, we use to denote the starting point of the super epoch after the perturbation, where uniformly and the perturbation radius is (see Line 6 in Algorithm 2). Following the classical widely used two-point analysis developed in [Jin et al., 2017], we consider two coupled points and with , where is a scalar and denotes the smallest eigenvector direction of Hessian . Then we get two coupled sequences and by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12 of Algorithm 2) for a super epoch. We will show that at least one of these two coupled sequences will decrease the function value a lot (escape the saddle point), i.e.,
[TABLE]
We will prove (43) by contradiction. Assume the contrary, and . First, we show that if function value does not decrease a lot, then all iteration points are not far from the starting point with high probability. Then we will show that the stuck region is relatively small in the random perturbation ball, i.e., at least one of and will go far away from their starting point and with high probability. Thus there is a contradiction. We recall these two lemmas here and their proofs are deferred to the end of this section.
Lemma 2 (Localization)
Let denote the sequence by running SSRGD update steps (Line 8–12 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have
[TABLE]
where .
Lemma 3 (Small Stuck Region)
If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12) from and with , where , , and denotes the smallest eigenvector direction of Hessian . Moreover, let the super epoch length , the step size \eta\leq\min\big{(}\frac{1}{8\log(\frac{8\delta\sqrt{d}}{C_{1}\rho\zeta^{\prime}r})L},\frac{1}{4C_{2}L\log t_{\mathrm{thres}}}\big{)}=\widetilde{O}(\frac{1}{L}), minibatch size and the perturbation radius , then with probability , we have
[TABLE]
where and .
Based on these two lemmas, we are ready to show that (43) holds with high probability. Without loss of generality, we assume in (45) (note that (44) holds for both and ), then plugging it into (44) to obtain
[TABLE]
where the last inequality is due to and (46) holds by letting . Thus, we already prove that at least one of sequences and escapes the saddle point with high probability, i.e.,
[TABLE]
if their starting points and satisfying , where and denotes the smallest eigenvector direction of Hessian . Similar to the classical argument in [Jin et al., 2017], we know that in the random perturbation ball, the stuck points can only be a short interval in the direction, i.e., at least one of two points in the direction will escape the saddle point if their distance is larger than . Thus, we know that the probability of the starting point (where uniformly ) located in the stuck region is less than
[TABLE]
where denotes the volume of a Euclidean ball with radius in dimension, and the first inequality holds due to Gautschi’s inequality. By a union bound for (48) and (46) (holds with high probability if is not in a stuck region), we know
[TABLE]
with high probability. Note that the initial point of this super epoch is before the perturbation (see Line 6 of Algorithm 2), thus we need to show that the perturbation step (where uniformly ) does not increase the function value a lot, i.e.,
[TABLE]
where the last inequality holds by letting the perturbation radius .
Now we combine with (49) and (50) to obtain with high probability
[TABLE]
Thus we have finished the proof for the second situation (around saddle points), i.e., we show that the function value decrease a lot () in a super epoch (recall that ) by adding a random perturbation at the initial point .
Combing these two situations (large gradients and around saddle points) to prove Theorem 2: First, we recall Theorem 2 here since we want to recall the parameter setting.
**Theorem **2
Under Assumption 1 and 2 (i.e. (3) and (5)), let , where is the initial point and is the optimal value of . By letting step size , epoch length , minibatch size , perturbation radius r=\widetilde{O}\big{(}\min(\frac{\delta^{3}}{\rho^{2}\epsilon},\frac{\delta^{3/2}}{\rho\sqrt{L}})\big{)}, threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using
[TABLE]
stochastic gradients for nonconvex finite-sum problem (1).
Proof of Theorem 2. Now, we prove this theorem by distinguishing the epochs into three types as follows:
Type-1 useful epoch: If at least half of points in this epoch have gradient norm larger than (Case 2 of Lemma 1); 2. 2.
Wasted epoch: If at least half of points in this epoch have gradient norm no larger than and the starting point of the next epoch has gradient norm larger than (it means that this epoch does not guarantee decreasing the function value a lot as the large gradients situation, also it cannot connect to the second super epoch situation since the starting point of the next epoch has gradient norm larger than ); 3. 3.
Type-2 useful super epoch: If at least half of points in this epoch have gradient norm no larger than and the starting point of the next epoch (here we denote this point as ) has gradient norm no larger than (i.e., ) (Case 1 of Lemma 1), according to Line 3 of Algorithm 2, we will start a super epoch. So here we denote this epoch along with its following super epoch as a type-2 useful super epoch.
First, it is easy to see that the probability of a wasted epoch happened is less than due to the random stop (see Case 1 of Lemma 1 and Line 16 of Algorithm 2) and different wasted epoch are independent. Thus, with high probability, there are at most wasted epochs happened before a type-1 useful epoch or type-2 useful super epoch. Now, we use and to denote the number of type-1 useful epochs and type-2 useful super epochs that the algorithm is needed. Recall that , where is the initial point and is the optimal value of . Also recall that the function value always does not increase with high probability (see Lemma 1).
For type-1 useful epoch, according to Case 2 of Lemma 1, we know that the function value decreases at least with probability at least . Using a standard concentration, we know that with high probability type-1 useful epochs will decrease the function value at least , note that the function value can decrease at most . So , we get .
For type-2 useful super epoch, first we know that the starting point of the super epoch has gradient norm . Now if , then is already a -second-order stationary point. Otherwise, and , this is exactly our second situation (around saddle points). According to (51), we know that the the function value decrease () is at least with high probability. Similar to type-1 useful epoch, we know by a union bound (so we change to , anyway we also have ).
Now, we are ready to compute the convergence results to finish the proof for Theorem 2.
[TABLE]
Now, the only remaining thing is to prove Lemma 2 and 3. We provide these two proofs as follows.
Lemma 2 (Localization)
Let denote the sequence by running SSRGD update steps (Line 8–12 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have
[TABLE]
where .
Proof of Lemma 2. First, we assume the variance bound (40) holds for all (this is true with high probability using a union bound by letting ). Then, according to (41), we know for any in some epoch
[TABLE]
where the last inequality holds since the step size and assuming . Now, we sum up (53) for all epochs before iteration ,
[TABLE]
Then, the proof is finished as
[TABLE]
Lemma 3 (Small Stuck Region)
If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12) from and with , where , , and denotes the smallest eigenvector direction of Hessian . Moreover, let the super epoch length , the step size \eta\leq\min\big{(}\frac{1}{8\log(\frac{8\delta\sqrt{d}}{C_{1}\rho\zeta^{\prime}r})L},\frac{1}{4C_{2}L\log t_{\mathrm{thres}}}\big{)}=\widetilde{O}(\frac{1}{L}), minibatch size and the perturbation radius , then with probability , we have
[TABLE]
where and .
Proof of Lemma 3. We prove this lemma by contradiction. Assume the contrary,
[TABLE]
We will show that the distance between these two coupled sequences will grow exponentially since they have a gap in the direction at the beginning, i.e., , where and denotes the smallest eigenvector direction of Hessian . However, according to (54) and the perturbation radius . It is not hard to see that the exponential increase will break this upper bound, thus we get a contradiction.
In the following, we prove the exponential increase of by induction. First, we need the expression of (recall that (see Line 11 of Algorithm 2)):
[TABLE]
where and . Note that the first term of (55) is in the direction and is exponential with respect to , i.e., , where . To prove the exponential increase of , it is sufficient to show that the first term of (55) will dominate the second term. We inductively prove the following two bounds
2. 2.
First, check the base case , and . Assume they hold for all , we now prove they hold for one by one. For Bound 1, it is enough to show the second term of (55) is dominated by half of the first term.
[TABLE]
where (56) uses the induction for with , (57) uses the definition , (58) follows from due to (54) and the perturbation radius , (59) holds by letting the perturbation radius , (60) holds since , and (61) holds by letting .
[TABLE]
where (62) uses the induction for with , (63) holds since , (64) holds (recall ), and (65) holds by letting .
Combining (61) and (65), we proved the second term of (55) is dominated by half of the first term. Note that the first term of (55) is . Thus, we have
[TABLE]
Now, the remaining thing is to prove the second bound . First, we write the concrete expression of :
[TABLE]
where (67) is due to the definition of the estimator (see Line 12 of Algorithm 2). We further define the difference . It is not hard to verify that is a martingale sequence and is the associated martingale difference sequence. We will apply the Azuma-Hoeffding inequalities to get an upper bound for and then we prove based on that upper bound. In order to apply the Azuma-Hoeffding inequalities for martingale sequence , we first need to bound the difference sequence . We use the Bernstein inequality to bound the differences as follows.
[TABLE]
We define u_{i}:=\big{(}\nabla f_{i}(x_{t})-\nabla f_{i}(x_{t}^{\prime})\big{)}-\big{(}\nabla f_{i}(x_{t-1})-\nabla f_{i}(x_{t-1}^{\prime})\big{)}-\big{(}\nabla f(x_{t})-\nabla f(x_{t}^{\prime})\big{)}+\big{(}\nabla f(x_{t-1})-\nabla f(x_{t-1}^{\prime})\big{)}, and then we have
[TABLE]
where (69) holds since we define and , and the last inequality holds due to the gradient Lipschitz Assumption 1 and Hessian Lipschitz Assumption 2 (recall ). Then, consider the variance term
[TABLE]
where the first inequality uses the fact , and the last inequality uses the gradient Lipschitz Assumption 1 and Hessian Lipschitz Assumption 2. According to (70) and (71), we can bound the difference by Bernstein inequality (Proposition 2) as (where and )
[TABLE]
where the last equality holds by letting , where .
Now, we have a high probability bound for the difference sequence , i.e.,
[TABLE]
Now, we are ready to get an upper bound for by using the martingale Azuma-Hoeffding inequality. Note that we only need to consider the current epoch that contains the iteration since each epoch we start with . Let denote the current epoch, i.e, iterations from to current , where is no larger than . According to Azuma-Hoeffding inequality (Proposition 4) and letting , we have
[TABLE]
where the last equality is due to , where . Recall that and at the beginning point of this epoch due to and (see Line 5 of Algorithm 1), thus we have
[TABLE]
with probability , where belongs to . Note that we can further relax the parameter in (73) to (see (74)) for making sure the above arguments hold with probability for all by using a union bound for ’s:
[TABLE]
Now, we will show how to bound the right-hand-side of (74) to finish the proof, i.e., prove the remaining second bound .
First, we show that the last two terms in the right-hand-side of (74) can be bounded as
[TABLE]
where the first inequality follows from the induction of and the already proved in (66), and the last inequality holds by letting the perturbation radius .
Now, we show that the first term of right-hand-side of (74) can be bounded as
[TABLE]
where the first equality follows from (55), (76) holds from the following (82),
[TABLE]
where (82) holds due to Hessian Lipschitz Assumption 2, (54) and the perturbation radius (recall that , and ), (77) holds due to , (78) holds by plugging the induction and , (79) follows from (82), the induction and (hold for all ), (80) holds by letting the perturbation radius , and the last inequality holds due to (recall ).
By plugging (75) and (81) into (74), we have
[TABLE]
where the second inequality holds due to , and the last inequality holds by letting and . Recall that is enough to let the arguments in this proof hold with probability for all .
From (66) and (83), we know that the two induction bounds hold for . We recall the first induction bound here:
Thus, we know that . However, according to (54) and the perturbation radius . The last inequality is due to the perturbation radius (we already used this condition in the previous arguments). This will give a contradiction for (54) if and it will happen if .
So the proof of this lemma is finished by contradiction if we let , i.e., we have
[TABLE]
B.2 Proofs for Online Problem
In this section, we provide the detailed proofs for online problem (2) (i.e., Theorem 3–4). We will reuse some parts of our previous proofs for finite-sum problem (1) in previous Section B.1.
First, we recall the previous key relation (25) between and as follows (recall ):
[TABLE]
Next, we recall the previous bound (29) for the variance term:
[TABLE]
Now, the following bound for the variance term will be different from the previous finite-sum case. Similar to (30), we sum up (85) from the beginning of this epoch to the point ,
[TABLE]
where (86) is the same as (30), (87) uses the modification (11) (i.e., instead of the full gradient computation in the finite-sum case), and the last inequality (88) follows from the bounded variance Assumption 3.
Now, we take expectations for (84) and then sum it up from the beginning of this epoch , i.e., iterations from to , by plugging the variance (88) into them to get:
[TABLE]
where (89) holds if the minibatch size (note that here ), (90) holds if the step size .
Proof of Theorem 3. Let and step size , then (90) holds. Now, the proof is directly obtained by summing up (90) for all epochs as follows:
[TABLE]
where (91) holds by choosing uniformly from and letting and . Note that the total number of computation of stochastic gradients equals to
[TABLE]
B.2.1 Proof of Theorem 4
Similar to the proof of Theorem 2, for proving the second-order guarantee, we will divide the proof into two situations. The first situation (large gradients) is also almost the same as the above arguments for first-order guarantee, where the function value will decrease a lot since the gradients are large (see (90)). For the second situation (around saddle points), we will show that the function value can also decrease a lot by adding a random perturbation. The reason is that saddle points are usually unstable and the stuck region is relatively small in a random perturbation ball.
Large Gradients: First, we need a high probability bound for the variance term instead of the expectation one (88). Then we use it to get a high probability bound of (90) for function value decrease. Note that in this online case, at the beginning of each epoch (see (11)) instead of in the previous finite-sum case. Thus we first need a high probability bound for . According to Assumption 4, we have
[TABLE]
By applying Bernstein inequality (Proposition 2), we get the high probability bound for as follows:
[TABLE]
where the last equality holds by letting , where . Now, we have a high probability bound for , i.e.,
[TABLE]
Now we will try to obtain a high probability bound for the variance term of other points beyond the starting points. Recall that v_{k}=\frac{1}{b}\sum_{i\in I_{b}}\big{(}\nabla f_{i}(x_{k})-\nabla f_{i}(x_{k-1})\big{)}+v_{k-1} (see Line 9 of Algorithm 1), we let and . It is not hard to verify that is a martingale sequence and is the associated martingale difference sequence. In order to apply the Azuma-Hoeffding inequalities to get a high probability bound, we first need to bound the difference sequence . We use the Bernstein inequality to bound the differences as follows.
[TABLE]
We define , and then we have
[TABLE]
where the last inequality holds due to the gradient Lipschitz Assumption 1. Then, consider the variance term
[TABLE]
where the first inequality uses the fact , and the last inequality uses the gradient Lipschitz Assumption 1. According to (94) and (95), we can bound the difference by Bernstein inequality (Proposition 2) as
[TABLE]
where the last equality holds by letting , where . Now, we have a high probability bound for the difference sequence , i.e.,
[TABLE]
Now, we are ready to get a high probability bound for our original variance term (88) by using the martingale Azuma-Hoeffding inequality. Consider in a specifical epoch , i.e, iterations from to current , where is less than . According to Azuma-Hoeffding inequality (Proposition 4) and letting , we have
[TABLE]
where the last equality holds by letting , where . Recall that and at the beginning point of this epoch with probability , where (see (92)). Combining with (92) and using a union bound, we have
[TABLE]
with probability , where belongs to .
Now, we use this high probability version (97) instead of the expectation one (88) to obtain the high probability bound for function value decrease (see (90)). We sum up (84) from the beginning of this epoch , i.e., iterations from to , by plugging (97) into them to get:
[TABLE]
where (98) holds if the minibatch size (note that here ), and (99) holds if the step size .
Similar to the previous finite-sum case, (99) only guarantees function value decrease when the summation of gradients in this epoch is large. However, in order to connect the guarantees between first situation (large gradients) and second situation (around saddle points), we need to show guarantees that are related to the gradient of the starting point of each epoch (see Line 3 of Algorithm 2). As we discussed in previous Section B.1.1, we achieve this by stopping the epoch at a uniformly random point (see Line 16 of Algorithm 2).
We want to point out that the second situation will have a little difference due to (11), i.e., the full gradient of the starting point is not available (see Line 3 of Algorithm 2). Thus some modifications are needed for previous Lemma 1, we use the following lemma to connect these two situations (large gradients and around saddle points):
Lemma 4** (Connection of Two Situations)**
For any epoch , let be a point uniformly sampled from this epoch and choose the step size (where ) and the minibatch size . Then for any , by letting batch size (where ), we have two cases:
If at least half of points in this epoch have gradient norm no larger than , then and hold with probability at least ; 2. 2.
Otherwise, we know holds with probability at least
Moreover, holds with high probability no matter which case happens.
Proof of Lemma 4. There are two cases in this epoch:
If at least half of points of in this epoch have gradient norm no larger than , then it is easy to see that a uniformly sampled point has gradient norm with probability at least Moreover, note that the starting point of the next epoch (i.e., Line 19 of Algorithm 2), thus we have with probability . According to (92), we have with probability , where . By a union bound, with probability at least , we have
[TABLE] 2. 2.
Otherwise, at least half of points have gradient norm larger than . Then, as long as the sampled point falls into the last quarter of , we know . This holds with probability at least since is uniformly sampled. Then by combining with (99), we obtain the function value decrease
[TABLE]
where the last inequality is due to . Note that (99) holds with high probability if we choose the minibatch size and the step size . By a union bound, the function value decrease with probability at least .
Again according to (99), always holds with high probability.
Note that if Case 2 happens, the function value already decreases a lot in this epoch (corresponding to the first situation large gradients). Otherwise Case 1 happens, we know the starting point of the next epoch (i.e., Line 19 of Algorithm 2), then we know and . Then we will start a super epoch (corresponding to the second situation around saddle points). Note that if , this point is already an -second-order stationary point (recall that in our Theorem 4).
Around Saddle Points and : In this situation, we will show that the function value decreases a lot in a super epoch (instead of an epoch as in the first situation) with high probability by adding a random perturbation at the initial point . To simplify the presentation, we use to denote the starting point of the super epoch after the perturbation, where uniformly and the perturbation radius is (see Line 6 in Algorithm 2). Following the classical widely used two-point analysis developed in [Jin et al., 2017], we consider two coupled points and with , where is a scalar and denotes the smallest eigenvector direction of Hessian . Then we get two coupled sequences and by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of batches and minibatches (i.e., ’s (see (11) and Line 8) and ’s (see Line 12))for a super epoch. We will show that at least one of these two coupled sequences will decrease the function value a lot (escape the saddle point), i.e.,
[TABLE]
We will prove (100) by contradiction. Assume the contrary, and . First, we show that if function value does not decrease a lot, then all iteration points are not far from the starting point with high probability. Then we will show that the stuck region is relatively small in the random perturbation ball, i.e., at least one of and will go far away from their starting point and with high probability. Thus there is a contradiction. Similar to Lemma 2 and Lemma 3, we need the following two lemmas. Their proofs are deferred to the end of this section.
Lemma 5** (Localization)**
Let denote the sequence by running SSRGD update steps (Line 8–12 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have
[TABLE]
where and .
Lemma 6** (Small Stuck Region)**
If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of batches and minibatches (i.e., ’s (see (11) and Line 8) and ’s (see Line 12)) from and with , where , , and denotes the smallest eigenvector direction of Hessian . Moreover, let the super epoch length , the step size \eta\leq\min\big{(}\frac{1}{16\log(\frac{8\delta\sqrt{d}}{C_{1}\rho\zeta^{\prime}r})L},\frac{1}{8C_{2}L\log t_{\mathrm{thres}}}\big{)}=\widetilde{O}(\frac{1}{L}), minibatch size , batch size and the perturbation radius , then with probability , we have
[TABLE]
where , and .
Based on these two lemmas, we are ready to show that (100) holds with high probability. Without loss of generality, we assume in (102) (note that (101) holds for both and ), then plugging it into (101) to obtain
[TABLE]
where (103) is due to and (104) holds by letting . Recall that and . Thus, we already prove that at least one of sequences and escapes the saddle point with high probability, i.e.,
[TABLE]
if their starting points and satisfying , where and denotes the smallest eigenvector direction of Hessian . Similar to the classical argument in [Jin et al., 2017], we know that in the random perturbation ball, the stuck points can only be a short interval in the direction, i.e., at least one of two points in the direction will escape the saddle point if their distance is larger than . Thus, we know that the probability of the starting point (where uniformly ) located in the stuck region is less than
[TABLE]
where denotes the volume of a Euclidean ball with radius in dimension, and the first inequality holds due to Gautschi’s inequality. By a union bound for (106) and (104) (holds with high probability if is not in a stuck region), we know
[TABLE]
with high probability. Note that the initial point of this super epoch is before the perturbation (see Line 6 of Algorithm 2), thus we need to show that the perturbation step (where uniformly ) does not increase the function value a lot, i.e.,
[TABLE]
where the last inequality holds by letting the perturbation radius .
Now we combine with (107) and (108) to obtain with high probability
[TABLE]
Thus we have finished the proof for the second situation (around saddle points), i.e., we show that the function value decrease a lot () in a super epoch (recall that ) by adding a random perturbation at the initial point .
Combing these two situations (large gradients and around saddle points) to prove Theorem 4: First, we recall Theorem 4 here since we want to recall the parameter setting.
**Theorem **4
Under Assumption 1, 2 (i.e. (4) and (6)) and Assumption 4, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size , epoch length , perturbation radius r=\widetilde{O}\big{(}\min(\frac{\delta^{3}}{\rho^{2}\epsilon},\frac{\delta^{3/2}}{\rho\sqrt{L}})\big{)}, threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using
[TABLE]
stochastic gradients for nonconvex online problem (2).
Proof of Theorem 4. Now, we prove this theorem by distinguishing the epochs into three types as follows:
Type-1 useful epoch: If at least half of points in this epoch have gradient norm larger than (Case 2 of Lemma 4); 2. 2.
Wasted epoch: If at least half of points in this epoch have gradient norm no larger than and the starting point of the next epoch has estimated gradient norm larger than (it means that this epoch does not guarantee decreasing the function value a lot as the large gradients situation, also it cannot connect to the second super epoch situation since the starting point of the next epoch has estimated gradient norm larger than ); 3. 3.
Type-2 useful super epoch: If at least half of points in this epoch have gradient norm no larger than and the starting point of the next epoch (here we denote this point as ) has estimated gradient norm no larger than (i.e., ) (Case 1 of Lemma 4), according to Line 3 of Algorithm 2, we will start a super epoch. So here we denote this epoch along with its following super epoch as a type-2 useful super epoch.
First, it is easy to see that the probability of a wasted epoch happened is less than due to the random stop (see Case 1 of Lemma 4 and Line 16 of Algorithm 2) and different wasted epoch are independent. Thus, with high probability, there are at most wasted epochs happened before a type-1 useful epoch or type-2 useful super epoch. Now, we use and to denote the number of type-1 useful epochs and type-2 useful super epochs that the algorithm is needed. Recall that , where is the initial point and is the optimal value of .
For type-1 useful epoch, according to Case 2 of Lemma 4, we know that the function value decreases at least with probability at least . Using a standard concentration, we know that with high probability type-1 useful epochs will decrease the function value at least , note that the function value can decrease at most . So , we get .
For type-2 useful super epoch, first we know that the starting point of the super epoch has gradient norm and estimated gradient norm . Now if , then is already a -second-order stationary point. Otherwise, and , this is exactly our second situation (around saddle points). According to (109), we know that the the function value decrease () is at least with high probability. Similar to type-1 useful epoch, we know by a union bound (so we change to , anyway we also have ).
Now, we are ready to compute the convergence results to finish the proof for Theorem 4.
[TABLE]
Now, the only remaining thing is to prove Lemma 5 and 6. We provide these two proofs as follows.
Lemma 5 (Localization)
Let denote the sequence by running SSRGD update steps (Line 8–12 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have
[TABLE]
where and .
Proof of Lemma 5. First, we assume the variance bound (97) holds for all (this is true with high probability using a union bound by letting and ). Then, according to (98), we know for any in some epoch
[TABLE]
where the last inequality holds since the step size and assuming . Now, we sum up (112) for all epochs before iteration ,
[TABLE]
Then, the proof is finished as
[TABLE]
Lemma 6 (Small Stuck Region)
If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of batches and minibatches (i.e., ’s (see (11) and Line 8) and ’s (see Line 12)) from and with , where , , and denotes the smallest eigenvector direction of Hessian . Moreover, let the super epoch length , the step size \eta\leq\min\big{(}\frac{1}{16\log(\frac{8\delta\sqrt{d}}{C_{1}\rho\zeta^{\prime}r})L},\frac{1}{8C_{2}L\log t_{\mathrm{thres}}}\big{)}=\widetilde{O}(\frac{1}{L}), minibatch size , batch size and the perturbation radius , then with probability , we have
[TABLE]
where , and .
Proof of Lemma 6. We prove this lemma by contradiction. Assume the contrary,
[TABLE]
We will show that the distance between these two coupled sequences will grow exponentially since they have a gap in the direction at the beginning, i.e., , where and denotes the smallest eigenvector direction of Hessian . However, according to (113) and the perturbation radius . It is not hard to see that the exponential increase will break this upper bound, thus we get a contradiction.
In the following, we prove the exponential increase of by induction. First, we need the expression of (recall that (see Line 11 of Algorithm 2)):
[TABLE]
where and . Note that the first term of (114) is in the direction and is exponential with respect to , i.e., , where . To prove the exponential increase of , it is sufficient to show that the first term of (114) will dominate the second term. We inductively prove the following two bounds
2. 2.
First, check the base case , holds for Bound 1. However, for Bound 2, we use Bernstein inequality (Proposition 2) to show that . According to (11), we know that and (recall that these two coupled sequence and use the same choice of batches and minibatches (i.e., ’s and ’s). Now, we have
[TABLE]
We first bound each individual term of (115):
[TABLE]
where the inequality holds due to the gradient Lipschitz Assumption 1. Then, consider the variance term of (115):
[TABLE]
where the first inequality uses the fact , and the last inequality uses the gradient Lipschitz Assumption 1. According to (116) and (117), we can bound by Bernstein inequality (Proposition 2) as
[TABLE]
where the last equality holds by letting , where . Note that we can further relax the parameter to for making sure the above arguments hold with probability for all epoch starting points with . Thus, we have with probability ,
[TABLE]
where the last inequality holds due to (recall that and ).
Now, we know that Bound 1 and Bound 2 hold for the base case with high probability. Assume they hold for all , we now prove they hold for one by one. For Bound 1, it is enough to show the second term of (114) is dominated by half of the first term.
[TABLE]
where (119) uses the induction for with , (120) uses the definition , (121) follows from due to (113) and the perturbation radius , (122) holds by letting the perturbation radius , (123) holds since , and (124) holds by letting .
[TABLE]
where (125) uses the induction for with , (126) holds since , (127) holds (recall ), and (128) holds by letting .
Combining (124) and (128), we proved the second term of (114) is dominated by half of the first term. Note that the first term of (114) is . Thus, we have
[TABLE]
Now, the remaining thing is to prove the second bound . First, we write the concrete expression of :
[TABLE]
where (130) is due to the definition of the estimator (see Line 12 of Algorithm 2). We further define the difference . It is not hard to verify that is a martingale sequence and is the associated martingale difference sequence. We will apply the Azuma-Hoeffding inequalities to get an upper bound for and then we prove based on that upper bound. In order to apply the Azuma-Hoeffding inequalities for martingale sequence , we first need to bound the difference sequence . We use the Bernstein inequality to bound the differences as follows.
[TABLE]
We define u_{i}:=\big{(}\nabla f_{i}(x_{t})-\nabla f_{i}(x_{t}^{\prime})\big{)}-\big{(}\nabla f_{i}(x_{t-1})-\nabla f_{i}(x_{t-1}^{\prime})\big{)}-\big{(}\nabla f(x_{t})-\nabla f(x_{t}^{\prime})\big{)}+\big{(}\nabla f(x_{t-1})-\nabla f(x_{t-1}^{\prime})\big{)}, and then we have
[TABLE]
where (132) holds since we define and , and the last inequality holds due to the gradient Lipschitz Assumption 1 and Hessian Lipschitz Assumption 2 (recall ). Then, consider the variance term
[TABLE]
where the first inequality uses the fact , and the last inequality uses the gradient Lipschitz Assumption 1 and Hessian Lipschitz Assumption 2. According to (133) and (134), we can bound the difference by Bernstein inequality (Proposition 2) as (where and )
[TABLE]
where the last equality holds by letting , where .
Now, we have a high probability bound for the difference sequence , i.e.,
[TABLE]
Now, we are ready to get an upper bound for by using the martingale Azuma-Hoeffding inequality. Note that we only need to focus on the current epoch that contains the iteration since the martingale sequence starts with a new point for each epoch due to the estimator . Also note that the starting point can be bounded with the same upper bound (118) for all epoch . Let denote the current epoch, i.e, iterations from to current , where is no larger than . According to Azuma-Hoeffding inequality (Proposition 4) and letting , we have
[TABLE]
where the last equality is due to , where . Recall that and at the beginning point of this epoch with probability (see (118)). Combining with (118) and using a union bound, we have
[TABLE]
with probability , where belongs to . Note that we can further relax the parameter in (136) to (see (137)) for making sure the above arguments hold with probability for all by using a union bound for ’s:
[TABLE]
where belongs to .
Now, we will show how to bound the right-hand-side of (137) to finish the proof, i.e., prove the remaining second bound .
First, we show that the last two terms in the first term of right-hand-side of (137) can be bounded as
[TABLE]
where the first inequality follows from the induction of and the already proved in (129), and the last inequality holds by letting the perturbation radius .
Now, we show that the first term in (137) can be bounded as
[TABLE]
where the first equality follows from (114), (139) holds from the following (145),
[TABLE]
where (145) holds due to Hessian Lipschitz Assumption 2, (113) and the perturbation radius (recall that , and ), (140) holds due to , (141) holds by plugging the induction and , (142) follows from (145), the induction and (hold for all ), (143) holds by letting the perturbation radius , and the last inequality holds due to (recall ).
By plugging (138) and (144) into (137), we have
[TABLE]
where the second inequality holds due to , and the last inequality holds by letting and . Recall that is enough to let the arguments in this proof hold with probability for all .
From (129) and (146), we know that the two induction bounds hold for . We recall the first induction bound here:
Thus, we know that . However, according to (113) and the perturbation radius . The last inequality is due to the perturbation radius (we already used this condition in the previous arguments). This will give a contradiction for (113) if and it will happen if .
So the proof of this lemma is finished by contradiction if we let , i.e., we have
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agarwal et al. [2016] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima for nonconvex optimization in linear time. ar Xiv preprint ar Xiv:1611.01146 , 2016.
- 2Allen-Zhu [2018] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems , pages 2680–2691, 2018.
- 3Allen-Zhu and Hazan [2016] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning , pages 699–707, 2016.
- 4Allen-Zhu and Li [2018] Zeyuan Allen-Zhu and Yuanzhi Li. Neon 2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems , pages 3720–3730, 2018.
- 5Bhojanapalli et al. [2016] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems , pages 3873–3881, 2016.
- 6Carmon et al. [2016] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization. ar Xiv preprint ar Xiv:1611.00756 , 2016.
- 7Chung and Lu [2006] Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics , 3(1):79–127, 2006.
- 8Daneshmand et al. [2018] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. ar Xiv preprint ar Xiv:1803.05999 , 2018.
