Behavior of Accelerated Gradient Methods Near Critical Points of Nonconvex Functions
Michael O'Neill, Stephen J. Wright

TL;DR
This paper analyzes how accelerated gradient methods, especially the heavy-ball method, behave near saddle points in nonconvex optimization, showing they tend to avoid convergence to saddle points and can diverge faster than gradient descent.
Contribution
It provides a theoretical analysis demonstrating that accelerated methods are unlikely to converge to strict saddle points and can diverge more rapidly than gradient descent near these points.
Findings
Heavy-ball method unlikely to converge to strict saddle points
Accelerated methods diverge faster than steepest descent near saddle points
Stable manifold theorem used to analyze convergence behavior
Abstract
We examine the behavior of accelerated gradient methods in smooth nonconvex unconstrained optimization, focusing in particular on their behavior near strict saddle points. Accelerated methods are iterative methods that typically step along a direction that is a linear combination of the previous step and the gradient of the function evaluated at a point at or near the current iterate. (The previous step encodes gradient information from earlier stages in the iterative process.) We show by means of the stable manifold theorem that the heavy-ball method method is unlikely to converge to strict saddle points, which are points at which the gradient of the objective is zero but the Hessian has at least one negative eigenvalue. We then examine the behavior of the heavy-ball method and other accelerated gradient methods in the vicinity of a strict saddle point of a nonconvex quadratic…
| Method | Av. Iters | Max. Iters | ||
|---|---|---|---|---|
| Steepest Descent | 379 | 518 | ||
| Accelerated Gradient | 71 | 87 | ||
| Divergence Rate | 46 | 59 | ||
| Steepest Descent | 3855 | 5603 | ||
| Accelerated Gradient | 242 | 299 | ||
| Divergence Rate | 155 | 194 | ||
| Steepest Descent | 582 | 773 | ||
| Accelerated Gradient | 99 | 116 | ||
| Divergence Rate | 71 | 85 | ||
| Steepest Descent | 5775 | 8240 | ||
| Accelerated Gradient | 332 | 399 | ||
| Divergence Rate | 235 | 282 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optimization Algorithms Research · Stochastic Gradient Optimization Techniques · Optimization and Variational Analysis
\headers
Accelerated Gradient Methods in Nonconvex OptimizationMichael O’Neill and Stephen J. Wright
Behavior of Accelerated Gradient Methods Near
Critical Points of Nonconvex Functions††thanks: Version of . \fundingThis work was supported by NSF Awards IIS-1447449, 1628384, 1634597, and 1740707; AFOSR Award FA9550-13-1-0138; and Subcontract 3F-30222 from Argonne National Laboratory. Part of this work was done while the second author was visiting the Simons Institute for the Theory of Computing, and partially supported by the DIMACS/Simons Collaboration on Bridging Continuous and Discrete Optimization through NSF Award CCF-1740425.
Michael O’Neill Computer Sciences Department, University of Wisconsin, Madison, WI 53706 (). [email protected]
Stephen J. Wright Computer Sciences Department, University of Wisconsin, Madison, WI 53706 (). [email protected]
Abstract
We examine the behavior of accelerated gradient methods in smooth nonconvex unconstrained optimization, focusing in particular on their behavior near strict saddle points. Accelerated methods are iterative methods that typically step along a direction that is a linear combination of the previous step and the gradient of the function evaluated at a point at or near the current iterate. (The previous step encodes gradient information from earlier stages in the iterative process.) We show by means of the stable manifold theorem that the heavy-ball method is unlikely to converge to strict saddle points, which are points at which the gradient of the objective is zero but the Hessian has at least one negative eigenvalue. We then examine the behavior of the heavy-ball method and other accelerated gradient methods in the vicinity of a strict saddle point of a nonconvex quadratic function, showing that both methods can diverge from this point more rapidly than the steepest-descent method.
keywords:
Accelerated Gradient Methods, Nonconvex Optimization
{AMS}
90C26
1 Introduction
We consider methods for the smooth unconstrained optimization problem
[TABLE]
where is a twice continuously differentiable function. We say that is a critical point of (1) if . Critical points that are not local minimizers are of little interest in the context of the optimization problem (1), so a desirable property of any algorithm for solving (1) is that it not be attracted to such a point. Specifically, we focus on functions with strict saddle points, that is, functions where the Hessian at each saddle point has at least one negative eigenvalue.
Our particular interest here is in methods that use gradients and momentum to construct steps. In many such methods, each step is a linear combination of two components: the gradient evaluated at a point at or near the latest iterate, and a momentum term, which is the step between the current iterate and the previous iterate. There are rich convergence theories for these methods in the case in which is convex or strongly convex, along with extensive numerical experience in some important applications. However, although these methods are applied frequently to nonconvex functions, little is known from a mathematical viewpoint about their behavior in such settings. Early results showed that a certain modified accelerated gradient method achieves the same order of convergence on a nonconvex problem as gradient descent [7] [10] — not a faster rate, as in the convex setting.
The heavy-ball method was studied in the nonconvex setting in [17]. From an argument based on a Lyapunov function, this work shows that heavy-ball converges to some set of stationary points when short step sizes are used. Their result also implies that with these shorter stepsizes, heavy-ball converges to these stationary points with a sublinear rate, just as gradient descent does in the nonconvex case. Another work studied the continuous time heavy-ball method [2]. For Morse functions (functions where all critical points have a non-singular Hessian matrix), this paper shows that the set of initial conditions from which heavy-ball converges to a local minimizer is an open dense subset of . We present a similar result for a larger class of functions, using techniques like those of [9], where the authors show that gradient descent, started from a random initial point, converges to a strict saddle point with probability zero. We show that the discrete heavy-ball method essentially shares this property. We also study whether momentum methods can “escape” strict saddle points more rapidly than gradient descent. Experience with nonconvex quadratics indicate that, when started close to the (measure-zero) set of points from which convergence to the saddle point occurs, momentum methods do indeed escape more quickly.
After submission of our paper, [8] described a method that combines accelerated gradient, perturbation at points with small gradients and explicit negative curvature detection to attain a method with worst-case complexity guarantees.
Notation
For compactness, we sometimes use the notation to denote the vector , for and .
2 Heavy-Ball is Unlikely to Converge to Strict Saddle Points
We show in this section that the heavy-ball method is not attracted to strict saddle points, unless initialized in a very particular way, that cannot occur if the starting point is chosen at random and the algorithm is modified slightly. Following [9], our proof is based on the stable manifold theorem.
We make the following assumption throughout this section.
Assumption 1**.**
*The function is times continuously differentiable, for some integer , and has Lipschitz constant . *
Under this assumption, the eigenvalues of the Hessian are bounded in magnitude by .
The heavy-ball method is a prototypical momentum method (see [13]), which proceeds as follows from a starting point :
[TABLE]
Following [13], we can write (2) as follows:
[TABLE]
Convergence for this method is known for the special case in which is a strongly convex quadratic. Denote by the positive lower bound on the eigenvalues of the Hessian of this quadratic, and recall that is the upper bound. For the settings
[TABLE]
a rigorous version of the eigenvalue-based argument in [13, Section 3.2] can be applied to show R-linear convergence with rate constant , which is approximately when the ratio is large. This suggests a complexity of iterations to reduce the error by a factor of (where is the unique solution). Such rates are typical of accelerated gradient methods. They contrast with the rates attained by the steepest-descent method on such functions.
We note that the eigenvalue-based argument that is “sketched” by [13] does not extend rigorously beyond strongly convex quadratic functions. A more sophisticated argument based on Lyapunov functions is needed, like the one presented for Nesterov’s accelerated gradient method in [14, Chapter 4].
The key to our argument for non-convergence to strict saddle points lies in formulating the heavy-ball method as a mapping whose fixed points are stationary points of and to which we can apply the stable manifold theorem. Following (3), we define this mapping to be
[TABLE]
Note that
[TABLE]
We have the following elementary result about the relationship of critical points for (1) to fixed points for the mapping .
Lemma 2.1**.**
*If is a critical point of , then is a fixed point for . Conversely, if is a fixed point for , then is a critical point for . *
Proof 2.2**.**
The first claim is obvious by substitution into (5). For the second claim, we have that if is a fixed point for , then
[TABLE]
*from which we have and , giving the result. *
We now establish that is a diffeomorphic mapping, a property needed for application of the stable manifold result.
Lemma 2.3**.**
*Suppose that Assumption 1 holds. Then the mapping defined in (5) is a diffeomorphism. *
Proof 2.4**.**
We need to show that is injective and surjective, and that and its inverse are times continuously differentiable.
To show injectivity of , suppose that . Then, we have
[TABLE]
Therefore, , and so
[TABLE]
demonstrating injectivity. To show that is surjective, we construct its inverse explicitly. Let be such that
[TABLE]
Then . From the first partition in (9), we obtain , which after substitution of leads to
[TABLE]
*Thus, is a bijection. Both and are continuously differentiable one time less than , so by Assumption 1, is a -diffeomorphism. *
We are now ready to state the stable manifold theorem, which provides tools to let us characterize the set of escaping points.
Theorem 2.5** (Theorem III.7 of [15]).**
*Let [math] be a fixed point for the local diffeomorphism where is a neighborhood of [math] in the Banach space . Suppose that , where is the invariant subspace corresponding to the eigenvalues of whose magnitude is less than or equal to 1, and is the invariant subspace corresponding to eigenvalues of whose magnitude is greater than 1. Then there exists a embedded disc that is tangent to at 0 called the local stable center manifold. Additionally, there exists a neighborhood of 0 such that , and that if is a point such that for all , then . *
This is a similar statement of the stable manifold theorem to the one found in [9], except that since we have to deal with complex eigenvalues here, we emphasize that the decomposition is between the eigenvalues whose magnitude is less than or equal to , and greater than , respectively. It guarantees the existence of a stable center manifold of dimension equal to the number of eigenvalues of the Jacobian at the critical point that are less than or equal to 1.
We show now that the Jacobian has the properties required for application of this result, for values of and similar to the choices (4). (Note that the conditions on and in this result hold when and , where is the Lipschitz constant from Assumption 1.) For purposes of this and future results in this section, we assume that at the point we have and that the eigenvalue decomposition of can be written as
[TABLE]
where the eigenvalues have
[TABLE]
for some with , where , and where , are the orthonormal set of eigenvectors that correspond to the eigenvalues in (12). The matrix is orthogonal.
Theorem 2.6**.**
Suppose that Assumption 1 holds. Let be a critical point for at which has negative eigenvalues, where . Consider the mapping defined by (5) where
[TABLE]
*where is the largest positive eigenvalue of . Then there are matrices and such that (a) the matrix is nonsingular; (b) the columns of span an invariant subspace of corresponding to eigenvalues of whose magnitude is less than or equal to ; (c) the columns of span an invariant subspace of corresponding to eigenvalues of whose magnitude is greater than . *
Proof 2.7**.**
Since
[TABLE]
we have from (11) that
[TABLE]
By performing a symmetric permutation on this matrix, interleaving rows/columns from the first block with rows/columns from the second block, we obtain a block diagonal matrix with blocks of the following form on the diagonals, that is,
[TABLE]
where
[TABLE]
The eigenvalues of are obtained from the following quadratic in :
[TABLE]
that is,
[TABLE]
for which the roots are
[TABLE]
We examine first the matrices for which . We have
[TABLE]
so both roots in (18) are real. Since is convex quadratic, with and , one root is in and the other is in . We can thus write
[TABLE]
where is the eigenvalue of in the range and is the eigenvalue of in the range . (This claim can be verified by direct calculation of the product (19a).)
Consider now the matrices for which . From (18), we have that the roots are and , which are distinct, since . The eigenvalue decompositions of these matrices have the form
[TABLE]
and the are nonsingular matrices.
When , we show that the eigenvalues of both have magnitude less than , under the given conditions on and . Both roots in (18) are complex exactly when the term under the square root is negative, and in this case the magnitude of both roots is
[TABLE]
which is less than by assumption. When both roots are real, we have , and we require the following to be true to ensure that both are less than in absolute value:
[TABLE]
We deal with the right-hand inequality in (21) first. By rearranging, we show that this is implied by the following sequence of equivalent inequalities:
[TABLE]
where the last is clearly true, because of and . Thus the right-hand inequality in (21) is satisfied.
For the left-hand inequality in (21), we have
[TABLE]
and the last condition holds because of the assumption that . This completes our proof of the claim (21). Thus our assumptions on and suffice to ensure that both eigenvalues of defined in (15) have magnitude less than when .
By defining
[TABLE]
where , are the matrices defined in (19), we have from (14) that
[TABLE]
We now define another -dimensional permutation matrix that sorts the entries of the diagonal matrices , into those whose magnitude is greater than one and those whose magnitude is less than or equal to one, to obtain
[TABLE]
where
[TABLE]
We now define
[TABLE]
which is a nonsingular matrix, by nonsingularity of and orthogonality of , , and . As in the statement of the theorem, we define to be the first columns of and to be the last columns. These define invariant spaces. For the stable space, we have
[TABLE]
where all eigenvalues of have magnitude less than or equal to . For the unstable space, we have
[TABLE]
*where is a diagonal matrix with all diagonal elements greater than . *
We find a basis for the eigenspace that corresponds to the eigenvalues of that are greater than (that is, the column space of ) in the following result.
Corollary 2.8**.**
Suppose that the assumptions of Theorem 2.6 hold. Then the eigenvector of that corresponds to the unstable eigenvalue , defined in (18) is
[TABLE]
*where is an eigenvector of that corresponds to . The set of such vectors forms an orthogonal basis for the subspace of corresponding to the eigenvalues of whose magnitude is greater than . *
Proof 2.9**.**
We have from (13) that
[TABLE]
so the result holds provided that
[TABLE]
*But this is true because of (17), so (24) is an eigenvector of corresponding to the eigenvalue . Since the vectors form an orthogonal set, so do the vectors (24) for , completing the proof. *
Our next result is similar to [9, Theorem 4.1]. It is for a modified version of the heavy-ball method in which the initial value for is perturbed from its usual choice of .
Theorem 2.10**.**
Suppose that the assumptions of Theorem 2.6 hold. Suppose that the heavy-ball method is applied from an initial point of , where and are random vectors with elements, and is small. We then have
[TABLE]
*where the probability is taken over the starting vectors and . *
Proof 2.11**.**
Our proof tracks that of [9, Theorem 4.1]. As there, we define the stable set for to be
[TABLE]
*For the neighborhood of promised by Theorem 2.5, we have for all that there is some such that for all , and therefore by Theorem 2.5 we must have . Thus is the set of points such that for some finite . From Theorem 2.5, is tangent to the subspace at , and the dimension of is , by Theorem 2.6 (since is the space spanned by the columns of ). This subspace has measure zero in , since . Since diffeomorphisms map sets of measure zero to sets of measure zero, and countable unions of measure zero sets have measure zero, we conclude that has measure zero. Thus the initialization strategy we have outlined produces a starting vector in with probability zero. *
Theorem 2.10 does not guarantee that once the iterates leave the neighborhood of , they never return. It does not exclude the possibility that the sequence returns infinitely often to a neighborhood of .
We note that the tweak of taking slightly different from does not affect practical performance of the heavy-ball method, and has in fact been proposed before [17]. It also does not disturb the theory that exists for this method, which for the case of quadratic discussed in [13] rests on an argument based on the eigendecomposition of the (linear) operator , which is not affected by the modified starting point. We note too that the accelerated gradient methods to be considered in the next section can also allow without significantly affecting the convergence theory. A Lyapunov-function-based convergence analysis of this method (see, for example [14, Chapter 4], based on arguments in [16]) requires only trivial modification to accommodate .
For the variant of heavy-ball method in which , we could consider a random choice of and ask whether there is zero probability of belonging to the measure-zero set defined by (25). The problem is of course that lies in the -dimensional subspace , and we would need to establish that the intersection has measure zero in . In other words, we need that the set has measure zero in . We have a partial result in this regard, pertaining to the set , which is the local counterpart of . This result also makes use of the subspace , defined as in Theorem 2.5, which is the invariant subspace corresponding to eigenvalues of whose magnitudes are less than or equal to one.
Theorem 2.12**.**
*Suppose that the assumptions of Theorem 2.6 hold. Then any vector of the form where lies in the stable subspace only if that is, the span of eigenvectors of that correspond to nonnegative eigenvalues of this matrix. *
Proof 2.13**.**
We write for some coefficients , , and show that for .
We first show that
[TABLE]
where and , . To derive recurrences for and , we consider the multiplication by that takes us from stages to . We have
[TABLE]
By matching terms, we have
[TABLE]
where is defined in (15). Using the factorization (19), we have
[TABLE]
By substitution from (19), we obtain
[TABLE]
Because , it follows from this formula that
[TABLE]
*so if has any component in the span of , (that is, if ), repeated multiplications of by will lead to divergence, so cannot be in the subspace . *
A consequence of this theorem is that for a random choice of , there is probability zero that , which is tangential to at . Thus for close to , there is probability zero that is in the measure-zero set . Successive iterations of (2) are locally similar to repeated multiplications of by the matrix , that is, for small, we have
[TABLE]
Under the probability-one event that , this suggests divergence of the iteration (2) away from .
On the other hand, we can show that if the sequence passes sufficiently close to a point such that satisfies second-order sufficient conditions to be a solution of (1), it subsequently converges to . For this result we need the following variant of the stable manifold theorem.
Theorem 2.14** (Theorem III.7 of [15]).**
*Let [math] be a fixed point for the local diffeomorphism where is a neighborhood of [math] in the Banach space . Suppose that is the invariant subspace corresponding to the eigenvalues of whose magnitude is strictly less than
- Then there exists a embedded disc that is tangent to at [math], and a neighborhood of [math] such that , and for all , we have at a linear rate.
When satisfies second-order conditions for (1), all eigenvalues of are strictly positive. It follows from the proof of Theorem 2.6 that under the assumptions of this theorem, all eigenvalues of have magnitude strictly less than . Thus, the invariant subspace in Theorem 2.14 is the full space (in our case, ), so is a neighborhood of . It follows that there is some such that if for some , the sequence for converges to at a linear rate.
3 Speed of Divergence on a Toy Problem
In this section, we investigate the rate of divergence of an accelerated method on a simple nonconvex objective function, the quadratic with defined by
[TABLE]
Obviously, this function is unbounded below with a saddle point at . Its gradient has Lipschitz constant . Despite being a trivial problem, it captures the behavior of gradient algorithms near strict saddle points for indefinite quadratics of arbitrary dimension, as is apparent from the analysis below.
We have described the heavy-ball method in (2). The steepest-descent method, by contrast, takes steps of the form
[TABLE]
for some . When has Lipschitz constant , the choice leads to decrease in at each iteration that is consistent with convergence of to zero at a sublinear rate when is bounded below [12]. (The classical theory for gradient descent says little about the case in which is unbounded below, as in this example.)
The gradient descent and heavy-ball methods will converge to the saddle point [math] for (27) only from starting points of the form for any . (In the case of heavy-ball, this claim follows from Theorem 2.12, using the fact that is the eigenvector of that corresponds to the positive eigenvalue .) From any other starting point, both methods will diverge, with function values going to . When the starting point is very close to (but not on) the axis, the typical behavior is that these algorithms pass close to [math] before diverging along the axis. We are interested in the question: Does the heavy-ball method diverge away from [math] significantly faster than the steepest-descent method? The answer is “yes,” as we show in this section.
We consider a starting point that is just off the horizontal axis, that is,
[TABLE]
For the steepest-descent method with constant steplength, we have
[TABLE]
so that
[TABLE]
One measure of repulsion from the saddle point is the number of iterations required to obtain . Here it suffices for to be large enough that , for which (using the usual bound ) a sufficient condition is that
[TABLE]
Making the standard choice of steplength , we obtain
[TABLE]
Consider now the heavy-ball method. Following (2), the iteration has the form:
[TABLE]
(For this quadratic problem, the operator defined by (5) is linear, so that is constant.) We can partition this recursion into and components, and write
[TABLE]
where
[TABLE]
The eigenvalues of these two matrices are given by (18), by setting and , respectively. For and satisfying the conditions of Theorem 2.6, which translate here to
[TABLE]
both eigenvalues of are less than in magnitude (as we show in the proof of Theorem 2.6), so the components converge to zero. Again referring to the proof of Theorem 2.6, the eigenvalues of are both real, with one of them greater than , suggesting divergence in the component.
To understand rigorously the behavior of the sequence, we make some specific choices of and . Consider
[TABLE]
for some parameter . Note that for small and , these choices are consistent with (35). By substituting into (18), we see that the two eigenvalues of are
[TABLE]
For reasonable choices of , we have that for a modest positive value of . For specificity (and simplicity) let us consider and , for which we have
[TABLE]
The formula (19) yields , where and
[TABLE]
From (33), and setting , we have
[TABLE]
By substituting for and , we obtain
[TABLE]
where we simply drop the term involving in the final step and use . It follows that
[TABLE]
It follows from this bound, by a standard argument, that a sufficient condition for is
[TABLE]
Thus we have confirmed that divergence from the saddle point occurs in
iterations for heavy-ball, versus iterations for gradient descent.
For larger values of , the divergence of steepest-descent and heavy-ball methods are both rapid, For appropriate choices of and , the iterates generated by both algorithms leave the vicinity of the saddle point quickly.
Figure 1 illustrates the divergence behavior of steepest descent and heavy-ball on the function (27) with . We set for both steepest descent and heavy-ball. For heavy-ball, we chose . Both methods were started from . We see that the trajectory traced by steepest descent approaches the saddle point quite closely before diverging slowly along the axis. The heavy-ball method “overshoots” the axis (because of the momentum term) but quickly returns to diverging along the direction at a faster rate than for steepest descent.
4 General Accelerated Gradient Methods Applied to Quadratic Functions
Here we analyze the rate at which a general class of accelerated gradient methods escape the saddle point of an -dimensional quadratic function:
[TABLE]
where is a symmetric matrix with eigenvalues satisfying (12). We assume without loss of generality that is in fact diagonal, that is,
[TABLE]
The Lipschitz constant for is .
As in Section 3, gradient descent with satisfies
[TABLE]
It follows that for all , for which , gradient descent diverges in that component at a rate of .
Algorithm 1 describes a general accelerated gradient framework, including gradient descent when , heavy-ball when and , and accelerated gradient methods when . With defined by (38), the update formula can be written as
[TABLE]
which because of (39) is equivalent to
[TABLE]
The following theorem describes the dynamics of in (41) when .
Theorem 4.1**.**
For all such that , we have from (41) that
[TABLE]
where
[TABLE]
In addition if and for all then,
[TABLE]
Proof 4.2**.**
We begin by showing that (43) holds for and . The case for is trivial as . In addition, for , the update formula (41) becomes
[TABLE]
Thus because , we can make this consistent with (42) by setting which is exactly (43) for .
Now assume that (42) holds for all . From (41), using the inductive hypothesis for and , we need to show
[TABLE]
by the given definition of in (43). Dividing both sides by , this is equivalent to
[TABLE]
which is true because
[TABLE]
is (43) with , as required.
Now we assume that and holds for all and show by induction that holds for all . This is clearly true for since . Assume now that holds for all . We have
[TABLE]
*where the second inequality above follows from , and . *
Since for all , Theorem 4.1 shows that Algorithm 1 diverges at a faster rate than gradient descent when at least one of or is true. Now we explore the rate of divergence by finding a limit for the sequence .
Theorem 4.3**.**
*Let and hold for all and denote and . Then, for all such that , we have
, where is defined by by*
[TABLE]
Proof 4.4**.**
We can write (41) as follows:
[TABLE]
Recall from Theorem 4.1 that . By substituting into the equation above, we have
[TABLE]
Using Theorem 4.1 again, we have
[TABLE]
By matching this expression with (47), we obtain
[TABLE]
which after division by yields
[TABLE]
Now assume for contradiction that the nondecreasing sequence has no finite limit, that is, . Recalling that and have a finite limit (as they are nondecreaseing sequences restricted to the interval ), we have by taking the limit as in (49) that the left-hand side approaches , while the right-hand side approaches , a contradiction. Thus, the nondecreasing sequence has a finite limit, which we denote by .
To find the value for , we take limits as in (48) to obtain
[TABLE]
By solving this quadratic for , we obtain
[TABLE]
*By Theorem 4.1, we know that for all , so that . Therefore, satisfies (46), as claimed. *
We apply Theorem 46 to parameter choices that typically appear in accelerated gradient methods.
Corollary 4.5**.**
Let the assumptions of Theorem 46 hold, let hold for all and let . Then,
[TABLE]
Proof 4.6**.**
By direct computation with , we have
[TABLE]
The above corollary gives a rate of divergence for many standard choices of the extrapolation parameters found in the accelerated gradient literature. In particular, it includes the sequence where and
[TABLE]
which was used in a seminal work by Nesterov [11]. (For completeness, we provide a proof that , so that the assumptions of Corollary 4.5 hold for this sequence in the appendix.) Another setting used in recent works [1] [3] [5]. For proper choices of , this scheme has a number of impressive properties such as fast convergence of iterates for accelerated proximal gradient as well as achieving a of convergence in the weakly convex case.
We can also use Theorem 46 to derive a bound for the heavy-ball method. If we target the -th eigenvalue and set and for all , simple manipulation shows that , which gives us an equivalent rate to that derived in (37). Note that for defined in (50) we also have .
The divergence rates for accelerated gradient and heavy-ball methods are significantly faster than the per-iteration rate of obtained for steepest descent.
5 Experiments
Some computational experiments verify that accelerated gradient methods escape saddle points on nonconvex quadratics faster than steepest descent.
We apply these methods to a quadratic with diagonal Hessian, with and a single negative eigenvalue, . The nonnegative eigenvalues are i.i.d. from the uniform distribution on , and starting vector is drawn from a uniform distribution on the unit ball. Figure 2 plots the norm of the component of in the direction of the negative eigenvector at each iteration , for accelerated gradient, heavy-ball, and steepest descent. It also shows the divergence that would be attained if the theoretical limit from Theorem 46 applied at every iteration. Steepest descent and heavy-ball were run with . Heavy-ball uses (36) to calculate , yielding in the case of . Accelerated gradient is run with and where is defined in (51).
It is clear from Figure 2 that accelerated gradient and heavy-ball diverge at a significantly faster rate than steepest descent. In addition, there is only a small discrepancy between applying accelerated gradient and its limiting rate that is derived in Corollary 4.5, suggesting that approaches rapidly as .
Next we investigate how these methods behave for various dimensions and various distributions of the eigenvalues. For two values of ( and ), we generate random matrices with eigenvalues uniformly distributed in the interval , with the negative eigenvalues uniformly distributed in . The starting vector is uniformly distributed on the unit ball. Algorithmic constants were the same as those used to generate Figure 2. Each trial was run until the norm of the projection of the current iterate into the negative eigenspace of the Hessian was greater than the dimension . The results of these trials are shown in Table 1.
As expected, accelerated gradient outperforms gradient descent in all respects. All convergence results are slightly faster for than for , because the random choice of will, in expectation, have a smaller component in the span of the negative eigenvectors in the latter case. The eigenvalue spectrum has a much stronger effect on the divergence rate. For steepest descent, an order of magnitude decrease in the absolute value of the negative eigenvalues corresponds to an order of magnitude increase in iterations, whereas Nesterov’s accelerated gradient sees significantly less growth in the iteration count. While the accelerated gradient method diverges at a slightly slower rate than the theoretical limit, the relative difference between the two does not change much as the dimensions change. Thus, Theorem 46 provides a strong indication of the practical behavior of Nesterov’s method on these problems.
6 Conclusion
We have derived several results about the behavior of accelerated gradient methods on nonconvex problems, in the vicinity of critical points at which at least one of the eigenvalues of the Hessian is negative. Section 2 shows that the heavy-ball method does not converge to such a point when started randomly, while Sections 3 and 4 show that when is an indefinite quadratic, momentum methods diverge faster than the steepest-descent method.
It would be interesting to extend the results on speed of divergence to non-quadratic smooth functions . It would also be interesting to know what can be proved about the complexity of convergence to a point satisfying second-order necessary conditions, for unadorned accelerated gradient methods. A recent work [6] shows that gradient descent can take exponential time to escape from a set of saddle points. We believe that a similar result holds for accelerated methods as well. The report [8], which appeared after this paper was submitted, describes an accelerated gradient method that add noise selectively to some iterates, and exploits negative curvature search directions when they are detected in the course of the algorithm. This approach is shown to have the rate that characterizes the best known gradient-based algorithms for finding second-order necessary points of smooth nonconvex functions.
Acknowledgments
We are grateful to Bin Hu for his advice and suggestions on the manuscript. We are also grateful to the referees and editor for helpful suggestions.
Appendix A Properties of the Sequence Defined By (51)
In this appendix we show that the following two properties hold for the sequence defined by (51):
[TABLE]
and
[TABLE]
We begin by noting two well known properties of the sequence (see for example [4, Section 3.7.2]):
[TABLE]
and
[TABLE]
To prove that is monotonically increasing, we need
[TABLE]
Since (which follows immediately from (51)), it is sufficient to prove that
[TABLE]
By manipulating this expression and using (54), we obtain the equivalent expression
[TABLE]
By definition of , we have
[TABLE]
Thus (56) holds, so the claim (52) is proved. The sequence is nonnegative, since .
Now we prove (53). We can lower-bound as follows:
[TABLE]
For an upper bound, we have from that
[TABLE]
Since (because of (55)), it follows from (A) and (58) that (53) holds.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Attouch and A. Cabot. Convergence rates of inertial forward-backward algorithms. SIAM Journal on Optimization , 28(1):849–874, 2018.
- 2[2] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics , 2(01):1–34, 2000.
- 3[3] H. Attouch and J. Peypouquet. The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than 1 / k 2 1 superscript 𝑘 2 1/k^{2} . SIAM Journal on Optimization , 26(3):1824–1834, 2016.
- 4[4] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning , 8(3-4):231–357, 2015.
- 5[5] A. Chambolle and Ch. Dossal. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. Journal of Optimization Theory and Applications , 166(3):968–982, 2015.
- 6[6] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 1067–1077. Curran Associates, Inc., 2017.
- 7[7] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming , 156(1-2):59–99, 2016.
- 8[8] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. ar Xiv preprint ar Xiv:1711.10456 , 2017.
