Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem
Brian Swenson, Ryan Murray, H. Vincent Poor, and Soummya Kar

TL;DR
This paper extends the stable-manifold theorem to distributed gradient descent, demonstrating that under certain conditions, DGD almost always converges to local minima rather than saddle points, addressing a key challenge in nonconvex optimization.
Contribution
It develops a novel stable-manifold theorem tailored for distributed gradient descent, showing convergence to saddle points is highly unlikely in nonconvex problems.
Findings
DGD typically converges to local minima, not saddle points
Convergence to saddle points occurs only on a low-dimensional stable manifold
Under certain assumptions, DGD almost always avoids saddle points
Abstract
The paper studies a distributed gradient descent (DGD) process and considers the problem of showing that in nonconvex optimization problems, DGD typically converges to local minima rather than saddle points. The paper considers unconstrained minimization of a smooth objective function. In centralized settings, the problem of demonstrating nonconvergence to saddle points of gradient descent (and variants) is typically handled by way of the stable-manifold theorem from classical dynamical systems theory. However, the classical stable-manifold theorem is not applicable in distributed settings. The paper develops an appropriate stable-manifold theorem for DGD showing that convergence to saddle points may only occur from a low-dimensional stable manifold. Under appropriate assumptions (e.g., coercivity), this result implies that DGD typically converges to local minima and not to saddle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem
Brian Swenson†, Ryan Murray⋆, H. Vincent Poor†, and Soummya Kar‡ This work was partially supported by the Air Force Office of Scientific Research under MURI Grant FA9550-18-1-0502.
†Department of Electrical Engineering, Princeton University, Princeton, NJ 08540 ([email protected] and [email protected]),
⋆Department of Mathematics, North Carolina State University, Raleigh, NC 27695 ([email protected])
‡Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 ([email protected])
Abstract
The paper studies continuous-time distributed gradient descent (DGD) and considers the problem of showing that in nonconvex optimization problems, DGD typically converges to local minima rather than saddle points. In centralized settings, the problem of demonstrating nonconvergence to saddle points is typically handled by way of the stable-manifold theorem from classical dynamical systems theory. However, the classical stable-manifold theorem is not applicable in the distributed setting. The paper develops an appropriate stable-manifold theorem for DGD. This shows that convergence to saddle points may only occur from a low-dimensional stable manifold. Under appropriate assumptions (e.g., coercivity), the result implies that DGD almost always converges to local minima.
Index Terms:
Distributed optimization, nonconvex optimization, gradient descent, multi-agent systems, saddle points, stable-manifold theorem
I Introduction
Suppose a group of agents may communicate over a network. Each agent possesses some local function and it is desired to optimize the sum function given by
[TABLE]
In applications, the function is typically generated from local information available only to agent , and (1) represents some collective objective a system designer would like to optimize [1, 2, 3, 4, 5]. We are interested in the use of distributed gradient descent processes to compute local optima of (1) wherein agents may only exchange information with neighboring agents e.g., [6].
In this paper we focus on the case where the local functions may be nonconvex. This framework encompasses a wide range of applications including, for example, empirical risk minimization [7], target localization [8], robust regression [9], distributed coverage control [10], power allocation in wireless adhoc networks [11], and others [12].
Assuming the objective is smooth, basic convergence results in nonconvex optimization typically ensure that algorithms converge to critical points. This set consists, of course, of local and global minima and saddle points. Global minima can be difficult to compute and, for practical purposes, local minima are often sufficient in applications [13]. Thus, global optima aside, the main difficulty in proving that an algorithm has desirable convergence properties typically lies in understanding the behavior near saddle points, and, in particular, showing nonconvergence to saddle points [14, 15, 16].
For classical (centralized) gradient descent, the problem of showing non-convergence to saddle points is handled using the well-known “stable-manifold theorem” from dynamical systems theory [17, 18, 14]. In short, the stable-manifold theorem says that gradient descent (along with many other first-order algorithms [15]) can only converge to a saddle point if initialized on some low-dimensional hypersurface (referred to as the stable manifold).111The stable-manifold theorem deals with unstable points of general dynamical systems, not just gradient-type systems. However, restricted to gradient-type systems, this is the main implication of the result. Any process initialized on the stable manifold will remain on the stable manifold thereafter, eventually converging to the saddle point of interest. On the other hand, any process not initialized on the stable manifold will be repelled from the saddle point (eventually converging to some local minimum, assuming, for example, that is coercive). In this way, the problem of understanding (non)convergence to saddle points in classical settings is completely resolved by the stable-manifold theorem.
In the distributed setting, this is not the case. The classical stable-manifold theorem does not generally apply and specialized stable-manifold theorem results do not exist. Several recent works, including [11, 12, 9, 19, 20], have considered gradient-descent type algorithms for distributed nonconvex optimization. These have shown convergence to critical points, but have not dealt with the issue of nonconvergence to saddle points. The recent work [21] considered discrete-time distributed gradient descent with constant step size and demonstrated convergence to a neighborhood of a second-order stationary point under relatively mild assumptions.
In this work we focus on continuous-time dynamics and consider the problem of characterizing the stable manifold for the distributed gradient descent process
[TABLE]
, where , and are time-varying (decaying) weight parameters, and is the set of agents neighboring agent in the underlying communication graph. Intuitively, the dynamics (2) may be understood as follows: The consensus term encourages agents to seek agreement with neighboring agents. The innovation term encourages each agent to descend the gradient of their local objective function. By appropriately controlling the decay rates of and one can balance the dual objectives of ensuring that agents reach asymptotic consensus while simultaneously seeking optima of (1). The process (2) is a consensus + innovations variant of gradient descent [22].
We remark that closely related discrete-time variants of distributed gradient descent were studied in [6, 23, 24] for distributed optimization of a convex function. This was extended to the distributed nonconvex setting in [11] where convergence to critical points was shown. The work [19] considered a distributed simulated annealing algorithm that ensures convergence to the set of global minima. However, the algorithm requires careful control of the annealing noise. We also remark that the recent work [25] considered a discrete-time primal dual algorithm for distributed nonconvex optimization and showed convergence to second-order stationary points, but did not consider distributed gradient descent.
Our first main result will be to show that the dynamics (2) converge to critical points of (see Theorem 1). Our second main result will be to prove a stable-manifold theorem for (2) that characterizes nonconvergence to saddle points (see Theorem 2). Together, these results show that (under appropriate assumptions) the dynamics (2) typically converge to local minima of (1).
I-A Main Results
I-A1 Assumptions
We will make the following general assumptions.
The first assumption pertains to the communication network.
Assumption 1**.**
The graph is undirected and connected.
(See Section II for further discussion of the communication network.) The next three assumptions apply to the local objectives , .
Assumption 2**.**
* is of class .*
Assumption 3**.**
* is Lipschitz continuous,*
Assumption 4**.**
* is coercive.*
We refer to the time-varying weights and in (2) as the consensus and innovation potentials respectively. We assume the consensus and innovation potentials take the following form.
Assumption 5**.**
* and , with .*
When developing our stable-manifold theorem for (2) we will consider the behavior of the dynamics near some fixed saddle point . We will assume that the saddle point satisfies the following non-degeneracy assumption.
Assumption 6**.**
* is a nondegenerate saddle point of . That is, the Hessian is nonsingular.*
I-A2 Main Results
We now state the main results of the paper. First, we show that the dynamics (2) converge to the set of critical points of (1).
Theorem 1**.**
Suppose is a solution to (2) with arbitrary initial condition and suppose that Assumptions 1–5 hold. Then for each ,
- (i)
Agents achieve consensus in the sense that , for .
- (ii)
* converges to the set of critical points of .*
Our second main result will refine this convergence guarantee. The next result shows that the critical point reached by (2) will not typically be a saddle point. We show the following stable-manifold theorem for (2).
Theorem 2**.**
Suppose that Assumptions 1–5 hold and suppose that is a saddle point of satisfying Assumption 6. Let denote the number of negative eigenvalues of the Hessian . Then for all sufficiently large there exist a manifold with dimension such that the following holds: A solution to (2) converges to in the sense that for some , if and only if is initialized on , i.e., with .
When we say that has dimension we mean that is the graph of a continuous function from a dimensional domain. Note that in the above theorem, since we deal with a nondegenerate saddle point of , we must have . Thus, has dimension at most and is indeed a “low-dimensional surface.” The initial time in the above theorem depends on the weight processes and . This time may be equivalently taken to be zero by using alternate weight sequences and .
The value of Theorems 1 and 2 together are that they allow us to conclude that the dynamics (2) “typically” converge to local minima of (assuming Assumptions 1–5 hold and every saddle point of satisfies Assumption 6). More precisely, Theorem 1 tells us that the dynamics (2) will converge to critical points of . Theorem 2 tells us that this limit point must be a local minimum222In the event that does not have a unique limit, then it converges to a connected set of local minima. unless is initialized from the special set of initial conditions , where the (countable) union is taken over the set of all saddle points, and each is the low-dimensional stable manifold associated with the saddle point .
It is also important to remark that a shortcoming of Theorem 2 is that it does not show that is a smooth surface. This will be the subject of future work.
The remainder of the paper is organized as follows. Section II sets up notation and reviews background material. Section III proves Theorem 1. Section IV proves Theorem 2. Finally, Section V concludes the paper.
II Notation
Let denote the set of all -times continuously differentiable functions from to . When the dimensions of domain and codomain are clear, we will simply say that a function belongs to . Given a function , we let denote the gradient of and let denote the Hessian. Unless otherwise stated, refers to the standard Euclidean norm. Given a point and let denote the open ball of radius about . We use the notation to denote the identity matrix. Given a matrix , denotes the nullspace of . Given a set of numbers let be the diagonal matrix with diagonal entries .
We say that a continuous mapping , over some interval , , is a solution to an ODE with initial condition at time if , satisfies the ODE for all , and . We note that under Assumption 3, solutions to (2) exist and are unique [18].
In Assumption 1 we assume that the inter-agent communication graph may be described by an undirected graph , where denotes the set of nodes (or agents) and denotes the set of communication links (edges), between agents. The pair if and only if there exists an edge between nodes and . In this paper we will consider simple graphs, i.e., graphs devoid of self-loops and multiple edges. The set of neighbors of node is given by
[TABLE]
The degree of node is given by . The adjacency matrix of the graph is the matrix , with , if , , otherwise. The degree matrix is given by the diagonal matrix . The positive semidefinite matrix is referred to as the graph Laplacian matrix. The eigenvalues of can be ordered as . A graph is said to be connected if there exists a path between each pair of nodes. If the graph is connected then [26].
II-A Stochastic Approximation and Perturbed Solutions
Some of our proof techniques will utilize results on perturbed solutions to differential equations from the theory of stochastic approximation. We briefly review relevant results from the literature now.
We will be interested in studying (possibly perturbed) solutions of the differential equation
[TABLE]
where is . We will consider the following notion of a perturbed solution.
Definition 3** (Perturbed Solution).**
*A continuous function will be called a perturbed solution to (4) if:
* is absolutely continuous,* 2. 2.
There exists a locally integrable function such that for every there holds
- (a)
[TABLE] 2. (b)
[TABLE]
for almost every .
Let ; we say that a continuous function is a Lyapunov function for if for any solution of (4) , for and for .
The following result (see Theorem 3.6 and Proposition 3.27 in [27]) characterizes the asymptotic behavior of perturbed solutions to ODEs admitting a Lyapunov function.
Theorem 4**.**
Suppose is a perturbed solution to (4). Suppose also that is a Lyapunov function for and that has empty interior. Then the limit set of , given by is contained in .
III Convergence to Critical Points
In this section we will prove Theorem 1. We begin by showing the following preliminary lemma which shows that under the dynamics (2) agents reach asymptotic consensus.
Lemma 5**.**
If is a solution to (2) then for all .
Proof.
The dynamics (2) may be expressed compactly as
[TABLE]
where and are as in Assumption 5.
Let and let denote the inverse of . Let . Using this time change we have the equivalent ODE
[TABLE]
where as . Using the explicit form of and in Assumption 5 it is readily verified that for some .
We will refer to the set
[TABLE]
as the consensus subspace. Consider the linear system
[TABLE]
Because is positive semidefinite with nullspace equal to , solutions to (7) converge to and hence for all .
Let denote a fundamental matrix solution of the linear system (7). By variation of parameters [28], the solution of (6) with initial condition may be expressed as
[TABLE]
where . Using Assumptions 3 and 4 we see that for some constant .
Let
[TABLE]
Using (8) we have
[TABLE]
where we have used the notation to indicate extracting the vector of coordinates in corresponding to agent . Using the previous bound on we get
[TABLE]
for some . The first term on the right hand side above goes to zero since is a solution to (7). Recalling that , the second term above is bounded as
[TABLE]
for some , where is the second smallest eigenvalue of . Since , this converges to zero as . ∎
We now prove Theorem 1.
Proof (Theorem 1).
Part (i) of the theorem follows from Lemma 5. We now prove part (ii) of the theorem. Let and let denote the inverse of so that . Letting we have
[TABLE]
, where as . Since (17) is equivalent to (2) up to a time change, we will prove the result for solutions to (17).
By Lemma 5, it is sufficient to show that the mean process, , converges to the set of critical points of . Noting that (because is undirected), the average dynamics may be expressed as
[TABLE]
[TABLE]
where {\bf r}(t)=-\frac{1}{N}\sum_{i=1}^{N}\Big{(}\nabla f_{n}({\bf y}_{n}(t))-\nabla f_{n}({\bf y}_{\textup{avg}}(t))\Big{)}.
By Assumptions 3 and 4 we see that as . Recalling Definition 3, solutions to (23) may be viewed as perturbed solutions of the ODE
[TABLE]
Let denote the set of critical points of . Since , Sard’s theorem implies that has empty interior. By Theorem 4, solutions to (23) converge to the critical points set of . ∎
IV Nonconvergence to Saddle Points
IV-A Generalized Problem Setup
It will simplify the presentation and proofs if we consider a slight generalization of the distributed optimization framework. Namely, we will consider the distributed optimization problem as a special case of subspace constrained optimization. To this end, let denote the dimension of the ambient space, let be a function, and let be a positive semidefinite matrix. Consider the following optimization problem
[TABLE]
and the following dynamics for addressing this problem
[TABLE]
where is some pre-specified weight function of class satisfying as .
Note that the dynamics (27) may be viewed as , i.e., as , is forced towards the constraint set.
Under Assumptions 1–5, (2) is a special case of (27). To see this, first observe that (2) (or rather, (5)) is equivalent to the following ODE after a time change
[TABLE]
where . This fits the template of (27) where we let , let be given by the sum function333Note that this differs from (1) in that we permit the arguments of to differ. , and let .
Within this generalized framework, we would like to capture Assumption 6. To this end, let ; we say that a point is a critical point of the restricted function if , where is taken with respect to some orthonormal basis of , and . Let denote the Hessian of taken with respect to some orthonormal basis of . We say that is a nondegenerate saddle point of if , and has at least one positive and one negative eigenvalue.
The following theorem demonstrates the existence of stable manifolds for (27).
Theorem 6**.**
Suppose , and . Suppose 0 is a nondegenerate saddle point of and let denote the number of negative eigenvalues of . Then for all sufficiently large there exists a manifold with dimension such that the following holds: A solution to (27) converges to 0 if and only if is initialized on , i.e., .
Since, under Assumptions 1–5, (2) is a special case of (27) this implies Theorem 2.
IV-B Proof of Theorem 6
- (Recenter) By the implicit function theorem, there exists a function such that, for each , is a critical point of the penalized function and as .
Letting we see that is a solution to (27) if and only if is a solution to
[TABLE]
where denotes the vector . For let
[TABLE]
and let so that we may express (28) as
[TABLE]
- (Diagonalize) For each , let be a unitary matrix that diagonalizes , so that , where is diagonal. Since we may construct as a differentiable function of . Changing coordinates again, let so that is a solution to (29) if and only if is a solution to
[TABLE]
Letting , the above is equivalent to
[TABLE]
Note that and for . Consequently, for any there exists an and such that for all and we have
[TABLE]
- (Compute Stable Solutions) Let denote the eigenvalues of . We may assume the eigenvalues are ordered so each varies smoothly in . For sufficiently large, the sign of remains constant for all , for each . Without loss of generality assume that the first diagonal entries (eigenvalues) of are negative and the remaining diagonal entries are positive for all sufficiently large. Let be decomposed as
[TABLE]
where denotes the ‘stable’ diagonal submatrix and denotes the ‘unstable’ diagonal submatrix. Let
[TABLE]
[TABLE]
By construction we have , . Hence, we may choose an such that for and all sufficiently large. We may also choose constants and such that the following estimates hold
[TABLE]
where . Now, suppose and consider the integral equation
[TABLE]
where . Note that if is continuous and solves (39) then, is differentiable and solves (33) with componentwise initialization for . This may be verified using the variation of parameters formula [28].
Given , let
[TABLE]
We remark that is finite for all and for any we may choose sufficiently large so that for all .
Suppose and let and be chosen so that (34) holds for all and for all . By Lemma 7, if and , then the right-hand side of (39) is a contraction on the space
[TABLE]
equipped with norm , where is defined in (42). Since this space is complete, there exists a unique solving (39).
- (Construct Stable Manifold) We now construct the stable set corresponding to the ODE (33). Let . For each let be the (unique) solution to (39) in . For each define the component map by
[TABLE]
and let . The stable manifold (with respect to (33)) is given by
[TABLE]
By construction, for any initialization , the corresponding solution of (33) with satisfies . Moreover, by Lemma 8 we see that contains all stable initializations . That is, if is a solution to (33) with and , then .
Having constructed (the stable manifold for (33)) the stable manifold for (27), denoted here by , is obtained by an appropriate change of coordinates,
V Conclusion
We have considered the distributed gradient descent dynamics (2) for nonconvex optimization. We showed that the dynamics converge to the set of critical points of the nonconvex objective (Theorem 1). Furthermore, the dynamics may only converge to a saddle point of the objective if initialized from some special low-dimensional stable manifold.
Appendix
This appendix contains some intermediate results required for the proof of Theorem 6.
The following lemma shows that the right-hand side of (39) is a contraction. Before presenting the lemma, we define a few useful quantities. Given , let be given by
[TABLE]
where, for convenience, we suppress the argument previously used in .
Lemma 7** ( is a contraction).**
Let , , and be chosen so that (37) is satisfied. Let , and let and be chosen so that (34) holds and holds for all . Let with . Then is a contraction.
Proof.
First, claim that if and , then . To see this, note that
[TABLE]
where in the last line we use the assumptions made on , , and in the statement of the lemma.
Suppose now that , with . Let . For we have
[TABLE]
Given our choice of we have , hence, is a contraction. ∎
Lemma 8** ( contains all stable initializations).**
Let , , and be chosen as in the construction of . Let , with , let and suppose that is a solution to (33) with and . If as then .
Proof.
By variation of constants we see that
[TABLE]
where . Note that integral in converges by (35) and the fact that . Every term on the right hand side of (60) is uniformly bounded in , except possibly the term . In particular, if , , then . Since the left hand side of (60) is bounded, it follows that the right hand side is bounded and thus all , must be zero and hence .
This implies that is a solution to the integral equation (39) given . By Lemma 7 we see that is the unique continuous solution of (39) given . By the definitions of and we thus see that . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks , 2004, pp. 20–27.
- 2[2] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing , vol. 60, no. 8, pp. 4289–4305, 2012.
- 3[3] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE Transactions on Industrial Informatics , vol. 9, no. 1, pp. 427–438, 2012.
- 4[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning , vol. 3, no. 1, pp. 1–122, 2011.
- 5[5] S. Kar and B. Swenson, “Clustering with distributed data,” 2019, submitted for publication. Online: https://arxiv.org/abs/1901.00214.
- 6[6] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control , vol. 54, no. 1, p. 48, 2009.
- 7[7] C. Lee, C. H. Lim, and S. J. Wright, “A distributed quasi-newton algorithm for empirical risk minimization with nonsmooth regularization,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 1646–1655.
- 8[8] P. Di Lorenzo and G. Scutari, “Distributed nonconvex optimization over networks,” in Proceedings of the 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) , 2015, pp. 229–232.
