Distributed Gradient Descent: Nonconvergence to Saddle Points and the   Stable-Manifold Theorem

Brian Swenson; Ryan Murray; H. Vincent Poor; and Soummya Kar

arXiv:1908.02747·math.OC·October 24, 2019

Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem

Brian Swenson, Ryan Murray, H. Vincent Poor, and Soummya Kar

PDF

TL;DR

This paper extends the stable-manifold theorem to distributed gradient descent, demonstrating that under certain conditions, DGD almost always converges to local minima rather than saddle points, addressing a key challenge in nonconvex optimization.

Contribution

It develops a novel stable-manifold theorem tailored for distributed gradient descent, showing convergence to saddle points is highly unlikely in nonconvex problems.

Findings

01

DGD typically converges to local minima, not saddle points

02

Convergence to saddle points occurs only on a low-dimensional stable manifold

03

Under certain assumptions, DGD almost always avoids saddle points

Abstract

The paper studies a distributed gradient descent (DGD) process and considers the problem of showing that in nonconvex optimization problems, DGD typically converges to local minima rather than saddle points. The paper considers unconstrained minimization of a smooth objective function. In centralized settings, the problem of demonstrating nonconvergence to saddle points of gradient descent (and variants) is typically handled by way of the stable-manifold theorem from classical dynamical systems theory. However, the classical stable-manifold theorem is not applicable in distributed settings. The paper develops an appropriate stable-manifold theorem for DGD showing that convergence to saddle points may only occur from a low-dimensional stable manifold. Under appropriate assumptions (e.g., coercivity), this result implies that DGD typically converges to local minima and not to saddle…

Equations111

f (x) := n = 1 \sum N f_{n} (x) .

f (x) := n = 1 \sum N f_{n} (x) .

\dot{x}_{n} (t) = β_{t} ℓ \in Ω_{n} \sum (x_{ℓ} (t) - x_{n} (t)) - α_{t} \nabla f_{n} (x_{n} (t)),

\dot{x}_{n} (t) = β_{t} ℓ \in Ω_{n} \sum (x_{ℓ} (t) - x_{n} (t)) - α_{t} \nabla f_{n} (x_{n} (t)),

Ω_{n} = {l \in V ∣ (n, l) \in E} .

Ω_{n} = {l \in V ∣ (n, l) \in E} .

\dot{x} = F (x),

\dot{x} = F (x),

\lim_{t\to\infty}\sup_{0\leq v\leq T}\Big{|}\int_{t}^{t+v}U(s)\,ds\Big{|}=0

\lim_{t\to\infty}\sup_{0\leq v\leq T}\Big{|}\int_{t}^{t+v}U(s)\,ds\Big{|}=0

\frac{d}{d t} y (t) - U (t) = F (y (t))

\frac{d}{d t} y (t) - U (t) = F (y (t))

\dot{x} = - β_{t} (L \otimes I_{d}) x - α_{t} (\nabla f_{n} (x))_{n = 1}^{N},

\dot{x} = - β_{t} (L \otimes I_{d}) x - α_{t} (\nabla f_{n} (x))_{n = 1}^{N},

\dot{y} (t) = - (L \otimes I_{N}) y (t) - γ_{t} (\nabla f_{n} (y (t)))_{n = 1}^{N},

\dot{y} (t) = - (L \otimes I_{N}) y (t) - γ_{t} (\nabla f_{n} (y (t)))_{n = 1}^{N},

C := {x \in R^{N d} : x = 1_{N} \otimes a, for some a \in R^{d}}

C := {x \in R^{N d} : x = 1_{N} \otimes a, for some a \in R^{d}}

\dot{y} = - (L \otimes I_{d}) y .

\dot{y} = - (L \otimes I_{d}) y .

y (t) = Φ (t) x_{0} + \int_{0}^{t} Φ (t - s) b (s) d s,

y (t) = Φ (t) x_{0} + \int_{0}^{t} Φ (t - s) b (s) d s,

y_{avg} (t) := \frac{1}{N} n = 1 \sum N y_{n} (t) .

y_{avg} (t) := \frac{1}{N} n = 1 \sum N y_{n} (t) .

y^{⊥} (t)

y^{⊥} (t)

= Φ (t) x_{0} - (1_{N} \otimes I_{d}) \frac{1}{N} n = 1 \sum N [Φ (t) x_{0}]_{n}

+ \int_{0}^{t} Φ (t - s) (b (s) - (1_{N} \otimes I_{d}) \frac{1}{N} n = 1 \sum N [b (s)]_{n}),

∥ y^{⊥} (t) ∥ \leq

∥ y^{⊥} (t) ∥ \leq

+ C \int_{0}^{t} ∥Φ (t - s) γ_{s} ∥ d s,

\int_{0}^{t} ∥Φ (t - s) γ_{s} ∥

\int_{0}^{t} ∥Φ (t - s) γ_{s} ∥

\leq C \int_{0}^{t} e^{- λ_{2} (t - s)} s^{- τ_{γ}} d s,

\dot{y}_{n} (t) = γ_{t} ℓ \in Ω_{n} \sum (y_{ℓ} (t) - y_{n} (t)) - \nabla f_{n} (y_{n} (t)),

\dot{y}_{n} (t) = γ_{t} ℓ \in Ω_{n} \sum (y_{ℓ} (t) - y_{n} (t)) - \nabla f_{n} (y_{n} (t)),

\dot{y}_{avg} (t)

\dot{y}_{avg} (t)

= \frac{1}{N} n = 1 \sum N (γ_{t} ℓ \in Ω_{n} \sum (y_{ℓ} (t) - y_{n} (t)) - \nabla f_{n} (y_{n} (t)))

\displaystyle=-\frac{1}{N}\sum_{i=1}^{N}\bigg{(}\Big{(}\nabla f_{n}({\bf y}_{n}(t))-\nabla f_{n}({\bf y}_{\textup{avg}}(t))\Big{)}

\displaystyle\quad\quad+\nabla f_{n}({\bf y}_{\textup{avg}}(t))\bigg{)}

= - \frac{1}{N} i = 1 \sum N \nabla f_{n} (y_{avg} (t)) + r (t)

= - \frac{1}{N} i = 1 \sum N \nabla f_{n} (y_{avg} (t)) + r (t)

= - \nabla f (y_{avg} (t)) + r (t),

\dot{y} = - \nabla f (y) .

\dot{y} = - \nabla f (y) .

x \in R^{m} min

x \in R^{m} min

x \in N (Q),

\dot{x} (t) = - \nabla h (x (t)) - β_{t} Q x (t),

\dot{x} (t) = - \nabla h (x (t)) - β_{t} Q x (t),

\dot{x} = - β_{t} (L \otimes I_{d}) x - (\nabla f_{n} (x))_{n = 1}^{N},

\dot{x} = - β_{t} (L \otimes I_{d}) x - (\nabla f_{n} (x))_{n = 1}^{N},

\dot{y} = - \nabla h (y + g (β_{t})) - β_{t} Q (y + g (β_{t})) - g^{'} (β_{t}) \dot{β}_{t},

\dot{y} = - \nabla h (y + g (β_{t})) - β_{t} Q (y + g (β_{t})) - g^{'} (β_{t}) \dot{β}_{t},

A(t):=\nabla^{2}_{x}\left(h(x)+\beta_{t}x^{T}Qx\right)\big{|}_{x=g(\beta_{t})}

A(t):=\nabla^{2}_{x}\left(h(x)+\beta_{t}x^{T}Qx\right)\big{|}_{x=g(\beta_{t})}

\dot{y} (t) = A (t) y (t) + F (y (t), t) - g^{'} (β_{t}) \dot{β}_{t} .

\dot{y} (t) = A (t) y (t) + F (y (t), t) - g^{'} (β_{t}) \dot{β}_{t} .

\dot{z} (t) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem

Brian Swenson†, Ryan Murray⋆, H. Vincent Poor†, and Soummya Kar‡ This work was partially supported by the Air Force Office of Scientific Research under MURI Grant FA9550-18-1-0502.

†Department of Electrical Engineering, Princeton University, Princeton, NJ 08540 ([email protected] and [email protected]),

⋆Department of Mathematics, North Carolina State University, Raleigh, NC 27695 ([email protected])

‡Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 ([email protected])

Abstract

The paper studies continuous-time distributed gradient descent (DGD) and considers the problem of showing that in nonconvex optimization problems, DGD typically converges to local minima rather than saddle points. In centralized settings, the problem of demonstrating nonconvergence to saddle points is typically handled by way of the stable-manifold theorem from classical dynamical systems theory. However, the classical stable-manifold theorem is not applicable in the distributed setting. The paper develops an appropriate stable-manifold theorem for DGD. This shows that convergence to saddle points may only occur from a low-dimensional stable manifold. Under appropriate assumptions (e.g., coercivity), the result implies that DGD almost always converges to local minima.

Index Terms:

Distributed optimization, nonconvex optimization, gradient descent, multi-agent systems, saddle points, stable-manifold theorem

I Introduction

Suppose a group of $N$ agents may communicate over a network. Each agent possesses some local function $f_{n}:\mathbb{R}^{d}\to\mathbb{R}$ and it is desired to optimize the sum function $f:\mathbb{R}^{d}\to\mathbb{R}$ given by

[TABLE]

In applications, the function $f_{n}$ is typically generated from local information available only to agent $n$ , and (1) represents some collective objective a system designer would like to optimize [1, 2, 3, 4, 5]. We are interested in the use of distributed gradient descent processes to compute local optima of (1) wherein agents may only exchange information with neighboring agents e.g., [6].

In this paper we focus on the case where the local $f_{n}$ functions may be nonconvex. This framework encompasses a wide range of applications including, for example, empirical risk minimization [7], target localization [8], robust regression [9], distributed coverage control [10], power allocation in wireless adhoc networks [11], and others [12].

Assuming the objective is smooth, basic convergence results in nonconvex optimization typically ensure that algorithms converge to critical points. This set consists, of course, of local and global minima and saddle points. Global minima can be difficult to compute and, for practical purposes, local minima are often sufficient in applications [13]. Thus, global optima aside, the main difficulty in proving that an algorithm has desirable convergence properties typically lies in understanding the behavior near saddle points, and, in particular, showing nonconvergence to saddle points [14, 15, 16].

For classical (centralized) gradient descent, the problem of showing non-convergence to saddle points is handled using the well-known “stable-manifold theorem” from dynamical systems theory [17, 18, 14]. In short, the stable-manifold theorem says that gradient descent (along with many other first-order algorithms [15]) can only converge to a saddle point if initialized on some low-dimensional hypersurface (referred to as the stable manifold).111The stable-manifold theorem deals with unstable points of general dynamical systems, not just gradient-type systems. However, restricted to gradient-type systems, this is the main implication of the result. Any process initialized on the stable manifold will remain on the stable manifold thereafter, eventually converging to the saddle point of interest. On the other hand, any process not initialized on the stable manifold will be repelled from the saddle point (eventually converging to some local minimum, assuming, for example, that $f$ is coercive). In this way, the problem of understanding (non)convergence to saddle points in classical settings is completely resolved by the stable-manifold theorem.

In the distributed setting, this is not the case. The classical stable-manifold theorem does not generally apply and specialized stable-manifold theorem results do not exist. Several recent works, including [11, 12, 9, 19, 20], have considered gradient-descent type algorithms for distributed nonconvex optimization. These have shown convergence to critical points, but have not dealt with the issue of nonconvergence to saddle points. The recent work [21] considered discrete-time distributed gradient descent with constant step size and demonstrated convergence to a neighborhood of a second-order stationary point under relatively mild assumptions.

In this work we focus on continuous-time dynamics and consider the problem of characterizing the stable manifold for the distributed gradient descent process

[TABLE]

$n=1,\ldots,N$ , where $\alpha_{t}$ , and $\beta_{t}$ are time-varying (decaying) weight parameters, and $\Omega_{n}$ is the set of agents neighboring agent $n$ in the underlying communication graph. Intuitively, the dynamics (2) may be understood as follows: The consensus term $\beta_{t}\sum_{\ell\in\Omega_{n}}({\bf x}_{\ell}(t)-{\bf x}_{n}(t))$ encourages agents to seek agreement with neighboring agents. The innovation term $-\alpha_{t}\nabla f_{n}({\bf x}_{n}(t))$ encourages each agent to descend the gradient of their local objective function. By appropriately controlling the decay rates of $\alpha_{t}$ and $\beta_{t}$ one can balance the dual objectives of ensuring that agents reach asymptotic consensus while simultaneously seeking optima of (1). The process (2) is a consensus + innovations variant of gradient descent [22].

We remark that closely related discrete-time variants of distributed gradient descent were studied in [6, 23, 24] for distributed optimization of a convex function. This was extended to the distributed nonconvex setting in [11] where convergence to critical points was shown. The work [19] considered a distributed simulated annealing algorithm that ensures convergence to the set of global minima. However, the algorithm requires careful control of the annealing noise. We also remark that the recent work [25] considered a discrete-time primal dual algorithm for distributed nonconvex optimization and showed convergence to second-order stationary points, but did not consider distributed gradient descent.

Our first main result will be to show that the dynamics (2) converge to critical points of $f$ (see Theorem 1). Our second main result will be to prove a stable-manifold theorem for (2) that characterizes nonconvergence to saddle points (see Theorem 2). Together, these results show that (under appropriate assumptions) the dynamics (2) typically converge to local minima of (1).

I-A Main Results

I-A1 Assumptions

We will make the following general assumptions.

The first assumption pertains to the communication network.

Assumption 1.

The graph $G=(V,E)$ is undirected and connected.

(See Section II for further discussion of the communication network.) The next three assumptions apply to the local objectives $f_{n}$ , $n=1,\ldots,N$ .

Assumption 2.

$f_{n}:\mathbb{R}^{d}\to\mathbb{R}$ * is of class $C^{2}$ .*

Assumption 3.

$\nabla f_{n}$ * is Lipschitz continuous,*

Assumption 4.

$f_{n}$ * is coercive.*

We refer to the time-varying weights $\beta_{t}$ and $\alpha_{t}$ in (2) as the consensus and innovation potentials respectively. We assume the consensus and innovation potentials take the following form.

Assumption 5.

$\alpha_{t}=(t+1)^{-\tau_{\alpha}}$ * and $\beta_{t}=(t+1)^{-\tau_{\beta}}$ , with $0\leq\tau_{\beta}<\tau_{\alpha}\leq 1$ .*

When developing our stable-manifold theorem for (2) we will consider the behavior of the dynamics near some fixed saddle point $x^{*}$ . We will assume that the saddle point satisfies the following non-degeneracy assumption.

Assumption 6.

$x^{*}$ * is a nondegenerate saddle point of $f$ . That is, the Hessian $\nabla^{2}f(x^{*})$ is nonsingular.*

I-A2 Main Results

We now state the main results of the paper. First, we show that the dynamics (2) converge to the set of critical points of (1).

Theorem 1.

Suppose $({\bf x}_{n}(t))_{n=1}^{N}$ is a solution to (2) with arbitrary initial condition and suppose that Assumptions 1–5 hold. Then for each $n=1,\ldots,N$ ,

(i)

Agents achieve consensus in the sense that $\lim_{t\to\infty}\|{\bf x}_{n}(t)-{\bf x}_{\ell}(t)\|=0$ , for $\ell=1,\ldots,N$ .

(ii)

${\bf x}_{n}(t)$ * converges to the set of critical points of $f$ .*

Our second main result will refine this convergence guarantee. The next result shows that the critical point reached by (2) will not typically be a saddle point. We show the following stable-manifold theorem for (2).

Theorem 2.

Suppose that Assumptions 1–5 hold and suppose that $x^{*}$ is a saddle point of $f$ satisfying Assumption 6. Let $p$ denote the number of negative eigenvalues of the Hessian $\nabla^{2}f(x^{*})$ . Then for all $t_{0}$ sufficiently large there exist a manifold $S\subset\mathbb{R}^{Nd}$ with dimension $(Nd-p)$ such that the following holds: A solution $({\bf x}_{n}(t))_{n=1}^{N}$ to (2) converges to $x^{*}$ in the sense that ${\bf x}_{n}(t)\to x^{*}$ for some $n$ , if and only if $({\bf x}_{n}(t))_{n=1}^{N}$ is initialized on $S$ , i.e., $({\bf x}_{n}(t_{0}))_{n=1}^{N}=x_{0}\in\mathbb{R}^{Nd}$ with $x_{0}\in S$ .

When we say that $S$ has dimension $Nd-p$ we mean that $S$ is the graph of a continuous function from a $Nd-p$ dimensional domain. Note that in the above theorem, since we deal with a nondegenerate saddle point of $f$ , we must have $p\geq 1$ . Thus, $S$ has dimension at most $Nd-1$ and is indeed a “low-dimensional surface.” The initial time $t_{0}$ in the above theorem depends on the weight processes $\alpha_{t}$ and $\beta_{t}$ . This time may be equivalently taken to be zero by using alternate weight sequences $\hat{\alpha}_{t}=\alpha_{t+t_{0}}$ and $\hat{\beta}_{t}=\beta_{t+t_{0}}$ .

The value of Theorems 1 and 2 together are that they allow us to conclude that the dynamics (2) “typically” converge to local minima of $f$ (assuming Assumptions 1–5 hold and every saddle point of $f$ satisfies Assumption 6). More precisely, Theorem 1 tells us that the dynamics (2) will converge to critical points of $f$ . Theorem 2 tells us that this limit point must be a local minimum222In the event that ${\bf x}(t)$ does not have a unique limit, then it converges to a connected set of local minima. unless $({\bf x}_{n}(t))_{n=1}^{N}$ is initialized from the special set of initial conditions $\bigcup_{x^{*}}S_{x^{*}}$ , where the (countable) union is taken over the set of all saddle points, and each $S_{x^{*}}$ is the low-dimensional stable manifold associated with the saddle point $x^{*}$ .

It is also important to remark that a shortcoming of Theorem 2 is that it does not show that $S$ is a smooth $C^{1}$ surface. This will be the subject of future work.

The remainder of the paper is organized as follows. Section II sets up notation and reviews background material. Section III proves Theorem 1. Section IV proves Theorem 2. Finally, Section V concludes the paper.

II Notation

Let $C^{k}(\mathbb{R}^{d_{1}};\mathbb{R}^{d_{2}})$ denote the set of all $k$ -times continuously differentiable functions from $\mathbb{R}^{d_{1}}$ to $\mathbb{R}^{d_{2}}$ . When the dimensions of domain and codomain are clear, we will simply say that a function belongs to $C^{k}$ . Given a function $f\in C^{2}$ , we let $\nabla f(x)$ denote the gradient of $f$ and let $\nabla^{2}f(x)$ denote the Hessian. Unless otherwise stated, $\|\cdot\|$ refers to the standard Euclidean norm. Given a point $x\in\mathbb{R}^{m}$ and $r>0$ let $B_{r}(x)$ denote the open ball of radius $r>0$ about $x$ . We use the notation $I_{m}$ to denote the $m\times m$ identity matrix. Given a matrix $A$ , $\mathcal{N}(A)$ denotes the nullspace of $A$ . Given a set of numbers $\{a_{1},\ldots,a_{m}\}$ let $\text{diag}(a_{1},\ldots,a_{m})$ be the $m\times m$ diagonal matrix with diagonal entries $a_{1},\ldots,a_{m}$ .

We say that a continuous mapping ${\bf x}:I\to\mathbb{R}^{d}$ , over some interval $I=[0,T)$ , $0<T\leq\infty$ , is a solution to an ODE with initial condition $x_{0}$ at time $t_{0}$ if ${\bf x}\in C^{1}$ , ${\bf x}$ satisfies the ODE for all $t\in I$ , and ${\bf x}(t_{0})=x_{0}$ . We note that under Assumption 3, solutions to (2) exist and are unique [18].

In Assumption 1 we assume that the inter-agent communication graph may be described by an undirected graph $G=(V,E)$ , where $V=\{1\cdots N\}$ denotes the set of nodes (or agents) and $E$ denotes the set of communication links (edges), between agents. The pair $(n,l)\in E$ if and only if there exists an edge between nodes $n$ and $l$ . In this paper we will consider simple graphs, i.e., graphs devoid of self-loops and multiple edges. The set of neighbors of node $n$ is given by

[TABLE]

The degree of node $n$ is given by $d_{n}=|\Omega_{n}|$ . The adjacency matrix of the graph $G$ is the $N\times N$ matrix $A=\left[A_{nl}\right]$ , with $A_{nl}=1$ , if $(n,l)\in E$ , $A_{nl}=0$ , otherwise. The degree matrix is given by the diagonal matrix $D=\mbox{diag}\left(d_{1},\ldots,d_{N}\right)$ . The positive semidefinite matrix $L=D-A$ is referred to as the graph Laplacian matrix. The eigenvalues of $L$ can be ordered as $0=\lambda_{1}(L)\leq\lambda_{2}(L)\leq\cdots\leq\lambda_{N}(L)$ . A graph is said to be connected if there exists a path between each pair of nodes. If the graph $G$ is connected then $\lambda_{2}(L)>0$ [26].

II-A Stochastic Approximation and Perturbed Solutions

Some of our proof techniques will utilize results on perturbed solutions to differential equations from the theory of stochastic approximation. We briefly review relevant results from the literature now.

We will be interested in studying (possibly perturbed) solutions of the differential equation

[TABLE]

where $F:\mathbb{R}^{d}\to\mathbb{R}$ is $C^{1}$ . We will consider the following notion of a perturbed solution.

Definition 3 (Perturbed Solution).

*A continuous function ${\bf y}:[0,\infty)\to\mathbb{R}^{m}$ will be called a perturbed solution to (4) if:

${\bf y}$ * is absolutely continuous,* 2. 2.

There exists a locally integrable function $t\mapsto U(t)$ such that for every $T>0$ there holds

(a)

[TABLE] 2. (b)

[TABLE]

for almost every $t>0$ .

Let $\Lambda\subset\mathbb{R}^{d}$ ; we say that a continuous function $V:\mathbb{R}^{d}\to\mathbb{R}$ is a Lyapunov function for $\Lambda$ if for any solution ${\bf x}:\mathbb{R}\to\mathbb{R}^{d}$ of (4) , $\frac{d}{dt}V({\bf x}(t))=0$ for ${\bf x}(t)\in\Lambda$ and $\frac{d}{dt}V({\bf x}(t))<0$ for ${\bf x}(t)\notin\Lambda$ .

The following result (see Theorem 3.6 and Proposition 3.27 in [27]) characterizes the asymptotic behavior of perturbed solutions to ODEs admitting a Lyapunov function.

Theorem 4.

Suppose ${\bf y}$ is a perturbed solution to (4). Suppose also that $V$ is a Lyapunov function for $\Lambda$ and that $V(\Lambda)$ has empty interior. Then the limit set of ${\bf y}$ , given by $L({\bf y}):=\cap_{t\geq 0}\textup{cl}(\{{\bf y}(s):s\geq t\})$ is contained in $\Lambda$ .

III Convergence to Critical Points

In this section we will prove Theorem 1. We begin by showing the following preliminary lemma which shows that under the dynamics (2) agents reach asymptotic consensus.

Lemma 5.

If $({\bf x}_{n}(t))_{n=1}^{N}$ is a solution to (2) then $\lim_{t\to\infty}\|{\bf x}_{n}(t)-{\bf x}_{\ell}(t)\|=0$ for all $\ell,n=1,\ldots,N$ .

Proof.

The dynamics (2) may be expressed compactly as

[TABLE]

where $\alpha_{t}$ and $\beta_{t}$ are as in Assumption 5.

Let $S(\tau)=\int_{0}^{\tau}\beta_{r}\,dr$ and let $T(t)$ denote the inverse of $S(\tau)$ . Let ${\bf y}(t)={\bf x}(T(t))$ . Using this time change we have the equivalent ODE

[TABLE]

where $\gamma_{t}=\frac{\alpha_{T(t)}}{\beta_{T(t)}}\to 0$ as $\tau\to\infty$ . Using the explicit form of $\alpha_{t}$ and $\beta_{t}$ in Assumption 5 it is readily verified that $\gamma_{t}\leq(t+1)^{-\tau_{\gamma}}$ for some $\tau_{\gamma}>0$ .

We will refer to the set

[TABLE]

as the consensus subspace. Consider the linear system

[TABLE]

Because $(L\otimes I_{d})$ is positive semidefinite with nullspace equal to $\mathcal{C}$ , solutions to (7) converge to $\mathcal{C}$ and hence $\lim_{t\to\infty}\|{\bf y}_{n}(t)-{\bf y}_{\ell}(t)\|=0$ for all $n,\ell=1,\ldots,N$ .

Let $\Phi(t)=e^{-(L\otimes I_{d})t}$ denote a fundamental matrix solution of the linear system (7). By variation of parameters [28], the solution ${\bf y}(t)$ of (6) with initial condition $x_{0}\in\mathbb{R}^{Nd}$ may be expressed as

[TABLE]

where $b(s)=-\gamma_{s}(\nabla f_{n}({\bf y}_{n}(s)))_{n=1}^{N}$ . Using Assumptions 3 and 4 we see that $\|b(s)\|\leq\gamma_{s}C$ for some constant $C>0$ .

Let

[TABLE]

Using (8) we have

[TABLE]

where we have used the notation $[\cdot]_{n}$ to indicate extracting the vector of coordinates in $\mathbb{R}^{d}$ corresponding to agent $n$ . Using the previous bound on $[b(t)]_{n}$ we get

[TABLE]

for some $C>0$ . The first term on the right hand side above goes to zero since $\Phi(t)x_{0}$ is a solution to (7). Recalling that $\Phi(t)=e^{-(L\otimes I_{d})t}$ , the second term above is bounded as

[TABLE]

for some $C>0$ , where $\lambda_{2}>0$ is the second smallest eigenvalue of $L$ . Since $\tau_{\gamma}>0$ , this converges to zero as $t\to\infty$ . ∎

We now prove Theorem 1.

Proof (Theorem 1).

Part (i) of the theorem follows from Lemma 5. We now prove part (ii) of the theorem. Let $S(\tau)=\int_{0}^{\tau}\alpha_{r}\,dr$ and let $T(t)$ denote the inverse of $S(\tau)$ so that $T(S(\tau))=\tau$ . Letting ${\bf y}_{n}(t)={\bf x}_{n}(T(t))$ we have

[TABLE]

$n=1,\ldots,N$ , where $\gamma_{t}=\frac{\beta_{T(t)}}{\alpha_{T(t)}}\to\infty$ as $t\to\infty$ . Since (17) is equivalent to (2) up to a time change, we will prove the result for solutions to (17).

By Lemma 5, it is sufficient to show that the mean process, ${\bf y}_{\textup{avg}}(t)$ , converges to the set of critical points of $f$ . Noting that $\sum_{n=1}^{N}\sum_{\ell\in\Omega_{n}}({\bf y}_{\ell}(t)-{\bf y}_{n}(t))=0$ (because $G$ is undirected), the average dynamics may be expressed as

[TABLE]

where ${\bf r}(t)=-\frac{1}{N}\sum_{i=1}^{N}\Big{(}\nabla f_{n}({\bf y}_{n}(t))-\nabla f_{n}({\bf y}_{\textup{avg}}(t))\Big{)}$ .

By Assumptions 3 and 4 we see that ${\bf r}(t)\to 0$ as $t\to\infty$ . Recalling Definition 3, solutions to (23) may be viewed as perturbed solutions of the ODE

[TABLE]

Let $\Lambda$ denote the set of critical points of $f$ . Since $f\in C^{2}$ , Sard’s theorem implies that $f(\Lambda)$ has empty interior. By Theorem 4, solutions to (23) converge to the critical points set of $f$ . ∎

IV Nonconvergence to Saddle Points

IV-A Generalized Problem Setup

It will simplify the presentation and proofs if we consider a slight generalization of the distributed optimization framework. Namely, we will consider the distributed optimization problem as a special case of subspace constrained optimization. To this end, let $M\geq 1$ denote the dimension of the ambient space, let $h:\mathbb{R}^{M}\to\mathbb{R}$ be a $C^{2}$ function, and let $Q\in\mathbb{R}^{M\times M}$ be a positive semidefinite matrix. Consider the following optimization problem

[TABLE]

and the following dynamics for addressing this problem

[TABLE]

where $\beta_{t}$ is some pre-specified weight function of class $C^{1}$ satisfying $\beta_{t}\to\infty$ as $t\to\infty$ .

Note that the dynamics (27) may be viewed as $\dot{\bf x}(t)=-\nabla_{x}\left(h({\bf x}(t))+{\bf x}^{T}Q{\bf x}(t)\right)$ , i.e., as $\beta_{t}\to\infty$ , ${\bf x}(t)$ is forced towards the constraint set.

Under Assumptions 1–5, (2) is a special case of (27). To see this, first observe that (2) (or rather, (5)) is equivalent to the following ODE after a time change

[TABLE]

where $\beta_{t}\to\infty$ . This fits the template of (27) where we let $M=Nd$ , let $h:\mathbb{R}^{Nd}\to\mathbb{R}$ be given by the sum function333Note that this differs from (1) in that we permit the arguments of $f_{n}$ to differ. $h(x)=\sum_{n=1}^{N}f_{n}(x_{n})$ , and let $Q=L\otimes I_{d}$ .

Within this generalized framework, we would like to capture Assumption 6. To this end, let $\mathcal{C}=\mathcal{N}(Q)$ ; we say that a point $x^{*}\in\mathcal{C}$ is a critical point of the restricted function $h|_{\mathcal{C}}$ if $\nabla h|_{\mathcal{C}}(x^{*})=0$ , where $\nabla h|_{\mathcal{C}}(x^{*})\in\mathbb{R}^{m}$ is taken with respect to some orthonormal basis of $\mathcal{C}$ , and $m=\dim\mathcal{C}$ . Let $\nabla^{2}h|_{\mathcal{C}}(x^{*})\in\mathbb{R}^{m\times m}$ denote the Hessian of $h|_{\mathcal{C}}$ taken with respect to some orthonormal basis of $\mathcal{C}$ . We say that $x^{*}$ is a nondegenerate saddle point of $h|_{\mathcal{C}}$ if $\det\nabla^{2}h|_{\mathcal{C}}(x^{*})\not=0$ , and $\nabla^{2}h|_{\mathcal{C}}(x^{*})$ has at least one positive and one negative eigenvalue.

The following theorem demonstrates the existence of stable manifolds for (27).

Theorem 6.

Suppose $h\in C^{2}$ , $\beta_{t}\in C^{1}$ and $\dim\mathcal{N}(Q)\geq 2$ . Suppose 0 is a nondegenerate saddle point of $h|_{\mathcal{C}}$ and let $p$ denote the number of negative eigenvalues of $\nabla^{2}h|_{\mathcal{C}}(0)$ . Then for all $t_{0}$ sufficiently large there exists a manifold $S_{t_{0}}\subset\mathbb{R}^{M}$ with dimension $M-p$ such that the following holds: A solution ${\bf x}$ to (27) converges to 0 if and only if ${\bf x}$ is initialized on $\mathcal{S}$ , i.e., ${\bf x}(t_{0})=x_{0}\in\mathcal{S}$ .

Since, under Assumptions 1–5, (2) is a special case of (27) this implies Theorem 2.

IV-B Proof of Theorem 6

(Recenter) By the implicit function theorem, there exists a function $g\in C^{1}([0,\infty);\mathbb{R}^{M})$ such that, for each $\beta\geq 0$ , $g(\beta)$ is a critical point of the penalized function $h(x)+\beta x^{T}Qx$ and $g(\beta)\to 0$ as $\beta\to\infty$ .

Letting ${\bf y}(t)={\bf x}(t)-g(\beta_{t})$ we see that ${\bf x}$ is a solution to (27) if and only if ${\bf y}$ is a solution to

[TABLE]

where $g^{\prime}$ denotes the vector $(\frac{\partial g_{i}}{\partial\beta})_{i=1}^{M}$ . For $t\geq 0$ let

[TABLE]

and let $F(y,t):=-\nabla h(y+g(\beta_{t}))-\beta_{t}Q(y+g(\beta_{t}))-A(t)y$ so that we may express (28) as

[TABLE]

(Diagonalize) For each $t\geq 0$ , let $U(t)$ be a unitary matrix that diagonalizes $A(t)$ , so that $U(t)A(t)U(t)^{T}=\Lambda(t)$ , where $\Lambda(t)$ is diagonal. Since $\beta_{t}\in C^{1}$ we may construct $U(t)$ as a differentiable function of $t$ . Changing coordinates again, let ${\bf z}(t)=U(t){\bf y}(t)$ so that ${\bf y}$ is a solution to (29) if and only if ${\bf z}$ is a solution to

[TABLE]

Letting $\tilde{F}(z,t):=U(t)F(U(t)^{T}z,t)+\dot{U}(t)U(t)z$ , the above is equivalent to

[TABLE]

Note that $F(0,t)=0$ and $F(y,t)=o(|y|^{2})$ for $t\geq 0$ . Consequently, for any $\epsilon>0$ there exists an $r>0$ and $T\geq 0$ such that for all $t\geq T$ and $z,\tilde{z}\in B_{r}(0)$ we have

[TABLE]

(Compute Stable Solutions) Let $\lambda_{1}(t),\ldots,\lambda_{M}(t)$ denote the eigenvalues of $\Lambda(t)$ . We may assume the eigenvalues are ordered so each $\lambda_{i}(t)$ varies smoothly in $t$ . For $T$ sufficiently large, the sign of $\lambda_{i}(t)$ remains constant for all $t\geq T$ , for each $i$ . Without loss of generality assume that the first $k<M$ diagonal entries (eigenvalues) of $\Lambda(t)$ are negative and the remaining diagonal entries are positive for all $t$ sufficiently large. Let $\Lambda(t)$ be decomposed as

[TABLE]

where $\Lambda^{s}(t)\in\mathbb{R}^{k\times k}$ denotes the ‘stable’ diagonal submatrix and $\Lambda^{u}(t)\in\mathbb{R}^{(M-k)\times(M-k)}$ denotes the ‘unstable’ diagonal submatrix. Let

[TABLE]

By construction we have $\limsup_{t\to\infty}\lambda_{j}(t)<0$ , $j=1,\ldots,k$ . Hence, we may choose an $\alpha>0$ such that $\lambda_{j}(t)<-\alpha<0$ for $j=1,\ldots,k$ and all $t$ sufficiently large. We may also choose constants $\sigma>0$ and $K>0$ such that the following estimates hold

[TABLE]

where $t_{1},t_{2}\geq t_{0}$ . Now, suppose $a^{s}\in\mathbb{R}^{k}$ and consider the integral equation

[TABLE]

where ${\bf u}:[t_{0},\infty)\times\mathbb{R}^{k}\to\mathbb{R}^{M}$ . Note that if $t\mapsto{\bf u}(t,a^{s})$ is continuous and solves (39) then, ${\bf u}(t,a^{s})$ is differentiable and solves (33) with componentwise initialization ${\bf u}_{i}(t_{0},a^{s})=a^{s}_{i}$ for $i=1,\ldots,k$ . This may be verified using the variation of parameters formula [28].

Given $t_{0}\geq 0$ , let

[TABLE]

We remark that $c(t)$ is finite for all $t\geq t_{0}$ and for any $\eta>0$ we may choose $t_{0}$ sufficiently large so that $|c(t)|<\eta$ for all $t\geq t_{0}$ .

Suppose $\varepsilon<\sigma/6K$ and let $r$ and $T$ be chosen so that (34) holds for all $t\geq T$ and $|c(t)|\leq r/3$ for all $t\geq T$ . By Lemma 7, if $|a_{s}|<r/3$ and $t_{0}\geq T$ , then the right-hand side of (39) is a contraction on the space

[TABLE]

equipped with norm $\|\cdot\|_{\infty}$ , where $c(t)$ is defined in (42). Since this space is complete, there exists a unique ${\bf u}(\cdot,a^{s})\in X_{t_{0},a^{s}}$ solving (39).

(Construct Stable Manifold) We now construct the stable set $\mathcal{S}$ corresponding to the ODE (33). Let $t_{0}\geq T$ . For each $z_{0}^{s}\in B_{\frac{r}{3}}(0)\subset\mathbb{R}^{k}$ let ${\bf u}(\cdot,z_{0}^{s})$ be the (unique) solution to (39) in $X_{T,z_{0}^{s}}$ . For each $t\in[t_{0},\infty)$ define the component map $\psi_{j}:\mathbb{R}\times\mathbb{R}^{k}\to\mathbb{R}$ by

[TABLE]

and let $\psi=(\psi_{j})_{j=k+1}^{M}$ . The stable manifold (with respect to (33)) is given by

[TABLE]

By construction, for any initialization $(t_{0},z_{0}^{s},z_{0}^{u})\in\mathcal{S}$ , the corresponding solution ${\bf z}$ of (33) with ${\bf z}(t_{0})=(z_{0}^{s},z_{0}^{u})$ satisfies ${\bf z}(t)\to 0$ . Moreover, by Lemma 8 we see that $\mathcal{S}$ contains all stable initializations $(t_{0},z_{0})$ . That is, if ${\bf z}$ is a solution to (33) with ${\bf z}(t_{0})=z_{0}$ and ${\bf z}(t)\to 0$ , then $(t_{0},z_{0})\in\mathcal{S}$ .

Having constructed $\mathcal{S}$ (the stable manifold for (33)) the stable manifold for (27), denoted here by $\tilde{\mathcal{S}}$ , is obtained by an appropriate change of coordinates, $\tilde{\mathcal{S}}:=\{(t,x)\in\mathbb{R}\times\mathbb{R}^{M}:~{}U(t)(x-g(\beta_{t}))\in\mathcal{S}\}.$

V Conclusion

We have considered the distributed gradient descent dynamics (2) for nonconvex optimization. We showed that the dynamics converge to the set of critical points of the nonconvex objective (Theorem 1). Furthermore, the dynamics may only converge to a saddle point of the objective if initialized from some special low-dimensional stable manifold.

Appendix

This appendix contains some intermediate results required for the proof of Theorem 6.

The following lemma shows that the right-hand side of (39) is a contraction. Before presenting the lemma, we define a few useful quantities. Given $a_{s}\in\mathbb{R}^{k}$ , let $\mathcal{T}:X_{t_{0},a^{s}}\to X_{t_{0},a^{s}}$ be given by

[TABLE]

where, for convenience, we suppress the argument $a^{s}$ previously used in ${\bf u}$ .

Lemma 7 ( $\mathcal{T}$ is a contraction).

Let $\sigma$ , $\alpha$ , and $K$ be chosen so that (37) is satisfied. Let $0<\epsilon<\frac{\sigma}{6K}$ , and let $r$ and $T$ be chosen so that (34) holds and $|c(t)|\leq r/3$ holds for all $t\geq T$ . Let $a^{s}\in\mathbb{R}^{k}$ with $|a^{s}|<\frac{r}{3}$ . Then $\mathcal{T}:X_{t_{0},a^{s}}\to X_{t_{0},a^{s}}$ is a contraction.

Proof.

First, claim that if ${\bf u}\in X_{t_{0},a^{s}}$ and $\|{\bf u}\|_{\infty}\leq r$ , then $\|\mathcal{T}({\bf u})\|_{\infty}\leq r$ . To see this, note that

[TABLE]

where in the last line we use the assumptions made on $|a^{s}|$ , $\varepsilon$ , and $t_{0}$ in the statement of the lemma.

Suppose now that ${\bf u},\hat{\bf u}\in X_{t_{0},a_{s}}$ , with $\|{\bf u}\|_{\infty},\|\hat{\bf u}\|_{\infty}\leq r$ . Let $M=\|{\bf u}-\hat{\bf u}\|_{\infty}$ . For $t\geq t_{0}$ we have

[TABLE]

Given our choice of $\varepsilon$ we have $\frac{2\varepsilon K}{\sigma}<1$ , hence, $\mathcal{T}$ is a contraction. ∎

Lemma 8 ( $\mathcal{S}$ contains all stable initializations).

Let $\varepsilon$ , $r$ , and $T$ be chosen as in the construction of $\mathcal{S}$ . Let $a^{s}\in\mathbb{R}^{k}$ , with $|a^{s}|<r/3$ , let $t_{0}\geq T$ and suppose that ${\bf z}$ is a solution to (33) with ${\bf z}_{i}(t_{0})=z_{0}=(z_{0}^{s},z_{0}^{u})$ and $z_{0}^{s}=a^{s}$ . If ${\bf z}(t)\to 0$ as $t\to\infty$ then $(t_{0},z_{0})\in\mathcal{S}$ .

Proof.

By variation of constants we see that

[TABLE]

where $c={\bf z}(t_{0})+\int_{t_{0}}^{\infty}V^{u}(t_{0},\tau)\left(\tilde{F}({\bf z}(\tau))-U(\tau)g^{\prime}(\tau)\dot{\beta}_{\tau}\right)\,d\tau$ . Note that integral in $c$ converges by (35) and the fact that $\int_{t_{0}}^{\infty}U(\tau)g^{\prime}(\tau)\beta_{\tau}\,d\tau<\infty$ . Every term on the right hand side of (60) is uniformly bounded in $t$ , except possibly the term $V^{u}(t,t_{0})c$ . In particular, if $c_{j}\not=0$ , $j>k$ , then $|V^{u}(t,t_{0})c|\to\infty$ . Since the left hand side of (60) is bounded, it follows that the right hand side is bounded and thus all $c_{j}$ , $j>k$ must be zero and hence $V^{u}(t,t_{0})c=0$ .

This implies that ${\bf u}(\cdot,a^{s})={\bf z}$ is a solution to the integral equation (39) given $a^{s}$ . By Lemma 7 we see that ${\bf u}(t,a^{s})$ is the unique continuous solution of (39) given $a^{s}$ . By the definitions of $\mathcal{S}$ and $\psi$ we thus see that $(t_{0},z_{0})\in\mathcal{S}$ . ∎

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks , 2004, pp. 20–27.
2[2] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing , vol. 60, no. 8, pp. 4289–4305, 2012.
3[3] Y. Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,” IEEE Transactions on Industrial Informatics , vol. 9, no. 1, pp. 427–438, 2012.
4[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine learning , vol. 3, no. 1, pp. 1–122, 2011.
5[5] S. Kar and B. Swenson, “Clustering with distributed data,” 2019, submitted for publication. Online: https://arxiv.org/abs/1901.00214.
6[6] A. Nedić and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control , vol. 54, no. 1, p. 48, 2009.
7[7] C. Lee, C. H. Lim, and S. J. Wright, “A distributed quasi-newton algorithm for empirical risk minimization with nonsmooth regularization,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, pp. 1646–1655.
8[8] P. Di Lorenzo and G. Scutari, “Distributed nonconvex optimization over networks,” in Proceedings of the 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP) , 2015, pp. 229–232.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem

Abstract

Index Terms:

I Introduction

I-A Main Results

I-A1 Assumptions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Assumption 5**.**

Assumption 6**.**

I-A2 Main Results

Theorem 1**.**

Theorem 2**.**

II Notation

II-A Stochastic Approximation and Perturbed Solutions

Definition 3** (Perturbed Solution).**

Theorem 4**.**

III Convergence to Critical Points

Lemma 5**.**

Proof.

Proof (Theorem 1).

IV Nonconvergence to Saddle Points

IV-A Generalized Problem Setup

Theorem 6**.**

IV-B Proof of Theorem 6

V Conclusion

Appendix

Lemma 7** (T\mathcal{T}T is a contraction).**

Proof.

Lemma 8** (S\mathcal{S}S contains all stable initializations).**

Proof.

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Theorem 1.

Theorem 2.

Definition 3 (Perturbed Solution).

Theorem 4.

Lemma 5.

Theorem 6.

Lemma 7 ( $\mathcal{T}$ is a contraction).

Lemma 8 ( $\mathcal{S}$ contains all stable initializations).