Analysis and Design of First-Order Distributed Optimization Algorithms   over Time-Varying Graphs

Akhil Sundararajan; Bryan Van Scoy; Laurent Lessard

arXiv:1907.05448·math.OC·February 17, 2020·IEEE Trans. Control. Netw. Syst.

Analysis and Design of First-Order Distributed Optimization Algorithms over Time-Varying Graphs

Akhil Sundararajan, Bryan Van Scoy, Laurent Lessard

PDF

TL;DR

This paper provides a unified analysis framework for first-order distributed optimization algorithms over time-varying graphs, introducing a new algorithm called SVL with improved convergence.

Contribution

It offers a computationally efficient analysis method and proposes the SVL algorithm that outperforms existing methods in convergence speed.

Findings

01

Unified analysis yields worst-case linear convergence rate

02

Analysis framework involves a small fixed-size semidefinite program

03

SVL algorithm achieves faster convergence than existing algorithms

Abstract

This work concerns the analysis and design of distributed first-order optimization algorithms over time-varying graphs. The goal of such algorithms is to optimize a global function that is the average of local functions using only local computations and communications. Several different algorithms have been proposed that achieve linear convergence to the global optimum when the local functions are strongly convex. We provide a unified analysis that yields the worst-case linear convergence rate as a function of the condition number of the local functions, the spectral gap of the graph, and the parameters of the algorithm. The framework requires solving a small semidefinite program whose size is fixed; it does not depend on the number of local functions or the dimension of their domain. The result is a computationally efficient method for distributed algorithm analysis that enables the…

Tables1

Table 1. Table 1: Algorithm parameters in the form of ( 3 ) for a variety of different distributed optimization algorithms. Algorithms can be tuned by choosing stepsize and overrelaxation parameters α 𝛼 \alpha and μ 𝜇 \mu , respectively. Algorithms are organized based on how many internal states they have (columns) and how many variables must be communicated in each iteration (block rows).

	Algorithms with 2 states		Algorithms with 3 states
1 communicated variable	SVL template (present work) See Section 4 for derivation of $(α, β, γ, δ)$	$[\begin{matrix} 1 & β & - α & - γ \\ 0 & 1 & 0 & - 1 \\ \hdashline 1 & 0 & 0 & - δ \\ \hdashline 1 & 0 & 0 & 0 \\ \hdashline 0 & 1 & 0 \end{matrix}]$	EXTRA [23]	$[\begin{matrix} 2 & - 1 & α & - α & - μ \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ \hdashline 1 & - \frac{1}{2} & 0 & 0 & 0 \\ \hdashline 1 & - 1 & α & 0 \end{matrix}]$
1 communicated variable	Exact Diffusion (ExDIFF) [35, 36]	$[\begin{matrix} 2 & - 1 & - α & - μ \\ 1 & 0 & - α & - \frac{1}{2} μ \\ \hdashline 1 & 0 & - \frac{1}{2} μ & 0 \\ \hdashline 1 & 0 & 0 & 0 \\ \hdashline 1 & - 1 & 0 \end{matrix}]$	NIDS [13]	$[\begin{matrix} 2 & - 1 & α & - α & - μ \\ 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ \hdashline 1 & - \frac{1}{2} & \frac{α}{2} & - \frac{α}{2} & 0 \\ \hdashline 1 & - 1 & α & 0 \end{matrix}]$
2 communicated variables	Unified DIGing (uDIG) [8]	$[\begin{matrix} 1 & - α & - α & - μ & 0 \\ 0 & 1 & 0 & 0 & - μ \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ - \frac{L + m}{2} & 1 & 1 & 0 & 0 \\ \hdashline 0 & 1 & 0 \end{matrix}]$	DIGing [15, 19]	$[\begin{matrix} 1 & - α & 0 & 0 & - μ & 0 \\ 0 & 1 & - 1 & 1 & 0 & - μ \\ 0 & 0 & 0 & 1 & 0 & 0 \\ \hdashline 1 & - α & 0 & 0 & - μ & 0 \\ \hdashline 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ \hdashline 0 & 1 & - 1 & 0 \end{matrix}]$
2 communicated variables	Unified EXTRA (uEXTRA) [8]	$[\begin{matrix} 1 & - α & - α & - μ & 0 \\ 0 & 1 & 0 & 0 & - μ \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ \hdashline 1 & 0 & 0 & 0 & 0 \\ - L & 1 & 1 & L μ & 0 \\ \hdashline 0 & 1 & 0 \end{matrix}]$	AugDGM [34]	$[\begin{matrix} 1 & - α & 0 & 0 & - μ & α μ \\ 0 & 1 & - 1 & 1 & 0 & - μ \\ 0 & 0 & 0 & 1 & 0 & 0 \\ \hdashline 1 & - α & 0 & 0 & - μ & α μ \\ \hdashline 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ \hdashline 0 & 1 & - 1 & 0 \end{matrix}]$

Equations133

x \in R^{d} min f (x), where f (x) : = \frac{1}{n} i = 1 \sum n f_{i} (x),

x \in R^{d} min f (x), where f (x) : = \frac{1}{n} i = 1 \sum n f_{i} (x),

x_{i}^{1}

x_{i}^{1}

x_{i}^{k + 2}

x^{k} : = x_{1}^{k} ⋮ x_{n}^{k} and x^{⋆} = x_{1}^{⋆} ⋮ x_{n}^{⋆} .

x^{k} : = x_{1}^{k} ⋮ x_{n}^{k} and x^{⋆} = x_{1}^{⋆} ⋮ x_{n}^{⋆} .

\bigl{(}\nabla\!f_{i}(y)-\nabla\!f_{i}(y_{\text{opt}})-m\,(y-y_{\text{opt}})\bigr{)}^{\mathsf{T}}\bigl{(}\nabla\!f_{i}(y)-\nabla\!f_{i}(y_{\text{opt}})-L\,(y-y_{\text{opt}})\bigr{)}\leq 0

\bigl{(}\nabla\!f_{i}(y)-\nabla\!f_{i}(y_{\text{opt}})-m\,(y-y_{\text{opt}})\bigr{)}^{\mathsf{T}}\bigl{(}\nabla\!f_{i}(y)-\nabla\!f_{i}(y_{\text{opt}})-L\,(y-y_{\text{opt}})\bigr{)}\leq 0

x_{i}^{k + 1} y_{i}^{k} z_{i}^{k} = A C_{y} C_{z} B_{u} D_{y u} D_{z u} B_{v} D_{y v} D_{z v} x_{i}^{k} u_{i}^{k} v_{i}^{k},

x_{i}^{k + 1} y_{i}^{k} z_{i}^{k} = A C_{y} C_{z} B_{u} D_{y u} D_{z u} B_{v} D_{y v} D_{z v} x_{i}^{k} u_{i}^{k} v_{i}^{k},

u_{i}^{k} = \nabla f_{i} (y_{i}^{k}), v_{i}^{k} = j = 1 \sum n L_{ij}^{k} z_{j}^{k},

j = 1 \sum n (F_{x} x_{j}^{k} + F_{u} u_{j}^{k}) = 0.

x^{k+1}=\mathbf{prox}_{\lambda f}(x^{k})\colonequals\operatorname*{\arg\min}_{x}\bigl{(}\lambda f(x)+\tfrac{1}{2}\lVert{x-x^{k}}\rVert^{2}\bigr{)}

x^{k+1}=\mathbf{prox}_{\lambda f}(x^{k})\colonequals\operatorname*{\arg\min}_{x}\bigl{(}\lambda f(x)+\tfrac{1}{2}\lVert{x-x^{k}}\rVert^{2}\bigr{)}

x^{k + 1}

x^{k + 1}

[D_{y u} D_{z u} D_{y v} D_{z v}] = [00 D_{y v} 0] or [0 D_{z u} 00] .

[D_{y u} D_{z u} D_{y v} D_{z v}] = [00 D_{y v} 0] or [0 D_{z u} 00] .

x^{1}

x^{1}

x^{k + 2}

\displaystyle\left[\begin{array}[]{c:c:c}A&B_{u}&B_{v}\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ \hdashline F_{x}&F_{u}&\end{array}\right]

\displaystyle\left[\begin{array}[]{c:c:c}A&B_{u}&B_{v}\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ \hdashline F_{x}&F_{u}&\end{array}\right]

s^{0}

s^{0}

x^{k + 1}

s^{k + 1}

(I - Π) y^{⋆} = 0 and 1^{T} u^{⋆} = 0.

(I - Π) y^{⋆} = 0 and 1^{T} u^{⋆} = 0.

(I - Π) z^{⋆} = 0 and v^{⋆} = 0.

(I - Π) z^{⋆} = 0 and v^{⋆} = 0.

1^{T} y^{⋆} and (I - Π) u^{⋆} unconstrained .

1^{T} y^{⋆} and (I - Π) u^{⋆} unconstrained .

null (A - I) \cap row (C_{y}) \cap null (F_{x}) \neq = {0}

null (A - I) \cap row (C_{y}) \cap null (F_{x}) \neq = {0}

and B_{u} D_{y u} D_{z u} \in col A - I C_{y} C_{z} .

M_{0} : = [- 2 m L L + m L + m - 2] and M_{1} : = [σ^{2} - 1 1 1 - 1] .

M_{0} : = [- 2 m L L + m L + m - 2] and M_{1} : = [σ^{2} - 1 1 1 - 1] .

\displaystyle\Psi^{\mathsf{T}}\left[\begin{array}[]{cc}A&B_{u}\\ I&0\\ \hdashline C_{y}&D_{yu}\\ 0&I\end{array}\right]^{\mathsf{T}}\left[\begin{array}[]{cc:c}P&0&0\\ 0&-\rho^{2}P&0\\ \hdashline 0&0&M_{0}\end{array}\right]\left[\begin{array}[]{cc}A&B_{u}\\ I&0\\ \hdashline C_{y}&D_{yu}\\ 0&I\end{array}\right]\Psi

\displaystyle\Psi^{\mathsf{T}}\left[\begin{array}[]{cc}A&B_{u}\\ I&0\\ \hdashline C_{y}&D_{yu}\\ 0&I\end{array}\right]^{\mathsf{T}}\left[\begin{array}[]{cc:c}P&0&0\\ 0&-\rho^{2}P&0\\ \hdashline 0&0&M_{0}\end{array}\right]\left[\begin{array}[]{cc}A&B_{u}\\ I&0\\ \hdashline C_{y}&D_{yu}\\ 0&I\end{array}\right]\Psi

\displaystyle\left[\begin{array}[]{ccc}A&B_{u}&B_{v}\\ I&0&0\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ 0&I&0\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ 0&0&I\end{array}\right]^{\mathsf{T}}\left[\begin{array}[]{cc:c:c}Q&0&0&0\\ 0&-\rho^{2}Q&0&0\\ \hdashline 0&0&M_{0}&0\\ \hdashline 0&0&0&M_{1}\otimes R\end{array}\right]\left[\begin{array}[]{ccc}A&B_{u}&B_{v}\\ I&0&0\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ 0&I&0\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ 0&0&I\end{array}\right]

∥ x_{i}^{k} - x_{i}^{⋆} ∥ \leq c ρ^{k}

∥ x_{i}^{k} - x_{i}^{⋆} ∥ \leq c ρ^{k}

∥ u_{i}^{k} - u_{i}^{⋆} ∥

∥ u_{i}^{k} - u_{i}^{⋆} ∥

\displaystyle V^{k}\colonequals(x^{k}-x^{\star})^{\mathsf{T}}\bigl{(}\Pi\otimes P+(I-\Pi)\otimes Q\bigr{)}(x^{k}-x^{\star})

\displaystyle V^{k}\colonequals(x^{k}-x^{\star})^{\mathsf{T}}\bigl{(}\Pi\otimes P+(I-\Pi)\otimes Q\bigr{)}(x^{k}-x^{\star})

\displaystyle\left[\begin{array}[]{c:c:c}A&B_{u}&B_{v}\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ \hdashline F_{x}&F_{u}&\end{array}\right]

\displaystyle\left[\begin{array}[]{c:c:c}A&B_{u}&B_{v}\\ \hdashline C_{y}&D_{yu}&D_{yv}\\ \hdashline C_{z}&D_{zu}&D_{zv}\\ \hdashline F_{x}&F_{u}&\end{array}\right]

α

α

\displaystyle\bigl{(}2\beta-(1-\rho)(\kappa+1)\bigr{)}(\beta-1+\rho^{2})

\displaystyle\bigl{(}2\beta-(1-\rho)(\kappa+1)\bigr{)}(\beta-1+\rho^{2})

\displaystyle\rho^{2}\,\biggl{(}\frac{\beta-1+\rho^{2}}{\beta-1+\rho}\biggr{)}\biggl{(}\frac{2-\eta-2\beta}{2\rho^{2}\beta-(1-\rho^{2})\eta}\biggr{)}\biggl{(}\frac{(2\rho^{2}+\eta)\beta-(1-\rho^{2})\eta}{(1+\rho)(\eta-2\eta\rho+2\rho^{2})-(2\rho^{2}+\eta)\beta}\biggr{)}

\frac{\textrm{d}\sigma^{2}}{\textrm{d}\beta}=0\quad\implies\quad\bigl{(}\beta\bigl{(}1-\kappa+2\rho(1+\rho)\bigr{)}-\eta(1-\rho^{2})\bigr{)}\bigl{(}s_{0}+s_{1}\beta+s_{2}\beta^{2}+s_{3}\beta^{3}\bigr{)}=0,

\frac{\textrm{d}\sigma^{2}}{\textrm{d}\beta}=0\quad\implies\quad\bigl{(}\beta\bigl{(}1-\kappa+2\rho(1+\rho)\bigr{)}-\eta(1-\rho^{2})\bigr{)}\bigl{(}s_{0}+s_{1}\beta+s_{2}\beta^{2}+s_{3}\beta^{3}\bigr{)}=0,

s_{0}

s_{0}

s_{1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\xpatchcmd

Analysis and Design of First-Order Distributed Optimization Algorithms over Time-Varying Graphs

Akhil Sundararajan1,2

Bryan Van Scoy1

Laurent Lessard1,2

Abstract

This work concerns the analysis and design of distributed first-order optimization algorithms over time-varying graphs. The goal of such algorithms is to optimize a global function that is the average of local functions using only local computations and communications. Several different algorithms have been proposed that achieve linear convergence to the global optimum when the local functions are strongly convex. We provide a unified analysis that yields the worst-case linear convergence rate as a function of the condition number of the local functions, the spectral gap of the graph, and the parameters of the algorithm. The framework requires solving a small semidefinite program whose size is fixed; it does not depend on the number of local functions or the dimension of their domain. The result is a computationally efficient method for distributed algorithm analysis that enables the rapid comparison, selection, and tuning of algorithms. Finally, we propose a new algorithm, which we call SVL, that is easily implementable and achieves a faster worst-case convergence rate than all other known algorithms.

1 Introduction

In distributed optimization, a network of agents, such as computing nodes, robots, or mobile sensors, work collaboratively to optimize a global objective. Specifically, each agent $i\in\{1,\dots,n\}$ has access to a local function $f_{i}$ and must minimize the average of all agents’ local functions

[TABLE]

by querying its local gradient $\nabla\!f_{i}$ , exchanging information with neighboring agents, and performing local computations.

This work aims to study the reliability of distributed optimization algorithms in the presence of a time-varying communication graph. Such a scenario could occur if communication links fail due to interference, mobile agents move out of range, or an adversary is jamming communications.

Distributed optimization is relevant in many application areas. For example, in large-scale machine learning [7, 9], $n$ could represent the number of computing units available for training a large data set. Each $f_{i}$ then denotes the loss function corresponding to the training examples assigned to unit $i$ . Another example is sensor networks [20], where each sensor may have a limited power budget, communication bandwidth, or sensing capability. The goal is to aggregate all local data without having a single point of failure. Other applications include distributed spectrum sensing [2] and resource allocation across geographic regions [21].

Distributed optimization generalizes both average consensus and centralized optimization, as we now explain.

Consensus

If each agent uses the initial value $x_{i}^{0}$ and local objective $f_{i}(x)=\lVert{x-x_{i}^{0}}\rVert^{2}$ , distributed optimization reduces to average consensus [27, 29]. The unique optimizer of (1) is then the average of all initial states: $x^{\star}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{0}$ . Using a gossip update of the form $x_{i}^{k+1}=\sum_{i=1}^{n}W_{ij}x_{j}^{k}$ where $W$ is carefully chosen, such methods converge exponentially: $\lVert{x_{i}^{k}-x^{\star}}\rVert\leq\rho^{k}$ with $\rho\in(0,1)$ that depends on $W$ [30]. This is called a linear rate in the optimization community.

Optimization

If $n=1$ or if all $f_{i}$ are identical, we recover the standard centralized optimization setup. Linear convergence can be guaranteed in certain cases. For example, the gradient descent method $x_{i}^{k+1}=x_{i}^{k}-\alpha\nabla\!f_{i}(x_{i}^{k})$ achieves linear convergence if $f_{i}$ is continuously differentiable, smooth, and strongly convex (formally stated in Assumption 1) [16].

A linear convergence rate for the general case was first achieved by the exact first-order algorithm (EXTRA) [23]. This algorithm requires storing the previous state in memory:

[TABLE]

where $W$ and $\widetilde{W}$ are gossip matrices that satisfy certain technical conditions and $\alpha$ is sufficiently small. Several additional linear-rate algorithms have since been proposed, including: AugDGM [34], DIGing [15, 19], Exact Diffusion [35, 36], NIDS [13], and a unified method [8]. Each of these methods have updates similar to (2) in that they require agents to store previous iterates or gradients.

Although linear convergence rates were obtained for the algorithms above, each algorithm differs in the nature and strength of its convergence analysis guarantees. For example, some works show (non-constructively) the existence of a linear rate [32] whereas others provide specific tuning recommendations with associated analytic rate bounds (which may be conservative) [23, 13]. Numerical simulations are also frequently used [31], but can be misleading because algorithm performance depends on the graph topology, choice of functions, algorithm initialization, and algorithm tuning.

The present work makes an effort to systematize the analysis and design of distributed optimization algorithms. We now summarize our main contributions.

Analysis framework. We present a universal analysis framework that provides an upper bound on the worst-case linear convergence rate $\rho$ of a wide range of distributed algorithms as a function of the parameters $\kappa$ (local function conditioning) and $\sigma$ (network connectedness). Our main result, Theorem 10, is a semidefinite program (SDP) parameterized by $(\kappa,\sigma)$ whose solution yields an upper bound on $\rho$ . The SDP has a small fixed size that does not depend on the number of agents $n$ or the dimension of the function domains and is efficiently solvable. Our SDP yields robust performance guarantees when the graph is allowed to vary (even adversarially) at each iteration. Fig. 2 compares the worst-case linear rate $\rho$ for 8 different algorithms.

Algorithm design. We present a new distributed algorithm, which we name SVL (the authors’ initials). SVL is derived by optimizing the SDP from our analysis framework and provides the fastest known convergence rate to date for this time-varying graph setting. The rate depends explicitly on $\kappa$ and $\sigma$ , so no tuning is required if these parameters are known or estimated in advance. When the graph is well-connected, SVL recovers the performance of gradient descent, which is optimal in this time-varying graph setting.

Worst-case examples. Although our analysis technique only provides upper bounds on the worst-case convergence rate for distributed algorithms, we outline a computationally tractable optimization procedure that finds numerically matching lower bounds by constructing worst-case trajectories, suggesting the bounds found via our analysis technique are tight.

Remark 1 (Accelerated rates).

Distributed algorithms that achieve accelerated[18, 31, 33] or optimal [22] linear rates have also been proposed. It turns out such methods are not guaranteed to achieve acceleration when the graph is time-varying. We discuss this phenomenon in Section 2.5, where we derive lower bounds for the time-varying setting.

The paper is organized as follows. We describe notation and assumptions in Section 2. We state and prove our main result for certifying worst-case rate bounds in Section 3. We present our SVL algorithm and discuss interpretations in Section 4. Finally, we demonstrate the tightness of our bounds by generating worst-case trajectories in Section 5.

2 Preliminaries

2.1 Notation

Let $I_{n}$ be the identity matrix in $\mathbb{R}^{n\times n}$ . The symbol $1_{n}$ denotes the column vector of all ones in $\mathbb{R}^{n}$ . $\Pi\colonequals\frac{1}{n}1_{n}1_{n}^{\mathsf{T}}$ is the projection matrix onto $1_{n}$ . We will sometimes omit subscripts when dimensions are clear from context. Unless otherwise indicated, Greek letters denote scalar parameters, lower-case letters denote column vectors, and upper-case letters denote matrices. Exceptions include the scalars $m$ and $L$ , which we use in Assumption 1 to conform with convention. The symbol $\otimes$ denotes the Kronecker matrix product. $\lVert{x}\rVert$ denotes the standard Euclidean norm of a vector $x$ , and $\lVert{A}\rVert\colonequals\sup_{x\neq 0}\lVert{Ax}\rVert/\lVert{x}\rVert$ is the spectral norm of a matrix $A$ . Unless otherwise indicated, subscripts refer to individual agents while superscripts refer to iteration count. For brevity, we write the symmetric quadratic form $x^{\mathsf{T}}Qx$ as $\begin{bmatrix}\star\end{bmatrix}^{\mathsf{T}}Qx$ .

Define the graph $\mathcal{G}\colonequals(\mathcal{V},\mathcal{E})$ where $\mathcal{V}\colonequals\{1,\dots,n\}$ is the set of agents and $\mathcal{E}$ is the set of pairs of agents $(i,j)$ that are connected. $\mathcal{L}\in\mathbb{R}^{n\times n}$ is a Laplacian matrix associated with $\mathcal{G}$ if $\mathcal{L}1_{n}=0$ and $\mathcal{L}_{ij}=0$ if $(i,j)\notin\mathcal{E}$ . The spectral gap of $\mathcal{L}$ is defined as the second-smallest eigenvalue magnitude of $\mathcal{L}$ . Since we consider time-varying graphs, we let $\mathcal{L}^{k}$ denote a Laplacian matrix associated with $\mathcal{G}^{k}$ . We denote a symbol on agent $i$ at iteration $k$ by $x_{i}^{k}$ along with its associated fixed point $x_{i}^{\star}$ . For all such symbols, we denote their aggregation over all agents as

[TABLE]

We denote the associated local and global error coordinates as $\tilde{x}_{i}^{k}\colonequals x_{i}^{k}-x_{i}^{\star}$ and $\tilde{x}^{k}\colonequals x^{k}-x^{\star}$ , respectively.

2.2 Function and Graph Assumptions

We assume that the local function gradients satisfy the following sector bound.

Assumption 1.

Given $0<m\leq L$ , the the local objective functions $f_{i}$ are continuously differentiable and each satisfy

[TABLE]

for all $y\in\mathbb{R}^{d}$ , where $y_{\text{opt}}$ satisfies $\sum_{i=1}^{n}\nabla\!f_{i}(y_{\text{opt}})=0$ .

Remark 2.

One way to satisfy Assumption 1 is if the local functions $f_{i}$ are $L$ -Lipschitz continuous and $m$ -strongly convex, though in general, Assumption 1 is much weaker.

We define the condition ratio as $\kappa\colonequals L/m$ . This quantity captures how much the curvature of the objective function varies. If $f$ is twice differentiable, $\kappa$ is an upper bound on the condition number of the Hessian $\nabla^{2}f$ . In general, as $\kappa\to\infty$ , the functions become poorly conditioned and more difficult to optimize using first-order methods.

The graph associated with the network of agents can change at each step of the algorithm, so we assume the following about the sequence of graph Laplacian matrices $\{\mathcal{L}^{k}\}$ .

Assumption 2.

The following properties hold at each step of the algorithm.

The graph is connected: there always exists a path between any two nodes in $\mathcal{G}^{k}$ . This implies that the zero eigenvalue of $\mathcal{L}^{k}$ has a multiplicity of one for all $k$ . 2. 2.

The graph is balanced: every node has equal in-degree and out-degree. This means that $1_{n}^{\mathsf{T}}\mathcal{L}^{k}=0$ for all $k$ . 3. 3.

The spectral gap of the time-varying graph is uniformly bounded. In particular, we assume there exists $\sigma\in[0,1)$ such that $\lVert{I-\Pi-\mathcal{L}^{k}}\rVert\leq\sigma$ for all $k$ . Since the spectral radius of a matrix is always upper-bounded by its spectral norm, this implies that $\sigma$ is a uniform bound on the spectral gap of each Laplacian matrix in $\{\mathcal{L}^{k}\}$ .

Remark 3.

The assumption that $\mathcal{G}^{k}$ must be connected for all $k$ is a strong assumption. Works that consider directed or time varying graphs typically make weaker assumptions, such as a joint spectrum property or $B$ -connectedness [15]. Nevertheless, our setting (which is equivalent to $B$ -connectedness with $B=1$ ) is still weaker than assuming a constant graph. Indeed, NIDS [13] converges for any $\sigma$ when the graph is constant, but in Section 5.2, we construct a sequence of graphs that drives NIDS to instability.

2.3 Algorithm Form

In this paper, we consider the broad class of distributed optimization algorithms that satisfy the algebraic equations

[TABLE]

Equation (3a) describes how agent $i$ ’s state $x_{i}^{k}$ evolves with iteration $k$ . The local gradient $\nabla\!f_{i}$ is evaluated at $y_{i}^{k}$ and the quantity $z_{i}^{k}$ is transmitted to neighboring agents in (3b). Finally, we allow for linear state-input invariants to be enforced in (3c). Such invariants typically arise from requiring a particular initialization for the algorithm.

The matrices $A$ , $D_{yu}$ , and $D_{zv}$ are square, and the other matrices have compatible dimensions. The dimension of $A$ is the number of local states on each agent, the dimension of $D_{yu}$ is one, and the dimension of $D_{zv}$ is the number of variables that each agent transmits with neighbors at each iteration.

Remark 4 (Dimension reduction).

To simplify notation, we assume the objective function is one-dimensional ( $d=1$ ). We can recover the general $d$ case by replacing each scalar symbol with a $1\times d$ row vector (e.g., $u_{i}^{k}\in\mathbb{R}^{1\times d}$ ) and interpreting each local gradient $\nabla\!f_{i}$ as a map from $\mathbb{R}^{1\times d}$ to $\mathbb{R}^{1\times d}$ .

Remark 5 (Implementation).

Not all instances of (3) are efficiently implementable. For example, if ${D_{yu}\neq 0}$ , then $y_{i}^{k}$ depends on $u_{i}^{k}$ , which then depends on $y_{i}^{k}$ . Such circular dependencies arise naturally in proximal algorithms, where an inner optimization problem must be solved at each iteration. For instance, given a convex differentiable $f$ and parameter $\lambda>0$ , the proximal algorithm

[TABLE]

satisfies the optimality condition $\lambda\nabla\!f(x^{k+1})+x^{k+1}-x^{k}=0$ and can therefore be expressed in the form of (3) as follows:

[TABLE]

In the forthcoming analysis, we treat implementability and analysis separately. That is, we derive convergence rate bounds for general algorithms of the form (3), regardless of whether they can be efficiently implemented. However, we note that a sufficient condition for avoiding circular dependencies is if the feedthrough term satisfies

[TABLE]

Putting a distributed optimization algorithm into the form of (3) is a straightforward algebraic exercise, which we now demonstrate for two recently proposed algorithms. These algorithms are parameterized by a stepsize $\alpha$ and a gossip matrix $W$ . To relate the gossip matrix to the Laplacian matrix, we set $W=I-\mu\mathcal{L}$ for some scalar $\mu\neq 0$ . This provides an additional tuning parameter, and is akin to the method of successive overrelaxation used in the numerical solutions of linear systems of equations [17].

EXTRA.

The EXTRA algorithm (2) has a state that depends on two previous timesteps. Using the authors’ recommendation of $\widetilde{W}=\tfrac{1}{2}(I+W)$ together with $W=I-\mu\mathcal{L}^{k}$ , the equations become

[TABLE]

Define the state $(x^{k+1},x^{k},\nabla\!f(x^{k}))$ . The outputs are now functions of the state: $y^{k}\colonequals x^{k+1}$ and $z^{k}\colonequals x^{k+1}-\tfrac{1}{2}x^{k}$ . Finally, summing across agents (left-multiplying by $1^{\mathsf{T}}$ ) and using $1^{\mathsf{T}}\mathcal{L}^{k}=0$ , we find that $1^{\mathsf{T}}\left(x^{k+1}-x^{k}+\alpha\nabla\!f(x^{k})\right)$ is independent of $k$ , and identically zero thanks to how $x^{1}$ is initialized. The parameters that characterize EXTRA are shown below and in Table 1.

[TABLE]

DIGing.

The DIGing algorithm [15, 19], is an example of a gradient tracking algorithm. It begins with an arbitrary $x^{0}$ and has two update equations:

[TABLE]

Using the authors’ recommendation of $\widetilde{W}=W$ , defining $W=I-\mu\mathcal{L}^{k}$ as before, and defining the state as $(x^{k},s^{k},\nabla\!f(x^{k}))$ , we find that the output is $y^{k}\colonequals x^{k+1}$ , two quantities must be communicated between agents, $z^{k}\colonequals(x^{k},s^{k})$ , and the invariant is $1^{\mathsf{T}}(s^{k}-\nabla\!f(x^{k}))=0$ . The parameters that characterize DIGing are shown in Table 1.

A similar derivation can be applied to a variety of algorithms. Table 1 summarizes the parameterizations for 8 recently proposed algorithms.

2.4 Existence of a Fixed Point

Not all instances of algorithm (3) solve the distributed optimization problem (1). For an algorithm to be valid, (i) there must exist a fixed point corresponding to the optimal solution, and (ii) the iterates must converge to the fixed point. We address convergence to a fixed point in our main result of Section 3. In this section, however, we provide simple conditions for verifying the existence of such a fixed point.

A distributed algorithm of the form (3) has a fixed point $(x^{\star},y^{\star},z^{\star},u^{\star},v^{\star})$ corresponding to the optimal solution of (1) for all functions satisfying Assumption 1 and all graphs satisfying Assumption 2 if the following conditions hold.

[TABLE]

•

Consensus and Optimality: All agents must achieve consensus on the point at which the gradient is evaluated, and the point must be a stationary (first-order optimal) point of $f$ . This means that the fixed point must satisfy $y_{1}^{\star}=\ldots=y_{n}^{\star}$ and $u_{1}^{\star}+\dots+u_{n}^{\star}=0$ , or in vector form,

[TABLE]

•

Robustness to Graph: The fixed point must not depend on the sequence of graphs $\{\mathcal{L}^{k}\}$ , so $z_{1}^{\star}=\ldots=z_{n}^{\star}$ and $v_{1}^{\star}=\dots=v_{n}^{\star}=0$ , or in vector form,

[TABLE]

•

Robustness to Functions: The fixed point must satisfy $y_{1}^{\star}=\ldots=y_{n}^{\star}=y_{\text{opt}}$ and $u_{i}^{\star}=\nabla\!f_{i}(y_{\text{opt}})$ , where $y_{\text{opt}}$ is the optimizer of (1). For these to hold for any objective function $f$ , we need

[TABLE]

The following proposition characterizes algorithms with such a fixed point, which we prove in Appendix A.1.

Proposition 6 (Existence of fixed point).

An algorithm of the form (3) has a fixed point $(x^{\star},y^{\star},z^{\star},u^{\star},v^{\star})$ that satisfies the conditions in (5) if and only if

[TABLE]

Here, “null”, “col”, and “row” denote the nullspace, column space, and row space, respectively. Both EXTRA and DIGing as derived above satisfy the conditions in (6) and therefore have a fixed point corresponding to the optimal solution of (1).

Remark 7.

Proposition 6 guarantees that any instance of algorithm (3) satisfying (5) has a desirable fixed point in the presence of a time-varying graph; all agents agree on a common stationary point of (1). However, Proposition 6 does not ensure that the algorithm necessarily converges to this fixed point, nor does it characterize the rate of convergence. These questions will be explored in Section 3.

2.5 Lower Bounds on Worst-Case Convergence Rates

We now construct simple lower bounds on the worst-case asymptotic convergence rate of the iterates for any valid algorithm of the form (3). We do so by separately considering the two specific instances discussed in Section 1

Consensus

Consider the scalar local quadratic functions $f_{i}(y)=\tfrac{L}{2}\,(y-r_{i})^{2}$ . Then Assumption 1 holds with $m=L$ and $y_{\text{opt}}=\tfrac{1}{n}\sum_{i=1}^{n}r_{i}$ .

Optimization

Consider the case $n=1$ . For the graph to satisfy Assumption 2, the Laplacian matrix must be $\mathcal{L}^{k}=0$ , which has spectral gap $\sigma=0$ .

In both cases above, the algorithm reduces to a linear system in feedback with sector-bounded nonlinearity: in the sector $(1-\sigma,1+\sigma)$ for consensus and $(m,L)$ for optimization. Further, the linear part of the system is strictly proper (since the algorithm is implementable) and must contain an integrator (due to the fixed-point conditions). Then using the lower bound for such systems in [12], we obtain the following.

Proposition 8.

There does not exist an algorithm of the form (3) that satisfies the implementability conditions (4) and fixed-point conditions (6) and such that, for all objective functions and Laplacian matrices satisfying Assumptions 1 and 2, there exists a constant $c>0$ such that the bound $\|x_{i}^{k}-y_{\text{opt}}\|\leq c\,\rho_{\text{lb}}^{k}$ holds for all agents $i\in\{1,\ldots,n\}$ and all iterations $k\geq 0$ , where $\rho_{\text{lb}}=\max\bigl{\{}\tfrac{\kappa-1}{\kappa+1},\,\sigma\bigr{\}}$ .

Remark 9 (Accelerated rates).

These lower bounds, which are achieved by ordinary gradient descent, imply that accelerated algorithms such as the recently proposed SSDA [22] or distributed versions of heavy-ball [33] or Nesterov acceleration [18, 31] do not in fact achieve accelerated rates in the worst case in our time-varying setting.

3 Main Result

Our main theorem, Theorem 10, consists of a small convex semidefinite program (SDP) whose feasibility guarantees the linear convergence of a distributed algorithm in the form of (3). The algorithm parameters, problem data $(\kappa,\sigma)$ , and candidate linear rate $\rho$ all appear as parameters in the SDP. Furthermore, the SDP has a fixed size that does not depend on $n$ (the number of agents) or $d$ (the dimension of the domain of $f$ ) and can thus be efficiently solved using a variety of established solvers.

Theorem 10 (Analysis result).

Consider the distributed optimization problem (1) solved using algorithm (3). Suppose Assumptions 1 and 2 hold and further assume the algorithm satisfies the fixed point conditions (6). Define the matrices

[TABLE]

Let $\Psi$ be a matrix whose columns form a basis for the nullspace of $\begin{bmatrix}F_{x}&F_{u}\end{bmatrix}$ . If there exist $P\succ 0$ , $Q\succ 0$ , and $R\succeq 0$ of appropriate sizes such that

[TABLE]

then there exists a constant $c>0$ independent of $i$ and $k$ such that for all agents $i\in\{1,\dots,n\}$ and all iterations $k\geq 0$ ,

[TABLE]

for some fixed point $(x_{i}^{\star},y_{i}^{\star},z_{i}^{\star},u_{i}^{\star},v_{i}^{\star})$ that satisfies (5).

For fixed algorithm parameters $A,B_{u},B_{v},C_{y},C_{z},D_{yu}$ , $D_{yv},D_{zu},D_{zv},F_{x},F_{u}$ , function parameters $m$ and $L$ , graph parameter $\sigma$ , and candidate rate $\rho$ , the SDP (7) is a linear matrix inequality (LMI) in the variables $(P,Q,R)$ , and therefore convex. Indeed, (7l) and (7ac) are decoupled and their feasibility may be checked separately. To find the best (smallest) upper bound, we observe that feasibility of (7) for some $\rho_{0}$ implies feasibility for all $\rho\geq\rho_{0}$ . A bisection search on $\rho$ is then guaranteed to find the minimal $\rho$ , even though (7) is not jointly convex in $(P,Q,R,\rho)$ . While our result is only a sufficient condition for convergence, we provide empirical evidence in Section 5.2 that suggests that it is in fact tight.

Remark 11.

Our main theorem provides conditions under which the state converges to a fixed point linearly with rate $\rho$ . However, when the algorithm also satisfies the conditions in (4) for being efficiently implementable, then under the conditions of Theorem 10, there exist constants $c_{u}$ , $c_{v}$ , $c_{y}$ , and $c_{z}$ such that for all agents $i$ and all iterations $k$ ,

[TABLE]

for some fixed point $(x_{i}^{\star},y_{i}^{\star},z_{i}^{\star},u_{i}^{\star},v_{i}^{\star})$ that satisfies (5). In particular, the output sequence $y_{i}^{k}$ of each agent converges to the optimizer $y_{\text{opt}}$ of (1) linearly with rate $\rho$ .

The core idea behind Theorem 10 is to posit a quadratic Lyapunov candidate of the form

[TABLE]

for some appropriate choice of $P,Q\succ 0$ . Feasibility of (7) can be shown to imply $V^{k+1}\leq\rho^{2}V^{k}$ , which ensures linear convergence of the distributed optimization algorithm when $\rho<1$ . A preliminary (and less concise) version of Theorem 10 appeared in [25]. The proof of Theorem 10 is given in Appendix A.2.

4 Algorithm Design

We now use Theorem 10 to design a distributed optimization algorithm, which we name SVL. Our guiding principle is to seek the fastest possible rate bound guarantee while keeping the algorithm as simple as possible. Therefore, we seek an algorithm with two states that only requires one state to be communicated at every timestep. Inspired by our previous work in which we developed a canonical form for distributed algorithms over time-invariant graphs [26], we restrict our search to algorithms of the form (3) with

[TABLE]

As long as $\beta\neq 0$ , this algorithm satisfies the fixed point conditions of Proposition 6. Moreover, the update equations satisfy (4) and therefore do not contain circular dependencies, so we can implement the algorithm in a straightforward fashion as in Algorithm 1. To motivate the structure of our algorithm, we show how it corresponds to an inexact version of the alternating direction method of multipliers (ADMM), as well as how it reduces to well-known consensus and optimization algorithms in special cases. But first, we show how to use the SDP (7) to choose the algorithm parameters.

4.1 Choosing the Algorithm Parameters

The problem of minimizing the worst-case convergence rate $\rho$ over the algorithm parameters $(\alpha,\beta,\gamma,\delta)$ and SDP solution $(P,Q,R)$ subject to the SDP being feasible is difficult due to the nonlinear matrix inequalities (7). Instead, we show that for a particular choice of $(\alpha,\gamma,\delta)$ , the remaining parameters $(\beta,\rho)$ can be chosen such that the SDP is feasible, where the matrix in (7ac) is rank one. We have performed extensive numerical optimizations of the SDP, suggesting that the optimal parameters do in fact have this structure. We now state our main design result, which describes the convergence rate of the SVL algorithm. We prove the result in Appendix A.3.

Theorem 12 (SVL).

Consider applying Algorithm 1 to the distributed optimization problem (1), and suppose Assumptions 1 and 2 hold with $0<m<L$ and $0\leq\sigma<1$ . Define $\eta\colonequals 1+\rho-\kappa\,(1-\rho)$ and choose the parameters

[TABLE]

where $\beta$ and $\rho\in\bigl{[}\tfrac{L-m}{L+m},1\bigr{)}$ satisfy the constraints

[TABLE]

Then there exists a constant $c>0$ independent of $i$ and $k$ such that for all agents $i\in\{1,\dots,n\}$ and all iterations $k\geq 0$ , $\lVert{y_{i}^{k}-y_{\text{opt}}}\rVert\leq c\,\rho^{k}$ where $y_{\text{opt}}\in\mathbb{R}^{d}$ is the optimizer of (1).

Theorem 12 provides conditions on parameters $(\alpha,\beta,\gamma,\delta)$ of Algorithm 1 such that the algorithm converges with rate at least $\rho$ . The theorem, however, does not address the problem of optimizing the convergence rate since $\beta$ and $\rho$ must only be chosen to satisfy the constraints (21). This is because the optimal parameters do not admit a closed-form solution for the convergence rate $\rho$ as a function of the spectral gap $\sigma$ and function parameters $m$ and $L$ . However, we now provide a systematic method for computing the optimal parameters.

The parameters must satisfy (21b), but this equation does not have a closed-form solution for $\rho$ . Instead, we consider fixing the rate $\rho$ and maximizing the corresponding spectral gap. We can then choose $\beta$ to maximize $\sigma^{2}$ in (21b). Setting the derivative equal to zero, we find that the value of $\beta$ which maximizes $\sigma^{2}$ for a fixed convergence rate $\rho$ satisfies

[TABLE]

where the coefficients $s_{i}$ are given by

[TABLE]

Solving the first factor for $\beta$ , we find that it does not satisfy the inequality (21a) and is therefore not a valid solution. The optimal $\beta$ must then make the second factor zero. Therefore, we can do a bisection search over $\rho$ , where at each iteration of the bisection search we solve the cubic equation

[TABLE]

to find the unique real solution $\beta$ that satisfies (21a). Substituting this value for $\beta$ into (21b) we can solve for $\sigma$ . If this value is less than $\sigma$ , we increase $\rho$ ; otherwise, we decrease $\rho$ . We then repeat this procedure until $\sigma$ is sufficiently close to the spectral gap. We summarize this procedure for finding the parameters $\beta$ and $\rho$ that optimize the worst-case convergence rate in Algorithm 2; we refer to Algorithm 1 using these parameters along with those in (20) as SVL.

Using this procedure for computing the worst-case convergence rate of SVL, Fig. 1 displays $\rho$ as a function of the spectral gap $\sigma$ and the centralized gradient rate $\tfrac{\kappa-1}{\kappa+1}$ . One of the remarkable aspects of the SVL algorithm is that it actually achieves the same worst-case convergence rate as centralized gradient descent if the spectral gap is sufficiently small. In this case, there is sufficient mixing among the agents so that the convergence rate is limited by the difficulty of the optimization problem and not the problem of having agents agree on the solution (i.e., consensus). This corresponds to the horizontal lines for small values of $\sigma$ in the top panel of Fig. 1. Viewed another way, the convergence rate is limited by the difficulty of the optimization problem when the problem is ill-conditioned (i.e., $\kappa$ is large), which corresponds to the curves approaching the straight line at $\rho=\tfrac{\kappa-1}{\kappa+1}$ in the bottom panel of Fig. 1.

Remark 13 (Optimality).

We conjecture that the SVL parameters $(\alpha,\beta,\gamma,\delta)$ produce the fastest worst-case convergence rate over all algorithms in the form of Algorithm 1 that is certifiable using Theorem 10. However, we make no formal claims of optimality of the SVL algorithm in this paper.

4.2 Interpretation of SVL as Inexact ADMM

To motivate the structure of SVL, we show how SVL can be interpreted as an inexact version of the alternating direction method of multipliers (ADMM). Using the formulation in [4, Section 7.1], the problem (1) can be solved using ADMM:

[TABLE]

where $(x_{i}^{k},y_{i}^{k},z_{i}^{k})$ are the variables associated with agent $i$ at time $k$ , and $\beta$ is the ADMM parameter. To implement this algorithm, however, each agent must solve the local optimization problem (23a) exactly as well as compute the exact average (23b) at each iteration. Instead, we consider a variant where the computations and communications are inexact. Specifically, we replace the exact minimization (23a) with a single gradient step with initial condition $y_{i}^{k}$ and stepsize $\alpha>0$ , and we replace the exact averaging step (23b) with a single gossip step using the Laplacian matrix $\mathcal{L}^{k}$ . This gives the following inexact version of ADMM:

[TABLE]

Defining the state $w_{i}^{k}\colonequals-\tfrac{\alpha}{\beta}z_{i}^{k-1}$ , this algorithm is equivalent to Algorithm 1 with $\gamma=1+\beta$ and $\delta=1$ . In other words, SVL corresponds to an inexact version of ADMM, where $\alpha$ is the stepsize of the gradient step and $\beta$ is the ADMM parameter. See [24, 5] for other distributed ADMM variants.

4.3 Special Cases

We now show how the SVL algorithm reduces to well-known consensus and optimization algorithms in special cases.

$n=1$ :

With only one agent, the distributed optimization problem (1) is equivalent to centralized optimization. In this case, the Laplacian matrix is simply the scalar $\mathcal{L}^{k}=0$ , so $v_{1}^{k}=0$ for all $k\geq 0$ . Algorithm 1 then simplifies to

[TABLE]

which is ordinary gradient descent with stepsize $\alpha$ . The fastest possible gradient rate of $\rho=\frac{\kappa-1}{\kappa+1}$ is achieved when $\alpha=\frac{2}{L+m}$ .

$\kappa=1$ :

When the condition ratio is unity (i.e., $m=L$ ), the distributed optimization problem (1) is equivalent to average consensus. In this case, the parameters of SVL are simply $\alpha=\tfrac{1}{L}$ , $\beta=1$ , $\gamma=2$ , and $\delta=1$ . Also, the objective functions are quadratic, so we may assume without loss of generality that they have the form $f_{i}^{k}(x)=\tfrac{L}{2}\|x-r_{i}^{k}\|^{2}$ , where $r_{i}^{k}\in\mathbb{R}^{d}$ is a parameter on agent $i\in\{1,\ldots,n\}$ at iteration $k$ . The SVL algorithm then simplifies to

[TABLE]

which is a dynamic average consensus algorithm since the reference signals are continually injected into the dynamics [10]. When the objective functions are constant, the $r_{i}$ terms cancel from the iterations and only affect the initial conditions. This case is referred to as static average consensus [27], and the worst-case rate of convergence is $\rho=\sigma$ [29].

5 Numerical Results

In this section, we compare the worst-case performance of SVL with that of other first-order distributed algorithms.

5.1 Algorithm Comparison (Upper Bounds)

Theorem 10 provides an upper bound on the worst-case convergence rate. We used this result to compare all algorithms in Table 1, including SVL. The results are shown in Fig. 2. For each algorithm, we used a bisection search to find the smallest rate $\rho$ that yielded a feasible solution to the SDP (7). We implemented the SDP in Julia [3] with the JuMP [6] modeling package and the Mosek interior point solver [1]. In an outer loop, we performed a parameter search for each algorithm to find the step size $\alpha$ and overrelaxation parameter $\mu$ that yielded the smallest possible $\rho$ . Specifically, we used Brent’s method and the Nelder–Mead method, respectively, as implemented in the Optim package [14] as $\sigma$ ranged from 0 to 1.

As shown in Fig. 2, optimizing over $\mu$ further improves worst-case performance. Our proposed SVL algorithm outperforms all methods we tested. Also shown in Fig. 2 is the lower bound described in Section 2.5, namely $\rho\geq\max\{\tfrac{\kappa-1}{\kappa+1},\sigma\}$ , which holds for any distributed algorithm.

5.2 Approximate Worst-Case Examples (Lower Bounds)

In an effort to show that the upper bounds for each algorithm in Fig 2 were likely tight, we searched for signals $\{x^{k},u^{k},v^{k},y^{k},z^{k}\}$ that satisfied (3) for some choice of $f_{i}$ and $\mathcal{L}^{k}$ satsifying Assumptions 1 and 2, respectively.

We first solved a relaxed version of the problem, where we replaced Assumptions 1 and 2 by the weaker conditions (26) and (27), respectively. We used the following greedy heuristic. For a given algorithm and rate $\rho$ , we solved (7) to obtain $(P,Q,R)$ . At each time step $k$ , we then maximized the Lyapunov increment $V^{k+1}-\rho^{2}V^{k}$ , where $V^{k}$ is defined in (9). We solved the following optimization problem for $k\geq 0$ .

[TABLE]

For $k=0$ , we also included $x^{0}$ as an optimization variable and the normalization $V^{0}=1$ . For $k\geq 1$ , we solved (24) using the $x^{k}$ found at the previous iteration and warm-starting $u^{k},v^{k}$ . We used the Ipopt [28] local solver with default settings since (24) is a nonconvex quadratically constrained quadratic program. Note that we must choose parameters $n$ and $d$ .

Our relaxed heuristic using $n=d=2$ was successful in constructing trajectories that matched the worst-case bounds from (7). To illustrate, we simulated EXTRA, NIDS, DIGing, and SVL with $\kappa=10$ and a few values of $\sigma$ in Fig. 3. For each trajectory, we plotted $\|y^{k}-y^{\star}\|$ together with the corresponding upper bound $\rho$ found from Theorem 10. We obtained similar results for the other algorithms from Table 1.

Since we used the relaxation (27) to construct $z^{k}$ and $v^{k}$ , there is no guarantee that there will exist a linear Laplacian $\mathcal{L}^{k}$ such that $v^{k}=\mathcal{L}^{k}z^{k}$ . However, finding whether such an $\mathcal{L}^{k}$ exists amounts to solving a convex optimization problem:

[TABLE]

If (25) is feasible and its optimal value is less than or equal to $\sigma$ , then the associated $\mathcal{L}^{k}$ is a valid Laplacian matrix at timestep $k$ . While there is no guarantee that (25) will even be feasible, we reasoned that since there are $n^{2}$ variables and $2n+ndc$ linear constraints, where $d$ and $c$ are the number of rows of $C_{y}$ and $C_{z}$ , respectively, we could increase our chances of finding feasible $\mathcal{L}^{k}$ with $n$ large and $d$ and $c$ small.

In Figure 4, we show a successful construction for the NIDS algorithm, which has $c=1$ . We solved (24) with $n=15$ and $d=1$ , and solved (25) at each timestep. An optimal cost for (25) of $\sigma$ was always achieved. This result indicates that the upper bound for NIDS in Fig. 2 is likely tight, and that NIDS is not robustly stable in the time-varying setting. In other words, the network-independent rate bound enjoyed by NIDS in the constant-graph setting [13, Thm. 2] does not carry over to the time-varying setting.

Remark 14.

There may be other approaches to finding a worst-case $\mathcal{L}^{k}$ that perform better. For example, one might try alternating convex optimizations or including $\mathcal{L}^{k}$ directly as an optimization variable in a nonlinear program.

6 Conclusion

We presented a universal analysis framework for a broad class of first-order distributed optimization algorithms over time-varying graphs. The framework provides worst-case certificates of linear convergence via semidefinite programming, and we show empirically that our rate bounds are likely tight. Optimizing the SDP from our analysis framework, we designed a novel distributed algorithm, SVL, which outperforms all known algorithms in this time-varying setting.

Appendix A Appendix

A.1 Proof of Proposition 6

Suppose (6) holds, and denote the optimizer of (1) by $y_{\text{opt}}$ . Then there exist vectors $p$ and $q$ such that

[TABLE]

For all $i\in\{1,\ldots,n\}$ , use these vectors to define the points

[TABLE]

This is a fixed point of algorithm (3), and the fixed point satisfies the conditions in (5) since $y_{\text{opt}}$ is the optimizer of (1).

Now suppose $(x^{\star},y^{\star},z^{\star},u^{\star},v^{\star})$ is a fixed point of (3) satisfying (5). Let $p=(1/n)\sum_{i=1}^{n}x_{i}^{\star}$ . Since $1^{\mathsf{T}}u^{\star}=0$ , $v^{\star}=0$ , and $1^{\mathsf{T}}y^{\star}$ is unconstrained, we have from (3a) and (3c) that $p\neq 0$ is in the set (6a). Now let $v$ be any nonzero vector such that $v^{\mathsf{T}}1=0$ . Then from (3a), we have that

[TABLE]

Since this must hold for arbitrary $v^{\mathsf{T}}u^{\star}$ , this implies (6b).

A.2 Proof of Theorem 10

Assumptions 1 and 2 lead to quadratic inequalities that will be useful in proving our main result. These are stated in the following propositions.

Proposition 15.

Suppose Assumption 1 holds for the local objective functions $f_{i}$ . Let $(y_{i}^{k},u_{i}^{k})$ satisfy (3b), and let $(y_{i}^{\star},u_{i}^{\star})$ be a fixed point that satisfies (5). Then

[TABLE]

Proof. Using the definition of $M_{0}$ , the quadratic form is

[TABLE]

Since the fixed point satisfies (5), Assumption 1 implies that this is nonnegative with $y_{\text{opt}}=y_{1}^{\star}=\ldots=y_{n}^{\star}$ .

Proposition 16.

Suppose Assumption 2 holds for the graph $\mathcal{G}^{k}$ at each iteration. Let $(z_{i}^{k},v_{i}^{k})$ satisfy (3b), and let $(z_{i}^{\star},v_{i}^{\star})$ be a fixed point that satisfies (5). Then for all $R\succeq 0$ ,

[TABLE]

Proof. From the definition of the matrix norm and Assumption 2, we have that

[TABLE]

Without loss of generality, $y=\Pi\eta+(I-\Pi)\,\phi$ , where $\eta$ and $\phi$ are arbitrary. By orthogonality, $\lVert{y}\rVert^{2}=\lVert{\Pi\eta}\rVert^{2}+\lVert{(I-\Pi)\,\phi}\rVert^{2}$ . Substituting the decomposition of $y$ into the above inequality,

[TABLE]

where the last two steps follow because the maximum is attained with $\eta=0$ , and $\mathcal{L}^{k}\Pi=\Pi\mathcal{L}^{k}=\mathbf{0}$ . Squaring both sides and rewriting as a quadratic form yields

[TABLE]

for all $\phi\in\mathbb{R}^{n}$ . Now let $p$ denote the dimension of $z_{i}^{k}$ . Then since $R\succeq 0$ , it has the decomposition

[TABLE]

where $w_{\ell}\in\mathbb{R}^{p}$ and $\mu_{\ell}\geq 0$ . Then using that $\tilde{v}^{k}=(\mathcal{L}^{k}\otimes I_{p})\,\tilde{z}^{k}$ , the quadratic form is

[TABLE]

which is nonnegative from (28) with $\phi\leftarrow(I\otimes w_{\ell}^{\mathsf{T}})\,\tilde{z}^{k}$ .

Let $(x^{k},y^{k},z^{k},u^{k},v^{k})$ denote a trajectory of algorithm (3). Since the algorithm satisfies the fixed point conditions (6) (by assumption), we have from Proposition 6 that there exists a fixed point $(x^{\star},y^{\star},z^{\star},u^{\star},v^{\star})$ satisfying (5). The global optimizer is unique from Assumption 1, so the fixed point conditions (5a) imply that $y_{1}^{\star}=\ldots=y_{n}^{\star}=y_{\text{opt}}$ with $y_{\text{opt}}$ the optimizer of (1).

Since the trajectory satisfies the invariant (3c) and the columns of $\Psi$ form a basis for the nullspace of $\begin{bmatrix}F_{x}&F_{u}\end{bmatrix}$ , there exists a vector $\tilde{s}^{k}$ such that

[TABLE]

Multiplying the matrix in (7l) on the right and left by $\tilde{s}^{k}$ and its transpose, respectively, we obtain the consensus inequality

[TABLE]

where we used that $\{w_{i}\}_{i=1}^{n}$ form an orthonormal basis for $\mathbb{R}^{n}$ . Summing the inequalities in (29), we obtain

[TABLE]

where $V^{k}$ is defined in (9). The quadratic forms in the last two terms are nonnegative from Propositions 15 and 16, which implies $V^{k+1}\leq\rho^{2}\,V^{k}$ . We then apply this inequality iteratively to obtain $V^{k}\leq\rho^{2k}\,V_{0}$ for all $k\geq 0$ . Now define

[TABLE]

and note that $T\succ 0$ since $P$ and $Q$ are positive definite. Then letting $\operatorname{cond}(T)=\lambda_{\text{max}}(T)/\lambda_{\text{min}}(T)$ denote the condition number of $T$ , we have the bound

[TABLE]

Therefore, the bound (8) holds with $c=\sqrt{\operatorname{cond}(T)\,V^{0}}$ .

A.3 Proof of Theorem 12

Substituting the template (19) into the LMI (7l) reduces to

[TABLE]

which is satisfied with $\alpha=(1-\rho)/m$ and $P_{11}=\tfrac{m\,(L-m)}{\rho\,(1-\rho)}$ . Note that this LMI is known to describe the convergence rate of centralized gradient descent; see [11, Section 4.4].

Now consider the potential solution to (7ac) given by

[TABLE]

Using these values along with the value for $\sigma^{2}$ in (21b), the matrix in (7ac) is equal to the rank-one matrix $-\tfrac{1}{t_{2}t_{4}}zz^{\mathsf{T}}$ , where

[TABLE]

In order for this to be a valid solution, we must have $t_{3}>0$ and $t_{1}/t_{4}>0$ (so that $Q\succ 0$ ), $t_{5}/t_{2}\geq 0$ (so that $R\succeq 0$ ), and $t_{2}t_{4}>0$ (so that (7ac) holds). All of these inequalities hold if and only if (21a) holds. Therefore, the SDP has a rank-one solution using the parameters in (20) if $\beta$ and $\rho$ satisfy (21). The convergence bound then follows from Theorem 10 and Remark 11.

11footnotetext: Wisconsin Institute for Discovery, WI 53715, USA.22footnotetext: Department of Electrical and Computer Engineering, University of Wisconsin–Madison, WI 53706, USA. Emails: {asundararaja,vanscoy,laurent.lessard}@wisc.edu

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] APS Mosek. The MOSEK optimization software, 2010. Online at http://www.mosek.com .
2[2] J. A. Bazerque and G. B. Giannakis. Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Transactions on Signal Processing , 58(3):1847–1862, 2009.
3[3] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. SIAM review , 59(1):65–98, 2017.
4[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , volume 3. Foundations and Trends in Machine Learning, 2010.
5[5] T. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consensus admm. IEEE Transactions on Signal Processing , 63(2):482–497, 2015.
6[6] I. Dunning, J. Huchette, and M. Lubin. Ju MP: A modeling language for mathematical optimization. SIAM Review , 59(2):295–320, 2017.
7[7] P. A. Forero, A. Cano, and G. B. Giannakis. Consensus-based distributed support vector machines. Journal of Machine Learning Research , 11:1663–1707, 2010.
8[8] D. Jakovetić. A unification and generalization of exact distributed first-order methods. IEEE Transactions on Signal and Information Processing over Networks , 5(1):31–46, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Analysis and Design of First-Order Distributed Optimization Algorithms over Time-Varying Graphs

Abstract

1 Introduction

Consensus

Optimization

Remark 1** (Accelerated rates).**

2 Preliminaries

2.1 Notation

2.2 Function and Graph Assumptions

Assumption 1**.**

Remark 2**.**

Assumption 2**.**

Remark 3**.**

2.3 Algorithm Form

Remark 4** (Dimension reduction).**

Remark 5** (Implementation).**

EXTRA.

DIGing.

2.4 Existence of a Fixed Point

Proposition 6** (Existence of fixed point).**

Remark 7**.**

2.5 Lower Bounds on Worst-Case Convergence Rates

Consensus

Optimization

Proposition 8**.**

Remark 9** (Accelerated rates).**

3 Main Result

Theorem 10** (Analysis result).**

Remark 11**.**

4 Algorithm Design

4.1 Choosing the Algorithm Parameters

Theorem 12** (SVL).**

Remark 13** (Optimality).**

4.2 Interpretation of SVL as Inexact ADMM

4.3 Special Cases

n=1n=1n=1:

κ=1\kappa=1κ=1:

5 Numerical Results

5.1 Algorithm Comparison (Upper Bounds)

5.2 Approximate Worst-Case Examples (Lower Bounds)

Remark 14**.**

6 Conclusion

Appendix A Appendix

A.1 Proof of Proposition 6

A.2 Proof of Theorem 10

Proposition 15**.**

Proposition 16**.**

A.3 Proof of Theorem 12

Remark 1 (Accelerated rates).

Assumption 1.

Remark 2.

Assumption 2.

Remark 3.

Remark 4 (Dimension reduction).

Remark 5 (Implementation).

Proposition 6 (Existence of fixed point).

Remark 7.

Proposition 8.

Remark 9 (Accelerated rates).

Theorem 10 (Analysis result).

Remark 11.

Theorem 12 (SVL).

Remark 13 (Optimality).

$n=1$ :

$\kappa=1$ :

Remark 14.

Proposition 15.

Proposition 16.