Asynchronous Distributed Optimization over Lossy Networks via Relaxed   ADMM: Stability and Linear Convergence

Nicola Bastianello; Ruggero Carli; Luca Schenato; Marco Todescato

arXiv:1901.09252·math.OC·July 24, 2020·IEEE Trans. Autom. Control.

Asynchronous Distributed Optimization over Lossy Networks via Relaxed ADMM: Stability and Linear Convergence

Nicola Bastianello, Ruggero Carli, Luca Schenato, Marco Todescato

PDF

TL;DR

This paper introduces a modified relaxed ADMM algorithm for distributed convex optimization over lossy, asynchronous networks, proving its almost sure convergence and linear convergence under certain conditions, with numerical validation.

Contribution

It proposes a novel asynchronous, lossy network-compatible ADMM variant with proven convergence properties and convergence rate bounds, extending distributed optimization theory.

Findings

01

Almost sure convergence under general loss and activation models

02

Linear convergence in mean to a neighborhood of the optimum

03

Numerical results demonstrating effectiveness in various scenarios

Abstract

In this work we focus on the problem of minimizing the sum of convex cost functions in a distributed fashion over a peer-to-peer network. In particular, we are interested in the case in which communications between nodes are prone to failures and the agents are not synchronized among themselves. We address the problem proposing a modified version of the relaxed ADMM, which corresponds to the Peaceman-Rachford splitting method applied to the dual. By exploiting results from operator theory, we are able to prove the almost sure convergence of the proposed algorithm under general assumptions on the distribution of communication loss and node activation events. By further assuming the cost functions to be strongly convex, we prove the linear convergence of the algorithm in mean to a neighborhood of the optimal solution, and provide an upper bound to the convergence rate. Finally, we present…

Figures6

Click any figure to enlarge with its caption.

Tables3

Table 1. TABLE I : Comparison of R-ADMM formulations.

	Operator R-ADMM	Lagrangian R-ADMM
Memory	$d_{i} + 1$	$2 d_{i} + 1$
Transmit	$d_{i}$	$2 d_{i} + 1$

Table 2. TABLE II: Comparison of convergence results for different distributed ADMM formulations.

	Reference	Formulation	$α$	Linear convergence	Asynchronous updates	Packet loss
Augmented Lagrangian ADMM	Shi et al. [31]	node-based	$1 / 2$	global	✗	✗
	Makhdoumi & Ozdaglar [32]	node-based	$1 / 2$	global	✗	✗
	Majzoobi et al. [35]	node-, edge-based	$1 / 2$	✗	✗	✓(uniform distr.)
	Chang et al. [34]	master-slave^†	$1 / 2$	global	✓(for master)	✗
	Iutzeler et al. [33]	clustered	$1 / 2$	local	✓(for clusters)	✗
Splitting ADMM	Bianchi et al. [30]	edge-based	$1 / 2$	✗	✓	✗
	Giselsson & Boyd [23]	master-slave^†	$(0, 1)$	global	✗	✗
	Combettes & Pesquet [38]	master-slave^†	$(0, 1)$	global	✓	✓
	Peng et al. [26]	edge-based	$(0, 1)$	✗	✓(one node at a time)	✗
	This paper	edge-based	$(𝟎, 𝟏)$	local	✓	✓
$†$ The results presented in these works do not explicitly address master-slave architectures, see Remark 5.

Table 3. TABLE III: Difference between empirical and theoretical convergence rate.

	Maximum	Minimum	Mean $\pm$ Std
$\frac{\| \hat{γ} - {\bar{γ}}_{M} \|}{{\bar{γ}}_{M}}$	$4.9 \times 10^{- 5}$	$8.28 \times 10^{- 11}$	$(1.1 \pm 2.8) \times 10^{- 6}$

Equations293

x min i = 1 \sum N f_{i} (x)

x min i = 1 \sum N f_{i} (x)

prox_{ρ f} (x) = y arg min {f (y) + \frac{1}{2 ρ} ∥ y - x ∥^{2}},

prox_{ρ f} (x) = y arg min {f (y) + \frac{1}{2 ρ} ∥ y - x ∥^{2}},

x (k + 1) = (1 - α) x (k) + α T x (k) .

x (k + 1) = (1 - α) x (k) + α T x (k) .

x \in R^{n} min {f (x) + g (x)}

x \in R^{n} min {f (x) + g (x)}

T_{PR} = refl_{ρ g} \circ refl_{ρ f}

T_{PR} = refl_{ρ g} \circ refl_{ρ f}

z (k + 1) = (1 - α) z (k) + α T_{PR} z (k), k \in N

z (k + 1) = (1 - α) z (k) + α T_{PR} z (k), k \in N

x (k + 1)

x (k + 1)

y (k + 1)

z (k + 1)

x \in R^{n}, y \in R^{m} min {f (x) + g (y)} s.t. A x + B y = c

x \in R^{n}, y \in R^{m} min {f (x) + g (y)} s.t. A x + B y = c

w \in R^{p} min {d_{f} (w) + d_{g} (w)}

w \in R^{p} min {d_{f} (w) + d_{g} (w)}

d_{f} (w) = f^{*} (A^{⊤} w) and d_{g} (w) = g^{*} (B^{⊤} w) - ⟨ w, c ⟩ .

d_{f} (w) = f^{*} (A^{⊤} w) and d_{g} (w) = g^{*} (B^{⊤} w) - ⟨ w, c ⟩ .

x (k + 1)

x (k + 1)

w (k + 1)

y (k + 1)

\displaystyle+\frac{\rho}{2}\left\lVert By-c\right\rVert^{2}\Big{\}}

v (k + 1)

z (k + 1)

x \in R^{n} min i = 1 \sum N f_{i} (x)

x \in R^{n} min i = 1 \sum N f_{i} (x)

x_{i}, i \in V min s.t. i = 1 \sum N f_{i} (x_{i}) x_{i} = x_{j} \forall (i, j) \in E .

x_{i}, i \in V min s.t. i = 1 \sum N f_{i} (x_{i}) x_{i} = x_{j} \forall (i, j) \in E .

x_{i} = y_{ij}, x_{j} = y_{j i} and y_{ij} = y_{j i} \forall (i, j) \in E .

x_{i} = y_{ij}, x_{j} = y_{j i} and y_{ij} = y_{j i} \forall (i, j) \in E .

\mathbold A \mathbold x + \mathbold B \mathbold y = 0 and \mathbold y = \mathbold P \mathbold y

\mathbold A \mathbold x + \mathbold B \mathbold y = 0 and \mathbold y = \mathbold P \mathbold y

\mathbold A = 1_{d_{1}} 0_{d_{2}} 0_{d_{N}} 0_{d_{1}} 1_{d_{2}} \dots 0_{d_{1}} 0_{d_{2}} ⋱ \dots \dots \dots ⋱ 0_{d_{N}} 0_{d_{1}} 0_{d_{2}} 1_{d_{N}} \otimes I_{n} \in R^{n M \times n N},

\mathbold A = 1_{d_{1}} 0_{d_{2}} 0_{d_{N}} 0_{d_{1}} 1_{d_{2}} \dots 0_{d_{1}} 0_{d_{2}} ⋱ \dots \dots \dots ⋱ 0_{d_{N}} 0_{d_{1}} 0_{d_{2}} 1_{d_{N}} \otimes I_{n} \in R^{n M \times n N},

\mathbold x, \mathbold y min {f (\mathbold x) + ι_{(\mathbold I - \mathbold P)} (\mathbold y)} s.t. \mathbold A \mathbold x - \mathbold y = 0.

\mathbold x, \mathbold y min {f (\mathbold x) + ι_{(\mathbold I - \mathbold P)} (\mathbold y)} s.t. \mathbold A \mathbold x - \mathbold y = 0.

x_{i} (k + 1)

x_{i} (k + 1)

\displaystyle\qquad\quad-\langle\sum_{j\in\mathcal{N}_{i}}z_{ij}(k),x_{i}\rangle+\frac{\rho d_{i}}{2}\left\lVert x_{i}\right\rVert^{2}\bigg{\}}

z_{ij} (k + 1)

x_{i} (k + 1) = prox_{f_{i} / (ρ d_{i})} ([\mathbold A^{⊤} \mathbold z (k)]_{i} / (ρ d_{i})),

x_{i} (k + 1) = prox_{f_{i} / (ρ d_{i})} ([\mathbold A^{⊤} \mathbold z (k)]_{i} / (ρ d_{i})),

q_{j \to i} = - z_{j i} (k) + 2 ρ x_{j} (k + 1)

q_{j \to i} = - z_{j i} (k) + 2 ρ x_{j} (k + 1)

z_{ij} (k + 1) = (1 - α) z_{ij} (k) + α q_{j \to i} .

z_{ij} (k + 1) = (1 - α) z_{ij} (k) + α q_{j \to i} .

k \to \infty lim x_{i} (k) = \overset{x}{ˉ}, \forall i \in V .

k \to \infty lim x_{i} (k) = \overset{x}{ˉ}, \forall i \in V .

\mathbold z (k + 1) = \mathbold T \mathbold z (k) + \mathbold u + \mathbold o^{'} (\mathbold x (k + 1) - \mathbold x^{*})

\mathbold z (k + 1) = \mathbold T \mathbold z (k) + \mathbold u + \mathbold o^{'} (\mathbold x (k + 1) - \mathbold x^{*})

\mathbold T = (1 - α) \mathbold I - α \mathbold P + 2 α ρ \mathbold P \mathbold A \mathbold H^{- 1} \mathbold A^{⊤},

\mathbold T = (1 - α) \mathbold I - α \mathbold P + 2 α ρ \mathbold P \mathbold A \mathbold H^{- 1} \mathbold A^{⊤},

\mathbold o^{'} (\mathbold x (k + 1) - \mathbold x^{*}) / \mathbold x (k + 1) - \mathbold x^{*} \to 0

\mathbold o^{'} (\mathbold x (k + 1) - \mathbold x^{*}) / \mathbold x (k + 1) - \mathbold x^{*} \to 0

∥ x_{i} (k) - x^{*} ∥ \leq C γ^{k} i \in V,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asynchronous Distributed Optimization over Lossy Networks via Relaxed ADMM:

Stability and Linear Convergence

Nicola Bastianello, , Ruggero Carli, , Luca Schenato, , and Marco Todescato N. Bastianello, R. Carli and L. Schenato are with the Department of Information Engineering (DEI), University of Padova, Italy. [bastian4|carlirug|schenato]@dei.unipd.it.M. Todescato is with Bosch Center for Artificial Intelligence. Renningen, Germany. [email protected] work has received funding from the Italian Ministry of Education, University and Research (MIUR) through the PRIN project no. 2017NS9FEY entitled “Realtime Control of 5G Wireless Networks: Taming the Complexity of Future Transmission and Computation Challenges”. The views and opinions expressed in this work are those of the authors and do not necessarily reflect those of the funding institution.

Abstract

In this work we focus on the problem of minimizing the sum of convex cost functions in a distributed fashion over a peer-to-peer network. In particular, we are interested in the case in which communications between nodes are prone to failures and the agents are not synchronized among themselves. We address the problem proposing a modified version of the relaxed ADMM, which corresponds to the Peaceman-Rachford splitting method applied to the dual. By exploiting results from operator theory, we are able to prove the almost sure convergence of the proposed algorithm under general assumptions on the distribution of communication loss and node activation events. By further assuming the cost functions to be strongly convex, we prove the linear convergence of the algorithm in mean to a neighborhood of the optimal solution, and provide an upper bound to the convergence rate. Finally, we present numerical results testing the proposed method in different scenarios.

Index Terms:

distributed optimization, ADMM, asynchronous update, lossy communications, operator theory, Peaceman-Rachford splitting

I Introduction

From classical control theory to more recent machine learning applications, many problems can be cast as optimization problems [1] and, in particular, as large-scale optimization problems, given the increasing importance of cyber-physical systems in engineering applications. Stemming from classical optimization theory, in order to break down the computational complexity, parallel and distributed optimization methods have been the focus of a wide branch of research [2]. Within this vast topic, typical applications foresee computing nodes to cooperate, through local information exchanges, in order to achieve a desired common goal such as

[TABLE]

where, usually, each $f_{i}$ is stored and known by one single node only.

While parallel optimization methods usually rely on a shared memory architecture to implement the communication among agents, in distributed systems a message passing architecture is employed, in which agents can exchange transmissions with a (subset) of the other agents. The message passing (or peer-to-peer) architecture however introduces some issues due to the implementation of the transmission protocols. Indeed, distributed systems may suffer from communication failures, delays, and noise, on top of the possible asynchronism of the agents’ activations. In this paper we are interested in solving the distributed problem (1) in the presence of communication (or packet) losses and asynchronism.

A first class of algorithms that has been proposed to solve distributed optimization problems is that of (sub)gradient- and Newton-based methods.

Distributed gradient descent algorithms in general combine local gradient descent steps with consensus averaging, see for example [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. These algorithms can handle many different scenarios, with smooth and non-smooth costs, over both fixed and time-varying topologies, and over directed and undirected graphs. In general, the convergence of gradient-based methods is sub-linear for convex costs and linear for strongly convex costs. The only method that can handle both packet losses and asynchronous activations of the nodes is [14]; however, it requires a decreasing step-size and thus, implicitly, that the agents be synchronized.

Newton-based distributed algorithms have been introduced in [15, 16, 17, 18] in synchronous and lossless scenarios. Recently the scheme in [17] has been extended in [19] to asynchronous and lossy scenarios. However, in [19], the convergence is proved only locally and no characterization of the convergence rate is provided.

Other widely studied algorithms for solving distributed optimization problems are the alternating direction method of multipliers (ADMM) and the more general relaxed ADMM (R-ADMM). This class of algorithms can be defined either as augmented Lagrangian methods [20, 21], or, within an operator theoretical framework, as the dual of the (relaxed) Peaceman-Rachford splitting [22, 23]. The latter formulation will be employed in this paper and we refer to [24, 25] for a background on operator theory and its applications to convex optimization. Typically, the ADMM is derived from a Lagrangian-based formulation, while the R-ADMM is derived in an operator theoretical framework. However, it is known, see [26], that the ADMM can be seen as a particular instance of the R-ADMM, obtained setting one of the free parameters equal to a specific value. This slightly reduces the complexity of the updating equations, but the higher flexibility of R-ADMM allows to obtain better convergence properties.

The convergence of ADMM and R-ADMM for convex optimization problems is in general sub-linear, see e.g. [20, 22], and the same applies for distributed optimization. In an asynchronous scenario, sub-linear convergence can be similarly proved, adopting both the augmented Lagrangian, see [27, 28, 29], and the splitting operator formulations, see [26, 30]. Remarkably, assuming the functional costs are strongly convex, the distributed implementations of ADMM introduced in [31, 32, 33], have been shown to attain linear and global convergence when communications are synchronous and reliable. These results have been extended to asynchronous schemes in [29, 34], though the proposed analysis is limited to master-slave architectures. To the best of our knowledge, [35] is the only paper proving convergence of the synchronous ADMM in presence of lossy communications, modeled as i.i.d. binary random variables. However, no characterization of the convergence rate is provided.

The authors of [26] have derived the R-ADMM within the framework of the ARock algorithm, introduced in the context of parallel computing where agents share a common memory. In [26] and [36], it is shown that the ARock framework successfully handles asynchronous updates and delayed information attaining a sub-linear rate of convergence. However, due to the reliance of the convergence proof on the common memory, ARock it is not suitable to deal with unreliable communications. In [37, 23], the general R-ADMM algorithm is provably shown to be linearly and globally convergent, provided that the dual problem is strongly convex. This result has been extended to randomized scenarios in [30, 38]. Unfortunately, strong convexity of the dual problem, is satisfied in the networked optimization scenario of interest only if master-slave architectures are employed, thus preventing the use of fully distributed schemes.

In this paper we present and analyze a modified version of the R-ADMM algorithm which is amenable of distributed implementation in peer-to-peer networks with unreliable communications and asynchronous operations of the agents. The theoretical contribution is twofold:

•

Deriving the R-ADMM as an application of the Peaceman-Rachford splitting, we are able to exploit recent results on randomized nonexpansive operators to establish the almost sure convergence of the proposed algorithm, provided that mild assumptions on the asynchronous and lossy nature of the network are satisfied.

•

Further assuming that the local costs are strongly convex and twice differentiable, we show that the convergence is locally linear in mean, and provide an upper bound to the convergence rate.

A preliminary version of this paper has appeared in [39], where however no asynchronous updates are considered, and no convergence rate analysis is provided.

The remainder of the paper is organized as follows. Section II reviews some concepts in operator theory, and the R-ADMM. Section III describes the distributed implementation of R-ADMM to solve (1) and its convergence. Section IV analyzes the convergence properties of R-ADMM under asynchronous updates and communication failures. Finally, Section V presents some numerical results and Section VI concludes the paper.

II Preliminaries

This Section collects some preliminary definitions in graph theory [40] and convex analysis [41], as well as a brief review of the necessary background regarding operator theory [25] and the R-ADMM [20, 23].

II-A Notation and useful definitions

We denote by $\otimes$ the Kronecker product, by $\Lambda(\mathbold{M})$ the spectrum of a matrix $\mathbold{M}$ , and by $\operatorname{dist}(x,\mathbb{D})=\inf_{y\in\mathbb{D}}\left\lVert x-y\right\rVert$ the distance between point $x\in{\mathbb{R}}^{n}$ and the set $\mathbb{D}\subset{\mathbb{R}}^{n}$ . ${\mathbf{1}}_{n}$ (resp. ${\mathbf{0}}_{n}$ ) denotes the $n$ -dimensional vector of all ones (resp. zeros). By $M\succ 0$ we denote that the symmetric matrix $M$ is positive definite.

We denote a graph by $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}$ is the set of $N$ vertices, labeled $1$ through $N$ , and $\mathcal{E}$ is the set of undirected edges. For $i\in\mathcal{V}$ , by $\mathcal{N}_{i}$ we denote the set of neighbors of node $i$ in $\mathcal{G}$ , namely, $\mathcal{N}_{i}=\left\{j\in\mathcal{V}\,:\,(i,j)\in\mathcal{E}\right\}$ . The degree of each node $i\in\mathcal{V}$ is denoted by $d_{i}=|\mathcal{N}_{i}|$ . Moreover, in the following we write $M=2|\mathcal{E}|$ , i.e., $M$ counts twice the number of edges in the network.

Consider the scalar function $f:{\mathbb{R}}^{n}\to{\mathbb{R}}\cup\{+\infty\}$ . Then $f$ is said to be closed if $\forall a\in{\mathbb{R}}$ the set $\{x\in\operatorname{dom}(f)\ |\ f(x)\leq a\}$ is closed, and it is proper if it does not attain $-\infty$ , see [25]. We denote by $\Gamma_{0}({\mathbb{R}}^{n})$ the class of convex, closed and proper functions from ${\mathbb{R}}^{n}$ to ${\mathbb{R}}\cup\{+\infty\}$ . We define the convex conjugate of $f\in\Gamma_{0}({\mathbb{R}}^{n})$ as $f^{*}(w)=\sup_{x\in{\mathbb{R}}^{n}}\{\langle w,x\rangle-f(x)\}$ for $w\in{\mathbb{R}}^{n}$ . The convex conjugate belongs to $\Gamma_{0}({\mathbb{R}}^{n})$ . Finally, a function $f\in\Gamma_{0}({\mathbb{R}}^{n})$ is said to be $m$ -strongly convex, $m>0$ , if $f-(m/2)\left\lVert\cdot\right\rVert^{2}$ is convex. If $f$ is twice differentiable, $f\in\mathcal{C}^{2}$ , then strong convexity implies that $\nabla^{2}f(x)\succ mI$ for all $x\in{\mathbb{R}}^{n}$ .

II-B Notions on operator theory

By operator111The term mapping should actually be used, but in the literature the two are usually employed interchangeably. on ${\mathbb{R}}^{n}$ we mean a map ${\mathcal{T}}:{\mathbb{R}}^{n}\to{\mathbb{R}}^{n}$ that assigns to each point $x$ in ${\mathbb{R}}^{n}$ the corresponding point ${\mathcal{T}}x\in{\mathbb{R}}^{n}$ . Given an operator ${\mathcal{T}}$ , by $\operatorname{fix}({\mathcal{T}})$ we denote the set of its fixed points, that is, $\operatorname{fix}({\mathcal{T}})=\{\bar{x}\in{\mathbb{R}}^{n}\ |\ \bar{x}={\mathcal{T}}\bar{x}\}$ .

An operator ${\mathcal{T}}$ is Lipschitz continuous if there exists $\zeta\geq 0$ such that $\left\lVert{\mathcal{T}}x-{\mathcal{T}}y\right\rVert\leq\zeta\left\lVert x-y\right\rVert$ holds for any two $x,y\in{\mathbb{R}}^{n}$ . In particular, ${\mathcal{T}}$ is said to be nonexpansive if $\zeta=1$ , and contractive if $\zeta\in[0,1)$ . An operator ${\mathcal{T}}$ is averaged if there exist $\alpha\in(0,1)$ and $\mathcal{R}$ nonexpansive such that we can write ${\mathcal{T}}=(1-\alpha)\mathcal{I}+\alpha\mathcal{R}$ . Notice that $\operatorname{fix}({\mathcal{T}})=\operatorname{fix}(\mathcal{R})$ .

An operator ${\mathcal{T}}$ is said to be affine if there exist $T\in{\mathbb{R}}^{n\times n}$ and $u\in{\mathbb{R}}^{n}$ such that we can write ${\mathcal{T}}x=Tx+u$ , $x\in{\mathbb{R}}^{n}$ .

Given a function $f\in\Gamma_{0}({\mathbb{R}}^{n})$ , we define the corresponding proximal operator as

[TABLE]

where $\rho>0$ is called penalty parameter, and the reflective operator as $\operatorname{refl}_{\rho f}(x)=2\operatorname{prox}_{\rho f}(x)-x$ . The proximal is $1/2$ -averaged222This property is also called firm nonexpansiveness. while the reflective is nonexpansive. Observe that the fixed points of $\operatorname{prox}_{\rho f}$ and $\operatorname{refl}_{\rho f}$ coincide with the minimizers of $f$ . In general, given ${\mathcal{T}}$ nonexpansive, the algorithm for finding its fixed points is the Krasnosel’skii-Mann (KM) iteration, see [25],

[TABLE]

Consider now the convex optimization problem

[TABLE]

with $f,g\in\Gamma_{0}({\mathbb{R}}^{n})$ . Let us define the Peaceman-Rachford operator

[TABLE]

such that the minimizers of the optimization problem are $\operatorname{prox}_{\rho f}(\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}}))$ . The Krasnosel’skiĭ-Mann iteration applied to the ${\mathcal{T}_{\mathrm{PR}}}$ on the auxiliary variable $z$ yields the so called Peaceman-Rachford splitting (PRS):

[TABLE]

which is guaranteed to converge to a fixed point of ${\mathcal{T}_{\mathrm{PR}}}$ if $\alpha\in(0,1)$ and $\rho>0$ , see [25]; a minimizer $\bar{x}$ to (3) is recovered from the limit $\bar{z}$ of the iterate $z(k)$ by computing $\bar{x}=\operatorname{prox}_{\rho g}(\bar{z})$ . As show in [25], the iteration (4) can be conveniently implemented by the following updates

[TABLE]

where $y$ is an additional auxiliary variable.

II-C Relaxed ADMM

Consider the following optimization problem

[TABLE]

where $f\in\Gamma_{0}({\mathbb{R}}^{n})$ , $g\in\Gamma_{0}({\mathbb{R}}^{m})$ , $A\in{\mathbb{R}}^{p\times n}$ , $B\in{\mathbb{R}}^{p\times m}$ and $c\in{\mathbb{R}}^{p}$ . We assume that (6) admits a finite solution. The dual problem of (6) is (see [26])

[TABLE]

where

[TABLE]

The relaxed alternating direction method of multipliers (R-ADMM) can be derived applying the PRS (4) to solve (7). In [22], it has been shown that an efficient implementation of (5) is characterized by the following updates, which involve the primal variables $x$ and $y$ , see also Appendix A:

[TABLE]

where (8a), (8b) implements (5a), while (8c), (8d) implements (5b). The convergence of the PRS guarantees, in turn, the convergence of $\{x(k)\}_{k\in{\mathbb{N}}}$ and $\{y(k)\}_{k\in{\mathbb{N}}}$ to an optimal solution of the primal (6). Indeed problem (6) is convex with linear constraints and strong duality holds.

The R-ADMM is a generalized version of the classical ADMM described e.g. in [20]; indeed it is possible to see that when $\alpha=1/2$ the former recovers the latter, see Remark 3 in Section III.

III R-ADMM for Distributed Optimization

In this Section we formulate the distributed optimization problem of interest and we show how the R-ADMM is suited to solve it.

III-A Problem formulation

Consider the undirected, connected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $N$ nodes. We are interested in solving

[TABLE]

over the network $\mathcal{G}$ where the cost function $f_{i}\in\Gamma_{0}({\mathbb{R}}^{n})$ is known only to the $i$ -th node, and nodes can communicate only with their neighbors. We assume that (9) admits at least one finite solution.

In order to apply the R-ADMM to problem (9) we reformulate it as follows. First, a local copy $x_{i}\in{\mathbb{R}}^{n}$ of the decision variable $x$ is assigned to each agent. Therefore, as long as $\mathcal{G}$ is connected, problem (9) is equivalent to

[TABLE]

Indeed the consensus constraints $x_{i}=x_{j}$ impose that any optimal solution of (10) satisfies $x_{1}=\ldots=x_{N}=\bar{x}$ , with $\bar{x}$ a solution to (9). Introducing the bridge variables $y_{ij}$ and $y_{ji}$ for each edge $(i,j)\in\mathcal{E}$ , the consensus constraints can be equivalently rewritten as

[TABLE]

Defining the vectors ${\mathbold{x}}=[x_{1}^{\top},\ldots,x_{N}^{\top}]^{\top}\in{\mathbb{R}}^{nN}$ and ${\mathbold{y}}=[\ldots,\{y_{ij}^{\top}\}_{j\in\mathcal{N}_{i}},\ldots]^{\top}\in{\mathbb{R}}^{nM}$ , the constraints in (11) can be compactly rewritten as333Hereafter, boldface letters will denote vectors and matrices built stacking local quantities.

[TABLE]

where

[TABLE]

${\mathbold{B}}=-{\mathbold{I}}_{nM}$ , ${\mathbold{P}}$ is a permutation matrix that swaps $y_{ij}$ with $y_{ji}$ . We remark that ${\mathbold{A}}$ in general is not full row rank.

Finally, define $f({\mathbold{x}})=\sum_{i=1}^{N}f_{i}(x_{i})$ and $g({\mathbold{y}})=\iota_{({\mathbold{I}}-{\mathbold{P}})}({\mathbold{y}})$ , where the indicator function $\iota_{({\mathbold{I}}-{\mathbold{P}})}({\mathbold{y}})$ is equal to [math] if $({\mathbold{I}}-{\mathbold{P}}){\mathbold{y}}={\mathbf{0}}$ and $+\infty$ otherwise. Hence problem (10) can be equivalently formulated as

[TABLE]

Problem (12) is in the form of (6) and thus we can apply the R-ADMM algorithm to solve it.

III-B Distributed R-ADMM

The particular separable structure of the functions $f({\mathbold{x}})$ , $g({\mathbold{y}})$ , and of matrices ${\mathbold{A}}$ , ${\mathbold{B}}$ , allows us to derive simplified equations for the R-ADMM algorithm that involve only the update of the ${\mathbold{x}}$ and ${\mathbold{z}}$ variables, and that are amenable of distributed implementations.

Indeed, it can be shown that the equations (8) applied to problem (12) reduce to

[TABLE]

for all $i\in\mathcal{V}$ and $j\in\mathcal{N}_{i}$ . See Appendix C-A for the derivation. Observe that, since ${\mathbold{B}}=-{\mathbold{I}}$ , the dimension of ${\mathbold{z}}$ is equal to the dimension of ${\mathbold{y}}$ , i.e., for all $(i,j)\in\mathcal{E}$ there are the variables $z_{ij}$ and $z_{ji}$ . Interestingly, one can see that (13a) can be rewritten as

[TABLE]

where $[{\mathbold{A}}^{\top}{\mathbold{z}}(k)]_{i}$ denotes the $i$ -th component of the vector $[{\mathbold{A}}^{\top}{\mathbold{z}}(k)]$ , see Lemma 1 in Appendix C-B. A straightforward implementation of (13) has the node $i$ storing and updating $x_{i}$ and $\{z_{ij}\}_{j\in\mathcal{N}_{i}}$ . Notice that, while (13a) can be computed using only local information, i.e., the local cost $f_{i}$ and $\{z_{ij}\}_{j\in\mathcal{N}_{i}}$ , update (13b) requires communication with $i$ ’s neighbors, that is, transmission of $z_{ji}(k)$ and $x_{j}$ from node $j\in\mathcal{N}_{i}$ . In particular we assume that node $j$ sends to node $i$ the packet

[TABLE]

and, consequently, node $i$ performs the update

[TABLE]

Algorithm 1 describes the implementation of the distributed R-ADMM.

The following convergence result is a direct consequence of the convergence of the Peaceman-Rachford splitting, proved e.g. in [25, Th. 26.11].

Proposition 1.

Consider problem (9) with $f_{i}\in\Gamma_{0}({\mathbb{R}}^{n})$ , and let $0<\alpha<1$ and $\rho>0$ . Then, for any initial condition ${\mathbold{z}}(0)\in{\mathbb{R}}^{nM}$ , the trajectories $k\mapsto x_{i}(k)$ , $i\in\mathcal{V}$ , generated by Algorithm 1, converge to an optimal solution $\bar{x}$ of (9), i.e.,

[TABLE]

Remark 1.

Notice that the statement of Proposition 1 considers only the initial condition of variable ${\mathbold{z}}$ and not of ${\mathbold{x}}$ . The reason is related to update (13a) where it is clear that ${\mathbold{x}}(1)$ depends only on ${\mathbold{z}}(0)$ and not on ${\mathbold{x}}(0)$ .

Remark 2 (Comparison with ARock [26]).

The formulation of the R-ADMM presented in Algorithm 1 is derived using the same idea employed in [26] of interpreting the R-ADMM as an application of the PRS to the dual problem. Also the R-ADMM algorithm proposed in [26] to solve problem (1) (see section 2.6.2), involves only the use of variables $x_{i}$ , $z_{ij}$ but the actual implementation differs from Algorithm 1. Additionally, it is worth mentioning that the authors of [26] have derived the R-ADMM within the framework of the ARock algorithm, introduced in the context of parallel computing where agents share a common memory. In [26] and [36] it is shown that the ARock framework successfully handles asynchronous updates and (possibly unbounded) delayed information. However, due to the reliance of the convergence proof on the common memory, ARock is not suitable to deal with the lossy and asynchronous framework of interest in this paper.

III-C Linear local convergence for strongly convex costs

In this section we prove the local linear convergence of the distributed R-ADMM under the assumption that the local costs are strongly convex and twice continuously differentiable. Notice that, under these assumptions there exists a unique minimizer $x^{*}$ for problem (9).

It is worth stressing that for particular distributed and centralized formulations of the R-ADMM (see [31, 32, 23]), especially of the classical ADMM, it is actually possible to prove global linear convergence under milder assumptions than the ones made in this section. However the results in [31, 32, 23] can be applied only partially to the scenario of our interest and we refer to Remark 5 for a detailed discussion.

The idea behind the result of Proposition 2 below is that, in a neighborhood of the optimal solution, the strong convexity and double continuous differentiability of the local costs allow us to rewrite the Peaceman-Rachford splitting applied to the dual of the distributed problem as a perturbed affine operator. Indeed, it is possible to write the update of the auxiliary variables in compact form as

[TABLE]

where ${\mathbold{x}}^{*}={\mathbf{1}}_{N}\otimes x^{*}$ , and ${\mathbold{T}}\in{\mathbb{R}}^{nM\times nM}$ is such that

[TABLE]

with ${\mathbold{H}}=\operatorname{blk\,diag}\left\{\rho d_{i}I_{n}+\nabla^{2}f_{i}(x^{*})\right\}$ , ${\mathbold{u}}\in{\mathbb{R}}^{nM}$ is a constant vector depending on the gradient and Hessian of $f({\mathbold{x}})$ evaluated at ${\mathbold{x}}^{*}$ , and ${\mathbold{o}}^{\prime}:{\mathbb{R}}^{nN}\to{\mathbb{R}}^{nM}$ is a vanishing function for ${\mathbold{x}}$ approaching the optimum, that is,

[TABLE]

as ${\mathbold{x}}(k+1)\to{\mathbold{x}}^{*}$ . All the details can be found in Appendix C-B.

In Lemma 2 in Appendix C-B, it is established that the eigenvalues of ${\mathbold{T}}$ are either equal to $1$ or strictly inside the unitary circle, with the eigenvalues equal to $1$ all being semi-simple. The largest (in absolute value) eigenvalue smaller than $1$ of ${\mathbold{T}}$ is an upper bound to the convergence rate of the $x_{i}$ ’s trajectories toward the optimum. This fact is formally stated in the following Proposition.

Proposition 2.

Assume that the local costs $f_{i}$ are strongly convex and twice continuously differentiable. Then there exists $\epsilon>0$ such that, if $\operatorname{dist}({\mathbold{z}}(0),\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}}))\leq\epsilon$ , then Algorithm 1 converges linearly fast, i.e.

[TABLE]

with $C>0$ , and $0\leq\gamma\leq\gamma_{\mathrm{M}}<1$ , where $\gamma_{\mathrm{M}}$ is the largest eigenvalue of ${\mathbold{T}}$ different from one, i.e.

[TABLE]

Proof.

See Appendix C-B. ∎

While we refer to Appendix C-B for the proof of this result, hereafter we comment some interesting details. These results will play an important role also in the analysis performed in Section IV in the presence of asynchronous updates and lossy communications, which is the main novelty of this paper.

The proof of Proposition 2 relies on the following two facts:

(i)

the set $\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ is an affine space such that, given any two fixed points $\bar{{\mathbold{z}}},\bar{{\mathbold{z}}}^{\prime}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ , it holds $\bar{{\mathbold{z}}}^{\prime}-\bar{{\mathbold{z}}}\in\ker({\mathbold{A}}^{\top})$ ; 2. (ii)

exploiting (14) and the fact that $\operatorname{prox}_{f_{i}/(\rho d_{i})}$ , $i\in\mathcal{V}$ , is contractive, it is possible to upper bound the primal error ${\mathbold{x}}(k+1)-{\mathbold{x}}^{*}$ with the auxiliary error ${\mathbold{z}}(k)-\bar{{\mathbold{z}}}$ , for any $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ . Specifically, we have that

[TABLE]

where $\zeta\in(0,1)$ is a suitable constant depending on the curvature of the $f_{i}$ ’s and on the topology of $\mathcal{G}$ ; for more details see (45) and (46) in Appendix C-B.

Since ${\mathbold{z}}(k)$ converges to a fixed point then, from (18) and fact (ii), we have ${\mathbold{x}}(k)\to{\mathbold{x}}^{*}$ . As observed in Appendix C-B, if all the $f_{i}$ were quadratic functions, then (16) would reduce to the linear update ${\mathbold{z}}(k+1)={\mathbold{T}}{\mathbold{z}}(k)+{\mathbold{u}}$ . The convergence would thus be global, linear and with rate upper bounded by $\gamma_{\mathrm{M}}$ .

In the proof of Proposition 2 we show that this linear convergence is not deteriorated by the presence of the nonlinear term ${\mathbold{o}}^{\prime}({\mathbold{x}}(k+1)-{\mathbold{x}}^{*})$ , though the price to be paid is that the convergence is not guaranteed to be global but only local.

Some remarks are now in order to better cast Algorithm 1 within the existing literature. In particular Remarks 3 and 4 discuss the connection with Augmented Lagrangian-based and node-/edge-based formulations, respectively. Remark 5 provides a further discussion on the convergence rate.

Remark 3 (Lagrangian based R-ADMM).

The R-ADMM described in section II-C can be interpreted in the framework of augmented Lagrangian methods. Indeed, define the augmented Lagrangian

[TABLE]

where ${\mathbold{w}}$ is the Lagrange multipliers’ vector. Then, the R-ADMM in (8) is equivalent to the following updates, see [22] and Appendix B:

[TABLE]

In particular, given ${\mathbold{x}}(0),{\mathbold{y}}(0),{\mathbold{w}}(0)$ , if ${\mathbold{z}}(0)={\mathbold{w}}(0)-\rho(2\alpha-1)({\mathbold{A}}{\mathbold{x}}(0)+{\mathbold{B}}{\mathbold{y}}(0)-{\mathbold{c}})-\rho({\mathbold{B}}{\mathbold{y}}(0)-{\mathbold{c}})$ , then the ${\mathbold{x}}$ and ${\mathbold{y}}$ trajectories generated by (8) and (20) coincide, see Appendix B. Observe that, if $\alpha=1/2$ , we recover the classical ADMM described e.g. in [20]. The choice of analyzing the more general R-ADMM relies on the fact that, by properly tuning the parameter $\alpha$ , we can achieve better performance than the classical ADMM as observed e.g. in [42], proved in [23], and evidenced by the numerical results in Section V. Interestingly, also the augmented Lagrangian-based R-ADMM in (20) is amenable of a distributed implementation when applied to (12), which is described by the following updates

[TABLE]

Clearly, this formulation requires each node $i\in\mathcal{V}$ to store the variables $x_{i}$ , $\{y_{ij}\}_{j\in\mathcal{N}_{i}}$ and $\{w_{ij}\}_{j\in\mathcal{N}_{i}}$ , and to update them exchanging information only with its neighbors. Therefore in terms of storage requirement Algorithm 1 is better than the augmented Lagrangian formulation, as evidenced by Table I.

Remark 4 (Node- and edge-based ADMM).

In Algorithm 1 and in the corresponding Lagrangian-based formulation, the number of variables that each node stores scales with $d_{i}$ , see Table I. This is due to the fact that each node stores an auxiliary variable for each of the edges it is part of – hence the name edge-based – incurring in a worst case memory requirement of $O(N)$ . A different formulation of the R-ADMM, called node-based, can be given, in order to guarantee that the local storage requirement is constant, i.e. $O(1)$ . Node-based formulations of the classical ADMM are employed e.g. in [28, 31, 32]. Notice that Algorithm 1 can be reformulated as a node-based method if each node $i\in\mathcal{V}$ stores and updates the variables $z_{i}^{\prime}=\sum_{j\in\mathcal{N}_{i}}z_{ij}$ and $z_{i}^{\prime\prime}=\sum_{j\in\mathcal{N}_{i}}z_{ji}$ instead of the $\{z_{ij}\}_{j\in\mathcal{N}_{i}}$ auxiliary variables.

However, as observed in [35], in general node-based ADMM formulations are not robust to packet losses. Indeed, the convergence of a node-based ADMM is guaranteed only if, at each iteration $k$ , the graph resulting from the removal of faulty edges is still connected. Edge-based formulations are instead necessary in order to remove this (rather demanding) assumption, such as the one proposed in [35] to handle (uniformly distributed) packet losses, and Algorithm 2 proposed in this paper. Intuitively, the use of bridge variables is necessary in order to keep track of the packets received at any given time from each of the neighbors.

Remark 5 (Further discussion on the convergence rate).

In recent years there has been an increasing interest in characterizing the convergence rate of both centralized and distributed implementations of R-ADMM and classical ADMM. A special effort has been devoted to provide conditions under which the convergence is guaranteed to be linear. Next, it is worth summarizing some of the main results comparing them to Proposition 2. Interestingly, the distributed implementations of the standard ADMM introduced in [31] and [32] have been shown to attain global and linear convergence provided that the Lagrange multipliers satisfy some particular initializations and under the assumptions that the local costs are strongly convex and with Lipschitz continuous gradient. Though these assumptions are milder than the one made in Proposition 2, the results in [31] and [32], when interpreted in the context of the more general R-ADMM, are valid only for the case $\alpha=1/2$ , while the result in section III-C holds true for any $\alpha$ within the interval $(0,1)$ . Moreover, for the case $\alpha=1/2$ , the analysis performed in [31] could be mimicked for the distributed Lagrangian-based implementation described in (21), thus obtaining a global linear rate when $\alpha=1/2$ also for the algorithm proposed in this paper.

Concerning the general Peaceman-Rachford splitting applied to (6), the authors of [23] have shown that the R-ADMM algorithm converges linearly to an optimal solution provided that the matrix ${\mathbold{A}}$ is full row rank, which guarantees that the Peaceman-Rachford operator is contractive. Under the same assumption, the linear convergence extends also to randomized updates [38]. However, in the distributed optimization scenario of interest ${\mathbold{A}}$ is not full row rank, since $M>N$ and therefore the aforementioned convergence results cannot be applied. In particular, the loss of row rank for ${\mathbold{A}}$ implies that the dual function $d_{f}$ is only convex but not strongly convex, and, hence that the Peaceman-Rachford operator is only nonexpansive. A notable exception arises when adopting master-slave architectures, which are characterized by a master node connected to $N$ other nodes (the slaves). Indeed, in this setting only one bridge variable is introduced for any edge, and thus ${\mathbold{A}}={\mathbold{I}}_{nN}$ , which is full row rank. The implementation of the ADMM in this setup envisions the slave nodes performing local updates of the primal variables, and the master node updating the $N$ dual variables.

IV Asynchronous Distributed R-ADMM over Lossy Networks

Algorithm 1 works under the standing assumption that the communication channels are reliable and that the nodes update synchronously. The goal of this Section is to relax these requirements and to show how Algorithm 1 can be modified to still guarantee convergence, under probabilistic assumptions on communication failures and asynchronous updates, and to characterize its linear convergence in mean.

IV-A Robust and Asynchronous R-ADMM

Consider Algorithm 1, and notice that node $i$ at iteration $k$ receives the packet $q_{j\to i}$ from $j\in\mathcal{N}_{i}$ only if the two following conditions are satisfied: (i) node $j$ performs an update of $x_{j}$ at iteration $k$ ; and (ii) the packet $q_{j\to i}$ is not lost.

Now, for any $k=0,1,\ldots$ , let us define the set of random variables $\{\mu_{i}(k)\}_{i\in\mathcal{V}}$ , such that the realization of $\mu_{i}(k)$ is $1$ if node $i$ performs an update during the $k$ -th iteration, [math] otherwise. Similarly, provided that $\mu_{i}(k)=1$ , we define the set of variables $\{\lambda_{i\to j}(k)\}_{i\in\mathcal{V},j\in\mathcal{N}_{i}}$ such that the realization of $\lambda_{i\to j}(k)$ is [math] if $q_{i\to j}$ is delivered to $j$ , $1$ otherwise. Within this formalism, we see that node $i$ can carry out an update of $z_{ij}$ at iteration $k$ provided that $\mu_{j}(k)=1$ and $\lambda_{j\to i}(k)=0$ . To simplify the theoretical analysis, we define the set of random variables $\{\beta_{ij}\}_{i\in\mathcal{V},j\in\mathcal{N}_{i}}$ such that

[TABLE]

We make the following probabilistic Assumption on the variables $\beta_{ij}$ .

Assumption 1.

The random variables $\{\beta_{ij}(k)\,:\,\,i\in\mathcal{V},\,j\in\mathcal{N}_{i},\,k\in{\mathbb{N}}\}$ are mutually independent over $k$ , namely, given $\beta_{ij}(k)$ and $\beta_{hl}(\ell)$ for any $(i,j),(h,l)\in\mathcal{E}$ , they are independent if $k\neq\ell$ . Moreover, there exists a $M$ -uple $\{p_{ij}\,:\,i\in\mathcal{V},\,j\in\mathcal{N}_{i},\,0<p_{ij}<1\}$ such that

[TABLE]

for all $k\in{\mathbb{N}}$ .

Observe that Assumption 1 requires only independence over time, but not among the random variables at the same iteration $k$ . Moreover, as consequence of (22), each variable $z_{ij}$ has a nonzero probability of being updated at each iteration $k$ . Assumption 1 could have been stated equivalently in terms of $\mu_{i}$ and $\lambda_{i\to j}$ , assuming nonzero probabilities for the occurrence of update and packet delivery events, and mutual independence over time.

Remark 6 (Uniform probabilities).

Assume the random variables $\{\mu_{i}(k)\}_{i\in\mathcal{V}}$ and $\{\lambda_{i\to j}(k)\}_{i\in\mathcal{V},j\in\mathcal{N}_{i}}$ are i.i.d., such that $\mathbb{E}\left[\mu_{i}(k)\right]=p_{\mu}$ and $\mathbb{E}\left[\lambda_{i\to j}(k)\right]=p_{\lambda}$ , for all $i\in\mathcal{V}$ , $j\in\mathcal{N}_{i}$ , and $k\in{\mathbb{N}}$ . Then $\{\beta_{ij}(k)\}_{i\in\mathcal{V},j\in\mathcal{N}_{i}}$ are uniformly distributed with probability $p_{\beta}=p_{\mu}(1-p_{\lambda})$ , but in general are not independent, since $\{\beta_{ij}(k)\}_{j\in\mathcal{N}_{i}}$ all depend on $\mu_{i}(k)$ .

In Algorithm 2 we describe the modified version of Algorithm 1 that can handle asynchronous updates and packet losses. If node $i$ at iteration $k$ is selected, then it updates $x_{i}$ and computes the variables $q_{i\to j}$ , $j\in\mathcal{N}_{i}$ , transmitting them to its neighbors. If node $j$ receives $q_{i\to j}$ then it updates the variable $z_{ji}$ , otherwise it leaves it unchanged.

Notice that node $i$ updates the variable $z_{ij}$ only if it receives the packet $q_{j\to i}$ . Making use of the random variables $\beta_{ij}$ , we can thus describe the update step for the auxiliary variables in the following compact form

[TABLE]

The following convergence result holds as a consequence of the convergence of the Peaceman-Rachford splitting with random coordinate updates, see [43, 30].

Proposition 3.

Consider problem (9) with $f_{i}\in\Gamma_{0}({\mathbb{R}}^{n})$ . Assume Assumption 1 holds, and let $0<\alpha<1$ and $\rho>0$ . Then for any initial condition ${\mathbold{z}}(0)\in{\mathbb{R}}^{nM}$ , the trajectories $k\mapsto x_{i}(k)$ , $i\in\mathcal{V}$ , generated by Algorithm 2 converge almost surely to an optimal solution $\bar{x}$ of (9), that is,

[TABLE]

Proof.

See Appendix D-A. ∎

IV-B Mean linear convergence for strongly convex costs

In this section, under the assumption that the local functions $f_{i}$ are strongly convex, we prove that the mean convergence of Algorithm 2 is locally linear. Moreover we provide an upper bound to the convergence rate.

In the scenario of Assumption 1, it is not guaranteed that the auxiliary variables $z_{ij}$ are updated at each iteration $k$ . Indeed, by introducing the random diagonal matrix ${\mathbold{B}}(k)\,\in\,{\mathbb{R}}^{nM\times nM}$ such that

[TABLE]

we can rewrite (16) as

[TABLE]

where $\hat{{\mathbold{T}}}(k):={\mathbold{I}}-{\mathbold{B}}(k)({\mathbold{I}}-{\mathbold{T}})$ , with ${\mathbold{T}}$ defined in (17). This allows us to interpret Algorithm 2 as the application of a randomized and perturbed affine operator.

The goal of this section is to evaluate the behavior of the mean error $\mathbb{E}[\left\lVert{\mathbold{x}}(k)-{\mathbold{x}}^{*}\right\rVert]$ as $k\to\infty$ and, in particular, showing that it converges to zero linearly. The following inequality holds, see Appendix D-B for the proof:

[TABLE]

Next we show that the square of the first term on the right-hand side of (IV-B) converges to zero linearly as $k\to\infty$ . Notice that this term can be rewritten as

[TABLE]

if $k\geq 1$ , otherwise ${\mathbold{\Delta}}(0)={\mathbold{A}}{\mathbold{A}}^{\top}$ . A simple recursive argument shows that

[TABLE]

that is, ${\mathbold{\Delta}}(k)$ is the evolution of a linear dynamical system which can be written in the form

[TABLE]

where $\mathcal{L}\,:\,{\mathbb{R}}^{nM\times nM}\to{\mathbb{R}}^{nM\times nM}$ is defined by

[TABLE]

The spectral properties of $\mathcal{L}$ have been characterized in Lemma 3 in Appendix D-C. In particular the following two facts have been established. First, if ${\mathbold{T}}$ has $H$ semi-simple eigenvalues in $1$ then $\mathcal{L}$ has $H^{2}$ semi-simple eigenvalues in $1$ , while all the other eigenvalues are strictly inside the unitary circle. Second, the matrix ${\mathbold{A}}{\mathbold{A}}^{\top}$ belongs to the eigenspace generated be the eigenvectors corresponding to the eigenvalues strictly smaller than $1$ . These two facts directly imply the following result.

Proposition 4.

Consider (26) with ${\mathbold{\Delta}}(0)={\mathbold{A}}{\mathbold{A}}^{\top}$ . Then, there exists $C^{\prime}>0$ such that

[TABLE]

where

[TABLE]

The previous Proposition states that the convergence rate to zero of $\boldsymbol{\Delta}(k)$ is upper bounded by the largest eigenvalue in absolute value of $\mathcal{L}$ different from $1$ . One can show that $\bar{\gamma}_{\mathrm{M}}$ is a suitable upper-bound also for the convergence rate to zero of the second term in the right-hand side of (IV-B). Indeed, the following Proposition holds true in a neighborhood of the optimal solution.

Proposition 5.

Assume that the local costs $f_{i}$ are strongly convex and twice continuously differentiable, and that Assumption 1 holds. Then there exists $\epsilon>0$ such that, if $\operatorname{dist}({\mathbold{z}}(0),\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}}))\leq\epsilon$ , then Algorithm 2 converges linearly – in mean – to the optimal solution, i.e.,

[TABLE]

with $C>0$ , and $0\leq\gamma\leq\sqrt{\bar{\gamma}_{\mathrm{M}}}<1$ .

Proof.

See Appendix D-C. ∎

All the details of the derivation can be found in Appendix D-C. The next Proposition provides a matricial characterization of the operator $\mathcal{L}$ that can be used to compute $\bar{\gamma}_{\mathrm{M}}$ .

Proposition 6.

The linear operator $\mathcal{L}$ can be equivalently described by the following matrix

[TABLE]

which is equal to

[TABLE]

Hence

[TABLE]

Proof.

See Appendix D-D. ∎

Observe that, from Assumption 1, it follows that $\mathbb{E}[{\mathbold{B}}(0)]$ and $\mathbb{E}[{\mathbold{B}}(0)\otimes{\mathbold{B}}(0)]$ are both diagonal matrices. In particular, when considering the uniform scenario introduced in Remark 6, the computation of ${\mathbold{L}}$ simplifies to

[TABLE]

since $\mathbb{E}[{\mathbold{B}}(0)]=p_{\beta}{\mathbold{I}}$ , and where the diagonal elements of $\mathbb{E}[{\mathbold{B}}(0)\otimes{\mathbold{B}}(0)]$ are given by

[TABLE]

We conclude this Section with the following Remarks that emphasize some interesting properties of Algorithm 2.

Remark 7 (Quadratic case: global linear convergence).

If the local costs are quadratic, then the linear convergence results of Proposition 2 and 5 hold globally. This is a consequence of the fact that the auxiliary variable update (24) characterizing the proposed algorithm becomes a (randomized) affine update:

[TABLE]

Remark 8 (Convergence of randomized (R-)ADMM).

Owing to the operator theoretical interpretation of the R-ADMM, its convergence in the presence of asynchronous updates and packet losses can be guaranteed almost surely for all choices of initial conditions and of the free parameters $\alpha$ and $\rho$ . On the other hand, proving convergence of the augmented Lagrangian-based interpretation of R-ADMM (see Remark 3) is not as straightforward. To the best of our knowledge, [35] is the only paper proving convergence of the standard ADMM in the presence of packet losses. Interestingly, building on the framework established by [31], the authors of [35] have proved global convergence of the ADMM in the presence of uniformly distributed packet losses. However no linear convergence has been established and, in turn, no characterization of the rate of convergence has been provided. Moreover, asynchronous scenarios have not been analyzed. On the other hand, in [34] global linear convergence of ADMM is shown in the presence of asynchronous updates, but the results hold only for master-slave architectures.

Concerning the general R-ADMM algorithm, results for lossy and asynchronous scenarios have been obtained in [26, 36, 43, 30, 38]. More precisely, in [26, 36], it is proved that the ARock algorithm converges sub-linearly with asynchronous updates and (possibly unbounded) transmission delays. However convergence is guaranteed only if a single agent updates at each iteration, while Algorithm 2 is fully parallel, i.e. guarantees convergence when an arbitrary number of agents updates simultaneously. Moreover, as already stressed in Remark 2, the presence of a common memory among the nodes makes the ARock framework not suitable to theoretically analyze the convergence properties of Algorithm 2.

In **[30, 43]** the convergence of the general R-ADMM has been shown in the presence of randomized coordinate updates. Furthermore, **[38]** proves the global, linear convergence of the randomized R-ADMM under the assumption that ${\mathbold{A}}$ is full row rank. We remark that the the linear convergence result of **[38]** applies to the distributed setup of interest only for master-slave topologies, for which ${\mathbold{A}}$ is full row rank [cf. Remark 5].

The review literature provided in this Remark and in Remark 5 has been conveniently summarized in Table II.

Remark 9 (Stable parameters pairs).

Observe that both Proposition 1, for the case of reliable communications, and Proposition 3, for the randomized updating scenario, establish convergence provided that $0<\alpha<1$ and $\rho>0$ . However, these conditions are only sufficient and not necessary and, in particular, the convergence might hold also for values of $\alpha\geq 1$ . This fact, proved in [23] under the assumption that ${\mathbold{A}}$ is full row rank, can be empirically observed in Section V where, for the case of quadratic functions $f_{i},\ i\in\mathcal{V}$ , the region of attraction in parameter space is larger. Moreover, despite what the intuition would suggest, the larger the packet loss uniform probability $p_{\lambda}$ , the larger the region of convergence. However, this increased region of stability is counterbalanced by a slower convergence rate of the algorithm.

V Simulations

In this Section we present numerical results that showcase the convergence properties of the proposed Algorithm 2 in different scenarios.

V-A Error trajectories

We consider a random geometric graph with $N=25$ nodes, and quadratic costs $f_{i}(x_{i})=(1/2)x_{i}^{\top}Q_{i}x_{i}-\langle r_{i},x_{i}\rangle$ , with $Q_{i}=Q_{i}^{\top}\succ 0$ , and $n=5$ . We performed a set of Monte Carlo simulations, each $500$ iterations long and averaging over $100$ realizations of the uniformly distributed packet loss and update random variables.

Fig. 1 depicts the logarithmic error $\log\left\lVert{\mathbold{x}}(k)-{\mathbold{x}}^{*}\right\rVert$ for different values of $\alpha$ when both packet losses and asynchronous updates are present. First of all we notice that, the convergence is linear. Moreover, the closer $\alpha$ is to $1$ , the faster the convergence is; notice that, although Proposition 3 does not guarantee convergence for $\alpha=1$ , this is nonetheless achieved. This result suggests that the R-ADMM is advantageous w.r.t. the standard ADMM, thus justifying its choice.

Fig. 2 depicts the logarithmic error for different values of the packet loss probability $p_{\lambda}$ . The result is that the larger $p_{\lambda}$ is, the slower the convergence, since the number of updates performed at each iteration decreases.

Finally, Fig. 3 depicts the stable pairs of values $(\rho,\alpha)$ for different packet loss probabilities and $p_{\mu}=1$ . A pair $(\rho,\alpha)$ is considered stable if it leads to convergence of Algorithm 2 over all of the Monte Carlo iterations. The curves in Fig. 3 represent the upper bound to the value of $\alpha$ that gives stable pairs.

An interesting feature of the proposed algorithm is that the larger the packet loss probability is, the larger the stability region. This is however counterbalanced by the fact that the convergence rate increases as the packet loss grows larger [cf. Fig. 2].

V-B Convergence rate

We consider now a random geometric graph with $N=5$ nodes and $n=2$ , the same quadratic cost for each agent, and for simplicity $p_{\mu}=1$ . We evaluate the empirical convergence rate of the R-ADMM $\hat{\gamma}$ , computed as the slope of the logarithmic error trajectory averaged over $100$ Monte Carlo simulations, each $1000$ iterations long. Fig. 4 depicts the results.

Notice that, as evidenced also by Fig. 1 above, the larger $\alpha$ is, the lower the convergence rate. On the other hand, in this particular scenario the larger is $\rho$ , the worse the convergence rate.

Moreover, for each choice of $\alpha\in(0,1)$ , $\rho\in[0.5,10]$ and $p_{\lambda}\in\{0,0.2,0.4,0.6\}$ , we computed $\bar{\gamma}_{\mathrm{M}}$ , which by Proposition 5 gives a bound to the convergence rate that holds in mean. Indeed, as evidenced by Tab. III, the bound appears to be extremely tight, with a maximum difference that is less than $1$ ‰.

Finally, Fig. 5 depicts the empirical rate for complete graphs with different numbers of nodes, and for different packet loss probabilities. The cost is quadratic and equal for all the agents.

As remarked above, the larger $p_{\lambda}$ is, the larger $\hat{\gamma}$ ; and, in this particular case, the convergence rate degrades monotonically with the number of nodes in the graph.

V-C Quartic function

In order to study the effect of curvature on the convergence of the algorithm, we considered a random graph with $N=10$ , $n=1$ , and all costs equal to

[TABLE]

Fig. 6 depicts the logarithmic error for different values of the parameter $q$ . Clearly, the curvature of the cost may deeply affect the rate of convergence.

Notice moreover that the convergence is globally linear, although the theoretical results guarantee only local linear convergence.

VI Conclusions and Future Directions

In this paper we addressed distributed convex optimization problems over peer-to-peer networks with both unreliable communications and asynchronous updates of the nodes. We proposed a modified version of the relaxed ADMM that, exploiting operator theoretical results, can be shown to converge almost surely. Moreover, by further assuming the local costs to be strongly convex, we proved local linear mean convergence of the proposed algorithm.

We have cast the proposed algorithm in the context of the literature and discussed its novelty. And finally, we have presented interesting numerical results that showcase the resilience and robustness of the proposed algorithm.

Appendix A Derivation of (8)

The Peaceman-Rachford splitting (5) applied to the dual problem (7) is characterized by the updates

[TABLE]

We show now that (28a) is equivalent to (8a) and (8b); the same argument can be applied to (28b).

By the definition of proximal operator and of $d_{f}$ it holds

[TABLE]

where we applied the definition of convex conjugate. Consider now the minimum in (29), we have

[TABLE]

and by the first optimality condition for the innermost minimization it must hold:

[TABLE]

which is exactly (8b). Substituting (31) into (30) yields

[TABLE]

and changing the sign of the cost function we obtain (8a). $\square$

Appendix B Proof of equivalence between (8) and (20)

We show that we can derive (20) from (8), thus proving their equivalence.

We derive first some equalities that will be useful in the following. By (8b) we have

[TABLE]

and using this fact into (8d) yields

[TABLE]

Moreover, substituting (33) into (8d) we obtain

[TABLE]

Consider now (8c), the following chain of equalities holds

[TABLE]

where we derived the first by subtracting $\langle w(k+1),Ax(k+1)-c\rangle$ , the second by adding $f(x(k+1))$ , $(\rho/2)\left\lVert Ax(k+1)\right\rVert^{2}$ , and $\rho\langle Ax(k+1),-c\rangle$ , which is allowed since they do not depend on $y$ . The third equality holds by definition of square norm and augmented Lagrangian (19). We have thus derived (20c).

Substituting (33) and (32) into (8e) yields

[TABLE]

where the second equation was obtained adding and subtracting $\rho(By(k+1)-c)$ . Finally, evaluating (35) at iteration $k$ and using (32) we get (20b).

From (8a) and using (35) at iteration $k$ we can write

[TABLE]

where the second equality was derived adding $g(y(k))$ , $-\langle w(k),By(k)-c\rangle$ , and $(\rho/2)\left\lVert By(k)-c\right\rVert^{2}$ which do not depend on $x$ , and the third by the definitions of square norm and augmented Lagrangian. This proves equivalence of (8a) with (20a).

Finally, by the results we derived above, we can see that if the initial conditions satisfy

[TABLE]

then the trajectories for $x$ and $y$ generated by the splitting R-ADMM and the Lagrangian R-ADMM coincide. $\square$

Appendix C Proofs of Section III

C-A Derivation of (13)

Following the derivation in [26], we show that applying the R-ADMM of (8) to the distributed problem of interest yields (13).

$x$ update

Using the particular structure of ${\mathbold{A}}$ we can see that in update (8a):

[TABLE]

since each $x_{i}$ appears in $d_{i}$ constraints, i.e. rows of ${\mathbold{A}}$ . Moreover

[TABLE]

since the $i$ -th row of ${\mathbold{A}}^{\top}$ sums over the auxiliary variables stored by $i$ . Therefore (8a) becomes

[TABLE]

which is clearly separable over the single components. Moreover, node $i$ has all the information necessary to compute $x_{i}(k+1)$ .

$y$ update

Update (8c) in the distributed scenario becomes

[TABLE]

whose KKT conditions are

[TABLE]

where $\mathbold{\nu}$ is the vector of Lagrange multipliers. Plugging (36a) into (36b) yields

[TABLE]

where the second equality was derived using ${\mathbold{P}}^{2}={\mathbold{I}}$ , which implies ${\mathbold{P}}({\mathbold{I}}-{\mathbold{P}})=-({\mathbold{I}}-{\mathbold{P}})$ . Finally, summing (36a) and (37) we get

[TABLE]

which means that $y_{ij}(k+1)=y_{ji}(k+1)$ for any $(i,j)\in\mathcal{E}$ and $k\in{\mathbb{N}}$ .

$z$ update

Using (8b), and substituting (38) into (8d), the auxiliary update (8e) becomes

[TABLE]

and using the definitions of ${\mathbold{A}}$ , ${\mathbold{P}}$ proves (13b). $\square$

C-B Proof of Proposition 2

The proof is divided in the following steps: (i) write the auxiliary update of Algorithm 1 as a perturbed affine operator; (ii) bound the primal error with the error on the auxiliary variable; (iii) show that the primal error converges linearly for a quadratic approximation; (iv) extend the result to the general case.

(i) Perturbed affine operator

From the first order optimality condition for (13a) it must hold for any $i\in\mathcal{V}$

[TABLE]

Therefore using the Taylor expansion of the gradient $\nabla f_{i}$ around $x^{*}$ we have

[TABLE]

for $x\in\mathcal{B}_{x^{*}}$ , where $o:{\mathbb{R}}^{n}\to{\mathbb{R}}^{n}$ is such that $\left\lVert o(x_{i}-x^{*})\right\rVert/\left\lVert x_{i}-x^{*}\right\rVert\to 0$ as $x_{i}\to x^{*}$ . Combining (40) and (41), the latter evaluated in $x_{i}(k+1)$ , yields

[TABLE]

and solving for $x_{i}(k+1)$ we get

[TABLE]

Stacking the updates (42) for $i\in\mathcal{V}$ we can write

[TABLE]

where ${\mathbold{H}}=\operatorname{blk\,diag}\left\{\rho d_{i}I_{n}+\nabla^{2}f_{i}(x^{*})\right\}$ , ${\mathbold{g}}$ and ${\mathbold{o}}$ stack $\nabla^{2}f_{i}(x^{*})x^{*}-\nabla f_{i}(x^{*})$ and $o(x_{i}(k+1)-x^{*})$ , respectively.

Using the auxiliary update (39) and (43) we can write

[TABLE]

where ${\mathbold{T}}=(1-\alpha){\mathbold{I}}-\alpha{\mathbold{P}}+2\alpha\rho{\mathbold{P}}{\mathbold{A}}{\mathbold{H}}^{-1}{\mathbold{A}}^{\top}$ , ${\mathbold{u}}=2\alpha\rho{\mathbold{P}}{\mathbold{A}}{\mathbold{H}}^{-1}{\mathbold{g}}$ , and ${\mathbold{o}}^{\prime}:{\mathbb{R}}^{nN}\to{\mathbb{R}}^{nM}$ , ${\mathbold{o}}^{\prime}(\cdot)=2\alpha\rho{\mathbold{P}}{\mathbold{A}}{\mathbold{H}}^{-1}\,{\mathbold{o}}(\cdot)$ , decays faster than the argument.

Remark 10.

Using the particular structure of ${\mathbold{A}}$ we see that ${\mathbold{A}}{\mathbold{H}}^{-1}{\mathbold{A}}^{\top}=\operatorname{blk\,diag}\left\{{\mathbf{1}}_{d_{i}\times d_{i}}\otimes(\rho d_{i}I_{n}+\nabla^{2}f_{i}(x^{*}))\right\}$ . But since $f_{i}\in\mathcal{C}^{2}$ for any $i$ , the Hessians $\nabla^{2}f_{i}(x^{*})$ are symmetric and thus ${\mathbold{T}}$ is symmetric as well.

(ii) Primal error bound

We start by stating the following result.

Lemma 1.

The update (13a) can be rewritten as

[TABLE]

Proof.

Adding the term $\left\lVert[{\mathbold{A}}^{\top}{\mathbold{z}}(k)]_{i}\right\rVert^{2}/(\rho d_{i})$ – which does not depend on $x_{i}$ – to the objective function in (13a) and using the definition of norm we can write

[TABLE]

where the second equality follows from definition of proximal operator, see section II-B. ∎

Denote by $m_{i}$ the strong convexity modulus of function $f_{i}$ , then we know that the proximal $\operatorname{prox}_{f_{i}/(\rho d_{i})}$ is $1/(1+m_{i}/(\rho d_{i}))$ -contractive [44], which implies

[TABLE]

for any $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ . Then

[TABLE]

where

[TABLE]

(iii) Linear convergence (quadratic case)

Assume that the functions $f_{i}$ are quadratic and, more specifically, (41) holds true with the residual equal to [math]. In this case the Peaceman-Rachford operator is affine and averaged, and the auxiliary update becomes ${\mathbold{z}}(k+1)={\mathbold{T}}{\mathbold{z}}(k)+{\mathbold{u}}$ . The following result characterizes the spectral properties of ${\mathbold{T}}$ .

Lemma 2.

The eigenvalues of ${\mathbold{T}}$ are either equal to $1$ or strictly inside the unitary circle. Moreover the eigenvalues in $1$ are all semi-simple. In addition the following property holds

[TABLE]

Proof.

For affine averaged operators the eigenvalues of ${\mathbold{T}}$ are all inside the circle on the complex plane with center $1-\alpha+i0$ and radius $\alpha$ [45]. This implies that the unique eigenvalues of ${\mathbold{T}}$ with unitary absolute value are in $1$ , and by convergence of the Krasnosel’skiĭ-Mann they are semi-simple.

Now let $\ker({\mathbold{I}}-{\mathbold{T}})=\operatorname{span}\{{\mathbold{v}}_{1},\ldots,{\mathbold{v}}_{H}\}$ where $H$ is the algebraic (and geometric) multiplicity of $1$ . Notice that, given $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ and $v_{h}\in\ker({\mathbold{I}}-{\mathbold{T}})$ , then $\bar{{\mathbold{z}}}+c{\mathbold{v}}_{h}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ for any $c\in{\mathbb{R}}$ . For any $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ from (43) we have ${\mathbold{x}}^{*}={\mathbold{H}}^{-1}[{\mathbold{A}}^{\top}\bar{{\mathbold{z}}}+{\mathbold{g}}]$ , which, by uniqueness of ${\mathbold{x}}^{*}$ , implies

[TABLE]

By the nonsingularity of ${\mathbold{H}}$ this condition implies ${\mathbold{v}}_{h}\in\ker({\mathbold{A}}^{\top})$ , $h=1,\ldots,H$ , and thus $\ker({\mathbold{I}}-{\mathbold{T}})\subset\ker({\mathbold{A}}^{\top})$ . ∎

Now by iterating ${\mathbold{z}}(k+1)={\mathbold{T}}{\mathbold{z}}(k)+{\mathbold{u}}$ we have

[TABLE]

which is satisfied also by any $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ . Thus ${\mathbold{z}}(k)-\bar{{\mathbold{z}}}={\mathbold{T}}^{k}({\mathbold{z}}(0)-\bar{{\mathbold{z}}}),$ and combining this fact with (45) yields

[TABLE]

Since $\ker({\mathbold{I}}-{\mathbold{T}})\subset\ker({\mathbold{A}}^{\top})$ , this implies that $\left\lVert{\mathbold{A}}^{\top}{\mathbold{T}}^{k}\right\rVert\leq C^{\prime}\gamma_{\mathrm{M}}^{k}$ where $\gamma_{\mathrm{M}}$ is the largest eigenvalue in absolute value different from $1$ of ${\mathbold{T}}$ and where $C^{\prime}>0$ , see [45]. This proves the linear convergence of the primal error in the quadratic case.

(iv) Linear convergence (general case)

Since the PRS operator is nonexpansive, it follows that $\|{\mathbold{z}}(k+1)-{\mathbold{z}}^{*}\|\leq\|{\mathbold{z}}(k)-{\mathbold{z}}^{*}\|$ and, in turn, that also the sequence $\left\lVert{\mathbold{x}}(k)-{\mathbold{x}}^{*}\right\rVert$ is bounded. Since $\lim_{k\to\infty}\left\lVert{\mathbold{x}}(k+1)-{\mathbold{x}}^{*}\right\rVert=0$ , by the definition of ${\mathbold{o}}^{\prime}$ we can argue that there exists a sequence of positive numbers $\{\delta_{k}\}_{k\geq 1}$ such that $\delta_{k+1}\leq\delta_{k}$ , $\lim_{k\to\infty}\delta_{k}=0$ and

[TABLE]

Therefore, by iterating (44) and exploiting (45) the primal error, for $k\geq 1$ , can be bounded as

[TABLE]

where $C^{\prime\prime}=\zeta C^{\prime}\left\lVert{\mathbold{z}}(0)-\bar{{\mathbold{z}}}\right\rVert$ . Consider now the sequence $\{e(k)\}_{k\geq 1}$ such that $e(1)=\left\lVert{\mathbold{x}}(1)-{\mathbold{x}}^{*}\right\rVert$ , $e(2)=C^{\prime\prime}\gamma_{\mathrm{M}}+\zeta C^{\prime}\delta_{1}\left\lVert{\mathbold{x}}(1)-{\mathbold{x}}^{*}\right\rVert$ and

[TABLE]

Recalling the definition of ${\mathbold{o}}^{\prime}$ , we know that there exists a ball $\mathcal{B}_{x^{*}}$ centered in $x^{*}$ such that, if $x_{i}(1)$ belongs to $\mathcal{B}_{x^{*}}$ , $i\in\mathcal{V}$ , then $\left\lVert{\mathbold{o}}({\mathbold{x}}(1)-{\mathbold{x}}^{*})\right\rVert\leq\delta_{1}\left\lVert{\mathbold{x}}(1)-{\mathbold{x}}^{*}\right\rVert$ with $\delta_{1}$ such that $\gamma_{\mathrm{M}}+\zeta C^{\prime}\delta_{1}<1$ , i.e. $\delta_{1}<(1-\gamma_{\mathrm{M}})/(\zeta C^{\prime})$ . In this case, by a standard inductive argument, one can show that $\left\lVert{\mathbold{x}}(k)-{\mathbold{x}}^{*}\right\rVert\leq e(k)$ for all $k\geq 1$ . Notice that in view of (18), there exists $\epsilon>0$ such that if $\operatorname{dist}({\mathbold{z}}(0),\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}}))\leq\epsilon$ then ${\mathbold{x}}(1)\in\mathcal{B}_{x^{*}}$ . Now from (48)

[TABLE]

from which, since $|\gamma_{\mathrm{M}}+\zeta C^{\prime}\delta_{\ell}|<1$ for all $\ell$ , we get that $\lim_{k\to\infty}e(k)=0$ . We conclude the proof by observing that, for any $\xi>0$ , it holds

[TABLE]

Appendix D Proofs of Section IV

D-A Proof of Proposition 3

As discussed in Section IV, loss of transmissions and asynchronous updates are taken into account by Algorithm 2 by updating a $z$ auxiliary variable only if new information is available. But since the R-ADMM is the Peaceman-Rachford splitting applied to the dual of (6), we can interpret Algorithm 2 as a randomized Peaceman-Rachford in which each coordinate of the ${\mathbold{z}}$ vector is randomly updated with nonzero probability. Therefore the convergence results of [30, Theorem 3] or (a particular case of) [43, Theorem 3.2] can be applied to prove almost sure convergence to the dual solution and, in turn, by strong duality, to the primal solution. $\square$

D-B Derivation of (IV-B)

By (18) we have $\mathbb{E}\left[\left\lVert{\mathbold{x}}(k+1)-{\mathbold{x}}^{*}\right\rVert\right]\leq\zeta\mathbb{E}\left[\left\lVert{\mathbold{A}}^{\top}({\mathbold{z}}(k+1)-\bar{{\mathbold{z}}})\right\rVert\right]$ , and our goal is to find a bound for the right-hand side. Iterating (24) we get

[TABLE]

Multiplying by ${\mathbold{A}}^{\top}$ , taking the norm of (49) and using triangle inequality and submultiplicativity then yields

[TABLE]

Now, taking the expectation we get:

[TABLE]

Finally, by Jensen’s inequality for concave functions (as the square root is), we know that

[TABLE]

and using this fact into (51) yields (IV-B).

D-C Proof of Proposition 5

The proof consists of the following steps: (i) derive the auxiliary update in Algorithm 2 as a perturbed, randomized affine operator, and characterize its properties; (ii) bound the mean primal error for the quadratic approximation with the auxiliary error, which converges linearly; (iii) extend the result to the general case.

(i) Randomized perturbed affine operator

Observe that, from (23), (44) and, recalling the definition of ${\mathbold{B}}(k)$ , we can write

[TABLE]

where $\hat{{\mathbold{T}}}(k):={\mathbold{I}}-{\mathbold{B}}(k)({\mathbold{I}}-{\mathbold{T}})$ . Let $\bar{{\mathbold{z}}}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ , then since $\bar{{\mathbold{z}}}={\mathbold{T}}\bar{{\mathbold{z}}}+{\mathbold{u}}$ , we have $\bar{{\mathbold{z}}}=\hat{{\mathbold{T}}}(k)\bar{{\mathbold{z}}}+{\mathbold{B}}(k){\mathbold{u}}$ for any $k\in{\mathbb{N}}$ . Thus iterating (52) and subtracting $\bar{{\mathbold{z}}}$ yields

[TABLE]

where by convention $\prod_{\ell=k+1}^{k}\hat{{\mathbold{T}}}(\ell)={\mathbold{I}}$ . Let us consider now the quadratic case that we have assuming (41) holds true with the residual equal to [math]. In this case (53) becomes

[TABLE]

By Proposition 3 we know that ${\mathbold{z}}(k+1)$ converges with probability one to a fixed point $\bar{{\mathbold{z}}}^{\prime}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ , in general different from $\bar{{\mathbold{z}}}$ , which implies from (54) that

[TABLE]

Notice that, given any two fixed points $\bar{{\mathbold{z}}},\bar{{\mathbold{z}}}^{\prime}\in\operatorname{fix}({\mathcal{T}_{\mathrm{PR}}})$ , it holds $\bar{{\mathbold{z}}}^{\prime}-\bar{{\mathbold{z}}}\in\ker({\mathbold{I}}-{\mathbold{T}})$ , thus for (55) to be true there must exist $c_{1},\ldots,c_{H}$ random variables such that

[TABLE]

where recall that $\ker({\mathbold{I}}-{\mathbold{T}})=\operatorname{span}\{{\mathbold{v}}_{1},\ldots,{\mathbold{v}}_{H}\}$ . The realizations of $c_{h}$ , $h=1,\ldots,H$ depend on the realizations of $\hat{{\mathbold{T}}}(\ell)$ , $\ell\in{\mathbb{N}}$ and on the initial condition ${\mathbold{z}}(0)$ . In general ${\mathbold{z}}(0)-\bar{{\mathbold{z}}}\not\in\ker({\mathbold{I}}-{\mathbold{T}})$ , which implies that we are able to find the random vectors ${\mathbold{\epsilon}}_{1},\ldots{\mathbold{\epsilon}}_{H}$ such that

[TABLE]

(ii) Mean error bound

By iterating (54), taking the expectation, exploiting (45) and Jensen’s inequality, we can write

[TABLE]

where

[TABLE]

if $k\geq 1$ , otherwise ${\mathbold{\Delta}}(0)={\mathbold{A}}{\mathbold{A}}^{\top}$ . Therefore, with a simple recursive argument, we can characterize the bound for the primal error in terms of the evolution of the linear system

[TABLE]

with initial condition ${\mathbold{\Delta}}(0)={\mathbold{A}}{\mathbold{A}}^{\top}$ , and where the second equality holds since, by Assumption 1, the $\hat{{\mathbold{T}}}(k)$ are independent and identically distributed.

Lemma 3.

The eigenvalues of the operator $\mathcal{L}$ are all either strictly inside the unitary circle, or in $1$ . In particular, let ${\mathbold{\epsilon}}_{1},\ldots,{\mathbold{\epsilon}}_{H}$ be the random vectors such that (56) holds. Then the eigenspace of $\mathcal{L}$ relative to $1$ is $H^{2}$ -dimensional and is generated by $\mathbb{E}[{\mathbold{\epsilon}}_{i}{\mathbold{\epsilon}}_{j}^{\top}]$ , $i,j=1,\ldots,H$ .

Proof.

Taking the limit for $k\to\infty$ of (57) and exploiting the property (56) proved above yields, for any initial condition ${\mathbold{\Delta}}(0)$ :

[TABLE]

This proves that the eigenspace of $\mathcal{L}$ relative to the eigenvalue $1$ is $H^{2}$ dimensional and it is generated by $\mathbb{E}[{\mathbold{\epsilon}}_{i}{\mathbold{\epsilon}}_{j}^{\top}]$ , $i,j=1,\ldots,H$ . ∎

By Lemma 2 we have $\ker({\mathbold{I}}-{\mathbold{T}})\subset\ker({\mathbold{A}}^{\top})$ , and so (58) with initial condition ${\mathbold{\Delta}}(0)={\mathbold{A}}{\mathbold{A}}^{\top}$ implies that ${\mathbold{A}}{\mathbold{A}}^{\top}$ is orthogonal to the eigenspace generated by the eigenvectors relative to $1$ . Thus we have $\lim_{k\to\infty}\mathcal{L}^{k}({\mathbold{\Delta}}(0))=0$ , which proves that the primal error converges linearly to zero. The rate of convergence is characterized by the largest eigenvalue of the linear system $\mathcal{L}$ strictly inside the unitary circle, that is by $\bar{\gamma}_{\mathrm{M}}$ as defined in (27).

(iii) General case

By the results of point 2) we can write

[TABLE]

for some $C^{\prime}>0$ . Moreover, in the general case, iterating (53) and exploiting the primal error bound

[TABLE]

we obtain (IV-B). Now, using (59) and the fact that ${\mathbold{B}}(k)$ are independent and identically distributed, we have

[TABLE]

Similarly to the proof of Proposition 2, we can argue that there exists a sequence of positive numbers $\{\delta_{k}\}_{k\geq 1}$ such that $\delta_{k+1}\leq\delta_{k}$ , $\lim_{k\to\infty}\delta_{k}=0$ and

[TABLE]

Hence we have the following inequality

[TABLE]

and with the same argument employed in Appendix C-B the proof of Proposition 5 is complete. $\square$

Remark 11.

Point (ii) of this proof extends to the distributed optimization scenario the results reported in [46] for the randomized consensus problem.

D-D Proof of Proposition 6

The idea is to introduce a matrix representation of $\mathcal{L}$ and then compute its largest eigenvalue inside the unitary circle.

Let $\operatorname{vect}(\cdot)$ be the vectorization operator that, given a matrix $\mathbold{M}\in{\mathbb{R}}^{K\times K}$ , returns the vector $\operatorname{vect}(\mathbold{M})\in{\mathbb{R}}^{K^{2}}$ having $[\mathbold{M}]_{i,j}$ in position $(i-1)K+j$ . A useful property of $\operatorname{vect}$ is that for a triplet of matrices of suitable dimensions we can write $\operatorname{vect}(\mathbold{ABC})=(\mathbold{C}^{\top}\otimes\mathbold{A})\operatorname{vect}(\mathbold{B})$ .

Vectorizing the linear system $\mathcal{L}$ we obtain

[TABLE]

where $\bar{\gamma}_{\mathrm{M}}$ of $\mathcal{L}$ coincides with the largest eigenvalue of ${\mathbold{L}}$ strictly inside the unitary circle.

Using Assumption 1 we now give an explicit formula for ${\mathbold{L}}$ in terms of ${\mathbold{T}}$ and the expectation of ${\mathbold{B}}(0)$ . The symmetry of ${\mathbold{T}}$ , see Remark 10, implies that of $\hat{{\mathbold{T}}}(0)$ and thus of ${\mathbold{L}}$ . Therefore, omitting the dependence on time in ${\mathbold{B}}(0)$ , we have:

[TABLE]

and by linearity of the expectation we can focus on each term separately. The first term is clearly equal to itself, while we have $\mathbb{E}[{\mathbold{I}}\otimes{\mathbold{B}}]={\mathbold{I}}\otimes\mathbb{E}[{\mathbold{B}}]$ , and, similarly, $\mathbb{E}[{\mathbold{B}}\otimes{\mathbold{I}}]=\mathbb{E}[{\mathbold{B}}]\otimes{\mathbold{I}}$ .

The remaining terms can be computed using the following property of the Kronecker product $(\mathbold{AC}\otimes\mathbold{BD})=(\mathbold{A}\otimes\mathbold{B})(\mathbold{C}\otimes\mathbold{D})$ for matrices $\mathbold{A,B,C,D}$ of suitable dimensions:

•

$\mathbb{E}[{\mathbold{I}}\otimes{\mathbold{B}}{\mathbold{T}}]=\mathbb{E}[({\mathbold{I}}\otimes{\mathbold{B}})({\mathbold{I}}\otimes{\mathbold{T}})]=({\mathbold{I}}\otimes\mathbb{E}[{\mathbold{B}}])({\mathbold{I}}\otimes{\mathbold{T}})={\mathbold{I}}\otimes\mathbb{E}[{\mathbold{B}}]{\mathbold{T}}$ ,

•

$\mathbb{E}[{\mathbold{B}}{\mathbold{T}}\otimes{\mathbold{I}}]=\mathbb{E}[({\mathbold{B}}\otimes{\mathbold{I}})({\mathbold{T}}\otimes{\mathbold{I}})]=(\mathbb{E}[{\mathbold{B}}]\otimes{\mathbold{I}})({\mathbold{T}}\otimes{\mathbold{I}})=\mathbb{E}[{\mathbold{B}}]{\mathbold{T}}\otimes{\mathbold{I}}$ ,

•

$\mathbb{E}[{\mathbold{B}}\otimes{\mathbold{B}}{\mathbold{T}}]=\mathbb{E}[({\mathbold{B}}\otimes{\mathbold{B}})({\mathbold{I}}\otimes{\mathbold{T}})]=\mathbb{E}[{\mathbold{B}}\otimes{\mathbold{B}}]({\mathbold{I}}\otimes{\mathbold{T}})$ ,

•

$\mathbb{E}[{\mathbold{B}}{\mathbold{T}}\otimes{\mathbold{B}}]=\mathbb{E}[({\mathbold{B}}\otimes{\mathbold{B}})({\mathbold{T}}\otimes{\mathbold{I}})]=\mathbb{E}[{\mathbold{B}}\otimes{\mathbold{B}}]({\mathbold{T}}\otimes{\mathbold{I}})$ ,

•

$\mathbb{E}[{\mathbold{B}}{\mathbold{T}}\otimes{\mathbold{B}}{\mathbold{T}}]=\mathbb{E}[({\mathbold{B}}\otimes{\mathbold{B}})({\mathbold{T}}\otimes{\mathbold{T}})]=\mathbb{E}[{\mathbold{B}}\otimes{\mathbold{B}}]({\mathbold{T}}\otimes{\mathbold{T}})$ .

Summing and rearranging the terms (exploiting the properties of the Kronecker product) we thus prove Proposition 6. $\square$

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Slavakis, G. B. Giannakis, and G. Mateos, “Modeling and Optimization for Big Data Analytics: (Statistical) learning tools for our era of data deluge,” IEEE Signal Process. Mag. , vol. 31, no. 5, pp. 18–31, 2014.
2[2] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods . Prentice hall Englewood Cliffs, NJ, 1989, vol. 23.
3[3] A. Nedić and A. Ozdaglar, “Distributed Subgradient Methods for Multi-Agent Optimization,” IEEE Trans. Autom. Control , vol. 54, no. 1, pp. 48–61, 2009.
4[4] ——, Convex Optimization in Signal Processing and Communications . Cambridge University Press, 2010, ch. Cooperative Distributed Multi-Agent Optimization, pp. 340–386.
5[5] A. Nedić, A. Ozdaglar, and P. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Trans. Autom. Control , vol. 55, no. 4, pp. 922–938, 2010.
6[6] I. Lobel and A. Ozdaglar, “Distributed Subgradient Methods for Convex Optimization Over Random Networks,” IEEE Trans. Autom. Control , vol. 56, no. 6, pp. 1291–1306, 2011.
7[7] S. Lee and A. Nedić, “Distributed Random Projection Algorithm for Convex Optimization,” IEEE J. Sel. Topics Signal Process. , vol. 7, no. 2, pp. 221–229, 2013.
8[8] D. Jakovetic, J. M. F. Xavier, and J. M. F. Moura, “Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks,” IEEE Trans. Signal Process. , vol. 62, no. 4, pp. 868–882, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asynchronous Distributed Optimization over Lossy Networks via Relaxed ADMM:

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Notation and useful definitions

II-B Notions on operator theory

II-C Relaxed ADMM

III R-ADMM for Distributed Optimization

III-A Problem formulation

III-B Distributed R-ADMM

Proposition 1**.**

Remark 1**.**

Remark 2** **(Comparison with ARock [26]).

III-C Linear local convergence for strongly convex costs

Proposition 2**.**

Proof.

Remark 3** **(Lagrangian based R-ADMM).

Remark 4** **(Node- and edge-based ADMM).

Remark 5** **(Further discussion on the convergence rate).

IV Asynchronous Distributed R-ADMM over Lossy Networks

IV-A Robust and Asynchronous R-ADMM

Assumption 1**.**

Remark 6** **(Uniform probabilities).

Proposition 3**.**

Proof.

IV-B Mean linear convergence for strongly convex costs

Proposition 4**.**

Proposition 5**.**

Proof.

Proposition 6**.**

Proof.

Remark 7** **(Quadratic case: global linear convergence).

Remark 8** **(Convergence of randomized (R-)ADMM).

Remark 9** **(Stable parameters pairs).

V Simulations

V-A Error trajectories

V-B Convergence rate

V-C Quartic function

VI Conclusions and Future Directions

Appendix A Derivation of (8)

Appendix B Proof of equivalence between (8) and (20)

Appendix C Proofs of Section III

C-A Derivation of (13)

xxx update

yyy update

zzz update

C-B Proof of Proposition 2

(i) Perturbed affine operator

Remark 10**.**

(ii) Primal error bound

Lemma 1**.**

Proof.

(iii) Linear convergence (quadratic case)

Lemma 2**.**

Proof.

(iv) Linear convergence (general case)

Appendix D Proofs of Section IV

D-A Proof of Proposition 3

D-B Derivation of (IV-B)

D-C Proof of Proposition 5

(i) Randomized perturbed affine operator

(ii) Mean error bound

Lemma 3**.**

Proof.

(iii) General case

Remark 11**.**

D-D Proof of Proposition 6

Proposition 1.

Remark 1.

Remark 2 (Comparison with ARock [26]).

Proposition 2.

Remark 3 (Lagrangian based R-ADMM).

Remark 4 (Node- and edge-based ADMM).

Remark 5 (Further discussion on the convergence rate).

Assumption 1.

Remark 6 (Uniform probabilities).

Proposition 3.

Proposition 4.

Proposition 5.

Proposition 6.

Remark 7 (Quadratic case: global linear convergence).

Remark 8 (Convergence of randomized (R-)ADMM).

Remark 9 (Stable parameters pairs).

$x$ update

$y$ update

$z$ update

Remark 10.

Lemma 1.

Lemma 2.

Lemma 3.

Remark 11.