Distributed Optimization Using the Primal-Dual Method of Multipliers

G. Zhang; R. Heusdens

arXiv:1702.00841·cs.DC·February 6, 2017

Distributed Optimization Using the Primal-Dual Method of Multipliers

G. Zhang, R. Heusdens

PDF

Open Access

TL;DR

This paper introduces PDMM, a novel primal-dual algorithm for distributed convex optimization over graphs, demonstrating convergence and robustness under various update schemes and network conditions.

Contribution

The paper develops PDMM, a new distributed optimization method that effectively handles graph-structured problems with convergence guarantees and resilience to communication failures.

Findings

01

Converges at rate O(1/K) for convex functions.

02

Effective under both synchronous and asynchronous updates.

03

Resilient to transmission failures in distributed averaging.

Abstract

In this paper, we propose the primal-dual method of multipliers (PDMM) for distributed optimization over a graph. In particular, we optimize a sum of convex functions defined over a graph, where every edge in the graph carries a linear equality constraint. In designing the new algorithm, an augmented primal-dual Lagrangian function is constructed which smoothly captures the graph topology. It is shown that a saddle point of the constructed function provides an optimal solution of the original problem. Further under both the synchronous and asynchronous updating schemes, PDMM has the convergence rate of O(1/K) (where K denotes the iteration index) for general closed, proper and convex functions. Other properties of PDMM such as convergence speeds versus different parameter- settings and resilience to transmission failure are also investigated through the experiments of distributed…

Tables2

Table 1. TABLE I: Synchronous PDMM where for each i ∈ 𝒱 𝑖 𝒱 i\in\mathcal{V} , 𝑷 d , i j = 𝑷 p , i j − 1 subscript 𝑷 𝑑 𝑖 𝑗 superscript subscript 𝑷 𝑝 𝑖 𝑗 1 \boldsymbol{P}_{d,ij}=\boldsymbol{P}_{p,ij}^{-1} .

Initialization: Properly initialize

{𝒙_{i}}

and

{𝝀_{i | j}}

Repeat

for all

i \in 𝒱

do

{\hat{𝒙}}_{i}^{k + 1} = \arg \min_{𝒙_{i}} [f_{i} (𝒙_{i}) - 𝒙_{i}^{T} (\sum_{j \in 𝒩_{i}} 𝑨_{i ​ j}^{T} {\hat{𝝀}}_{j | i}^{k})

+ \sum_{j \in 𝒩_{i}} \frac{1}{2} ∥ 𝑨_{i ​ j} 𝒙_{i} + 𝑨_{j ​ i} {\hat{𝒙}}_{j}^{k} - 𝒄_{i ​ j} ∥_{𝑷_{p, i ​ j}}^{2}]

end for

for all

i \in 𝒱

and

j \in 𝒩_{i}

do

{\hat{𝝀}}_{i | j}^{k + 1} = {\hat{𝝀}}_{j | i}^{k} + 𝑷_{p, i ​ j} ​ (𝒄_{i ​ j} - 𝑨_{j ​ i} ​ {\hat{𝒙}}_{j}^{k} - 𝑨_{i ​ j} ​ {\hat{𝒙}}_{i}^{k + 1})

end for

k \leftarrow k + 1

Until some stopping criterion is met

Table 2. TABLE II: Average execution times (per iteration) and their standard deviations for the four methods.

	$\begin{matrix} one-node \\ PDMM \end{matrix}$	$\begin{matrix} two-node \\ PDMM \end{matrix}$	ADMM	broadcast	gossip	$\begin{matrix} PDMM \\ (syn) \end{matrix}$	$\begin{matrix} ADMM \\ (syn) \end{matrix}$
ave. ( $μ s$ )	5.46	8.92	6.54	2.10	0.24	380	384
std ( $10^{- 6}$ )	5.04	8.58	8.09	4.55	1.73	216	285

Equations252

{x_{i}} min i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum f_{ij} (x_{i}, x_{j}),

{x_{i}} min i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum f_{ij} (x_{i}, x_{j}),

{x_{i}} min i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum I_{A_{ij} x_{i} + A_{j i} x_{j} = c_{ij}} (x_{i}, x_{j}),

{x_{i}} min i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum I_{A_{ij} x_{i} + A_{j i} x_{j} = c_{ij}} (x_{i}, x_{j}),

x, z min f (x) + g (z) subject to A x + B z = c,

x, z min f (x) + g (z) subject to A x + B z = c,

h^{*} (δ) = Δ y max δ^{T} y - h (y),

h^{*} (δ) = Δ y max δ^{T} y - h (y),

δ^{'} \in \partial_{y} h (y^{'}),

δ^{'} \in \partial_{y} h (y^{'}),

h (y^{'}) =

h (y^{'}) =

x min i \in V \sum f_{i} (x_{i}) s. t. A_{ij} x_{i} + A_{j i} x_{j} = c_{ij} \forall (i, j) \in E,

x min i \in V \sum f_{i} (x_{i}) s. t. A_{ij} x_{i} + A_{j i} x_{j} = c_{ij} \forall (i, j) \in E,

L_{p} (x, δ) = i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum δ_{ij}^{T} (c_{ij} - A_{ij} x_{i} - A_{j i} x_{j}),

L_{p} (x, δ) = i \in V \sum f_{i} (x_{i}) + (i, j) \in E \sum δ_{ij}^{T} (c_{ij} - A_{ij} x_{i} - A_{j i} x_{j}),

L_{p} (x^{⋆}, δ) \leq L_{p} (x^{⋆}, δ^{⋆}) \leq L_{p} (x, δ^{⋆}) .

L_{p} (x^{⋆}, δ) \leq L_{p} (x^{⋆}, δ^{⋆}) \leq L_{p} (x, δ^{⋆}) .

j \in N_{i} \sum A_{ij}^{T} δ_{ij}^{⋆} \in \partial f_{i} (x_{i}^{⋆})

j \in N_{i} \sum A_{ij}^{T} δ_{ij}^{⋆} \in \partial f_{i} (x_{i}^{⋆})

A_{j i} x_{j}^{⋆} + A_{ij} x_{i}^{⋆} = c_{ij}

δ max x min L_{p} (x, δ)

δ max x min L_{p} (x, δ)

\displaystyle=\max_{\boldsymbol{\delta}}\sum_{i\in\mathcal{V}}\min_{\boldsymbol{x}_{i}}\Big{(}f_{i}(\boldsymbol{x}_{i})-\sum_{j\in\mathcal{N}_{i}}\boldsymbol{\delta}_{ij}^{T}\boldsymbol{A}_{ij}\boldsymbol{x}_{i}\Big{)}+\sum_{(i,j)\in\mathcal{E}}\boldsymbol{\delta}_{ij}^{T}\boldsymbol{c}_{ij}

\displaystyle=\max_{\boldsymbol{\delta}}\sum_{i\in\mathcal{V}}-f_{i}^{\ast}\Bigg{(}\sum_{j\in\mathcal{N}_{i}}\boldsymbol{A}_{ij}^{T}\boldsymbol{\delta}_{ij}\Bigg{)}+\sum_{(i,j)\in\mathcal{E}}\boldsymbol{\delta}_{ij}^{T}\boldsymbol{c}_{ij},

\displaystyle f_{i}(\boldsymbol{x}_{i})+f_{i}^{\ast}\Bigg{(}\sum_{j\in\mathcal{N}_{i}}\boldsymbol{A}_{ij}^{T}\boldsymbol{\delta}_{ij}\Bigg{)}\geq\sum_{j\in\mathcal{N}_{i}}\boldsymbol{\delta}_{ij}^{T}\boldsymbol{A}_{ij}\boldsymbol{x}_{i}.

\displaystyle f_{i}(\boldsymbol{x}_{i})+f_{i}^{\ast}\Bigg{(}\sum_{j\in\mathcal{N}_{i}}\boldsymbol{A}_{ij}^{T}\boldsymbol{\delta}_{ij}\Bigg{)}\geq\sum_{j\in\mathcal{N}_{i}}\boldsymbol{\delta}_{ij}^{T}\boldsymbol{A}_{ij}\boldsymbol{x}_{i}.

δ, {λ_{i}} max

δ, {λ_{i}} max

s. t. λ_{i ∣ j} = λ_{j ∣ i} = δ_{ij} \forall (i, j) \in E,

A_{i}^{T} λ_{i} = j \in N_{i} \sum A_{ij}^{T} λ_{i ∣ j} .

A_{i}^{T} λ_{i} = j \in N_{i} \sum A_{ij}^{T} λ_{i ∣ j} .

L_{d}^{'} (δ, λ, y) =

L_{d}^{'} (δ, λ, y) =

+ (i, j) \in E \sum [y_{i ∣ j}^{T} (δ_{ij} - λ_{i ∣ j}) + y_{j ∣ i}^{T} (δ_{ij} - λ_{j ∣ i})],

0

0

= \partial_{λ_{i ∣ j}} [f_{i}^{*} (A_{i}^{T} λ_{i}^{⋆})] + A_{j i} x_{j}^{⋆} - c_{ij} \forall [i, j] \in E .

L_{d} (δ, λ, x) =

L_{d} (δ, λ, x) =

- (i, j) \in E \sum δ_{ij}^{T} (c_{ij} - A_{ij} x_{i} - A_{j i} x_{j}) .

L_{p d} (x, λ) = L_{p} (x, δ) + L_{d} (δ, λ, x)

L_{p d} (x, λ) = L_{p} (x, δ) + L_{d} (δ, λ, x)

\displaystyle=\sum_{i\in\mathcal{V}}\Big{[}f_{i}(\boldsymbol{x}_{i})-\sum_{j\in\mathcal{N}_{i}}\boldsymbol{\lambda}_{j|i}^{T}(\boldsymbol{A}_{ij}\boldsymbol{x}_{i}-\boldsymbol{c}_{ij})-f_{i}^{\ast}(\boldsymbol{A}_{i}^{T}\boldsymbol{\lambda}_{i})\Big{]}.

L_{p d} (x^{⋆}, λ)

L_{p d} (x^{⋆}, λ)

\leq L_{p} (x^{⋆}, δ^{⋆}) + L_{d} (δ^{⋆}, λ^{⋆}, x^{⋆})

= L_{p d} (x^{⋆}, λ^{⋆})

\leq L_{p} (x, δ^{⋆}) + L_{d} (δ^{⋆}, λ^{⋆}, x)

x_{1}, x_{2} min f_{1} (x_{1}) + f_{2} (x_{2}) s.t. x_{1} - x_{2} = 0,

x_{1}, x_{2} min f_{1} (x_{1}) + f_{2} (x_{2}) s.t. x_{1} - x_{2} = 0,

\displaystyle\textrm{where }\qquad f_{1}(x_{1})=f_{2}(-x_{1})=\left\{\begin{array}[]{ll}x_{1}-1&x_{1}\geq 1\\ 0&\textrm{otherwise}\end{array}\right..

\displaystyle\textrm{where }\qquad f_{1}(x_{1})=f_{2}(-x_{1})=\left\{\begin{array}[]{ll}x_{1}-1&x_{1}\geq 1\\ 0&\textrm{otherwise}\end{array}\right..

\displaystyle f_{1}^{\ast}(\delta_{12})=f_{2}^{\ast}(-\delta_{12})=\left\{\begin{array}[]{ll}\delta_{12}&0\leq\delta_{12}\leq 1\\ +\infty&\textrm{otherwise}\end{array}\right..

\displaystyle f_{1}^{\ast}(\delta_{12})=f_{2}^{\ast}(-\delta_{12})=\left\{\begin{array}[]{ll}\delta_{12}&0\leq\delta_{12}\leq 1\\ +\infty&\textrm{otherwise}\end{array}\right..

L_{p d} (x, λ) =

L_{p d} (x, λ) =

- x_{1} λ_{2∣1} + x_{2} λ_{1∣2} .

L_{P} (x, λ) =

L_{P} (x, λ) =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Control Multi-Agent Systems · Cooperative Communication and Network Coding · Advanced MIMO Systems Optimization

Full text

Distributed Optimization Using the Primal-Dual Method of Multipliers

Guoqiang Zhang and Richard Heusdens G. Zhang is with both the School of Computing and Communications, University of Technology, Sydney, Australia, and the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: [email protected]. Heusdens is with the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: [email protected] of the work has been published on ICASSP, 2015, with the paper titled Bi-Alternating Direction Method of Multipliers over Graphs. After careful consideration, we decide to change the name of our algorithm from bi-alternating direction method of multipliers (BiADMM) in [1] and [2] to primal-dual method of multipliers (PDMM).

Abstract

In this paper, we propose the primal-dual method of multipliers (PDMM) for distributed optimization over a graph. In particular, we optimize a sum of convex functions defined over a graph, where every edge in the graph carries a linear equality constraint. In designing the new algorithm, an augmented primal-dual Lagrangian function is constructed which smoothly captures the graph topology. It is shown that a saddle point of the constructed function provides an optimal solution of the original problem. Further under both the synchronous and asynchronous updating schemes, PDMM has the convergence rate of $O(1/K)$ (where $K$ denotes the iteration index) for general closed, proper and convex functions. Other properties of PDMM such as convergence speeds versus different parameter-settings and resilience to transmission failure are also investigated through the experiments of distributed averaging.

Index Terms:

Distributed optimization, ADMM, PDMM, sublinear convergence.

I Introduction

In recent years, distributed optimization has drawn increasing attention due to the demand for big-data processing and easy access to ubiquitous computing units (e.g., a computer, a mobile phone or a sensor equipped with a CPU). The basic idea is to have a set of computing units collaborate with each other in a distributed way to complete a complex task. Popular applications include telecommunication [3, 4], wireless sensor networks [5], cloud computing and machine learning [6]. The research challenge is on the design of efficient and robust distributed optimization algorithms for those applications.

To the best of our knowledge, almost all the optimization problems in those applications can be formulated as optimization over a graphic model $G=(\mathcal{V},\mathcal{E})$ :

[TABLE]

where $\{f_{i}|i\in\mathcal{V}\}$ and $\{f_{ij}|(i,j)\in\mathcal{E}\}$ are referred to as node and edge-functions, respectively. For instance, for the application of distributed quadratic optimization, all the node and edge-functions are in the form of scalar quadratic functions (see [7, 8, 9]).

In the literature, a large number of applications (see [10]) require that every edge function $f_{ij}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ , $(i,j)\in\mathcal{E}$ , is essentially a linear equality constraint in terms of $\boldsymbol{x}_{i}$ and $\boldsymbol{x}_{j}$ . Mathematically, we use $\boldsymbol{A}_{ij}\boldsymbol{x}_{i}+\boldsymbol{A}_{ji}\boldsymbol{x}_{j}=\boldsymbol{c}_{ij}$ to formulate the equality constraint for each $(i,j)\in\mathcal{E}$ , as demonstrated in Fig. 1. In this situation, (1) can be described as

[TABLE]

where $I_{(\cdot)}$ denotes the indicator or characteristic function defined as $I_{\mathcal{C}}(\boldsymbol{x})=0$ if $\boldsymbol{x}\in\mathcal{C}$ and $I_{\mathcal{C}}(\boldsymbol{x})=\infty$ if $\boldsymbol{x}\notin\mathcal{C}$ . In this paper, we focus on convex optimization of form (2), where every node-function $f_{i}$ is closed, proper and convex.

The majority of recent research have been focusing on a specialized form of the convex problem (2), where every edge-function $f_{ij}$ reduces to $I_{\boldsymbol{x}_{i}=\boldsymbol{x}_{j}}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ . The above problem is commonly known as the consensus problem in the literature. Classic methods include the dual-averaging algorithm [11], the subgradient algorithm [12], the diffusion adaptation algorithm [13]. For the special case that $\{f_{i}|i\in\mathcal{V}\}$ are scalar quadratic functions (referred to as the distributed averaging problem), the most popular methods are the randomized gossip algorithm [5] and the broadcast algorithm [14]. See [15] for an overview of the literature for solving the distributed averaging problem.

The alternating-direction method of multipliers (ADMM) can be applied to solve the general convex optimization (2). The key step is to decompose each equality constraint $\boldsymbol{A}_{ij}\boldsymbol{x}_{i}+\boldsymbol{A}_{ji}\boldsymbol{x}_{j}=\boldsymbol{c}_{ij}$ into two constraints such as $\boldsymbol{A}_{ij}\boldsymbol{x}_{i}+\boldsymbol{z}_{ij}=\boldsymbol{c}_{ij}$ and $\boldsymbol{z}_{ij}=\boldsymbol{A}_{ji}\boldsymbol{x}_{j}$ with the help of the auxiliary variable $\boldsymbol{z}_{ij}$ . As a result, (2) can be reformulated as

[TABLE]

where $f(\boldsymbol{x})=\sum_{i\in\mathcal{V}}f_{i}(\boldsymbol{x}_{i})$ , $g(\boldsymbol{z})=0$ and $\boldsymbol{z}$ is a vector obtained by stacking up $\boldsymbol{z}_{ij}$ one after another. See [16] for using ADMM to solve the consensus problem of (2) (with edge-function $I_{\boldsymbol{x}_{i}=\boldsymbol{x}_{j}}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ ). The graphic structure is implicitly embedded in the two matrices $(\boldsymbol{A},\boldsymbol{B})$ and the vector $\boldsymbol{c}$ . The reformulation essentially converts the problem on a general graph with many nodes (2) to a graph with only two nodes (3), allowing the application of ADMM. Based on (3), ADMM then constructs and optimizes an augmented Lagrangian function iteratively with respect to $(\boldsymbol{x},\boldsymbol{z})$ and a set of Lagrangian multipliers. We refer to the above procedure as synchronous ADMM as it updates all the variables at each iteration. Recently, the work of [17] proposed asynchronous ADMM, which optimizes the same function over a subset of the variables at each iteration.

We note that besides solving (2), ADMM has found many successful applications in the fields of signal processing and machine learning (see [10] for an overview). For instance, in [18] and [19], variants of ADMM have been proposed to solve a (possibly nonconvex) optimization problem defined over a graph with a star topology, which is motivated from big data applications. The work of [20] considers solving the consensus problem of (2) (with edge-function $I_{\boldsymbol{x}_{i}=\boldsymbol{x}_{j}}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ ) over a general graph, where each node function $f_{i}$ is further expressed as a sum of two component functions. The authors of [20] propose a new algorithm which includes ADMM as a special case when one component function is zero. In general, ADMM and its variants are quite simple and often provide satisfactory results after a reasonable number of iterations, making it a popular algorithm in recent years.

In this paper, we tackle the convex problem (2) directly instead of relying on the reformulation (3). Specifically, we construct an augmented primal-dual Lagrangian function for (2) without introducing the auxiliary variable $\boldsymbol{z}$ as is required by ADMM. We show that solving (2) is equivalent to searching for a saddle point of the augmented primal-dual Lagrangian. We then propose the primal-dual method of multipliers (PDMM) to iteratively approach one saddle point of the constructed function. It is shown that for both the synchronous and asynchronous updating schemes, the PDMM converges with the rate of $\mathcal{O}(1/K)$ for general closed, proper and convex functions.

Further we evaluate PDMM through the experiments of distributed averaging. Firstly, it is found that the parameters of PDMM should be selected by a rule (see VI-C1) for fast convergence. Secondly, when there are transmission failures in the graph, transmission losses only slow down the convergence speed of PDMM. Finally, experimental comparison suggests that PDMM outperforms ADMM and the two gossip algorithms in [5] and [14].

This work is mainly devoted to the theoretical analysis of PDMM. In the literature, PDMM has already been successfully applied for solving a few other problems. The work of [21] investigates the efficiency of ADMM and PDMM for distributed dictionary learning. In [22], we have used both ADMM and PDMM for training a support vector machine (SVM). In the above examples it is found that PDMM outperforms ADMM in terms of convergence rate. In [23], the authors describes an application of the linearly constrained minimum variance (LCMV) beamformer for use in acoustic wireless sensor networks. The proposed algorithm computes the optimal beamformer output at each node in the network without the need for sharing raw data within the network. PDMM has been successfully applied to perform distributed beamforming. This suggests that PDMM is not only theoretically interesting but also might be powerful in real applications.

II Problem Setting

In this section, we first introduce basic notations needed in the rest of the paper. We then make a proper assumption about the existence of optimal solutions of the problem. Finally, we derive the dual problem to (2) and its Lagrangian function, which will be used for constructing the augmented primal-dual Lagrangian function in Section III.

II-A Notations and functional properties

We first introduce notations for a graphic model. We denote a graph as $G=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{1,\ldots,m\}$ represents the set of nodes and $\mathcal{E}=\{(i,j)|i,j\in\mathcal{V}\}$ represents the set of edges in the graph, respectively. We use $\vec{\mathcal{E}}$ to denote the set of all directed edges. Therefore, $|\vec{\mathcal{E}}|=2|\mathcal{E}|$ . The directed edge $[i,j]$ starts from node $i$ and ends with node $j$ . We use $\mathcal{N}_{i}$ to denote the set of all neighboring nodes of node $i$ , i.e., $\mathcal{N}_{i}=\{j|(i,j)\in\mathcal{E}\}$ . Given a graph $G=(\mathcal{V},\mathcal{E})$ , only neighboring nodes are allowed to communicate with each other directly.

Next we introduce notations for mathematical description in the remainder of the paper. We use bold small letters to denote vectors and bold capital letters to denote matrices. The notation $\boldsymbol{M}\succeq 0$ (or $\boldsymbol{M}\succ 0$ ) represents a symmetric positive semi-definite matrix (or a symmetric positive definite matrix). The superscript $(\cdot)^{T}$ represents the transpose operator. Given a vector $\boldsymbol{y}$ , we use $\|\boldsymbol{y}\|$ to denote its $l_{2}$ norm.

Finally, we introduce the conjugate function. Suppose $h:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\}$ is a closed, proper and convex function. Then the conjugate of $h(\cdot)$ is defined as [24, Definition 2.1.20]

[TABLE]

where the conjugate function $h^{\ast}$ is again a closed, proper and convex function. Let $\boldsymbol{y}^{\prime}$ be the optimal solution for a particular $\boldsymbol{\delta}^{\prime}$ in (4). We then have

[TABLE]

where $\partial_{\boldsymbol{y}}h(\boldsymbol{y}^{\prime})$ represents the set of all subgradients of $h(\cdot)$ at $\boldsymbol{y}^{\prime}$ (see [24, Definition 2.1.23]). As a consequence, since $h^{\ast\ast}=h$ , we have

[TABLE]

and we conclude that $\boldsymbol{y}^{\prime}\in\partial_{\boldsymbol{\delta}}h^{\ast}(\boldsymbol{\delta}^{\prime})$ as well.

II-B Problem assumption

With the notation $G=(\mathcal{V},\mathcal{E})$ for a graph, we first reformulate the convex problem (2) as

[TABLE]

where each function $f_{i}:\mathbb{R}^{n_{i}}\rightarrow\mathbb{R}\cup\{+\infty\}$ is assumed to be closed, proper and convex, and $\boldsymbol{x}=[\boldsymbol{x}_{1}^{T},\boldsymbol{x}_{2}^{T},\ldots,\boldsymbol{x}_{m}^{T}]^{T}$ . For every edge $(i,j)\in\mathcal{E}$ , we let $(\boldsymbol{c}_{ij},\boldsymbol{A}_{ij},\boldsymbol{A}_{ji})\in(\mathbb{R}^{n_{ij}},\mathbb{R}^{n_{ij}\times n_{i}},\mathbb{R}^{n_{ij}\times n_{j}})$ . The vector $\boldsymbol{x}$ is thus of dimension $n_{\boldsymbol{x}}=\sum_{i\in\mathcal{V}}n_{i}$ . In general, $\boldsymbol{A}_{ij}$ and $\boldsymbol{A}_{ji}$ are two different matrices. The matrix $\boldsymbol{A}_{ij}$ operates on $\boldsymbol{x}_{i}$ in the linear constraint of edge $(i,j)\in\mathcal{E}$ . The notation s. t. in (7) stands for “subject to”. We take the reformulation (7) as the primal problem in terms of $\boldsymbol{x}$ .

The primal Lagrangian for (7) can be constructed as

[TABLE]

where $\boldsymbol{\delta}_{ij}$ is the Lagrangian multiplier (or the dual variable) for the corresponding edge constraint in (7), and the vector $\boldsymbol{\delta}$ is obtained by stacking all the dual variables $\boldsymbol{\delta}_{ij}$ , $(i,j)\in\mathcal{E}$ , on top of one another. Therefore, $\boldsymbol{\delta}$ is of dimension $n_{\boldsymbol{\delta}}=\sum_{(i,j)\in\mathcal{E}}n_{ij}$ . The Lagrangian function is convex in $\boldsymbol{x}$ for fixed $\boldsymbol{\delta}$ , and concave in $\boldsymbol{\delta}$ for fixed $\boldsymbol{x}$ . Throughout the rest of the paper, we will make the following (common) assumption:

Assumption 1.

There exists a saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ to the Lagrangian function $L_{p}(\boldsymbol{x},\boldsymbol{\delta})$ such that for all $\boldsymbol{x}\in\mathbb{R}^{n_{\boldsymbol{x}}}$ and $\boldsymbol{\delta}\in\mathbb{R}^{n_{\boldsymbol{\delta}}}$ we have

[TABLE]

Or equivalently, the following optimality (KKT) conditions hold for $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ :

[TABLE]

II-C Dual problem and its Lagrangian function

We first derive the dual problem to (7). Optimizing $L_{p}(\boldsymbol{x},\boldsymbol{\delta})$ over $\boldsymbol{\delta}$ and $\boldsymbol{x}$ yields

[TABLE]

where $f_{i}^{\ast}(\cdot)$ is the conjugate function of $f_{i}(\cdot)$ as defined in (4), satisfying Fenchel’s inequality

[TABLE]

Under Assumption 1, the dual problem (11) is equivalent to the primal problem (7). That is suppose $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ is a saddle point of $L_{p}$ . Then $\boldsymbol{x}^{\star}$ solves the primal problem (7) and $\boldsymbol{\delta}^{\star}$ solves the dual problem (11).

At this point, we need to introduce auxiliary variables to decouple the node dependencies in (11). Indeed, every $\boldsymbol{\delta}_{ij}$ , associated to edge $(i,j)$ , is used by two conjugate functions $f_{i}^{\ast}$ and $f_{j}^{\ast}$ . As a consequence, all conjugate functions in (11) are dependent on each other. To decouple the conjugate functions, we introduce for each edge $(i,j)\in\mathcal{E}$ two auxiliary node variables $\boldsymbol{\lambda}_{i|j}\in\mathbb{R}^{n_{ij}}$ and $\boldsymbol{\lambda}_{j|i}\in\mathbb{R}^{n_{ij}}$ , one for each node $i$ and $j$ , respectively. The node variable $\boldsymbol{\lambda}_{i|j}$ is owned by and updated at node $i$ and is related to neighboring node $j$ . Hence, at every node $i$ we introduce $|\mathcal{N}_{i}|$ new node variables. With this, we can reformulate the original dual problem as

[TABLE]

where $\boldsymbol{\lambda}_{i}$ is obtained by vertically concatenating all $\boldsymbol{\lambda}_{i|j}$ , $j\in\mathcal{N}_{i}$ , and $\boldsymbol{A}_{i}^{T}$ is obtained by horizontally concatenating all $\boldsymbol{A}_{ij}^{T}$ , $j\in\mathcal{N}_{i}$ . To clarify, the product $\boldsymbol{A}_{i}^{T}\boldsymbol{\lambda}_{i}$ in (13) equals to

[TABLE]

Consequently, we let $\boldsymbol{\lambda}=[\boldsymbol{\lambda}_{1}^{T},\boldsymbol{\lambda}_{2}^{T},\ldots,\boldsymbol{\lambda}_{m}^{T}]^{T}$ . In the above reformulation (13), each conjugate function $f_{i}^{\ast}(\cdot)$ only involves the node variable $\boldsymbol{\lambda}_{i}$ , facilitating distributed optimization.

Next we tackle the equality constraints in (13). To do so, we construct a (dual) Lagrangian function for the dual problem (13), which is given by

[TABLE]

where $\boldsymbol{y}$ is obtained by concatenating all the Lagrangian multipliers $\boldsymbol{y}_{i|j}$ , $[i,j]\in\vec{\mathcal{E}}$ , one after another.

We now argue that each Lagrangian multiplier $\boldsymbol{y}_{i|j}$ , $[i,j]\in\vec{\mathcal{E}}$ , in (15) can be replaced by an affine function of $\boldsymbol{x}_{j}$ . Suppose $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ is a saddle point of $L_{p}$ . By letting $\boldsymbol{\lambda}_{i|j}^{\star}=\boldsymbol{\delta}_{ij}^{\star}$ for every $[i,j]\in\vec{\mathcal{E}}$ , Fenchel’s inequality (12) must hold with equality at $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ from which we derive that

[TABLE]

One can then show that $(\boldsymbol{\delta}^{\star},\boldsymbol{\lambda}^{\star},\boldsymbol{y}^{\star})$ where $\boldsymbol{y}_{i|j}^{\star}=\boldsymbol{A}_{ji}\boldsymbol{x}_{j}^{\star}-\boldsymbol{c}_{ij}$ for every $[i,j]\in\vec{\mathcal{E}}$ , is a saddle point of $L_{d}^{\prime}$ . We therefore restrict the Lagrangian multiplier $\boldsymbol{y}_{i|j}$ to be of the form $\boldsymbol{y}_{i|j}=\boldsymbol{A}_{ji}\boldsymbol{x}_{j}-\boldsymbol{c}_{ij}$ so that the dual Lagrangian becomes

[TABLE]

We summarize the result in a lemma below:

Lemma 1.

If $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ is a saddle point of $L_{p}(\boldsymbol{x},\boldsymbol{\delta})$ , then $(\boldsymbol{\delta}^{\star},\boldsymbol{\lambda}^{\star},\boldsymbol{x}^{\star})$ is a saddle point of $L_{d}(\boldsymbol{\delta},\boldsymbol{\lambda},\boldsymbol{x})$ , where $\boldsymbol{\lambda}_{i|j}^{\star}=\boldsymbol{\delta}_{ij}^{\star}$ for every $[i,j]\in\vec{\mathcal{E}}$ .

We note that $L_{d}(\boldsymbol{\delta},\boldsymbol{\lambda},\boldsymbol{x})$ might not be equivalent to $L_{d}(\boldsymbol{\delta},\boldsymbol{\lambda},\boldsymbol{y})$ . By inspection of the optimality conditions of (16), not every saddle point $(\boldsymbol{\delta}^{\star},\boldsymbol{\lambda}^{\star},\boldsymbol{x}^{\star})$ of $L_{d}$ might lead to $\{\boldsymbol{\lambda}_{i|j}^{\star}=\boldsymbol{\lambda}_{j|i}^{\star},(i,j)\in\mathcal{E}\}$ due to the generality of the matrices $\{\boldsymbol{A}_{ij},[i,j]\in\vec{\mathcal{E}}\}$ . In next section we will introduce quadratic penalty functions w.r.t. $\boldsymbol{\lambda}$ to implicitly enforce the equality constraints $\{\boldsymbol{\lambda}_{i|j}^{\star}=\boldsymbol{\lambda}_{j|i}^{\star},(i,j)\in\mathcal{E}\}$ .

To briefly summarize, one can alternatively solve the dual problem (13) instead of the primal problem. Further, by replacing $\boldsymbol{y}$ with an affine function of $\boldsymbol{x}$ in (15), the dual Lagrangian $L_{d}(\boldsymbol{\delta},\boldsymbol{\lambda},\boldsymbol{x})$ share two variables $\boldsymbol{x}$ and $\boldsymbol{\boldsymbol{\delta}}$ with the primal Lagrangian $L_{p}(\boldsymbol{x},\boldsymbol{\delta})$ . We will show in next section that the special form of $L_{d}$ in (16) plays a crucial role for constructing the augmented primal-dual Lagrangian.

III Augmented Primal-Dual Lagrangian

In this section, we first build and investigate a primal-dual Lagrangian from $L_{p}$ and $L_{d}$ . We show that a saddle point of the primal-dual Lagrangian does not always lead to an optimal solution of the primal or the dual problem.

To address the above issue, we then construct an augmented primal-dual Lagrangian by introducing two additional penalty functions. We show that any saddle point of the augmented primal-dual Lagrangian leads to an optimal solution of the primal and the dual problem, respectively.

III-A Primal-dual Lagrangian

By inspection of (8) and (16), we see that in both $L_{p}$ and $L_{d}$ , the edge variables $\boldsymbol{\delta}_{ij}$ are related to the terms $\boldsymbol{c}_{ij}-\boldsymbol{A}_{ij}\boldsymbol{x}_{i}-\boldsymbol{A}_{ji}\boldsymbol{x}_{j}$ . As a consequence, if we add the primal and dual Lagrangian functions, the edge variables $\boldsymbol{\delta}_{ij}$ will cancel out and the resulting function contains node variables $\boldsymbol{x}$ and $\boldsymbol{\lambda}$ only.

We hereby define the new function as the primal-dual Lagrangian below:

Definition 1.

The primal-dual Lagrangian is defined as

[TABLE]

$L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ is convex in $\boldsymbol{x}$ for fixed $\boldsymbol{\lambda}$ and concave in $\boldsymbol{\lambda}$ for fixed $\boldsymbol{x}$ , suggesting that it is essentially a saddle-point problem (see [25], [26] for solving different saddle point problems). For each edge $(i,j)\in\mathcal{E}$ , the node variables $\boldsymbol{\lambda}_{i|j}$ and $\boldsymbol{\lambda}_{j|i}$ substitute the role of the edge variable $\boldsymbol{\delta}_{ij}$ . The removal of $\boldsymbol{\delta}_{ij}$ enables to design a distributed algorithm that only involves node-oriented optimization (see next section for PDMM).

Next we study the properties of saddle points of $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ :

Lemma 2.

If $\boldsymbol{x}^{\star}$ solves the primal problem (7), then there exists a $\boldsymbol{\lambda}^{\star}$ such that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ .

Proof.

If $\boldsymbol{x}^{\star}$ solves the primal problem (7), then there exists a $\boldsymbol{\delta}^{\star}$ such that $(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})$ is a saddle point of $L_{p}(\boldsymbol{x},\boldsymbol{\delta})$ and by Lemma 1, there exist $\boldsymbol{\lambda}_{i|j}^{\star}=\boldsymbol{\delta}_{ij}^{\star}$ for every $[i,j]\in\vec{\mathcal{E}}$ so that $(\boldsymbol{\delta}^{\star},\boldsymbol{\lambda}^{\star},\boldsymbol{x}^{\star})$ is a saddle point of $L_{d}(\boldsymbol{\delta},\boldsymbol{\lambda},\boldsymbol{x})$ . Hence

[TABLE]

$=L_{pd}(\boldsymbol{x},\boldsymbol{\lambda}^{\star}).$ ∎

The fact that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ , however, is not sufficient for showing $\boldsymbol{x}^{\star}$ (or $\boldsymbol{\lambda}^{\star}$ ) being optimal for solving the primal problem (7) (for solving the dual problem (13)).

Example 1 ( $\boldsymbol{x}^{\star}$ not optimal).

Consider the following problem

[TABLE]

With this, the primal Lagrangian is given by $L_{p}(\boldsymbol{x},\delta_{12})=f_{1}(x_{1})+f_{2}(x_{2})+\delta_{12}(x_{2}-x_{1})$ , so that the dual function is given by $-f_{1}^{\ast}(\delta_{12})-f_{2}^{\ast}(-\delta_{12})$ , where

[TABLE]

Hence, the optimal solution for the primal and dual problem is $x_{1}^{\star}=x_{2}^{\star}\in[-1,1]$ and $\delta_{12}^{\star}=0$ , respectively. The primal-dual Lagrangian in this case is given by

[TABLE]

One can show that every point $(x_{1}^{\prime},x_{2}^{\prime},\lambda_{1|2}^{\prime},\lambda_{2|1}^{\prime})\in\{(x_{1},x_{2},0,0)|-1\leq x_{1},x_{2}\leq 1\}$ is a saddle point of $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ , which does not necessarily lead to $x_{1}^{\prime}=x_{2}^{\prime}$ .

It is clear from Example 1 that finding a saddle point of $L_{pd}$ does not necessarily solve the primal problem (7). Similarly, one can also build another example illustrating that a saddle point of $L_{pd}$ does not necessarily solve the dual problem (13).

III-B Augmented primal-dual Lagrangian

The problem that not every saddle point of $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ leads to an optimal point of the primal or dual problem can be solved by adding two quadratic penalty terms to $L_{pd}(\boldsymbol{x},\boldsymbol{\lambda})$ as

[TABLE]

where $h_{\mathcal{P}_{p}}(\boldsymbol{x})$ and $h_{\mathcal{P}_{d}}(\boldsymbol{\lambda})$ are defined as

[TABLE]

where $\mathcal{P}=\mathcal{P}_{p}\cup\mathcal{P}_{d}$ and

[TABLE]

The set $\mathcal{P}$ of $2|\mathcal{E}|$ positive definite matrices remains to be specified.

Let ${X}=\{\boldsymbol{x}|\boldsymbol{A}_{ij}\boldsymbol{x}_{i}+\boldsymbol{A}_{ji}\boldsymbol{x}_{j}=\boldsymbol{c}_{ij},\forall(i,j)\in\mathcal{E}\}$ and ${\Lambda}=\{\boldsymbol{\lambda}|\boldsymbol{\lambda}_{i|j}=\boldsymbol{\lambda}_{j|i},\forall(i,j)\in\mathcal{E}\}$ denote the primal and dual feasible set, respectively. It is clear that $h_{\mathcal{P}_{p}}(\boldsymbol{x})\geq 0$ (or $-h_{\mathcal{P}_{d}}(\boldsymbol{\lambda})\leq 0$ ) with equality if and only if $\boldsymbol{x}\in X$ (or $\boldsymbol{\lambda}\in\Lambda$ ). The introduction of the two penalty functions essentially prevents non-feasible $\boldsymbol{x}$ and/or $\boldsymbol{\lambda}$ to correspond to saddle points of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . As a consequence, we have a saddle point theorem for $L_{\mathcal{P}}$ which states that $\boldsymbol{x}^{\star}$ solves the primal problem (7) if and only if there exits $\boldsymbol{\lambda}^{\star}$ such that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . To prove this result, we need the following lemma.

Lemma 3.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ and $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ be two saddle points of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . Then

[TABLE]

Further, $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ and $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\prime})$ are two saddle points of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ as well.

Proof.

Since $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ and $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ are two saddle points of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ , we have

[TABLE]

Combining the above two inequality chains produces (29). In order to show that $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ is a saddle point, we have $L_{\mathcal{P}}(\boldsymbol{x}^{\prime},\boldsymbol{\lambda})\leq L_{\mathcal{P}}(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})=L_{\mathcal{P}}(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})=L_{\mathcal{P}}(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})\leq L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda}^{\star})$ . The proof for $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\prime})$ is similar. ∎

We are ready to prove the saddle point theorem for $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ .

Theorem 1.

If $\boldsymbol{x}^{\star}$ solves the primal problem (7), there exists $\boldsymbol{\lambda}^{\star}$ such that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . Conversely, if $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ is a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ , then $\boldsymbol{x}^{\prime}$ and $\boldsymbol{\lambda}^{\prime}$ solves the primal and the dual problem, respectively. Or equivalently, the following optimality conditions hold

[TABLE]

Proof.

If $\boldsymbol{x}^{\star}$ solves the primal problem, then there exists a $\boldsymbol{\lambda}^{\star}$ such that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{pd}$ by Lemma 2. Since $\boldsymbol{x}^{\star}\in X$ and $\boldsymbol{\lambda}^{\star}\in\Lambda$ , we have $h_{\mathcal{P}_{p}}(\boldsymbol{x}^{\star})-h_{\mathcal{P}_{d}}(\boldsymbol{\lambda}^{\star})=0$ , $\partial_{\boldsymbol{x}}h_{\mathcal{P}_{p}}(\boldsymbol{x}^{\star})=\boldsymbol{0}$ and $\partial_{\boldsymbol{\lambda}}h_{\mathcal{P}_{d}}(\boldsymbol{\lambda}^{\star})=\boldsymbol{0}$ , from which we conclude that $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ as well.

Conversely, let $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ be a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . We first show that $\boldsymbol{x}^{\prime}$ solves the primal problem. We have from Lemma 3 that $L_{\mathcal{P}}(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})=L_{\mathcal{P}}(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ , which can be simplified as

[TABLE]

from which we conclude that $h_{\mathcal{P}_{p}}(\boldsymbol{x}^{\prime})=L_{p}(\boldsymbol{x}^{\star},\boldsymbol{\delta}^{\star})-L_{p}(\boldsymbol{x}^{\prime},\boldsymbol{\delta}^{\star})\leq 0$ and thus $h_{\mathcal{P}_{p}}(\boldsymbol{x}^{\prime})=0$ so that $\boldsymbol{x}^{\prime}\in X$ . In addition, since $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ by Lemma 3, we have

[TABLE]

and we conclude that $\boldsymbol{x}^{\prime}$ solves the primal problem as required. Similarly, one can show that $\boldsymbol{\lambda}^{\prime}$ solves the dual problem.

Based on the above analysis, we conclude that the optimality conditions for $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ being a saddle point of $L_{\mathcal{P}}$ are given by (30)-(32). The set of optimality conditions $\{\boldsymbol{c}_{ij}-\boldsymbol{A}_{ji}\boldsymbol{x}_{j}^{\prime}\in\partial_{\boldsymbol{\lambda}_{i|j}}\left[f_{i}^{\ast}(\boldsymbol{A}_{i}^{T}\boldsymbol{\lambda}_{i}^{\prime})\right]|[i,j]\in\vec{\mathcal{E}}\}$ is redundant and can be derived from (30)-(32) (see (4)-(6) for the argument). ∎

Theorem 1 states that instead of solving the primal problem (7) or the dual problem (13), one can alternatively search for a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . To briefly summarize, we consider solving the following min-max problem in the rest of the paper

[TABLE]

We will explain in next section how to iteratively approach the saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ in a distributed manner.

IV Primal-Dual Method of Multipliers

In this section, we present a new algorithm named primal-dual method of multipliers (PDMM) to iteratively approach a saddle point of $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . We propose both the synchronous and asynchronous PDMM for solving the problem.

IV-A Synchronous updating scheme

The synchronous updating scheme refers to the operation that at each iteration, all the variables over the graph update their estimates by using the most recent estimates from their neighbors from last iteration. Suppose $(\hat{\boldsymbol{x}}^{k},\hat{\boldsymbol{\lambda}}^{k})$ is the estimate obtained from the $k-1$ th iteration, where $k\geq 1$ . We compute the new estimate $(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})$ at iteration $k$ as

[TABLE]

By inserting the expression (26) for $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ into (34), the updating expression can be further simplified as

[TABLE]

Eq. (35)-(36) suggest that at iteration $k$ , every node $i$ performs parameter-updating independently once the estimates $\{\hat{\boldsymbol{x}}_{j}^{k},\hat{\boldsymbol{\lambda}}_{j|i}^{k}|j\in\mathcal{N}_{i}\}$ of its neighboring variables are available. In addition, the computation of $\hat{\boldsymbol{x}}_{i}^{k+1}$ and $\hat{\boldsymbol{\lambda}}_{i}^{k+1}$ can be carried out in parallel since $\boldsymbol{x}_{i}$ and $\boldsymbol{\lambda}_{i}$ are not directly related in $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ . We refer to (35)-(36) as node-oriented computation.

In order to run PDMM over the graph, each iteration should consist of two steps. Firstly, every node $i$ computes $(\hat{\boldsymbol{x}}_{i},\hat{\boldsymbol{\lambda}}_{i})$ by following (35)-(36), accounting for information-fusion. Secondly, every node $i$ sends $(\hat{\boldsymbol{x}}_{i},\hat{\boldsymbol{\lambda}}_{i|j})$ to its neighboring node $j$ for all neighbors, accounting for information-spread. We take $\hat{\boldsymbol{x}}_{i}$ as the common message to all neighbors of node $i$ and $\hat{\boldsymbol{\lambda}}_{i|j}$ as a node-specific message only to neighbor $j$ . In some applications, it may be preferable to exploit broadcast transmission rather than point-to-point transmission in order to save energy. We will explain in Subsection IV-C that the transmission of $\hat{\boldsymbol{\lambda}}_{i|j}$ , $j\in\mathcal{N}_{i}$ , can be replaced by broadcast transmission of an intermediate quantity.

Finally, we consider terminating the iterates (35)-(36). One can check if the estimate $(\hat{\boldsymbol{x}},\hat{\boldsymbol{\lambda}})$ becomes stable over consecutive iterates (see Corollary 1 for theoretical support).

IV-B Asynchronous updating scheme

The asynchronous updating scheme refers to the operation that at each iteration, only the variables associated with one node in the graph update their estimates while all other variables keep their estimates fixed. Suppose node $i$ is selected at iteration $k$ . We then compute $(\hat{\boldsymbol{x}}_{i}^{k+1},\hat{\boldsymbol{\lambda}}_{i}^{k+1})$ by optimizing $L_{\mathcal{P}}$ based on the most recent estimates $\{\hat{\boldsymbol{x}}_{j}^{k},\hat{\boldsymbol{\lambda}}_{j|i}^{k}|j\in\mathcal{N}_{i}\}$ from its neighboring nodes. At the same time, the estimates $(\hat{\boldsymbol{x}}_{j}^{k},\hat{\boldsymbol{\lambda}}_{j}^{k})$ , $j\neq i$ , remain the same. By following the above computational instruction, $(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})$ can be obtained as

[TABLE]

Similarly to (35)-(36), $\hat{\boldsymbol{x}}_{i}^{k+1}$ and $\hat{\boldsymbol{\lambda}}_{i}^{k+1}$ can also be computed separately in (37). Once the update at node $i$ is complete, the node sends the common message $\hat{\boldsymbol{x}}_{i}^{k+1}$ and node-specific messages $\{\hat{\boldsymbol{\lambda}}_{i|j}^{k+1},j\in\mathcal{N}_{i}\}$ to its neighbors We will explain in next subsection how to exploit broadcast transmission to replace point-to-point transmission.

In practice, the nodes in the graph can either be randomly activated or follow a predefined order for asynchronous parameter-updating. One scheme for realizing random node-activation is that after a node finishes parameter-updating, it randomly activates one of its neighbors for next iteration. Another scheme is to introduce a clock at each node which ticks at the times of a (random) Poisson process (see [5] for detailed information). Each node is activated only when its clock ticks. As for node-activation in a predefined order, cyclic updating scheme is probably most straightforward. Once node $i$ finishes parameter-updating, it informs node $i+1$ for next iteration. For the case that node $i$ and $i+1$ are not neighbors, the path from node $i$ to $i+1$ can be pre-stored at node $i$ to facilitate the process. In Subsection V-D, we provide convergence analysis only for the cyclic updating scheme. We leave the analysis for other asynchronous schemes for future investigation.

Remark 1.

To briefly summarize, synchronous PDMM scheme allows faster information-spread over the graph through parallel parameter-updating while asynchronous PDMM scheme requires less effort from node-coordination in the graph. In practice, the scheme-selection should depend on the graph (or network) properties such as the feasibility of parallel computation, the complexity of node-coordination and the life time of nodes.

IV-C Simplifying node-based computations and transmissions

It is clear that for both the synchronous and asynchronous schemes, each activated node $i$ has to perform two minimizations: one for $\hat{\boldsymbol{x}}_{i}$ and the other one for $\hat{\boldsymbol{\lambda}}_{i}$ . In this subsection, we show that the computations for the two minimizations can be simplified. We will also study how the point-to-point transmission can be replaced with broadcast transmission. To do so, we will consider two scenarios:

IV-C1 Avoiding conjugate functions

In the first scenario, we consider using $f_{i}(\cdot)$ instead of $f_{i}^{\ast}(\cdot)$ to update ${\hat{\boldsymbol{\lambda}}}_{i}$ . Our goal is to simplify computations by avoiding the derivation of $f_{i}^{\ast}(\cdot)$ .

By using the definition of $f_{i}^{\ast}$ in (4), the computation (36) for $\hat{\boldsymbol{\lambda}}_{i}^{k+1}$ (which also holds for asynchronous PDMM) can be rewritten as

[TABLE]

We denote the optimal solution for $\boldsymbol{w}_{i}$ in (39) as $\boldsymbol{w}_{i}^{k+1}$ . The optimality conditions for $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ , $j\in\mathcal{N}_{i}$ , and $\boldsymbol{w}_{i}^{k+1}$ can then be derived from (39) as

[TABLE]

where (14) is used in deriving (41). Since $\boldsymbol{P}_{d,ij}$ is a nonsingular matrix, (41) defines a mapping from $\boldsymbol{w}_{i}^{k+1}$ to $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ :

[TABLE]

With this mapping, (40) can then be reformulated as

[TABLE]

By inspection of (43), it can be shown that (43) is in fact an optimality condition for the following optimization problem

[TABLE]

The above analysis suggests that $\hat{\boldsymbol{\lambda}}_{i}^{k+1}$ can be alternatively computed through an intermediate quantity $\boldsymbol{w}_{i}^{k+1}$ . We summarize the result in a proposition below.

Proposition 1.

Considering a node $i\in\mathcal{V}$ at iteration $k$ , the new estimate $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ for each $j\in\mathcal{N}_{i}$ can be obtained by following (42), where ${\boldsymbol{w}}_{i}^{k+1}$ is computed by (44).

Proposition 1 suggests that the estimate $\hat{\boldsymbol{\lambda}}_{i}^{k+1}$ can be easily computed from ${\boldsymbol{w}}_{i}^{k+1}$ . We argue in the following that the point-to-point transmission of $\left\{\hat{\boldsymbol{\lambda}}_{i|j}^{k+1},j\in\mathcal{N}_{i}\right\}$ can be replaced with broadcast transmission of ${\boldsymbol{w}}_{i}^{k+1}$ .

We see from (42) that the computation of the node-specific message $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ (from node $i$ to node $j$ ) only consists of the quantities ${\boldsymbol{w}}_{i}^{k+1}$ , $\hat{\boldsymbol{\lambda}}_{j|i}^{k}$ and $\hat{\boldsymbol{x}}_{j}^{k}$ . Since $\hat{\boldsymbol{\lambda}}_{j|i}^{k}$ and $\hat{\boldsymbol{x}}_{j}^{k}$ are available at node $j$ , the message $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ can therefore be computed at node $j$ once the common message ${\boldsymbol{w}}_{i}^{k+1}$ is received. In other words, it is sufficient for node $i$ to broadcast both $\hat{\boldsymbol{x}}_{i}^{k+1}$ and ${\boldsymbol{w}}_{i}^{k+1}$ to all its neighbors. Every node-specific message $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ , $j\in\mathcal{N}_{i}$ , can then be computed at node $j$ alone.

Finally, in order for the broadcast transmission to work, we assume there is no transmission failure between neighboring nodes. The assumption ensures that there is no estimate inconsistency between neighboring nodes, making the broadcast transmission reliable.

IV-C2 Reducing two minimizations to one

In the second scenario, we study under what conditions the two minimizations (35)-(36) (which also hold for asynchronous PDMM) reduce to one minimization.

Proposition 2.

Considering a node $i\in\mathcal{V}$ at iteration $k$ , if the matrix $\boldsymbol{P}_{d,ij}$ for every neighbor $j\in\mathcal{N}_{i}$ is chosen to be $\boldsymbol{P}_{d,ij}=\boldsymbol{P}_{p,ij}^{-1}$ , then there is $\hat{\boldsymbol{x}}_{i}^{k+1}={\boldsymbol{w}}_{i}^{k+1}$ . As a result,

[TABLE]

Proof.

The proof is trivial. By inspection of (35) and (44) under $\boldsymbol{P}_{d,ij}=\boldsymbol{P}_{p,ij}^{-1}$ , $j\in\mathcal{N}_{i}$ , we obtain $\hat{\boldsymbol{x}}_{i}^{k+1}={\boldsymbol{w}}_{i}^{k+1}$ . ∎

Similarly to the first scenario, broadcast transmission is also applicable for the second scenario. Since $\hat{\boldsymbol{x}}_{i}^{k+1}={\boldsymbol{w}}_{i}^{k+1}$ , node $i$ only has to broadcast the estimate $\hat{\boldsymbol{x}}_{i}^{k+1}$ to all its neighbors. Each message $\hat{\boldsymbol{\lambda}}_{i|j}^{k+1}$ from node $i$ to node $j$ can then be computed at node $j$ directly by applying (45). See Table I for the procedure of synchronous PDMM.

V Convergence Analysis

In this section, we analyze the convergence rates of PDMM for both the synchronous and asynchronous schemes. Inspired by the convergence analysis of ADMM [27, 28], we construct a special inequality (presented in V-B) for $L_{\mathcal{P}}(\boldsymbol{x},\boldsymbol{\lambda})$ and then exploit it to analyze both synchronous PDMM (presented in V-C) and asynchronous PDMM (presented in V-D).

Before constructing the inequality, we first study how to properly choose the matrices in the set $\mathcal{P}$ (presented in V-A) in order to enable convergence analysis.

V-A Parameter setting

In order to analyze the algorithm convergence later on, we first have to select the matrix set $\mathcal{P}$ properly. We impose a condition on each pair of matrices $(\boldsymbol{P}_{p,ij}\succ\boldsymbol{0},\boldsymbol{P}_{d,ij}\succ\boldsymbol{0})$ , $(i,j)\in\mathcal{E}$ , in $L_{\mathcal{P}}$ :

Condition 1.

In the function $L_{\mathcal{P}}$ , each matrix $\boldsymbol{P}_{d,ij}$ can be represented in terms of $\boldsymbol{P}_{p,ij}$ as

[TABLE]

where $\Delta\boldsymbol{P}_{d,ij}\succeq\boldsymbol{0}$ .

Eq. (46) implies that $\boldsymbol{P}_{p,ij}$ and $\boldsymbol{P}_{d,ij}$ can not be chosen arbitrarily for our convergence analysis. If $\boldsymbol{P}_{p,ij}$ is small, then $\boldsymbol{P}_{d,ij}$ has to be chosen big enough to make (46) hold, and vice versa. One special setup for $(\boldsymbol{P}_{p,ij},\boldsymbol{P}_{d,ij})$ is to let $\boldsymbol{P}_{d,ij}=\boldsymbol{P}_{p,ij}^{-1}$ , or equivalently, $\Delta\boldsymbol{P}_{d,ij}=\boldsymbol{0}$ . This leads to the application of Proposition 2, which reduces two minimizations to one minimization for each activated node.

One simple setup in Condition 1 is to let all the matrices in $\mathcal{P}$ take scalar form. That is setting $(\boldsymbol{P}_{p,ij},\boldsymbol{P}_{d,ij})$ , $(i,j)\in\mathcal{E}$ , to be identity matrices multiplied by positive parameters:

[TABLE]

where $\gamma_{p,ij}>0$ , $\gamma_{d,ij}>0$ and $\gamma_{d,ij}\gamma_{p,ij}\geq 1$ . It is worth noting that matrix form of $(\boldsymbol{P}_{p,ij},\boldsymbol{P}_{d,ij})$ might lead to faster convergence for some optimization problems.

V-B Constructing an inequality

Before introducing the inequality, we first define a new function which involves $\{f_{i},i\in\mathcal{V}\}$ and their conjugates:

[TABLE]

By studying (7) and (13) at a saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ of $L_{\mathcal{P}}$ , one can show that $p(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})=0$ .

With $p(\boldsymbol{x},\boldsymbol{\lambda})$ , the inequality for $L_{\mathcal{P}}$ can be described as:

Lemma 4.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . Then for any $(\boldsymbol{x},\boldsymbol{\lambda})$ , there is

[TABLE]

where equality holds if and only if $(\boldsymbol{x},\boldsymbol{\lambda})$ satisfies

[TABLE]

Proof.

Given a saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ of $L_{\mathcal{P}}$ , the right hand side of the inequality (49) can be reformulated as

[TABLE]

where the last equality is obtained by using $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})\in(X,\Lambda)$ . Using Fenchel’s inequalities (12), we conclude that for any $i\in\mathcal{V}$ , the following two inequalities hold

[TABLE]

Finally, combining (52)-(54) and the fact that $p(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})=0$ produces the inequality (49). The equality holds if and only if (53)-(54) hold, of which the optimality conditions are given by (50)-(51) (see (4)-(6) for the argument). ∎

Lemma 4 shows that the quantity on the right hand side of (49) is always lower-bounded by zero. In the next two subsections, we will construct proper upper bounds for the quantity by replacing $(\boldsymbol{x},\boldsymbol{\lambda})$ with real estimate of PDMM. The algorithmic convergence will be established by showing that the upper bounds approach to zero when iteration increases.

The conditions (50)-(51) in Lemma 4 are not sufficient for showing that $(\boldsymbol{x},\boldsymbol{\lambda})$ is a saddle point of $L_{\mathcal{P}}$ . The primal and dual feasibilities $\boldsymbol{x}\in X$ and $\boldsymbol{\lambda}\in\Lambda$ are also required to complete the argument, as shown in Lemma 5, 6 and 7 below. Lemma 5 and 6 are preliminary to show that $(\boldsymbol{x},\boldsymbol{\lambda})$ is a saddle point of $L_{\mathcal{P}}$ as presented in Lemma 7. These three lemmas will be used in the next two subsections for convergence analysis.

Lemma 5.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . Given $\boldsymbol{x}=\boldsymbol{x}^{\prime}$ which satisfies (51) and $\boldsymbol{x}^{\prime}\in X$ , then $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}$ .

Proof.

By using (51) and the fact that $\boldsymbol{x}^{\prime}\in X$ and $\boldsymbol{\lambda}^{\star}\in\Lambda$ , it is immediate from (30)-(32) that $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ is a saddle point of $L_{\mathcal{P}}$ . ∎

Lemma 6.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . Given $\boldsymbol{\lambda}=\boldsymbol{\lambda}^{\prime}$ which satisfies (50) and $\boldsymbol{\lambda}^{\prime}\in\Lambda$ , then $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\prime})$ is a saddle point of $L_{\mathcal{P}}$ .

Proof.

The proof is similar to that for Lemma 5. ∎

Lemma 7.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . Given $(\boldsymbol{x},\boldsymbol{\lambda})=(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ which satisfy (50)-(51) and $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})\in(X,\Lambda)$ , then $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ is a saddle point of $L_{\mathcal{P}}$ .

Proof.

It is known from Lemma 5 and 6 that in addition to $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ , $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\star})$ and $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\prime})$ are also the saddle points of $L_{\mathcal{P}}$ . By using a similar argument as the one for Lemma 3, one can show that $(\boldsymbol{x}^{\prime},\boldsymbol{\lambda}^{\prime})$ is a saddle point of $L_{\mathcal{P}}$ . ∎

V-C Synchronous PDMM

In this subsection, we show that the synchronous PDMM converges with the sub-linear rate $\mathcal{O}(K^{-1})$ . In order to obtain the result, we need the following two lemmas.

Lemma 8.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . The estimate $(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})$ is obtained by performing (35)-(36) under Condition 1. Then there is

[TABLE]

where $d_{i|j}^{k+1}$ is given by

[TABLE]

where $\boldsymbol{P}_{p,ij}=\boldsymbol{P}_{p,ij}^{\frac{1}{2}}\boldsymbol{P}_{p,ij}^{\frac{1}{2}}$ and $\Delta\boldsymbol{P}_{d,ij}=\Delta\boldsymbol{P}_{d,ij}^{\frac{1}{2}}\Delta\boldsymbol{P}_{d,ij}^{\frac{1}{2}}$ .

Proof.

See the proof in Appendix A. ∎

Lemma 9.

Every pair of estimates $(\hat{\boldsymbol{x}_{i}}^{k+1},\hat{\boldsymbol{\lambda}}_{i|j}^{k+1})$ , $i\in\mathcal{V}$ , $j\in\mathcal{N}_{i}$ , $k\geq 0$ , in Lemma 8 is upper bounded by a constant $M$ under a squared error criterion:

[TABLE]

Proof.

One can first prove (57) for $k=0$ by performing algebra on (55)-(56). The inequality (57) for $k>0$ can then be proved recursively. ∎

Upon obtaining the results in Lemma 8 and 9, we are ready to present the convergence rate of synchronous PDMM.

Theorem 2.

Let $(\hat{\boldsymbol{x}}^{k},\hat{\boldsymbol{\lambda}}^{k})$ , $k=1,\ldots,K$ , be obtained by performing (35)-(36) under Condition 1. The average estimate ${(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})=(\frac{1}{K}\sum_{k=1}^{K}\hat{\boldsymbol{x}}^{k},\frac{1}{K}\sum_{k=1}^{K}\hat{\boldsymbol{\lambda}}^{k})}$ satisfies

[TABLE]

Proof.

Summing (55) over $k$ and simplifying the expression yields

[TABLE]

Finally, since the left hand side of (60) is a convex function of $(\boldsymbol{x},\boldsymbol{\lambda})$ , applying Jensen’s inequality to (60) and using the inequality of Lemma 4 yields (58). Similarly, applying Jensen’s inequality to (60) and using the upper-bound result of Lemma 9 yields the asymptotic result (59). ∎

Finally, we use the results of Theorem 2 to show that as $K$ goes to infinity, the average estimate $(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ converges to a saddle point of $L_{\mathcal{P}}$ .

Theorem 3.

The average estimate $(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ of Theorem 2 converges to a saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ of $L_{\mathcal{P}}$ as $K$ increases.

Proof.

The basic idea of the proof is to investigate if $(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ satisfies all the conditions of Lemma 7. By investigation of Lemma 4 and (58), it is clear that the average estimate $(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ asymptotically satisfies the conditions (50)-(51) by letting $(\boldsymbol{x},\boldsymbol{\lambda})=(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ .

Next we show that as $K$ increases, $\bar{\boldsymbol{x}}^{K}$ asymptotically converges to an element of the primal feasible set $X$ and so does $\bar{\boldsymbol{\lambda}}^{K}$ to an element of the dual feasible set $\Lambda$ . To do do, we reconsider (59) for each pair of directed edges $[i,j]$ and $[j,i]$ , which can be expressed as

[TABLE]

Combining the above two expressions produces

[TABLE]

It is straightforward from Lemma 7 that $(\bar{\boldsymbol{x}}^{K},\bar{\boldsymbol{\lambda}}^{K})$ converges to a saddle point of $L_{\mathcal{P}}$ as $K$ increases. ∎

Further we have the following result from Theorem 3:

Corollary 1.

If for certain $i\in\mathcal{V}$ , the estimate $\hat{\boldsymbol{x}}_{i}^{k}$ in Theorem 2 converges to a fixed point $\boldsymbol{x}_{i}^{\prime}$ ( $\lim_{k\rightarrow\infty}\hat{\boldsymbol{x}}_{i}^{k}=\boldsymbol{x}_{i}^{\prime}$ ), we have $\boldsymbol{x}_{i}^{\prime}=\boldsymbol{x}_{i}^{\star}$ which is the $i$ th component of the optimal solution $\boldsymbol{x}^{\star}$ in Theorem 3. Similarly, if the estimate $\hat{\boldsymbol{\lambda}}_{i|j}^{k}$ converges to a point $\boldsymbol{\lambda}_{i|j}^{\prime}$ , we have $\boldsymbol{\lambda}_{i|j}^{\prime}=\boldsymbol{\lambda}_{i|j}^{\star}$ .

V-D Asynchronous PDMM

In this subsection, we characterize the convergence rate of asynchronous PDMM. In order to facilitate the analysis, we consider a predefined node-activation strategy (no randomness is involved). We suppose at each iteration $k$ , the node $i=\textrm{mod}(k,m)+1$ is activated for parameter-updating, where $m=|\mathcal{V}|$ and $\textrm{mod}(\cdot,\cdot)$ stands for the modulus operation. Then naturally, after a segment of $m$ consecutive iterations, all the nodes will be activated sequentially, one node at each iteration.

To be able to derive the convergence rate, we consider segments of iterations, i.e., $k\in\{lm,lm+1,\ldots(l+1)m-1\}$ , where $l\geq 0$ . Each segment $l$ consists of $m$ iterations. With the mapping $i=\textrm{mod}(k,m)+1$ , it is immediate that $k=ml$ activates node 1 and $k=(l+1)m-1$ activates node $m$ . Based on the above analysis, we have the following result.

Lemma 10.

Let $k_{1},k_{2}$ be two iteration indices within a segment $\{lm,lm+1,\ldots,(l+1)m-1\}$ . If $k_{1}<k_{2}$ , then $i_{1}<i_{2}$ , where the node-index $i_{q}=\textrm{mod}(k_{q},m)+1$ , $q=1,2$ .

Upon introducing Lemma 10, we are ready to perform convergence analysis.

Lemma 11.

Let $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ be a saddle point of $L_{\mathcal{P}}$ . A segment of estimates $\{(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})|k=lm,\ldots,(l+1)m-1\}$ , is obtained by performing (37)-(38) under Condition 1. Then there is

[TABLE]

where $d_{uv}^{l+1}$ is given by

[TABLE]

Proof.

See the proof in Appendix B. Lemma 10 will be used in the proof to simplify mathematic derivations. ∎

Remark 2.

We note that Lemma 11 corresponds to Lemma 8 which is for synchronous PDMM. The right hand side of (61) consists of $|\mathcal{E}|$ quantities $\{d_{uv}^{l+1}\}$ (one for each edge $(u,v)\in\mathcal{E}$ ) as opposed to that of (55) which consists of $|\vec{\mathcal{E}}|$ quantities $\{d_{i|j}^{k+1}\}$ (one for each directed edge $[i,j]\in\vec{\mathcal{E}}$ ).

Lemma 12.

Every pair of estimates $(\hat{\boldsymbol{x}}_{v}^{(l+1)m},\hat{\boldsymbol{\lambda}}_{v|u}^{(l+1)m})$ , $(u,v)\in\mathcal{E}$ , $u<v$ , $l\geq 0$ , in Lemma 11 is upper bounded by a constant $M$ under a squared error criterion:

[TABLE]

Theorem 4.

Let the $K\geq 1$ segments of estimates $\{(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})|k=0,\ldots,Km-1\}$ be obtained by performing (37)-(38) under Condition 1. The average estimates $(\check{\boldsymbol{x}}^{K},\check{\boldsymbol{\lambda}}^{K})=(\frac{1}{K}\sum_{l=1}^{K}\hat{\boldsymbol{x}}^{lm},\frac{1}{K}\sum_{l=1}^{K}\hat{\boldsymbol{\lambda}}^{lm})$ satisfies

[TABLE]

Proof.

The proof is similar to that for Theorem 2. ∎

Similarly to synchrounous PDMM, by using the results of Theorem 4, we can conclude that:

Theorem 5.

The average estimate $(\check{\boldsymbol{x}}^{K},\check{\boldsymbol{\lambda}}^{K})$ of Theorem 4 converges to a saddle point $(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ of $L_{\mathcal{P}}$ as $K$ increases.

Corollary 2.

If for certain $u\in\mathcal{V}$ , the estimate $\hat{\boldsymbol{x}}_{u}^{lm}$ in Theorem 4 converges to a fixed point $\boldsymbol{x}_{u}^{\prime}$ ( $\lim_{l\rightarrow\infty}\hat{\boldsymbol{x}}_{u}^{lm}=\boldsymbol{x}_{u}^{\prime}$ ), we have ${\boldsymbol{x}}_{u}^{\prime}=\boldsymbol{x}_{u}^{\star}$ which is the $u$ th component of the optimal solution $\boldsymbol{x}^{\star}$ in Theorem 5. Similarly, if the estimate $\hat{\boldsymbol{\lambda}}_{u|v}^{lm}$ converges to a point $\boldsymbol{\lambda}_{u|v}^{\prime}$ , we hvae $\boldsymbol{\lambda}_{u|v}^{\prime}=\boldsymbol{\lambda}_{u|v}^{\star}$ .

VI Application to Distributed Averaging

In this section, we consider solving the problem of distributed averaging by using PDMM. Distributed averaging is one of the basic and important operations for advanced distributed signal processing [5, 15].

VI-A Problem formulation

Suppose every node $i$ in a graph $G=(\mathcal{V},\mathcal{E})$ carries a scalar parameter, denoted as $t_{i}$ . $t_{i}$ may represent a measurement of the environment, such as temperature, humidity or darkness. The problem is to compute the average value $t_{ave}=\frac{1}{m}\sum_{i\in\mathcal{V}}t_{i}$ iteratively only through message-passing between neighboring nodes in the graph.

The above averaging problem can be formulated as a quadratic optimization over the graph as

[TABLE]

The optimal solution equals to $x_{1}^{\star}=\ldots=x_{m}^{\star}=t_{ave}$ , which is the same as the averaging value.

The quadratic problem (66) is inline with (7) by letting

[TABLE]

In next subsection, we apply PDMM for distributed averaging.

VI-B Parameter computations and transmissions

Before deriving the updating expressions for PDMM, we first configure the set $\mathcal{P}$ in $L_{\mathcal{P}}$ . For distributed averaging, all the matrices in $\mathcal{P}$ become scalars. For simplicity, we set the value of the primal scalars and the dual scalars as

[TABLE]

where the two parameters $\gamma_{p}>0$ and $\gamma_{d}>0$ .

We start with the synchronous PDMM. By inserting (67)-(69) into (35), (42) and (44), the updating expression for $(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})$ at iteration $k$ can be derived as

[TABLE]

where

[TABLE]

For the case that $\gamma_{d}=\gamma_{p}^{-1}$ , it is immediate from (70) and (72) that $\hat{x}_{i}^{k+1}={w}_{i}^{k+1}$ , which coincides with Proposition 2.

The asynchronous PDMM only activates one node per iteration. Suppose node $i$ is activated at iteration $k$ . Node $i$ then updates $\hat{{x}}_{i}$ and $\hat{{\lambda}}_{i|j}$ , $j\in\mathcal{N}_{i}$ , by following (70)-(71) while all other nodes remain silent. After computation, node $i$ then sends $(\hat{x}_{i},\hat{\lambda}_{i|j})$ to its neighboring node $j$ for all neighbors.

As described in Subsection IV-C, if no transmission fails in the graph, the transmission of $\hat{\lambda}_{i|j}$ , $j\in\mathcal{N}_{i}$ , can be replaced by broadcast transmission of $w_{i}$ as given by (72). Once $w_{i}$ is received by a neighboring node $j$ , $\hat{\lambda}_{i|j}$ can be easily computed by node $j$ alone using $w_{i}$ , $\hat{x}_{j}$ and $\hat{\lambda}_{j|i}$ (see Eq. (71)). If instead the transmission is not reliable, we have to return to point-to-point transmission.

VI-C Experimental results

We conducted three experiments for PDMM applied to distributed averaging. In the first experiment, we evaluated how different parameter-settings w.r.t. $(\gamma_{p},\gamma_{d})$ affect the convergence rates of PDMM. In the second experiment, we tested the non-perfect channels for PDMM, which lacks convergence guaranty at the moment. Finally, we evaluated the convergence rates of PDMM, ADMM and two gossip algorithms.

The tested graph in the three experiments was a $10\times 10$ two-dimensional grid (corresponding to $m=100$ ), implying that each node may have two, three or four neighbors. The mean squared error (MSE) $\frac{1}{m}\|\hat{\boldsymbol{x}}-t_{ave}\boldsymbol{1}\|_{2}^{2}$ was employed as performance measurement.

VI-C1 performance for different parameter settings

In this experiment, we evaluated the performance of PDMM by testing different parameter-settings for $(\gamma_{p},\gamma_{d})$ . Both synchronous and asynchronous updating schemes were investigated.

At each iteration, the synchronous PDMM activated all the nodes for parameter-updating. As for the asynchronous PDMM, the nodes were activated sequentially by following the mapping $i=\textrm{mod}(k,m)+1$ , where the iteration $k\geq 0$ (See Subsection V-D). As a result, after every segment of $m=100$ iterations, all the nodes were activated once. In the experiment, we counted the number of iterations for the synchronous PDMM and the number of segments (each segment consists of $m$ iterations) for the asynchronous PDMM.

For each parameter-setting, we initialized $(\hat{{x}}_{i}^{0},\hat{\boldsymbol{\lambda}}_{i}^{0})=(t_{i},\boldsymbol{0})$ for every $i\in\mathcal{V}$ . The algorithm stops when the squared error is below $10^{-4}$ .

Fig. 2 displays the numbers of iterations (or segments) of PDMM under different parameter-settings. Each $\circ$ or $\square$ symbol represents a particular setting for $(\gamma_{p},\gamma_{d})$ . The settings denoted by $\square$ are for the case that $\gamma_{p}\gamma_{d}<1$ while the ones by $\circ$ are for the case that $\gamma_{p}\gamma_{d}\geq 1$ .

It is seen from the figure that large $\gamma_{p}$ or $\gamma_{d}$ can only make the algorithm converge slowly. The optimal parameter-setting that leads to the fastest convergence lies on the curve $\gamma_{d}\gamma_{p}=1$ for both the synchronous and the asynchronous updating schemes. Further, it appears that the two optimal settings for the two updating schemes are in a neighborhood.

Finally, we note that the settings denoted by $\square$ correspond to the situation that $\gamma_{p}\gamma_{d}<1$ . The experiment for those settings demonstrates that Condition 1 is only sufficient for algorithmic convergence. We also tested the setting $\gamma_{p}=\gamma_{d}=0.5$ . We found that the above setting led to divergence for both synchronous and synchronous schemes. This phenomenon suggests that $\gamma_{p}$ and $\gamma_{d}$ cannot be chosen arbitrarily in practice.

VI-C2 performance with transmission failure

In this experiment, we studied how transmission failure affects the performance of PDMM given the fact that no convergence guaranty is derived at the moment. As discussed in Subsection IV-C, we could not use broadcast transmission in the case of transmission loss. Instead, each activated node $i$ has to perform point-to-point transmission for $\hat{\lambda}_{i|j}$ from node $i$ to node $j\in\mathcal{N}_{i}$ .

Due to transmission failure, PDMM was initialized differently from the first experiment. Each time the algorithm was tested, the initial estimate $(\hat{{\boldsymbol{x}}}^{0},\hat{\boldsymbol{\lambda}}^{0})$ was set as

[TABLE]

The above initialization guarantees that every node in the graph has access to the initial estimates of neighboring nodes without transmission.

Fig. 3 demonstrates the performance of PDMM under three transmission losses: 0%, 20% and 40%. Subplot (a) and (b) are for the asynchronous and synchronous schemes, respectively. Each curve in the two subplots was obtained by averaging over 100 simulations to mitigate the effect of random transmission losses. It is seen that transmission failure only slows down the convergence speed of the algorithm. The above property is highly desirable in real applications because transmission losses might be inevitable in some networks (e.g., see [29] for investigation of packet-loss over wireless sensor networks in different environments).

Finally, it is observed that for each transmission-loss in subplot (a), the error goes up in the first few hundred of iterations before deceasing. This may because of the special initialization (73). We have tested the initialization $\{\hat{x}_{i}^{0}=t_{i}\}$ for 0% transmission loss, where the MSE decreases along with the iterations monotonically.

VI-C3 performance comparison

In this experiment, we investigated the convergence speeds of four algorithms under the condition of no transmission failure. Besides PDMM, we also implemented the broadcast-based algorithm in [14] (referred to as broadcast), the randomized gossip algorithm in [5] (referred to as gossip) and ADMM. Unlike PDMM and ADMM that can work either synchronously or asynchronously, both broadcast and gossip algorithms can only work asynchronously. While broadcast algorithm randomly activates one node per iteration, gossip algorithm randomly activates one edge per iteration for parameter-updating.

Similarly to the first experiment, we also evaluated PDMM for both the synchronous and asynchronous schemes. For the asynchronous scheme, we tested all the four algorithms introduced above while for the synchronous scheme, we focused on PDMM and ADMM. The implementation of the synchronous/asynchronous ADMM follows from [10] and [17], respectively. The asynchronous ADMM [17] is similar to the gossip algorithm in the sense that both algorithms activates one edge per iteration.

We note that the asynchronous ADMM essentially activates two neighboring nodes per iteration. To make a fair comparison between PDMM and ADMM, we implemented two versions of PDMM for the asynchronous scheme. The first version follows Subsection IV-B where each iteration randomly activates one node as the gossip algorithm, referred to as one-node PDMM. The second version of PDMM randomly activates two neighboring nodes per iteration as the broadcast algorithm, referred to as two-node PDMM.

Both PDMM and ADMM have some parameters to be specified. To simplify the implementation, we let $\gamma_{p}=\gamma_{d}=1$ in PDMM (which is not the optimal setting from Fig. 2). Similarly, we set the parameter in ADMM to be 1.

In the experiment, the gossip and broadcast algorithms were initialized according to [5] and [14], respectively. The initialization for PDMM was the same as in the first experiment. The estimates of ADMM were initialized similarly as for PDMM.

Fig. 4 displays the MSE trajectories for the four methods while Table II lists the average execution times (per iteration) and their standard deviations. Similarly to the second experiment, the performance of each method for the asynchronous scheme was obtained by averaging over 100 simulations to mitigate the effect of randomness introduced in node or edge-activation. We now focus on the asynchronous scheme. It is seen that the two-node PDMM converges the fastest in terms of number of iterations while the gossip algorithm requires the least execution time on average. The above results suggest that for applications where signal transmission is more expensive than local computation (w.r.t. energy consumption), PDMM might be a good candidate as it may save number of iterations.

Fig. 4 (b) demonstrates the MSE performance of PDMM and ADMM for the synchronous scheme. Both algorithms appear to have linear convergence rates. This may be because the objective functions in (66) are strongly convex and have gradients which are Lipschitz continuous. It is seen from Table II that both methods take roughly the same execution time. By combining the above results, we conclude that under synchronous scheme, PDMM converges faster than ADMM w.r.t. the execution time, which may be due to the fact that PDMM avoids the auxiliary variable $\boldsymbol{z}$ used in ADMM.

VII Conclusion

In this paper, we have proposed PDMM for iterative optimization over a general graph. The augmented primal-dual Lagrangian function is constructed of which a saddle point provides an optimal solution of the original problem, which leads to the design of PDMM. PDMM performs broadcast transmission under perfect channel and point-to-point transmission under non-perfect channel. We have shown that both the synchronous and asynchronous PDMMs possess a convergence rate of $\mathcal{O}(1/K)$ for general closed, proper and convex functions defined over the graph. As an example, we have applied PDMM for distributed averaging, through which properties of PDMM such as proper parameter-selection and resilience against transmission failure are further investigated.

We note that PDMM is natural when performing node-oriented optimization over a graph as compared to ADMM which involves computing the edge variable $\boldsymbol{z}$ introduced in (3). A few applications in [21], [22] and [23] suggest that PDMM is practically promising. While convergence properties of ADMM under different conditions (e.g., strong convexity and/or the gradients being Lipschitz continuous) are well understood, the convergence properties of PDMM for those conditions remain to be discovered.

Appendix A Proof for Lemma 8

Before presenting the proof, we first introduce a basic inequality, which is described in a lemma below:

Lemma 13.

Let $f_{1}(\boldsymbol{x})$ and $f_{2}(\boldsymbol{x})$ be two arbitrary closed, proper and convex functions. $\boldsymbol{x}^{\star}$ minimizes the sum of the two functions, i.e., $\boldsymbol{x}^{\star}=\arg\min_{\boldsymbol{x}}(f_{1}(\boldsymbol{x})+f_{2}(\boldsymbol{x}))$ . Then, there is

[TABLE]

where $\boldsymbol{r}(\boldsymbol{x}^{\star})\in\partial_{\boldsymbol{x}}f_{2}(\boldsymbol{x}^{\star})$ .

The above inequality is wildly exploited for the convergence analysis of ADMM and its variants [27, 28, 10]. We will also use the inequality in our proof.

Applying (74) to the updating equations (35)-(36) for $(\hat{\boldsymbol{x}}^{k+1},\hat{\boldsymbol{\lambda}}^{k+1})$ , we obtain a set of inequalities for all $(\boldsymbol{x},\boldsymbol{\lambda})\in(\mathbb{R}^{\sum n_{i}},\mathbb{R}^{2\sum n_{ij}})$ as

[TABLE]

Adding (75)-(76) over all $i\in\mathcal{V}$ , and substituting $(\boldsymbol{x},\boldsymbol{\lambda})=(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ , the saddle point of $L_{\mathcal{P}}$ , yields

[TABLE]

where the last equality follows from the two optimality conditions (31)-(32).

To further simplify (77), one can first insert the alternative expression (46) for every $\boldsymbol{P}_{d,ij}$ into (77). After that, the expression (55) can be obtained by simplifying the new expression using (31)-(32) and the following identity

[TABLE]

Appendix B Proof of Lemma 11

The basic idea for the proof is similar to that for Lemma 8 as presented in Appendix A. However, since asynchronous PDMM activates one node $i\in\mathcal{V}$ per iteration, it is difficult to tell which neighbors of $i$ have been recently activated and which have not yet. The above difficulty requires careful treatment in the convergence analysis. We sketch the proof in the following for reference.

We focus on the parameter-updating for a particular segment of iterations $k\in\{ml,ml+1,\ldots,ml+m-1\}$ , where $l\geq 0$ . For simplicity, we denote the activated node $i$ at iteration $k$ as $i(k)$ . To start with, we apply (74) to the updating equation (37) for the estimate $(\hat{\boldsymbol{x}}_{i(k)}^{k+1},\hat{\boldsymbol{\lambda}}_{i(k)}^{k+1})$ of node $i(k)$ . In order to do so, we first have to consider the estimates of its neighbors. It may happen that some neighbors have already been activated within the segment while others are still waiting to be activated. If a neighbor $j\in\mathcal{N}_{i(k)}$ is still waiting, we then have $(\hat{\boldsymbol{x}}_{j}^{k},\hat{\boldsymbol{\lambda}}_{j}^{k})=(\hat{\boldsymbol{x}}_{j}^{lm},\hat{\boldsymbol{\lambda}}_{j}^{lm})$ . Conversely, if a neighbor $j\in\mathcal{N}_{i(k)}$ has already been activated, we then have $(\hat{\boldsymbol{x}}_{j}^{k},\hat{\boldsymbol{\lambda}}_{j}^{k})=(\hat{\boldsymbol{x}}_{j}^{(l+1)m},\hat{\boldsymbol{\lambda}}_{j}^{(l+1)m})$ . From Lemma 10, it is clear that if $j<i(k)$ (or $j>i(k)$ ), then the neighbor $j$ has been activated (not yet activated). For simplicity, we use a function $s(k,j)$ to denote the value $lm$ or $(l+1)m$ for a neighbor $j\in\mathcal{N}_{i(k)}$ at iteration $k$

[TABLE]

As for the activated node $i(k)$ , we have $(\hat{\boldsymbol{x}}_{i(k)}^{k+1},\hat{\boldsymbol{\lambda}}_{i(k)}^{k+1})=(\hat{\boldsymbol{x}}_{i(k)}^{(l+1)m},\hat{\boldsymbol{\lambda}}_{i(k)}^{(l+1)m})$ . As a result, the two inequalities for $\hat{\boldsymbol{x}}_{i(k)}^{k+1}$ and $\hat{\boldsymbol{\lambda}}_{i(k)}^{k+1}$ are given by

[TABLE]

where $lm\leq k<(l+1)m$ .

Next adding (81)-(82) over all $lm\leq k<(l+1)m$ and substituting $(\boldsymbol{x},\boldsymbol{\lambda})=(\boldsymbol{x}^{\star},\boldsymbol{\lambda}^{\star})$ yields

[TABLE]

where the function $g(k,i(k),j)$ is defined as

[TABLE]

where $lm\leq k<(l+1)m$ and $j\in\mathcal{N}_{i(k)}$ .

Now we are in a position to analyze the right hand side of (83). By using the fact that each node $i$ has $|\mathcal{N}_{i}|$ different functions $g(k,i(k),j)$ , we can conclude that each edge $(u,v)\in\mathcal{E}$ is associated with two functions $g(k_{1},u(k_{1}),v)$ and $g(k_{2},v(k_{2}),u)$ , where iteration $k_{1}$ and $k_{2}$ activate $u$ and $v$ , respectively. From (83), it is clear that each edge $(u,v)$ is also associated with the other two functions $\|\boldsymbol{c}_{uv}-\boldsymbol{A}_{uv}\hat{\boldsymbol{x}}_{u}^{(l+1)m}-\boldsymbol{A}_{vu}\hat{\boldsymbol{x}}_{v}^{(l+1)m}\|_{\boldsymbol{P}_{p,uv}}^{2}$ and $\|\hat{\boldsymbol{\lambda}}_{v|u}^{(l+1)m}-\hat{\boldsymbol{\lambda}}_{u|v}^{(l+1)m}\|_{\boldsymbol{P}_{d,uv}}^{2}$ . We show in the following that the combination of the above four functions for every edge $(u,v)\in\mathcal{E}$ is independent of $k_{1}$ and $k_{2}$ . In order to do so, we assume $k_{1}<k_{2}$ (or equivalently, $u<v$ from Lemma 10). From (80), we know that $s(k_{1},v)=lm$ and $s(k_{2},u)=(l+1)m$ . Based on the above information, the four functions for $(u,v)\in\mathcal{E}$ can be simplified as

[TABLE]

where $d_{uv}^{l+1}$ is given by (62), of which the derivation is similar to that for $d_{i|j}^{k+1}$ in (56). The term $u(k_{1})$ in (84) is simplified as $u$ since we already assume that at iteration $k_{1}$ , node $u$ is activated. The quantity $d_{uv}^{l+1}$ is a function of $m$ and $l$ instead of $k_{1}$ . Finally, combining (83) and (85) produces (61).

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Zhang, R. Heusdens, and W. B. Kleijn, “On the Convergence Rate of the Bi-Alternating Direction Method of Multipliers,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2014, pp. 3897–3901.
2[2] G. Zhang and R. Heusdens, “Bi-Alternating Direction Method of Multipliers over Graphs,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , April 2015.
3[3] T. Richardson and R. Urbanke, Modern Coding Theory . Cambridge University Press, 2008.
4[4] G. Zhang, R. Heusdens, and W. B. Kleijn, “Large Scale LP Decoding with Low Complexity,” IEEE Communications Letters , vol. 17, no. 11, pp. 2152–2155, 2013.
5[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized Gossip Algorithms,” IEEE Trans. Information Theory , vol. 52, no. 6, pp. 2508–2530, 2006.
6[6] D. Sontag, A. Globerson, and T. Jaakkola, “Introduction to Dual Decomposition for Inference,” in Optimization for Machine Learning . MIT Press, 2011.
7[7] Y. Zeng and R. Heusdens, “Linear Coordinate-Descent Message-Passing for Quadratic Optimization,” Neural Computation , vol. 24, no. 12, pp. 3340–3370, 2012.
8[8] C. C. Moallemi and B. V. Roy, “Convergence of Min-Sum Message Passing for Quadratic Optimization,” IEEE Trans. Inf. Theory , vol. 55, no. 5, pp. 2413–2423, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Distributed Optimization Using the Primal-Dual Method of Multipliers

Abstract

Index Terms:

I Introduction

II Problem Setting

II-A Notations and functional properties

II-B Problem assumption

Assumption 1**.**

II-C Dual problem and its Lagrangian function

Lemma 1**.**

III Augmented Primal-Dual Lagrangian

III-A Primal-dual Lagrangian

Definition 1**.**

Lemma 2**.**

Proof.

Example 1** (x⋆\boldsymbol{x}^{\star}x⋆ not optimal).**

III-B Augmented primal-dual Lagrangian

Lemma 3**.**

Proof.

Theorem 1**.**

Proof.

IV Primal-Dual Method of Multipliers

IV-A Synchronous updating scheme

IV-B Asynchronous updating scheme

Remark 1**.**

IV-C Simplifying node-based computations and transmissions

IV-C1 Avoiding conjugate functions

Proposition 1**.**

IV-C2 Reducing two minimizations to one

Proposition 2**.**

Proof.

V Convergence Analysis

V-A Parameter setting

Condition 1**.**

V-B Constructing an inequality

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

V-C Synchronous PDMM

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Theorem 2**.**

Proof.

Theorem 3**.**

Proof.

Corollary 1**.**

V-D Asynchronous PDMM

Lemma 10**.**

Lemma 11**.**

Proof.

Remark 2**.**

Lemma 12**.**

Theorem 4**.**

Proof.

Theorem 5**.**

Corollary 2**.**

VI Application to Distributed Averaging

VI-A Problem formulation

VI-B Parameter computations and transmissions

VI-C Experimental results

VI-C1 performance for different parameter settings

VI-C2 performance with transmission failure

VI-C3 performance comparison

VII Conclusion

Appendix A Proof for Lemma 8

Lemma 13**.**

Assumption 1.

Lemma 1.

Definition 1.

Lemma 2.

Example 1 ( $\boldsymbol{x}^{\star}$ not optimal).

Lemma 3.

Theorem 1.

Remark 1.

Proposition 1.

Proposition 2.

Condition 1.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Theorem 2.

Theorem 3.

Corollary 1.

Lemma 10.

Lemma 11.

Remark 2.

Lemma 12.

Theorem 4.

Theorem 5.

Corollary 2.

Lemma 13.