Distributed Optimization Using the Primal-Dual Method of Multipliers
G. Zhang, R. Heusdens

TL;DR
This paper introduces PDMM, a novel primal-dual algorithm for distributed convex optimization over graphs, demonstrating convergence and robustness under various update schemes and network conditions.
Contribution
The paper develops PDMM, a new distributed optimization method that effectively handles graph-structured problems with convergence guarantees and resilience to communication failures.
Findings
Converges at rate O(1/K) for convex functions.
Effective under both synchronous and asynchronous updates.
Resilient to transmission failures in distributed averaging.
Abstract
In this paper, we propose the primal-dual method of multipliers (PDMM) for distributed optimization over a graph. In particular, we optimize a sum of convex functions defined over a graph, where every edge in the graph carries a linear equality constraint. In designing the new algorithm, an augmented primal-dual Lagrangian function is constructed which smoothly captures the graph topology. It is shown that a saddle point of the constructed function provides an optimal solution of the original problem. Further under both the synchronous and asynchronous updating schemes, PDMM has the convergence rate of O(1/K) (where K denotes the iteration index) for general closed, proper and convex functions. Other properties of PDMM such as convergence speeds versus different parameter- settings and resilience to transmission failure are also investigated through the experiments of distributed…
| Initialization: Properly initialize and |
| Repeat |
| for all do |
| end for |
| for all and do |
| end for |
| Until some stopping criterion is met |
| ADMM | broadcast | gossip | |||||
|---|---|---|---|---|---|---|---|
| ave. () | 5.46 | 8.92 | 6.54 | 2.10 | 0.24 | 380 | 384 |
| std () | 5.04 | 8.58 | 8.09 | 4.55 | 1.73 | 216 | 285 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed Control Multi-Agent Systems · Cooperative Communication and Network Coding · Advanced MIMO Systems Optimization
Distributed Optimization Using the Primal-Dual Method of Multipliers
Guoqiang Zhang and Richard Heusdens G. Zhang is with both the School of Computing and Communications, University of Technology, Sydney, Australia, and the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: [email protected]. Heusdens is with the Department of Microelectronics, Circuits and Systems group, Delft University of Technology, The Netherlands. Email: [email protected] of the work has been published on ICASSP, 2015, with the paper titled Bi-Alternating Direction Method of Multipliers over Graphs. After careful consideration, we decide to change the name of our algorithm from bi-alternating direction method of multipliers (BiADMM) in [1] and [2] to primal-dual method of multipliers (PDMM).
Abstract
In this paper, we propose the primal-dual method of multipliers (PDMM) for distributed optimization over a graph. In particular, we optimize a sum of convex functions defined over a graph, where every edge in the graph carries a linear equality constraint. In designing the new algorithm, an augmented primal-dual Lagrangian function is constructed which smoothly captures the graph topology. It is shown that a saddle point of the constructed function provides an optimal solution of the original problem. Further under both the synchronous and asynchronous updating schemes, PDMM has the convergence rate of (where denotes the iteration index) for general closed, proper and convex functions. Other properties of PDMM such as convergence speeds versus different parameter-settings and resilience to transmission failure are also investigated through the experiments of distributed averaging.
Index Terms:
Distributed optimization, ADMM, PDMM, sublinear convergence.
I Introduction
In recent years, distributed optimization has drawn increasing attention due to the demand for big-data processing and easy access to ubiquitous computing units (e.g., a computer, a mobile phone or a sensor equipped with a CPU). The basic idea is to have a set of computing units collaborate with each other in a distributed way to complete a complex task. Popular applications include telecommunication [3, 4], wireless sensor networks [5], cloud computing and machine learning [6]. The research challenge is on the design of efficient and robust distributed optimization algorithms for those applications.
To the best of our knowledge, almost all the optimization problems in those applications can be formulated as optimization over a graphic model :
[TABLE]
where and are referred to as node and edge-functions, respectively. For instance, for the application of distributed quadratic optimization, all the node and edge-functions are in the form of scalar quadratic functions (see [7, 8, 9]).
In the literature, a large number of applications (see [10]) require that every edge function , , is essentially a linear equality constraint in terms of and . Mathematically, we use to formulate the equality constraint for each , as demonstrated in Fig. 1. In this situation, (1) can be described as
[TABLE]
where denotes the indicator or characteristic function defined as if and if . In this paper, we focus on convex optimization of form (2), where every node-function is closed, proper and convex.
The majority of recent research have been focusing on a specialized form of the convex problem (2), where every edge-function reduces to . The above problem is commonly known as the consensus problem in the literature. Classic methods include the dual-averaging algorithm [11], the subgradient algorithm [12], the diffusion adaptation algorithm [13]. For the special case that are scalar quadratic functions (referred to as the distributed averaging problem), the most popular methods are the randomized gossip algorithm [5] and the broadcast algorithm [14]. See [15] for an overview of the literature for solving the distributed averaging problem.
The alternating-direction method of multipliers (ADMM) can be applied to solve the general convex optimization (2). The key step is to decompose each equality constraint into two constraints such as and with the help of the auxiliary variable . As a result, (2) can be reformulated as
[TABLE]
where , and is a vector obtained by stacking up one after another. See [16] for using ADMM to solve the consensus problem of (2) (with edge-function ). The graphic structure is implicitly embedded in the two matrices and the vector . The reformulation essentially converts the problem on a general graph with many nodes (2) to a graph with only two nodes (3), allowing the application of ADMM. Based on (3), ADMM then constructs and optimizes an augmented Lagrangian function iteratively with respect to and a set of Lagrangian multipliers. We refer to the above procedure as synchronous ADMM as it updates all the variables at each iteration. Recently, the work of [17] proposed asynchronous ADMM, which optimizes the same function over a subset of the variables at each iteration.
We note that besides solving (2), ADMM has found many successful applications in the fields of signal processing and machine learning (see [10] for an overview). For instance, in [18] and [19], variants of ADMM have been proposed to solve a (possibly nonconvex) optimization problem defined over a graph with a star topology, which is motivated from big data applications. The work of [20] considers solving the consensus problem of (2) (with edge-function ) over a general graph, where each node function is further expressed as a sum of two component functions. The authors of [20] propose a new algorithm which includes ADMM as a special case when one component function is zero. In general, ADMM and its variants are quite simple and often provide satisfactory results after a reasonable number of iterations, making it a popular algorithm in recent years.
In this paper, we tackle the convex problem (2) directly instead of relying on the reformulation (3). Specifically, we construct an augmented primal-dual Lagrangian function for (2) without introducing the auxiliary variable as is required by ADMM. We show that solving (2) is equivalent to searching for a saddle point of the augmented primal-dual Lagrangian. We then propose the primal-dual method of multipliers (PDMM) to iteratively approach one saddle point of the constructed function. It is shown that for both the synchronous and asynchronous updating schemes, the PDMM converges with the rate of for general closed, proper and convex functions.
Further we evaluate PDMM through the experiments of distributed averaging. Firstly, it is found that the parameters of PDMM should be selected by a rule (see VI-C1) for fast convergence. Secondly, when there are transmission failures in the graph, transmission losses only slow down the convergence speed of PDMM. Finally, experimental comparison suggests that PDMM outperforms ADMM and the two gossip algorithms in [5] and [14].
This work is mainly devoted to the theoretical analysis of PDMM. In the literature, PDMM has already been successfully applied for solving a few other problems. The work of [21] investigates the efficiency of ADMM and PDMM for distributed dictionary learning. In [22], we have used both ADMM and PDMM for training a support vector machine (SVM). In the above examples it is found that PDMM outperforms ADMM in terms of convergence rate. In [23], the authors describes an application of the linearly constrained minimum variance (LCMV) beamformer for use in acoustic wireless sensor networks. The proposed algorithm computes the optimal beamformer output at each node in the network without the need for sharing raw data within the network. PDMM has been successfully applied to perform distributed beamforming. This suggests that PDMM is not only theoretically interesting but also might be powerful in real applications.
II Problem Setting
In this section, we first introduce basic notations needed in the rest of the paper. We then make a proper assumption about the existence of optimal solutions of the problem. Finally, we derive the dual problem to (2) and its Lagrangian function, which will be used for constructing the augmented primal-dual Lagrangian function in Section III.
II-A Notations and functional properties
We first introduce notations for a graphic model. We denote a graph as , where represents the set of nodes and represents the set of edges in the graph, respectively. We use to denote the set of all directed edges. Therefore, . The directed edge starts from node and ends with node . We use to denote the set of all neighboring nodes of node , i.e., . Given a graph , only neighboring nodes are allowed to communicate with each other directly.
Next we introduce notations for mathematical description in the remainder of the paper. We use bold small letters to denote vectors and bold capital letters to denote matrices. The notation (or ) represents a symmetric positive semi-definite matrix (or a symmetric positive definite matrix). The superscript represents the transpose operator. Given a vector , we use to denote its norm.
Finally, we introduce the conjugate function. Suppose is a closed, proper and convex function. Then the conjugate of is defined as [24, Definition 2.1.20]
[TABLE]
where the conjugate function is again a closed, proper and convex function. Let be the optimal solution for a particular in (4). We then have
[TABLE]
where represents the set of all subgradients of at (see [24, Definition 2.1.23]). As a consequence, since , we have
[TABLE]
and we conclude that as well.
II-B Problem assumption
With the notation for a graph, we first reformulate the convex problem (2) as
[TABLE]
where each function is assumed to be closed, proper and convex, and . For every edge , we let . The vector is thus of dimension . In general, and are two different matrices. The matrix operates on in the linear constraint of edge . The notation s. t. in (7) stands for “subject to”. We take the reformulation (7) as the primal problem in terms of .
The primal Lagrangian for (7) can be constructed as
[TABLE]
where is the Lagrangian multiplier (or the dual variable) for the corresponding edge constraint in (7), and the vector is obtained by stacking all the dual variables , , on top of one another. Therefore, is of dimension . The Lagrangian function is convex in for fixed , and concave in for fixed . Throughout the rest of the paper, we will make the following (common) assumption:
Assumption 1**.**
There exists a saddle point to the Lagrangian function such that for all and we have
[TABLE]
Or equivalently, the following optimality (KKT) conditions hold for :
[TABLE]
II-C Dual problem and its Lagrangian function
We first derive the dual problem to (7). Optimizing over and yields
[TABLE]
where is the conjugate function of as defined in (4), satisfying Fenchel’s inequality
[TABLE]
Under Assumption 1, the dual problem (11) is equivalent to the primal problem (7). That is suppose is a saddle point of . Then solves the primal problem (7) and solves the dual problem (11).
At this point, we need to introduce auxiliary variables to decouple the node dependencies in (11). Indeed, every , associated to edge , is used by two conjugate functions and . As a consequence, all conjugate functions in (11) are dependent on each other. To decouple the conjugate functions, we introduce for each edge two auxiliary node variables and , one for each node and , respectively. The node variable is owned by and updated at node and is related to neighboring node . Hence, at every node we introduce new node variables. With this, we can reformulate the original dual problem as
[TABLE]
where is obtained by vertically concatenating all , , and is obtained by horizontally concatenating all , . To clarify, the product in (13) equals to
[TABLE]
Consequently, we let . In the above reformulation (13), each conjugate function only involves the node variable , facilitating distributed optimization.
Next we tackle the equality constraints in (13). To do so, we construct a (dual) Lagrangian function for the dual problem (13), which is given by
[TABLE]
where is obtained by concatenating all the Lagrangian multipliers , , one after another.
We now argue that each Lagrangian multiplier , , in (15) can be replaced by an affine function of . Suppose is a saddle point of . By letting for every , Fenchel’s inequality (12) must hold with equality at from which we derive that
[TABLE]
One can then show that where for every , is a saddle point of . We therefore restrict the Lagrangian multiplier to be of the form so that the dual Lagrangian becomes
[TABLE]
We summarize the result in a lemma below:
Lemma 1**.**
If is a saddle point of , then is a saddle point of , where for every .
We note that might not be equivalent to . By inspection of the optimality conditions of (16), not every saddle point of might lead to due to the generality of the matrices . In next section we will introduce quadratic penalty functions w.r.t. to implicitly enforce the equality constraints .
To briefly summarize, one can alternatively solve the dual problem (13) instead of the primal problem. Further, by replacing with an affine function of in (15), the dual Lagrangian share two variables and with the primal Lagrangian . We will show in next section that the special form of in (16) plays a crucial role for constructing the augmented primal-dual Lagrangian.
III Augmented Primal-Dual Lagrangian
In this section, we first build and investigate a primal-dual Lagrangian from and . We show that a saddle point of the primal-dual Lagrangian does not always lead to an optimal solution of the primal or the dual problem.
To address the above issue, we then construct an augmented primal-dual Lagrangian by introducing two additional penalty functions. We show that any saddle point of the augmented primal-dual Lagrangian leads to an optimal solution of the primal and the dual problem, respectively.
III-A Primal-dual Lagrangian
By inspection of (8) and (16), we see that in both and , the edge variables are related to the terms . As a consequence, if we add the primal and dual Lagrangian functions, the edge variables will cancel out and the resulting function contains node variables and only.
We hereby define the new function as the primal-dual Lagrangian below:
Definition 1**.**
The primal-dual Lagrangian is defined as
[TABLE]
is convex in for fixed and concave in for fixed , suggesting that it is essentially a saddle-point problem (see [25], [26] for solving different saddle point problems). For each edge , the node variables and substitute the role of the edge variable . The removal of enables to design a distributed algorithm that only involves node-oriented optimization (see next section for PDMM).
Next we study the properties of saddle points of :
Lemma 2**.**
If solves the primal problem (7), then there exists a such that is a saddle point of .
Proof.
If solves the primal problem (7), then there exists a such that is a saddle point of and by Lemma 1, there exist for every so that is a saddle point of . Hence
[TABLE]
∎
The fact that is a saddle point of , however, is not sufficient for showing (or ) being optimal for solving the primal problem (7) (for solving the dual problem (13)).
Example 1** ( not optimal).**
Consider the following problem
[TABLE]
[TABLE]
With this, the primal Lagrangian is given by , so that the dual function is given by , where
[TABLE]
Hence, the optimal solution for the primal and dual problem is and , respectively. The primal-dual Lagrangian in this case is given by
[TABLE]
One can show that every point is a saddle point of , which does not necessarily lead to .
It is clear from Example 1 that finding a saddle point of does not necessarily solve the primal problem (7). Similarly, one can also build another example illustrating that a saddle point of does not necessarily solve the dual problem (13).
III-B Augmented primal-dual Lagrangian
The problem that not every saddle point of leads to an optimal point of the primal or dual problem can be solved by adding two quadratic penalty terms to as
[TABLE]
where and are defined as
[TABLE]
where and
[TABLE]
The set of positive definite matrices remains to be specified.
Let and denote the primal and dual feasible set, respectively. It is clear that (or ) with equality if and only if (or ). The introduction of the two penalty functions essentially prevents non-feasible and/or to correspond to saddle points of . As a consequence, we have a saddle point theorem for which states that solves the primal problem (7) if and only if there exits such that is a saddle point of . To prove this result, we need the following lemma.
Lemma 3**.**
Let and be two saddle points of . Then
[TABLE]
Further, and are two saddle points of as well.
Proof.
Since and are two saddle points of , we have
[TABLE]
Combining the above two inequality chains produces (29). In order to show that is a saddle point, we have . The proof for is similar. ∎
We are ready to prove the saddle point theorem for .
Theorem 1**.**
If solves the primal problem (7), there exists such that is a saddle point of . Conversely, if is a saddle point of , then and solves the primal and the dual problem, respectively. Or equivalently, the following optimality conditions hold
[TABLE]
Proof.
If solves the primal problem, then there exists a such that is a saddle point of by Lemma 2. Since and , we have , and , from which we conclude that is a saddle point of as well.
Conversely, let be a saddle point of . We first show that solves the primal problem. We have from Lemma 3 that , which can be simplified as
[TABLE]
from which we conclude that and thus so that . In addition, since is a saddle point of by Lemma 3, we have
[TABLE]
and we conclude that solves the primal problem as required. Similarly, one can show that solves the dual problem.
Based on the above analysis, we conclude that the optimality conditions for being a saddle point of are given by (30)-(32). The set of optimality conditions is redundant and can be derived from (30)-(32) (see (4)-(6) for the argument). ∎
Theorem 1 states that instead of solving the primal problem (7) or the dual problem (13), one can alternatively search for a saddle point of . To briefly summarize, we consider solving the following min-max problem in the rest of the paper
[TABLE]
We will explain in next section how to iteratively approach the saddle point in a distributed manner.
IV Primal-Dual Method of Multipliers
In this section, we present a new algorithm named primal-dual method of multipliers (PDMM) to iteratively approach a saddle point of . We propose both the synchronous and asynchronous PDMM for solving the problem.
IV-A Synchronous updating scheme
The synchronous updating scheme refers to the operation that at each iteration, all the variables over the graph update their estimates by using the most recent estimates from their neighbors from last iteration. Suppose is the estimate obtained from the th iteration, where . We compute the new estimate at iteration as
[TABLE]
By inserting the expression (26) for into (34), the updating expression can be further simplified as
[TABLE]
Eq. (35)-(36) suggest that at iteration , every node performs parameter-updating independently once the estimates of its neighboring variables are available. In addition, the computation of and can be carried out in parallel since and are not directly related in . We refer to (35)-(36) as node-oriented computation.
In order to run PDMM over the graph, each iteration should consist of two steps. Firstly, every node computes by following (35)-(36), accounting for information-fusion. Secondly, every node sends to its neighboring node for all neighbors, accounting for information-spread. We take as the common message to all neighbors of node and as a node-specific message only to neighbor . In some applications, it may be preferable to exploit broadcast transmission rather than point-to-point transmission in order to save energy. We will explain in Subsection IV-C that the transmission of , , can be replaced by broadcast transmission of an intermediate quantity.
Finally, we consider terminating the iterates (35)-(36). One can check if the estimate becomes stable over consecutive iterates (see Corollary 1 for theoretical support).
IV-B Asynchronous updating scheme
The asynchronous updating scheme refers to the operation that at each iteration, only the variables associated with one node in the graph update their estimates while all other variables keep their estimates fixed. Suppose node is selected at iteration . We then compute by optimizing based on the most recent estimates from its neighboring nodes. At the same time, the estimates , , remain the same. By following the above computational instruction, can be obtained as
[TABLE]
Similarly to (35)-(36), and can also be computed separately in (37). Once the update at node is complete, the node sends the common message and node-specific messages to its neighbors We will explain in next subsection how to exploit broadcast transmission to replace point-to-point transmission.
In practice, the nodes in the graph can either be randomly activated or follow a predefined order for asynchronous parameter-updating. One scheme for realizing random node-activation is that after a node finishes parameter-updating, it randomly activates one of its neighbors for next iteration. Another scheme is to introduce a clock at each node which ticks at the times of a (random) Poisson process (see [5] for detailed information). Each node is activated only when its clock ticks. As for node-activation in a predefined order, cyclic updating scheme is probably most straightforward. Once node finishes parameter-updating, it informs node for next iteration. For the case that node and are not neighbors, the path from node to can be pre-stored at node to facilitate the process. In Subsection V-D, we provide convergence analysis only for the cyclic updating scheme. We leave the analysis for other asynchronous schemes for future investigation.
Remark 1**.**
To briefly summarize, synchronous PDMM scheme allows faster information-spread over the graph through parallel parameter-updating while asynchronous PDMM scheme requires less effort from node-coordination in the graph. In practice, the scheme-selection should depend on the graph (or network) properties such as the feasibility of parallel computation, the complexity of node-coordination and the life time of nodes.
IV-C Simplifying node-based computations and transmissions
It is clear that for both the synchronous and asynchronous schemes, each activated node has to perform two minimizations: one for and the other one for . In this subsection, we show that the computations for the two minimizations can be simplified. We will also study how the point-to-point transmission can be replaced with broadcast transmission. To do so, we will consider two scenarios:
IV-C1 Avoiding conjugate functions
In the first scenario, we consider using instead of to update . Our goal is to simplify computations by avoiding the derivation of .
By using the definition of in (4), the computation (36) for (which also holds for asynchronous PDMM) can be rewritten as
[TABLE]
We denote the optimal solution for in (39) as . The optimality conditions for , , and can then be derived from (39) as
[TABLE]
where (14) is used in deriving (41). Since is a nonsingular matrix, (41) defines a mapping from to :
[TABLE]
With this mapping, (40) can then be reformulated as
[TABLE]
By inspection of (43), it can be shown that (43) is in fact an optimality condition for the following optimization problem
[TABLE]
The above analysis suggests that can be alternatively computed through an intermediate quantity . We summarize the result in a proposition below.
Proposition 1**.**
Considering a node at iteration , the new estimate for each can be obtained by following (42), where is computed by (44).
Proposition 1 suggests that the estimate can be easily computed from . We argue in the following that the point-to-point transmission of can be replaced with broadcast transmission of .
We see from (42) that the computation of the node-specific message (from node to node ) only consists of the quantities , and . Since and are available at node , the message can therefore be computed at node once the common message is received. In other words, it is sufficient for node to broadcast both and to all its neighbors. Every node-specific message , , can then be computed at node alone.
Finally, in order for the broadcast transmission to work, we assume there is no transmission failure between neighboring nodes. The assumption ensures that there is no estimate inconsistency between neighboring nodes, making the broadcast transmission reliable.
IV-C2 Reducing two minimizations to one
In the second scenario, we study under what conditions the two minimizations (35)-(36) (which also hold for asynchronous PDMM) reduce to one minimization.
Proposition 2**.**
Considering a node at iteration , if the matrix for every neighbor is chosen to be , then there is . As a result,
[TABLE]
Proof.
The proof is trivial. By inspection of (35) and (44) under , , we obtain . ∎
Similarly to the first scenario, broadcast transmission is also applicable for the second scenario. Since , node only has to broadcast the estimate to all its neighbors. Each message from node to node can then be computed at node directly by applying (45). See Table I for the procedure of synchronous PDMM.
V Convergence Analysis
In this section, we analyze the convergence rates of PDMM for both the synchronous and asynchronous schemes. Inspired by the convergence analysis of ADMM [27, 28], we construct a special inequality (presented in V-B) for and then exploit it to analyze both synchronous PDMM (presented in V-C) and asynchronous PDMM (presented in V-D).
Before constructing the inequality, we first study how to properly choose the matrices in the set (presented in V-A) in order to enable convergence analysis.
V-A Parameter setting
In order to analyze the algorithm convergence later on, we first have to select the matrix set properly. We impose a condition on each pair of matrices , , in :
Condition 1**.**
In the function , each matrix can be represented in terms of as
[TABLE]
where .
Eq. (46) implies that and can not be chosen arbitrarily for our convergence analysis. If is small, then has to be chosen big enough to make (46) hold, and vice versa. One special setup for is to let , or equivalently, . This leads to the application of Proposition 2, which reduces two minimizations to one minimization for each activated node.
One simple setup in Condition 1 is to let all the matrices in take scalar form. That is setting , , to be identity matrices multiplied by positive parameters:
[TABLE]
where , and . It is worth noting that matrix form of might lead to faster convergence for some optimization problems.
V-B Constructing an inequality
Before introducing the inequality, we first define a new function which involves and their conjugates:
[TABLE]
By studying (7) and (13) at a saddle point of , one can show that .
With , the inequality for can be described as:
Lemma 4**.**
Let be a saddle point of . Then for any , there is
[TABLE]
where equality holds if and only if satisfies
[TABLE]
Proof.
Given a saddle point of , the right hand side of the inequality (49) can be reformulated as
[TABLE]
where the last equality is obtained by using . Using Fenchel’s inequalities (12), we conclude that for any , the following two inequalities hold
[TABLE]
Finally, combining (52)-(54) and the fact that produces the inequality (49). The equality holds if and only if (53)-(54) hold, of which the optimality conditions are given by (50)-(51) (see (4)-(6) for the argument). ∎
Lemma 4 shows that the quantity on the right hand side of (49) is always lower-bounded by zero. In the next two subsections, we will construct proper upper bounds for the quantity by replacing with real estimate of PDMM. The algorithmic convergence will be established by showing that the upper bounds approach to zero when iteration increases.
The conditions (50)-(51) in Lemma 4 are not sufficient for showing that is a saddle point of . The primal and dual feasibilities and are also required to complete the argument, as shown in Lemma 5, 6 and 7 below. Lemma 5 and 6 are preliminary to show that is a saddle point of as presented in Lemma 7. These three lemmas will be used in the next two subsections for convergence analysis.
Lemma 5**.**
Let be a saddle point of . Given which satisfies (51) and , then is a saddle point of .
Proof.
By using (51) and the fact that and , it is immediate from (30)-(32) that is a saddle point of . ∎
Lemma 6**.**
Let be a saddle point of . Given which satisfies (50) and , then is a saddle point of .
Proof.
The proof is similar to that for Lemma 5. ∎
Lemma 7**.**
Let be a saddle point of . Given which satisfy (50)-(51) and , then is a saddle point of .
Proof.
It is known from Lemma 5 and 6 that in addition to , and are also the saddle points of . By using a similar argument as the one for Lemma 3, one can show that is a saddle point of . ∎
V-C Synchronous PDMM
In this subsection, we show that the synchronous PDMM converges with the sub-linear rate . In order to obtain the result, we need the following two lemmas.
Lemma 8**.**
Let be a saddle point of . The estimate is obtained by performing (35)-(36) under Condition 1. Then there is
[TABLE]
where is given by
[TABLE]
where and .
Proof.
See the proof in Appendix A. ∎
Lemma 9**.**
Every pair of estimates , , , , in Lemma 8 is upper bounded by a constant under a squared error criterion:
[TABLE]
Proof.
One can first prove (57) for by performing algebra on (55)-(56). The inequality (57) for can then be proved recursively. ∎
Upon obtaining the results in Lemma 8 and 9, we are ready to present the convergence rate of synchronous PDMM.
Theorem 2**.**
Let , , be obtained by performing (35)-(36) under Condition 1. The average estimate satisfies
[TABLE]
[TABLE]
Proof.
Summing (55) over and simplifying the expression yields
[TABLE]
Finally, since the left hand side of (60) is a convex function of , applying Jensen’s inequality to (60) and using the inequality of Lemma 4 yields (58). Similarly, applying Jensen’s inequality to (60) and using the upper-bound result of Lemma 9 yields the asymptotic result (59). ∎
Finally, we use the results of Theorem 2 to show that as goes to infinity, the average estimate converges to a saddle point of .
Theorem 3**.**
The average estimate of Theorem 2 converges to a saddle point of as increases.
Proof.
The basic idea of the proof is to investigate if satisfies all the conditions of Lemma 7. By investigation of Lemma 4 and (58), it is clear that the average estimate asymptotically satisfies the conditions (50)-(51) by letting .
Next we show that as increases, asymptotically converges to an element of the primal feasible set and so does to an element of the dual feasible set . To do do, we reconsider (59) for each pair of directed edges and , which can be expressed as
[TABLE]
Combining the above two expressions produces
[TABLE]
It is straightforward from Lemma 7 that converges to a saddle point of as increases. ∎
Further we have the following result from Theorem 3:
Corollary 1**.**
If for certain , the estimate in Theorem 2 converges to a fixed point (), we have which is the th component of the optimal solution in Theorem 3. Similarly, if the estimate converges to a point , we have .
V-D Asynchronous PDMM
In this subsection, we characterize the convergence rate of asynchronous PDMM. In order to facilitate the analysis, we consider a predefined node-activation strategy (no randomness is involved). We suppose at each iteration , the node is activated for parameter-updating, where and stands for the modulus operation. Then naturally, after a segment of consecutive iterations, all the nodes will be activated sequentially, one node at each iteration.
To be able to derive the convergence rate, we consider segments of iterations, i.e., , where . Each segment consists of iterations. With the mapping , it is immediate that activates node 1 and activates node . Based on the above analysis, we have the following result.
Lemma 10**.**
Let be two iteration indices within a segment . If , then , where the node-index , .
Upon introducing Lemma 10, we are ready to perform convergence analysis.
Lemma 11**.**
Let be a saddle point of . A segment of estimates , is obtained by performing (37)-(38) under Condition 1. Then there is
[TABLE]
where is given by
[TABLE]
Proof.
See the proof in Appendix B. Lemma 10 will be used in the proof to simplify mathematic derivations. ∎
Remark 2**.**
We note that Lemma 11 corresponds to Lemma 8 which is for synchronous PDMM. The right hand side of (61) consists of quantities (one for each edge ) as opposed to that of (55) which consists of quantities (one for each directed edge ).
Lemma 12**.**
Every pair of estimates , , , , in Lemma 11 is upper bounded by a constant under a squared error criterion:
[TABLE]
Theorem 4**.**
Let the segments of estimates be obtained by performing (37)-(38) under Condition 1. The average estimates satisfies
[TABLE]
Proof.
The proof is similar to that for Theorem 2. ∎
Similarly to synchrounous PDMM, by using the results of Theorem 4, we can conclude that:
Theorem 5**.**
The average estimate of Theorem 4 converges to a saddle point of as increases.
Corollary 2**.**
If for certain , the estimate in Theorem 4 converges to a fixed point (), we have which is the th component of the optimal solution in Theorem 5. Similarly, if the estimate converges to a point , we hvae .
VI Application to Distributed Averaging
In this section, we consider solving the problem of distributed averaging by using PDMM. Distributed averaging is one of the basic and important operations for advanced distributed signal processing [5, 15].
VI-A Problem formulation
Suppose every node in a graph carries a scalar parameter, denoted as . may represent a measurement of the environment, such as temperature, humidity or darkness. The problem is to compute the average value iteratively only through message-passing between neighboring nodes in the graph.
The above averaging problem can be formulated as a quadratic optimization over the graph as
[TABLE]
The optimal solution equals to , which is the same as the averaging value.
The quadratic problem (66) is inline with (7) by letting
[TABLE]
In next subsection, we apply PDMM for distributed averaging.
VI-B Parameter computations and transmissions
Before deriving the updating expressions for PDMM, we first configure the set in . For distributed averaging, all the matrices in become scalars. For simplicity, we set the value of the primal scalars and the dual scalars as
[TABLE]
where the two parameters and .
We start with the synchronous PDMM. By inserting (67)-(69) into (35), (42) and (44), the updating expression for at iteration can be derived as
[TABLE]
where
[TABLE]
For the case that , it is immediate from (70) and (72) that , which coincides with Proposition 2.
The asynchronous PDMM only activates one node per iteration. Suppose node is activated at iteration . Node then updates and , , by following (70)-(71) while all other nodes remain silent. After computation, node then sends to its neighboring node for all neighbors.
As described in Subsection IV-C, if no transmission fails in the graph, the transmission of , , can be replaced by broadcast transmission of as given by (72). Once is received by a neighboring node , can be easily computed by node alone using , and (see Eq. (71)). If instead the transmission is not reliable, we have to return to point-to-point transmission.
VI-C Experimental results
We conducted three experiments for PDMM applied to distributed averaging. In the first experiment, we evaluated how different parameter-settings w.r.t. affect the convergence rates of PDMM. In the second experiment, we tested the non-perfect channels for PDMM, which lacks convergence guaranty at the moment. Finally, we evaluated the convergence rates of PDMM, ADMM and two gossip algorithms.
The tested graph in the three experiments was a two-dimensional grid (corresponding to ), implying that each node may have two, three or four neighbors. The mean squared error (MSE) was employed as performance measurement.
VI-C1 performance for different parameter settings
In this experiment, we evaluated the performance of PDMM by testing different parameter-settings for . Both synchronous and asynchronous updating schemes were investigated.
At each iteration, the synchronous PDMM activated all the nodes for parameter-updating. As for the asynchronous PDMM, the nodes were activated sequentially by following the mapping , where the iteration (See Subsection V-D). As a result, after every segment of iterations, all the nodes were activated once. In the experiment, we counted the number of iterations for the synchronous PDMM and the number of segments (each segment consists of iterations) for the asynchronous PDMM.
For each parameter-setting, we initialized for every . The algorithm stops when the squared error is below .
Fig. 2 displays the numbers of iterations (or segments) of PDMM under different parameter-settings. Each or symbol represents a particular setting for . The settings denoted by are for the case that while the ones by are for the case that .
It is seen from the figure that large or can only make the algorithm converge slowly. The optimal parameter-setting that leads to the fastest convergence lies on the curve for both the synchronous and the asynchronous updating schemes. Further, it appears that the two optimal settings for the two updating schemes are in a neighborhood.
Finally, we note that the settings denoted by correspond to the situation that . The experiment for those settings demonstrates that Condition 1 is only sufficient for algorithmic convergence. We also tested the setting . We found that the above setting led to divergence for both synchronous and synchronous schemes. This phenomenon suggests that and cannot be chosen arbitrarily in practice.
VI-C2 performance with transmission failure
In this experiment, we studied how transmission failure affects the performance of PDMM given the fact that no convergence guaranty is derived at the moment. As discussed in Subsection IV-C, we could not use broadcast transmission in the case of transmission loss. Instead, each activated node has to perform point-to-point transmission for from node to node .
Due to transmission failure, PDMM was initialized differently from the first experiment. Each time the algorithm was tested, the initial estimate was set as
[TABLE]
The above initialization guarantees that every node in the graph has access to the initial estimates of neighboring nodes without transmission.
Fig. 3 demonstrates the performance of PDMM under three transmission losses: 0%, 20% and 40%. Subplot (a) and (b) are for the asynchronous and synchronous schemes, respectively. Each curve in the two subplots was obtained by averaging over 100 simulations to mitigate the effect of random transmission losses. It is seen that transmission failure only slows down the convergence speed of the algorithm. The above property is highly desirable in real applications because transmission losses might be inevitable in some networks (e.g., see [29] for investigation of packet-loss over wireless sensor networks in different environments).
Finally, it is observed that for each transmission-loss in subplot (a), the error goes up in the first few hundred of iterations before deceasing. This may because of the special initialization (73). We have tested the initialization for 0% transmission loss, where the MSE decreases along with the iterations monotonically.
VI-C3 performance comparison
In this experiment, we investigated the convergence speeds of four algorithms under the condition of no transmission failure. Besides PDMM, we also implemented the broadcast-based algorithm in [14] (referred to as broadcast), the randomized gossip algorithm in [5] (referred to as gossip) and ADMM. Unlike PDMM and ADMM that can work either synchronously or asynchronously, both broadcast and gossip algorithms can only work asynchronously. While broadcast algorithm randomly activates one node per iteration, gossip algorithm randomly activates one edge per iteration for parameter-updating.
Similarly to the first experiment, we also evaluated PDMM for both the synchronous and asynchronous schemes. For the asynchronous scheme, we tested all the four algorithms introduced above while for the synchronous scheme, we focused on PDMM and ADMM. The implementation of the synchronous/asynchronous ADMM follows from [10] and [17], respectively. The asynchronous ADMM [17] is similar to the gossip algorithm in the sense that both algorithms activates one edge per iteration.
We note that the asynchronous ADMM essentially activates two neighboring nodes per iteration. To make a fair comparison between PDMM and ADMM, we implemented two versions of PDMM for the asynchronous scheme. The first version follows Subsection IV-B where each iteration randomly activates one node as the gossip algorithm, referred to as one-node PDMM. The second version of PDMM randomly activates two neighboring nodes per iteration as the broadcast algorithm, referred to as two-node PDMM.
Both PDMM and ADMM have some parameters to be specified. To simplify the implementation, we let in PDMM (which is not the optimal setting from Fig. 2). Similarly, we set the parameter in ADMM to be 1.
In the experiment, the gossip and broadcast algorithms were initialized according to [5] and [14], respectively. The initialization for PDMM was the same as in the first experiment. The estimates of ADMM were initialized similarly as for PDMM.
Fig. 4 displays the MSE trajectories for the four methods while Table II lists the average execution times (per iteration) and their standard deviations. Similarly to the second experiment, the performance of each method for the asynchronous scheme was obtained by averaging over 100 simulations to mitigate the effect of randomness introduced in node or edge-activation. We now focus on the asynchronous scheme. It is seen that the two-node PDMM converges the fastest in terms of number of iterations while the gossip algorithm requires the least execution time on average. The above results suggest that for applications where signal transmission is more expensive than local computation (w.r.t. energy consumption), PDMM might be a good candidate as it may save number of iterations.
Fig. 4 (b) demonstrates the MSE performance of PDMM and ADMM for the synchronous scheme. Both algorithms appear to have linear convergence rates. This may be because the objective functions in (66) are strongly convex and have gradients which are Lipschitz continuous. It is seen from Table II that both methods take roughly the same execution time. By combining the above results, we conclude that under synchronous scheme, PDMM converges faster than ADMM w.r.t. the execution time, which may be due to the fact that PDMM avoids the auxiliary variable used in ADMM.
VII Conclusion
In this paper, we have proposed PDMM for iterative optimization over a general graph. The augmented primal-dual Lagrangian function is constructed of which a saddle point provides an optimal solution of the original problem, which leads to the design of PDMM. PDMM performs broadcast transmission under perfect channel and point-to-point transmission under non-perfect channel. We have shown that both the synchronous and asynchronous PDMMs possess a convergence rate of for general closed, proper and convex functions defined over the graph. As an example, we have applied PDMM for distributed averaging, through which properties of PDMM such as proper parameter-selection and resilience against transmission failure are further investigated.
We note that PDMM is natural when performing node-oriented optimization over a graph as compared to ADMM which involves computing the edge variable introduced in (3). A few applications in [21], [22] and [23] suggest that PDMM is practically promising. While convergence properties of ADMM under different conditions (e.g., strong convexity and/or the gradients being Lipschitz continuous) are well understood, the convergence properties of PDMM for those conditions remain to be discovered.
Appendix A Proof for Lemma 8
Before presenting the proof, we first introduce a basic inequality, which is described in a lemma below:
Lemma 13**.**
Let and be two arbitrary closed, proper and convex functions. minimizes the sum of the two functions, i.e., . Then, there is
[TABLE]
where .
The above inequality is wildly exploited for the convergence analysis of ADMM and its variants [27, 28, 10]. We will also use the inequality in our proof.
Applying (74) to the updating equations (35)-(36) for , we obtain a set of inequalities for all as
[TABLE]
Adding (75)-(76) over all , and substituting , the saddle point of , yields
[TABLE]
where the last equality follows from the two optimality conditions (31)-(32).
To further simplify (77), one can first insert the alternative expression (46) for every into (77). After that, the expression (55) can be obtained by simplifying the new expression using (31)-(32) and the following identity
[TABLE]
Appendix B Proof of Lemma 11
The basic idea for the proof is similar to that for Lemma 8 as presented in Appendix A. However, since asynchronous PDMM activates one node per iteration, it is difficult to tell which neighbors of have been recently activated and which have not yet. The above difficulty requires careful treatment in the convergence analysis. We sketch the proof in the following for reference.
We focus on the parameter-updating for a particular segment of iterations , where . For simplicity, we denote the activated node at iteration as . To start with, we apply (74) to the updating equation (37) for the estimate of node . In order to do so, we first have to consider the estimates of its neighbors. It may happen that some neighbors have already been activated within the segment while others are still waiting to be activated. If a neighbor is still waiting, we then have . Conversely, if a neighbor has already been activated, we then have . From Lemma 10, it is clear that if (or ), then the neighbor has been activated (not yet activated). For simplicity, we use a function to denote the value or for a neighbor at iteration
[TABLE]
As for the activated node , we have . As a result, the two inequalities for and are given by
[TABLE]
where .
Next adding (81)-(82) over all and substituting yields
[TABLE]
where the function is defined as
[TABLE]
where and .
Now we are in a position to analyze the right hand side of (83). By using the fact that each node has different functions , we can conclude that each edge is associated with two functions and , where iteration and activate and , respectively. From (83), it is clear that each edge is also associated with the other two functions and . We show in the following that the combination of the above four functions for every edge is independent of and . In order to do so, we assume (or equivalently, from Lemma 10). From (80), we know that and . Based on the above information, the four functions for can be simplified as
[TABLE]
where is given by (62), of which the derivation is similar to that for in (56). The term in (84) is simplified as since we already assume that at iteration , node is activated. The quantity is a function of and instead of . Finally, combining (83) and (85) produces (61).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Zhang, R. Heusdens, and W. B. Kleijn, “On the Convergence Rate of the Bi-Alternating Direction Method of Multipliers,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , May 2014, pp. 3897–3901.
- 2[2] G. Zhang and R. Heusdens, “Bi-Alternating Direction Method of Multipliers over Graphs,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , April 2015.
- 3[3] T. Richardson and R. Urbanke, Modern Coding Theory . Cambridge University Press, 2008.
- 4[4] G. Zhang, R. Heusdens, and W. B. Kleijn, “Large Scale LP Decoding with Low Complexity,” IEEE Communications Letters , vol. 17, no. 11, pp. 2152–2155, 2013.
- 5[5] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized Gossip Algorithms,” IEEE Trans. Information Theory , vol. 52, no. 6, pp. 2508–2530, 2006.
- 6[6] D. Sontag, A. Globerson, and T. Jaakkola, “Introduction to Dual Decomposition for Inference,” in Optimization for Machine Learning . MIT Press, 2011.
- 7[7] Y. Zeng and R. Heusdens, “Linear Coordinate-Descent Message-Passing for Quadratic Optimization,” Neural Computation , vol. 24, no. 12, pp. 3340–3370, 2012.
- 8[8] C. C. Moallemi and B. V. Roy, “Convergence of Min-Sum Message Passing for Quadratic Optimization,” IEEE Trans. Inf. Theory , vol. 55, no. 5, pp. 2413–2423, 2009.
