Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization
Yue Yu, Beh\c{c}et A\c{c}{\i}kme\c{s}e

TL;DR
This paper introduces a stochastic variant of the Bregman parallel direction method of multipliers for distributed optimization, reducing computational load and enabling larger network applications while maintaining convergence guarantees.
Contribution
It generalizes BPDMM to a stochastic setting with convergence proofs, facilitating scalable distributed optimization in multi-agent systems.
Findings
Achieves global convergence of stochastic BPDMM.
Establishes an O(1/T) iteration complexity.
Demonstrates effectiveness through numerical examples.
Abstract
Bregman parallel direction method of multipliers (BPDMM) efficiently solves distributed optimization over a network, which arises in a wide spectrum of collaborative multi-agent learning applications. In this paper, we generalize BPDMM to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization reduce the need for computational resources and allows applications to larger scale networks. We establish both the global convergence and the \(O(1/T)\) iteration complexity of stochastic BPDMM. We demonstrate our results via numerical examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization
Yue Yu and Behçet Açıkmeşe
The authors are with the Department of Aeronautics and Astronautics, University of Washington, Seattle, WA, 98195; emails: {yueyu,behcet}@uw.edu
Abstract
Bregman parallel direction method of multipliers (BPDMM) efficiently solves distributed optimization over a network, which arises in a wide spectrum of collaborative multi-agent learning applications. In this paper, we generalize BPDMM to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization reduce the need for computational resources and allows applications to larger scale networks. We establish both the global convergence and the iteration complexity of stochastic BPDMM. We demonstrate our results via numerical examples.
I Introduction
Distributed optimization over a connected undirected network is defined as follows
[TABLE]
where is a closed convex set, is the Cartesian product of copies of , each is a convex function accessible by node only. The global optimality is achieved by local optimization on each node and efficient communication between neighboring nodes. In addition to classical applications such as formation control [1], distributed tracking [2] and estimation [3, 4], problem (1) also arises in collaborative learning scenarios [5, 6], where problem (1) represents distributed learning from data collected by multiple agents.
There has been an increasing interest in applying multiplier methods to solve problem (1) [7, 8, 9]. At each iteration of such methods, every primal variable is updated by optimizing a quadratic augmented Lagrangian; every dual variable is updated by numerically integrating local disagreement. Recently, Bregman parallel direction method of multipliers (PDMM) generalized the quadratic augmentation in local optimization to Bregman augmentation, which better exploits the structure of constraint set , and hence leads to significant improvement in convergence speed [10, 11].
One challenge in implementing multiplier methods for problem (1) is that a local optimization problem needs to be solved on every node in parallel at each iteration, which requires demanding computational resources when applied to large scale networks. A popular approach to address this challenge is stochastic multiplier methods [12, 13, 14], which combine multiplier methods with the idea of stochastic block coordinate descent [15, 16]. At each iteration, stochastic multiplier methods only solve local optimization problems on, rather than all the nodes, a randomly selected subset of nodes. Such algorithms guarantee global convergence to optimum in expectation via proper choice of algorithm parameters. However, to our best knowledge, all existing stochastic multiplier methods use quadratic augmentation. In other words, there is no stochastic extension to Bregman augmentation based multiplier methods.
In this paper, we close this gap in the literature by proposing stochastic BPDMM, which combines the benefits of BPDMM and stochastic multiplier methods. Compared with BPDMM [11], it only requires solving local optimization on a randomly selected subset of nodes, which allows application to larger scale networks; compared with existing stochastic multiplier methods [12, 13, 14], it extends quadratic augmented Lagrangian to Bregman augmented Lagrangian, which improves the convergence speed by better exploiting constraints structure. We establish the global convergence and iteration complexity of stochastic BPDMM, and demonstrate its effectiveness and efficiency via numerical examples.
The rest of the paper is organized as follows. Section II covers necessary background and reformulates problem (1) with consensus constraints. Section III develops the stochastic BPDMM, whose convergence proof is established in Section IV. Section V presents numerical examples and demonstrates the advantages of stochastic BPDMM over prior work. Section VI concludes and comments on future directions.
II Preliminaries and Background
II-A Notation
Let () denote the set of (nonnegative) real numbers, () the set of -dimensional (elementwise nonnegative) vectors. Let denote elementwise inequality when applied to vectors and matrices. Let denote the dot product. Let denote the -dimensional identity matrix, the -dimensional vector of all s. Given matrix , let denote its entry; denotes its transpose. Let denote the Kronecker product.
II-B Subgradients
Let be a convex function. Then is a subgradient of at if and only if for any one has
[TABLE]
We denote the set of subgradients of at . An important case of subdifferential is the case of indicator function of a non-empty convex set defined as if and otherwise. We will use the following results.
Lemma 1**.**
[17, Theorem 27.4]** Given a closed convex set and closed, convex, proper function , then if and only if .
II-C Mirror maps and Bregman divergence
Let be a convex open set. We say that is a mirror map [18, p.298] if it satisfies: 1) is differentiable and strictly convex, 2) takes all possible values, and 3) diverges on the boundary of the closure of , i.e., , where is an arbitrary norm on . The Bregman divergence is defined as [19, Sec. 2.1]
[TABLE]
Note that and only if . also satisfy the following three-point identity,
[TABLE]
II-D Graphs and distibuted optimization
An undirected connected graph contains a vertex set and an edge set such that if and only if for all . Denote the set of neighbors of node such that if .
Consider a symmetric stochastic matrix defined on the graph such that implies that . Such a matrix can be constructed, for example, by the graph Laplacian [1, Proposition 3.18]. If is irreducible [20, Lem. 8.4.1], then is a simple eigenvalue of with eigenvectors spanned by .
Let denote the underlying graph over which problem (1) is defined. A common approach to solve problem is to create local copies of the design variable and impose the consensus constraints: for all [21, 22]. Many different consensus constraints have been proposed [7, 23, 24, 25]. In this paper, we consider consensus constraints of the form:
[TABLE]
where , is a symmetric, stochastic and irreducible matrix defined on . We will focus on the following reformulation of problem (1),
[TABLE]
III Stochastic Bregman Parallel Direction Method of Multipliers
In this section, we first review BPDMM in Algorithm 1, then combine it with the stochastic node update in [13] and propose sBPDMM in Algorithm 2.
BPDMM [11] solves problem (6) with Algorithm 1, which combines the idea of PDMM [8] and Bregman augmented Lagrangian [10]. Each iteration of the algorithm include the following steps:
- (a)
Mirror averaging Step (8a) computes a nodal mirror average of neighboring nodes’ variables, and can be further decomposed as follows:
[TABLE]
where . Therefore this step is equivalent to first apply to , then run an average step, followed by , and finally a projection step. See Fig. 1 for an illustration. 2. (b)
Local optimization Step (8b) optimizes a nodal augmented Lagrangian. In particular, the Bregman divergence term in the objective of (8b) augments the nodal Lagrangian by penalizing the difference from the nodal mirror average. 3. (c)
Disagreement integration Step (9) is a discrete integration of the disagreement between neighboring nodes. Such integration is equivalent to a spring dynamics among neighboring nodes and improves the disturbance rejection performance of the algorithm. See [26, 27] for a detailed discussion.
Both mirror averaging step (8a) and disagreement integration step (9) have close-form update when the constraint set is structured, e.g., is or the probability simplex [11]. On the other hand, the local optimization step (8b) typically requires an iterative algorithm itself, e.g., mirror descent method [28]. Hence the main computational effort of implementing Algorithm 1 is caused by the local optimization step (8b). At each iteration, Algorithm 1 requires at least processors, one assigned to each node, to solve optimization (8b) in parallel. Such requirements are computationally demanding for large scale networks.
In order to address this challenge, we propose Algorithm 2, which uses a stochastic node update [12, 13, 14]. Compared with Algorithm 1, each iteration of Algorithm 2 only execute local optimization step on a set of randomly selected nodes, which requires less number of processors running in parallel. This flexibility reduce the requirements on the total computation power of the network, and allows BPDMM to be applicable much larger scale networks.
Although the generalization from Algorithm 1 to Algorithm 2 seems straightforward, the generalization in the corresponding convergence proof requires more careful treatment. In particular, the convergence proof of Algorithm 1 in [11] hinges on a monotonically non-increasing non-negative Lyapunov function for full primal update in (8) with carefully chosen algorithm parameters. In order to generalize such proof to Algoritjm 2, we need to answer the following questions:
- •
How to find a monotonically non-increasing non-negative Lyapunov function for stochastic partial primal update in (10)?
- •
How does the randomly selected node set affect the choice of algorithm parameters?
In the sequel, we aim to answer theses questions and establish the convergence proof of Algorithm 2.
IV Convergence
In this section, we prove the global convergence as well as the iteration complexity of Algorithm 2. All detailed proof in this section can be found in the Appendix.
We first group our assumptions in Assumption 1.
Assumption 1**.**
- (a)
Function are closed, proper and convex for all . 2. (b)
Set is closed and convex. There exists a saddle point such that and
[TABLE]
for all . 3. (c)
Function is a mirror map, where is a open convex set such that is included in its closure. In addition, function is -strongly convex with respect to -norm, i.e., for any ,
[TABLE] 4. (d)
Matrix is symmetric, stochastic, irreducible and positive semi-definite. 5. (e)
At each iteration , we assume
Now we start to construct the convergence proof of Algorithm 2 under Assumption 1. The optimality condition of (10b) is that for all ,
[TABLE]
Define the residuals of optimality conditions (14) at iteration as
[TABLE]
where and Lagrangian is defined as
[TABLE]
Using (12) and (2) we can show the following
[TABLE]
Hence defines a running duality gap that measures distance to optimality [8]. Notice that given , is a random variable only depends on and implies that and for all , i.e., both optimality and consensus are achieved.
In order to show , we define the following Lyapunov function of Algorithm 2
[TABLE]
where
[TABLE]
with and .
Compared with the one used in [11], the Lyapunov function defined by (18) contains a generalized Lagrangian , which renders the positive definiteness of unclear. The following lemma shows that is indeed positive definite, and lower bounded by a Bregman divergence to the optimum.
Lemma 2**.**
Suppose Assumption 1 holds, if
[TABLE]
where , and are defined in (13), then the Lyapunov function defined in (18) satisfy
[TABLE]
The sketch of the proof is as follows. Use equation (12b) and (11) we can show
[TABLE]
In addition, equation (11) and Assumption 1, particularly assumptions on function and matrix , ensures that
[TABLE]
Substitute these two inequalities into (18), use (13) we can show , which, due to the assumption in (20), finally reduces to (21). Then positive definiteness of follows from the positive definiteness of Bregman divergence and the fact when .
Notice that is a random variable whose value depends on the realization of , which is the history of selected node sets, i.e., . The following theorem shows that the expected value of conditioned on , i.e., is monotonically non-increasing with respect to .
Theorem 1** (Global convergence).**
Suppose that Assumption 1 . Let the sequence be generated by Algorithm 2. Let and be defined as in (15) and (18), respectively. If satisfy (20), then we have the following monotonicity relation
[TABLE]
The sketch of the proof is as follows. We substitute the subgradient in (14) into (2) and obtain an inequality. Use three point property (4) we can split the right hand side of this inequality into three parts, each contributes to and , respectively. Taking the expectation over realization of conditioned on the value of , we obtain the following relation
[TABLE]
where assumptions in Assumption 1 and (20) ensures that all intermediate terms cancel each other. Taking the expectation over the realization of on both sides of (22), we reach the inequality in Theorem 1.
Summing the inequality in Theorem 1 from the case of to we have
[TABLE]
Since for all , inequality (23) implies that as , which establishes the global convergence of Algorithm 2. In addition, if we apply Jensen’s inequality to (23), we obtain the following corollary, which shows the the iteration complexity of Algorithm 2 in an ergodic sense.
Corollary 1** (Iteration complexity).**
Suppose that Assumption 1 holds. Let the sequence be generated by Algorithm 2. Let be defined as in (18), . If satisfy (20), then
[TABLE]
The bound on running duality gap was used in [8].
V Numerical examples
In this section, we demonstrate the effectiveness and efficiency of Algorithm 2 via numerical examples.
Consider the an instance of problem (1) where and is the probability simplex, is a undirected connected communication graph. Such optimizaton can model, for example, multi-agent decision making, where is the cost of agent for choosing policy .
We generate an instance of this optimization where entries of are sampled from standard normal distribution. is a randomly generated with and edge probability [1, p. 90]. Matrix is obtained by minimizing its second largest eigenvalue (in this case, ) while preserving graph adjacency constraints. We choose the following parameters in Algorithm 2:
- •
, where denotes the -th element of vector . Then assumption in (13) is satisfied by (see Remark 1 in [10]).
- •
. Notice that assumptions in (20) are satisfied with .
With these assumptions, the mirror averaging step (10a) and local optimization step (10b) reduces to the following (see Section 4.3 in [18] for details)
[TABLE]
where multiplication, power and exponential operation on vectors are all elementwise, and for all . Update (24) amounts to elementwise operation that allows massive parallel implementation.
We demonstrate the convergence performance of Algorithm 2 in Fig. 2 and Fig. 3, where and are the objective function value achieved at iteration and, respectively, optimality. In particular, Fig. 2 shows that as increases, the convergence of Algorithm 2 becomes faster and less oscillating, which is because more nodes get updated at each iteration. Fig. 3 shows that when we choose as negative entropy function rather than quadratic function, the convergence speed is improved dramatically. This is because compared with quadratic function, negative entropy function exploits the structure of probability simplex much better. Such improvement demonstrates the advantage of Algorithm 2 over stochastic multiplier methods based on quadratic augmentation [12, 13, 14].
VI Conclusions
In this paper, we generalize BPDMM [11] to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization requires less number of processors running in parallel, hence allows application to much larger scale networks. Future directions include generalization to directed and time varying networks.
APPENDIX
For notation simplicity, we let . Suppose Assumption 1 holds, then the nullspace of is spanned by In addition, Assumption (1) and update rule (10) ensure that
[TABLE]
for all . We will need the following lemmas.
Lemma 3**.**
Let
[TABLE]
for all . Then for any ,
[TABLE]
Proof.
Equation (26) holds if and only if: for any ,
[TABLE]
Using three point property (4), we have
[TABLE]
Summing (28) over all completes the proof. ∎
Lemma 4**.**
Suppose Assumption 1 holds. Then
[TABLE]
for all , where denote norm and .
Proof.
First, observe that if is symmetric, stochastic, irreducible and positive semi-definite, is positive semi-definite [20, Theorem 8.4.4]. Since , we can show the following
[TABLE]
Hence (29) holds due to the fact that
[TABLE]
for all , and that for all where . ∎
VI-A Lemma 21
Proof.
Using (25a) and (16) we can show that
[TABLE]
Substitute (30) into (19) we have
[TABLE]
where the last step is due to . Therefore, substitute (31) into (18) we have
[TABLE]
Since for all , we have
[TABLE]
Substitute the above inequality into (32) we obtain (21). ∎
VI-B *Theorem 1 *
Proof.
Let be the -th column of . Since is convex, the subgradient in (14) satisfy the following
[TABLE]
where we use (25b).
The first term on the RHS of (33) can be rewritten as
[TABLE]
To simplify the second term on the RHS of (33), notice that
[TABLE]
Substitute (34) and (35) into (33), we have
[TABLE]
In addition, notice that
[TABLE]
Substitute (36) into (37), we have
[TABLE]
where we use the definition in (15).
Taking the expectation of (38) over conditioned on , we have the following
[TABLE]
where we use (25a). Here we assume is computed as in (8a) for all nodes in , even though Algorithm 1 only require computation on nodes in . Substitute (25b) into (16) we have
[TABLE]
[TABLE]
Using (4) and (11) we can show
[TABLE]
Substitue (42) into (41), use the definition in (18) we have
[TABLE]
Since
[TABLE]
Substitute (42) into (41) we have
[TABLE]
Taking the expectation of (41) over realization of we obtain the desired results. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks . Princeton University Press, 2010.
- 2[2] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification, and tracking of targets,” IEEE Signal Process. Mag. , vol. 19, no. 2, pp. 17–29, 2002.
- 3[3] B. Açıkmeşe, M. Mandić, and J. L. Speyer, “Decentralized observers with consensus filters for distributed discrete-time linear systems,” Automatica , vol. 50, no. 4, pp. 1037–1052, 2014.
- 4[4] V. Lesser, C. L. Ortiz Jr, and M. Tambe, Distributed Sensor Networks: A Multiagent Perspective . Springer Science & Business Media, 2012, vol. 9.
- 5[5] B. Gholami, S. Yoon, and V. Pavlovic, “Decentralized approximate bayesian inference for distributed sensor network.” in AAAI Conf. Artificial Intell. , 2016, pp. 1582–1588.
- 6[6] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Collective robot reinforcement learning with distributed asynchronous guided policy search,” in Int. Conf. Intell. Robots Syst. IEEE, 2017, pp. 79–86.
- 7[7] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” in Proc. IEEE Conf. Decision Control , 2012, pp. 5445–5450.
- 8[8] D. Meng, M. Fazel, and M. Mesbahi, “Proximal alternating direction method of multipliers for distributed optimization on weighted graphs,” in Proc. IEEE Conf. Decision Control , 2015, pp. 1396–1401.
