Stochastic Bregman Parallel Direction Method of Multipliers for   Distributed Optimization

Yue Yu; Beh\c{c}et A\c{c}{\i}kme\c{s}e

arXiv:1902.09695·math.OC·March 5, 2019·CDC

Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization

Yue Yu, Beh\c{c}et A\c{c}{\i}kme\c{s}e

PDF

TL;DR

This paper introduces a stochastic variant of the Bregman parallel direction method of multipliers for distributed optimization, reducing computational load and enabling larger network applications while maintaining convergence guarantees.

Contribution

It generalizes BPDMM to a stochastic setting with convergence proofs, facilitating scalable distributed optimization in multi-agent systems.

Findings

01

Achieves global convergence of stochastic BPDMM.

02

Establishes an O(1/T) iteration complexity.

03

Demonstrates effectiveness through numerical examples.

Abstract

Bregman parallel direction method of multipliers (BPDMM) efficiently solves distributed optimization over a network, which arises in a wide spectrum of collaborative multi-agent learning applications. In this paper, we generalize BPDMM to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization reduce the need for computational resources and allows applications to larger scale networks. We establish both the global convergence and the $O(1/T)$ iteration complexity of stochastic BPDMM. We demonstrate our results via numerical examples.

Equations173

\begin{array}[]{ll}\underset{x\in\mathcal{X}^{|\mathcal{V}|}}{\mbox{minimize}}&\sum\limits_{i\in\mathcal{V}}f_{i}(x_{i})\\ \mbox{subject to}&x_{i}=x_{j},\enskip\forall\{i,j\}\in\mathcal{E}\end{array}

\begin{array}[]{ll}\underset{x\in\mathcal{X}^{|\mathcal{V}|}}{\mbox{minimize}}&\sum\limits_{i\in\mathcal{V}}f_{i}(x_{i})\\ \mbox{subject to}&x_{i}=x_{j},\enskip\forall\{i,j\}\in\mathcal{E}\end{array}

f (v) - f (u) \geq ⟨ g, v - u ⟩ .

f (v) - f (u) \geq ⟨ g, v - u ⟩ .

B_{ϕ} (u, v) = ϕ (u) - ϕ (v) - ⟨ \nabla ϕ (v), u - v ⟩ .

B_{ϕ} (u, v) = ϕ (u) - ϕ (v) - ⟨ \nabla ϕ (v), u - v ⟩ .

⟨ \nabla ϕ (u) - \nabla ϕ (v), w - u ⟩

⟨ \nabla ϕ (u) - \nabla ϕ (v), w - u ⟩

=

(P \otimes I_{n}) x = x,

(P \otimes I_{n}) x = x,

\begin{array}[]{ll}\underset{x\in\mathcal{X}^{|\mathcal{V}|}}{\mbox{minimize}}&\sum\limits_{i\in\mathcal{V}}f_{i}(x_{i})\\ \mbox{subject to}&(P\otimes I_{n})x=x.\end{array}

\begin{array}[]{ll}\underset{x\in\mathcal{X}^{|\mathcal{V}|}}{\mbox{minimize}}&\sum\limits_{i\in\mathcal{V}}f_{i}(x_{i})\\ \mbox{subject to}&(P\otimes I_{n})x=x.\end{array}

\nablaΦ (z^{t}) =

\nablaΦ (z^{t}) =

y^{t} =

y_{i}^{t} = y_{i} \in X argmin j \in N (i) \sum P_{ij} B_{ϕ} (y_{i}, x_{j}^{t}), \forall i \in V

y_{i}^{t} = y_{i} \in X argmin j \in N (i) \sum P_{ij} B_{ϕ} (y_{i}, x_{j}^{t}), \forall i \in V

x_{i}^{t + 1} = x_{i} \in X argmin f_{i} (x_{i}) + ⟨ x_{i}, μ_{i}^{t} - j \in N (i) \sum P_{ij} μ_{j}^{t} ⟩ + ρ B_{ϕ} (x_{i}, y_{i}^{t}), \forall i \in V

μ_{i}^{t + 1} = μ_{i}^{t} + τ x_{i}^{t + 1} - τ j \in N (i) \sum P_{ij} x_{j}^{t + 1}, \forall i \in V

μ_{i}^{t + 1} = μ_{i}^{t} + τ x_{i}^{t + 1} - τ j \in N (i) \sum P_{ij} x_{j}^{t + 1}, \forall i \in V

y_{i}^{t} = y_{i} \in X argmin j \in N (i) \sum P_{ij} B_{ϕ} (y_{i}, x_{j}^{t}), \forall i \in S_{t + 1}

y_{i}^{t} = y_{i} \in X argmin j \in N (i) \sum P_{ij} B_{ϕ} (y_{i}, x_{j}^{t}), \forall i \in S_{t + 1}

x_{i}^{t + 1} = x_{i} \in X argmin f_{i} (x_{i}) + ⟨ x_{i}, μ_{i}^{t} - j \in N (i) \sum P_{ij} μ_{j}^{t} ⟩ + ρ B_{ϕ} (x_{i}, y_{i}^{t}), \forall i \in S_{t + 1}

x_{i}^{t + 1} = x_{i}^{t}, \forall i \in V ∖ S_{t + 1}

μ_{i}^{t + 1} = μ_{i}^{t} + τ x_{i}^{t + 1} - τ j \in N (i) \sum P_{ij} x_{j}^{t + 1}, \forall i \in V

μ_{i}^{t + 1} = μ_{i}^{t} + τ x_{i}^{t + 1} - τ j \in N (i) \sum P_{ij} x_{j}^{t + 1}, \forall i \in V

j \in V \sum P_{ij} x_{j}^{⋆}

j \in V \sum P_{ij} x_{j}^{⋆}

- μ_{i}^{⋆} + j \in V \sum P_{ij} μ_{j}^{⋆}

B_{ϕ} (u, v) \geq \frac{α}{2} ∥ u - v ∥_{p}^{2} .

B_{ϕ} (u, v) \geq \frac{α}{2} ∥ u - v ∥_{p}^{2} .

- μ_{i}^{t} + j \in V \sum P_{ij} μ_{j}^{t} - ρ (\nabla ϕ (x_{i}^{t + 1}) - \nabla ϕ (y_{i}^{t}))

- μ_{i}^{t} + j \in V \sum P_{ij} μ_{j}^{t} - ρ (\nabla ϕ (x_{i}^{t + 1}) - \nabla ϕ (y_{i}^{t}))

\in

R (t + 1) : = ω (L (x^{t}, μ^{⋆}) - L (x^{⋆}, μ^{⋆}))

R (t + 1) : = ω (L (x^{t}, μ^{⋆}) - L (x^{⋆}, μ^{⋆}))

+ ρ i \in S_{t + 1} \sum B_{ϕ} (x_{i}^{t + 1}, y_{i}^{t}) + \frac{γ ρ}{2} ((I_{∣ V ∣} - P) \otimes I_{n}) x^{t}_{2}^{2},

L (x, μ) = i \in V \sum (f_{i} + δ_{X}) (x_{i}) + ⟨ μ, ((I_{∣ V ∣} - P) \otimes I_{n}) x ⟩ .

L (x, μ) = i \in V \sum (f_{i} + δ_{X}) (x_{i}) + ⟨ μ, ((I_{∣ V ∣} - P) \otimes I_{n}) x ⟩ .

L (x^{t}, μ^{⋆}) - L (x^{⋆}, μ^{⋆}) \geq 0

L (x^{t}, μ^{⋆}) - L (x^{⋆}, μ^{⋆}) \geq 0

V (t) : =

V (t) : =

+ ρ i \in V \sum B_{ϕ} (x_{i}^{⋆}, x_{i}^{t}) .

H (x^{t}, μ^{t}) = L (x^{t}, μ^{t}) - L (x^{⋆}, μ^{⋆}) - τ Q \otimes I_{n}) x^{t}_{2}^{2}

H (x^{t}, μ^{t}) = L (x^{t}, μ^{t}) - L (x^{⋆}, μ^{⋆}) - τ Q \otimes I_{n}) x^{t}_{2}^{2}

τ \leq \frac{ρ ( ω α σ - γ )}{2 - ω}, 0 < γ < ω α σ,

τ \leq \frac{ρ ( ω α σ - γ )}{2 - ω}, 0 < γ < ω α σ,

V (t) \geq \frac{( 1 - ω ) ω α σ ρ + γ ρ}{( 2 - ω ) ω α σ} i \in V \sum B_{ϕ} (x_{i}^{⋆}, x_{i}^{t}) .

V (t) \geq \frac{( 1 - ω ) ω α σ ρ + γ ρ}{( 2 - ω ) ω α σ} i \in V \sum B_{ϕ} (x_{i}^{⋆}, x_{i}^{t}) .

H (x^{t}, μ^{t}) \geq - \frac{ω}{2 τ} μ^{t - 1} - μ^{⋆}_{2}^{2} - \frac{1}{2 ω τ} μ^{t} - μ^{t - 1}_{2}^{2} .

H (x^{t}, μ^{t}) \geq - \frac{ω}{2 τ} μ^{t - 1} - μ^{⋆}_{2}^{2} - \frac{1}{2 ω τ} μ^{t} - μ^{t - 1}_{2}^{2} .

- \frac{1}{2 ω τ} μ^{t} - μ^{t - 1}_{2}^{2} + \frac{τ}{2 ω σ} \sum_{i \in V} B_{ϕ} (x_{i}^{⋆}, x_{i}^{t}) \geq 0.

- \frac{1}{2 ω τ} μ^{t} - μ^{t - 1}_{2}^{2} + \frac{τ}{2 ω σ} \sum_{i \in V} B_{ϕ} (x_{i}^{⋆}, x_{i}^{t}) \geq 0.

\mathds E_{S_{1 : t}} [V (t)] - \mathds E_{S_{1 : t + 1}} [V (t + 1)] \geq \mathds E_{S_{1 : t + 1}} [R (t + 1)] .

\mathds E_{S_{1 : t}} [V (t)] - \mathds E_{S_{1 : t + 1}} [V (t + 1)] \geq \mathds E_{S_{1 : t + 1}} [R (t + 1)] .

\mathds E_{S_{t + 1}} [R (t + 1)] \leq V (t) - \mathds E_{S_{t + 1}} [V (t + 1)],

\mathds E_{S_{t + 1}} [R (t + 1)] \leq V (t) - \mathds E_{S_{t + 1}} [V (t + 1)],

\sum_{t = 1}^{T} \mathds E_{S_{1 : t}} [R (t)] \leq V (0) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization

Yue Yu and Behçet Açıkmeşe

The authors are with the Department of Aeronautics and Astronautics, University of Washington, Seattle, WA, 98195; emails: {yueyu,behcet}@uw.edu

Abstract

Bregman parallel direction method of multipliers (BPDMM) efficiently solves distributed optimization over a network, which arises in a wide spectrum of collaborative multi-agent learning applications. In this paper, we generalize BPDMM to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization reduce the need for computational resources and allows applications to larger scale networks. We establish both the global convergence and the $O(1/T)$ iteration complexity of stochastic BPDMM. We demonstrate our results via numerical examples.

I Introduction

Distributed optimization over a connected undirected network $\mathcal{G}=(\mathcal{V},\mathcal{E})$ is defined as follows

[TABLE]

where $\mathcal{X}\subset\mathbb{R}^{n}$ is a closed convex set, $\mathcal{X}^{\mathcal{V}}$ is the Cartesian product of $|\mathcal{V}|$ copies of $\mathcal{X}$ , each $f_{i}$ is a convex function accessible by node $i$ only. The global optimality is achieved by local optimization on each node and efficient communication between neighboring nodes. In addition to classical applications such as formation control [1], distributed tracking [2] and estimation [3, 4], problem (1) also arises in collaborative learning scenarios [5, 6], where problem (1) represents distributed learning from data collected by multiple agents.

There has been an increasing interest in applying multiplier methods to solve problem (1) [7, 8, 9]. At each iteration of such methods, every primal variable is updated by optimizing a quadratic augmented Lagrangian; every dual variable is updated by numerically integrating local disagreement. Recently, Bregman parallel direction method of multipliers (PDMM) generalized the quadratic augmentation in local optimization to Bregman augmentation, which better exploits the structure of constraint set $\mathcal{X}$ , and hence leads to significant improvement in convergence speed [10, 11].

One challenge in implementing multiplier methods for problem (1) is that a local optimization problem needs to be solved on every node in parallel at each iteration, which requires demanding computational resources when applied to large scale networks. A popular approach to address this challenge is stochastic multiplier methods [12, 13, 14], which combine multiplier methods with the idea of stochastic block coordinate descent [15, 16]. At each iteration, stochastic multiplier methods only solve local optimization problems on, rather than all the nodes, a randomly selected subset of nodes. Such algorithms guarantee global convergence to optimum in expectation via proper choice of algorithm parameters. However, to our best knowledge, all existing stochastic multiplier methods use quadratic augmentation. In other words, there is no stochastic extension to Bregman augmentation based multiplier methods.

In this paper, we close this gap in the literature by proposing stochastic BPDMM, which combines the benefits of BPDMM and stochastic multiplier methods. Compared with BPDMM [11], it only requires solving local optimization on a randomly selected subset of nodes, which allows application to larger scale networks; compared with existing stochastic multiplier methods [12, 13, 14], it extends quadratic augmented Lagrangian to Bregman augmented Lagrangian, which improves the convergence speed by better exploiting constraints structure. We establish the global convergence and $O(1/T)$ iteration complexity of stochastic BPDMM, and demonstrate its effectiveness and efficiency via numerical examples.

The rest of the paper is organized as follows. Section II covers necessary background and reformulates problem (1) with consensus constraints. Section III develops the stochastic BPDMM, whose convergence proof is established in Section IV. Section V presents numerical examples and demonstrates the advantages of stochastic BPDMM over prior work. Section VI concludes and comments on future directions.

II Preliminaries and Background

II-A Notation

Let $\mathbb{R}$ ( $\mathbb{R}_{+}$ ) denote the set of (nonnegative) real numbers, $\mathbb{R}^{n}$ ( $\mathbb{R}^{n}_{+}$ ) the set of $n$ -dimensional (elementwise nonnegative) vectors. Let $\geq(\leq)$ denote elementwise inequality when applied to vectors and matrices. Let $\langle\cdot,\cdot\rangle$ denote the dot product. Let $I_{n}\in\mathbb{R}^{n\times n}$ denote the $n$ -dimensional identity matrix, $\mathbf{1}_{n}\in\mathbb{R}^{n}$ the $n$ -dimensional vector of all $1$ s. Given matrix $A\in\mathbb{R}^{n\times n}$ , let $A_{ij}$ denote its $(i,j)$ entry; $A^{\top}$ denotes its transpose. Let $\otimes$ denote the Kronecker product.

II-B Subgradients

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function. Then $g\in\mathbb{R}^{n}$ is a subgradient of $f$ at $u\in\mathbb{R}^{n}$ if and only if for any $v\in\mathbb{R}^{n}$ one has

[TABLE]

We denote $\partial f(u)$ the set of subgradients of $f$ at $u$ . An important case of subdifferential is the case of indicator function of a non-empty convex set $\mathcal{X}$ defined as $\delta_{\mathcal{X}}(x)=0$ if $x\in\mathcal{X}$ and $\infty$ otherwise. We will use the following results.

Lemma 1.

[17, Theorem 27.4]** Given a closed convex set $\mathcal{X}\subseteq\mathbb{R}^{n}$ and closed, convex, proper function $f:\mathbb{R}^{n}\to\mathbb{R}$ , then $u^{\star}=\mathop{\rm argmin}_{u\in\mathcal{X}}\,f(u)$ if and only if $0\in\partial(f+\delta_{\mathcal{X}})(u^{\star})$ .

II-C Mirror maps and Bregman divergence

Let $\mathcal{D}\subseteq\mathbb{R}^{n}$ be a convex open set. We say that $\phi:\mathcal{D}\to\mathbb{R}$ is a mirror map [18, p.298] if it satisfies: 1) $\phi$ is differentiable and strictly convex, 2) $\nabla\phi$ takes all possible values, and 3) $\nabla\phi$ diverges on the boundary of the closure of $\mathcal{D}$ , i.e., $\lim_{u\to\partial\bar{\mathcal{D}}}\left\lVert\nabla\phi(u)\right\rVert=\infty$ , where $\left\lVert\cdot\right\rVert$ is an arbitrary norm on $\mathbb{R}^{n}$ . The Bregman divergence $B_{\phi}:\mathcal{D}\times\mathcal{D}\to\mathbb{R}_{+}$ is defined as [19, Sec. 2.1]

[TABLE]

Note that $B_{\phi}(u,v)\geq 0$ and $B_{\phi}(u,v)=0$ only if $u=v$ . $B_{\phi}$ also satisfy the following three-point identity,

[TABLE]

II-D Graphs and distibuted optimization

An undirected connected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ contains a vertex set $\mathcal{V}=\{1,2,\ldots,m\}$ and an edge set $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ such that $(i,j)\in\mathcal{E}$ if and only if $(j,i)\in\mathcal{E}$ for all $i,j\in\mathcal{V}$ . Denote $\mathcal{N}(i)$ the set of neighbors of node $i$ such that $j\in\mathcal{N}(i)$ if $(i,j)\in\mathcal{E}$ .

Consider a symmetric stochastic matrix $P\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{V}|}$ defined on the graph $\mathcal{G}$ such that $P_{ij}>0$ implies that $j\in\mathcal{N}(i)$ . Such a matrix $P$ can be constructed, for example, by the graph Laplacian [1, Proposition 3.18]. If $P$ is irreducible [20, Lem. 8.4.1], then $1$ is a simple eigenvalue of $P$ with eigenvectors spanned by $\mathbf{1}_{|\mathcal{V}|}$ .

Let $\mathcal{G}=(\mathcal{V},\mathcal{E})$ denote the underlying graph over which problem (1) is defined. A common approach to solve problem is to create local copies of the design variable $\{x_{1},x_{2},\ldots,x_{|\mathcal{V}|}\}$ and impose the consensus constraints: $x_{i}=x_{j}$ for all $(i,j)\in\mathcal{E}$ [21, 22]. Many different consensus constraints have been proposed [7, 23, 24, 25]. In this paper, we consider consensus constraints of the form:

[TABLE]

where $x=[x^{\top}_{1},x^{\top}_{2},\ldots,x^{\top}_{|\mathcal{V}|}]^{\top}$ , $P$ is a symmetric, stochastic and irreducible matrix defined on $\mathcal{G}$ . We will focus on the following reformulation of problem (1),

[TABLE]

III Stochastic Bregman Parallel Direction Method of Multipliers

In this section, we first review BPDMM in Algorithm 1, then combine it with the stochastic node update in [13] and propose sBPDMM in Algorithm 2.

BPDMM [11] solves problem (6) with Algorithm 1, which combines the idea of PDMM [8] and Bregman augmented Lagrangian [10]. Each iteration of the algorithm include the following steps:

(a)

Mirror averaging Step (8a) computes a nodal mirror average of neighboring nodes’ variables, and can be further decomposed as follows:

[TABLE]

where $\Phi(x)=\sum_{i\in\mathcal{V}}\phi(x_{i})$ . Therefore this step is equivalent to first apply $\nabla\Phi$ to $x^{t}$ , then run an average step, followed by $(\nabla\Phi)^{-1}$ , and finally a projection step. See Fig. 1 for an illustration. 2. (b)

Local optimization Step (8b) optimizes a nodal augmented Lagrangian. In particular, the Bregman divergence term in the objective of (8b) augments the nodal Lagrangian by penalizing the difference from the nodal mirror average. 3. (c)

Disagreement integration Step (9) is a discrete integration of the disagreement between neighboring nodes. Such integration is equivalent to a spring dynamics among neighboring nodes and improves the disturbance rejection performance of the algorithm. See [26, 27] for a detailed discussion.

Both mirror averaging step (8a) and disagreement integration step (9) have close-form update when the constraint set $\mathcal{X}$ is structured, e.g., $\mathcal{X}$ is $\mathbb{R}^{n}$ or the probability simplex [11]. On the other hand, the local optimization step (8b) typically requires an iterative algorithm itself, e.g., mirror descent method [28]. Hence the main computational effort of implementing Algorithm 1 is caused by the local optimization step (8b). At each iteration, Algorithm 1 requires at least $|\mathcal{V}|$ processors, one assigned to each node, to solve optimization (8b) in parallel. Such requirements are computationally demanding for large scale networks.

In order to address this challenge, we propose Algorithm 2, which uses a stochastic node update [12, 13, 14]. Compared with Algorithm 1, each iteration of Algorithm 2 only execute local optimization step on a set of randomly selected nodes, which requires less number of processors running in parallel. This flexibility reduce the requirements on the total computation power of the network, and allows BPDMM to be applicable much larger scale networks.

Although the generalization from Algorithm 1 to Algorithm 2 seems straightforward, the generalization in the corresponding convergence proof requires more careful treatment. In particular, the convergence proof of Algorithm 1 in [11] hinges on a monotonically non-increasing non-negative Lyapunov function for full primal update in (8) with carefully chosen algorithm parameters. In order to generalize such proof to Algoritjm 2, we need to answer the following questions:

•

How to find a monotonically non-increasing non-negative Lyapunov function for stochastic partial primal update in (10)?

•

How does the randomly selected node set $\mathcal{S}_{t+1}$ affect the choice of algorithm parameters?

In the sequel, we aim to answer theses questions and establish the convergence proof of Algorithm 2.

IV Convergence

In this section, we prove the global convergence as well as the $O(1/T)$ iteration complexity of Algorithm 2. All detailed proof in this section can be found in the Appendix.

We first group our assumptions in Assumption 1.

Assumption 1.

(a)

Function $f_{i}:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ are closed, proper and convex for all $i\in\mathcal{V}$ . 2. (b)

Set $\mathcal{X}\subset\mathbb{R}^{n}$ is closed and convex. There exists a saddle point $(x^{\star},\mu^{\star})$ such that $x^{\star}_{i}\in\mathcal{X}$ and

[TABLE]

for all $i\in\mathcal{V}$ . 3. (c)

Function $\phi:\mathcal{D}\to\mathbb{R}$ is a mirror map, where $\mathcal{D}$ is a open convex set such that $\mathcal{X}$ is included in its closure. In addition, function $\phi$ is $\alpha$ -strongly convex with respect to $l_{p}$ -norm, i.e., for any $u,v\in\mathcal{X}$ ,

[TABLE] 4. (d)

Matrix $P$ is symmetric, stochastic, irreducible and positive semi-definite. 5. (e)

At each iteration $t+1$ , we assume $|\mathcal{S}_{t+1}|/|\mathcal{V}|=\omega,0<\omega<1.$

Now we start to construct the convergence proof of Algorithm 2 under Assumption 1. The optimality condition of (10b) is that for all $i\in\mathcal{S}_{t+1}$ ,

[TABLE]

Define the residuals of optimality conditions (14) at iteration $t$ as

[TABLE]

where $\gamma>0$ and Lagrangian $L(x,\mu)$ is defined as

[TABLE]

Using (12) and (2) we can show the following

[TABLE]

Hence $L(x^{t},\mu^{\star})-L(x^{\star},\mu^{\star})$ defines a running duality gap that measures distance to optimality [8]. Notice that given $x^{t}$ , $R(t+1)$ is a random variable only depends on $\mathcal{S}_{t+1}$ and $\mathds{E}_{\mathcal{S}_{t+1}}\left[R(t+1)\right]=0$ implies that $L(x^{t},\mu^{\star})=L(x^{\star},\mu^{\star})$ and $x_{i}^{t}=x_{j}^{t}$ for all $i,j\in\mathcal{V}$ , i.e., both optimality and consensus are achieved.

In order to show $\mathds{E}_{\mathcal{S}_{t+1}}\left[R(t+1)\right]=0$ , we define the following Lyapunov function of Algorithm 2

[TABLE]

where

[TABLE]

with $Q=I_{|\mathcal{V}|}-P$ and $\mu^{-1}\coloneqq\mu^{0}-\tau((I_{|\mathcal{V}|}-P)\otimes I_{n})x^{0}$ .

Compared with the one used in [11], the Lyapunov function $V(t)$ defined by (18) contains a generalized Lagrangian $H(x^{t},\mu^{t})$ , which renders the positive definiteness of $V(t)$ unclear. The following lemma shows that $V(t)$ is indeed positive definite, and lower bounded by a Bregman divergence to the optimum.

Lemma 2.

Suppose Assumption 1 holds, if

[TABLE]

where $\sigma=\min\{1,n^{\frac{2}{p}-1}\}$ , $p$ and $\alpha$ are defined in (13), then the Lyapunov function defined in (18) satisfy

[TABLE]

The sketch of the proof is as follows. Use equation (12b) and (11) we can show

[TABLE]

In addition, equation (11) and Assumption 1, particularly assumptions on function $\phi$ and matrix $P$ , ensures that

[TABLE]

Substitute these two inequalities into (18), use (13) we can show $V(t)\geq(\rho-\frac{\tau}{\omega\alpha\sigma})\sum_{i\in\mathcal{V}}B_{\phi}(x_{i}^{\star},x_{i}^{t})$ , which, due to the assumption in (20), finally reduces to (21). Then positive definiteness of $V(t)$ follows from the positive definiteness of Bregman divergence and the fact $\frac{(1-\omega)\omega\alpha\sigma\rho+\gamma\rho}{(2-\omega)\omega\alpha\sigma}>0$ when $0<\omega<1$ .

Notice that $V(t)$ is a random variable whose value depends on the realization of $\mathcal{S}_{1:t}$ , which is the history of selected node sets, i.e., $\{\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{t}\}$ . The following theorem shows that the expected value of $V(t)$ conditioned on $\mathcal{S}_{1:t}$ , i.e., $\mathds{E}_{\mathcal{S}_{1:t}}[V(t)]$ is monotonically non-increasing with respect to $t$ .

Theorem 1 (Global convergence).

Suppose that Assumption 1 . Let the sequence $\{y^{t},x^{t},\mu^{t}\}$ be generated by Algorithm 2. Let $R(t+1)$ and $V(t)$ be defined as in (15) and (18), respectively. If $\rho,\tau,\gamma,\omega$ satisfy (20), then we have the following monotonicity relation

[TABLE]

The sketch of the proof is as follows. We substitute the subgradient in (14) into (2) and obtain an inequality. Use three point property (4) we can split the right hand side of this inequality into three parts, each contributes to $R(t+1),V(t)$ and $V(t+1)$ , respectively. Taking the expectation over realization of $\mathcal{S}_{t+1}$ conditioned on the value of $x^{t}$ , we obtain the following relation

[TABLE]

where assumptions in Assumption 1 and (20) ensures that all intermediate terms cancel each other. Taking the expectation over the realization of $\mathcal{S}_{1:t}$ on both sides of (22), we reach the inequality in Theorem 1.

Summing the inequality in Theorem 1 from the case of $t=0$ to $t={T-1}$ we have

[TABLE]

Since $\mathds{E}_{\mathcal{S}_{1:t}}[R(t)]\geq 0$ for all $t$ , inequality (23) implies that $\mathds{E}_{\mathcal{S}_{1:t}}[R(t)]\to 0$ as $T\to\infty$ , which establishes the global convergence of Algorithm 2. In addition, if we apply Jensen’s inequality to (23), we obtain the following corollary, which shows the the $O(1/T)$ iteration complexity of Algorithm 2 in an ergodic sense.

Corollary 1 (Iteration complexity).

Suppose that Assumption 1 holds. Let the sequence $\{y^{t},x^{t},\mu^{t}\}$ be generated by Algorithm 2. Let $V(t)$ be defined as in (18), $\overline{x}^{T}=\frac{1}{T}\sum_{t=0}^{T-1}x^{t}$ . If $\rho,\tau,\gamma,\omega$ satisfy (20), then

[TABLE]

The bound on running duality gap was used in [8].

V Numerical examples

In this section, we demonstrate the effectiveness and efficiency of Algorithm 2 via numerical examples.

Consider the an instance of problem (1) where $f_{i}(x_{i})=\langle c_{i},x_{i}\rangle$ and $\mathcal{X}=\{u\in\mathbb{R}^{n}_{+}|\left\lVert u\right\rVert_{1}=1\}$ is the probability simplex, $\mathcal{G}=(\mathcal{V},\mathcal{E})$ is a undirected connected communication graph. Such optimizaton can model, for example, multi-agent decision making, where $c_{i}$ is the cost of agent $i$ for choosing policy $x_{i}$ .

We generate an instance of this optimization where entries of $c_{1},\ldots,c_{|\mathcal{V}|}\in\mathbb{R}^{100}$ are sampled from standard normal distribution. $\mathcal{G}$ is a randomly generated with $|\mathcal{V}|=100$ and edge probability $0.2$ [1, p. 90]. Matrix $P$ is obtained by minimizing its second largest eigenvalue (in this case, $\lambda_{2}(P)=0.4786$ ) while preserving graph adjacency constraints. We choose the following parameters in Algorithm 2:

•

$\phi(u)=\sum_{k=1}^{n}u[k]\ln u[k]$ , where $u[k]$ denotes the $k$ -th element of vector $u$ . Then assumption in (13) is satisfied by $\alpha=1,p=1$ (see Remark 1 in [10]).

•

$\rho=1,\tau=\omega/(4-2\omega)$ . Notice that assumptions in (20) are satisfied with $\gamma=\omega/2$ .

With these assumptions, the mirror averaging step (10a) and local optimization step (10b) reduces to the following (see Section 4.3 in [18] for details)

[TABLE]

where multiplication, power and exponential operation on vectors are all elementwise, and $\text{Proj}[u]=u/\left\lVert u\right\rVert_{1}$ for all $u\in\mathbb{R}^{n}$ . Update (24) amounts to elementwise operation that allows massive parallel implementation.

We demonstrate the convergence performance of Algorithm 2 in Fig. 2 and Fig. 3, where $f^{t}$ and $f^{\star}$ are the objective function value achieved at iteration $t$ and, respectively, optimality. In particular, Fig. 2 shows that as $\omega$ increases, the convergence of Algorithm 2 becomes faster and less oscillating, which is because more nodes get updated at each iteration. Fig. 3 shows that when we choose $\phi$ as negative entropy function rather than quadratic function, the convergence speed is improved dramatically. This is because compared with quadratic function, negative entropy function exploits the structure of probability simplex much better. Such improvement demonstrates the advantage of Algorithm 2 over stochastic multiplier methods based on quadratic augmentation [12, 13, 14].

VI Conclusions

In this paper, we generalize BPDMM [11] to stochastic BPDMM, where each iteration only solves local optimization on a randomly selected subset of nodes rather than all the nodes in the network. Such generalization requires less number of processors running in parallel, hence allows application to much larger scale networks. Future directions include generalization to directed and time varying networks.

APPENDIX

For notation simplicity, we let $Q\coloneqq I_{|\mathcal{V}|}-P$ . Suppose Assumption 1 holds, then the nullspace of $I_{|\mathcal{V}|}-P$ is spanned by $\mathbf{1}_{|\mathcal{V}|}$ In addition, Assumption (1) and update rule (10) ensure that

[TABLE]

for all $t$ . We will need the following lemmas.

Lemma 3.

Let

[TABLE]

for all $i\in\mathcal{V}$ . Then for any $u\in\mathcal{X}$ ,

[TABLE]

Proof.

Equation (26) holds if and only if: for any $u\in\mathcal{X}$ ,

[TABLE]

Using three point property (4), we have

[TABLE]

Summing (28) over all $i\in\mathcal{V}$ completes the proof. ∎

Lemma 4.

Suppose Assumption 1 holds. Then

[TABLE]

for all $u,v\in\mathcal{X}^{|\mathcal{V}|}$ , where $\left\lVert\cdot\right\rVert_{p}$ denote $l_{p}$ norm and $\sigma=\min\{1,n^{\frac{2}{p}-1}\}$ .

Proof.

First, observe that if $P$ is symmetric, stochastic, irreducible and positive semi-definite, $P-P^{2}$ is positive semi-definite [20, Theorem 8.4.4]. Since $P\mathbf{1}_{|\mathcal{V}|}=P^{\top}\mathbf{1}_{|\mathcal{V}|}=\mathbf{1}_{|\mathcal{V}|}$ , we can show the following

[TABLE]

Hence (29) holds due to the fact that

[TABLE]

for all $i\in\mathcal{V}$ , and that $\left\lVert w\right\rVert_{2}^{2}\leq 1/\sigma\left\lVert w\right\rVert_{p}^{2}$ for all $w\in\mathbb{R}^{n}$ where $\sigma=\min\{1,n^{\frac{2}{p}-1}\}$ . ∎

VI-A Lemma 21

Proof.

Using (25a) and (16) we can show that

[TABLE]

Substitute (30) into (19) we have

[TABLE]

where the last step is due to $2\langle a,b\rangle\geq-\left\lVert a\right\rVert_{2}^{2}-\left\lVert b\right\rVert_{2}^{2}$ . Therefore, substitute (31) into (18) we have

[TABLE]

Since $x^{\star}_{i}=x_{j}^{\star}$ for all $i,j\in\mathcal{V}$ , we have

[TABLE]

Substitute the above inequality into (32) we obtain (21). ∎

VI-B Theorem 1

Proof.

Let $q_{i}$ be the $i$ -th column of $Q$ . Since $f+\delta_{\mathcal{X}}$ is convex, the subgradient in (14) satisfy the following

[TABLE]

where we use (25b).

The first term on the RHS of (33) can be rewritten as

[TABLE]

To simplify the second term on the RHS of (33), notice that

[TABLE]

Substitute (34) and (35) into (33), we have

[TABLE]

In addition, notice that

[TABLE]

Substitute (36) into (37), we have

[TABLE]

where we use the definition in (15).

Taking the expectation of (38) over $\mathcal{S}_{t+1}$ conditioned on $x^{t}$ , we have the following

[TABLE]

where we use (25a). Here we assume $y^{t}_{i}$ is computed as in (8a) for all nodes in $\mathcal{V}$ , even though Algorithm 1 only require computation on nodes in $\mathcal{S}_{t+1}$ . Substitute (25b) into (16) we have

[TABLE]

Combine (39) and (40) we have

[TABLE]

Using (4) and (11) we can show

[TABLE]

Substitue (42) into (41), use the definition in (18) we have

[TABLE]

Since

[TABLE]

Substitute (42) into (41) we have

[TABLE]

Taking the expectation of (41) over realization of $\mathcal{S}_{1:t}$ we obtain the desired results. ∎

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks . Princeton University Press, 2010.
2[2] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification, and tracking of targets,” IEEE Signal Process. Mag. , vol. 19, no. 2, pp. 17–29, 2002.
3[3] B. Açıkmeşe, M. Mandić, and J. L. Speyer, “Decentralized observers with consensus filters for distributed discrete-time linear systems,” Automatica , vol. 50, no. 4, pp. 1037–1052, 2014.
4[4] V. Lesser, C. L. Ortiz Jr, and M. Tambe, Distributed Sensor Networks: A Multiagent Perspective . Springer Science & Business Media, 2012, vol. 9.
5[5] B. Gholami, S. Yoon, and V. Pavlovic, “Decentralized approximate bayesian inference for distributed sensor network.” in AAAI Conf. Artificial Intell. , 2016, pp. 1582–1588.
6[6] A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Collective robot reinforcement learning with distributed asynchronous guided policy search,” in Int. Conf. Intell. Robots Syst. IEEE, 2017, pp. 79–86.
7[7] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” in Proc. IEEE Conf. Decision Control , 2012, pp. 5445–5450.
8[8] D. Meng, M. Fazel, and M. Mesbahi, “Proximal alternating direction method of multipliers for distributed optimization on weighted graphs,” in Proc. IEEE Conf. Decision Control , 2015, pp. 1396–1401.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Stochastic Bregman Parallel Direction Method of Multipliers for Distributed Optimization

Abstract

I Introduction

II Preliminaries and Background

II-A Notation

II-B Subgradients

Lemma 1**.**

II-C Mirror maps and Bregman divergence

II-D Graphs and distibuted optimization

III Stochastic Bregman Parallel Direction Method of Multipliers

IV Convergence

Assumption 1**.**

Lemma 2**.**

Theorem 1** (Global convergence).**

Corollary 1** (Iteration complexity).**

V Numerical examples

VI Conclusions

APPENDIX

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

VI-A Lemma 21

Proof.

VI-B *Theorem 1 *

Proof.

Lemma 1.

Assumption 1.

Lemma 2.

Theorem 1 (Global convergence).

Corollary 1 (Iteration complexity).

Lemma 3.

Lemma 4.

VI-B Theorem 1