Linear Convergence of Primal-Dual Gradient Methods and their Performance   in Distributed Optimization

Sulaiman A. Alghunaim; Ali H. Sayed

arXiv:1904.01196·math.OC·January 17, 2020·Autom.

Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Sulaiman A. Alghunaim, Ali H. Sayed

PDF

TL;DR

This paper proves the linear convergence of primal-dual gradient methods for smooth strongly-convex problems and analyzes how augmented Lagrangian penalties affect distributed optimization performance.

Contribution

It provides a concise proof of exponential convergence and explores the impact of augmented Lagrangian terms in distributed settings.

Findings

01

Proves linear convergence of primal-dual gradient methods for strongly-convex functions.

02

Analyzes the effect of augmented Lagrangian penalties on distributed optimization.

03

Relates incremental and non-incremental implementations of the algorithm.

Abstract

In this work, we revisit a classical incremental implementation of the primal-descent dual-ascent gradient method used for the solution of equality constrained optimization problems. We provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation. We also study the effect of the augmented Lagrangian penalty term on the performance of distributed optimization algorithms for the minimization of aggregate cost functions over multi-agent networks.

Figures2

Click any figure to enlarge with its caption.

Equations113

w \in R^{M} minimize

w \in R^{M} minimize

w \in R^{M} min λ \in R^{E} max L_{ρ} (w, λ)

w \in R^{M} min λ \in R^{E} max L_{ρ} (w, λ)

L_{ρ} (w, λ) = Δ J (w) + \frac{ρ}{2} ∥ B w - b ∥^{2} + λ^{T} (B w - b)

L_{ρ} (w, λ) = Δ J (w) + \frac{ρ}{2} ∥ B w - b ∥^{2} + λ^{T} (B w - b)

w_{i}

w_{i}

λ_{i}

\displaystyle(x-w^{\star})^{\mathsf{T}}\big{(}{\nabla}J_{\rho}(x)-{\nabla}J_{\rho}(w^{\star})\big{)}\geq\nu_{\rho}\|x-w^{\star}\|^{2},\quad\forall\ x

\displaystyle(x-w^{\star})^{\mathsf{T}}\big{(}{\nabla}J_{\rho}(x)-{\nabla}J_{\rho}(w^{\star})\big{)}\geq\nu_{\rho}\|x-w^{\star}\|^{2},\quad\forall\ x

\nabla J (w^{⋆}) + B^{T} λ^{⋆}

\nabla J (w^{⋆}) + B^{T} λ^{⋆}

B w^{⋆} - b

∥ B^{T} λ_{x} ∥^{2} \geq \underline{σ}^{2} (B) ∥ λ_{x} ∥^{2}

∥ B^{T} λ_{x} ∥^{2} \geq \underline{σ}^{2} (B) ∥ λ_{x} ∥^{2}

∥ B^{T} λ_{x} ∥^{2}

∥ B^{T} λ_{x} ∥^{2}

= u^{T} Σ_{r} u \geq \underline{σ}^{2} (B) ∥ u ∥^{2} = \underline{σ}^{2} (B) x^{T} U_{r} Σ_{r} U_{r}^{T} x

μ_{w} < \frac{1}{δ _{ρ}}, μ_{λ} \leq \frac{ν _{ρ}}{σ _{m a x}^{2} ( B )}

μ_{w} < \frac{1}{δ _{ρ}}, μ_{λ} \leq \frac{ν _{ρ}}{σ _{m a x}^{2} ( B )}

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

γ = Δ max {1 - μ_{w} ν_{ρ} (1 - μ_{w} δ_{ρ}), 1 - μ_{w} μ_{λ} \underline{σ}^{2} (B)} < 1

γ = Δ max {1 - μ_{w} ν_{ρ} (1 - μ_{w} δ_{ρ}), 1 - μ_{w} μ_{λ} \underline{σ}^{2} (B)} < 1

w_{i}

w_{i}

λ_{i}

∥ w_{i} ∥^{2}

∥ w_{i} ∥^{2}

\displaystyle\ -2\mu_{w}\widetilde{\lambda}_{i-1}^{\mathsf{T}}B\left(\widetilde{w}_{i-1}-\mu_{w}\big{(}{\nabla}J_{\rho}(w_{i-1})-{\nabla}J_{\rho}(w^{\star})\big{)}\right)

+ μ_{w}^{2} ∥ B^{T} λ_{i - 1} ∥^{2}

∥ λ_{i} ∥^{2}

∥ λ_{i} ∥^{2}

= \eqref er r o r_{p} r ima l ∥ λ_{i - 1} ∥^{2} + μ_{λ}^{2} ∥ B w_{i} ∥^{2} - 2 μ_{λ} μ_{w} ∥ B^{T} λ_{i - 1} ∥^{2}

\displaystyle\quad+2\mu_{\lambda}\widetilde{\lambda}_{i-1}^{\mathsf{T}}B\left(\widetilde{w}_{i-1}-\mu_{w}\big{(}{\nabla}J_{\rho}(w_{i-1})-{\nabla}J_{\rho}(w^{\star})\big{)}\right)

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

+ ∥ λ_{i - 1} ∥_{c_{λ}}^{2} - μ_{w}^{2} ∥ B^{T} λ_{i - 1} ∥^{2}

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

∥ w_{i} ∥_{c_{w}}^{2} + ∥ λ_{i} ∥_{c_{λ}}^{2}

\displaystyle\quad+\big{(}1-\mu_{w}\mu_{\lambda}\underline{\sigma}^{2}(B)\big{)}\|\widetilde{\lambda}_{i-1}\|_{c_{\lambda}}^{2}

\displaystyle\|\widetilde{w}_{i-1}-\mu_{w}\big{(}{\nabla}J_{\rho}(w_{i-1})-{\nabla}J_{\rho}(w^{\star})\big{)}\|^{2}

\displaystyle\|\widetilde{w}_{i-1}-\mu_{w}\big{(}{\nabla}J_{\rho}(w_{i-1})-{\nabla}J_{\rho}(w^{\star})\big{)}\|^{2}

\displaystyle\ \leq\big{(}1-\mu_{w}\nu_{\rho}(2-\mu_{w}\delta_{\rho})\big{)}\|\widetilde{w}_{i-1}\|^{2}

\displaystyle\big{(}1-\mu_{w}\nu_{\rho}(2-\mu_{w}\delta_{\rho})\big{)}\|\widetilde{w}_{i-1}\|^{2}=\gamma_{1}\|\widetilde{w}_{i-1}\|^{2}-\mu_{w}\nu_{\rho}\|\widetilde{w}_{i-1}\|^{2}

\displaystyle\big{(}1-\mu_{w}\nu_{\rho}(2-\mu_{w}\delta_{\rho})\big{)}\|\widetilde{w}_{i-1}\|^{2}=\gamma_{1}\|\widetilde{w}_{i-1}\|^{2}-\mu_{w}\nu_{\rho}\|\widetilde{w}_{i-1}\|^{2}

= γ_{1} ∥ w_{i - 1} ∥_{c_{w}}^{2} - μ_{w} (ν_{ρ} - μ_{λ} σ_{m a x}^{2} (B) γ_{1}) ∥ w_{i - 1} ∥^{2}

\leq γ_{1} ∥ w_{i - 1} ∥_{c_{w}}^{2}

μ_{w} μ_{λ} < \frac{ν _{ρ}}{δ _{ρ} σ _{m a x}^{2} ( B )} \leq \frac{1}{σ _{m a x}^{2} ( B )}

μ_{w} μ_{λ} < \frac{ν _{ρ}}{δ _{ρ} σ _{m a x}^{2} ( B )} \leq \frac{1}{σ _{m a x}^{2} ( B )}

\displaystyle w_{i}=w_{i-1}-\mu_{w}\big{(}{\nabla}J_{\eta}(w_{i-1})+B^{\mathsf{T}}\lambda^{\prime}_{i-1}\big{)}

\displaystyle w_{i}=w_{i-1}-\mu_{w}\big{(}{\nabla}J_{\eta}(w_{i-1})+B^{\mathsf{T}}\lambda^{\prime}_{i-1}\big{)}

λ_{i}^{'} = λ_{i - 1}^{'} + μ_{λ} (B w_{i - 1} - b)

w_{i}

w_{i}

\displaystyle=w_{i-1}-\mu_{w}\big{(}{\nabla}J_{\rho}(w_{i-1})+B^{\mathsf{T}}\lambda_{i-1}\big{)}

μ_{w} < \frac{1}{δ - μ _{λ} σ _{m i n}^{2} ( B )}, μ_{λ} \leq \frac{ν}{2 σ _{m a x}^{2} ( B )}

μ_{w} < \frac{1}{δ - μ _{λ} σ _{m i n}^{2} ( B )}, μ_{λ} \leq \frac{ν}{2 σ _{m a x}^{2} ( B )}

\displaystyle w_{i}=w_{i-1}-\mu_{w}\big{(}{\nabla}J(w_{i-1})+B^{\mathsf{T}}\lambda^{\prime}_{i-1}\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Sulaiman A. Alghunaim*∗* and Ali H. Sayed*†, Fellow, IEEE This work was supported in part by grant 205121-184999 from the Swiss National Science Foundation.∗*S. A. Alghunaim is with the ECE Department, University of California at Los Angeles (UCLA). Email:[email protected]. *†*A. H. Sayed is with the Ecole Polytechnique Federale de Lausanne EPFL, School of Engineering, CH-1015 Lausanne, Switzerland e-mail: [email protected].

Abstract

In this work, we revisit a classical incremental implementation of the primal-descent dual-ascent gradient method used for the solution of equality constrained optimization problems. We provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation. We also study the effect of the augmented Lagrangian penalty term on the performance of distributed optimization algorithms for the minimization of aggregate cost functions over multi-agent networks.

Index Terms:

Primal-dual methods, linear convergence, Arrow-Hurwicz, augmented Lagrangian, distributed optimization.

I Introduction

Consider the constrained optimization problem:

[TABLE]

where $J(w):{\mathbb{R}}^{M}\rightarrow{\mathbb{R}}$ is a smooth function assumed to satisfy Assumption 1 further ahead, $B\in{\mathbb{R}}^{E\times M}$ , and $b\in{\mathbb{R}}^{E}$ . Consider also the saddle point problem:

[TABLE]

where

[TABLE]

is the augmented Lagrangian of problem (1), $\lambda$ is a dual variable, and $\rho\geq 0$ is the augmented Lagrangian penalty parameter. Note that for $\rho=0$ , $L_{0}(w,\lambda)$ becomes the classical Lagrangian of problem (1). If a point $(w^{\star},\lambda^{\star})$ exists that solves (2), then $w^{\star}$ is an optimal solution to the constrained problem when strong duality holds, which is the case under our assumptions [1]. A classical algorithm that solves (2) is the primal-dual (PD) gradient algorithm (4). In this algorithm, ${\nabla}J_{\rho}(w)$ denotes the gradient of $J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2}$ evaluated at $w$ and $(\mu_{w},\mu_{\lambda})$ are positive step-sizes (learning rates) chosen by the designer. The updates in (4) are primal-descent dual-ascent steps applied to (3) and it subsume the classical Lagrangian implementation when $\rho=0$ and the augmented Lagrangian implementation when $\rho>0$ . Note that the updates in (4) are incremental since the dual update (4b) uses the most recent primal variable $w_{i}$ and not $w_{i-1}$ . If the dual update uses the previous primal iterate $w_{i-1}$ , then we refer to the update as non-incremental.

This work provides a concise proof that establishes the linear convergence of recursion (4) and studies its relation to the non-incremental implementation. We also study the effect of the penalty term ${\rho\over 2}\|Bw-b\|^{2}$ on the performance of multi-agent consensus optimization algorithms. Algorithms of the form (4) have been applied in various applications including wireless systems [2], power systems [3], reinforcement learning [4], and network utility maximization [5].

I-A Related Works

There exists a large body of literature on primal-dual saddle-point algorithms – see [6, 7, 8, 9, 10, 11, 5, 12, 13] and the references therein, including the seminal work [6], which proposed recursions of the type (4) and established their convergence. These works focus on proving convergence to an optimal solution without providing convergence rates, provide sub-linear convergence rates (e.g., ${1\over i}$ where $i$ is the iteration index), or show linear convergence from a starting point that is sufficiently close to a solution (local convergence). Some other works examined global linear convergence under different settings.

The works [14, 15] focuses on continuous versions of the primal-dual gradient dynamics and establish linear convergence for augmented Lagrangian implementations (i.e., they require the presence of the augmented Lagrangian term $\rho/2\|Bw-b\|^{2}$ , where $\rho$ is strictly positive). They also require $B$ to have full row rank. Similarly, the work [16] establishes linear convergence for continuous primal-dual gradient dynamics for full row rank $B$ , but it does not require the presence of the augmented Lagrangian term. Moreover, it was shown in [16] that if the continuous dynamics is discretized using Euler discretization, then the discrete version converges linearly under small enough step sizes. However, no upper bound is given on the step-sizes. Moreover, Euler discretization uses identical step-sizes for the primal and dual updates (i.e., $\mu_{w}=\mu_{\lambda}$ ) and results in a non-incremental primal-dual dynamics. Therefore, the results in [14, 15, 16] are not directly applicable to the discrete incremental implementation (4) and do not provide clear bounds on the step-sizes.

We remark that linear convergence for various monotone operator methods have been established albeit under other conditions that are not satisfied in our setup. For example, the linear convergence results in [17] and [18, Proposition 25.9] for forward-backward splitting methods would require the saddle-point problem (2) to be both strongly-convex with respect to $w$ and strongly-concave with respect to $\lambda$ . This holds for example for problems with Lagrangian $L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda)$ where $J(w)$ and $g(\lambda)$ are both strongly-convex functions. Similarly, the conditions used in [19, 20, 21] require the saddle-point problem (2) to be strongly-convex with respect to $w$ and strongly-concave with respect to $\lambda$ . In our setup, $L_{\rho}(w,\lambda)$ is not strongly-concave with respect to $\lambda$ .

The work [22] showed that for saddle point problems with $L(w,\lambda)=J(w)+\lambda^{\mathsf{T}}Bw-g(\lambda)$ , linear convergence is possible without requiring the Lagrangian to be both strongly-convex and strongly-concave. In particular, it established linear convergence when the primal function $J(w)$ is smooth and convex, the dual function $-g(\lambda)$ is smooth and strongly-concave, and the additional assumption that $B$ is a full column rank matrix. Unlike the current work, the algorithm analyzed in [22] is non-incremental; moreover, particular fixed step-sizes are needed to establish linear convergence – [22, Theorem 3.1].

Now, in the distributed optimization literature, various incremental primal-dual gradient algorithms have been proposed to solve multi-agent consensus optimization problems – see [23, 24, 25, 26, 27] and references therein, which are mostly based on AL formulations. They have been shown to achieve linear convergence under strong-convexity even though the consensus constraint matrix is not full rank. However, the analysis techniques used to establish the convergence of these methods either depend on the particular consensus constraint matrix and/or require the AL term to be strictly positive. Unlike these works, our analysis does not require $\rho$ to be strictly positive. Moreover, due to our unified Lagrangian and AL framework, we clarify the effect of the AL penalty term on the performance of these types of distributed algorithms. Note that the work [28] studied non-incremental primal-dual methods with identical step-sizes for quadratic distributed optimization. It was found in [28] that unlike AL methods, Lagrangian methods suffer from stability issues when the individual costs are not strongly-convex. Unlike [28], we study the affect of the AL penalty on the convergence rate of distributed algorithms.

I-B Contribution

Given the above, this work has two main contributions: I) Through an original proof, we establish the linear convergence of the incremental implementation (4). Moreover, we show how the non-incremental implementation is related to the incremental one and establish its linear convergence while providing explicit upper bounds on the step-sizes. Our proof technique does not require the AL parameter to be strictly positive nor do we require $B$ to have full row rank. II) We show the effect of the AL penalty term on the performance of distributed multi-agent optimization algorithms. Depending on the condition number of the agents’ costs, we provide scenarios where the AL term is beneficial and other scenarios where it is not beneficial.

Notation and Terminology: For a matrix $A\in{\mathbb{R}}^{M\times N}$ , $\sigma_{\max}(A)$ denotes the maximum singular value of $A$ , $\sigma_{\min}(A)$ denotes the minimum singular value of $A$ , and $\underline{\sigma}(A)$ denotes the smallest non-zero singular value. For a vector $x\in{\mathbb{R}}^{M}$ and a positive constant $c>0$ , we let $\|x\|_{c}^{2}$ denote the weighted norm $c\|x\|^{2}$ . For any positive semidefinite matrix $A\in{\mathbb{R}}^{M\times M}$ the square root $A^{1\over 2}$ is the solution of $X^{2}=A$ . A function $f(x):{\mathbb{R}}^{M}\rightarrow{\mathbb{R}}$ is $\delta$ -smooth if $\|{\nabla}f(x)-{\nabla}f(y)\|\leq\delta\|x-y\|$ for any $x,y$ and some $\delta>0$ . A smooth function $f(x)$ is $\nu$ -strongly-convex if $(x-y)^{\mathsf{T}}\big{(}{\nabla}f(x)-{\nabla}f(y)\big{)}\geq\nu\|x-y\|^{2}$ for any $x,y$ and some $\nu>0$ .

II Auxiliary Results

This section gives the auxiliary results leading to the main convergence result. We start with the following condition on the cost function.

Assumption 1.

(Cost function): It is assumed that a unique solution $w^{\star}$ exists for problem (1) and the cost function $J(w)$ is convex. It is also assumed that $J(w)$ is $\delta$ -smooth, consequently, $J_{\rho}(w)=J(w)+{\rho\over 2}\|Bw-b\|^{2}$ is $\delta_{\rho}$ -smooth with $\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}(B)$ . Moreover, the cost $J_{\rho}(w)$ is $\nu_{\rho}$ -strongly-convex with respect to $w^{\star}$ , namely,

[TABLE]

The scalars satisfy $0<\nu_{\rho}\leq\delta_{\rho}$ for any $\rho\geq 0$ . $\Box$

Remark 1 (Strong-convexity).

If $J(w)$ is $\nu$ -strongly-convex, then $w^{\star}$ is unique [1, Example 5.4] and condition (5) will be satisfied with $\nu_{\rho}=\nu$ . We remark that condition (5) does not necessarily imply that $J(w)$ is strongly-convex w.r.t. $w^{\star}$ unless $\rho=0$ . This condition is used instead of typical strong-convexity to be consistent with the conditions used to study the effect of the augmented Lagrangian term on the performance of distributed algorithms in Section V. * $\Box$ *

It is known that a pair $(w^{\star},\lambda^{\star})$ is an optimal solution to (2) if, and only if, it satisfies the optimality conditions [1]:

[TABLE]

From (6a) and uniqueness of $w^{\star}$ , $\lambda^{\star}$ will be unique if $B$ has full row rank. In general $\lambda^{\star}$ is not necessarily unique. Motivated by [29], we will characterize a particular dual solution that we later show convergence to. For that result and later analysis, we need the following result.

Lemma 1.

If $\lambda_{x}$ is in the range space of $B\in{\mathbb{R}}^{E\times M}$ , then it holds that:

[TABLE]

Proof.

Introduce the truncated singular value decomposition [30] of the positive semi-definite matrix $B^{\mathsf{T}}B=U_{r}\Sigma_{r}U_{r}^{\mathsf{T}}$ , where $U_{r}\in{\mathbb{R}}^{M\times r}$ ( $r$ denotes the rank of $B^{\mathsf{T}}B$ ) with $U_{r}^{\mathsf{T}}U_{r}=I_{r}$ and $\Sigma_{r}>0$ is a diagonal matrix with entries equal to the non-zero eigenvalues of $B^{\mathsf{T}}B$ ( i.e., the squared non-zero singular values of $B$ ). Since $\lambda_{x}$ is in the range space of $B$ , it holds that $\lambda_{x}=Bx$ for some $x$ . Thus, if we let $u=\Sigma^{1\over 2}_{r}U_{r}^{\mathsf{T}}x$ , then

[TABLE]

The result follows since $x^{\mathsf{T}}U_{r}\Sigma_{r}U_{r}^{\mathsf{T}}x=\|\lambda_{x}\|^{2}$ . The inequality follows since $\underline{\sigma}^{2}(B)$ is the smallest eigenvalue (or diagonal entry) of $\Sigma_{r}$ – see [1, Appendix A.5.2]. ∎

Lemma 2.

(Particular dual $\lambda^{\star}_{b}$ ): There exists a unique optimal dual variable, denoted by $\lambda^{\star}_{b}$ , lying in the range space of $B$ .

Proof.

The argument is motivated by [29]. Any solution $\lambda^{\star}$ of the linear system of equations given in (6a) can be decomposed into two parts $\lambda^{\star}=\lambda^{\star}_{b}+\lambda_{n}^{\star}$ , where $\lambda^{\star}_{b}\in{\rm Range}(B)$ and $\lambda_{n}^{\star}\in{\rm Null}(B^{\mathsf{T}})$ – see [30]. Therefore, if $(w^{\star},\lambda^{\star})$ satisfies (6), then $(w^{\star},\lambda^{\star}_{b})$ also satisfies (6). We now show $\lambda^{\star}_{b}$ is unique by contradiction. Assume we have two distinct dual solutions $\lambda^{\star}_{b_{1}}=Bx_{1}$ and $\lambda^{\star}_{b_{2}}=Bx_{2}$ lying in the range space of $B$ . Then, substituting into (6a) and subtracting, we get $B^{\mathsf{T}}B(x_{1}-x_{2})=0$ . It follows that $\|B(x_{1}-x_{2})\|^{2}=0$ and, consequently, $B(x_{1}-x_{2})=0$ . This means that $\lambda^{\star}_{b_{1}}=Bx_{1}=Bx_{2}=\lambda^{\star}_{b_{2}}$ , which is a contradiction. ∎

Note that if $\lambda_{i-1}$ belongs to the range space of $B$ (i.e., $\lambda_{i-1}=Bx$ for some $x$ ) or $\lambda_{i-1}=0$ , then from $b=Bw^{\star}$ and (4b) we know that $\lambda_{i}=\lambda_{i-1}+\mu_{\lambda}(Bw_{i}-b)=B\big{(}x+\mu_{\lambda}(w_{i}-w^{\star})\big{)}$ will remain in the range space of $B$ . Thus, $\{\lambda_{i}\}_{i\geq 0}$ will always remain in the range space of $B$ if $\lambda_{-1}$ belongs to the range space of $B$ or $\lambda_{-1}=0$ . This observation will allow us to utilize the bound (8) to establish linear convergence to the particular saddle-point $(w^{\star},\lambda^{\star}_{b})$ without requiring a rank condition on the matrix $B$ .

III LINEAR CONVERGENCE RESULT

We are now ready to establish our main result. Let $\widetilde{w}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;w_{i}-w^{\star}$ and $\widetilde{\lambda}_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda_{i}-\lambda^{\star}_{b}$ denote the primal and dual errors, respectively.

Theorem 1.

(Linear convergence): Let Assumption 1 holds and assume the step-sizes are positive and satisfy:

[TABLE]

If $\lambda_{-1}=0$ , then algorithm (4) converges linearly to the particular saddle-point $(w^{\star},\lambda^{\star}_{b})$ , namely, it holds that

[TABLE]

where $c_{\lambda}>0$ , $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ , and

[TABLE]

Proof.

Subtracting $w^{\star}$ and $\lambda^{\star}_{b}$ from both sides of (4) and using the optimality conditions (6) we get the coupled error recursion:

[TABLE]

Squaring both sides of (11a) and (11b) we get

[TABLE]

and

[TABLE]

Using the bound $\|B\widetilde{w}_{i}\|^{2}\leq\sigma^{2}_{\max}(B)\|\widetilde{w}_{i}\|^{2}$ , multiplying equation (13) by $c_{\lambda}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\mu_{w}/\mu_{\lambda}$ and adding to (12) gives:

[TABLE]

where $c_{w}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)$ . Note that from Lemma 2, $\lambda^{\star}_{b}$ lies in the range space of $B$ . Moreover, since $\lambda_{-1}=0$ , then we know that $\widetilde{\lambda}_{i}$ will always lie in the range space of $B$ . Thus, from (7) it holds that $\|B^{\mathsf{T}}\widetilde{\lambda}_{i-1}\|^{2}\geq\underline{\sigma}^{2}(B)\|\widetilde{\lambda}_{i-1}\|^{2}$ . Using this bound in (14), we get:

[TABLE]

Since $J_{\rho}(w)$ is $\delta_{\rho}$ -smooth, it holds that [31, Theorem 2.1.5]:

[TABLE]

Thus

[TABLE]

for $\mu<2/\delta_{\rho}$ . This follows directly by expanding the square and using the bounds (5) and (16). Let $\gamma_{1}=1-\mu_{w}\nu_{\rho}(1-\mu_{w}\delta_{\rho})$ . Since $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)$ , it holds that:

[TABLE]

where the last step we used the fact that the second term is non-positive under the conditions $\mu_{w}<{1\over\delta_{\rho}}$ and $\mu_{\lambda}\leq\nu_{\rho}/\sigma^{2}_{\max}(B)$ . We conclude that equation (10) holds by using the previous two equations in (15). Note that for positive step-sizes it holds that $c_{\lambda}={\mu_{w}\over\mu_{\lambda}}>0$ . Moreover, $c_{w}=1-\mu_{w}\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ and $0<1-\mu_{w}\mu_{\lambda}\underline{\sigma}^{2}(B)<1$ if $\mu_{w}\mu_{\lambda}<{1\over\sigma^{2}_{\max}(B)}$ . This condition is satisfied under condition (9) because under these conditions we have

[TABLE]

where the last inequality hold because $\nu_{\rho}\leq\delta_{\rho}$ . ∎

Theorem 1 shows that under conditions (9), the incremental algorithm (4) converges linearly. We will show how to utilize this result to establish the linear convergence of the classical non-incremental (Arrow-Hurwicz) method [6].

IV Non-incremental PD Gradient Method

Consider the non-incremental update (Arrow-Hurwicz):

[TABLE]

where $J_{\eta}(w)\;\stackrel{{\scriptstyle\Delta}}{{=}}\;J(w)+{\eta\over 2}\|Bw-b\|^{2}$ and $\eta\geq 0$ . Different from (4), recursion (19b) uses $w_{i-1}$ in the dual update instead of $w_{i}$ . We will see that these two different implementations are equivalent for particular choices of $\eta$ and $\rho$ .

Lemma 3.

(Equivalence of (4) and (19b))* The primal iterates of the non-incremental recursion (19b) are equivalent to the primal iterates of the incremental recursion (4) if $\eta=\rho+\mu_{\lambda}$ and $\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b)$ .*

Proof.

Let $\eta=\rho+\mu_{\lambda}$ . It holds that $J_{\eta}(w)=J_{\rho}(w)+{\mu_{\lambda}\over 2}\|Bw-b\|^{2}$ so that ${\nabla}J_{\eta}(w)={\nabla}J_{\rho}(w)+\mu_{\lambda}B^{\mathsf{T}}(Bw-b)$ . Thus, for $\eta=\rho+\mu_{\lambda}$ step (19a) can be rewritten as:

[TABLE]

where we introduced the change of variable $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b)$ . Adding $\mu_{\lambda}(Bw_{i}-b)$ to both sides of (19b) and using $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}+\mu_{\lambda}(Bw_{i}-b)$ , we can directly rewrite (19b) as in (4b). Thus, the primal iterates of recursion (19b) are equivalent to the primal iterates of recursion (4) if $\lambda^{\prime}_{-1}=\lambda_{-1}-\mu_{\lambda}(Bw_{-1}-b)$ . ∎

Lemma 3 implies that the non-incremental implementation (19b) is an instance of the incremental implementation with $\rho=\eta-\mu_{\lambda}$ . Recall that in algorithm (4) we assume that $\rho\geq 0$ . Therefore, if $\eta=\rho+\mu_{\lambda}\geq\mu_{\lambda}$ , the linear convergence of (19b) follows from Theorem 1 with $\rho=\eta-\mu_{\lambda}\geq 0$ . The case $0\leq\eta<\mu_{\lambda}$ implies that $\rho=\eta-\mu_{\lambda}<0$ . This case can also be analyzed using the exact same technique as in Theorem 1. To show that, it suffices to consider the classical case $\eta=0$ .

Corollary 1.

(Non-Incremental $\eta=0$ )* If the cost $J(w)$ is $\delta$ -smooth and $\nu$ -strongly-convex and the step-sizes satisfy:*

[TABLE]

Then, recursion (19b) with $\eta=0$ converges linearly to the optimal saddle-point if $\lambda^{\prime}_{-1}=0$ .

Proof.

See Appendix A. ∎

By relating recursion (19b) to (4), we are able to establish its linear convergence and provide explicit upper bounds on the step-sizes as well. The works [16] and [22] also established the linear convergence of the non-incremental recursion (19b) with $\eta=0$ . However, these works do not provide explicit upper bounds on the step-sizes [16] or require particular fixed step-sizes to establish their result [22].

Remark 2 (Forward-Backward Method).

Assume $b=0$ and consider the forward-backward gradient algorithm [32]:

[TABLE]

By using a change of variable trick, the analysis of (22b) directly follows from Theorem 1 with $\rho=\mu_{\lambda}$ . In particular, by adding and subtracting $\mu_{w}\mu_{\lambda}B^{\mathsf{T}}Bw_{i-1}$ to the R.H.S. of (22a), letting $\lambda_{i}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\lambda^{\prime}_{i}-\mu_{\lambda}Bw_{i}$ , and rearranging (22b), recursion (22b) can be equivalently written as recursion (4) ( $b=0$ ) with $\rho=\mu_{\lambda}$ . * $\Box$ *

V Application: Distributed Optimization

In this section, we study the benefit of the AL penalty term for distributed consensus optimization problems.

Consider a network of $K$ agents that are connected through some network and interested in the following problem:

[TABLE]

where $J_{k}(w):{\mathbb{R}}^{M}\rightarrow{\mathbb{R}}$ is a local cost function associated with agent $k$ . In order to derive the algorithm that solves (23) in a distributed manner, we will rewrite (23) in an equivalent constrained form. We introduce a combination matrix $A=[a_{sk}]$ associated with the network. The entry $a_{sk}$ is the weight used by agent $k$ to scale information arriving from agent $s$ with $a_{sk}=0$ if $s$ is not a direct neighbor of agent $k$ , i.e., there is no edge connecting them.

Assumption 2.

The network is static, undirected, and the matrix $A$ is assumed to be primitive, i.e., there exists some integer $p>0$ such that all entries of $A^{p}$ are positive. We also assume $A$ to be symmetric, and doubly stochastic. $\Box$

There exists many rules to chose $A$ such as the Metropolis rule – see [33], which satisfy Assumption 2 as long as the network is connected. Under this assumption, it holds that $I_{K}-A$ is positive semi-definite and $(I_{K}-A)x=0$ if, and only, if $x=c\mathds{1}_{K}$ for any $c\in{\mathbb{R}}$ – see [26]. Therefore, if we let $w_{k}\in{\mathbb{R}}^{M}$ denote a local copy of $w$ available at agent $k$ and introduce the network quantities:

[TABLE]

Then, it holds that ${\mathcal{B}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $w_{k}=w_{s}\ \forall\ k,s$ – see [26]. Thus, problem (23) is equivalent to the following constrained problem:

[TABLE]

A direct application of (4) to problem (26) gives:

[TABLE]

where ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})\;\stackrel{{\scriptstyle\Delta}}{{=}}\;{\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2}$ with $\rho\geq 0$ . Recursion (27) is not distributed yet because ${\mathcal{B}}$ need not have the network structure. However, this can be easily handled by a change of variable. Let ${\scriptstyle{\mathcal{Y}}}_{i}={\mathcal{B}}\lambda_{i}$ and multiply (27b) by ${\mathcal{B}}$ gives:

[TABLE]

Since ${\mathcal{B}}^{2}=(I_{K}-A)\otimes I_{M}$ has the network structure, then the $k$ -th block of ${\mathcal{B}}^{2}{\scriptstyle{\mathcal{W}}}_{i}={\rm col}\{u_{k,i}\}_{k=1}^{K}$ has the distributed form $u_{k,i}=w_{k,i}-\sum_{s\in{\mathcal{N}}_{k}}a_{sk}w_{s,i}$ where ${\mathcal{N}}_{k}$ denotes the neighbors of agent $k$ , including agent $k$ . Therefore, recursion (28) is distributed and agent $k$ can locally update its corresponding $k$ -th blocks in ${\scriptstyle{\mathcal{W}}}_{i}$ and ${\scriptstyle{\mathcal{Y}}}_{i}$ .

V-A Relation to Other Algorithms

Before we establish convergence of recursion (28) and show the influence of the AL penalty term on its performance, we show how the derivation of recursions (27) and (28) are related to some state of the art algorithms.

V-A1 EXTRA [34]

Note that the saddle point interpretation of EXTRA appeared in the work [35]. If we choose $\mu_{w}=\mu$ , $\mu_{\lambda}={1\over 2\mu}$ , and $\rho={1\over 2\mu}$ in algorithm (27) we get:

[TABLE]

where $\bar{{\mathcal{A}}}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;I-{1\over 2}{\mathcal{B}}^{2}={1\over 2}(I+{\mathcal{A}})$ and ${\mathcal{A}}=A\otimes I_{M}$ . By eliminating the dual-variable (see, e.g., [27]), the above algorithm can be shown to be equivalent to the EXTRA algorithm in [34], which requires communicating the primal variable once per iteration.

V-A2 Exact diffusion [26]

Consider the following update:

[TABLE]

which differs from EXTRA (29) in the primal update where the gradient is also multiplied by $\bar{{\mathcal{A}}}$ . By eliminating the dual-variable, the above algorithm can be shown to be equivalent to the exact-diffusion algorithm from [26]. Different from a traditional gradient primal-descent (29a) that was used to derive EXTRA, exact diffusion uses incremental gradient descent steps – see [26] for details. Exact diffusion enjoys wider step-size $\mu$ stability range and better convergence performance compared to EXTRA – see [36].

It is worth mentioning that if we consider the penalized unconstrained problem $\min_{{\scriptstyle{\scalebox{0.5}{\mbox{$ \displaystyle\mathcal{W} $}}}}}\ {\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2}$ and apply two incremental gradient descent steps for the two terms in the penalized cost with step-size $\mu$ and $\rho={1\over\mu}$ , we arrive at:

[TABLE]

which is the diffusion algorithm [33, 26]. The bias that arises from solving the penalized problem, rather than the original problem, can be corrected by employing exact diffusion [26].

V-A3 DIGing [25]

If we choose a different penalty function ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{I-{\mathcal{A}}^{2}}^{2}$ and set ${\mathcal{B}}^{2}\leftarrow{\mathcal{B}}$ in algorithm (27) with $\mu_{w}=\mu$ , $\mu_{\lambda}={1\over\mu}$ , and $\rho={1\over\mu}$ we get:

[TABLE]

By eliminating the dual variable, this algorithm can be shown to be equivalent to DIGing – see [25, Section 2.2]. We see that the main difference from the EXTRA derivation is in the choice of the constraint and penalty matrices.

V-A4 Linearized ADMM [23]

Consider an instance111We let $\tilde{d}_{k}=2cd_{k}+\rho=d$ in the DLM from [23]. of the decentralized linearized ADMM (DLM) method from [23]:

[TABLE]

where $d,c>0$ . The matrix ${\mathcal{L}}$ is the oriented Laplacian matrix chosen such that the $k$ -th block of ${\mathcal{L}}{\scriptstyle{\mathcal{W}}}_{i}$ is equal to $\sum_{s\in{\mathcal{N}}_{k}}w_{k,i}-w_{s,i}$ . Recursion (33) is equivalent to (28) with ${\mathcal{B}}^{2}$ replaced by ${\mathcal{L}}$ , $\mu_{\lambda}=\rho=c$ , and $\mu_{w}=1/d$ .

Remark 3 (Generalized Framework).

Based on the previous derivations, one can rewrite problem (26) more generally as

[TABLE]

where ${\mathcal{C}}$ and $\bar{{\mathcal{C}}}$ are general consensus matrices satisfying ${\mathcal{C}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $\bar{{\mathcal{C}}}{\scriptstyle{\mathcal{W}}}=0$ if, and only, if $w_{1}=\cdots=w_{K}$ . Various algorithms can be derived by proper choices of ${\mathcal{C}}$ and $\bar{{\mathcal{C}}}$ and using more general primal-dual algorithms. For works focusing on unifying distributed algorithms, we refer interested readers to [27, 37]. * $\Box$ *

Remark 4 (Augmented Lagrangian Term).

We notice that most state-of-the-art algorithms are based on augmented Lagrangian formulations (i.e., they require $\rho$ to be strictly positive). However, it is unclear whether the AL term is always beneficial. Unlike previous works, we reveal the influence of AL penalty term on convergence rate of distributed algorithms compared to the classical Lagrangian case ( $\rho=0$ ). * $\Box$ *

V-B AL Penalty Term Influence

To reveal the influence of the AL penalty parameter on the performance of distributed algorithms, we study the linear convergence properties of (28a)–(28b). To do that, we let $({\scriptstyle{\mathcal{W}}}^{\star},\lambda^{\star}_{b})$ be the point satisfying the optimality conditions of problem (26) where $\lambda^{\star}_{b}$ lies in the range space of ${\mathcal{B}}$ . First, we recall the following result from [34, Proposition 3.6].

Lemma 4 (AL Penalized Cost).

Let $\rho>0$ . If each cost $J_{k}(w)$ is convex and $\delta$ -smooth, and the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ is $\bar{\beta}$ -strongly convex, then the penalized augmented cost ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\scriptstyle{\mathcal{W}}}\|_{{\mathcal{B}}^{2}}^{2}$ is $\nu_{\rho}$ -strongly-convex with respect to ${\scriptstyle{\mathcal{W}}}^{\star}$ where

[TABLE]

and $\nu_{\rho}\rightarrow\bar{\beta}$ as $\rho\rightarrow\infty$ . $\Box$

Note that even if the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ is strongly-convex, each cost $J_{k}(w_{k})$ is not necessarily strongly-convex, e.g., $J_{k}(w)=\big{(}w(k)\big{)}^{2}$ where $w(k)$ is the $k$ -th entry of $w\in{\mathbb{R}}^{M}$ , is not strongly-convex with respect to $w\in{\mathbb{R}}^{K}$ but ${1\over K}\sum_{k=1}^{K}J_{k}(w)={1\over K}\|w\|^{2}$ is strongly-convex. The previous Lemma allows us to reveal the effect of the AL term through the following result.

Corollary 2.

Assume that each cost $J_{k}(w)$ is convex and $\delta$ -smooth and let Assumption (2) hold. Then, the following result holds:

•

If $\rho>0$ , the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ is $\bar{\beta}$ -strongly convex, and $\mu_{w}<{1\over\delta_{\rho}},\quad\mu_{\lambda}\leq{\nu_{\rho}\over\sigma^{2}_{\max}({\mathcal{B}})}$ , then recursion (28a)–(28b) with ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ converges linearly and the convergence rate is upper bounded by:

[TABLE]

where $\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}({\mathcal{B}})$ and $\nu_{\rho}>0$ is defined in (35).

•

If $\rho=0$ , each cost $J_{k}(w)$ is $\beta_{k}$ -strongly-convex, and $\mu_{w}<{1\over\delta},\quad\mu_{\lambda}\leq{\nu_{0}\over\sigma^{2}_{\max}({\mathcal{B}})}$ , then recursion (28a)–(28b) with ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ converges linearly and the convergence rate is upper bounded by:

[TABLE]

where $\nu_{0}=\min_{k}\beta_{k}$ .

Proof.

See Appendix B ∎

From the previous result, we see that for $\rho>0$ , we only require the aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ to be strongly-convex to establish linear convergence since from Lemma 4, we know that for a strongly-convex aggregate cost ${1\over K}\sum_{k=1}^{K}J_{k}(w)$ , the penalized augmented cost ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})$ is guaranteed to be strongly-convex w.r.t. ${\scriptstyle{\mathcal{W}}}^{\star}$ . However, for the linear convergence of the case $\rho=0$ , we require the stronger condition that each individual cost is strongly-convex. This is because the cost ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})\in{\mathbb{R}}^{MK}\rightarrow{\mathbb{R}}$ is strongly-convex if, and only, if each individual cost is strongly-convex – see the argument in the proof of Corollary 2. Thus, the AL term is beneficial if the aggregate cost is strongly-convex but the individual costs are not – see simulation section. However, if each individual cost $J_{k}(w)$ is $\beta_{k}$ -strongly-convex, then the presence of the AL term ( $\rho>0$ ) can either degrade the performance compared to $\rho=0$ or improve the performance as we now explain.

From the step-size conditions in Corollary 2, the convergence rates $\gamma_{L}$ and $\gamma_{AL}$ have the form

[TABLE]

for some $0<c<1$ where $\kappa_{L}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta/\nu_{0}$ and $\kappa_{AL}\;\stackrel{{\scriptstyle\Delta}}{{=}}\;\delta_{\rho}/\nu_{\rho}$ are the condition numbers of ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})$ and ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})$ . Note that $\nu_{\rho}\approx\bar{\beta}$ (for large enough $\rho$ ). If the condition number of the aggregate cost is much smaller than the condition number of the individual costs (e.g., $\bar{\beta}>>\min_{k}\beta_{k}$ ), then the AL method will have faster convergence rate since $\kappa_{AL}<\kappa_{L}$ , consequently $\gamma_{AL}<\gamma_{L}<1$ . However, when the individual costs are well conditioned (e.g, $\beta_{k}\approx\bar{\beta}$ ), then $\kappa_{AL}\approx\kappa_{L}$ and the AL penalty term is not that beneficial. Moreover, for large $\rho$ we can have $\kappa_{AL}>\kappa_{L}$ ; hence $\gamma_{L}<\gamma_{AL}$ and AL term slows down the convergence rate.

VI Simulation

To illustrate the influence of the AL term on the performance of distributed algorithms, we consider the distributed optimization problem (23) with quadratic costs $J_{k}(w)=w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w$ where $w\in{\mathbb{R}}^{20}$ , $R_{k}\in{\mathbb{R}}^{20\times 20}$ , and $r_{k}\in{\mathbb{R}}^{20}$ . We randomly generated a network of $K=20$ agents shown in the right side of Fig. 1. The matrix $A$ is generated using the Metropolis rule [33]. Each vector $r_{k}$ is randomly generated with its entries uniformly selected between $[0,2]$ . Note that the condition number of the cost $J_{k}(w)=w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w$ is the ratio of the largest and smallest eigenvalues of $R_{k}$ . In our simulations, we construct the matrix $R_{k}$ under three different scenarios:

VI-1 Well conditioned costs $J_{k}(w)$

The matrix $R_{k}$ is a randomly generated diagonal matrix with integer diagonal entries, each chosen between $[6,8]$ . In this case, each $J_{k}(w)$ is well conditioned because $8/6$ is not very large. The result for this scenario is shown on the left plot of Fig. 1. In all results, PD distributed refers to (28) (with $\rho=0$ ) and AL PD distributed refers to (28) with $\rho>0$ , EXTRA algorithm from [34], and exact diffusion from [26]. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We notice that for this case, increasing $\rho$ decreases the performance compared to the case $\rho=0$ . In this scenario, we do not see any advantages of AL methods compared to the Lagrangian method ( $\rho=0$ ) due to the reasons mentioned in the previous section. Note that EXTRA (29) and exact diffusion (30) converges slower since they require $\rho=1/2\mu$ , which cannot be tweaked independently from the step-size $\mu$ .

VI-2 Ill conditioned costs $J_{k}(w)$

We now construct $R_{k}$ so that the local costs become ill-conditioned. To do that, we let $R_{k}$ to be a diagonal matrix where the $(k,k)$ -th diagonal entry for each agent ( $R_{k}(k,k)$ ) are chosen randomly between $[2,8]$ and the other diagonal entries are chosen uniformly between $(0,1)$ . In this case, the ratio of the largest diagonal entry and the smallest can be very large making each $J_{k}(w)$ ill-conditioned. However, the aggregate cost ${1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w)$ is better conditioned compared to the individual costs. This is because from our construction, the condition number of $R=\sum_{k=1}^{K}R_{k}$ is smaller than the condition number of $R_{k}$ . The left plot of Fig. 2 shows the result for this case. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. In this case, we see that the Lagrangian method performs poorly compared to AL PD method, EXTRA, and Exact diffusion.

VI-3 Non-convex costs $J_{k}(w)$

We now consider the case where the individual costs $J_{k}(w)$ are non-convex but the aggregate cost $\sum_{k=1}^{K}J_{k}(w)$ is strongly-convex. To do that, we let $R_{k}$ be a diagonal matrix with the $(k,k)$ -th diagonal entry for each agent, $R_{k}(k,k)$ , chosen randomly between $[2,8]$ , the entries $R_{k}(k-1,k-1)=-{R_{k-1}(k-1,k-1)/2}$ for all $k\geq 2$ . In this case, the individual costs $\{J_{k}(w)\}_{k\geq 2}$ are non-convex since they have negative diagonal entries. However, the aggregate cost ${1\over K}\sum_{k=1}^{K}(w^{\mathsf{T}}R_{k}w+r_{k}^{\mathsf{T}}w)$ is strongly convex since $R=\sum_{k=1}^{K}R_{k}$ is positive-definite from construction. The result of this set-up is shown in the right plot of Fig. 2. The step-sizes are manually chosen to get the best possible convergence rate for each algorithm. We see that the AL based methods still converge linearly. However, the PD distributed method diverges even under small step-sizes. This is because the cost ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})=\sum_{k=1}^{K}(w_{k}^{\mathsf{T}}R_{k}w_{k}+r_{k}^{\mathsf{T}}w_{k})$ is non-convex since the Hessian ${\nabla}^{2}{\mathcal{J}}({\scriptstyle{\mathcal{W}}})={\rm blkdiag}\{R_{k}\}_{k=1}^{K}$ is indefinite. In contrast, the cost ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})$ is strongly-convex for large $\rho$ .

VII Concluding Remarks

In this work, we studied the linear convergence of the classical incremental primal-dual gradient algorithm (4). We provided an original proof that is applicable to both the Lagrangian and augmented Lagrangian implementations. Moreover, we proved the linear convergence of the non-incremental implementation (19b) by relating it to the incremental one. Finally, we studied algorithm (4) in distributed multi-agent optimization problems. The effect of the AL term on the performance of distributed algorithms is illustrated in theory and validated by means of simulation.

Appendix A Proof of Corollary 1

For $\eta=0$ , we know from Lemma 3 that recursion (19b) is equivalent to the incremental implementation with $\rho=\eta=-\mu_{\lambda}$ , namely,

[TABLE]

where $J^{\prime}(w)=J_{-\mu_{\lambda}}(w)=J(w)-{\mu_{\lambda}\over 2}\|Bw-b\|^{2}$ . The above recursion is exactly (4) with $\rho=0$ and cost $J^{\prime}(w)$ instead of $J(w)$ . Therefore, its analysis follows from Theorem 1 as long as $J^{\prime}(w)$ is $\delta^{\prime}-$ smooth and $\nu^{\prime}$ -strongly-convex for some $\delta^{\prime}\geq\nu^{\prime}>0$ . It holds that:

[TABLE]

where the first inequality holds from Cauchy-Schwartz and $\|B(w_{1}-w_{2})\|^{2}\geq\sigma^{2}_{\min}(B)\|w_{1}-w_{2}\|^{2}$ . The last inequality holds since $J(w)$ is $\delta$ -smooth. The above inequality is equivalent to the cost $J_{-\mu_{\lambda}}(w)$ being $\delta^{\prime}=\delta-\mu_{\lambda}\sigma^{2}_{\min}(B)$ smooth – see [31, Theorem 2.1.5]. Moreover, from strong-convexity condition (5), it also holds that

[TABLE]

Hence, the cost $J^{\prime}(w)=J_{-\mu_{\lambda}}(w)$ is $\nu^{\prime}=\nu-\mu_{\lambda}\sigma^{2}_{\max}(B)>0$ strongly-convex if $\mu_{\lambda}<\nu/\sigma^{2}_{\max}(B)$ . By replacing $\delta$ and $\nu$ with $\delta^{\prime}$ and $\nu^{\prime}$ in (9) and setting $\rho=0$ we get conditions (21).

Appendix B Proof of Corollary 2

Note that if $\lambda_{-1}=0$ and ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ , then from (27b) and (28b) it holds that ${\scriptstyle{\mathcal{Y}}}_{i}={\mathcal{B}}\lambda_{i}$ for all $i\geq-1$ . Since $\lambda_{i}$ lies in the range space of ${\mathcal{B}}$ , it follows from Lemma 1 that ${\scriptstyle{\mathcal{Y}}}_{i}=0\iff\lambda_{i}=0$ . Thus, the primal iterates (27a) and (28a) are equivalent if $\lambda_{-1}=0$ and ${\scriptstyle{\mathcal{Y}}}_{-1}=0$ . Moreover, if recursion (27a)–(27b) converges linearly to $({\scriptstyle{\mathcal{W}}}^{\star},\lambda^{\star}_{b})$ , then recursion (28a)–(28b) converges linearly to $({\scriptstyle{\mathcal{W}}}^{\star},{\mathcal{B}}\lambda^{\star}_{b})$ and its convergence properties follow from Theorem 1. It remains to verify the conditions in Theorem 1 hold for the two cases $\rho=0$ and $\rho>0$ . For $\rho>0$ , it holds that the cost ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})+{\rho\over 2}\|{\mathcal{B}}{\scriptstyle{\mathcal{W}}}\|^{2}$ is $\delta_{\rho}$ -smooth with $\delta_{\rho}=\delta+\rho\sigma^{2}_{\max}({\mathcal{B}})$ . Moreover, since the aggregate cost $\sum_{k=1}^{K}J_{k}(w):{\mathbb{R}}^{M}\rightarrow{\mathbb{R}}$ is $\bar{\beta}$ -strongly-convex, it holds from 4 that the augmented penalized cost ${\mathcal{J}}_{\rho}({\scriptstyle{\mathcal{W}}})$ is $\nu_{\rho}$ -strongly convex with respect to ${\scriptstyle{\mathcal{W}}}^{\star}$ . For $\rho=0$ , the augmented cost ${\mathcal{J}}_{0}({\scriptstyle{\mathcal{W}}})={\mathcal{J}}({\scriptstyle{\mathcal{W}}})=\sum_{k=1}^{K}J_{k}(w_{k})$ is separable in $\{w_{k}\}$ so that ${\nabla}{\mathcal{J}}_{0}({\scriptstyle{\mathcal{W}}})={\rm col}\{{\nabla}J_{k}(w_{k})\}_{k=1}^{K}$ . Thus, ${\mathcal{J}}({\scriptstyle{\mathcal{W}}})$ is strongly-convex if, and only, if each individual cost is strongly-convex. Since $J_{k}(w)$ is $\delta$ -smooth and $\beta_{k}$ -strongly-convex, it can be verified that ${\mathcal{J}}_{0}({\scriptstyle{\mathcal{W}}})$ is $\delta$ -smooth and $\nu_{0}$ -strongly-convex where $\nu_{0}=\min_{k}\beta_{k}$ .

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Boyd and L. Vandenberghe, Convex Optimization . Cambridge University Press, 2004.
2[2] J. Chen and V. K. Lau, “Convergence analysis of saddle point problems in time varying wireless systems—control theoretical approach,” IEEE Transactions on Signal Processing , vol. 60, no. 1, pp. 443–452, 2012.
3[3] A. Cherukuri and J. Cortes, “Initialization-free distributed coordination for economic dispatch under varying loads and generator commitment,” Automatica , vol. 74, pp. 183–193, 2016.
4[4] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy evaluation under multiple behavior strategies,” IEEE Transactions on Automatic Control , vol. 60, no. 5, pp. 1260–1274, 2015.
5[5] D. Feijer and F. Paganini, “Stability of primal–dual gradient dynamics and applications to network optimization,” Automatica , vol. 46, no. 12, pp. 1974–1981, 2010.
6[6] K. J. Arrow, L. Hurwicz, and H. Uzawa, Studies in Linear and Nonlinear Programming . Stanford University Press, Palo Alto, 1958.
7[7] T. Kose, “Solutions of saddle value problems by differential equations,” Econometrica, Journal of the Econometric Society , pp. 59–70, 1956.
8[8] B. Polyak, “Iterative methods using Lagrange multipliers for solving extremal problems with constraints of the equation type,” USSR Computational Mathematics and Mathematical Physics , vol. 10, no. 5, pp. 42–52, 1970.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Abstract

Index Terms:

I Introduction

I-A Related Works

I-B Contribution

II Auxiliary Results

** Assumption 1****.**

Remark 1** (Strong-convexity).**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

III LINEAR CONVERGENCE RESULT

Theorem 1**.**

Proof.

IV Non-incremental PD Gradient Method

Lemma 3**.**

Proof.

Corollary 1**.**

Proof.

Remark 2** (Forward-Backward Method).**

V Application: Distributed Optimization

** Assumption 2****.**

V-A Relation to Other Algorithms

V-A1 EXTRA [34]

V-A2 Exact diffusion [26]

V-A3 DIGing [25]

V-A4 Linearized ADMM [23]

Remark 3** (Generalized Framework).**

Remark 4** (Augmented Lagrangian Term).**

V-B AL Penalty Term Influence

Lemma 4** (AL Penalized Cost).**

Corollary 2**.**

Proof.

VI Simulation

VI-1 Well conditioned costs Jk(w)J_{k}(w)Jk​(w)

VI-2 Ill conditioned costs Jk(w)J_{k}(w)Jk​(w)

VI-3 Non-convex costs Jk(w)J_{k}(w)Jk​(w)

VII Concluding Remarks

Appendix A Proof of Corollary 1

Appendix B Proof of Corollary 2

Assumption 1.

Remark 1 (Strong-convexity).

Lemma 1.

Lemma 2.

Theorem 1.

Lemma 3.

Corollary 1.

Remark 2 (Forward-Backward Method).

Assumption 2.

Remark 3 (Generalized Framework).

Remark 4 (Augmented Lagrangian Term).

Lemma 4 (AL Penalized Cost).

Corollary 2.

VI-1 Well conditioned costs $J_{k}(w)$

VI-2 Ill conditioned costs $J_{k}(w)$

VI-3 Non-convex costs $J_{k}(w)$