This paper develops decentralized algorithms for multi-agent convex optimization over dynamic networks, providing convergence guarantees and analyzing the impact of network topology on performance.
Contribution
It introduces new algorithms for strongly convex functions with convergence analysis over time-varying directed networks.
Findings
01
Convergence rates are established for suboptimality, infeasibility, and consensus violation.
Algorithms work with non-smooth convex functions and private conic constraints.
Abstract
We consider cooperative multi-agent consensus optimization problems over both static and time-varying communication networks, where only local communications are allowed. The objective is to minimize the sum of agent-specific possibly non-smooth composite convex functions over agent-specific private conic constraint sets; hence, the optimal consensus decision should lie in the intersection of these private sets. Assuming the sum function is strongly convex, we provide convergence rates in suboptimality, infeasibility and consensus violation; examine the effect of underlying network topology on the convergence rates of the proposed decentralized algorithms.
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
Multi-agent constrained optimization of
a strongly convex function over
time-varying directed networks
Erfan Yazdandoost Hamedani1, and
Necdet Serhat Aybat1
We consider cooperative multi-agent consensus optimization problems over both static and time-varying communication networks, where only local communications are allowed. The objective is to minimize the sum of agent-specific possibly non-smooth composite convex functions over agent-specific private conic constraint sets; hence, the optimal consensus decision should lie in the intersection of these private sets. Assuming the sum function is strongly convex, we provide convergence rates in sub-optimality, infeasibility and consensus violation; examine the effect of underlying network topology on the convergence rates of the proposed decentralized algorithms.
I Introduction
Decentralized optimization over communication networks has various applications: i) distributed parameter estimation in wireless sensor networks [1, 2]; ii) multi-agent cooperative control and coordination in multirobot networks [3, 4]; iii) distributed spectrum sensing in cognitive radio networks [5, 6]; iv) processing distributed big-data in (online) machine learning [7, 8, 9, 10, 11]; v) power control problem in cellular networks [12], to name a few application areas.
In many of these network applications the communication network may be directed, i.e., communication links can be unidirectional, and/or the network in the wireless setting may be time-varying, e.g., communication links can be on/off over time due to failures or the links may exist among agents depending on their inter-distances. In the context of decentralized optimization, time-varying directed networks can also arise in wired networks as uni-directional asynchronous protocols are desired over bi-directional communication protocols which create deadlocks due to lack of enforcement rule to block a third node when the other two neighbors are exchanging local variables between themselves [7]. In majority of the applications discussed above, other than the topology being time-invariant (static) or time-varying, or the network having undirected or directed links, one common characteristic shared by today’s big-data networks is that the network size is usually prohibitively large for centralized optimization, which requires a fusion center that collects the physically distributed data and runs a centralized optimization method. This process has expensive communication overhead, requires large enough memory to store and process the data, and also may violate data privacy in case agent are not willing to share their data even though they are collaborative agents [13, 14].
In this paper, from a broader perspective, we aim to study constrained distributed optimization of a strongly convex function over static or time-varying communication networks Gt=(N,Et) for t≥0; in particular, from an application perspective, we are motivated to design an efficient decentralized solution method for constrained LASSO (C-LASSO) problems [15] with distributed data. C-LASSO, having the generic form minx{λ∥x∥1+∥Cx−d∥22:Ax≤b}, is an important class of problems in statistics, which includes fused LASSO, constrained regression, and generalized LASSO problems as its special cases [16, 15, 17] to name a few. In the rest, we provide our results for a more general setting of constrained decentralized optimization. We assume that i) each node i∈N has a local conic convex constraint set χi, for which projections are not easy to compute, and a local convex objective function φi (possibly non-smooth) such that ∑i∈Nφi(x) is strongly convex, and ii) nodes are willing to collaborate, without sharing their private data defining χi and φi, to compute an optimal consensus decision minimizing the sum of local functions and satisfying all local constraints; moreover, iii) nodes are only allowed to communicate with the neighboring nodes over the links in the network.
Although we assume that ∑i∈Nφi(x) is strongly convex, it is possible that none of the local functions {φi}i∈N are strongly convex. This kind of structure arises in LASSO problems; in particular, let φi:Rn→R such that φi(x)=λ∥x∥1+∥Cix−di∥22 for Ci∈Rmi×n and di∈Rmi for i∈N. Note that while φi is merely convex for all i∈N, ∑i∈Nφi(x) is strongly convex when mi<n for i∈N and rank(C)=n≤∑i∈Nmi≜m where C=[Ci]i∈N∈Rm×n. Therefore, it is important to note that in the centralized formulation of this problem minx∑i∈Nφi(x), the objective is strongly convex; however, in the decentralized formulation, this is not the case where we minimize ∑i∈Nφi(xi) while imposing consensus among local variables {xi}i∈N. In the numerical section, we considered a distributed C-LASSO problem under a similar strong convexity setting.
Many of the real-life application problems discussed above are special cases of the generic conic constrained decentralized optimization framework discussed in this paper. With the motivation of designing an efficient decentralized solution method for the distributed conic constrained problem over static or time-varying communication networks, as we briefly described above, we propose distributed primal-dual algorithms: DPDA for static and DPDA-TV for time-varying communication networks. DPDA and DPDA-TV are both based on the primal-dual algorithm (PDA), recently proposed in [18] for convex-concave saddle-point problems which for sake of completeness will be discussed in detail in Section I-A.
Problem Description. Let {Gt}t∈R+ denote a time-varying graph of N computing nodes. More precisely, for all t≥0, the graph has the form Gt=(N,Et), where N≜{1,…,N} is the set of nodes and Et⊆N×N is the set of (possibly directed) edges at time t. Suppose that each node i∈N has a private (local) cost function φi:Rn→R∪{+∞} such that
[TABLE]
where ρi:Rn→R∪{+∞} is a possibly non-smooth convex function, and fi:Rn→R is a smooth convex function. We assume fi is differentiable on an open set containing domρi with a Lipschitz continuous gradient ∇fi, of which Lipschitz constant is Li; and the prox map of ρi,
[TABLE]
is efficiently computable for i∈N, where ∥.∥ denotes the Euclidean norm.
Consider the following minimization problem:
[TABLE]
where Ai∈Rmi×n, bi∈Rmi and Ki⊆Rmi is a closed, convex cone. Suppose that projections onto Ki can be computed efficiently, while the projection onto the preimageχi≜Ai−1(Ki+bi) is assumed to be impractical, e.g., when Ki is the positive semidefinite cone, projection to preimage requires solving an SDP.
Assumption I.1**.**
The duality gap for (3) is zero, and a primal-dual solution to (3) exists.
A sufficient condition is the existence of a Slater point, i.e., there exists xˉ∈relint(domφˉ) such that Aixˉ−bi∈int(Ki) for i∈N, where domφˉ=∩i∈Ndomφi.
Definition** 1****.**
*A differentiable function f:Rn→R is strongly convex with modulus μ>0
if the following inequality holds
*
[TABLE]
Assumption I.2**.**
Suppose fˉ(x)≜∑i∈Nfi(x) is strongly convex with modulus μˉ>0; and each fi is strongly convex with modulus μi≥0 for i∈N, and define \underaccent{\bar}{\mu}\triangleq\min_{i\in\mathcal{N}}\{\mu_{i}\}\geq 0.
Remark** I.1****.**
Clearly μˉ≥∑i∈Nμi is always true, and it is possible that μi=0 for all i∈N but still μˉ>0; moreover, μˉ>0 implies that x∗ is the unique optimal solution to (3).
Previous Work. Consider minx∈Rn{φˉ(x):x∈∩i∈Nχi} over a communication network of computing agents N, where φˉ(x)=∑i∈Nφi(x). Although the unconstrained consensus optimization, i.e., χi=Rn, is well studied for static or time-varying networks – see [19, 20] and the references therein,
the constrained case is still an area of active research, e.g.,
[19, 20, 21, 22, 23, 24, 25, 26, 27, 28].
Our focus in this paper is on the case where φˉ is strongly convex such that each φi=ρi+fi is composite convex, and χi has the form Ai−1(Ki+bi) for i∈N. In this section, we briefly review the existing work related to our setup.
Unconstrained minimization of a strongly convex objective function fˉ(x)≜∑i∈Nfi(x) in the multi-agent setting has been investigated in many papers, e.g., [29, 30, 31, 32, 33] considered static communication networks G=(N,E) while [34, 35] studied the time-varying networks. In the rest, suppose that μi≥0 denotes the convexity modulus of fi for i∈N. In [29], Makhdoumi and Ozdaglar proposed a distributed ADMM to solve minxfˉ(x) over a time-invariant (static), undirected network; they show that when fi has Lipschitz continuous gradient with constant Li and when μi>0 for each i∈N, the local iterates at all nodes are within an ϵ-ball of the optimal solution after at most O(κlog(1/ϵ)) iterations, where \kappa=L_{\max}/\underaccent{\bar}{\mu}, Lmax≜maxi∈NLi and \underaccent{\bar}{\mu}\triangleq\min_{i\in\mathcal{N}}\{\mu_{i}\}; on the other hand, since each iteration requires exact minimization of an augmented function involving fi at each i∈N, iterations can be very costly depending on fi. In [36], Chang et al. considered the composite convex minimization problem, minx∑i∈Nρi(x)+fi(Cix), over a static undirected network G, where ρi is merely convex and fi is strongly convex with a Lipschitz continuous gradient for i∈N. A method based on ADMM taking proximal-gradient steps, IC-ADMM, is proposed to reduce the computational work of ADMM due to exact minimizations required in each iteration. Under the assumption that the smallest eigenvalue of the un-oriented Laplacian of G is known at all agents, it is shown that IC-ADMM sequence converges when each fi is strongly convex – no rate result is provided for this case; on the other hand, linear convergence is established in the absence of the merely convex (possibly non-smooth) term ρi and assuming each Ci has full column-rank in addition to the previous assumptions required for establishing the convergence result. In a similar spirit, to overcome the costly exact minimizations required in ADMM, an exact first-order algorithm (EXTRA) is proposed in [30] for minimizing fˉ over an undirected static network G. When fˉ is smooth and strongly convex with modulus μˉ>0, it is shown that the algorithm has linear convergence without assuming each fi to be strongly convex provided that the step-size α>0, constant among all the nodes, is sufficiently small, i.e., α=O(μˉ/Lmax2). In a follow up work, Extra-Push [31] has been proposed that extends EXTRA to handle strongly connected, directed static networks using push-sum protocol. Convergence of Extra-Push, without providing any rate, has been shown under boundedness assumption on the iterate sequence; moreover, under the assumption that the stationary distribution, ϕ∈R∣N∣, of the column-stochastic mixing matrix that represents the static directed network is known, i.e., each node i∈N knows ϕi>0, they relax the boundedness assumption on the iterate sequence, and show that a variant of Extra-Push converges at a linear rate if each fi is smooth and strongly convex with μi>0 for i∈N – note that assuming each node i∈N knows ϕiexactly is a fairly strong assumption in a decentralized optimization setting. In [32], Xi et al. also combined EXTRA with the push-sum protocol to obtain DEXTRA to minimize strongly convex fˉ over a static directed network. In addition to assumptions on {fi}i∈N in [31], by further assuming that ∇fi bounded over Rn for i∈N, which implies boundedness of the iterate sequence,
it is shown that the iterate sequence converges linearly when the constant step-size α, fixed for all i∈N, is chosen carefully belonging to a non-trivial interval [αmin,αmax] such that αmin>0 – note that the boundedness on each ∇fi is a strong requirement and clearly it is not satisfied by commonly used quadratic loss function. In a follow up paper [33], Xi and Khan proposed Accelerated Distributed Directed Optimization (ADD-OPT) where they improved on the nontrivial step-size condition of DEXTRA and showed that the iterates converge linearly when the constant step-size α is chosen sufficiently small – assuming that the directed network topology is static and each fi is strongly convex with Lipschitz continuous gradients (without assuming boundedness as in [32]). In a more general setting, Nedić and Olshevsky [34] proposed a stochastic (sub)gradient-push for minimizing strongly convex fˉ on time-varying directed graphs without assuming differentiability when the stochastic error in subgradient samples has zero mean and bounded standard deviation.
When μi>0 for all i∈N, choosing a diminishing step-size sequence, they were able to show O(log(k)/k) rate result provided that the iterate sequence stays bounded – the boundedness assumption on the iterate sequence can be removed by assuming that functions are smooth, having Lipschitz continuous gradients.
In [35], Nedić et al. proposed distributed inexact gradient methods referred to as DIGing and Push-DIGing for time-varying undirected and directed networks, respectively. Assuming fi is strongly convex with Lipschitz continuous gradient for each i∈N, it is shown that the iterate sequence converges linearly provided that the constant step-size α, fixed for all i∈N, is chosen sufficiently small.
For constrained consensus optimization, other than few exceptions, e.g., [23, 24, 25, 26, 27, 28], the existing methods require that each node compute a projection on the local set χi in addition to consensus and (sub)gradient steps, e.g., [21, 22]. Moreover, among those few exceptions, only [25, 26, 27, 28] can handle agent-specific constraints without assuming global knowledge of the constraints by all agents. However, no rate results in terms of suboptimality, local infeasibility, and consensus violation exist for the primal-dual distributed methods in [25, 26, 27] when implemented for the agent-specific conic constraint sets χi={x:Aix−bi∈Ki} studied in this paper. In [25], a consensus-based distributed primal-dual perturbation (PDP) algorithm using a diminishing step-size sequence is proposed. The objective is to minimize a composition of a global network function (smooth) with the sum of local objective functions (smooth), i.e., F(∑i∈Nfi(x)), subject to local compact sets and inequality constraints on the summation of agent specific constrained functions, i.e., ∑i∈Ngi(x)≤0, over a time-varying directed network. They showed that the local primal-dual iterate sequence converges to a global optimal primal-dual solution; however, no rate result was provided. The proposed PDP method can also handle non-smooth constraints with similar convergence guarantees.
In a recent work [26],
the authors proposed a distributed algorithm on time-varying directed networks for solving saddle-point problems subject to consensus constraints. The algorithm can also solve consensus optimization problems with inequality constraints that can be written as summation of local convex functions of local and global variables. It is shown that using a carefully selected decreasing step-size sequence, the ergodic average of primal-dual sequence converges with O(1/k) rate in terms of saddle-point evaluation error; however, when applied to constrained optimization problems, no rate in terms of either suboptimality or infeasibility is provided. In [27], a closely related paper to ours, a proximal dual consensus ADMM method, PDC-ADMM,
is proposed by Chang to minimize φˉ subject to a coupling equality and agent-specific constraints over both static and time-varying undirected networks – for the time-varying topology, they assumed that agents are on/off and communication links fail randomly with certain probabilities. Each agent-specific set is assumed to be an intersection
of a polyhedron and a “simple” compact set. More precisely, the goal is to solve minx{∑iφi(xi):∑i∈NCixi=d,xi∈χii∈N} where φi=ρi+fi is composite convex, χi={xi:Aixi≥bi,xi∈Si} and Si is a convex compact set. Clearly, by properly choosing the primal constraint ∑i∈NCixi=d one can impose consensus on {xi}i∈N. The polyhedral constraints defining χi are handled using a penalty formulation without requiring projection onto them. It is shown that both for static and time-varying cases, PCD-ADMM have O(1/k) ergodic convergence rate in the mean for suboptimality and infeasibility when each fi is strongly convex and differentiable with a Lipschitz continuous gradient for i∈N. More recently, in [28], Aybat and Yazdandoost Hamedani proposed a distributed primal-dual method to solve (3) when φi=ρi+fi is composite convex. Assuming fi is smooth, O(1/k) ergodic rate is shown for suboptimality and infeasibility. In this paper, we aim to improve on this rate by further assuming ∑iφi is strongly convex to achieve O(1/k2) ergodic rate.
Contribution. To the best of our knowledge, only a handful of methods, e.g., [25, 26, 27, 28] can handle consensus problems, similar to (3), with agent-specific local constraint sets {χi}i∈N without requiring each agent i∈N to project onto χi. However, no rate results in terms of suboptimality, local infeasibility, and consensus violation exist for the distributed methods in [25, 26, 27] when implemented for conic
sets {χi}i∈N studied in this paper; moreover, none of these four methods exploits the strong convexity of the sum function φˉ=∑i∈Nφi. We believe DPDA and DPDA-TV proposed in this paper is one of the first decentralized algorithms to solve (3) with O(1/k2) ergodic rate guarantee on both sub-optimality and infeasibility.
More precisely, we show that when φˉ is strongly convex and each φi is composite convex with smooth fi for i∈N, our proposed method reduces the suboptimality and infeasibility with O(1/k2) rate as k, the number primal-dual iterations, increases, and it requires O(k) and O(klog(k)) local communications for all k iterations in total when the network topologies are static and time-varying, respectively. To the best of our knowledge, this is the best rate result for our setting. Moreover, the proposed methods
do not require the agents to know any global parameter depending on the entire network topology, e.g., the second smallest eigenvalue of the Laplacian.
Notation. Throughout ∥.∥ denotes either the Euclidean norm or the spectral norm, and ⟨θ,w⟩≜θ⊤w for θ,w∈Rn. Given a convex set S, let σS(.) denote its support function, i.e., σS(θ)≜supw∈S⟨θ,w⟩, let IS(⋅) denote the indicator function of S, i.e., IS(w)=0 for w∈S and equal to +∞ otherwise, and let PS(w)≜argmin{∥v−w∥:v∈S} denote the projection onto S. For a closed convex set S, we define the distance function as dS(w)≜∥PS(w)−w∥. Given a convex cone K∈Rm, let K∗ denote its dual cone, i.e., K∗≜{θ∈Rm:⟨θ,w⟩≥0∀w∈K}, and K∘≜−K∗ denote the polar cone of K. Note that for a given cone K∈Rm, σK(θ)=0 for θ∈K∘ and equal to +∞ if θ∈K∘, i.e., σK(θ)=IK∘(θ) for all θ∈Rm.
Given a convex function g:Rn→R∪{+∞}, its convex conjugate is defined as g∗(w)≜supθ∈Rn⟨w,θ⟩−g(θ). ⊗ denotes the Kronecker product, 1n∈Rn be the vector all ones, In is the n×n identity matrix. S++n(S+n) denotes the cone of symmetric positive (semi)definite matrices. For Q≻0, i.e., Q∈S++n, Q-norm is defined as ∥z∥Q≜z⊤Qz. Given Q∈S+n, λmin+(W) denotes the smallest positive eigenvalue of Q. Π denotes the Cartesian product. Finally, for θ∈Rn, we adopt (θ)+∈R+n to denote max{θ,0} where max is computed componentwise.
I-A Preliminary
Let X and Y be finite-dimensional vector spaces. In a recent paper, Chambolle and Pock [18] proposed a primal-dual algorithm (PDA) for the following convex-concave saddle-point problem:
[TABLE]
is a strongly convex function with modulus μ such that ρ and h are possibly non-smooth convex functions, g is convex and has a Lipschitz continuous gradient defined on domρ with Lipschitz constant L, and T is a linear map. Given some positive step-size sequences {τk,κk,ηk}k≥0 and the initial iterates x0,y0, PDA consists of two proximal-gradient steps:
[TABLE]
where Dk is a Bregman distance function such that Dk(y,yˉ)≥2κk1∥y−yˉ∥2 for any y and yˉ and k≥0.
In [18], a simple proof for the ergodic convergence is provided for (5); indeed, it is shown that
when the convexity modulus for ρ and g are μ and [math], resp., and
if τk,κk,ηk>0 are chosen such that τk1+μ≥τk+1ηk+11, (τk1−L)≥∥T∥2κk, and κk=κk+1ηk+1 for all k≥0, then
[TABLE]
for all x,y∈X×Y, where NK≜∑k=1Kκ0κk−1, xˉK≜NK−1∑k=1Kκ0κk−1xk and yˉK≜NK−1∑k=1Kκ0κk−1yk for all K≥1. In [18], it is shown that {τk,κk,ηk}k≥0 can be chosen such that Nk=O(k2), τk=O(1/k) and κk=O(k) for k≥0.
First, in Section II, we discuss a special case of (4), which will help us develop a decentralized primal-dual algorithm, DPDA, for the consensus optimization problem in (3) when the communication network topology is static, and we provide the main results for the static case in Theorem II.2.
Next, in Section III, we propose a decentralized algorithm DPDA-TV to solve (3) when the network topology is time-varying, and we extend our convergence results to time-varying case in Theorem III.2. Finally, in Section IV, we test the performance of the proposed methods for solving distributed constrained LASSO problems.
II A distributed method for a static network topology
In this section we discuss how PDA, stated in (5), can be implemented to compute an ϵ-optimal solution to (3) in a distributed way using only O(1/ϵ) communications over a static communication network G using only local communications.
Let G=(N,E) denote a connected undirected graph of N computing nodes, where N≜{1,…,N} and E⊆N×N denotes the set of edges – without loss of generality assume that (i,j)∈E implies i<j. Suppose nodes i and j can exchange information only if (i,j)∈E. Let Ni≜{j∈N:(i,j)∈E or (j,i)∈E} denote the set of neighboring nodes of i∈N, and di≜∣Ni∣ is the degree of node i∈N.
Let xi∈Rn denote the local decision vector of node i∈N. By taking advantage of the fact that G is connected, we can reformulate (3) as a consensus optimization problem:
[TABLE]
where λij∈Rn and θi∈Rmi are the corresponding dual variables. Let x=[xi]i∈N∈Rn∣N∣. The consensus constraints xi=xj for (i,j)∈E can be formulated as Mx=0, where M∈Rn∣E∣×n∣N∣ is a block matrix such that M=H⊗In where H is the oriented edge-node incidence matrix, i.e., the entry H(i,j),l, corresponding to edge (i,j)∈E and node l∈N, is equal to 1 if l=i, −1 if l=j, and [math] otherwise.
Note that MTM=HTH⊗In=Ω⊗In, where Ω∈R∣N∣×∣N∣ denotes the graph Laplacian of G, i.e., Ωii=di, Ωij=−1 if (i,j)∈E or (j,i)∈E, and equal to [math] otherwise.
Since x∗ is the unique solution to (3) and (7), and since x∗≜1⊗x∗ satisfies (Ω⊗In)x∗=0, one can reformulate (7) as a saddle point problem. Indeed, let x=[xi]i∈N, y=[θ⊤λ⊤]⊤ such that θ=[θi]i∈N and λ=[λij](i,j)∈E, then for any α≥0, one can compute a primal-dual optimal solution to (3) through solving
[TABLE]
Next, we consider implementation of PDA in (5) to solve (8) for some α≥0.
Definition** 2****.**
Let X≜Πi∈NRn and X∋x=[xi]i∈N; Y≜Πi∈NRmi×Rn∣E∣, Y∋y=[θ⊤λ⊤]⊤ such that θ=[θi]i∈N∈Rm and λ=[λij](i,j)∈E∈Rm0, where m≜∑i∈Nmi, and m0≜n∣E∣. Given parameters γk>0 and κik>0 for i∈N, let Dγk≜γk1Im0, Dκk≜diag([κik1Imi]i∈N),
and Dκk,γk≜[Dκk00Dγk].
Definition** 3****.**
Let Φ,φ:X→R∪{∞} such that Φ(x)=ρ(x)+g(x) and φ(x)=ρ(x)+f(x) where ρ(x)≜∑i∈Nρi(xi), f(x)≜∑i∈Nfi(xi), and g(x)≜f(x)+2α∥x∥Ω⊗In2, and let h:Y→R∪{∞} such that h(y)≜∑i∈NσKi(θi)+⟨bi,θi⟩. Define the block-diagonal matrix A≜diag([Ai]i∈N)∈Rm×n∣N∣ and T=[A⊤M⊤]⊤.
Given some positive parameters γk,τk>0, κik>0 for i∈N – we shortly discuss how to select them, we define the Bregman function Dk(y,yˉ)=21∥y−yˉ∥Dκk,γk2 for each k≥0. Hence, given Φ, h and T as in Definition 3, and the initial iterates x0 and y0=[θ0⊤λ0⊤]⊤, the PDA iterations given in (5) take the following form:
[TABLE]
Since Ki is a cone, proxκikσKi(⋅)=PKi∘(⋅); hence, θik+1 can be written in closed form as
[TABLE]
Using recursion in (9c), we can write λk+1 as a partial summation of primal iterates {xℓ}ℓ=0k,
i.e., λk+1=λ0+∑ℓ=0kγℓM(xℓ+ηℓ(xℓ−xℓ−1)) for k≥0. Let λ0←0, and define {sk}k≥0 such that s0=0 and \mathbf{s}^{k+1}=\mathbf{s}^{k}+\gamma^{k}\big{(}\mathbf{x}^{k}+\eta^{k}(\mathbf{x}^{k}-\mathbf{x}^{k-1})\big{)} for k≥0; hence, λk=Msk for k≥0. Using the fact that M⊤M=Ω⊗In, we obtain
[TABLE]
Thus, PDA iterations given in (9) for the static graph G can be computed in a decentralized way, via the node-specific computations as in time-invariant distributed primal dual algorithm displayed in Fig. 1 below.
Definition** 4****.**
Let W∈S∣N∣ such that Wij=Wji<0 for (i,j)∈E, Wij=Wji=0 for (i,j)∈/E, and Wii=−∑j∈NiWij for i∈N.
Remark** II.1****.**
According Assumption I.2, when \underaccent{\bar}{\mu}>0, f(x)=∑i∈Nfi(xi) is strongly convex
with modulus \underaccent{\bar}{\mu}. That said, as emphasized in the introduction, although fˉ(x)=∑i∈Nfi(x) is strongly convex with modulus μˉ>0, it is possible that f may not when \underaccent{\bar}{\mu}=0.
Inspired from Proposition 3.6. in [30], we show that by suitably regularizing f, one can obtain a strongly convex function when \underaccent{\bar}{\mu}=0.
Lemma II.1**.**
*Consider f(x)=∑i∈Nfi(xi) under Assumption I.2, suppose \underaccent{\bar}{\mu}=0. Given α>0 and W as in Definition 4, let fα(x)≜f(x)+αr(x), where r(x)≜21∥x∥W⊗In2.
Then fα is strongly convex
with modulus {\mu}_{\alpha}\triangleq\frac{\bar{\mu}/|\mathcal{N}|~{}+\alpha{\lambda}_{2}}{2}-\big{(}\big{(}\frac{\bar{\mu}/|\mathcal{N}|~{}-\alpha{\lambda}_{2}}{2}\big{)}^{2}+4\bar{L}^{2}\big{)}^{\nicefrac{{1}}{{2}}}>0 for any α>λ2μˉ4∑i∈NLi2, where Lˉ=∣N∣∑i∈NLi2 and λ2=λmin+(W). *
Remark** II.2****.**
When \underaccent{\bar}{\mu}>0, i.e., all fi’s are strongly convex,
the parameter α can be set to zero; hence, g(x)=f(x) is strongly convex
with modulus \mu_{g}=\underaccent{\bar}{\mu}. Otherwise, when \underaccent{\bar}{\mu}=0, α should be chosen according to Lemma II.1; hence, g(x)=fα(x) is strongly convex
with modulus μg=μα. The condition α>μˉλmin+(W)4∑i∈NLi2 is similar to the one in [30], where they also have a parameter W∈S+∣N∣ for their algorithm and α should be greater than 2μˉλmin+(W)∣N∣Lmax2 and Lmax=maxi∈NLi.
Next, we quantify the suboptimality and infeasibility of the DPDA iterate sequence.
Theorem II.2**.**
Suppose Assumption I.1 holds.
Let {xk,θk}k≥0 be the sequence generated by Algorithm DPDA, displayed in Fig. 1, initialized from an arbitrary x0 and θ0=0. Then {xk}k≥0 converges to x∗=1⊗x∗ such that x∗ is the optimal solution to (3);
moreover, the following error bounds hold for all K≥1:
[TABLE]
where \Theta_{0}\triangleq{1\over 2\gamma^{0}}+\sum_{i\in\mathcal{N}}\Big{[}{1\over 2\tau^{0}}\|x_{i}^{0}-{x^{*}}\|^{2}+{2\over\kappa_{i}^{0}}\|\theta^{*}_{i}\|^{2}\Big{]}, xˉK=NK−1∑k=1Kγk−1xk, and NK=∑k=1Kγk−1=O(K2). Moreover, τ~K/γK=O(1/K2).
Remark** II.3****.**
Note that the result in Theorem II.2 can be extended to weighted graphs by replacing the Laplacian matrix Ω in g(x)=f(x)+2α∥x∥Ω⊗In2 with a weighted Laplacian W as in Definition 4, and also replacing consensus constraint Mx=0 in (7) with (W⊗In)x=0.
III A distributed method for a time-varying communication network
In this section we develop a distributed primal-dual algorithm for solving (3) when the communication network topology is time-varying. We will adopt the following definition and assumption for the time-varying network model.
Definition** 5****.**
Given t≥0, for an undirected graph Gt=(N,Et), let Nit≜{j∈N:(i,j)∈Et or (j,i)∈Et} denote the set of neighboring nodes of i∈N, and dit≜∣Nit∣ represent the degree of node i∈N at time t; for a directed graph Gt=(N,Et), let Nit,in≜{j∈N:(j,i)∈Et}∪{i} and Nit,out≜{j∈N:(i,j)∈Et}∪{i} denote the in-neighbors and out-neighbors of node i at time t, respectively; and dit≜∣Nit,out∣ be the out-degree of node i.
Assumption III.1**.**
Suppose that {Gt}t∈R+ is a collection of either all directed or all undirected graphs. When Gt is an undirected graph, node i∈N can send and receive data to and from j∈N at time t only if j∈Nit, i.e., (i,j)∈Et or (j,i)∈Et; on the other hand, when Gt is a directed graph, node i∈N can receive data from j∈N only if j∈Nit,in, i.e., (j,i)∈Et, and can send data to j∈N only if j∈Nit,out, i.e., (i,j)∈Et.
We assume a compact domain, i.e., let Δi≜maxxi,xi′∈domρi∥x−x′∥ and Δ≜maxi∈NΔi<∞. Let B0≜{x∈Rn:∥x∥≤2Δ} and B≜Πi∈NB0; and let C and C be the sets of consensus and bounded consensus decisions respectively:
[TABLE]
Since x∗ is the unique solution to (3) and since x∗≜1⊗x∗ satisfies PC(x∗)=0, one can reformulate (3) as a saddle point problem using C. Indeed, Indeed, let x=[xi]i∈N∈Rn∣N∣, y=[θ⊤λ⊤]⊤ such that θ=[θi]i∈N and λ∈Rn∣N∣, then for any α≥0, one can compute a primal-dual optimal solution to (3) through solving
[TABLE]
Next, we consider a slightly different implementation of PDA in (5) to solve (11).
Definition** 6****.**
Let X≜Πi∈NRn and X∋x=[xi]i∈N; Y≜Πi∈NRmi×Rm0, Y∋y=[θ⊤λ⊤]⊤ and θ=[θi]i∈N∈Rm, where m≜∑i∈Nmi and m0≜n∣N∣. Given parameters γk>0, κik>0 for i∈N, let Dγk≜γk1Im0, Dκk≜diag([κik1Imi]i∈N), and Dκk,γk≜[Dκk00Dγk].
Definition** 7****.**
Let Φ,φ:X→R∪{∞} such that Φ(x)=ρ(x)+g(x) and φ(x)=ρ(x)+f(x) where ρ(x)≜∑i∈Nρi(xi), g(x)≜f(x)+2αdC2(x) and f(x)≜∑i∈Nfi(xi), and let h:Y→R∪{∞} such that h(y)≜σC(λ)+∑i∈NσKi(θi)+⟨bi,θi⟩. Define the block-diagonal matrix A≜diag([Ai]i∈N)∈Rm×n∣N∣ and T=[A⊤In∣N∣]⊤.
Given some positive parameters γk,τk>0, κik>0 for i∈N – we shortly discuss how to select them, we define the Bregman function Dk(y,yˉ)=21∥y−yˉ∥Dκk,γk2 for each k≥0. Hence, given Φ, h and T as in Definition 7, and the initial iterates x0 and y0=[θ0⊤λ0⊤]⊤,
the PDA iterations given in (5) take the following form for k≥0:
[TABLE]
where ξ−1=ξ0=x0 and ν0=λ0. For k≥0, using extended Moreau decomposition for proximal operators, λk+1 in (12b) can be computed as
[TABLE]
where ωk≜γk1νk+ξk+ηk(ξk−ξk−1) for k≥0. Moreover, ∇g for the x-step in (12d) can be computed as
[TABLE]
For any x=[xi]i∈N∈X, PC(x) and PC(x) can be computed as
[TABLE]
where p(x)≜∣N∣1∑i∈Nxi, PB(x)=[PB0(xi)]i∈N and PB0(xi)=ximin{1,∥xi∥2Δ} for i∈N. Equivalently, PC(x)=(W⊗In)x
for W≜∣N∣111⊤∈R∣N∣×∣N∣.
Although θ-step of the PDA implementation in (12) can be computed locally at each node, computing x-step and λ-step require communication among the nodes to evaluate PC(ωk) and PC(ξk). Indeed, evaluating the average operator p(.) is not a simple operation in a decentralized computational setting which only allows for communication among the neighbors. In order to overcome this issue, we will approximate the average operator p(.) using multi-communication rounds, and analyze the resulting iterations as an inexact primal-dual algorithm.
We define a communication round at time t as an operation over Gt such that every node simultaneously sends and receives data to and from its neighboring nodes according to Assumption III.1 – the details of this operation will be discussed shortly. We assume that communication among neighbors occurs instantaneously, and nodes operate synchronously; and we further assume that for each PDA iteration k≥0, there exists an approximate averaging operator Rk(⋅) which can be computed in a decentralized fashion and approximate PC(⋅) with decreasing approximation error as k, the number of PDA iterations, increases. This inexact version of PDA using approximate averaging operator Rk(⋅) and running on time-varying communication network {Gt} will be called DPDA-TV, of which details will be explained next.
Assumption III.2**.**
Given a time-varying network {Gt}t∈R+ such that Gt=(N,Et) for t≥0, suppose that there is a global clock known to all i∈N. Assume that the local operations requiring to compute ΠKi as in (12a), and proxρi and ∇fi as in (12e) can be completed between two ticks of the clock for all i∈N and k≥0; and every time the clock ticks a communication round with instantaneous messaging between neighboring nodes takes place subject to Assumption III.1. Suppose that for each k≥0 there exists Rk(⋅)=[Rik(⋅)]i∈N such that Rik(⋅) can be computed with local information available to node i∈N, and decentralized computation of Rk requires qk communication rounds. Furthermore, we assume that there exist Γ>0 and α∈(0,1) such that for all k≥0, Rk satisfies
[TABLE]
Now we briefly talk about such operators. Let Vt∈R∣N∣×∣N∣ be a matrix encoding the topology of Gt=(N,Et) in some way for t∈Z+. We define Wt,s≜VtVt−1...Vs+1 for any t,s∈Z+ such that t≥s+1. For directed time-varying graph Gt, set Vt∈R∣N∣×∣N∣ as:
[TABLE]
Let tk∈Z+ be the total number of communication rounds done before the k-th iteration of DPDA-TV, and let qk∈Z+ be the number of communication rounds to be performed within the k-th iteration while evaluating Rk. For x=[xi]i∈N∈X such that xi∈Rn for i∈N, define
[TABLE]
to approximate PC(⋅). Note that Rk(⋅) can be computed in a distributed fashion requiring qk communication rounds – Rk is nothing but the push-sum protocol [37]. Assuming that the digraph sequence {Gt}t∈Z+ is uniformly strongly connected (M-strongly connected), it follows from [37, 38] that Rk satisfies Assumption III.2. When {Gt}t∈Z+ is undirected time-varying network, then choosing {Vt} according to Metropolis weights, one can show that
[TABLE]
satisfies Assumption III.2 under certain conditions, e.g., see [39].
Note that for Rk(⋅)≜PB(Rk(⋅)), we have Rk(w)∈B, and ∥Rk(w)−PC(w)∥≤NΓβqk∥w∥ for w∈Rm0 due to non-expansivity of PB.
Consider the k-th iteration of PDA as shown in (12). Instead of setting νk+1 to λk+1 and ξk+1 to xk+1, which require computing PC, we propose replacing these assignment operations in (12c) and (12e) with an operation that uses the inexact averaging operator Rk to approximate PC. This way, we obtain inexact variant of (12) replacing (12c) and (12e) with
[TABLE]
Thus, PDA iterations given in (12) can be computed inexactly, but in decentralized way for a time-varying connectivity network {Gt}t≥0, via the node-specific computations as in time-varying distributed primal dual algorithm displayed in Fig. 2 below. Indeed, the iterate sequence {ξk,νk,θk}k≥0 generated by DPDA-TV displayed in Fig. 2 is the same sequence generated by the recursion in (12a), (20a), and (20b). The sequences {xk}k≥0 and {λk}k≥0 will not be explicitly computed, instead we will use it in the analysis of the inexact algorithm.
Recall Remark II.1, it is possible that \underaccent{\bar}{\mu}=0. In the next lemma, similar to Lemma II.1, we generalize the result in Proposition 3.6. of [30], making it suitable for time-varying topology, and show that by suitably regularizing f, one can obtain a strongly convex function when \underaccent{\bar}{\mu}=0.
Lemma III.1**.**
*Consider f(x)=∑i∈Nfi(xi) under Assumption I.2 and suppose \underaccent{\bar}{\mu}=0. Given α>0, let fα(x)≜f(x)+αr(x), where r(x)≜21dC2(x).
Then fα is strongly convex
with modulus μα≜2μˉ/∣N∣+α−(2μˉ/∣N∣−α)2+4Lˉ2>0 for any α>μˉ4∑i∈NLi2, where Lˉ=∣N∣∑i∈NLi2.
*
Next, we quantify the suboptimality and infeasibility of the DPDA-TV iterate sequence. Recall that if \underaccent{\bar}{\mu}>0, then we set α=0 and set g=f; otherwise, when \underaccent{\bar}{\mu}=0, it follows from Lemma III.1 that for any α>μˉ4∑i∈NLi2, fα is strongly convex with modulus μα>0; hence, we set g=fα – See also Remark II.1.
Theorem III.2**.**
Suppose Assumptions I.1, I.2, III.1 and III.2 hold.
Starting from ν0=0, θ0=0, and an arbitrary x0, let {ξk,θk,νk}k≥0 be the iterate sequence generated by Algorithm DPDA-TV, displayed in Fig. 2, using qk≥(5+c)log1/β(k+1) communication rounds for the k-th iteration for k≥0. Then {ξk}k≥0 converges to x∗=1⊗x∗ such that x∗ is the optimal solution to (3).
Moreover, the following bounds hold for all K≥1:
[TABLE]
*and the parameters satisfy NK=O(K2) and τ~K/γK=O(1/K2), where NK=∑k=1Kγk−1, xˉK=NK−1∑k=1Kγk−1xk, and \Theta(K)=\mathcal{O}\big{(}\sum_{k=1}^{K}\beta^{q_{k-1}}k^{4}\big{)}; hence, supK∈Z+Θ(K)<∞.
*
Remark** III.1****.**
Note that, at the K-th iteration, the suboptimality, infeasibility and consensus violation are O(NK1Θ(K)) in the ergodic sense, and the distance of iterates to x∗ is O(γKτ~KΘ(K)) where Θ(K) denotes the error accumulations due to average approximation.
Moreover, Θ(K) can be bounded above for all K≥1 as Θ2(K)≤C1∑k=1Kβqk−1k4; therefore, for any c>0, choosing {qk}k∈Z+ as stated in Theorem III.2 ensures that ∑k=1∞βqk−1k4<1+c1.
Moreover, for any c>0, setting qk=(5+c)logβ1(k+1) for k≥0 implies that the total number of communication rounds right before the K-th iteration is equal to tK=∑k=0K−1qk≤(5+c)Klogβ1(K).
IV Numerical Section
In this section, we illustrate the performance of DPDA and DPDA-TV for solving synthetic C-LASSO problems. We first test the effect of network topology on the performance of proposed algorithms, and then we compare DPDA and DPDA-TV with other distributed primal-dual algorithms, DPDA-S and DPDA-D, proposed in [28] for solving (3) – it is shown in [28] that both DPDA-S and DPDA-D converge
with O(1/K) ergodic rate when φˉ is merely convex. In fact, when φˉ is strongly convex with modulus μ>0, using the fact that φ(x∗)−φ(xˉK)≥2μxˉK−x∗2, it immediately follows that xˉK−x∗2≤O(1/K).
We consider an isotonic C-LASSO problem over network Gt=(N,Et) for t≥0. This problem can be formulated in a centralized form as
x∗≜argminx∈Rn{21∥Cx−d∥2+λ∥x∥1:Ax≤0},
where the matrix C=[Ci]i∈N∈Rm∣N∣×n, d=[di]i∈N∈Rm∣N∣, and A∈Rn−1×n. In fact, the matrix A captures the isotonic feature of vector x∗, and can be written explicitly as, A(ℓ,ℓ)=1 and A(ℓ,ℓ+1)=−1, for 1≤ℓ≤n−1, otherwise it is zero.
Each agent i has access to Ci, di, and A; hence, by making local copies of x, the decentralized formulation can be expressed as
In the rest, we set n=20, m=n+2, λ=0.05 and Ki=−R+n−1 for i∈N. Moreover, for each i∈N, we generate Ci∈Rm×n as follows: after mn entries i.i.d. with Gaussian distribution are sampled, the condition number of Ci is normalized by sampling the singular values from [1,3] uniformly at random. We generate the first 5 and the last 5 components of x∗ by sampling from [−10,0] and [0,10] uniformly at random in ascending order, respectively, and the other middle 10 components are set to zero; hence, [x∗]j≤[x∗]j+1 for j=1,…,n−1. Finally, we set di=Ci(x∗+ϵi), where ϵi∈Rn is a random vector with i.i.d. components following Gaussian distribution with zero mean and standard deviation of 10−3.
Generating static undirected network:G=(N,E) is generated as a random small-world network. Given ∣N∣ and the desired number of edges ∣E∣, we choose ∣N∣ edges creating a random cycle over nodes, and then the remaining ∣E∣−∣N∣ edges are selected uniformly at random.
Generating time-varying undirected network: Given ∣N∣ and the desired number of edges ∣E0∣ for the initial graph, we generate a random small-world G0=(N,E0) as described above. Given M∈Z+, and p∈(0,1), for each k∈Z+, we generate Gt=(N,Et), the communication network at time t∈{(k−1)M,…,kM−2} by sampling ⌈p∣E0∣⌉ edges of G0 uniformly at random and we set EkM−1=E0∖⋃t=(k−1)MkM−2Et. In all experiments, we set M=5, p=0.8 and the number of communications per iteration is set to qk=10ln(k+1).
IV-A Effect of Network Topology
In this section, we test the performance of DPDA and DPDA-TV on undirected communication networks. To illustrate the effect of network topology, we consider four scenarios in which the number of nodes ∣N∣∈{10,40} and the average number of edges per node (∣E∣/∣N∣) is either ≈1.5 or ≈4.5. For each scenario, we plot both the relative error, i.e., maxi∈Nxik−x∗/∥x∗∥ and the infeasibility, i.e., maxi∈NdKi(Axˉik)=maxi∈N(Axˉik)+ versus iteration number k. All the plots show the average statistics over all 25 randomly generated replications.
Testing DPDA on static undirected communication networks: We generated the static small-world networks G=(N,E) as described above for (∣N∣,∣E∣)∈{(10,15),(10,45),(40,60),(40,180)} and solve the saddle-point formulation (8) corresponding to (22) using DPDA. For DPDA, displayed in Fig. 1, we chose δ1=maxi∈Ndi=dmax and δ2=2maxi∈NLi=2Lmax, which lead to the initial step-sizes as γ0=32dmaxLmax, τ0=3Lmax1, and κ0=32∥A∥2Lmax.
In Fig. 3, we plot maxi∈Nxik−x∗/∥x∗∥ and maxi∈N(Axˉik)+ statistics for DPDA versus iteration number k. Note that compared to average edge density, the network size has more influence on the convergence rate, i.e., the smaller the network faster the convergence is. On the other hand, for fixed size network, as expected, higher the density faster the convergence is.
Testing DPDA-TV on time-varying undirected communication networks: We first generated an undirected graph Gu=(N,Eu) as in the static case, and let G0=Gu. Next, we generated {Gt}t≥1 as described above by setting M=5 and p=0.8. For each consensus round t≥1, Vt is formed according to Metropolis weights, i.e., for each i∈N, Vijt=1/(max{di,dj}+1) if j∈Nit, Viit=1−∑i∈NiVijt, and Vijt=0 otherwise – see (19) for our choice of Rk.
For DPDA-TV, displayed in Fig. 2, we chose δ1=δ2=1, which lead to the initial step-sizes as γ0=21, τ0=Lmax+11, and κ0=2∥A∥21.
In Fig. 4, we plot maxi∈Nξik−x∗/∥x∗∥ and maxi∈N(Aξˉik)+ statistics for DPDA-TV versus iteration number k – we used {ξk} to compute the error statistics instead of {xk} as xk is never actually computed for DPDA-TV. Note that network size and average edge density have the same impact on the rate as in the static case.
IV-B Comparison with other methods
We also compared our methods with DPDA-S and DPDA-D, in terms of the relative error and infeasibility of the ergodic iterate sequence, i.e., maxi∈Nxˉik−x∗/∥x∗∥ and maxi∈N(Axˉik)+. We further report the performance of our algorithms in terms of relative error of the actual iterate sequence, i.e., maxi∈Nxik−x∗/∥x∗∥. For DPDA-D and DPDA-TV, we used {ξk} sequence to compute the error statistics instead of {xk} as xk is never actually computed. In this section we fix the number of nodes to ∣N∣=10 and the average edge density to ∣E∣/∣N∣=4.5 – we observed the same convergence behavior for the other network scenarios discussed in the previous section.
Static undirected network: We generated G=(N,E) and chose the algorithm parameters as in the previous section. Moreover, the step-sizes of DPDA-S are set to the initial steps-sizes of DPDA. As it can be seen in Fig. 5, DPDA has faster convergence when compared to DPDA-S.
Time-varying undirected network: We generated the network sequence {Gt}t≥0 and chose the parameters as in the prvious section. Moreover, the step-sizes of DPDA-D are set to the initial steps-sizes of DPDA-TV. As it can be seen in Fig. 6, DPDA-TV has faster convergence when compared to DPDA-D.
Time-varying directed network: In this scenario, we generated time-varying communication networks similar to [35]. Let Gd=(N,Ed) be the directed graph shown in Fig. 8 where it has ∣N∣=12 nodes and ∣Ed∣=12 directed edges. We set G0=Gd, and we generate {Gt}t≥0 generated as in the undirected case with parameters M=5 and p=0.8; hence, {Gt}t≥0 is M-strongly-connected. Moreover, communication weight matrices Vt are formed according to rule (17). We chose the initial step-sizes for DPDA-TV as in the time-varying undirected case, and the constant step-sizes of DPDA-D is set to the initial steps-sizes of DPDA-TV.
In Fig. 7 we compare DPDA-TV against DPDA-D. We observe that over time-varying directed networks DPDA-TV again outperforms DPDA-D for both statistics.
Let x∗=1∣N∣⊗x∗, where x∗ is the unique optimal solution to (3), and according to Assumption I.2, fˉ is strongly convex
with modulus μˉ>0. Note that any W as given in Definition 4 is positive semidefinite, and Null(W)=Span{1}. In the rest, we will use these properties of W. Fix some arbitrary α>λ2μˉ4∑i∈NLi2 and x∈Rn∣N∣.
x∈Rn∣N∣ can be decomposed into u∈Span{1} and v∈Span{1}⊥ where x=u+v and ∥x∥2=∥u∥2+∥v∥2. From definition of fα we have that,
[TABLE]
Let N≜∣N∣ and Lˉ≜N∑i∈NLi2. The inner product of ⟨∇f(x)−∇f(x∗),x−x∗⟩ can be bounded by using the following inequalities:
[TABLE]
which follow from convexity, Lipschitz differentiability, and strong convexity of f. Summing above inequalities leads to,
[TABLE]
Hence, strong convexity of fα follows from (23), (25). Indeed, it follows from W∈S+N and Null(W)=Span{1} that we have ∥x−x∗∥W⊗In2=v⊤(W⊗In)v≥λ2∥v∥2, where λ2=λmin+(W) is the second smallest eigenvalue of W. Therefore,
[TABLE]
Next, fix some arbitrary ω≥0. Then either (i)∥v∥≤ω∥u−x∗∥, or (ii)∥v∥≥ω∥u−x∗∥ holds. If (i) is true, then (26) implies
[TABLE]
on the other hand, if (ii) is true, then (26) implies
Since ω≥0 is arbitrary, fα is strongly convex
with modulus μα=maxω≥0min{Nμˉ−2Lˉω,αλ2−ω2Lˉ}.
Note μα is attained for ωα≥0 such that Nμˉ−2Lˉωα=αλ2−ωα2Lˉ, which implies that ωα=21(2Lˉμˉ/N−αλ2+(2Lˉμˉ/N−αλ2)2+4). Moreover, μα=Nμˉ−2Lˉωα is the value given in the statement of the lemma, and we have Nμˉ>μα>0 for any α>λ2μˉ4∑i∈NLi2. It is worth mentioning that μα is a concave increasing function of α over R++, and supα>0μα=limα↗∞μα=Nμˉ.
Let T=[A⊤M⊤]⊤ for A≜diag([Ai]i∈N)∈Rm×n∣N∣. Given α,μ,δ1>0, and arbitrary sequences {τk},{γk}⊂R++, {κik}k≥0⊂R++ for i∈N. For k≥0, define Dτk≜γk1In∣N∣, Dτk≜(τk1−μ)In∣N∣, Dˉτk≜diag([(τk1−(Li+2αdi))In]i∈N), and
Qˉk≜[Ak−ηkT−ηkT⊤Dκk,γk], where Ak≜(ηk)2γkdiag([(2di+δ1)In]i∈N)≻0, and Dκk,γk is defined in Definition 2.
In order to prove Theorem II.2, we first prove Lemma V.3 below which help us to appropriately bound L(xˉK,y)−L(x∗,yˉK) for any y∈Y and xK−x∗. In order to prove Lemma V.3, we first need to show the following two lemmas, Lemma V.1 and Lemma V.2, describing a proper choice for the step size sequences.
Lemma V.1**.**
Given δ1>0. For any k≥0, Qˉk⪰0 if ηk>0, and positive numbers {κik}i∈N and γk are chosen such that
[TABLE]
Proof.
Let Dκk,γk be as in Definition 2. Since Dγk≻0, Schur complement condition implies that Qˉk⪰0 if and only if
[TABLE]
Moreover, since Dκk≻0, again using Schur complement and the fact that M⊤M=Ω⊗In, one can conclude that (31) holds if and only if (ηk)21Ak−γkΩ⊗In−A⊤Dκk−1A⪰0. Moreover, by definition Ω=diag([di]i∈N)−E, where Eii=0 for all i∈N and Eij=Eji=1 if (i,j)∈E or (j,i)∈E. Note that diag([di]i∈N)+E⪰0 since it is diagonally dominant. Therefore, Ω⪯2diag([di]i∈N). Hence, one can conclude that (31) holds if (ηk)21Ak−2γkdiag([diIn]i∈N)−A⊤Dκk−1A⪰0. This condition holds if (30) is true.
∎
Lemma V.2**.**
Let Dκk,γk be as given in Definition 2, and Dτk, Dτk, Dˉτk and Qˉk be as in Definition 8 for α≥0 chosen according to Lemma II.1 and Remark II.2, and \mu\in(0,~{}\max\{\underaccent{\bar}{\mu},~{}\mu_{\alpha}\}].
Suppose {τk},{ηk},{γk}⊂R++, {κik}k≥0⊂R++ for i∈N are chosen as in DPDA diplayed in Fig. 1, then the following relations hold for all k≥0:
[TABLE]
Moreover, ηk∈(0,1), 0<τ~k1<τk1=O(k), and 0<γk=O(k).
Proof.
It is trivial to check that the parameter sequence constructed in Fig. 1 satisfies (32). Indeed, Lemma V.1 shows that (32a) is true since κik for i∈N and γk as chosen in Fig. 1 satisfy (30) for all k≥0. This specific choice of parameters satisfy (32b), (32c), (32d), and (32e) with equality. Moreover, one can use induction to show (32f) using the relations τ~k>τk, τ~k>τ~k+1, and γkτ~k=γk+1τ~k+1 for all k≥0.
∎
Lemma V.3**.**
For any y∈Y, the iterate sequence {xk,yk}k≥1 generated using Algorithm DPDA as in Fig. 1, where yk=[θk⊤λk⊤]⊤, satisfies for all k≥0,
[TABLE]
Proof.
Note that x-subproblem in (5b) is separable in local decisions {xi}i∈N; and for each i∈N the local subproblem over xi is strongly convex with constant 1/τk. Indeed, let pk=T⊤yk and define {pik}i∈N such that pik is the subvector corresponding to the components of xi, i.e., pk=[pik]i∈N. In addition, ∇g(xk)=[∇gi(xk)]i∈N where ∇gi(xk)≜∇fi(xik)+[(Ω⊗In)xk]i, where
[(Ω⊗In)xk]i=∑j∈Ni(xik−xjk). Thus, for all i∈N
[TABLE]
Therefore, for i∈N, the strong convexity of the objective in local subproblem (34) implies
[TABLE]
Now, we show that ∇g is Lipschitz continuous. First, recall that as we discussed in the proof of Lemma V.1, we have Ω⪯2diag([di]i∈N). Second, since ∥x∥Ω⊗In2 is a quadratic term, for any xˉ we have
[TABLE]
In addition, since each fi has a Lipschitz continuous gradient, we have for any x and xˉ that
[TABLE]
Let Lg≜diag([(Li+2diα)In]i∈N)∈Sn∣N∣. Summing (V-B) and (37), for any x and xˉ, we have
[TABLE]
It follows from strong convexity of fˉ that choosing α≥0 according to Lemma II.1 and Remark II.2, we conclude that for any \mu\in(0,~{}\max\{\underaccent{\bar}{\mu},~{}\mu_{\alpha}\}) we have
[TABLE]
Since ∑i∈N⟨pik+1,x∗⟩=⟨Tx∗,yk+1⟩, first summing (35) over i∈N, next summing the resulting inequality with (V-B), and then adding g(xk) to both hand-sides, we get
[TABLE]
Similarly, let qk≜T(xk+ηk(xk−xk−1)) and define q0k∈Rm0 and qik∈Rmi for i∈N such that q0k is the subvector corresponding to the components of λ, and qik is the subvector corresponding to the components of θi for i∈N, i.e., qk=[q1k⊤…qNk⊤q0k⊤]⊤. Thus, from (9a) and (9b), we have
[TABLE]
Using the strong convexity of these subproblems, for any y=[θ⊤,λ⊤]⊤, we get
[TABLE]
Since ⟨q0k,λ⟩+∑i∈N⟨qik,θi⟩=⟨T(xk+ηk(xk−xk−1)),y⟩ for all y, summing the second inequality over i∈N and then summing the resulting inequality with the first one, we get
[TABLE]
Next, summing (V-B), (V-B), and rearranging the terms, we obtain
[TABLE]
Note that we have
[TABLE]
moreover, using (32a), i.e., Qˉk⪰0, the last term can be bounded as follows:
[TABLE]
Then, combining (V-B), (43) and (44) gives the desired result.
∎
Under Assumption I.1, a saddle point (x∗,y∗) for minx∈Xmaxy∈YL(x,y) in (8) exists, where y∗=[θ∗⊤,λ∗⊤]⊤; moreover, any saddle point (x∗,θ∗,λ∗) satisfies that x∗=1⊗x∗ such that (x∗,θ∗) is a primal-dual solution to (3). Thus, θi∗∈Ki∘ and L(x∗,θ∗,λ∗)=Φ(x∗). Recall Definition 3, since ∥x∗∥Ω⊗In2=0, we have g(x∗)=f(x∗); hence, Φ(x∗)=φ(x∗)=∑i∈Nφi(x∗). Therefore, L(x∗,θ∗,λ∗)=φ(x∗). Moreover, note that if (x∗,θ∗,λ∗) is a saddle point of L such that λ∗=0, then it trivially follows that (x∗,θ∗,0) is another saddle point of L.
Multiplying both sides of (V.3) by γ0γk and using Lemma V.2, we get
[TABLE]
Next, we sum (V-C) from k=0 to K−1; using Jensen inequality and the following facts: QˉK⪰0 and x−1=x0, we get
[TABLE]
where zK≜[(xK−xK−1)⊤(y−yK)⊤]⊤, NK=∑k=1Kγ0γk−1, xˉK=NK−1∑k=1Kγ0γk−1xk, and yˉK=NK−1∑k=1Kγ0γk−1yk. Since zKQˉK2≥0 and τ~k>τk for k≥0, we get the following bounds for all K≥1:
[TABLE]
Under Assumption I.1, one can construct a saddle point (x∗,θ∗,λ∗) for L in (8) such that λ∗=0;
hence, L(x∗,θ∗,λ∗)=φ(x∗) and θi∗∈Ki∘ for i∈N. Define θ~=[θ~i]i∈N such that \tilde{\theta}_{i}\triangleq 2\|\theta_{i}^{*}\|\big{(}\|\mathcal{P}_{\mathcal{K}_{i}^{\circ}}(A_{i}{\bar{x}_{i}^{K}}-b_{i})\|\big{)}^{-1}~{}\mathcal{P}_{\mathcal{K}_{i}^{\circ}}(A_{i}{\bar{x}_{i}^{K}}-b_{i})\in\mathcal{K}_{i}^{\circ}, which implies
[TABLE]
Similarly, define λ~≜MxˉK/MxˉK; hence, ⟨MxˉK,λ~⟩=MxˉK. Together with (47), we get
[TABLE]
Note that for any i∈N, θˉiK∈Ki∘; hence, σKi(θˉiK)=0. In addition, since θˉiK∈Ki∘ and Aix∗−bi∈Ki, we have
[TABLE]
Therefore, using (49) and the fact that Mx∗=0 we get that,
[TABLE]
Thus, (48), (50) and (46) together with the definitions of θ~, λ~ and the fact that λ0=0 and θ0=0 imply that
[TABLE]
Since (x∗,θ∗,λ∗) is a saddle-point for L in (8), we have L(ξˉK,θ∗,λ∗)−L(x∗,θ∗,λ∗)≥0; therefore,
[TABLE]
Using the conic decomposition of AixˉiK−bi and the fact that θi∗∈Ki∘, we immediately get
Finally, combining inequalities (51) and (53) immediately implies the desired result. Moreover, the bound on x∗−xK follows from (46). In fact, possibly a tighter bound can be derived using Θ(x∗,θ∗,λ∗) for λ∗=0.
Let x∗=1∣N∣⊗x∗, where x∗ is the unique optimal solution to (3), and according to Assumption I.2, f is strongly convex
with modulus μˉ>0. Fix some arbitrary α>μˉ4∑i∈NLi2 and x∈Rn∣N∣. Since C is a closed convex cone, x can be decomposed into u=PC(x) and v=PC∘(x), i.e., x=u+v and ∥x∥2=∥u∥2+∥v∥2. From the definition of fα,
[TABLE]
which follows from the fact that ∇r(x)=x−PC(x); hence ∇r(x∗)=0. Let N≜∣N∣ and Lˉ≜N∑i∈NLi2. Since x∗,u∈C and f is convex, Lipschitz differentiable, and strongly convex, the same inequalities in (24) implies:
[TABLE]
Note that u−x∗∈C; hence, ⟨u−x∗,v⟩=0 since v∈C∘. Thus, ⟨x−x∗,v⟩=∥v∥2; this together with (54) and (55)
implies that
[TABLE]
Next, fix some arbitrary ω≥0. Then either (i)∥v∥≤ω∥u−x∗∥, or (ii)∥v∥≥ω∥u−x∗∥ holds. Using the same arguments to obtain (V-A), (V-A) and (29), we can conclude that
[TABLE]
Since ω≥0 is arbitrary, fα is restricted strongly convex with respect to x∗ with modulus μα=maxω≥0min{Nμˉ−2Lˉω,α−ω2Lˉ}. Note μα is attained for ωα≥0 such that Nμˉ−2Lˉωα=α−ωα2Lˉ, which implies that ωα=4Lˉμˉ/N−α+(4Lˉμˉ/N−α)2+1. Moreover, μα=Nμˉ−2Lˉωα is the value given in the statement of the lemma, and we have Nμˉ>μα>0 for any α>μˉ4∑i∈NLi2. It is worth mentioning that μα is a concave increasing function of α over R++, and supα>0μα=limα↗∞μα=Nμˉ.
We first define the proximal error sequences {e1k}k≥1, {e2k}k≥1, and {e3k}k≥1 which will be used for analyzing the convergence of Algorithm DPDA-TV displayed in Fig. 2. For k≥0, let
[TABLE]
where ωk=γk1νk+ξk+ηk(ξk−ξk−1) and Rk(x)=PB(Rk(x)), i.e., Rk(x)=[Rik(x)]i∈N and Rik(x)=PB0(Rik(x)), for x∈X. Thus, for k≥0, νk+1=λk+1+γke1k+1 since (12c) is replaced with (20a), and ξk+1=xk+1+e3k+1 since (12e) is replaced with (20b). In the rest, we set ν0 to 0.
The following observation will also be useful to prove error bounds for DPDA-TV iterate sequence. Note that (20a) implies for each i∈N,
[TABLE]
Thus, we trivially get the following bound on νk+1:
[TABLE]
Moreover, we will also need the following relation: for any ν and λ we have that
[TABLE]
Definition** 9****.**
Let T=[A⊤In∣N∣]⊤ for A≜diag([Ai]i∈N)∈Rm×n∣N∣. Given α,μ,δ1>0, and arbitrary sequences {τk},{γk}⊂R++, {κik}k≥0⊂R++ for i∈N, define Dτk≜τk1diag([In]i∈N), Dτk≜diag([(τk1−μ)In]i∈N), Dˉτk≜diag([(τk1−(Li+α))In]i∈N), and
Qˉk≜[Ak−ηkT−ηkT⊤Dκk,γk] for k≥0, where Ak≜(ηk)2γk(1+δ1)In∣N∣≻0 and Dκk,γk is defined in Definition 6.
In order to prove Theorem III.2, we first prove Lemma V.6 below which help us to appropriately bound L(xˉK,y)−L(x∗,yˉK) for any y∈Y and ξK−x∗. That said to show the result in Lemma V.6, we need to show the following two lemmas, Lemma V.4 and Lemma V.5, describing a proper choice for the primal-dual step size sequences.
Lemma V.4**.**
Given δ1>0. For any k≥0, Qˉk⪰0 if ηk>0, and positive numbers {κik}i∈N, and γk are chosen such that
[TABLE]
Proof.
Let Dγk and Dκk be as in Definition 6. Since Dγk≻0, Schur complement condition implies that Qˉk⪰0 if and only if
[TABLE]
Moreover, since Dκk≻0, again using Schur complement one can conclude that (62) holds if and only if (ηk)21Ak−γkIn−A⊤Dκk−1A⪰0. This condition holds if (61) is true.
∎
Lemma V.5**.**
Let Dτk and Dκk,γk be as given in Definition 6, and Dτk, Dˉτk and Qˉk be as in Definition 9 for α>0 chosen according to Lemma III.1 and Remark II.2, and \mu\in(0,~{}\max\{\underaccent{\bar}{\mu},~{}\mu_{\alpha}\}). Suppose {τk},{ηk},{γk}⊂R++, {κik}k≥0⊂R++ for i∈N are chosen as in DPDA-TV diplayed in Fig. 2, then the following relations hold for all k≥0:
[TABLE]
Moreover, ηk∈(0,1), 0<τ~k1<τk1=O(k), and 0<γk=O(k).
Proof.
Using the result of Lemma V.4, it is trivial to check that the parameter sequence constructed in Fig. 2 satisfies (63) – see also the discussion in the proof of Lemma V.2.
∎
In order to prove Theorem III.2, we need Lemma V.6 which help us to appropriately bound L(xˉK,y)−L(x∗,yˉK) for all y∈Y and ξK−x∗ for all K≥1. In particular, Lemma V.6 is similar to Lemma V.3 for the static case, but it also accounts for the approximation errors for the time-varying case, arising due to use of Rk.
Lemma V.6**.**
Let {ξk,yk}k≥0 be the iterate sequence generated using Algorithm DPDA-TV as displayed in Fig. 2 which is initialized from an arbitrary x0 and y0, where yk=[θk⊤νk⊤]⊤ for k≥0; and let {e1k}k≥1 and {e2k}k≥1 be the error sequence defined as in (58). For any y∈Y, the iterate sequence {ξk,yk}k≥0 satisfies for all k≥0,
[TABLE]
*where E1k+1(ν)≜∥ek+1∥(4γkNΔ+∥ν−νk+1∥), and E2k+1≜e3k+1(τk2NΔ+αe2k+1) for k≥0.
*
Proof.
Fix y=[θ⊤ν⊤]⊤∈Y. For k≥0, let qk≜ξk+ηk(ξk−ξk−1) and define qik∈Rn for i∈N such that qk=[q1k⊤…qNk⊤]⊤. It follows from (12b) that using strong convexity of σC(ν)−⟨qk,ν⟩+2γk1∥ν−νk∥22 in ν and the fact that λk+1 is its minimizer, we conclude that
[TABLE]
According to (58), νk+1=λk+1+γke1k+1 for all k≥1; hence, from (60) we have
[TABLE]
where the error term S1k+1(ν) is defined as
[TABLE]
Note that for all k≥0, we have νk+γkqk=γkωk, νk+1=λk+1+γke1k+1, and λk+1=γk(ωk−PC(ωk)). Using these we get νk+γkqk−νk+1=γk(PC(ωk)−e1k+1); therefore, (66) can be written as
[TABLE]
where the inequality follows from Cauchy-Schwarz and ∥PC(ωk)∥≤2NΔ since PC(ωk)∈C.
Moreover, it follows from the strong convexity of the objective in (12a) that
[TABLE]
Summing the above inequality over i∈N, then summing the resulting inequality with (65) and using (67), we get
[TABLE]
Let pk=T⊤yk for k≥1. Strong convexity of the objective in (12d) implies that
[TABLE]
where ∇g(ξk)=∇f(ξk)+α(ξk−PC(ξk)). Also, the optimality condition of (20b) implies that, there exist sk+1∈∂ρ(ξk+1) such that
Moreover, since ρ(⋅) is a convex function and sk+1∈∂ρ(ξk+1), using (58) we obtain
[TABLE]
Now, using (58) and (71) within (69), we conclude that
[TABLE]
where the error term S2k+1 is given as follows
[TABLE]
Note that using (70), the definition of S2k+1 can be simplified:
[TABLE]
where we used the fact that xk+1≤NΔ.
In addition, since each fi has a Lipschitz continuous gradient with constant Li and 21dC2(x) has a Lipschitz continuous gradient with constant 1, we have for any x and xˉ that
[TABLE]
Define Lg=diag([(Li+α)In]i∈N). It follows from strong convexity of fˉ that choosing α≥0 according to Lemma III.1 and Remark II.2, we conclude that for any \mu\in(0,~{}\max\{\underaccent{\bar}{\mu},~{}\mu_{\alpha}\}) we have
[TABLE]
where the last inequality follows from (75). Next, summing inequalities (V-E) and (V-E), and using (74), we get
[TABLE]
Next, summing (V-E) and (V-E), and rearranging terms, we obtain
[TABLE]
Note that we have,
[TABLE]
moreover, the last term can be bounded using the fact that Qˉk⪰0 as follows:
[TABLE]
Combining (V-E) and (79) gives the desired result.
∎
Under Assumption I.1, a saddle point (x∗,y∗) for minx∈Xmaxy∈YL(x,y) in (11) exists, where y∗=[θ∗⊤,λ∗⊤]⊤; moreover, any saddle point (x∗,θ∗,λ∗) satisfies that x∗=1⊗x∗ such that (x∗,θ∗) is a primal-dual solution to (3). Thus, θi∗∈Ki∘ and L(x∗,θ∗,λ∗)=Φ(x∗). Recall Definition 7, we have g(x∗)=f(x∗) since dC(x∗)=0; hence, Φ(x∗)=φ(x∗)=∑i∈Nφi(x∗). Therefore, L(x∗,θ∗,λ∗)=φ(x∗). Indeed, this implies ⟨x∗,λ∗⟩−σC(λ∗)=0 which leads to ∑i∈Nλi∗=0, i.e., λ∗∈C∘. Therefore, we have 0=⟨x∗,λ∗⟩=σC(λ∗). In the rest of the proof, we provide the error bounds for a saddle point (x∗,θ∗,λ∗) of L such that λ∗=0. Note that if (x∗,θ∗,λ∗) is a saddle point of L such that λ∗=0, then it trivially follows that (x∗,θ∗,0) is another saddle point of L.
Multiplying both sides of (64) by γ0γk and using Lemma V.5, we get
[TABLE]
Next, we sum (80) over k=0 to K−1;
using Jensen’s inequality and the following facts: QˉK⪰0 and ξ−1=ξ0=x0, we get
[TABLE]
where zK=[(ξK−ξK−1)⊤(y−yK)⊤]⊤, NK=∑k=1Kγ0γk−1, ξˉK=NK−1∑k=1Kγ0γk−1ξk and yˉK=NK−1∑k=1Kγ0γk−1yk for yk=[θk⊤νk⊤]⊤ for k≥0.
Note that E1k+1(ν) and E2k+1 appearing in (V-F) are the error terms due to approximating PC with Rk in the k-th iteration of the algorithm for k≥0. Furthermore, since zKQˉK≥0 and τ~k>τk for k≥0, (V-F) can be written more explicitly as follows: for any [θ⊤,ν⊤]∈Y and for all K≥1, we have
[TABLE]
Under Assumption I.1, one can construct a saddle point (x∗,θ∗,λ∗) for L in (11) such that λ∗=0;
hence, L(x∗,θ∗,λ∗)=φ(x∗) and θi∗∈Ki∘ for i∈N. Define θ~=[θ~i]i∈N such that \tilde{\theta}_{i}\triangleq 2\|\theta_{i}^{*}\|\big{(}\|\mathcal{P}_{\mathcal{K}_{i}^{\circ}}(A_{i}\bar{\xi}_{i}^{K}-b_{i})\|\big{)}^{-1}~{}\mathcal{P}_{\mathcal{K}_{i}^{\circ}}(A_{i}\bar{\xi}_{i}^{K}-b_{i})\in\mathcal{K}_{i}^{\circ}, which implies
[TABLE]
Note that C is a closed convex cone, and the projection PC(x)=1⊗p(x) – see (15). Similarly, define ν~=∥PC∘(ξˉK)∥PC∘(ξˉK)∈C∘, where C∘ denotes polar cone of C. Hence, it can be verified that ⟨ν~,ξˉK⟩=dC(ξˉK). Note that ν~∈C∘ implies that σC(ν~)=0; moreover, we also have C⊆C; hence, σC(ν~)≤σC(ν~)=0. Therefore, we can conclude that σC(ν~)=0 since 0∈C. Together with (83), we get
[TABLE]
Since, x∗∈C we also have that
[TABLE]
Note that for any i∈N, θˉiK∈Ki; hence, σKi(θˉiK)=0. In addition, since θˉiK∈Ki∘, and Aix∗−bi∈Ki, we have
Provided that we show Θ1+∑k=0K−1γ0γk(E1k+1(ν~)+E2k+1)≤Θ(K) for some \Theta(K)=\mathcal{O}\big{(}\sum_{k=0}^{K-1}\beta^{q_{k}}k^{4}\big{)}, the desired result in (21) follows from (88) and (90). Moreover, the bound on ξK−x∗ follows from (82). In fact, possibly a tighter bound can be derived using Θ(x∗,θ∗,λ∗) for λ∗=0. In the rest of the proof, we construct the Θ(K) bound with properties as specified above.
Note that using (16) and the non-expansivity of projection, PB(⋅), we conclude that
[TABLE]
Moreover, since we assumed that each ρi has a compact domain with diameter at most Δ, we immediately conclude that xk≤NΔ and ξk≤NΔ for k≥1. Hence, from (58) and nonexpansivity of prox operator we obtain
[TABLE]
Let qk=ξk+ηk(ξk−ξk−1) for k≥0. Note that for {ηk} as specified in Algorithm DPDA-TV displayed in Fig. 2, we have ηk≤1. Therefore, it follows from (58) and (59) that
Therefore, by letting Θ(K)=Θ1+Θ2(K)+Θ3(K) it is easy to see that Θ(K)=O(∑k=0K−1βqkk4); thus, supK∈Z+Θ(K)<∞ due to our choice of {qk}, and this completes the proof.
Bibliography39
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] Joel B Predd, SB Kulkarni, and H Vincent Poor. Distributed learning in wireless sensor networks. IEEE Signal Processing Magazine , 23(4):56–69, 2006.
2[2] Ioannis D Schizas, Alejandro Ribeiro, and Georgios B Giannakis. Consensus in ad hoc WS Ns with noisy links - Part I: Distributed estimation of deterministic signals. Signal Processing, IEEE Transactions on , 56(1):350–364, 2008.
3[3] Ke Zhou and Stergios I Roumeliotis. Multirobot active target tracking with combinations of relative observations. IEEE Transactions on Robotics , 27(4):678–695, 2011.
4[4] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics , 9(1):427–438, 2013.
5[5] Juan Andrés Bazerque and Georgios B Giannakis. Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Transactions on Signal Processing , 58(3):1847–1862, 2010.
6[6] Juan Andrés Bazerque, Gonzalo Mateos, and Georgios B Giannakis. Group-lasso on splines for spectrum cartography. IEEE Transactions on Signal Processing , 59(10):4648–4663, 2011.
7[7] Konstantinos I Tsianos, Sean Lawlor, and Michael G Rabbat. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on , pages 1543–1550. IEEE, 2012.
8[8] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control , 57(3):592–606, 2012.