Distributed Optimization for Smart Cyber-Physical Networks

Giuseppe Notarstefano; Ivano Notarnicola; Andrea Camisa

arXiv:1906.10760·eess.SY·October 28, 2020

Distributed Optimization for Smart Cyber-Physical Networks

Giuseppe Notarstefano, Ivano Notarnicola, Andrea Camisa

PDF

TL;DR

This paper surveys distributed optimization methods enabling smart cyber-physical network agents to cooperatively solve large-scale problems through local computation and communication without central coordination.

Contribution

It formalizes principal distributed optimization approaches and reviews recent extensions, providing a comprehensive introduction to the field.

Findings

01

Consensus-based, duality-based, and constraint exchange methods are key approaches.

02

Basic schemes are analyzed for effectiveness and efficiency.

03

State-of-the-art extensions improve scalability and robustness.

Abstract

The presence of embedded electronics and communication capabilities as well as sensing and control in smart devices has given rise to the novel concept of cyber-physical networks, in which agents aim at cooperatively solving complex tasks by local computation and communication. Numerous estimation, learning, decision and control tasks in smart networks involve the solution of large-scale, structured optimization problems in which network agents have only a partial knowledge of the whole problem. Distributed optimization aims at designing local computation and communication rules for the network processors allowing them to cooperatively solve the global optimization problem without relying on any central unit. The purpose of this survey is to provide an introduction to distributed optimization methodologies. Principal approaches, namely (primal) consensus-based, duality-based and…

Figures34

Click any figure to enlarge with its caption.

Equations474

x min

x min

x \in X .

x \in^{d} min subj. to i = 1 \sum N f_{i} (x) x \in X,

x \in^{d} min subj. to i = 1 \sum N f_{i} (x) x \in X,

x min

x min

\displaystyle\>\big{(}\mathbf{x}_{i},\{\mathbf{x}_{j}\}_{j\in\mathcal{N}_{i}}\big{)}\in X_{i},\hskip 28.45274pti\in\{1,\ldots,N\},

x \in^{d} min subj. to f (x) x \in i = 1 ⋂ N X_{i},

x \in^{d} min subj. to f (x) x \in i = 1 ⋂ N X_{i},

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N f_{i} (x_{i}) x_{i} \in X_{i}, i \in {1, \dots, N} i = 1 \sum N g_{i} (x_{i}) \leq 0,

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N f_{i} (x_{i}) x_{i} \in X_{i}, i \in {1, \dots, N} i = 1 \sum N g_{i} (x_{i}) \leq 0,

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N f_{i} (x_{i}) x_{i} \in X, i \in {1, \dots, N} x_{1} = x_{2} ⋮ x_{N - 1} = x_{N}

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N f_{i} (x_{i}) x_{i} \in X, i \in {1, \dots, N} x_{1} = x_{2} ⋮ x_{N - 1} = x_{N}

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2}

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2}

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2} + r (x),

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2} + r (x),

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2} + ρ ∥ x ∥_{1}

x min i = 1 \sum N ∥ D_{i} x - b_{i} ∥^{2} + ρ ∥ x ∥_{1}

w, b min i = 1 \sum N j = 1 \sum m_{i} lo g [1 + e^{- (w^{⊤} p_{i, j} + b) ℓ_{i, j}}] + \frac{C}{2} ∥ w ∥^{2},

w, b min i = 1 \sum N j = 1 \sum m_{i} lo g [1 + e^{- (w^{⊤} p_{i, j} + b) ℓ_{i, j}}] + \frac{C}{2} ∥ w ∥^{2},

f_{i} (w, b) = j = 1 \sum m_{i} lo g [1 + e^{- (w^{⊤} p_{i, j} + b) ℓ_{i, j}}] + \frac{C}{2 N} ∥ w ∥^{2}, i \in {1, \dots, N} .

f_{i} (w, b) = j = 1 \sum m_{i} lo g [1 + e^{- (w^{⊤} p_{i, j} + b) ℓ_{i, j}}] + \frac{C}{2 N} ∥ w ∥^{2}, i \in {1, \dots, N} .

w^{⊤} p_{i} + b

w^{⊤} p_{i} + b

w^{⊤} p_{i} + b

w, b min subj. to \frac{1}{2} w^{⊤} w ℓ_{i} (w^{⊤} p_{i} + b) \geq 1, i \in {1, \dots, N} .

w, b min subj. to \frac{1}{2} w^{⊤} w ℓ_{i} (w^{⊤} p_{i} + b) \geq 1, i \in {1, \dots, N} .

w, b, ξ min subj. to \frac{1}{2} w^{⊤} w + C i = 1 \sum N ξ_{i} ℓ_{i} (w^{⊤} p_{i} + b) \geq 1 - ξ_{i}, i \in {1, \dots, N}, ξ \geq 0,

w, b, ξ min subj. to \frac{1}{2} w^{⊤} w + C i = 1 \sum N ξ_{i} ℓ_{i} (w^{⊤} p_{i} + b) \geq 1 - ξ_{i}, i \in {1, \dots, N}, ξ \geq 0,

x min subj. to c^{⊤} x x \in i = 1 ⋂ N X_{i},

x min subj. to c^{⊤} x x \in i = 1 ⋂ N X_{i},

x min subj. to (i, κ) \in E_{A} \sum c_{iκ} x_{iκ} 0 \leq x \leq 1, {κ ∣ (i, κ) \in E_{A}} \sum x_{iκ} = 1 \forall i \in {1, \dots, N}, {i ∣ (i, κ) \in E_{A}} \sum x_{iκ} = 1 \forall κ \in {1, \dots, N},

x min subj. to (i, κ) \in E_{A} \sum c_{iκ} x_{iκ} 0 \leq x \leq 1, {κ ∣ (i, κ) \in E_{A}} \sum x_{iκ} = 1 \forall i \in {1, \dots, N}, {i ∣ (i, κ) \in E_{A}} \sum x_{iκ} = 1 \forall κ \in {1, \dots, N},

X_{i} = {x_{i} \in^{K_{i}} ∣ 0 \leq x_{i} \leq 1 and x_{i}^{⊤} 1 = 1}, i \in {1, \dots, N} .

X_{i} = {x_{i} \in^{K_{i}} ∣ 0 \leq x_{i} \leq 1 and x_{i}^{⊤} 1 = 1}, i \in {1, \dots, N} .

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N c_{i}^{⊤} x_{i} x_{i} \in X_{i}, i \in {1, \dots, N} i = 1 \sum N H_{i} x_{i} = 1,

x_{1}, \dots, x_{N} min subj. to i = 1 \sum N c_{i}^{⊤} x_{i} x_{i} \in X_{i}, i \in {1, \dots, N} i = 1 \sum N H_{i} x_{i} = 1,

\displaystyle\begin{split}\min_{\begin{subarray}{c}\mathbf{z}_{1},\ldots,\mathbf{z}_{N}\\ \mathbf{u}_{1},\ldots,\mathbf{u}_{N}\end{subarray}}\>&\>\textstyle\sum\limits_{i=1}^{N}\bigg{(}\textstyle\sum\limits_{s=0}^{S-1}\ell_{i}(\mathbf{z}_{i}(s),\mathbf{u}_{i}(s))+V_{i}(\mathbf{z}_{i}(S))\bigg{)}\\ \text{subj.\ to}\>&\>\mathbf{z}_{i}(s+1)=A_{i}\mathbf{z}_{i}(s)+B_{i}\mathbf{u}_{i}(s),\>s\in\{0,\ldots,S-1\},\forall\>i,\\ &\>\mathbf{z}_{i}(s)\in Z_{i},\mathbf{u}_{i}(s-1)\in U_{i}\hskip 28.45274pts\in\{1,\ldots,S\},\hskip 18.49411pt\forall\>i,\\ &\>\mathbf{z}_{i}(0)=\mathbf{z}_{i}^{0},\hskip 184.08936pt\forall\>i,\\ &\>\textstyle\sum\limits_{i=1}^{N}H_{i}\mathbf{z}_{i}(s)\leq h,\hskip 69.70915pts\in\{1,\ldots,S\},\end{split}

\displaystyle\begin{split}\min_{\begin{subarray}{c}\mathbf{z}_{1},\ldots,\mathbf{z}_{N}\\ \mathbf{u}_{1},\ldots,\mathbf{u}_{N}\end{subarray}}\>&\>\textstyle\sum\limits_{i=1}^{N}\bigg{(}\textstyle\sum\limits_{s=0}^{S-1}\ell_{i}(\mathbf{z}_{i}(s),\mathbf{u}_{i}(s))+V_{i}(\mathbf{z}_{i}(S))\bigg{)}\\ \text{subj.\ to}\>&\>\mathbf{z}_{i}(s+1)=A_{i}\mathbf{z}_{i}(s)+B_{i}\mathbf{u}_{i}(s),\>s\in\{0,\ldots,S-1\},\forall\>i,\\ &\>\mathbf{z}_{i}(s)\in Z_{i},\mathbf{u}_{i}(s-1)\in U_{i}\hskip 28.45274pts\in\{1,\ldots,S\},\hskip 18.49411pt\forall\>i,\\ &\>\mathbf{z}_{i}(0)=\mathbf{z}_{i}^{0},\hskip 184.08936pt\forall\>i,\\ &\>\textstyle\sum\limits_{i=1}^{N}H_{i}\mathbf{z}_{i}(s)\leq h,\hskip 69.70915pts\in\{1,\ldots,S\},\end{split}

f_{i} (x_{i})

f_{i} (x_{i})

g_{i} (x_{i})

\displaystyle X_{i}\triangleq\Big{\{}(\mathbf{z}_{i},\mathbf{u}_{i})\in^{(S+1)q_{i}+Sm_{i}}\mid\>

\displaystyle X_{i}\triangleq\Big{\{}(\mathbf{z}_{i},\mathbf{u}_{i})\in^{(S+1)q_{i}+Sm_{i}}\mid\>

\displaystyle\mathbf{z}_{i}(s+1)\in Z_{i},\mathbf{u}_{i}(s)\in U_{i},\>\>\forall\>s\Big{\}},

i \in GEN \sum p_{gen, i}^{s} + i \in STOR \sum p_{stor, i}^{s} + i \in CONL \sum p_{conl, i}^{s} + p_{tr}^{s} \geq D^{s},

i \in GEN \sum p_{gen, i}^{s} + i \in STOR \sum p_{stor, i}^{s} + i \in CONL \sum p_{conl, i}^{s} + p_{tr}^{s} \geq D^{s},

x_{i} ≜ [p_{gen, i}^{0}, \dots, p_{gen, i}^{S}]^{⊤},

x_{i} ≜ [p_{gen, i}^{0}, \dots, p_{gen, i}^{S}]^{⊤},

f_{i} (x_{i}) ≜ s = 0 \sum S f_{gen, i}^{s} (p_{gen, i}^{s})

f_{i} (x_{i}) ≜ s = 0 \sum S f_{gen, i}^{s} (p_{gen, i}^{s})

\displaystyle X_{i}\triangleq\Big{\{}[p_{\texttt{gen},i}^{0},\ldots,p_{\texttt{gen},i}^{S}]^{\top}\mid\>\>

\displaystyle X_{i}\triangleq\Big{\{}[p_{\texttt{gen},i}^{0},\ldots,p_{\texttt{gen},i}^{S}]^{\top}\mid\>\>

\displaystyle\underaccent{\bar}{r}\leq p^{s+1}_{\texttt{gen},i}-p^{s}_{\texttt{gen},i}\leq\bar{r},\>\>\>\tau\in[0,S-1]\Big{\}},

x min subj. to i = 1 \sum N f_{i} (x) x \in^{d},

x min subj. to i = 1 \sum N f_{i} (x) x \in^{d},

x^{t + 1}

x^{t + 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distributed Optimization for

Smart Cyber-Physical Networks

Notarstefano, Giuseppe

Università di Bologna, Bologna (Italy)

[email protected]

Notarnicola, Ivano

Università di Bologna, Bologna (Italy)

[email protected]

Camisa, Andrea

Università di Bologna, Bologna (Italy)

[email protected]

Abstract

The presence of embedded electronics and communication capabilities as well as sensing and control in smart devices has given rise to the novel concept of cyber-physical networks, in which agents aim at cooperatively solving complex tasks by local computation and communication. Numerous estimation, learning, decision and control tasks in smart networks involve the solution of large-scale, structured optimization problems in which network agents have only a partial knowledge of the whole problem. Distributed optimization aims at designing local computation and communication rules for the network processors allowing them to cooperatively solve the global optimization problem without relying on any central unit. The purpose of this survey is to provide an introduction to distributed optimization methodologies. Principal approaches, namely (primal) consensus-based, duality-based and constraint exchange methods, are formalized. An analysis of the basic schemes is supplied, and state-of-the-art extensions are reviewed.

Introduction
1 Distributed Optimization Framework
1.1 Distributed Computation Model
1.2 Optimization Set-ups
1.2.1 Cost-Coupled Optimization
1.2.2 Common Cost Optimization
1.2.3 Constraint-Coupled Optimization
1.3 Optimization Set-ups for Learning and Control
1.3.1 Regression for Data Analytics
1.3.2 Classification via Logistic Regression
1.3.3 Classification via Support Vector Machine (SVM)
1.3.4 Target Localization in Sensor Networks
1.3.5 Task allocation/assignment
1.3.6 Cooperative Distributed Model Predictive Control
2 Consensus-Based Primal Methods
2.1 Distributed Subgradient Method
2.2 Gradient Tracking Algorithm
2.3 Variants and Extensions of the Basic Gradient Tracking
2.4 Discussion and References
2.5 Numerical Example
3 Distributed Dual Methods
3.1 Fenchel Duality and Graph Duality
3.1.1 Two-Agent Example
3.1.2 Fenchel Duality
3.1.3 Graph Duality
3.2 Distributed Dual Decomposition for Cost-Coupled Problems
3.3 Distributed ADMM for Cost-Coupled Problems
3.4 Distributed Dual Methods for Constraint-Coupled Problems
3.4.1 Connections between Cost-Coupled and Constraint-Coupled Problems via Duality
3.4.2 Distributed Dual Subgradient Algorithm
3.4.3 Relaxation and Successive Distributed Decomposition
3.5 Discussion and References
3.6 Numerical Example
3.6.1 Cost-coupled Example
3.6.2 Constraint-coupled Example
4 Constraint Exchange Methods
4.1 Constraints Consensus applied to Linear Programs
4.1.1 Algorithm description
4.1.2 Convergence Analysis
4.1.3 Distributed Simplex
4.2 Constraints Consensus for Convex and Abstract Programs
4.3 Extensions
4.3.1 Cutting-plane Consensus
4.3.2 Distributed Mixed-Integer Linear Programming via Cut Generation and Constraint Exchange
4.3.3 Other extensions
4.4 Numerical Example
Concluding Remarks
A Centralized Optimization Methods
A.1 Gradient Method
A.2 Subgradient Method
A.3 Lagrangian Duality and Dual Subgradient Method
A.4 ADMM Algorithm
B Consensus Over Networks
B.1 Average Consensus over Static Networks
B.2 Push-sum Consensus over Directed Networks
B.3 Dynamic Average Consensus Algorithm
C Linear Programming

Introduction

Motivation

In recent years, the breakthroughs in embedded electronics are giving the opportunity to include computation and communication capabilities in almost any device of several domains as factories, farms, buildings, grids and cities. Communication among devices has enabled a number of new challenges along the direction of turning smart devices into smart (cooperating) systems. The keyword “cyber-physical networks” is being adopted to refer to this permeating reality, whose distinctive feature is that a great advantage can be obtained if its interconnected, complex nature is exploited. A novel peer-to-peer distributed computational framework is emerging as a new opportunity in which peer processors, communicating over a network, cooperatively solve a task without resorting to a unique provider that knows and owns all the data.

Several challenges arising in cyber-physical networks can be stated as optimization problems. Examples are estimation, decision, learning and control applications. To solve optimization problems over cyber-physical networks, it is not possible to apply the classical optimization algorithms (that we call centralized), which require the data to be managed by a single entity. In fact, the problem data are spread over the network, and it is undesirable (or even impossible) to collect them at a unique node. To this end, parallel computing serves as a source of inspiration. In order to speed up the solution of large-scale optimization problems, several effort has been made in designing parallel algorithms by splitting the computational burden among several processors. However, for typical parallel optimization algorithms, a central coordinating node is required and the communication topology is designed ad hoc. In distributed computation the communication topology cannot be thought of as a design parameter. Rather, it is a given part of the problem. Thus, in cyber-physical networks, the goal is to design algorithms, based on the exchange of information among the processors, that take advantage of the aggregated computational power. All the agents must be treated as peers and each of them must perform the same tasks and no “master” node must be present. Moreover, information privacy is often a requirement (i.e., private problem data at each node must not be shared with the other nodes). These challenges call for tailored strategies and have given rise to a novel, growing research branch termed distributed optimization.

Scope of the Monograph

The purpose of this survey is to give a comprehensive overview of the most common approaches used to design distributed optimization algorithms, together with the theoretical analysis of the main schemes in their basic version. We identify and formalize classes of problem set-ups that arise in motivating application scenarios. For each set-up, in order to give the main tools for analysis, we review tailored distributed algorithms in simplified cases. Extensions and generalizations of the basic schemes are also discussed at the end of each chapter. The algorithms have been developed by combining mathematical tools from optimization theory (e.g., duality) and network control theory (e.g., average consensus). For some of the discussed algorithms, we will present also parallel algorithms that serve as a starting point for the development of distributed methods.

We focus on three main categories of distributed optimization approaches: (i) primal consensus-based methods, i.e., methods combining classical gradient or subgradient steps with local averaging schemes; (ii) dual methods, i.e., methods which employ the Lagrangian dual of suitable equivalent formulations of the target problem to obtain a distributed routine; (iii) constraint exchange methods, which are based on the exchange of (active) constraints among agents to compute a solution of the considered problem.

Survey papers on distributed optimization have been proposed in the literature. An early survey paper presenting a broad class of relevant optimization problems in control is [1]. It also discusses tailored, parallel and distributed optimization algorithms based on decomposition techniques and including also the distributed subgradient method. Recent surveys analyze thoroughly average consensus [2] and the distributed subgradient method [2, 3, 4], with a literature review on other distributed optimization techniques. The book [5] provides parallel and distributed asynchronous optimization algorithms, including gradient tracking techniques. Some latest advances in distributed optimization are collected in [6].

Organization

In Chapter 1, we introduce the relevant problem set-ups, that we call cost-coupled, constraint-coupled and common cost, along with several motivating applications of interest arising in estimation, learning, decision and control. In Chapter 2 we provide an overview of primal approaches to solve cost-coupled problems, namely the distributed subgradient algorithm and the gradient tracking algorithm. In Chapter 3, a discussion on relevant duality forms for distributed optimization is first provided, and then distributed algorithms relying on Lagrangian approaches are reviewed. Namely, for cost-coupled problems, distributed dual decomposition and distributed ADMM algorithms are considered, while for constraint-coupled problems, a distributed dual subgradient algorithm and a method based on relaxation and successive distributed decomposition are presented. In Chapter 4, we focus on constraint exchange methods. We introduce the Constraints Consensus algorithm applied to common-cost problems, along with its most relevant extensions.

We also provide illustrative numerical examples to highlight significant properties of the considered distributed optimization methods. Since the described algorithms are designed for different problem set-ups, different, relevant simulation scenarios are considered in each chapter.

Chapter 1 Distributed Optimization Framework

In this chapter we introduce the conceptual framework for distributed optimization in peer-to-peer networks. First, we describe the network model we will consider throughout the survey. Then we present and motivate the main optimization set-ups that are of interest in smart networks.

In a distributed scenario, we consider $N$ units, called agents or processors, that have both communication and computation capabilities. Communication among agents is modeled by means of graph theory. Informally, given a graph $\mathcal{G}$ with $N$ nodes, one for each agent, an agent $i$ can send (receive) data to (from) another agent $j$ , when the graph $\mathcal{G}$ contains an edge connecting $i$ to $j$ ( $j$ to $i$ ). In a distributed algorithm, agents initialize their local states and then start an iterative procedure in which communication and computation steps are iteratively performed, with all the nodes performing the same actions. In particular, local states are updated by using only information received by in-neighbors.

In this survey we consider a distributed framework in which agents cooperatively solve an optimization problem. The basic assumption we make is that each agent $i$ has only a partial knowledge of the entire problem, e.g., only a portion of the cost and/or a portion of the constraints is locally available. In the rest of the chapter, depending on the specific optimization set-up, we will clarify what do we mean by cooperation among agents for the solution of a given optimization problem.

Remark 1.

We point out that, regardless of the optimization problem structure, our standing assumption is that the distributed framework is made by cooperative agents. There is another strain of research on non-cooperative set-ups with applications to game theoretic problems. A non-exhaustive list of early references is [7, 8, 9]. $\square$

1.1 Distributed Computation Model

In this section we formally define the communication model for a distributed algorithm. A network is modeled as a (possibly time-dependent) directed graph $\mathcal{G}^{t}=(\{1,\ldots,N\},\mathcal{E}^{t})$ , where $t\in{\mathbb{N}}$ is a universal (slotted) time, $\{1,\ldots,N\}$ is the (fixed) set of agent identifiers and $\mathcal{E}^{t}\subseteq\{1,\ldots,N\}\times\{1,\ldots,N\}$ , for all $t\geq 0$ , is the (time-dependent) set of (directed) edges over the vertices $\{1,\ldots,N\}$ , which represents the communication links. A graphical representation of a time-varying network is given in Figure 1.1.

At each (universal) time instant $t$ , a communication structure, i.e., a graph $\mathcal{G}^{t}$ , is active. The time-varying edge set $\mathcal{E}^{t}$ models the communication in the sense that at time $t$ there is an edge from node $i$ to node $j$ in $\mathcal{E}^{t}$ if and only if processor $i$ transmits information to processor $j$ at time $t$ . Given an edge $(i,j)\in\mathcal{E}^{t}$ , $i$ is called in-neighbor of $j$ and $j$ is an out-neighbor of $i$ at time $t$ . When the edge set $\mathcal{E}^{t}$ does not depend on $t$ , i.e., $\mathcal{G}^{t}\equiv\mathcal{G}$ for all $t$ , we say that the network is fixed, otherwise the network is time-varying. Moreover, when for every pair of nodes $i$ and $j$ in the network the edge $(i,j)$ and the edge $(j,i)$ are in $\mathcal{E}^{t}$ , then the graph is undirected. An example of a directed and of an undirected graph is depicted in Figure 1.2.

Given a fixed graph $\mathcal{G}$ , connectivity properties can be stated.

Definition 2.

A fixed directed graph $\mathcal{G}$ is said to be strongly connected if for every pair of nodes $(i,j)$ there exists a path of directed edges that goes from $i$ to $j$ . If $\mathcal{G}$ is undirected, we say that $\mathcal{G}$ is connected. $\square$

Connectivity properties can be also stated for time-varying topologies (we only consider directed graphs).

Definition 3.

A time-varying directed graph $\mathcal{G}^{t}$ , $t\in{\mathbb{N}}$ , is said to be

•

jointly strongly connected* if the graph $\mathcal{G}_{\infty}^{t}\triangleq(\{1,\ldots,N\},\mathcal{E}_{\infty}^{t})$ , with $\mathcal{E}_{\infty}^{t}=\bigcup_{\tau=t}^{\infty}\>\mathcal{E}^{\tau}$ , is strongly connected for all $t\geq 0$ .*

•

$T$ -strongly connected* (or uniformly jointly strongly connected) if there exists a scalar $T>0$ such that the graph $\mathcal{G}_{T}^{t}\triangleq(\{1,\ldots,N\},\mathcal{E}_{T}^{t})$ with $\mathcal{E}_{T}^{t}=\bigcup_{\tau=0}^{T-1}\mathcal{E}^{t+\tau}$ , is strongly connected for every $t\geq 0$ . * $\square$

Given a network topology, agents can run distributed algorithms according to several communication protocols. When the steps of the algorithm explicitly depend on the value of $t$ , we say that the algorithm is synchronous, i.e., agents must be aware of the current value of $t$ and, thus, their local operations must be synchronized to a global clock. We will also consider a communication protocol in which agents are not aware of any global time information, i.e., their updates do not depend on $t$ , and we term these algorithms asynchronous. In fact, if a distributed algorithm is designed to run over a jointly strongly connected graph, and the local computation steps do not depend on $t$ , then the algorithm can be also implemented in an asynchronous network.

1.2 Optimization Set-ups

In this section we describe three general optimization set-ups that comprise several estimation, learning, decision and control application scenarios in smart networks. A distributed optimization algorithm for such classes of problems consists of an iterative procedure based on the distributed computation model introduced in Section 1.1. The goal for the agents is to eventually obtain a solution of the investigated problem. In each considered optimization set-up, this goal translates to different statements that will be formally specified next.

For an optimization algorithm, the aim is to minimize a scalar objective function (or cost function), usually denoted as $f(\mathbf{x})$ , where $\mathbf{x}\in^{d}$ is the decision variable. We may need to restrict the minimizer of $f$ in a given constraint set $X\subseteq^{d}$ (or feasible set). From now on, we use the symbol $\min$ to denote that we want to minimize $f(\mathbf{x})$ subject to the constraints, and we compactly write the overall optimization problem as

[TABLE]

The generic constraint set $X$ can also be expressed by means of equalities or inequalities as, e.g., $h_{j}(\mathbf{x})=0$ for $j\in\{1,\ldots,p\}$ , or $g_{k}(\mathbf{x})\leq 0$ for $k\in\{1,\ldots,q\}$ , for some functions $h_{j}$ and $g_{k}$ . The equality and inequality constraints are usually compactly denoted as $\mathbf{h}(\mathbf{x})=\mathbf{0}$ or $\mathbf{g}(\mathbf{x})\leq\mathbf{0}$ . Centralized methods to approach this problem can be found in [10, 11].

In the remainder of this section, we introduce three structured versions of the above general optimization problem.

1.2.1 Cost-Coupled Optimization

We start by introducing an optimization set-up in which the cost function is expressed as the sum of local contributions $f_{i}$ and all of them depend on a common optimization variable $\mathbf{x}$ . Formally, the set-up is

[TABLE]

where $\mathbf{x}\in^{d}$ and $X\subseteq^{d}$ . The global constraint set $X$ is assumed to be common to each agent, while $f_{i}:^{d}\rightarrow$ is assumed to be known by agent $i$ only, for all $i\in\{1,\ldots,N\}$ . Figure 1.3 provides a graphical representation of how problem information is spread over the network.

More general versions of this optimization set-up assume that the constraint set is more structured, e.g., $X=\bigcap_{i=1}^{N}X_{i}$ , where each $X_{i}$ is known by agent $i$ only.

Let $\mathbf{x}^{\star}$ denote an optimal solution of problem (1.1). For this optimization set-up, the goal is to design a distributed algorithm where each agent updates a local estimate $\mathbf{x}_{i}^{t}$ that converges (asymptotically or in finite time) to $\mathbf{x}^{\star}$ , by means of local computation and neighboring communication only. An illustrative scheme is depicted in Figure 1.4.

Remark 4.

An interesting optimization set-up arising in several applications is the so-called partitioned, or partition-based, set-up, first introduced in [12]. The problem is in the form (1.1), but the cost function and the constraints of each agent do not involve all the components of the decision variable, but rather they depend only on some of its components. This sparsity in the problem can be modeled using a graph. Formally, the partitioned optimization set-up is

[TABLE]

where $\mathbf{x}$ denotes the vector stacking $(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})$ , while the notation $f_{i}(\mathbf{x}_{i},\{\mathbf{x}_{j}\}_{j\in\mathcal{N}_{i}})$ highlights the fact that $f_{i}$ actually depends only on the components of $\mathbf{x}$ indexed by $\{i\}\cup\mathcal{N}_{i}$ . Distributed algorithms have been developed to solve partitioned problems. Remark 20 discusses how to tailor algorithms based on dual decomposition in order to take into account the partitioned structure. $\square$

1.2.2 Common Cost Optimization

Another important set-up arising in several applications is given by

[TABLE]

where $f:^{d}\rightarrow$ is known by all the agents while each constraint $X_{i}\subseteq^{d}$ is known by agent $i$ only, for all $i\in\{1,\ldots,N\}$ . Figure 1.5 provides a graphical representation of how information is spread over the network.

The common cost set-up (1.2) is somehow similar to the cost-coupled set-up (1.1), since in both cases the optimization variable is shared among the processors. However, in the common cost set-up (1.2), the cost function is shared, and the coupling among the agents is due to the fact that the optimization variable must belong to all the local constraint sets. It is possible to think of problem (1.2) as a special case of the cost-coupled set-up (1.1) (with $X=\bigcap_{i=1}^{N}X_{i}$ ) by setting each $f_{i}(\mathbf{x})=1/N\cdot f(\mathbf{x})$ . However, notice that a commonly known cost function explicitly allows for tailored distributed optimization algorithms such as, e.g., constraint exchange methods (cf. Chapter 4).

Let $\mathbf{x}^{\star}$ denote an optimal solution of problem (1.2). For such optimization set-up, the goal is to design a distributed algorithm where each agent updates a local estimate $\mathbf{x}_{i}^{t}$ that converges (asymptotically or in finite time) to $\mathbf{x}^{\star}$ , by means of local computation and neighboring communication only (cf. Figure 1.4).

1.2.3 Constraint-Coupled Optimization

In this subsection, we present a different set-up which we call constraint-coupled. Agents in a network want to minimize the sum of local cost functions, each one depending only on a local vector satisfying local constraints. The decision vectors are then coupled to each other by means of separable coupling constraints. This feature leads easily to the so-called big-data problems having a very highly dimensional decision variable that grows with the network size. However, since agents are typically interested in computing only their (small) portion of an optimal solution, novel tailored methods need to be developed to address these challenges.

Formally, the constraint-coupled optimization problem is

[TABLE]

where $(\mathbf{x}_{1},\ldots,\mathbf{x}_{N})$ is the global optimization vector stacking all the local variables, $X_{i}\subseteq^{d_{i}}$ , $f_{i}:^{d_{i}}\rightarrow$ and $\mathbf{g}_{i}:^{d_{i}}\rightarrow^{S}$ are known by agent $i$ only, for all $i\in\{1,\ldots,N\}$ . Notice that problem (1.3) is challenging because of the coupling constraints $\sum_{i=1}^{N}\mathbf{g}_{i}(\mathbf{x}_{i})\leq\mathbf{0}$ . If there were no coupling constraints, the optimization would trivially split into $N$ independent problems. Figure 1.6 provides a graphical representation of how information is spread over the network.

Let $(\mathbf{x}_{1}^{\star},\ldots,\mathbf{x}_{N}^{\star})$ denote an optimal solution of problem (1.3). The goal is to design a distributed algorithm where each agent updates a local estimate $\mathbf{x}_{i}^{t}$ that converges (asymptotically or in finite time) to $\mathbf{x}_{i}^{\star}$ , the $i$ -th portion of $(\mathbf{x}_{1}^{\star},\ldots,\mathbf{x}_{N}^{\star})$ , by means of local computation and neighboring communication only. An illustrative scheme is depicted in Figure 1.7.

A special instance of this set-up has been investigated in the context of resource allocation, where the coupling constraint is linear, e.g., $\sum_{i=1}^{N}\mathbf{x}_{i}=b$ , and there are no local constraints. In this survey, we consider more general problems where the coupling may be nonlinear and local constraints are explicitly taken into account.

Remark 5 (Comparison with the cost-coupled set-up).

We notice that problem (1.1) can be cast as (1.3) by introducing copies $\mathbf{x}_{1},\ldots,\mathbf{x}_{N}$ of the decision vector $\mathbf{x}$ and appropriate coherence (coupling) constraints, i.e.,

[TABLE]

However, it is worth noticing that the coupling constraint of such reformulation enjoys a special, sparse structure while the constraints in (1.3) are more general (since they involve all the agents in the network). $\square$

1.3 Optimization Set-ups for Learning and Control

In this section, we motivate the study of the optimization set-ups introduced in Section (1.2) by describing important application scenarios that are of interest in control and robotics as well as communication and signal processing.

1.3.1 Regression for Data Analytics

Let us consider an important task for several applications, namely the linear regression problem, in which we assume that a set of points in a training dataset is used to estimate the parameters of a model (assumed to be linear in the parameters). The model can be exploited, e.g., to predict new generated samples. Figure 1.8 proposes a pictorial representation of a simple scenario in 2.

Nowadays, especially in big-data contexts, a natural scenario is to assume that the training data are not (or cannot be) gathered at a main collection center. Rather, it is reasonable to assume that the samples are (spatially) distributed in a network, as shown in Figure 1.9.

Now, let us focus on Least Squares (LS), a popular regression approach. Assume that $N$ processors in a network want to solve a regression problem, where $\mathbf{x}\in^{d}$ denotes the parameter vector that has to be estimated, and each agent $i$ has $n_{i}$ observations. The (unweighted) LS problem can be formulated as

[TABLE]

where, for all $i\in\{1,\ldots,N\}$ , $\mathbf{D}_{i}\in^{n_{i}\times d}$ is the regression matrix and $\mathbf{b}_{i}\in^{n_{i}}$ is the label vector.

A typical challenge arising in regression problems is due to the fact that problem (1.4) may be ill-posed and can easily lead to over-fitting phenomena. A viable technique to prevent over-fitting consists in adding a suitable regularization term $r(\mathbf{x})$ in the cost function, leading to

[TABLE]

where $r:^{d}\rightarrow$ is assumed to be known by all the agents in the network. Several possibilities for the regularizer $r(\mathbf{x})$ can be chosen. For instance, by using $\ell_{1}$ -norm, we obtain the so-called LASSO (Least Absolute Shrinkage and Selection Operator) problem, i.e.,

[TABLE]

where $\rho$ is a positive scalar used to strengthen or weaken the effects of the regularizer. Problem (1.5) can be classified as cost-coupled, i.e., of the form (1.1), with $X=^{d}$ and local functions given by $f_{i}(\mathbf{x})=\|\mathbf{D}_{i}\mathbf{x}-\mathbf{b}_{i}\|^{2}+\rho/N\cdot\|\mathbf{x}\|_{1}$ .

This problem will be used to test duality-based methods for cost-coupled problems and a numerical example is shown in Section 3.6.

1.3.2 Classification via Logistic Regression

Regression problems can be also set up for a classification scenario. We recall a set-up in which linear models are trained by minimizing the so-called logistic loss functions. Suppose each agent has $m_{i}$ points $p_{i,1},\ldots,p_{i,m_{i}}\in^{d}$ (which represent training samples in a feature space) and suppose they are associated to binary labels, i.e., each point $p_{i,j}$ is labeled with $\ell_{i,j}\in\{-1,1\}$ , for all $j\in\{1,\ldots,m_{i}\}$ and $i\in\{1,\ldots,N\}$ . The problem consists of building a linear classification model from the training samples by maximizing the a-posteriori probability of each class. In particular, we look for a separating hyperplane of the form $\{z\in^{d}\mid w^{\top}z+b=0\}$ , whose parameters ( $w$ and $b$ ) can be determined by solving the convex optimization problem

[TABLE]

where $C>0$ is a parameter affecting regularization. We now make some observations on problem (1.6). First, we see that it is an unconstrained optimization problem, so that an optimal solution can always be found (even though it may be meaningless for the classification problem). Second, we point out that the cost function is strictly convex, so that the optimal solution is unique. Finally, notice that the problem is cost-coupled, i.e., it is of the form (1.1), with $X=^{d}$ and each $f_{i}$ is given by

[TABLE]

In a distributed setting, the goal is to make agents agree on a common solution $(w^{\star},b^{\star})$ , so that all of them can compute the separating hyperplane as $\{z\in^{d}\mid(w^{\star})^{\top}z+b^{\star}=0\}$ .

This problem is suited for the application of consensus-based primal methods (cf. Section 2) and a numerical example is shown in Section 2.5.

1.3.3 Classification via Support Vector Machine (SVM)

Support Vector Machines (SVMs) are a popular tool used in (supervised) learning to build classification models. Suppose we have $N$ points $p_{1},\ldots,p_{N}\in^{d}$ (which represent training samples in a feature space) and suppose they are associated to binary labels, i.e., each $p_{i}$ is labeled with $\ell_{i}\in\{-1,1\}$ , for all $i\in\{1,\ldots,N\}$ . For simplicity, we consider linear SVM (more complex set-ups can be handled with appropriate transformations [13]). The problem consists of building a classification model from the training samples. In particular, we look for a separating hyperplane of the form $\{z\in^{d}\mid w^{\top}z+b=0\}$ such that it separates all the points with $\ell_{i}=-1$ from all the points with $\ell_{i}=1$ . In symbols:

[TABLE]

In Figure 1.10, a classification example is shown.

In order to maximize the distance of the separating hyperplane from the training points, one can solve the following (convex) quadratic program:

[TABLE]

Problem (1.7) is known in the literature as hard-margin SVM problem, and can be solved only if a separating hyperplane exists. However, if problem (1.7) is infeasible (e.g., when there are outliers), one can solve a soft-margin SVM problem in which some of the training samples are allowed to be on the “wrong side” of the hyperplane. Formally, we consider the following relaxation of problem (1.7):

[TABLE]

where we denote by $\boldsymbol{\xi}$ the vector stacking the violations $\xi_{1},\ldots,\xi_{N}$ and $C>0$ weighs the effect of the relaxation. Notice that problem (1.8) can be viewed either as a cost-coupled problem of the form (1.1), or as a common cost problem of the form (1.2).

In a distributed set-up, problem (1.8) must be solved by agents in a network. We suppose that each agent $i$ is assigned exactly one training tuple $(p_{i},\ell_{i})$ , so that each agent knows one constraint of the optimization problem. Agents eventually agree on an optimal solution $(w^{\star},b^{\star},\boldsymbol{\xi}^{\star})$ , so that the separating hyperplane can be computed as $\{z\in^{d}\mid(w^{\star})^{\top}z+b^{\star}=0\}$ .

This problem is suited, e.g., for the application of constraint exchange methods (cf. Section 4.2) and a numerical example is shown in Section 4.4.

1.3.4 Target Localization in Sensor Networks

An interesting application in the field of sensor and robotic networks is the problem of estimating the position of a target, while having information on the position of sensors that can detect the unknown target within their field of sensing. A representational example of the problem is given in Figure 1.11.

Formally, we suppose that $N$ sensors are used to estimate in a distributed way the position of a target. Each sensor $i$ knows its position $\mathbf{v}_{i}\in^{2}$ and the unknown target position is denoted by $\mathbf{x}\in^{2}$ . We assume that sensors in the network detect the presence of the unknown target with two sensing mechanisms: (i) laser transmitters which scan through some angle, leading to a bounded cone set that can be expressed by three linear constraints, two bounding the angle and one bounding the distance, compactly written as $A_{i}\mathbf{x}\leq b_{i}$ , with $A_{i}\in^{3\times 2}$ and $b_{i}\in^{3}$ , and (ii) the range of the RF transmitter, leading to circular constraints of the form $\|\mathbf{x}-\mathbf{v}_{i}\|_{2}\leq r_{i}$ , where $r_{i}$ denotes the maximum sensing distance. Depending on the sensing mechanisms that each sensor $i$ is equipped with, it is possible to bound the position of the unknown target to be contained in the intersection of convex sets $X_{i}$ , each one known only by agent $i$ , defined as $X_{i}\triangleq\{\mathbf{x}\mid\|\mathbf{x}-\mathbf{v}_{i}\|_{2}\leq r_{i}\}$ if the constraint is a disk, $X_{i}\triangleq\{\mathbf{x}\mid A_{i}\mathbf{x}\leq b_{i}\}$ if the constraint is a cone, $X_{i}\triangleq\{\mathbf{x}\mid A_{i}\mathbf{x}\leq b_{i},\>\|\mathbf{x}-\mathbf{v}_{i}\|_{2}\leq r_{i}\}$ if the constraint is a quadrant.

Now, the goal for the agents is to compute the smallest bounding box $\{\mathbf{x}\in^{2}\mid\mathbf{x}^{L}\leq\mathbf{x}\leq\mathbf{x}^{U}\}$ , for suitable $\mathbf{x}^{L},\mathbf{x}^{U}\in^{2}$ , that is guaranteed to contain the unknown position of the additional target. This can be addressed by solving four optimization problems, one for each component of $\mathbf{x}^{L},\mathbf{x}^{U}$ . For instance, to compute the first component of $\mathbf{x}^{L}$ , agents define the objective vector $c=[1,0]^{\top}$ and they cooperatively solve the optimization problem

[TABLE]

which is in the common cost form (1.2). After an optimal solution $\mathbf{x}^{\star}$ is found, each agent computes the first component of $\mathbf{x}^{L}$ by using the first component of $\mathbf{x}^{\star}$ , and similarly for the other coordinates.

1.3.5 Task allocation/assignment

Task allocation is a building block for decision making problems in which a certain number of agents must be assigned given tasks. The goal is to find the best matching of agents and tasks according to a given performance criterion. Here, we consider $N$ agents and $N$ tasks and we look for a one-to-one assignment. Define the variable $x_{i\kappa}$ , which is $1$ if agent $i$ is assigned to task $\kappa$ and [math] otherwise. Also, define the set $E_{A}$ , which contains the tuple $(i,\kappa)$ if agent $i$ can be assigned to task $\kappa$ . Finally, let $c_{i\kappa}$ be the cost occurring if agent $i$ is assigned to task $\kappa$ . In Figure 1.12, we show an illustrative example of the set-up.

Since the objective is to minimize the total cost, the task allocation problem can be formulated as an integer program. However, as pointed out in [14], integrality constraints can be dropped to obtain the linear program

[TABLE]

where $\mathbf{x}$ is the variable stacking all $x_{i\kappa}$ . If problem (1.10) is feasible, it can be shown that it admits an optimal solution such that $x_{i\kappa}\in\{0,1\}$ for all $(i,\kappa)\in E_{A}$ (see, e.g., [14]). Moreover, all the optimal assignments belong to the optimal solution set of problem (1.10).

Problem (1.10) can be cast to the constraint-coupled form (1.3). To see this, let us define $K_{i}$ as the number of tasks that agent $i$ can perform (i.e., $K_{i}=|\{\kappa\mid(i,\kappa)\in E_{A}\}|$ ). We assume that agent $i$ deals with the variable $\mathbf{x}_{i}\in^{K_{i}}$ , stacking the $x_{i\kappa}$ for all $\kappa$ such that $(i,\kappa)\in E_{A}$ . Then, the local sets $X_{i}$ can be written as

[TABLE]

The coupling constraints can be written by defining, for all $i\in\{1,\ldots,N\}$ , the matrix $H_{i}\in^{N\times K_{i}}$ , obtained by extracting from the $N\times N$ identity matrix the subset of columns corresponding to the tasks that agent $i$ can perform. Problem (1.10) becomes

[TABLE]

where each $c_{i}$ stacks the costs $c_{i\kappa}$ , for all $\kappa$ such that $(i,\kappa)\in E_{A}$ . Notice that problem (1.10) can be also tackled by resorting to its dual, which can be solved by using distributed optimization algorithms for common-cost problems.

In a distributed context, the goal for the agents is to find an optimal solution $\mathbf{x}^{\star}$ , but each agent $i$ is only interested in computing its portion $\mathbf{x}_{i}^{\star}$ of optimal solution, which contains only one entry $x_{i\kappa}=1$ , corresponding to the task $\kappa$ that agent $i$ is eventually assigned.

1.3.6 Cooperative Distributed Model Predictive Control

Model Predictive Control (MPC) is a widely studied technique in the control community, and is also used in distributed contexts. The goal is to design an optimization-based feedback control law for a (spatially distributed) network of dynamical systems. The leading idea is the principle of receding horizon control, which informally speaking consists of solving at each time step an optimization problem (usually termed optimal control problem), in which the system model is used to predict the system trajectory over a fixed time window. After an optimal solution of the optimal control problem is found, the input associated to the current time instant is applied and the process is repeated (for a survey on MPC methods, see, e.g., [15]).

Now, we describe a typical distributed MPC framework applied to a network of linear systems with linear coupling constraints. Formally, assume we have $N$ discrete-time linear dynamical systems with independent dynamics of the form $\mathbf{z}_{i}(s+1)=A_{i}\mathbf{z}_{i}(s)+B_{i}\mathbf{u}_{i}(s)$ , where $s\in{\mathbb{Z}}$ represents time; $\mathbf{z}_{i}(s)\in^{q_{i}}$ is the system state at time $s$ ; $\mathbf{u}_{i}(s)\in^{m_{i}}$ is the input fed to the system at time $s$ ; and $A_{i},B_{i}$ are given matrices of appropriate dimensions, for all $i\in\{1,\ldots,N\}$ . We suppose that the states and the inputs must satisfy local constraints $\mathbf{z}_{i}(s)\in Z_{i}$ and $\mathbf{u}_{i}(s)\in U_{i}$ for all $i\in\{1,\ldots,N\}$ , and that the agents’ states are coupled to each other by means of coupling constraints of the form $\sum_{i=1}^{N}H_{i}\mathbf{z}_{i}(s)\leq h$ , for a given $h\in^{P}$ . Given the initial conditions of the systems $\mathbf{z}_{1}^{0},\ldots,\mathbf{z}_{N}^{0}$ , the optimal control problem to be solved is

[TABLE]

where $S$ is the prediction horizon, $\mathbf{z}_{i}=[\mathbf{z}_{i}(0)^{\top},\ldots,\mathbf{z}_{i}(S)^{\top}]^{\top}$ and $\mathbf{u}_{i}=[\mathbf{u}_{i}(0)^{\top},\ldots,\mathbf{u}_{i}(S-1)^{\top}]^{\top}$ are the optimization vectors, $\ell_{i}:^{q_{i}+m_{i}}\rightarrow$ is the stage cost and $V_{i}:^{q_{i}}\rightarrow$ is the terminal cost, for all $i\in\{1,\ldots,N\}$ . Problem (1.12) can be fit into the constraint-coupled set-up (1.3) by setting

[TABLE]

for all $i\in\{1,\ldots,N\}$ , with the local optimization variables being $\mathbf{x}_{i}=\big{[}\mathbf{z}_{i}^{\top},\mathbf{u}_{i}^{\top}\big{]}^{\top}$ and the local constraint set $X_{i}$ being

[TABLE]

for all $i\in\{1,\ldots,N\}$ .

Next, we describe an example of microgrid control scenario that can be fit into our distributed optimization framework. A microgrid consists of several generators, controllable loads, storage devices and a connection to the main grid. In the following, we use the notational convention that energy generation corresponds to positive variables, while energy consumption corresponds to negative variables. Generators are collected in the set GEN. At each time instant $s$ in a given horizon $[0,S]$ , they generate power, denoted by $p_{\texttt{gen},i}^{s}$ , that must satisfy magnitude and rate bounds, i.e., for given positive scalars $\underaccent{\bar}{p}$ , $\bar{p}$ , $\underaccent{\bar}{r}$ and $\bar{r}$ , it must hold, for all $i\in\texttt{GEN}$ , $\underaccent{\bar}{p}\leq p_{\texttt{gen},i}^{s}\leq\bar{p}$ , with $s\in[0,S]$ , and $\underaccent{\bar}{r}\leq p_{\texttt{gen},i}^{s+1}-p_{\texttt{gen},i}^{s}\leq\bar{r}$ , with $s\in[0,S-1]$ . The cost to produce power by a generator is modeled as a quadratic function $f_{\texttt{gen},i}^{s}=\alpha_{1}p_{\texttt{gen},i}^{s}+\alpha_{2}(p_{\texttt{gen},i}^{s})^{2}$ with $\alpha_{1}$ and $\alpha_{2}$ positive scalars. Storage devices are collected in STOR and their power is denoted by $p_{\texttt{stor},i}^{s}$ and satisfies bounds and a dynamical constraint given by $-d_{\texttt{stor}}\leq p_{\texttt{stor},i}^{s}\leq c_{\texttt{stor}}$ , $s\in[0,S]$ , $q_{\texttt{stor},i}^{s+1}=q_{\texttt{stor},i}^{s}+p_{\texttt{stor},i}^{s}$ , $s\in[0,S-1]$ , and $0\leq q_{\texttt{stor},i}^{s}\leq q_{\text{max}}$ , $s\in[0,S]$ , where the initial capacity $q_{\texttt{stor},i}^{0}$ is given and $d_{\texttt{stor}}$ , $c_{\texttt{stor}}$ and $q_{\text{max}}$ are positive scalars. There are no costs associated with the stored power. Controllable loads are collected in CONL and their power is denoted by $p_{\texttt{conl},i}^{s}$ . The power must satisfy box constraints, i.e., $-P\leq p^{s}_{\texttt{conl},i}\leq P,\>\>\>s\in[0,S]$ . A desired load profile $p_{\texttt{des},i}^{s}$ for $p_{\texttt{conl},i}^{s}$ is given and the controllable load incurs in a cost $f_{\texttt{conl},i}^{s}=\beta\max\{0,p_{\texttt{des},i}^{s}-p_{\texttt{conl},i}^{s}\}$ , $\beta\geq 0$ , if the desired load is not satisfied. Finally, the device $i=N$ is the connection node with the main grid; its power is denoted as $p_{\texttt{tr}}^{s}$ and must satisfy $|p_{\texttt{tr}}^{s}|\leq E$ , $s\in[0,S]$ . The power-trading cost is modeled as $f_{\texttt{tr}}^{s}=-c_{1}p_{\texttt{tr}}^{s}+c_{2}|p_{\texttt{tr}}^{s}|$ , with $c_{1}$ and $c_{2}$ positive scalars corresponding to the price and a general transaction cost respectively.

The power network must provide at least a given power demand $D^{s}$ , which can be modeled by a coupling constraint among the units

[TABLE]

for all $s\in[0,S]$ . Reasonably, we assume $D^{s}$ to be known only by the connection node tr.

Notice that the microgrid control problem can be cast in the constraint-coupled form (1.3). To this end, we let each $\mathbf{x}_{i}$ be the whole trajectory over the prediction horizon $[0,S]$ , i.e.,

[TABLE]

for all the generators $i\in\texttt{GEN}$ and, consistently, for the other device types. As for the cost functions, we define

[TABLE]

for all the generators $i\in\texttt{GEN}$ and, consistently, for the other device types. The local constraint sets $X_{i}$ are given by

[TABLE]

for all the generators $i\in\texttt{GEN}$ and, consistently, for the other device types. The coupling constraints are as in (1.13).

This problem is suited, e.g., for the application of duality-based methods for constraint-coupled problems (cf. Section 3.4) and a numerical example is shown in Section 3.6.

Chapter 2 Consensus-Based Primal Methods

In this chapter we focus on primal approaches to design distributed algorithms for cost-coupled problems. We start by describing the so-called distributed subgradient method, based on a combination of the average consensus protocol with the subgradient method. Then, we present a recent improvement of such consensus-based scheme, named gradient tracking, that relies on the novel idea of tracking the gradient of the global cost function via a dynamic consensus scheme. Then, we provide extensions to the basic schemes. Finally, we show a numerical example to compare the two presented algorithms.

2.1 Distributed Subgradient Method

In this section we review the distributed subgradient method that has been proposed in the pioneering works [16, 17] (see also the tutorial papers [2, 3, 4]). In this survey, we report a proof based on the analysis proposed in the references above.

As already described in Section 1.2.1, we consider a network of $N$ agents that aim to cooperatively solve the cost-coupled problem

[TABLE]

where each cost function $f_{i}:^{d}\rightarrow$ is known by agent $i$ only, for all $i\in\{1,\ldots,N\}$ .

A natural way to devise a distributed algorithm for problem (2.1) is to study how it would be solved through a centralized gradient-based approach. We recall that a subgradient method applied to (2.1) consists of an iterative procedure in which the current solution estimate, denoted by $\mathbf{x}^{t}$ , is updated according to

[TABLE]

where $\gamma^{t}$ is the step-size and $\sum_{i=1}^{N}\widetilde{\nabla}f_{i}(\mathbf{x}^{t})$ is a subgradient of the cost function at $\mathbf{x}^{t}$ . The initial value $\mathbf{x}^{0}$ can be set to any element of d.

Next, we introduce the distributed subgradient algorithm proposed in [16, 17]. Each agent $i$ maintains its own estimate $\mathbf{x}_{i}^{t}$ of the decision variable $\mathbf{x}$ , initialized to any value in d and iteratively updated until it eventually converges to an optimal solution of (2.1). The distributed subgradient algorithm is based on the combination of a consensus protocol (cf. Appendix B) with the subgradient optimization method (cf. Appendix A.1) to move each local solution estimate toward an optimal (common) solution of problem (2.1). Algorithm 1 summarizes the distributed subgradient algorithm from the perspective of node $i$ .

For presentation purposes, in this survey we consider a simplified network configuration, so that the core idea of the scheme can be easily caught. That is, the network is modeled as a fixed, connected and undirected graph $\mathcal{G}=(\{1,\ldots,N\},\mathcal{E})$ . The weights $a_{ij}$ in (2.2a) inherit the typical assumptions on consensus protocols, formally reported next.

Assumption 6.

Let the weights $a_{ij}$ , $i,j\in\{1,\ldots,N\}$ be nonnegative entries of $A\in^{N\times N}$ that match the graph $\mathcal{G}$ , i.e., $a_{ij}\neq 0$ for all $(i,j)\in\mathcal{E}$ and $a_{ij}=0$ otherwise. Moreover, they satisfy

•

$\sum_{j=1}^{N}a_{ij}=1$ , for all $i\in\{1,\ldots,N\}$ ;

•

$\sum_{i=1}^{N}a_{ij}=1$ , for all $j\in\{1,\ldots,N\}$ ;

•

*for all $i\in\{1,\ldots,N\}$ , $a_{ii}>0$ . * $\square$

We point out that one may also consider strongly connected directed graphs that admits a doubly-stochastic weighted adjacency matrix. Detailed convergence analyses of distributed subgradient schemes have been provided, e.g., in [16, 17, 2, 4, 3]. For the sake of completeness, this survey provides a proof for the convergence of Algorithm 1 that is mainly inspired by the references above.

We start by stating the condition on the step-size $\gamma^{t}$ used in the update (2.2b). As in the standard (centralized) subgradient method, it must satisfy a diminishing property.

Assumption 7.

The step-size sequence $\{\gamma^{t}\}_{t\geq 0}$ , with $\gamma^{t}\geq 0$ , satisfies the conditions $\sum_{t=0}^{\infty}\gamma^{t}=\infty$ , $\sum_{t=0}^{\infty}\big{(}\gamma^{t}\big{)}^{2}<\infty$ . $\square$

As a consequence of the square summability in Assumption 7, the step-size vanishes as the algorithm proceeds, i.e., $\operatornamewithlimits{lim\vphantom{p}}_{t\to\infty}\gamma^{t}=0$ .

Next, we state regularity requirements for problem (2.1).

Assumption 8.

Let the following conditions hold:

(i)

each $f_{i}:^{d}\rightarrow$ is convex and has bounded subgradients, i.e., there exists a scalar $C_{i}>0$ such that $\|\widetilde{\nabla}f_{i}(\mathbf{x})\|\leq C_{i}$ for any subgradient $\widetilde{\nabla}f_{i}(\mathbf{x})$ of $f_{i}$ at any $\mathbf{x}\in^{d}$ ;

(ii)

*problem (2.1) has at least one optimal solution, i.e., the optimal solution set $X^{\star}=\{\mathbf{x}\in^{d}\mid f(\mathbf{x})=f^{\star}\}$ is nonempty, where $f^{\star}$ denotes the optimal value of problem (2.1). * $\square$

Usually, in the analysis of consensus-based algorithms, it is useful to introduce the average of the quantities that are required to be asymptotically consensual. Here, we introduce the average of the current solution estimates, i.e., for all $t\geq 0$ we define

[TABLE]

We point out that $\bar{\mathbf{x}}^{t}\in^{d}$ has the same dimension of the local solution estimates $\mathbf{x}_{i}^{t}$ and is introduced only for the sake of analysis. Of course, it cannot be computed by any agent and, nevertheless, it does not need to be known. We observe that $\bar{\mathbf{x}}^{t}$ evolves according to its own dynamics, which can be obtained by combining the local updates of the agents. Formally, it holds

[TABLE]

where we used the (row) stochasticity of the weights $a_{ij}$ .

The following result (see [18] for a proof) is an important building block for the forthcoming proof of the convergence of Algorithm 1.

Lemma 9.

Let $\{Y^{t}\}_{t\geq 0}$ , $\{W^{t}\}_{t\geq 0}$ , and $\{Z^{t}\}_{t\geq 0}$ be three scalar sequences such that $W^{t}t$ is nonnegative for all $t$ . Assume the following

[TABLE]

Then either $\operatornamewithlimits{lim\vphantom{p}}_{t\to\infty}Y^{t}=-\infty$ or else $\{Y^{t}\}_{t\geq 0}$ converges to a finite value and $\sum_{t=0}^{\infty}W^{t}<\infty$ . $\square$

The following theorem, also provided, e.g., in [16, 17, 2, 3, 4], formally states the convergence properties of Algorithm 1. For ease of notation, we consider a scalar optimization problem, i.e., $d=1$ .

Theorem 10.

Let Assumptions 6, 7 and 8 hold and let the communication graph be undirected and connected. Then, the sequences of local solution estimates $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by Algorithm 1, converge to a (common) solution of problem (2.1), i.e., for all $i\in\{1,\ldots,N\}$ , it holds

[TABLE]

for some $\mathbf{x}^{\star}\in X^{\star}$ .

Proof.

The proof provided in this manuscript is mainly based on the ones given in [16, 17, 2, 3, 4]. It is based on showing the following three steps:

asymptotic consensus of the local solution estimates to their average, i.e.,

[TABLE]

for all $i\in\{1,\ldots,N\}$ ; 2. 2.

summability of the consensus error weighted by the step-size, i.e.,

[TABLE] 3. 3.

convergence of the average sequence $\{\bar{\mathbf{x}}^{t}\}_{t\geq 0}$ to an optimal solution of problem (2.1), i.e.,

[TABLE]

for some $\mathbf{x}^{\star}\in X^{\star}$ .

Let $\mathbf{x}^{t}$ be the vector stacking the local solution estimates $\mathbf{x}_{i}^{t}$ . Then, the consensus error evolution is given by

[TABLE]

where $\boldsymbol{\epsilon}^{t}$ denotes the vector stacking all the $\boldsymbol{\epsilon}_{i}^{t}$ with the short-hand $\boldsymbol{\epsilon}_{i}^{t}$ for $-\gamma^{t}\widetilde{\nabla}f_{i}(\mathbf{v}_{i}^{t+1})$ , for all $i\in\{1,\ldots,N\}$ .

Taking the norm of both sides in the last equation and applying the triangle inequality leads to

[TABLE]

where we used the sub-multiplicative property of the $2$ -norm, we set $\sigma_{A}=\|A-\mathbf{1}\mathbf{1}^{\top}/N\|$ (i.e., the contraction factor associated to the matrix $A$ , cf. Appendix B.1), and we used the bound $\|I-\mathbf{1}\mathbf{1}^{\top}/N\|\leq 1$ .

By using the explicit solution for the free evolution and the forced evolution of a linear time-invariant system, the term $\|\mathbf{x}^{t}-\bar{\mathbf{x}}^{t}\mathbf{1}\|$ can be bounded as follows

[TABLE]

Since by assumption $\|\boldsymbol{\epsilon}^{\tau}\|\to 0$ (cf. Assumption 7 and 8(i)) and $\sigma_{A}\in(0,1)$ , it can be proven that

[TABLE]

which, in turns, implies that

[TABLE]

Next we show the summability condition. It holds

[TABLE]

for some finite $\kappa$ , where in (a) we rearranged terms; in (b) the first term is bounded due to geometric series properties (cf. Assumption 7 and recall $\sigma_{A}\in(0,1)$ ), while the second one can be shown to be bounded by using the Young’s inequality111For all $a\geq 0$ , and $b\geq 0$ , it holds $2ab\leq a^{2}+b^{2}$ . to write

[TABLE]

and, then, exploiting subgradient boundedness (cf. Assumption 8), geometric series properties (recall $\sigma_{A}\in(0,1)$ ) and the step-size properties (cf. square summability of $\gamma^{t}$ in Assumption 7).

Finally, we study convergence to the optimum by showing that a proper candidate function, say $V^{t}$ , decreases along the algorithmic evolution. Let $V^{t}$ be a measure of the distance between the local solution estimates $\mathbf{x}_{i}^{t}$ , $i\in\{1,\ldots,N\}$ , and an optimal solution to problem (2.1), i.e.,

[TABLE]

where $\mathbf{x}^{\star}\in X^{\star}$ . Due to convexity of each $f_{i}$ and subgradient boundedness (cf. Assumption 8), it follows that

[TABLE]

where in (a) we exploited convexity of the square norm $\|\cdot\|^{2}$ and weights properties (cf. Assumption 6) to write $\sum_{i=1}^{N}\|\mathbf{v}_{i}^{t+1}-\mathbf{x}^{\star}\|^{2}=\sum_{i=1}^{N}\|\sum_{j\in\mathcal{N}_{i}}a_{ij}(\mathbf{x}_{j}^{t}-\mathbf{x}^{\star})\|^{2}\leq\sum_{j=1}^{N}(\sum_{i=1}^{N}a_{ij})\|\mathbf{x}_{j}^{t}-\mathbf{x}^{\star}\|^{2}$ ; subgradient boundedness (cf. Assumption 8); and the subgradient definition (cf. Appendix A.2. Compactly it holds

[TABLE]

where $C^{2}\triangleq\sum_{i=1}^{N}C_{i}^{2}$ . Adding and subtracting $2\,\gamma^{t}\sum_{i=1}^{N}f_{i}(\bar{\mathbf{x}}^{t+1})$ yields

[TABLE]

where in (a) we used subgradient boundedness to write

[TABLE]

Notice that $\sum_{i=1}^{N}(f_{i}(\bar{\mathbf{x}}^{t+1})-f_{i}(\mathbf{x}^{\star}))>0$ since $\mathbf{x}^{\star}$ is a minimum of (2.1). Using Lemma 9 we can conclude that:

•

the sequence $\{V^{t}\}_{t\geq 0}$ converges to a finite value, say $\bar{V}$ , for every $\mathbf{x}^{\star}\in X^{*}$ , and

•

the average sequence $\{\bar{\mathbf{x}}^{t}\}_{t\geq 0}$ satisfies

[TABLE]

Since the sequence $\{V^{t}\}_{t\geq 0}$ (cf. its definition in (2.12)) converges, then also the sequence $\{\sum_{i=1}^{N}\|\mathbf{x}_{i}^{t}-\mathbf{x}^{\star}\|\}_{t\geq 0}$ converges, for every $\mathbf{x}^{\star}\in X^{\star}$ . Moreover, recall that by consensus achievement (2.10), it holds $\operatornamewithlimits{lim\vphantom{p}}_{t\to\infty}\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|=0$ . Therefore, also $\{\|\bar{\mathbf{x}}^{t}-\mathbf{x}^{\star}\|\}_{t\geq 0}$ must converge.

In view of (2.15) and of continuity of $f$ (due to its convexity), one of the limit points of $\{\bar{\mathbf{x}}^{t}\}_{t\geq 0}$ must belong to $X^{\star}$ ; thus, consider a subsequence $\{\bar{\mathbf{x}}^{t_{k}}\}_{k\geq 0}$ of $\{\bar{\mathbf{x}}^{t}\}_{t\geq 0}$ converging to an optimum, i.e., such that $\operatornamewithlimits{lim\vphantom{p}}_{k\to\infty}\|\bar{\mathbf{x}}^{t_{k}}-\mathbf{x}^{\infty}\|=0$ , with $\mathbf{x}^{\infty}\in X^{\star}$ . Convergence of $\bar{\mathbf{x}}^{t_{k}}$ with the asymptotic consensus property (cf. eq. (2.10)) implies that also $\operatornamewithlimits{lim\vphantom{p}}_{k\to\infty}\|\mathbf{x}_{i}^{t_{k}}-\mathbf{x}^{\infty}\|=0$ , for all $i\in\{1,\ldots,N\}$ . But in view of convergence of $V^{t}=\sum_{i=1}^{N}\|\mathbf{x}_{i}^{t}-\mathbf{x}^{\star}\|^{2}$ , it must be that the (entire) sequence $\{\mathbf{x}_{i}^{t}\}_{\geq 0}$ converges to $\mathbf{x}^{\star}\in X^{\star}$ . ∎

It is worth mentioning that convergence of the distributed subgradient algorithm to an optimum can only be guaranteed with a diminishing step-size. This is mainly due to the fact that at each iteration, each agent $i$ considers an update direction depending only on its local objective function $f_{i}$ , rather than on the entire cost function $\sum_{i=1}^{N}f_{i}$ .

Convergence rates have been proven for the distributed subgradient method and its variants. In [19], a convergence rate of $\mathcal{O}(\ln{t}/\sqrt{t})$ is proved for an extension of the distributed subgradient algorithm for directed graphs.

2.2 Gradient Tracking Algorithm

In this section we review a recent method for cost-coupled problems (cf. Section 1.2.1) that exhibits a faster convergence rate because it allows for the use of a constant step-size. The underlying idea of this novel approach is to implement a distributed consensus-based mechanism to track the gradient of the whole cost function. Thanks to this tracking mechanism, a linear convergence rate has been shown for this scheme, matching the rate of the centralized gradient method.

Formally, we consider a cost-coupled problem in the form (2.1), where the cost functions $f_{i}$ satisfy suitable regularity properties that will be specified next.

In order to understand the concept underlying the gradient tracking algorithm, let us consider the (centralized) gradient method applied to (2.1). If we denote by $\mathbf{x}^{t}$ the (centralized) solution estimate, the method reads

[TABLE]

In a distributed context, each agent $i$ has its own version $\mathbf{x}_{i}^{t}$ of the current solution estimate $\mathbf{x}^{t}$ . Thus, the gradient scheme (2.16) can be adapted as follows

[TABLE]

where the consensus iteration $\sum_{j\in\mathcal{N}_{i}}a_{ij}\mathbf{x}_{j}^{t}$ is meant to enforce an agreement among the agents. However, still the descent direction $\sum_{h=1}^{N}\nabla f_{h}(\mathbf{x}_{h}^{t})$ requires a global knowledge that is not locally available. To overcome this issue, the exact (centralized) descent direction is replaced by a local descent direction, say $\mathbf{y}_{i}^{t}$ , which is updated through a dynamic average consensus iteration to eventually track $\sum_{h=1}^{N}\nabla f_{h}(\mathbf{x}_{h}^{t})$ . Informally, the dynamic average consensus is a distributed algorithm in which each agent $i$ has access only to its local (possibly time-varying) signal, say $\mathbf{r}_{i}^{t}$ , and wants to track the (time-varying) average signal $1/N\cdot\sum_{h=1}^{N}\mathbf{r}_{h}^{t}$ by exchanging information only with neighbors. See Appendix B.3 for further details. In the context of gradient tracking, each agent’s signal is the local gradient at the current estimate, i.e., $\mathbf{r}_{i}^{t}=\nabla f_{i}(\mathbf{x}_{i}^{t})$ . The following table (Algorithm 2) formally summarizes the gradient tracking algorithm from the perspective of agent $i$ , where eq. (2.18) describes the dynamic average consensus iteration for the tracking of $\sum_{h=1}^{N}\nabla f_{h}(\mathbf{x}_{h}^{t})$ , the local solution estimate $\mathbf{x}_{i}^{t}$ is initialized to any vector in d and the gradient tracker $\mathbf{y}_{i}^{t}$ is initialized to $\nabla f_{i}(\mathbf{x}_{i}^{0})$ .

Gradient tracking algorithms have been proposed with several names and versions in the literature, but with a common underlying idea. Early works [20, 21, 22] propose the novel idea of distributively tracking a Newton-Raphson direction by means of suitable average consensus ratios. In [23] the same approach has been extended to deal with directed, asynchronous networks with lossy communication. More recently, the idea of gradient tracking has been independently proposed by several research groups. In [24, 25] the authors consider constrained nonsmooth and nonconvex problems, while in [26, 27] strongly convex, unconstrained, smooth optimization problems are addressed with agent-specific stepsizes. Works [28, 29] extend the algorithms to (possibly) time-varying digraphs (still in a nonconvex setting). A convergence rate analysis of the scheme was later developed in [30, 31, 32, 33, 27], where [30, 31] consider time-varying (directed) graphs. Several other recent works investigate the same scheme under numerous variants, such as [34, 35, 36, 37].

In order to highlight the key tools needed for the analysis of such class of algorithms, in this survey we investigate a simplified scenario that is characterized afterwards.

Assumption 11.

For all $i\in\{1,\ldots,N\}$ , each cost function $f_{i}:^{d}\rightarrow$ satisfies the following conditions

•

it is $\alpha$ -strongly convex, i.e.,

[TABLE]

for all $\mathbf{w},\mathbf{z}\in^{d}$ and $\alpha>0$ ;

•

it has Lipschitz continuous gradient with constant $L>0$ , i.e.,

[TABLE]

*for all $\mathbf{w},\mathbf{z}\in^{d}$ . * $\square$

Since each $f_{i}$ is a strongly convex function, then also their sum is strongly convex. Thus under Assumption 11, problem (2.1) has a unique optimal solution, denoted by $\mathbf{x}^{\star}$ . Notice that it holds $\alpha\leq L$ . We point out that one can also consider a more general case in which each $f_{i}$ has $L_{i}$ -Lipschitz continuous gradient. The results proved next still hold by setting in the analysis $L=\sum_{i=1}^{N}L_{i}$ .

Similarly to the distributed subgradient algorithm in Section 2.1, we consider a simple network scenario modeled as a fixed, connected and undirected graph $\mathcal{G}=(\{1,\ldots,N\},\mathcal{E})$ . We assume the weights $a_{ij}$ satisfy a double stochasticity property as formalized in Assumption 6.

The gradient tracking scheme has been proposed in [25, 29] with a diminishing step-size $\gamma^{t}$ . As in the distributed subgradient algorithm (cf. Section 2.1), this choice allows one to decouple the convergence analysis in two independent parts, i.e., consensus achievement and asymptotic convergence of the consensual value to the optimum. When a constant step-size $\gamma$ is used, as done in this survey, the proof cannot be split in two parts anymore, but consensus and optimality need to be handled simultaneously.

Since the gradient tracking algorithm is a consensus-based scheme, it is convenient to introduce average quantities of local agent variables. Namely, we define the average of the solution estimates and the average of the trackers as

[TABLE]

for all $t\geq 0$ . Using simple algebraic manipulations, it can be shown that the average quantities evolve as the following linear dynamical system

[TABLE]

By exploiting the (column) stochasticity of consensus weights (cf. Assumption 6) and the initialization of the trackers, i.e., $\mathbf{y}_{i}^{0}=\nabla f_{i}(\mathbf{x}_{i}^{0})$ , one can show that a conservation property for the tracker average $\bar{\mathbf{y}}^{t}$ holds. That is

[TABLE]

which implies $\bar{\mathbf{y}}^{t}=1/N\cdot\sum_{i=1}^{N}\nabla f_{i}(\mathbf{x}_{i}^{t})$ , for all $t\geq 0$ .

The analysis we propose is mainly a detailed version of the proof provided in [37] for the above simplified scenario. In addition, we consider a scalar optimization problem, i.e., we set $d=1$ .

The proof starts by characterizing the interconnection among the following quantities:

•

consensus error $\|\mathbf{x}^{t}-\bar{\mathbf{x}}^{t}\mathbf{1}\|$ , where $\mathbf{x}^{t}$ stacks all the $\mathbf{x}_{i}^{t}$ ;

•

gradient tracking error $\|\mathbf{y}^{t}-\bar{\mathbf{y}}^{t}\mathbf{1}\|$ , where $\mathbf{y}^{t}$ stacks all the $\mathbf{y}_{i}^{t}$ ;

•

distance from optimality of the average $\|\bar{\mathbf{x}}^{t}-\mathbf{x}^{\star}\|$ , where $\mathbf{x}^{\star}$ is the optimal solution of problem (2.1).

We first recall a preliminary result which relies on Lipschitz continuity of the cost gradients.

Lemma 12.

Let $\nabla f(\mathbf{x}^{t})$ denote the vector stacking all the gradients $\nabla f_{i}(\mathbf{x}_{i}^{t})$ , $i\in\{1,\ldots,N\}$ . Under Assumptions 11 and 6, it holds that

[TABLE]

where $L$ is the Lipschitz constant of $\nabla f_{i}$ , $i\in\{1,\ldots,N\}$ . $\square$

The previous lemma can be easily shown by exploiting the basic algebraic property $\sum_{i=1}^{N}\|\theta_{i}\|_{2}\leq\sqrt{N}\|[\theta_{1},\ldots,\theta_{N}]^{\top}\|_{2}$ , which follows by concavity of the square root function.

Next, we provide a list of intermediate results that will be used in the convergence theorem. They explicitly provide linear upper bounds for the three quantities introduced above. The following lemma characterizes the consensus error.

Lemma 13.

Under Assumption 11, it holds

[TABLE]

for all $t\geq 0$ , where $\sigma_{A}\in(0,1)$ .

Proof.

From (2.17) and (2.19), we can write

[TABLE]

where we used the triangle inequality and $\sigma_{A}$ is the contraction factor associated to the consensus matrix $A$ (cf. Appendix B.1). ∎

Next, we bound the distance of the average $\bar{\mathbf{x}}^{t}$ from $\mathbf{x}^{\star}$ , optimal solution of problem (2.1).

Lemma 14.

Under Assumptions 6 and 11, it holds that

[TABLE]

where $\theta=\max(|1-\alpha\gamma/N|,|1-L\gamma/N|)$ , with $L$ and $\alpha$ being the Lipschitz constant of $\nabla f_{i}$ and the strong convexity parameter of $f_{i}$ , respectively, $i\in\{1,\ldots,N\}$ .

Proof.

Using (2.19), we can write

[TABLE]

where in (a) we added and subtracted $\gamma/N\cdot\sum_{i=1}^{N}\nabla f_{i}(\bar{\mathbf{x}}^{t})$ , in (b) we used the triangle inequality, in (c) we exploited the convergence rate result for a gradient iteration applied to a smooth and strongly convex function222 We recall that a (centralized) gradient iteration applied to the minimization of a $L_{\varphi}$ -smooth and $\alpha_{\varphi}$ -strongly function $\varphi(z)$ satisfies (for a sufficiently small $\gamma>0$ ) $\|z-\gamma\nabla\varphi(z)-z^{\star}\|\leq\theta_{\varphi}\|z-z^{\star}\|$ , where $\theta_{\varphi}=\max(|1-\alpha_{\varphi}\gamma|,|1-L_{\varphi}\gamma|)$ and $z^{\star}$ is the minimizer of $\varphi$ . and (d) follows by the conservation property of the tracker (cf. (2.21)), the Lipschitz continuity of each $\nabla f_{i}$ and the algebraic property $\sum_{i=1}^{N}\|\xi_{i}\|_{2}\leq\sqrt{N}\|[\xi_{1},\ldots,\xi_{N}]^{\top}\|_{2}$ . ∎

Finally, we provide an upper bound for the tracking error.

Lemma 15.

Under Assumptions 6 and 11, it holds

[TABLE]

for all $t\geq 0$ , where $\sigma_{A}$ is the contraction factor associated to the consensus matrix $A$ , $I$ is the identity matrix and $L$ is the Lipschitz constant of $\nabla f_{i}$ , $i\in\{1,\ldots,N\}$ .

Proof.

Under Lipschitz continuity of $\nabla f$ , and using (2.20), it follows

[TABLE]

where in (a) we used (2.18) and (2.20), in (b) we rearranged the terms and we used the triangle inequality, in (c) we used the contraction property of the consensus matrix $A$ (cf. Appendix B.1) and the sub-multiplicativity of $2$ -norm and finally in (d) we used the fact that $\|I-\mathbf{1}\mathbf{1}^{\top}/N\|\leq 1$ and the Lipschitz continuity of $\nabla f$ together with the update law (2.17).

Next, we make further modifications on the terms as follows

[TABLE]

where in (a) we added and subtracted $\bar{\mathbf{x}}^{t}$ and we exploited row stochasticity of $A$ , and in (b) we used the sub-multiplicativity of $2$ -norm and the triangle inequality. Adding and subtracting $\bar{\mathbf{y}}^{t}$ and using the triangle inequality we can write $\|\mathbf{y}^{t}\|\leq\|\mathbf{y}^{t}-\bar{\mathbf{y}}^{t}\mathbf{1}\|+\|\bar{\mathbf{y}}^{t}\mathbf{1}\|$ , which plugged into the last equation yields

[TABLE]

Finally, let us manipulate the last term in (2.24) as

[TABLE]

where in (a) we added $\sum_{i=1}^{N}\nabla f_{i}(\mathbf{x}_{i}^{\star})=0$ , in (b) we exploited the Lipschitz continuity of each $\nabla f_{i}$ , in (c) we used the algebraic property $\sum_{i=1}^{N}\|\xi_{i}\|_{2}\leq\sqrt{N}\|[\xi_{1},\ldots,\xi_{N}]^{\top}\|_{2}$ , and in (d) we added and subtracted $\bar{\mathbf{x}}^{t}\mathbf{1}$ and used the triangle inequality. Combining (2.24) with (2.25) the proof follows. ∎

The following theorem states the convergence result for Algorithm 2.

Theorem 16.

Let Assumptions 6 and 11 hold and let the communication graph be undirected and connected. Then, there exists a constant $\bar{\gamma}\in(0,N/L)$ such that for all $\gamma\in(0,\bar{\gamma})$ the sequences of local solution estimates $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by Algorithm 2 are asymptotically consensual to the optimal solution $\mathbf{x}^{\star}$ of problem (2.1), i.e.,

[TABLE]

for all $i\in\{1,\ldots,N\}$ . Moreover, the convergence rate is linear.333 A (convergent) sequence $\{\mathbf{z}^{t}\}_{t\geq 0}$ is said to converge linearly (or geometrically) to $\mathbf{z}^{\star}$ if there exists $\rho\in(0,1)$ such that $\|\mathbf{z}^{t+1}-\mathbf{z}^{\star}\|\leq\rho\|\mathbf{z}^{t}-\mathbf{z}^{\star}\|$ , for all $t\geq 0$ .

Proof.

The proof is based on showing a (strict) contraction property along the algorithmic evolution. Let us introduce the following vector

[TABLE]

Then, combining the results given in Lemma 13, 14 and 15 it holds

[TABLE]

where the matrix $J(\gamma)$ is defined as

[TABLE]

Recall that $\theta=\max(|1-\alpha\gamma/N|,|1-L\gamma/N|)$ . Since $\alpha\leq L$ and $\gamma\leq N/L$ , it follows that $\theta=1-\alpha\gamma/N$ . Thus, we can express $J(\gamma)$ as the sum of two structured matrices as follows

[TABLE]

Being $\sigma_{A}<1$ and due to the triangular structure of the left matrix, we can conclude that it has spectral radius equal to $1$ . Since the eigenvalues of a matrix are a continuous function of its entries, we can use a continuity argument to assert that for positive $\gamma$ the spectral radius of $J(\gamma)$ becomes strictly less than $1$ (see [37, Theorem 1] for a more comprehensive discussion). Hence, we have $\mathbf{v}^{t+1}\leq\rho\,\mathbf{v}^{t}$ with $\rho\in(0,1)$ . Thus, $\|\mathbf{v}^{t}-[0,0,0]^{\top}\|\to 0$ as $t\to\infty$ with linear rate, and the proof follows. ∎

2.3 Variants and Extensions of the Basic Gradient Tracking

Several extensions of the gradient tracking scheme (described in Section 2.2) have been proposed in the literature. We present some of them without following their historical development but following a pure conceptual flow.

A first enhancement deals with optimization problems including both composite cost functions (i.e., with regularizers) and a common convex constraint. The main idea is to compute a feasible descent direction rather than a pure descent direction. Thus, let us consider a constrained cost-coupled optimization problem

[TABLE]

with $r$ being a convex regularizer and $X$ a convex constraint set. The modified algorithm reads as follows

[TABLE]

where $\tau>0$ and $\beta\in(0,1]$ are parameters to be suitably tuned. Notice that $N\mathbf{y}_{i}^{t}$ represents a local estimate of $\sum_{j=1}^{N}\nabla f_{j}(\mathbf{x}_{j}^{t})$ that is used to build a linear approximation of $\sum_{j=1}^{N}f_{j}(\mathbf{x}_{j}^{t})$ about the current iterate. Moreover, notice that $\Delta\mathbf{x}_{i}^{t}\in X$ , so that, provided that $\mathbf{x}_{j}^{t}\in X$ , then $\mathbf{x}_{i}^{t+1}$ stays feasible. This constrained version of the gradient tracking has been proposed and analyzed in [24, 25, 28, 29, 38, 5]. We notice that in these works, the authors consider a more general nonconvex optimization setting and propose more general approximation schemes than a simple linearization. Indeed, using successive convex approximations, the proposed distributed algorithms are able to solve also nonconvex instances of problem (2.28), which are of great interest in learning and estimation applications.

The gradient tracking has been extended also to time-varying and directed networks by means of the push-sum protocol (cf. Appendix B.2) in both the consensus and the tracking iterations. Formally, the algorithm reads

[TABLE]

with $\phi_{i}^{0}=1$ , for all $i\in\{1,\ldots,N\}$ , and where the time-varying weights $b_{ij}^{t}$ are entries of column stochastic matrices $B^{t}\in^{N\times N}$ , for all $t\geq 0$ . This extension has been studied in [24, 25, 28, 29, 38, 31, 37, 39, 34, 30]. Notice that the previous extensions have been combined in some of the mentioned works to design time-varying gradient algorithm for convex and nonconvex problems. Recently, a block-wise implementation of the gradient tracking algorithm has been proposed in [40, 41, 42].

2.4 Discussion and References

Early consensus-based algorithms for distributed optimization and estimation have been proposed and analyzed in [43, 44, 16, 45, 17, 46, 47, 48]. A push-sum version of the subgradient algorithm has been proposed in [19] to deal with time-varying networks. Extensions to the stochastic set-up are provided in [49, 50] A distributed algorithm using a constant step-size has been proposed in [51], with proved convergence rate $\mathcal{O}(1/t)$ (which can be strenghtened to linear for strongly convex problems). The algorithmic framework has been extended to regularized problems in [52], and a detailed convergence rate analysis has been proposed in [53]. Its extension to directed graphs is proposed in [54]. Distributed schemes to solve nonconvex optimization problems are proposed in [55, 56, 57].

As regards gradient tracking algorithms, the interested reader can find relevant up-to-date references in Section 2.2. Second-order approaches have been investigated in [58, 59, 60, 61]. Netwon-Raphson distributed approaches have been proposed and analyzed in [20, 22]. An extension to networks with packet loss is given in [62].

Distributed schemes working under asynchronous communication protocols are studied in [63, 64, 65, 66, 67, 68]. A randomized block-coordinate descent algorithm for convex optimization problems with linear constraints is proposed in [69]. In [70] an asynchronous distributed algorithm working also with communication delays is proposed.

As regards continous-time optimization, a purely primal approach is designed in [71]. A prediction-correction approach for online distributed optimization has been proposed in [72]. It is also worth mentioning the works in [73, 74, 75, 76, 77], where a control perspective is employed to analyze distributed optimization algorithms. A distributed scenario with a variable number of working nodes is proposed in [78]. A novel methodology to design continuous-time distributed optimization algorithms using techniques from geometric control theory is investigated in [79, 80].

Among the most recent contributions, a Frank-Wolfe decomposition approach for convex and nonconvex problems is analyzed in [81]. A distributed algorithm based on the proximal minimization is proposed in [82] to solve convex constrained problems. In [83], a distributed scheme using a Bregman penalization has been proposed. A distributed optimization algorithm for convex optimization with local inequality constraints has been studied in [84]. An asynchronous distributed algorithm with heterogeneous regularizations and normalizations is proposed in [85]. A specialized version of the distributed subgradient algorithm for convex feasibility problems, which allows for an infinite number of constraint sets, is presented in [86].

2.5 Numerical Example

In this section we provide a numerical study to show the behavior of the distributed optimization algorithms presented in this chapter.

We consider a network of $N=30$ agents communicating over a fixed, undirected, connected graph generated according to an Erdős-Rényi random model with parameter $p=0.2$ . Agents are equipped with a doubly stochastic matrix built according to the Metropolis-Hastings rule [87], i.e.,

[TABLE]

We focus on the logistic regression problem introduced in Section 1.3.2, where we suppose that each agent has $m_{1}=\ldots=m_{N}=10$ samples with feature space dimension $d=5$ . We generate the points $p_{i,j}$ according to a normal distribution with zero mean and variance equal to $2$ and we generate the binary labels $\ell_{i,j}$ from a standard Bernoulli distribution. Agents must agree on the optimal solution of problem (1.6), recalled here

[TABLE]

The regularization parameter $C$ is assumed to be equal to $0.01$ . We compare the distributed subgradient method (cf. Section 2.1), with diminishing step-size $\gamma^{t}=(1/t)^{0.8}$ , and the gradient tracking algorithm (cf. Section 2.2), with constant step-size $\gamma=10^{-3}$ .

In Figure 2.1 we compare the convergence rate of Algorithm 1 and Algorithm 2. That is, we plot the absolute value of the difference between the optimal cost $f^{\star}$ and the sum of local costs $\sum_{i=1}^{N}f_{i}(\mathbf{x}_{i}^{t})$ . From the theoretical analysis, the cost error of both algorithms is known to asymptotically converge to zero. However, the gradient tracking algorithm has a linear convergence rate and converges more quickly than the distributed subgradient method (see Figure 2.1).

In Figure 2.2 and 2.3, we show the total consensus error of the local solution estimates (for both algorithms) and of the gradient trackers (for the gradient tracking algorithm), respectively.

Chapter 3 Distributed Dual Methods

In this chapter we describe distributed optimization methods based on Lagrangian approaches. We start by discussing an illustrative example and then we present two relevant duality forms to show how duality can be exploited to reformulate cost-coupled problems as constraint-coupled problems and vice versa. We describe algorithms for cost-coupled problems based on a decomposition technique known as dual decomposition and on the Alternating Direction Method of Multipliers (ADMM). Then, we illustrate duality-based approaches to solve constraint-coupled problems. To conclude, we give an up-to-date set of references and we provide numerical examples to highlight the main features of the discussed algorithms.

3.1 Fenchel Duality and Graph Duality

In this section we show how a cost-coupled optimization problem can be manipulated to obtain alternative (decoupled) problem formulations that are amenable for distributed computation. First, we present a simplified scenario with two agents to illustrate how duality can be exploited in designing a distributed optimization algorithm. Then, we recall a classical duality form known as Fenchel duality (see [88]), that paved the way for a number of parallel algorithms. Finally, we introduce an alternative and effective approach, that we term graph duality, tailored for the distributed framework.

Consider a cost-coupled problem

[TABLE]

where, for all $i\in\{1,\ldots,N\}$ , the cost function $f_{i}$ is convex and the constraint set $X_{i}$ is convex and bounded. These regularity assumptions are standard and guarantee that dual methods apply, i.e., that strong duality holds (cf. Appendix A.3). We will denote by $f^{\star}$ the optimal cost of problem (3.1).

3.1.1 Two-Agent Example

We start by considering a simple “network” of $2$ agents and informally discuss how duality allows for a suitable decomposition of a cost-coupled problem. All the technical details will be provided in the forthcoming sections.

We assume that both agents cooperate to solve the cost-coupled optimization problem

[TABLE]

where $f_{1},f_{2}:^{d}\rightarrow$ and $X_{1},X_{2}\subseteq^{d}$ . Recall that for such cost-coupled set-up, each agent is assumed to know only its own cost function and constraint (e.g., agent $1$ knows only $f_{1}$ and $X_{1}$ ).

The aim is to decompose problem (3.2) by exploiting Lagrangian duality. Specifically, we would like to obtain two symmetric subproblems so that each agent can solve its subproblem independently. To this end, we recast problem (3.2) into an equivalent formulation by introducing two copies, say $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ , of the decision variable $\mathbf{x}$ and a coherence constraint to obtain

[TABLE]

This reformulation exhibits a convenient structure since the cost function of each agent depends only on its copy of the decision variable, while the coupling in the problem is due only to the coherence constraint $\mathbf{x}_{1}=\mathbf{x}_{2}$ . Now we write the dual of problem (3.3) (cf. Appendix A.3). Let us introduce the Lagrangian of (3.3), i.e.,

[TABLE]

where $\boldsymbol{\lambda}\in^{d}$ is the multiplier associated to the constraint $\mathbf{x}_{1}=\mathbf{x}_{2}$ . As it will be clear from the forthcoming discussion, the presence of a single $\boldsymbol{\lambda}$ in $\mathcal{L}$ does not allow for a symmetric decomposition. Thus, let us follow an alternative approach, more suited for distributed computation. Formally, we add another, redundant constraint and rewrite (3.3) as

[TABLE]

which is trivially equivalent to problem (3.3). For this problem, the Lagrangian becomes

[TABLE]

where $\boldsymbol{\lambda}_{12}$ and $\boldsymbol{\lambda}_{21}$ are the multipliers associated to the constraints $\mathbf{x}_{1}=\mathbf{x}_{2}$ and $\mathbf{x}_{2}=\mathbf{x}_{1}$ respectively, and in (a) we use the problem symmetry to rearrange $\mathcal{L}$ in two similar terms, each one depending only on a single primal variable, i.e., on $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ respectively. The dual function of problem (3.4) is obtained by minimizing the Lagrangian (3.5) with respect to the primal variables. Formally,

[TABLE]

Finally, we can pose the dual problem as

[TABLE]

Under suitable regularity assumption on the primal problem (3.2), problem (3.6) has the same optimal cost. Thus, by solving (3.6), a dual optimal solution can be exploited to recover a primal optimal solution. In Section 3.1.3, we described the extended approach for a general set-up with $N$ agents.

The distributed dual decomposition algorithm consists of an iterative procedure to solve problem (3.6) by means of a subgradient algorithm (cf. Appendix A.1), and to obtain ultimately a solution of the original problem (3.2). The choice of solving (3.6) with such algorithm is convenient since a subgradient of the dual function111 Notice that here we are slightly abusing terminology. Indeed, subgradients are defined for convex functions, while the dual function $q$ is concave. Here, the notation $\widetilde{\nabla}q$ stands for the opposite of a subgradient of $-q$ .

at a given $(\bar{\boldsymbol{\lambda}}_{12},\bar{\boldsymbol{\lambda}}_{21})$ can be computed, in a distributed way, as

[TABLE]

where

[TABLE]

We assume that agent $1$ maintains and updates $\mathbf{x}_{1}$ and $\boldsymbol{\lambda}_{12}$ , while agent $2$ maintains and updates $\mathbf{x}_{2}$ and $\boldsymbol{\lambda}_{21}$ . At the beginning, they initialize $\boldsymbol{\lambda}_{12}^{0}$ and $\boldsymbol{\lambda}_{21}^{0}$ to arbitrary values. Then, at each iteration $t\geq 0$ of the algorithm, agents exchange their current value of $\boldsymbol{\lambda}_{12}^{t}$ and $\boldsymbol{\lambda}_{21}^{t}$ and compute a local estimate of the solution as

[TABLE]

Then, they exchange the updated value of $\mathbf{x}_{1}^{t+1}$ and $\mathbf{x}_{2}^{t+1}$ to adjust their local dual variable as

[TABLE]

where $\gamma^{t}$ denotes the step-size of the gradient method. An illustration of how communication and computation interleave is shown in Figure 3.1.

In Section 3.2 we will present and analyze the general case with $N$ agents and prove that the local solution estimates are asymptotically consensual and converge to an optimal solution of the primal problem.

3.1.2 Fenchel Duality

A classical approach to manipulate problem (3.1) consists in writing its Fenchel dual [10]. To this end, let us introduce copies $\mathbf{x}_{i}\in^{d}$ of the optimization variable $\mathbf{x}$ and an auxiliary variable $\mathbf{z}\in^{d}$ needed to enforce coherence among all the copies. Then, problem (3.1) can be equivalently recast as

[TABLE]

The Fenchel-dual problem of (3.1) is defined as the (standard) dual of (3.9). To this end, consider the Lagrangian function of (3.9), i.e.,

[TABLE]

The minimization of $\mathcal{L}$ with respect to the primal variables gives the dual function

[TABLE]

Then, the Fenchel-dual problem of (3.1) is given by the maximization of $q$ over its domain, i.e.,

[TABLE]

where each $q_{i}$ is defined as

[TABLE]

Problems in the form (3.10) are often referred to as resource allocation problems. We point out that (3.10) has a constraint-coupled structure, similar to problem (1.3) in Section 1.2.3. A (centralized) projected gradient method applied to (3.10) reads as follows,

[TABLE]

where in (a) we exploited the (recursive) feasibility of the previous iterate $(\boldsymbol{\lambda}_{1}^{t},\ldots,\boldsymbol{\lambda}_{N}^{t})$ .

Algorithm (3.11) is also known as parallel dual decomposition. Notice that we used properties of dual subgradients involving the local primal minimizers to write the dual update (cf. Appendix A.3). Figure 3.2 shows the algorithmic flow of parallel dual decomposition.

Notice that problem (3.9) can also be solved using ADMM (cf. Appendix A.4). The formal updates can be derived as done for the parallel dual decomposition by considering the so-called augmented Lagrangian. It can be shown (see [89]) that the resulting algorithm is

[TABLE]

where $\rho$ is the positive penalty parameter of the augmented Lagrangian. It is worth noting that algorithm (3.12) enjoys a parallel structure similarly to the dual decomposition case.

3.1.3 Graph Duality

A powerful method to decouple a cost-coupled problem (3.1) into a convenient structure, amenable to distributed computation, is to introduce suitable graph-induced constraints, that result into an appropriate dual problem. We term this methodology graph duality to stress that it combines the classical duality theory with the network structure. Indeed, the resulting dual problem heavily depends on the specific network as will be detailed next. The method that we now formalize is the general form of the approach used in Section 3.1.1

Let a fixed, undirected and connected graph $\mathcal{G}=(\{1,\ldots,N\},\mathcal{E})$ be given, then we define the $\mathcal{G}$ -dual of (3.1) as follows. Introduce $N$ copies, say $\mathbf{x}_{1},\ldots,\mathbf{x}_{N}$ , of the decision variable $\mathbf{x}$ and coherence constraints of the copies matching the graph structure, i.e., $\mathbf{x}_{i}=\mathbf{x}_{j}$ for all $(i,j)\in\mathcal{E}$ . Then, problem (3.1) becomes

[TABLE]

Being the graph $\mathcal{G}$ connected, the equivalence of problems (3.1) and (3.13) is guaranteed.

Let $\boldsymbol{\lambda}_{ij}\in^{S}$ be the multiplier associated to the constraint $\mathbf{x}_{i}=\mathbf{x}_{j}$ , then the Lagrangian of (3.13) is

[TABLE]

where the variable $\boldsymbol{\Lambda}$ stacks all the $|\mathcal{E}|$ multipliers $\boldsymbol{\lambda}_{ij}$ .

Notice that, being the communication graph undirected, for each term $\boldsymbol{\lambda}_{ij}^{\top}(\mathbf{x}_{i}-\mathbf{x}_{j})$ in (3.14) there is also a symmetric counterpart $\boldsymbol{\lambda}_{ji}^{\top}(\mathbf{x}_{j}-\mathbf{x}_{i})$ . Thus, the Lagrangian (3.14) can be rearranged so as to isolate the primal variables $\mathbf{x}_{i}$ , $i\in\{1,\ldots,N\}$ , as

[TABLE]

At this point, the dual function of (3.13) is obtained by minimizing the Lagrangian $\mathcal{L}$ with respect to the primal variables, leading to a separable function. Finally, the $\mathcal{G}$ -dual of (3.1) is the (standard) dual of (3.13), which is given by

[TABLE]

where the $i$ -th term $q_{i}$ of the dual function $q$ is defined as

[TABLE]

for all $i\in\{1,\ldots,N\}$ . We notice that problem (3.15) exhibits interesting features for a distributed computation framework. First, it is an unconstrained optimization problem with cost function expressed, similarly to the starting problem, as the sum of local terms $q_{i}$ . However, differently from the original problem (3.13), in the $\mathcal{G}$ -dual (3.15) the $i$ -th cost function depends only on the variables of agent $i$ and of its neighbors, rather than on the entire stack of decision vectors. In Section 3.2, we will derive a distributed algorithm that exploits the special structure of problem (3.15), known in the literature as partitioned optimization (cf. Remark 4).

3.2 Distributed Dual Decomposition for Cost-Coupled Problems

In this section, we review an algorithm, known as distributed dual decomposition, that relies on duality to solve cost-coupled problems in a distributed way. Decomposition techniques based on duality have been introduced in [88, 90, 9]. Typically, they are used to obtain parallel algorithms to speed-up the computation. However, the distributed extension of those techniques are only partially discussed in the mentioned references, while in the following we provide a comprehensive and constructive analysis for this scenario.

We consider $N$ agents in a network that want to cooperatively solve a cost-coupled problem (3.1) (cf. Section 1.2.1) that satisfies the following regularity properties.

Assumption 17.

For all $i\in\{1,\ldots,N\}$ , each $f_{i}$ is a convex function and each $X_{i}$ is a compact, convex set. Moreover, there exists a vector $\mathbf{x}$ such that $\mathbf{x}\in\mathop{\rm relint}{X_{i}}$ 222 Given a set $X\subset^{d}$ , we denote by $\mathop{\rm relint}{X}$ its relative interior. , for all $i\in\{1,\ldots,N\}$ . $\square$

The latter part of Assumption 17 is known in the literature as Slater’s constraint qualification, and is a sufficient condition to ensure that strong duality holds.

Agent $i$ maintains a primal solution estimate $\mathbf{x}_{i}^{t}$ , and dual solution estimates $\boldsymbol{\lambda}_{ij}^{t},j\in\mathcal{N}_{i}$ . The distributed dual decomposition algorithm is based on a subgradient method applied to the $\mathcal{G}$ -dual of (3.1) (see Section 3.1.3), i.e.,

[TABLE]

A subgradient of the dual function $q(\boldsymbol{\Lambda})$ at a given $\bar{\boldsymbol{\Lambda}}$ (stacking all the $\bar{\boldsymbol{\lambda}}_{ij}$ ) can be computed in a distributed way as follows. The component of $\widetilde{\nabla}q$ corresponding to the variable $\boldsymbol{\lambda}_{ij}$ is equal to (cf. Appendix (A.3))

[TABLE]

where $\bar{\mathbf{x}}_{i}$ is computed as

[TABLE]

and, consistently, for $\bar{\mathbf{x}}_{j}$ . Due to the sparse computation of dual subgradients, a subgradient method applied to the $\mathcal{G}$ -dual of (3.1) turns out to be a distributed algorithm. Formally, each agent $i$ initializes $\boldsymbol{\lambda}_{ij}^{t}$ for $j\in\mathcal{N}_{i}$ to any vector in d. At each iteration $t$ , each agent $i$ collects from its neighbors $j\in\mathcal{N}_{i}$ the updated dual variables $\boldsymbol{\lambda}_{ji}^{t}$ and performs a primal minimization

[TABLE]

Then, agents exchange their updated primal solution estimates and perform a subgradient method step on the dual variables according to

[TABLE]

where $\gamma^{t}$ is the step-size sequence.

Figure 3.3 shows the algorithmic flow of the distributed dual decomposition while the following table (Algorithm 3) summarizes the algorithm from the perspective of each agent $i$ .

Next, we provide the convergence result for Algorithm 3.

Theorem 18.

Let Assumption 17 hold. Moreover, let the communication graph be undirected and connected and let the step-size $\gamma^{t}$ satisfy Assumption 7. Then, the dual variable sequence $\{\boldsymbol{\Lambda}^{t}\}_{t\geq 0}$ generated by Algorithm 3 satisfies

[TABLE]

where $f^{\star}$ is the optimal cost of problem (3.1).

Proof (Sketch).

The proof heavily relies on the constructive derivation we carried out in this section. We have proven that the distributed dual decomposition algorithm is a subgradient method iteration on the $\mathcal{G}$ -dual (3.16). Since the primal cost functions $f_{i}$ are convex and the local sets $X_{i}$ are compact, it is possible to show that the dual function $q$ has bounded subgradients. Thus, by Proposition 37, and since the dual function $q$ is concave, every limit point of $\{\boldsymbol{\Lambda}^{t}\}_{t\geq 0}$ is an optimal solution of problem (3.16). Therefore, by continuity of $q$ and by strong duality, it holds

[TABLE]

∎

Notice that nothing can be said about the convergence of the primal sequence $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ generated by Algorithm 3. In fact, due to the lack of strict convexity of the cost functions, there is no guarantee of feasibilty of the solutions retrieved by the Lagrangian minimization. This problem has been addressed by introducing averaging mechanisms, i.e., let the sequence $\{\widehat{\mathbf{x}}_{i}^{t}\}_{t\geq 0}$ be defined as $\widehat{\mathbf{x}}_{i}^{t}=1/t\sum_{\tau=0}^{t}\mathbf{x}_{i}^{\tau}$ , for all $t$ . Then, it holds

[TABLE]

where $\mathbf{x}^{\star}$ and $f^{\star}$ denote an optimal solution and the optimal cost of problem (3.25), respectively.

Remark 19.

If each cost function $f_{i}$ in problem (3.1) is strongly convex then it is possible to improve the result. Specifically, under primal strong convexity the dual function $q$ becomes smooth (i.e., differentiable with Lipschitz continuous gradient) so that a gradient method with constant step-size can be applied to solve the dual problem (3.16). Moreover, since strong convexity implies strict convexity, also primal convergence can be established, i.e., $\operatornamewithlimits{lim\vphantom{p}}_{t\to\infty}\|\mathbf{x}_{i}^{t}-\mathbf{x}^{\star}\|=0$ for all $i$ with $\mathbf{x}^{\star}$ the optimal solution of (3.1). This follows since the Lagrangian minimization admits a unique solution at each iteration $t$ . $\square$

As for the rate of convergence of the dual iterates, the algorithm directly inherits the convergence rate of the standard subgradient method, which is sublinear. If more regular problems are considered (e.g., strongly convex cost functions), then the dual function becomes smooth, therefore the linear convergence rate of gradient method is obtained.

Remark 20.

Distributed dual decomposition can be also applied to partitioned optimization problems (cf. Remark 4). To efficiently exploit the partitioned structure of the problem, one can work on copies of the relevant portions of the global decision vector. This gives rise to tailored distributed dual decomposition algorithms, see, e.g., [91, 92]. The same procedure has been employed for distributed ADMM (cf. the following section) in [12, 93, 94]. $\square$

In the following section we describe a distributed algorithm that can solve convex optimization problems and guarantees asymptotic primal feasibility without resorting to averaging mechanisms.

3.3 Distributed ADMM for Cost-Coupled Problems

In this section we review a distributed algorithm based on the popular Alternating Direction Method of Multipliers (ADMM, cf. Appendix A.4). References for the approach described in this section are, e.g., [95, 96, 97, 98]

We consider a network of $N$ agents that aim to cooperatively solve a cost-coupled problem in the form (3.1). Similarly to distributed dual decomposition, in order to distribute the computation we include sparsity in problem (3.1) by introducing a set of copies of $\mathbf{x}$ and proper coherence constraints matching the sparsity of the communication graph $\mathcal{G}$ . That is, problem (3.1) can be equivalently stated as

[TABLE]

This problem reformulation is different from the one used for distributed dual decomposition and is tailored for the ADMM approach which makes use of the augmented Lagrangian. Let us introduce $|\mathcal{E}|+N$ multipliers associated to the coherence constraints. The augmented Lagrangian is

[TABLE]

where $\mathbf{X}$ , $\mathbf{Z}$ and $\boldsymbol{\Lambda}$ denote the vectors stacking all the primal variables and all the multipliers, respectively.

The ADMM algorithm described in Appendix A.4 can applied to problem (3.19) using the following identifications. The decision variables $\mathbf{x}$ and $\mathbf{z}$ of (A.11) are $\mathbf{X}$ and $\mathbf{Z}$ , respectively. As for the cost functions, we set

[TABLE]

As for the constraints, $C_{1}=X_{1}\times\cdots\times X_{N}$ while $C_{2}\equiv^{N\cdot d}$ . Finally, the linear constraints can be stated as

[TABLE]

and $c$ equal to zero, where Adj is the adjacency matrix of $\mathcal{G}$ (without self-loops) while $I_{N\cdot d}$ and $I_{d}$ are $Nd\times Nd$ and $d\times d$ identity matrices, respectively.

Remark 21.

An alternative formulation of problem (3.1) has been largely used in the literature and it is known as consensus-ADMM formulation (see, e.g., [99]). Formally, the following equivalent formulation of problem (3.1) is considered,

[TABLE]

The resulting ADMM algorithm is derived by following the same steps performed for problem (3.19). However, notice that problem (3.20) has $|\mathcal{E}|+N$ variables and $2\cdot|\mathcal{E}|$ coherence constraints, while problem (3.19) has only $2\cdot N$ variables and $|\mathcal{E}|+N$ coherence constraints. $\square$

[TABLE]

for all $j\in\mathcal{N}_{i}$ and $i\in\{1,\ldots,N\}$ .

It is possible to rephrase the $\mathbf{z}$ -minimization in (3.21b) by noticing that it is an unconstrained quadratic program. The first order necessary condition of optimality is

[TABLE]

Thus, the explicit solution of (3.21b) is given by

[TABLE]

Figure 3.4 shows the algorithmic flow of distributed ADMM, while in Algorithm 4 we summarize the distributed ADMM algorithm from the perspective of agent $i$ . As for the initialization, each agent $i$ can initialize $\boldsymbol{\lambda}_{ij}^{t}$ for $j\in\mathcal{N}_{i}$ , $\boldsymbol{\lambda}_{ii}^{t}$ and $\mathbf{z}_{i}^{t}$ to arbitrary vectors in d.

Next, we establish convergence of the distributed ADMM algorithm.

Theorem 22.

Let Assumption 17 hold and let the communication graph be undirected and connected. Then, the sequences of local solution estimates $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by Algorithm 4 are asymptotically consensual to an optimal solution $\mathbf{x}^{\star}$ of problem (3.1), i.e.,

[TABLE]

Proof (Sketch).

The proof heavily relies on the constructive derivation we carried out in this section. We have shown that Algorithm 4 is an istance of the ADMM algorithm (cf. (A.10) in Appendix A.4) applied to problem (3.19). Thus, by Proposition 39, it follows that the primal variable sequence $\{(\mathbf{x}_{1}^{t},\ldots,\mathbf{x}_{N}^{t})\}_{t\geq 0}$ converges to an optimal (hence feasible) solution of problem (3.19). Recalling that problem (3.19) is an equivalent formulation of (3.1), the proof follows. ∎

3.4 Distributed Dual Methods for Constraint-Coupled Problems

In this section, we consider a constraint-coupled optimization problem (cf. Section 1.2.3). We describe how duality can be exploited to develop distributed optimization algorithms for this problem class. Notice that the methods discussed in Section 3.2 and Section 3.3 are designed for a different problem set-up.

3.4.1 Connections between Cost-Coupled and Constraint-Coupled Problems via Duality

In Section 3.1, we have shown that the Fenchel-dual problem (3.10) of a cost-coupled problem is a constraint-coupled problem. Next, we show that there exists a more general symmetry between these two set-ups. In the following, we discuss how duality can be employed to express constraint-coupled problems as cost-coupled ones. Consider a constraint-coupled problem

[TABLE]

where all the quantities have been introduced in Section 1.2.3.

To derive the dual problem of (3.25), let us introduce a multiplier $\boldsymbol{\mu}\in^{S}$ associated to the coupling constraint $\sum_{i=1}^{N}\mathbf{g}_{i}(\mathbf{x}_{i})\leq\mathbf{0}$ . Thus, the Lagrangian reads

[TABLE]

The dual of problem (3.25) is

[TABLE]

where the $i$ -th term $q_{i}$ of the dual function $q$ is defined as

[TABLE]

It is easy to see that (3.26) is a cost-coupled problem.

We consider $N$ agents in a network modeled as a connected, fixed and undirected graph, which aim to cooperatively solve a constraint-coupled problem (3.25) satisfying the following assumption.

Assumption 23.

For all $i\in\{1,\ldots,N\}$ : each function $f_{i}$ is convex, each constraint $X_{i}$ is a non-empty, compact and convex set; each function $\mathbf{g}_{i}$ is a component-wise convex function. Moreover, there exist $\bar{\mathbf{x}}_{1}\in X_{1},\ldots,\bar{\mathbf{x}}_{N}\in X_{N}$ such that $\sum_{i=1}^{N}\mathbf{g}_{i}(\bar{\mathbf{x}}_{i})<\mathbf{0}$ . $\square$

The latter part of Assumption 23 is Slater’s constraint qualification and ensures that strong duality holds.

We recall that each agent $i$ aims to compute only its portion $\mathbf{x}_{i}^{\star}$ of the entire optimal solution $(\mathbf{x}_{1}^{\star},\ldots,\mathbf{x}_{N}^{\star})$ (cf. Section 1.3). In the following, we introduce two distributed algorithms that solve problem (3.25) by means of problem (3.26).

3.4.2 Distributed Dual Subgradient Algorithm

A (centralized) subgradient method (cf. Appendix A.2) applied to the maximization of the concave problem (3.26) reads

[TABLE]

Notice that, as discussed in Appendix A.3, a subgradient of $q_{i}$ at $\boldsymbol{\mu}^{t}$ can be computed by evaluating the dualized constraints $\mathbf{g}_{i}$ at the minimizer of the Lagrangian, i.e.,

[TABLE]

so that $\widetilde{\nabla}q_{i}(\boldsymbol{\mu}^{t})=\mathbf{g}_{i}(\mathbf{x}_{i}^{t+1})$ . The method described by (3.28) suggests that the distributed subgradient algorithm (cf. Section 2.1) can be applied to solve problem (3.26).

In the following, we describe the distributed dual subgradient algorithm. Each node $i$ maintains a local dual variable estimate $\boldsymbol{\mu}_{i}^{t}$ that is iteratively updated according to a distributed subgradient iteration described by (3.30), and a local primal variable $\mathbf{x}_{i}^{t}$ , computed by minimizing the $i$ -th term of the Lagrangian as in (3.29). Nodes initialize their local dual variables $\boldsymbol{\mu}_{i}^{t}$ to any vector in the positive orthant. Algorithm 5 formally summarizes the distributed dual subgradient algorithm for a constraint-coupled optimization problem (from the perspective of agent $i$ ).

Being Algorithm 5 a distributed subgradient method (cf. Algorithm 1), the usual convergence properties (discussed in Chapter 2) apply333We give the analysis for unconstrained problems, however the algorithm can be extended to a constrained set-up, see, e.g., [17]. Consider the same network framework as in Section 2.1 and let Assumption 23 hold. We now state the convergence result of the distributed dual subgradient algorithm.

Theorem 24.

Let Assumption 23 hold. Let the communication graph be undirected and connected with weights $a_{ij}$ satisfying Assumption 6 and let the step-size $\gamma^{t}$ satisfy Assumption 7. Then, the sequence of dual variables $\{\boldsymbol{\mu}_{1}^{t},\ldots,\boldsymbol{\mu}_{N}^{t}\}_{t\geq 0}$ generated by Algorithm 5 satisfies

[TABLE]

where $\boldsymbol{\mu}^{\star}$ is an optimal solution of problem (3.26), the dual of problem (3.25). Moreover, let the sequence $\{\widehat{\mathbf{x}}_{i}^{t}\}_{t\geq 0}$ be defined as $\widehat{\mathbf{x}}_{i}^{t}=1/t\sum_{\tau=0}^{t}\mathbf{x}_{i}^{\tau}$ , for all $t$ . Then, it holds

[TABLE]

where $\mathbf{x}^{\star}$ and $f^{\star}$ denote an optimal solution and the optimal cost of problem (3.25), respectively. $\square$

A proof of the statement is provided in [100] for time-varying networks using a proximal minimization perspective. Notice that Theorem 24 does not state any convergence property for the primal variables $\mathbf{x}_{i}^{t}$ . To this end, as done in Section 3.2, it is useful to employ a local running average (i.e., $\widehat{\mathbf{x}}_{i}^{t}$ ). When the cost function of problem (3.25) is strictly convex, problem (3.25) has a unique optimal solution. In this scenario, convergence of $\mathbf{x}_{i}^{t}$ is guaranteed in any case, so that no primal recovery issues arise and no local running average is necessary.

The distributed dual subgradient algorithm enjoys appealing features: (i) local computations at each node involve only the local decision variable and, thus, scale nicely with respect to the dimension of the decision vector, (ii) privacy is preserved since agents do not communicate, and thus disclose, their estimates of the local decision variable, cost or constraints.

3.4.3 Relaxation and Successive Distributed Decomposition

Next, we present a distributed algorithm, named Relaxation and Successive Distributed Decomposition (RSDD), to solve constraint-coupled problems of the form (3.25) that has been proposed and analyzed in [101, 102]. The main leading ideas of the algorithmic development are: (i) to solve the (cost-coupled) dual problem (3.26) by means of distributed dual decomposition, and (ii) to handle infeasibility of local problems, occurring during the algorithmic evolution, via a suitable relaxation. The combination of relaxation and duality steps give rise to a simple and efficient distributed algorithm that overcomes some limitations of the dual distributed subgradient (cf. Section 3.4.2) related to primal recovery.

Algorithm 6 formally states the RSDD distributed algorithm from the perspective of node $i$ .

Informally, the RSDD algorithm consists of an iterative two-step procedure. Each node $i$ stores a set of variables $((\mathbf{x}_{i}$ , $\rho_{i}),\boldsymbol{\mu}_{i})$ , obtained as a primal-dual optimal solution pair of problem (3.31). The vector $\boldsymbol{\mu}_{i}$ is the multiplier associated to the local inequality constraint $\mathbf{g}_{i}(\mathbf{x}_{i})+\sum_{j\in\mathcal{N}_{i}}(\boldsymbol{\lambda}_{ij}^{t}-\boldsymbol{\lambda}_{ji}^{t})\leq\rho_{i}\mathbf{1}$ . Notice that problem (3.31) mimics a local version of the original problem (3.25), where the coupling with the other nodes is replaced by a local term depending only on neighboring variables $\boldsymbol{\lambda}_{ij}$ and $\boldsymbol{\lambda}_{ji}$ , $j\in\mathcal{N}_{i}$ . Moreover, this local version of the coupling constraint is also relaxed, i.e., a positive violation $\rho_{i}\mathbf{1}$ is allowed. Finally, instead of minimizing only the local function $f_{i}$ , the (scaled) violation $M\rho_{i}$ , $M>0$ , enters the cost function as well. The auxiliary variables $\boldsymbol{\lambda}_{ij}$ , $j\in\mathcal{N}_{i}$ , are updated in a second step according to a linear law which combines neighboring $\boldsymbol{\mu}_{i}$ as shown in (3.32). Nodes initialize their variables $\boldsymbol{\lambda}_{ij}^{t}$ , $j\in\mathcal{N}_{i}$ to arbitrary values.

Similarly to the distributed dual subgradient algorithm, the RSDD algorithm also enjoys the same appealing features mentioned in Section 3.4.2, i.e., nicely scaling local computation and information privacy preserving. Moreover, a peculiarity of RSDD is that an estimate of a primal optimal solution component is directly computed by each agent without any averaging mechanism, which results in a faster algorithm.

Consider the same network framework as in Section 2.1 and let Assumption 23 hold. We now present the convergence result of RSDD.

Theorem 25.

Let Assumption 23 hold. Let the communication graph be undirected and connected and let the step-size $\gamma^{t}$ satisfy Assumption 7. Moreover, letting $\boldsymbol{\mu}{}^{\star}$ be an optimal solution of the dual of problem (3.25), assume $M$ be sufficiently large such that $M>\|\boldsymbol{\mu}{}^{\star}\|_{1}$ . Consider a sequence $\big{\{}\mathbf{x}_{i}^{t},\rho_{i}^{t}\big{\}}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by Algorithm 6. Then, the following holds:

(i)

the sequence $\big{\{}\sum_{i=1}^{N}\big{(}f_{i}(\mathbf{x}_{i}^{t})+M\rho_{i}^{t}\big{)}\big{\}}_{t\geq 0}$ converges to the optimal cost $f^{\star}$ of (3.25); 2. (ii)

*every limit point of $\big{\{}\mathbf{x}_{i}^{t}\big{\}}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , is a primal optimal (feasible) solution of (3.25). * $\square$

The proof of Theorem 25 can be found in [102].

In [103], Algorithm 6 has been interpreted as a distributed primal decomposition method and has been used to solve mixed-integer linear programs by means a suitable coupling constraint restriction. The main challenge is due to the presence of local constraint sets $X_{i}$ that are mixed-integer polyhedra (i.e., with some of the components constrained to be integer, see also Section 4.3.2).

Remark 26.

Another important optimization set-up in smart grid applications arises in so-called Demand Side Management (DSM) programs [104]. As an example, a cooperative DSM task has the goal of reducing the hourly and daily variations and peaks of electric demand by optimizing generation, storage and consumption. A widely adopted objective in DSM programs is Peak-to-Average Ratio (PAR), which gives rise to the following min-max optimization problem

[TABLE]

where $p\in$ represents the peak value that agents want to shave. A duality-based approach similar to the one leading to the RSDD distributed algorithm has been proposed and analyzed in [105, 106] for solving problem (3.33). $\square$

3.5 Discussion and References

Early popular tutorials on parallel and distributed optimization based on duality are [90, 9, 89]. Distributed algorithms based on the Alternating Direction Method of Multipliers are proposed in [95, 96, 97, 107, 108, 109]. Convergence rates for ADMM-based algorithms are provided in [110, 98, 111, 112]. A distributed algorithm combining a linearization approach with ADMM has been proposed in [113], while quadratic approximations have been explored in [114]. A fast distributed ADMM algorithm for quadratic problems is devised in [115]. A more general ADMM framework is considered in [116], where an explicit converge rate has been provided. An application of the distributed ADMM algorithm to an online optimization scenario (i.e., with time-varying cost function) is analyzed in [117]. An asynchronous version of the distributed ADMM algorithm is proposed in [118].

Primal-dual algorithms for constrained optimization over networks are given in [119, 120]. A primal-dual perturbation approach is explored in the paper [121]. An asynchronous version of such algorithm class is provided in [122]. Augmented Lagrangian algorithms for directed gossip networks are analyzed in [123]. Continuous-time, Lagrangian-based, distributed algorithms are investigated in [124, 125, 126, 127, 128, 129]. A distributed saddle-point algorithm for robust linear programs is proposed in [130]. A saddle-point method for distributed, continuous-time, online optimization is proposed in [131]. An asynchronous, primal-dual, cloud-based algorithm for distributed convex optimization is provided in [132]. An asynchronous algorithm which allows the presence of local nonconvex constraints is presented in [133].

A dual averaging approach for distributed optimization is proposed in [134]. A push-sum version for directed networks is analyzed in [135], while an extension for online optimization is given in [136]. A fully distributed dual gradient algorithm to minimize linearly constrained separable convex problems, with linear convergence rate, is given in [137]. A distributed dual fast gradient algorithm, with sublinear rate, is proposed [138] for linearly constrained separable convex optimization problems. An asynchronous version of the distributed dual decomposition with composite costs is proposed in [139]. An extension to a partitioned set-up is provided in [92]. A time-varying distributed algorithm based on Fenchel duality is provided in [140]. Papers [141, 100] investigate distributed dual subgradient methods for constraint-coupled optimization. In [142] an ADMM approach for the same set-up is proposed in which multiple consensus steps are needed. Dual decomposition techniques applied to control problems are proposed in [143, 144]. In [145] a distributed Jacobi algorithm for convex optimization problems, arising in distributed model predictive control, is presented. A fast dual gradient algorithm for network utility maximization is proposed in [146].

3.6 Numerical Example

In this section, we provide numerical examples of the algorithms presented in this chapter. Since we considered algorithms for both the cost-coupled set-up and the constraint-coupled set-up, we analyze the examples in two separate subsections.

As done in Chapter 2, we consider a network of $N=10$ agents communicating over a fixed, undirected, connected graph generated according to an Erdős-Rényi random model with parameter $p=0.2$ . For the algorithms embedded with consensus iterations, we assume agents are equipped with a doubly stochastic matrix built according to the Metropolis-Hastings rule [87], i.e.,

[TABLE]

3.6.1 Cost-coupled Example

In this subsection, we assume that $N$ agents aim to cooperatively solve the cost-coupled quadratic program

[TABLE]

where each $Q_{i}\in^{5\times 5}$ is randomly generated such that its eigenvalues are drawn uniformly from $[1,10]$ . We compare distributed ADMM (cf. Algorithm 4), with $\rho=0.1$ and distributed dual decomposition (cf. Algorithm 3), with diminishing step-size $\gamma^{t}=(1/t)^{0.7}$ .

As for distributed ADMM, in Figure 3.5 we show cost convergence rate, i.e., the evolution of $|\sum_{i=1}^{N}f_{i}(\mathbf{x}_{i}^{t})-f^{\star}|/|f^{\star}|$ .

In Figure 3.6, we show the consensus error of the local solution estimates, i.e., $\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|$ for all $i$ , where $\bar{\mathbf{x}}^{t}=1/N\cdot\sum_{i=1}^{N}\mathbf{x}_{i}^{t}$ .

As regards distributed dual decomposition, in Figure 3.7 we show cost convergence. That is, we plot the evolution of primal and dual cost error, i.e., $|\sum_{i=1}^{N}f_{i}(\mathbf{x}_{i}^{t})-f^{\star}|/|f^{\star}|$ and $|q(\boldsymbol{\Lambda}^{t})-f^{\star}|/|f^{\star}|$ . As expected for a dual method, dual cost converges faster than primal cost.

Finally, in Figure 3.8 we show consensus error of the local solution estimates, i.e., $\|\mathbf{x}_{i}^{t}-\bar{\mathbf{x}}^{t}\|$ for all $i$ , where $\bar{\mathbf{x}}^{t}=1/N\cdot\sum_{i=1}^{N}\mathbf{x}_{i}^{t}$ .

3.6.2 Constraint-coupled Example

In this subsection, we consider the Microgrid control problem introduced in Section 1.3.6, where we assume we have a heterogeneous network of $N=10$ units with $4$ generators, $3$ storage devices, $2$ controllable loads and $1$ connection to the main grid. We assume that in the distributed MPC scheme each unit predicts its power generation strategy over a horizon of $S=12$ slots. The optimization problem to be solved has the form (cf. Section 1.3.6)

[TABLE]

We compare RSDD (cf. Algorithm 6) and distributed dual subgradient (cf. Algorithm 5). For both algorithms, we use the diminishing step-size $\gamma^{t}=0.1\cdot(1/t)^{0.7}$ . For RSDD, we set $M=10\cdot\|\boldsymbol{\mu}^{\star}\|_{1}$ , where $\boldsymbol{\mu}^{\star}$ is a dual optimal solution of the problem (3.35) computed by a centralized solver.

In Figure 3.9 we compare the convergence rate of RSDD and of distributed dual subgradient. In particular, for the RSDD algorithm, we plot the difference between the optimal cost $f^{\star}$ and the sum of local costs $\sum_{i=1}^{N}f_{i}(\mathbf{x}_{i}^{t})$ , normalized by $f^{\star}$ . For the distributed dual subgradient algorithm, we plot the difference between the optimal cost $f^{\star}$ and the sum of local costs $\sum_{i=1}^{N}f_{i}(\widehat{\mathbf{x}}_{i}^{t})$ , normalized by $f^{\star}$ , where $\widehat{\mathbf{x}}_{i}^{t}$ denotes the $i$ -th running average of the local Lagrangian minimizers $\mathbf{x}_{i}^{t}$ .

For the RSDD algorithm, in Figure 3.10, we show the algorithmic evolution of the sum of the penalty parameters $\rho_{i}^{t}$ and the maximum violation of the coupling constraint at each iteration $t$ .

Finally, in Figure 3.11 we show how $\sum_{j\in\mathcal{N}_{i}}(\boldsymbol{\lambda}_{ij}^{t}-\boldsymbol{\lambda}_{ji}^{t})$ compares with the unknown part of the coupling constraint of each agent $i$ , namely $\sum_{j\neq i}\mathbf{g}_{j}(\mathbf{x}_{j}^{t})$ . Specifically, for all $i$ , we plot the quantity

[TABLE]

The picture highlights that $\sum_{j\in\mathcal{N}_{i}}(\boldsymbol{\lambda}_{ij}^{t}-\boldsymbol{\lambda}_{ji}^{t})$ acts as a “tracker” of the maximum of the contribution in the coupling constraint due to all the other agents $j\neq i$ in the network.

Chapter 4 Constraint Exchange Methods

In this chapter, we present distributed optimization algorithms based on the exchange of constraints among agents. These algorithms are structurally different from the ones described in Chapters 2 and 3, since the information exchanged by agents (encoding the local solution estimate) amounts to constraints rather than decision variables. We start by introducing the so-called Constraints Consensus algorithm for convex and abstract programs.111Abstract programs are a generalization of linear programs, see, e.g., [147, 148]. Following the same approach as in the previous chapter, we present and analyze the algorithm for a simplified optimization set-up, namely linear programs, and then discuss how it in fact applies to general convex and abstract programs. Then, we present other methods based on the constraint exchange approach that generalize Constraints Consensus. Finally, we provide a numerical example to show the main characteristics of the presented methods.

4.1 Constraints Consensus applied to Linear Programs

In this section, we present and analyze a simplified version, applied to Linear Programs (LPs), of the Constraints Consensus algorithm [148]. First, we give some intuition on the algorithm together with its formal description. Then, we provide a convergence analysis, and we briefly mention a variant of the algorithm in which agents exchange “columns” instead of constraints.

4.1.1 Algorithm description

Consider a network of $N$ agents that aim to solve the linear program

[TABLE]

where $\mathbf{x}\in^{d}$ is the optimization variable, $c\in^{d}$ is the cost vector, and $a_{i}\in^{d}$ and $b_{i}\in$ , $i\in\{1,\ldots,N\}$ . Notice that problem (4.1) is an instance of the common cost set-up described in Section 1.2.2. For ease of presentation, we suppose that each agent $i$ knows only the constraint $a_{i}^{\top}\mathbf{x}\leq b_{i}$ , and we say that this is the initial constraint of agent $i$ . Also, we make the standing assumption that the number of agents is greater than the dimension of the variable, i.e., $N>d$ .

To convey the idea underlying the Constraints Consensus algorithm, let us elaborate on optimization problems in the form of (4.1). It is known from linear programming theory (cf. Appendix C) that the feasible set of problem (4.1) is polyhedral and that, if $\mathbf{x}^{\star}$ is an optimal vertex (i.e., an optimal solution attained at a vertex of the feasible set), then there exists a basis, consisting of exactly $d$ linearly independent inequality constraints $a_{\ell_{1}}^{\top}\mathbf{x}\leq b_{\ell_{1}},\ldots,a_{\ell_{d}}^{\top}\mathbf{x}\leq b_{\ell_{d}}$ , for some indices $\{\ell_{1},\ldots,\ell_{d}\}\subseteq\{1,\ldots,N\}$ . Such a basis allows for the computation of $\mathbf{x}^{\star}$ as the (unique) optimal vertex of the linear program

[TABLE]

obtained as a relaxation of problem (4.1) by considering the constraints in the basis only. Roughly speaking, in the Constraints Consensus algorithm, each agent iteratively solves a relaxation of problem (4.1), with constraints given by its initial constraint and constraints collected from neighbors, and computes an optimal solution with its corresponding basis. Then, the basis is broadcast to neighbors and the process is repeated until convergence. A natural question arising at this point is how to handle problems with multiple optimal solutions. For such problems, in order to guarantee convergence of the scheme, it is necessary for agents to select a common solution. A possible approach to guarantee agent agreement is to employ a lexicographic criterion (see Appendix C for a formal description), i.e., agents compute the lexicographically minimal optimal solution, termed lex-optimal solution, of the LPs through an appropriate local lexicographic solver. Thus, in the remainder of this section, we will stick to the following definition of basis.

Definition 27.

Let $\mathbf{x}^{\star}$ be the lex-optimal solution of a linear program in the form (4.1). A collection of $d$ inequality constraints $a_{\ell_{1}}^{\top}\mathbf{x}\leq b_{\ell_{1}},\ldots,a_{\ell_{d}}^{\top}\mathbf{x}\leq b_{\ell_{d}}$ , for some indices $\{\ell_{1},\ldots,\ell_{d}\}\subseteq\{1,\ldots,N\}$ , is called a basis of (4.1) if $\mathbf{x}^{\star}$ is the lex-optimal solution of

[TABLE]

This definition is specifically tailored for linear programs. In fact, it can be obtained as a special version of a more general definition of basis that holds for so-called abstract programs, see, e.g., [147, 148]. From now on, we compactly denote a basis as $(P,q)$ , where $P\in^{d\times d}$ is the matrix obtained by stacking the row vectors $a_{\ell_{h}}^{\top}$ and $q\in^{d}$ is the vector obtained by stacking the scalars $b_{\ell_{h}}$ , i.e.,

[TABLE]

Notice that, even if the lex-optimal solution of a LP is unique, there might be several bases associated to the problem. In Figure 4.1, an example scenario in 2 is graphically represented.

Next, we describe the Constraints Consensus algorithm applied to problem (4.1). We assume that the agents communicate according to a jointly strongly connected (time-varying) directed graph $\mathcal{G}^{t}$ (cf. Section 1.1), and we denote by $\mathcal{N}_{i}^{t}$ the in-neighbors of agent $i$ at communication round222 In a synchronous algorithm the term iteration is more suited. Since the Constraints Consensus algorithm can be implemented also in an asynchronous setting, we prefer to use this terminology. $t$ . Each agent $i$ maintains a local solution estimate $\mathbf{x}_{i}^{t}$ and a local basis $(P_{i}^{t},q_{i}^{t})$ . It is initialized to $(a_{i}^{\top},b_{i})$ and is incrementally filled as the agent collects information from neighbors during the algorithm evolution. At each communication round $t$ , agent $i$ first gathers the bases from its neighbors, then it constructs a (small) local LP with constraints given by the aggregation of: (i) the old basis, (ii) the collected bases from neighbors, and (iii) its initial constraint. Then, the agent finds a basis for the local LP to update its state. Finally, the updated basis is broadcast to neighbors. Notice that the local LP can be unbounded. Thus, an artificial (sufficiently large) bounding box, denoted as $-M\mathbf{1}\leq\mathbf{x}\leq M\mathbf{1}$ , with $M>0$ , is added to ensure that the algorithm is well posed at each communication round, so that the bounding box can becomes part of the local bases. If $M$ is sufficiently large, the lex-optimal solution of problem (4.1) is contained in the bounding box and the bounding box will eventually leave the local bases. Algorithm 7 formally summarizes the Constraints Consensus algorithm applied to linear programs from the perspective of node $i$ .

In Section 4.1.2, we analyze the convergence of Algorithm 7.

Let us now highlight the differences of the constraint exchange approach with respect to the other approaches discussed in this survey. First, note that in primal methods (cf. Chapter 2), consensus of the agents on a common optimal solution is enforced by means of consensus iterations that steer the local quantities to a common value, whereas in Algorithm 7, consensus follows because eventually the lex-optimal solution of the local problems (4.3) is the same. Second, the communication network assumptions of constraint exchange methods are generally very weak. For instance, Algorithm 7 only requires joint strong connectivity, which allows for an asynchronous implementation of the algorithm (cf. Section 1.1), and allows for unreliable communication links (e.g., subject to packet loss). Also, it is worth mentioning that if the network consists of a large number of agents with relatively small in-degree, the local optimization problem (4.3) solved at each iteration is much smaller than the original problem (4.1), so that the algorithm scales nicely with the network size. This is also corroborated by the fact that the communication is bounded: each exchanged basis always consists of $d$ constraints, except in the early stages of the algorithm in which less than $d$ constraints are available. Finally, note that Algorithm 7 does not require global tuning parameters (e.g., the step-size).

4.1.2 Convergence Analysis

In this subsection, we analyze the convergence of Algorithm 7. The proof reported in this survey is different from the one in [148], which was devised for general abstract programs. Here we present a new proof inspired by the arguments used in [149]. Let us make the following assumption on problem (4.1).

Assumption 28.

Problem (4.1) is feasible and the lex-optimal solution exists.333For a discussion on the existence of the lex-optimal solution, see Appendix C. $\square$

In the following, we prove that Algorithm 7 enjoys finite-time convergence. The line of proof relies on three facts, namely (i) (finite-time) convergence of the solution estimates computed by each agent (Lemma 29), (ii) consensus of the solution estimates at convergence (Lemma 30), (iii) optimality of the consensual solution estimates.

In the next lemma we prove that the quantities computed by each agent converge in finite time.

Lemma 29 (Local convergence).

Let Assumption 28 hold. Then, for all $i\in\{1,\ldots,N\}$ ,

(i)

the cost sequence $\{c^{\top}\mathbf{x}_{i}^{t}\}_{t\geq 0}$ is monotonically non-decreasing and converges in finite time, i.e., there exist $T_{i}>0$ and $\bar{J}_{i}\in$ such that

[TABLE] 2. (ii)

the solution estimate sequence $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to a vector satisfying the initial constraint of agent $i$ , i.e., there exist $T_{i}^{\prime}>0$ and $\bar{\mathbf{x}}_{i}$ such that

[TABLE]

Proof.

For the sake of analysis, let us denote by $J_{i}^{t}\triangleq c^{\top}\mathbf{x}_{i}^{t}$ the cost associated to $\mathbf{x}_{i}^{t}$ . To prove (i), we consider problem (4.3) at consecutive communication rounds, say $t$ and $t+1$ . The lex-optimal solution of problem (4.3) is $\mathbf{x}_{i}^{t+1}$ , with cost $J_{i}^{t+1}=c^{\top}\mathbf{x}_{i}^{t+1}$ and $(P_{i}^{t+1},q_{i}^{t+1})$ an associated basis. Thus, $\mathbf{x}_{i}^{t+1}$ is the lex-optimal solution of

[TABLE]

with optimal cost $J_{i}^{t+1}$ . At the successive communication round $t+1$ , the lex-optimal solution of the local problem (4.3) does not violate any constraint of problem (4.4). Thus, it holds $J_{i}^{t+2}\geq J_{i}^{t+1}$ . Therefore, we conclude that the cost sequence is monotonically non-decreasing, i.e., for all $t\geq 0$ ,

[TABLE]

Also, because of the bounding box, the feasible set of problem (4.3) is bounded, so that $\{J_{i}^{t}\}_{t\geq 0}$ converges. Finally, since there is a finite number of constraints in the network, $J_{i}^{t}$ can only assume a finite number of values (corresponding to all the possible combinations of constraints). Thus, $\{J_{i}^{t}\}_{t\geq 0}$ converges in finite time, i.e., there exist $T_{i}>0$ and $\bar{J}_{i}\in$ such that

[TABLE]

To prove (ii), let us consider the sequence of the first component of $\mathbf{x}_{i}^{t}$ for $t\geq T_{i}$ , i.e., $\{\mathbf{x}_{i,1}^{t}\}_{t\geq T_{i}}$ . First, notice that the cost associated to such sequence is identically equal to $\bar{J}_{i}$ , i.e., $c^{\top}\mathbf{x}_{i}^{t}=\bar{J}_{i}$ for all $t\geq T_{i}$ . In the following, we apply ideas similar to (i), namely we consider problem (4.3) at consecutive communication rounds, say $t$ and $t+1$ , with $t\geq T_{i}$ . The lex-optimal solution of problem (4.3) is $\mathbf{x}_{i}^{t+1}$ , with first component $\mathbf{x}_{i,1}^{t+1}$ and $(P_{i}^{t+1},q_{i}^{t+1})$ an associated basis. Thus, $\mathbf{x}_{i}^{t+1}$ is the lex-optimal solution of

[TABLE]

At the successive communication round $t+1$ , the optimal cost stays equal to $\bar{J}_{i}$ and the lex-optimal solution of the local problem (4.3) does not violate any constraint of problem (4.5). Thus, since the local lexicographic solver selects the optimal solution with minimal first component, it follows that $\mathbf{x}_{i,1}^{t+2}\geq\mathbf{x}_{i,1}^{t+1}$ . Therefore, we conclude that the sequence $\{\mathbf{x}_{i,1}^{t}\}_{t\geq T_{i}}$ is monotonically non-decreasing, i.e., for all $t\geq T_{i}$ ,

[TABLE]

Also, because of the bounding box, the feasible set of problem (4.3) is bounded, so that $\{\mathbf{x}_{i,1}^{t}\}_{t\geq 0}$ converges. Finally, since there is a finite number of constraints in the network, $\mathbf{x}_{i,1}^{t}$ can only assume a finite number of values (corresponding to all the possible combinations of constraints). Thus, $\{\mathbf{x}_{i,1}^{t}\}_{t\geq 0}$ converges in finite time, i.e., there exist $T_{i}^{\prime}>0$ and $\bar{\mathbf{x}}_{i,1}\in$ such that

[TABLE]

By repeating the same arguments for each of the subsequent components of $\mathbf{x}_{i}^{t}$ for $t\geq T_{i}^{\prime}$ , we are able to conclude that $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to some $\bar{\mathbf{x}}_{i}$ , which by construction satisfies $a_{i}^{\top}\bar{\mathbf{x}}_{i}\leq b_{i}$ . ∎

In the following lemma, we prove that the solution estimates to which agents converge are consensual.

Lemma 30 (Consensus).

Let the communication graph be jointly strongly connected. Moreover, assume that the sequences computed by agents have converged, i.e., there exists $T_{0}>0$ such that for all $i\in\{1,\ldots,N\}$ it holds

[TABLE]

for some $\bar{J}_{i}\in$ and $\bar{\mathbf{x}}_{i}\in^{d}$ . Then, it holds

[TABLE]

Proof.

For the sake of analysis, let us denote by $J_{i}^{t}\triangleq c^{\top}\mathbf{x}_{i}^{t}$ the cost associated to $\mathbf{x}_{i}^{t}$ . By contradiction, assume that there exist two different agents $i$ and $j$ such that $\bar{J}_{i}\neq\bar{J}_{j}$ . Without loss of generality, let $\bar{J}_{j}>\bar{J}_{i}$ .

By finite-time convergence of the cost sequences, there exists $T_{0}>0$ such that $J_{j}^{t}=\bar{J}_{j}>\bar{J}_{i}=J_{i}^{t}$ for all $t\geq T_{0}$ . Moreover, since the communication graph is jointly strongly connected, for all $t\geq T_{0}$ and each pair of agents $(i,j)$ , there exists a sequence of time instants $\tau_{1},\ldots,\tau_{k}$ , with $t\leq\tau_{1}<\ldots<\tau_{k}$ , and a sequence of nodes $\nu_{1},\ldots,\nu_{k-1}$ , such that the directed edges $(j,\nu_{1}),(\nu_{1},\nu_{2}),\ldots,(\nu_{k-1},i)$ belong to the digraph at times $\tau_{1},\ldots,\tau_{k}$ (cf. [148]).

At communication round $\tau_{1}$ , agent $\nu_{1}$ computes $\mathbf{x}_{\nu_{1}}^{\tau_{1}+1}$ by minimizing $c^{\top}\mathbf{x}$ over a subset of the basis associated to $\mathbf{x}_{j}^{\tau_{1}}$ (by construction), so that $J_{\nu_{1}}^{\tau_{1}+1}\geq J_{j}^{\tau_{1}}$ . Similarly, at communication round $\tau_{2}$ , agent $\nu_{2}$ computes $\mathbf{x}_{\nu_{2}}^{\tau_{2}+1}$ by minimizing $c^{\top}\mathbf{x}$ over a subset of the basis associated to $\mathbf{x}_{\nu_{1}}^{\tau_{2}}$ . Thus, it holds

[TABLE]

Since the cost sequences have converged, it follows that $\bar{J}_{\nu_{1}}=J_{\nu_{1}}^{\tau_{2}}=J_{\nu_{1}}^{\tau_{1}+1}$ . Thus, it holds

[TABLE]

The argument can be iterated to conclude that $J_{i}^{\tau_{k}+1}\geq J_{j}^{\tau_{1}}$ . Therefore, for all $t>T_{0}$ there exists $\theta_{ij}>0$ such that

[TABLE]

contradicting the assumption $\bar{J}_{j}>\bar{J}_{i}$ . Thus, $\bar{J}_{1}=\ldots=\bar{J}_{N}$ , which concludes the first part of the proof. To prove consensus of the solutions, we note that for all $t\geq T_{0}$ , $c^{\top}\mathbf{x}_{1}^{t}=\ldots=c^{\top}\mathbf{x}_{N}^{t}$ . Then, it is possible to apply arguments similar to the first part to each component of the solution vector (in lexicographic order, see proof of Lemma 29 (ii)). ∎

With Lemma 29 and Lemma 30 at reach, we are now ready to prove the convergence of Algorithm 7.

Theorem 31.

Let Assumption 28 hold and let the communication graph be jointly strongly connected. Moreover, let $\mathbf{x}^{\star}$ be the lex-optimal solution of problem (4.1) and assume $M>0$ is sufficiently large. Consider the sequences $\{\mathbf{x}_{i}^{t}\}_{t\geq 0},i\in\{1,\ldots,N\}$ , generated by Algorithm 7. Then, for all $i\in\{1,\ldots,N\}$ , the following holds:

the cost sequence $\{c^{\top}\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to the optimal cost $J^{\star}$ of (4.1); 2. 2.

the solution sequence $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to $\mathbf{x}^{\star}$ .

Proof.

For the sake of analysis, let us denote by $J_{i}^{t}\triangleq c^{\top}\mathbf{x}_{i}^{t}$ the cost associated to $\mathbf{x}_{i}^{t}$ . By Lemma 29, the cost sequences $\{J_{i}^{t}\}_{t\geq 0}$ and the solution sequences $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converge in finite time to $\bar{J}_{i}$ and $\bar{\mathbf{x}}_{i}$ respectively, and by construction it holds

[TABLE]

By Lemma 30, there exist a common scalar $\bar{J}\in$ and a common vector $\bar{\mathbf{x}}$ such that $\bar{J}_{i}=\bar{J}$ and $\bar{\mathbf{x}}_{i}=\bar{\mathbf{x}}$ for all $i\in\{1,\ldots,N\}$ . Therefore, $\bar{\mathbf{x}}$ is feasible for problem (4.1), since $a_{i}^{\top}\bar{\mathbf{x}}\leq b_{i}$ for all $i$ . To prove that $\bar{J}=J^{\star}$ , we first note that $\bar{J}\leq J^{\star}$ , since each agent builds up the local LP as a relaxation (i.e., with a lower number of constraints) of the original problem (4.1), and the bounding box is sufficiently large (thus, we can assume that $M>\|\mathbf{x}^{\star}\|_{\infty}$ ). On the other hand, since $\bar{\mathbf{x}}$ is feasible for problem (4.1), then $J^{\star}\leq c^{\top}\bar{\mathbf{x}}=\bar{J}$ , thus implying $\bar{J}=J^{\star}$ .

Since we have shown that $\bar{\mathbf{x}}$ is feasible and cost-optimal, so that $\bar{\mathbf{x}}$ is an optimal solution of (4.1), we only have to show that it is the lexicographic minimum among all the minima (i.e., $\bar{\mathbf{x}}=\mathbf{x}^{\star}$ ). By contradiction, suppose it is not. Then, $\mathbf{x}^{\star}\stackrel{{\scriptstyle L}}{{<}}\bar{\mathbf{x}}$ , where the symbol $\stackrel{{\scriptstyle L}}{{<}}$ means that $\mathbf{x}^{\star}$ is lexicographically smaller than $\bar{\mathbf{x}}$ (cf. Appendix C). Now, since $\bar{\mathbf{x}}$ is computed by each agent as the lex-optimal solution to the local problem, there exists a basis $(\bar{P},\bar{q})$ , made up of constraints of problem (4.1), such that $\bar{\mathbf{x}}$ is the lex-optimal solution to

[TABLE]

where $\bar{p}_{h}^{\top}\in^{1\times d}$ denotes the $h$ -th row of $\bar{P}$ and $\bar{q}_{h}\in$ denotes the $h$ -th entry of $\bar{q}$ . But this means that $\mathbf{x}^{\star}$ must be infeasible for problem (4.6), otherwise the lex-optimal solution of (4.6) would be $\mathbf{x}^{\star}$ instead of $\bar{\mathbf{x}}$ . Therefore, one of the constraints in (4.6) is violated by $\mathbf{x}^{\star}$ , i.e., there exists $h\in\{1,\ldots,d\}$ such that $P_{h}^{\top}\mathbf{x}^{\star}>q_{h}$ . But since the constraints in (4.6) are drawn from problem (4.1), this contradicts the fact that $\mathbf{x}^{\star}$ is feasible for the original LP (4.1). Thus, $\bar{\mathbf{x}}=\mathbf{x}^{\star}$ and the proof follows. ∎

A few remarks on the Constraints Consensus algorithm are in order. In the algorithm analysis we did not prove that the local bases are consensual at convergence. Indeed, agents may compute different bases associated to the lex-optimal solution. A sufficient condition for consensus of bases is the so-called non-degeneracy of problem (4.1) (see also [148]). Finally, a remarkable property of the algorithm is that a fully distributed halting condition can be obtained. Indeed, if the communication graph is fixed, each agent can halt the execution of the algorithm as soon as the locally computed solution stays constant for $2\mathop{\rm diam}(\mathcal{G})+1$ communication rounds [148, Theorem IV.4]. If the communication graph is time-varying and $T$ -strongly connected (cf. Section 1.1), it can be seen that each agent can halt the execution of the algorithm as soon as the locally computed solution stays constant for $2NT+1$ communication rounds.

4.1.3 Distributed Simplex

In this section, we briefly mention a variant of the Constraints Consensus algorithm applied to LPs, namely the Distributed Simplex algorithm [150]. We consider a network of $N$ agents that aim to cooperatively solve linear programs in the so-called standard form, i.e.,

[TABLE]

where $A\in^{d\times N}$ , $b\in^{d}$ and $c\in^{N}$ are the problem data and $\mathbf{x}\in^{N}$ is the decision vector. A column of problem (4.7) is defined as the vector

[TABLE]

where $c_{i}\in$ is the $i$ -th entry of the vector $c$ and $a_{i}\in^{N}$ is the $i$ -th column of the matrix $A$ .

From a centralized perspective, in the classical simplex method, a set of columns (which for problems in standard form are treated as a basis), is iteratively updated until an optimal solution of problem (4.7) is found. At each iteration, a leaving column exits the basis and is replaced by an entering column. The Distributed Simplex algorithm extends the (centralized) simplex method. Agents are assumed to initially know only a subset of the problem columns. Informally, at every communication round, each agent builds up a (small) local LP with a subset of the problem columns (namely, the old basis and the bases collected from neighbors). Then, the local LP is solved, a basis associated to the optimal solution is found and is sent to neighbors. It can be shown that the evolution of the Distributed Simplex algorithm applied to problem (4.7) is tightly linked to the evolution of the Constraints Consensus algorithm applied to the dual of problem (4.7) (see [150, Proposition 5.3]).

4.2 Constraints Consensus for Convex and Abstract Programs

In this section, we describe the Constraints Consensus algorithm for more general set-ups than problem (4.1). Formally, assume $N$ agents aim to cooperatively solve the convex program

[TABLE]

where $c\in^{d}$ is the cost vector and the sets $X_{i}$ are subsets of d, for all $i\in\{1,\ldots,N\}$ . Problem (4.8) is in the common-cost form (cf. Section 1.2.2), and we suppose that, for all $i$ , the set $X_{i}$ is known by agent $i$ only and that the cost vector $c$ is globally known. Notice that the linear cost function in problem (4.8) results in no loss of generality, as discussed in Remark 34. We make the following assumption.

Assumption 32.

Problem (4.8) is feasible and the sets $X_{i}$ are convex and compact, for all $i\in\{1,\ldots,N\}$ . $\square$

The Constraints Consensus algorithm applied to problem (4.8) can be formalized by extending the concept of basis (cf. Definition 27) so as to consider the (possible) nonlinear nature of the local constraints $X_{i}$ . Formally, let $\mathbf{x}^{\star}$ be the lex-optimal solution of problem (4.8). Then, the collection of $\delta$ constraints $X_{\ell_{1}},\ldots,X_{\ell_{\delta}}$ , for some indices $\{\ell_{1},\ldots,\ell_{\delta}\}\subseteq\{1,\ldots,N\}$ , are a basis of (4.8) if $\mathbf{x}^{\star}$ is the lex-optimal solution of

[TABLE]

and if the collection of $\delta$ constraints is minimal (i.e., removing a constraint from the previous problem implies that the lex-optimal solution changes). We compactly denote the basis as the set $B=\bigcap_{h=1}^{\delta}X_{\ell_{h}}$ . For feasible convex problems in the form (4.8), it holds $\delta\leq d$ , whereas for linear programs, it holds $\delta=d$ (cf. Definition 27). The maximum $\delta$ for a given problem is called the combinatorial dimension of the problem. A more comprehensive discussion can be found in [147, 148].

Next, we describe the Constraints Consensus algorithm applied to convex programs in Algorithm 8, from the perspective of node $i$ . Each agent $i$ maintains a local solution estimate $\mathbf{x}_{i}^{t}$ and a local basis $B_{i}^{t}$ , initialized to $X_{i}$ . The algorithm looks similar to Algorithm 7, where the main difference is that general convex constraints are considered, instead of linear ones.

Note that, as in Algorithm 7, we ask processors to use a lexicographic solver to handle possible non-uniqueness of the optimal solution. Algorithm 8 enjoys the same convergence properties of Algorithm 7, formalized next.

Theorem 33.

Let Assumption 32 hold and let the communication graph be jointly strongly connected. Moreover, let $\mathbf{x}^{\star}$ be the lex-optimal solution of problem (4.8). Consider the sequences $\{\mathbf{x}_{i}^{t}\}_{t\geq 0},i\in\{1,\ldots,N\}$ , generated by Algorithm 8. Then, for all $i\in\{1,\ldots,N\}$ , the following holds:

the cost sequence $\{c^{\top}\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to the optimal cost $J^{\star}$ of (4.8); 2. 2.

*the solution sequence $\{\mathbf{x}_{i}^{t}\}_{t\geq 0}$ converges in finite time to $\mathbf{x}^{\star}$ . * $\square$

Theorem 33 can be proven by using arguments similar to the ones in Theorem 31, thus we omit the proof.

We highlight that, in practice, Algorithm 8 can be implemented when the constraint sets $X_{i}$ are easy to communicate (e.g., when all of them have the same parametric form and they only differ for the parameters). In more difficult set-ups, polyhedral approximations of the local sets $X_{i}$ can be communicated instead (cf. Section 4.3.1).

Remark 34.

Algorithm 8 can be properly adapted to handle problems with nonlinear cost in the form

[TABLE]

with $f:^{d}\rightarrow$ a convex cost function. By resorting to the epigraph form of (4.10), which is in the form (4.8), it can be shown that Algorithm 8 can be implemented by simply replacing the linear function in the local problem (4.9) with the nonlinear one and by increasing the maximum number of sets in the bases to $d+1$ . $\square$

The Constraints Consensus algorithm can handle more general problems than (4.8). Indeed, in [148], the algorithm has been formulated for general abstract programs (or LP-type problems), which include, as a special case, problems (4.1) and (4.8). We do not give the technical details of abstract programs, but we only mention that they are a generalization of linear programs, which capture numerous geometric optimization problems such as, e.g., computation of the smallest enclosing ball of a set of points. When the combinatorial dimension of the problem is known, the distributed algorithm [148] can be applied directly. Otherwise, if the Helly number of the problem is known, one can use the results in [151] to compute the combinatorial dimension of the problem.

4.3 Extensions

In this section, we discuss extensions of the Constraints Consensus algorithm.

4.3.1 Cutting-plane Consensus

Let us consider again the convex program (4.8). The Cutting-plane Consensus algorithm [152] is an extension of Algorithm 8, in which outer approximations of the local constraint sets $X_{i}$ are communicated (instead of the sets $X_{i}$ themselves). There are several situations in which this approach is desirable, such as, e.g., (i) when privacy must be preserved (so that agents do not want to share their own constraint with the other nodes), (ii) when it is expensive to send $X_{i}$ , (iii) when there are infinitely many local constraints (e.g., robust, semi-infinite programming).

The Cutting-plane Consensus algorithm is based on a successive refinement of polyhedral approximations of the local sets $X_{i}$ . In particular, agents repeatedly solve linear programs of the form

[TABLE]

where the feasible set $\{\mathbf{x}\in^{d}\mid A\mathbf{x}\leq b\}$ is a (polyhedral) outer-approximation of $\bigcap_{i=1}^{N}X_{i}$ . It is constructed by generating and exchanging a particular type of constraints, called cutting planes.444A cutting plane is a half space $h\triangleq\{\mathbf{x}\in^{d}\mid a^{\top}\mathbf{x}\leq b\}$ separating a query point $\mathbf{x}_{q}\in^{d}$ from a set $X$ , i.e., such that $X\subset h$ and $\mathbf{x}_{q}\notin h$ .

The evolution of the Cutting-plane Consensus algorithm can be roughly summarized as follows. Each agent $i$ first solves problem (4.11) and finds an optimal solution $\mathbf{x}_{q}$ . Then, it checks whether $\mathbf{x}_{q}$ belongs to its own constraint set $X_{i}$ . If so, it sends to neighbors a basis associated to $\mathbf{x}_{q}$ (in terms of the approximated constraints). If not, it generates a new cutting plane, it computes an optimal solution of the new approximate problem and sends a basis to neighbors.

Differently from the Constraints Consensus algorithm, the Cutting-plane Consensus algorithm does not enjoy finite-time convergence, but instead it converges asymptotically. Also, we point out that the tie-break rule used in [152] (in case problem (4.11) has multiple optimal solutions) consists of the minimal 2-norm solution, instead of the lex-optimal solution.

4.3.2 Distributed Mixed-Integer Linear Programming via Cut Generation and Constraint Exchange

Mixed-integer linear programs (MILPs) are linear programs in which some of the variables are constrained to be integer, i.e.,

[TABLE]

where $d_{Z}$ and $d_{R}$ are the dimensions of the integer and real variables, $d=d_{Z}+d_{R}$ , $c\in^{d}$ and $a_{i}\in^{d}$ , $b_{i}\in$ for all $i\in\{1,\ldots,N\}$ .

It is well known that MILPs are NP-hard problems, which makes problem (4.12) difficult to solve. In [153] and in [149] distributed algorithms are proposed, with finite-time convergence, for the solution of problem (4.12). They are based on a constraint exchange approach as in Constraints Consensus, but appropriate additional constraints (cutting planes, cf. also Section 4.3.1) are generated throughout the algorithm evolution.

Let $P\triangleq\{\mathbf{x}\in^{d}\mid a_{i}^{\top}\mathbf{x}\leq b_{i}\text{ for all }i\}$ denote the polyhedron described by the inequality constraints of problem (4.12) and let $P_{I}\triangleq P\cap({\mathbb{Z}}^{d_{Z}}\times^{d_{R}})$ denote the feasible set of problem (4.12). An important feature of problem (4.12) is that it has the same optimal cost of the linear program

[TABLE]

where $\text{conv}(P_{I})$ is the convex hull of $P_{I}$ . Moreover, the optimal solution set of problem (4.12) is contained in the optimal solution set of (4.13). In order to solve the original MILP (4.12), the algorithms in [149] produce successive approximations of $\text{conv}(P_{I})$ by generating two types of cutting planes: (i) mixed-integer Gomory cuts and (ii) cost-based cuts. We do not provide the technical details on the algorithms, but we only point out that, as in Constraints Consensus, the algorithms work under asynchronous and unreliable communication and enjoy finite-time convergence.

4.3.3 Other extensions

In this subsection, we briefly mention other extensions of the algorithms presented in this chapter.

Robust optimization is the field of optimization that considers problems in which the problem data is uncertain. Typical approaches to tackle an uncertain problem consider the worst case of the uncertain parameters, giving rise to a semi-infinite optimization problem, i.e., with an infinite number of constraints. In [154], a distributed robust optimization algorithm is proposed, which is a randomized extension of the Constraints Consensus algorithm, to solve linear programs where the problem data is subject to uncertainty. The algorithm relies on a verification step (based on a random sampling of each agent of its local uncertain constraint set), and on the deterministic solution of a local version of the global semi-infinite problem.

In [155], the authors considered a big-data quadratic programming set-up emerging in several learning problems for cyber-physical networks, where the big-data keyword is due to the very high dimension of the optimization variable and of the training samples. For this class of big-data quadratic optimization problems, they proposed a distributed algorithm, obtained as an extension of the Constraints Consensus algorithm, which solves the problem up to an arbitrary tolerance $\epsilon$ . The algorithm is based on the notion of core-set used in geometric optimization to approximate the value function of a given set of points with a smaller subset of points. From an optimization point of view, a subset of active constraints is identified, whose number depends on the tolerance $\epsilon$ . The resulting approximate solution is such that an $\epsilon$ -relaxation of the constraints guarantees no constraint violation.

Submodular optimization is a special class of combinatorial optimization (in which the cost function is actually a set function) arising in several machine learning problems, but also in cooperative control of complex systems. In [156], a submodular minimization problem is considered. Agents can evaluate the cost function only for those sets including the agent itself. Then, by relying on a proper linear programming reformulation of the submodular problem (involving a huge number of variables), it is possible to devise a distributed algorithm based on a column generation approach, in which columns are generated through a local greedy algorithm.

4.4 Numerical Example

In this section, we provide a numerical example of the Constraints Consensus algorithm to highlight its main features.

We consider a network of $N=30$ agents communicating over a fixed, directed, strongly connected graph generated according to an Erdős-Rényi random model with parameter $p=0.1$ .

We focus on the soft-margin SVM problem introduced in Section 1.3.3, where we consider a two-dimensional space, i.e., $d=2$ , and we recall that each agent $i$ is assigned one training sample $(p_{i},\ell_{i})\in^{2}\times\{-1,1\}$ . We suppose that the training samples are randomly picked from two bivariate gaussian distributions with covariance matrix equal to the identity matrix. A number of $15$ agents are assigned to the first distribution, which has zero mean and is associated to the label $\ell_{i}=1$ , while the remaining agents are assigned to the second distribution, associated to the label $\ell_{i}=-1$ and with mean equal to $[3,2]^{\top}$ .

The goal for agents is to agree on an optimal solution of problem (1.8), which we recall here

[TABLE]

where the parameter $C$ is set to $100$ . In the following we also denote the vector stacking all the optimization variables with $\mathbf{x}$ . As discussed in Remark 34, in order to solve problem (1.8) with the Constraints Consensus algorithm, we implement the local optimization problems in Algorithm 8 with the cost function $f(\mathbf{x})=1/2w^{\top}w+C\sum_{i=1}^{N}\xi_{i}$ and we allow up to $d+1$ constraints in the bases. To solve the $\mathop{\rm lexmin}$ optimization in (8), we solve a total of $d+1$ problems as follows. First, we obtain the optimal cost $f^{\star}$ of the problem. Then we add to the problem the constraint $f(x)=f^{\star}$ (in order to force the optimal cost) and we minimize the first component of the decision variable. We continue this procedure until we obtain the lex-optimal solution. Moreover, artificial box constraints $-M\mathbf{1}\leq w,b,\boldsymbol{\xi}\leq M\mathbf{1}$ , with $M=10$ (which we verified to be sufficiently large for this problem), are added to problem (1.8) in order to satisfy Assumption 32.

In our simulation, agents reached consensus on the lex-optimal solution of problem (1.8) in $10$ communication rounds, as expected from the finite-time result of Theorem 33. In Figure 4.2 we show the convergence rate of Algorithm 8. In particular, we plot the difference between the cost of the solution estimates and the optimal cost $J^{\star}$ of problem (1.8), i.e., $f(\mathbf{x}_{i}^{t})-J^{\star}$ , for all $i$ . Note that all the lines eventually approach zero.

In Figure 4.3 we show the maximum constraint value associated to the local solution estimates, i.e., for all $i$ we plot the quantity

[TABLE]

Notice that the algorithm evolves in an outer-approximation fashion, that is, the solution estimates are infeasible for problem (1.8) until the optimal solution is found. This can also be seen by noting in Figure 4.2 that the costs associated to the intermediate solution estimates are lower than the optimal cost of the problem.

In Figure 4.4 we show the distance of the local solution estimates from the lex-optimal solution $\mathbf{x}^{\star}$ of problem (1.8), i.e., $\|\mathbf{x}_{i}^{t}-\mathbf{x}^{\star}\|$ , for all $i$ .

Concluding Remarks

In this survey, we considered a distributed optimization framework arising in modern cyber-physical networks, in which computing units have only a partial knowledge of a global optimization problem and must solve it through local computation and communication without any central coordinator. First, we introduced main optimization set-ups addressed in distributed optimization (i.e., cost-coupled, common-cost, and constraint-coupled), and motivated them with relevant estimation, learning, decision and control applications arising in smart networks. Then, we reviewed three main approaches to design distributed optimization algorithms, namely (primal) consensus-based, duality-based and constraint-exchange methods, and provided a theoretical analysis under simplified communication assumptions and/or problem set-ups. To highlight the behavior of the presented algorithms, the theoretical results are also equipped with numerical examples.

Appendix A Centralized Optimization Methods

A.1 Gradient Method

Consider the following unconstrained optimization problem

[TABLE]

where $f:^{d}\rightarrow$ . The gradient method is an iterative algorithm given by

[TABLE]

where $t\geq 0$ denotes the iteration counter and $\gamma^{t}$ is the step-size. The following result states the convergence of the gradient method for constant step-size.

Proposition 35 ([10, Proposition 1.2.3]).

Assume that $f$ is a $\mathcal{C}^{1}$ function with Lipschitz continuous gradient $\nabla f$ with constant $L$ . Let the step-size be constant, i.e., $\gamma^{t}=\gamma$ , for all $t\geq 0$ , and such that $0<\gamma<2/L$ . Then, every limit point of the sequence $\{\mathbf{x}^{t}\}_{t\geq 0}$ generated by the gradient method (A.2), is a stationary point of problem (A.1), i.e., there exists a subset of indices $\mathcal{K}\subseteq{\mathbb{N}}$ such that

[TABLE]

where $\bar{\mathbf{x}}$ is a stationary point of (A.1). $\square$

The previous result can be extended in several ways, e.g., with different step-size rules and adapted to constrained problems. We refer the interested reader to [10] and references therein.

A.2 Subgradient Method

Consider the following constrained optimization problem

[TABLE]

with $f:^{d}\rightarrow$ a convex function and $X\subseteq^{d}$ a closed, convex set.

A vector $\widetilde{\nabla}f(\mathbf{x})\in^{d}$ is called a subgradient of the convex function $f$ at $\mathbf{x}\in^{d}$ if

[TABLE]

for all $\mathbf{y}\in^{d}$ . The (projected) subgradient method is the iterative algorithm given by

[TABLE]

where $t\geq 0$ denotes the iteration counter, $\gamma^{t}$ is the step-size, $\widetilde{\nabla}f(\mathbf{x}^{t})$ denotes a subgradient of $f$ at $\mathbf{x}^{t}$ , and $\mathcal{P}_{X}(\,\cdot\,)$ is the Euclidean projection onto $X$ .

Assumption 36 (Diminishing Step-size).

The step-size sequence $\{\gamma^{t}\}_{t\geq 0}$ is such that $\gamma^{t}\geq 0$ and satisfies

[TABLE]

The following proposition formally states the convergence of the subgradient method (A.4).

Proposition 37 ([157, Proposition 3.2.6]).

Assume that all the subgradients of $f$ are bounded at each $\mathbf{x}\in X$ . Moreover, assume the optimal solution set of problem (A.3) is not empty. Let the step-size $\gamma^{t}$ satisfy Assumption 36. Then, the sequence $\{\mathbf{x}^{t}\}_{t\geq 0}$ generated by the subgradient method (A.4) converges to an optimal solution $\mathbf{x}^{\star}$ of problem (A.3), i.e.,

[TABLE]

A.3 Lagrangian Duality and Dual Subgradient Method

Consider a constrained optimization problem, addressed as primal problem, having the form

[TABLE]

where $X\subseteq^{d}$ is a convex, compact set, $f:^{d}\rightarrow$ is a convex function and $\mathbf{g}:^{d}\rightarrow^{S}$ is such that each component $\mathbf{g}_{s}:^{d}\rightarrow$ , $s\in\{1,\ldots,S\}$ , is a convex (scalar) function.

The following optimization problem

[TABLE]

is called the dual of problem (A.5), where $q:^{S}\rightarrow$ is obtained by minimizing with respect to $\mathbf{x}\in X$ the Lagrangian function $\mathcal{L}(\mathbf{x},\boldsymbol{\mu})=f(\mathbf{x})+\boldsymbol{\mu}^{\top}\mathbf{g}(\mathbf{x})$ , i.e., $q(\boldsymbol{\mu})=\min_{\mathbf{x}\in X}\mathcal{L}(\mathbf{x},\boldsymbol{\mu})$ . It can be shown that the domain of $q$ (i.e., the set of $\boldsymbol{\mu}$ such that $q(\boldsymbol{\mu})>-\infty$ ) is convex and that $q$ is concave on its domain. A vector $\bar{\boldsymbol{\mu}}\in^{S}$ is said to be a Lagrange multiplier if it holds $\bar{\boldsymbol{\mu}}\geq\mathbf{0}$ and

[TABLE]

It can be shown that the following inequality holds [10]

[TABLE]

which is called weak duality. When in (A.7) the equality holds, then we say that strong duality holds and, thus, solving the primal problem (A.5) is equivalent to solving its dual formulation (A.6). In this case the right-hand-side problem in (A.7) is referred to as saddle-point problem of (A.5).

Definition 38.

A pair $(\mathbf{x}^{\star},\boldsymbol{\mu}^{\star})$ is called a primal-dual optimal solution of problem (A.5) if $\mathbf{x}^{\star}\in X$ and $\boldsymbol{\mu}^{\star}\geq\mathbf{0}$ , and $(\mathbf{x}^{\star},\boldsymbol{\mu}^{\star})$ is a saddle point of the Lagrangian, i.e.,

[TABLE]

for all $\mathbf{x}\in X$ and $\boldsymbol{\mu}\geq\mathbf{0}$ . $\square$

Given the dual function $q$ , an important property is as follows. A subgradient of $-q$ at a given $\bar{\boldsymbol{\mu}}$ can be efficiently computed as $g(\bar{\mathbf{x}})$ , where $\bar{\mathbf{x}}=\mathop{\rm argmin}_{\mathbf{x}\in X}\>f(\mathbf{x})+\bar{\boldsymbol{\mu}}^{\top}g(\mathbf{x})$ (see [10, Section 6] for further details). Then, a subgradient method to solve the dual problem (A.6) reads

[TABLE]

where $\gamma^{t}$ is a suitable step-size and $\boldsymbol{\mu}^{0}\geq 0$ is arbitrary.

A.4 ADMM Algorithm

In this section, we review the Alternating Direction Method of Multipliers (ADMM) following [88, Section 3.4]. Consider the following optimization problem

[TABLE]

where $G_{1}:^{d}\rightarrow$ and $G_{2}:^{S}\rightarrow$ are convex functions, $A$ is a ${S\times d}$ matrix, and $C_{1}\subseteq^{d}$ and $C_{2}\subseteq^{S}$ are nonempty, closed convex sets. We assume that the optimal solution set $X^{\star}$ of problem (A.8) is nonempty. Furthermore, either $C_{1}$ is bounded or else $A^{\top}A$ is invertible.

Problem (A.8) can be equivalently rewritten as

[TABLE]

Let $\boldsymbol{\lambda}\in^{S}$ be a multiplier associated to the equality constraint $A\mathbf{x}=\mathbf{z}$ and introduce the augmented Lagrangian of problem (A.9)

[TABLE]

where $\rho>0$ is a penalty parameter. The ADMM algorithm is an iterative procedure in which at each iteration $t\geq 0$ , the following steps are performed

[TABLE]

where the initialization of the variables $\mathbf{z}^{0}$ and $\boldsymbol{\lambda}^{0}$ can be arbitrary.

The ADMM algorithm is very similar to dual ascent and to the Method of Multipliers (MM): it consists of an $\mathbf{x}$ -minimization, a $\mathbf{z}$ -minimization, and a dual variable update. As in the method of multipliers, the dual variable update uses a step-size equal to the augmented Lagrangian parameter $\rho$ . In the MM, the augmented Lagrangian $\mathcal{L}_{\rho}$ is minimized jointly with respect to the two primal variables. In ADMM, on the other hand, $\mathbf{x}$ and $\mathbf{z}$ are updated in an alternating or sequential fashion, which accounts for the term alternating direction.

Proposition 39 ([88, Proposition 4.2]).

Consider a sequence $\{\mathbf{x}^{t},\mathbf{z}^{t},\boldsymbol{\lambda}^{t}\}_{t\geq 0}$ generated by the ADMM algorithm (A.10). Then, the generated sequence is bounded and every limit point of $\{\mathbf{x}^{t}\}_{t\geq 0}$ is an optimal solution of problem (A.8). Furthermore, the sequence $\{\boldsymbol{\lambda}^{t}\}_{t\geq 0}$ converges to an optimal solution of the dual of problem (A.8). $\square$

In [89] a more general problem set-up for ADMM is considered. Specifically, let us consider a two-variable problem defined as

[TABLE]

with $A\in^{p\times d}$ , $B\in^{p\times S}$ and $c\in^{p\times 1}$ . Then, the ADMM algorithm applied to problem (A.11) reads as follows

[TABLE]

where the augmented Lagrangian is defined as

[TABLE]

Appendix B Consensus Over Networks

Consensus and distributed averaging are fundamental building blocks in distributed optimization.

We introduce the consensus problem for a group of $N$ agents that considers conditions under which, using a certain message-passing protocol, the local variables of each agent converge to the same value. There exist several results related to the convergence of local variables to a common value using various information exchange protocols among agents.

B.1 Average Consensus over Static Networks

One of the most used models for consensus is based on the following discrete-time iteration: to generate an estimate at iteration $t+1$ , agent $i$ forms a convex combination of its current estimate $\mathbf{z}_{i}^{t}$ with the estimates received from other agents as

[TABLE]

where $a_{ij}$ denotes a (positive) weight that agent $i$ assigns to each neighbor $j$ , and we recall that $\mathcal{N}_{i}$ is the set of neighbors of agent $i$ in the (static) undirected communication graph. The weights $a_{ij}$ are set to zero if $i$ and $j$ are not neighbors in the communication graph $\mathcal{G}$ and are doubly stochastic, i.e., they satisfy $\sum_{j=1}^{N}a_{ij}=1$ , for all $i\in\{1,\ldots,N\}$ , and $\sum_{i=1}^{N}a_{ij}=1$ , for all $j\in\{1,\ldots,N\}$ .

The consensus algorithm can be written in an aggregate form by stacking all the agents’ estimates in a single variable which evolves according to

[TABLE]

where $A$ is a matrix whose $(i,j)$ -th entry is $a_{ij}$ for all $i,j\in\{1,\ldots,N\}$ .

A useful property of doubly stochastic matrices is the following. Given $A$ doubly stochastic, it holds

[TABLE]

where $\bar{\mathbf{z}}\triangleq\frac{1}{N}\sum_{i=1}^{N}\mathbf{z}_{i}$ and $\sigma_{A}$ is the spectral radius of $A-\mathbf{1}\mathbf{1}^{\top}/N$ . It can be proven (see [87]) that if the graph is connected and $A$ is doubly stochastic, then $\sigma_{A}\in(0,1)$ , and specifically $\sigma_{A}=\max\{|\lambda_{2}|,|\lambda_{N}|\}$ , where $\lambda_{h}$ denotes the $h$ -th largest eigenvalue of $A$ .

Theorem 40.

Let $\mathcal{G}$ be a connected graph and let $a_{ij}$ , $i,j\in\{1,\ldots,N\}$ be doubly stochastic weights matching the graph. Then, the sequences $\{\mathbf{z}_{i}^{t}\}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by (B.1) satisfy

[TABLE]

for all $i\in\{1,\ldots,N\}$ , where $\bar{\mathbf{z}}^{0}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{z}_{i}^{0}$ . $\square$

Several extensions of the basic consensus algorithm (B.1) exist. For instance, one can consider time-varying networks that have some long-term connectivity properties. The consensus algorithm needs to be adapted to accommodate the time-varying network by considering time-varying weights $a_{ij}^{t}$ . Also, it is possible to design a consensus algorithm that works under delays and is robust to packet losses. See [158] for a recent survey on this topic. Next, we describe another extension in which the consensus algorithm is tailored for directed networks.

B.2 Push-sum Consensus over Directed Networks

In this section we describe how the average consensus algorithm can be adapted to work on directed networks. This algorithm is known as push-sum algorithm and has been introduced in [159].

In directed networks is not always possible to construct a doubly stochastic matrix $A$ , while a column stochastic matrix is always available. We use $B$ to denote a column stochastic matrix, i.e., such that $\mathbf{1}^{\top}B=\mathbf{1}^{\top}$ . Formally, the push-sum consensus reads

[TABLE]

with the initial values $\phi_{i}^{0}=1$ for all $i\in\{1,\ldots,N\}$ .

The convergence of this scheme has been proven in [159], i.e., the sequences $\{\mathbf{z}_{i}^{t}\}_{t\geq 0}$ , $i\in\{1,\ldots,N\}$ , generated by (B.3) satisfy

[TABLE]

for all $i\in\{1,\ldots,N\}$ , where $\bar{\mathbf{z}}^{0}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{z}_{i}^{0}$ .

B.3 Dynamic Average Consensus Algorithm

In this section, we present a distributed algorithm to achieve dynamic average consensus that has been proposed in [160]. See also [161] for a very recent tutorial.

We consider a network of $N$ agents in which each agent $i$ is able to measure a local discrete-time signal $\{\mathbf{r}_{i}^{t}\}_{t\geq 0}$ . The goal is to design a distributed algorithm that enables agents to eventually track the average of their signal $\mathbf{r}_{i}^{t}$ , $i\in\{1,\ldots,N\}$ , by means of local communication only.

The dynamic consensus algorithm proposed in [160] consists in a consensus-based procedure in which each agent maintains a local estimate $\mathbf{z}_{i}^{t}$ of the average. The local estimate is iteratively updated according to

[TABLE]

where $a_{ij}$ are entries of a doubly stochastic matrix.

If the input signals $\mathbf{r}_{i}^{t}$ asymptotically converge to a constant value, then the dynamic average consensus algorithm in (B.4) is guaranteed to converge, i.e., for all $i\in\{1,\ldots,N\}$ , it holds

[TABLE]

where $\bar{\mathbf{r}}^{t}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{r}_{i}^{t}$ for all $t\geq 0$ .

The interested reader can find a rigorous treatment and a more comprehensive discussion on this class of algorithms in [160, 161].

Appendix C Linear Programming

A Linear Program (LP) is an optimization problem with linear cost function and linear constraints:

[TABLE]

where $c\in^{d}$ is the cost vector and $a_{k}\in^{d}$ and $b_{k}\in$ describe $K$ inequality constraints. In the subsequent discussion, we assume that $d\leq K$ . The feasible set $\mathcal{X}$ of problem (C.1) is the set of vectors satisfying all the constraints, i.e.,

[TABLE]

Note that $\mathcal{X}$ is a polyhedron, for which the following definition of vertex can be given.

Definition 41.

A vector $\tilde{\mathbf{x}}\in^{d}$ is a vertex of $\mathcal{X}$ if there exists some $c\in^{d}$ such that $c^{\top}\tilde{\mathbf{x}}<c^{\top}\mathbf{x}$ for all $\mathbf{x}\in\mathcal{X}$ with $\mathbf{x}\neq\tilde{\mathbf{x}}$ . $\square$

If problem (C.1) admits an optimal solution, it can be shown that there exists an optimal vertex, i.e., a vertex which is an optimal solution of the problem (see, e.g., [162, Theorem 2.7]). Let $\mathbf{x}^{\star}$ be an optimal vertex of problem (C.1). Then, it is a standard result in linear programming theory that there exists an index set $\{\ell_{1},\ldots,\ell_{d}\}\subset\{1,\ldots,K\}$ , with cardinality $d$ , such that $\mathbf{x}^{\star}$ is the unique optimal vertex of the problem

[TABLE]

which is a relaxed version of problem (C.1) in which only $d$ constraints are considered. In addition, the vectors $a_{\ell_{h}},h\in\{1,\ldots,d\}$ are linearly independent, so that they form a basis of d. By analogy, the constraints $a_{\ell_{h}}^{\top}\mathbf{x}\leq b_{\ell_{h}},h\in\{1,\ldots,d\}$ are called a basis of the point $\mathbf{x}^{\star}$ . Due to the optimality of $\mathbf{x}^{\star}$ , we call it also a basis of problem (C.1). To compactly denote such basis, we introduce a matrix $P\in^{d\times d}$ , obtained by stacking the row vectors $a_{\ell_{h}}^{\top}$ , and a vector $q\in^{d}$ , obtained by stacking the scalars $b_{\ell_{h}}$ , i.e.,

[TABLE]

Then, $\mathbf{x}^{\star}=P^{-1}q$ , and we say that the tuple $(P,q)$ is a basis of (C.1).

If problem (C.1) has multiple optimal solutions, we say that the LP is dual degenerate. In presence of dual degeneracy, it is not trivial to guarantee convergence of distributed algorithms to the same optimal solution. In order to overcome this issue, it is possible to rely on the lexicographic ordering of vectors. We now give some definitions.

Definition 42.

A vector $\mathbf{v}\in^{n}$ is said to be lexicographically positive (or lex-positive) if $\mathbf{v}\neq\mathbf{0}$ and the first non-zero component of $\mathbf{v}$ is positive. In symbols:

[TABLE]

A vector $\mathbf{u}\in^{n}$ is said to be lexicographically larger (resp. smaller) than another vector $\mathbf{v}\in^{n}$ if $\mathbf{u}-\mathbf{v}$ is lex-positive (resp. $\mathbf{v}-\mathbf{u}$ is lex-positive), or, equivalently, if $\mathbf{u}\neq\mathbf{v}$ and the first nonzero component of $\mathbf{u}-\mathbf{v}$ is positive (resp., negative). In symbols:

[TABLE]

Given a set of vectors $\{\mathbf{v}_{1},\ldots,\mathbf{v}_{r}\}$ , the lexicographic minimum is the element $\mathbf{v}_{i}$ such that $\mathbf{v}_{j}\stackrel{{\scriptstyle L}}{{>}}\mathbf{v}_{i}$ for all $j\neq i$ . In symbols:

[TABLE]

Now, consider the optimal solution set of problem (C.1), i.e., $\mathcal{X}^{\star}\triangleq\{\mathbf{x}\in\mathcal{X}\mid c^{\top}\mathbf{x}\leq c^{\top}\mathbf{x}^{\prime}\text{ for all }\mathbf{x}^{\prime}\in\mathcal{X}\}\subseteq\mathcal{X}$ , where $\mathcal{X}$ is the feasible set of problem (C.1). Among all the optimal solutions in $\mathcal{X}^{\star}$ , it is possible to compute the lexicographically minimal one, i.e., $\mathop{\rm lexmin}(\SS^{\star})$ . It turns out that finding $\mathop{\rm lexmin}(\SS^{\star})$ is equivalent to finding the (unique) optimal solution to a modified (non dual-degenerate) version of the original problem (C.1), where the cost vector $c$ is perturbed to $c^{\prime}=c+\Delta$ , with $\Delta$ a lexicographic perturbation vector:

[TABLE]

for a sufficiently small $\Delta_{0}>0$ (see [163]). Therefore, the lex-optimal solution of problem (C.1) is the unique optimal solution of the problem with perturbed cost

[TABLE]

Thus, the lex-optimal solution of problem (C.1) exists if and only if problem (C.2) admits an optimal solution. Moreover, the optimal solution of (C.2) is attained at a vertex of (C.1), therefore it is an optimal vertex of problem (C.1).

Acknowledgements

This work is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 638992 - OPT4SMART)

Bibliography163

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Necoara, V. Nedelcu, and I. Dumitrache, “Parallel and distributed optimization methods for estimation and control in networks,” Journal of Process Control , vol. 21, no. 5, pp. 756–766, 2011.
2[2] A. Nedić, “Convergence rate of distributed averaging dynamics and optimization in networks,” Foundations and Trends® in Systems and Control , vol. 2, no. 1, pp. 1–100, 2015.
3[3] A. Nedić and J. Liu, “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems , vol. 1, pp. 77–103, 2018.
4[4] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE , vol. 106, no. 5, pp. 953–976, 2018.
5[5] A. Nedić, J.-S. Pang, G. Scutari, and Y. Sun, Multi-agent Optimization . Springer, 2018, vol. 2224.
6[6] P. Giselsson and A. Rantzer, Large-scale and Distributed Optimization . Springer, 2018, vol. 2227.
7[7] M. S. Stankovic, K. H. Johansson, and D. M. Stipanovic, “Distributed seeking of nash equilibria with applications to mobile sensor networks,” IEEE Transactions on Automatic Control , vol. 57, no. 4, pp. 904–919, 2011.
8[8] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal of Selected Topics in Signal Processing , vol. 7, no. 2, pp. 230–242, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Distributed Optimization for

Contents

Introduction

Motivation

Scope of the Monograph

Organization

Chapter 1 Distributed Optimization Framework

Remark 1**.**

1.1 Distributed Computation Model

Definition 2**.**

Definition 3**.**

1.2 Optimization Set-ups

1.2.1 Cost-Coupled Optimization

Remark 4**.**

1.2.2 Common Cost Optimization

1.2.3 Constraint-Coupled Optimization

Remark 5** (Comparison with the cost-coupled set-up).**

1.3 Optimization Set-ups for Learning and Control

1.3.1 Regression for Data Analytics

1.3.2 Classification via Logistic Regression

1.3.3 Classification via Support Vector Machine (SVM)

1.3.4 Target Localization in Sensor Networks

1.3.5 Task allocation/assignment

1.3.6 Cooperative Distributed Model Predictive Control

Chapter 2 Consensus-Based Primal Methods

2.1 Distributed Subgradient Method

Assumption 6**.**

Assumption 7**.**

Assumption 8**.**

Lemma 9**.**

Theorem 10**.**

Proof.

2.2 Gradient Tracking Algorithm

Assumption 11**.**

Lemma 12**.**

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Theorem 16**.**

Proof.

2.3 Variants and Extensions of the Basic Gradient Tracking

2.4 Discussion and References

2.5 Numerical Example

Chapter 3 Distributed Dual Methods

3.1 Fenchel Duality and Graph Duality

3.1.1 Two-Agent Example

3.1.2 Fenchel Duality

3.1.3 Graph Duality

3.2 Distributed Dual Decomposition for Cost-Coupled Problems

Assumption 17**.**

Theorem 18**.**

Proof (Sketch).

Remark 19**.**

Remark 20**.**

3.3 Distributed ADMM for Cost-Coupled Problems

Remark 21**.**

Theorem 22**.**

Proof (Sketch).

3.4 Distributed Dual Methods for Constraint-Coupled Problems

3.4.1 Connections between Cost-Coupled and Constraint-Coupled Problems via Duality

Assumption 23**.**

3.4.2 Distributed Dual Subgradient Algorithm

Theorem 24**.**

3.4.3 Relaxation and Successive Distributed Decomposition

Theorem 25**.**

Remark 26**.**

3.5 Discussion and References

3.6 Numerical Example

3.6.1 Cost-coupled Example

3.6.2 Constraint-coupled Example

Chapter 4 Constraint Exchange Methods

Remark 1.

Definition 2.

Definition 3.

Remark 4.

Remark 5 (Comparison with the cost-coupled set-up).

Assumption 6.

Assumption 7.

Assumption 8.

Lemma 9.

Theorem 10.

Assumption 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Theorem 16.

Assumption 17.

Theorem 18.

Remark 19.

Remark 20.

Remark 21.

Theorem 22.

Assumption 23.

Theorem 24.

Theorem 25.

Remark 26.

Definition 27.

Assumption 28.

Lemma 29 (Local convergence).

Lemma 30 (Consensus).

Theorem 31.

Assumption 32.

Theorem 33.

Remark 34.

Proposition 35 ([10, Proposition 1.2.3]).

Assumption 36 (Diminishing Step-size).

Proposition 37 ([157, Proposition 3.2.6]).

Definition 38.

Proposition 39 ([88, Proposition 4.2]).

Theorem 40.

Definition 41.

Definition 42.