Adaptation and learning over networks under subspace constraints -- Part   I: Stability Analysis

Roula Nassif; Stefan Vlaski; Ali H. Sayed

arXiv:1905.08750·cs.MA·April 22, 2020

Adaptation and learning over networks under subspace constraints -- Part I: Stability Analysis

Roula Nassif, Stefan Vlaski, Ali H. Sayed

PDF

TL;DR

This paper introduces a distributed adaptive algorithm for network optimization under subspace constraints, proving its stability and small estimation errors in the small step-size regime, with further performance analysis in the sequel.

Contribution

It develops a distributed implementation of projected gradient descent for subspace-constrained network optimization, ensuring stability and performance close to centralized solutions.

Findings

01

Distributed algorithm achieves small estimation errors for small step-sizes.

02

The proposed method generalizes consensus optimization to broader subspace constraints.

03

Part II will analyze steady-state performance considering noise and data characteristics.

Abstract

This paper considers optimization problems over networks where agents have individual objectives to meet, or individual parameter vectors to estimate, subject to subspace constraints that require the objectives across the network to lie in low-dimensional subspaces. This constrained formulation includes consensus optimization as a special case, and allows for more general task relatedness models such as smoothness. While such formulations can be solved via projected gradient descent, the resulting algorithm is not distributed. Starting from the centralized solution, we propose an iterative and distributed implementation of the projection step, which runs in parallel with the stochastic gradient descent update. We establish in this Part I of the work that, for small step-sizes $μ$ , the proposed distributed adaptive strategy leads to small estimation errors on the order of $μ$ . We…

Figures5

Click any figure to enlarge with its caption.

Tables2

Table 1. TABLE I: Definition of some variables used throughout the analysis. ℐ ℐ \mathcal{I} is a permutation matrix defined by ( 36 ).

Variable	Real data case	Complex data case
Data-type variable $h$	1	2
Gradient vector	$\nabla_{w_{k}^{⊤}} J_{k} (w_{k})$	$\nabla_{w_{k}^{*}} J_{k} (w_{k})$
Error vector ${\tilde{𝒘}}_{k, i}^{e}$	${\tilde{𝒘}}_{k, i}$ from (53)	$[\begin{matrix} {\tilde{𝒘}}_{k, i} \\ {({\tilde{𝒘}}_{k, i}^{*})}^{⊤} \end{matrix}]$
Gradient noise $𝒔_{k, i}^{e} (w)$	$𝒔_{k, i} (w)$ from (26)	$[\begin{matrix} 𝒔_{k, i} (w) \\ {(𝒔_{k, i}^{*} (w))}^{⊤} \end{matrix}]$
Bias vector $b_{k}^{e}$	$b_{k}$ from (56)	$[\begin{matrix} b_{k} \\ {(b_{k}^{*})}^{⊤} \end{matrix}]$
$(k, ℓ)$ -th block of $𝒜^{e}$	$A_{k ℓ}$	$[\begin{matrix} A_{k ℓ} & 0 \\ 0 & {(A_{k ℓ}^{*})}^{⊤} \end{matrix}]$
Matrix $𝒰^{e}$	$𝒰$	$ℐ^{⊤} [\begin{matrix} 𝒰 & 0 \\ 0 & {(𝒰^{*})}^{⊤} \end{matrix}]$
Matrix $𝒥_{ϵ}^{e}$	$𝒥_{ϵ}$ from (49)	$[\begin{matrix} 𝒥_{ϵ} & 0 \\ 0 & {(𝒥_{ϵ}^{*})}^{⊤} \end{matrix}]$
Matrix $𝒱_{R, ϵ}^{e}$	$𝒱_{R, ϵ}$ from (49)	$ℐ^{⊤} [\begin{matrix} 𝒱_{R, ϵ} & 0 \\ 0 & {(𝒱_{R, ϵ}^{*})}^{⊤} \end{matrix}]$
Matrix ${(𝒱_{L, ϵ}^{e})}^{*}$	$𝒱_{L, ϵ}^{*}$ from (49)	$[\begin{matrix} 𝒱_{L, ϵ}^{*} & 0 \\ 0 & 𝒱_{L, ϵ}^{⊤} \end{matrix}] ℐ$

Table 2. TABLE II: Distributed beamforming settings for uniform linear arrays of N 𝑁 N antennas ( 1 ≤ ν ≤ N − 1 1 𝜈 𝑁 1 1\leq\nu\leq N-1 ).

Neighboring set $𝒩_{k}$	Parameter vector $w_{k}$	Regressor $𝒖_{k, i}$
${\max {1, k - ν}, \dots, \min {k + ν, N}}$	$col {h_{m}}_{m = \max {1, k - ν}}^{\min {k + ν, N}}$	$col {{\| 𝒩_{m} \|}^{- \frac{1}{2}} 𝒙_{m} (i)}_{m = \max {1, k - ν}}^{\min {k + ν, N}}$

Equations257

w^{o} = ar g w min k = 1 \sum N J_{k} (w),

w^{o} = ar g w min k = 1 \sum N J_{k} (w),

W^{o} = ar g W min J^{glob} (W) ≜ k = 1 \sum N J_{k} (w_{k}), subject to W \in R (U),

W^{o} = ar g W min J^{glob} (W) ≜ k = 1 \sum N J_{k} (w_{k}), subject to W \in R (U),

W_{i} = P_{U} (W_{i - 1} - μ col {\nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1})}_{k = 1}^{N}), i \geq 0,

W_{i} = P_{U} (W_{i - 1} - μ col {\nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1})}_{k = 1}^{N}), i \geq 0,

P_{U} = U (U^{*} U)^{- 1} U^{*},

P_{U} = U (U^{*} U)^{- 1} U^{*},

\nabla_{w_{k}^{*}} J_{k} (w_{k}) = \nabla_{w_{k}^{*}} Q_{k} (w_{k}; x_{k, i}),

\nabla_{w_{k}^{*}} J_{k} (w_{k}) = \nabla_{w_{k}^{*}} Q_{k} (w_{k}; x_{k, i}),

ψ_{k, i} = w_{k, i - 1} - μ \nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1}) .

ψ_{k, i} = w_{k, i - 1} - μ \nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1}) .

i \to \infty lim A^{i} = P_{U},

i \to \infty lim A^{i} = P_{U},

A_{k ℓ} = [A]_{k ℓ} = 0, if ℓ \in / N_{k} and k \neq = ℓ,

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w_{k}^{*}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}A_{k\ell}\boldsymbol{\psi}_{\ell,i},\end{array}\right.

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w_{k}^{*}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}A_{k\ell}\boldsymbol{\psi}_{\ell,i},\end{array}\right.

A P_{U} = P_{U},

A P_{U} = P_{U},

P_{U} A = P_{U},

ρ (A - P_{U}) < 1,

A U

A U

U^{*} A

a_{k ℓ} \geq 0, A \mathds 1_{N} = \mathds 1_{N}, \mathds 1_{N}^{⊤} A = \mathds 1_{N}^{⊤}, a_{k ℓ} = 0 if ℓ \in / N_{k} and k \neq = ℓ

a_{k ℓ} \geq 0, A \mathds 1_{N} = \mathds 1_{N}, \mathds 1_{N}^{⊤} A = \mathds 1_{N}^{⊤}, a_{k ℓ} = 0 if ℓ \in / N_{k} and k \neq = ℓ

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w^{*}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}a_{k\ell}\boldsymbol{\psi}_{\ell,i}.\end{array}\right.

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w^{*}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}a_{k\ell}\boldsymbol{\psi}_{\ell,i}.\end{array}\right.

W^{o} = ar g W min k = 1 \sum N J_{k} (w_{k}), subject to D^{*} W = d,

W^{o} = ar g W min k = 1 \sum N J_{k} (w_{k}), subject to D^{*} W = d,

W_{i} = P_{D} (W_{i - 1} - μ col {\nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1})}_{k = 1}^{N}) + d_{D}, i \geq 0,

W_{i} = P_{D} (W_{i - 1} - μ col {\nabla_{w_{k}^{*}} J_{k} (w_{k, i - 1})}_{k = 1}^{N}) + d_{D}, i \geq 0,

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w^{*}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}A_{k\ell}\boldsymbol{\psi}_{\ell,i}+d_{\mathcal{D},k},\end{array}\right.

\left\{\begin{array}[]{rl}\boldsymbol{\psi}_{k,i}=&\boldsymbol{w}_{k,i-1}-\mu\widehat{\nabla_{w^{*}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\ \boldsymbol{w}_{k,i}=&\sum\limits_{\ell\in\mathcal{N}_{k}}A_{k\ell}\boldsymbol{\psi}_{\ell,i}+d_{\mathcal{D},k},\end{array}\right.

S (W) = W^{⊤} L_{c} W = \frac{1}{2} k = 1 \sum N ℓ \in N_{k} \sum c_{k ℓ} ∥ w_{k} - w_{ℓ} ∥^{2},

S (W) = W^{⊤} L_{c} W = \frac{1}{2} k = 1 \sum N ℓ \in N_{k} \sum c_{k ℓ} ∥ w_{k} - w_{ℓ} ∥^{2},

S (W) = \overline{W}^{⊤} (Λ \otimes I_{L}) \overline{W} = m = 1 \sum N λ_{m} ∥ \overline{w}_{m} ∥^{2},

S (W) = \overline{W}^{⊤} (Λ \otimes I_{L}) \overline{W} = m = 1 \sum N λ_{m} ∥ \overline{w}_{m} ∥^{2},

\begin{array}[]{cl}\text{find}&\mathcal{A}\\ \text{such that}&\mathcal{A}\,\mathcal{U}=\mathcal{U},~{}~{}\mathcal{U}^{*}\mathcal{A}=\mathcal{U}^{*},\\ &\rho(\mathcal{A}-\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}})<1,\\ &[\mathcal{A}]_{k\ell}=0,~{}\text{if }\ell\notin\mathcal{N}_{k}\text{ and }\ell\neq k,\end{array}

\begin{array}[]{cl}\text{find}&\mathcal{A}\\ \text{such that}&\mathcal{A}\,\mathcal{U}=\mathcal{U},~{}~{}\mathcal{U}^{*}\mathcal{A}=\mathcal{U}^{*},\\ &\rho(\mathcal{A}-\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}})<1,\\ &[\mathcal{A}]_{k\ell}=0,~{}\text{if }\ell\notin\mathcal{N}_{k}\text{ and }\ell\neq k,\end{array}

MSD ≜ μ μ \to 0 lim (i \to \infty lim sup \frac{1}{μ} E (\frac{1}{N} ∥ W^{o} - W_{i} ∥^{2})),

MSD ≜ μ μ \to 0 lim (i \to \infty lim sup \frac{1}{μ} E (\frac{1}{N} ∥ W^{o} - W_{i} ∥^{2})),

\left\{\begin{array}[]{rl}\big{(}\boldsymbol{\psi}_{k,i}^{*})^{\top}=&\big{(}\boldsymbol{w}_{k,i-1}^{*}\big{)}^{\top}-\mu\widehat{\nabla_{w^{\top}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\[2.79857pt] \big{(}\boldsymbol{w}_{k,i}^{*}\big{)}^{\top}=&\sum\limits_{\ell\in\mathcal{N}_{k}}(A_{k\ell}^{*})^{\top}\big{(}\boldsymbol{\psi}_{\ell,i}^{*}\big{)}^{\top}.\end{array}\right.

\left\{\begin{array}[]{rl}\big{(}\boldsymbol{\psi}_{k,i}^{*})^{\top}=&\big{(}\boldsymbol{w}_{k,i-1}^{*}\big{)}^{\top}-\mu\widehat{\nabla_{w^{\top}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1}),\\[2.79857pt] \big{(}\boldsymbol{w}_{k,i}^{*}\big{)}^{\top}=&\sum\limits_{\ell\in\mathcal{N}_{k}}(A_{k\ell}^{*})^{\top}\big{(}\boldsymbol{\psi}_{\ell,i}^{*}\big{)}^{\top}.\end{array}\right.

\left\{\begin{split}\left[\begin{array}[]{c}\boldsymbol{\psi}_{k,i}\\ \big{(}\boldsymbol{\psi}_{k,i}^{*})^{\top}\end{array}\right]&=\left[\begin{array}[]{c}\boldsymbol{w}_{k,i-1}\\ \big{(}\boldsymbol{w}_{k,i-1}^{*})^{\top}\end{array}\right]-\mu\left[\begin{array}[]{c}\widehat{\nabla_{w_{k}^{*}}J_{k}}(\boldsymbol{w}_{k,i-1})\\[2.15277pt] \widehat{\nabla_{w^{\top}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1})\end{array}\right]\\ \left[\begin{array}[]{c}\boldsymbol{w}_{k,i}\\ \big{(}\boldsymbol{w}_{k,i}^{*})^{\top}\end{array}\right]&=\sum_{\ell\in\mathcal{N}_{k}}\left[\begin{array}[]{cc}A_{k\ell}&0\\ 0&(A_{k\ell}^{*})^{\top}\end{array}\right]\left[\begin{array}[]{c}\boldsymbol{\psi}_{\ell,i}\\ \big{(}\boldsymbol{\psi}_{\ell,i}^{*})^{\top}\end{array}\right].\end{split}\right.

\left\{\begin{split}\left[\begin{array}[]{c}\boldsymbol{\psi}_{k,i}\\ \big{(}\boldsymbol{\psi}_{k,i}^{*})^{\top}\end{array}\right]&=\left[\begin{array}[]{c}\boldsymbol{w}_{k,i-1}\\ \big{(}\boldsymbol{w}_{k,i-1}^{*})^{\top}\end{array}\right]-\mu\left[\begin{array}[]{c}\widehat{\nabla_{w_{k}^{*}}J_{k}}(\boldsymbol{w}_{k,i-1})\\[2.15277pt] \widehat{\nabla_{w^{\top}_{k}}J_{k}}(\boldsymbol{w}_{k,i-1})\end{array}\right]\\ \left[\begin{array}[]{c}\boldsymbol{w}_{k,i}\\ \big{(}\boldsymbol{w}_{k,i}^{*})^{\top}\end{array}\right]&=\sum_{\ell\in\mathcal{N}_{k}}\left[\begin{array}[]{cc}A_{k\ell}&0\\ 0&(A_{k\ell}^{*})^{\top}\end{array}\right]\left[\begin{array}[]{c}\boldsymbol{\psi}_{\ell,i}\\ \big{(}\boldsymbol{\psi}_{\ell,i}^{*})^{\top}\end{array}\right].\end{split}\right.

s_{k, i} (w) ≜ \nabla_{w_{k}^{*}} J_{k} (w) - \nabla_{w_{k}^{*}} J_{k} (w) .

s_{k, i} (w) ≜ \nabla_{w_{k}^{*}} J_{k} (w) - \nabla_{w_{k}^{*}} J_{k} (w) .

H_{k} (w_{k})

H_{k} (w_{k})

\displaystyle=\left\{\begin{array}[]{l}\nabla_{w_{k}^{\top}}[\nabla_{w_{k}}J_{k}(w_{k})],\qquad\qquad\qquad\qquad\qquad\qquad\qquad~{}\text{when the data is real }(M_{k}\times M_{k})\\ \left[\begin{array}[]{c|c}\nabla_{w_{k}^{*}}[\nabla_{w_{k}}J_{k}(w_{k})]&(\nabla_{w_{k}^{\top}}[\nabla_{w_{k}}J_{k}(w_{k})])^{*}\\[2.15277pt] \hline\cr\nabla_{w_{k}^{\top}}[\nabla_{w_{k}}J_{k}(w_{k})]&(\nabla_{w_{k}^{*}}[\nabla_{w_{k}}J_{k}(w_{k})])^{\top}\end{array}\right]\qquad\quad\text{when the data is complex }(2M_{k}\times 2M_{k})\end{array}\right.

H (W)

\nabla_{{\scriptscriptstyle\mathcal{W}}}^{2}J^{\text{glob}}({\scriptstyle\mathcal{W}})=\left\{\begin{array}[]{ll}\mathcal{H}({\scriptstyle\mathcal{W}}),&\text{when the data is real }\\ \mathcal{I}\mathcal{H}({\scriptstyle\mathcal{W}})\mathcal{I}^{\top},&\text{when the data is complex }\end{array}\right.

\nabla_{{\scriptscriptstyle\mathcal{W}}}^{2}J^{\text{glob}}({\scriptstyle\mathcal{W}})=\left\{\begin{array}[]{ll}\mathcal{H}({\scriptstyle\mathcal{W}}),&\text{when the data is real }\\ \mathcal{I}\mathcal{H}({\scriptstyle\mathcal{W}})\mathcal{I}^{\top},&\text{when the data is complex }\end{array}\right.

\mathcal{I}\triangleq\left[\begin{array}[]{cccccccc}I_{M_{1}}&0&0&0&\ldots&0&0&0\\ 0&0&I_{M_{2}}&0&\ldots&0&0&0\\ &&&&\ddots&&&\\ 0&0&0&0&\ldots&0&I_{M_{N}}&0\\ \hline\cr 0&I_{M_{1}}&0&0&\ldots&0&0&0\\ 0&0&0&I_{M_{2}}&\ldots&0&0&0\\ &&&&\ddots&&&\\ 0&0&0&0&\ldots&0&0&I_{M_{N}}\end{array}\right].

\mathcal{I}\triangleq\left[\begin{array}[]{cccccccc}I_{M_{1}}&0&0&0&\ldots&0&0&0\\ 0&0&I_{M_{2}}&0&\ldots&0&0&0\\ &&&&\ddots&&&\\ 0&0&0&0&\ldots&0&I_{M_{N}}&0\\ \hline\cr 0&I_{M_{1}}&0&0&\ldots&0&0&0\\ 0&0&0&I_{M_{2}}&\ldots&0&0&0\\ &&&&\ddots&&&\\ 0&0&0&0&\ldots&0&0&I_{M_{N}}\end{array}\right].

[\mathcal{I}]_{mn}\triangleq\left\{\begin{array}[]{ll}I_{M_{k}},&\text{if }m=k,n=2(k-1)+1\\ I_{M_{k}},&\text{if }m=k+N,n=2k\\ 0,&\text{otherwise}\\ \end{array}\right.

[\mathcal{I}]_{mn}\triangleq\left\{\begin{array}[]{ll}I_{M_{k}},&\text{if }m=k,n=2(k-1)+1\\ I_{M_{k}},&\text{if }m=k+N,n=2k\\ 0,&\text{otherwise}\\ \end{array}\right.

\frac{ν _{k}}{h} I_{h M_{k}} \leq H_{k} (w_{k}) \leq \frac{δ _{k}}{h} I_{h M_{k}},

\frac{ν _{k}}{h} I_{h M_{k}} \leq H_{k} (w_{k}) \leq \frac{δ _{k}}{h} I_{h M_{k}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\stackMath

Adaptation and learning over networks under subspace constraints – Part I: Stability Analysis

Roula Nassif*†, , Stefan Vlaski†,‡*, ,

Ali H. Sayed*†*,

† Institute of Electrical Engineering, EPFL, Switzerland

‡ Electrical Engineering Department, UCLA, USA

[email protected] [email protected] [email protected] This work was supported in part by NSF grant CCF-1524250. A short conference version of this work appears in [1].

Abstract

This paper considers optimization problems over networks where agents have individual objectives to meet, or individual parameter vectors to estimate, subject to subspace constraints that require the objectives across the network to lie in low-dimensional subspaces. This constrained formulation includes consensus optimization as a special case, and allows for more general task relatedness models such as smoothness. While such formulations can be solved via projected gradient descent, the resulting algorithm is not distributed. Starting from the centralized solution, we propose an iterative and distributed implementation of the projection step, which runs in parallel with the stochastic gradient descent update. We establish in this Part I of the work that, for small step-sizes $\mu$ , the proposed distributed adaptive strategy leads to small estimation errors on the order of $\mu$ . We examine in the accompanying Part II [2] the steady-state performance. The results will reveal explicitly the influence of the gradient noise, data characteristics, and subspace constraints, on the network performance. The results will also show that in the small step-size regime, the iterates generated by the distributed algorithm achieve the centralized steady-state performance.

Index Terms:

Distributed optimization, subspace projection, gradient noise, stability analysis.

I Introduction

Distributed inference allows a collection of interconnected agents to perform parameter estimation tasks from streaming data by relying solely on local computations and interactions with immediate neighbors. Most prior literature focuses on consensus problems, where agents with separate objective functions need to agree on a common parameter vector corresponding to the minimizer of the aggregate sum of the individual costs, namely,

[TABLE]

where $J_{k}(\cdot)$ is the cost function at agent $k$ , $N$ is the number of agents in the network, and $w\in\mathbb{C}^{L}$ is the global parameter vector, which all agents need to agree upon–see Fig. 1 (middle). Each agent seeks to estimate $w^{o}$ through local computations and communications among neighboring agents without the need to know any of the costs besides their own. Among many useful strategies that have been proposed in the literature [3, 4, 5, 6, 7, 8, 9, 10], diffusion strategies [3, 4, 5] are particularly attractive since they are scalable, robust, and enable continuous learning and adaptation in response to drifts in the location of the minimizer.

However, there exist many network applications that require more complex models and flexible algorithms than consensus implementations since their agents may involve the need to estimate and track multiple distinct, though related, objectives. For instance, in distributed power system state estimation, the local state vectors to be estimated at neighboring control centers may overlap partially since the areas in a power system are interconnected [11, 12]. Likewise, in monitoring applications, agents need to track the movement of multiple correlated targets and to exploit the correlation profile in the data for enhanced accuracy [13, 14]. Problems of this kind, where nodes need to infer multiple, though related, parameter vectors, are referred to as multitask problems. Existing strategies to address multitask problems generally exploit prior knowledge on how the tasks across the network relate to each other [15, 11, 13, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 12, 26, 27, 28, 29, 14]. For example, one way to model relationships among tasks is to formulate convex optimization problems with appropriate co-regularizers between neighboring agents [13, 16, 17, 18, 19]. Graph spectral regularization can also be used in order to leverage more thoroughly the graph spectral information and improve the multitask network performance [20]. In other applications, it may happen that the parameter vectors to be estimated at neighboring agents are related according to a set of linear equality constraints [22, 23, 24, 21, 26, 12, 25].

However, in this paper, and the accompanying Part II [2], we consider multitask inference problems where each agent seeks to minimize an individual cost (expressed as the expectation of some loss function), and where the collection of parameter vectors to be estimated across the network is required to lie in a low-dimensional subspace–see Fig. 1 (left). That is, we let $w_{k}\in{\mathbb{C}^{M_{k}}}$ denote some parameter vector at node $k$ and let ${\scriptstyle\mathcal{W}}=\text{col}\{w_{1},\ldots,w_{N}\}$ denote the collection of parameter vectors from across the network. We associate with each agent $k$ a differentiable convex cost $J_{k}(w_{k}):{\mathbb{C}^{M_{k}}}\rightarrow\mathbb{R}$ , which is expressed as the expectation of some loss function $Q_{k}(\cdot)$ and written as $J_{k}(w_{k})=\mathbb{E}Q_{k}(w_{k};\boldsymbol{x}_{k})$ , where $\boldsymbol{x}_{k}$ denotes the random data. The expectation is computed over the distribution of the data. Let $M=\sum_{k=1}^{N}M_{k}$ . We consider constrained problems of the form:

[TABLE]

where $\mathcal{R}(\cdot)$ denotes the range space operator, and $\mathcal{U}$ is an $M\times P$ full-column rank matrix with $P\ll M$ . Each agent $k$ is interested in estimating the $k$ -th $M_{k}\times 1$ subvector $w^{o}_{k}$ of ${\scriptstyle\mathcal{W}}^{o}=\text{col}\{w^{o}_{1},\ldots,w^{o}_{N}\}$ . In order to solve (2) iteratively, the gradient projection method can be applied [30]:

[TABLE]

where ${\scriptstyle\mathcal{W}}_{i}=\text{col}\{w_{1,i},\ldots,w_{N,i}\}$ with $w_{k,i}$ the estimate of $w^{o}_{k}$ at iteration $i$ and agent $k$ , $\mu>0$ is a small step-size parameter, $\nabla_{w_{k}^{*}}J_{k}(\cdot)$ is the (Wirtinger) complex gradient [4, Appendix A] of $J_{k}(\cdot)$ relative to $w^{*}_{k}$ (complex conjugate of $w_{k}$ ), and $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ is the projector onto the $P$ -dimensional subspace of $\mathbb{C}^{M}$ spanned by the columns of $\mathcal{U}$ :

[TABLE]

where we used the fact that $\mathcal{U}$ is a full-column rank matrix.

We are particularly interested in solving the problem in the stochastic setting when the distribution of the data $\boldsymbol{x}_{k}$ is generally unknown. This means that the risks $J_{k}(\cdot)$ and their gradients $\nabla_{w_{k}^{*}}J_{k}(\cdot)$ are unknown. As such, approximate gradient vectors need to be employed. A common construction in stochastic approximation theory is to employ the following approximation at iteration $i$ :

[TABLE]

where $\boldsymbol{x}_{k,i}$ represents the data observed at iteration $i$ . The difference between the true gradient and its approximation is called gradient noise. This noise will seep into the operation of the algorithm and one main challenge is to show that despite its presence, agent $k$ is still able to approach $w^{o}_{k}$ asymptotically.

Although the gradient update in (3) and (5) can be performed locally at agent $k$ , the projection operation requires a fusion center. To see this, let us introduce an intermediate variable $\psi_{k,i}$ at node $k$ :

[TABLE]

After evaluating $\psi_{k,i}$ locally, each agent at each iteration needs to send its estimate $\psi_{k,i}$ to a fusion center, which performs the projection operation in (3) by computing ${\scriptstyle\mathcal{W}}_{i}=\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}\text{col}\{\psi_{1,i},\ldots,\psi_{N,i}\}$ , and then sends the resulting estimates $w_{k,i}$ back to the agents. While centralized solutions can be powerful, decentralized solutions are more attractive since they are more robust and respect the privacy policy at each agent [4]. Thus, a second challenge we face in this paper is how to carry out the projection through a distributed network where each node performs local computations and exchanges information only with its neighbors.

We propose in Section II an adaptive and distributed iterative algorithm allowing each agent $k$ to converge, in the mean-square-error sense, within $O(\mu)$ from the solution $w^{o}_{k}$ of (2), for sufficiently small $\mu$ . Conditions on the network topology and signal subspace ensuring the feasibility of a distributed implementation are provided. We also show how some well-known network optimization problems, such as consensus optimization [3, 4, 5] and multitask smooth optimization [16], can be recast in the form (2) and addressed with the strategy proposed in this paper. The analysis in Section III of this Part I shows that, for sufficiently small $\mu$ , the proposed adaptive strategy leads to small estimation errors on the order of the small step-size. Building on the results of this Part I, we shall derive in Part II [2] a closed-form expression for the steady-state network mean-square-error performance. This closed form expression will reveal explicitly the influence of the data characteristics (captured by the second-order properties of the costs and second-order moments of the gradient noises) and subspace constraints (captured by $\mathcal{U}$ ), on the network performance. The results will also show that, in the small step-size regime, the iterates generated by the distributed implementation achieve the centralized steady-state performance. For illustration purposes, distributed sub-optimal beamforming is considered in Section IV of this Part I.

Notation: All vectors are column vectors. Random quantities are denoted in boldface. Matrices are denoted in capital letters while vectors and scalars are denoted in lower-case letters. We use the symbol $(\cdot)^{\top}$ to denote matrix transpose, the symbol $(\cdot)^{*}$ to denote matrix complex-conjugate transpose, and the symbol $\text{Tr}(\cdot)$ to denote trace operator. The symbol $\text{diag}\{\cdot\}$ forms a matrix from block arguments by placing each block immediately below and to the right of its predecessor. The operator $\text{col}\{\cdot\}$ stacks the column vector entries on top of each other. The symbol $\otimes$ denotes the Kronecker product. The $M\times M$ identity matrix is denoted by $I_{M}$ .

II Distributed inference under subspace constraints

We move on to propose and study a distributed solution for solving (2) with a continuous adaptation mechanism. The solution must rely on local computations and communications with immediate neighborhood, and operate in real-time on streaming data. To proceed with the analysis, one of the challenges we face is that the projection in (3) requires non-local exchange of information. Our strategy is to replace the $M\times M$ projection matrix $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ in (3) by an $M\times M$ matrix $\mathcal{A}$ that satisfies the following conditions:

[TABLE]

where $[\mathcal{A}]_{k\ell}$ denotes the $(k,\ell)$ -th block of $\mathcal{A}$ of dimension $M_{k}\times M_{\ell}$ and $\mathcal{N}_{k}$ denotes the neighborhood of agent $k$ , i.e., the set of nodes connected to agent $k$ by an edge. The sparsity condition (8) characterizes the network topology and ensures local exchange of information at each time instant $i$ . By replacing the projector $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ in (3) by $\mathcal{A}$ and the true gradients $\nabla_{w_{k}^{*}}J_{k}(\cdot)$ by their stochastic approximations, we obtain the following distributed adaptive solution at each agent $k$ :

[TABLE]

where we used condition (8), and where $\boldsymbol{\psi}_{k,i}$ is an intermediate estimate and $\boldsymbol{w}_{k,i}$ is the estimate of $w^{o}_{k}$ at agent $k$ and iteration $i$ . As we shall see in Section III, condition (7) helps ensure convergence toward the optimum. Necessary and sufficient conditions for the matrix equation (7) to hold are given in the following lemma.

Lemma 1.

(Necessary and sufficient conditions for (7))*

The matrix equation (7) holds, if and only if, the following conditions on the projector $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ and the matrix $\mathcal{A}$ are satisfied:*

[TABLE]

where $\rho(\cdot)$ denotes the spectral radius of its matrix argument. It follows that any $\mathcal{A}$ satisfying condition (7) has one as an eigenvalue with multiplicity $P$ , and all other eigenvalues are strictly less than one in magnitude.

Proof.

See Appendix A. The arguments are along the lines developed in [31] for distributed averaging with proper adjustments to handle general subspace constraints. ∎

Note that conditions (10)–(12) appeared previously (with proof omitted) in the context of distributed denoising in wireless sensor networks [29]. In such problems, the $N$ sensors are observing $N$ -dimensional signal, with each entry of the signal corresponding to one sensor. Using the prior knowledge that the observed signal belongs to a low-dimensional subspace, the sensor task is to denoise the corresponding entry of the signal by projecting in a distributed iterative manner onto the signal subspace in order to improve the error variance. However, in this work, we consider the more general problem of distributed inference over networks.

If we replace $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ by (4), multiply both sides of (10) by $\mathcal{U}$ , and multiply both sides of (11) by $\mathcal{U}^{*}$ , conditions (10) and (11) reduce to:

[TABLE]

Conditions (13) and (14) state that the $P$ columns of $\mathcal{U}$ are right and left eigenvectors of $\mathcal{A}$ associated with the eigenvalue $1$ . Together with these two conditions, condition (12) means that $\mathcal{A}$ has $P$ eigenvalues at one, and that all other eigenvalues are strictly less than one in magnitude.

In the following, we discuss how some well-known network optimization problems can be recast in the form (2) and addressed with strategies in the form of (9).

Remark 1. (Distributed consensus optimization). Let $M_{k}=L$ for all agents. If we set in (2) $P=L$ and $\mathcal{U}=\frac{1}{\sqrt{N}}(\mathds{1}_{N}\otimes I_{L})$ where $\mathds{1}_{N}$ is the $N\times 1$ vector of all ones, then solving problem (2) will be equivalent to solving the well-known consensus problem (1). Different algorithms for solving (1) over strongly-connected networks have been proposed [3, 4, 5, 7, 6, 8, 9]. By picking any $N\times N$ doubly-stochastic matrix $A=[a_{k\ell}]$ satisfying:

[TABLE]

the diffusion strategy for instance takes the form [3, 4, 5]:

[TABLE]

Observe that this strategy can be written in the form of (9) with $A_{k\ell}=a_{k\ell}I_{L}$ and $\mathcal{A}=A\otimes I_{L}$ . It can be verified that, when $A$ satisfies (15) over a strongly connected network, the matrix $\mathcal{A}$ will satisfy (8), (13), (14), and (12). ∎

Remark 2. (Distributed coupled optimization). Similarly, with a proper selection of $\mathcal{U}$ , multitask inference problems with overlapping parameter vectors [23, 22, 24] can also be recast in the form (2). This scenario is illustrated in Fig. 1 (right). In this example, agent $k$ is influenced by only a subset of the entries of a global $w=[w^{1},w^{2},w^{3}]$ and seeks to estimate $w_{k}=[w^{2},w^{3}]$ . For a given variable $w^{\ell}$ and any two arbitrary agents containing $w^{\ell}$ in their costs, it is assumed that the network topology is such that there exists at least one path linking one agent to the other [23]. By properly selecting the matrix $\mathcal{U}$ , the network vector ${\scriptstyle\mathcal{W}}=\text{col}\{w_{1},\ldots,w_{N}\}$ can be written as ${\scriptstyle\mathcal{W}}=\mathcal{U}w$ and, therefore, distributed coupled optimization can be recast in the form (2). It can be verified that the coupled diffusion strategy proposed in [23] for solving this problem can be written in the form of (9) and that the (doubly-stochastic) matrix $\mathcal{A}$ in [23] satisfies conditions (7) and (8). ∎

Remark 3. (Distributed optimization under affine constraints). Several existing works consider (distributed or offline) variations of the following problem [27, 26, 28]:

[TABLE]

where $\mathcal{D}$ is an $M\times(M-P)$ full-column rank matrix and $d$ is an $(M-P)\times 1$ column vector. It turns out that the online distributed strategy proposed in this work can be used to solve (17) for general constraints that are not necessarily local. To see this, we first note that the gradient projection method can be applied to solve (17) [30]:

[TABLE]

where $\mathcal{P}_{\scriptstyle\mathcal{D}}\triangleq I_{M}-\mathcal{D}(\mathcal{D}^{*}\mathcal{D})^{-1}\mathcal{D}^{*}$ and $d_{\scriptstyle\mathcal{D}}\triangleq\mathcal{D}(\mathcal{D}^{*}\mathcal{D})^{-1}d$ . Since $\mathcal{P}_{\scriptstyle\mathcal{D}}$ is a projection matrix, it can be decomposed as $\mathcal{P}_{\scriptstyle\mathcal{D}}=\sum_{p=1}^{P}u_{p}u^{*}_{p}$ with $\{u_{p}\}$ the orthonormal eigenvectors of $\mathcal{P}_{\scriptstyle\mathcal{D}}$ associated with the $P$ eigenvalues at one, and thus $\mathcal{P}_{d}$ can be replaced by $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}=\mathcal{U}\mathcal{U}^{*}$ with $\mathcal{U}=[u_{1},\ldots,u_{P}]$ . Therefore, solution (18) has a form similar to the earlier solution (3) with the rightmost term $d_{\scriptstyle\mathcal{D}}$ in (18) absent from (3). Following the same line of reasoning that led to (9), we can similarly obtain the following distributed adaptive solution for solving (17):

[TABLE]

where $d_{\mathcal{D},k}$ is the $k$ -th sub-vector of $(I_{M}-\mathcal{A})d_{\mathcal{D}}$ corresponding to node $k$ (see Appendix B), and $\mathcal{A}=[A_{k\ell}]$ is a properly selected matrix satisfying conditions (7) and (8). Although algorithm (19) is different than (9) due to the presence of the constant term $d_{\mathcal{D},k}$ , the mean-square-error analyzes of both algorithms are the same, as we shall see in Section III. In Section IV, we shall apply (19) to solve linearly constrained beamforming [32, 33].∎

Remark 4. (Distributed inference under smoothness). Let $M_{k}=L$ for all agents. In such problems, each agent $k$ in the network has an individual cost $J_{k}(w_{k})$ to minimize subject to a smoothness condition over the graph. The smoothness requirement softens the transition in the tasks $\{w_{k}\}$ among neighboring nodes and can be measured in terms of a quadratic form of the graph Laplacian [16]:

[TABLE]

where $\mathcal{L}_{c}=L_{c}\otimes I_{L}$ with $L_{c}=\text{diag}\{C\mathds{1}_{N}\}-C$ denoting the graph Laplacian. The matrix $C=[c_{k\ell}]$ is an $N\times N$ symmetric weighted adjacency matrix with $c_{k\ell}\geq 0$ if $\ell\in\mathcal{N}_{k}$ and $c_{k\ell}=0$ otherwise. The smaller $S({\scriptstyle\mathcal{W}})$ is, the smoother the signal ${\scriptstyle\mathcal{W}}$ on the graph is. Since $L_{c}$ is symmetric positive semi-definite, it can be decomposed as $L_{c}=V\Lambda V^{\top}$ where $\Lambda=\text{diag}\{\lambda_{1},\ldots,\lambda_{N}\}$ with $\lambda_{m}$ the non-negative eigenvalues ordered as $0=\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{N}$ and $V=[v_{1},\ldots,v_{N}]$ is the matrix of orthonormal eigenvectors. When the graph is connected, there is only one zero eigenvalue with corresponding eigenvector $v_{1}=\frac{1}{\sqrt{N}}\mathds{1}_{N}$ [34]. Using the eigenvalue decomposition $\mathcal{L}=(V\Lambda V^{\top})\otimes I_{L}$ , $S({\scriptstyle\mathcal{W}})$ can be written as:

[TABLE]

where $\overline{{\scriptstyle\mathcal{W}}}=(V^{\top}\otimes I_{L}){\scriptstyle\mathcal{W}}$ and $\overline{w}_{m}=(v_{m}^{\top}\otimes I_{L}){\scriptstyle\mathcal{W}}$ . Given that $\lambda_{m}\geq 0$ , the above expression shows that ${\scriptstyle\mathcal{W}}$ is considered to be smooth if $\|\overline{w}_{m}\|^{2}$ corresponding to large $\lambda_{m}$ is negligible. Thus, for a smooth ${\scriptstyle\mathcal{W}}$ , $S({\scriptstyle\mathcal{W}})$ will be equal to $\sum_{m=1}^{p}\lambda_{m}\|\overline{w}_{m}\|^{2}$ with $p\ll N$ . By choosing $\mathcal{U}=U\otimes I_{L}$ where $U=[v_{1},\ldots,v_{p}]$ , the smooth signal ${\scriptstyle\mathcal{W}}$ will be in the range space of $\mathcal{U}$ since it can be written as ${\scriptstyle\mathcal{W}}=\mathcal{U}s$ with $s=\text{col}\{\overline{w}_{1},\ldots,\overline{w}_{p}\}$ . Therefore, distributed inference problems under smoothness can be recast in the form (2).∎

Before proceeding, note that, in some cases, one may find a family of matrices $\mathcal{A}$ satisfying conditions (12), (13), and (14) under the sparsity constraints (8). For example, in consensus optimization described in Remark 1 where $\mathcal{U}=\frac{1}{\sqrt{N}}(\mathds{1}_{N}\otimes I_{L})$ , by ensuring that the underlying graph is strongly connected and by choosing any doubly-stochastic $A$ satisfying the sparsity constraints, the resulting matrix $\mathcal{A}=A\otimes I_{L}$ will satisfy the required conditions. The same observation holds for coupled optimization problems described in Remark 2. Several policies for designing locally doubly-stochastic matrices have been proposed in the literature [3, 4, 5]. For more general $\mathcal{U}$ , designing an $\mathcal{A}$ satisfying conditions (7) and (8) can be written as the following feasibility problem:

[TABLE]

which is challenging in general. Not all network topologies satisfying (8) guarantee the existence of an $\mathcal{A}$ satisfying condition (7). The higher the dimension of the signal subspace is, the greater the graph connectivity has to be. In the works [1, 29], it is assumed that the sparsity constraints (8) and the signal subspace lead to a feasible problem. That is, it is assumed that problem (22) admits at least one solution. As a remedy for the violation of such assumption, one may increase the network connectivity by increasing the transmit power of each node, i.e., adding more links [29]. In the accompanying Part II [2], we shall relax the feasibility assumption by considering the problem of finding an $\mathcal{A}$ that minimizes the number of edges to be added to the original topology while satisfying the constraints (12), (13), and (14). In this case, if the original topology leads to a feasible solution, then no links will be added. Otherwise, we assume that the designer is able to add some links to make the problem feasible.

In the following section, we consider that a feasible $\mathcal{A}$ (topology) is computed by the designer and that its blocks $\{A_{k\ell}\}_{\ell\in\mathcal{N}_{k}}$ are provided to agent $k$ in order to run algorithm (9). We shall study the performance of (9) in the mean-square-error sense. We shall consider the general complex case, in addition to the real case, since complex-valued combination matrix $\mathcal{A}$ and data $\boldsymbol{x}_{k,i}$ are important in several applications, as will be the case in the distributed beamforming application considered later in Section IV.

III Stability analysis

In this Part I, we shall establish mean-square-error stability by showing that, for each agent $k$ , the error variance relative to $w^{o}_{k}$ enters a bounded region whose size is in the order of $\mu$ , namely, $\limsup_{i\rightarrow\infty}\mathbb{E}\|w^{o}_{k}-\boldsymbol{w}_{k,i}\|^{2}=O(\mu)$ . Then, building on this result, we will assess in the accompanying Part II [2] the size of this mean-square error by deriving closed-form expression for the network mean-square-deviation (MSD) defined by [4]:

[TABLE]

where $\boldsymbol{{\scriptstyle\mathcal{W}}}_{i}\triangleq\text{col}\{\boldsymbol{w}_{k,i}\}_{k=1}^{N}$ . In this way, we will be able to conclude that distributed strategies of the form (9) with small step-size are able to lead to reliable performance even in the presence of gradient noise. We will be able also to conclude that the iterates generated by the distributed implementation achieve the centralized steady-state performance.

As explained in [4, Chap. 8], in the general case where $J_{k}(w_{k})$ are not necessarily quadratic in the (complex) variable $w_{k}$ , we need to track the evolution of both quantities $\boldsymbol{w}_{k,i}$ and $(\boldsymbol{w}_{k,i}^{*})^{\top}$ in order to examine how the network is performing. Since $J_{k}(w_{k})$ is real valued, the evolution of the complex conjugate iterates $(\boldsymbol{w}_{k,i}^{*})^{\top}$ is given by:

[TABLE]

Representations (9) and (24) can be grouped together into a single set of equations by introducing extended vectors of dimensions $2M_{k}\times 1$ as follows:

[TABLE]

Therefore, when the data is complex, extended vectors and matrices need to be introduced in order to analyze the network evolution. The arguments and results presented in the analysis are applicable to both cases of real and complex data through the use of data-type variable $h$ defined in Table I. When the data is real-valued, the complex conjugate transposition should be replaced by the real transposition. Table I lists a couple of variables and symbols that will be used in the sequel for both real and complex data cases. The superscript “ $e$ ” is used to refer to extended quantities. Although in the real data case no extended quantities should be introduced, we use the superscript “ $e$ ” for both data cases for compactness of notation.

III-A Modeling conditions

We analyze (9) under conditions (8), (12), (13), and (14) on $\mathcal{A}$ , and the following assumptions on the risks $\{J_{k}(\cdot)\}$ and on the gradient noise processes $\{\boldsymbol{s}_{k,i}(\cdot)\}$ defined as:

[TABLE]

Before proceeding, we introduce the Hermitian Hessian matrix functions [4, Appendix B]:

[TABLE]

Note that, when $J^{\text{glob}}({\scriptstyle\mathcal{W}})=\sum_{k=1}^{N}J_{k}(w_{k})$ , we have:

[TABLE]

where $\mathcal{I}$ is a permutation matrix given by:

[TABLE]

This matrix consists of $2N\times 2N$ blocks with $(m,n)$ -th block given by:

[TABLE]

for $m,n=1,\ldots,2N$ and $k=1,\ldots,N$ .

Assumption 1.

(Conditions on aggregate and individual costs).*

The individual costs $J_{k}(w_{k})\in\mathbb{R}$ are assumed to be twice differentiable and convex such that:*

[TABLE]

where $\nu_{k}\geq 0$ for $k=1,\ldots,N$ . It is further assumed that, for any ${\scriptstyle\mathcal{W}}$ , $\mathcal{H}({\scriptstyle\mathcal{W}})$ satisfies:

[TABLE]

for some positive parameters $\nu\leq\delta$ . The data-type variable $h$ and the matrix $\mathcal{U}^{e}$ are defined in Table I.

Condition (38) ensures that problem (2), which can be rewritten as:

[TABLE]

has a unique minimizer ${\scriptstyle\mathcal{W}}^{o}$ . This is due to the fact that the Hessian of $f(s)$ , which is given by:

[TABLE]

is positive definite under condition (38).

Assumption 2.

(Conditions on gradient noise).*

The gradient noise process defined in (26) satisfies for any $\boldsymbol{w}\in\boldsymbol{\cal{F}}_{i-1}$ and for all $k,\ell=1,\ldots,N$ :*

[TABLE]

for some $\beta_{k}^{2}\geq 0$ , $\sigma^{2}_{s,k}\geq 0$ , and where $\boldsymbol{\cal{F}}_{i-1}$ denotes the filtration generated by the random processes $\{\boldsymbol{w}_{\ell,j}\}$ for all $\ell=1,\ldots,N$ and $j\leq i-1$ .

As explained in [3, 4, 5], these conditions are satisfied by many objective functions of interest in learning and adaptation such as quadratic and logistic risks. Condition (44) essentially states that the gradient vector approximation should be unbiased conditioned on the past data, which is a reasonable condition to require. Condition (47) states that the second-order moment of the gradient noise process should get smaller for better estimates, since it is bounded by the squared-norm of the iterate. Conditions (45) and (46) state that the gradient noises across the agents are uncorrelated and second-order circular.

Without loss of generality, we shall introduce the following assumption on the matrix $\mathcal{U}$ 111This assumption is not restrictive since for any full-column rank matrix $\mathcal{U}^{\prime}=[u_{1}^{\prime},\ldots,u^{\prime}_{P}]$ with $P\leq M$ , we can generate by using, for example, the Gram-Schmidt process [35, pp. 15], a semi-unitary matrix $\mathcal{U}=[u_{1},\ldots,u_{P}]$ that spans the same $P$ -dimensional subspace of $\mathbb{C}^{M}$ as $\mathcal{U}^{\prime}$ , i.e., $\mathcal{R}(\mathcal{U})=\mathcal{R}(\mathcal{U}^{\prime})$ ..

Assumption 3.

(Condition on $\mathcal{U}$ ).*

The full-column rank matrix $\mathcal{U}$ is assumed to be semi-unitary, i.e., its column vectors are orthonormal and $\mathcal{U}^{*}\mathcal{U}=I_{P}$ .*

Before proceeding, we introduce an $N\times N$ block matrix $\mathcal{A}^{e}$ whose $(k,\ell)$ -th block is defined in Table I. This matrix will appear in our subsequent study. Observe that in the real data case, $\mathcal{A}^{e}=\mathcal{A}$ , and that in the complex data case, $\mathcal{A}^{e}$ can be seen as an extended version of the combination matrix $\mathcal{A}$ . The next statement exploits the eigen-structure of $\mathcal{A}^{e}$ that will be useful for establishing the mean-square stability.

Lemma 2.

(Jordan canonical decomposition).*

Under Assumption 3, the $M\times M$ combination matrix $\mathcal{A}$ satisfying conditions (13), (14), and (12) admits a Jordan canonical decomposition of the form:*

[TABLE]

with:

[TABLE]

where $\mathcal{J}_{\epsilon}$ is a Jordan matrix with the eigenvalues (which may be complex but have magnitude less than one) on the diagonal and $\epsilon>0$ on the super-diagonal. It follows that the $hM\times hM$ matrix $\mathcal{A}^{e}$ defined in Table I admits a Jordan decomposition of the form:

[TABLE]

with

[TABLE]

where $\mathcal{U}^{e},\mathcal{J}^{e}_{\epsilon},\mathcal{V}^{e}_{R,\epsilon}$ , and $(\mathcal{V}_{L,\epsilon}^{e})^{*}$ are defined in Table I. Since $(\mathcal{V}_{\epsilon}^{e})^{-1}\mathcal{V}_{\epsilon}^{e}=I_{hM}$ , the following relations hold:

[TABLE]

Proof.

See Appendix C. ∎

III-B Network error vector recursion

Let $\widetilde{\boldsymbol{w}}_{k,i}$ denote the error vector at node $k$ :

[TABLE]

Consider first the complex data case. Using (26) and the mean-value theorem [36, pp. 24], [4, Appendix D], we can express the stochastic gradient vectors appearing in (25) as follows:

[TABLE]

where:

[TABLE]

and $\widetilde{\boldsymbol{w}}_{k,i}^{e}$ , $\boldsymbol{s}_{k,i}^{e}(\boldsymbol{w}_{k,i-1})$ , and $b_{k}^{e}$ are defined in Table I with:

[TABLE]

Subtracting $(w^{o}_{k})^{e}=\text{col}\{w^{o}_{k},((w^{o}_{k})^{*})^{\top}\}$ from both sides of (25) and by introducing the following extended vectors and matrices, which collect quantities from across the network:

[TABLE]

we can show that the network weight error vector $\widetilde{\boldsymbol{{\scriptstyle\mathcal{W}}}}^{e}_{i}$ in (57) evolves according to the following dynamics:

[TABLE]

where $\mathcal{A}^{e}$ is defined in Table I and where we used (54) and the fact that

[TABLE]

since ${\scriptstyle\mathcal{W}}^{o}$ is the solution of problem (2), and thus:

[TABLE]

For real data, the model can be simplified since we do not need to track the evolution of the complex conjugate $(\boldsymbol{w}_{k,i}^{*})^{\top}$ . Although we use the notation “ $e$ ” for the quantities in the above recursion, it is to be understood that the extended quantities $\{\widetilde{\boldsymbol{w}}^{e}_{k,i},b_{k}^{e},\boldsymbol{s}_{k,i}^{e},\mathcal{A}^{e}\}$ should be replaced by the quantities $\{\widetilde{\boldsymbol{w}}_{k,i},b_{k},\boldsymbol{s}_{k,i},\mathcal{A}\}$ as in Table I.

The stability analysis of recursion (62) is facilitated by transforming it to a convenient basis using the Jordan decomposition of $\mathcal{A}^{e}$ in Lemma 2. Multiplying both sides of (62) from the left by $(\mathcal{V}_{\epsilon}^{e})^{-1}$ and introducing the transformed iterates and variables:

[TABLE]

we obtain from Lemma 2:

[TABLE]

where

[TABLE]

Recursions (LABEL:eq:_error_recursion_for_wbi) and (LABEL:eq:_error_recursion_for_wci) can be written more compactly as:

[TABLE]

The zero entry in (77) is due to the fact that

[TABLE]

since the constrained optimization problem (2) can be written alternatively as:

[TABLE]

The Lagrangian associated with problem (90) is given by:

[TABLE]

where $\gamma$ is the $M\times 1$ vector of Lagrange multipliers. From the optimality conditions, we obtain the following condition on ${\scriptstyle\mathcal{W}}^{o}$ :

[TABLE]

where we used the fact that $\sum_{k=1}^{N}J_{k}(w_{k})$ is real valued and where $b^{e}$ and $\mathcal{I}$ are given by (61) and (36), respectively. In the real data case, by multiplying both sides of the previous relation by $(\mathcal{U}^{e})^{*}=\mathcal{U}^{\top}$ , we obtain $(\mathcal{U}^{e})^{*}b^{e}=0$ . For complex data, by multiplying both sides of the previous equation by $(\mathcal{U}^{e})^{*}\mathcal{I}^{\top}$ with $\mathcal{U}^{e}$ defined in Table I, we obtain $(\mathcal{U}^{e})^{*}b^{e}=0$ . Now, considering both real and complex data cases, we arrive at (89).

Remark 5. Regarding algorithm (19), it can be verified that the weight error vector $\widetilde{\boldsymbol{{\scriptstyle\mathcal{W}}}}_{i}^{e}$ will end up evolving according to recursion (62). The constant driving terms $\{d_{\scriptstyle\mathcal{D},k}\}$ will disappear when subtracting $w^{o}_{k}$ from both sides of (19) since $w^{o}_{k}$ satisfies the following relation:

[TABLE]

where we used the fact that the optimal solution ${\scriptstyle\mathcal{W}}^{o}$ in (17) verifies:

[TABLE]

By rewriting the constraint in (17) as $(I_{M}-\mathcal{P}_{\scriptscriptstyle\mathcal{U}}){\scriptstyle\mathcal{W}}=d_{\scriptstyle\mathcal{D}}$ and repeating similar arguments as (90)–(100), we can show that $(\mathcal{U}^{e})^{*}b^{e}=0$ . Therefore, the transformed iterates $\overline{\boldsymbol{{\scriptstyle\mathcal{W}}}}_{i}^{e}$ and $\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{\boldsymbol{{\scriptstyle\mathcal{W}}}}]{\kern-0.3pt\bigwedge\kern-0.3pt}{\rule[-505.89pt]{4.30554pt}{505.89pt}}}{}}{0.5ex}}\stackon[1pt]{\boldsymbol{{\scriptstyle\mathcal{W}}}}{\scalebox{-0.8}{\tmpbox}}_{i}^{e}$ in (69) will continue to evolve according to recursions (LABEL:eq:_error_recursion_for_wbi) and (LABEL:eq:_error_recursion_for_wci). ∎

In the following, we shall establish the mean-square-error stability of algorithm (9). In the accompanying Part II, we will derive a closed-form expression for the network MSD defined by (23). The derivation is demanding. However, the arguments are along the lines developed in [4, Chaps. 9–11] for standard diffusion (16) with proper adjustments to handle possibly complex valued block matrices $\{A_{k\ell}\}$ satisfying conditions (7) and (8) and the subspace constraints.

III-C Mean-square-error stability

Theorem 1.

(Network mean-square-error stability).*

Consider a network of $N$ agents running the distributed strategy (9) with a matrix $\mathcal{A}$ satisfying conditions (13), (14), and (12) and $\mathcal{U}$ satisfying Assumption 3. Assume the individual costs, $J_{k}(w_{k})$ , satisfy the conditions in Assumption 1. Assume further that the first and second-order moments of the gradient noise process satisfy the conditions in Assumption 2. Then, the network is mean-square-error stable for sufficiently small step-sizes, namely, it holds that:*

[TABLE]

for small enough $\mu$ .

Proof.

See Appendix D. ∎

IV Distributed linearly constrained minimum variance (LCMV) beamformer

Consider a uniform linear array (ULA) of $N=14$ antennas, as shown in Fig. 2. A desired narrow-band signal $\boldsymbol{s}_{0}(i)\in\mathbb{C}$ from far field impinges on the array from known direction of arrival (DOA) $\theta_{0}=30^{\circ}$ along with two uncorrelated interfering signals $\{\boldsymbol{s}_{1}(i),\boldsymbol{s}_{2}(i)\}\in\mathbb{C}$ from DOAs $\{\theta_{1}=-60^{\circ},\theta_{2}=60^{\circ}\}$ , respectively. We assume that the DOA of $\boldsymbol{s}_{3}(i)$ is roughly known. The received signal at the array is therefore modeled as:

[TABLE]

where $\boldsymbol{x}_{i}=\text{col}\{\boldsymbol{x}_{1}(i),\ldots,\boldsymbol{x}_{N}(i)\}$ is an $N\times 1$ vector that collects the received signals at the antenna elements, $\{a(\theta_{n})\}_{n=0}^{2}$ are $N\times 1$ array manifold vectors (steering vectors) for the desired and interference signals, and $\boldsymbol{v}_{i}=\text{col}\{\boldsymbol{v}_{1}(i),\ldots,\boldsymbol{v}_{N}(i)\}$ is the additive noise vector at time $i$ . With the first element as the reference point, the $N\times 1$ array manifold vector $a(\theta_{n})$ is given by $a(\theta_{n})=\text{col}\left\{1,e^{-j\tau_{n}},e^{-j2\tau_{n}},\ldots,e^{-j(N-1)\tau_{n}}\right\}$ [32], with $\tau_{n}=\frac{2\pi d}{\lambda}\sin(\theta_{n})$ where $d$ denotes the spacing between two adjacent antenna elements, and $\lambda$ denotes the wavelength of the carrier signal. The antennas are assumed spaced half a wavelength apart, i.e., $d=\lambda/2$ .

Beamforming problems generally deal with the design of a weight vector $h=\text{col}\{h_{1},\ldots,h_{N}\}\in\mathbb{C}^{N\times 1}$ in order to recover the desired signal $\boldsymbol{s}_{0}(i)$ from the received data $\boldsymbol{x}_{i}$ [32, 33]. The narrowband beamformer output can be expressed as $\boldsymbol{y}(i)=h^{*}\boldsymbol{x}_{i}$ . Among many possible criteria, we use the linearly-constrained-minimum-variance (LCMV) design, namely,

[TABLE]

where $D$ is an $N\times P$ matrix and $b$ is a $P\times 1$ vector, in order to suppress the influence of the perturbation $\boldsymbol{z}_{i}$ on the output $\boldsymbol{y}(i)$ while preserving the signal component. Since the DOAs of $\boldsymbol{s}_{0}(i)$ is known and the DOA of $\boldsymbol{s}_{2}(i)$ is roughly known, matrix $D$ can be chosen as $D=[a(30^{\circ})~{}a(58.5^{\circ})~{}a(61.5^{\circ})]$ , and the vector $b$ as $b=\text{col}\{1,0.01,0.01\}$ . In this way, we set unit response to the direction of the desired signal so that $\boldsymbol{s}_{0}(i)$ passes through the array without distortion.

In a distributed setting, the objective of agent (antenna element) $k$ is to estimate $h^{o}_{k}$ , the $k$ -th component of $h^{o}$ in (105). Neighboring agents are allowed to exchange their observations $\boldsymbol{x}_{\ell}(i)$ . To each agent $k$ , we associate a neighborhood set $\mathcal{N}_{k}$ , an $M_{k}\times 1$ parameter vector $w_{k}$ , and an $M_{k}\times 1$ regression vector $\boldsymbol{u}_{k,i}$ , defined in Table II depending on the node location on the array. Observe that the parameter $\nu$ controls the network topology. For example, $\nu=N-1$ corresponds to a fully connected network setting. We associate with each agent $k$ a cost $J_{k}(w_{k})\triangleq w^{*}_{k}\mathbb{E}[\boldsymbol{u}_{k,i}\boldsymbol{u}_{k,i}^{*}]w_{k}.$ Instead of solving (105), we propose to solve:

[TABLE]

where the equality constraint $\mathcal{D}^{*}{\scriptstyle\mathcal{W}}=d$ merges the equality constraint in (105) and the equality constraints that need to be imposed on the parameter vectors at neighboring nodes in order to achieve equality between common entries (see Table II). Let $E$ denote the binary connection matrix with $[E_{k\ell}]=1$ if $\ell\in\mathcal{N}_{k}$ , and [math] otherwise. Under the consensus constraints, it can be shown that:

[TABLE]

where $F$ is an $N\times N$ matrix with $[F]_{k\ell}=\frac{[E^{2}]_{k\ell}}{\sqrt{|\mathcal{N}_{k}||\mathcal{N}_{\ell}|}}$ and $\circ$ is the element-wise product. Therefore, collecting observations from neighboring nodes allows partial covariance matrix computation, which will be used in optimization. For the partial covariance $F\circ R_{x}$ to converge to the true covariance $R_{x}$ in (105), we need to set $\nu=N-1$ in order to have $F=\mathds{1}_{N}\mathds{1}_{N}^{\top}$ . Note that, two main classes of distributed beamforming appear in the literature [37]. In the first class, which is considered here, the covariance matrix is approximated to form distributed implementations [38, 39, 40, 37] leading to sub-optimal beamformers. In the second class, the proposed beamformers obtain statistical optimality but do so at the expense of restricting the topology of the underlying network [41]. Different from [38], the current distributed solution preserves convexity and is scalable since nodes exchange and compute $M_{\ell}\times 1$ sub-vectors $\{w_{\ell}\}$ with $M_{\ell}=|\mathcal{N}_{\ell}|<N$ instead of $N\times 1$ vectors.

Algorithm (9) can be applied to solve (106). The signals $\{\boldsymbol{s}_{n}(i)\}_{n=0}^{2}$ are i.i.d. zero-mean complex Gaussian random variables with variance $\sigma^{2}_{s,n}=1,\forall n$ . The additive noise $\boldsymbol{v}_{i}$ is zero-mean complex Gaussian with covariance $\mathbb{E}\boldsymbol{v}_{i}\boldsymbol{v}_{i}^{*}=\sigma_{v}^{2}I_{N}$ ( $\sigma_{v}=0.7$ ). We set $\nu=4$ . The complex combination matrix $\mathcal{A}$ is set as the solution of the feasibility problem (22) with the constraint $\rho(\mathcal{A}-\mathcal{P}_{\scriptscriptstyle\mathcal{U}})<1$ replaced by $\rho(\mathcal{A}-\mathcal{P}_{\scriptscriptstyle\mathcal{U}})\leq 1-\epsilon$ ( $\epsilon=0.01$ ) and the constraint $\mathcal{A}=\mathcal{A}^{*}$ added222These changes make the problem convex–see [2, Sec. 3] for further details.. The resulting problem is solved via CVX package [42]. Note that the distributed implementation is feasible in this example. We set $\mu=0.005$ . The output signal-to-interference-plus-noise ratio (SINR) given by $\mathbb{E}\left[\frac{\sigma^{2}_{s,0}|\boldsymbol{h}_{i}^{*}a(\theta_{0})|}{\boldsymbol{h}_{i}^{*}R_{z}\boldsymbol{h}_{i}}\right]$ with $R_{z}=\sum_{n=1}^{2}\sigma^{2}_{s,n}a(\theta_{n})a^{*}(\theta_{n})+\sigma^{2}_{v}I_{N}$ is illustrated in Fig. 2 (right). The dashed black curve is the beampattern obtained by the centralized, also known as the constrained LMS [33], algorithm ( $\mu=0.001$ ). The results are averaged over $1000$ Monte-Carlo runs. We observe that the distributed solution performs well compared to the centralized implementation.

V Conclusion

In this work, we considered inference problems over networks where agents have individual parameter vectors to estimate subject to subspace constraints that require the parameters across the network to lie in low-dimensional subspaces. Based on the gradient projection algorithm, we proposed an iterative and distributed implementation of the projection step, which runs in parallel with the stochastic gradient descent update. We showed that, for small step-size parameter, the network is able to approach the minimizer of the constrained problem to arbitrarily good accuracy levels.

Appendix A Proof of Lemma 1

First we prove sufficiency by proving that if $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ is a projection matrix and $\mathcal{A}$ satisfies conditions (10), (11), and (12), then the matrix equation (7) holds. If $\mathcal{A}$ satisfies (10) and (11), then:

[TABLE]

where we used the fact that $(I-\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}})=(I-\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}})^{i}$ since $(I-\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}})$ is a projector. Applying condition (12) and using the fact that for any matrix $B$ , $\lim_{i\rightarrow\infty}B^{i}=0$ if and only if $\rho(B)<1$ , we obtain the desired convergence (7).

To prove necessity, we shall prove that every time we have (7), we will have $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ a projection matrix and conditions (10), (11), and (12) on $\mathcal{A}$ satisfied. We use the fact that $\lim_{i\rightarrow\infty}\mathcal{A}^{i}$ exists if, and only if, there is a non singular matrix $\mathcal{V}$ such that [43]:

[TABLE]

where the spectral radius of $\mathcal{J}$ is less than one. Let $v_{1},\ldots,v_{M}$ be the columns of $\mathcal{V}$ and $y_{1}^{*},\ldots,y_{M}^{*}$ be the rows of $\mathcal{V}^{-1}$ . Then, we have:

[TABLE]

From the left hand-side of (7) and (119), we obtain:

[TABLE]

Observe from (113) that one is an eigenvalue of $\mathcal{A}$ with multiplicity $K$ and $\{v_{m},y_{m}\}_{m=1}^{K}$ are the associated right and left eigenvectors. Thus, from (122), we obtain:

[TABLE]

and equations (10) and (11) hold. Moreover, from (113) and (122), we obtain:

[TABLE]

which is condition (12). Finally, from (122), we have:

[TABLE]

Thus, $\mathcal{P}_{\scriptscriptstyle\mathcal{U}}$ is a projector, which completes the necessity proof.

Since each $v_{m}y_{m}^{*}$ is a rank-one matrix and their sum $\sum_{m=1}^{M}v_{m}y_{m}^{*}=\mathcal{V}\mathcal{V}^{-1}=I$ has rank $M$ , the matrix $\sum_{m=1}^{K}v_{m}y_{m}^{*}$ must have rank $K$ . Since the rank of $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ is equal to $P$ , we obtain from (122) $K=P$ . Thus, the matrix $\mathcal{A}$ has $P=K$ eigenvalues at one and all other eigenvalues are strictly less than one.

Appendix B Driving term in algorithm (19)

Let ${\scriptstyle\mathcal{W}}_{0}$ denote an $M\times 1$ vector distributed across the network. In order to justify the choice of $(I-\mathcal{A})d_{\scriptstyle\mathcal{D}}$ in (19), we consider the problem of finding the projection ${\scriptstyle\mathcal{W}}^{o}=\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}{\scriptstyle\mathcal{W}}_{0}+d_{\scriptstyle\mathcal{D}}$ in a distributed and iterative manner through a linear iteration of the form:

[TABLE]

where $\mathcal{A}$ satisfies (7), (8) and $\mathcal{B}$ is a properly chosen matrix ensuring convergence toward ${\scriptstyle\mathcal{W}}^{o}$ . Starting from ${\scriptstyle\mathcal{W}}_{0}$ and iterating the above recursion, we obtain:

[TABLE]

If we let $i\rightarrow\infty$ on both sides of (132), we find:

[TABLE]

For ${\scriptstyle\mathcal{W}}_{\infty}$ to be equal to ${\scriptstyle\mathcal{W}}^{o}$ , $\mathcal{B}$ in (131) must be chosen such that $\sum_{j=0}^{\infty}\mathcal{A}^{j}\mathcal{B}=I$ . In the following, we show that $\mathcal{B}=I-\mathcal{A}+\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}$ ensures convergence. From the Jordan canonical form of $\mathcal{A}$ introduced in (49), we have:

[TABLE]

If we multiply both terms and compute the infinite sum in (133), we obtain:

[TABLE]

where we used the fact that $\sum_{j=0}^{\infty}\mathcal{J}_{\epsilon}^{i}=I-\mathcal{J}_{\epsilon}$ since $\rho(\mathcal{J}_{\epsilon})<1$ .

Now, since $\mathcal{P}_{{\scriptscriptstyle\mathcal{U}}}=\mathcal{P}_{\scriptstyle\mathcal{D}}$ and $\mathcal{P}_{\scriptstyle\mathcal{D}}d_{\scriptstyle\mathcal{D}}=0$ , we obtain $\mathcal{B}d_{\scriptstyle\mathcal{D}}=(I-\mathcal{A})d_{\scriptstyle\mathcal{D}}$ , which justifies the choice in (19).

Appendix C Proof of Lemma 2

We start by noting that the $M\times M$ matrix $\mathcal{A}$ satisfying conditions (13), (14), and (12) admits a Jordan canonical decomposition of the form:

[TABLE]

where the matrix $\mathcal{J}$ consists of Jordan blocks, with each one of them having the form (say for a Jordan block of size $3\times 3$ ):

[TABLE]

where the eigenvalue $\lambda$ may be complex but has magnitude less than one. Let $\mathcal{E}=\text{diag}\{I_{P},\epsilon,\epsilon^{2},\ldots,\epsilon^{M-P}\}$ with $\epsilon>0$ any small positive number independent of $\mu$ . The matrix $\mathcal{A}$ in (148) can be written alternatively as:

[TABLE]

where

[TABLE]

and where the matrix $\mathcal{J}_{\epsilon}$ consists of Jordan blocks, with each one of them having a form similar as (152) with $\epsilon>0$ appearing on the upper diagonal instead of $1$ , and where the eigenvalue $\lambda$ may be complex but has magnitude less than one. Obviously, since $\mathcal{V}_{\epsilon}^{-1}\mathcal{V}_{\epsilon}=I_{M}$ , it holds that:

[TABLE]

and where $\mathcal{U}^{*}\mathcal{U}=I_{P}$ from Assumption 3.

Now, let us consider the extended version of the matrix $\mathcal{A}$ , namely, $\mathcal{A}^{e}$ , which is an $N\times N$ block matrix whose $(k,\ell)$ -th block is defined in Table I. In the real data case, we have $\mathcal{A}^{e}=\mathcal{A}$ . In the complex data case, it can be verified that $\mathcal{A}^{e}$ is similar to the $2\times 2$ block diagonal matrix:

[TABLE]

according to:

[TABLE]

where $\mathcal{I}$ is the permutation matrix defined by (36). Using (153), we can rewrite the second block in (163) as:

[TABLE]

where $(\Lambda_{\epsilon}^{*})^{\top}=\text{diag}\{I_{P},(\mathcal{J}_{\epsilon}^{*})^{\top}\}$ and

[TABLE]

Now, by replacing (153) and (165) into (163), and by introducing the extended $2\times 2$ block diagonal matrices:

[TABLE]

we find that the $2M\times 2M$ matrix $\mathcal{A}^{d}$ has a Jordan decomposition of the form:

[TABLE]

Let us again introduce a permutation matrix $\mathcal{I}^{\prime}$ given by:

[TABLE]

The matrix $\mathcal{A}^{d}$ in (172) can be written alternatively as:

[TABLE]

where the matrix $\Lambda_{\epsilon}^{e}$ is block diagonal defined in (51). Returning now to $\mathcal{A}^{e}$ , and using (164) and (178), we find that the matrix $\mathcal{A}^{e}$ has a Jordan decomposition of the form:

[TABLE]

where $\mathcal{V}^{e}_{\epsilon}$ and $(\mathcal{V}^{e}_{\epsilon})^{-1}$ are defined by:

[TABLE]

in terms of the permutation matrices $\mathcal{I}$ and $\mathcal{I}^{\prime}$ in (35), (177) and the block diagonal matrix $\mathcal{V}_{\epsilon}^{d}$ in (170). By properly evaluating these matrices, we arrive at (51).

Appendix D Proof of Theorem 1

We consider the transformed variables $\overline{\boldsymbol{{\scriptstyle\mathcal{W}}}}_{i}^{e}$ and $\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{\boldsymbol{{\scriptstyle\mathcal{W}}}}]{\kern-0.3pt\bigwedge\kern-0.3pt}{\rule[-505.89pt]{4.30554pt}{505.89pt}}}{}}{0.5ex}}\stackon[1pt]{\boldsymbol{{\scriptstyle\mathcal{W}}}}{\scalebox{-0.8}{\tmpbox}}_{i}^{e}$ in (69). Conditioning both sides on $\boldsymbol{\cal{F}}_{i-1}$ , computing the conditional second-order moments, using the conditions from Assumption 2 on the gradient noise process, and computing the expectations again, we get:

[TABLE]

and

[TABLE]

By applying Jensen’s inequality to the convex function $\|x\|^{2}$ , we can bound the first term on the RHS of (181) as follows:

[TABLE]

for any arbitrary positive number $t\in(0,1)$ . By Assumption 1, the Hermitian matrix $\boldsymbol{H}_{k,i-1}$ defined in (55) can be bounded as follows:

[TABLE]

Using the fact that the integral of a matrix is the matrix of the integrals, and the linear property of integration, the Hermitian block $\boldsymbol{\cal{D}}_{11,i-1}$ in (78) can be rewritten as:

[TABLE]

and, therefore, from Assumption 1, $\boldsymbol{\cal{D}}_{11,i-1}$ can be bounded as follows:

[TABLE]

for some positive constants $\nu$ and $\delta$ that are independent of $\mu$ and $i$ . In terms of the $2-$ induced matrix norm (i.e., maximum singular value), we obtain:

[TABLE]

and, therefore,

[TABLE]

for some positive constant $\sigma_{11}$ that is independent of $\mu$ and $i$ .

Similarly, using the $2-$ induced matrix norm (i.e., maximum singular value), we can bound $\|\boldsymbol{\cal{D}}_{12,i-1}\|^{2}$ as follows:

[TABLE]

for some positive constant $\sigma_{12}$ and where we used the fact that $\|(\mathcal{U}^{e})^{*}\|=\sigma_{\max}((\mathcal{U}^{e})^{*})=\sqrt{\lambda_{\max}(\mathcal{U}^{e}(\mathcal{U}^{e})^{*})}=1$ .

Substituting (LABEL:eq:_first_term_on_the_RHS) into (181), and using (187), (188), we get:

[TABLE]

We select $t=\sigma_{11}\mu$ (for sufficiently small $\mu$ ). Then, the previous inequality can be written as:

[TABLE]

We repeat similar arguments for the second variance relation (182). Using Jensen’s inequality again, we obtain:

[TABLE]

for any arbitrary positive number $t\in(0,1)$ . In (a) we used the fact that the block diagonal matrix $\mathcal{J}^{e}_{\epsilon}$ defined in Table I satisfies:

[TABLE]

Expression (192) can be established by using similar arguments as in [4, pp. 516] and the fact that $\lambda_{\max}\left(\mathcal{J}_{\epsilon}^{\top}(\mathcal{J}_{\epsilon}^{*})^{\top}\right)=\lambda_{\max}\left((\mathcal{J}_{\epsilon}^{*}\mathcal{J}_{\epsilon})^{\top}\right)=\lambda_{\max}(\mathcal{J}_{\epsilon}^{*}\mathcal{J}_{\epsilon})$ . In (b), we used the fact that $\rho(\mathcal{J}_{\epsilon})\in{(0},1)$ , and thus, $\epsilon$ can be selected small enough to ensure $\rho(\mathcal{J}_{\epsilon})+\epsilon\in(0,1)$ . We then selected $t=\rho(\mathcal{J}_{\epsilon})+\epsilon$ .

Using Jensen’s inequality, the second term on the RHS of (191) can be bounded as follows:

[TABLE]

Following similar arguments as in (188), we can show that:

[TABLE]

for some positive constants $\sigma_{21}$ and $\sigma_{22}$ . Substituting (193) into (191) and (191) into (182), and using (194), we obtain:

[TABLE]

From (77), we have:

[TABLE]

where we used the fact that $(\mathcal{V}_{L,\epsilon}^{e})^{*}\mathcal{A}^{e}=\mathcal{J}_{\epsilon}^{e}(\mathcal{V}_{L,\epsilon}^{e})^{*}$ from Lemma 2. Since $b^{e}$ in (61) is defined in terms of the gradient $\nabla_{w^{*}_{k}}J_{k}(w^{o}_{k})$ and since $J_{k}(w_{k})$ is twice differentiable, then $\|b^{e}\|^{2}$ is bounded and we obtain $\|\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{b}]{\kern-0.3pt\bigwedge\kern-0.3pt}{\rule[-505.89pt]{4.30554pt}{505.89pt}}}{}}{0.5ex}}\stackon[1pt]{b}{\scalebox{-0.8}{\tmpbox}}^{e}\|^{2}=O(\mu^{2})$ . For the noise terms $\mathbb{E}\|\overline{\boldsymbol{s}}_{i}^{e}\|^{2}$ in (190) and $\mathbb{E}\|\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{\boldsymbol{s}}]{\kern-0.3pt\bigwedge\kern-0.3pt}{\rule[-505.89pt]{4.30554pt}{505.89pt}}}{}}{0.5ex}}\stackon[1pt]{\boldsymbol{s}}{\scalebox{-0.8}{\tmpbox}}_{i}^{e}\|^{2}$ in (195), we have:

[TABLE]

where $v_{1}$ is a positive constant independent of $\mu$ and given by $v_{1}\triangleq\|(\mathcal{V}_{\epsilon}^{e})^{-1}\mathcal{A}^{e}\|=\|\Lambda_{\epsilon}^{e}(\mathcal{V}_{\epsilon}^{e})^{-1}\|$ . In terms of the variances of the individual noise processes, $\mathbb{E}\|\boldsymbol{s}_{k,i}\|^{2}$ , we have $\mathbb{E}\|\boldsymbol{s}_{i}^{e}\|^{2}=\sum_{k=1}^{N}\mathbb{E}\|\boldsymbol{s}^{e}_{k,i}\|^{2}=2\sum_{k=1}^{N}\mathbb{E}\|\boldsymbol{s}_{k,i}\|^{2}$ . For each $\boldsymbol{s}_{k,i}(\boldsymbol{w}_{k,i-1})$ , we have from Assumption 2 and Jensen’s inequality:

[TABLE]

where $\bar{\beta}_{k}^{2}\triangleq 2(\beta_{k}^{2}/h^{2})$ and $\bar{\sigma}_{s,k}^{2}\triangleq 2(\beta_{k}^{2}/h^{2})\|w^{o}_{k}\|^{2}+\sigma^{2}_{s,k}$ . The term $\mathbb{E}\|\boldsymbol{s}_{i}^{e}\|^{2}$ can thus be bounded as follows:

[TABLE]

where $\beta^{2}_{\max}\triangleq\max_{1\leq k\leq N}\bar{\beta}_{k}^{2}$ , $\sigma^{2}_{s}\triangleq 2\sum_{k=1}^{N}\bar{\sigma}_{s,k}^{2}$ , and $v_{2}\triangleq\|\mathcal{V}_{\epsilon}^{e}\|$ . Substituting into (197), we get:

[TABLE]

Using this bound in (190) and (195), we obtain:

[TABLE]

We can combine (200) and (201) into a single inequality recursion:

[TABLE]

where $\Gamma$ is given by:

[TABLE]

and where $a=1-O(\mu)$ , $b=O(\mu)$ , $c=O(\mu^{2})$ , $d=\rho(\mathcal{J}_{\epsilon})+\epsilon+O(\mu^{2})$ , $e=O(\mu^{2})$ , and $f=O(\mu^{2})$ . Now, using the property that the spectral radius of a matrix is upper bounded by its $1-$ norm norm, we obtain:

[TABLE]

Since $\rho(\mathcal{J}_{\epsilon})<1$ is independent of $\mu$ , and since $\epsilon$ and $\mu$ are small positive numbers that can be chosen arbitrarily small and independently of each other, it is clear that the RHS of the above expression can be made strictly smaller than one for sufficiently small $\epsilon$ and $\mu$ . In that case $\rho(\Gamma)<1$ so that $\Gamma$ is stable. Moreover, it holds that:

[TABLE]

Now, by iterating (208) we arrive at:

[TABLE]

from which we conclude that $\limsup_{i\rightarrow\infty}\mathbb{E}\|\overline{\boldsymbol{{\scriptstyle\mathcal{W}}}}^{e}_{i}\|^{2}=O(\mu)$ and $\limsup_{i\rightarrow\infty}\mathbb{E}\|\savestack{\tmpbox}{\stretchto{\scaleto{\scalerel*[\widthof{\boldsymbol{{\scriptstyle\mathcal{W}}}}]{\kern-0.3pt\bigwedge\kern-0.3pt}{\rule[-505.89pt]{4.30554pt}{505.89pt}}}{}}{0.5ex}}\stackon[1pt]{\boldsymbol{{\scriptstyle\mathcal{W}}}}{\scalebox{-0.8}{\tmpbox}}^{e}_{i}\|^{2}=O(\mu^{2})$ . Therefore,

[TABLE]

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Nassif, S. Vlaski, and A. H. Sayed, “Distributed inference over networks under subspace constraints,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , Brighton, UK, May 2019, pp. 1–5.
2[2] R. Nassif, S. Vlaski, and A. H. Sayed, “Adaptation and learning over networks under subspace constraints – Part II: Performance analysis,” Submitted for publication. , May 2019.
3[3] A. H. Sayed, S. Y. Tu, J. Chen, X. Zhao, and Z. J. Towfic, “Diffusion strategies for adaptation and learning over networks,” IEEE Signal Process. Mag. , vol. 30, no. 3, pp. 155–171, 2013.
4[4] A. H. Sayed, “Adaptation, learning, and optimization over networks,” Foundations and Trends in Machine Learning , vol. 7, no. 4-5, pp. 311–801, 2014.
5[5] A. H. Sayed, “Adaptive networks,” Proc. IEEE , vol. 102, no. 4, pp. 460–497, Apr. 2014.
6[6] D. Bertsekas, “A new class of incremental gradient methods for least squares problems,” SIAM J. Optim. , vol. 7, no. 4, pp. 913–926, 1997.
7[7] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control , vol. 54, no. 1, pp. 48–61, Jan. 2009.
8[8] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proc. IEEE , vol. 98, no. 11, pp. 1847–1864, Nov. 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Adaptation and learning over networks under subspace constraints – Part I: Stability Analysis

Abstract

Index Terms:

I Introduction

II Distributed inference under subspace constraints

Lemma 1**.**

Proof.

III Stability analysis

III-A Modeling conditions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Lemma 2**.**

Proof.

III-B Network error vector recursion

III-C Mean-square-error stability

Theorem 1**.**

Proof.

IV Distributed linearly constrained minimum variance (LCMV) beamformer

V Conclusion

Appendix A Proof of Lemma 1

Appendix B Driving term in algorithm (19)

Appendix C Proof of Lemma 2

Appendix D Proof of Theorem 1

Lemma 1.

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 2.

Theorem 1.