Localized Linear Regression in Networked Data

Alexander Jung; Nguyen Tran

arXiv:1903.11178·cs.LG·July 24, 2019

Localized Linear Regression in Networked Data

Alexander Jung, Nguyen Tran

PDF

1 Repo

TL;DR

This paper analyzes the statistical properties of the network Lasso (nLasso) for localized linear regression on networked data, providing conditions for accurate learning from limited labels and an implementation via primal-dual methods.

Contribution

It offers a theoretical analysis of nLasso's ability to learn localized linear models with few labels and presents a specialized implementation using primal-dual optimization.

Findings

01

Identifies sufficient conditions on network structure and labels for accurate nLasso learning.

02

Provides a scalable primal-dual algorithm for localized linear regression with nLasso.

03

Demonstrates the effectiveness of nLasso in networked data scenarios.

Abstract

The network Lasso (nLasso) has been proposed recently as an efficient learning algorithm for massive networked data sets (big data over networks). It extends the well-known least absolute shrinkage and selection operator (Lasso) from learning sparse (generalized) linear models to network models. Efficient implementations of the nLasso have been obtained using convex optimization methods lending to scalable message passing protocols. In this paper, we analyze the statistical properties of nLasso when applied to localized linear regression problems involving networked data. Our main result is a sufficient condition on the network structure and available label information such that nLasso accurately learns a localized linear regression model from a few labeled data points. We also provide an implementation of nLasso for localized linear regression by specializing a primaldual method for…

Figures1

Click any figure to enlarge with its caption.

Equations70

y^{(i)}=\big{(}\overline{\mathbf{w}}^{(i)}\big{)}^{T}\mathbf{x}^{(i)}\!+\!\varepsilon^{(i)},

y^{(i)}=\big{(}\overline{\mathbf{w}}^{(i)}\big{)}^{T}\mathbf{x}^{(i)}\!+\!\varepsilon^{(i)},

\hat{y}^{(i)}:=\big{(}\widehat{\mathbf{w}}^{(i)}\big{)}^{T}\mathbf{x}^{(i)}.

\hat{y}^{(i)}:=\big{(}\widehat{\mathbf{w}}^{(i)}\big{)}^{T}\mathbf{x}^{(i)}.

W := {w : V \to R^{p} : i \mapsto w^{(i)}} .

W := {w : V \to R^{p} : i \mapsto w^{(i)}} .

E (w)

E (w)

∥ w ∥_{TV}

∥ w ∥_{TV}

w

w

w = ((w^{(1)})^{T}, \dots, (w^{(n)})^{T})^{T} \in R^{p n} .

w = ((w^{(1)})^{T}, \dots, (w^{(n)})^{T})^{T} \in R^{p n} .

D_{e, i} = ⎩ ⎨ ⎧ A_{ij} I_{p} - A_{ij} I_{p} 0 e = {i, j} \in E, i < j e = {i, j} \in E, i > j otherwise .

D_{e, i} = ⎩ ⎨ ⎧ A_{ij} I_{p} - A_{ij} I_{p} 0 e = {i, j} \in E, i < j e = {i, j} \in E, i > j otherwise .

w \in w \in R^{p n} arg min h (w) + g (Dw) .

w \in w \in R^{p n} arg min h (w) + g (Dw) .

h (w)

h (w)

\displaystyle\mbox{ with }\mathbf{u}=\!\big{(}\big{(}\mathbf{u}^{(1)}\big{)}^{T},\ldots,\big{(}\mathbf{u}^{(q)}\big{)}^{T}\big{)}^{T}\in\mathbb{R}^{pq}.

w \in R^{p n} min u \in R^{pq} max u^{T} Dw + h (w) - g^{*} (u),

w \in R^{p n} min u \in R^{pq} max u^{T} Dw + h (w) - g^{*} (u),

- D^{T} u \in \partial h (w) \mbox, an d D w \in \partial g^{*} (u) .

- D^{T} u \in \partial h (w) \mbox, an d D w \in \partial g^{*} (u) .

w - T D^{T} u \in (I + T \partial h) (w) \mbox, u + Σ D w \in (I + Σ \partial g^{*}) (u),

w - T D^{T} u \in (I + T \partial h) (w) \mbox, u + Σ D w \in (I + Σ \partial g^{*}) (u),

Σ = diag {σ^{(e)} I_{p}}_{e = 1}^{q} \mbox an d T = diag {τ^{(i)} I_{p}}_{i = 1}^{n}

Σ = diag {σ^{(e)} I_{p}}_{e = 1}^{q} \mbox an d T = diag {τ^{(i)} I_{p}}_{i = 1}^{n}

w_{k + 1}

w_{k + 1}

u_{k + 1}

(I + Σ \partial g^{*})^{- 1} (u) = u^{'} \in R^{pq} arg min g^{*} (u^{'}) + (1/2) ∥ u^{'} - u ∥_{Σ^{- 1}}^{2} .

(I + Σ \partial g^{*})^{- 1} (u) = u^{'} \in R^{pq} arg min g^{*} (u^{'}) + (1/2) ∥ u^{'} - u ∥_{Σ^{- 1}}^{2} .

\displaystyle\mathbf{c}=\big{(}\big{(}\mathbf{c}^{(1)}\big{)}^{T},\ldots,\big{(}\mathbf{c}^{(q)}\big{)}^{T}\big{)}^{T}\mbox{, }\mathbf{c}^{(e)}:=\mathcal{T}^{(\lambda)}\big{(}\mathbf{u}^{(e)}\big{)}.

\displaystyle\mathbf{c}=\big{(}\big{(}\mathbf{c}^{(1)}\big{)}^{T},\ldots,\big{(}\mathbf{c}^{(q)}\big{)}^{T}\big{)}^{T}\mbox{, }\mathbf{c}^{(e)}:=\mathcal{T}^{(\lambda)}\big{(}\mathbf{u}^{(e)}\big{)}.

w^{(i)} = w_{k}^{(i)} - j > i \sum τ^{(j)} A_{i, j} u_{k}^{(j)} + i > j \sum τ^{(j)} A_{i, j} u_{k}^{(j)}

w^{(i)} = w_{k}^{(i)} - j > i \sum τ^{(j)} A_{i, j} u_{k}^{(j)} + i > j \sum τ^{(j)} A_{i, j} u_{k}^{(j)}

v^{(i)}

v^{(i)}

\displaystyle+(\mathbf{I}\!-\!(1/\|\mathbf{x}^{(i)}\|^{2})\mathbf{x}^{(i)}\big{(}\mathbf{x}^{(i)}\big{)}^{T})\mathbf{w}^{(i)}

∥ Σ^{1/2} D T^{1/2} ∥^{2} < 1,

∥ Σ^{1/2} D T^{1/2} ∥^{2} < 1,

\overline{w}^{(i)} = l = 1 \sum F a^{(l)} I_{C^{(l)}} [i] .

\overline{w}^{(i)} = l = 1 \sum F a^{(l)} I_{C^{(l)}} [i] .

K\sum_{i\in\mathcal{M}}\big{|}\big{(}\mathbf{x}^{(i)}\big{)}^{T}\mathbf{w}^{(i)}\big{|}+\|\mathbf{w}\|_{\overline{\partial\mathcal{F}}}\geq(L/\sqrt{p})\|\mathbf{w}\|_{\partial\mathcal{F}}\vspace*{-1mm}

K\sum_{i\in\mathcal{M}}\big{|}\big{(}\mathbf{x}^{(i)}\big{)}^{T}\mathbf{w}^{(i)}\big{|}+\|\mathbf{w}\|_{\overline{\partial\mathcal{F}}}\geq(L/\sqrt{p})\|\mathbf{w}\|_{\partial\mathcal{F}}\vspace*{-1mm}

∥ w - \overline{w} ∥_{TV} \leq K (1 + 4 p / (L - p)) i \in M \sum ∣ ε^{(i)} ∣.

∥ w - \overline{w} ∥_{TV} \leq K (1 + 4 p / (L - p)) i \in M \sum ∣ ε^{(i)} ∣.

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{TV} \leq i \in M \sum ∣ ε^{(i)} ∣_{1} + λ ∥ \overline{w} ∥_{TV} .

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{TV} \leq i \in M \sum ∣ ε^{(i)} ∣_{1} + λ ∥ \overline{w} ∥_{TV} .

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{\overline{\partial F}} \leq i \in M \sum ∣ ε^{(i)} ∣ + λ ∥ \overline{w} ∥_{\partial F} - λ ∥ w ∥_{\partial F}

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{\overline{\partial F}} \leq i \in M \sum ∣ ε^{(i)} ∣ + λ ∥ \overline{w} ∥_{\partial F} - λ ∥ w ∥_{\partial F}

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{\overline{\partial F}} \leq i \in M \sum ∣ ε^{(i)} ∣ + λ ∥ w ∥_{\partial F} . \vspace * - 1 mm

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣ + λ ∥ w ∥_{\overline{\partial F}} \leq i \in M \sum ∣ ε^{(i)} ∣ + λ ∥ w ∥_{\partial F} . \vspace * - 1 mm

∥ w ∥_{\overline{\partial F}}

∥ w ∥_{\overline{\partial F}}

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣

i \in M \sum ∣ \overset{y}{^}^{(i)} - y^{(i)} ∣

\displaystyle\geq\sum_{i\in\mathcal{M}}\big{|}\big{(}\mathbf{x}^{(i)}\big{)}^{T}\widetilde{\mathbf{w}}^{(i)}\big{|}\!-\!\sum_{i\in\mathcal{M}}|\varepsilon^{(i)}|,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexjungaalto/ResearchPublic
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression

Full text

Localized Linear Regression in Networked Data

Alexander Jung and Nguyen Tran Authors are with the Department of Computer Science, Aalto University, Finland; firstname.lastname(at)aalto.fi

Abstract

The network Lasso (nLasso) has been proposed recently as an efficient learning algorithm for massive networked data sets (big data over networks). It extends the well-known least absolute shrinkage and selection operator (Lasso) from learning sparse (generalized) linear models to network models. Efficient implementations of the nLasso have been obtained using convex optimization methods lending to scalable message passing protocols. In this paper, we analyze the statistical properties of nLasso when applied to localized linear regression problems involving networked data. Our main result is a sufficient condition on the network structure and available label information such that nLasso accurately learns a localized linear regression model from a few labeled data points. We also provide an implementation of nLasso for localized linear regression by specializing a primal-dual method for solving the convex (non-smooth) nLasso problem.

I Introduction

The data arising in many important application domains can be modeled efficiently using some network structure. Examples of such networked data are found in signal processing where signal samples can be arranged as a chain, in image processing with pixels arranged on a grid, in wireless sensor networks where measurements conform to sensor proximity [1, 2, 3, 4]. Organizing data using networks is also used in knowledge bases (graphs) whose items are linked by relations [5, 6].

In what follows, we will represent networked data using an undirected “empirical graph”. The nodes of the empirical graph represent individual data points (e.g., one image out of an entire collection) which are connected by edges according to some notion of similarity. This similarity might be induced by domain knowledge (e.g., friendship relations in social networks) or via probabilistic models ( [7, 8].

Beside their network structure, data points are typically characterized by features and labels. The features of data points are quantities that can be measured or computed efficiently (in an automated fashion). In contrast, the labels of data points are costly to acquire, involving human expert labor.

We consider regression problems within which data points are characterized by features and a numeric label (or target). The goal is to learn an accurate predictor which maps the features of a data point to a predicted label. The learning of the predictor is based on the availability of a few data points with known labels. Facing partially labeled data is common since the acquisition of reliable label information is often costly (involving human expert labor).

Accurate learning is particularly challenging in the high-dimensional regime [9, 10]. Here, a key obstacle is the lack of a sufficient amount of samples which can be considered i.i.d. Using a network structure allows then to borrow statistical strength from different “groups” of samples which are not exactly i.i.d., but still statistically similar to some extent.

The learning of an accurate predictor from a small number of labeled data points is enabled by exploiting the tendency of well-connected data points to have similar statistical properties. Such a clustering assumption, which underlies most (semi-) supervised machine learning methods [11, 12], requires any reasonable predictor to be nearly constant over well-connected subsets (clusters) of data points. The clustering assumption motivates the network Lasso (nLasso) as a form of empirical risk minimization [13].

Contribution. While several implementations of nLasso have been proposed and analyzed (see [13, 14]), little is known about the accuracy of nLasso in regression problems. The main contribution of this paper is a sufficient condition on the network topology and available label information such that the nLasso accurately learns a predictor from a small number of labeled data points. To this end, we apply (an extension of) the network compatibility condition (NCC) introduced in [15].

We demonstrate theoretically and empirically, that the NCC guarantees that nLasso learns an accurate predictor which conforms with the clustering hypothesis. Our theoretical findings help to design sampling schemes which identify those data points whose labels would provide the most information about the labels of the other data points [16, 4].

Notation. The identity matrix of size $d\!\times\!d$ is denoted $\mathbf{I}_{d}$ . The positive part of some real number $a\!\in\!\mathbb{R}$ is $(a)_{+}\!=\!\max\{a,0\}$ . The Euclidean norm of a vector $\mathbf{x}\!=\!(x_{1},\ldots,x_{p})^{T}$ is $\|\mathbf{x}\|\!:=\!\sqrt{\sum_{r=1}^{p}x_{r}^{2}}$ . For a positive definite matrix $\mathbf{C}$ , we define the induced norm $\|\mathbf{x}\|_{\mathbf{C}}:=\sqrt{\mathbf{x}^{T}\mathbf{C}\mathbf{x}}$ . We will need the vector-valued clipping function $\mathcal{T}^{(\lambda)}(\mathbf{x}):=\lambda\mathbf{x}/\|\mathbf{x}\|$ for $\|\mathbf{x}\|\geq\lambda$ and $\mathcal{T}^{(\lambda)}(\mathbf{x}):=\mathbf{x}$ otherwise. The soft-thresholding operator is $\mathcal{S}(x;\tau):={\rm sign}(x)(|x|-\tau)_{+}$ .

II Problem Formulation

We consider networked data modelled by an undirected “empirical” graph $\mathcal{G}\!=\!(\mathcal{V},\mathcal{E},\mathbf{A})$ whose nodes $\mathcal{V}\!=\!\{1,\ldots,n\}$ represent individual data points. The undirected edges $\mathcal{E}$ encode some domain-specific notion of similarity between data points. The similarity between nodes $i,j\!\in\!\mathcal{V}$ connected by the edge $\{i,j\}\!\in\!\mathcal{E}$ is quantified by a positive edge weight $A_{ij}$ . We collect the weights (with $A_{ij}\!=\!0$ if nodes $i,j\!\in\!\mathcal{V}$ are not connected by an edge), into the weight matrix $\mathbf{A}\in\mathbb{R}_{+}^{n\times n}$ .

In addition to the graph structure $\mathcal{G}$ , datasets typically convey additional information about the data points. Let us assume that each individual data point $i\in\mathcal{V}$ is characterized by a feature vectors $\mathbf{x}^{(i)}\in\mathbb{R}^{p}$ and a numeric label $y^{(i)}\in\mathbb{R}$ . The features $\mathbf{x}^{(i)}$ can be determined easily for any data point. In contrast, acquisition of labels $y^{(i)}$ is difficult (requiring human expert labor). Our approach allows to have access only to the labels of a small training set $\mathcal{M}=\{i_{1},\ldots,i_{m}\}\subseteq\mathcal{V}$ .

We relate features $\mathbf{x}^{(i)}$ and labels $y^{(i)}$ using the linear model

[TABLE]

with some (unknown) weight vector $\overline{\mathbf{w}}^{(i)}$ for each node $i\in\mathcal{V}$ . The noise component $\varepsilon^{(i)}$ in (1) summarizes any labeling our modeling errors.

Thus, we assign each data point with an individual linear model (1). For high-dimensional data (feature vector length $p$ ) this would result in overfitting unless we leverage the information contained in the network structure relating different data points. As we demonstrate theoretically and empirically, enforcing the (estimates of the) weight vectors $\overline{\mathbf{w}}^{(i)}$ to be similar for well-connected data points allows to accurately learn the linear models (1) for the entire dataset.

We will apply nLasso to the available labels $y^{(i)}$ for the training set to obtain an estimate $\widehat{\mathbf{w}}^{(i)}$ for the weight vector $\mathbf{w}^{(i)}$ at each node $i\in\mathcal{V}$ . The estimates $\widehat{\mathbf{w}}^{(i)}$ define a predictor which maps the node $i\in\mathcal{V}$ to the predicted label

[TABLE]

The predictions $\hat{y}^{(i)}$ will be accurate, i.e., the prediction error $\hat{y}^{(i)}-y^{(i)}$ will be small, if the estimation error $\overline{\mathbf{w}}^{(i)}\!-\!\widehat{\mathbf{w}}^{(i)}$ is small. Our main result (see Theorem 2) provides a sufficient condition on the structure of the empirical graph $\mathcal{G}$ and the training set $\mathcal{M}$ such that the estimation error is small.

We interpret the weight vectors $\mathbf{w}^{(i)}$ as the values of a graph signal $\mathbf{w}:\mathcal{V}\rightarrow\mathbb{R}^{p}$ which assigns node $i\!\in\!\mathcal{V}$ the vector $\mathbf{w}^{(i)}\!\in\!\mathbb{R}^{p}$ . The set of all vector-valued graph signals is denoted

[TABLE]

Each graph signal $\widehat{\mathbf{w}}\in\mathcal{W}$ represents a predictor which maps a node with features $\mathbf{x}^{(i)}$ to the predicted label (2).

Given partially labeled networked data, we aim at leaning a predictor $\widehat{\mathbf{w}}\in\mathcal{W}$ whose predictions (2) agree with the labels $y^{(i)}$ of labeled data points in the training set $\mathcal{M}$ . In particular, we aim at learning a predictor having a small training error

[TABLE]

We use the absolute value loss since it somewhat simplifies our analysis. However, we expect no big challenges in extending our analysis to nLasso using different loss functions, such as the squared error loss. The absolute value loss is actually preferred for learning linear regression models (1) when the noise $\varepsilon^{(i)}$ is expected to contain only a few large values, known as “salt and pepper” noise in image processing [17].

III Network Lasso

The criterion (4) by itself is not enough for guiding the learning of a predictor $\widehat{\mathbf{w}}$ since (4) completely ignores the weights $\widehat{\mathbf{w}}^{(i)}$ at unlabeled nodes $i\in\mathcal{V}\setminus\mathcal{M}$ . Therefore, we need to impose some additional structure on the predictor $\widehat{\mathbf{w}}$ . To this end, we require the predictor $\widehat{\mathbf{w}}$ to conform with the cluster structure of the empirical graph $\mathcal{G}$ [18, 19].

The extend by which a predictor $\widehat{\mathbf{w}}\!\in\!\mathcal{W}$ conforms with $\mathcal{G}$ can be measured by the total variation (TV)

[TABLE]

If the weights $\mathbf{w}^{(i)}$ are approximately constant over well-connected subsets of nodes, the predictor $\mathbf{w}\!\in\!\mathcal{W}$ has small TV $\|\widehat{\mathbf{w}}\|_{\rm TV}$ . The restriction of (5) to a subset $\mathcal{S}\!\subseteq\!\mathcal{E}$ of edges is denoted $\|\mathbf{w}\|_{\mathcal{S}}\!:=\!\sum_{\{i,j\}\in\mathcal{S}}A_{ij}\|\mathbf{w}^{(j)}-\mathbf{w}^{(i)}\|$ .

We are led naturally to learning a predictor $\widehat{\mathbf{w}}$ via the regularized empirical risk minimization (ERM)

[TABLE]

which is a special case of nLasso [13]. The parameter $\lambda>0$ in (6) allows to trade small TV $\|\widehat{\mathbf{w}}\|_{\rm TV}$ against small error $\widehat{E}(\widehat{\mathbf{w}})$ (4). The choice of $\lambda$ can be guided by cross validation [20]. Alternatively the choice of $\lambda$ can be guided by our analysis of the nLasso estimation error (see discussion after Theorem 2).

Note that nLasso (6) does not enforce the labels $y^{(i)}$ themselves to be clustered. Instead, it requires the predictor $\widehat{\mathbf{w}}$ , which is used to obtain predictions (2), to be clustered.

It will be convenient to reformulate (6) using vector notation. To this end, we represent a graph signal $\mathbf{w}\in\mathcal{W}$ as the vector

[TABLE]

and define the block matrix $\mathbf{D}\!\in\!\mathbb{R}^{pq\times pn}$ (with $q\!=\!|\mathcal{E}|$ )

[TABLE]

Applying the matrix $\mathbf{D}$ to a graph signal vector $\mathbf{w}$ (7) results in a partitioned vector $\mathbf{D}\mathbf{w}$ whose $e$ th block is given by $A_{ij}(\mathbf{w}^{(i)}-\mathbf{w}^{(j)})$ (see (5)). Using (7) and (8), we can reformulate the nLasso (6) as

[TABLE]

Here,

[TABLE]

IV Primal-Dual Method

The nLasso (9) is a convex optimization problem with a non-smooth objective function which rules out the use of gradient descent methods [21]. However, the objective function is highly structured since it is the sum of two components $h(\mathbf{w})$ and $g(\mathbf{D}\mathbf{w})$ , which can be optimized efficiently when considered separately. Such composite functions can be optimized efficiently using proximal splitting methods [22, 23, 24].

We apply the proximal method proposed in [25] which is based on reformulating (9) as a saddle-point problem

[TABLE]

with the convex conjugate $g^{*}$ of $g$ [24].

Solutions $(\widehat{\mathbf{w}},\widehat{\mathbf{u}})$ of (11) are characterized by [26, Thm 31.3]

[TABLE]

The coupled conditions (12) are, in turn, equivalent to

[TABLE]

with positive definite matrices $\boldsymbol{\Sigma}\!\in\!\mathbb{R}^{pq\times pq},\mathbf{T}\!\in\!\mathbb{R}^{pn\times pn}$ . In principle, the matrices $\boldsymbol{\Sigma},\mathbf{T}$ in (13) can be chosen arbitrarily. It will prove convenient to choose them as

[TABLE]

with scalars $\big{\{}\sigma^{(e)}\big{\}}_{e\!=\!1}^{q}$ and $\big{\{}\tau^{(i)}\big{\}}_{i\in\mathcal{V}}$ as specified below.

The optimality condition (13) for nLasso (9) lends naturally to the following coupled fixed point iterations [25]

[TABLE]

The update (16) involves the resolvent operator

[TABLE]

The convex conjugate $g^{*}$ of $g$ (see (10)) can be decomposed as $g^{*}(\mathbf{v})\!=\!\sum\limits_{e\!=\!1}^{q}g_{2}^{*}(\mathbf{v}^{(e)})$ with the convex conjugate $g_{2}^{*}$ of $g_{2}(\mathbf{z}):=\lambda\|\mathbf{z}\|$ . Combining the fact that $\boldsymbol{\Sigma}$ is a block diagonal matrix with the Moreau decomposition [27, Sec. 6.5], it can be shown that $\mathbf{c}=(\mathbf{I}_{pq}\!+\!\boldsymbol{\Sigma}\partial g^{*})^{-1}(\mathbf{u})$ (see (17)) with

[TABLE]

Similar to the update (16), also the update (15) decomposes into independent updates of the weight vectors

[TABLE]

yielding the updated weight vectors $\mathbf{w}^{(i)}_{k+1}=\mathbf{v}^{(i)}$ for each node $i\in\mathcal{V}$ . In particular, for unlabeled nodes $i\notin\mathcal{M}$ , the update (15) reduces to $\mathbf{v}^{(i)}=\mathbf{w}^{(i)}$ . For labeled nodes $i\in\mathcal{M}$ , using elementary sub-gradient calculus, we obtain

[TABLE]

with $\tilde{y}:=y^{(i)}/\|\mathbf{x}^{(i)}\|^{2}$ and $\tilde{w}:=\big{(}\mathbf{w}^{(i)}\big{)}^{T}\mathbf{x}^{(i)}/\|\mathbf{x}^{(i)}\|^{2}$ . Inserting (19) and (18) into the fixed point iteration (15), (16) results in Alg. 1 for solving the nLasso (9).

If the matrices $\boldsymbol{\Sigma}$ and $\mathbf{T}$ using in (16) satisfy

[TABLE]

the sequences obtained from iterating (15) and (16) converge to a saddle point of the problem (11) [25, Thm. 1]. The condition (20) is ensured by choosing $\boldsymbol{\Sigma}$ and $\mathbf{T}$ according to (14) using $\sigma^{(e)}\!=\!1/(2A_{e})$ and $\tau^{(i)}\!:=\!\eta/d^{(i)}$ , with (weighted) node degree $d^{(i)}\!=\!\sum_{j\!\neq\!i}A_{i,j}$ and some constant $\eta\!<\!1$ [25, Lem. 2].

Another instance of a proximal method is the alternating direction method of multipliers (ADMM) [27, 28], which has been applied to (a more general formulation of) the nLasso in [13]. In contrast, to the primal-dual method used in Alg. 1, the ADMM implementation involves a tuning parameter. The optimum choice for this tuning parameter is non-trivial and typically requires a grid search [29]. However, we expect that Alg. 1 and the ADMM implementation of [13] (when specialized to (4)) to have similar computational requirements.

V Error Analysis for nLasso

In order to analyze the statistical properties of Alg. 1 we need to understand the structure of the solutions to the nLasso problem (9). To this end, will use a simple but useful model of piece-wise constant weight vectors

[TABLE]

with fixed vectors $\mathbf{a}^{(l)}\in\mathbb{R}^{p}$ , for $l=1,\ldots,F$ , and the indicator function $\mathcal{I}_{\mathcal{C}}[i]\in\{0,1\}$ with $\mathcal{I}_{\mathcal{C}}[i]=1$ if and only if $i\in\mathcal{C}\subseteq\mathcal{V}$ . Here, we use a partition $\mathcal{F}=\{\mathcal{C}^{(1)},\ldots,\mathcal{C}^{(F)}\}$ of the nodes $\mathcal{V}$ in the empirical graph into disjoint subsets (clusters) $\mathcal{C}^{(l)}$ .

The model (21), which generalizes the piece-wise constant signal model (see [30, 31]), embodies a clustering assumption that well-connected nodes in the empirical graph should have similar relations between features and labels [19, 18].

Note that our analysis allows for an arbitrary choice of clusters $\mathcal{C}^{(l)}$ in (21). However, our results are most useful when the sets $\mathcal{C}^{(l)}$ reflect the intrinsic cluster structure of the empirical graph $\mathcal{G}$ such that the TV $\|\overline{\mathbf{w}}\|_{\rm TV}$ (see (5)) is small.

We now introduce the network compatibility condition (NCC), which generalizes the compatibility conditions for Lasso type estimators [32] of ordinary sparse signals. Our main contribution is to show that the NCC guarantees the accuracy of the nLasso (9) solutions, as obtained using Alg. 1.

Definition 1.

Consider a networked dataset with empirical graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{A})$ . The nodes are characterized by feature vectors $\mathbf{x}^{(i)}\in\mathbb{R}^{p}$ and grouped according to a fixed partition $\mathcal{F}=\{\mathcal{C}^{(1)},\ldots,\mathcal{C}^{(F)}\}$ . The labels $y^{(i)}$ of nodes are observed only on the training set $\mathcal{M}\subseteq\mathcal{V}$ . The training set is said to satisfy NCC, with constants $K,L>0$ , if

[TABLE]

for any graph signal $\mathbf{w}\in\mathcal{W}$ (see (3)).

We highlight that the NCC (constants) depend jointly on the training set $\mathcal{M}$ and the network structure of $\mathcal{G}$ . While enlarging the training set can only improve the NCC constants (smaller $K$ ), the precise quantification of this improvement is difficult.

As shown in [15, 33], the NCC is satisfied if there exists a sufficiently large network flow between sampled nodes. Thus, given a dataset with empirical graph $\mathcal{G}$ , the NCC can be verified using network flow algorithms (see Section VI and [34]).

Our main theoretical result is that if the sampling set satisfies the NCC (see Definition 1), any solution of (6) is close to the true underlying weight vectors (see (1), (21)).

Theorem 2.

Consider a partially labeled networked dataset with empirical graph $\mathcal{G}$ with features $\mathbf{x}^{(i)}$ known for all nodes and labels $y^{(i)}$ which are known only for the nodes $i\in\mathcal{M}$ . We assume a linear model (1) with true weights $\overline{\mathbf{w}}^{(i)}$ piece-wise constant (21). If the sampling set $\mathcal{M}$ satisfies NCC with parameters $L>\sqrt{p}$ and $K>0$ , then any solution $\widehat{\mathbf{w}}$ of nLasso (9) with the choice $\lambda:=1/K$ satisfies

[TABLE]

According to Theorem 2, the choice for the nLasso parameter $\lambda$ in (9) can be based on the NCC constant $K$ (see (22)) via setting $\lambda\!=\!1/K$ . For this choice, given the training set $\mathcal{M}$ satisfies the NCC with parameters $K$ and $L$ , the nLasso error $\widehat{\mathbf{w}}\!-\!\overline{\mathbf{w}}$ is bounded according to (23).

Note that the bound (23) does neither explicitly involve the size $m\!=\!|\mathcal{M}|$ of the training set $\mathcal{M}$ , nor the overall size $n$ of the empirical graph (or dataset). However, the relative size $m/n$ of the training set will influence the probability that the NCC is satisfied (such that the bound (23) applies at all).

We highlight that the nLasso (6) does not require the partition $\mathcal{F}$ used for our signal model (21). This partition is only used for the analysis of nLasso (6). Moreover, if the true underlying graph signal is of the form (21) and nLasso accurately learns this signal, we can obtain the partition $\mathcal{F}$ by thresholding the edge-wise differences $\|\mathbf{w}^{(i)}\!-\!\mathbf{w}^{(j)}\|$ for $\{i,j\}\!\in\!\mathcal{E}$ [35].

VI Numerical Experiments

In order to verify our theoretical findings (see Theorem 2), we have applied Alg. 1 to two particular datasets. The first dataset is synthetically generated based on an empirical graph which consists of two well-connected clusters. We also consider a dataset obtained from temperature measurements at various locations in Finland.111The source code for our numerical experiments can be found under https://github.com/alexjungaalto/ResearchPublic/tree/master/LocalizedLinReg.

Two-Cluster Dataset. We generate the empirical graph $\mathcal{G}$ ( $n\!=\!80$ ) by sparsely connecting two random graphs $\mathcal{C}^{(1)}$ and $\mathcal{C}^{(2)}$ , each of size $n/2$ and with average degree $10$ . The nodes of $\mathcal{G}$ are assigned feature vectors $\mathbf{x}^{(i)}\in\mathbb{R}^{2}$ obtained by i.i.d. random vectors uniformly distributed on the unit sphere $\{\mathbf{x}\in\mathbb{R}^{2}:\|\mathbf{x}\|=1\}$ . The labels $y^{(i)}$ of the nodes $i\in\mathcal{V}$ are generated according to the linear model (1) with zero noise $\varepsilon^{(i)}=0$ and piecewise constant weight vectors $\mathbf{w}^{(i)}$ (see (21)). We assume that the labels $y^{(i)}$ are known for the nodes in the training set which includes three data points from each cluster, i.e., $|\mathcal{M}\cap\mathcal{C}^{(1)}|=|\mathcal{M}\cap\mathcal{C}^{(2)}|=3$ .

Using [15, Lemma 6] it can be shown that the training set $\mathcal{M}$ satisfies NCC with $L\!>\!\sqrt{p}\!=\!\sqrt{2}$ if there exists a sufficiently large network flow between the labeled node $i\!\in\!\mathcal{C}^{(l)}\!\cap\!\mathcal{M}$ and the boundary edges $\partial:=\{\{i,j\}\in\mathcal{E}:i\in\mathcal{C}^{(1)},j\in\mathcal{C}^{(2)}\}$ between the two clusters. In particular, let $\rho^{(l)}$ denote the normalized flow value from the labeled nodes in cluster $\mathcal{C}^{(l)}$ and the cluster boundary, normalized by the boundary size $|\partial|$ . The NCC is satisfied with $L\!>\!\sqrt{2}$ if $\rho^{(l)}\!>\!\sqrt{2}$ for $l\!=\!1,2$ .

In Fig. 1, we depict the normalized mean squared error (NMSE) $\varepsilon\!:=\!\|\overline{\mathbf{w}}\!-\!\widehat{\mathbf{w}}\|^{2}_{2}/\|\overline{\mathbf{w}}\|^{2}_{2}$ incurred by Alg. 1 (averaged over $10$ i.i.d. simulation runs) for varying connectivity, as measured by the empirical average $\bar{\rho}$ of $\rho^{(1)}$ and $\rho^{(2)}$ (having same distribution). Note that Fig. 1 agrees with Theorem 2 which predicts Alg. 1 is accurate if NCC holds ( $\bar{\rho}\!>\!\sqrt{2}$ ).

Weather Data. In this experiment, we consider a networked dataset whose empirical graph $\mathcal{G}$ represents Finnish weather stations (see Fig. 2), which are initially connected by an edge to their $K=3$ nearest neighbors. The feature vector $\mathbf{x}^{(i)}\!\in\!\mathbb{R}^{3}$ of node $i\!\in\!\mathcal{V}$ contains the local (daily mean) temperature for the preceding three days. The label $y^{(i)}\in\mathbb{R}$ is the current day-average temperature.

We use Alg. 1 to learn the weight vectors $\mathbf{w}^{(i)}$ for a localized linear model (1). For the sake of illustration we focus on the weather stations in the capital region around Helsinki (indicated by a red cross in Fig. 2). These stations are represented by nodes $\mathcal{C}\!=\!\{23,18,22,15,12,13,9,7,5\}$ and we assume that labels $y^{(i)}$ are available for all nodes outside $\mathcal{C}$ and for the nodes $i\!\in\!\{12,13,15\}\!\subseteq\!\mathcal{C}$ . Thus, for more than half of the nodes in $\mathcal{C}$ we do not know the labels $y^{(i)}$ but predict them via (2) with the weight vectors $\widehat{\mathbf{w}}^{(i)}$ obtained from Alg. 1 (using $\lambda\!=1/7$ and a fixed number of $10^{4}$ iterations). The normalized average squared prediction error is $\approx 10^{-1}$ and only slightly larger than the prediction error incurred by fitting a single linear model to the cluster $\mathcal{C}$ using a least absolute deviation regression method [28, Sec. 6.1].

Acknowledgments

We thank Roope Tervo from the Finnish Meteorological Institute for helping with gathering the weather data.

VII Proof of Theorem 2

In order to proof Theorem 2, we consider an arbitrary but fixed nLasso solution $\widehat{\mathbf{w}}=\big{(}\big{(}\widehat{\mathbf{w}}^{(1)}\big{)}^{T},\ldots,\big{(}\widehat{\mathbf{w}}^{(n)}\big{)}^{T}\big{)}^{T}$ (see (9)) and denote the estimation error between $\widehat{\mathbf{w}}^{(i)}$ and the true underlying weights $\overline{\mathbf{w}}^{(i)}$ (see (1)) as $\widetilde{\mathbf{w}}^{(i)}:=\widehat{\mathbf{w}}^{(i)}-\overline{\mathbf{w}}^{(i)}$ .

By the definition of nLasso (6),

[TABLE]

Since the true weight vectors $\overline{\mathbf{w}}^{(i)}$ are piece-wise constant (see (21)), $\|\overline{\mathbf{w}}\|_{\overline{\partial\mathcal{F}}}=0$ and $\|\widetilde{\mathbf{w}}\|_{\overline{\partial\mathcal{F}}}=\|\widehat{\mathbf{w}}\|_{\overline{\partial\mathcal{F}}}$ . Using the decomposition property and triangle inequality for the TV in (24),

[TABLE]

and, in turn,

[TABLE]

We conclude from (25) that

[TABLE]

Thus, for small noise $\varepsilon^{(i)}$ (see (1)), the nLasso estimation error $\widetilde{\mathbf{w}}$ is piece-wise constant. However, it remains to control the size of the error for which we will invoke the NCC 22.

We can develop the LHS of (25) as

[TABLE]

where we have used the triangle inequality in the last step. Combining (27) with (25),

[TABLE]

Since we assume NNC holds for $\mathcal{M}$ , (22) yields

[TABLE]

Inserting (29) into (28) and using $\lambda:=1/K$ , yields

[TABLE]

Combining (26) with (30) yields

[TABLE]

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, May 2013.
2[2] S. G. Mallat, A Wavelet Tour of Signal Processing – The Sparse Way , 3rd ed. San Diego, CA: Academic Press, 2009.
3[3] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing , 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1998.
4[4] L. F. O. Chamon and A. Ribeiro, “Greedy sampling of graph signals,” 2018 , vol. 66, no. 1, pp. 34–47, 2018.
5[5] D. Vrandečić and M. Krötzsch, “Wikidata: A free collaborative knowledgebase,” Commun. ACM , vol. 57, no. 10, pp. 78–85, Sep. 2014.
6[6] A. Sadeghi, C. Lange, M. Vidal, and S. Auer, “Communication metadata using knowledge graphs,” in Lecture Notes in Computer Science . Springer, 2017.
7[7] N. Q. Tran and A. Jung, “Learning conditional independence structure for high-dimensional uncorrelated vector processes,” New Orleans (LA), 2017, pp. 5920–5924.
8[8] D. Koller, N., and Friedman, Probabilistic Graphical Models: Principles and Techniques , ser. Adaptive computation and machine learning. MIT Press, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Localized Linear Regression in Networked Data

Abstract

I Introduction

II Problem Formulation

III Network Lasso

IV Primal-Dual Method

V Error Analysis for nLasso

Definition 1**.**

Theorem 2**.**

VI Numerical Experiments

Acknowledgments

VII Proof of Theorem 2

Definition 1.

Theorem 2.