Structure Learning of Sparse GGMs over Multiple Access Networks

Mostafa Tavassolipour; Armin Karamzade; Reza Mirzaeifard; Seyed; Abolfazl Motahari; and Mohammad-Taghi Manzuri Shalmani

arXiv:1812.10437·cs.LG·December 27, 2018

Structure Learning of Sparse GGMs over Multiple Access Networks

Mostafa Tavassolipour, Armin Karamzade, Reza Mirzaeifard, Seyed, Abolfazl Motahari, and Mohammad-Taghi Manzuri Shalmani

PDF

1 Repo

TL;DR

This paper investigates structure learning of sparse Gaussian Graphical Models over distributed data in a wireless network, proposing two communication strategies and analyzing their theoretical and empirical performance.

Contribution

It introduces two novel methods, Signs and Uncoded, for reliable GGM structure learning over noisy wireless channels with limited power and bandwidth.

Findings

01

Both methods can recover the structure with high probability given sufficient samples.

02

Signs method outperforms Uncoded method in various scenarios.

03

Theoretical analysis confirms the effectiveness of the proposed strategies.

Abstract

A central machine is interested in estimating the underlying structure of a sparse Gaussian Graphical Model (GGM) from datasets distributed across multiple local machines. The local machines can communicate with the central machine through a wireless multiple access channel. In this paper, we are interested in designing effective strategies where reliable learning is feasible under power and bandwidth limitations. Two approaches are proposed: Signs and Uncoded methods. In Signs method, the local machines quantize their data into binary vectors and an optimal channel coding scheme is used to reliably send the vectors to the central machine where the structure is learned from the received data. In Uncoded method, data symbols are scaled and transmitted through the channel. The central machine uses the received noisy symbols to recover the structure. Theoretical results show that both…

Equations108

f (x; Θ) = \frac{1}{det ( 2 π Θ ^{- 1} )} exp {- \frac{1}{2} x^{T} Θ x} .

f (x; Θ) = \frac{1}{det ( 2 π Θ ^{- 1} )} exp {- \frac{1}{2} x^{T} Θ x} .

g (Θ) = tr (Θ S_{x}) - lo g det (Θ),

g (Θ) = tr (Θ S_{x}) - lo g det (Θ),

Θ = ar g Θ ≻ 0 min {g (Θ) + λ_{n} ∥Θ ∥_{1, off}},

Θ = ar g Θ ≻ 0 min {g (Θ) + λ_{n} ∥Θ ∥_{1, off}},

\Gamma=\nabla_{\Theta}^{2}g(\Theta)\Big{|}_{\Theta=\Theta_{x}}=\Theta_{x}^{-1}\otimes\Theta_{x}^{-1}.

\Gamma=\nabla_{\Theta}^{2}g(\Theta)\Big{|}_{\Theta=\Theta_{x}}=\Theta_{x}^{-1}\otimes\Theta_{x}^{-1}.

S (Θ_{x}) = {(j, k) \in V^{2} ∣ (Θ_{x})_{j k} \neq = 0} .

S (Θ_{x}) = {(j, k) \in V^{2} ∣ (Θ_{x})_{j k} \neq = 0} .

∣ ∣ ∣ X ∣ ∣ ∣_{\infty} = i = 1, \dots, d max j = 1 \sum d ∣ X_{ij} ∣ .

∣ ∣ ∣ X ∣ ∣ ∣_{\infty} = i = 1, \dots, d max j = 1 \sum d ∣ X_{ij} ∣ .

∣ ∣ ∣ Γ_{S^{c} S} (Γ_{S S})^{- 1} ∣ ∣ ∣_{\infty} \leq (1 - α) .

∣ ∣ ∣ Γ_{S^{c} S} (Γ_{S S})^{- 1} ∣ ∣ ∣_{\infty} \leq (1 - α) .

∣ ∣ ∣ Q_{x} ∣ ∣ ∣_{\infty}

∣ ∣ ∣ Q_{x} ∣ ∣ ∣_{\infty}

∣ ∣ ∣ Γ_{S S}^{- 1} ∣ ∣ ∣_{\infty}

\frac{1}{n} i = 1 \sum n s_{j}^{(i)}^{2} \leq p,

\frac{1}{n} i = 1 \sum n s_{j}^{(i)}^{2} \leq p,

y = H s + z,

y = H s + z,

k \in S \sum R_{k} \leq l g det (\frac{p}{σ _{z}^{2}} H_{S}^{H} H_{S} + I_{∣ S ∣}), \forall \leavevmode S \subseteq {1, \dots, d},

k \in S \sum R_{k} \leq l g det (\frac{p}{σ _{z}^{2}} H_{S}^{H} H_{S} + I_{∣ S ∣}), \forall \leavevmode S \subseteq {1, \dots, d},

\begin{array}[]{c|cc}\hat{x}_{j}\backslash\hat{x}_{k}&-1&+1\\ \hline\cr-1&\beta_{jk}/2&(1-\beta_{jk})/2\\ +1&(1-\beta_{jk})/2&\beta_{jk}/2\end{array}

\begin{array}[]{c|cc}\hat{x}_{j}\backslash\hat{x}_{k}&-1&+1\\ \hline\cr-1&\beta_{jk}/2&(1-\beta_{jk})/2\\ +1&(1-\beta_{jk})/2&\beta_{jk}/2\end{array}

β_{j k} = \frac{1}{2} + \frac{arcsin ( ρ _{j k} )}{π} .

β_{j k} = \frac{1}{2} + \frac{arcsin ( ρ _{j k} )}{π} .

ρ_{j k} = sin (π (β_{j k} - \frac{1}{2})) = - cos (π β_{j k}) .

ρ_{j k} = sin (π (β_{j k} - \frac{1}{2})) = - cos (π β_{j k}) .

\hat{β}_{j k} = \frac{1}{n} i = 1 \sum n I (\overset{x}{^}_{j}^{(i)} \overset{x}{^}_{k}^{(i)} = 1),

\hat{β}_{j k} = \frac{1}{n} i = 1 \sum n I (\overset{x}{^}_{j}^{(i)} \overset{x}{^}_{k}^{(i)} = 1),

\overset{ρ}{^}_{j k} = - cos (π \hat{β}_{j k}),

\overset{ρ}{^}_{j k} = - cos (π \hat{β}_{j k}),

Pr (∣ \overset{ρ}{^}_{j k} - ρ_{j k} ∣ \geq δ) \leq 2 exp (- \frac{2}{π ^{2}} n δ^{2}),

Pr (∣ \overset{ρ}{^}_{j k} - ρ_{j k} ∣ \geq δ) \leq 2 exp (- \frac{2}{π ^{2}} n δ^{2}),

Pr (∣ \overset{ρ}{^}_{j k} - ρ_{j k} ∣ \geq δ)

Pr (∣ \overset{ρ}{^}_{j k} - ρ_{j k} ∣ \geq δ)

\leq Pr (π ∣ \hat{β}_{j k} - β_{j k} ∣ \geq δ)

= Pr (∣ \hat{β}_{j k} - β_{j k} ∣ \geq \frac{δ}{π}) .

Pr (∣ \hat{β}_{j k} - β_{j k} ∣ \geq \frac{δ}{π})

Pr (∣ \hat{β}_{j k} - β_{j k} ∣ \geq \frac{δ}{π})

S_{x} = - cos (\frac{π}{2} B),

S_{x} = - cos (\frac{π}{2} B),

θ_{min} = (i, j) \in E (Θ_{x}) min ∣ (Θ_{x})_{ij} ∣.

θ_{min} = (i, j) \in E (Θ_{x}) min ∣ (Θ_{x})_{ij} ∣.

n > C_{sign}^{2} \leavevmode Δ^{2} \leavevmode (1 + \frac{8}{α})^{2} ln \frac{2}{ϵ},

n > C_{sign}^{2} \leavevmode Δ^{2} \leavevmode (1 + \frac{8}{α})^{2} ln \frac{2}{ϵ},

C_{sign} = 32 π max {κ_{Σ} κ_{Γ}, κ_{Σ}^{3} κ_{Γ}^{2}},

C_{sign} = 32 π max {κ_{Σ} κ_{Γ}, κ_{Σ}^{3} κ_{Γ}^{2}},

n > T_{sign}^{2} (1 + \frac{8}{α})^{2} ln \frac{2}{ϵ},

n > T_{sign}^{2} (1 + \frac{8}{α})^{2} ln \frac{2}{ϵ},

T_{sign} = 2 π max {κ_{Γ} θ_{min}^{- 1}, 3Δ \leavevmode max {κ_{Σ} κ_{Γ}, κ_{Σ}^{3} κ_{Γ}^{2}}},

T_{sign} = 2 π max {κ_{Γ} θ_{min}^{- 1}, 3Δ \leavevmode max {κ_{Σ} κ_{Γ}, κ_{Σ}^{3} κ_{Γ}^{2}}},

Pr (M (Θ_{x}; Θ_{x})) \geq 1 - d^{2} ϵ .

Pr (M (Θ_{x}; Θ_{x})) \geq 1 - d^{2} ϵ .

y = (H_{R} + j H_{I}) (s_{R} + j s_{I}) + (z_{R} + j z_{I}),

y = (H_{R} + j H_{I}) (s_{R} + j s_{I}) + (z_{R} + j z_{I}),

[y_{R} y_{I}] = [H_{R} H_{I} - H_{I} H_{R}] [s_{R} s_{I}] + [z_{R} z_{I}],

[y_{R} y_{I}] = [H_{R} H_{I} - H_{I} H_{R}] [s_{R} s_{I}] + [z_{R} z_{I}],

Q_{\tilde{x}} = [Q_{x} 0 0 Q_{x}] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ArminKaramzade/distributed-sparse-GGM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Structure Learning of Sparse GGMs over Multiple Access Networks

Mostafa Tavassolipour, Armin Karamzade, Reza Mirzaeifard, Seyed Abolfazl Motahari, and Mohammad-Taghi Manzuri Shalmani

Abstract

A central machine is interested in estimating the underlying structure of a sparse Gaussian Graphical Model (GGM) from datasets distributed across multiple local machines. The local machines can communicate with the central machine through a wireless multiple access channel. In this paper, we are interested in designing effective strategies where reliable learning is feasible under power and bandwidth limitations. Two approaches are proposed: Signs and Uncoded methods. In Signs method, the local machines quantize their data into binary vectors and an optimal channel coding scheme is used to reliably send the vectors to the central machine where the structure is learned from the received data. In Uncoded method, data symbols are scaled and transmitted through the channel. The central machine uses the received noisy symbols to recover the structure. Theoretical results show that both methods can recover the structure with high probability for large enough sample size. Experimental results indicate the superiority of Signs method over Uncoded method under several circumstances.

Index Terms:

Structure learning, Gaussian graphical model, distributed learning

I Introduction

In recent years, by the explosion of the volume of training data, distributed machine learning has become more important than ever. Many modern big datasets are distributed over several hosting machines which are connected to each other via some communication links. In such systems, designing distributed learning algorithms which efficiently exploit the available resource demands a careful design where intensive computational workloads and the amount of data communications are taken into account.

This paper is focused on the problem of structure learning in Gaussian Graphical Models (GGMs) in distributed environments. A GGM for a $d$ -dimensional random vector $\mathbf{x}=(x_{1},\cdots,x_{d})^{T}\in\mathbb{R}^{d}$ is specified by a graph $\mathcal{G}(\mathcal{V},\mathcal{E})$ where $\mathcal{V}=\{1,\cdots,d\}$ is the set of vertices and $\mathcal{E}\subseteq\mathcal{V}^{2}$ is the set of edges. The model comprises all $d$ -dimensional normal distributions $\mathcal{N}(\mu,\Theta^{-1})$ where $\Theta$ is the precision matrix with $\Theta_{jk}\neq 0$ iff $(j,k)\in\mathcal{E}$ . It worths mentioning that GGM is indeed a Markov Random Field (MRF).

In our problem setting, we assume that the data are distributed over multiple local machines so that each one contains a single dimension of the whole dataset. The local machines are connected to a central machine via a wireless medium. The central machine is responsible for inferring the conditional dependencies between the gathered data by the local machines. The system’s block diagram is depicted in Figure 1. We also assume that the communication links between the local and the central machine are bandwidth limited implying that transmission of the whole local datasets to the central machine is impossible. Due to this constraint, each local machine transmits some information from its local dataset to the central machine. Then, the central machine estimates the underlying graph structure using received information from the local machines.

In this paper, we consider a wireless multiple access channel between the local machines and the central node. Each local machine is equipped with a single antenna while the central node is equipped with multiple antennas. Hence, the overall channel between the local and central machines is modeled as a single-input multiple-output multiple access channel (SIMO-MAC).

We have proposed two communication schemes for transmitting information from the local machines to the central machine. In the first scheme, we have separate the source coding from the channel coding. In this scheme, we quantize the source samples into single bits and assume that there exists a channel coding such that the bits can be sent through the channel reliably. We refer to this scheme as Signs method in the paper.

In the second scheme, we do not use any source and channel coding. In this scheme, we put the source samples into the channel without any encoding. At the central machine, we directly estimate the underlying graph structure using received data from the channel. We refer to this scheme as Uncoded method.

We have shown through theoretical analysis and experiments that by transmitting only 1 bit per sample; the central machine can reliably recover the underlying graph structure. More precisely, we have shown theoretically that by consuming only 1 bit per sample, under some mild conditions, the central machine can perfectly recover the graph structure with high probability. Moreover, the true signs of the edges weights of the graph are obtained.

The paper is organized as follows. In Section II, we provide a brief review on structure learning of GGMs. Section III describes the detail of our modeling for the source and the communication channel. In sections IV and V, we describe Signs and Uncoded methods, respectively. Section VI provides experimental results to compare and evaluate the proposed methods. Finally, Section VII concludes the paper.

II Related Work

The problem of structure learning of GGMs from data samples has applications in many fields including biology and social networks. For example, it has been used for gene regulatory networks reconstruction in [1, 2] and analysis of users relationships in social networks in [3]. There are many studies addressing this problem from various perspectives [4, 5, 6].

The Chow-Liu algorithm obtains the maximum likelihood estimate of the structure if the underlying graph is a tree [7]. Although this algorithm is applicable for discrete random variables, it can be used for tree structured GGMs in a similar manner [8]. Tavassolipour et al. [9] proposed a distributed version of the Chow-Liu algorithm and proved it can recover the underlying tree structure with high probability. Tan et al. in [10] and [11] provided an analysis of the error exponent of the Chow-Liu algorithm on tree-structured GGMs.

GGMs have the property that the neighbors of each variable can be obtained by solving a linear regression problem for the corresponding variable on other variables. This approach is referred to as neighborhood selection in the literature. For the sparse structures, there are some methods which penalize the linear regression problem with $\ell_{1}$ of the coefficient vector [12, 13].

Among the proposed methods for sparse structure estimation of GGMs, the $\ell_{1}$ -regularized maximum likelihood approaches are more popular [6, 12, 14]. This class of model is analyzed in several articles (see [4] and refs therein). For instance, Ravikumar et al. [4] analyzed the performance of the $\ell_{1}$ -regularized maximum likelihood estimator (MLE) under high dimensional scaling. They showed that with probability converging to one, the estimated structure correctly specifies the zero pattern of the true precision matrix. A similar study is conducted by [15] which analyzed the consistency of the $\ell_{1}$ -regularized MLE in the Frobenius norm.

Besides the lasso typed estimators, thresholding based estimators are proposed for sparse recovering of the precision matrix [5, 8]. For example, Sojoudi [5] proposed a simple thresholding method and showed, under certain conditions, the resulting structure is identical to the structure obtained by the lasso.

In the distributed setting, there are several studies which address the problem of covariance/precision matrix estimation [16, 17, 18]. Arroyo and Hou [18] studied the problem of sparse precision matrix estimation in the situation where the samples are distributed among several machines. Their work differs from our setting in the sense that we assume the data are split across dimensions whereas they split the data across samples. Meng et al. [16] addressed estimation of the precision matrix in a distributed manner where the zero pattern of the precision matrix is known in advanced.

III Problem Formulation

We are given $n$ i.i.d. random vectors drawn from a $d$ -dimensional zero mean normal distribution $\mathcal{N}(\mathbf{0},Q_{x})$ with $(Q_{x})_{jj}=1$ . The focus of this paper is the problem of estimating the zero pattern of the sparse precision matrix $\Theta_{x}=Q_{x}^{-1}$ in a situation where the data is stored in $d$ separate local machines such that each machine possesses one dimension of the sample vectors.

Denoting the whole gathered data by $\{\mathbf{x}^{(1)},\cdots,\mathbf{x}^{(n)}\}$ , the $j$ th machine captures the $j$ -th dimension of the sample vectors. We denote the $j$ -th dimension of the $i$ -th sample by $x_{j}^{(i)}$ , i.e. $\mathbf{x}^{(i)}=[x^{(i)}_{1},\cdots,x^{(i)}_{d}]^{T}$ . Hence, the local data at the $j$ -th local machine is $\{x_{j}^{(1)},\cdots,x_{j}^{(n)}\}$ .

The probability density function of the normal distribution is given by

[TABLE]

The negative log-likelihood of $n$ i.i.d. samples is given by

[TABLE]

where $S_{x}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{x}^{(i)}\left(\mathbf{x}^{(i)}\right)^{T}$ is reffered to as sample covariance matrix. One of the well known methods for estimating the sparse precision matrix is to solve an $\ell_{1}$ -regularized maximum likelihood function stated as

[TABLE]

where $\|\Theta\|_{1,\mathrm{off}}:=\sum_{j\neq k}|\Theta_{jk}|$ , and $\lambda_{n}$ is a user defined regularization parameter. There is an efficient algorithm known as glasso [19] for solving the above log-determinant program.

In [4], it is stated that the Hessian matrix of (2) is given by

[TABLE]

where $\otimes$ is the Kronecker matrix product. The entry $\Gamma_{(j,k),(l,m)}$ corresponds to the second derivative $\frac{\partial^{2}g}{\partial\Theta_{jk}\partial\Theta_{lm}}$ , evaluated at $\Theta=\Theta_{x}$ and $\Gamma_{(j,k),(l,m)}=\mathrm{cov}\{x_{j}x_{k},x_{l}x_{m}\}$ .

We define the set of non-zero entries in the true precision matrix $\Theta_{x}$ as

[TABLE]

Recall that $\mathcal{V}=\{1,\cdots,d\}$ . Note that $S(\Theta_{x})$ includes the diagonal entries. Let $S^{c}(\Theta_{x})$ be the complement set of $S(\Theta_{x})$ which includes all pairs $(j,k)$ where $(\Theta_{x})_{jk}=0$ . For any two subsets $T$ and $T^{\prime}$ of $V\times V$ , the notation $\Gamma_{TT^{\prime}}$ denotes a $|T|\times|T^{\prime}|$ sub-matrix of $\Gamma$ with rows and columns indexed by $T$ and $T^{\prime}$ , respectively.

We adopt the incoherence condition used by Ravikumar et al. [4] to obtain error bounds on consistency of the solutions. We define max-row-sum norm of a $d$ by $d$ matrix $X$ as

[TABLE]

Assumption 1.

*(Incoherence Condition [4])

There exists some $\alpha\in(0,1]$ such that*

[TABLE]

An implication of the above condition is that the non-edge pairs cannot have strong influence on the edges.

Assumption 2.

*(Covariance Control [4])

There exist constants $\kappa_{\Sigma},\kappa_{\Gamma}<\infty$ such that*

[TABLE]

The above assumptions imply that the covariance elements along any row of $Q_{x}$ and $\Gamma_{SS}^{-1}$ have bounded $\ell_{1}$ norms.

Having no access to the original data, the central machine finds the solution of the optimization problem (3) using received data from the local machines. The likelihood function $g(\Theta)$ in (2) depends on the samples via the sample covariance matrix $S_{x}$ . Thus, obtaining an appropriate approximate of the sample covariance matrix at the central machine would result in a good estimate of the underlying structure.

III-A Channel Model

There is a wireless channel between the local machines and the central node. In this paper, we assume that the channel can be modeled as a single-input multiple-output Gaussian multiple access channel (SIMO-GMAC) with additive white noise. The number of antennas at the receiver is assumed to be $m$ . All transmitters have equal transmit power which is denoted by $p$ . This implies, for all $j$ , the following constraint on the transmit symbols should be satisfied:

[TABLE]

where $s_{j}^{(i)}$ ’s are the channel inputs at the local machine $j$ . Denoting the transmit symbols by vector $\mathbf{s}$ , the channel output is modeled by

[TABLE]

where $H\in\mathcal{C}^{m\times d}$ is assumed to be an invertible complex matrix. In the fading environments, the channel gains are drawn from independent circularly symmetric complex Gaussian distribution. The additive noise $\mathbf{z}$ is an independent circularly symmetric Gaussian vector with covaraience matrix $\sigma_{z}^{2}I_{d}$ .

IV Signs Method

We assume that each local machine applies a sign function on its local dataset to obtain binary data. More precisely, given samples $\{x_{j}^{(1)},\cdots,x_{j}^{(n)}\}$ at local machine $j$ , it obtains the signs dataset $\{\hat{x}_{j}^{(1)},\cdots,\hat{x}_{j}^{(n)}\}$ where $\hat{x}_{j}=\mathrm{sign}(x_{j})$ . Then, it transmits the binary data to the central machine with the rate of $1$ bit per sample using a channel encoder and decoder.

Denoting the bit rate of the channel at local machine $j$ by $R_{j}$ , the achievable bit rates for all machines are characterized by [20]

[TABLE]

where $H_{S}$ is the sub-matrix of $H$ that includes the rows and columns indexed by $S$ , $H^{H}$ is the Hermitian of $H$ , $I_{|S|}$ is the $|S|\times|S|$ identity matrix, and $\lg(\cdot)$ is the logarithm function in base 2. Throughout this section, we assume that $R_{j}\geq 1$ , for all $j=1,\cdots,d$ . Therefore, there exists a channel encoding to transmit 1 bit per sample at each local machine.

At the central machine, our goal is to solve the optimization problem (3) on the received binary data from all local machines. In order to obtain a solution that is close to the solution obtained by the original data, we seek a suitable approximation for the sample covariance matrix $S_{x}$ in (3). Thus, our goal is to estimate the sample covariance matrix as accurate as possible using the received signs data. In this section, we propose an estimator for the sample covariance matrix and theoretically show its error decreases exponentially by the sample size $n$ .

Let $x_{j}$ and $x_{k}$ be jointly normal with zero means, unit variances and correlation coefficient $\rho_{jk}$ . If $\hat{x}_{j}$ and $\hat{x}_{k}$ be the corresponding sign variables, then the joint probability mass function (pmf) of $\hat{x}_{j}$ and $\hat{x}_{k}$ is given by [21]

[TABLE]

where $\beta_{jk}\in[0,1]$ and given by

[TABLE]

The equation (14) can be rewritten as

[TABLE]

Thus, by proposing an estimator for $\beta_{jk}$ , using (15) we can obtain an estimator for $\rho_{jk}$ . The following estimator for $\beta_{jk}$ is optimal in the sense that it is unbiased and has minimum variance (UMVE) [22],

[TABLE]

where $\mathbb{I}(.)$ is the indicator function. $\hat{\beta}_{jk}$ is indeed a binomial random variable with success probability of $\beta_{jk}$ . By substituting $\hat{\beta}_{jk}$ in (15), we use

[TABLE]

as an estimator for $\rho_{jk}$ . Although $\hat{\rho}_{jk}$ is indeed biased, it is a consistent estimator for $\rho_{jk}$ . Following lemma gives an error bound on this estimator.

Lemma 1.

Let $x_{j}$ and $x_{k}$ be two jointly normal variables with zero means, unit variances and correlation coefficient $\rho_{jk}$ . Then, for the estimator (17), we have

[TABLE]

where $\delta\geq 0$ .

Proof.

Since the function $\cos(\cdot)$ is a 1-Lipschitz function, i.e. $|\cos(x)-\cos(y)|\leq|x-y|$ , we have

[TABLE]

Since $\hat{\beta}_{jk}$ is sum of $n$ independent Bernoulli random variables, applying the Hoeffding inequality yields

[TABLE]

which completes the proof. ∎

Lemma 1 shows that the error of proposed estimator $\hat{\rho}_{jk}$ is controlled by the number of samples exponentially. Using this estimator, we can obtain an estimator for the sample covariance matrix as follows

[TABLE]

where $B_{jk}=\hat{\beta}_{jk}$ , and $\cos(.)$ function is applied on the input matrix element wise, i.e. $(\widehat{S}_{x})_{ij}=-\cos(\dfrac{\pi}{2}\beta_{ij})$ . By substituting the above sample covariance matrix into (3), we can solve the regularized maximum likelihood problem.

Note that the sample covariance matrix $\widehat{S}_{x}$ defined in (19) is not necessarily positive semi-definite. But, it does not affect the convexity and uniqueness solution of (3). Ravikumar et al. [4] proved that the problem (3) is convex and has a unique solution for any sample covariance matrix with strictly positive diagonal elements which holds for $\widehat{S}_{x}$ in (19).

By incorporating Lemma 1 and the theorems 1 and 2 in [4], we can conclude that the precision matrix obtained by solving (3) with the sample covariance matrix $\widehat{S}_{x}$ in (28), recovers the true structure with high probability. Moreover, the proposed method correctly recovers the signs of the edges with high probability.

More precisely, the event $\mathcal{M}(\Theta_{x};\widehat{\Theta}_{x})$ indicates that $\Theta_{x}$ and $\widehat{\Theta}_{x}$ do agree on the zero entries and for the nonzero entries they have the same sign. Theorems 1 and 2 state that the event $\mathcal{M}(\Theta_{x},\widehat{\Theta}_{x})$ occurs with high probability. Before stating the theorems we should define some properties of the underlying GGM. We denote the maximum degree of the underlying graph structure by $\Delta$ . The minimum absolute value of the edges weighs in the precision matrix is denoted by $\theta_{\mathrm{min}}$ which is

[TABLE]

Theorem 1.

Consider a normal distribution satisfying the incoherence Assumption 1 and 2 with parameter $\alpha\in(0,1]$ . Let $\widehat{\Theta}_{x}$ be the solution of the log-determinant program (3) with sample covariance $\widehat{S}_{x}$ in (19) and regularization parameter $\lambda_{n}=(8\pi/\alpha)\sqrt{\frac{1}{2n}\ln\frac{2}{\epsilon}}$ for some $0<\epsilon\leq d^{-2}$ . Then,

(a)

If the sample size is lower bounded as

[TABLE]

where

[TABLE]

then with probability at least $1-d^{2}\epsilon$ , the edge set specified by $\widehat{\Theta}_{x}$ is a subset of the true edge set. 2. (b)

If the sample size satisfies the lower bound

[TABLE]

where

[TABLE]

then,

[TABLE]

Remark 1.

Note that the proposed sign method is applicable for any channel with capacity greater than or equal to 1 bit.

V Uncoded Method

In this section, we assume that each local machine puts its local data into the channel without any source or channel coding. The central machine estimates the underlying graph structure using received data from the channel. At the central machine, no source or channel decoding is used. It infers the structure directly from the output samples of the channel.

As described in Section III-A, we consider a SIMO-GMAC. Each local machine can transmit two consequent samples by each channel use. More precisely, denoting the channel input symbol by $\mathbf{s}=\mathbf{s}_{R}+j\mathbf{s}_{I}$ , each local machine can put two consequent samples as real and imaginary parts of the input symbol. Therefore, at the central machine, $n/2$ vectors are received that each one is $2d$ -dimensional.

In this way, the equation (11) can be decomposed as

[TABLE]

where $j^{2}=-1$ . Hence, it can be rewritten in a block-matrix form as

[TABLE]

where $H_{R},H_{I}$ are $d\times d$ real matrices, and all the real and imaginary part vectors are $d$ -dimensional. In this way, two source samples can be transmitted per channel use: a sample is put into the real part and the other is put into the imaginary part. In particular, $\mathbf{s}_{R}+j\mathbf{s}_{I}=\sqrt{\frac{p}{2}}(\mathbf{x}_{R}+j\mathbf{x}_{I})$ , where $\mathbf{x}_{R}$ and $\mathbf{x}_{I}$ are two independent samples from the source. In this way, the transmit power constraints are satisfied.

The central machine estimates the conditional dependencies of the vectors $\mathbf{x}$ using the received vectors $\mathbf{y}$ .

V-A Approximating the Sample Covariance

Defining matrix $\widetilde{H}=\sqrt{\dfrac{p}{2}}\begin{bmatrix}H_{R}&-H_{I}\\ H_{I}&H_{R}\end{bmatrix}$ , $\tilde{\mathbf{x}}=\begin{bmatrix}\mathbf{x}_{R}\\ \mathbf{x}_{I}\end{bmatrix}$ , and $\tilde{\mathbf{y}}=\begin{bmatrix}\mathbf{y}_{R}\\ \mathbf{y}_{I}\end{bmatrix}$ . When transmitting samples through the channel, if we put two independent samples as real and imaginary parts of the vector $\tilde{\mathbf{x}}$ , then the covariance matrix of $\tilde{\mathbf{x}}$ is

[TABLE]

On the other hand, according to equation (25), we have

[TABLE]

where $Q_{\tilde{x}}$ and $Q_{\tilde{y}}$ is the covariance matrix of $\tilde{\mathbf{x}}$ and $\tilde{\mathbf{y}}$ , respectively. By substituting the sample covariance matrix of $\tilde{\mathbf{y}}$ into the above expression, we obtain an approximation for the sample covariance matrix of $\tilde{\mathbf{x}}$ , as

[TABLE]

Lemma 2.

Given $n$ i.i.d. samples of the vector $\tilde{\mathbf{y}}$ . Then, for the sample covariance matrix $S_{\tilde{x}}$ in (28) we have

[TABLE]

where

[TABLE]

and $\lambda_{\mathrm{min}}(\widetilde{H})$ is the minimum eigenvalue of $\widetilde{H}$ .

Proof.

We define the random variable $\mathbf{w}$ as follows

[TABLE]

It is clear that $\mathbf{w}\sim\mathcal{N}(\mathbf{0},Q_{w})$ , where $Q_{w}=Q_{x}+\sigma_{z}^{2}\widetilde{H}^{-1}\widetilde{H}^{-T}$ . Denoting $\mathbf{w}=[w_{1},\cdots,w_{p}]^{T}$ , $w_{j}/\sqrt{(Q_{w})_{jj}}$ is a standard normal variable which is sub-Gaussian with parameter $1$ . Thus, according to Lemma 1 in [4], we have

[TABLE]

where $S_{w}=\widetilde{H}^{-1}S_{y}\widetilde{H}^{-T}$ is the sample covariance over $\mathbf{w}^{(i)}=\widetilde{H}^{-1}\mathbf{y}^{(i)}$ samples. On the other hand, we have

[TABLE]

By finding an upper bound on $\max_{i}(Q_{w})_{ii}^{2}$ , we can obtain an upper bound on the above probability.

[TABLE]

By decomposing $\widetilde{H}^{-1}\widetilde{H}^{-T}$ as $U\Lambda^{2}U^{T}$ , we have

[TABLE]

By combining the above bound and (32), and substituting into (31) we obtain the claimed bound in the lemma. ∎

Next, we define the approximate sample covariance matrix of $\mathbf{x}$ as

[TABLE]

where $0_{d}$ is a $d\times d$ zero matrix.

Lemma 3.

For $\hat{S}_{x}$ defined in (33), we have

[TABLE]

Proof.

From (33), it is clear that

[TABLE]

Thus, we have

[TABLE]

where the last inequality is obtained by the error bound of Lemma 2 for the sample size $n/2$ . ∎

Note that the matrix $\widehat{S}_{x}$ in (33) is not necessarily positive semi-definite. But this does not affect the convexity of the optimization problem (3).

By substituting $\widehat{S}_{x}$ from (33) into the (3), we can solve the $\ell_{1}$ regularized maximum likelihood problem and obtain a sparse solution for the precision matrix $\Theta_{x}$ . Similar to Theorem 1, we can guarantee that Uncoded method can correctly recover the underlying graph structure with high probability.

Theorem 2.

Consider a normal distribution satisfying the incoherence Assumption 1 and 2 with parameter $\alpha\in(0,1]$ . Let $\widehat{\Theta}_{x}$ be the solution of the log-determinant program (3) with sample covariance $\widehat{S}_{x}$ in (33) and regularization parameter $\lambda_{n}=(8\pi/\alpha)\sqrt{\frac{1}{2n}\ln\frac{2}{\epsilon}}$ for some $0<\epsilon\leq d^{-2}$ .

(a)

If the sample size is lower bounded as

[TABLE]

where

[TABLE]

then with probability at least $1-d^{2}\epsilon$ , the edge set specified by $\widehat{\Theta}_{x}$ is a subset of the true edge set. 2. (b)

If the sample size satisfies the lower bound

[TABLE]

where

[TABLE]

then,

[TABLE]

Remark 2.

Since the channel has real and imaginary parts, 2 samples can be transmitted by each channel access (i.e. each local machine can transmit $n$ samples by $n/2$ channel uses). Therefore, if the sample generation rate at the source is less than or equal to twice of the channel’s rate, the machines can transmit all samples without any sample loss.

VI Experiments

In this section, the performance of our proposed methods is evaluated by performing several experiments. In our simulations111The source code is available at https://github.com/ArminKaramzade/distributed-sparse-GGM., the glasso package [19] is used to solve $\ell_{1}$ -regularized MLE of the precision matrix. This package is based on the block coordinate descent algorithm proposed by [6].

In our experiments, a sparse random precision matrix is generated as follows. First, we generate a random sparse graph with a fixed probability of the edge presence, say $0.1$ , and also set its maximum node degree to $\Delta=5$ . Then, we choose edge weights uniformly in $[-1,1]$ for the symmetric precision matrix $\Theta$ . Next, we make it positive definite matrix by adding a scaled identity matrix. Finally, we normalize the precision matrix to set the variances to $1$ . Also, we ensure that the generated matrix satisfies Assumption 1.

We employ the True Positive and False Positive Rates (TPR and FPR, respectively) as our performance measures. TPR is defined as the percentage of the predicted edges (non-zero off-diagonal entries in the precision matrix) that correctly detected. Similarly, FPR is the percentage of the predicted non-edges (zero entries in the precision matrix) that incorrectly detected.

We have experimentally observed that $\lambda_{n}^{\text{*signs}}\approx 4\lambda_{n}^{\text{*original}}$ and $\lambda_{n}^{\text{*uncoded}}\approx\frac{2}{3}\lambda_{n}^{\text{*original}}$ , where $\lambda_{n}^{\text{*signs}}$ , $\lambda_{n}^{\text{*uncoded}}$ , and $\lambda_{n}^{\text{*original}}$ are the best regularization parameter for the signs, uncoded, and original data, respectively.

In order to have a fair comparison between Signs and Uncoded methods, we have used identical parameters for the channel. More precisely, we assume $H=I$ which ensures all the local machines have identical bit rates. We set the bit rate of each local machine, i.e. $R_{j}$ , to 2 bits. Thus, in the proposed methods, each local machine can transmit $n$ bits by $n/2$ channel uses. According to (12), in order to achieve the bit-rate of 2, the signal to noise ratio (SNR) should set to 3 (i.e. $\frac{p}{\sigma^{2}_{z}}=3$ ).

In the first experiment, we evaluate the performance of our methods with respect to the dimension $d$ . Figure 2 shows TPR and FPR as a function of $d$ for sample sizes $n=1000$ and $10,000$ . In this experiment, the error curves are averaged over 20 different random graphs and for each graph the TPR and FPR are averaged over 10 different random samples. As can be seen from Figure 2, Signs method outperforms Uncoded method. However, all three methods have approximately the same FPR.

Figure 3 reflects the performance of the methods as a function of the sample size $n$ for $d=50$ and $d=100$ . As can be seen, by increasing the sample size $n$ , the performance of all methods increases. In this experiment again Signs method outperforms the uncoded scheme.

In Figure 4, the probability of perfect structure recovery for a star-shaped graph is depicted. In this experiment, the underlying star graph consists of $d=70$ nodes. The precision matrix is generated as the inverse of a covariance matrix with $(Q_{x})_{ij}=\frac{1}{4}$ for all $(i,j)\in\mathcal{E}$ which satisfies Assumption 1. The probability of perfect recovery is estimated by running the proposed methods 100 times and counting the number of times that the structure is recovered exactly. As can be seen from the figure, all methods recover the structure exactly for large enough sample sizes as claimed by theorems 1 and 2.

In Figure 5, we measure the TPR of Uncoded method for different values of the SNR. The experiment is performed on a random graph with $d=40$ nodes with maximum degree of $\Delta=5$ and $n=10,000$ . In this experiment, we have generated the channel matrix $H$ with entries drawn from i.i.d. standard normal samples. The TPR curve is averaged over 100 different channel matrices. As can be seen from Figure 5, for SNR greater than 5, the performance of Uncoded method is very close to the TPR of the original data.

The FPR curve is not plotted, since the error values were negligible even for small SNR.

VII Conclusion

In this paper, we have studied the sparse structure learning of GGMs where the data are distributed across multiple local machines. Two methods are proposed to send information from the local machines to the central machine, namely, Signs and Uncoded methods. We have analytically and experimentally shown that the central machine can recover the underlying graph if large enough sample sizes are transmitted to the central machine.

Our experiments show that, under the same conditions, Signs method outperforms the uncoded scheme. Both methods have small FPR which is close to the FPR obtained by the original data.

VIII Acknowledgment

We thank Amir Najafi and Amir-Hossein Saberi for their valuable comments that greatly improved the manuscript.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris, and Z. Zakaria, “A review on the computational approaches for gene regulatory network construction,” Computers in biology and medicine , vol. 48, pp. 55–65, 2014.
2[2] M. Hecker, S. Lambeck, S. Toepfer, E. Van Someren, and R. Guthke, “Gene regulatory network inference: data integration in dynamic models—a review,” Biosystems , vol. 96, no. 1, pp. 86–103, 2009.
3[3] R. Xiang, J. Neville, and M. Rogati, “Modeling relationship strength in online social networks,” in Proceedings of the 19th international conference on World wide web . ACM, 2010, pp. 981–990.
4[4] P. Ravikumar, M. J. Wainwright, G. Raskutti, B. Yu et al. , “High-dimensional covariance estimation by minimizing ℓ 1 subscript ℓ 1 \ell_{1} -penalized log-determinant divergence,” Electronic Journal of Statistics , vol. 5, pp. 935–980, 2011.
5[5] S. Sojoudi, “Equivalence of graphical lasso and thresholding for sparse graphs,” The Journal of Machine Learning Research , vol. 17, no. 1, pp. 3943–3963, 2016.
6[6] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data,” Journal of Machine learning research , vol. 9, no. Mar, pp. 485–516, 2008.
7[7] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE transactions on Information Theory , vol. 14, no. 3, pp. 462–467, 1968.
8[8] A. Anandkumar, V. Y. Tan, F. Huang, and A. S. Willsky, “High-dimensional gaussian graphical model selection: Walk summability and local separation criterion,” Journal of Machine Learning Research , vol. 13, no. Aug, pp. 2293–2337, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Structure Learning of Sparse GGMs over Multiple Access Networks

Abstract

Index Terms:

I Introduction

II Related Work

III Problem Formulation

Assumption 1**.**

Assumption 2**.**

III-A Channel Model

IV Signs Method

Lemma 1**.**

Proof.

Theorem 1**.**

Remark 1**.**

V Uncoded Method

V-A Approximating the Sample Covariance

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Theorem 2**.**

Remark 2**.**

VI Experiments

VII Conclusion

VIII Acknowledgment

Assumption 1.

Assumption 2.

Lemma 1.

Theorem 1.

Remark 1.

Lemma 2.

Lemma 3.

Theorem 2.

Remark 2.