Learning Quasi-Kronecker Product Graphical Models

Mattia Zorzi

arXiv:1901.10894·math.OC·January 31, 2019·CDC

Learning Quasi-Kronecker Product Graphical Models

Mattia Zorzi

PDF

Open Access

TL;DR

This paper introduces a Bayesian hierarchical method for learning graphical models with support decomposable as a Kronecker product, effectively reducing hyperparameter complexity and avoiding overfitting.

Contribution

It presents a novel approach leveraging the Kronecker structure and Bayesian hierarchy to improve model learning efficiency and robustness.

Findings

01

Method successfully captures Kronecker-structured supports.

02

Reduces hyperparameter count compared to traditional models.

03

Demonstrates effectiveness through numerical experiments.

Abstract

We consider the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We propose a method that uses the Bayesian hierarchical learning modeling approach. Thanks to the particular structure of the graph, we use a the number of hyperparameters which is small compared to the number of nodes in the graphical model. In this way, we avoid overfitting in the estimation of the hyperparameters. Finally, we test the effectiveness of the proposed method by a numerical example.

Equations70

x_{j} ⊥, x_{k} ∣ x_{l}, l \neq = j, k ⟺ (j, k) \in / Ω.

x_{j} ⊥, x_{k} ∣ x_{l}, l \neq = j, k ⟺ (j, k) \in / Ω.

x_{j} ⊥ x_{k} ∣ x_{l}, l \neq = j, k ⟺ s_{j k} = 0.

x_{j} ⊥ x_{k} ∣ x_{l}, l \neq = j, k ⟺ s_{j k} = 0.

\hat{Σ} = \frac{1}{N} k = 1 \sum N x_{k} x_{k}^{T};

\hat{Σ} = \frac{1}{N} k = 1 \sum N x_{k} x_{k}^{T};

\hat{S}^{(h)}

\hat{S}^{(h)}

+ \overset{γ}{^}^{(h - 1)} j, k = 1 \sum m ∣ s_{j k} ∣

\overset{γ}{^}^{(h)}

\overset{γ}{^}^{(h)}

\hat{S}^{(h)}

\hat{S}^{(h)}

+ j, k = 1 \sum m \overset{γ}{^}_{j k}^{(h - 1)} ∣ s_{j k} ∣

\overset{γ}{^}_{j k}^{(h)}

\overset{γ}{^}_{j k}^{(h)}

\displaystyle(E_{\Omega})=\left\{\begin{array}[]{cc}1,&\hbox{ if $(j,k)\in\Omega$ }\\ 0,&\hbox{ otherwise}\end{array}\right..

\displaystyle(E_{\Omega})=\left\{\begin{array}[]{cc}1,&\hbox{ if $(j,k)\in\Omega$ }\\ 0,&\hbox{ otherwise}\end{array}\right..

E_{Ω} = E_{Ω_{1}} \otimes E_{Ω_{2}}

E_{Ω} = E_{Ω_{1}} \otimes E_{Ω_{2}}

p (x^{N} ∣ S)

p (x^{N} ∣ S)

\propto ∣ S ∣^{N /2} exp (- \frac{1}{2} tr (S \hat{Σ}))

p (S ∣Λ, Γ) = j, k = 1 \prod m_{1} i, l = 1 \prod m_{2} p (s_{j k, i l} ∣ λ_{j k}, γ_{i l})

p (S ∣Λ, Γ) = j, k = 1 \prod m_{1} i, l = 1 \prod m_{2} p (s_{j k, i l} ∣ λ_{j k}, γ_{i l})

p (s_{j k, i l} ∣ λ_{j k}, γ_{i l}) = \frac{λ _{j k} γ _{i l}}{2} exp (- λ_{j k} γ_{i l} ∣ s_{j k, i l} ∣) .

p (s_{j k, i l} ∣ λ_{j k}, γ_{i l}) = \frac{λ _{j k} γ _{i l}}{2} exp (- λ_{j k} γ_{i l} ∣ s_{j k, i l} ∣) .

p (Λ) = j, k = 1 \prod m_{1} p (λ_{j k}), p (Γ) = i, l = 1 \prod m_{2} p (γ_{i l})

p (Λ) = j, k = 1 \prod m_{1} p (λ_{j k}), p (Γ) = i, l = 1 \prod m_{2} p (γ_{i l})

p (λ_{j k}) = ε_{1} exp (- ε_{1} λ_{j k}), p (γ_{i l}) = ε_{2} exp (- ε_{2} γ_{i l})

p (λ_{j k}) = ε_{1} exp (- ε_{1} λ_{j k}), p (γ_{i l}) = ε_{2} exp (- ε_{2} γ_{i l})

ℓ (x^{N}; S, Λ, Γ) = - lo g p (x^{N}, S, Λ, Γ)

ℓ (x^{N}; S, Λ, Γ) = - lo g p (x^{N}, S, Λ, Γ)

p (x^{N}, S, Λ, Γ) = p (x^{N} ∣ S) p (S ∣Λ, Γ) p (Λ) p (Γ),

p (x^{N}, S, Λ, Γ) = p (x^{N} ∣ S) p (S ∣Λ, Γ) p (Λ) p (Γ),

ℓ (

ℓ (

- lo g p (Λ) - lo g p (Γ)

\propto - \frac{N}{2} lo g ∣ S ∣ + \frac{N}{2} tr (S \hat{Σ}) + j, k = 1 \sum m_{1} i, l = 1 \sum m_{2} λ_{j k} γ_{i l} ∣ s_{j k, i l} ∣

+ j, k = 1 \sum m_{1} (ε_{1} λ_{j k} - m_{2}^{2} lo g λ_{j k}) + i l = 1 \sum m_{2} (ε_{2} γ_{i l} - m_{1}^{2} lo g γ_{i l})

(\hat{S}, \hat{Λ}, \hat{Γ}) =

(\hat{S}, \hat{Λ}, \hat{Γ}) =

s.t. S ≻ 0, λ_{j k} \geq 0, γ_{i l} \geq 0.

\hat{S}^{(h)}

\hat{S}^{(h)}

\hat{Λ}^{(h)}

\hat{Γ}^{(h)}

\hat{S}^{(h)}

\hat{S}^{(h)}

+ j, k = 1 \sum m_{1} i, l = 1 \sum m_{2} \hat{λ}_{j k}^{(h - 1)} \overset{γ}{^}_{i l}^{(h - 1)} ∣ s_{j k, i l} ∣

\hat{λ}_{j k}^{(h)}

\hat{λ}_{j k}^{(h)}

\overset{γ}{^}_{i l}^{(h)}

W

W

\approx lo g (abs (\hat{Σ}^{- 1}) + ϵ 1_{m_{1} m_{2}} 1_{m_{1} m_{2}}^{T})

vec (W \otimes I_{m_{2}})

vec (W \otimes I_{m_{2}})

vec (I_{m_{1}} \otimes Y)

A

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Bayesian Modeling and Causal Inference · Bayesian Methods and Mixture Models

Full text

Learning Quasi-Kronecker Product Graphical Models

Mattia Zorzi M. Zorzi is with the Department of Information Engineering, University of Padova, Padova, Italy; email: [email protected]

Abstract

We consider the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We propose a method that uses the Bayesian hierarchical learning modeling approach. Thanks to the particular structure of the graph, we use a the number of hyperparameters which is small compared to the number of nodes in the graphical model. In this way, we avoid overfitting in the estimation of the hyperparameters. Finally, we test the effectiveness of the proposed method by a numerical example.

I INTRODUCTION

Many modern applications are characterized by high-dimensional data sets from which it is important to discover the meaningful interactions among the variables rather than to find an accurate model. An powerful tool to analyze these interrelations is given by graphical models (i.e. Markov networks), [1, speed1986gaussian, 8378239]. The simplest version of the latter is constituted by a zero mean Gaussian random vector to which we attach an undirected graph: each node corresponds to a component of the random vector and there is an edge between two nodes if and only if the corresponding components are conditionally dependent given the others.

In these applications, there is a large interest of learning sparse graphical models (i.e. graphs with few edges) from data; indeed, these models are characterized by few conditional interdependence relations among the components. Interestingly, a sparse graph corresponds to a covariance matrix whose inverse, say concentration matrix, is sparse. The problem of learning sparse graphical models, sometime called covariance selection problem, can be faced by using regularization techniques, [2, 3, 4, 5]. For instance, [3] proposed a regularized maximum-likelihood (ML) estimator for the covariance matrix where the $\ell_{1}$ penalty norm on the concentration matrix has been considered. Since the $\ell_{1}$ norm penalty induces sparsity, the estimated covariance matrix will have a sparse inverse. It is worth noting that these approaches can be extended to dynamic graphical models, [6, 7] as well as factor models [valeCDC, ciccone2017factor, 7331087, 8264253].

These regularized estimators are known to be sensitive to the choice of the regularization parameter, i.e. the weight on $\ell_{1}$ penalty, which is typically selected by cross-validation or theoretical derivation. To overcome this issue, a Bayesian hierarchical modeling approach has been considered, [8]. Here, the concentration matrix is modeled as a random matrix whose prior is characterized by a regularization parameter (called hyperparameter). Then, the hyperparameter as well as the covariance matrix are jointly estimated. Since the $\ell_{1}$ norm shrinks all the entries to zero, and thus introduces a bias, a further improvement is to consider a weighted $\ell_{1}$ norm, see [9], where the hyperparameter is a matrix whose dimension (in principle) coincides with the number of the nodes in the graph. On the other hand, the introduction of an hyperparameter with many variables could lead to overfitting in the estimation of the hyperparameter matrix.

An important class of graphical models is represented by the so called Kronecker Product (KP) graphical models, [10, 11, 12, 13, 14] wherein it is required that the concentration matrix can be decomposed as a Kronecker product. KP graphical models find application in many fields: spatiotemporal MEG/EEG modeling [15]; recommendation systems like NetFlix and gene expression analysis, [16]; face recognition analysis[17]. In these applications the most important feature is the graphical structure, i.e. the fact that the support of the concentration matrix can be decomposed as a Kronecker product.

The contribution of the present paper is to address the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We call such models Quasi-Kronecker Product (QKP) graphical models. Note that, the assumption that the support can be decomposed as a Kronecker product does not imply that the concentration matrix does. Therefore, QKP graphical models can understood as a weaker version of KP graphical models, making the former class less restrictive than the latter. Adopting the Bayesian hierarchical modeling approach, in the spirit of [9], we introduce two hyperparameter matrices whose total number of variables is small compared to the number of nodes in the graph. In this way, we avoid overfitting in the estimation of the hyperparameter.

The paper is outlined as follows. In Section II we introduce graphical models and the problem of graphical model selection. In Section III we introduce QKP graphical models. In Section IV we propose a Bayesian procedure to learn QKP graphical models from data, while Section V is devoted on how to initialize the procedure. In Section VI we present a numerical example to show the effectiveness of the proposed method. Finally, Section VII draws the conclusions.

We warn the reader that the present paper only reports some preliminary result regarding the Bayesian estimation of QKP graphical models. In particular, all the proofs and most of the technical assumptions needed therein are omitted and will be published afterwards.

Notation: Given a symmetric matrix S, we write $S\succ 0$ ( $S\succeq 0$ ) if $S$ is positive (semi-)definite. $x\sim\mathcal{N}(\mu,\Sigma)$ means $x$ is a Gaussian random vector with mean $\mu$ and covariance matrix $\Sigma$ . $\mathbb{E}[\cdot]$ denotes the expectation operator. Given two functions $f(x)$ and $g(x)$ , $f\propto g$ means that the argmin with respect to $x$ of $f$ and $g$ do coincide. Given a matrix $S$ of dimension $m\times m$ , $s_{jk}$ denotes its entry in position $(j,k)$ . Given a matrix $S$ of dimension $m_{1}m_{2}\times m_{1}m_{2}$ , $s_{jk,il}$ denotes its entry in position $((j-1)m_{2}+i,(k-1)m_{2}+l)$ . Given a matrix $S$ , $\mathrm{vec}(S)$ denotes the vectorization of matrix $S$ . Given a matrix $S$ with positive entries, $log(S)$ denotes the matrix with entry $\log(s_{jk})$ in position $(j,k)$ . Given a matrix $S$ , $\exp(S)$ is the matrix with entry $\exp(s_{jk})$ in position $(j,k)$ and $\mathrm{abs}(S)$ denotes the matrix with entry $|s_{jk}|$ in position $(j,k)$ . $\mathbf{1}_{m}$ denotes the $m$ -dimensional vector of ones.

II GRAPHICAL MODEL SELECTION

Let $x=[\,x_{1}\ldots x_{m}\,]^{T}$ be a zero mean Gaussian random vector taking values in $\mathbb{R}^{m}$ and with covariance matrix $\Sigma\succ 0$ . Thus, this random vector is completely characterized by $\Sigma$ . We can attach to $x$ an undirected graph $\mathcal{G}(V,\Omega)$ where $V$ denotes the set of its nodes, and $\Omega$ denotes the set of its edges. More precisely, each nodes corresponds to a component $x_{j}$ , $j=1\ldots m$ , of $x$ and there is an edge between nodes $j,k$ if and only $x_{j}$ and $x_{k}$ are conditionally dependent given the other components, or equivalently, for any $j\neq k$ :

[TABLE]

Thus, $\Omega$ describes the conditionally dependent pairs of $x$ . The graph $\mathcal{G}$ is referred to as graphical model of $x$ ; an example is provided in Figure 1.

Dempster proved that conditional independence relations are given by the concentration matrix of $x$ , i.e. $S:=\Sigma^{-1}$ , [18]:

[TABLE]

Accordingly, sparsity of $S$ , i.e. $S$ with many entries equal to zero, reflects the fact that the graphical model $\mathcal{G}$ of $x$ is sparse, i.e. $\mathcal{G}$ has few edges.

In many applications, it is required to learn a sparse graphical model $\mathcal{G}$ from data. More precisely, given a sequence of data $\mathrm{x}^{N}:=\{\,\mathrm{x}_{1}^{T}\ldots\mathrm{x}_{N}^{T}\,\}$ generated by $x$ , find a sparse graphical model $\mathcal{G}(V,E)$ for $x$ where $\mathcal{G}$ is sparse. The simplest idea is to compute the sample covariance from the data

[TABLE]

then the graphical model is given by the support of $\hat{\Sigma}^{-1}$ . However, the resulting graph is full even in the case that the underlying system is well described by a sparse graphical model. In [8], a procedure based on a Bayesian hierarchical model has been proposed. More precisely, the entries of $S$ are assumed to be i.i.d. and Laplace distributed with hyperparameter $\gamma\geq 0$ , i.e. the probability density function (pdf) of $s_{jk}$ is $p(s_{jk})=\gamma/2\exp\left(-\gamma|s_{jk}|\right)$ . The resulting procedure is described in Algorithm 1. The main drawback of this approach is that it assigns a priori the same level of sparsity to each entry of $S$ . This method has been extended to the case wherein only the entries in the same column of $S$ have the same distribution, [9], allowing different levels of sparsity in the prior. A further extension is to assume that all the entries of $S$ may be distributed in a different way, but respecting the symmetry, that is $p(s_{jk})=p(s_{kj})=\gamma_{jk}/2\exp\left(-\gamma_{jk}|s_{jk}|\right)$ with $\gamma_{jk}\geq 0$ . Using argumentations similar to the ones in [9], it is not difficult to find that the procedure described in Algorithm 2.

III QKP GRAPHICAL MODELS

Consider the undirected graph $\mathcal{G}(V,\Omega)$ and let $m$ denote the number of its nodes. Let $E_{\Omega}$ be the $m\times m$ binary matrix defined as follows:

[TABLE]

We say that $\mathcal{G}(V,\Omega)$ is a Kronecker Product graph if there exist two graphs $\mathcal{G}_{1}(V_{1},\Omega_{1})$ and $\mathcal{G}_{1}(V_{2},\Omega_{2})$ with $m_{1}$ and $m_{2}$ nodes, respectively, such that:

[TABLE]

where $m=m_{1}m_{2}$ . In shorthand notation we will write $\mathcal{G}=\mathcal{G}_{1}\otimes\mathcal{G}_{2}$ . In practice, in this graph $\mathcal{G}$ we can recognize modules containing $m_{2}$ nodes sharing the same graphical structure, described by $\Omega_{2}$ ; the interaction among those $m_{1}$ modules is described by $\Omega_{1}$ . An illustrative example is given in Figure 2.

Let $x=[\,x_{1}\ldots x_{m}\,]^{T}$ be a zero mean Gaussian random vector taking values in $\mathbb{R}^{m}$ and with inverse covariance matrix (i.e. concentration matrix) $S\succ 0$ whose support is $E_{\Omega}=E_{\Omega_{1}}\otimes E_{\Omega_{2}}$ . We can attach to $x$ a Kronecker Product graph $\mathcal{G}(V,\Omega)=\mathcal{G}_{1}(V_{1},\Omega_{1})\otimes\mathcal{G}_{2}(V_{2},\Omega_{2})$ Accordingly, $\Omega_{1}$ characterizes the conditional dependence relations among the modules, while $\Omega_{2}$ characterizes the recurrent conditional dependence relations among the nodes in each module. This graphical model is referred to as Quasi-Kronecker Product (QKP) graphical model to distinguish from the Kronecker Product (KP) graphical model proposed in [13]. Indeed, in the latter the support of the concentration matrix and its support admit a Kronecker decomposition, i.e. $E_{\Omega}=E_{\Omega_{1}}\otimes E_{\Omega_{2}}$ and $S=S_{1}\otimes S_{2}$ . In our graphical model, even if the support of the concentration matrix admits a Kronecker product decomposition, the concentration matrix does not.

IV LEARNING QKP GRAPHICAL MODELS

We address the problem of learning a QKP graphical model $\mathcal{G}=\mathcal{G}_{1}\otimes\mathcal{G}_{2}$ from data. In many real applications the observed data are explained by a sparse graphical model because the latter allows a straightforward interpretation of the interaction among the variables involved in the application. Thus, in our case we require that $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are sparse. The problem can be formalized as follows.

Problem 1

Let $x$ be a zero mean random vector of dimension $m=m_{1}m_{2}$ , $m_{1},m_{2}\in\mathbb{N}$ , with zero mean and inverse covariance matrix $S$ . Given a sequence of data $\mathrm{x}^{N}:=\{\,\mathrm{x}_{1}^{T}\ldots\mathrm{x}_{N}^{T}\,\}$ generated by $x$ , find a QKP graphical model $\mathcal{G}(V,E)=\mathcal{G}_{1}(V_{1})\otimes\mathcal{G}_{2}(V_{2},E_{2})$ for $x$ where $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are sparse and have $m_{1},m_{2}$ nodes, respectively.

To solve Problem 1, we adopt the Bayesian hierarchical modeling. $S$ is modeled as a random matrix whose prior depends on the hyperparameters $\Lambda$ and $\Gamma$ . $\Lambda$ is a $m_{1}\times m_{1}$ symmetric random matrix with nonnegative entries and $\Gamma$ is a $m_{2}\times m_{2}$ symmetric random matrix with nonnegative entries. The hyperprior for $\Lambda$ and $\Gamma$ depends on $\varepsilon_{1}$ and $\varepsilon_{2}$ , respectively. The latter are deterministic positive quantities.

We proceed to characterize the Bayesian model in detail. The conditional pdf of $x^{N}$ under model $x\sim\mathcal{N}(0,S^{-1})$ is:

[TABLE]

where the neglected terms do not depend on $S$ . In what follows we assume that $\hat{\Sigma}\succ 0$ . We model the entries of $S$ as independent random variables, so that the prior for $S$ is:

[TABLE]

where $s_{jk,il}$ is Laplace distributed

[TABLE]

$\Lambda$ and $\Gamma$ are independent random matrices, i.e. $p(\Lambda,\Gamma)=p(\Lambda)p(\Gamma)$ . The entries of $\Lambda$ and $\Gamma$ are assumed to be independent

[TABLE]

and with exponential distribution

[TABLE]

with $\varepsilon_{1},\varepsilon_{2}$ deterministic and positive quantities. At this point, some comment regarding the choice of the prior on $S$ and the hyperprior on $\Lambda,\Gamma$ is required. From (10) it is clear that $s_{jk,il}$ takes value close to zero with high probability if the product $\lambda_{jk}\gamma_{il}$ is large. Moreover, if $\gamma_{il}$ is very large for some $(i,l)$ and $\lambda_{jk}\geq\epsilon>0$ , for all $j,k=1\ldots m_{1}$ , such that $\gamma_{il}\epsilon$ is large then $s_{jk,il}$ with $j,k=1\ldots m_{1}$ take values close to zero with high probability. Accordingly, the different modules in the graph will have a similar sparsity pattern with high probability. Accordingly, prior (10) assigns high probability to QKP graphical models. The hyperprior in (12) guarantees that $\lambda_{jk}$ and $\gamma_{il}$ diverge with probability zero. As we will see, this assumption guarantees that the optimization procedure that we propose is well-posed.

Next, we characterize the maximum a posteriori (MAP) estimator of $S$ (and thus also the MAP estimator of the covariance matrix by the invariance principle). The latter minimizes the negative log-likelihood

[TABLE]

where $p(\mathrm{x}^{N},S,\Lambda,\Gamma)$ is the joint pdf of $\mathrm{x}^{N}$ , $S$ , $\Lambda$ and $\Gamma$ . Note that,

[TABLE]

in particular the negative log-likelihood contains the prior (10) inducing sparsity on intra-group/modules. Moreover, we have

[TABLE]

where the neglected terms do not depend on $S$ , $\Lambda$ and $\Gamma$ . It is clear that the MAP estimator of $S$ depends on $\Lambda$ , $\Gamma$ , $\varepsilon_{1}$ and $\varepsilon_{2}$ . In what follows we assume $\varepsilon_{1}$ and $\varepsilon_{2}$ fixed. Then, a way to estimate $\Lambda$ and $\Gamma$ from the data is provided by the empirical Bayes approach: $\Lambda$ and $\Gamma$ are given by maximizing the marginal likelihood of $x^{N}$ which is obtained by integrating out $S$ in (14), [19]. However, it is not easy to find an analytical expression for the marginal likelihood in this case. An alternative simplified approach for optimizing $\Lambda$ , $\Gamma$ is the generalized maximum likelihood (GML) method, [20]. According to this method, $S$ , $\Lambda$ and $\Gamma$ minimize jointly (IV):

[TABLE]

Since the joint optimization of the three variables is still an hard problem, we propose an iterative three-step procedure. At the $h$ -th iteration we solve the following three optimization problems:

[TABLE]

It is possible to prove that Problems (17), (18) and (19) do admit a unique solution. Moreover, the resulting procedure is illustrated in Algorithm 3.

In the aforementioned algorithm the hyperparameters selection is performed iteratively through Step 5. It is worth noting that Algorithm 3 is similar to Algorithm 1 and Algorithm 2: the main difference is that in the proposed algorithm we have two types of hyperparameters. As a consequence, in the proposed algorithm we have three optimization steps instead of two.

V INITIAL CONDITIONS

In Algorithm 3 we have to fix the initial conditions for the hyperparameters, that is $\hat{\Lambda}^{(0)}$ and $\hat{\Gamma}^{(0)}$ . The idea is to approximate $\hat{\Sigma}^{-1}$ through a Kronecker product, then the two matrices of this product are used to initialize $\Lambda$ and $\Gamma$ .

Given $\hat{\Sigma}$ , we want to find $\bar{W}$ and $\bar{Y}$ of dimension $m_{1}$ and $m_{2}$ , respectively, with positive entries such that $\bar{W}\otimes\bar{Y}\approx\mathrm{abs}(\hat{\Sigma}^{-1})+\epsilon\mathbf{1}_{m_{1}m_{2}}\mathbf{1}_{m_{1}m_{2}}^{T}$ where $\epsilon>0$ is chosen sufficiently small. The presence of the term $\epsilon\mathbf{1}_{m_{1}m_{2}}\mathbf{1}_{m_{1}m_{2}}^{T}$ allows to take the entrywise logarithm on both sides, obtaining

[TABLE]

where $W=\log\bar{W}$ and $Y=\log\bar{Y}$ . It is not difficult to see that the following relations hold:

[TABLE]

Then, we can write (V) as $Az\approx b$ where

[TABLE]

Accordingly, $z$ can be found by solving the least squares problem $\hat{z}=\operatornamewithlimits{argmin}_{z}\|Az-b\|$ , therefore $\hat{z}=(A^{T}A)^{-1}A^{T}b$ . From $z$ we recover $W$ and $Y$ . Note that, $W$ and $Y$ computed from $z$ are not symmetric matrices. Thus, to compute $\bar{W}$ and $\bar{Y}$ from $W$ and $Y$ we force the symmetric structure: $\bar{W}=(\exp(W)+\exp(W)^{T})/2$ and $\bar{Y}=(\exp(Y)+\exp(Y)^{T})/2$ . At this point, it is worth noting that $w_{jk}y_{il}$ provides roughly the order of magnitude of $s_{jk,il}$ . On the other hand, the hyperparameters $\lambda_{jk}$ and $\gamma_{il}$ provides the prior about the order the order of magnitude of $s_{jk,il}$ : the larger $\lambda_{jk}\gamma_{il}$ is, the more $s_{jk,il}$ is close to zero. Accordingly, we choose $\hat{\Lambda}^{(0)}$ and $\hat{\Gamma}^{(0)}$ such that:

[TABLE]

VI SIMULATION RESULTS

We consider a Monte Carlo experiment structured as follows:

•

We generate $60$ QKP graphical models with $m_{1}=6$ modules and each module contains $m_{2}=10$ nodes. For each model, $\Omega_{1}$ and $\Omega_{2}$ are generated randomly. The fraction of edges is set equal to $20\%$ for both $\Omega_{1}$ and $\Omega_{2}$ ;

•

For each model we generate a finite-length realization $\mathrm{x}^{N}:=\{\,\mathrm{x}_{1}\ldots\mathrm{x}_{N}\,\}$ , with $N=1000$ , and we compute the sample covariance $\hat{\Sigma}$ .

•

For each realization we consider the following estimators:

–

S1 estimator: it computes a sparse graphical model by using Algorithm 1, in this case we have one scalar hyperparameter;

–

S2 estimator: it computes a sparse graphical model by using Algorithm 2, in this case we have $m_{1}m_{2}(m_{1}m_{2}+1)/2=1830$ variables in the hyperparameter matrix;

–

QKP estimator: it computes a QKP graphical model with $m_{1}=6$ modules and each module has $m_{2}=10$ nodes, in this case the total number of variables in the hyperparameters matrices is $m_{1}(m_{1}+1)/2+m_{1}(m_{2}+1)/2=76$ .

•

For each realization, we compute the relative error in reconstructing the concentration matrix and the relative error in reconstructing the sparsity pattern using S1, S2 and QKP. For instance, the relative error in reconstructing concentration matrix using QKP estimator is

[TABLE]

where $S_{true}$ denotes the concentration matrix of the true model, while $\hat{S}_{QKP}$ is the estimated concentration matrix; here, $\|\cdot\|$ denotes the Frobenius norm. The relative error in reconstructing the sparsity pattern using QKP estimator is

[TABLE]

where $\Omega_{set}$ and $\hat{\Omega}_{QKP}$ denote the set of edges of the true graphical model and the one estimated, respectively.

Figure 3 depicts the set of edges estimated using the three estimators in a realization of the Monte Carlo experiment. As we see, only QK provides a structure similar to the true model. Figure 4

shows the boxplot of the relative error in reconstructing the sparsity pattern (left panel) and the concentration matrix (right panel). As we can see, the worst performance is given by S1, while the best performance is achieved by QKP. In particular, the relative error of the estimated sparsity pattern for QKP is very small compared to the other two methods. The poor performance of S1 is due by the fact that only a scalar hyperparameter is not sufficient to capture the correct structure of the graph. On the contrary, the poor performance of S2 is due by the fact that there is overfitting in the estimation of the hyperparameter matrix.

VII CONCLUSIONS

We have introduced Quasi-Kronecker Product graphical models wherein the nodes are regrouped in modules having the same number of nodes. The interactions among the nodes of the same module as well as the interaction among the nodes of two modules follow a common structure. Then, we have addressed the problem of learning QKP graph models from data using a Bayesian hierarchical model. Finally, we have compared the proposed procedure with Bayesian learning techniques for estimating sparse graphical models: simulation evidence showed the effectiveness of the proposed method.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Lauritzen, Graphical Models . Oxford: Oxford University Press, 1996.
2[2] J. Z. Huang, N. Liu, M. Pourahmadi, and L. Liu, “Covariance matrix selection and estimation via penalised normal likelihood,” Biometrika , vol. 93, no. 1, pp. 85–98, 2006.
3[3] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data,” Journal of Machine learning research , vol. 9, pp. 485–516, 2008.
4[4] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics , vol. 9, no. 3, pp. 432–441, 2008.
5[5] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, “First-order methods for sparse covariance selection,” SIAM Journal on Matrix Analysis and Applications , vol. 30, no. 1, pp. 56–66, 2008.
6[6] J. Songsiri and L. Vandenberghe, “Topology selection in graphical models of autoregressive processes,” J. Mach. Learning Res. , vol. 11, pp. 2671–2705, 2010.
7[7] M. Zorzi and R. Sepulchre, “AR identification of latent-variable graphical models,” IEEE Transactions on Automatic Control , vol. 61, pp. 2327–2340, Sep. 2016.
8[8] N. B. Asadi, I. Rish, K. Scheinberg, D. Kanevsky, and B. Ramabhadran, “Map approach to learning sparse gaussian markov networks,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on , pp. 1721–1724, 2009.