Learning Quasi-Kronecker Product Graphical Models
Mattia Zorzi

TL;DR
This paper introduces a Bayesian hierarchical method for learning graphical models with support decomposable as a Kronecker product, effectively reducing hyperparameter complexity and avoiding overfitting.
Contribution
It presents a novel approach leveraging the Kronecker structure and Bayesian hierarchy to improve model learning efficiency and robustness.
Findings
Method successfully captures Kronecker-structured supports.
Reduces hyperparameter count compared to traditional models.
Demonstrates effectiveness through numerical experiments.
Abstract
We consider the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We propose a method that uses the Bayesian hierarchical learning modeling approach. Thanks to the particular structure of the graph, we use a the number of hyperparameters which is small compared to the number of nodes in the graphical model. In this way, we avoid overfitting in the estimation of the hyperparameters. Finally, we test the effectiveness of the proposed method by a numerical example.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Modeling and Causal Inference · Bayesian Methods and Mixture Models
Learning Quasi-Kronecker Product Graphical Models
Mattia Zorzi M. Zorzi is with the Department of Information Engineering, University of Padova, Padova, Italy; email: [email protected]
Abstract
We consider the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We propose a method that uses the Bayesian hierarchical learning modeling approach. Thanks to the particular structure of the graph, we use a the number of hyperparameters which is small compared to the number of nodes in the graphical model. In this way, we avoid overfitting in the estimation of the hyperparameters. Finally, we test the effectiveness of the proposed method by a numerical example.
I INTRODUCTION
Many modern applications are characterized by high-dimensional data sets from which it is important to discover the meaningful interactions among the variables rather than to find an accurate model. An powerful tool to analyze these interrelations is given by graphical models (i.e. Markov networks), [1, speed1986gaussian, 8378239]. The simplest version of the latter is constituted by a zero mean Gaussian random vector to which we attach an undirected graph: each node corresponds to a component of the random vector and there is an edge between two nodes if and only if the corresponding components are conditionally dependent given the others.
In these applications, there is a large interest of learning sparse graphical models (i.e. graphs with few edges) from data; indeed, these models are characterized by few conditional interdependence relations among the components. Interestingly, a sparse graph corresponds to a covariance matrix whose inverse, say concentration matrix, is sparse. The problem of learning sparse graphical models, sometime called covariance selection problem, can be faced by using regularization techniques, [2, 3, 4, 5]. For instance, [3] proposed a regularized maximum-likelihood (ML) estimator for the covariance matrix where the penalty norm on the concentration matrix has been considered. Since the norm penalty induces sparsity, the estimated covariance matrix will have a sparse inverse. It is worth noting that these approaches can be extended to dynamic graphical models, [6, 7] as well as factor models [valeCDC, ciccone2017factor, 7331087, 8264253].
These regularized estimators are known to be sensitive to the choice of the regularization parameter, i.e. the weight on penalty, which is typically selected by cross-validation or theoretical derivation. To overcome this issue, a Bayesian hierarchical modeling approach has been considered, [8]. Here, the concentration matrix is modeled as a random matrix whose prior is characterized by a regularization parameter (called hyperparameter). Then, the hyperparameter as well as the covariance matrix are jointly estimated. Since the norm shrinks all the entries to zero, and thus introduces a bias, a further improvement is to consider a weighted norm, see [9], where the hyperparameter is a matrix whose dimension (in principle) coincides with the number of the nodes in the graph. On the other hand, the introduction of an hyperparameter with many variables could lead to overfitting in the estimation of the hyperparameter matrix.
An important class of graphical models is represented by the so called Kronecker Product (KP) graphical models, [10, 11, 12, 13, 14] wherein it is required that the concentration matrix can be decomposed as a Kronecker product. KP graphical models find application in many fields: spatiotemporal MEG/EEG modeling [15]; recommendation systems like NetFlix and gene expression analysis, [16]; face recognition analysis[17]. In these applications the most important feature is the graphical structure, i.e. the fact that the support of the concentration matrix can be decomposed as a Kronecker product.
The contribution of the present paper is to address the problem of learning graphical models where the support of the concentration matrix can be decomposed as a Kronecker product. We call such models Quasi-Kronecker Product (QKP) graphical models. Note that, the assumption that the support can be decomposed as a Kronecker product does not imply that the concentration matrix does. Therefore, QKP graphical models can understood as a weaker version of KP graphical models, making the former class less restrictive than the latter. Adopting the Bayesian hierarchical modeling approach, in the spirit of [9], we introduce two hyperparameter matrices whose total number of variables is small compared to the number of nodes in the graph. In this way, we avoid overfitting in the estimation of the hyperparameter.
The paper is outlined as follows. In Section II we introduce graphical models and the problem of graphical model selection. In Section III we introduce QKP graphical models. In Section IV we propose a Bayesian procedure to learn QKP graphical models from data, while Section V is devoted on how to initialize the procedure. In Section VI we present a numerical example to show the effectiveness of the proposed method. Finally, Section VII draws the conclusions.
We warn the reader that the present paper only reports some preliminary result regarding the Bayesian estimation of QKP graphical models. In particular, all the proofs and most of the technical assumptions needed therein are omitted and will be published afterwards.
Notation: Given a symmetric matrix S, we write () if is positive (semi-)definite. means is a Gaussian random vector with mean and covariance matrix . denotes the expectation operator. Given two functions and , means that the argmin with respect to of and do coincide. Given a matrix of dimension , denotes its entry in position . Given a matrix of dimension , denotes its entry in position . Given a matrix , denotes the vectorization of matrix . Given a matrix with positive entries, denotes the matrix with entry in position . Given a matrix , is the matrix with entry in position and denotes the matrix with entry in position . denotes the -dimensional vector of ones.
II GRAPHICAL MODEL SELECTION
Let be a zero mean Gaussian random vector taking values in and with covariance matrix . Thus, this random vector is completely characterized by . We can attach to an undirected graph where denotes the set of its nodes, and denotes the set of its edges. More precisely, each nodes corresponds to a component , , of and there is an edge between nodes if and only and are conditionally dependent given the other components, or equivalently, for any :
[TABLE]
Thus, describes the conditionally dependent pairs of . The graph is referred to as graphical model of ; an example is provided in Figure 1.
Dempster proved that conditional independence relations are given by the concentration matrix of , i.e. , [18]:
[TABLE]
Accordingly, sparsity of , i.e. with many entries equal to zero, reflects the fact that the graphical model of is sparse, i.e. has few edges.
In many applications, it is required to learn a sparse graphical model from data. More precisely, given a sequence of data generated by , find a sparse graphical model for where is sparse. The simplest idea is to compute the sample covariance from the data
[TABLE]
then the graphical model is given by the support of . However, the resulting graph is full even in the case that the underlying system is well described by a sparse graphical model. In [8], a procedure based on a Bayesian hierarchical model has been proposed. More precisely, the entries of are assumed to be i.i.d. and Laplace distributed with hyperparameter , i.e. the probability density function (pdf) of is . The resulting procedure is described in Algorithm 1. The main drawback of this approach is that it assigns a priori the same level of sparsity to each entry of . This method has been extended to the case wherein only the entries in the same column of have the same distribution, [9], allowing different levels of sparsity in the prior. A further extension is to assume that all the entries of may be distributed in a different way, but respecting the symmetry, that is with . Using argumentations similar to the ones in [9], it is not difficult to find that the procedure described in Algorithm 2.
III QKP GRAPHICAL MODELS
Consider the undirected graph and let denote the number of its nodes. Let be the binary matrix defined as follows:
[TABLE]
We say that is a Kronecker Product graph if there exist two graphs and with and nodes, respectively, such that:
[TABLE]
where . In shorthand notation we will write . In practice, in this graph we can recognize modules containing nodes sharing the same graphical structure, described by ; the interaction among those modules is described by . An illustrative example is given in Figure 2.
Let be a zero mean Gaussian random vector taking values in and with inverse covariance matrix (i.e. concentration matrix) whose support is . We can attach to a Kronecker Product graph Accordingly, characterizes the conditional dependence relations among the modules, while characterizes the recurrent conditional dependence relations among the nodes in each module. This graphical model is referred to as Quasi-Kronecker Product (QKP) graphical model to distinguish from the Kronecker Product (KP) graphical model proposed in [13]. Indeed, in the latter the support of the concentration matrix and its support admit a Kronecker decomposition, i.e. and . In our graphical model, even if the support of the concentration matrix admits a Kronecker product decomposition, the concentration matrix does not.
IV LEARNING QKP GRAPHICAL MODELS
We address the problem of learning a QKP graphical model from data. In many real applications the observed data are explained by a sparse graphical model because the latter allows a straightforward interpretation of the interaction among the variables involved in the application. Thus, in our case we require that and are sparse. The problem can be formalized as follows.
Problem 1
Let be a zero mean random vector of dimension , , with zero mean and inverse covariance matrix . Given a sequence of data generated by , find a QKP graphical model for where and are sparse and have nodes, respectively.
To solve Problem 1, we adopt the Bayesian hierarchical modeling. is modeled as a random matrix whose prior depends on the hyperparameters and . is a symmetric random matrix with nonnegative entries and is a symmetric random matrix with nonnegative entries. The hyperprior for and depends on and , respectively. The latter are deterministic positive quantities.
We proceed to characterize the Bayesian model in detail. The conditional pdf of under model is:
[TABLE]
where the neglected terms do not depend on . In what follows we assume that . We model the entries of as independent random variables, so that the prior for is:
[TABLE]
where is Laplace distributed
[TABLE]
and are independent random matrices, i.e. . The entries of and are assumed to be independent
[TABLE]
and with exponential distribution
[TABLE]
with deterministic and positive quantities. At this point, some comment regarding the choice of the prior on and the hyperprior on is required. From (10) it is clear that takes value close to zero with high probability if the product is large. Moreover, if is very large for some and , for all , such that is large then with take values close to zero with high probability. Accordingly, the different modules in the graph will have a similar sparsity pattern with high probability. Accordingly, prior (10) assigns high probability to QKP graphical models. The hyperprior in (12) guarantees that and diverge with probability zero. As we will see, this assumption guarantees that the optimization procedure that we propose is well-posed.
Next, we characterize the maximum a posteriori (MAP) estimator of (and thus also the MAP estimator of the covariance matrix by the invariance principle). The latter minimizes the negative log-likelihood
[TABLE]
where is the joint pdf of , , and . Note that,
[TABLE]
in particular the negative log-likelihood contains the prior (10) inducing sparsity on intra-group/modules. Moreover, we have
[TABLE]
where the neglected terms do not depend on , and . It is clear that the MAP estimator of depends on , , and . In what follows we assume and fixed. Then, a way to estimate and from the data is provided by the empirical Bayes approach: and are given by maximizing the marginal likelihood of which is obtained by integrating out in (14), [19]. However, it is not easy to find an analytical expression for the marginal likelihood in this case. An alternative simplified approach for optimizing , is the generalized maximum likelihood (GML) method, [20]. According to this method, , and minimize jointly (IV):
[TABLE]
Since the joint optimization of the three variables is still an hard problem, we propose an iterative three-step procedure. At the -th iteration we solve the following three optimization problems:
[TABLE]
It is possible to prove that Problems (17), (18) and (19) do admit a unique solution. Moreover, the resulting procedure is illustrated in Algorithm 3.
In the aforementioned algorithm the hyperparameters selection is performed iteratively through Step 5. It is worth noting that Algorithm 3 is similar to Algorithm 1 and Algorithm 2: the main difference is that in the proposed algorithm we have two types of hyperparameters. As a consequence, in the proposed algorithm we have three optimization steps instead of two.
V INITIAL CONDITIONS
In Algorithm 3 we have to fix the initial conditions for the hyperparameters, that is and . The idea is to approximate through a Kronecker product, then the two matrices of this product are used to initialize and .
Given , we want to find and of dimension and , respectively, with positive entries such that where is chosen sufficiently small. The presence of the term allows to take the entrywise logarithm on both sides, obtaining
[TABLE]
where and . It is not difficult to see that the following relations hold:
[TABLE]
Then, we can write (V) as where
[TABLE]
Accordingly, can be found by solving the least squares problem , therefore . From we recover and . Note that, and computed from are not symmetric matrices. Thus, to compute and from and we force the symmetric structure: and . At this point, it is worth noting that provides roughly the order of magnitude of . On the other hand, the hyperparameters and provides the prior about the order the order of magnitude of : the larger is, the more is close to zero. Accordingly, we choose and such that:
[TABLE]
VI SIMULATION RESULTS
We consider a Monte Carlo experiment structured as follows:
- •
We generate QKP graphical models with modules and each module contains nodes. For each model, and are generated randomly. The fraction of edges is set equal to for both and ;
- •
For each model we generate a finite-length realization , with , and we compute the sample covariance .
- •
For each realization we consider the following estimators:
- –
S1 estimator: it computes a sparse graphical model by using Algorithm 1, in this case we have one scalar hyperparameter;
- –
S2 estimator: it computes a sparse graphical model by using Algorithm 2, in this case we have variables in the hyperparameter matrix;
- –
QKP estimator: it computes a QKP graphical model with modules and each module has nodes, in this case the total number of variables in the hyperparameters matrices is .
- •
For each realization, we compute the relative error in reconstructing the concentration matrix and the relative error in reconstructing the sparsity pattern using S1, S2 and QKP. For instance, the relative error in reconstructing concentration matrix using QKP estimator is
[TABLE]
where denotes the concentration matrix of the true model, while is the estimated concentration matrix; here, denotes the Frobenius norm. The relative error in reconstructing the sparsity pattern using QKP estimator is
[TABLE]
where and denote the set of edges of the true graphical model and the one estimated, respectively.
Figure 3 depicts the set of edges estimated using the three estimators in a realization of the Monte Carlo experiment. As we see, only QK provides a structure similar to the true model. Figure 4
shows the boxplot of the relative error in reconstructing the sparsity pattern (left panel) and the concentration matrix (right panel). As we can see, the worst performance is given by S1, while the best performance is achieved by QKP. In particular, the relative error of the estimated sparsity pattern for QKP is very small compared to the other two methods. The poor performance of S1 is due by the fact that only a scalar hyperparameter is not sufficient to capture the correct structure of the graph. On the contrary, the poor performance of S2 is due by the fact that there is overfitting in the estimation of the hyperparameter matrix.
VII CONCLUSIONS
We have introduced Quasi-Kronecker Product graphical models wherein the nodes are regrouped in modules having the same number of nodes. The interactions among the nodes of the same module as well as the interaction among the nodes of two modules follow a common structure. Then, we have addressed the problem of learning QKP graph models from data using a Bayesian hierarchical model. Finally, we have compared the proposed procedure with Bayesian learning techniques for estimating sparse graphical models: simulation evidence showed the effectiveness of the proposed method.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Lauritzen, Graphical Models . Oxford: Oxford University Press, 1996.
- 2[2] J. Z. Huang, N. Liu, M. Pourahmadi, and L. Liu, “Covariance matrix selection and estimation via penalised normal likelihood,” Biometrika , vol. 93, no. 1, pp. 85–98, 2006.
- 3[3] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont, “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data,” Journal of Machine learning research , vol. 9, pp. 485–516, 2008.
- 4[4] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics , vol. 9, no. 3, pp. 432–441, 2008.
- 5[5] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, “First-order methods for sparse covariance selection,” SIAM Journal on Matrix Analysis and Applications , vol. 30, no. 1, pp. 56–66, 2008.
- 6[6] J. Songsiri and L. Vandenberghe, “Topology selection in graphical models of autoregressive processes,” J. Mach. Learning Res. , vol. 11, pp. 2671–2705, 2010.
- 7[7] M. Zorzi and R. Sepulchre, “AR identification of latent-variable graphical models,” IEEE Transactions on Automatic Control , vol. 61, pp. 2327–2340, Sep. 2016.
- 8[8] N. B. Asadi, I. Rish, K. Scheinberg, D. Kanevsky, and B. Ramabhadran, “Map approach to learning sparse gaussian markov networks,” in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on , pp. 1721–1724, 2009.
