Combining Smoothing Spline with Conditional Gaussian Graphical Model for Density and Graph Estimation
Runfei Luo, Anna Liu, Yuedong Wang

TL;DR
This paper introduces a semiparametric framework combining smoothing splines and conditional Gaussian graphical models for joint density and graph estimation, enabling flexible modeling of high-dimensional data without assuming joint Gaussianity.
Contribution
It develops a novel combined approach using SS ANOVA and cGGM for simultaneous density and graph estimation with theoretical guarantees.
Findings
Effective in high-dimensional settings
Accurate edge selection via geometric inference
Competitive performance in simulations and real data
Abstract
Multivariate density estimation and graphical models play important roles in statistical learning. The estimated density can be used to construct a graphical model that reveals conditional relationships whereas a graphical structure can be used to build models for density estimation. Our goal is to construct a consolidated framework that can perform both density and graph estimation. Denote as the random vector of interest with density function . Splitting into two parts, and writing where is the density function of and is the conditional density of . We propose a semiparametric framework that models nonparametrically using a smoothing spline ANOVA (SS ANOVA) model and parametrically using a conditional Gaussian graphical model…
| KL | Method |
|
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cSScGG | 0.030 (0.023) | 0.043 (0.021) | 0.062 (0.073) | 0.046 (0.026) | |||||||||
| SKDE | 0.208 (0.172) | 0.181 (0.190) | 0.503 (0.414) | 0.141 (0.091) | |||||||||
| QUIC | 0.225 (0.023) | 0.243 (0.012) | 3.313 (0.051) | 3.528 (0.021) | |||||||||
| MLE | 0.182 (0.021) | 0.225 (0.010) | 3.128 (0.047) | 3.487 (0.023) | |||||||||
| cSScGG_CV | 1.145 (0.202) | 1.118 (0.189) | 1.040 (0.138) | 1.078 (0.175) | |||||||||
| cSScGG_LOOKL | 1.143 (0.164) | 1.098 (0.167) | 1.14 (0.172) | 1.112 (0.161) | |||||||||
| SKDE | 1.621 (0.184) | 1.632 (0.218) | 1.425 (0.173) | 1.474 (0.215) | |||||||||
| QUIC | 1.196 (0.125) | 1.163 (0.147) | 1.179 (0.141) | 1.156 (0.139) | |||||||||
| MLE | 1.827 (0.245) | 1.613 (0.236) | 2.235 (0.413) | 1.607 (0.235) | |||||||||
| cSScGG_CV | 1.175 (0.205) | 1.161 (0.189) | 1.102 (0.155) | 1.124 (0.178) | |||||||||
| cSScGG_CV_NT | 1.262 (0.208) | 1.268 (0.173) | 1.032 (0.125) | 1.051 (0.120) | |||||||||
| cSScGG_LOOKL | 1.173 (0.167) | 1.141 (0.166) | 1.202 (0.186) | 1.158 (0.164) | |||||||||
| cSScGG_LOOKL_NT | 1.358 (0.241) | 1.325 (0.211) | 1.153 (0.184) | 1.122 (0.166) | |||||||||
| SKDE | 1.829 (0.256) | 1.813 (0.331) | 1.928 (0.455) | 1.615 (0.234) | |||||||||
| QUIC | 1.422 (0.132) | 1.405 (0.148) | 4.492 (0.15) | 4.684 (0.137) | |||||||||
| MLE | 2.009 (0.245) | 1.838 (0.237) | 5.363 (0.404) | 5.094 (0.232) |
| cSScGG | QUIC | NPN | |||||||||||||||||||||||||
| SPE | SEN | F1 | SPE | SEN | F1 | SPE | SEN | F1 | |||||||||||||||||||
| Among | |||||||||||||||||||||||||||
| n=200 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| n=300 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| Among | |||||||||||||||||||||||||||
| n=200 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| n=300 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| Between and | |||||||||||||||||||||||||||
| n=200 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| n=300 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| Overall | |||||||||||||||||||||||||||
| n=200 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
| n=300 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference
\stackMath
Combining Smoothing Spline with Conditional Gaussian Graphical Model for
Density and Graph Estimation
Runfei Luo, Anna Liu and Yuedong Wang Runfei Luo (email: [email protected]) received her Ph.D. in statistics from University of California, Santa Barbara. She is now an applied scientist in Amazon Web Services. Anna Liu (email: [email protected]) is Associate Professor, Department of Mathematics and Statistics, University of Massachusetts, Amherst, Massachusetts 01002., Yuedong Wang (email: [email protected]) is Professor, Department of Statistics and Applied Probability, University of California, Santa Barbara, California 93106. Anna Liu’s research was supported by a grant from the National Science Foundation (DMS-1507078). Runfei Luo and Yuedong Wang’s research was supported by a grant from the National Science Foundation (DMS-1507620). We acknowledge support from the Center for Scientific Computing from the CNSI, MRL: an NSF MRSEC (DMR-1720256) for their support. Address for correspondence: Yuedong Wang, Department of Statistics and Applied Probability, University of California, Santa Barbara, California 93106.
Abstract
Multivariate density estimation and graphical models play important roles in statistical learning. The estimated density can be used to construct a graphical model that reveals conditional relationships whereas a graphical structure can be used to build models for density estimation. Our goal is to construct a consolidated framework that can perform both density and graph estimation. Denote as the random vector of interest with density function f(\mbox{\bm{z}}). Splitting into two parts, and writing f(\mbox{\bm{z}})=f(\mbox{\bm{x}})f(\mbox{\bm{y}}|\mbox{\bm{x}}) where f(\mbox{\bm{x}}) is the density function of and f(\mbox{\bm{y}}|\mbox{\bm{x}}) is the conditional density of \bm{Y}|\bm{X}=\mbox{\bm{x}}. We propose a semiparametric framework that models f(\mbox{\bm{x}}) nonparametrically using a smoothing spline ANOVA (SS ANOVA) model and f(\mbox{\bm{y}}|\mbox{\bm{x}}) parametrically using a conditional Gaussian graphical model (cGGM). Combining flexibility of the SS ANOVA model with succinctness of the cGGM, this framework allows us to deal with high-dimensional data without assuming a joint Gaussian distribution. We propose a backfitting estimation procedure for the cGGM with a computationally efficient approach for selection of tuning parameters. We also develop a geometric inference approach for edge selection. We establish asymptotic convergence properties for both the parameter and density estimation. The performance of the proposed method is evaluated through extensive simulation studies and two real data applications.
KEY WORDS: cross-validation, high dimensional data, penalized likelihood, reproducing kernel Hilbert space, smoothing spline ANOVA
1 Introduction
Density estimation has long been a subject of paramount interest in statistics. Many parametric, nonparametric, and semiparametric methods have been developed in the literature. Assuming a known distribution family with succinct representation and interpretable parameters, the parametric approach is in general statistically and computationally efficient [Kendall1987]. However, the parametric assumption may be too restrictive for some applications. The nonparametric approach, on the other hand, does not assume a specific form for the density function and allows its shape to be decided by data. Methods such as kernel estimation [parzen1962estimation, silverman2018density], local likelihood estimators [loader1996local], and smoothing splines [gu2013smoothing] work well for low dimensional multivariate density functions. When the dimension is moderate to large, existing nonparametric methods break down quickly due to the curse of dimensionality and/or computationally limitations. \citeasnounduong2007ks pointed out that the kernel density estimation is not applicable to random variables of dimension higher than six. To reduce the computational burden, \citeasnounjeon2006effective and \citeasnoungu2013nonparametric developed pseudo likelihood method for smoothing spline density estimation. However, our experience indicates that the computation become almost infeasible when the dimension is higher than twelve. Consequently, contrary to the univariate case, flexible methods for multivariate density estimation are rather limited when the dimension is large. Recent work using piecewise constant and Bayesian partitions represents a major breakthrough in this area [lu2013multivariate, liu2014multivariate, li2016density]. Nevertheless, these methods can handle moderate dimensions only, lead to non-smooth density estimates, and cannot be used to investigate the conditional relationship.
Some semiparametric methods have been proposed to take advantage of the parsimony of parametric models and the flexibility of nonparametric modeling. Semiparametric copula models consist of nonparametric marginal distributions and parametric copula functions [genest1995]. Projection pursuit density estimation overcomes the curse of dimensionality by representing the joint density as a product of some smooth univariate functions of carefully selected linear combinations of variables [friedman1984projection]. The regularized derivative expectation operator (rodeo) method assumes the joint density equals a product of a parametric component and a nonparametric function of an unknown subset of variables [liu2007sparse]. Other semiparametric/nonparametric methods for density estimation include mixture models [richardson1997bayesian], forest density [liu2011forest], density tree [Ram11], and geometric density estimation [Dunson16]. All existing semiparametric/nonparametric methods have strengths and limitations. We will develop a new semiparametric procedure for multivariate density estimation that explores the sparse graph structure in the parametric part of the model.
Graphical models are used to characterize conditional relationship between variables with a wide range of applications in natural sciences, social sciences, and economics [lauritzen1996graphical, fan2016overview, friedman2008sparse]. Gaussian graphical model (GGM) is one of the most popular models where conditional independence is reflected in the zero entries of the precision matrix [friedman2008sparse]. The resulting structure from a GGM can be erroneous when the true distribution is far from Gaussian. The dependence structure of non-Gaussian data has not received great attention until recent years. Robustified Gaussian and elliptical graphical models against possible outliers were studied by \citeasnounmiyamura2006robust, \citeasnounfinegold2011robust, \citeasnounvogel2011elliptical, and \citeasnounsun2012robust. Graphical models based on generalized linear models were proposed by \citeasnounlee2007efficient, \citeasnounhofling2009estimation, \citeasnounravikumar2010high, \citeasnounallen2012log, and \citeasnounyang2012graphical. Nonparametric and semiparametric approaches have also been considered. \citeasnounjeon2006effective and \citeasnoungu2013nonparametric applied SS ANOVA dendity models to estimate graphs (see Section 3 for details). The computation of this nonparametric approach becomes prohibitive for large dimensions. \citeasnounliu2009nonparanormal, \citeasnounliu2012high, and \citeasnounxue2012regularized developed an elegant nonparanormal model which assumes that there exists a monotone transformation to each variable such that the joint distribution after transformation is multivariate Gaussian. Then any established estimation methods for the GGM can be applied to the transformed variables. Other semiparametric/nonparametric methods include graphical random forests [fellinghauer2013stable], regularized score matching [lin2018methods], and kernel partial correlation [oh2017graphical].
The goal of this article is to build a semiparametric model that combines the GGM with the SS ANOVA density model. We are interested in both density and graph estimation. The remainder of the article is organized as follows. In Section 2 we introduce the semiparametric density model and methods for estimation and computation. We propose methods for graph estimation in Section 3. Sections 4 presents theoretical properties of our methods in term of both density and graph estimation. In Section 5 we evaluate our method using simulation studies. In Section 6 we present applications to two real datasets. Some technical details are gathered in the Appendix.
2 Density Estimation with SS ANOVA and cGGM
2.1 Semiparametric Density Models with SS ANOVA and cGGM
Consider the density estimation problem in which we are given a random sample of a random vector , and we wish to estimate the density function of . Let \bm{Z}=(\bm{X}\mbox{{}^{T}},\bm{Y}\mbox{{}^{T}})\mbox{{}^{T}} where is a -dimensional random vector for which the density function will be modeled nonparametrically and collects elements for which the conditional density will be modeled parametrically. The joint density function f(\mbox{\bm{z}}) can be decomposed into two components:
[TABLE]
We will model f(\mbox{\bm{x}}) and f(\mbox{\bm{y}}|\mbox{\bm{x}}) using SS ANOVA models and cGGMs respectively. We now provide details of these models.
Assume where which is an arbitrary set. To deal with the positivity and unity constraints of a density function, we consider the logistic transform f=e^{\eta}/\mbox{\int_{\mathcal{X}}}e^{\eta}d\mbox{\bm{x}} where \eta(\mbox{\bm{x}}) is referred to as the logistic density function [gu2013smoothing]. We construct a model space for using the tensor product of reproducing kernel Hilbert spaces (RKHS). The SS ANOVA decomposition of functions in the tensor product RKHS can be represented as
[TABLE]
where ’s are main effects, ’s are two-way interactions, and the rest are higher order interactions involving more than two variables. Higher order interactions are often removed in (2) for more tractable estimation and inference. An SS ANOVA model for the logistic density function assumes that belongs to an RKHS which contains a subset of components in the SS ANOVA decomposition (2). For a given SS ANOVA model, terms included in the model can be regrouped and the model space can be expressed as
[TABLE]
where is a finite dimensional space collecting all functions that are not going to be penalized, and are orthogonal RKHS’s with reproducing kernels (RK) for . Details about the SS ANOVA model can be found in \citeasnoungu2013smoothing and \citeasnounwang2011smoothing.
We assume a cGGM for f(\mbox{\bm{y}}|\mbox{\bm{x}}). Specifically, we assume that \bm{Y}|\bm{X}=\mbox{\bm{x}}\sim\text{N}(-\Lambda\mbox{{}^{-1}}\Theta\mbox{{}^{T}}\mbox{\bm{x}},\Lambda\mbox{{}^{-1}}) where is a precision matrix and is a matrix that parameterizes the conditional relationship between and [sohn2012joint, wytock2013sparse, yuan2014partial]. We note that the negative log likelihood function is convex under this parameterization. An alternative assumption \bm{Y}|\bm{X}=\mbox{\bm{x}}\sim\text{N}(\Psi\mbox{\bm{x}},\Lambda\mbox{{}^{-1}}) [yin2011sparse] may be used to model the conditional density f(\mbox{\bm{y}}|\mbox{\bm{x}}) where the negative log likelihood function is biconvex in and rather than jointly convex.
We will refer to the proposed semiparametric model as combined smoothing spline and conditional Gaussian graphical (cSScGG) model. The cSScGG model is closely related to the semiparametric kernel density estimation (SKDE) proposed by \citeasnounhoti2004semiparametric. The same decomposition in (1) was considered. Given an iid sample \bm{Z}_{i}=(\bm{X}_{i}\mbox{{}^{T}},\bm{Y}_{i}\mbox{{}^{T}})\mbox{{}^{T}}, , \citeasnounhoti2004semiparametric estimated f(\mbox{\bm{x}}) using the kernel density, \hat{f}(\mbox{\bm{x}})=n^{-1}\sum_{i=1}^{n}K_{h_{1}}(\mbox{\bm{x}}-\bm{X}_{i}), and f(\mbox{\bm{y}}|\mbox{\bm{x}}) using the conditional Gaussian density with mean \mu(\mbox{\bm{x}}) and covariance \Sigma(\mbox{\bm{x}}). Specifically, they estimated \mu(\mbox{\bm{x}}) and covariance \Sigma(\mbox{\bm{x}}) by \hat{\mu}(\mbox{\bm{x}})=\sum_{i=1}^{n}W_{h_{2}}(\mbox{\bm{x}}-\bm{X}_{i})\bm{Y}_{i} and \hat{\Sigma}(\mbox{\bm{x}})=\sum_{i=1}^{n}W_{h_{3}}(\mbox{\bm{x}}-\bm{X}_{i})(\bm{Y}_{i}-\hat{\mu}(\mbox{\bm{x}}))(\bm{Y}_{i}-\hat{\mu}(\mbox{\bm{x}}))\mbox{{}^{T}} respectively, where K_{h}(\mbox{\bm{x}})=h^{-d}K(\mbox{\bm{x}}/h), is the symmetric Gaussian kernel function, W_{h}(\bm{x}-\bm{X}_{i})=K_{h}(\mbox{\bm{x}}-\bm{X}_{i})/\sum_{j=1}^{n}K_{h}(\mbox{\bm{x}}-\bm{X}_{i}), and , , and are bandwidths. Selection of bandwidths can be difficult and the estimation of conditional mean and covariance can be poor when the dimension of is large. The authors focused on the classification problem. They set W_{h_{2}}(\mbox{\bm{x}}-\bm{X}_{i})=W_{h_{3}}(\mbox{\bm{x}}-\bm{X}_{i})=1/n in their simulations to make the computation feasible. Under these weights the estimated conditional density does not depend on at all. In contrast, we model using a cGGM which will allow us to explore sparsity in the conditional dependence structure. In addition, the domain in our model is an arbitrary set while the domain in the SKDE method is a subset of . While we focus on continuous in this paper, the discrete case is a natural extension of the current work.
2.2 Penalized Likelihood Estimation
A cSScGG model consists of three parameters: and matrices and where is an RKHS given in (3) and is positive definite. Given an iid sample \bm{Z}_{i}=(\bm{X}_{i}\mbox{{}^{T}},\bm{Y}_{i}\mbox{{}^{T}})\mbox{{}^{T}}, , let X=(\bm{X}_{1},\dots,\bm{X}_{n})\mbox{{}^{T}}, Y=(\bm{Y}_{1},\dots,\bm{Y}_{n})\mbox{{}^{T}}, \mbox{S_{xx}}=n^{-1}\mbox{X}\mbox{{}^{T}}\mbox{X}, \mbox{S_{yy}}=n^{-1}\mbox{Y}\mbox{{}^{T}}\mbox{Y}, and \mbox{S_{xy}}=n^{-1}\mbox{X}\mbox{{}^{T}}\mbox{Y}. Denote
[TABLE]
as the negative log pseudo likelihood and negative log likelihood functions based on and samples respectively, where some constants are ignored and is a known density for the pseudo likelihood [gu2013smoothing]. The function is continuous, convex and Fréchet differentiable [jeon2006effective], and the function l_{2}(\mbox{\Theta},\mbox{\Lambda}) is jointly convex in and .
We estimate , and as minimizers of the penalized likelihood:
[TABLE]
where is a semi-norm in that penalizes departure from the null space , denotes the elementwise -norm, denotes the elementwise -norm on off-diagonal entries, and indicates positive definiteness of . Together, \left\lVert\mbox{\Lambda}\right\rVert_{1,\text{off}} and \left\lVert\mbox{\Theta}\right\rVert_{1} encourage sparsity for the cGGM. We allow different tuning parameters for different penalties.
Note that the first part of the penalized likelihood depends on only and the second part depends on and only. Therefore, we can compute the penalized likelihood estimates by solving two optimization problems separately:
[TABLE]
and
[TABLE]
As in \citeasnoungu2013smoothing, we approximate the solution of (7) by a linear combination of basis functions in and a random subset of representers. Then the estimate can be calcuated using the Newton-Raphson algorithm. The smoothing parameter is selected as the minimizer of an approximated cross-validation estimate of the Kullback-Leibler (KL) divergence. Details can be found in \citeasnoungu2013smoothing, \citeasnoungu2013nonparametric, and \citeasnounLuothesis. In the next section we propose a new computational method for solving (8).
2.3 Backfitting Algorithm for cGGM
Instead of updating and simultaneously as in \citeasnounsohn2012joint, \citeasnounwytock2013sparse and \citeasnounyuan2014partial, we will consider a backfitting procedure to update them iteratively until convergence. We use the subscript to denote quantities calculated at iteration and to denote the -th element of a matrix .
At iteration , with being fixed at \mbox{\Lambda}_{(t)}, (8) reduces to the minimization of a quadratic function plus an penalty. Therefore, without needing to calculate the Hessian matrix, can be updated efficiently using the coordinate descent algorithm. The gradient . Denote as the covariance matrix. Then the th element is updated by
[TABLE]
where a_{\Theta}=2\mbox{\Sigma}_{jj,(t)}(\mbox{S_{xx}})_{ii}, b_{\Theta}=2(\mbox{S_{xy}})_{ij}+2(\mbox{S_{xx}}\Theta_{(t)}\mbox{\Sigma}_{(t)})_{ij}, , and is the soft-thresholding operator with threshold .
To update at iteration , we consider the approximate conditional distribution \text{N}(-\Lambda^{-1}_{(t)}\Theta\mbox{{}^{T}}_{(t)}\mbox{\bm{x}},\Lambda^{-1}) where both and in the conditional mean are fixed at their estimates from the -th iteration. The resulting negative log likelihood
[TABLE]
where terms independent of are dropped. We update by
[TABLE]
As in \citeasnounhsieh2011sparse, we will find the Newton direction by approximating using a quadratic function. Based on the second-order Taylor expansion of at where and ignoring terms independent of , we consider
[TABLE]
where \nabla h_{(t)}(\Lambda_{(t)})=\mbox{S_{yy}}+\mbox{\Sigma}_{(t)}\Theta_{(t)}^{T}\mbox{S_{xx}}\Theta_{(t)}\mbox{\Sigma}_{(t)}+2\mbox{\Sigma}_{(t)}\Theta_{(t)}^{T}\mbox{S_{xy}}-\mbox{\Sigma}_{(t)} and \nabla^{2}h_{(t)}(\Lambda_{(t)})=\mbox{\Sigma}_{(t)}\otimes\mbox{\Sigma}_{(t)} are gradient and Hessian matrices with respect to respectively, and represents the Kronecker product. The Newton direction for (11) can be written as the solution of the following regularized quadratic function [hsieh2011sparse]
[TABLE]
Equation (12) can be solved efficiently via the coordinate descent algorithm. Specifically, let be the initial value, and be the update at iteration . Then at iteration , the th element of is updated by
[TABLE]
where a_{\Lambda}=\mbox{\Sigma}^{2}_{ij,(t)}+\mbox{\Sigma}_{ii,(t)}\mbox{\Sigma}_{jj,(t)}, b_{\Lambda}=(\mbox{S_{yy}})_{ij}+\Big{(}\mbox{\Sigma}_{(t)}\Theta_{(t)}^{T}\mbox{S_{xx}}\Theta_{(t)}\mbox{\Sigma}_{(t)}\Big{)}_{ij}+2\Big{(}\mbox{\Sigma}_{(t)}\Theta_{(t)}^{T}\mbox{S_{xy}}\Big{)}_{ij}-\mbox{\Sigma}_{ij,(t)}+(\mbox{\Sigma}_{(t)}\Delta_{\Lambda,(s)}\mbox{\Sigma}_{(t)})_{ij} and . Denote the penalized objective function at the -th iteration as . We adopt the Armijo’s rule [armijo1966minimization] to find the step size . Specifically, with a constant decrease rate (typically ), step sizes for are tried until the smallest such that
[TABLE]
where is the backtracking termination threshold. After the step size is calculated, we update .
When , we use the maximum likelihood estimates \check{\Lambda}=(\mbox{S_{yy}}-S_{xy}^{T}S_{xx}^{-1}\mbox{S_{xy}})\mbox{{}^{-1}} and \check{\Theta}=-S_{xx}^{-1}\mbox{S_{xy}}\check{\Lambda} of and as initial values for and respectively [yin2011sparse]. In the high dimensional case when is not invertible, we use the identity and zero matrix as initial values for and respectively.
The regularized Newton step (12) via the coordinate descent algorithm described above is the most computational expansive part of the algorithm. Despite its efficiency for lasso type of problems, updating all variables in is costly. To relieve this problem, we divide the parameter set into an active set and a free set. As in \citeasnounhsieh2011sparse and \citeasnounwytock2013sparse, at the th iteration of the algorithm, we only update and over the active set defined by
[TABLE]
As the active set is relatively small due to sparsity induced by the regularization, this strategy provides a substantial speedup.
Tuning parameters and determine the sparsity of and . As a general selection tool, leave-one-out or -fold cross-validation can be used to select these tuning parameters. The leave-one-out cross-validation (LOOCV) can be computationally intensive and various approximations have been proposed in the literature. \citeasnounlian2011shrinkage and \citeasnounvujavcic2015computationally derived generalized approximate cross-validation (GACV) scores for selecting a single tuning parameter in the GGM. The BIC and k-fold CV have been used to select a single tuning parameter in the cGGM [yin2011sparse, sohn2012joint, wytock2013sparse, yuan2014partial, lee2012simultaneous]. LOOCV has not been used for the cGGM as it requires fitting the model times which is computationally intensive. To the best of our knowledge, there are no computationally efficient alternatives to LOOCV in the current cGGM literature. We propose a new criterion, Leave-One-Out KL (LOOKL), for selecting and involved in (8) as minimizers of
[TABLE]
where \mbox{S_{xx,k}}=\bm{X}_{k}^{T}\bm{X}_{k}, \mbox{S_{yy,k}}=\bm{Y}_{k}^{T}\bm{Y}_{k}, \mbox{S_{xy,k}}=\bm{Y}_{k}^{T}\bm{X}_{k}, \mbox{S_{xx}^{(-k)}}=1/n\sum_{i\neq k}S_{xx,i}, \mbox{S_{yy}^{(-k)}}=1/n\sum_{i\neq k}S_{yy,i}, \mbox{S_{xy}^{(-k)}}=1/n\sum_{i\neq k}S_{xy,i}, \mbox{\bm{u}}_{k}=\mbox{\mathrm{vec}}(\hat{\Lambda}\mbox{{}^{-1}}-\mbox{S_{yy,k}}+\hat{\Lambda}\mbox{{}^{-1}}\hat{\Theta}^{T}\mbox{S_{xx,k}}\mbox{\hat{\Theta}}\hat{\Lambda}\mbox{{}^{-1}}), \mbox{\bm{v}}_{xx,k}=\mbox{\mathrm{vec}}(\mbox{S_{xx}^{(-k)}}-\mbox{S_{xx}}), \mbox{\bm{v}}_{yy,k}=\mbox{\mathrm{vec}}(\mbox{S_{yy}^{(-k)}}-\mbox{S_{yy}}), \mbox{\bm{v}}_{xy,k}=\mbox{\mathrm{vec}}(\mbox{S_{xy}^{(-k)}}-\mbox{S_{xy}}), \mbox{\bm{w}}_{k}=\mbox{\mathrm{vec}}(-2\mbox{S_{xy,k}}-2\mbox{S_{xx,k}}\hat{\Theta}\mbox{\hat{\Lambda}}\mbox{{}^{-1}}), A=-2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes\mbox{S_{xx}}, B=2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes\mbox{S_{xx}}\mbox{\hat{\Theta}}\mbox{\hat{\Lambda}}\mbox{{}^{-1}}, C=-\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes(\mbox{\hat{\Lambda}}\mbox{{}^{-1}}+2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\mbox{S_{xx}}\mbox{\hat{\Theta}}\mbox{\hat{\Lambda}}\mbox{{}^{-1}}), D=-2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\otimes I_{d\times d}, and E=\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\otimes\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}. The derivation is defered to Appendix A. Note that the GACV in \citeasnounlian2011shrinkage and KLCV in \citeasnounvujavcic2015computationally are special cases of LOOKL with . In the penalized case, we ignored the partial derivatives corresponding to the zero elements in and \citeasnounlian2011shrinkage, and showed that the LOOKL score remains the same. More details can be found in \citeasnounLuothesis. Therefore, we conjecture that the proposed score is more appropriate for density estimation, rather than model selection.
We note the proposed backfitting procedure and LOOKL method for selecting tuning parameters are new for the cGMM. When and are simultaneously updated using the second-order Taylor expansion over all parameters [wytock2013sparse], an expensive computation of the large Hessian matrix of size is required in each iteration. In contrast, our approach forms a second-order approximation of a function of which requires a Hessian matrix of size . The remaining set of parameters in can be updated easily using the simple coordinate descent algorithm. Moreover, compared to the method in \citeasnounmccarter2016large, our backfitting algorithm eliminates the need for computing the large matrix \Sigma\Theta\mbox{{}^{T}}\mbox{S_{xx}}\Theta\Sigma in time. Note that we always require to be positive-definite after each iteration, so the algorithm still has complexity flops due to the Cholesky factorization.
Some off-the-shelf packages are utilized to solve the optimization problem. Specifically, we use QUIC [hsieh2014quic] for updating , and gss [gu2014smoothing] for computing the smoothing spline estimate of f(\mbox{\bm{x}}). We write R code for updating using (9). We note that other penalities such as the smoothly clipped absolute deviation (SCAD) [fan2001variable] and adaptive lasso [zou2006adaptive] may be used to replace the penalty in the estimation of and . Details can be found in \citeasnounLuothesis.
3 Graph Estimation with cSScGG Models
In Section 2 we proposed the cSScGG model as a flexible framework for estimating the multivariate density in high-dimensional setting. In terms of the graph structure, the edges among are identified by , and edges between and are identified by [sohn2012joint, wytock2013sparse, yuan2014partial]. The remaining task is the identification of conditional independence within variables which is the target of this section.
We have assumed that the model space for the logistic density contains a subset of components in the SS ANOVA decomposition (2). The interactions are often truncated to overcome the curse of dimensionality and reduce the computational cost. As in \citeasnoungu2013smoothing and \citeasnoungu2013nonparametric, in this section we consider the SS ANOVA model with all main effects and two-way interactions as the model space for . We note that the SS ANOVA model allows pairwise nonparametric interactions as opposed to linear interactions in the GGM. \citeasnoungu2013smoothing and \citeasnoungu2013nonparametric proposed the squared error projection for accessing importance of each interaction term and subsequently identify edges. However, we cannot apply their method directly to to identify edges within since the cGGM for f(\mbox{\bm{y}}|\mbox{\bm{x}}) also includes interaction terms among variables in .
The logarithm of the joint density
[TABLE]
where is a constant independent of and . The main challenge in identifying conditional independence among comes from the fact that f(\mbox{\bm{y}}|\mbox{\bm{x}}) brings in an extra term, -\mbox{\bm{x}}\mbox{{}^{T}}\mbox{\Theta}\mbox{{}^{T}}\mbox{\Lambda}\mbox{{}^{-1}}\mbox{\Theta}\mbox{\bm{x}}/2, into the interactions among . Let
[TABLE]
where \hat{\Delta}(\mbox{\bm{x}})=-\mbox{\bm{x}}^{T}\hat{\mbox{\Theta}}\mbox{{}^{T}}\hat{\mbox{\Lambda}}\mbox{{}^{-1}}\hat{\mbox{\Theta}}\mbox{\bm{x}}/2. Define the functional
[TABLE]
and denote as . Let where collects functions whose contribution to the overall model is of question. The squared error projection of in is [gu2013smoothing]
[TABLE]
\tilde{V}(\mbox{\hat{\zeta}}-\zeta) can be regarded as a proxy of the symmetrized KL divergence [gu2013smoothing]. Assuming \mbox{\zeta_{u}}=-\log\mbox{\rho(\mbox{})}\in\mathcal{S}^{0}, it is easy to check that \tilde{V}(\mbox{\hat{\zeta}}-\mbox{\zeta_{u}})=\tilde{V}(\mbox{\hat{\zeta}}-\mbox{\tilde{\zeta}})+\tilde{V}(\mbox{\tilde{\zeta}}-\mbox{\zeta_{u}}). Then the ratio \tilde{V}(\mbox{\hat{\zeta}}-\mbox{\tilde{\zeta}})/\tilde{V}(\mbox{\hat{\zeta}}-\mbox{\zeta_{u}}) reflects the importance of functions in . The quantity \tilde{V}(\mbox{\hat{\zeta}}-\mbox{\zeta_{u}}) is readily computable while details for computing the squared error projection are given in Appendix B.
For any pair of variables and , consider the decomposition where is the subspace consisting of two-way interactions between and , and contains all functions in except the two-way interactions between and . Note that where \hat{\Delta}_{ij}=(\hat{\mbox{\Theta}}\mbox{{}^{T}}\hat{\mbox{\Lambda}}\mbox{{}^{-1}}\hat{\mbox{\Theta}})_{ij}. Compute the projection ratio in which is the squared error projection of in . The ratio indicates the importance of interactions between and , and we will add the interactions to the additive model sequentially according to the descending order of ’s.
Consider the space decomposition . We start with being the subspace spanned by all main effects. We calculate the projection ratio of in as where is the squared error projection in . If is larger than a threshold, is deemed important and we move the interaction with the largest from to . We then calculate the projection ratio with the updated and . The projection ratio decreases each time we move an interaction from to . Finally the process stops when falls below a cut-off value, at which time we denote the corresponding as . Let \Pi_{ij}=I\big{(}\zeta_{ij}\in\mathcal{S}_{s}^{0}\big{)} and remove the edge between and if . In our implementations, the cut-off value is set to be .
To summarize, the conditional independences among , between and , and among are characterized by the zero elements in , and , respectively. The whole procedure for edge identification is illustrated in Figure 1.
4 Theoretical Analysis
We list notations, assumptions, and theoretical results only. Proofs are given in Appendix C.
4.1 Notations and Assumptions
Given a matrix , let , and denote the operator norm , operator norm and Frobenius norm respectively, where represents the largest eigenvalue of . We assume that where and are the true parameters. Let , , , , , \mbox{C_{X}}=\max_{j=1,\dots,d}\left\lVert\bm{X}^{j}\right\rVert_{2}/\sqrt{n} where is the th columns of , denote the Hessian matrix evaluated at the true parameters, and . Let be the maximum number of non-zeros in any column of which represents the maximum degree of in the graph.
Denote and , then . In the following theoretical analysis, we assume that and . Similar arguments apply to the case of . The objective function can be rewritten as
[TABLE]
We make the following assumptions.
Assumption 1**.**
(Underlying Model) where has the maximum degree .
Assumption 2**.**
(Restricted Convexity) For any , let denote the nonzero indices of the -th column of (i.e., the edges between and ). We have , where denotes the smallest eigenvalue and represents the matrix with columns of indexed by .
Assumption 3**.**
(Mutual incoherence) Let denote the support set of in vector form where denotes the indicator function of whether an element is zero. Let denote the complement of . We have for some , where and represent the and sub-matrices of with entries in and respectively.
Assumption 4**.**
(Control of eigenvalues) There exists some constants , such that .
Assumption 1 provides the true underlying model. Assumption 2 ensures the solution of optimization problem (19) is restricted to the active set (nonzero entries in and ), which is also used in \citeasnounwainwright2009sharp and \citeasnounwytock2013sparse. Assumption 3 limits the influence of edges in inactive set () can have on the edges in active set (), and Assumption 4 bounds the eigenvalues of the precision matrix. Define V(f,g)=\mbox{\int_{\mathcal{X}}}f(\bm{x})g(\bm{x})\mbox{\rho(\mbox{})}d\bm{x} and V(f)=\mbox{\int_{\mathcal{X}}}f^{2}(\bm{x})\mbox{\rho(\mbox{})}d\bm{x}.
Assumption 5**.**
* is completely continuous with respect to and .*
Under the Assumption 5, there exists such that , , and , where is the Kronecker delta and is referred to as the eigenvalues of with respect to . Denote the Fourier series expansion of as where are the Fourier coefficients. Let where and \beta_{\nu}=n^{-1}\sum_{i=1}^{n}\{e^{-\eta_{0}(\bm{X}_{i})}\phi_{\nu}(\bm{X}_{i})-\mbox{\int_{\mathcal{X}}}\phi_{\nu}(\bm{x})\mbox{\rho(\mbox{})}d\bm{x}\}.
Assumption 6**.**
(a) The eigenvalues of with respect to satisfy for some and when is sufficiently large.
(b) There exists some constants , and such that holds uniformly for in a convex set around containing and , , and \mbox{\int_{\mathcal{X}}}\phi^{2}_{\nu}(\bm{x})\phi^{2}_{\mu}(\bm{x})e^{-\eta_{0}(\bm{x})}\mbox{\rho(\mbox{})}d\bm{x}<C_{1,4} for any and .
(c) There exists some such that .
Assumptions 5 and 6 are commonly used in smoothing spline literatures to study the convergence rate for nonparametric density estimation [gu2013smoothing] .
4.2 Asymptotic Consistency of the Estimated Parameters
The following theorem provides the estimation error bound and edge selection accuracy.
Theorem 1**.**
Suppose that the Assumptions 2 and 3 hold, , and and satisfy
[TABLE]
where , , and , then with probability greater than 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}, we have
The estimates satisfy the elementwise bound:
[TABLE] 2. 2.
All non-zero entries of the solution are a subset of the non-zero entries of . Furthermore, non-zero entries of includes all non-zero entries in that satisfy
[TABLE]
Remark 1: i) Theorem 1 indicates that a sample size larger than a constant times is enough for our estimation procedure to identify a subset of the true non-zero elements in the cGGM, and the resulting estimations are close to the true parameters in bound. The convergence rate is the same as that in \citeasnounwytock2013sparse, but the success probability of the primal-dual witness approach as well as the exact bounds for and are different. We also provide a lower bound for the sign consistency which is not included in \citeasnounwytock2013sparse.
ii) The convergence probability is smaller than that for the GGM [wainwright2009sharp] where only a precision matrix needs to be estimated. This is the price we pay for estimating extra parameters in .
Define as the total number of non-zero elements in off-diagonal positions of , and as the total number of non-zero elements in .
Corollary 1**.**
Under the same assumptions as in Theorem 1, with probability at least 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}, the estimates and satisfy
[TABLE]
Remark 2: The Frobenius norm was not studied in \citeasnounwytock2013sparse. We develop it as a building block for establishing the convergence rate for the density estimation in Section 4.3.
4.3 Convergence Rates for the Density Estimation
We first introduce a combined measure of divergence between the joint density and its estimate. Let f_{0}(\bm{x})=e^{\eta_{0}(\bm{x})}\rho(\bm{x})/\mbox{\int_{\mathcal{X}}}e^{\eta_{0}(\bm{x})}\rho(\bm{x})d\bm{x} and f_{0}(\mbox{\bm{y}}|\bm{x}) be the true densities of and with their estimates denoted as \hat{f}(\bm{x})=e^{\hat{\eta}(\bm{x})}\mbox{\rho(\mbox{})}/\mbox{\int_{\mathcal{X}}}e^{\hat{\eta}(\bm{x})}\mbox{\rho(\mbox{})}d\bm{x} and \hat{f}(\mbox{\bm{y}}|\bm{x}) respectively. The KL divergence between two density functions and are defined as . Then the symmetrized KL divergence between the true joint density f_{0}(\mbox{\bm{z}})=f_{0}(\bm{x})f_{0}(\mbox{\bm{y}}|\bm{x}) and its estimate \hat{f}(\mbox{\bm{z}})=\hat{f}(\bm{x})\hat{f}(\mbox{\bm{y}}|\bm{x}) can expressed as
[TABLE]
We will establish the asymptotic convergence rate under the following combined measure of divergence
[TABLE]
The difference between (24) and (23) lies in the divergence measures for the estimation of . Note that the rate in implies rate in , since and is a proxy of [gu2013smoothing].
We first establish the rate for \text{SKL}(f_{0}(\mbox{\bm{y}}|\mbox{\bm{x}}),\hat{f}(\mbox{\bm{y}}|\mbox{\bm{x}})) which has the explicit expression:
[TABLE]
where . We assume that the second moments of marginal densities of and exist.
Theorem 2**.**
Under the Assumption 4 and conditional on the event , we have
[TABLE]
For the smoothing spline ANOVA estimate of , under the Assumptions 5 and 6, \citeasnoungu2013smoothing showed that as and ,
[TABLE]
Finally, we have the convergence rate for the joint density estimate.
Theorem 3**.**
Suppose that the Assumptions 2-6 hold, , , , and and satisfy
[TABLE]
where , and , then with probability greater than 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)} we have
[TABLE]
Remark 3: For low-dimensional (usually ), the computation of multivariate integrals are feasible. We may use the penalized likelihood instead of the pseudo likelihood to estimate the density function f(\mbox{\bm{x}}). This leads to f_{0}(\mbox{\bm{x}})=e^{\eta_{0}}/\mbox{\int_{\mathcal{X}}}e^{\eta_{0}}. Under similar conditions, \citeasnoungu2013smoothing has proved that the symmetrized KL divergence is also , where is the penalized likelihood estimate. If we also use the penalized likelihood to estimate in our model, then \text{SKL}\big{(}f_{0}(\mbox{\bm{z}}),\hat{f}(\mbox{\bm{z}})\big{)}=\mathcal{O}(n^{-5/2}p^{5/2}(\log pd)^{5/2}+n^{-1}p^{2}(\log pd)+n^{-1}\lambda_{1}^{-1/s}+\lambda_{1}^{q}).
5 Simulation Studies
We have conducted extensive simulation experiments to evaluate the performance of the cSScGG procedure, and compare it with some existing parametric and semiparametric/nonparametric methods. To save space, we present some simulation results and more comprehensive results can be found in \citeasnounLuothesis. We note that the cSScGG method can ourperform the maximum likelihood estimation (MLE) when is multivariate Gaussian and the cGGM for is sparse. Results for density and graph estimations are presented in Sections 5.1 and 5.2 respectively.
For density estimation, we use both LOOKL and CV (5-fold) methods to choose and . Tuning parameters involved in all other methods are chosen by 5-fold CV. For graph estimation, we select and in the cSScGG method as minimizers of the following BIC score
[TABLE]
where and \xi(\hat{\mbox{\Theta}}) are the number of non-zero off-diagonal elements in and the number of non-zero elements in \hat{\mbox{\Theta}} respectively. The degree of freedom is defined in the same way as in \citeasnounyin2011sparse. The BIC is also used to select tuning parameters in other methods for graph estimation. More details regarding comparison of various tuning parameter selection methods are included in \citeasnounLuothesis.
5.1 Density Estimation
We set , , and . We generate with , and . We consider four combinations of and : and . All results are reported based on 100 replications under each setting. In each replication, we first generate iid samples from the multivariate Gaussian mixtures, then ’s are generated from a cGGM. Specifically, we randomly create a precision matrix using the R-package huge [zhao2012huge], in which the probability of the off-diagonal elements being nonzero equals . The decomposition gives us and [yuan2014partial], so that we can sample from \mathcal{N}(-\Lambda\mbox{{}^{-1}}\Theta^{T}\bm{X}_{i},\Lambda\mbox{{}^{-1}}) for .
Since the division of non-Gaussian variables and Gaussian variables is typically unknown in practice, we consider two versions of the proposed method – plain cSScGG and cSScGG with normality test (denoted as NT). In the plain version, we assume that the true non-Gaussian components are known and apply cSScGG directly. In the NT version, we select variables with smallest p-values based on the Shapiro-Wilk test to all marginal variables as , and then apply the cSScGG method.
In addition to the cSScGG method, we estimate density using the SKDE [hoti2004semiparametric], MLE, and QUIC [hsieh2011sparse] methods. In the implementation of the SKDE method, we use the R-package ks [duong2007ks] to calculate the kernel density estimate for f(\mbox{\bm{x}}) with the bandwidth selected by the smoothed cross-validation selector with diagonal bandwidth matrices (Hscv.diag(x)) which provides the best overall performance. To avoid selecting the two extra bandwidths involved in SKDE, as in \citeasnounhoti2004semiparametric, we set f(\mbox{\bm{y}}|\mbox{\bm{x}})=f(\mbox{\bm{y}}) and use MLE to estimate f(\mbox{\bm{y}}). MLE and QUIC methods treat as multivariate normal across all settings, and the estimates from these two methods are further broken down into f(\mbox{\bm{x}}) and f(\mbox{\bm{y}}|\mbox{\bm{x}}) for comparison. Specifically, QUIC method learns the precision matrix of by forming a quadratic approximation of the log-likelihood, and the estimates are computed using the R-package QUIC [hsieh2014quic].
To evaluate the performance of different methods, we consider the KL divergence between the estimated density and the true density
[TABLE]
where is the true density, and the aggregated KL \text{E}_{\bm{X}}\Big{[}\text{KL}\Big{(}f_{0}(\mbox{\bm{y}}|\bm{X}),\hat{f}(\mbox{\bm{y}}|\bm{X})\Big{)}\Big{]} is approximated by the empirical aggregated KL divergence. Table 1 reports the overall KL divergence \text{KL}\Big{(}f_{0}(\mbox{\bm{z}}),\hat{f}(\mbox{\bm{z}})\Big{)}, the empirical aggregated KL divergence n^{-1}\sum_{i=1}^{n}\text{KL}\Big{(}f_{0}(\mbox{\bm{y}}|\bm{X}_{i}),\hat{f}(\mbox{\bm{y}}|\bm{X}_{i})\Big{)}, and \text{KL}\Big{(}f_{0}(\mbox{\bm{x}}),\hat{f}(\mbox{\bm{x}})\Big{)}. They provide evaluations for the estimation of f(\mbox{\bm{z}}), f(\mbox{\bm{y}}|\mbox{\bm{x}}), and f(\mbox{\bm{x}}), respectively.
Since cSScGG with normality test may identify different , we only include the overall KL divergence \text{KL}\Big{(}f_{0}(\mbox{\bm{z}}),\hat{f}(\mbox{\bm{z}})\Big{)} for comparison. Both versions of the cSScGG method enjoy superior performance relative to all other methods under all settings. When comparing the plain cSScGG with other methods, the differences mainly come from the estimation of f(\mbox{\bm{x}}), in which parametric methods MLE and QUIC cannot fit the data properly. The cSScGG performs much better than SKDE in both the estimation of f(\mbox{\bm{x}}) and f(\mbox{\bm{y}}|\mbox{\bm{x}}). When is fixed, the performance differences are larger under where the deviation from Gaussian is more severe. Furthermore, under a fixed , the superiority of the cSScGG methods is greater when where the deviation from Gaussian is more severe. Comparative results remain the same under other simulation settings [Luothesis].
5.2 Edge Detection
We do not consider the SKDE and MLE methods here because we they do not perform edge selection. In addition to QUIC which is parametric, we also include the nonparanormal (NPN) method [liu2009nonparanormal]. The NPN method is implemented with the R-package huge. When fitting the model, we use shrunken ECDF to transform the data first, then apply Glasso to the transformed data. The final NPN model is selected by the extended BIC score [foygel2010extended]. Given a fixed dimension , the model chosen by the EBIC method agrees with the model chosen by the BIC method. As the cSScGG method is formulated quite differently from the NPN, our main focus is to investigate the improvements that cSScGG can bring over the QUIC method which assumes normality for all variables including .
The performance is measured in three categories: among , among , and between and . Recall that for the cSScGG procedure, edges in the above categories are decided by , and , respectively (see Figure 1). We also report the overall performance based on the whole graph. All simulation results are based on replications.
We fix , , and consider two sample sizes and . We first generate both and from multivariate normals. Specifically, we first generate a sparse precision matrix , in which the probability of the off-diagonal elements being nonzero equals . Then i.i.d. samples are generated from \mathcal{N}(\bm{0},\Omega\mbox{{}^{-1}}). The decomposition leads to i.i.d. samples of and , and the decomposition leads to and . The results are presented in Table 2.
Overall, the cSScGG and QUIC methods perform better than the NPN. This is expected as the true distribution is Gaussian and the ECDF transformation leads to efficiency loss. Surprisingly, the cSScGG outperforms the QUIC in detecting edges within variables even when the normality assumption holds for the QUIC method. It suggests that the proposed projection ratio method learns the conditional independence within better than the parametric QUIC method with BIC. Furthermore, the cSScGG outperforms the QUIC in identifying edges among as well as edges between and , due to the fact that there are two penalty parameters in cSScGG, as opposed to one in QUIC. To conclude, the cSScGG method is more efficient even when the joint normality assumption holds.
6 Applications
6.1 Isoprenoid Gene Network in Arabidopsis Thaliana
We consider the gene expression data for Arabidopsis thaliana introduced by \citeasnounwille2004sparse. Arabidopsis thaliana is the first plant to have its genome sequenced, and is a popular model in the study of molecular biology and genetics. The dataset contains observations of Affymetrix GeneChip microarrays, in which the expression levels of genes are recorded. All values are preprocessed by log-transformation and standardization. This data has been analyzed by \citeasnounlafferty2012sparse to explore the structure using the nonparanormal model. As in \citeasnounlafferty2012sparse, we consider a subset of genes from the isoprenoid pathway 111The dataset was downloaded from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783/. We note that while there were genes in \citeasnounwille2004sparse and \citeasnounlafferty2012sparse, this dataset contains only..
Our goal is to construct a graph using the proposed cSScGG procedure and compare its structure with those from Glasso [friedman2008sparse] and nonparanormal (NPN). Let be the expression levels of genes. To apply the cSScGG procedure we first need to identify variables of which the density function may be non-Gaussian. A simple approach is to select elements in whose marginal distributions are non-Gaussian. We looked at histograms of all gene expression levels and found genes (MCT, GGPPS6 and GGPPS1mt) with marginal distribution far from Gaussian, as shown in Figure 2. Therefore, we set as gene expression levels of MCT, GGPPS6, and GGPPS1mt. We note that marginal distributions of these three genes have bi-/multiple modes, and monotone transformations cannot transfer them into Gaussian random variables. Therefore, the GGM and nonparanormal model may be inappropriate for this data.
As indicated by \citeasnounwille2004sparse, the GGM chosen by the BIC generally leads to a graph that is too dense for biologically relevant researches. Therefore in this study, we construct the graph by limiting the number of edges. Particularly, we tune the regularization parameters in the cSScGG method to fix . Results with can be found in \citeasnounLuothesis. Once the cSScGG fit is obtained, we scan the full regularization path of the Glasso estimates, compare the symmetric difference with the cSScGG estimate, and select the graph with smallest symmetric difference value as the Glasso graph. Specifically, the symmetric difference between two graphs is the set of edges which are in either of the graphs but not in their intersection. The same procedure is done for the NPN estimates. We implemented Glasso and NPN with R-packages glasso [friedman2014glasso] and huge [zhao2012huge] respectively.
Figures 3 presents graph topologies achieved from each method, along with the corresponding symmetric difference. We refer the symmetric difference between cSScGG and Glasso to as cSScGG vs Glasso, and the symmetric difference between cSScGG and NPN to as cSScGG vs NPN. Nodes with numbers , , and correspond to the non-Gaussian genes GGPPS1mt, GGPPS6, and MCT, respectively. Although the overall structures of different methods look similar, there are some interesting differences.
We focus on the two symmetric difference plots in Figure 3. Note that red edges are selected by the cSScGG only. Most of these edges are associated with the non-Gaussian nodes, for example, edges 32-1 and 32-39. This indicates that the cSScGG procedure is able to discover new interactions for the non-Gaussian variables. We further look at the red lines that appear only in one of the two symmetric difference plots. It is interesting to see that they all come from the cSScGG vs NPN plot, indicating that cSScGG is able to detect some edges selected by Glasso which are not selected by NPN. This is not surprising since the cSScGG method assumes a conditional Gaussian distribution for the parametric component. Finally, we note that as a trade-off for the newly identified interactions, there exists edges that are selected by both Glasso and NPN, but not by cSScGG. For instance, edge 10-33 with blue dashed line in Figure 3.
To summarize, in terms of the overall graph structure, the cSScGG procedure is capable of capturing a majority of edges that are detected by the Glasso method. By modeling the distributions of some genes that clearly violate the Gaussian assumption, the proposed method is capable of detecting interactions that are not selected by other methods. These interactions may provide potential research areas for biological study.
6.2 Conditional Relationship Between Clinical, Laboratory and Dialysis Variables from Hemodialysis Patients
We apply the cSScGG procedure to study the conditional relationships between some clinical, laboratory and dialysis variables collected from hemodialysis patients. All patients who underwent dialysis treatments at the Fresenius Medical Care - North America during 2010-2014 are considered. We include patients who stayed at the same facility throughout the treatments. To avoid large fluctuation in the first year on dialysis, we use the average measurements in the second year on dialysis from patients who survived longer than two years. For homogeneity, we include white, non-diabetic and non-Hispanic patients. After removing missing values, we have observations (patients) on the following variables in categories:
Clinical variables: age (years), height (cm), weight (kg), bmi (body mass index, kg/m2), sbp (systolic blood pressure, mmHg), dbp (diastolic blood pressure, mmHg), temp (temperature, Celsius);
Laboratory variables: albumin (g/dL), ferritin (ng/mL), hgb (hemoglobin, g/dL), lymphocytes (), neutrophils (), nlr (neutrophils to lymphocytes ratio, unitless), sna (serum sodium concentration, mEq/L or mmol/L), wbc (white blood cell, 1000/mc);
Dialysis variables: qb (blood flow, mL/min), qd (dialysis flow, mL/min), saline (mL), txttime (treatment time, min), olc (on-line clearance, unitless), idwg (interdialytic weight gain, kg), ufv (ultrafiltration volume, L), ufr (ultrafiltration rate, mL/hr/kg), epodose (erythropoietin dose, unit), volume (L), enpcr (equilibrated normalized protein catabolic rate, g/kg/day), ektv (equilibrated Kt/V, unitless).
Note that nlr and epodose have been transformed to make them close to Gaussian. In particular, nlr equals the logarithm of the neutrophils to lymphocytes ratio, and epodose represents the 1/4 power transformation of the actual erythropoietin dose.
The primary objective of this study is to discover the interactions between all these measurements. We first check the marginal distributions of all variables to investigate possible violation of the Gaussian assumption. We identify variables, age, qb, qd and epodose as non-Gaussian with very small p-values (less than ). Histograms in Figure 4 indicate that the distribution of age is skewed, and the distributions of qb, qd and epodose have multiple peaks. Note that despite the power transformation, the distribution of epodose is still far from normal due to the point mass at zero. Consequently, we specify these variables as to be estimated nonparametrically in the proposed cSScGG procedure.
We compare the cSScGG procedure with Glasso and NPN. For the NPN method, we use the shrunken ECDF to transform the data first, then apply Glasso to the transformed data. For each method, we tune the regularization parameters by BIC. The estimated graph structures are shown in Figure 5.
From the visual inspection, there is a large set of edges shared by cSScGG and Glasso, which is due to the fact that cSScGG assumes majority of the variables are conditionally normal. However, the graph of Glasso is much denser. To see how cSScGG differs from other two methods, Figure 6 shows edges detected by the cSScGG procedure only. It shows that the bmi is a hub node whose connections with other variables such as age, dbp, and wbc are not selected by other methods. Meanwhile, qb has multiple connections with nodes from the other two categories (Clinical and Laboratory). The value of these extra edges remains to be further explored from a clinical standpoint. We do not intend to claim that the graph obtained by the cSScGG procedure is the best as the underlying truth is unknown. Instead, with different model assumptions, the cSScGG procedure can identify potential links for further study.
\citationstyle
dcu
Appendix Appendix A Derivation of the LOOKL
Our derivation is similar to that in \citeasnounlian2011shrinkage and \citeasnounvujavcic2015computationally with adjustments to deal with complications brought by the conditional mean -\Lambda\mbox{{}^{-1}}\Theta^{T}\mbox{\bm{x}} and two tuning parameters. Recall that a cGGM assumes that \bm{Y}|\bm{X}=\mbox{\bm{x}}\sim\mathcal{N}(-\Lambda\mbox{{}^{-1}}\Theta^{T}\mbox{\bm{x}},\Lambda\mbox{{}^{-1}}). The log-likelihood based on the -th observation and is (ignoring constant terms)
[TABLE]
where \mbox{S_{yy,k}}=\bm{Y}_{k}^{T}\bm{Y}_{k}, \mbox{S_{xy,k}}=\bm{X}_{k}^{T}\bm{Y}_{k}, and \mbox{S_{xx,k}}=\bm{X}_{k}^{T}\bm{X}_{k} are the empirical variance/covariance matrices. Note that \mbox{S_{yy}}=n^{-1}\sum_{k=1}^{n}\mbox{S_{yy,k}}, \mbox{S_{xx}}=n^{-1}\sum_{k=1}^{n}\mbox{S_{xx,k}}, and \mbox{S_{xy}}=n^{-1}\sum_{k=1}^{n}\mbox{S_{xy,k}}.
Let and be the estimates of and based on the data excluding the -th observation. Directly calculating leave-one-out estimate of the KL distance is computationally costly. We now derive a score based on the fact that cross-validating the log-likelihood provides an estimate of the KL distance [yanagihara2006bias].
Consider the following function of five variables f(\mbox{S_{xx}},\mbox{S_{yy}},\mbox{S_{xy}},\Lambda,\Theta)=\log|\Lambda|-\mbox{\text{tr}}(\mbox{S_{yy}}\Lambda+2S^{T}_{xy}\Theta+\Lambda\mbox{{}^{-1}}\Theta^{T}S^{T}_{xx}\Theta). We have the identity \sum_{k=1}^{n}f(S_{xx,k},S_{yy,k},S_{xy,k},\Lambda,\Theta)=nf(\mbox{S_{xx}},\mbox{S_{yy}},\mbox{S_{xy}},\Lambda,\Theta). Letting \bm{S}=(\mbox{S_{xx}},\mbox{S_{yy}},\mbox{S_{xy}}) and , we denote f(\mbox{S_{xx}},\mbox{S_{yy}},\mbox{S_{xy}},\Lambda,\Theta) and
f(\mbox{S_{xx,k}},\mbox{S_{yy,k}},\mbox{S_{xy,k}},\Lambda,\Theta) as and in the rest of the derivation. The leave-one-out cross validation score [yanagihara2006bias]
[TABLE]
where \partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\Lambda=\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\mbox{\mathrm{vec}}(\Lambda) and \partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\Theta=\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\mbox{\mathrm{vec}}(\Theta) are and dimensional column vectors of partial derivatives given by
[TABLE]
Denoting as the version of without the -th observation, the Taylor expansions of the functions and at the point (\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}}) are
[TABLE]
and
[TABLE]
where \partial^{2}f(\bm{S},\Lambda,\Theta)/\partial\Lambda^{2}=(\partial f(\bm{S},\Lambda,\Theta)/\partial\mbox{\mathrm{vec}}(\Lambda))/\partial\mbox{\mathrm{vec}}(\Lambda) is the Hessian matrix, \partial f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Lambda and \partial f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta denote partial derivative evaluated at and , and other second order derivatives are defined similarly. Note that \partial f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Lambda=\bm{0} and \partial f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta=\bm{0} because and are the maximum likelihood estimators, \partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Lambda\partial\mbox{S_{xy}}=\bm{0} because (A.3) is free of , and \partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta\partial\mbox{S_{yy}}=\bm{0} because (A.4) is free of . Let A=\partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta^{2}=-2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes\mbox{S_{xx}}, B=\partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta\partial\Lambda=2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes\mbox{S_{xx}}\mbox{\hat{\Theta}}\mbox{\hat{\Lambda}}\mbox{{}^{-1}}, C=\partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Lambda^{2}=-\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\otimes(\mbox{\hat{\Lambda}}\mbox{{}^{-1}}+2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\mbox{S_{xx}}\mbox{\hat{\Theta}}\mbox{\hat{\Lambda}}\mbox{{}^{-1}}), D=\partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Theta\partial\mbox{S_{xx}}=-2\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\otimes I_{d\times d}, and E=\partial^{2}f(\bm{S},\mbox{\hat{\Lambda}},\mbox{\hat{\Theta}})/\partial\Lambda\partial\mbox{S_{xx}}=\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}\otimes\mbox{\hat{\Lambda}}\mbox{{}^{-1}}\hat{\Theta}^{T}. Solving (A.5) and (A.6) and plugging solutions into (A.2), we have
[TABLE]
For the Gaussian graphical model with , (A.7) reduces to
[TABLE]
which is the same as the GACV in \citeasnounlian2011shrinkage and KLCV in \citeasnounvujavcic2015computationally.
Appendix Appendix B Calculation of the Projection Ratio
Letting \hat{\zeta}(\mbox{\bm{x}})=\hat{\Delta}(\mbox{\bm{x}})+\hat{\eta}(\mbox{\bm{x}}), we construct the ratio where denotes the squared error projection of in . A small ratio indicates that may be removed. By definition,
[TABLE]
To obtain , one needs to find
[TABLE]
Let , where is a space spanned by known functions \{\varphi_{1}(\mbox{\bm{x}}),\cdots,\varphi_{m}(\mbox{\bm{x}})\} and is the orthogonal reproducing kernel Hilbert space with the reproducing kernel function . Let \mbox{\bm{\phi}}=\big{(}\varphi_{i}({\bf{X}}_{j})\big{)}_{i=1,\cdots,m}^{j=1,\cdots,n} and \mbox{\bm{\xi}}=\big{(}R({\bf{X}}_{i},{\bf{X}}_{j})\big{)}_{i=1,\cdots,n}^{j=1,\cdots,n}. Let \tilde{\zeta}=\mbox{\bm{\phi}}\mbox{{}^{T}}\tilde{\mbox{\bm{d}}}+\mbox{\bm{\xi}}\mbox{{}^{T}}\tilde{\mbox{\bm{c}}}, take derivatives with respect to \tilde{\mbox{\bm{d}}} and \tilde{\mbox{\bm{c}}}, and set them to zero. After rearrangements, we obtain the equation
[TABLE]
where for any vectors of functions and .
The right hand side of (A.10) contains some extra components involving . We compute solutions to (A.10) using the Cholesky decomposition implemented in the project() function in the R package gss [gu2014smoothing]. Once is computed, we have
[TABLE]
Appendix Appendix C Proofs of Theoretical Results
To prove Theorem 1, we first introduce a sequence of lemmas as in \citeasnounwytock2013sparse. Note that, different from \citeasnounwytock2013sparse, we allow different penalties for and . Lemma 1 below studies the decay rate of the gradients and in element-wise infinity operator norm as sample size increases.
Lemma 1**.**
Suppose that the Assumption 1 holds. Then
[TABLE]
for any \vartheta\in(0,40\mbox{C_{\sigma}}).
Proof.
\citeasnoun
wytock2013sparse proved (A.12) using the Chernoff bound for the Gaussian tail probability. \citeasnounravikumar2011high proved (A.13) in their Lemma 1. ∎
The next lemma extends the primal-dual witness approach proposed in \citeasnounwainwright2009sharp to our multi-penalties setting. Let . With a bit abuse of notation, let .
Lemma 2**.**
Suppose that the true parameter has support . We consider two optimization problems:
[TABLE]
[TABLE]
Let and . If the following conditions hold,
the solution is unique; 2. 2.
{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\big{(}\nabla^{2}_{\Gamma}l_{2}(\Gamma_{0})\big{)}_{\bar{S}S}\big{(}\nabla^{2}_{\Gamma}l_{2}(\Gamma_{0})\big{)}_{SS}^{-1}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}<1-\alpha* for ;* 3. 3.
;
then the two -regularized solutions are identical, .
Proof.
Define \Delta_{\Lambda}=\mbox{\tilde{\Lambda}}-\Lambda_{0}, \Delta_{\Theta}=\mbox{\tilde{\Theta}}-\Theta_{0} and . Let be the residual of second order Taylor expansion of the log-likelihood where
[TABLE]
Following the same arguments as in Lemma 3 in \citeasnounravikumar2011high, the optimization problem (A.14) satisfies
[TABLE]
where is the sub-differential of the penalty term evaluated at and , and
[TABLE]
[TABLE]
If we can verify the strict dual feasibility , then by Lemma 3 in \citeasnounravikumar2011high, the restricted solution is an optimal solution to the original problem, i.e., .
Denoting and for simplicity, the optimality condition of (A.16) in terms of and can be rewritten as
[TABLE]
Since is invertible, we have
[TABLE]
Plugging (A.18) back into the second equation in (A.17), we obtain
[TABLE]
Taking the norm of both sides gives
[TABLE]
∎
Based on Lemma 2, the solution is constructed as a witness to the original unrestricted solution . Then inherits many optimality properties from , in terms of the discrepancy to the true and the recovery of the signed sparsity pattern. Our next step is to bound the residual term in terms of .
Lemma 3** (Control of remainder).**
Suppose that , then
[TABLE]
Proof.
We describe the proof briefly since it follows the same steps as in \citeasnounwytock2013sparse. Denote second order Taylor expansion of a function in terms of its differentials
[TABLE]
By the definition of and the mean value theorem, there exists such that and similarly for . As expressions of the above second differentials are tedious, we do not include them here. However, we note that each term in and has a quadratic expression in and , with at most four terms, two terms and one term. Using the fact that
[TABLE]
for any matrices and \left\lVert\mbox{S_{xx}}\right\rVert_{\infty}\leq C_{X}^{2}, each term in the second differentials is bounded by
[TABLE]
For an invertible , since , it is easy to verify that
[TABLE]
Then
[TABLE]
Similarly, since , we have
[TABLE]
Combining with (A.19), we obtain
[TABLE]
∎
Lemma 4** (Control of ).**
Suppose that . Then
[TABLE]
Proof.
Recall that , , therefore . Our goal is to bound the deviation . By (A.18), we have . In the following, we use Brouwer’s fixed point theorem on a compact set to construct a ball that contains . Define the -ball and a continuous map such that
[TABLE]
Now it suffices to show F\big{(}\mathbb{B}(u)\big{)}\in\mathbb{B}(u), as this implies there is a solution to the above equation. By uniqueness of the optimal solution, we can thus conclude that belongs in this ball.
Taking infinity norm to (A.21), we have
[TABLE]
For any , by Lemma 3, the first term in (A.22) is bounded by
[TABLE]
By the definition of radius , the second term in (A.22) is bounded by
[TABLE]
Therefore, we have . ∎
Proof of Theorem 1 We first show that equals the solution to original objective function (19) with high probability. Then we proceed with the proof conditioning on this event.
By Lemma 1, we have the element-wise tail conditions for : , where f_{\Lambda}(n,\delta)=(1/4)\exp{\big{(}n\delta^{2}/(3200\mbox{C^{2}_{\sigma}})\big{)}}, and denotes the -th element in . For a fixed , denote
[TABLE]
Similarly, for each fixed , denote
[TABLE]
By the monotonicity of the function , it is easy to see that
[TABLE]
Appling Corollary 1 and Lemma 8 in \citeasnounravikumar2011high, for any , we have the control of sampling noise for
[TABLE]
where \bar{n}_{f_{\Lambda}}(\delta;p^{\tau})=3200\mbox{C^{2}{\sigma}}(\tau\log p+\log 4)/\delta^{2} and \bar{\delta}_{f_{\Lambda}}(n;p^{\tau})=\sqrt{3200\mbox{C^{2}{\sigma}}}\sqrt{(\tau\log p+\log 4)/n}. Now we develop the control of sampling noise for . Again, by Lemma 1 we have the element-wise tail probability for :
[TABLE]
where f_{\Theta}(n,\delta)=(1/2)\exp\big{(}n\delta^{2}/(8\mbox{C^{2}_{\sigma}}C_{X}^{2})\big{)}.
Define and similarly to (A.23) and (A.24). Applying the union bound over all entries of the gradient matrix, we obtain that
[TABLE]
Let \delta=\bar{\delta}_{f_{\Lambda}}\big{(}n;(pd)^{\tau}\big{)}, then for any ,
[TABLE]
The last equality follows the fact that f_{\Theta}\Big{(}n,\bar{\delta}_{f_{\Theta}}\big{(}n;(pd)^{\tau}\big{)}\Big{)}=(pd)^{\tau}, based on the definition of .
Straightforward calculation shows that \bar{n}_{f_{\Theta}}\big{(}\delta;(pd)^{\tau}\big{)} and \bar{\delta}_{f_{\Theta}}\big{(}n;(pd)^{\tau}\big{)} take the forms
[TABLE]
and
[TABLE]
Denote , , by (A.26) and (A.27) we have
[TABLE]
Specifically,
[TABLE]
where C_{X}^{\star}=\max\{\mbox{C_{X}}^{2},1\}.
Let denote the event that , (A.28) implies that \mathbb{P}(\mathcal{A})\geq 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}. Accordingly, we condition on the event in the following analysis.
Next, we verify that the third assumption in Lemma 2 holds. Choose the (larger) regularization penalty , then the first half is satisfied. It remains to establish the bound . We do so by using Lemmas A.20 and 3 consecutively. Choose
[TABLE]
by our choice of , the minimum bound on and the monotonicity property (A.25) , we have
[TABLE]
Applying Lemma A.20, we conclude that
[TABLE]
Then Lemma 3 gives
[TABLE]
where the final inequality follows from the lower bound on sample size , and the monotonicity property (A.25).
To summarize, we have shown that condition 3 in Lemma 2 holds. Furthermore, a finite implies condition 1, and condition 2 is assumed by the Assumption 3. These allow us to conclude that . By (A.29) and (A.30), the estimator satisfies the bound claimed in Theorem 1(a). Moreover, by the bound (A.20) and the definition of in Lemma A.20, the estimate cannot differ enough from to change sign when condition (21) is satisfied. This proves Theorem 1(b).
Proof of Corollary 1 Let \psi=2\kappa_{H}(1+8/\alpha)C_{\sigma}C_{X}^{\star}\sqrt{3200}\sqrt{\big{(}\tau\log(pd)+\log 4\big{)}/n}. From Theorem 1, we have \max\Big{\{}\left\lVert\hat{\Lambda}-\Lambda_{0}\right\rVert_{\infty},\left\lVert\hat{\Theta}-\Theta_{0}\right\rVert_{\infty}\Big{\}}\leq\psi with probability at least 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}. Since has at most non-zeros including diagonal elements and has at most non-zeros elements, we have
[TABLE]
Combining above two inequalities leads to the bound in (22).
Lemma 5**.**
Suppose that the Assumption 4 holds, then for positive definite matrices and ,
[TABLE]
Proof of Lemma 5 This proof is similar to that for Lemma A.1 in \citeasnounfan2011high. Under the event , for any vector with Euclidean norm , we have
[TABLE]
The inequality holds by the fact that for any . Therefore, .
Meanwhile,
[TABLE]
The first inequality holds because of submultiplicativity of the norm, and for any matrix .
Proof of Theorem 25 Recall that \text{SKL}\big{(}f_{0}(\mbox{\bm{y}}|\mbox{\bm{x}}),\hat{f}(\mbox{\bm{y}}|\mbox{\bm{x}})\big{)} has an explicit form:
[TABLE]
where . We now derive the upper bound for each of the four terms in (A.31) conditioning on the event and .
We first derive an bound for using the fact that I_{1}\leq 2^{-1}\int_{\mathcal{X}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\left\lVert U\bm{x}\right\rVert^{2}f_{0}(\mbox{\bm{x}})d\mbox{\bm{x}}. Note that
[TABLE]
Furthermore, since the Frobenius norm for a vector equals its Euclidean norm, we have
[TABLE]
The last inequality holds from Lemma 5. Combined, we have the upper bound for
[TABLE]
where G=\max\{\int_{\mathcal{X}}\left\lVert\mathbf{x}\right\rVert_{2}^{2}f_{0}(\mbox{\bm{x}})d\mbox{\bm{x}},\int_{\mathcal{X}}\left\lVert\mathbf{x}\right\rVert_{2}^{2}\hat{f}(\mbox{\bm{x}})d\mbox{\bm{x}}\}\max\{C_{U},1\}\max\{D_{T}^{2},1\}/\min\{C_{L}^{4},1\} and C_{m}=\max\{\int_{\mathcal{X}}\bm{x}^{T}\bm{x}f_{0}(\mbox{\bm{x}})d\mbox{\bm{x}},\int_{\mathcal{X}}\bm{x}^{T}\bm{x}\hat{f}(\mbox{\bm{x}})d\mbox{\bm{x}}\}. As the only difference between and lies in whether the expectation is calculated with respect to the true or estimated density, this bound also applies to .
For , note that
[TABLE]
where the first inequality uses the fact that \mbox{\text{tr}}(A^{T}B) is an appropriate inner product for symmetric matrices and , and by the Cauchy-Schwarz inequality, \mbox{\text{tr}}(A^{T}B)\leq\mbox{\text{tr}}(A^{T}A)\mbox{\text{tr}}(B^{T}B)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|B\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}; and the second inequality holds by Lemma 5 with probability 1. Then
[TABLE]
For , following similar arguments as above,
[TABLE]
Then by Corollary 1 and Lemma 5, we have and on the order of \mathcal{O}\Big{(}n^{-5/2}p^{5/2}(\log pd)^{5/2}\Big{)}, and and on the order of \mathcal{O}\Big{(}n^{-1}p^{2}(\log pd)\Big{)}. This proves the claim.
Proof of Theorem 28 The bound of D\big{(}f_{0}(\mbox{\bm{z}}),\hat{f}(\mbox{\bm{z}})\big{)} in (28) comes straightforwardly by combing (25) and (26). However, as the parametric part (25) is conditioning on the event , a new lower bound for the sample size needs to be derived such that this condition is always satisfied.
By the RHS of upper bound (22) in Corollary 1, we have
[TABLE]
Combining (A.35) with (20) yields (27) after some simple algebra.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] \harvarditem Allen \harvardand Liu 2012 allen 2012 log Allen, G. I. \harvardand Liu, Z. \harvardyearleft 2012 \harvardyearright . A log-linear graphical model for inferring genetic networks from high-throughput sequencing data, Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on , IEEE, pp. 1–6.
- 2[2] \harvarditem Armijo 1966 armijo 1966 minimization Armijo, L. \harvardyearleft 1966 \harvardyearright . Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics 16 : 1–3.
- 3[3] \harvarditem Duong 2007 duong 2007 ks Duong, T. \harvardyearleft 2007 \harvardyearright . ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R, Journal of Statistical Software 21 : 1–16.
- 4[4] \harvarditem Fan \harvardand Li 2001 fan 2001 variable Fan, J. \harvardand Li, R. \harvardyearleft 2001 \harvardyearright . Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 : 1348–1360.
- 5[5] \harvarditem [Fan et al.]Fan, Liao \harvardand Liu 2016 fan 2016 overview Fan, J., Liao, Y. \harvardand Liu, H. \harvardyearleft 2016 \harvardyearright . An overview of the estimation of large covariance and precision matrices, The Econometrics Journal 19 : C 1–C 32.
- 6[6] \harvarditem [Fan et al.]Fan, Liao \harvardand Mincheva 2011 fan 2011 high Fan, J., Liao, Y. \harvardand Mincheva, M. \harvardyearleft 2011 \harvardyearright . High dimensional covariance matrix estimation in approximate factor models, Annals of Statistics 39 : 3320–3356.
- 7[7] \harvarditem [Fellinghauer et al.]Fellinghauer, Bühlmann, Ryffel, Von Rhein \harvardand Reinhardt 2013 fellinghauer 2013 stable Fellinghauer, B., Bühlmann, P., Ryffel, M., Von Rhein, M. \harvardand Reinhardt, J. D. \harvardyearleft 2013 \harvardyearright . Stable graphical model estimation with random forests for discrete, continuous, and mixed variables, Computational Statistics & Data Analysis 64 : 132–152.
- 8[8] \harvarditem Finegold \harvardand Drton 2011 finegold 2011 robust Finegold, M. \harvardand Drton, M. \harvardyearleft 2011 \harvardyearright . Robust graphical modeling of gene networks using classical and alternative t-distributions, The Annals of Applied Statistics 5 : 1057–1080.
