Combining Smoothing Spline with Conditional Gaussian Graphical Model for   Density and Graph Estimation

Runfei Luo; Anna Liu; Yuedong Wang

arXiv:1904.00204·stat.ME·April 2, 2019

Combining Smoothing Spline with Conditional Gaussian Graphical Model for Density and Graph Estimation

Runfei Luo, Anna Liu, Yuedong Wang

PDF

Open Access

TL;DR

This paper introduces a semiparametric framework combining smoothing splines and conditional Gaussian graphical models for joint density and graph estimation, enabling flexible modeling of high-dimensional data without assuming joint Gaussianity.

Contribution

It develops a novel combined approach using SS ANOVA and cGGM for simultaneous density and graph estimation with theoretical guarantees.

Findings

01

Effective in high-dimensional settings

02

Accurate edge selection via geometric inference

03

Competitive performance in simulations and real data

Abstract

Multivariate density estimation and graphical models play important roles in statistical learning. The estimated density can be used to construct a graphical model that reveals conditional relationships whereas a graphical structure can be used to build models for density estimation. Our goal is to construct a consolidated framework that can perform both density and graph estimation. Denote $Z$ as the random vector of interest with density function $f (\bz)$ . Splitting $Z$ into two parts, $Z = (X^{T}, Y^{T})^{T}$ and writing $f (\bz) = f (\bx) f (\by ∣ \bx)$ where $f (\bx)$ is the density function of $X$ and $f (\by ∣ \bx)$ is the conditional density of $Y ∣ X = \bx$ . We propose a semiparametric framework that models $f (\bx)$ nonparametrically using a smoothing spline ANOVA (SS ANOVA) model and $f (\by ∣ \bx)$ parametrically using a conditional Gaussian graphical model…

Tables2

Table 1. Table 1: Averages and standard deviations (in parentheses) of the overall KL divergence KL ( f 0 ( 𝒛 ) , f ^ ( 𝒛 ) ) KL subscript 𝑓 0 𝒛 ^ 𝑓 𝒛 \text{KL}\Big{(}f_{0}(\mbox{$\bm{z}$}),\hat{f}(\mbox{$\bm{z}$})\Big{)} (denoted by f ( 𝒛 ) 𝑓 𝒛 f(\mbox{$\bm{z}$}) ), the empirical aggregated KL 1 / n ∑ i = 1 n KL ( f 0 ( 𝒚 | 𝑿 i ) , f ^ ( 𝒚 | 𝑿 i ) ) 1 𝑛 superscript subscript 𝑖 1 𝑛 KL subscript 𝑓 0 conditional 𝒚 subscript 𝑿 𝑖 ^ 𝑓 conditional 𝒚 subscript 𝑿 𝑖 1/n\sum_{i=1}^{n}\text{KL}\Big{(}f_{0}(\mbox{$\bm{y}$}|\bm{X}_{i}),\hat{f}(\mbox{$\bm{y}$}|\bm{X}_{i})\Big{)} (denoted by f ( 𝒚 | 𝒙 ) 𝑓 conditional 𝒚 𝒙 f(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}) ), and KL ( f 0 ( 𝒙 ) , f ^ ( 𝒙 ) ) KL subscript 𝑓 0 𝒙 ^ 𝑓 𝒙 \text{KL}\Big{(}f_{0}(\mbox{$\bm{x}$}),\hat{f}(\mbox{$\bm{x}$})\Big{)} (denoted by f ( 𝒙 ) 𝑓 𝒙 f(\mbox{$\bm{x}$}) ). cSScGG_CV (cSScGG_LOOKL) and cSScGG_CV_NT (cSScGG_LOOKL_NT) correspond to the cSScGG method without and with normality test repectively, and tuning parameters λ 2 subscript 𝜆 2 \lambda_{2} and λ 3 subscript 𝜆 3 \lambda_{3} are selected by the 5-fold CV (LOOKL).

KL

Method

σ = 0.5

ω = 0.9

σ = 0.5

ω = 0.5

σ = 0.1

ω = 0.9

σ = 0.1

ω = 0.5

f ​ (𝒙)

cSScGG

0.030 (0.023)

0.043 (0.021)

0.062 (0.073)

0.046 (0.026)

SKDE

0.208 (0.172)

0.181 (0.190)

0.503 (0.414)

0.141 (0.091)

QUIC

0.225 (0.023)

0.243 (0.012)

3.313 (0.051)

3.528 (0.021)

MLE

0.182 (0.021)

0.225 (0.010)

3.128 (0.047)

3.487 (0.023)

f ​ (𝒚 | 𝒙)

cSScGG_CV

1.145 (0.202)

1.118 (0.189)

1.040 (0.138)

1.078 (0.175)

cSScGG_LOOKL

1.143 (0.164)

1.098 (0.167)

1.14 (0.172)

1.112 (0.161)

SKDE

1.621 (0.184)

1.632 (0.218)

1.425 (0.173)

1.474 (0.215)

QUIC

1.196 (0.125)

1.163 (0.147)

1.179 (0.141)

1.156 (0.139)

MLE

1.827 (0.245)

1.613 (0.236)

2.235 (0.413)

1.607 (0.235)

f ​ (𝒛)

cSScGG_CV

1.175 (0.205)

1.161 (0.189)

1.102 (0.155)

1.124 (0.178)

cSScGG_CV_NT

1.262 (0.208)

1.268 (0.173)

1.032 (0.125)

1.051 (0.120)

cSScGG_LOOKL

1.173 (0.167)

1.141 (0.166)

1.202 (0.186)

1.158 (0.164)

cSScGG_LOOKL_NT

1.358 (0.241)

1.325 (0.211)

1.153 (0.184)

1.122 (0.166)

SKDE

1.829 (0.256)

1.813 (0.331)

1.928 (0.455)

1.615 (0.234)

QUIC

1.422 (0.132)

1.405 (0.148)

4.492 (0.15)

4.684 (0.137)

MLE

2.009 (0.245)

1.838 (0.237)

5.363 (0.404)

5.094 (0.232)

Table 2. Table 2: Averages and standard deviations (in parentheses) of specificity(SPE), sensitivity(SEN), and F 1 score when p = 25 𝑝 25 p=25 and d = 3 𝑑 3 d=3 . 𝑿 𝑿 \bm{X} follows the multivariate Normal distribution.

cSScGG

QUIC

NPN

SPE

SEN

F₁

SPE

SEN

F₁

SPE

SEN

F₁

Among

𝑿

n=200

0.881

(0.221)

0.931

(0.24)

0.732

(0.429)

0.775

(0.245)

0.97

(0.171)

0.55

(0.471)

0.839

(0.259)

0.914

(0.27)

0.699

(0.412)

n=300

0.932

(0.203)

0.895

(0.278)

0.827

(0.352)

0.812

(0.293)

0.989

(0.102)

0.713

(0.428)

0.803

(0.304)

0.968

(0.17)

0.765

(0.372)

Among

𝒀

n=200

0.821

(0.029)

0.939

(0.039)

0.707

(0.035)

0.794

(0.03)

0.946

(0.037)

0.682

(0.034)

0.819

(0.095)

0.774

(0.333)

0.564

(0.197)

n=300

0.858

(0.028)

0.96

(0.028)

0.761

(0.028)

0.829

(0.027)

0.963

(0.027)

0.728

(0.031)

0.79

(0.029)

0.965

(0.024)

0.689

(0.033)

Between

𝑿

and

𝒀

n=200

0.828

(0.117)

0.865

(0.163)

0.687

(0.096)

0.776

(0.06)

0.942

(0.071)

0.656

(0.077)

0.821

(0.107)

0.78

(0.337)

0.574

(0.221)

n=300

0.799

(0.125)

0.966

(0.063)

0.707

(0.084)

0.836

(0.051)

0.969

(0.045)

0.728

(0.062)

0.786

(0.056)

0.955

(0.059)

0.678

(0.072)

Overall

n=200

0.823

(0.029)

0.926

(0.044)

0.702

(0.033)

0.79

(0.026)

0.946

(0.036)

0.677

(0.03)

0.82

(0.094)

0.776

(0.331)

0.568

(0.194)

n=300

0.848

(0.026)

0.96

(0.028)

0.746

(0.029)

0.831

(0.023)

0.964

(0.026)

0.729

(0.025)

0.79

(0.025)

0.963

(0.025)

0.689

(0.027)

Equations222

f(\mbox{$\bm{z}$})=f(\mbox{$\bm{x}$},\mbox{$\bm{y}$})=f(\mbox{$\bm{x}$})f(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}).

f(\mbox{$\bm{z}$})=f(\mbox{$\bm{x}$},\mbox{$\bm{y}$})=f(\mbox{$\bm{x}$})f(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}).

\eta(\mbox{$\bm{x}$})=c+\sum_{k=1}^{d}\eta_{k}(x_{k})+\sum_{k>l}\eta_{kl}(x_{k},x_{l})+\dots+\eta_{1\ldots d}(x_{1},\cdots,x_{d}),

\eta(\mbox{$\bm{x}$})=c+\sum_{k=1}^{d}\eta_{k}(x_{k})+\sum_{k>l}\eta_{kl}(x_{k},x_{l})+\dots+\eta_{1\ldots d}(x_{1},\cdots,x_{d}),

H = H^{0} \oplus H^{1} \oplus \dots \oplus H^{w},

H = H^{0} \oplus H^{1} \oplus \dots \oplus H^{w},

l_{1} (η)

l_{1} (η)

\displaystyle l_{2}(\mbox{$\Theta$},\mbox{$\Lambda$})

\{\hat{\eta},\hat{\mbox{$\Lambda$}},\hat{\mbox{$\Theta$}}\}=\operatorname*{arg\,min}_{\eta\in\mathcal{H},\Lambda\succ 0,\Theta}\left\{\Big{[}l_{1}(\eta)+\frac{\lambda_{1}}{2}J(\eta)\Big{]}+\Big{[}l_{2}(\mbox{$\Lambda$},\mbox{$\Theta$})+\lambda_{2}\left\lVert\mbox{$\Lambda$}\right\rVert_{1,\text{off}}+\lambda_{3}\left\lVert\mbox{$\Theta$}\right\rVert_{1}\Big{]}\right\},

\{\hat{\eta},\hat{\mbox{$\Lambda$}},\hat{\mbox{$\Theta$}}\}=\operatorname*{arg\,min}_{\eta\in\mathcal{H},\Lambda\succ 0,\Theta}\left\{\Big{[}l_{1}(\eta)+\frac{\lambda_{1}}{2}J(\eta)\Big{]}+\Big{[}l_{2}(\mbox{$\Lambda$},\mbox{$\Theta$})+\lambda_{2}\left\lVert\mbox{$\Lambda$}\right\rVert_{1,\text{off}}+\lambda_{3}\left\lVert\mbox{$\Theta$}\right\rVert_{1}\Big{]}\right\},

\hat{\eta}=\operatorname*{arg\,min}_{\eta\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^{n}e^{-\eta(\bm{X}_{i})}+\int_{\mathcal{X}}\eta(\mbox{$\bm{x}$})\rho(\mbox{$\bm{x}$})d\mbox{$\bm{x}$}+\frac{\lambda_{1}}{2}J(\eta)\right\},

\hat{\eta}=\operatorname*{arg\,min}_{\eta\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^{n}e^{-\eta(\bm{X}_{i})}+\int_{\mathcal{X}}\eta(\mbox{$\bm{x}$})\rho(\mbox{$\bm{x}$})d\mbox{$\bm{x}$}+\frac{\lambda_{1}}{2}J(\eta)\right\},

\displaystyle\{\hat{\mbox{$\Theta$}},\hat{\mbox{$\Lambda$}}\}=

\displaystyle\{\hat{\mbox{$\Theta$}},\hat{\mbox{$\Lambda$}}\}=

\mbox{$\Theta$}_{ij,(t+1)}\leftarrow S_{\lambda_{3}/a_{\Theta}}\Big{(}c_{\Theta}-\frac{b_{\Theta}}{a_{\Theta}}\Big{)},

\mbox{$\Theta$}_{ij,(t+1)}\leftarrow S_{\lambda_{3}/a_{\Theta}}\Big{(}c_{\Theta}-\frac{b_{\Theta}}{a_{\Theta}}\Big{)},

h_{(t)}(\Lambda)=-\log|\Lambda|+\mbox{$\text{tr}$}\big{(}\mbox{$S_{yy}$}\Lambda+2S_{xy}^{T}\Theta_{(t)}\Lambda^{-1}_{(t)}\Lambda+\Lambda^{-1}_{(t)}\Lambda\Lambda^{-1}_{(t)}\Theta\mbox{${}^{T}$}_{(t)}\mbox{$S_{xx}$}\Theta_{(t)}\big{)}

h_{(t)}(\Lambda)=-\log|\Lambda|+\mbox{$\text{tr}$}\big{(}\mbox{$S_{yy}$}\Lambda+2S_{xy}^{T}\Theta_{(t)}\Lambda^{-1}_{(t)}\Lambda+\Lambda^{-1}_{(t)}\Lambda\Lambda^{-1}_{(t)}\Theta\mbox{${}^{T}$}_{(t)}\mbox{$S_{xx}$}\Theta_{(t)}\big{)}

\Lambda_{(t+1)}=\operatorname*{arg\,min}_{\Lambda\succ 0}\Big{\{}h_{(t)}(\Lambda)+\lambda_{2}\left\lVert\Lambda\right\rVert_{1,\text{off}}\Big{\}}.

\Lambda_{(t+1)}=\operatorname*{arg\,min}_{\Lambda\succ 0}\Big{\{}h_{(t)}(\Lambda)+\lambda_{2}\left\lVert\Lambda\right\rVert_{1,\text{off}}\Big{\}}.

\bar{h}_{(t)}(\Delta_{\Lambda})=\mbox{$\text{vec}$}(\nabla h_{(t)}(\Lambda_{(t)}))\mbox{${}^{T}$}\mbox{$\text{vec}$}(\Delta_{\Lambda})+\frac{1}{2}\mbox{$\text{vec}$}(\Delta_{\Lambda})\mbox{${}^{T}$}\nabla^{2}h_{(t)}(\Lambda_{(t)})\mbox{$\text{vec}$}(\Delta_{\Lambda}),

\bar{h}_{(t)}(\Delta_{\Lambda})=\mbox{$\text{vec}$}(\nabla h_{(t)}(\Lambda_{(t)}))\mbox{${}^{T}$}\mbox{$\text{vec}$}(\Delta_{\Lambda})+\frac{1}{2}\mbox{$\text{vec}$}(\Delta_{\Lambda})\mbox{${}^{T}$}\nabla^{2}h_{(t)}(\Lambda_{(t)})\mbox{$\text{vec}$}(\Delta_{\Lambda}),

D_{\Lambda,(t)}=\operatorname*{arg\,min}_{\Delta_{\Lambda}}\Big{\{}\bar{h}_{(t)}(\Delta_{\Lambda})+\lambda_{2}\left\lVert\Lambda_{(t)}+\Delta_{\Lambda}\right\rVert_{1,\text{off}}\Big{\}}.

D_{\Lambda,(t)}=\operatorname*{arg\,min}_{\Delta_{\Lambda}}\Big{\{}\bar{h}_{(t)}(\Delta_{\Lambda})+\lambda_{2}\left\lVert\Lambda_{(t)}+\Delta_{\Lambda}\right\rVert_{1,\text{off}}\Big{\}}.

(\Delta_{\Lambda})_{ij,(s+1)}\leftarrow(\Delta_{\Lambda})_{ij,(s)}-c_{\Lambda}+S_{\lambda_{2}/a_{\Lambda}}\Big{(}c_{\Lambda}-\frac{b_{\Lambda}}{a_{\Lambda}}\Big{)},

(\Delta_{\Lambda})_{ij,(s+1)}\leftarrow(\Delta_{\Lambda})_{ij,(s)}-c_{\Lambda}+S_{\lambda_{2}/a_{\Lambda}}\Big{(}c_{\Lambda}-\frac{b_{\Lambda}}{a_{\Lambda}}\Big{)},

p_{(t)}(\Lambda_{(t)}+\alpha D_{\Lambda,(t)})\leq p_{(t)}(\Lambda_{(t)})+\alpha\sigma\Big{\{}\mbox{$\text{tr}$}(\nabla h_{(t)}(\Lambda_{(t)})D_{\Lambda,(t)})+\lambda_{2}\left\lVert\Lambda_{(t)}+D_{\Lambda,(t)}\right\rVert_{1,\text{off}}-\lambda_{2}\left\lVert\Lambda_{(t)}\right\rVert_{1,\text{off}}\Big{\}},

p_{(t)}(\Lambda_{(t)}+\alpha D_{\Lambda,(t)})\leq p_{(t)}(\Lambda_{(t)})+\alpha\sigma\Big{\{}\mbox{$\text{tr}$}(\nabla h_{(t)}(\Lambda_{(t)})D_{\Lambda,(t)})+\lambda_{2}\left\lVert\Lambda_{(t)}+D_{\Lambda,(t)}\right\rVert_{1,\text{off}}-\lambda_{2}\left\lVert\Lambda_{(t)}\right\rVert_{1,\text{off}}\Big{\}},

\displaystyle\mathcal{S}_{\Theta}=\{(i,j):\ |\big{(}\nabla_{\Theta}l_{2}(\Lambda_{(t)},\Theta_{(t)})\big{)}_{ij}|>\lambda_{3}\ \text{or}\ \Theta_{{ij},(t)}\neq 0\},

\displaystyle\mathcal{S}_{\Theta}=\{(i,j):\ |\big{(}\nabla_{\Theta}l_{2}(\Lambda_{(t)},\Theta_{(t)})\big{)}_{ij}|>\lambda_{3}\ \text{or}\ \Theta_{{ij},(t)}\neq 0\},

\displaystyle\mathcal{S}_{\Lambda}=\{(i,j):\ |\big{(}\nabla h_{(t)}(\Lambda_{(t)})\big{)}_{ij}|>\lambda_{2}\ \text{or}\ \Lambda_{{ij},(t)}\neq 0\}.

LOOKL (λ_{2}, λ_{3})

LOOKL (λ_{2}, λ_{3})

\displaystyle\log f(\mbox{$\bm{z}$})

\displaystyle\log f(\mbox{$\bm{z}$})

\displaystyle=\eta(\mbox{$\bm{x}$})+\frac{1}{2}\big{(}-\mbox{$\bm{y}$}\mbox{${}^{T}$}\mbox{$\Lambda$}\mbox{$\bm{y}$}-2\mbox{$\bm{x}$}\mbox{${}^{T}$}\mbox{$\Theta$}\mbox{$\bm{y}$}-\mbox{$\bm{x}$}\mbox{${}^{T}$}\mbox{$\Theta$}\mbox{${}^{T}$}\mbox{$\Lambda$}\mbox{${}^{-1}$}\mbox{$\Theta$}\mbox{$\bm{x}$}\big{)}+C,

\hat{\zeta}(\mbox{$\bm{x}$})=\hat{\Delta}(\mbox{$\bm{x}$})+\hat{\eta}(\mbox{$\bm{x}$}).

\hat{\zeta}(\mbox{$\bm{x}$})=\hat{\Delta}(\mbox{$\bm{x}$})+\hat{\eta}(\mbox{$\bm{x}$}).

\tilde{V}(f,g)=\mbox{$\int_{\mathcal{X}}$}f(\mbox{$\bm{x}$})g(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}-\{\mbox{$\int_{\mathcal{X}}$}f(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}\}\{\mbox{$\int_{\mathcal{X}}$}g(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}\}

\tilde{V}(f,g)=\mbox{$\int_{\mathcal{X}}$}f(\mbox{$\bm{x}$})g(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}-\{\mbox{$\int_{\mathcal{X}}$}f(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}\}\{\mbox{$\int_{\mathcal{X}}$}g(\mbox{$\bm{x}$})\mbox{$\rho(\mbox{$\bm{x}$})$}\mbox{$d\mbox{$\bm{x}$}$}\}

\tilde{\zeta}=\operatorname*{arg\,min}_{\zeta\in\mathcal{S}^{0}}\big{\{}\tilde{V}(\mbox{$\hat{\zeta}$}-\zeta)\big{\}}.

\tilde{\zeta}=\operatorname*{arg\,min}_{\zeta\in\mathcal{S}^{0}}\big{\{}\tilde{V}(\mbox{$\hat{\zeta}$}-\zeta)\big{\}}.

\displaystyle\{\hat{\Theta},\hat{\Lambda}\}=\operatorname*{arg\,min}_{\Lambda\succ 0,\Theta}\big{\{}l_{2}(\Lambda,\Theta)+\lambda(\left\lVert\Lambda\right\rVert_{1}+r\left\lVert\Theta\right\rVert_{1})\big{\}}.

\displaystyle\{\hat{\Theta},\hat{\Lambda}\}=\operatorname*{arg\,min}_{\Lambda\succ 0,\Theta}\big{\{}l_{2}(\Lambda,\Theta)+\lambda(\left\lVert\Lambda\right\rVert_{1}+r\left\lVert\Theta\right\rVert_{1})\big{\}}.

n

n

λ

\max\Big{\{}\left\lVert\hat{\Lambda}-\Lambda_{0}\right\rVert_{\infty},\left\lVert\hat{\Theta}-\Theta_{0}\right\rVert_{\infty}\Big{\}}\leq 2\kappa_{H}(1+8\alpha^{-1})C_{\sigma}C_{X}^{\star}\sqrt{3200}\sqrt{\frac{\tau\log(pd)+\log 4}{n}}.

\max\Big{\{}\left\lVert\hat{\Lambda}-\Lambda_{0}\right\rVert_{\infty},\left\lVert\hat{\Theta}-\Theta_{0}\right\rVert_{\infty}\Big{\}}\leq 2\kappa_{H}(1+8\alpha^{-1})C_{\sigma}C_{X}^{\star}\sqrt{3200}\sqrt{\frac{\tau\log(pd)+\log 4}{n}}.

min {Λ_{0, ij}, Θ_{0, ij}} > 4 κ_{H} (1 + 8 α^{- 1}) C_{σ} C_{X}^{⋆} 3200 \frac{τ lo g ( p d ) + lo g 4}{n} .

min {Λ_{0, ij}, Θ_{0, ij}} > 4 κ_{H} (1 + 8 α^{- 1}) C_{σ} C_{X}^{⋆} 3200 \frac{τ lo g ( p d ) + lo g 4}{n} .

\displaystyle\max\Big{\{}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F},{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Theta}-\Theta_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\Big{\}}

\displaystyle\max\Big{\{}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F},{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Theta}-\Theta_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\Big{\}}

\displaystyle\text{SKL}\Big{(}f_{0}(\mbox{$\bm{z}$}),\hat{f}(\mbox{$\bm{z}$})\Big{)}

\displaystyle\text{SKL}\Big{(}f_{0}(\mbox{$\bm{z}$}),\hat{f}(\mbox{$\bm{z}$})\Big{)}

D\big{(}f_{0}(\mbox{$\bm{z}$}),\hat{f}(\mbox{$\bm{z}$})\big{)}\triangleq\text{SKL}\big{(}f_{0}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}),\hat{f}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$})\big{)}+(V+\lambda_{1}J)(\eta_{0}-\mbox{$\hat{\eta}$}).

D\big{(}f_{0}(\mbox{$\bm{z}$}),\hat{f}(\mbox{$\bm{z}$})\big{)}\triangleq\text{SKL}\big{(}f_{0}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}),\hat{f}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$})\big{)}+(V+\lambda_{1}J)(\eta_{0}-\mbox{$\hat{\eta}$}).

\displaystyle\text{SKL}\big{(}f_{0}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}),\hat{f}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$})\big{)}

\displaystyle\text{SKL}\big{(}f_{0}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$}),\hat{f}(\mbox{$\bm{y}$}|\mbox{$\bm{x}$})\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference

Full text

\stackMath

Combining Smoothing Spline with Conditional Gaussian Graphical Model for

Density and Graph Estimation

Runfei Luo, Anna Liu and Yuedong Wang Runfei Luo (email: [email protected]) received her Ph.D. in statistics from University of California, Santa Barbara. She is now an applied scientist in Amazon Web Services. Anna Liu (email: [email protected]) is Associate Professor, Department of Mathematics and Statistics, University of Massachusetts, Amherst, Massachusetts 01002., Yuedong Wang (email: [email protected]) is Professor, Department of Statistics and Applied Probability, University of California, Santa Barbara, California 93106. Anna Liu’s research was supported by a grant from the National Science Foundation (DMS-1507078). Runfei Luo and Yuedong Wang’s research was supported by a grant from the National Science Foundation (DMS-1507620). We acknowledge support from the Center for Scientific Computing from the CNSI, MRL: an NSF MRSEC (DMR-1720256) for their support. Address for correspondence: Yuedong Wang, Department of Statistics and Applied Probability, University of California, Santa Barbara, California 93106.

Abstract

Multivariate density estimation and graphical models play important roles in statistical learning. The estimated density can be used to construct a graphical model that reveals conditional relationships whereas a graphical structure can be used to build models for density estimation. Our goal is to construct a consolidated framework that can perform both density and graph estimation. Denote $\bm{Z}$ as the random vector of interest with density function $f(\mbox{$ \bm{z} $})$ . Splitting $\bm{Z}$ into two parts, $\bm{Z}=(\bm{X}^{T},\bm{Y}^{T})^{T}$ and writing $f(\mbox{$ \bm{z} $})=f(\mbox{$ \bm{x} $})f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ where $f(\mbox{$ \bm{x} $})$ is the density function of $\bm{X}$ and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ is the conditional density of $\bm{Y}|\bm{X}=\mbox{$ \bm{x} $}$ . We propose a semiparametric framework that models $f(\mbox{$ \bm{x} $})$ nonparametrically using a smoothing spline ANOVA (SS ANOVA) model and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ parametrically using a conditional Gaussian graphical model (cGGM). Combining flexibility of the SS ANOVA model with succinctness of the cGGM, this framework allows us to deal with high-dimensional data without assuming a joint Gaussian distribution. We propose a backfitting estimation procedure for the cGGM with a computationally efficient approach for selection of tuning parameters. We also develop a geometric inference approach for edge selection. We establish asymptotic convergence properties for both the parameter and density estimation. The performance of the proposed method is evaluated through extensive simulation studies and two real data applications.

KEY WORDS: cross-validation, high dimensional data, penalized likelihood, reproducing kernel Hilbert space, smoothing spline ANOVA

1 Introduction

Density estimation has long been a subject of paramount interest in statistics. Many parametric, nonparametric, and semiparametric methods have been developed in the literature. Assuming a known distribution family with succinct representation and interpretable parameters, the parametric approach is in general statistically and computationally efficient [Kendall1987]. However, the parametric assumption may be too restrictive for some applications. The nonparametric approach, on the other hand, does not assume a specific form for the density function and allows its shape to be decided by data. Methods such as kernel estimation [parzen1962estimation, silverman2018density], local likelihood estimators [loader1996local], and smoothing splines [gu2013smoothing] work well for low dimensional multivariate density functions. When the dimension is moderate to large, existing nonparametric methods break down quickly due to the curse of dimensionality and/or computationally limitations. \citeasnounduong2007ks pointed out that the kernel density estimation is not applicable to random variables of dimension higher than six. To reduce the computational burden, \citeasnounjeon2006effective and \citeasnoungu2013nonparametric developed pseudo likelihood method for smoothing spline density estimation. However, our experience indicates that the computation become almost infeasible when the dimension is higher than twelve. Consequently, contrary to the univariate case, flexible methods for multivariate density estimation are rather limited when the dimension is large. Recent work using piecewise constant and Bayesian partitions represents a major breakthrough in this area [lu2013multivariate, liu2014multivariate, li2016density]. Nevertheless, these methods can handle moderate dimensions only, lead to non-smooth density estimates, and cannot be used to investigate the conditional relationship.

Some semiparametric methods have been proposed to take advantage of the parsimony of parametric models and the flexibility of nonparametric modeling. Semiparametric copula models consist of nonparametric marginal distributions and parametric copula functions [genest1995]. Projection pursuit density estimation overcomes the curse of dimensionality by representing the joint density as a product of some smooth univariate functions of carefully selected linear combinations of variables [friedman1984projection]. The regularized derivative expectation operator (rodeo) method assumes the joint density equals a product of a parametric component and a nonparametric function of an unknown subset of variables [liu2007sparse]. Other semiparametric/nonparametric methods for density estimation include mixture models [richardson1997bayesian], forest density [liu2011forest], density tree [Ram11], and geometric density estimation [Dunson16]. All existing semiparametric/nonparametric methods have strengths and limitations. We will develop a new semiparametric procedure for multivariate density estimation that explores the sparse graph structure in the parametric part of the model.

Graphical models are used to characterize conditional relationship between variables with a wide range of applications in natural sciences, social sciences, and economics [lauritzen1996graphical, fan2016overview, friedman2008sparse]. Gaussian graphical model (GGM) is one of the most popular models where conditional independence is reflected in the zero entries of the precision matrix [friedman2008sparse]. The resulting structure from a GGM can be erroneous when the true distribution is far from Gaussian. The dependence structure of non-Gaussian data has not received great attention until recent years. Robustified Gaussian and elliptical graphical models against possible outliers were studied by \citeasnounmiyamura2006robust, \citeasnounfinegold2011robust, \citeasnounvogel2011elliptical, and \citeasnounsun2012robust. Graphical models based on generalized linear models were proposed by \citeasnounlee2007efficient, \citeasnounhofling2009estimation, \citeasnounravikumar2010high, \citeasnounallen2012log, and \citeasnounyang2012graphical. Nonparametric and semiparametric approaches have also been considered. \citeasnounjeon2006effective and \citeasnoungu2013nonparametric applied SS ANOVA dendity models to estimate graphs (see Section 3 for details). The computation of this nonparametric approach becomes prohibitive for large dimensions. \citeasnounliu2009nonparanormal, \citeasnounliu2012high, and \citeasnounxue2012regularized developed an elegant nonparanormal model which assumes that there exists a monotone transformation to each variable such that the joint distribution after transformation is multivariate Gaussian. Then any established estimation methods for the GGM can be applied to the transformed variables. Other semiparametric/nonparametric methods include graphical random forests [fellinghauer2013stable], regularized score matching [lin2018methods], and kernel partial correlation [oh2017graphical].

The goal of this article is to build a semiparametric model that combines the GGM with the SS ANOVA density model. We are interested in both density and graph estimation. The remainder of the article is organized as follows. In Section 2 we introduce the semiparametric density model and methods for estimation and computation. We propose methods for graph estimation in Section 3. Sections 4 presents theoretical properties of our methods in term of both density and graph estimation. In Section 5 we evaluate our method using simulation studies. In Section 6 we present applications to two real datasets. Some technical details are gathered in the Appendix.

2 Density Estimation with SS ANOVA and cGGM

2.1 Semiparametric Density Models with SS ANOVA and cGGM

Consider the density estimation problem in which we are given a random sample of a random vector $\bm{Z}$ , and we wish to estimate the density function $f(\bm{z})$ of $\bm{Z}$ . Let $\bm{Z}=(\bm{X}\mbox{$ {}^{T} $},\bm{Y}\mbox{$ {}^{T} $})\mbox{$ {}^{T} $}$ where $\bm{X}=(X_{1},\cdots,X_{d})^{T}$ is a $d$ -dimensional random vector for which the density function will be modeled nonparametrically and $\bm{Y}=(Y_{1},\cdots,Y_{p})^{T}\in\mathbb{R}^{p}$ collects elements for which the conditional density will be modeled parametrically. The joint density function $f(\mbox{$ \bm{z} $})$ can be decomposed into two components:

[TABLE]

We will model $f(\mbox{$ \bm{x} $})$ and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ using SS ANOVA models and cGGMs respectively. We now provide details of these models.

Assume $\bm{X}\in\mathcal{X}=\mathcal{X}_{1}\times\dots\times\mathcal{X}_{d}$ where $X_{u}\in\mathcal{X}_{u}$ which is an arbitrary set. To deal with the positivity and unity constraints of a density function, we consider the logistic transform $f=e^{\eta}/\mbox{$ \int_{\mathcal{X}} $}e^{\eta}d\mbox{$ \bm{x} $}$ where $\eta(\mbox{$ \bm{x} $})$ is referred to as the logistic density function [gu2013smoothing]. We construct a model space for $\eta$ using the tensor product of reproducing kernel Hilbert spaces (RKHS). The SS ANOVA decomposition of functions in the tensor product RKHS can be represented as

[TABLE]

where $\eta_{k}$ ’s are main effects, $\eta_{kl}$ ’s are two-way interactions, and the rest are higher order interactions involving more than two variables. Higher order interactions are often removed in (2) for more tractable estimation and inference. An SS ANOVA model for the logistic density function assumes that $\eta$ belongs to an RKHS which contains a subset of components in the SS ANOVA decomposition (2). For a given SS ANOVA model, terms included in the model can be regrouped and the model space can be expressed as

[TABLE]

where $\mathcal{H}^{0}$ is a finite dimensional space collecting all functions that are not going to be penalized, and $\mathcal{H}^{1},\dots,\mathcal{H}^{w}$ are orthogonal RKHS’s with reproducing kernels (RK) $R^{v}$ for $v=1\dots,w$ . Details about the SS ANOVA model can be found in \citeasnoungu2013smoothing and \citeasnounwang2011smoothing.

We assume a cGGM for $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ . Specifically, we assume that $\bm{Y}|\bm{X}=\mbox{$ \bm{x} $}\sim\text{N}(-\Lambda\mbox{$ {}^{-1} $}\Theta\mbox{$ {}^{T} $}\mbox{$ \bm{x} $},\Lambda\mbox{$ {}^{-1} $})$ where $\Lambda$ is a $p\times p$ precision matrix and $\Theta$ is a $d\times p$ matrix that parameterizes the conditional relationship between $\bm{X}$ and $\bm{Y}$ [sohn2012joint, wytock2013sparse, yuan2014partial]. We note that the negative log likelihood function is convex under this parameterization. An alternative assumption $\bm{Y}|\bm{X}=\mbox{$ \bm{x} $}\sim\text{N}(\Psi\mbox{$ \bm{x} $},\Lambda\mbox{$ {}^{-1} $})$ [yin2011sparse] may be used to model the conditional density $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ where the negative log likelihood function is biconvex in $\Psi$ and $\Lambda$ rather than jointly convex.

We will refer to the proposed semiparametric model as combined smoothing spline and conditional Gaussian graphical (cSScGG) model. The cSScGG model is closely related to the semiparametric kernel density estimation (SKDE) proposed by \citeasnounhoti2004semiparametric. The same decomposition in (1) was considered. Given an iid sample $\bm{Z}_{i}=(\bm{X}_{i}\mbox{$ {}^{T} $},\bm{Y}_{i}\mbox{$ {}^{T} $})\mbox{$ {}^{T} $}$ , $i=1,\dots,n$ , \citeasnounhoti2004semiparametric estimated $f(\mbox{$ \bm{x} $})$ using the kernel density, $\hat{f}(\mbox{$ \bm{x} $})=n^{-1}\sum_{i=1}^{n}K_{h_{1}}(\mbox{$ \bm{x} $}-\bm{X}_{i})$ , and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ using the conditional Gaussian density with mean $\mu(\mbox{$ \bm{x} $})$ and covariance $\Sigma(\mbox{$ \bm{x} $})$ . Specifically, they estimated $\mu(\mbox{$ \bm{x} $})$ and covariance $\Sigma(\mbox{$ \bm{x} $})$ by $\hat{\mu}(\mbox{$ \bm{x} $})=\sum_{i=1}^{n}W_{h_{2}}(\mbox{$ \bm{x} $}-\bm{X}_{i})\bm{Y}_{i}$ and $\hat{\Sigma}(\mbox{$ \bm{x} $})=\sum_{i=1}^{n}W_{h_{3}}(\mbox{$ \bm{x} $}-\bm{X}_{i})(\bm{Y}_{i}-\hat{\mu}(\mbox{$ \bm{x} $}))(\bm{Y}_{i}-\hat{\mu}(\mbox{$ \bm{x} $}))\mbox{$ {}^{T} $}$ respectively, where $K_{h}(\mbox{$ \bm{x} $})=h^{-d}K(\mbox{$ \bm{x} $}/h)$ , $K$ is the symmetric Gaussian kernel function, $W_{h}(\bm{x}-\bm{X}_{i})=K_{h}(\mbox{$ \bm{x} $}-\bm{X}_{i})/\sum_{j=1}^{n}K_{h}(\mbox{$ \bm{x} $}-\bm{X}_{i})$ , and $h_{1}$ , $h_{2}$ , and $h_{3}$ are bandwidths. Selection of bandwidths can be difficult and the estimation of conditional mean and covariance can be poor when the dimension of $\bm{Y}$ is large. The authors focused on the classification problem. They set $W_{h_{2}}(\mbox{$ \bm{x} $}-\bm{X}_{i})=W_{h_{3}}(\mbox{$ \bm{x} $}-\bm{X}_{i})=1/n$ in their simulations to make the computation feasible. Under these weights the estimated conditional density $f(\bm{y}|\bm{x})$ does not depend on $\bm{x}$ at all. In contrast, we model $f(\bm{y}|\bm{x})$ using a cGGM which will allow us to explore sparsity in the conditional dependence structure. In addition, the domain $\mathcal{X}$ in our model is an arbitrary set while the domain in the SKDE method is a subset of $\mathbb{R}^{d}$ . While we focus on continuous $\bm{X}$ in this paper, the discrete case is a natural extension of the current work.

2.2 Penalized Likelihood Estimation

A cSScGG model consists of three parameters: $\eta\in\mathcal{H}$ and matrices $\Lambda$ and $\Theta$ where $\mathcal{H}$ is an RKHS given in (3) and $\Lambda$ is positive definite. Given an iid sample $\bm{Z}_{i}=(\bm{X}_{i}\mbox{$ {}^{T} $},\bm{Y}_{i}\mbox{$ {}^{T} $})\mbox{$ {}^{T} $}$ , $i=1,\dots,n$ , let $X=(\bm{X}_{1},\dots,\bm{X}_{n})\mbox{$ {}^{T} $}$ , $Y=(\bm{Y}_{1},\dots,\bm{Y}_{n})\mbox{$ {}^{T} $}$ , $\mbox{$ S_{xx} $}=n^{-1}\mbox{$ X $}\mbox{$ {}^{T} $}\mbox{$ X $}$ , $\mbox{$ S_{yy} $}=n^{-1}\mbox{$ Y $}\mbox{$ {}^{T} $}\mbox{$ Y $}$ , and $\mbox{$ S_{xy} $}=n^{-1}\mbox{$ X $}\mbox{$ {}^{T} $}\mbox{$ Y $}$ . Denote

[TABLE]

as the negative log pseudo likelihood and negative log likelihood functions based on $\bm{X}$ and $\bm{Y}$ samples respectively, where some constants are ignored and $\rho$ is a known density for the pseudo likelihood [gu2013smoothing]. The function $l_{1}(\eta)$ is continuous, convex and Fréchet differentiable [jeon2006effective], and the function $l_{2}(\mbox{$ \Theta $},\mbox{$ \Lambda $})$ is jointly convex in $\Lambda$ and $\Theta$ .

We estimate $\eta$ , $\Lambda$ and $\Theta$ as minimizers of the penalized likelihood:

[TABLE]

where $J$ is a semi-norm in $\mathcal{H}$ that penalizes departure from the null space $\mathcal{H}^{0}$ , $\left\lVert\cdot\right\rVert_{1}$ denotes the elementwise $\ell_{1}$ -norm, $\left\lVert\cdot\right\rVert_{1,\text{off}}$ denotes the elementwise $\ell_{1}$ -norm on off-diagonal entries, and $\Lambda\succ 0$ indicates positive definiteness of $\Lambda$ . Together, $\left\lVert\mbox{$ \Lambda $}\right\rVert_{1,\text{off}}$ and $\left\lVert\mbox{$ \Theta $}\right\rVert_{1}$ encourage sparsity for the cGGM. We allow different tuning parameters for different penalties.

Note that the first part of the penalized likelihood depends on $\eta$ only and the second part depends on $\Theta$ and $\Lambda$ only. Therefore, we can compute the penalized likelihood estimates by solving two optimization problems separately:

[TABLE]

and

[TABLE]

As in \citeasnoungu2013smoothing, we approximate the solution of (7) by a linear combination of basis functions in $\mathcal{H}^{0}$ and a random subset of representers. Then the estimate $\hat{\eta}$ can be calcuated using the Newton-Raphson algorithm. The smoothing parameter $\lambda_{1}$ is selected as the minimizer of an approximated cross-validation estimate of the Kullback-Leibler (KL) divergence. Details can be found in \citeasnoungu2013smoothing, \citeasnoungu2013nonparametric, and \citeasnounLuothesis. In the next section we propose a new computational method for solving (8).

2.3 Backfitting Algorithm for cGGM

Instead of updating $\Lambda$ and $\Theta$ simultaneously as in \citeasnounsohn2012joint, \citeasnounwytock2013sparse and \citeasnounyuan2014partial, we will consider a backfitting procedure to update them iteratively until convergence. We use the subscript $(t)$ to denote quantities calculated at iteration $t$ and $A_{ij}$ to denote the $(i,j)$ -th element of a matrix $A$ .

At iteration $t+1$ , with $\Lambda$ being fixed at $\mbox{$ \Lambda $}_{(t)}$ , (8) reduces to the minimization of a quadratic function plus an $\ell_{1}$ penalty. Therefore, without needing to calculate the Hessian matrix, $\Theta$ can be updated efficiently using the coordinate descent algorithm. The gradient $\nabla_{\Theta}l_{2}(\Lambda,\Theta)=2S_{xy}+2S_{xx}\Theta\Lambda^{-1}$ . Denote $\Sigma=\Lambda^{-1}$ as the covariance matrix. Then the $(i,j)$ th element $\Theta_{ij}$ is updated by

[TABLE]

where $a_{\Theta}=2\mbox{$ \Sigma $}_{jj,(t)}(\mbox{$ S_{xx} $})_{ii}$ , $b_{\Theta}=2(\mbox{$ S_{xy} $})_{ij}+2(\mbox{$ S_{xx} $}\Theta_{(t)}\mbox{$ \Sigma $}_{(t)})_{ij}$ , $c_{\Theta}=\Theta_{ij,(t)}$ , and $S_{\omega}(x)=\text{sign}(x)\max(|x|-\omega,0)$ is the soft-thresholding operator with threshold $\omega$ .

To update $\Lambda$ at iteration $t+1$ , we consider the approximate conditional distribution $\text{N}(-\Lambda^{-1}_{(t)}\Theta\mbox{$ {}^{T} $}_{(t)}\mbox{$ \bm{x} $},\Lambda^{-1})$ where both $\Theta$ and $\Lambda$ in the conditional mean are fixed at their estimates from the $t$ -th iteration. The resulting negative log likelihood

[TABLE]

where terms independent of $\Lambda$ are dropped. We update $\Lambda$ by

[TABLE]

As in \citeasnounhsieh2011sparse, we will find the Newton direction by approximating $h_{(t)}$ using a quadratic function. Based on the second-order Taylor expansion of $h_{(t)}(\Lambda)$ at $\Lambda_{(t)}$ where $\Lambda=\Lambda_{(t)}+\Delta_{\Lambda}$ and ignoring terms independent of $\Delta_{\Lambda}$ , we consider

[TABLE]

where $\nabla h_{(t)}(\Lambda_{(t)})=\mbox{$ S_{yy} $}+\mbox{$ \Sigma $}_{(t)}\Theta_{(t)}^{T}\mbox{$ S_{xx} $}\Theta_{(t)}\mbox{$ \Sigma $}_{(t)}+2\mbox{$ \Sigma $}_{(t)}\Theta_{(t)}^{T}\mbox{$ S_{xy} $}-\mbox{$ \Sigma $}_{(t)}$ and $\nabla^{2}h_{(t)}(\Lambda_{(t)})=\mbox{$ \Sigma $}_{(t)}\otimes\mbox{$ \Sigma $}_{(t)}$ are gradient and Hessian matrices with respect to $\Lambda$ respectively, and $\otimes$ represents the Kronecker product. The Newton direction $D_{\Lambda,(t)}$ for (11) can be written as the solution of the following regularized quadratic function [hsieh2011sparse]

[TABLE]

Equation (12) can be solved efficiently via the coordinate descent algorithm. Specifically, let $\Delta_{\Lambda,(0)}=0$ be the initial value, and $\Delta_{\Lambda,(s)}$ be the update at iteration $s$ . Then at iteration $s+1$ , the $(i,j)$ th element of $\Delta_{\Lambda,(s)}$ is updated by

[TABLE]

where $a_{\Lambda}=\mbox{$ \Sigma $}^{2}_{ij,(t)}+\mbox{$ \Sigma $}_{ii,(t)}\mbox{$ \Sigma $}_{jj,(t)}$ , $b_{\Lambda}=(\mbox{$ S_{yy} $})_{ij}+\Big{(}\mbox{$ \Sigma $}_{(t)}\Theta_{(t)}^{T}\mbox{$ S_{xx} $}\Theta_{(t)}\mbox{$ \Sigma $}_{(t)}\Big{)}_{ij}+2\Big{(}\mbox{$ \Sigma $}_{(t)}\Theta_{(t)}^{T}\mbox{$ S_{xy} $}\Big{)}_{ij}-\mbox{$ \Sigma $}_{ij,(t)}+(\mbox{$ \Sigma $}_{(t)}\Delta_{\Lambda,(s)}\mbox{$ \Sigma $}_{(t)})_{ij}$ and $c_{\Lambda}=\Lambda_{ij,(t)}+(\Delta_{\Lambda})_{ij,(s)}$ . Denote the penalized objective function at the $t$ -th iteration as $p_{(t)}(\Lambda)\triangleq h_{(t)}(\Lambda)+\lambda_{2}\left\lVert\Lambda\right\rVert_{1,\text{off}}$ . We adopt the Armijo’s rule [armijo1966minimization] to find the step size $\alpha$ . Specifically, with a constant decrease rate $0<\beta<1$ (typically $\beta=0.5$ ), step sizes $\alpha=\beta^{k}$ for $k\in\mathbb{N}$ are tried until the smallest $k$ such that

[TABLE]

where $0<\sigma<0.5$ is the backtracking termination threshold. After the step size is calculated, we update $\Lambda_{(t+1)}=\Lambda_{(t)}+\alpha D_{\Lambda,(t)}$ .

When $n>\max(p,d)$ , we use the maximum likelihood estimates $\check{\Lambda}=(\mbox{$ S_{yy} $}-S_{xy}^{T}S_{xx}^{-1}\mbox{$ S_{xy} $})\mbox{$ {}^{-1} $}$ and $\check{\Theta}=-S_{xx}^{-1}\mbox{$ S_{xy} $}\check{\Lambda}$ of $\Lambda$ and $\Theta$ as initial values for $\Lambda$ and $\Theta$ respectively [yin2011sparse]. In the high dimensional case when $S_{xx}$ is not invertible, we use the identity and zero matrix as initial values for $\Lambda$ and $\Theta$ respectively.

The regularized Newton step (12) via the coordinate descent algorithm described above is the most computational expansive part of the algorithm. Despite its efficiency for lasso type of problems, updating all $p(p+1)/2$ variables in $\Lambda$ is costly. To relieve this problem, we divide the parameter set into an active set and a free set. As in \citeasnounhsieh2011sparse and \citeasnounwytock2013sparse, at the $t$ th iteration of the algorithm, we only update $\Theta$ and $\Lambda$ over the active set defined by

[TABLE]

As the active set is relatively small due to sparsity induced by the $\ell_{1}$ regularization, this strategy provides a substantial speedup.

Tuning parameters $\lambda_{2}$ and $\lambda_{3}$ determine the sparsity of $\Lambda$ and $\Theta$ . As a general selection tool, leave-one-out or $k$ -fold cross-validation can be used to select these tuning parameters. The leave-one-out cross-validation (LOOCV) can be computationally intensive and various approximations have been proposed in the literature. \citeasnounlian2011shrinkage and \citeasnounvujavcic2015computationally derived generalized approximate cross-validation (GACV) scores for selecting a single tuning parameter in the GGM. The BIC and k-fold CV have been used to select a single tuning parameter in the cGGM [yin2011sparse, sohn2012joint, wytock2013sparse, yuan2014partial, lee2012simultaneous]. LOOCV has not been used for the cGGM as it requires fitting the model $n$ times which is computationally intensive. To the best of our knowledge, there are no computationally efficient alternatives to LOOCV in the current cGGM literature. We propose a new criterion, Leave-One-Out KL (LOOKL), for selecting $\lambda_{2}$ and $\lambda_{3}$ involved in (8) as minimizers of

[TABLE]

where $\mbox{$ S_{xx,k} $}=\bm{X}_{k}^{T}\bm{X}_{k}$ , $\mbox{$ S_{yy,k} $}=\bm{Y}_{k}^{T}\bm{Y}_{k}$ , $\mbox{$ S_{xy,k} $}=\bm{Y}_{k}^{T}\bm{X}_{k}$ , $\mbox{$ S_{xx}^{(-k)} $}=1/n\sum_{i\neq k}S_{xx,i}$ , $\mbox{$ S_{yy}^{(-k)} $}=1/n\sum_{i\neq k}S_{yy,i}$ , $\mbox{$ S_{xy}^{(-k)} $}=1/n\sum_{i\neq k}S_{xy,i}$ , $\mbox{$ \bm{u} $}_{k}=\mbox{$ \mathrm{vec} $}(\hat{\Lambda}\mbox{$ {}^{-1} $}-\mbox{$ S_{yy,k} $}+\hat{\Lambda}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\mbox{$ S_{xx,k} $}\mbox{$ \hat{\Theta} $}\hat{\Lambda}\mbox{$ {}^{-1} $})$ , $\mbox{$ \bm{v} $}_{xx,k}=\mbox{$ \mathrm{vec} $}(\mbox{$ S_{xx}^{(-k)} $}-\mbox{$ S_{xx} $})$ , $\mbox{$ \bm{v} $}_{yy,k}=\mbox{$ \mathrm{vec} $}(\mbox{$ S_{yy}^{(-k)} $}-\mbox{$ S_{yy} $})$ , $\mbox{$ \bm{v} $}_{xy,k}=\mbox{$ \mathrm{vec} $}(\mbox{$ S_{xy}^{(-k)} $}-\mbox{$ S_{xy} $})$ , $\mbox{$ \bm{w} $}_{k}=\mbox{$ \mathrm{vec} $}(-2\mbox{$ S_{xy,k} $}-2\mbox{$ S_{xx,k} $}\hat{\Theta}\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $})$ , $A=-2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes\mbox{$ S_{xx} $}$ , $B=2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes\mbox{$ S_{xx} $}\mbox{$ \hat{\Theta} $}\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}$ , $C=-\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes(\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}+2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\mbox{$ S_{xx} $}\mbox{$ \hat{\Theta} $}\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $})$ , $D=-2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\otimes I_{d\times d}$ , and $E=\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\otimes\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}$ . The derivation is defered to Appendix A. Note that the GACV in \citeasnounlian2011shrinkage and KLCV in \citeasnounvujavcic2015computationally are special cases of LOOKL with $\Theta=0$ . In the penalized case, we ignored the partial derivatives corresponding to the zero elements in $\Theta$ and $\Lambda$ \citeasnounlian2011shrinkage, and showed that the LOOKL score remains the same. More details can be found in \citeasnounLuothesis. Therefore, we conjecture that the proposed score is more appropriate for density estimation, rather than model selection.

We note the proposed backfitting procedure and LOOKL method for selecting tuning parameters are new for the cGMM. When $\Theta$ and $\Lambda$ are simultaneously updated using the second-order Taylor expansion over all parameters [wytock2013sparse], an expensive computation of the large Hessian matrix of size $(p+d)\times(p+d)$ is required in each iteration. In contrast, our approach forms a second-order approximation of a function of $\Lambda$ which requires a Hessian matrix of size $p\times p$ . The remaining set of parameters in $\Theta$ can be updated easily using the simple coordinate descent algorithm. Moreover, compared to the method in \citeasnounmccarter2016large, our backfitting algorithm eliminates the need for computing the large matrix $\Sigma\Theta\mbox{$ {}^{T} $}\mbox{$ S_{xx} $}\Theta\Sigma$ in $\mathcal{O}(npd+np^{2})$ time. Note that we always require $\Lambda$ to be positive-definite after each iteration, so the algorithm still has complexity $\mathcal{O}(p^{3})$ flops due to the Cholesky factorization.

Some off-the-shelf packages are utilized to solve the optimization problem. Specifically, we use QUIC [hsieh2014quic] for updating $\Lambda$ , and gss [gu2014smoothing] for computing the smoothing spline estimate of $f(\mbox{$ \bm{x} $})$ . We write R code for updating $\Theta$ using (9). We note that other penalities such as the smoothly clipped absolute deviation (SCAD) [fan2001variable] and adaptive lasso [zou2006adaptive] may be used to replace the $\ell_{1}$ penalty in the estimation of $\Theta$ and $\Lambda$ . Details can be found in \citeasnounLuothesis.

3 Graph Estimation with cSScGG Models

In Section 2 we proposed the cSScGG model as a flexible framework for estimating the multivariate density in high-dimensional setting. In terms of the graph structure, the edges among $\bm{Y}$ are identified by $\hat{\Lambda}$ , and edges between $\bm{X}$ and $\bm{Y}$ are identified by $\hat{\Theta}$ [sohn2012joint, wytock2013sparse, yuan2014partial]. The remaining task is the identification of conditional independence within $\bm{X}$ variables which is the target of this section.

We have assumed that the model space for the logistic density $\eta$ contains a subset of components in the SS ANOVA decomposition (2). The interactions are often truncated to overcome the curse of dimensionality and reduce the computational cost. As in \citeasnoungu2013smoothing and \citeasnoungu2013nonparametric, in this section we consider the SS ANOVA model with all main effects and two-way interactions as the model space for $\eta$ . We note that the SS ANOVA model allows pairwise nonparametric interactions as opposed to linear interactions in the GGM. \citeasnoungu2013smoothing and \citeasnoungu2013nonparametric proposed the squared error projection for accessing importance of each interaction term and subsequently identify edges. However, we cannot apply their method directly to $\hat{\eta}$ to identify edges within $\bm{X}$ since the cGGM for $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ also includes interaction terms among variables in $\bm{X}$ .

The logarithm of the joint density

[TABLE]

where $C$ is a constant independent of $\bm{x}$ and $\bm{y}$ . The main challenge in identifying conditional independence among $\bm{X}$ comes from the fact that $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ brings in an extra term, $-\mbox{$ \bm{x} $}\mbox{$ {}^{T} $}\mbox{$ \Theta $}\mbox{$ {}^{T} $}\mbox{$ \Lambda $}\mbox{$ {}^{-1} $}\mbox{$ \Theta $}\mbox{$ \bm{x} $}/2$ , into the interactions among $\bm{X}$ . Let

[TABLE]

where $\hat{\Delta}(\mbox{$ \bm{x} $})=-\mbox{$ \bm{x} $}^{T}\hat{\mbox{$ \Theta $}}\mbox{$ {}^{T} $}\hat{\mbox{$ \Lambda $}}\mbox{$ {}^{-1} $}\hat{\mbox{$ \Theta $}}\mbox{$ \bm{x} $}/2$ . Define the functional

[TABLE]

and denote $\tilde{V}(f,f)$ as $\tilde{V}(f)$ . Let $\mathcal{H}=\mathcal{S}^{0}\oplus\mathcal{S}^{1}$ where $\mathcal{S}^{1}$ collects functions whose contribution to the overall model is of question. The squared error projection of $\hat{\zeta}$ in $\mathcal{S}^{0}$ is [gu2013smoothing]

[TABLE]

$\tilde{V}(\mbox{$ \hat{\zeta} $}-\zeta)$ can be regarded as a proxy of the symmetrized KL divergence [gu2013smoothing]. Assuming $\mbox{$ \zeta_{u} $}=-\log\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}\in\mathcal{S}^{0}$ , it is easy to check that $\tilde{V}(\mbox{$ \hat{\zeta} $}-\mbox{$ \zeta_{u} $})=\tilde{V}(\mbox{$ \hat{\zeta} $}-\mbox{$ \tilde{\zeta} $})+\tilde{V}(\mbox{$ \tilde{\zeta} $}-\mbox{$ \zeta_{u} $})$ . Then the ratio $\tilde{V}(\mbox{$ \hat{\zeta} $}-\mbox{$ \tilde{\zeta} $})/\tilde{V}(\mbox{$ \hat{\zeta} $}-\mbox{$ \zeta_{u} $})$ reflects the importance of functions in $\mathcal{S}^{1}$ . The quantity $\tilde{V}(\mbox{$ \hat{\zeta} $}-\mbox{$ \zeta_{u} $})$ is readily computable while details for computing the squared error projection $\tilde{\zeta}$ are given in Appendix B.

For any pair of variables $X_{i}$ and $X_{j}$ , consider the decomposition $\mathcal{H}=\mathcal{S}^{0}_{ij}\oplus\mathcal{S}^{1}_{ij}$ where $\mathcal{S}^{1}_{ij}$ is the subspace consisting of two-way interactions between $X_{i}$ and $X_{j}$ , and $\mathcal{S}^{0}_{ij}$ contains all functions in $\mathcal{H}$ except the two-way interactions between $X_{i}$ and $X_{j}$ . Note that $\zeta_{ij}(x_{i},x_{j})\triangleq\eta_{ij}(x_{i},x_{j})+\hat{\Delta}_{ij}x_{i}x_{j}\in S_{ij}^{1}$ where $\hat{\Delta}_{ij}=(\hat{\mbox{$ \Theta $}}\mbox{$ {}^{T} $}\hat{\mbox{$ \Lambda $}}\mbox{$ {}^{-1} $}\hat{\mbox{$ \Theta $}})_{ij}$ . Compute the projection ratio $r_{ij}\triangleq\tilde{V}(\hat{\zeta}-\tilde{\zeta})/\tilde{V}(\hat{\zeta}-\zeta_{u})$ in which $\tilde{\zeta}$ is the squared error projection of $\hat{\zeta}$ in $\mathcal{S}^{0}_{ij}$ . The ratio $r_{ij}$ indicates the importance of interactions between $X_{i}$ and $X_{j}$ , and we will add the interactions to the additive model sequentially according to the descending order of $r_{ij}$ ’s.

Consider the space decomposition $\mathcal{H}=\mathcal{S}^{0}\oplus\mathcal{S}^{1}$ . We start with $\mathcal{S}^{0}$ being the subspace spanned by all main effects. We calculate the projection ratio of $\hat{\zeta}$ in $\mathcal{S}^{0}$ as $r=\tilde{V}(\hat{\zeta}-\tilde{\zeta})/\tilde{V}(\hat{\zeta}-\zeta_{u})$ where $\tilde{\zeta}$ is the squared error projection in $\mathcal{S}^{0}$ . If $r$ is larger than a threshold, $\mathcal{S}^{1}$ is deemed important and we move the interaction with the largest $r_{ij}$ from $\mathcal{S}^{1}$ to $\mathcal{S}^{0}$ . We then calculate the projection ratio $r$ with the updated $\mathcal{S}^{0}$ and $\mathcal{S}^{1}$ . The projection ratio decreases each time we move an interaction from $\mathcal{S}^{1}$ to $\mathcal{S}^{0}$ . Finally the process stops when $r$ falls below a cut-off value, at which time we denote the corresponding $\mathcal{S}^{0}$ as $\mathcal{S}_{s}^{0}$ . Let $\Pi_{ij}=I\big{(}\zeta_{ij}\in\mathcal{S}_{s}^{0}\big{)}$ and remove the edge between $X_{i}$ and $X_{j}$ if $\Pi_{ij}=0$ . In our implementations, the cut-off value is set to be $3\%$ .

To summarize, the conditional independences among $\bm{Y}$ , between $\bm{X}$ and $\bm{Y}$ , and among $\bm{X}$ are characterized by the zero elements in $\hat{\Lambda}$ , $\hat{\Theta}$ and $\Pi$ , respectively. The whole procedure for edge identification is illustrated in Figure 1.

4 Theoretical Analysis

We list notations, assumptions, and theoretical results only. Proofs are given in Appendix C.

4.1 Notations and Assumptions

Given a matrix $U$ , let ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|U\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}=\sqrt{\lambda_{\text{max}}(U^{T}U)}$ , ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|U\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}=\max_{i=1,\dots,p}\sum_{j=1}^{p}|U_{ij}|$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|U\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}=\sqrt{\sum_{i=1}^{p}\sum_{j=1}^{p}U_{ij}^{2}}$ denote the $\ell_{2}$ operator norm , $\ell_{\infty}$ operator norm and Frobenius norm respectively, where $\lambda_{\text{max}}(U^{T}U)$ represents the largest eigenvalue of $U^{T}U$ . We assume that $\bm{Y}|\bm{X}=\bm{x}\sim\text{N}(-\Lambda_{0}^{-1}\Theta_{0}^{T}\bm{x},\Lambda_{0}^{-1})$ where $\Lambda_{0}$ and $\Theta_{0}$ are the true parameters. Let $\Gamma_{0}=(\Lambda^{T}_{0},\Theta^{T}_{0})^{T}$ , $\Sigma_{0}=\Lambda_{0}^{-1}$ , $C_{\sigma}=\max_{i}\Sigma_{0,ii}$ , $C_{\Sigma}=\max_{i,j}|\Sigma_{0,ij}|$ , $C_{\Theta}=\max_{i,j}|\Theta_{0,ij}|$ , $\mbox{$ C_{X} $}=\max_{j=1,\dots,d}\left\lVert\bm{X}^{j}\right\rVert_{2}/\sqrt{n}$ where $\bm{X}^{j}$ is the $j$ th columns of $X$ , $H_{0}=\nabla^{2}_{\Lambda,\Theta}l_{2}(\Lambda_{0},\Theta_{0})$ denote the Hessian matrix evaluated at the true parameters, and $\kappa_{H}=\max_{i,j}|H^{-1}_{0,ij}|$ . Let $\gamma=\max_{1\leq j\leq p}\left\{\sum_{i=1}^{d+p}I(\Gamma_{0,ij}\neq 0)\right\}$ be the maximum number of non-zeros in any column of $\Gamma_{0}$ which represents the maximum degree of $\bm{Y}$ in the graph.

Denote $\lambda=\max\{\lambda_{2},\lambda_{3}\}$ and $r=\min\{\lambda_{2},\lambda_{3}\}/\lambda$ , then $r\leq 1$ . In the following theoretical analysis, we assume that $\lambda_{2}\geq\lambda_{3}$ and $r=\lambda_{3}/\lambda_{2}$ . Similar arguments apply to the case of $\lambda_{2}<\lambda_{3}$ . The objective function can be rewritten as

[TABLE]

We make the following assumptions.

Assumption 1.

(Underlying Model) $\bm{Y}|\bm{X}=\bm{x}\sim\text{N}(-\Lambda_{0}^{-1}\Theta_{0}^{T}\bm{x},\Lambda_{0}^{-1})$ where $\bm{Y}$ has the maximum degree $\gamma$ .

Assumption 2.

(Restricted Convexity) For any $i=1,\dots,p$ , let $S_{i}$ denote the nonzero indices of the $i$ -th column of $\Theta_{0}$ (i.e., the edges between $\mathbf{X}$ and $Y_{i}$ ). We have $\lambda_{\emph{min}}(1/nX_{S_{i}}^{T}X_{S_{i}})>0$ , where $\lambda_{\emph{min}}(\cdot)$ denotes the smallest eigenvalue and $X_{S_{i}}$ represents the $n\times|S_{i}|$ matrix with columns of $X$ indexed by $S_{i}$ .

Assumption 3.

(Mutual incoherence) Let $S$ denote the support set of $\Gamma_{0}$ in vector form $S=(\emph{vec}(\emph{supp}\{\Lambda_{0}\})^{T},\emph{vec}(\emph{supp}\{\Theta_{0}\})^{T})^{T}$ where $\emph{supp}\{\cdot\}$ denotes the indicator function of whether an element is zero. Let $\bar{S}$ denote the complement of $S$ . We have ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|H_{0,\bar{S}S}(H_{0,SS})^{-1}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}\leq 1-\alpha$ for some $\alpha\in(0,1)$ , where $H_{0,\bar{S}S}$ and $H_{0,SS}$ represent the $|\bar{S}|\times|S|$ and $|S|\times|S|$ sub-matrices of $H_{0}$ with entries in $\bar{S}\times S$ and $S\times S$ respectively.

Assumption 4.

(Control of eigenvalues) There exists some constants $0<C_{L}\leq C_{U}<\infty$ , such that $C_{L}\leq\lambda_{\mathrm{min}}(\Lambda_{0})\leq\lambda_{\mathrm{max}}(\Lambda_{0})\leq C_{U}$ .

Assumption 1 provides the true underlying model. Assumption 2 ensures the solution of optimization problem (19) is restricted to the active set (nonzero entries in $\Lambda_{0}$ and $\Theta_{0}$ ), which is also used in \citeasnounwainwright2009sharp and \citeasnounwytock2013sparse. Assumption 3 limits the influence of edges in inactive set ( $\bar{S}$ ) can have on the edges in active set ( $S$ ), and Assumption 4 bounds the eigenvalues of the precision matrix. Define $V(f,g)=\mbox{$ \int_{\mathcal{X}} $}f(\bm{x})g(\bm{x})\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}d\bm{x}$ and $V(f)=\mbox{$ \int_{\mathcal{X}} $}f^{2}(\bm{x})\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}d\bm{x}$ .

Assumption 5.

$V$ * is completely continuous with respect to $J$ and $J(\eta_{0})<\infty$ .*

Under the Assumption 5, there exists $\phi_{\nu}$ such that $V(\phi_{\nu},\phi_{\mu})=\delta_{\nu,\mu}$ , $J(\phi_{\nu},\phi_{\mu})=\rho_{\nu}\delta_{\nu,\mu}$ , and $0\leq\rho_{\nu}\uparrow\infty$ , where $\delta_{\nu,\mu}$ is the Kronecker delta and $\rho_{\nu}$ is referred to as the eigenvalues of $J$ with respect to $V$ . Denote the Fourier series expansion of $\eta_{0}$ as $\eta_{0}=\sum_{\nu}\eta_{\nu,0}\phi_{\nu}$ where $\eta_{\nu,0}=V(\eta_{0},\phi_{\nu})$ are the Fourier coefficients. Let $\tilde{\eta}=\sum_{\nu}\tilde{\eta}_{\nu}\phi_{\nu}$ where $\tilde{\eta}_{\nu}=(\beta_{\nu}+\eta_{\nu,0})/(1+\lambda_{1}\rho_{\nu})$ and $\beta_{\nu}=n^{-1}\sum_{i=1}^{n}\{e^{-\eta_{0}(\bm{X}_{i})}\phi_{\nu}(\bm{X}_{i})-\mbox{$ \int_{\mathcal{X}} $}\phi_{\nu}(\bm{x})\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}d\bm{x}\}$ .

Assumption 6.

(a) The eigenvalues $\rho_{\nu}$ of $J$ with respect to $V$ satisfy $\rho_{\nu}>\beta\nu^{s}$ for some $\beta>0$ and $s>1$ when $\nu$ is sufficiently large.

(b) There exists some constants $0<C_{1,1}<C_{1,2}<\infty$ , $C_{1,3}<\infty$ and $C_{1,4}<\infty$ such that $C_{1,1}<e^{\eta_{0}(\bm{x})-\eta(\bm{x})}<C_{1,2}$ holds uniformly for $\eta$ in a convex set around $\eta_{0}$ containing $\hat{\eta}$ and $\tilde{\eta}$ , $e^{-\eta_{0}(\bm{x})}<C_{1,3}$ , and $\mbox{$ \int_{\mathcal{X}} $}\phi^{2}_{\nu}(\bm{x})\phi^{2}_{\mu}(\bm{x})e^{-\eta_{0}(\bm{x})}\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}d\bm{x}<C_{1,4}$ for any $\nu$ and $\mu$ .

(c) There exists some $q\in[1,2]$ such that $\sum_{\nu}\rho_{\nu}^{q}\eta_{\nu,0}^{2}<\infty$ .

Assumptions 5 and 6 are commonly used in smoothing spline literatures to study the convergence rate for nonparametric density estimation [gu2013smoothing] .

4.2 Asymptotic Consistency of the Estimated Parameters

The following theorem provides the estimation error bound and edge selection accuracy.

Theorem 1.

Suppose that the Assumptions 2 and 3 hold, $\tau>2$ , and $n$ and $\lambda=\max\{\lambda_{2},\lambda_{3}\}$ satisfy

[TABLE]

where $C_{2,1}=\max\{12800,32C_{X}^{2}\}$ , $C_{2,2}=\kappa_{H}\max\{3C_{\Sigma}/\gamma,2/(C_{\Theta}\gamma),412C_{\Sigma}^{4}C_{\Theta}^{2}C_{X}^{2}\}$ , and $C_{X}^{\star}=\max\{C_{X}^{2},1\}$ , then with probability greater than $1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}$ , we have

The estimates satisfy the elementwise $\ell_{\infty}$ bound:

[TABLE] 2. 2.

All non-zero entries of the solution $(\hat{\Lambda},\hat{\Theta})$ are a subset of the non-zero entries of $(\Lambda_{0},\Theta_{0})$ . Furthermore, non-zero entries of $(\hat{\Lambda},\hat{\Theta})$ includes all non-zero entries $(i,j)$ in $(\Lambda_{0},\Theta_{0})$ that satisfy

[TABLE]

Remark 1: i) Theorem 1 indicates that a sample size larger than a constant times $\gamma^{4}\log(pd)$ is enough for our estimation procedure to identify a subset of the true non-zero elements in the cGGM, and the resulting estimations are close to the true parameters in $\ell_{\infty}$ bound. The convergence rate is the same as that in \citeasnounwytock2013sparse, but the success probability of the primal-dual witness approach as well as the exact bounds for $n$ and $\lambda$ are different. We also provide a lower bound for the sign consistency which is not included in \citeasnounwytock2013sparse.

ii) The convergence probability is smaller than that for the GGM [wainwright2009sharp] where only a precision matrix needs to be estimated. This is the price we pay for estimating extra parameters in $\Theta$ .

Define $s_{\Lambda}$ as the total number of non-zero elements in off-diagonal positions of $\Lambda_{0}$ , and $s_{\Theta}$ as the total number of non-zero elements in $\Theta_{0}$ .

Corollary 1.

Under the same assumptions as in Theorem 1, with probability at least $1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}$ , the estimates $\hat{\Lambda}$ and $\hat{\Theta}$ satisfy

[TABLE]

Remark 2: The Frobenius norm was not studied in \citeasnounwytock2013sparse. We develop it as a building block for establishing the convergence rate for the density estimation in Section 4.3.

4.3 Convergence Rates for the Density Estimation

We first introduce a combined measure of divergence between the joint density and its estimate. Let $f_{0}(\bm{x})=e^{\eta_{0}(\bm{x})}\rho(\bm{x})/\mbox{$ \int_{\mathcal{X}} $}e^{\eta_{0}(\bm{x})}\rho(\bm{x})d\bm{x}$ and $f_{0}(\mbox{$ \bm{y} $}|\bm{x})$ be the true densities of $\bm{X}$ and $\bm{Y}|\bm{X}=\bm{x}$ with their estimates denoted as $\hat{f}(\bm{x})=e^{\hat{\eta}(\bm{x})}\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}/\mbox{$ \int_{\mathcal{X}} $}e^{\hat{\eta}(\bm{x})}\mbox{$ \rho(\mbox{ $\bm{x}$ }) $}d\bm{x}$ and $\hat{f}(\mbox{$ \bm{y} $}|\bm{x})$ respectively. The KL divergence between two density functions $f_{1}$ and $f_{2}$ are defined as $\text{KL}(f_{1},f_{2})=\text{E}_{f_{1}}[\log(f_{1}/f_{2})]$ . Then the symmetrized KL divergence between the true joint density $f_{0}(\mbox{$ \bm{z} $})=f_{0}(\bm{x})f_{0}(\mbox{$ \bm{y} $}|\bm{x})$ and its estimate $\hat{f}(\mbox{$ \bm{z} $})=\hat{f}(\bm{x})\hat{f}(\mbox{$ \bm{y} $}|\bm{x})$ can expressed as

[TABLE]

We will establish the asymptotic convergence rate under the following combined measure of divergence

[TABLE]

The difference between (24) and (23) lies in the divergence measures for the estimation of $f(\bm{x})$ . Note that the rate in $V(\eta-\eta_{0})$ implies rate in $\tilde{V}(\eta-\eta_{0})$ , since $\tilde{V}(f)\leq V(f)$ and $\tilde{V}(\eta_{0}-\hat{\eta})$ is a proxy of $\text{SKL}(f_{0}(\bm{x}),\hat{f}(\bm{x}))$ [gu2013smoothing].

We first establish the rate for $\text{SKL}(f_{0}(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $}),\hat{f}(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $}))$ which has the explicit expression:

[TABLE]

where $\mathbf{a}=\hat{\Lambda}^{-1}\hat{\Theta}^{T}-\Lambda_{0}^{-1}\Theta_{0}^{T}$ . We assume that the second moments of marginal densities of $f_{0}$ and $\hat{f}$ exist.

Theorem 2.

Under the Assumption 4 and conditional on the event ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq 0.5C_{L}$ , we have

[TABLE]

For the smoothing spline ANOVA estimate $\hat{\eta}$ of $\eta_{0}$ , under the Assumptions 5 and 6, \citeasnoungu2013smoothing showed that as $\lambda_{1}\rightarrow 0$ and $n\lambda_{1}^{2/s}\rightarrow\infty$ ,

[TABLE]

Finally, we have the convergence rate for the joint density estimate.

Theorem 3.

Suppose that the Assumptions 2-6 hold, $\tau>2$ , $\lambda_{1}\rightarrow 0$ , $n\lambda_{1}^{2/s}\rightarrow\infty$ , and $n$ and $\lambda$ satisfy

[TABLE]

where $C_{3,1}=C_{2,1}=\max\{12800,32C_{X}^{2}\}$ , and $C_{3,2}=\max\{C_{2,2},\kappa_{H}\sqrt{1600}/C_{L}\}=\kappa_{H}\max\{3C_{\Sigma}/\gamma,2/(C_{\Theta}\gamma),\\ 412C_{\Sigma}^{4}C_{\Theta}^{2}C_{X}^{2},\sqrt{1600}/C_{L}\}$ , then with probability greater than $1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}$ we have

[TABLE]

Remark 3: For low-dimensional $\bm{X}$ (usually $d\leq 3$ ), the computation of multivariate integrals are feasible. We may use the penalized likelihood instead of the pseudo likelihood to estimate the density function $f(\mbox{$ \bm{x} $})$ . This leads to $f_{0}(\mbox{$ \bm{x} $})=e^{\eta_{0}}/\mbox{$ \int_{\mathcal{X}} $}e^{\eta_{0}}$ . Under similar conditions, \citeasnoungu2013smoothing has proved that the symmetrized KL divergence $\text{SKL}(f_{0}(\bm{x}),\hat{f}(\bm{x}))$ is also $\mathcal{O}(n^{-1}\lambda_{1}^{-1/s}+\lambda_{1}^{q})$ , where $\hat{f}(\bm{x})$ is the penalized likelihood estimate. If we also use the penalized likelihood to estimate $\eta$ in our model, then $\text{SKL}\big{(}f_{0}(\mbox{$ \bm{z} $}),\hat{f}(\mbox{$ \bm{z} $})\big{)}=\mathcal{O}(n^{-5/2}p^{5/2}(\log pd)^{5/2}+n^{-1}p^{2}(\log pd)+n^{-1}\lambda_{1}^{-1/s}+\lambda_{1}^{q})$ .

5 Simulation Studies

We have conducted extensive simulation experiments to evaluate the performance of the cSScGG procedure, and compare it with some existing parametric and semiparametric/nonparametric methods. To save space, we present some simulation results and more comprehensive results can be found in \citeasnounLuothesis. We note that the cSScGG method can ourperform the maximum likelihood estimation (MLE) when $\bm{Z}=(\bm{X}^{T},\bm{Y}^{T})^{T}$ is multivariate Gaussian and the cGGM for $\bm{Y}$ is sparse. Results for density and graph estimations are presented in Sections 5.1 and 5.2 respectively.

For density estimation, we use both LOOKL and CV (5-fold) methods to choose $\lambda_{2}$ and $\lambda_{3}$ . Tuning parameters involved in all other methods are chosen by 5-fold CV. For graph estimation, we select $\lambda_{2}$ and $\lambda_{3}$ in the cSScGG method as minimizers of the following BIC score

[TABLE]

where $\xi(\hat{\Lambda})$ and $\xi(\hat{\mbox{$ \Theta $}})$ are the number of non-zero off-diagonal elements in $\hat{\Lambda}$ and the number of non-zero elements in $\hat{\mbox{$ \Theta $}}$ respectively. The degree of freedom is defined in the same way as in \citeasnounyin2011sparse. The BIC is also used to select tuning parameters in other methods for graph estimation. More details regarding comparison of various tuning parameter selection methods are included in \citeasnounLuothesis.

5.1 Density Estimation

We set $n=200$ , $d=3$ , and $p=25$ . We generate $\bm{X}\sim\omega\mathcal{N}(\bm{\mu}_{1},\sigma^{2}I)+(1-\omega)\mathcal{N}(\bm{\mu}_{2},\sigma^{2}I)$ with $\bm{\mu}_{1}=(1,0,-1)^{T}$ , and $\bm{\mu}_{2}=(0,-1,1)^{T}$ . We consider four combinations of $\sigma$ and $\omega$ : $\sigma=0.5,0.1$ and $\omega=0.9,0.1$ . All results are reported based on 100 replications under each setting. In each replication, we first generate $n$ iid samples $\bm{X}_{1},\dots,\bm{X}_{n}$ from the multivariate Gaussian mixtures, then $\bm{Y}_{i}$ ’s are generated from a cGGM. Specifically, we randomly create a $(d+p)\times(d+p)$ precision matrix $\Omega$ using the R-package huge [zhao2012huge], in which the probability of the off-diagonal elements being nonzero equals $0.2$ . The decomposition $\Omega=\begin{bmatrix}\Omega_{xx}&\Omega_{xy}\\ \Omega_{yx}&\Omega_{yy}\end{bmatrix}$ gives us $\Theta=\Omega_{xy}$ and $\Lambda=\Omega_{yy}$ [yuan2014partial], so that we can sample $\bm{Y}_{i}$ from $\mathcal{N}(-\Lambda\mbox{$ {}^{-1} $}\Theta^{T}\bm{X}_{i},\Lambda\mbox{$ {}^{-1} $})$ for $i=1,\dots,n$ .

Since the division of non-Gaussian variables $\bm{X}$ and Gaussian variables $\bm{Y}$ is typically unknown in practice, we consider two versions of the proposed method – plain cSScGG and cSScGG with normality test (denoted as NT). In the plain version, we assume that the true non-Gaussian components are known and apply cSScGG directly. In the NT version, we select $d$ variables with smallest p-values based on the Shapiro-Wilk test to all $p+d$ marginal variables as $\bm{X}$ , and then apply the cSScGG method.

In addition to the cSScGG method, we estimate density using the SKDE [hoti2004semiparametric], MLE, and QUIC [hsieh2011sparse] methods. In the implementation of the SKDE method, we use the R-package ks [duong2007ks] to calculate the kernel density estimate for $f(\mbox{$ \bm{x} $})$ with the bandwidth selected by the smoothed cross-validation selector with diagonal bandwidth matrices (Hscv.diag(x)) which provides the best overall performance. To avoid selecting the two extra bandwidths involved in SKDE, as in \citeasnounhoti2004semiparametric, we set $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})=f(\mbox{$ \bm{y} $})$ and use MLE to estimate $f(\mbox{$ \bm{y} $})$ . MLE and QUIC methods treat $\bm{Z}^{T}=(\bm{X}^{T},\bm{Y}^{T})$ as multivariate normal across all settings, and the estimates from these two methods are further broken down into $f(\mbox{$ \bm{x} $})$ and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ for comparison. Specifically, QUIC method learns the precision matrix of $\bm{Z}$ by forming a quadratic approximation of the log-likelihood, and the estimates are computed using the R-package QUIC [hsieh2014quic].

To evaluate the performance of different methods, we consider the KL divergence between the estimated density and the true density

[TABLE]

where $f_{0}$ is the true density, and the aggregated KL $\text{E}_{\bm{X}}\Big{[}\text{KL}\Big{(}f_{0}(\mbox{$ \bm{y} $}|\bm{X}),\hat{f}(\mbox{$ \bm{y} $}|\bm{X})\Big{)}\Big{]}$ is approximated by the empirical aggregated KL divergence. Table 1 reports the overall KL divergence $\text{KL}\Big{(}f_{0}(\mbox{$ \bm{z} $}),\hat{f}(\mbox{$ \bm{z} $})\Big{)}$ , the empirical aggregated KL divergence $n^{-1}\sum_{i=1}^{n}\text{KL}\Big{(}f_{0}(\mbox{$ \bm{y} $}|\bm{X}_{i}),\hat{f}(\mbox{$ \bm{y} $}|\bm{X}_{i})\Big{)}$ , and $\text{KL}\Big{(}f_{0}(\mbox{$ \bm{x} $}),\hat{f}(\mbox{$ \bm{x} $})\Big{)}$ . They provide evaluations for the estimation of $f(\mbox{$ \bm{z} $})$ , $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ , and $f(\mbox{$ \bm{x} $})$ , respectively.

Since cSScGG with normality test may identify different $\bm{X}$ , we only include the overall KL divergence $\text{KL}\Big{(}f_{0}(\mbox{$ \bm{z} $}),\hat{f}(\mbox{$ \bm{z} $})\Big{)}$ for comparison. Both versions of the cSScGG method enjoy superior performance relative to all other methods under all settings. When comparing the plain cSScGG with other methods, the differences mainly come from the estimation of $f(\mbox{$ \bm{x} $})$ , in which parametric methods MLE and QUIC cannot fit the data properly. The cSScGG performs much better than SKDE in both the estimation of $f(\mbox{$ \bm{x} $})$ and $f(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})$ . When $\sigma$ is fixed, the performance differences are larger under $\omega=0.5$ where the deviation from Gaussian is more severe. Furthermore, under a fixed $\omega$ , the superiority of the cSScGG methods is greater when $\sigma=0.1$ where the deviation from Gaussian is more severe. Comparative results remain the same under other simulation settings [Luothesis].

5.2 Edge Detection

We do not consider the SKDE and MLE methods here because we they do not perform edge selection. In addition to QUIC which is parametric, we also include the nonparanormal (NPN) method [liu2009nonparanormal]. The NPN method is implemented with the R-package huge. When fitting the model, we use shrunken ECDF to transform the data first, then apply Glasso to the transformed data. The final NPN model is selected by the extended BIC score [foygel2010extended]. Given a fixed dimension $p$ , the model chosen by the EBIC method agrees with the model chosen by the BIC method. As the cSScGG method is formulated quite differently from the NPN, our main focus is to investigate the improvements that cSScGG can bring over the QUIC method which assumes normality for all variables including $\bm{X}$ .

The performance is measured in three categories: among $\bm{X}$ , among $\bm{Y}$ , and between $\bm{X}$ and $\bm{Y}$ . Recall that for the cSScGG procedure, edges in the above categories are decided by $\Pi$ , $\Lambda$ and $\Theta$ , respectively (see Figure 1). We also report the overall performance based on the whole graph. All simulation results are based on $100$ replications.

We fix $p=25$ , $d=3$ , and consider two sample sizes $n=200$ and $n=300$ . We first generate both $\bm{X}$ and $\bm{Y}$ from multivariate normals. Specifically, we first generate a $(d+p)\times(d+p)$ sparse precision matrix $\Omega$ , in which the probability of the off-diagonal elements being nonzero equals $0.2$ . Then $n$ i.i.d. samples $\bm{Z}_{1},\dots,\bm{Z}_{n}$ are generated from $\mathcal{N}(\bm{0},\Omega\mbox{$ {}^{-1} $})$ . The decomposition $\bm{Z}_{i}^{T}=(\bm{X}_{i}^{T},\bm{Y}_{i}^{T})$ leads to i.i.d. samples of $\bm{X}$ and $\bm{Y}$ , and the decomposition $\Omega=\begin{bmatrix}\Omega_{xx}&\Omega_{xy}\\ \Omega_{yx}&\Omega_{yy}\end{bmatrix}$ leads to $\Theta=\Omega_{xy}$ and $\Lambda=\Omega_{yy}$ . The results are presented in Table 2.

Overall, the cSScGG and QUIC methods perform better than the NPN. This is expected as the true distribution is Gaussian and the ECDF transformation leads to efficiency loss. Surprisingly, the cSScGG outperforms the QUIC in detecting edges within $\bm{X}$ variables even when the normality assumption holds for the QUIC method. It suggests that the proposed projection ratio method learns the conditional independence within $\bm{X}$ better than the parametric QUIC method with BIC. Furthermore, the cSScGG outperforms the QUIC in identifying edges among $\bm{Y}$ as well as edges between $\bm{X}$ and $\bm{Y}$ , due to the fact that there are two penalty parameters in cSScGG, as opposed to one in QUIC. To conclude, the cSScGG method is more efficient even when the joint normality assumption holds.

6 Applications

6.1 Isoprenoid Gene Network in Arabidopsis Thaliana

We consider the gene expression data for Arabidopsis thaliana introduced by \citeasnounwille2004sparse. Arabidopsis thaliana is the first plant to have its genome sequenced, and is a popular model in the study of molecular biology and genetics. The dataset contains $n=118$ observations of Affymetrix GeneChip microarrays, in which the expression levels of $795$ genes are recorded. All values are preprocessed by log-transformation and standardization. This data has been analyzed by \citeasnounlafferty2012sparse to explore the structure using the nonparanormal model. As in \citeasnounlafferty2012sparse, we consider a subset of genes from the isoprenoid pathway 111The dataset was downloaded from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545783/. We note that while there were $40$ genes in \citeasnounwille2004sparse and \citeasnounlafferty2012sparse, this dataset contains $39$ only..

Our goal is to construct a graph using the proposed cSScGG procedure and compare its structure with those from Glasso [friedman2008sparse] and nonparanormal (NPN). Let $\bm{Z}$ be the expression levels of $39$ genes. To apply the cSScGG procedure we first need to identify variables $\bm{X}$ of which the density function may be non-Gaussian. A simple approach is to select elements in $\bm{Z}$ whose marginal distributions are non-Gaussian. We looked at histograms of all $39$ gene expression levels and found $3$ genes (MCT, GGPPS6 and GGPPS1mt) with marginal distribution far from Gaussian, as shown in Figure 2. Therefore, we set $\bm{X}$ as gene expression levels of MCT, GGPPS6, and GGPPS1mt. We note that marginal distributions of these three genes have bi-/multiple modes, and monotone transformations cannot transfer them into Gaussian random variables. Therefore, the GGM and nonparanormal model may be inappropriate for this data.

As indicated by \citeasnounwille2004sparse, the GGM chosen by the BIC generally leads to a graph that is too dense for biologically relevant researches. Therefore in this study, we construct the graph by limiting the number of edges. Particularly, we tune the regularization parameters in the cSScGG method to fix $|E|=18$ . Results with $|E|=25$ can be found in \citeasnounLuothesis. Once the cSScGG fit is obtained, we scan the full regularization path of the Glasso estimates, compare the symmetric difference with the cSScGG estimate, and select the graph with smallest symmetric difference value as the Glasso graph. Specifically, the symmetric difference between two graphs is the set of edges which are in either of the graphs but not in their intersection. The same procedure is done for the NPN estimates. We implemented Glasso and NPN with R-packages glasso [friedman2014glasso] and huge [zhao2012huge] respectively.

Figures 3 presents graph topologies achieved from each method, along with the corresponding symmetric difference. We refer the symmetric difference between cSScGG and Glasso to as cSScGG vs Glasso, and the symmetric difference between cSScGG and NPN to as cSScGG vs NPN. Nodes with numbers $13$ , $18$ , and $32$ correspond to the $3$ non-Gaussian genes GGPPS1mt, GGPPS6, and MCT, respectively. Although the overall structures of different methods look similar, there are some interesting differences.

We focus on the two symmetric difference plots in Figure 3. Note that red edges are selected by the cSScGG only. Most of these edges are associated with the non-Gaussian nodes, for example, edges 32-1 and 32-39. This indicates that the cSScGG procedure is able to discover new interactions for the non-Gaussian variables. We further look at the red lines that appear only in one of the two symmetric difference plots. It is interesting to see that they all come from the cSScGG vs NPN plot, indicating that cSScGG is able to detect some edges selected by Glasso which are not selected by NPN. This is not surprising since the cSScGG method assumes a conditional Gaussian distribution for the parametric component. Finally, we note that as a trade-off for the newly identified interactions, there exists edges that are selected by both Glasso and NPN, but not by cSScGG. For instance, edge 10-33 with blue dashed line in Figure 3.

To summarize, in terms of the overall graph structure, the cSScGG procedure is capable of capturing a majority of edges that are detected by the Glasso method. By modeling the distributions of some genes that clearly violate the Gaussian assumption, the proposed method is capable of detecting interactions that are not selected by other methods. These interactions may provide potential research areas for biological study.

6.2 Conditional Relationship Between Clinical, Laboratory and Dialysis Variables from Hemodialysis Patients

We apply the cSScGG procedure to study the conditional relationships between some clinical, laboratory and dialysis variables collected from hemodialysis patients. All patients who underwent dialysis treatments at the Fresenius Medical Care - North America during 2010-2014 are considered. We include patients who stayed at the same facility throughout the treatments. To avoid large fluctuation in the first year on dialysis, we use the average measurements in the second year on dialysis from patients who survived longer than two years. For homogeneity, we include white, non-diabetic and non-Hispanic patients. After removing missing values, we have $n=2959$ observations (patients) on the following $27$ variables in $3$ categories:

Clinical variables: age (years), height (cm), weight (kg), bmi (body mass index, kg/m2), sbp (systolic blood pressure, mmHg), dbp (diastolic blood pressure, mmHg), temp (temperature, Celsius);

Laboratory variables: albumin (g/dL), ferritin (ng/mL), hgb (hemoglobin, g/dL), lymphocytes ( $\%$ ), neutrophils ( $\%$ ), nlr (neutrophils to lymphocytes ratio, unitless), sna (serum sodium concentration, mEq/L or mmol/L), wbc (white blood cell, 1000/mc);

Dialysis variables: qb (blood flow, mL/min), qd (dialysis flow, mL/min), saline (mL), txttime (treatment time, min), olc (on-line clearance, unitless), idwg (interdialytic weight gain, kg), ufv (ultrafiltration volume, L), ufr (ultrafiltration rate, mL/hr/kg), epodose (erythropoietin dose, unit), volume (L), enpcr (equilibrated normalized protein catabolic rate, g/kg/day), ektv (equilibrated Kt/V, unitless).

Note that nlr and epodose have been transformed to make them close to Gaussian. In particular, nlr equals the logarithm of the neutrophils to lymphocytes ratio, and epodose represents the 1/4 power transformation of the actual erythropoietin dose.

The primary objective of this study is to discover the interactions between all these measurements. We first check the marginal distributions of all $27$ variables to investigate possible violation of the Gaussian assumption. We identify $4$ variables, age, qb, qd and epodose as non-Gaussian with very small p-values (less than $2\times 10^{-16}$ ). Histograms in Figure 4 indicate that the distribution of age is skewed, and the distributions of qb, qd and epodose have multiple peaks. Note that despite the $1/4$ power transformation, the distribution of epodose is still far from normal due to the point mass at zero. Consequently, we specify these $4$ variables as $\bm{X}$ to be estimated nonparametrically in the proposed cSScGG procedure.

We compare the cSScGG procedure with Glasso and NPN. For the NPN method, we use the shrunken ECDF to transform the data first, then apply Glasso to the transformed data. For each method, we tune the regularization parameters by BIC. The estimated graph structures are shown in Figure 5.

From the visual inspection, there is a large set of edges shared by cSScGG and Glasso, which is due to the fact that cSScGG assumes majority of the variables are conditionally normal. However, the graph of Glasso is much denser. To see how cSScGG differs from other two methods, Figure 6 shows edges detected by the cSScGG procedure only. It shows that the bmi is a hub node whose connections with other variables such as age, dbp, and wbc are not selected by other methods. Meanwhile, qb has multiple connections with nodes from the other two categories (Clinical and Laboratory). The value of these extra edges remains to be further explored from a clinical standpoint. We do not intend to claim that the graph obtained by the cSScGG procedure is the best as the underlying truth is unknown. Instead, with different model assumptions, the cSScGG procedure can identify potential links for further study.

\citationstyle

dcu

Appendix Appendix A Derivation of the LOOKL

Our derivation is similar to that in \citeasnounlian2011shrinkage and \citeasnounvujavcic2015computationally with adjustments to deal with complications brought by the conditional mean $-\Lambda\mbox{$ {}^{-1} $}\Theta^{T}\mbox{$ \bm{x} $}$ and two tuning parameters. Recall that a cGGM assumes that $\bm{Y}|\bm{X}=\mbox{$ \bm{x} $}\sim\mathcal{N}(-\Lambda\mbox{$ {}^{-1} $}\Theta^{T}\mbox{$ \bm{x} $},\Lambda\mbox{$ {}^{-1} $})$ . The log-likelihood based on the $k$ -th observation $\bm{X}_{k}$ and $\bm{Y}_{k}$ is (ignoring constant terms)

[TABLE]

where $\mbox{$ S_{yy,k} $}=\bm{Y}_{k}^{T}\bm{Y}_{k}$ , $\mbox{$ S_{xy,k} $}=\bm{X}_{k}^{T}\bm{Y}_{k}$ , and $\mbox{$ S_{xx,k} $}=\bm{X}_{k}^{T}\bm{X}_{k}$ are the empirical variance/covariance matrices. Note that $\mbox{$ S_{yy} $}=n^{-1}\sum_{k=1}^{n}\mbox{$ S_{yy,k} $}$ , $\mbox{$ S_{xx} $}=n^{-1}\sum_{k=1}^{n}\mbox{$ S_{xx,k} $}$ , and $\mbox{$ S_{xy} $}=n^{-1}\sum_{k=1}^{n}\mbox{$ S_{xy,k} $}$ .

Let $\hat{\Lambda}^{(-k)}$ and $\hat{\Theta}^{(-k)}$ be the estimates of $\Lambda$ and $\Theta$ based on the data excluding the $k$ -th observation. Directly calculating leave-one-out estimate of the KL distance is computationally costly. We now derive a score based on the fact that cross-validating the log-likelihood provides an estimate of the KL distance [yanagihara2006bias].

Consider the following function of five variables $f(\mbox{$ S_{xx} $},\mbox{$ S_{yy} $},\mbox{$ S_{xy} $},\Lambda,\Theta)=\log|\Lambda|-\mbox{$ \text{tr} $}(\mbox{$ S_{yy} $}\Lambda+2S^{T}_{xy}\Theta+\Lambda\mbox{$ {}^{-1} $}\Theta^{T}S^{T}_{xx}\Theta)$ . We have the identity $\sum_{k=1}^{n}f(S_{xx,k},S_{yy,k},S_{xy,k},\Lambda,\Theta)=nf(\mbox{$ S_{xx} $},\mbox{$ S_{yy} $},\mbox{$ S_{xy} $},\Lambda,\Theta)$ . Letting $\bm{S}=(\mbox{$ S_{xx} $},\mbox{$ S_{yy} $},\mbox{$ S_{xy} $})$ and $\bm{S}_{k}=(S_{xx,k},S_{yy,k},S_{xy,k})$ , we denote $f(\mbox{$ S_{xx} $},\mbox{$ S_{yy} $},\mbox{$ S_{xy} $},\Lambda,\Theta)$ and

$f(\mbox{$ S_{xx,k} $},\mbox{$ S_{yy,k} $},\mbox{$ S_{xy,k} $},\Lambda,\Theta)$ as $f(\bm{S},\Lambda,\Theta)$ and $f(\bm{S}_{k},\Lambda,\Theta)$ in the rest of the derivation. The leave-one-out cross validation score [yanagihara2006bias]

[TABLE]

where $\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\Lambda=\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\mbox{$ \mathrm{vec} $}(\Lambda)$ and $\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\Theta=\partial f(\bm{S}_{k},\hat{\Lambda},\hat{\Theta})/\partial\mbox{$ \mathrm{vec} $}(\Theta)$ are $p^{2}$ and $pd$ dimensional column vectors of partial derivatives given by

[TABLE]

Denoting $\bm{S}^{(-k)}$ as the version of $\bm{S}$ without the $k$ -th observation, the Taylor expansions of the functions $\partial f(\bm{S}^{(-k)},\hat{\Lambda}^{(-k)},\hat{\Theta}^{(-k)})/\partial\Lambda$ and $\partial f(\bm{S}^{(-k)},\hat{\Lambda}^{(-k)},\hat{\Theta}^{(-k)})/\partial\Theta$ at the point $(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})$ are

[TABLE]

and

[TABLE]

where $\partial^{2}f(\bm{S},\Lambda,\Theta)/\partial\Lambda^{2}=(\partial f(\bm{S},\Lambda,\Theta)/\partial\mbox{$ \mathrm{vec} $}(\Lambda))/\partial\mbox{$ \mathrm{vec} $}(\Lambda)$ is the $p^{2}\times p^{2}$ Hessian matrix, $\partial f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Lambda$ and $\partial f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta$ denote partial derivative evaluated at $\hat{\Lambda}$ and $\hat{\Theta}$ , and other second order derivatives are defined similarly. Note that $\partial f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Lambda=\bm{0}$ and $\partial f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta=\bm{0}$ because $\hat{\Lambda}$ and $\hat{\Theta}$ are the maximum likelihood estimators, $\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Lambda\partial\mbox{$ S_{xy} $}=\bm{0}$ because (A.3) is free of $S_{xy}$ , and $\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta\partial\mbox{$ S_{yy} $}=\bm{0}$ because (A.4) is free of $S_{yy}$ . Let $A=\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta^{2}=-2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes\mbox{$ S_{xx} $}$ , $B=\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta\partial\Lambda=2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes\mbox{$ S_{xx} $}\mbox{$ \hat{\Theta} $}\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}$ , $C=\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Lambda^{2}=-\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\otimes(\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}+2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\mbox{$ S_{xx} $}\mbox{$ \hat{\Theta} $}\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $})$ , $D=\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Theta\partial\mbox{$ S_{xx} $}=-2\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\otimes I_{d\times d}$ , and $E=\partial^{2}f(\bm{S},\mbox{$ \hat{\Lambda} $},\mbox{$ \hat{\Theta} $})/\partial\Lambda\partial\mbox{$ S_{xx} $}=\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}\otimes\mbox{$ \hat{\Lambda} $}\mbox{$ {}^{-1} $}\hat{\Theta}^{T}$ . Solving (A.5) and (A.6) and plugging solutions into (A.2), we have

[TABLE]

For the Gaussian graphical model with $\bm{Y}\sim\mathcal{N}(\bm{0},\Lambda^{-1})$ , (A.7) reduces to

[TABLE]

which is the same as the GACV in \citeasnounlian2011shrinkage and KLCV in \citeasnounvujavcic2015computationally.

Appendix Appendix B Calculation of the Projection Ratio

Letting $\hat{\zeta}(\mbox{$ \bm{x} $})=\hat{\Delta}(\mbox{$ \bm{x} $})+\hat{\eta}(\mbox{$ \bm{x} $})$ , we construct the ratio $\tilde{V}(\hat{\zeta}-\tilde{\zeta})/\tilde{V}(\hat{\zeta}-\eta_{u})$ where $\tilde{\zeta}$ denotes the squared error projection of $\hat{\zeta}$ in $\mathcal{S}^{0}$ . A small ratio indicates that $\mathcal{S}^{1}$ may be removed. By definition,

[TABLE]

To obtain $\tilde{V}(\hat{\zeta}-\tilde{\zeta})$ , one needs to find

[TABLE]

Let $\mathcal{S}^{0}=\mathcal{H}^{0}\oplus\mathcal{H}^{1}$ , where $\mathcal{H}_{0}$ is a space spanned by known functions $\{\varphi_{1}(\mbox{$ \bm{x} $}),\cdots,\varphi_{m}(\mbox{$ \bm{x} $})\}$ and $\mathcal{H}_{1}$ is the orthogonal reproducing kernel Hilbert space with the reproducing kernel function $R(\cdot,\cdot)$ . Let $\mbox{$ \bm{\phi} $}=\big{(}\varphi_{i}({\bf{X}}_{j})\big{)}_{i=1,\cdots,m}^{j=1,\cdots,n}$ and $\mbox{$ \bm{\xi} $}=\big{(}R({\bf{X}}_{i},{\bf{X}}_{j})\big{)}_{i=1,\cdots,n}^{j=1,\cdots,n}$ . Let $\tilde{\zeta}=\mbox{$ \bm{\phi} $}\mbox{$ {}^{T} $}\tilde{\mbox{$ \bm{d} $}}+\mbox{$ \bm{\xi} $}\mbox{$ {}^{T} $}\tilde{\mbox{$ \bm{c} $}}$ , take derivatives with respect to $\tilde{\mbox{$ \bm{d} $}}$ and $\tilde{\mbox{$ \bm{c} $}}$ , and set them to zero. After rearrangements, we obtain the equation

[TABLE]

where $\tilde{V}(\bm{a},\bm{b})={\{\tilde{V}(a_{i},b_{j})\}_{i=1}^{I}}_{j=1}^{J}$ for any vectors of functions $\bm{a}=(a_{1},\ldots,a_{I})^{T}$ and $\bm{b}=(b_{1},\ldots,b_{J})^{T}$ .

The right hand side of (A.10) contains some extra components involving $\hat{\Delta}$ . We compute solutions to (A.10) using the Cholesky decomposition implemented in the project() function in the R package gss [gu2014smoothing]. Once $\tilde{\zeta}$ is computed, we have

[TABLE]

Appendix Appendix C Proofs of Theoretical Results

To prove Theorem 1, we first introduce a sequence of lemmas as in \citeasnounwytock2013sparse. Note that, different from \citeasnounwytock2013sparse, we allow different penalties for $\Lambda$ and $\Theta$ . Lemma 1 below studies the decay rate of the gradients $\nabla_{\Theta}l_{2}(\Lambda_{0},\Theta_{0})$ and $\nabla_{\Lambda}l_{2}(\Lambda_{0},\Theta_{0})$ in element-wise infinity operator norm as sample size increases.

Lemma 1.

Suppose that the Assumption 1 holds. Then

[TABLE]

for any $\vartheta\in(0,40\mbox{$ C_{\sigma} $})$ .

Proof.

\citeasnoun

wytock2013sparse proved (A.12) using the Chernoff bound for the Gaussian tail probability. \citeasnounravikumar2011high proved (A.13) in their Lemma 1. ∎

The next lemma extends the primal-dual witness approach proposed in \citeasnounwainwright2009sharp to our multi-penalties setting. Let $\Gamma=(\Lambda^{T},\Theta^{T})^{T}$ . With a bit abuse of notation, let $l_{2}(\Gamma)=l_{2}(\Lambda,\Theta)$ .

Lemma 2.

Suppose that the true parameter $\Gamma_{0}$ has support $S$ . We consider two optimization problems:

[TABLE]

Let $\Delta=\tilde{\Gamma}-\Gamma_{0}$ and $R(\Delta)=\nabla_{\Gamma}^{2}l_{2}(\Gamma_{0})\Delta+\nabla_{\Gamma}l_{2}(\Gamma_{0})-\nabla_{\Gamma}l_{2}(\tilde{\Gamma})$ . If the following conditions hold,

the solution $\tilde{\Gamma}$ is unique; 2. 2.

${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\big{(}\nabla^{2}_{\Gamma}l_{2}(\Gamma_{0})\big{)}_{\bar{S}S}\big{(}\nabla^{2}_{\Gamma}l_{2}(\Gamma_{0})\big{)}_{SS}^{-1}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\infty}<1-\alpha$ * for $0<\alpha<1$ ;* 3. 3.

$\max\{\left\lVert\nabla_{\Gamma}l_{2}(\Gamma_{0})\right\rVert_{\infty},\left\lVert R(\Delta)\right\rVert_{\infty}\}\leq\frac{\alpha\lambda}{8}$ ;

then the two $\ell_{1}$ -regularized solutions are identical, $\tilde{\Gamma}=\hat{\Gamma}$ .

Proof.

Define $\Delta_{\Lambda}=\mbox{$ \tilde{\Lambda} $}-\Lambda_{0}$ , $\Delta_{\Theta}=\mbox{$ \tilde{\Theta} $}-\Theta_{0}$ and $\Delta=(\Delta_{\Lambda}^{T},\Delta_{\Theta}^{T})^{T}$ . Let $R(\Delta)=(R_{\Lambda}^{T}(\Delta_{\Lambda},\Delta_{\Theta}),R_{\Theta}^{T}(\Delta_{\Lambda},\Delta_{\Theta}))^{T}$ be the residual of second order Taylor expansion of the log-likelihood where

[TABLE]

Following the same arguments as in Lemma 3 in \citeasnounravikumar2011high, the $\ell_{1}$ optimization problem (A.14) satisfies

[TABLE]

where $Z=(Z_{\Lambda}^{T},Z_{\Theta}^{T})^{T}$ is the sub-differential of the penalty term evaluated at $\Lambda$ and $\Theta$ , and

[TABLE]

If we can verify the strict dual feasibility $\left\lVert Z_{\bar{S}}\right\rVert_{\infty}\leq 1$ , then by Lemma 3 in \citeasnounravikumar2011high, the restricted solution $\tilde{\Gamma}$ is an optimal solution to the original $\ell_{1}$ problem, i.e., $\tilde{\Gamma}=\hat{\Gamma}$ .

Denoting $H=\nabla^{2}_{\Gamma}l_{2}(\Gamma_{0})$ and $G=\nabla_{\Gamma}l_{2}(\Gamma_{0})$ for simplicity, the optimality condition of (A.16) in terms of $S$ and $\bar{S}$ can be rewritten as

[TABLE]

Since $H_{SS}$ is invertible, we have

[TABLE]

Plugging (A.18) back into the second equation in (A.17), we obtain

[TABLE]

Taking the $\ell_{\infty}$ norm of both sides gives

[TABLE]

∎

Based on Lemma 2, the solution $\tilde{\Gamma}$ is constructed as a witness to the original unrestricted solution $\hat{\Gamma}$ . Then $\tilde{\Gamma}$ inherits many optimality properties from $\hat{\Gamma}$ , in terms of the discrepancy to the true $\Gamma_{0}$ and the recovery of the signed sparsity pattern. Our next step is to bound the residual term $\left\lVert R(\Delta)\right\rVert_{\infty}$ in terms of $\left\lVert\Delta\right\rVert_{\infty}$ .

Lemma 3 (Control of remainder).

Suppose that $\left\lVert\Delta\right\rVert_{\infty}\leq\gamma^{-1}\emph{min}\{1/(3C_{\Sigma}),C_{\Theta}/2\}$ , then

[TABLE]

Proof.

We describe the proof briefly since it follows the same steps as in \citeasnounwytock2013sparse. Denote second order Taylor expansion of a function in terms of its differentials

[TABLE]

By the definition of $R(\Delta)$ and the mean value theorem, there exists $t\in(0,1)$ such that $R_{\Lambda}(\Delta_{\Lambda},\Delta_{\Theta})=d(\nabla_{\Lambda}l_{2}(\Lambda_{0}+t\Delta_{\Lambda},\Theta_{0}+t\Delta_{\Theta});\Delta_{\Lambda},\Delta_{\Theta})$ and similarly for $R_{\Theta}(\Delta_{\Lambda},\Delta_{\Theta})$ . As expressions of the above second differentials are tedious, we do not include them here. However, we note that each term in $R_{\Lambda}(\Delta_{\Lambda},\Delta_{\Theta})$ and $R_{\Theta}(\Delta_{\Lambda},\Delta_{\Theta})$ has a quadratic expression in $\Delta_{\Lambda}$ and $\Delta_{\Theta}$ , with at most four $(\Lambda_{0}+t\Delta_{\Lambda})^{-1}$ terms, two $(\Theta_{0}+t\Delta_{\Theta})$ terms and one $S_{xx}$ term. Using the fact that

[TABLE]

for any matrices $A,B,C$ and $\left\lVert\mbox{$ S_{xx} $}\right\rVert_{\infty}\leq C_{X}^{2}$ , each term in the second differentials is bounded by

[TABLE]

For an invertible $\Lambda_{0}$ , since $0<t<1$ , it is easy to verify that

[TABLE]

Then

[TABLE]

Similarly, since ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\Delta_{\Theta}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\leq\gamma\left\lVert\Delta\right\rVert_{\infty}$ , we have

[TABLE]

Combining with (A.19), we obtain

[TABLE]

∎

Lemma 4 (Control of $\Delta$ ).

Suppose that $u\triangleq 2\kappa_{H}(\max\{\left\lVert\nabla_{\Theta}l_{2}(\Lambda_{0},\Theta_{0}))\right\rVert_{\infty},\left\lVert\nabla_{\Lambda}l_{2}(\Lambda_{0},\Theta_{0}))\right\rVert_{\infty}\}+\lambda)\leq\min\{1/(3C_{\Sigma}\gamma),C_{\Theta}/(2\gamma),1/(412\kappa_{H}C_{\Sigma}^{4}C_{\Theta}^{2}C_{X}^{2}\gamma^{2})\}$ . Then

[TABLE]

Proof.

Recall that $\Delta=\tilde{\Gamma}-\Gamma_{0}$ , $\tilde{\Gamma}_{\bar{S}}=\Gamma_{0,\bar{S}}=0$ , therefore $\left\lVert\Delta\right\rVert_{\infty}=\left\lVert\Delta_{S}\right\rVert_{\infty}$ . Our goal is to bound the deviation $\Delta$ . By (A.18), we have $\Delta_{S}=H^{-1}_{SS}(R(\Delta)_{S}-G_{S}-\lambda Z_{S})$ . In the following, we use Brouwer’s fixed point theorem on a compact set to construct a ball $\mathbb{B}(u)$ that contains $\Delta$ . Define the $\ell_{\infty}$ -ball $\mathbb{B}(u)=\{\Delta|\left\lVert\Delta_{S}\right\rVert_{\infty}<u\}$ and a continuous map $\mathcal{F}:\Delta_{S}\rightarrow F(\Delta_{S})$ such that

[TABLE]

Now it suffices to show $F\big{(}\mathbb{B}(u)\big{)}\in\mathbb{B}(u)$ , as this implies there is a solution to the above equation. By uniqueness of the optimal solution, we can thus conclude that $\Delta$ belongs in this ball.

Taking infinity norm to (A.21), we have

[TABLE]

For any $\Pi\in\mathbb{B}(u)$ , by Lemma 3, the first term in (A.22) is bounded by

[TABLE]

By the definition of radius $u$ , the second term in (A.22) is bounded by

[TABLE]

Therefore, we have $\left\lVert F(\Pi)\right\rVert_{\infty}\leq u$ . ∎

Proof of Theorem 1 We first show that $\tilde{\Gamma}$ equals the solution to original objective function (19) $\hat{\Gamma}$ with high probability. Then we proceed with the proof conditioning on this event.

By Lemma 1, we have the element-wise tail conditions for $\Lambda$ : $\mathbb{P}(\max_{i,j}|\nabla_{\Lambda,ij}l_{2}(\Lambda_{0},\Theta_{0})|>\delta)\leq 1/f_{\Lambda}(n,\delta)$ , where $f_{\Lambda}(n,\delta)=(1/4)\exp{\big{(}n\delta^{2}/(3200\mbox{$ C^{2}_{\sigma} $})\big{)}}$ , and $\nabla_{\Lambda,ij}l_{2}(\Lambda_{0},\Theta_{0})$ denotes the $(i,j)$ -th element in $\nabla_{\Lambda}l_{2}(\Lambda_{0},\Theta_{0})$ . For a fixed $n$ , denote

[TABLE]

Similarly, for each fixed $\delta>0$ , denote

[TABLE]

By the monotonicity of the function $f_{\Lambda}(\delta;n)$ , it is easy to see that

[TABLE]

Appling Corollary 1 and Lemma 8 in \citeasnounravikumar2011high, for any $\tau>2$ , we have the control of sampling noise for $\hat{\Lambda}$

[TABLE]

where $\bar{n}_{f_{\Lambda}}(\delta;p^{\tau})=3200\mbox{$ C^{2}{\sigma} $}(\tau\log p+\log 4)/\delta^{2}$ and $\bar{\delta}_{f_{\Lambda}}(n;p^{\tau})=\sqrt{3200\mbox{$ C^{2}{\sigma} $}}\sqrt{(\tau\log p+\log 4)/n}$ . Now we develop the control of sampling noise for $\Theta$ . Again, by Lemma 1 we have the element-wise tail probability for $\hat{\Theta}$ :

[TABLE]

where $f_{\Theta}(n,\delta)=(1/2)\exp\big{(}n\delta^{2}/(8\mbox{$ C^{2}_{\sigma} $}C_{X}^{2})\big{)}$ .

Define $\bar{\delta}_{f_{\Theta}}(n;\omega)$ and $\bar{n}_{f_{\Theta}}(\delta;\omega)$ similarly to (A.23) and (A.24). Applying the union bound over all $pd$ entries of the gradient matrix, we obtain that

[TABLE]

Let $\delta=\bar{\delta}_{f_{\Lambda}}\big{(}n;(pd)^{\tau}\big{)}$ , then for any $\tau>1$ ,

[TABLE]

The last equality follows the fact that $f_{\Theta}\Big{(}n,\bar{\delta}_{f_{\Theta}}\big{(}n;(pd)^{\tau}\big{)}\Big{)}=(pd)^{\tau}$ , based on the definition of $\bar{\delta}_{f_{\Theta}}$ .

Straightforward calculation shows that $\bar{n}_{f_{\Theta}}\big{(}\delta;(pd)^{\tau}\big{)}$ and $\bar{\delta}_{f_{\Theta}}\big{(}n;(pd)^{\tau}\big{)}$ take the forms

[TABLE]

and

[TABLE]

Denote $\bar{n}_{f_{\Gamma}}=\max\{\bar{n}_{f_{\Lambda}},\bar{n}_{f_{\Theta}}\}$ , $\bar{\delta}_{f_{\Gamma}}=\max\{\bar{\delta}_{f_{\Lambda}},\bar{\delta}_{f_{\Theta}}\}$ , by (A.26) and (A.27) we have

[TABLE]

Specifically,

[TABLE]

where $C_{X}^{\star}=\max\{\mbox{$ C_{X} $}^{2},1\}$ .

Let $\mathcal{A}$ denote the event that $\max\{\left\lVert\nabla_{\Lambda}l_{2}(\Lambda_{0},\Theta_{0})\right\rVert_{\infty},\left\lVert\nabla_{\Theta}l_{2}(\Lambda_{0},\Theta_{0})\right\rVert_{\infty}\}<\bar{\delta}_{f_{\Gamma}}$ , (A.28) implies that $\mathbb{P}(\mathcal{A})\geq 1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}$ . Accordingly, we condition on the event $\mathcal{A}$ in the following analysis.

Next, we verify that the third assumption in Lemma 2 holds. Choose the (larger) regularization penalty $\lambda=(8/\alpha)\bar{\delta}_{f_{\Gamma}}$ , then the first half $\left\lVert\nabla_{\Gamma}l_{2}(\Gamma_{0})\right\rVert_{\infty}\leq\alpha\lambda/8$ is satisfied. It remains to establish the bound $\left\lVert R(\Delta)\right\rVert_{\infty}\leq\alpha\lambda/8$ . We do so by using Lemmas A.20 and 3 consecutively. Choose

[TABLE]

by our choice of $\lambda$ , the minimum bound on $n$ and the monotonicity property (A.25) , we have

[TABLE]

Applying Lemma A.20, we conclude that

[TABLE]

Then Lemma 3 gives

[TABLE]

where the final inequality follows from the lower bound on sample size $n$ , and the monotonicity property (A.25).

To summarize, we have shown that condition 3 in Lemma 2 holds. Furthermore, a finite $C_{X}$ implies condition 1, and condition 2 is assumed by the Assumption 3. These allow us to conclude that $\tilde{\Gamma}=\hat{\Gamma}$ . By (A.29) and (A.30), the estimator $\hat{\Gamma}$ satisfies the $\ell_{\infty}$ bound claimed in Theorem 1(a). Moreover, by the bound (A.20) and the definition of $u$ in Lemma A.20, the estimate $\tilde{\Gamma}_{ij}$ cannot differ enough from $\Gamma_{0,ij}$ to change sign when condition (21) is satisfied. This proves Theorem 1(b).

Proof of Corollary 1 Let $\psi=2\kappa_{H}(1+8/\alpha)C_{\sigma}C_{X}^{\star}\sqrt{3200}\sqrt{\big{(}\tau\log(pd)+\log 4\big{)}/n}$ . From Theorem 1, we have $\max\Big{\{}\left\lVert\hat{\Lambda}-\Lambda_{0}\right\rVert_{\infty},\left\lVert\hat{\Theta}-\Theta_{0}\right\rVert_{\infty}\Big{\}}\leq\psi$ with probability at least $1-\big{(}p^{-(\tau-2)}+(pd)^{-(\tau-1)}\big{)}$ . Since $\Lambda_{0}$ has at most $p+s_{\Lambda}$ non-zeros including diagonal elements and $\Theta_{0}$ has at most $s_{\Theta}$ non-zeros elements, we have

[TABLE]

Combining above two inequalities leads to the bound in (22).

Lemma 5.

Suppose that the Assumption 4 holds, then for positive definite matrices $\hat{\Lambda}$ and $\Lambda_{0}$ ,

[TABLE]

Proof of Lemma 5 This proof is similar to that for Lemma A.1 in \citeasnounfan2011high. Under the event ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq 0.5C_{L}$ , for any vector $\mathbf{v}\in\mathbb{R}^{p}$ with Euclidean norm $\left\lVert\mathbf{v}\right\rVert=1$ , we have

[TABLE]

The inequality holds by the fact that ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}$ for any $A$ . Therefore, $\lambda_{\mathrm{min}}(\hat{\Lambda})\geq 0.5C_{L}$ .

Meanwhile,

[TABLE]

The first inequality holds because of submultiplicativity of the $\ell_{2}$ norm, and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq\sqrt{p}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}$ for any matrix $A$ .

Proof of Theorem 25 Recall that $\text{SKL}\big{(}f_{0}(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $}),\hat{f}(\mbox{$ \bm{y} $}|\mbox{$ \bm{x} $})\big{)}$ has an explicit form:

[TABLE]

where $U=\hat{\Lambda}^{-1}\hat{\Theta}^{T}-\Lambda_{0}^{-1}\Theta_{0}^{T}$ . We now derive the upper bound for each of the four terms in (A.31) conditioning on the event ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq 0.5C_{L}$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}^{-1}-\Lambda_{0}^{-1}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq(2\sqrt{p}/C_{L}^{2}){\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}$ .

We first derive an bound for $I_{1}$ using the fact that $I_{1}\leq 2^{-1}\int_{\mathcal{X}}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{2}\left\lVert U\bm{x}\right\rVert^{2}f_{0}(\mbox{$ \bm{x} $})d\mbox{$ \bm{x} $}$ . Note that

[TABLE]

Furthermore, since the Frobenius norm for a vector equals its Euclidean norm, we have

[TABLE]

The last inequality holds from Lemma 5. Combined, we have the upper bound for $I_{1}$

[TABLE]

where $G=\max\{\int_{\mathcal{X}}\left\lVert\mathbf{x}\right\rVert_{2}^{2}f_{0}(\mbox{$ \bm{x} $})d\mbox{$ \bm{x} $},\int_{\mathcal{X}}\left\lVert\mathbf{x}\right\rVert_{2}^{2}\hat{f}(\mbox{$ \bm{x} $})d\mbox{$ \bm{x} $}\}\max\{C_{U},1\}\max\{D_{T}^{2},1\}/\min\{C_{L}^{4},1\}$ and $C_{m}=\max\{\int_{\mathcal{X}}\bm{x}^{T}\bm{x}f_{0}(\mbox{$ \bm{x} $})d\mbox{$ \bm{x} $},\int_{\mathcal{X}}\bm{x}^{T}\bm{x}\hat{f}(\mbox{$ \bm{x} $})d\mbox{$ \bm{x} $}\}$ . As the only difference between $I_{1}$ and $I_{2}$ lies in whether the expectation is calculated with respect to the true or estimated density, this bound also applies to $I_{2}$ .

For $I_{3}$ , note that

[TABLE]

where the first inequality uses the fact that $\mbox{$ \text{tr} $}(A^{T}B)$ is an appropriate inner product for symmetric matrices $A$ and $B$ , and by the Cauchy-Schwarz inequality, $\mbox{$ \text{tr} $}(A^{T}B)\leq\mbox{$ \text{tr} $}(A^{T}A)\mbox{$ \text{tr} $}(B^{T}B)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|B\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}^{2}$ ; and the second inequality holds by Lemma 5 with probability 1. Then

[TABLE]

For $I_{4}$ , following similar arguments as above,

[TABLE]

Then by Corollary 1 and Lemma 5, we have $I_{1}$ and $I_{2}$ on the order of $\mathcal{O}\Big{(}n^{-5/2}p^{5/2}(\log pd)^{5/2}\Big{)}$ , and $I_{3}$ and $I_{4}$ on the order of $\mathcal{O}\Big{(}n^{-1}p^{2}(\log pd)\Big{)}$ . This proves the claim.

Proof of Theorem 28 The bound of $D\big{(}f_{0}(\mbox{$ \bm{z} $}),\hat{f}(\mbox{$ \bm{z} $})\big{)}$ in (28) comes straightforwardly by combing (25) and (26). However, as the parametric part (25) is conditioning on the event ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\hat{\Lambda}-\Lambda_{0}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{F}\leq 0.5C_{L}$ , a new lower bound for the sample size $n$ needs to be derived such that this condition is always satisfied.

By the RHS of upper bound (22) in Corollary 1, we have

[TABLE]

Combining (A.35) with (20) yields (27) after some simple algebra.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] \harvarditem Allen \harvardand Liu 2012 allen 2012 log Allen, G. I. \harvardand Liu, Z. \harvardyearleft 2012 \harvardyearright . A log-linear graphical model for inferring genetic networks from high-throughput sequencing data, Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on , IEEE, pp. 1–6.
2[2] \harvarditem Armijo 1966 armijo 1966 minimization Armijo, L. \harvardyearleft 1966 \harvardyearright . Minimization of functions having Lipschitz continuous first partial derivatives, Pacific Journal of Mathematics 16 : 1–3.
3[3] \harvarditem Duong 2007 duong 2007 ks Duong, T. \harvardyearleft 2007 \harvardyearright . ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R, Journal of Statistical Software 21 : 1–16.
4[4] \harvarditem Fan \harvardand Li 2001 fan 2001 variable Fan, J. \harvardand Li, R. \harvardyearleft 2001 \harvardyearright . Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 : 1348–1360.
5[5] \harvarditem [Fan et al.]Fan, Liao \harvardand Liu 2016 fan 2016 overview Fan, J., Liao, Y. \harvardand Liu, H. \harvardyearleft 2016 \harvardyearright . An overview of the estimation of large covariance and precision matrices, The Econometrics Journal 19 : C 1–C 32.
6[6] \harvarditem [Fan et al.]Fan, Liao \harvardand Mincheva 2011 fan 2011 high Fan, J., Liao, Y. \harvardand Mincheva, M. \harvardyearleft 2011 \harvardyearright . High dimensional covariance matrix estimation in approximate factor models, Annals of Statistics 39 : 3320–3356.
7[7] \harvarditem [Fellinghauer et al.]Fellinghauer, Bühlmann, Ryffel, Von Rhein \harvardand Reinhardt 2013 fellinghauer 2013 stable Fellinghauer, B., Bühlmann, P., Ryffel, M., Von Rhein, M. \harvardand Reinhardt, J. D. \harvardyearleft 2013 \harvardyearright . Stable graphical model estimation with random forests for discrete, continuous, and mixed variables, Computational Statistics & Data Analysis 64 : 132–152.
8[8] \harvarditem Finegold \harvardand Drton 2011 finegold 2011 robust Finegold, M. \harvardand Drton, M. \harvardyearleft 2011 \harvardyearright . Robust graphical modeling of gene networks using classical and alternative t-distributions, The Annals of Applied Statistics 5 : 1057–1080.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Combining Smoothing Spline with Conditional Gaussian Graphical Model for

Abstract

1 Introduction

2 Density Estimation with SS ANOVA and cGGM

2.1 Semiparametric Density Models with SS ANOVA and cGGM

2.2 Penalized Likelihood Estimation

2.3 Backfitting Algorithm for cGGM

3 Graph Estimation with cSScGG Models

4 Theoretical Analysis

4.1 Notations and Assumptions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Assumption 4**.**

Assumption 5**.**

Assumption 6**.**

4.2 Asymptotic Consistency of the Estimated Parameters

Theorem 1**.**

Corollary 1**.**

4.3 Convergence Rates for the Density Estimation

Theorem 2**.**

Theorem 3**.**

5 Simulation Studies

5.1 Density Estimation

5.2 Edge Detection

6 Applications

6.1 Isoprenoid Gene Network in Arabidopsis Thaliana

6.2 Conditional Relationship Between Clinical, Laboratory and Dialysis Variables from Hemodialysis Patients

Appendix Appendix A Derivation of the LOOKL

Appendix Appendix B Calculation of the Projection Ratio

Appendix Appendix C Proofs of Theoretical Results

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3** (Control of remainder).**

Proof.

Lemma 4** (Control of Δ\DeltaΔ).**

Proof.

Lemma 5**.**

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Assumption 5.

Assumption 6.

Theorem 1.

Corollary 1.

Theorem 2.

Theorem 3.

Lemma 1.

Lemma 2.

Lemma 3 (Control of remainder).

Lemma 4 (Control of $\Delta$ ).

Lemma 5.