Bayesian Structure Learning in Graphical Models using Shrinkage priors

Sayantan Banerjee

arXiv:1908.02684·math.ST·August 8, 2019

Bayesian Structure Learning in Graphical Models using Shrinkage priors

Sayantan Banerjee

PDF

Open Access

TL;DR

This paper introduces a Bayesian method for high-dimensional graphical model structure learning using a novel shrinkage prior, with theoretical guarantees and a Gibbs sampling scheme.

Contribution

It proposes the DL-graphical prior for precision matrix estimation and provides posterior convergence guarantees with a Gibbs sampling algorithm.

Findings

01

Effective structure learning in high-dimensional settings

02

Theoretical posterior convergence guarantees

03

Gibbs sampling scheme for practical implementation

Abstract

We consider the problem of learning the structure of a high dimensional precision matrix under sparsity assumptions. We propose to use a shrinkage prior, called the DL-graphical prior based on the Dirichlet-Laplace prior used for the Gaussian mean problem. A posterior sampling scheme based on Gibbs sampling is also provided along with theoretical guarantees of the method by obtaining the posterior convergence rate of the precision matrix.

Equations42

U (s_{p}) = {Ω \in M_{p}^{+} : # (ω_{ij} \neq = 0) \leq s_{p}, i < j = 1, \dots, p},

U (s_{p}) = {Ω \in M_{p}^{+} : # (ω_{ij} \neq = 0) \leq s_{p}, i < j = 1, \dots, p},

ω_{ii}

ω_{ii}

ω_{ij}

ψ_{ij}

ϕ

τ

p (Ω ∣ X, ψ, ϕ, τ) \propto {det (Ω)}^{n /2} exp {- \frac{1}{2} tr (S Ω)} i < j \prod exp {- \frac{ω _{ij}^{2}}{2 ψ _{ij} ϕ _{ij}^{2} τ ^{2}}} .

p (Ω ∣ X, ψ, ϕ, τ) \propto {det (Ω)}^{n /2} exp {- \frac{1}{2} tr (S Ω)} i < j \prod exp {- \frac{ω _{ij}^{2}}{2 ψ _{ij} ϕ _{ij}^{2} τ ^{2}}} .

Ω = (Ω_{- p, - p} ω_{- p, p}^{'} ω_{- p, p} ω_{pp}), S = (S_{- p, - p} s_{- p, p}^{'} s_{- p, p} s_{pp}) .

Ω = (Ω_{- p, - p} ω_{- p, p}^{'} ω_{- p, p} ω_{pp}), S = (S_{- p, - p} s_{- p, p}^{'} s_{- p, p} s_{pp}) .

Λ = (Λ_{- p, - p} λ_{- p, p}^{'} λ_{- p, p} λ_{pp}) .

Λ = (Λ_{- p, - p} λ_{- p, p}^{'} λ_{- p, p} λ_{pp}) .

p (ω_{- p, p}, ω_{pp} ∣ Ω_{- p, - p}, X, Λ, τ)

p (ω_{- p, p}, ω_{pp} ∣ Ω_{- p, - p}, X, Λ, τ)

p (θ, η ∣ Ω_{- p, - p}, X, Λ, τ)

p (θ, η ∣ Ω_{- p, - p}, X, Λ, τ)

\tilde{ψ}_{- i, i} ∣ ϕ, τ, ω \sim i G (ϕ_{- i, i} τ /∣ ω_{- i, i} ∣, 1),

\tilde{ψ}_{- i, i} ∣ ϕ, τ, ω \sim i G (ϕ_{- i, i} τ /∣ ω_{- i, i} ∣, 1),

T_{ij} \sim g i G (a - 1, 1, 2∣ ω_{ij} ∣),

T_{ij} \sim g i G (a - 1, 1, 2∣ ω_{ij} ∣),

τ ∣ ϕ, ω \sim g i G (ν a - ν, 1, 2 \sum (ω_{ij} / ϕ_{ij})),

τ ∣ ϕ, ω \sim g i G (ν a - ν, 1, 2 \sum (ω_{ij} / ϕ_{ij})),

P (∣ ω_{ij} ∣ < δ) \geq 1 - C \frac{lo g ( 1/ δ )}{Γ ( a )},

P (∣ ω_{ij} ∣ < δ) \geq 1 - C \frac{lo g ( 1/ δ )}{Γ ( a )},

B (p_{Ω_{0}}, ϵ_{n}) = {p_{Ω} : K (p_{Ω_{0}}, p_{Ω}) \leq ϵ_{n}^{2}, V (p_{Ω_{0}}, p_{Ω}) \leq ϵ_{n}^{2}} .

B (p_{Ω_{0}}, ϵ_{n}) = {p_{Ω} : K (p_{Ω_{0}}, p_{Ω}) \leq ϵ_{n}^{2}, V (p_{Ω_{0}}, p_{Ω}) \leq ϵ_{n}^{2}} .

B (p_{Ω_{0}}, ϵ_{n}) \supset {p_{Ω} : ∥Ω - Ω_{0} ∥_{\infty} \leq c ϵ_{n} / p} .

B (p_{Ω_{0}}, ϵ_{n}) \supset {p_{Ω} : ∥Ω - Ω_{0} ∥_{\infty} \leq c ϵ_{n} / p} .

Π (∥Ω - Ω_{0} ∥_{\infty} \leq c ϵ_{n} / p) ≳ (c ϵ_{n} / p)^{p + s_{p}} (1 - \frac{C _{1} lo g ( p / c ϵ _{n} )}{p ^{2}})^{p_{0}} .

Π (∥Ω - Ω_{0} ∥_{\infty} \leq c ϵ_{n} / p) ≳ (c ϵ_{n} / p)^{p + s_{p}} (1 - \frac{C _{1} lo g ( p / c ϵ _{n} )}{p ^{2}})^{p_{0}} .

(p + s_{p}) (lo g p + lo g ϵ_{n}^{- 1}) + p_{0} lo g (1 - \frac{C _{1} ( lo g p + lo g ϵ _{n}^{- 1} )}{p ^{2}}) ≍ n ϵ_{n}^{2} .

(p + s_{p}) (lo g p + lo g ϵ_{n}^{- 1}) + p_{0} lo g (1 - \frac{C _{1} ( lo g p + lo g ϵ _{n}^{- 1} )}{p ^{2}}) ≍ n ϵ_{n}^{2} .

supp_{δ} (Ω) = {(i, j) : ∣ ω_{ij} ∣ > δ, i < j = 1, \dots, p}

supp_{δ} (Ω) = {(i, j) : ∣ ω_{ij} ∣ > δ, i < j = 1, \dots, p}

∣ supp_{δ} (Ω) ∣ < r < \frac{1}{2} (2 p),

∣ supp_{δ} (Ω) ∣ < r < \frac{1}{2} (2 p),

n \to \infty lim E_{Ω_{0}} P (∣ supp_{δ_{p}} (Ω) ∣ > M s_{p} ∣ X) = 0,

n \to \infty lim E_{Ω_{0}} P (∣ supp_{δ_{p}} (Ω) ∣ > M s_{p} ∣ X) = 0,

Π (∣ ω_{ij} ∣ > L) \leq \frac{M ^{'}}{Γ ( a )} {C^{'} - lo g (1 - e^{- 2 L})},

Π (∣ ω_{ij} ∣ > L) \leq \frac{M ^{'}}{Γ ( a )} {C^{'} - lo g (1 - e^{- 2 L})},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Control Systems and Identification · Statistical Methods and Inference

Full text

Bayesian structure learning in graphical models using shrinkage priors 111This is an extended abstract version of the ongoing work.

Sayantan Banerjee

Indian Institute of Management Indore

1 Introduction

We consider the problem of learning the structure of an undirected graphical model corresponding to a $p$ -dimensional Gaussian random variable based on an iid sample of size $n$ , where $p$ can be much larger than $n$ . A Gaussian graphical model captures the conditional independence structure of the underlying random variable, with absence of an edge signifying that the corresponding components of the random variable are conditionally independent given the rest. Thus the sparsity structure of the graphical model is exactly given by the sparsity structure of the precision matrix (inverse covariance matrix) of the random variable.

Standard statistical procedures like the maximum likelihood estimator performs poorly or even does not exist in cases where the dimension $p$ is large. Regularized estiamtors or penalty based estimators have been proposed in this regard to tackle the high-dimensional situation under assumptions of sparsity. Bayesian techniques in this direction include putting sparse or spike and slab based priors on individual elements of the precision matrix.

In this work, we focus on learning the structure of a Gaussian graphical model through estimation of the precision matrix using continuous shrinkage priors. In the next section, we present the model assumptions along with specifying the prior distributions followed by evaluating the posterior distribution for the various parameters along with a sampling scheme for the same. We also establish some theoretical guarantees of our method by deriving the posterior convergence rates of the distribution of the precision matrix.

2 Model assumptions and prior distribution

Consider multivariate Gaussian data $X_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}N_{p}(0,\Sigma),$ where $\Sigma$ is a $p$ -dimensional positive definite matrix. Let $\Omega=\Sigma^{-1}$ denote the corresponding inverse covariance matrix or the precision matrix. Here we consider a high-dimensional situation such that $p\gg n$ . Suppose the true precision matrix is sparse, that is, they belong to the following class of positive definite matrices:

[TABLE]

$\mathcal{M}_{p}^{+}$ being the cone of positive definite matrices of dimension $p$ .

We propose the following prior distribution on the elements of $\Omega=(\!(\omega_{ij})\!)$ .

[TABLE]

The above prior distribution is motivated by the Dirichlet-Laplace shrinkage priors introduced by Bhattacharya et al., (2015) for the sparse Gaussian mean problem. The above prior is a global-local shrinkage prior in the sense that the parameter $\tau^{2}$ induces a global shrinkage while $\psi_{j},\phi_{j}$ offering deviations in shrinkage locally for individual parameters.

3 Posterior distribution and sampling scheme

In this section, we provide the posterior distribution of the precision matrix $\Omega$ and devise a sampling scheme for the parameters. The conditional posterior density of $\Omega$ is given by

[TABLE]

We partition the precision matrix as

[TABLE]

Also define $\Lambda=(\!(\lambda_{ij})\!),$ where $\lambda_{ij}=\psi_{ij}\phi_{ij}^{2}.$ Then partition $\Lambda$ as

[TABLE]

Then, we have,

[TABLE]

where $\Lambda^{*}=\mathrm{diag}(\lambda_{-p,p}).$ Let $\theta=\omega_{-p,p},\;\eta=\omega_{pp}-\omega_{-p,p}^{\prime}\Omega_{-p,-p}^{-1}\omega_{-p,p}.$ Then,

[TABLE]

where $A=\{s_{pp}\Omega_{-p,-p}^{-1}+(\Lambda^{*}\tau^{2})^{-1}\}^{-1}.$

So simulation of $\theta$ and $\eta$ can be done easily. For the rest of the parameters, we follow the same Gibbs sampler as proposed by BBhattacharya et al., (2015), that is, Simulate

[TABLE]

and then let $\psi_{-i,i}=1/\tilde{\psi}_{-i,i}.$

Simulate

[TABLE]

and then set $\phi_{ij}=T_{ij}/\sum_{i<j}T_{ij}.$

Finally, simulate

[TABLE]

where $\nu={p\choose 2}.$

In the above, $iG$ denotes the inverse Gaussian distribution and $giG$ denotes the generalised inverse Gaussian distribtion.

4 Posterior convergence rate

In this section, we establish some theoretical guarantees of our proposed method. In particular, we show that under certain sparsity assumptions, the posterior distribution of $\Omega$ converges to the true precision matrix. We also derive the posterior convergence rates.

4.1 Estimating prior concentration

Following Bhattacharya et al., (2015), we have,

[TABLE]

for some constant $C>0.$ Let us consider the set

[TABLE]

Following Banerjee and Ghosal, (2015), under assumptions on the eigenvalues of precision matrices being bounded away from [math] and infinity, we have,

[TABLE]

Now, we have, for the choice of $a={p\choose 2}^{-1},$

[TABLE]

Matching with the prior concentration rate gives,

[TABLE]

Here we need to check the rate $\epsilon_{n}$ , which comes out to be $n^{-1/2}(p+s_{p})^{1/2}(\log n)^{1/2}$ .

4.2 Choosing the sieve

The Dirichlet-Laplace prior is a shrinkage prior and does not set the value of any off-diagonal element of the precision matrix to be exactly zero. In this situation, we consider the sieve $\mathcal{P}_{n}$ to be the space of all densities $p_{\Omega}$ such that $|\mathrm{supp}_{\delta}(\Omega)|$ , where

[TABLE]

satisfies

[TABLE]

for suitably chosen threshold $\delta,$ and each entry of $\Omega$ is at most $L$ in absolute value.

Now, from Theorem 3.2 in Bhattacharya et al., (2015), we have, for $s_{p}\gtrsim\log(p)$ and choice of $a=1/p^{2}$ , and for $\delta_{p}=s_{p}/p^{2}$ ,

[TABLE]

for some constant $M>0.$ The above result will take care of a part (the size of the support mentioned above) of controlling the probability of the complement of the chosen sieve. For the other part (maximum absolute value of the elements), we can show that,

[TABLE]

where $M^{\prime}$ and $C^{\prime}$ are constants independent of $L$ . It follows that the rate obtained using the prior concentration matches the one obtained using the above metric entropy calculations.

The metric entropy using the sieve can be verified in similar lines with Banerjee and Ghosal, (2015), so as to get the posterior convergence rate as $\epsilon_{n}=n^{-1/2}(p+s_{p})^{1/2}(\log n)^{1/2}$ .

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Banerjee and Ghosal, (2015) Banerjee, S. and Ghosal, S. (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis , 136:147–162.
2Bhattacharya et al., (2015) Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015). Dirichlet–laplace priors for optimal shrinkage. Journal of the American Statistical Association , 110(512):1479–1490.