An Expectation Maximization Algorithm for High-Dimensional Model   Selection for the Ising Model with Misclassified States

David G. Sinclair; Giles Hooker

arXiv:1704.05995·stat.ME·April 21, 2017

An Expectation Maximization Algorithm for High-Dimensional Model Selection for the Ising Model with Misclassified States

David G. Sinclair, Giles Hooker

PDF

TL;DR

This paper introduces an EM algorithm for high-dimensional Ising model selection that accounts for misclassified binary states, improving accuracy in dependent binary data analysis.

Contribution

It extends existing model selection methods to handle misclassification, providing a new EM-based approach for more accurate graphical model identification.

Findings

01

The EM algorithm improves model selection accuracy with simulated data.

02

Application to fMRI data demonstrates practical effectiveness.

03

Theoretical guarantees for edge identification under misclassification.

Abstract

We propose the misclassified Ising Model; a framework for analyzing dependent binary data where the binary state is susceptible to error. We extend the theoretical results of the model selection method presented in Ravikumar et. al. (2010) to show that the method will still correctly identify edges in the underlying graphical model under suitable misclassification settings. With knowledge of the misclassification process, an expectation maximization algorithm is developed that accounts for misclassification during model selection. We illustrate the increase of performance of the proposed expectation maximization algorithm with simulated data, and using data from a functional magnetic resonance imaging analysis.

Equations77

P_{θ^{*}} (x) = \frac{1}{Z ( θ ^{*} )} exp ⎩ ⎨ ⎧ (s, t) \in E \sum θ_{s t}^{*} x_{s} x_{t} ⎭ ⎬ ⎫

P_{θ^{*}} (x) = \frac{1}{Z ( θ ^{*} )} exp ⎩ ⎨ ⎧ (s, t) \in E \sum θ_{s t}^{*} x_{s} x_{t} ⎭ ⎬ ⎫

\hat{θ}_{∖ r} = ar g θ_{∖ r} \in R^{p - 1} min {- \frac{1}{n} i = 1 \sum n lo g P_{θ_{∖ r}} (x_{r}^{(i)} ∣ x_{∖ r}^{(i)}) + λ_{n, d, p} ∥ θ_{∖ r} ∥_{1}}

\hat{θ}_{∖ r} = ar g θ_{∖ r} \in R^{p - 1} min {- \frac{1}{n} i = 1 \sum n lo g P_{θ_{∖ r}} (x_{r}^{(i)} ∣ x_{∖ r}^{(i)}) + λ_{n, d, p} ∥ θ_{∖ r} ∥_{1}}

\hat{E}_{ℓ_{1}} = {(s, t); (\hat{θ}_{∖ s})_{t} \neq = 0 and (\hat{θ}_{∖ t})_{s} \neq = 0}

\hat{E}_{ℓ_{1}} = {(s, t); (\hat{θ}_{∖ s})_{t} \neq = 0 and (\hat{θ}_{∖ t})_{s} \neq = 0}

S_{ma x}

S_{ma x}

\tilde{Q}_{r}^{*}

S_{ma x} \leq \frac{C _{min}^{2} α ^{2}}{400 D _{ma x} d ( 2 - α ) ^{2}}

S_{ma x} \leq \frac{C _{min}^{2} α ^{2}}{400 D _{ma x} d ( 2 - α ) ^{2}}

λ_{n} \geq \frac{16 ( 2 - α )}{α} (\frac{lo g p}{n} + \frac{S _{ma x}}{4})

λ_{n} \geq \frac{16 ( 2 - α )}{α} (\frac{lo g p}{n} + \frac{S _{ma x}}{4})

n > L d^{3} lo g p

n > L d^{3} lo g p

P (x_{r} = 1, x_{t} = 1∣ \overset{x_{s}}{^} = 1) = P (x_{s} = \tilde{x_{s}}) P (x_{r} = 1, x_{t} = 1∣ x_{s} = 1) + P (x_{s} \neq = \overset{x_{s}}{^}) P (x_{r} = 1, x_{t} = 1∣ x_{s} = - 1) = (1 - γ_{s}) P (x_{r} = 1∣ x_{s} = 1) P (x_{t} = 1∣ x_{s} = 1) + γ_{s} (x_{r} = 1∣ x_{s} = - 1) P (x_{t} = 1∣ x_{r} = - 1) \neq = P (x_{s} = 1∣ \overset{x_{r}}{^} = 1) P (x_{t} = 1∣ \overset{x_{r}}{^} = 1)

P (x_{r} = 1, x_{t} = 1∣ \overset{x_{s}}{^} = 1) = P (x_{s} = \tilde{x_{s}}) P (x_{r} = 1, x_{t} = 1∣ x_{s} = 1) + P (x_{s} \neq = \overset{x_{s}}{^}) P (x_{r} = 1, x_{t} = 1∣ x_{s} = - 1) = (1 - γ_{s}) P (x_{r} = 1∣ x_{s} = 1) P (x_{t} = 1∣ x_{s} = 1) + γ_{s} (x_{r} = 1∣ x_{s} = - 1) P (x_{t} = 1∣ x_{r} = - 1) \neq = P (x_{s} = 1∣ \overset{x_{r}}{^} = 1) P (x_{t} = 1∣ \overset{x_{r}}{^} = 1)

θ_{U ∖ r}

θ_{U ∖ r}

θ_{V ∖ U ∖ r}^{(k)}

\tilde{θ}_{∖ r}

L_{λ} (θ_{U ∖ r} ∣ θ_{V ∖ U ∖ r}^{(k)}, \tilde{X})

L_{λ} (θ_{U ∖ r} ∣ θ_{V ∖ U ∖ r}^{(k)}, \tilde{X})

= \frac{1}{n} i = 1 \sum n lo g P_{\tilde{θ}_{∖ r}} (\tilde{x}_{r}^{(i)} ∣ \tilde{x}_{U ∖ r}^{(i)}) - λ ∥ \tilde{θ}_{∖ r} ∥_{1}

\hat{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, X)

\hat{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, X)

= \frac{1}{n} i = 1 \sum n z_{c} \in Ω_{C} \sum [P_{θ^{(k)}} (X_{C} = z_{c} ∣ \tilde{X}_{U} = \tilde{x}_{U}^{(i)}) lo g P_{\tilde{θ}_{∖ r}} (\tilde{x}_{r}^{(i)}, z_{c} ∣ \tilde{x}_{U ∖ r}^{(i)})]

- λ ∥ \tilde{θ}_{∖ r} ∥_{1}

\tilde{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, X)

\tilde{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, X)

- λ ∥ \tilde{θ}_{∖ r} ∥_{1}

θ_{∖ r}^{(k + 1)} = (ar g θ_{U ∖ r} \in R^{∣ U ∣ - 1} min \hat{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, \tilde{X})) \cup \hat{θ}_{V ∖ U \cup r}

θ_{∖ r}^{(k + 1)} = (ar g θ_{U ∖ r} \in R^{∣ U ∣ - 1} min \hat{Q}_{r} (θ_{U ∖ r} ∣ θ^{(k)}, \hat{θ}_{∖ r}, \tilde{X})) \cup \hat{θ}_{V ∖ U \cup r}

\hat{E}_{E M}^{(k + 1)} = {(s, t); if (θ_{∖ s}^{(k + 1)})_{t} \neq = 0 and (θ_{∖ t}^{(k + 1)})_{s} \neq = 0}

\hat{E}_{E M}^{(k + 1)} = {(s, t); if (θ_{∖ s}^{(k + 1)})_{t} \neq = 0 and (θ_{∖ t}^{(k + 1)})_{s} \neq = 0}

P (\tilde{X}_{C} = z_{c}) = s \in C \prod γ_{s}

P (\tilde{X}_{C} = z_{c}) = s \in C \prod γ_{s}

Λ_{min} ((\tilde{Q}_{r}^{*})_{S S})

Λ_{min} ((\tilde{Q}_{r}^{*})_{S S})

Λ_{ma x} (E_{γ, θ^{*}} [X_{∖ r} X_{∖ r}^{T}])

∥ \tilde{Q}_{S^{c} S}^{*} (\tilde{Q}_{S S}^{*})^{- 1} ∥_{\infty} \leq 1 - α

∥ \tilde{Q}_{S^{c} S}^{*} (\tilde{Q}_{S S}^{*})^{- 1} ∥_{\infty} \leq 1 - α

S_{ma x} \leq \frac{C _{min}^{2} α ^{2}}{400 D _{ma x} d ( 2 - α ) ^{2}}

S_{ma x} \leq \frac{C _{min}^{2} α ^{2}}{400 D _{ma x} d ( 2 - α ) ^{2}}

\tilde{Q}^{n} = - \hat{E} (\nabla W^{n} (θ^{*}))

\tilde{Q}^{n} = - \hat{E} (\nabla W^{n} (θ^{*}))

P (∥ W^{n} ∥_{\infty} \geq \frac{λ _{n}}{4}) = O (exp (- K \tilde{λ}_{n}^{2} n))

P (∥ W^{n} ∥_{\infty} \geq \frac{λ _{n}}{4}) = O (exp (- K \tilde{λ}_{n}^{2} n))

λ_{n} d = \frac{16 ( 2 - α )}{α} (\frac{lo g p}{n} + \frac{S _{ma x}}{4}) d < \frac{32 C _{min}^{2} α}{400 D _{ma x} ( 2 - α )} < \frac{C _{min}^{2}}{10 D _{ma x}}

λ_{n} d = \frac{16 ( 2 - α )}{α} (\frac{lo g p}{n} + \frac{S _{ma x}}{4}) d < \frac{32 C _{min}^{2} α}{400 D _{ma x} ( 2 - α )} < \frac{C _{min}^{2}}{10 D _{ma x}}

P (∣ W_{u}^{n} - E (W_{u}^{n}) ∣ > δ) \leq 2 exp (- \frac{n δ ^{2}}{8})

P (∣ W_{u}^{n} - E (W_{u}^{n}) ∣ > δ) \leq 2 exp (- \frac{n δ ^{2}}{8})

P (∣ W_{u}^{n} ∣ > δ + ∣ E (W_{u}^{n}) ∣) \leq P (∣ W_{u}^{n} - E (W_{u}^{n}) ∣ > δ) \leq 2 exp (- \frac{n δ ^{2}}{8})

P (∣ W_{u}^{n} ∣ > δ + ∣ E (W_{u}^{n}) ∣) \leq P (∣ W_{u}^{n} - E (W_{u}^{n}) ∣ > δ) \leq 2 exp (- \frac{n δ ^{2}}{8})

P (∣ W_{u}^{n} ∣ > \frac{λ _{n}}{4}) \leq P (∣ W_{u}^{n} ∣ > \frac{α λ _{n}}{4 ( 2 - α )})

P (∣ W_{u}^{n} ∣ > \frac{λ _{n}}{4}) \leq P (∣ W_{u}^{n} ∣ > \frac{α λ _{n}}{4 ( 2 - α )})

2 exp (- \frac{n δ ^{2}}{8}) = 2 exp (- \frac{n}{8} [\frac{α λ _{n}}{4 ( 2 - α )} - ∣ E (W_{u}^{n}) ∣]^{2}) \leq 2 exp (- \frac{n}{8} [\frac{α λ _{n}}{4 ( 2 - α )} - S_{ma x}]^{2}) = 2 exp - \frac{n}{8} [\frac{α λ ~ _{n}}{4 ( 2 - α )}]^{2}

2 exp (- \frac{n δ ^{2}}{8}) = 2 exp (- \frac{n}{8} [\frac{α λ _{n}}{4 ( 2 - α )} - ∣ E (W_{u}^{n}) ∣]^{2}) \leq 2 exp (- \frac{n}{8} [\frac{α λ _{n}}{4 ( 2 - α )} - S_{ma x}]^{2}) = 2 exp - \frac{n}{8} [\frac{α λ ~ _{n}}{4 ( 2 - α )}]^{2}

L_{λ} (θ_{U ∖ r}^{*} ∣ \hat{θ}_{V ∖ U ∖ r}, \tilde{X}) \geq L_{λ} (\hat{θ}_{U ∖ r} ∣ \hat{θ}_{V ∖ U ∖ r}, \tilde{X})

L_{λ} (θ_{U ∖ r}^{*} ∣ \hat{θ}_{V ∖ U ∖ r}, \tilde{X}) \geq L_{λ} (\hat{θ}_{U ∖ r} ∣ \hat{θ}_{V ∖ U ∖ r}, \tilde{X})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An Expectation Maximization Algorithm for High-Dimensional Model Selection for the Ising Model with Misclassified States

David G. Sinclair and Giles Hooker David Sinclair is PhD Candidate, Department of Statistical Science, Cornell University, 301 Malott Hall, Ithaca, NY 14853 (Email: [email protected]). Giles Hooker is Associate Professor of Biological Statistics and Computational Biology, Cornell University, 1186 Comstock Hall, Ithaca, NY 14853 (Email: [email protected]). The authors gratefully acknowledge support from grants NSF DMS-1053252 and NSF DEB-1353039

Abstract

We propose the misclassified Ising Model; a framework for analyzing dependent binary data where the binary state is susceptible to error. We extend the theoretical results of the model selection method presented in Ravikumar et al. (2010) to show that the method will still correctly identify edges in the underlying graphical model under suitable misclassification settings. With knowledge of the misclassification process, an expectation maximization algorithm is developed that accounts for misclassification during model selection. We illustrate the increase of performance of the proposed expectation maximization algorithm with simulated data, and using data from a functional magnetic resonance imaging analysis.

Keywords: graphical models; LASSO; variational methods; latent variables; fMRI

1 Introduction

This paper proposes an extension of estimation methods for graphical models to cases where node values are observed with error. In particular, motivated by data from functional magnetic resonance imaging (fMRI), we examine the consequences of misclassification noise in an Ising network model on estimation methods proposed in Ravikumar et al. (2010) and show that the estimated edge set can be improved by accounting for misclassification rates.

Graphical models have proven to be a useful tool in modeling a wide range of data, arising in fields such as neuroscience, genetics, social networks, image restoration, traffic models, and disease case modeling, among many. The graph structure provides a useful mathematical framework for representing complex dependencies among a large collection of objects.

In this paper we focus on undirected graphical models, which are specified by a graph $\mathcal{G}=(V,E)$ for a node set $V=\{1,2,\dots,p\}$ and an edge set $E\subset V\times V$ . A random vector with this graph structure is assumed to follow the Markov Property (Kindermann et al., 1980): the $i^{th}$ and $j^{th}$ element of the vector are dependent conditional on the remaining nodes if and only if $(i,j)\in E$ . Thus, we are concerned with uncovering the structure of the edge set $E$ and therefore uncovering conditional dependencies within our dataset.

Further, we assume that our data is binary where the dependencies are entirely captured by pariwise relationships resulting in the Ising Model (Ising, 1925), detailed in Section 2, which corresponds precisely to these assumptions. The Ising Model has proven useful in data analysis settings such as functional magnetic resonance imaging (fMRI) (Sinclair et al., 2017), image restoration (Kandes, 2008; Geman and Geman, 1984), spatial statistics (Banerjee et al., 2014), social network analysis (Montanari and Saberi, 2010), and genetics (Majewski et al., 2001).

Structure learning of the edge set in the Ising model is a well-studied problem in the statistics literature. Considerable attention has been given to finding theoretic information bounds for learning Ising graph structures (Scarlett and Cevher, 2016; Tandon et al., 2014; Santhanam and Wainwright, 2012). Table 1 in Scarlett and Cevher (2016) gives a useful summary of the graphical assumptions for which these information theoretic bounds are known.

Due to the computational intractability of the partition function $Z(\theta^{*})$ for the Ising distribution function given in Equation (1) (see Welsh, 1993), various approaches have been developed in order to perform sound statistical methodology under this practical constraint.

Barber et al. (2015) show an extended BIC method for uncovering the underlying graph in the Ising data setting with theoretical bounds. Bresler (2015) develop a greedy algorithm, which uses a structural property of mutual information associated with Ising models to prove asymptotic exact learning of the underlying graph. Ravikumar et al. (2010) show theoretic bounds for a neighborhood-based regularized logistic regression approach for performing model selection analogous to the Meinshausen-Bühlmann approach for Gaussian graphical models (Meinshausen and Bühlmann, 2006).

One potential issue with categorical data is the possibility for misclassification. This arises in fMRI data where the traditional General Linear Model approach attempts to find areas of the cortex that have been significantly activated, which corresponds to a threshold of the BOLD response’s association with the HRF function (Lindquist et al., 2008). When the cortex is reduced to specialized regions via a parcellation (Sinclair et al., 2017; Gordon et al., 2016) we can think of this procedure as assigning a latent label to each parcel and may suspect possible misclassification when the BOLD respone’s association with the HRF is close to the threshold. If there is a non-zero probability of misclassification, it can be shown that the data no longer follows an Ising distribution, and thus it is not clear if current structure learning methods can still perform adequately.

In this paper we extend theory behind Ravikumar et al. (2010)’s approach to handle misclassification and, conditional on this result, we develop a methodology for further boosting of structural learning performance via an expectation maximization (EM) technique (Dempster et al., 1977) that can be used if there is knowledge of the misclassification process. Due to the inherent dependency in our data set, it is difficult to show that the EM method will always increase the marginal log likelihood. However we show that if the learned structural dependency can predict a candidate state with high probability, the EM method can provide gains in efficiency.

In Section 2 of this paper the misclassified Ising model is defined, and theoretical guarantees are stated. In Section 3 the algorithm for incorporating misclassification information in an updated edge set estimated is described. Section 4 looks at simulations to better understand the performance of this methodology. Section 5 shows how this methodology can be applied in an fMRI setting, and simulations are done to show the method should still increase structural learning accuracy.

2 Misclassified Ising Model and Theoretical Guarantees

In this section we develop the Misclassified Ising Model, and discuss theoretical guarantees for estimating the underlying edge set with this added noise assumption.

2.1 Ising Model

We focus on the special case of the Ising Model as described in (Ravikumar et al., 2010), which we refer to as the $Ising(G,\theta^{*})$ distribution. Let $\mathfrak{X}=(x^{(1)},\dots,x^{(n)})$ be $n$ i.i.d. observations of $X=(x_{1},\dots,x_{p})\sim Ising(G,\theta^{*})$ in which $x_{s}\in\{-1,1\}$ , and $\theta^{*}_{st}\in\mathbb{R}$ for each $s\in V$ , with probability mass function

[TABLE]

Here the partition function $Z(\theta^{*})$ ensures the distribution sums to one. Recall that $\theta^{*}_{st}\neq 0\iff(s,t)\in E$ , and therefore our goal is to determine the support of $\theta^{*}$ .

Due to the computational intractability of the partition function (Welsh, 1993), a neighborhood based likelihood method is adopted in (Ravikumar et al., 2010), a technique akin to the Meinshausen and Bühlmann (2006) method for Gaussian graphical models (Lauritzen, 1996), where a model selection is undertaken to find the neighborhood of each node separately. The estimated edge set is then consolidated from the neighborhood sets.

2.2 $\ell_{1}$ -regularized Neighborhood-based Model Selection

The Ising Model has the useful property that the conditional distribution of a node takes the form of a logistic regression with the canonical link function on all remaining nodes. Therefore, if we let $\theta^{*}_{\setminus r}=\{\theta^{*}_{ru};u\in V\setminus\{r\}\}$ be the edge weights associated with the node $r$ , a model selection can be done via an $\ell_{1}$ -regularized logistic regression on each node $r$ (Friedman et al., 2010):

[TABLE]

In this equation, $d$ is the maximal neighborhood size, and $P_{\theta}$ is the logistic regression function with a canonical link with response $\mathbf{1}(x_{r}^{(i)}=1)$ , regression parameters $2\theta_{\setminus r}$ , and predictors $x_{\setminus r}=\{x_{t}|t\in V\setminus\{r\}\}$ . Doing this regularized regression over each node can give us an estimate for the edge set $E$ as follows:

[TABLE]

In this formulation of the estimated edge set, an edge will be selected between two nodes if the corresponding estimated neighborhood sets both contain these two nodes.

This method is shown in Ravikumar et al. (2010) to give a consistent estimate $\hat{E}_{\ell_{1}}$ in the sense that $P(\hat{E}_{\ell_{1}}=E)\rightarrow 1$ as $n\rightarrow\infty$ , when $n=\Omega(d^{3}\log p)$ for appropriately chosen $\lambda_{n,d,p}$ . We refer to the method for obtaining this edge set as RWL in recognition of its authors.

2.3 Misclassified Ising Model

Here we introduce a formalization of the Misclassified Ising Model, which will be defined hierarchically.

We continue to assume $X\sim Ising(G,\theta^{*})$ , but define $\tilde{X}$ as the random vector such that $P(\tilde{X}\equiv Y|X)=\prod_{s\in V}P(\tilde{x}_{s}=y_{s}|x_{s})=\prod_{s\in V}(\gamma_{s}^{\mathbbm{1}(y_{s}\neq x_{s})}(1-\gamma_{s})^{\mathbbm{1}(y_{s}=x_{s})})$ for all $Y\in\{-1,1\}^{p}$ . In this sense, each node is misclassified with some probability $\gamma_{s}$ and the misclassification is independent across nodes. As we only observe the misclassified nodes, $\tilde{X}$ , we define their distribution unconditional of $X$ as the Misclassified Ising Model, $\tilde{X}\sim MIsing_{\gamma}(G,\theta^{*})$ . The theoretical guarantees for RWL under this distribution shown in Section 2.4 do not directly assume independence of the misclassification probabilities, however this assumption is used when completing the EM update algorithm in Section 3.

As with the Ising Model, let $\tilde{\mathfrak{X}}=(\tilde{x}^{(1)},\dots,\tilde{x}^{(n)})$ be n i.i.d. observations of $\tilde{X}$ .

2.4 Theoretical Guarantees

In this section we show that when the extra noise due to misclassification is small, the estimated edge set $\hat{E}_{\ell_{1}}$ can still produce a reasonable model selection method. The amount that the added noise hinders our ability to detect edges is captured by the expectation of the score function for each node-conditional distribution for the (not misclassified) Ising Model, where expectation is calculated over the true misclassified Ising Model. Indeed, as misclassification goes to 0, the expectation of the score function goes to 0, which implies that we there is no hinderance in obtaining the edges, as expected.

Formally, $W^{n}_{r}(\theta)=-\nabla\log P_{\theta_{\setminus r}}(\tilde{x}_{r}^{(i)}|\tilde{x}^{(i)}_{\setminus r})$ is the score function for $P_{\theta_{\setminus r}}$ defined in equation (2). We define the misclassified score and misclassified information as

[TABLE]

Note that both of these expectations are over the misclassified distribution. The misclassified score $S_{max}$ corresponds to the largest deviation of the expected score function over the misclassified distribution from 0.

The first two assumptions we make for our extension, are very similar to those given in Ravikumar et al. (2010), however they are made on the misclassified information matrix. These are stated explicitly in Appendix A, and are referred to as ( $\tilde{A1}$ ) and ( $\tilde{A2}$ ). The third assumption is stated here as:

( $\tilde{A3}$ ) Misclassification Condition. For $C_{min}$ and $D_{max}$ as defined in ( $\tilde{A1}$ ), and $\alpha$ as defined in ( $\tilde{A2}$ ), we assume

[TABLE]

If we make the same population assumptions as given in Ravikumar et al. (2010) on the underlying Ising Model (stated in Appendix A.1), then for $\alpha$ satisfying ( $\tilde{A2}$ ) we have the following result that corresponds to Theorem 1 in Ravikumar et al. (2010).

Extended Theorem 1: Consider an Misclassified Ising graphical model, $MIsing_{\gamma}(G,\theta^{*})$ with parameter vector $\theta^{*}$ and associated edge set $E^{*}$ such that conditions $(\tilde{A}1)$ and $(\tilde{A}2)$ are satisfied by the misclassified information matrix $\tilde{Q}^{*}_{r}$ for all $r\in V$ . Assume the misclassified score, $S_{max}$ satisfies $(\tilde{A}3)$ and let $\tilde{\mathfrak{X}}$ be a set of n i.i.d. samples for the misclassified Ising model. Suppose that the regularization parameter $\lambda_{n}$ is selected to satisfy

[TABLE]

Then there exists positive constants L and K, independent of (n,d,p) such that if

[TABLE]

then the following properties hold with probability at least $1-2\exp(-K\tilde{\lambda}^{2}_{n}n)$ , where $\tilde{\lambda_{n}}=\lambda_{n}-\frac{4(2-\alpha)}{\alpha}S_{max}$ .

(a)

For each node $r\in V$ the $\ell_{1}$ -regularized logistic regression has a unique solution and therefore uniquely specifies a neighborhood $\hat{N}(r)$ .

(b)

For each node $r\in V$ the the estimated neighborhood $\hat{N}(r)$ correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges (r,t) for which $|\theta^{*}_{rt}|\geq\frac{10}{C_{min}}\sqrt{d}\lambda_{n}$ .

The proof of this result is located in Appendix A.

An interesting consequence from this result is that as $n\rightarrow\infty$ the tuning parameter does not go to 0, unless $S_{max}$ also goes to 0. This means that by part (b) some edges may never be correctly included with high probability due the conditional independencies of the graphical model being overcome by the misclassification.

3 EM Algorithm for Updating Edges of $\hat{E}_{\ell_{1}}$

We develop an EM algorithm for obtaining an updated edge set. In Section 2.3, all nodes could potentially have some amount of misclassification probability, however throughout the use of this update we assume that only a subset of nodes can be misclassified. The distinction does not affect the related proofs for the method, although for the method to be computationally tractable the number of potentially misclassified nodes must be relatively small.

Conditional on the initial RWL fit, resulting in edge set $\hat{E}_{\ell_{1}}$ and parameter $\hat{\theta}_{\setminus r}$ , we develop an EM-type algorithm for updating the neighborhood for certain nodes in our graphical model. The method is run on each node individually similar to RWL. In the usual EM approach the average joint log likelihood of the observed and latent variables is maximized in order to increase the likelihood marginally on the observed data. Due to the complexity of the distribution in the joint case, it is difficult to maximize the log likelihood over all possible latent states.

We instead show in Appendix B that maximizing the conditional distributions will still serve to increase the marginal likelihood given that the probability that a node is in the incorrect state is close to 1. By leveraging dependency information from the initial RWL fit, we show in simulations that this condition is satisfied and we are able to increase the marginal likelihood.

In doing our EM update we focus on neighborhoods surrounding nodes that have potentially been misclassified. In order to do this we assume we have some knowledge of the probability of misclassification for each node. This probability can be an average misclassification over all observations for a given node, although the model has better performance when misclassified probabilities are known for each observation . Misclassification probabilities can be estimated within each observation across nodes if, for example, a separate EM algorithm is used to determine the state of each node, then the latent variable state probabilities correspond to the probability of misclassification. In Sinclair et al. (2017), misclassification probabilities can be derived from the implicit mixture model for continuous signaling in fMRI.

With an appropriate update set of nodes, $\mathcal{U}$ , we can then update the edge set to obtain $\hat{E}^{EM}_{\ell_{1}}$ . In the following subsections we go over obtaining the update set $\mathcal{U}$ and completing the $E$ and $M$ steps.

3.1 Obtaining Update Set: $\mathcal{U}$

The update set will be a union of candidate nodes, $\mathcal{C}$ , and participant nodes, $\mathcal{P}$ . Candidate nodes are nodes that have potentially been misclassified, and participant nodes are nodes where their estimated neighborhood sets have been potentially affected by misclassification.

If $\hat{\gamma}_{s}$ is a misclassification estimation for each node, then for a given threshold $q$ , a reasonable way to define candidate set is as $\mathcal{C}=\{s\in V:\hat{\gamma}_{s}>q\}$ , although our method is not bound to any procedure on determining the candidate set.

To obtain the participant nodes, first consider the following example. Assume $(r,s)\in E$ and $(s,t)\in E$ but $(r,t)\not\in E$ . If there were no misclassification in our data then $x_{r}|x_{s}\perp\!\!\!\!\perp x_{t}|x_{s}$ , but if $x_{s}$ is a candidate node with some non-zero probability for misclassification, then we have

[TABLE]

Thus nodes are no longer independent as long as $\theta^{*}_{rs}\neq\theta^{*}_{st}$ , and in the fitted network the edge $(r,t)$ may appear. On the other hand, if $x_{r}$ was a candidate node, then $P(x_{t}=1|x_{s}=1,x_{r}=1)=P(x_{t}=1|x_{s}=1)$ . That is to say that if a node’s shortest path to a candidate node in the true network is greater than or equal to 2, then that node’s neighbors will still be chosen independently from the misclassification. This is not only a useful heuristic for choosing an update set, but will also be a useful property when calculating weights for the EM fit.

Taking this into account, we set the update set to be $\mathcal{U}=N(N(\mathcal{C}))$ , the neighbors of neighbors of the candidate nodes. From here we have the participant nodes as all nodes in $\mathcal{U}$ that are not in $\mathcal{C}$ , i.e. $\mathcal{P}=\mathcal{U}\setminus\mathcal{C}$ .

Lastly, let $s$ be the number of disjoint subgraphs induced by $\mathcal{U}$ and let $c_{max}$ be the largest number of candidate nodes in a single subgraph. The computational complexity of the method is $O(sn2^{c_{max}})$ , which can computationally tractable even with up to 20 candidates node in a single subgraph. For the rest of the document, we assume $s=1$ , but for $s>1$ the $E$ and $M$ steps still hold where a loop is run over each disjoint subgraph.

3.2 E Step

For the $k^{th}$ step in the EM update, for node $r\in\mathcal{U}$ , we take the expectation over the lantent variabes $x_{r}$ . Define the following three sets of parameters

[TABLE]

$\theta_{\mathcal{U}\setminus r}$ corresponds to the the neighborhood parameters for node $r$ that will be updated. For $s\not\in\mathcal{U}$ , the corresponding edge parameter $\theta^{(k)}_{sr}$ will not be updated , and thus when running this update, the value $2\theta^{(k)}_{sr}x_{r}x_{s}$ is included as an offset in the logistic regression to account for their neighborhood effect.

We are interested in the penalized log likelihood

[TABLE]

By including the offset terms in the regularization term, we ensure that the log likelihood will increase over a fixed parameter $\lambda$ B. Let $\Omega_{\mathcal{C}}=\{-1,+1\}^{|\mathcal{C}|}$ , and for $z_{c}\in\Omega_{\mathcal{C}}$ , let $\tilde{x}^{(i)}(z_{c})$ be original observation with candidate nodes replaced by $z_{c}$ . An estimate of the expectation of this log likelihood is

[TABLE]

However, the joint probability $P_{\tilde{\theta}_{\setminus r}}(\tilde{x}_{r}^{(i)}|\tilde{x}^{(i)}_{\mathcal{U}\setminus r},z_{c})$ is computationally intractable to maximize over unless $|\mathcal{C}|$ is very small. We instead look only at conditional distributions, and consider the following estimate of the expectation

[TABLE]

In Appendix B it is shown for any set of observations $\tilde{X}$ and for any initial fit $\hat{\theta}$ , there exists an open set of misclassification probabilities such that maximizing $\tilde{Q}_{r}$ will still result in an increase in the penalized likelihood $L_{\lambda}(\theta_{\mathcal{U}\setminus r}|\hat{\theta}_{V\setminus\mathcal{U}\setminus r})$ .

The function $\tilde{Q}_{r}$ , corresponds to a $\ell_{1}$ -regularized weighted logistic regression. Each $P_{\theta^{(k)}}(\tilde{X}_{\mathcal{C}}=z_{c}|\tilde{X}_{\mathcal{P}}=\tilde{x}^{(i)}_{\mathcal{U}})$ can be calculated utilizing factorizations of Ising distribution where the partition function is cancelled out due to conditioning the probability. A derivation of these probabilities is located in Appendix C.

3.3 M Step

Noting that $\tilde{Q}_{r}$ corresponds to a weighted penalized logistic regression with an offset, we complete the M step maximization using the glmnet package in R (Friedman et al., 2009). We obtain the updated edge parameter estimates as

[TABLE]

With the updated edge set as

[TABLE]

We show through simulations that this methodology tends to increase model selection performance of the underlying graphical model.

4 Simulations

The EM method uses information about the misclassification, and also leverages dependency/structure information which we have access to from the original fit as made formal in Section 2.4.

In the following simulation we demonstrate that candidate nodes will gain spurious connections due to misclassification, which can be overcome using the EM update.

One can also note that given misclassification information, a “prior” weight based solely on misclassification information (i.e. agnostic of any structural dependency information) can be calculated as

[TABLE]

The EM method updates these state probabilities given dependency information.

4.1 Simulation Parameters and Network Specification

We ran the method on a network of 12 nodes ( $p=12$ ); Figure 1 shows the topological structure of the network over which we simulate. The intuition for this network topology is that the blue participants nodes will inform the red candidate nodes.

The nodes $L,D,H$ are each potentially misclassified in 50% of observations, where the probability of misclassification in these observations is 60%. We ran 1000 simulations with $n=60$ , and true edge parameters $\theta^{*}_{st}=\frac{1}{2}$ for $(s,t)\in E$ . All Ising observations were simulated using the IsingSampler package in R (Epskamp, 2014).

Although nodes L, D, H are only misclassified in half of observations, the distribution unconditional on knowledge of the misclassification process is still a Misclassified Ising Distribution with non-zero misclassification parameters equal to $\gamma_{L}=\gamma_{D}=\gamma_{H}=0.8$ .

4.2 Fitted Models

The models we fit are

RWL - minimizing (2)

2.

RWL Weighted - minimizing (2) with a weighted logistic regression using weights defined in (17)

3.

RWL + EM - Running an EM update for edges selected in RWL

4.

Weighted + EM - Running an EM update for edges selected in RWL weighted

For the initial RWL and RWL Weighted fits, a range of tuning parameters were selected to obtain an ROC curve for candidate and participant nodes. For the EM fits, the selected dependency was based off of the tuning parameter that maximized $P(True\,\,Positive)+(1-P(False\,\,Positive))$ , and then a range of tuning parameters were simulated over to analyze the EM fits.

The first set of simulations look at only one EM update on our fit. We then investigate the effect of further EM analyses. We look at $RWL+2EM$ and $RWL+3EM$ , which corresponds to running a second and third EM update to the on the $RWL$ fitted edge set.

4.3 Results

In Figure 2 the RWL + EM fit performs at least as well or better than any other method. Even when not changing the tuning parameter, an increase in classification performance is always observed. Specifically the AUC for candidate nodes increases from 0.6608 to 0.6945, and for participant nodes the AUC increases from 0.8729 to 0.8770.

Interestingly, basing the initial fit off of RWL seems to perform better than the weighted regularized logistic regression (RWL Weighted). This is consistent with the proof given in Appendix B, as the misclassification probability for a candidate node will be at most $P(X_{r}=\tilde{X}_{r})=0.5$ for RWL Weighted, and therefore this misclassification scenario is far from the open set $\mathbf{\Gamma}$ defined in Appendix B. The implication of this result is that misclassification information alone is not enough to provide a gain in model selection performance; dependency information must also be leveraged.

As shown in Section 2.4 some dependency information is obtained in the $RWL$ fit, from which we have that $P(X_{r}=\tilde{X}_{r}|RWL)\approx 0$ for multiple observations, and therefore the Regularized EM Theorem in Appendix B applies. Figure 2 demonstrates this theorized increased in performance, and, as shown in Appendix B, the increase will occur without needing to change the tuning parameter.

Figure 3 shows the simulations results for running the EM update multiple times. Note that between EM updates it is unlikely the probability that a node is in a given state will change drastically, therefore the Regularized EM Theorem does not apply. This can be seen in Figure 3, as by the third EM update, there is a small decrease in participant node detection. After the first EM update the participant node AUC is 0.8770, and it decreases to 0.8593 by the third update.

5 fMRI Data Example Simulations

Sinclair et al. (2017) documents a method for fitting an Ising model on task-fMRI data. Each node in the graph corresponds to a specialized region of the cortex, and the classification is a discretization of a fit parameter corresponding to blood flow. If the blood flow is above a certain threshold, the area of the cortex is considered active during the task. Due to the inherent noise in the data, misclassification is certainly present.

Figure 4 shows the fit example from Sinclair et al. (2017), using data from the Human Connectome project (Van Essen et al., 2013), and the nodes were obtained via the parcellation documented in Gordon et al. (2016). An estimate of the node’s state was obtained by investigating the p-values used for the classification procedure. 14 out of the 111 regions were found to be closer to the p-value threshold more often, being within 5% of the p-value threshold at over 12% of the time. In Figure 4, these regions are colored in red.

5.1 Choosing Update Set $\mathcal{U}$

A useful consequence of the network fit we have, is that the update set as defined in Section 3.1 is a disjoint union of $s=4$ disjoint subgraphs. Therefore, we run our simulations on the largest of the subgraphs denoted as the update set in Figure 4. This corresponds to our $p=20$ node network topology that we use for simulations.

5.2 Simulation Parameters

We ran 500 simulations with $n=200$ , corresponding to the size of the original dataset. Edge parameters in the simulation were selected to correspond to edge parameters from the original fit, however non-zero edges were smoothed towards the average of all edge parameters.

Participant nodes were then misclassified in 50% of observations with a misclassification probability of 75%. Thus, the overall misclassification rate is similar to the observed dataset.

Based off of the results from Section 4, we only compare the RWL + EM and RWL, where a range of tuning parameters is selected for each method.

5.3 Results

Figure 5 shows the True Positive vs False Positive relationship. A consistent increase in classification performance is observed for the first 13 nodes. The overall error rate decreases for the neighborhood of candidate nodes drops from from 21.1% to 10.0% when choosing the optimal tuning parameter for the EM fit. If the tuning parameter is not changed for the EM fit, we still see an decrease in the error from 21.1% to 14.6%. There does appear to be a small decrease in performance for participant nodes that were not a direct neighbor with a candidate however this difference contributed to less than a 3% increase in false positives and false negatives.

Figure 6 orders the nodes by overall error rate across simulations for the two different methods. The decrease in error rate is consistently better after running the EM fit.

Figure 7 plots the adjacency matrix for $\mathcal{U}$ . This plot has a few interesting characteristics. The red areas, which correspond to false edges that were selected often for the RWL fit tend to correspond to edges between participant nodes that are highly connected to candidate nodes. The error rate is particularly high for nodes 104, 84, and 55. Figure 8 looks only at error rate, and focusses on nodes that had at least one neighbor with a candidate node.

6 Conclusion

In this paper we introduce the misclassified Ising model. We show that under suitable misclassification assumptions RWL can still be used as a model selection technique. We then show that RWL can be extended in order to account for misclassification. Sections 4 and 5 show simulation results for a symmetric network and for a network obtained from fMRI data.

The fMRI node states correspond to discretizations of a continuous variable and therefore provide a useful setting for discussing misclassification. Depending on the discretization method used to determine the latent state, acquiring an estimate for the probability of misclassification is potentially straightforward.

In both cases, the EM-based algorithm is shown to provide significant performance gains in model selection. Given a binary network data set with an estimated misclassification probability, one can therefore obtain more reliable connections between nodes within the update set $\mathcal{U}$ by performing this update.

The method is computationally constrained by the greatest number of candidate nodes within the largest disjoint subnetwork of the update set $\mathcal{U}$ . However, this computational complexity depends only linearly on the number of remaining nodes in the update set. Therefore even with a high degree dataset, if there are few candidate nodes, this method can still be tractable.

The analysis in this paper can be extended easily to the signed edge selection as discussed in Ravikumar et al. (2010). The EM approach can also be extended to the Potts model corresponding to multiple states per node, although this would serve to further increase the computational complexity. Future work within the misclassified Ising framework could be to understand the effect of dependent misclassification across nodes on the misclassified score and information functions.

Appendix A Proof of Extended Theorem 1

In this appendix we state the assumptions for extended theorem 1 and complete the proof.

A.1 Assumptions

In order to prove extended theorem 1 we need to make assumptions $\tilde{A1}$ , $\tilde{A2}$ , $\tilde{A3}$ . Assumptions $\tilde{A1}$ and $\tilde{A2}$ are analogous to Ravikumar et al. (2010) except under the misclassified information matrix. Assumption $\tilde{A3}$ bounds the amount of misclassification in our data.

Define $S=\{(r,t)\in V\times V|t\in\mathcal{N}(r)\}$ .

Assume the following assumptions hold uniformly for all $r\in V$ :

( $\tilde{A1}$ ) Dependency Condition. For the misclassified information matrix and for the sample covariance matrix, there exists a constants $C_{min},D_{max}>0$ such that

[TABLE]

( $\tilde{A2}$ ) Incoherence Condition. There exists $\alpha\in(0,1]$ such that

[TABLE]

( $\tilde{A3}$ ) Misclassification Condition. For $C_{min}$ and $D_{max}$ as defined in ( $\tilde{A1}$ ), and $\alpha$ as defined in ( $\tilde{A2}$ ), we assume

[TABLE]

A.2 Proof

Within this proof we drop the node-specific subscript $r$ . The proof is done within node, and a union bound is applied to obtain the result across nodes.

Define the sample misclassified information as

[TABLE]

In Ravikumar et al. (2010), Lemma 5, 6, and 7 can be applied to show that if $\tilde{\mathfrak{X}}$ is such that $\tilde{A1}$ and $\tilde{A2}$ hold for $\tilde{Q}^{n}$ , then the assumptions will hold for with high probability for $\tilde{Q}^{*}$ for $n=\Omega(d^{3}\log p)$ . These lemmas directly apply to the misclassified case since their only dependence on the Ising distribution is that $\tilde{Q}^{n}-\tilde{Q}^{*}$ can be written as an iid mean of bounded observations, which still holds.

Therefore, to complete the proof it suffices to show that Extended Theorem 1 is true only for observations where the event $M=\{\tilde{\mathfrak{X}}:\tilde{A1}\text{ and }\tilde{A2}\text{ hold for }\tilde{Q}^{n}\}$ occurs. This corresponds to Proposition 1 of Ravikumar et al. (2010).

Define $\tilde{\lambda_{n}}=\lambda_{n}-\frac{4(2-\alpha)}{\alpha}S_{max}$ . We can use Lemma 3, and Lemma 4 from Ravikumar et al. (2010) to show Extended Theorem 1 holds when $M$ occurs. In order to utilize these lemmas we need to establish an upper bound for the misclassified score function with high probability, and we need to establish an upper bound for the quantity $\lambda_{n}d$ . The following lemma proven in Appendix A.2.1. established an upper bound on the misclassified score function.

Lemma. *For the specified incoherence parameter $\alpha\in(0,1]$ , we have *

[TABLE]

for $K$ independent of $(n,d,p)$ and for $\lambda_{n}\geq\frac{16(2-\alpha)}{\alpha}\left(\sqrt{\frac{\log p}{n}}+\frac{S_{max}}{4}\right)$

In order to establish bounds for $\lambda_{n}d$ , set $n>\frac{400^{2}D^{2}_{max}}{C^{4}_{min}}\frac{(2-\alpha)^{4}}{\alpha^{4}}d^{2}\log p$ , then by applying assumption ( $\tilde{A3}$ ) on $S_{max}$ , and since $\frac{\alpha}{2-\alpha}\leq 1$ we have

[TABLE]

With these technical results we can complete the proof of extended theorem 1 as presented in Ravikumar et al. (2010).

A.2.1 Proof of Lemma

Let $W^{n}_{u}$ be the $u^{th}$ component of $W^{n}$ . Note that $W^{n}_{u}$ is the iid mean of $n$ random variables that are bounded between [-2,2]. Therefore by Azuma-Hoeffding inequality (Hoeffding, 1963), we have

[TABLE]

for any $\delta>0$ . Note that for any $x,y,z\in\mathbb{R}$ we have, $|x|>|z|+|y|\Rightarrow|x-y|>|z|$ . Applying this to (25) gives

[TABLE]

We can bound (26) from below by setting $\delta=\frac{\alpha\lambda_{n}}{4(2-\alpha)}-|E(W^{n}_{u})|$ , and noting that $\frac{\alpha}{2-\alpha}\leq 1$ . We get

[TABLE]

We bound (26) from above as follows

[TABLE]

Combining (26), (27), (28) finishes the proof of the lemma.

Appendix B Proof of Regularized EM Approach

In this appendix we show the following.

Regularized EM Theorem. For data $\tilde{\mathfrak{X}}$ , for $\hat{\theta}$ the parameter estimate from the RWL fit, and for $\theta^{*}$ the parameter estimate from the first EM update, there exists an open set of misclassification laws $\mathbf{\Gamma}$ such that for the marginal penalized likelihood of our data as defined in Equation (10) we have that

[TABLE]

For notational convenience, we suppress the parameters $\hat{\theta}_{V\setminus\mathcal{U}\setminus r}$ , and we refer to our parameters of interested simply as $\theta$ as they do not change throughout the proof.

For $z_{c}$ as the latent states, by following the proof of the EM given in Little and Rubin (2002) we have the following relationship for the marginal likelihoods, which still holds when the regularization parameter is added

[TABLE]

Where $A_{\Gamma}(\theta)$ and $B_{\Gamma}(\theta)$ correspond the two large summations in equation (30). $\Gamma$ is included in the notation for these functions to emphasize their dependence on the misclassification scheme.

For $B_{\Gamma}(\theta)$ we have that by Gibb’s inequality, $B_{\Gamma}(\theta)\geq B_{\Gamma}(\hat{\theta})$ for all $\theta$ , and for all $\Gamma$ . Therefore $B_{\Gamma}(\theta)$ will increase at $\theta^{*}$ . Our goal is thus to show that $A(\theta)+\lambda\|\theta\|_{1}$ will increase.

Choose the misclassification setting $\Gamma^{\prime}$ such that $\prod_{s\in\mathcal{C}}P(z_{s}\neq\tilde{x}^{(i)}_{s})=1$ . Define $z^{(i)}_{\Gamma^{\prime}}$ component-wise as $(z^{(i)}_{\Gamma^{\prime}})_{s}=-\tilde{x}^{(i)}_{s}$ . Under this $\Gamma^{\prime}$ , we have the following representation for $A_{\Gamma^{\prime}}(\theta)$

[TABLE]

For this selection of $\Gamma^{\prime}$ we have that $\theta^{*}$ is chosen to maximize $A_{\Gamma^{\prime}}(\theta)+\lambda\|\theta\|_{1}$ , and therefore $A_{\Gamma^{\prime}}(\theta^{*})+B_{\Gamma^{\prime}}(\theta^{*})+\lambda\|\theta^{*}\|\geq A_{\Gamma^{\prime}}(\hat{\theta})+B_{\Gamma^{\prime}}(\hat{\theta})+\lambda\|\hat{\theta}\|$ . Since $A_{\Gamma}(\theta)+B_{\Gamma}(\theta)+\lambda\|\theta\|_{1}$ is continuous in $\Gamma$ , there exists an open set $\mathbf{\Gamma}$ such that if $\Gamma\in\mathbf{\Gamma}$ then $L_{\lambda}(\theta^{*}_{\mathcal{U}\setminus r}|\hat{\theta}_{V\setminus\mathcal{U}\setminus r},\tilde{\mathfrak{X}})\geq L_{\lambda}(\hat{\theta}_{\mathcal{U}\setminus r}|\hat{\theta}_{V\setminus\mathcal{U}\setminus r},\tilde{\mathfrak{X}})$ as needed.

Appendix C Calculating Weights for E-step

Here we calculate the weights $P_{\theta^{(k)}}(X_{\mathcal{C}}=z_{c}|\tilde{X}_{\mathcal{U}}=\tilde{x}^{(i)}_{\mathcal{U}})$ from equation (13). In these calculations we assume we have $\gamma^{i}_{s}$ corresponding to the misclassification probability for node $s$ at observation $i$ .

We remove the subscript for estimate $\theta^{(k)}$ , and superscript for observation ${(i)}$ for notational convenience. Rearranging conditional and joint probabilities give us

[TABLE]

The conditional probability in (34) gives the proportion of the weight associated with the observed misclassification probability. This is calculated as

[TABLE]

The ratio of probabilities gives the weight of the observation associated with the estimated dependency structure. Define $A(x_{\mathcal{C}},x_{\mathcal{P}})=\sum_{(s,t)\in E_{\mathcal{U}}}\theta^{(t)}_{st}x_{s}x_{t}$ ; this corresponds to the association between nodes in $\mathcal{U}$ as it relates to the full distribution given in (1). From to the selection of $\mathcal{U}$ the ratio of probabilities factors allowing this calculation to ignore nodes outside of $\mathcal{U}$ .

[TABLE]

Where $B$ in the above equation corresponds to the potential from all nodes outside of $\mathcal{U}$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Banerjee et al. (2014) Banerjee, S., B. P. Carlin, and A. E. Gelfand (2014). Hierarchical modeling and analysis for spatial data . Crc Press.
2Barber et al. (2015) Barber, R. F., M. Drton, et al. (2015). High-dimensional ising model selection with bayesian information criteria. Electronic Journal of Statistics 9 (1), 567–607.
3Bresler (2015) Bresler, G. (2015). Efficiently learning ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing , pp. 771–782. ACM.
4Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological) , 1–38.
5Epskamp (2014) Epskamp, S. (2014). Isingsampler: Sampling methods and distribution functions for the ising model. R package version 0.1 1 .
6Friedman et al. (2009) Friedman, J., T. Hastie, and R. Tibshirani (2009). glmnet: Lasso and elastic-net regularized generalized linear models. R package version 1 (4).
7Friedman et al. (2010) Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1), 1–22.
8Geman and Geman (1984) Geman, S. and D. Geman (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence (6), 721–741.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

An Expectation Maximization Algorithm for High-Dimensional Model Selection for the Ising Model with Misclassified States

Abstract

1 Introduction

2 Misclassified Ising Model and Theoretical Guarantees

2.1 Ising Model

2.2 ℓ1\ell_{1}ℓ1​-regularized Neighborhood-based Model Selection

2.3 Misclassified Ising Model

2.4 Theoretical Guarantees

3 EM Algorithm for Updating Edges of E^ℓ1\hat{E}_{\ell_{1}}E^ℓ1​​

3.1 Obtaining Update Set: U\mathcal{U}U

3.2 E Step

3.3 M Step

4 Simulations

4.1 Simulation Parameters and Network Specification

4.2 Fitted Models

4.3 Results

5 fMRI Data Example Simulations

5.1 Choosing Update Set U\mathcal{U}U

5.2 Simulation Parameters

5.3 Results

6 Conclusion

Appendix A Proof of Extended Theorem 1

A.1 Assumptions

A.2 Proof

A.2.1 Proof of Lemma

Appendix B Proof of Regularized EM Approach

Appendix C Calculating Weights for E-step

2.2 $\ell_{1}$ -regularized Neighborhood-based Model Selection

3 EM Algorithm for Updating Edges of $\hat{E}_{\ell_{1}}$

3.1 Obtaining Update Set: $\mathcal{U}$

5.1 Choosing Update Set $\mathcal{U}$