Causal Discovery with a Mixture of DAGs

Eric V. Strobl

arXiv:1901.09475·stat.ML·September 8, 2020

Causal Discovery with a Mixture of DAGs

Eric V. Strobl

PDF

TL;DR

This paper introduces a novel approach to causal discovery using a mixture of directed cyclic graphs (DAGs) to model complex, evolving, and population-specific causal processes, with an algorithm leveraging longitudinal data for inference.

Contribution

It proposes modeling causation with a mixture of DAGs and introduces the Causal Inference over Mixtures algorithm for longitudinal data analysis.

Findings

01

Improved causal inference performance over prior methods.

02

Effective modeling of cyclic, evolving, and population-specific causal processes.

03

Algorithm successfully infers causal relations from complex data.

Abstract

Causal processes in biomedicine may contain cycles, evolve over time or differ between populations. However, many graphical models cannot accommodate these conditions. We propose to model causation using a mixture of directed cyclic graphs (DAGs), where the joint distribution in a population follows a DAG at any single point in time but potentially different DAGs across time. We also introduce an algorithm called Causal Inference over Mixtures that uses longitudinal data to infer a graph summarizing the causal relations generated from a mixture of DAGs. Experiments demonstrate improved performance compared to prior approaches.

Figures12

Click any figure to enlarge with its caption.

Tables2

Table 1. (a)

$X_{1}$	$X_{2}$	$T$
0.21	-0.20	1.29
0.68	-0.47	7.30
1.05	-0.19	4.33
0.72	-1.40	0.10
0.13	-0.56	2.91

Table 2. (b)

$X_{1}$	$X_{3}$	$X_{4}$	$X_{7}$	$\dots$	$T_{3}$
0.31	-1.01	5	0	$\dots$	1.29
0.89	-0.58	6	0	$\dots$	7.30
1.11	-0.79	2	1	$\dots$	4.33
0.14	-1.23	5	0	$\dots$	0.10
0.21	-0.20	4	1	$\dots$	2.91
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$

Equations37

Anc_{G} (Y)

Anc_{G} (Y)

f (X) = i = 1 \prod p f (X_{i} ∣ Pa_{G} (X_{i})) .

f (X) = i = 1 \prod p f (X_{i} ∣ Pa_{G} (X_{i})) .

f (X, T = t)

f (X, T = t)

= f (T = t) i = 1 \prod p f (X_{i} ∣ Pa_{t} (X_{i})),

f (Z)

f (Z)

i = 1 \prod p + 1 f (Z_{i} ∣ Pa_{T} (Z_{i}))

i = 1 \prod p + 1 f (Z_{i} ∣ Pa_{T} (Z_{i}))

=

f (O ∣ S)

f (O ∣ S)

f (Z)

f (Z)

= i = 1 \prod p + s f (Z_{i} ∣ Pa_{T} (Z_{i})),

i = 1 \prod p + s f (Z_{i} ∣ Pa_{T} (Z_{i})) = i = 1 \prod p + s f (Z_{i} ∣ (Pa_{T} (Z_{i}) ∖ U_{i}) \cup V_{i}),

i = 1 \prod p + s f (Z_{i} ∣ Pa_{T} (Z_{i})) = i = 1 \prod p + s f (Z_{i} ∣ (Pa_{T} (Z_{i}) ∖ U_{i}) \cup V_{i}),

f (O ∣ S) = L \sum f (O, L ∣ S) .

f (O ∣ S) = L \sum f (O, L ∣ S) .

i = 1 \prod p + s f (Z_{i} ∣ Pa_{T} (Z_{i})) = j = 1 \sum q \mathbbm 1_{T \in T^{j}} i = 1 \prod p + s f (Z_{i} ∣ Pa^{j} (Z_{i})),

i = 1 \prod p + s f (Z_{i} ∣ Pa_{T} (Z_{i})) = j = 1 \sum q \mathbbm 1_{T \in T^{j}} i = 1 \prod p + s f (Z_{i} ∣ Pa^{j} (Z_{i})),

f^{j} (\ddot{A}, \ddot{B}, C)

f^{j} (\ddot{A}, \ddot{B}, C)

= e \in E^{j} ∖ E_{\ddot{B}}^{j} \prod γ^{j} (e) e \in E_{\ddot{B}}^{j} \prod γ^{j} (e) = γ^{j} (\ddot{A}, C) γ^{j} (\ddot{B}, C),

f (A, B, C)

f (A, B, C)

= [\ddot{A} \cup \ddot{B}] ∖ [A \cup B] \sum j = 1 \sum q \mathbbm 1_{T \in T^{j}} f^{j} (\ddot{A}, \ddot{B}, C)

= [\ddot{A} \cup \ddot{B}] ∖ [A \cup B] \sum j = 1 \sum q \mathbbm 1_{U \in U^{j}} \mathbbm 1_{V \in V^{j}} f^{j} (\ddot{A}, \ddot{B}, C)

= [\ddot{A} \cup \ddot{B}] ∖ [A \cup B] \sum γ (\ddot{A}, C) γ (\ddot{B}, C),

[\ddot{A} \cup \ddot{B}] ∖ [A \cup B] \sum γ (\ddot{A}, C) γ (\ddot{B}, C)

[\ddot{A} \cup \ddot{B}] ∖ [A \cup B] \sum γ (\ddot{A}, C) γ (\ddot{B}, C)

= [\ddot{A} ∖ A] \cup [\ddot{B} ∖ B] \sum γ (\ddot{A}, C) γ (\ddot{B}, C)

\displaystyle=\Big{[}\sum_{[\ddot{\bm{B}}\setminus\bm{B}]}\Big{[}\sum_{[\ddot{\bm{A}}\setminus\bm{A}]}\gamma(\ddot{\bm{A}},\bm{C})\Big{]}\gamma(\ddot{\bm{B}},\bm{C})\Big{]}

= [\ddot{A} ∖ A] \sum γ (\ddot{A}, C) [\ddot{B} ∖ B] \sum γ (\ddot{B}, C)

= γ (A, C) γ (B, C),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsCausal inference

Full text

Causal Discovery with a Mixture of DAGs

Eric V. Strobl

Psychiatry & Behavioral Sciences

Vanderbilt University Medical Center

Tennessee, United States

Abstract

Causal processes in biomedicine may contain cycles, evolve over time or differ between populations. However, many graphical models cannot accommodate these conditions. We propose to model causation using a mixture of directed cyclic graphs (DAGs), where the joint distribution in a population follows a DAG at any single point in time but potentially different DAGs across time. We also introduce an algorithm called Causal Inference over Mixtures that uses longitudinal data to infer a graph summarizing the causal relations generated from a mixture of DAGs. Experiments demonstrate improved performance compared to prior approaches.

keywords:

Causal discovery, Longitudinal data, Directed acyclic graph, Mixture of DAGs

1 Introduction

Causal discovery refers to the process of inferring causation from data. Investigators usually perform causal discovery in biomedicine using randomized controlled trials (RCTs). However, RCTs can be impractical or unethical to perform. For example, scientists cannot randomly administer illicit substances or traumatize healthy subjects. Many investigators therefore experiment with animals knowing that the derived results may not directly apply to humans.

In this paper, we develop an algorithm that discovers causation directly from human observational data, or data collected without randomization. Denote the variables in an observational dataset by $\bm{X}$ . We summarize the causal relations between variables in $\bm{X}$ using a directed graph, where the directed edge $X_{i}\rightarrow X_{j}$ with $X_{i},X_{j}\in\bm{X}$ means that $X_{i}$ is a direct cause of $X_{j}$ . Similarly, $X_{i}$ is a cause of $X_{j}$ if there exists a directed path, or a sequence of directed edges, from $X_{i}$ to $X_{j}$ . We want to recover the directed graph as best as possible using the observational dataset.

Directed graphs in nature often contain feedback loops, or cycles, where $X_{i}$ causes $X_{j}$ and $X_{j}$ directly causes $X_{i}$ . For example, Figure 1 (a) depicts a portion of the thyroid system where $X_{1}$ denotes the thyroid stimulating hormone (TSH) and $X_{2}$ the T4 hormone (T4). TSH released from the thyroid gland regulates T4 hormone release, while T4 feeds back to inhibit TSH release. Cycles such as these abound in biomedicine, so we must develop algorithms that can accommodate them in order to accurately model causal processes.

We propose to model a potentially cyclic causal process using multiple directed acyclic graphs (DAGs), or graphs with directed edges but no cycles. The causal process is represented as a DAG at any single point in time, but the DAG may change across time to accommodate feedback. We illustrate the idea by decomposing the cycle in Figure 1 (a) into two DAGs: TSH $\rightarrow$ T4 and T4 $\rightarrow$ TSH. For each sample, TSH first causes T4 release at time point $t_{1}$ and then T4 inhibits TSH release at time point $t_{2}>t_{1}$ . We however can only measure each sample at a single point in time, so the observational dataset in Figure 1(a) contains some samples in blue when TSH causes T4 and others in grey when T4 causes TSH. If we do not observe the time variable $T$ , then the observational dataset arises from a mixture of DAGs where the mixing occurs over time: $f(X_{1},X_{2})=\sum_{T}f(X_{1},X_{2}|T)f(T)$ . We must infer the directed graph in Figure 1 (a) using the samples from $X_{1}$ and $X_{2}$ alone. In practice, we observe more than two random variables without color coding and mixing occurs over a subset of variables $\bm{T}$ denoting entities such as time, gender, income and disease status. Figure 1(b) therefore depicts a more realistic dataset.

We also develop a method for recovering a directed graph summarizing the causal relations arising from a mixture of DAGs. We do so by first reviewing related work in Section 2. We then provide background in Section 3. Section 4 introduces the mixture of DAGs framework. In Section 5, we detail the algorithm called Causal Inference over Mixtures (CIM) to infer causal relations using longitudinal data. We then report experimental results in Section 6 highlighting the superiority of CIM compared to prior approaches on both real and synthetic datasets. We finally conclude the paper in Section 7. We delegate all proofs to the Appendix.

This paper improves upon a previous conference paper [1]. We made the following changes: (1) simplified exposition, (2) improved characterization of a mixture of DAGs, (3) corrected theoretical results, (4) an enhanced CIM algorithm, (5) experiments with better evaluation metrics. This report therefore provides more convincing material compared to the conference paper.

2 Related Work

Several algorithms perform causal discovery with cycles. Most of these methods assume stationarity, or a stable distribution over time and populations. The Fast Causal Inference (FCI) algorithm for example infers causal relations when cycles exist [2, 3]. The algorithm was initially developed for the acyclic case, but it can infer the acyclic portions of a cyclic graph by ignoring the independence relations within cycles. Other algorithms attempt to recover within-cycle causal relations. The Cyclic Causal Discovery (CCD) algorithm for instance works well when no selection bias or latent variables exist. The Cyclic Causal Inference (CCI) algorithm extends CCD to handle selection bias and latent variables, but both algorithms require linear or discrete variables for correctness [4, 5, 6].

Investigators have extended FCI, CCD and CCI with answer set programming (ASP). ASP algorithms allow the user to easily incorporate prior knowledge and infer causal relations more accurately. These methods however only apply to datasets with less than 10-20 variables due to scalability issues with a conventional laptop [7, 8].

Another set of methods focus on non-stationarity, but most of them require a single underlying directed graph [9, 10, 11, 12, 13]. Two methods exist for recovering causal processes with multiple graphs [14, 15], but they assume a mixture of parametric distributions. CIM improves upon all of these methods by allowing non-linearity, cycles, non-stationarity, non-parametric distributions, changing graphical structure, latent variables and selection bias.

3 Background

We now delve into the background material required to understand the proposed methodology.

3.1 Terminology

In addition to directed edges, we consider other edge types including: $\leftrightarrow$ (bidirected), — (undirected), ${\circ\hskip 0.85358pt\!\!\!\rightarrow}$ (partially directed), ${\circ\!-}$ (partially undirected) and ${\circ\hskip 1.13809pt\!\!\!-\hskip 1.13809pt\!\!\!\circ}$ (nondirected). The edges contain three endpoint types: arrowheads, tails and circles. We say that two vertices $X_{i}$ and $X_{j}$ are adjacent if there exists an edge between the two vertices. We refer to the triple $X_{i}*\!\!\rightarrow X_{j}\leftarrow\!\!*X_{k}$ as a collider or v-structure, where each asterisk corresponds to an arbitrary endpoint type. A collider or v-structure is said to be unshielded when $X_{i}$ and $X_{k}$ are non-adjacent. The triple $X_{i}*\!\!-\!\!*X_{j}*\!\!-\!\!*X_{k}$ is conversely a triangle if $X_{i}$ and $X_{k}$ are adjacent. Unless stated otherwise, a path is a sequence of edges without repeated vertices. $X_{i}$ is an ancestor of $X_{j}$ if there exists a directed path from $X_{i}$ to $X_{j}$ or $X_{i}=X_{j}$ . We write $X_{i}\in\textnormal{Anc}_{\mathbb{G}}(X_{j})$ when $X_{i}$ is an ancestor of $X_{j}$ in the graph $\mathbb{G}$ . We also apply the definition of an ancestor to a set of vertices $\bm{Y}\subseteq\bm{X}$ as follows:

[TABLE]

If $\bm{A}$ , $\bm{B}$ and $\bm{C}$ are disjoint sets of vertices in $\bm{X}$ , then $\bm{A}$ and $\bm{B}$ are said to be d-connected by $\bm{C}$ in a directed graph $\mathbb{G}$ if there exists a path $\Pi$ between some vertex in $\bm{A}$ and some vertex in $\bm{B}$ such that, for any collider $X_{i}$ on $\Pi$ , $X_{i}$ is an ancestor of $\bm{C}$ and no non-collider on $\Pi$ is in $\bm{C}$ . We also say that $\bm{A}$ and $\bm{B}$ are d-separated by $\bm{C}$ if they are not d-connected by $\bm{C}$ . For shorthand, we write $\bm{A}\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ to denote d-separation and $\bm{A}\not\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ to denote d-connection. The set $\bm{C}$ is more specifically called a minimal separating set if we have $\bm{A}\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ but $\bm{A}\not\perp\!\!\!\perp_{d}\bm{B}|\bm{D}$ , where $\bm{D}$ denotes any proper subset of $\bm{C}$ .

A mixed graph contains edges with only arrowheads or tails, while a partially oriented mixed graph may also include circles. We focus on mixed graphs that contain at most one edge between any two vertices. We can associate a mixed graph $\mathbb{G}^{*}$ with a directed graph $\mathbb{G}$ as follows. We first partition $\bm{X}=\bm{O}\cup\bm{L}\cup\bm{S}$ denoting observed, latent and selection variables, respectively. We then consider a graph over $\bm{O}$ summarizing the ancestral relations in $\mathbb{G}$ with the following endpoint interpretations: $O_{i}*\!\!\rightarrow O_{j}$ in $\mathbb{G}^{*}$ if $O_{j}\not\in\textnormal{Anc}_{\mathbb{G}}(O_{i}\cup\bm{S})$ , and $O_{i}*\!\!\textnormal{---}O_{j}$ in $\mathbb{G}^{*}$ if $O_{j}\in\textnormal{Anc}_{\mathbb{G}}(O_{i}\cup\bm{S})$ .

3.2 Probabilistic Interpretation

We associate a density $f(\bm{X})$ to a DAG $\mathbb{G}$ by requiring that the density factorize into the product of conditional densities of each variable given its parents:

[TABLE]

Any distribution which factorizes as above also satisfies the global Markov property w.r.t. $\mathbb{G}$ where, if we have $\bm{A}\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ in $\mathbb{G}$ , then $\bm{A}$ and $\bm{B}$ are conditionally independent given $\bm{C}$ [16]. We denote the conditional independence (CI) as $\bm{A}\perp\!\!\!\perp\bm{B}|\bm{C}$ for short. We refer to the converse of the global Markov property as d-separation faithfulness. An algorithm is constraint-based if it utilizes CI testing to recover some aspects of $\mathbb{G}^{*}$ as a consequence of the global Markov property and d-separation faithfulness.

4 Mixture of DAGs

We introduce the framework with univariate $T$ and then generalize to multivariate $\bm{T}$ because the univariate case is simpler.

4.1 Univariate Case

We consider the set of vertices $\bm{Z}=\bm{X}\cup T$ . We divide $\bm{Z}$ into three non-overlapping sets $\bm{O}$ , $\bm{L}$ and $\bm{S}$ denoting observed, latent and selection variables, respectively. At each time point $t$ , we consider the joint density $f(\bm{X},T=t)$ and assume that it factorizes according to a DAG $\mathbb{G}^{t}$ over $\bm{Z}$ :

[TABLE]

where $\textnormal{Pa}_{t}(Z_{i})$ refers to $\textnormal{Pa}_{\mathbb{G}^{t}}(Z_{i})$ for shorthand, the parent set of $Z_{i}$ at time point $t$ . We analyze the following density:

[TABLE]

where $\textnormal{Pa}_{T}(T)=\emptyset$ . The above equation differs from Equation (1) for a single DAG; the parent set $\textnormal{Pa}(Z_{i})$ remains constant over time in Equation (1), but the parent set $\textnormal{Pa}_{T}(Z_{i})$ may vary over time in Equation (2).

Let $\bm{R}\subseteq\bm{Z}$ correspond to all those variables in $\bm{Z}$ where $T\not\in\textnormal{Pa}_{T}(Z_{i})$ , so that $T\in\textnormal{Pa}_{T}(Z_{i})$ for all $Z_{i}\in[\bm{Z}\setminus\bm{R}]$ . We can then rewrite Equation (2):

[TABLE]

The left hand term corresponds to the stationary component and the right hand to the non-stationary component. We assume that we can sample from the mixture density $f(\bm{O}|\bm{S})$ :

[TABLE]

where mixing occurs over time $T$ in the integration if $T\in\bm{L}$ . We refer to the above equation as the mixture of DAGs framework.

4.2 Multivariate Case

We generalize the mixture of DAGs framework to a multivariate set of mutually independent mixture variables $\bm{T}$ . For example, we may let $\bm{T}=\{T_{1},T_{2}\}$ , where $T_{1}$ denotes time and $T_{2}$ gender. Gender is instantiated independent of time, but the causal process can change over time and differ by gender. We may also observe gender but not observe time so that $T_{2}\in\bm{O}$ but $T_{1}\in\bm{L}$ . The set $\bm{T}$ can therefore encompass a wide range of variables.

We consider the set of vertices $\bm{Z}=\bm{X}\cup\bm{T}$ instead of the original $\bm{X}\cup T$ . We divide $\bm{Z}$ into three non-overlapping sets $\bm{O}$ , $\bm{L}$ and $\bm{S}$ . We assume a joint density $f(\bm{X},\bm{T})$ that factorizes according to a DAG $\mathbb{G}^{\bm{T}}$ over $\bm{Z}$ :

[TABLE]

where $\textnormal{Pa}_{\bm{T}}(\bm{T})=\emptyset$ . The above equation mirrors Equation (2).

For each $Z_{i}\in\bm{Z}$ , let $\bm{U}_{i}\subseteq\bm{T}$ denote the largest set such that $\bm{U}_{i}\cap\textnormal{Pa}_{\bm{T}}(Z_{i})=\emptyset$ . This implies $\bm{T}\cap\textnormal{Pa}_{\bm{T}}(Z_{i})=\bm{T}\setminus\bm{U}_{i}\triangleq\bm{V}_{i}$ . We then rewrite Equation (4):

[TABLE]

so that $f(Z_{i}|\textnormal{Pa}_{\bm{T}}(Z_{i}))$ is stationary over $\bm{U}_{i}$ but non-stationary over $\bm{V}_{i}$ . Setting $\bm{U}_{i}=T$ and $\bm{V}_{i}=\emptyset$ for $Z_{i}\in\bm{R}$ and vice versa for $Z_{i}\in[\bm{Z}\setminus\bm{R}]$ recovers Equation (3). We finally sample from the mixture density $f(\bm{O}|\bm{S})$ :

[TABLE]

4.3 Global Markov Property

The factorization in Equation (5) implies certain CI relations. In this section, we will identify the CI relations by deriving a global Markov property similar to the traditional DAG case.

There exists a DAG $\mathbb{G}_{\bm{T}}$ for each instantiation of $\bm{T}$ because $\textnormal{Pa}_{\bm{T}}(Z_{i})$ is defined for all $Z_{i}\in\bm{Z}$ . Consider the collection $\mathcal{G}$ consisting of all DAGs indexed by $\bm{T}$ . The number of DAGs over $\bm{Z}$ is finite, so $|\mathcal{G}|=q\in\mathbb{N}^{+}$ . Let $\mathcal{T}$ denote the values of $\bm{T}$ corresponding to any member of $\mathcal{G}$ , and $\mathcal{T}^{j}$ to those for $\mathbb{G}^{j}\in\mathcal{G}$ . We can then rewrite Equation (5) as:

[TABLE]

where $\textnormal{Pa}^{j}(Z_{i})$ refers to the parents of $Z_{i}$ in $\mathbb{G}^{j}$ , and $\textnormal{Ch}^{j}(Z_{i})$ the children. If $\bm{A}\subseteq\bm{Z}$ , then we use $\bm{A}^{j}$ to more directly refer to the vertices corresponding to $\mathbb{G}^{j}$ . We let $\bm{A}^{\prime}=\cup_{j=1}^{q}\bm{A}^{j}$ .

We adopt the following procedure:

Plot each of the $q$ DAGs in $\mathcal{G}$ adjacent to each other. 2. 2.

Combine the vertices $T_{i}^{\prime}$ into a single vertex $T_{i}$ for each $T_{i}\in\bm{T}$ . 3. 3.

Add additional directed edges so that the children of $T_{i}$ correspond to $\textnormal{Ch}^{\prime}(T_{i})$ for each $T_{i}\in\bm{T}$ .

Denote the resultant graph as the mixture graph $\mathbb{M}$ . If $\bm{A}\subseteq\bm{T}$ , then $\bm{A}^{\prime}=\bm{A}$ in $\mathbb{M}$ due to step 2 above. We can read off the implied CI relations from $\mathbb{M}$ by utilizing d-separation across groups of vertices rather than just singletons.

Theorem 1.

(Global Markov property) Let $\bm{A},\bm{B},\bm{C}$ denote disjoint subsets of $\bm{Z}$ . If $\bm{A}^{\prime}\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{M}$ , then $\bm{A}\perp\!\!\!\perp\bm{B}|\bm{C}$ .

We refer to the reverse direction as d-separation faithfulness with respect to $\mathbb{M}$ . The result improves that of [17] (Appendix 8.3) and extends that of [18] when $\bm{T}$ is partially observed, continuous or multivariate. We provide an example in Figure 2. Suppose $\bm{T}=\{T_{1}\}$ indexes two DAGs. We plot the two DAGs next to each other in Figure 2 (a) and combine the vertices associated with $\bm{T}$ as in Figure 2 (b). We have $X_{1}^{1}\rightarrow X_{2}^{1}$ in the first DAG and $X_{2}^{2}\rightarrow X_{3}^{2}$ in the second; however, we do not have the directed path $X_{1}^{j}\rightarrow X_{2}^{j}\rightarrow X_{3}^{j}$ in either DAG. We also have the relation $\{X_{1}^{1},X_{1}^{2}\}\perp\!\!\!\perp_{d}\{X_{3}^{1},X_{3}^{2}\}$ , so $\mathbb{M}$ implies $X_{1}\perp\!\!\!\perp X_{3}$ per Theorem 1.

5 Causal Inference over Mixtures

5.1 Fused Graph

We construct a fused graph $\mathbb{F}$ as follows. Create a vertex for every variable in $\bm{Z}$ . Draw a directed edge $Z_{i}\rightarrow Z_{j}$ if and only if $Z_{i}^{\prime}$ is a direct cause of $Z_{j}^{\prime}$ in $\mathbb{M}$ , so that $\mathbb{F}$ may contain cycles. The fused graph is more intuitive than the mixture graph because $\mathbb{F}$ summarizes cycles in one directed graph. We will utilize the global Markov property of $\mathbb{M}$ in order to recover a mixed graph $\mathbb{F}^{*}$ summarizing the ancestral relations in $\mathbb{F}$ . For example, suppose we have the mixture graph drawn in Figure 3 (a). We consider a cycle involving $\{X_{1},X_{2},X_{4}\}$ and consider two slow causal relations: $X_{2}\rightarrow X_{4}$ and $X_{4}\rightarrow X_{1}$ . We thus have $X_{2}\rightarrow X_{4}$ in the first DAG in $\mathbb{M}$ , but $X_{4}$ is overwritten by this causal relation, so we do not observe $X_{4}\rightarrow X_{1}$ . Likewise, we have $X_{4}\rightarrow X_{1}$ in the second DAG, but $X_{2}$ is overwritten, so we do not observe $X_{2}\rightarrow X_{4}$ . We therefore cannot observe $X_{2}$ causing $X_{1}$ in either DAG due to the two rate limiting steps even though $X_{2}$ causes $X_{1}$ in the cycle involving $\{X_{1},X_{2},X_{4}\}$ . Moreover, if we intervene on the value of $X_{2}$ , then $X_{2}$ cannot be overwritten in the second DAG, so we would observe $X_{2}$ causing $X_{1}$ . Now we have also drawn out $\mathbb{F}$ in Figure 3 (b). $X_{2}$ is an ancestor of $X_{1}$ in $\mathbb{F}$ even though $\{X_{2}^{1},X_{2}^{2}\}$ is not an ancestor of $\{X_{1}^{1},X_{1}^{2}\}$ in $\mathbb{M}$ . Discovering $\mathbb{F}^{*}$ thus allows us to infer cycles that are not present within $\mathbb{M}$ but exist once the DAGs are combined in $\mathbb{F}$ .

5.2 Strategy

We unfortunately cannot recover non-ancestral relations using a CI oracle alone (Appendix 8.4), but we can recover ancestral relations. We therefore rely on additional time information to orient arrowheads by utilizing longitudinal data, or data arising from a longitudinal density. We can partition the observed variables into $w$ sets or waves so that $\bm{O}=\cup_{k=1}^{w}\tensor[^{k}]{\bm{O}}{}$ . We have the following definition:

Definition 1.

(Longitudinal density) A longitudinal density is a density $f(\cup_{k=1}^{w}\tensor[^{k}]{\bm{O}}{},\bm{L},\bm{S})$ that factorizes according to Equation (5) such that no variable in wave $j$ is an ancestor of a variable in wave $i<j$ and $w\geq 2$ .

Causation proceeds forward in time, so no variable in wave $j$ can be an ancestor of a variable in wave $i<j$ .

If $\bm{Y}\subseteq\bm{O}$ , then let $\tensor[^{a}]{\bm{Y}}{}$ and $\tensor[^{a}]{\bm{Y}}{}^{\prime}$ denote $\bm{Y}\cap\tensor[^{a}]{\bm{O}}{}$ and $[\tensor[^{a}]{\bm{Y}}{}]^{\prime}$ , respectively. We write $\tensor*[^{c}_{d}]{\textnormal{Adj}}{}_{\mathbb{F}^{*}}(\tensor[^{a}]{O}{{}_{i}})$ to mean those variables between waves $c$ and $d$ inclusive that are adjacent to $\tensor[^{a}]{O}{{}_{i}}$ in $\mathbb{F}^{*}$ . We will specifically construct $\mathbb{F}^{*}$ with the following adjacencies:

List 1.

*(Adjacency Interpretations)

If we have $\tensor[^{a}]{O}{{}_{i}}*\!\!-\!\!*\tensor[^{b}]{O}{{}_{j}}$ (with possibly $a=b$ ), then $\tensor[^{a}]{O}{}_{i}^{\prime}\not\perp\!\!\!\perp_{d}\tensor[^{b}]{O}{}_{j}^{\prime}|\bm{W}^{\prime}\cup\bm{S}^{\prime}$ in $\mathbb{M}$ for all $\bm{W}\subseteq\tensor*[^{a}_{b}]{\textnormal{Adj}}{{}_{\mathbb{F}^{*}}}(\tensor[^{a}]{O}{{}_{i}})\setminus\tensor[^{b}]{O}{{}_{j}}$ and all $\bm{W}\subseteq\tensor*[^{a}_{b}]{\textnormal{Adj}}{{}_{\mathbb{F}^{*}}}(\tensor[^{b}]{O}{{}_{j}})\setminus\tensor[^{a}]{O}{{}_{i}}$ . 2. 2.

If we do not have $\tensor[^{a}]{O}{{}_{i}}*\!\!-\!\!*\tensor[^{b}]{O}{{}_{j}}$ (with possibly $a=b$ ), then $\tensor[^{a}]{O}{}_{i}^{\prime}\perp\!\!\!\perp_{d}\tensor[^{b}]{O}{}_{j}^{\prime}|\bm{W}^{\prime}\cup\bm{S}^{\prime}$ in $\mathbb{M}$ for some $\bm{W}\subseteq\bm{O}\setminus\{\tensor[^{a}]{O}{{}_{i}},\tensor[^{b}]{O}{{}_{j}}\}$ between waves $a$ and $b$ inclusive.

The endpoints of $\mathbb{F}^{*}$ have the following modified interpretations:

List 2.

*(Endpoint Interpretations)

If $O_{i}*\!\!\rightarrow O_{j}$ , then $O_{j}\not\in\textnormal{Anc}_{\mathbb{F}}(O_{i})$ . 2. 2.

If $O_{i}*\!\!-O_{j}$ , then $O_{j}\in\textnormal{Anc}_{\mathbb{F}}(O_{i}\cup\bm{S})$ .

The arrowheads do not take into account selection variables because we often cannot a priori specify whether a variable is an ancestor of $\bm{S}$ in $\mathbb{F}$ using either wave information or other prior knowledge in practice. We draw an example of $\mathbb{M}$ in Figure 3 (a), its fused graph $\mathbb{F}$ in Figure 3 (b) and the corresponding mixed graph $\mathbb{F}^{*}$ in Figure 3 (c), where $\bm{O}=\bm{X}$ , $\bm{L}=\{T_{1},T_{2}\}$ , $\bm{S}=\emptyset$ and $w=2$ .

5.3 Algorithm

We cannot apply an existing constraint-based algorithm like FCI on data arising from a mixture of DAGs and expect to recover a partially oriented $\mathbb{F}^{*}$ (Appendix 8.5). We therefore propose a new algorithm called Causal Inference over Mixtures (CIM) which correctly recovers causal relations. We summarize the procedure in Algorithm 1.

The CIM algorithm works as follows. First, CIM runs a variant of PC-stable’s skeleton discovery procedure in order to discover adjacencies as well as minimal separating sets in Step 1 [19]. This step recovers the adjacencies with interpretations listed in List 1. The algorithm stores the minimal separating sets in the array Sep so that $\textnormal{Sep}(O_{i},O_{k})$ contains a minimal separating set of $O_{i}$ and $O_{k}$ , if such a set exists. CIM next adds arrowheads in Step 1 using wave information from a longitudinal dataset with the list $\mathcal{W}$ . If we have $\tensor[^{a}]{O}{{}_{i}}{\circ\hskip 1.13809pt\!\!\!-\hskip 1.13809pt\!\!\!\circ}\tensor[^{b}]{O}{{}_{j}}$ with $b>a$ , then CIM orients $\tensor[^{a}]{O}{{}_{i}}{\circ\hskip 0.85358pt\!\!\!\rightarrow}\tensor[^{b}]{O}{{}_{j}}$ because $\tensor[^{b}]{O}{{}_{j}}\not\in\textnormal{Anc}_{\mathbb{F}}(\tensor[^{a}]{O}{{}_{i}})$ . We can orient additional arrowheads using other prior knowledge $\mathcal{P}$ . Step 1 orients many arrowheads in practice, so long as we have at least two waves of data and repeated measurements.

For every triple $O_{i}*\!\!\rightarrow O_{j}*\!\!-\!\!*O_{k}$ with $O_{i}$ and $O_{k}$ non-adjacent, CIM then attempts to find a minimal separating set that contains $O_{j}$ in Step 1. These sets are important due to the following lemma which allows us to infer tails in Step 1:

Lemma 1.

Suppose $O_{i}^{\prime}\perp\!\!\!\perp_{d}O_{j}^{\prime}|\bm{W}^{\prime}\cup\bm{S}^{\prime}$ in $\mathbb{M}$ but $O_{i}^{\prime}\not\perp\!\!\!\perp_{d}O_{j}^{\prime}|\bm{V}^{\prime}\cup\bm{S}^{\prime}$ for every $\bm{V}\subset\bm{W}$ . If $O_{k}\in\bm{W}$ , then $O_{k}\in\textnormal{Anc}_{\mathbb{F}}(\{O_{i},O_{j}\}\cup\bm{S})$ .

CIM finally adds some additional tails in Step 1 due to transitivity of the tails. The algorithm has the same polynomial time complexity as PC-stable due to Step 1.

We now formally claim that Algorithm 1 is sound:

Theorem 2.

Suppose the longitudinal density $f(\cup_{k=1}^{w}\tensor[^{k}]{\bm{O}}{},\bm{L},\bm{S})$ factorizes according to Equation (5). Assume that all arrowheads deduced from $\mathcal{P}$ are correct. Then, under d-separation faithfulness w.r.t. $\mathbb{M}$ , the CIM algorithm returns the mixed graph $\mathbb{F}^{*}$ partially oriented.

6 Experiments

We had two overarching goals: (1) evaluate the performance of CIM against other constraint-based algorithms using real data, and (2) determine if we can reconstruct the real data results using synthetic data sampled from a mixture of DAGs. We therefore utilized the setup described below.

6.1 Algorithms

We compared the following five constraint-based algorithms in recovering the ancestral and nonancestral relations in $\mathbb{F}$ : CIM, PC, FCI, RFCI and CCI. We equipped all algorithms with a nonparametric CI test called GCM [20] and fixed $\alpha=0.01$ across all experiments. We gave all algorithms the same wave information during skeleton discovery in order to orient arrowheads between the waves. The algorithms perform much worse without the additional knowledge.

6.1.1 Metrics

Let tails refer to positives and arrowheads to negatives. CIM only infers tails, so we cannot compute the number of true and false negatives. We can however compute the number of true positives and false positives.

We therefore evaluated the algorithms using sensitivity and fallout. The sensitivity is defined as $TP/P$ , where $TP$ refers to true positives and $P$ to positives. The fallout is defined as $FP/N$ , where $FP$ refers to false positives and $N$ to negatives. A tail in place of an arrowhead corresponds to a false positive.

The receiver operating characteristic (ROC) curve plots sensitivity against the fallout. Perfect accuracy corresponds to a sensitivity of one and a fallout of zero at the upper left hand corner of the ROC curve. Constraint-based algorithms do not output a continuous score required to compute the area under the ROC curve, but we can assess overall performance using the Euclidean distance from the upper left hand corner [21].

6.2 Real Data

6.2.1 Framingham Heart Study

We first evaluated the algorithms on real data. We considered the Framingham Heart Study, where investigators measured cardiovascular changes across time in residents of Framingham, Massachusetts [22]. The dataset contains three waves of data with 8 variables in each wave. We obtained 2019 samples after removing patients with missing values.

The dataset contains the following known direct causal relations: (1) number of cigarettes per day causes heart rate via cardiac nicotonic acetylcholine receptors [23, 24, 25]; (2) age causes systolic blood pressure due to increased large artery stiffness [26, 27]; (3) age causes cholesterol levels due to changes in cholesterol and lipoprotein metabolism [28]; (4) BMI causes number of cigarettes per day because smoking cigarettes is a common weight loss strategy [29, 30]; (5) systolic blood pressure causes diastolic blood pressure and vice versa by definition, because both quantities refer to pressure in the same arteries at different points in time. We can compute sensitivity using this information.

We summarize the results over 50 bootstrapped datasets in Figures 4 (a, b, c). We first evaluated sensitivity by running the algorithms using the full wave information. RFCI, FCI and CCI oriented few tails overall, so they obtained lower sensitivity scores (Figure 4 (a)). PC and CIM had similar sensitivities (t=-0.80, p=0.43). We next combined waves 2 and 3, so that the algorithms could incorrectly orient tails backwards in time. CIM made fewer errors than PC as indicated by a lower fallout (Figure 4 (b), t=-11.85, p=5.37E-16). FCI, RFCI and CCI also achieved low fallout scores, but they again did not orient many tails to begin with. CIM therefore obtained the best overall score when we combined sensitivity and fallout (Figure 4 (c), t=-5.60, p=9.70E-7). We conclude that both CIM and PC orient many tails, but CIM makes fewer errors as evidenced by its high sensitivity and low fallout. We therefore prefer CIM in this dataset.

6.2.2 Sequenced Treatment Alternatives to Relieve Depression Trial

We next analyzed Level 1 of the Sequenced Treatment Alternatives to Relieve Depression (STAR∗D) trial [31]. Investigators gave patients an antidepressant called citalopram and then tracked their depressive symptoms using a standardized questionnaire called QIDS-SR-16. We analyzed the 9 QIDS-SR-16 sub-scores measuring components of depression at weeks 0, 2 and 4. We also included age and gender in the first wave. The dataset contains 2043 subjects after removing subjects with missing values.

The 9 QIDS-SR-16 subscores include sleep, mood, appetite, concentration, self-esteem, thoughts of death, interest, energy and psychomotor changes. We asked a psychiatrist to identify direct ground truth causal relations among the subscores before we ran the experiments. The ground truth includes: (1) sleep causes mood [32]; (2) energy causes psychomotor changes; (3) appetite causes energy; (4, 5) mood causes appetite and self-esteem [33]; (6) psychomotor changes cause concentration; (7, 8) mood and self-esteem cause thoughts of death [34].

We summarize the sensitivity, fallout and overall performance over 50 bootstrapped datasets in Figures 4 (d, e, f). CIM achieved higher sensitivity than all other algorithms (Figure 4 (d), t=5.66, p=7.86E-7). CIM also had a smaller fallout score compared to PC (Figure 4 (e), t=-19.19, p $<$ 2.20E-16). CIM therefore obtained the highest overall score compared to the other algorithms (Figure 4 (f), t=-14.95, p $<$ 2.20E-16). These results corroborate the superiority of CIM in a second real dataset.

6.3 Synthetic Data

We next sampled from a mixture of DAGs to see if we could replicate the real data results. We specifically instantiated a linear DAG with an expected neighborhood size of 2, $p=24$ vertices and linear coefficients drawn from Uniform( $[-1,-0.25]\cup[0.25,1]$ ). We then uniformly instantiated $q=5$ to 15 binary variables for $\bm{T}$ and block randomized the edges in the DAG to each element of $\bm{T}$ . We assigned the first 8 variables to wave 1, the second 8 to wave 2, and the third 8 to wave 3. We added a directed edge from the $n^{\textnormal{th}}$ variable in wave 1 to the $n^{\textnormal{th}}$ variable in wave 2, and similarly added the directed edges from wave 2 to wave 3 in order to model self-loops. We randomly selected a set of 0-2 latent common causes without replacement from $\bm{X}$ , which we placed in $\bm{L}$ in addition to the variables in $\bm{T}$ . We then selected a set of 0-2 selection variables $\bm{S}$ without replacement from the set $\bm{X}\setminus\bm{L}$ .

We uniformly instantiating the mixing probabilities $f(T_{i}=0)$ and $f(T_{i}=1)$ for each $T_{i}\in\bm{T}$ . We then generated 2000 samples as follows. For each sample, we drew an instantiation $\bm{T}=\bm{t}$ according to $\prod_{i=1}^{q}f(T_{i})$ and created a graph containing the union of the edges associated with those elements in $\bm{t}$ equal to one. We then sampled the resultant DAG using a multivariate Gaussian distribution. We finally removed the latent variables and introduced selection bias by removing the bottom $k^{\textnormal{th}}$ percentile for each selection variable, with $k$ chosen uniformly between 10 and 50.

We report the results in Figures 4 (g, h, i) after repeating the above process 50 times. We computed the sensitivity and fallout using the ground truth in waves 2 and 3. CIM achieved the highest sensitivity (Figure 4 (g), t=3.71, p=5.35E-4). PC obtained the second highest sensitivity, but CIM had a lower fallout than PC (Figure 4 (h), t=-4.63,p=2.72E-5). CIM ultimately achieved the best overall score (Figure 4 (i), t=-3.78,p=4.37E-4). We conclude that the synthetic data results mimic those seen with the real data.

7 Conclusion

We proposed to model causal processes in biomedicine using a mixture of DAGs to accommodate non-stationary distributions and cycles. We then introduced a constraint-based algorithm called CIM to infer causal relations from data even with latent variables and selection bias. The CIM algorithm outperforms existing constraint-based algorithms across multiple metrics and datasets. CIM thus advances the state of the art in causal discovery from biomedical data.

8 Appendix

8.1 Proofs

Theorem 1.

Let $\bm{A},\bm{B},\bm{C}$ denote disjoint subsets of $\bm{Z}$ . If $\bm{A}^{\prime}\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{M}$ , then $\bm{A}\perp\!\!\!\perp\bm{B}|\bm{C}$ .

Proof.

We first consider the moral graph of $\mathbb{M}$ . Let $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ denote the moral graph of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})$ , the smallest ancestral set of $\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime}$ in $\mathbb{M}$ such that, if $Z_{i}\in\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})$ , then $Z_{i}^{\prime}\subseteq\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})$ . We then consider a partition of the vertices $\ddot{\bm{A}}\cup\ddot{\bm{B}}\cup\bm{C}^{\prime}=\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})$ so that $\bm{A}^{\prime}\subseteq\ddot{\bm{A}}$ , $\bm{B}^{\prime}\subseteq\ddot{\bm{B}}$ , and $\ddot{\bm{A}}$ , $\ddot{\bm{B}}$ and $\bm{C}^{\prime}$ are disjoint sets of vertices. We require that $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ be separated by $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ ; in other words, there does not exist an undirected path between $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ that is active given $\bm{C}^{\prime}$ .

We now construct such a partition $(\ddot{\bm{A}},\ddot{\bm{B}})$ . First set $\ddot{\bm{A}}$ to $\bm{A}^{\prime}$ and $\ddot{\bm{B}}$ to $\bm{B}^{\prime}$ . We have $\bm{A}^{\prime}\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{M}$ if and only if $\bm{A}^{\prime}$ and $\bm{B}^{\prime}$ are separated by $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ (Lemma 2 in [16]). $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ are therefore separated by $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ at the moment. Now consider the set of vertices $\bm{H}^{\prime}=\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})\setminus(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime}).$ We will put subsets of $\bm{H}^{\prime}$ into either $\ddot{\bm{A}}$ or $\ddot{\bm{B}}$ . We have two situations for each $H_{i}^{\prime}\subseteq\bm{H}^{\prime}$ .

In $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ , there does not exist an undirected path between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ or an undirected path between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ (or both) that is active given $\bm{C}^{\prime}$ . More specifically:

(a)

If there does not exist an undirected path between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ that is active given $\bm{C}^{\prime}$ , but such a path exists between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ , then include $H_{i}^{\prime}$ into $\ddot{\bm{B}}$ . 2. (b)

If there does not exist an undirected path between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ that is active given $\bm{C}^{\prime}$ , but such a path exists between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ , then include $H_{i}^{\prime}$ into $\ddot{\bm{A}}$ . 3. (c)

If there does not exist an undirected path between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ that is active given $\bm{C}^{\prime}$ and there likewise does not exist such a path between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ , then include $H_{i}^{\prime}$ into $\ddot{\bm{A}}$ . 2. 2.

In $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ , there exists an undirected path between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ and an undirected path between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ that are both active given $\bm{C}^{\prime}$ . We have two cases:

(a)

There exists an undirected path between $H_{i}^{m}$ and $\bm{A}^{\prime}$ and an undirected path between $H_{i}^{m}$ and $\bm{B}^{\prime}$ that are both active given $\bm{C}^{\prime}$ . But this implies that $\bm{A}^{\prime}$ and $\bm{B}^{\prime}$ are connected given $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ via $H_{i}^{m}$ - a contradiction. 2. (b)

There exists an undirected path between $H_{i}^{m}$ and $\bm{A}^{\prime}$ and an undirected path between $H_{i}^{n}$ ( $m\not=n)$ and $\bm{B}^{\prime}$ that are both active given $\bm{C}^{\prime}$ - denote these paths by $\Pi_{H_{i}^{m}\bm{A}^{\prime}}$ and $\Pi_{H_{i}^{n}\bm{B}^{\prime}}$ , respectively. Note that there does not exist an undirected path between $H_{i}^{n}$ and $\bm{A}^{\prime}$ and an undirected path between $H_{i}^{m}$ and $\bm{B}^{\prime}$ that are both active given $\bm{C}^{\prime}$ per the argument in (a). We have two cases:

i.

There does not exist a descendant of $\bm{T}$ on $\Pi_{H_{i}^{m}\bm{A}^{\prime}}$ . Then $\Pi_{H_{i}^{m}\bm{A}^{\prime}}$ must be confined to the sub-graph $\mathbb{G}^{m}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ . But then an analogous undirected path $\Pi_{H_{i}^{n}\bm{A}^{\prime}}$ must be active in the sub-graph $\mathbb{G}^{n}$ - a contradiction. A similar argument also applies to $\Pi_{H_{i}^{n}\bm{B}^{\prime}}$ . 2. ii.

Let $\bm{R}$ denote all members of $\bm{T}$ that have a descendant on $\Pi_{H_{i}^{m}\bm{A}^{\prime}}$ . We will construct an active path between $H_{i}^{n}$ and $\bm{A}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ for a contradiction. We construct a path from $H_{i}^{n}$ in $\mathbb{G}^{n}$ using the corresponding vertices on $\Pi_{H_{i}^{m}\bm{A}^{\prime}}$ in $\mathbb{G}^{m}$ until we encounter the first child of $\bm{R}$ , denoted by $X_{j}^{n}$ . Consider the collection $\mathcal{P}=\{\Pi_{H_{i}^{n}X_{j}^{n}},X_{j}^{n}-R_{k}-X_{j}^{m},\Pi_{X_{j}^{m}\bm{A}^{\prime}}\}$ , for some $R_{k}\in\bm{R}$ , whose concatenation connects $H_{i}^{n}$ and $\bm{A}^{\prime}$ given $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ - a contradiction. A similar argument also applies to $\Pi_{H_{i}^{n}\bm{B}^{\prime}}$ .

We have exhausted all possibilities and therefore conclude that there cannot exist an undirected path between $H_{i}^{\prime}$ and $\bm{A}^{\prime}$ and an undirected path between $H_{i}^{\prime}$ and $\bm{B}^{\prime}$ that are both active given $\bm{C}^{\prime}$ .

We have constructed a disjoint partition of vertices $(\ddot{\bm{A}},\ddot{\bm{B}})$ such that $\ddot{\bm{A}}\cup\ddot{\bm{B}}\cup\bm{C}^{\prime}=\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\textnormal{Anc}}_{\mathbb{M}}(\bm{A}^{\prime}\cup\bm{B}^{\prime}\cup\bm{C}^{\prime})$ . Moreover, $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ are separated given $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ because, if an active path exists between $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ given $\bm{C}^{\prime}$ , this implies the contradiction that there also exists an active path between $\ddot{\bm{A}}$ and $\bm{B}^{\prime}$ given $\bm{C}^{\prime}$ .

We may then consider all of the cliques in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ corresponding to each vertex and its married parents. Denote this set of cliques as $\mathcal{E}$ . Also let $\mathcal{E}_{\ddot{\bm{B}}}$ denote the set of cliques in $\mathcal{E}$ that have non-empty intersection with $\ddot{\bm{B}}$ . Because $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ are separated given $\bm{C}^{\prime}$ , the vertices $\ddot{\bm{A}}$ and $\ddot{\bm{B}}$ are also non-adjacent in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{M}}$ ; this implies that no clique in $\mathcal{E}_{\ddot{\bm{B}}}$ can contain a member of $\ddot{\bm{A}}$ . We also have $\ddot{\bm{B}}\cap e=\emptyset$ for all $e\in\mathcal{E}\setminus\mathcal{E}_{\ddot{\bm{B}}}$ .

Consider an arbitrary graph $\mathbb{G}^{j}\in\mathcal{G}$ . We can write:

[TABLE]

where $\gamma^{j}$ is a placeholder for some non-negative function for $\mathbb{G}^{j}$ . Let $\bm{U}=\bm{T}\cap(\ddot{\bm{A}}\cup\bm{C})$ and $\bm{V}=\bm{T}\cap(\ddot{\bm{B}}\cup\bm{C})$ . Also let $\mathcal{U}$ denote the sub-sets of the sets in $\mathcal{T}$ corresponding to $\bm{U}$ - likewise for $\mathcal{V}$ . We then proceed by integrating out $[\ddot{\bm{A}}\cup\ddot{\bm{B}}]\setminus[\bm{A}\cup\bm{B}]$ :

[TABLE]

where $\gamma(\ddot{\bm{A}},\bm{C})=\gamma^{j}(\ddot{\bm{A}},\bm{C})$ when $\bm{U}\in\mathcal{U}^{j}$ and $\gamma(\ddot{\bm{B}},\bm{C})=\gamma^{j}(\ddot{\bm{B}},\bm{C})$ when $\bm{V}\in\mathcal{V}^{j}$ . We then finalize the integration:

[TABLE]

The third equality follows because $[\ddot{\bm{A}}\setminus\bm{A}]\cap[\ddot{\bm{B}}\setminus\bm{B}]=\emptyset$ by construction. The conclusion follows by the last equality. ∎

Lemma 1.

Suppose $O_{i}^{\prime}\perp\!\!\!\perp_{d}O_{j}^{\prime}|\bm{W}^{\prime}\cup\bm{S}^{\prime}$ in $\mathbb{M}$ but $O_{i}^{\prime}\not\perp\!\!\!\perp_{d}O_{j}^{\prime}|\bm{V}^{\prime}\cup\bm{S}^{\prime}$ for every $\bm{V}\subset\bm{W}$ . If $O_{k}\in\bm{W}$ , then $O_{k}\in\textnormal{Anc}_{\mathbb{F}}(\{O_{i},O_{j}\}\cup\bm{S})$ .

Proof.

We invoke Lemma 15 in [4] by setting $\bm{R}=\emptyset$ , $O_{i}=O_{i}^{\prime}$ , $O_{j}=O_{j}^{\prime}$ , $\bm{W}=\bm{W}^{\prime}$ and $\bm{S}=\bm{S}^{\prime}$ in that paper. We can then conclude that $O_{k}^{\prime}\in\textnormal{Anc}_{\mathbb{M}}(O_{i}^{\prime}\cup O_{j}^{\prime}\cup\bm{S}^{\prime})$ . Moreover, if $O_{k}^{\prime}\in\textnormal{Anc}_{\mathbb{M}}(O_{i}^{\prime}\cup O_{j}^{\prime}\cup\bm{S}^{\prime})$ , then $O_{k}\in\textnormal{Anc}_{\mathbb{F}}(O_{i}\cup O_{j}\cup\bm{S})$ by construction of $\mathbb{F}$ . ∎

Theorem 2.

Suppose the longitudinal density $f(\cup_{k=1}^{w}\tensor[^{k}]{\bm{O}}{},\bm{L},\bm{S})$ factorizes according to Equation (5). Assume that all arrowheads deduced from $\mathcal{P}$ are correct. Then, under d-separation faithfulness w.r.t. $\mathbb{M}$ , the CIM algorithm returns the mixed graph $\mathbb{F}^{*}$ partially oriented.

Proof.

Under d-separation faithfulness w.r.t. $\mathbb{M}$ , CI and d-separation w.r.t. $\mathbb{M}$ are equivalent by Theorem 1, so we can refer to them interchangeably. Algorithm 2 finds the adjacencies in List 1 because we must always have $\tensor*[^{a}_{b}]{\textnormal{Adj}}{}_{\mathbb{F}^{*}}(\tensor[^{a}]{O}{{}_{i}})\subseteq\tensor*[^{a}_{b}]{\textnormal{Adj}}{}_{\widehat{\mathbb{F}}^{*}}(\tensor[^{a}]{O}{{}_{i}})$ in Step 2 of Algorithm 2. Step 1 discovers the correct tails by Lemma 1. Step 1 follows directly by transitivity of the tails. ∎

8.2 Skeleton Discovery

We summarize CIM’s skeleton discovery procedure in Algorithm 2. The algorithm mimics PC-stable’s skeleton discovery procedure, but it incorporates wave information in the adjacency sets.

8.3 Comparison to Previous Global Markov Property

Spirtes [17] also characterized the global Markov property across a mixture of DAGs using $\mathbb{F}$ . The fused graph however implies less CI relations than $\mathbb{M}$ as illustrated in Figure 2. We have drawn $\mathbb{F}$ in Figure 2 (c). $X_{1}$ and $X_{3}$ are d-connected in $\mathbb{F}$ even though $\{X_{1}^{1},X_{1}^{2}\}$ and $\{X_{3}^{1},X_{3}^{2}\}$ are d-separated in $\mathbb{M}$ in Figure 2 (b). We have established an instance where the mixture graph implies strictly more independence relations than the fused graph.

The mixture graph in fact always implies at least the same number of CI relations as the fused graph:

Proposition 1.

Let $\bm{A},\bm{B},\bm{C}$ denote disjoint subsets of $\bm{X}$ . If $\bm{A}\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ in $\mathbb{F}$ , then $\bm{A}^{\prime}\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{M}$ .

Proof.

We create $q$ copies of $\mathbb{F}$ and plot them adjacent to each other. Denote the resultant graph as $\mathbb{F}^{\prime}$ . As a result, we have $\bm{A}\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ in $\mathbb{F}$ if and only if $\bm{A}^{\prime}\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime}$ . Create a new graph $\mathbb{F}^{\prime\prime}$ as follows. First set $\mathbb{F}^{\prime\prime}$ equal to $\mathbb{F}^{\prime}$ . Then remove $\bm{T}^{\prime}$ from $\mathbb{F}^{\prime\prime}$ and place $\bm{T}$ instead. Set $\textnormal{Ch}_{\mathbb{F}^{\prime\prime}}(T_{i})$ equal to $\textnormal{Ch}^{\prime}_{\mathbb{F}}(T_{i})$ for each $T_{i}\in\bm{T}$ . Denote the resultant graph as $\mathbb{F}^{\prime\prime}$ .

We will show that $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime\prime}$ implies $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime}$ . Consider any active path $\Pi_{\bm{A}^{\prime}\bm{B}^{\prime}}$ between $\bm{A}^{\prime}$ and $\bm{B}^{\prime}$ given $\bm{C}^{\prime}$ in $\mathbb{F}^{\prime\prime}$ . Denote the moral graph of $\textnormal{Anc}_{\mathbb{F}^{\prime\prime}}(\bm{A}^{\prime},\bm{B}^{\prime},\bm{C}^{\prime})$ by $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{F}}^{\prime\prime}$ . Consider an active path $\Pi_{\bm{A}^{\prime}\bm{B}^{\prime}}$ between $\bm{A}^{\prime}$ and $\bm{B}^{\prime}$ given $\bm{C}^{\prime}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{F}}^{\prime\prime}$ . We can replace an arbitrary vertex $Z_{i}^{m}$ on $\Pi_{\bm{A}^{\prime}\bm{B}^{\prime}}$ with $Z_{i}^{n}$ on $\mathbb{G}^{n}\in\mathcal{G}$ . Repeating this process for every vertex on $\Pi_{\bm{A}^{\prime}\bm{B}^{\prime}}$ creates a non-simple path (i.e. with potentially repeated vertices) between $\bm{A}^{n}$ and $\bm{B}^{n}$ that does not contain any member of $\bm{C}^{n}$ . There thus exists a simple path without repeated vertices between $\bm{A}^{n}$ and $\bm{B}^{n}$ that does not contain any member of $\bm{C}^{n}$ in $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{\mathbb{F}}^{\prime\prime}$ . Hence $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime}$ by Lemma 2 in [16].

Note that all of the edges in $\mathbb{M}$ are contained within $\mathbb{F}^{\prime\prime}$ . The conclusion follows because we may write $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{M}$ implies $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime\prime}$ , which implies $\bm{A}^{\prime}\not\perp\!\!\!\perp_{d}\bm{B}^{\prime}|\bm{C}^{\prime}$ in $\mathbb{F}^{\prime}$ , which implies $\bm{A}\not\perp\!\!\!\perp_{d}\bm{B}|\bm{C}$ in $\mathbb{F}$ .

∎

$\mathbb{M}$ is thus superior to $\mathbb{F}$ because (1) $\mathbb{M}$ implies at least as many CI relations as $\mathbb{F}$ , and (2) $\mathbb{M}$ implies strictly more CI relations in some cases.

8.4 Negative Result

We cannot infer arrowheads with a CI oracle alone:

Proposition 2.

There exist mother and fused graph pairs $(\mathbb{M}_{1},\mathbb{F}_{1})$ and $(\mathbb{M}_{2},\mathbb{F}_{2})$ such that $O_{i}\not\in\textnormal{Anc}_{\mathbb{F}_{1}}(O_{j}\cup\bm{S})$ and $O_{i}\in\textnormal{Anc}_{\mathbb{F}_{2}}(O_{j})$ , but $O_{i}^{\prime}\perp\!\!\!\perp O_{j}^{\prime}|(\bm{W}^{\prime},\bm{S}^{\prime})$ in $\mathbb{M}_{1}$ if and only if $O_{i}^{\prime}\perp\!\!\!\perp O_{j}^{\prime}|(\bm{W}^{\prime},\bm{S}^{\prime})$ in $\mathbb{M}_{2}$ for any $O_{i},O_{j}\in\bm{O}$ and $\bm{W}\subseteq\bm{O}\setminus\{O_{i},O_{j}\}$ .

Proof.

Assume $O_{i}\not\in\textnormal{Anc}_{\mathbb{F}_{1}}(O_{j}\cup\bm{S})$ . Let $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ refer to the set of DAGs associated with $\mathbb{M}_{1}$ and $\mathbb{M}_{2}$ , respectively. Choose $\mathbb{M}_{1}$ arbitrarily and set $\mathcal{G}_{2}$ equal to $\mathcal{G}_{1}$ . If $|\mathcal{G}_{2}|=1$ , then add a second copy of the DAG into $\mathcal{G}_{2}$ . Add one new latent variable $L_{k}$ into each DAG in $\mathcal{G}_{2}$ as follows. For all but the last DAG, draw the directed edge $O_{i}\rightarrow L_{k}$ . For the last DAG, draw $L_{k}\rightarrow O_{j}$ . Next, introduce a new latent common cause $T_{l}$ for $L_{k}$ and $O_{j}$ into every DAG in $\mathcal{G}_{2}$ . The new paths do not introduce a d-connecting path between the observed vertices in any of the DAGs in $\mathcal{G}_{2}$ . As a result, $O_{i}^{\prime}\perp\!\!\!\perp O_{j}^{\prime}|(\bm{W}^{\prime},\bm{S}^{\prime})$ in $\mathbb{M}_{1}$ if and only if $O_{i}^{\prime}\perp\!\!\!\perp O_{j}^{\prime}|(\bm{W}^{\prime},\bm{S}^{\prime})$ in $\mathbb{M}_{2}$ , but $O_{i}\in\textnormal{Anc}_{\mathbb{F}_{2}}(O_{j})$ with the directed path $O_{i}\rightarrow L_{k}\rightarrow O_{j}$ . ∎

8.5 Failure of Other Constraint-Based Methods

We cannot apply an existing constraint-based algorithm on data arising from a mixture of DAGs and recover a partially oriented $\mathbb{F}^{*}$ . For example, FCI and CCI can make incorrect inferences if $\mathcal{G}$ contains more than one DAG. Consider the mixture graph in Figure 5 (a), where all variables lie in the same wave. $O_{2}$ is an ancestor of $O_{3}$ in $\mathbb{F}$ drawn in Figure 5 (b), but we have $\{O_{1}^{1},O_{1}^{2}\}$ $\perp\!\!\!\perp_{d}\{O_{3}^{1},O_{3}^{2}\}$ in $\mathbb{M}$ , so $O_{1}$ and $O_{3}$ are independent by Theorem 1. FCI and CCI therefore infer the incorrect collider $O_{1}*\!\!\rightarrow O_{2}\leftarrow\!\!*O_{3}$ in $\mathbb{F}^{*}$ during v-structure discovery. We thus require an alternative algorithm to correctly recover a partially oriented $\mathbb{F}^{*}$ .

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Strobl [2019] E. V. Strobl, Improved causal discovery from longitudinal data using a mixture of dags, in: T. D. Le, J. Li, K. Zhang, E. K. P. Cui, A. Hyvärinen (Eds.), Proceedings of Machine Learning Research, volume 104 of Proceedings of Machine Learning Research , PMLR, Anchorage, Alaska, USA, 2019, pp. 100–133. URL: http://proceedings.mlr.press/v 104/strobl 19a.html .
2Spirtes et al. [2000] P. Spirtes, C. Glymour, R. Scheines, Causation, Prediction, and Search, 2nd ed., MIT press, 2000.
3Zhang [2008] J. Zhang, On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias, Artif. Intell. 172 (2008) 1873–1896. URL: http://dx.doi.org/10.1016/j.artint.2008.08.001 . doi: 10.1016/j.artint.2008.08.001 . · doi ↗
4Strobl [2018] E. V. Strobl, A constraint-based algorithm for causal discovery with cycles, latent variables and selection bias, International Journal of Data Science and Analytics (2018). URL: https://doi.org/10.1007/s 41060-018-0158-2 . doi: 10.1007/s 41060-018-0158-2 . · doi ↗
5Forré and Mooij [2017] P. Forré, J. M. Mooij, Markov properties for graphical models with cycles and latent variables, ar Xiv.org preprint ar Xiv:1710.08775 [math.ST] (2017). URL: https://arxiv.org/abs/1710.08775 .
6Forré and Mooij [2018] P. Forré, J. M. Mooij, Constraint-based causal discovery for non-linear structural causal models with cycles and latent confounders, in: Proceedings of the 34th Annual Conference on Uncertainty in Artificial Intelligence (UAI-18), 2018.
7Hyttinen et al. [2013] A. Hyttinen, P. O. Hoyer, F. Eberhardt, M. Järvisalo, Discovering cyclic causal models with latent variables: A general sat-based procedure, in: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013. URL: https://dslpitt.org/uai/display Article Details.jsp?mmnu=1&smnu=2&article_id=2391&proceeding_id=29 .
8Hyttinen et al. [2014] A. Hyttinen, F. Eberhardt, M. Järvisalo, Constraint-based causal discovery: Conflict resolution with answer set programming, in: Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, AUAI Press, Arlington, Virginia, United States, 2014, pp. 340–349. URL: http://dl.acm.org/citation.cfm?id=3020751.3020787 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Causal Discovery with a Mixture of DAGs

Abstract

keywords:

1 Introduction

2 Related Work

3 Background

3.1 Terminology

3.2 Probabilistic Interpretation

4 Mixture of DAGs

4.1 Univariate Case

4.2 Multivariate Case

4.3 Global Markov Property

Theorem 1**.**

5 Causal Inference over Mixtures

5.1 Fused Graph

5.2 Strategy

Definition 1**.**

List 1**.**

List 2**.**

5.3 Algorithm

Lemma 1**.**

Theorem 2**.**

6 Experiments

6.1 Algorithms

6.1.1 Metrics

6.2 Real Data

6.2.1 Framingham Heart Study

6.2.2 Sequenced Treatment Alternatives to Relieve Depression Trial

6.3 Synthetic Data

7 Conclusion

8 Appendix

8.1 Proofs

Theorem 1.

Proof.

Lemma 1.

Proof.

Theorem 2.

Proof.

8.2 Skeleton Discovery

8.3 Comparison to Previous Global Markov Property

Proposition 1**.**

Proof.

8.4 Negative Result

Proposition 2**.**

Proof.

8.5 Failure of Other Constraint-Based Methods

Theorem 1.

Definition 1.

List 1.

List 2.

Lemma 1.

Theorem 2.

Proposition 1.

Proposition 2.