Causal Dantzig: fast inference in linear structural equation models with   hidden variables under additive interventions

Dominik Rothenh\"ausler; Peter B\"uhlmann; Nicolai Meinshausen

arXiv:1706.06159·stat.ME·June 19, 2018

Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions

Dominik Rothenh\"ausler, Peter B\"uhlmann, Nicolai Meinshausen

PDF

TL;DR

Causal Dantzig offers a computationally efficient method for causal inference in linear structural equation models with hidden variables, leveraging invariance under specific interventions to handle large-scale data.

Contribution

It introduces a new approach using inner-product invariance for fast causal inference, addressing computational challenges and hidden confounders in large-scale linear models.

Findings

01

Addresses computational efficiency for large datasets

02

Provides asymptotic confidence intervals in low-dimensional settings

03

Offers predictive guarantees in non-identifiable cases

Abstract

Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. [2016] that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion…

Tables5

Table 1. Table 1: Consistency of the causal Dantzig and the instrumental variables approach. Consider a model Y = β X + H + η y 𝑌 𝛽 𝑋 𝐻 subscript 𝜂 𝑦 Y=\beta X+H+\eta_{y} and a structural equation model for X 𝑋 X as depicted in the table. The case on the left is a mean-shift, whereas on the right hand side the error variance changes between setting e = 1 𝑒 1 e=1 and e = 2 𝑒 2 e=2 . We assume α ≠ 0 𝛼 0 \alpha\neq 0 , and that the random variables e , η y , η x , H 𝑒 subscript 𝜂 𝑦 subscript 𝜂 𝑥 𝐻 e,\eta_{y},\eta_{x},H are independent and non-degenerate with 𝔼 [ H ] = 0 𝔼 delimited-[] 𝐻 0 \mathbb{E}[H]=0 .

Consistency	$X = α e + H + η_{x}$	$X = H + (1 + α e) η_{x}$
	(mean-shift)	(change in error distribution)
Instrumental variable regression	yes	no
Unregularized causal Dantzig	yes	yes

Table 2. Table 2: The first two rows contain actual coverage and average length of confidence intervals of causal Dantzig for the first variable in SEM (A) of equation ( 31 ). The last row contains the actual coverage of ICP in these settings. The nominal coverage is 0.95 0.95 0.95 for causal Dantzig and at least 0.95 0.95 0.95 for ICP. For small sample sizes, the variance is relatively large. As discussed in Section 4 , regularization can be helpful in these settings.

	$n = 50$	$100$	$500$	$1000$
Coverage	0.93 $\pm$ 0.01	0.95 $\pm$ 0.01	0.96 $\pm$ 0.01	0.96 $\pm$ 0.01
Average length	65.79 $\pm$ 2918.53	4.11 $\pm$ 602.53	0.27 $\pm$ 0.62	0.18 $\pm$ 0.01
Coverage ICP	0.92 $\pm$ 0.01	0.84 $\pm$ 0.01	0.42 $\pm$ 0.02	0.3 $\pm$ 0.03

Table 3. Table 3: Actual coverage and average length of confidence intervals for first variable in SEM (B) of equation ( 31 ) with causal Dantzig. The nominal coverage is 0.95.

	$n = 50$	$100$	$500$	$1000$
Coverage	0.95 $\pm$ 0.01	0.95 $\pm$ 0.01	0.96 $\pm$ 0.01	0.96 $\pm$ 0.01
Average length	11354.75 $\pm$ 2776.95	57.27 $\pm$ 28842.69	0.69 $\pm$ 7.28	0.39 $\pm$ 3.73

Table 4. Table 4: Mean square error for varying n 𝑛 n . Instrument is not weak.

	$n = 20$	$50$	$100$
causal Dantzig	0.46 $\pm$ 0.41	0.03 $\pm$ 0.01	0.01 $\pm$ 0
ivreg	0.07 $\pm$ 0.01	0.02 $\pm$ 0	0.01 $\pm$ 0

Table 5. Table 5: Mean square error for varying n 𝑛 n . The instrument is weak, but causal Dantzig can leverage changes in variance.

	$n = 20$	$50$	$100$
causal Dantzig	24.05 $\pm$ 90.75	0.03 $\pm$ 0	0.01 $\pm$ 0
ivreg	36634.21 $\pm$ 161096.94	4244.29 $\pm$ 15557.82	1862.7 $\pm$ 8171.14

Equations318

X_{k}

X_{k}

Y := X_{p + 1} = k = 1 \sum p β_{k}^{0} X_{k} + ε .

Y := X_{p + 1} = k = 1 \sum p β_{k}^{0} X_{k} + ε .

Y^{e} ∣ X_{p a (Y)}^{e} = x .

Y^{e} ∣ X_{p a (Y)}^{e} = x .

Y^{e} ∣ X_{p a (Y)}^{e} = x = d Y^{f} ∣ X_{p a (Y)}^{f} = x .

Y^{e} ∣ X_{p a (Y)}^{e} = x = d Y^{f} ∣ X_{p a (Y)}^{f} = x .

\mathbb{E}\big{[}X^{e}_{k}(Y^{e}-X^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}X^{f}_{k}(Y^{f}-X^{f}\beta^{0})\big{]}

\mathbb{E}\big{[}X^{e}_{k}(Y^{e}-X^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}X^{f}_{k}(Y^{f}-X^{f}\beta^{0})\big{]}

X_{k}^{e}

X_{k}^{e}

η^{e} = d η^{0} + δ^{e} \mbox f or a l l e \in E .

η^{e} = d η^{0} + δ^{e} \mbox f or a l l e \in E .

\mathbb{E}\big{[}X^{e}_{k}(Y^{e}-X^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}X^{f}_{k}(Y^{f}-X^{f}\beta^{0})\big{]}

\mathbb{E}\big{[}X^{e}_{k}(Y^{e}-X^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}X^{f}_{k}(Y^{f}-X^{f}\beta^{0})\big{]}

\tilde{Y}^{e} = Y^{e} + ζ_{y}^{e} and \tilde{X}_{k}^{e} = X_{k}^{e} + ζ_{k}^{e}, e \in E, k = 1, \dots, p,

\tilde{Y}^{e} = Y^{e} + ζ_{y}^{e} and \tilde{X}_{k}^{e} = X_{k}^{e} + ζ_{k}^{e}, e \in E, k = 1, \dots, p,

latent variables X_{1} and Y with Y observed variables \tilde{X}_{1} and \tilde{Y} with \tilde{X}_{1} and \tilde{Y} = 2 X_{1} + ε, = X_{1} + ζ_{1}, = Y + ζ_{y} .

latent variables X_{1} and Y with Y observed variables \tilde{X}_{1} and \tilde{Y} with \tilde{X}_{1} and \tilde{Y} = 2 X_{1} + ε, = X_{1} + ζ_{1}, = Y + ζ_{y} .

\mathbb{E}\big{[}\tilde{X}^{e}_{k}(\tilde{Y}^{e}-\tilde{X}^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}\tilde{X}^{f}_{k}(\tilde{Y}^{f}-\tilde{X}^{f}\beta^{0})\big{]}

\mathbb{E}\big{[}\tilde{X}^{e}_{k}(\tilde{Y}^{e}-\tilde{X}^{e}\beta^{0})\big{]}\quad=\quad\mathbb{E}\big{[}\tilde{X}^{f}_{k}(\tilde{Y}^{f}-\tilde{X}^{f}\beta^{0})\big{]}

\hat{Z} \hat{G} := \frac{1}{n _{1}} (X^{1})^{t} Y^{1} - \frac{1}{n _{2}} (X^{2})^{t} Y^{2} \in R^{p} := \frac{1}{n _{1}} (X^{1})^{t} X^{1} - \frac{1}{n _{2}} (X^{2})^{t} X^{2} \in R^{p \times p} .

\hat{Z} \hat{G} := \frac{1}{n _{1}} (X^{1})^{t} Y^{1} - \frac{1}{n _{2}} (X^{2})^{t} Y^{2} \in R^{p} := \frac{1}{n _{1}} (X^{1})^{t} X^{1} - \frac{1}{n _{2}} (X^{2})^{t} X^{2} \in R^{p \times p} .

E [\hat{Z} - \hat{G} β^{0}] = 0.

E [\hat{Z} - \hat{G} β^{0}] = 0.

β \in R^{p} min ∥ \hat{Z} - \hat{G} β ∥_{\infty} .

β \in R^{p} min ∥ \hat{Z} - \hat{G} β ∥_{\infty} .

\hat{β} = \hat{G}^{- 1} \hat{Z} .

\hat{β} = \hat{G}^{- 1} \hat{Z} .

\hat{β}_{L S} = (X^{t} X)^{- 1} X^{t} Y .

\hat{β}_{L S} = (X^{t} X)^{- 1} X^{t} Y .

β \in R^{p} min e \in E max ∥ \hat{Z}^{e} - \hat{G}^{e} β ∥_{\infty},

β \in R^{p} min e \in E max ∥ \hat{Z}^{e} - \hat{G}^{e} β ∥_{\infty},

\hat{Z}^{e} \hat{G}^{e} := \frac{1}{n _{e}} (X^{e})^{t} Y^{e} - \frac{1}{∣ E ∣ - 1} \tilde{e} \neq = e \sum \frac{1}{n _{\tilde{e}}} (X^{\tilde{e}})^{t} Y^{\tilde{e}} \in R^{p}, := \frac{1}{n _{e}} (X^{e})^{t} X^{e} - \frac{1}{∣ E ∣ - 1} \tilde{e} \neq = e \sum \frac{1}{n _{\tilde{e}}} (X^{\tilde{e}})^{t} X^{\tilde{e}} \in R^{p \times p} .

\hat{Z}^{e} \hat{G}^{e} := \frac{1}{n _{e}} (X^{e})^{t} Y^{e} - \frac{1}{∣ E ∣ - 1} \tilde{e} \neq = e \sum \frac{1}{n _{\tilde{e}}} (X^{\tilde{e}})^{t} Y^{\tilde{e}} \in R^{p}, := \frac{1}{n _{e}} (X^{e})^{t} X^{e} - \frac{1}{∣ E ∣ - 1} \tilde{e} \neq = e \sum \frac{1}{n _{\tilde{e}}} (X^{\tilde{e}})^{t} X^{\tilde{e}} \in R^{p \times p} .

(\frac{V ^{1}}{n _{1}} + \frac{V ^{2}}{n _{2}})^{- \frac{1}{2}} (\hat{β} - β^{0}) ⇀ N_{p} (0, Id_{p}) .

(\frac{V ^{1}}{n _{1}} + \frac{V ^{2}}{n _{2}})^{- \frac{1}{2}} (\hat{β} - β^{0}) ⇀ N_{p} (0, Id_{p}) .

I_{k} = [\hat{β}_{k} - q \hat{V}_{k k}, \hat{β}_{k} + q \hat{V}_{k k}],

I_{k} = [\hat{β}_{k} - q \hat{V}_{k k}, \hat{β}_{k} + q \hat{V}_{k k}],

P [β_{k}^{0} \in I_{k}] \to 1 - α \mbox f or n_{1}, n_{2} \to \infty.

P [β_{k}^{0} \in I_{k}] \to 1 - α \mbox f or n_{1}, n_{2} \to \infty.

(\frac{V ^{1}}{n _{1}} + \frac{V ^{2}}{n _{2}})^{- \frac{1}{2}} (\hat{β} - β^{0}) ⇀ N (0, Id_{p}),

(\frac{V ^{1}}{n _{1}} + \frac{V ^{2}}{n _{2}})^{- \frac{1}{2}} (\hat{β} - β^{0}) ⇀ N (0, Id_{p}),

- \hat{G}^{- 1} (X_{i \cdot}^{1})^{t} X_{i \cdot}^{1} \hat{G}^{- 1} \hat{Z} + \hat{G}^{- 1} (X_{i \cdot}^{1})^{t} Y_{i}^{1}, \mbox i = 1, ..., n_{1},

- \hat{G}^{- 1} (X_{i \cdot}^{1})^{t} X_{i \cdot}^{1} \hat{G}^{- 1} \hat{Z} + \hat{G}^{- 1} (X_{i \cdot}^{1})^{t} Y_{i}^{1}, \mbox i = 1, ..., n_{1},

\left\{\begin{array}[]{rcrrrr}X_{2}^{e}&\leftarrow&&&\eta^{0}+&\sigma^{e}\eta_{2}\\ Y^{e}&\leftarrow&&X_{2}^{e}+&\eta^{0}+&\eta_{y}\\ X_{1}^{e}&\leftarrow&Y^{e}+&X_{2}^{e}+&&\sigma^{e}\eta_{1}\\ X_{3}^{e}&\leftarrow&&X_{1}^{e}+&\eta^{0}+&\sigma^{e}\eta_{3}\end{array}\right.,

\left\{\begin{array}[]{rcrrrr}X_{2}^{e}&\leftarrow&&&\eta^{0}+&\sigma^{e}\eta_{2}\\ Y^{e}&\leftarrow&&X_{2}^{e}+&\eta^{0}+&\eta_{y}\\ X_{1}^{e}&\leftarrow&Y^{e}+&X_{2}^{e}+&&\sigma^{e}\eta_{1}\\ X_{3}^{e}&\leftarrow&&X_{1}^{e}+&\eta^{0}+&\sigma^{e}\eta_{3}\end{array}\right.,

\hat{\mathbf{G}}=\left(\begin{array}[]{ccc}15.9&6.5&16.1\\ 6.5&3.2&6.5\\ 16.1&6.5&19.1\\ \end{array}\right),\hat{\mathbf{Z}}=\left(\begin{array}[]{c}6.4\\ 3.2\\ 6.5\end{array}\right)\quad\Rightarrow\hat{\beta}=\hat{\mathbf{G}}^{-1}\hat{\mathbf{Z}}=\left(\begin{array}[]{c}-0.04\\ 1.00\\ 0.03\end{array}\right),

\hat{\mathbf{G}}=\left(\begin{array}[]{ccc}15.9&6.5&16.1\\ 6.5&3.2&6.5\\ 16.1&6.5&19.1\\ \end{array}\right),\hat{\mathbf{Z}}=\left(\begin{array}[]{c}6.4\\ 3.2\\ 6.5\end{array}\right)\quad\Rightarrow\hat{\beta}=\hat{\mathbf{G}}^{-1}\hat{\mathbf{Z}}=\left(\begin{array}[]{c}-0.04\\ 1.00\\ 0.03\end{array}\right),

\beta^{0}=\left(\begin{array}[]{c}0\\ 1\\ 0\end{array}\right).

\beta^{0}=\left(\begin{array}[]{c}0\\ 1\\ 0\end{array}\right).

Y^{e} - X^{e} β = d Y^{f} - X^{f} β for all e, f \in E .

Y^{e} - X^{e} β = d Y^{f} - X^{f} β for all e, f \in E .

Y^{e} - X^{e} β = d Y^{\tilde{e}} - X^{\tilde{e}} β for all e \in E .

Y^{e} - X^{e} β = d Y^{\tilde{e}} - X^{\tilde{e}} β for all e \in E .

n \to \infty lim \hat{β}_{IV} = \frac{E [ Y ∣ e = 1 ] - E [ Y ∣ e = 2 ]}{E [ X ∣ e = 1 ] - E [ X ∣ e = 2 ]} = \frac{E [ Y ^{1} ] - E [ Y ^{2} ]}{E [ X ^{1} ] - E [ X ^{2} ]} .

n \to \infty lim \hat{β}_{IV} = \frac{E [ Y ∣ e = 1 ] - E [ Y ∣ e = 2 ]}{E [ X ∣ e = 1 ] - E [ X ∣ e = 2 ]} = \frac{E [ Y ^{1} ] - E [ Y ^{2} ]}{E [ X ^{1} ] - E [ X ^{2} ]} .

n \to \infty lim \hat{β} = \frac{E [ X ^{1} \cdot Y ^{1} ] - E [ X ^{2} \cdot Y ^{2} ]}{E [( X ^{1} ) ^{2} ] - E [( X ^{2} ) ^{2} ]} .

n \to \infty lim \hat{β} = \frac{E [ X ^{1} \cdot Y ^{1} ] - E [ X ^{2} \cdot Y ^{2} ]}{E [( X ^{1} ) ^{2} ] - E [( X ^{2} ) ^{2} ]} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions

Dominik Rothenhäuslerlabel=e2][email protected] [

Peter Bühlmann

Nicolai Meinshausenlabel=e1][email protected] [[[ ETH Zürich

Seminar für Statistik

ETH Zürich

8092 Zürich

Switzerland

e3

e1

Abstract

Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. (2016) that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion of invariance, namely inner-product invariance. By avoiding a computationally cumbersome reverse-engineering approach such as in Peters et al. (2016), it allows for large-scale causal inference in linear structural equation models. We discuss identifiability conditions for the causal parameter and derive asymptotic confidence intervals in the low-dimensional setting. In the case of non-identifiability we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We derive finite-sample bounds in the high-dimensional setting and investigate its performance on simulated datasets.

62J99,

62H99,

68T99,

Causal inference,

structural equation models,

high-dimensional consistency.,

keywords:

[class=MSC]

keywords:

\setattribute

tablecaptionshape \setattributetablecaptionsize \setattributetablename size \setattributetablename skip :

,

and label=e3][email protected] label=u1,url]http://stat.ethz.ch

1 Introduction

Using only observational data to infer causal relations is a challenging task and only possible under certain circumstances and assumptions. In the context of structural equation models (Bollen, 1989; Robins et al., 2000; Pearl, 2009), one possibility is to characterize the Markov equivalence class of graphs under the assumption of acyclicity and usually faithfulness (Verma and Pearl, 1991; Andersson et al., 1997; Tian and Pearl, 2001; Hauser and Bühlmann, 2012; Chickering, 2002). Based on the Markov equivalence class, some causal effects and often only bounds for them can be inferred, see for example Maathuis et al. (2009) and VanderWeele and Robins (2010). Other approaches exploit non-Gaussianity or nonlinearities, while making suitable assumptions about the causal model (Shimizu et al., 2006; Hoyer et al., 2009).

If both observational and data under interventions are available and the target and effect of the interventions is perfectly known, the task of inferring causal relationships becomes easier. Hauser and Bühlmann (2015), for example, modify the greedy equivalence search of Chickering (2002) to such a scenario. If an instrumental variable is available, then different forms of instrumental variable regression (Wright, 1928; Bowden and Turkington, 1990; Angrist et al., 1996; Didelez et al., 2010) can be used to infer the causal effect of a single variable on a target of interest.

Consider a setting where data are recorded in different environments. The environments can have an arbitrary and unknown intervention effect on all predictor variables and the method exploits that the conditional distribution of the target $Y$ of interest, given its causal parents, is invariant across environments under arbitrary interventions on all variables (excluding, just as in instrumental variable regression, direct interventions on the response or target $Y$ ). While it was demonstrated in Peters et al. (2016) that the method can infer a full causal model, there are two major shortcomings:

(i)

It is assumed for invariant causal prediction (ICP) (Peters et al., 2016) that there are no hidden variables that influence $Y$ and its parents simultaneously. 2. (ii)

ICP scans all potential subsets of variables and tests whether the conditional distribution of $Y$ given a subset of variables is invariant across all environments. This makes the method computationally prohibitively expensive as soon as the number of predictor variables starts to exceed one or two dozens.

We will show that both shortcomings can be addressed if we are willing to make a more specific assumption about the type of interventions that generate the different environments.

1.1 Setting and notation

Assume we have a $p+1$ variables $X_{1},\ldots,X_{p+1}$ from a linear Structural Equation Model (SEM) (Bollen, 1989; Robins et al., 2000; Pearl, 2009),

[TABLE]

where $\mathit{pa}(k):=\{k^{\prime}:A_{k,k^{\prime}}\neq 0\}\subseteq\{1,\ldots,p+1\}\setminus k$ is the set of parents of variable $k$ . For notational simplicity we set $A_{k,k}:=0$ for all $k$ . Deviating from convention, we allow dependence between the components of the noise contribution $\eta=(\eta_{1},\ldots,\eta_{p+1})$ which is equivalent to allowing for hidden variables as parents of the observed variables $X_{1},\ldots,X_{p+1}$ , see Figure 1 for an example. The variables form a directed graph $G=(V,E)$ , where the nodes $V=\{1,\ldots,p+1\}$ are given by the variables themselves and there is an edge from variable $k$ to $k^{\prime}$ if and only if $k\in\mathit{pa}(k^{\prime})$ . Furthermore, we allow the underlying graph to be cyclic. The values $(A_{k,k^{\prime}})$ for $k,k^{\prime}\in\{1,\ldots,p+1\}$ form a $(p+1)\times(p+1)$ -dimensional matrix that we denote by $A$ . We write $\text{Id}_{p+1}$ for the $(p+1)\times(p+1)$ -dimensional identity matrix. To make the distribution of $X_{1},...,X_{p+1}$ well defined in the presence of cycles, we assume that $\text{Id}_{p+1}-A$ is invertible. Note that this is always the case if $G$ is acyclic.

We consider inferring the structural equation for just one of the variables and we take variable $X_{p+1}$ without loss of generality and denote it by $Y$ . Note that $Y$ can be in the parental set of some (or all) of the variables $X_{1},\ldots,X_{p}$ , i.e. the matrix $A$ is not necessarily lower triangular. With slight abuse of notation we define $X:=(X_{1},\ldots,X_{p})$ , $\beta^{0}:=A_{p+1,1:p}$ and $\varepsilon:=\eta_{p+1}$ such that

[TABLE]

Note that the vector $\beta^{0}$ has a causal interpretation as it is the coefficient vector $A_{p+1,1:p}$ in the structural equation model (2). The goal is to infer $\beta^{0}$ .

1.2 Relation to other work

We have mentioned already major differences to invariant causal prediction (Peters et al., 2016) and the loose relation to the vast literature on instrumental variable regression (Didelez et al., 2010) which will be detailed in Section 3.6. Another method that relies on shift interventions has been published recently (Rothenhäusler et al., 2015). However, the authors exploit a different type of invariance as inner-product invariance does not hold in this setting. Lewbel (2012) uses heteroscedasticity to infer structural equations. While Lewbel (2012) uses cross-products between exogeneous variables and error terms to identify structural equations, we directly exploit the covariance structure of endogeneous variables and the error terms, resulting in a different method. The comparison in Figure 11 about an application has been published in Meinshausen et al. (2016). The concept of inner-product invariance, the causal Dantzig method and all its corresponding theory are entirely novel.

1.3 Overview

In Section 2 we introduce the notion of inner-product invariance and discuss under which assumptions this property is satisfied. In Section 3 we leverage this property to define the unregularized causal Dantzig and discuss identifiability, low-dimensional estimation and inference. Furthermore, in the case of non-identifiability we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We conclude with a comparison to instrumental variable regression and a discussion of inner-product invariance from the perspective of potential outcomes. In Section 4 we introduce the regularized causal Dantzig, examine its performance in high-dimensional estimation and show how it can achieve consistency under relaxed identifiability assumptions. Practical considerations for both the regularized and unregularized causal Dantzig can be found in Section 5. Numerical examples can be found in Section 6.

2 Conditional and inner-product invariance

In analogy to the setting of Peters et al. (2016) we assume that the data are recorded under different discrete environments or experimental conditions $e\in\mathcal{E}$ . The random variable $X$ in environment $e\in\mathcal{E}$ is denoted by $X^{e}$ and the distribution of $\eta$ by $\eta^{e}$ . We observe i.i.d. samples of $(X^{e},Y^{e})$ from each environment $e\in\mathcal{E}$ and for each sample $i$ we observe from which environment $e_{i}\in\mathcal{E}$ it was drawn. This variable $e_{i}$ can be deterministic or random.

The distribution of a variable can be different across environments due to specific or non-specific interventions. A change in the distribution of $X^{e},\eta^{e}$ can be caused by different intervention mechanisms such as do-interventions or noise-interventions, which can be randomized or not and known or partially known or unknown.

The type of intervention that generates the environments is arbitrary in Peters et al. (2016) with the exception that interventions on the target $Y$ itself are not allowed. The same requirement is also necessary for the instrumental variable approach and we will keep this requirement in the following. For possible relaxations see Rothenhäusler et al. (2015). Throughout the paper we assume that the distributions $(X^{e},Y^{e})$ are non-degenerate and that the Gram matrix of $(X^{e},Y^{e})$ is well-defined and positive definite for all $e\in\mathcal{E}$ .

2.1 Conditional invariance

The conditional distribution of the target variable $Y$ , given its parents $\mathit{pa}(Y)=\mathit{pa}(X_{p+1})$ is denoted by

[TABLE]

It was assumed in Peters et al. (2016) that the conditional distribution is invariant for all $x\in\mathbb{R}^{|\mathit{pa}(Y)|}$ where it is defined in the absence of hidden confounding (where absence of hidden confounding is fulfilled in (1) if all components of $\eta$ are independent). It then holds for all environments $e,f\in\mathcal{E}$ and all $x\in\mathbb{R}^{|\mathit{pa}(Y)|}$ for which the conditional distributions are well defined that

[TABLE]

This conditional invariance under the true parental set $\mathit{pa}(Y)$ is then exploited for inference by testing for all subsets of $\{1,\ldots,p\}$ whether the invariance of (3) can be rejected. The intersection of all subsets for which invariance cannot be rejected is then automatically a subset of the true parental set with controllable probability.

There are two shortcomings of this invariance approach (Peters et al., 2016) in certain contexts:

(i)

The invariance (3) becomes invalid under hidden confounding between $Y$ and the parents of $Y$ as the conditional invariance of (3) can be violated even for the true parental set (Peters et al., 2016). 2. (ii)

Testing each subset of $\{1,\ldots,p\}$ restricts the number of variables to somewhere between $p\leq 20$ in practice.

Both of these shortcomings can be addressed when using a different type of invariance.

2.2 Inner-product invariance

We show in the following that the invariance of the conditional distribution (3) can be replaced with an inner-product invariance under a more specific assumption on the mechanism that generates the different environments.

Definition 1.

Inner-product invariance* under $\beta^{0}\in\mathbb{R}^{p}$ is fulfilled iff*

[TABLE]

for all $e,f\in\mathcal{E}$ and $k\in\{1,\ldots,p\}$ .

We will show that inner-product invariance is true for the causal vector $\beta^{0}$ under the assumption of additive interventions made precise in the following. A derivation of this result from potential outcome assumptions is discussed in Section 3.7. The concept of inner-product invariance will then later be exploited for computationally fast causal inference for both low- and high-dimensional data.

2.3 Additive interventions

We assume here that the structural equations (1) are constant across all environments and that the change in the distribution of $X^{e}$ between environments is caused by a shift in the distribution of $\eta^{e}$ between different environments.

Assumption 1.

Assume that the distributions of $(X_{1}^{e},...,X_{p+1}^{e})$ , $e\in\mathcal{E}$ , are generated by the linear SEM

[TABLE]

Assume that there exist random variables $\eta^{0},\delta^{e}\in\mathbb{R}^{p}$ with $\mathrm{Cov}(\eta^{0},\delta^{e})=0$ for all $e\in\mathcal{E}$ such that $\eta^{e}$ can be written as

[TABLE]

We assume that $\delta^{e}_{p+1}\equiv 0$ for all $e\in\mathcal{E}$ and $\mathbb{E}[\eta^{0}]=0$ .

Note that the components of $\eta^{0}$ and of each vector $\delta^{e}$ , $e\in\mathcal{E}$ are allowed to be dependent to allow for hidden confounding. We call the random variables $\delta^{e}$ , $e\in\mathcal{E}$ , additive interventions as they are additive and specific to the environment $e\in\mathcal{E}$ . $\delta^{e}$ can for example be an additive contribution if $\mathbb{E}(\delta^{e}_{k})\neq 0$ for some variable $k\in\{1,\ldots,p\}$ or a noise contribution if $\mathrm{Var}(\delta^{e}_{k})\neq 0$ or both. If $\delta_{k}^{e}\equiv 0$ for some $e\in\mathcal{E}$ and $k\in\{1,\ldots,p\}$ we say that there is no intervention on variable $k$ in environment $e\in\mathcal{E}$ . The last part of the assumption ensures that the noise part $\delta^{e}$ that is specific to environment $e\in\mathcal{E}$ does not include an intervention on the target variable $Y$ itself and is a type of exclusion restriction (Pearl, 2009). Mathematically, the crucial property of Assumption 1 is that the covariance between the error of covariates and target variable is constant, i.e. that $\text{Cov}(\eta_{1:p}^{e},\eta_{p+1}^{e})$ is constant across environments $e\in\mathcal{E}$ . This allows us to obtain the following result.

Proposition 1.

Under Assumption 1, we have inner-product invariance under the true causal coefficients $\beta^{0}=(A_{p+1,k})_{k=1,\ldots,p}$ :

[TABLE]

for all $e,f\in\mathcal{E}$ and $k\in\{1,\ldots,p\}$ .

The proof of this result can be found in the Appendix. A derivation of this result from potential outcome assumptions is discussed in Section 3.7. We will exploit inner-product invariance to infer the causal effects in linear SEMs in the following.

2.4 Errors-in-variables

In many real-world applications, we cannot directly observe $X_{1},\ldots,X_{p},Y$ , but make a measurement error $\zeta$ when observing it. In other words, we measure

[TABLE]

where $\zeta_{y}^{e},\zeta_{k}^{e},e\in\mathcal{E},k=1,\ldots,p$ are centered, jointly independent and independent of $X^{e},Y^{e},e\in\mathcal{E}$ with finite variance. Furthermore, we make the assumption that the distributions of $\zeta_{k}^{e},k=1,\ldots,p$ are invariant for different settings $e\in\mathcal{E}$ . Note that we do not assume that the distribution of $\zeta_{y}^{e}$ is invariant for different settings $e\in\mathcal{E}$ . Errors-in-variables exhibit an effect called “regression dilution” or “attenuation”. As an example consider a Structural Equation Model of the following form:

[TABLE]

For now, let us assume that there is no confounding between $X_{1}$ and $Y$ and $X_{1}$ . When regressing $\tilde{Y}$ on $\tilde{X}_{1}$ we obtain a smaller regression coefficient than when regressing $Y$ on $X_{1}$ due to higher variance of $\tilde{X}_{1}$ . The smaller regression coefficient is by definition the best linear prediction of $\tilde{Y}$ given $\tilde{X}_{1}$ . In this sense attenuation can be ignored if one wants to make predictions based on $\tilde{X}_{1}$ . However, in causal inference we are interested in knowing what happens when intervening on $X_{1}$ , and this effect would be underestimated by the regressing $\tilde{Y}$ on $\tilde{X}_{1}$ . The following proposition shows that if inner-product invariance holds for $X_{1},\ldots,X_{p},Y$ then it also holds for proxy variables $\tilde{X}_{1},\ldots,\tilde{X}_{p},\tilde{Y}$ .

Proposition 2.

Assume inner-product invariance holds for $X_{1}^{e},\ldots,X_{p}^{e},Y^{e}$ , $e\in\mathcal{E}$ , under $\beta^{0}$ . Assume we have an errors-in-variables model as defined in equation (4). Then inner-product invariance holds for $\tilde{X}_{1}^{e},\ldots,\tilde{X}_{p}^{e},\tilde{Y}^{e}$ , $e\in\mathcal{E}$ under $\beta^{0}$ :

[TABLE]

for all $e,f\in\mathcal{E}$ and $k\in\{1,\ldots,p\}$ .

The proof of this result can be found in the Appendix. As a result, methods based on inner-product invariance will be robust with respect to errors-in-variables. Note that the analogous statement is true for instrumental variable regression. Now let us turn to the definition of the unregularized causal Dantzig.

3 Causal Dantzig without regularization

In this section we introduce the unregularized causal Dantzig, discuss its basic properties and an example. We introduce the unregularized causal Dantzig in Section 3.1. Asymptotic confidence intervals for low-dimensional estimation are discussed in Section 3.3. Section 3.4 provides an example and explains basic usage of the method causalDantzig in the R-package InvariantCausalPrediction (R Core Team, 2017). Identifiability and consistency issues are discussed in Section 3.5. We conclude with a comparison to instrumental variable regression in Section 3.6.

3.1 The estimator

Assume that we observe i.i.d. samples of $(X^{e},Y^{e})$ in two environments $e\in\mathcal{E}=\{1,2\}$ with $n_{1},n_{2}$ samples in each environment. Let $\mathbf{X}^{1}$ and $\mathbf{X}^{2}$ be the $n_{1}\times p$ and $n_{2}\times p$ -dimensional matrices that contain the realized values of the random variables $X^{e}$ in environment $e=1$ and $e=2$ respectively and let $\mathbf{Y}^{1}\in\mathbb{R}^{n_{1}}$ and $\mathbf{Y}^{2}\in\mathbb{R}^{n_{2}}$ be the respective measurements of the response variables. Define the differences between the two environments in inner-product and Gram matrices, the so-called Gram-shift matrices

[TABLE]

Assuming inner-product invariance holds under $\beta^{0}$ ,

[TABLE]

A simple estimator of $\beta^{0}$ is the empirical minimizer of the $\ell_{\infty}$ -norm of the differences between $\hat{\mathbf{Z}}$ and $\hat{\mathbf{G}}\beta$ .

Definition 2 (Unregularized causal Dantzig).

The causal Dantzig estimator $\hat{\beta}$ is defined as a solution to the optimization problem

[TABLE]

The choice of how to center and scale variables deserves some attention. We will discuss this in Section 5.1. Causal Dantzig is uniquely defined if and only if $\hat{\mathbf{G}}$ is invertible and can in this case be written as

[TABLE]

Note that by equation (5) this estimator is closely related to least squares in linear regression. Recall that for observations $\mathbf{Y}\in\mathbb{R}^{n}$ and design matrix $\mathbf{X}\in\mathbb{R}^{n\times p}$ , the least squares estimator is defined as

[TABLE]

Causal Dantzig is strikingly similar, with the Gram matrices replaced by differences of Gram matrices in different settings. As such, it is straightforward to derive asymptotic confidence intervals for this estimator. Many properties from linear regression do not carry over. For example, the causal Dantzig is only asymptotically unbiased.

3.2 More than two environments

There are two straightforward extensions to more than two environments $|\mathcal{E}|>2$ . Pooling data from different environments preserves inner-product invariance. If some of the environments are “observational” and the others are “interventional”, one option for splitting the data into two environments $\mathcal{E}^{\prime}=\{1,2\}$ is pooling all observational data ( $e^{\prime}=1$ ) and pooling all interventional data ( $e^{\prime}=2$ ). Instead of splitting the data into two environments one can change the definition of the estimator to accommodate for more than two settings, for example by defining $\hat{\beta}$ as a solution to the optimization problem

[TABLE]

where

[TABLE]

Note that for two environments, solutions of equation (8) coincide with equation (6). It depends on the type of interventions and the signal strength which of the two options mentioned above is better. If the data can be split into two environments $\mathcal{E}^{\prime}=\{1,2\}$ that are homogeneous, doing so is preferable as the estimators of $\mathbf{G}^{e^{\prime}}$ and $\mathbf{Z}^{e^{\prime}},e^{\prime}\in\mathcal{E}^{\prime}$ have low variance. If the environments $\mathcal{E}$ have different (strong) interventions, solving equation (8) can be preferable as the effect of several strong interventions might get “washed out” when averaging over many environments. We will return later to the case of more than two environments. For the following discussion we assume that there are two environments $\mathcal{E}=\{1,2\}$ .

3.3 Confidence intervals

In the settings described above $\hat{\beta}$ is in general only asymptotically unbiased. This bias is unknown as it depends on the unknown amount of confounding between $X^{e}$ and $Y^{e}$ . Hence we will only pursue asymptotic confidence intervals. We will show that the estimator (7) is under certain conditions asymptotically normally distributed, that is for $n_{1},n_{2}\rightarrow\infty$ ,

[TABLE]

The matrices $V^{1}$ and $V^{2}$ are positive definite under suitable assumptions and can be consistently estimated from the data as $\hat{V}^{1}$ and $\hat{V}^{2}$ as we will discuss later. We can then define asymptotically valid confidence intervals for $\beta^{0}_{k}$ as

[TABLE]

where $\hat{V}_{kk}$ is the $k$ -th diagonal element of $\hat{V}=\hat{V}^{1}/n_{1}+\hat{V}^{2}/n_{2}$ and $q=\Phi^{-1}(1-\alpha/2)$ . Here, $\Phi$ denotes the distribution function of a standard Gaussian random variable. The interval $I_{k}$ has asymptotic coverage

[TABLE]

The conditions for asymptotic normality (10) are fourth-moment conditions on the observed random variables as well as conditions that guarantee that $V^{1}$ and $V^{2}$ are invertible and that causal Dantzig is unique.

Theorem 1 (Asymptotic normality).

Let $(X^{1},Y^{1})$ and $(X^{2},Y^{2})$ have finite fourth moments and assume that inner product invariance holds under $\beta^{0}$ . Assume that $(\mathbf{X}^{1},\mathbf{Y}^{1})$ and $(\mathbf{X}^{2},\mathbf{Y}^{2})$ are independent. Define $\mathbf{G}:=\mathbb{E}[\hat{\mathbf{G}}]$ and $\mathbf{Z}:=\mathbb{E}[\hat{\mathbf{Z}}]$ and let $\mathbf{G}$ and the covariance matrix of $X^{e}\eta_{p+1}^{e}$ , $e\in\mathcal{E}$ be invertible. For $n_{1},n_{2}\rightarrow\infty$ ,

[TABLE]

where $V^{e}:=\mathrm{Cov}(\mathbf{G}^{-1}(X^{e})^{t}\eta_{p+1}^{e})$ , $e\in\{1,2\}$ are invertible. Note that we allow $n_{1}$ and $n_{2}$ to have different asymptotic growth rates.

Remark 1 (Estimation of $V^{1}$ and $V^{2}$ ).

The empirical covariance matrix of

[TABLE]

is a consistent estimator of $V^{1}$ . $V^{2}$ can be estimated analogously.

The proof of this result can be found in the Appendix. The assumption that $\mathbf{G}$ is invertible will be discussed further in Section 3.5. In Section 4 we will discuss how the regularized causal Dantzig can be consistent in some situations where population $\mathbf{G}$ is not invertible. Asymptotic efficiency is discussed in Section 8.5 in the Appendix.

3.4 Implementation and example

We use data generated according to a SEM with the structure given by Figure 1 as an example. Suppose the data are generated in two environments $\{1,2\}=\mathcal{E}$ according to

[TABLE]

where $(\eta^{0},\eta_{y},\eta_{1},\eta_{2},\eta_{3})$ is assumed to be drawn from $\mathcal{N}_{5}(0,\mathrm{Id}_{5})$ and the noise variances are $\sigma^{e}=1$ for environment $e=1$ and $\sigma^{e}=4$ for environment $e=2$ . We draw $1000$ i.i.d. samples from each environment and the corresponding pairwise scatterplots are shown in Figure 2. For one realization we obtain the estimate $\hat{\beta}$ via the difference in Gram matrices $\mathbf{G}$ and inner products with the target $\mathbf{Z}$ as

[TABLE]

where the correct vector of causal coefficients in this problem is

[TABLE]

Asymptotic confidence intervals can be computed via (11).

The procedure is implemented as method causalDantzig in the R-package InvariantCausalPrediction (R Core Team, 2017). The output for the example above is shown below, where $X$ is the matrix with predictor variables, $Y$ the outcome of interest and $E$ is an $n$ -dimensional vector with entries $1$ for samples from environment $e=1$ and entries $2$ for samples from environment $e=2$ .

fit <- causalDantzig(X,Y,E,regularization=FALSE) print(fit) Unregularized causal Dantzig Call: causalDantzig(X = X, Y = Y, E = E, regularization = FALSE)

Estimate StdErr p.value X1 -0.042 0.059 0.481 X2 0.999 0.106 <2e-16 *** X3 0.035 0.042 0.403

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Only the direct causal effect of the second variable turns out to be statistically significant. Note that in this setting, instrumental variables regression would fail. One problem is that the number of covariates exceeds the number of “instruments”. Additionally, the expectation of $X^{1}$ and $X^{2}$ are equal, implying that there is no mean shift due to the two environments. We will discuss these issues in more detail in Section 3.6.

3.5 Identifiability of $\beta^{0}$ and practical implications

In the simplest setting, the number of samples greatly exceeds the number of parameters, and the interventions $\delta^{e}$ , $e\in\mathcal{E}$ are sufficiently different to make the parameter $\beta^{0}$ identifiable. Theorem 2 gives conditions under which this is the case.

Theorem 2.

Consider a SEM that satisfies Assumption 1. Assume that there exists an “observational” environment, i.e. an environment $e\in\mathcal{E}$ with $\delta^{e}\equiv 0$ . Furthermore assume that all interventions $\delta^{e}$ are full-rank on its support, i.e. that the Gram matrix of $\delta_{S^{e}}^{e}$ is positive definite for $S^{e}=\{k:\delta_{k}^{e}\not\equiv 0\}$ .

The causal coefficient is identifiable in the population case if and only if for each $k=1,\ldots,p$ there exists $e\in\mathcal{E}$ such that $\delta_{k}^{e}\not\equiv 0$ . 2. 2.

If the condition in 1. holds then the solution of causal Dantzig as defined in equation (8) is unique in the population case and equal to $\beta^{0}$ .

The proof of this result can be found in the Appendix. Usually, there are many different SEMs satisfying Assumption 1 that can generate a given observed distribution of $(X^{e},Y^{e}),e\in\mathcal{E}$ . Theorem 2 gives a condition under which these SEMs all share the same direct causal effect $\beta^{0}$ from $X_{1},\ldots,X_{p}$ to $Y$ . If said condition is satisfied, the causal Dantzig has a unique solution in the population case that is equal to $\beta^{0}$ . Furthermore, it tells us that if this condition is not satisfied, there exist at least two SEMs satisfying Assumption 1 with different direct causal effects from $X_{1},\ldots,X_{p}$ to $Y$ that generate the given distribution. Without further assumptions it is then not possible to consistently estimate the direct causal effects, but only a set of potential causal effects. We will characterize this set later.

Note that Theorem 2 describes a rather strong condition for identifiability. Especially if $p$ is large it might be unrealistic to have nonzero interventions $\delta_{k}^{e}$ on each of the variables $X_{k},k=1,\ldots,p$ . However, making additional assumptions can help resolve these identifiability issues. If the interventions $\delta_{k}^{e}$ only act on a subset of the variables $X_{1},\ldots,X_{p}$ or when the number of covariates exceeds the sample size $p>n$ , the regularized causal Dantzig can be consistent under the additional assumption of sparsity. We discuss consistency of the regularized causal Dantzig in such scenarios in Section 4.2 and Section 4.3. Alternatively, it can be advisable to first run LASSO on the pooled dataset to select a subset of the variables. Under the assumption of faithfulness, it is sufficient to have nonzero interventions on the selected subset. Some justification for this approach can be found in Section 5.3.

If the assumptions for identifiability of $\beta^{0}$ are not fulfilled it should still be possible to guarantee predictive performance under certain new environments. The following theorem makes this intuition more precise. The proof can be found in the Appendix.

Theorem 3.

Consider a SEM that satisfies Assumption 1. Assume that there exists an “observational” environment, i.e. an environment $e\in\mathcal{E}$ with $\delta^{e}\equiv 0$ . Furthermore assume that all interventions $\delta^{e}$ are full-rank on its support, i.e. that the Gram matrix of $\delta_{S^{e}}^{e}$ is positive definite for $S^{e}=\{k:\delta_{k}^{e}\not\equiv 0\}$ . Let $\beta$ be a solution of causal Dantzig as defined in equation (8) in the population case.

Then the distribution of the residuals is invariant, i.e.

[TABLE] 2. 2.

For a new environment $\tilde{e}\not\in\mathcal{E}$ that satisfies Assumption 1 for $(X^{e},Y^{e})$ , $e\in\mathcal{E}\cup\{\tilde{e}\}$ with $\{k:\delta_{k}^{\tilde{e}}\not\equiv 0\}\subset\cup_{e\in\mathcal{E}}S^{e}$ , we have

[TABLE]

In words, solutions of causal Dantzig guarantee that the residuals have the same distribution across all environments $e\in\mathcal{E}$ . Perhaps more importantly, solutions of causal Dantzig are guaranteed to have the same predictive performance on new environments $\tilde{e}\not\in\mathcal{E}$ with arbitrary large additive perturbations $\delta_{k}^{\tilde{e}}$ as long as these perturbations act on a subset of the variables $\cup_{e\in\mathcal{E}}S^{e}$ .

3.6 Comparison with instrumental variables

Consider a setting where the underlying DAG takes the following form:

$Y$$H$$X$$e$

We assume that $H$ is not observed and that $e$ takes values in $\{1,2\}$ . To be able to use the causal Dantzig, we have to define settings $\mathcal{E}$ . It is rather straightforwards to write $(X^{1},Y^{1})$ for the variables $(X,Y)$ conditioned on $e=1$ and $(X^{2},Y^{2})$ for the variables $(X,Y)$ conditioned on $e=2$ . As $e$ is binary, the method of instrumental variables (IV) coincides with the Wald estimator (Wald, 1940). In the population case it can be written as

[TABLE]

Causal Dantzig leads to

[TABLE]

Both the IV approach and the causal Dantzig have different strengths and weaknesses in this setting. For example, equation (13) is based on means, whereas equation (14) is based on covariances. If, say, $X=e\cdot\eta_{x}+H$ , $Y=\beta X+H+\eta_{y}$ , with centered noise $\eta_{x},\eta_{y}$ independent of the centered confounder $H$ , then $\mathbb{E}[X|e=1]=\mathbb{E}[X|e=2]$ . Hence the IV estimator is not well-defined in the population case and one should use the causal Dantzig. If the instrument is weak, causal Dantzig can exhibit efficiency gains. An example of this can be found in Section 6.2. A more general comparison can be found in Table 1. It is also possible to construct examples where equation (14) is not well-defined. For this to happen, the second moments of $X^{1}$ and $X^{2}$ have to be equal.

A drawback of the IV approach is that the number of instruments has to equal or exceed the number of endogenous variables. However, this is not necessary for the causal Dantzig. Two settings $|\mathcal{E}|=2$ in our framework correspond to a single binary exogenous variable. In that case the number of endogenous variables $p$ can be arbitrarily large as long as $\mathbf{G}$ , the difference of Gram matrices, is invertible. On the other hand, for $p>2$ the number of endogenous variables exceeds the number of exogenous variables and the IV approach is bound to fail. We compare the performance of the IV approach and causal Dantzig on simulated datasets in Section 6.2.

3.7 Inner-product invariance in the potential outcome framework

In this section we will investigate the notion of inner-product invariance under potential outcome assumptions (Neyman, 1923; Rubin, 1974). Note that here, as in the rest of the paper, we consider a continuous exposure $X\in\mathbb{R}^{p}$ . In the following, we use a slightly different notation compared to the rest of the paper. We write $X(e)\in\mathbb{R}^{p}$ for the potential outcome of a continuous exposure if the environment $E$ takes value $e\in\mathcal{E}$ . Equivalently we write $Y(x,e)\in\mathbb{R}$ for the potential outcome of the response of a unit if the exposure takes level $X=x$ and environment $E$ takes value $e\in\mathcal{E}$ . We assume that these quantities are well-defined. We make the following additional assumptions:

A1.

Exclusion restriction:

[TABLE] 2. A2.

Independence:

[TABLE] 3. A3.

Constant confounding across environments $\mathcal{E}$ :

[TABLE] 4. A4.

Treatment effect homogeneity and linearity:

[TABLE] 5. A5.

The variables are normalized:

[TABLE]

Note that we did not make any cross-world assumptions (Richardson and Robins, 2013), i.e. we made no assumptions on the joint distribution of $Y(x)$ , $x\in\text{range}(X)$ or on the joint distribution of $X(e)$ , $e\in\mathcal{E}$ . Condition (A2) can be relaxed to an assumption on the cross-product between $X(e)$ and $Y(0)$ . Details can be found in the Appendix in the proof of Proposition 3. Condition (A3) is crucial: we allow for confounding (nonzero covariance of $X(e)$ and $Y(0)$ ), but we assume that the covariance is constant across environments. Loosely speaking, this can be seen as a non-interaction-assumption of environment and confounding. Condition (A4) ensures that the average treatment effect is the same within strata defined by $X$ and $E$ and allows the usage of a linear model. For a discussion of similar assumptions in the context of the IV framework, see Wang and Tchetgen Tchetgen (2017).

If these assumptions are fulfilled, then we have inner-product invariance under the average treatment effect $\beta^{0}$ .

Proposition 3.

Under assumptions (A1) - (A5) we have inner-product invariance under the vector $\beta^{0}\in\mathbb{R}^{p}$ which satisfies $\mathbb{E}[Y(x)-Y(0)]=x\beta^{0}$ , i.e.

[TABLE]

The proof of this result can be found in the Appendix. Using inner-product invariance for estimating the average treatment effect $\beta^{0}$ , it is possible to consistently estimate the average treatment effect in cases in which two-stage least squares (or the Wald estimand) is degenerate. For example, in settings where the dimension of exposure variables $X$ exceeds the number of environments $|E|$ or when $\mathbb{E}[Y-Xb|E=1]=\mathbb{E}[Y-Xb|E=0]$ for $\mathcal{E}=\{0,1\}$ . In the presence of weak instruments, causal Dantzig can exhibit efficiency gains compared to estimators based on conditional means of $X$ and $Y$ . This is investigated further in Section 3.6 and Section 6.

4 Causal Dantzig with regularization

In this section we introduce the regularized causal Dantzig, and discuss its theoretical properties. The estimator is motivated and introduced in Section 4.1. Section 4.2 contains finite sample bounds. The bounds presented in this section involve a quantity that we call the “causal cone invertibility factor”. The behavior of this quantity is discussed in Section 4.3.

4.1 The estimator

Weak interventions on some of the variables (i.e. $\mathbb{E}[(\delta_{k}^{e})^{2}]$ small) may lead to coefficient estimates with high variance in equation (7). Furthermore, if the number of predictors $p$ exceeds the total sample size $n$ , the matrix $\hat{\mathbf{G}}$ is not invertible and the solution to equation (6) is not unique. In such settings, regularization and shrinkage is desirable and can outperform unpenalized estimation procedures, see e.g. Bühlmann and van de Geer (2011). In particular, $\ell_{1}$ -penalized estimation procedures have attracted much interest in high-dimensional models. For linear models, Candes and Tao (2007) proposed an $\ell_{1}$ -minimization method called the Dantzig selector. Consider $\mathbf{Y}=\mathbf{X}\beta^{*}+\epsilon$ with $\mathbf{X}\in\mathbb{R}^{n\times p}$ , $\mathbf{Y}\in\mathbb{R}^{n}$ , $\beta^{*}\in\mathbb{R}^{p}$ . For a tuning parameter $\lambda\geq 0$ , the Dantzig selector is defined as a solution to the regularization problem

[TABLE]

The geometry of the Dantzig selector is depicted in Figure 3. The $\ell_{1}$ -minimization favors sparse solutions, i.e. vectors in which many coefficients are exactly zero. This facilitates interpretation. Furthermore, if $\lambda$ gets larger, the Dantzig selector shrinks towards the zero vector. Choosing $\lambda$ is a trade off: small values will generally result in larger variance of the estimator, but smaller bias. We propose the regularized causal Dantzig $\hat{\beta}^{\lambda}$ , which in analogy to equation (6) is defined as a solution to

[TABLE]

On a superficial level, the difference to the Dantzig selector is merely that $\mathbf{X}^{t}\mathbf{Y}/n$ is replaced by $(\mathbf{X}^{1})^{t}\mathbf{Y}^{1}/n_{1}-(\mathbf{X}^{2})^{t}\mathbf{Y}^{2}/n_{2}$ and $\mathbf{X}^{t}\mathbf{X}/n$ is replaced by $(\mathbf{X}^{1})^{t}\mathbf{X}^{1}/n_{1}-(\mathbf{X}^{2})^{t}\mathbf{X}^{2}/n_{2}$ . Hence the geometry of the optimization problem is akin to the Dantzig selector and the causal Dantzig inherits its variable selection, shrinkage and regularization properties. Furthermore, the causal Dantzig can be cast as a linear program for fixed $\lambda$ . Details can be found in the Appendix, Section 8.3.6.

4.2 Finite-sample bound

The regularized causal Dantzig is related to the Dantzig selector and enjoys similar properties. Notably, it attains the same rates of convergence under comparable regularity conditions. To this end, we introduce the quantity “causal cone invertibility factor”, similar to the “cone invertibility factor” for the Dantzig selector as defined in Ye and Zhang (2010). For ease of exposition we will first treat the case $\mathcal{E}=\{1,2\}$ . The treatment of the general case is sketched in Remark 2.

4.2.1 Causal Cone Invertibility Factor

Let $\hat{\boldsymbol{\Sigma}}$ denote the empirical covariance matrix of $\mathbf{X}$ and consider a set $S\subset\{1,\ldots,p\}$ . Later we will mainly be interested in the case where $S$ is the active set of $\beta^{0}$ . Ye and Zhang (2010) proved bounds for the Dantzig selector that involve the so-called cone invertibility factor (CIF). For the upper bound, the relevant quantity in Ye and Zhang (2010) is $\mathrm{CIF}_{q}(S)$ . Roughly speaking, the cone invertibility factor is a lower bound on the $\ell_{\infty}$ -norm of $\hat{\boldsymbol{\Sigma}}u$ , given that $u$ lies in the cone $\{u:\|u_{S^{c}}\|_{1}\leq\|u_{S}\|_{1}\}$ and has unit norm $\|u\|_{q}=1$ . To make the quantity comparable across different norms, it is scaled by a factor $|S|^{1/q}$ . To be more precise,

[TABLE]

Now we are ready to define the causal cone invertibility factor $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ :

[TABLE]

Analogously define $\mathrm{CCIF}_{q}(S,\mathbf{G})$ for $\mathbf{G}:=\mathbb{E}[\hat{\mathbf{G}}]$ . Here and in the following, notationally we do not treat the case $q=\infty$ separately. Instead, with small abuse of notation we set $|S|^{1/q}:=1$ for $q=\infty$ . In the new definition, the positive semi-definite matrix $\hat{\boldsymbol{\Sigma}}$ is replaced by the symmetric matrix $\hat{\mathbf{G}}$ . As $\hat{\boldsymbol{\Sigma}}$ , the matrix $\hat{\mathbf{G}}$ is not positive definite in high-dimensional settings and even indefinite in general. However, it can be shown that the CCIF behaves similarly to the CIF in several ways. This is further discussed in Section 4.3. For now, let us turn to the finite-sample bound of the causal Dantzig.

4.2.2 Finite sample bound

The finite-sample results of the causal Dantzig are analogous to the Dantzig selector while the issue of identifiability is now addressed by the causal cone invertibility factor $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ . Similarly as in Ye and Zhang (2010), define $z^{*}:=\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0}\|_{\infty}$ and let $S$ denote the active set of $\beta^{0}$ . The first result is purely algebraic and follows from the definitions of $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ and the causal Dantzig.

Lemma 1.

On the event $z^{*}\leq\lambda$ we have

[TABLE]

The proof can be found in the Appendix. There are two terms on the right-hand side in equation (18) that deserve further attention. First, $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ is bounded away from zero under certain assumptions, as discussed in Section 4.3. Secondly, it is crucial to understand the behavior of $z^{*}:=\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0}\|_{\infty}$ . Using a union bound over the $p$ entries, it can be shown that with high probability, $z^{*}$ is of the order $\max_{e\in\mathcal{E}}\max(\log(p)/n_{e},\sqrt{\log(p)/n_{e}})$ :

Lemma 2.

Assume that inner-product invariance holds for $(X^{e},Y^{e}),e\in\{1,2\}$ under $\beta^{0}$ . Assume $X^{1},X^{2},\eta_{p+1}^{1},\eta_{p+1}^{2}$ are centered and multivariate Gaussian. Let $t\geq 0$ . Then, with probability exceeding $1-4\exp(-t)$ ,

[TABLE]

The proof can be found in the Appendix. This result can be extended to situations where $(X^{1},\eta_{p+1}^{1})$ and $(X^{2},\eta_{p+1}^{2})$ have subgaussian tails, see e.g. exercise 14.3 in Bühlmann and van de Geer (2011). By combining Lemma 1 and Lemma 2 we obtain the following theorem. The proof can be found in the Appendix.

Theorem 4.

Let $\lambda\asymp 5C\sqrt{\log(p)/\min_{e\in\{1,2\}}n_{e}}\rightarrow 0$ for a constant $C>0$ that satisfies $\sigma_{\varepsilon}\cdot\sigma_{\text{max}}^{e}\leq C<\infty$ for $e\in\{1,2\}$ . Under the assumptions mentioned in Lemma 2,

[TABLE]

with $\mathbb{P}\rightarrow 1$ for $n_{1},n_{2},p\rightarrow\infty$ .

Another consequence of these two Lemmata is the screening property of the causal Dantzig under a so-called betamin-condition. The short proof can be found in the Appendix.

Proposition 4.

Let $\hat{S}$ denote the active set of $\hat{\beta}^{\lambda}$ . Using the notation of Theorem 4, assume that

[TABLE]

Then under the assumptions mentioned in Theorem 4 for $q=\infty$ , we have

[TABLE]

Note that the convergence rate in Theorem 4 coincides with the usual rate of convergence in high-dimensional linear regression (Ye and Zhang (2010)) under comparable assumptions. For consistency in the $\ell_{2}$ norm in the regression setting it is required that $|S|\log(p)/n\rightarrow 0$ , that $\lambda\asymp C\sqrt{\log(p)/n}$ for constant $C>0$ large enough and that the population quantity $\mathrm{CIF}_{2}(S)$ is bounded away from zero. In our framework, if $n_{1}\asymp n_{2}$ , the assumptions on the asymptotic behavior of $n=n_{1}+n_{2},p,|S|$ and $\lambda$ stay essentially the same, but $\mathrm{CCIF}_{2}(S,\hat{\mathbf{G}})$ plays the role of $\mathrm{CIF}_{2}(S)$ . The next section aims to shed some light on the behavior of this quantity.

Remark 2.

The results of this section can be extended to more than two settings $|\mathcal{E}|>2$ . To be more precise, in the general case one can define the regularized causal Dantzig as a solution to

[TABLE]

where $\hat{\mathbf{Z}}^{e},\hat{\mathbf{G}}^{e},e\in\mathcal{E}$ are defined as in equation (9). The causal cone invertibility factor is then defined as

[TABLE]

With this notation, it is straightforward to obtain analogous results to Lemma 1-3, Theorem 4 and Proposition 4.

4.3 Behavior of the causal cone invertibility factor

In the preceding section we showed that the causal cone invertibility factor $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ is a crucial quantity to understand the behavior of the regularized causal Dantzig. How do we guarantee that this quantity is bounded away from zero? There are two issues that we will treat separately. First, for $p>n=n_{1}+n_{2}$ , $\hat{\mathbf{G}}$ is not invertible. Secondly, the environments might not be sufficiently different to make population version $\mathbf{G}$ invertible. In Section 4.3.1 we will discuss how to relate the empirical causal cone invertibility factor to the population causal cone invertibility factor. In Section 4.3.2 we consider the case where the environments are sufficiently different to make the population version $\mathbf{G}$ invertible. In Section 4.3.3 we examine a setting where the environments are not sufficiently different, i.e. where $\mathbf{G}$ is not invertible.

4.3.1 General properties

In this section we discuss how to relate the empirical causal cone invertibility factor $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ to the population quantity $\mathrm{CCIF}_{q}(S,\mathbf{G})$ . The following Lemma gives a deterministic bound for these quantities. The proof can be found in the Appendix.

Lemma 3.

Let $q\geq 1$ . Then,

[TABLE]

where $\|A\|_{\infty}:=\max_{i,j}|A_{i,j}|$ denotes the matrix max norm.

Hence the problem is reduced to understanding the behavior of $\|\hat{\mathbf{G}}-\mathbf{G}\|_{\infty}$ . Let the rows of $\mathbf{X}^{e}$ consist of i.i.d. centered multivariate Gaussian random variables for $e\in\{1,2\}$ . It can be shown that with probability at least $1-4\exp(-t)$ ,

[TABLE]

This result can be extended to situations where $X^{1}$ and $X^{2}$ have subgaussian tails, see e.g. exercise 14.3 in Bühlmann and van de Geer (2011). Hence by Lemma 3, even if $\hat{\mathbf{G}}$ is not invertible, the quantity in equation (17) is well behaved for $\sqrt{\min(n_{1},n_{2})}\gg|S|\sqrt{\log(p)}$ , in the sense that it is strictly bounded away from zero if the same is true for the population quantity. The latter assumption is nontrivial and depends on the distribution of the interventions $\delta^{e},e\in\{1,2\}$ .

4.3.2 Population $\mathbf{G}$ invertible

Under the assumptions discussed in Section 4.3.1, $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ is bounded away from zero if $\mathrm{CCIF}_{q}(S,\mathbf{G})$ is bounded away from zero. Hence, the problem is reduced to understanding the population quantity $\mathrm{CCIF}_{q}(S,\mathbf{G})$ . If $\mathbf{G}$ is invertible, then

[TABLE]

As

[TABLE]

this is a measure of the difference in the intervention strength $\delta^{e}$ between the two settings $e=1$ and $e=2$ . In this sense, this bound is similar to the discussion in Section 4.3.3. However, the bound fails to capture appropriately what happens if the interventions only act on a subset of the variables $X_{i},i=1,\ldots,p$ . In that case the bound in equation (20) is not useful as $\mathbf{G}$ is not invertible. The next section shows that in some of these settings it is still true that $\mathrm{CCIF}_{q}(S,\mathbf{G})>0$ .

4.3.3 Population $\mathbf{G}$ not invertible

The setting of Section 4.3.2 and the bound in equation (20) are rather restrictive. Consider a situation with a block structure in the Gram matrix, i.e. where $\mathbb{E}[X_{k}^{e}X_{k^{\prime}}^{e}]=0$ for all $k\leq k_{0}<k^{\prime}$ and $e\in\mathcal{E}$ . In this case, there might be no interventions on the variables $\{X_{k^{\prime}},k^{\prime}>k_{0}\}$ , i.e. $\delta_{k^{\prime}}^{e}\equiv 0$ for all $k^{\prime}>k_{0}$ . As a result, $\mathbf{G}$ might not be invertible. However, if $\mathbf{G}_{1:k_{0},1:k_{0}}$ is invertible and $S\subset\{1,\ldots k_{0}\}$ , then

[TABLE]

Hence, under the assumptions discussed in Section 4.2.2, the causal Dantzig is a consistent estimator for $\beta^{0}$ . Generally speaking, the causal Dantzig tends to screen out variables that have not been affected by the intervention. In this light it is crucial that the interventions act on the variables in the active set of $\beta^{0}$ directly or indirectly.

5 Practical considerations

In this section we discuss practical considerations for the causal Dantzig. Recommendations are given for centering and scaling of the variables, choice of the regularization parameter $\lambda$ and a procedure for preselection.

5.1 Centering and scaling

Centering and scaling in the causal Dantzig setting is a bit more intricate than in a regression setting. Let $\hat{\mu}^{e}\in\mathbb{R}^{p+1}$ denote the empirical mean of $(\mathbf{X}^{e},\mathbf{Y}^{e})$ . For centering, we recommend substracting $\frac{1}{|\mathcal{E}|}\sum_{e\in\mathcal{E}}\hat{\mu}^{e}$ from each sample. By mean-centering globally (and not with an environment-specific intercept), the estimator is able to leverage changes in mean between environments. For scaling, define

[TABLE]

We recommend to scale the $k$ -th row of $\hat{\mathbf{Z}}^{e}$ and $\hat{\mathbf{G}}^{e}$ by approximately $1/\sqrt{c_{k,e}}$ for all $k=1,\ldots,p$ and $e\in\mathcal{E}$ . What is the motivation behind this scaling? In the following we will discuss the special case $\mathcal{E}=\{1,2\}$ . In absence of noise in equation (16), $\|\mathbf{Z}-\mathbf{G}\beta^{0}\|_{\infty}=0$ . By allowing for $\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0}\|_{\infty}\leq\lambda$ , we account for the variance of $\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0}$ . Since we work with a supremum bound and the same $\lambda$ for all components, we want all scaled components to have roughly the same variance. To be more precise, we want

[TABLE]

It can be challenging to scale according to equation (22) as the correlation between $X_{k}^{e}$ and $\eta_{p+1}^{e}=Y^{e}-X^{e}\beta^{0}$ is unknown and changes for different $k$ . In the absence of confounding however and if $X_{k}$ and $X_{l}$ are not descendants of $Y$ in the graph $G$ , $\varepsilon=\eta_{p+1}^{e}$ is independent of $X_{k}^{e}$ and $X_{l}^{e}$ and the scaling of equation (21) implies

[TABLE]

where $\sigma_{\varepsilon}$ denotes the standard deviation of $\varepsilon=\eta_{p+1}^{e}$ . The scaling of equation (21) still has some theoretical justification in more general cases. In the presence of confounding and for general $k,l$ it depends on the joint distribution of $X_{k}^{e},X_{l}^{e}$ and $\varepsilon=\eta_{p+1}^{e}$ whether $\text{Var}\left(\frac{(\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0})_{k}}{\sqrt{c_{k,1}}}\right)$ and $\text{Var}\left(\frac{(\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0})_{l}}{\sqrt{c_{l,1}}}\right)$ are of the same order. Notably, if equation (21) holds with equality and if the variables $X_{k}^{e},k=1,\ldots,p$ and $\varepsilon=\eta_{p+1}^{e},e\in\{1,2\}$ are centered multivariate Gaussian, using moment inequalities,

[TABLE]

for $e\in\{1,2\},k\in\{1,\ldots,p\}$ . Using independence of samples from different environments $e\in\{1,2\}$ ,

[TABLE]

for all $k=1,\ldots,p$ . Using equation (21),

[TABLE]

Hence $\text{Var}\left(\frac{(\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0})_{l}}{\sqrt{c_{l,1}}}\right)$ and $\text{Var}\left(\frac{(\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta^{0})_{k}}{\sqrt{c_{k,1}}}\right)$ are of the same order for all $k,l=1,\ldots,p$ .

5.2 Choosing $\lambda$

Large segments of the regularization path of the causal Dantzig are usually poor estimates of $\beta^{0}$ . Hence it is crucial to use an appropriate value of the regularization parameter $\lambda$ . From a theoretical perspective one would choose $\lambda$ as in Theorem 4. However, $\sigma_{\varepsilon}$ and $\sigma_{max}^{e}$ are usually unknown in real-world datasets. Hence, in practice we propose to choose $\lambda$ by $k$ -fold cross-validation. Concretely, in each environment $e\in\mathcal{E}$ the samples are split into $k$ groups of approximately equal size. Denote $\hat{\beta}^{\lambda,-i}$ the causal Dantzig estimator that is calculated on all samples except the samples from group $i$ . Let $\hat{\mathbf{Z}}^{i}$ and $\hat{\mathbf{G}}^{i}$ be defined as in equation (5), using the samples from group $i$ . Then we can choose $\hat{\lambda}^{\text{cv}}$ as a solution to

[TABLE]

We define the cross-validated causal Dantzig as $\hat{\beta}^{\text{cv}}:=\hat{\beta}^{\hat{\lambda}^{\text{cv}}}$ . Two exemplary regularization paths and the solution chosen by cross-validation are depicted in Figure 4.

5.3 Preselection with hidden variables

An alternative of running the causal Dantzig directly on a high-dimensional dataset is doing preselection. In the first stage we recommend to run Lasso on observational data, if available. If observational data is not available, one could run Lasso on the pooled dataset. In the second stage, one would run the causal Dantzig with or without regularization on the active set of the first stage. Ideally, the first stage would screen out as many variables as possible, except for the parental set of the target variable $Y$ . Quite often this will result in a set that contains a superset of the parental set implying a very useful dimensionality reduction. The following Lemma provides some justification for this approach.

Lemma 4.

Assume that the distribution $X_{1},...,X_{p},Y$ is generated by a linear acyclic Gaussian structural equation model with directed acyclic graph $D_{total}$ that consists of both the observed variables $X_{1},\ldots,X_{p},Y$ and (potentially) hidden confounders $H_{1},....,H_{q}$ . Assume that the joint distribution of the variables $X_{1},...,X_{p},Y,H_{1},...,H_{q}$ is faithful (Pearl, 2009) to $D_{total}$ . Let $S$ denote the active set of regressing $Y$ on $X_{1},\ldots,X_{p}$ in the population case. Then,

[TABLE]

The proof can be found in the Appendix. We test this two-step procedure on real world data in Section 6.4. However, note that for valid $p$ -values (with the unregularized causal Dantzig) we would have a post-selection problem due to the screening step.

6 Numerical examples

Section 6.1 explores actual coverage and length of the asymptotic confidence intervals as defined in Section 3.3. In Section 6.2 we compare the causal Dantzig to instrumental variable regression for $p=1$ under different types of interventions. In Section 6.3 we evaluate the performance of parameter selection by cross-validation as defined in Section 5.2. Finally, in Section 6.4 we discuss an application to real-world data that has been published in Meinshausen et al. (2016).

6.1 Causal Dantzig in low dimensions: confidence intervals

In this section we explore the actual coverage and average length of the asymptotic confidence intervals constructed according to Theorem 1.

We simulate data from two linear SEMs shown in Figure 5. Specifically, the data are generated according to the equations

[TABLE]

where the noise distributions of $(\eta_{1},\eta_{2},\eta_{y})$ and $(\eta_{1},\eta_{2},\eta_{3},\eta_{4},\eta_{y})$ respectively depend on the environment. Specifically, for SEM (A), we assume a factor model for the noise

[TABLE]

where $(\varepsilon_{1},\varepsilon_{2},\varepsilon_{y})^{t}\sim\mathcal{N}(0,1_{3})$ , and the entries in both the factor loading matrix $A\in\mathbb{R}^{3\times 5}$ and the factor values $H\in\mathbb{R}^{5}$ are chosen i.i.d. standard normal. The 5-dimensional variable $H$ act as hidden confounders between the observed variables. The noise contribution $\sigma_{j}$ is chosen as 1 in environment $e=1$ and as $1+\kappa$ in environment $e=2$ . We call $\kappa=\sigma_{1}-\sigma_{2}$ the intervention strength as it measures the variance of the additional noise input in environment $e=1$ over environment $e=2$ . In our simulations it is chosen as $8$ . For SEM (B) we generate the data analogously with the dimension of the hidden variable $H$ being five.

We draw $n\in\{50,100,500,1000\}$ samples in total (across both environments) and compute the confidence intervals for the causal coefficients $\beta^{0}$ of $Y$ with the unregularized causal Dantzig. For SEM (A), the true causal coefficients for $Y$ are given by $\beta^{0}=(0,1)$ and the actual coverage and average length of the constructed intervals at confidence level 0.05 with the unregularized causal Dantzig is shown in the two upper rows of Table 2 for variable $X_{1}$ . The bottom row show the coverage of the confidence intervals for invariant causal prediction (ICP). For large $n$ , ICP often (rightfully) rejects all models and outputs neither coefficient estimates nor confidence intervals. These cases were ignored in the table. ICP is not consistent and hence has incorrect coverage for growing sample size, as clearly visible in the table.

The causal Dantzig has approximately correct coverage for all sample sizes in this example. For small sample sizes, the variance of the causal Dantzig is large and consequently the average length of the confidence intervals of the causal Dantzig is large, too. In such regimes, regularization is recommended, as discussed in Section 4. For larger sample sizes, the confidence intervals are shrinking considerably with the $\sqrt{n}$ -rate. For SEM (A), this effect is depicted in Table 2. Table 3 shows these effects for SEM (B). Note that also in this case the actual coverage of the causal Dantzig is approximately correct.

6.2 Causal Dantzig and the instrumental variable approach

To compare the causal Dantzig to instrumental variables, consider a binary instrument $e\in\{0,1\}$ . To be more precise, we consider the model

[TABLE]

The corresponding DAG is depicted in Figure 6. In words, $X$ is a direct cause of $Y$ , there is a hidden confounder $H$ that causes both $X$ and $Y$ , and $e$ is an instrument for $X$ , meaning that $e$ is a root node and a direct cause of $X$ , but not of $H$ or $Y$ . Note that the conditional mean differs between settings, i.e. $\mathbb{E}[X|e=1]\neq\mathbb{E}[X|e=0]$ . Hence the IV approach is consistent for the true causal effect from $X$ to $Y$ , as discussed in Section 3.6.

For each environment $e\in\{0,1\}$ we generate $n$ samples and estimate the direct causal effect via causal Dantzig and instrumental variables regression using the function ivreg in the R-package AER. Table 4 shows the mean square error for both methods. For few observations, the causal Dantzig is relatively unstable.

For larger values of $n$ , this is not the case and the mean square error shrinks at the $\sqrt{n}$ -rate for both estimators. The instrumental variables (IV) approach outperforms the causal Dantzig in this example. This is due to the fact that IV is a fraction of conditional means, whereas the causal Dantzig is a fraction of conditional covariances. Estimating conditional means is statistically easier, but it comes at a certain price as we will see below.

For the second model, we change the edge function between $e$ and $X$ . Notably,

[TABLE]

Both the conditional variance $\text{Var}(X|e=\bullet),\bullet\in\{0,1\}$ and the conditional mean $\mathbb{E}[X|e=\bullet],\bullet\in\{0,1\}$ change between the environments. However, the conditional mean changes only slightly, imposing difficulties for the IV approach. Again, for each environment $e\in\{0,1\}$ we generate $n$ samples and estimate the direct causal effect via causal Dantzig and ivreg. As seen in Table 5, for very few observations, both ivreg and causal Dantzig are comparatively far from the target quantity. For larger values of $n$ , the causal Dantzig converges with the $\sqrt{n}$ -rate. The instrumental variables approach is consistent but unstable for these small sample sizes as the instrument is weak. It exhibits large MSE as it does not use the changing variance for inference.

6.3 Causal Dantzig in high dimensions

We consider a structural equation model, where the variables $X_{1},\ldots,X_{p},Y$ form a chain and the distribution of the unobserved confounder $\eta$ changes between the environments. The corresponding directed acylic graph is depicted in Figure 7.

To be more precise, the distribution of the observed variables $e,X$ and $Y$ is generated according to the following structural equation model:

[TABLE]

We assume that $z_{k}$ and $\eta_{k}$ are jointly independent. The regularization parameter $\lambda$ is chosen by $10$ -fold cross-validation. Figure 8 shows the regularization path for two different values of $p$ . Figure 9 shows the regularization path for varying intervention strength $\sigma$ . Finally, in Figure 10 the number of samples collected from each environment $n:=n_{0}=n_{1}$ is varied. In a nutshell, cross-validation seems to select a reasonable regularization parameter in most cases, estimation performance deteriorates with increasing $p$ , but improves with increasing $n$ and drastically so with increasing intervention strength $\sigma$ .

6.4 Gene knockout experiments

We outline here an application which has appeared in Meinshausen et al. (2016). The authors consider gene expression in yeast (Saccharomyces cerevisiae) under deletion of single genes (Kemmeren et al., 2014): $160$ samples are wild-type (observational); and $1,479$ samples are measured under the deletion of a single gene (intervention). For each of those observations, genome-wide mRNA expression levels were measured. We denote these measurements by $X_{1},\ldots,X_{p+1}$ , where $p+1=6170$ . The goal is to predict whether mRNA expression level $Y=X_{p+1}$ changes significantly under a new and unobserved gene-deletion $X_{j}$ , $j\neq p+1$ . Knocking out a gene is not always successful, and the measured activity of a gene is not constant (or zero) after knocking it out, i.e. the intervention is “noisy”. Overall, knockouts decrease the activity, which can be interpreted as a negative shift in the measured log-activity of a gene.

The data is split into training and validation data. To this end, the $1,479$ interventional samples are divided into five sets $B_{1},\ldots,B_{5}$ . For some $v\in\{1,\ldots,5\}$ , the training data consists of the four sets $\{B_{i}\}_{i\in\{1,2,3,4,5\}\setminus\{v\}}$ and the $160$ observational samples. The samples in $B_{v}$ are held out for validation. The interventional effects on the validation set $B_{v}$ were predicted using only training data. This procedure is carried out for all sets $B_{v},v=1,\ldots,5$ , i.e. each gene perturbation is excluded from the training set once.

Preselection with the LASSO was used on the pooled data to screen for a superset of the parental set of variable $X_{p+1}$ . For some justification of this approach, see Section 5.3. Then, the causal Dantzig without regularization was used, with setting $e=1$ for observational data and $e=2$ for interventional training data. Using causal Dantzig without screening step is computationally prohibitive due to the large number of variables and as the procedure is repeated for each possible target variable $X_{1},\ldots,X_{p+1}$ . The $s$ most often selected intervention predictions were compared to so-called “strong intervention effects” (SIEs) as defined in Meinshausen et al. (2016). SIEs are computed on the held-out data $B_{v}$ and are a measure for the total causal effect. The results are depicted in Figure 11. As an example, for causal Dantzig the four most often selected intervention predictions correspond to SIEs.

Screening for causal effects is a very challenging problem in this setting, mainly due to the high-dimensionality of the dataset and the presence of hidden confounders. The ground truth is not perfectly known but good proxies (strong intervention effects) can be computed on hold-out interventional data. The strongest discoveries of InvariantCausalPrediction (ICP) and causalDantzig correspond very well to the benchmark. Assuming hidden confounding and shift interventions (causalDantzig) leads to a different ranking of genes compared to assuming the absence of confounding and allowing for arbitrary interventions (ICP). Interestingly, while both methods miss some important variables, making “wrong” assumptions such as linearity or absence of latent confounding do not seem to lead to false positives for the first few variables in the ranking. This form of validation and the comparison to other methods are further discussed in Meinshausen et al. (2016).

7 Discussion

Causal discovery is challenging, particularly in the presence of hidden confounders and feedback loops. However, hidden confounders can rarely be excluded and feedback loops are to be expected in many real-world applications (e.g., in biological systems). We introduced the notion of inner-product invariance and showed that inference in linear structural equation models under inner-product invariance is possible, both for low- and high-dimensional data.

The proposed methods have interesting parallels to widely-used statistical methods. For example, the functional form of the causal Dantzig estimator is similar to linear regression. The regularized causal Dantzig is similar to the Dantzig selector. For two environments ( $|\mathcal{E}|=2$ ) the causal Dantzig estimator can be compared with instrumental variable regression and is consistent in certain settings in which instrumental variable regression fails. Hence, we believe that the causal Dantzig will push the boundaries in the analysis of certain types of datasets, in particular in the analysis of datasets where potentially unknown interventions (or “perturbations”) change both the mean and the variance of the observed error distribution. Empirical results show state-of-the-art performance of our proposed estimator on a real-world dataset.

We investigated the identifiability of direct causal effects under the proposed model class. Furthermore, we showed that the regularized causal Dantzig can be consistent in the high-dimensional case even if not all covariates have been intervened on. The estimator can be obtained by solving a linear program and as such is feasible for large-scale causal inference. We derived asymptotic confidence intervals for the unregularized causal Dantzig, as well as guarantees for statistical accuracy for the regularized causal Dantzig.

The notion of inner-product invariance pushes the boundaries for the types of datasets we can leverage for causal discovery. We expect it to be useful for practitioners, in particular as a simple and fast tool for screening for potential direct causal effects. From a theoretical perspective, the regularized and unregularized causal Dantzig provide new perspectives on invariant causal prediction, on the instrumental variable approach and on classical theory for high-dimensional estimation.

8 Appendix

Remark 3 (Reminder of Assumption 1 and some of the notation).

We assume that the distributions of $(X_{1}^{e},...,X_{p+1}^{e})$ , $e\in\mathcal{E}$ , are generated by the linear SEM

[TABLE]

Assume that there exist random variables $\eta^{0},\delta^{e}\in\mathbb{R}^{p}$ with $\mathrm{Cov}(\eta^{0},\delta^{e})=0$ for all $e\in\mathcal{E}$ such that $\eta^{e}$ can be written as

[TABLE]

*We assume that $\delta^{e}_{p+1}\equiv 0$ for all $e\in\mathcal{E}$ and $\mathbb{E}[\eta^{0}]=0$ .

We aim to infer the structural equation for variable $X_{p+1}$ , hence we denote it by $Y$ . Furthermore, for simplicity we write $\beta^{0}=(A_{p+1,k})_{k=1,\ldots,p}$ . The values $A_{k,k^{\prime}}$ form a $(p+1)\times(p+1)$ -dimensional matrix that we denote by $A$ .

8.1 Proofs for Section 2

8.1.1 Proof of Proposition 1

Proof.

Recall that $Y=X_{p+1}$ . We can write equation (1) under Assumption 1 more compactly as $X_{1:(p+1)}^{e}=AX_{1:(p+1)}^{e}+\eta^{e}$ , where $A$ is the matrix that contains the structural parameters $A_{k,k^{\prime}}$ . In other words,

[TABLE]

In the following, we denote the $k$ -th unit vector in $\mathbb{R}^{p}$ by $e^{(k)}$ , i.e.

[TABLE]

Recall that $\beta^{0}=(A_{p+1,k})_{k=1,\ldots,p}$ and $Y^{e}=X_{p+1}^{e}$ . By Assumption 1, $Y^{e}-X^{e}\beta^{0}=\eta_{p+1}^{e}$ . Hence,

[TABLE]

Now we can again use Assumption 1. Recall that $\eta^{e}=\eta^{0}+\delta^{e}$ , $\mathbb{E}\eta_{p+1}^{e}=0$ and that $\eta_{p+1}^{0}$ and $\delta^{e}$ are uncorrelated. Hence for $k=1,\ldots,p$ ,

[TABLE]

Note that this quantity is the same for all environments $e\in\mathcal{E}$ , which concludes the proof. ∎

8.1.2 Proof of Proposition 2

Proof.

For all $e,f\in\mathcal{E}$ and $k\in\{1,\ldots,p\}$ ,

[TABLE]

In the first line and third line we used that $\zeta_{1}^{e},\ldots,\zeta_{k}^{e},\zeta_{y}^{e}$ are centered and jointly independent for all $e\in\mathcal{E}$ . In the second line we used that we have inner product invariance for $(X^{e},Y^{e})$ , $e\in\mathcal{E}$ under $\beta^{0}$ and that $\mathbb{E}[(\zeta_{k}^{e})^{2}]=\mathbb{E}[(\zeta_{k}^{f})^{2}]$ for all $e,f\in\mathcal{E}$ and $k=1,\ldots,p$ . This proves that we also have inner-product invariance for $(\tilde{X}^{e},\tilde{Y}^{e}),e\in\mathcal{E}$ under $\beta^{0}$ .

∎

8.2 Proofs for Section 3

8.2.1 Proof of Theorem 1

Proof.

First note that $V^{1}$ and $V^{2}$ are invertible as $\mathbf{G}$ and the covariance matrix of $(X^{e})^{t}\eta_{p+1}^{e}$ , $e\in\{1,2\}$ are assumed to be invertible. Now note that by inner-product invariance of $(X^{e},Y^{e})$ under $\beta^{0}$ we have $\mathbf{G}^{-1}\mathbf{Z}=\beta^{0}$ and hence

[TABLE]

In particular,

[TABLE]

We denote $\mbox{GL}_{p}$ the set of real-valued invertible $p\times p$ matrices. Define the function $f:\mbox{GL}_{p}\times\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ by

[TABLE]

By elementary matrix algebra, this function is continuously differentiable with derivative in direction $(D,d)\in\mathbb{R}^{p\times p}\times\mathbb{R}^{p}$

[TABLE]

As $(\hat{\mathbf{G}},\hat{\mathbf{Z}})-\left(\mathbf{G},\mathbf{Z}\right)=\mathcal{O}_{P}\left(\max\left(\frac{1}{\sqrt{n_{1}}},\frac{1}{\sqrt{n_{2}}}\right)\right)$ and $\beta^{0}=f(\mathbf{G},\mathbf{Z})$ , the delta method yields

[TABLE]

In the last line we used independence of the samples of environment $e=1$ and $e=2$ , the CLT and the definition of $V^{1}$ and $V^{2}$ together with equation (42).

∎

8.2.2 Proof of Theorem 2

Proof.

Part A: In this part we prove claim 2 and $"\Leftarrow"$ of claim 1. By Proposition 1 we have inner-product invariance for $\beta^{0}$ and hence the solution set of the population causal Dantzig contains $\beta^{0}$ . Assume that for each $k$ there exists $e\in\mathcal{E}$ such that $\delta_{k}^{e}\not\equiv 0$ . We want to show that under this assumption the causal Dantzig is unique in the population case. By Proposition 1 we have inner-product invariance for $\beta^{0}$ and hence each solution $\beta^{*}$ to the population causal Dantzig satisfies

[TABLE]

Denote $\tilde{e}$ the “observational” environment, i.e. the environment with $\delta_{k}^{\tilde{e}}\equiv 0$ for $k=1,\ldots,p$ . By inner-product invariance under $\beta^{0}$ ,

[TABLE]

By rearranging,

[TABLE]

As we want to show $\beta^{*}=\beta^{0}$ it suffices to show that $\mathbb{E}[\hat{\mathbf{G}}^{\tilde{e}}]$ is invertible. In the following, for notational brevity we write $(\mbox{Id}-A)_{1:p,1:p}^{-t}$ instead of $\left((\mbox{Id}-A)^{-t}\right)_{1:p,1:p}$ and $(\mbox{Id}-A)_{1:p,1:p}^{-1}$ instead of $\left((\mbox{Id}-A)^{-1}\right)_{1:p,1:p}$ . By definition, we have

[TABLE]

In the last line we used $\delta_{p+1}^{e}\equiv 0$ for all $e\in\mathcal{E}$ . As setting $\tilde{e}$ is “observational”, i.e. $\delta^{\tilde{e}}\equiv 0$ ,

[TABLE]

Now we want to show that $(\mbox{Id}-A)_{1:p,1:p}^{-1}$ is invertible. If this is not the case then there exists $\gamma\in\mathbb{R}^{p}\setminus\{0\}$ such that $\gamma^{t}(\mbox{Id}-A)_{1:p,1:p}^{-1}=0$ . As $(\mbox{Id}-A)^{-1}$ is invertible,

[TABLE]

In particular, $\gamma^{t}(\mbox{Id}-A)_{1:p,p+1}^{-1}\neq 0$ . As $(X^{e},Y^{e})^{t}=(\mbox{Id}-A)^{-1}(\eta^{e})^{t}$ , by equation (44) we have $\gamma^{t}(X^{e})^{t}=\gamma^{t}(\mbox{Id}-A)_{1:p,p+1}^{-1}\eta_{p+1}^{e}$ . As $Y^{e}=X_{p+1}^{e}=X^{e}\beta^{0}+\eta_{p+1}^{e}$ ,

[TABLE]

As we assumed that the Gram matrix of $(X^{e},Y^{e})$ is positive definite for all $e\in\mathcal{E}$ , this is a contradiction. Hence $(\mbox{Id}-A)_{1:p,1:p}^{-1}$ is invertible. Thus the matrix in equation (8.2.2) is invertible if and only if

[TABLE]

is invertible. Let $\xi\in\mathbb{R}^{p}\setminus\{0\}$ such that $\xi^{t}\mathbb{E}\left[\sum_{e\neq\tilde{e}}\left(\delta_{1:p}^{e}\right)^{t}\delta_{1:p}^{e}\right]\xi=0$ . We will lead this to a contradiction. As all matrices $\mathbb{E}\left[\left(\delta_{1:p}^{e}\right)^{t}\delta_{1:p}^{e}\right]$ , $e\neq\tilde{e}$ are positive semi-definite we have

[TABLE]

As $\xi\neq 0$ there exists $k$ such that $\xi_{k}\neq 0$ . Fix such a $k$ . By assumption there exists $e\neq\tilde{e}$ such that $\delta_{k}^{e}\not\equiv 0$ . Fix such an environment $e$ . Define $S=\{s:1\leq s\leq p\text{ such that }\delta_{s}^{e}\not\equiv 0\}$ , the support of $\delta_{1:p}^{e}$ . By definition,

[TABLE]

But by assumption, the matrix $\mathbb{E}\left[\left(\delta_{S}^{e}\right)^{t}\delta_{S}^{e}\right]$ is positive definite. Hence

[TABLE]

Contradiction! Thus $\mathbb{E}\left[\hat{\mathbf{G}}^{e}\right]$ is invertible and $\beta^{*}=\beta^{0}$ . This concludes the proof of part A.

Part B: In this part we prove $"\Rightarrow"$ of claim 1. Proof by contradiction. Assume there exists a $k$ such that $\delta_{k}^{e}\equiv 0$ for all $e\in\mathcal{E}$ . We want to show that there exists a second SEM with $\tilde{\beta}^{0}\neq\beta^{0}$ that satisfies Assumption 1 and generates the distributions of $(X^{e},Y^{e}),e\in\mathcal{E}$ . Fix a $k$ such that $\delta_{k}^{e}\equiv 0$ for all $e\in\mathcal{E}$ . As above, it is possible to show that for all $\tilde{e}\in\mathcal{E}$ ,

[TABLE]

As $\delta_{k}^{e}\equiv 0$ for all $e\in\mathcal{E}$ there exists $\Delta\in\mathbb{R}^{p}\setminus\{0\}$ such that $\mathbb{E}[\mathbf{G}^{\tilde{e}}]\Delta=0$ for all $\tilde{e}\in\mathcal{E}$ . Now we want to show that there exists a SEM with $\tilde{\beta}^{0}\neq\beta^{0}$ that generates the distributions of $(X^{e},Y^{e}),e\in\mathcal{E}$ and satisfies Assumption 1. For $X_{1:p}$ we keep the structural equations $\tilde{A}_{1:p,\bullet}=A_{1:p,\bullet}$ . For the variable $Y$ we define the new structural equation

[TABLE]

where we choose $\gamma$ small enough to make $\text{Id}-\tilde{A}$ invertible.

Furthermore define $(\tilde{\eta}^{0})^{t}=\left(\text{Id}-\tilde{A}\right)\left(\text{Id}-A\right)^{-1}(\eta^{0})^{t}$ and $\tilde{\delta}^{e}=\delta^{e}$ . Note that this SEM still satisfies that one environment $e$ is “observational”, i.e. $\tilde{\delta}^{e}\equiv 0$ and that all interventions $\tilde{\delta}^{e}$ are full-rank on its support as the same holds true for $\delta^{e}$ . Now we want to show that this SEM satisfies inner-product invariance under $\tilde{\beta}^{0}=\beta^{0}+\gamma\Delta$ . By inner-product invariance under $\beta^{0}$ , and as $\mathbb{E}[\mathbf{G}^{e}]\Delta=0$ for all $e\in\mathcal{E}$ ,

[TABLE]

Hence we also have inner-product invariance of $(X^{e},Y^{e}),e\in\mathcal{E}$ under $\tilde{\beta}^{0}$ . Now we want to show that the new SEM generates the distributions of $(X^{e},Y^{e}),e\in\mathcal{E}$ , i.e. we want to show that

[TABLE]

By definition,

[TABLE]

and again by definition we know $\tilde{\eta}^{0}=\left(\text{Id}-\tilde{A}\right)\left(\text{Id}-A\right)^{-1}\left(\eta^{0}\right)^{t}$ . Hence to prove equation (46) it suffices to show

[TABLE]

As we defined $\tilde{\delta}^{e}:=\delta^{e}$ it suffices to show

[TABLE]

Rearranging yields

[TABLE]

We know that there exists $\tilde{e}$ such that $\delta^{\tilde{e}}\equiv 0$ . Using equation (8.2.2) we obtain

[TABLE]

By construction $\mathbb{E}[\mathbf{G}^{\tilde{e}}]\Delta=0$ , which implies that $\Delta^{t}\mathbb{E}[\mathbf{G}^{\tilde{e}}]\Delta=0$ . Combining this fact with equation (48) yields

[TABLE]

Equivalently,

[TABLE]

Now we can prove equation (47):

[TABLE]

This proves equation (47) and hence the new SEM generates $(X^{e},Y^{e}),e\in\mathcal{E}$ . Hence $\beta^{0}$ is not identifiable. This concludes the proof of part B.

∎

8.2.3 Proof of Theorem 3

Proof.

It suffices to show that

[TABLE]

for all $e\in\mathcal{E}\cup\{\tilde{e}\}$ as the distribution on the right hand side of the data is the same across all environments $e\in\mathcal{E}\cup\{\tilde{e}\}$ . By Assumption 1,

[TABLE]

and hence

[TABLE]

Hence it suffices to show that

[TABLE]

for all $k\in\cup_{e\in\mathcal{E}}\{k^{\prime}:\delta_{k^{\prime}}^{e}\not\equiv 0\}$ . To this end, let $e^{\prime}$ denote the observational environment, i.e. the environment $e^{\prime}\in\mathcal{E}$ with $\delta^{e^{\prime}}\equiv 0$ . By Proposition 1,

[TABLE]

Hence also

[TABLE]

Using equation (49) and equation (50), with $\tilde{\beta}:=(\mathrm{Id}-A)_{1:p,p+1}^{-t}-(\mathrm{Id}-A)_{1:p,1:p}^{-t}\beta$ , equation (52) is equivalent to

[TABLE]

As shown in the proof of Theorem 2, $(\mathrm{Id}-A)_{1:p,1:p}^{-1}$ is invertible. Hence the preceding equation is equivalent to

[TABLE]

Analogously as in the proof of Theorem 2 we can use positive definiteness of $\mathbb{E}[(\delta_{S^{e}}^{e})^{t}\delta_{S^{e}}^{e}]$ to conclude that $\tilde{\beta}_{k}\equiv 0$ for all $k\in S^{e},e\in\mathcal{E},e\neq e^{\prime}$ . As

[TABLE]

we proved equation (51), which concludes the proof. ∎

8.2.4 Proof of Proposition 3

Proof.

First, recall that assumption (A4) says that

[TABLE]

We have

[TABLE]

In the second line we used (A1). In the fourth line we used equation (53). In the last line we used (A2). By assumption (A5) we have $\mathbb{E}[Y(0)]=\mathbb{E}[Y(0)-Y(X)]=\mathbb{E}[\mathbb{E}[Y(0)-Y(X)|X,E]]=\mathbb{E}[X]\beta^{0}=0$ and hence

[TABLE]

Using assumption (A3) concludes the proof. ∎

8.3 Proofs for Section 4

8.3.1 Proof of Lemma 1

Proof.

The proof follows the technique used in Ye and Zhang (2010). As $z^{*}\leq\lambda$ , $\beta^{0}\in\{\beta:\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta\|_{\infty}\leq\lambda\}$ . By definition of $\hat{\beta}^{\lambda}$ , we have $\|\hat{\beta}^{\lambda}\|_{1}\leq\|\beta^{0}\|_{1}$ . As the active set of $\beta^{0}$ is $S$ we have $\|(\hat{\beta}^{\lambda}-\beta^{0})_{S^{c}}\|_{1}=\|\hat{\beta}^{\lambda}\|_{1}-\|\hat{\beta}^{\lambda}_{S}\|_{1}\leq\|\beta^{0}\|_{1}-\|\hat{\beta}^{\lambda}_{S}\|_{1}\leq\|(\hat{\beta}^{\lambda}-\beta^{0})_{S}\|_{1}$ . Hence we can invoke the definition of $\mathrm{CCIF}_{q}(S,\hat{\mathbf{G}})$ to obtain

[TABLE]

To bound the right hand side of equation (54),

[TABLE]

Combining equation (54) and equation (55) concludes the proof. ∎

8.3.2 Proof of Lemma 2

Proof.

Using inner-product invariance of $(X^{e},Y^{e})$ under $\beta^{0}$ ,

[TABLE]

Now we can use that $\mathbf{X}^{e}_{ik}(\mathbf{Y}^{e}-\mathbf{X}^{e}_{i\bullet}\beta^{0})$ , $i=1,\ldots,n_{e}$ are i.i.d. with distribution $X_{k}^{e}\eta_{p+1}^{e}$ , $e\in\{1,2\}$ . By van de Geer and Bühlmann (2009), for all $t\geq 0$ , with probability exceeding $1-2\exp(-t)$ ,

[TABLE]

Taking a union bound over $k=1,\ldots,p$ , for all $t\geq 0$ , with probability exceeding $1-2\exp(-t)$ ,

[TABLE]

Using the bound for $e=1$ and $e=2$ and equation (56) yields the desired result.

∎

8.3.3 Proof of Theorem 4

Proof.

As $\sigma_{\varepsilon}\sigma_{\text{max}}^{e}\leq C$ and as $\sqrt{\log(p)/\min_{e\in\{1,2\}}n_{e}}\rightarrow 0$ for $n_{1},n_{2},p\rightarrow\infty$ , for $t=0.2\log p$ we have eventually

[TABLE]

As $\lambda\asymp 5C\sqrt{\log(p)/\min_{e\in\{1,2\}}n_{e}}$ , eventually

[TABLE]

Using Lemma 3 for $t=0.2\log(p)$ , the probability of the event $z^{*}\leq\lambda$ eventually exceeds $1-4\exp(-0.2\log(p))$ , which converges to $1$ for $p\rightarrow\infty$ . By Lemma 2, on the event $z^{*}\leq\lambda$ ,

[TABLE]

This concludes the proof. ∎

8.3.4 Proof of Proposition 4

Proof.

Using Theorem 4 for $q=\infty$ ,

[TABLE]

Using the betamin-condition,

[TABLE]

with $\mathbb{P}\rightarrow 1$ for $n_{1},n_{2},p\rightarrow\infty$ . Hence $\min_{k\in S}|\hat{\beta}_{k}^{\lambda}|>0$ with $\mathbb{P}\rightarrow 1$ for $n_{1},n_{2},p\rightarrow\infty$ . This concludes the proof. ∎

8.3.5 Proof of Lemma 3

Proof.

Consider an $u$ with $\|u_{S^{c}}\|_{1}\leq\|u_{S}\|_{1}$ . Hence, $\|u\|_{1}=\|u_{S^{c}}\|_{1}+\|u_{S}\|_{1}\leq 2\|u_{S}\|_{1}$ . Using this,

[TABLE]

In the last line we used that $q\geq 1$ . This concludes the proof. ∎

8.3.6 Causal Dantzig as a LP

For fixed $\lambda$ , the regularized causal Dantzig can be cast as a linear program. For notational simplicity, will show this for the case $|\mathcal{E}|=2$ . Define

[TABLE]

Let $\Gamma^{\lambda}$ be the solution set of the linear program

[TABLE]

Let $B^{\lambda}$ be the solution set of (16). The following Lemma shows that $B^{\lambda}$ can easily be obtained from $\Gamma^{\lambda}$ .

Lemma 5.

$B^{\lambda}=\{\gamma_{1:p}-\gamma_{(p+1):2p}:\gamma\in\Gamma^{\lambda}\}$ **

Proof.

Let $\gamma\in\Gamma^{\lambda}$ . By constraint, all entries of $\gamma$ are non-negative. Furthermore, $\gamma_{k}$ and $\gamma_{p+k}$ cannot be nonzero at the same time: In that case, $\tilde{\gamma}$ defined as

[TABLE]

would suffices $A\tilde{\gamma}=A\gamma\leq b$ , $\tilde{\gamma}\geq 0$ , $c^{t}\tilde{\gamma}<c^{t}\gamma$ , which is a contradiction to the definition of $\gamma$ . As either $\gamma_{k}$ or $\gamma_{p+k}$ are equal to zero, $c^{t}\gamma=\|\gamma_{1:p}-\gamma_{(p+1):2p}\|_{1}$ . Analogously, one can show that any solution $\gamma$ to

[TABLE]

satisfies that either $\gamma_{i}=0$ or $\gamma_{i+p}=0$ . Hence $\Gamma^{\lambda}$ is also the solution set of

[TABLE]

By rewriting the constraint, this problem is equivalent to solving

[TABLE]

Now for each solution $\gamma$ of this problem we can define $\beta(\gamma):=\gamma_{1:p}-\gamma_{(p+1):2p}$ and $\beta(\gamma)$ satisfies the constraint $\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}\beta\|_{\infty}\leq\lambda$ . Furthermore the objective functionals match, i.e. $\|\gamma_{1:p}-\gamma_{(p+1):2p}\|_{1}=\|\beta(\gamma)\|_{1}$ . On the other hand, for each solution $\beta$ of

[TABLE]

we can define $\gamma(\beta)\in\mathbb{R}^{2p}$ via $\gamma(\beta)_{1:p}=\max(\beta,0_{p})$ and $\gamma(\beta)_{(p+1):2p}=-\min(\beta,0_{p})$ . Note that by definition $\gamma(\beta)$ satisfies the constraints $\gamma(\beta)\geq 0$ , $\|\hat{\mathbf{Z}}-\hat{\mathbf{G}}(\gamma_{1:p}-\gamma_{(p+1):2p})\|_{\infty}\leq\lambda$ and again the objective functionals match, i.e. $\|\beta\|_{1}=\|\gamma(\beta)_{1:p}-\gamma(\beta)_{(p+1):2p}\|_{1}$ . Hence $B^{\lambda}=\{\gamma_{1:p}-\gamma_{(p+1):2p}:\gamma\in\Gamma^{\lambda}\}$ . This concludes the proof. ∎

8.4 Proof for Section 5

8.4.1 Proof of Lemma 4

Proof.

Proof by contradiction. Let $X_{k}$ be a parent or child of $Y$ in $D_{total}$ with $k\not\in S$ . Without loss of generality let us assume that $Y\rightarrow X_{k}$ . As the regression coefficient of $X_{k}$ is zero and as $X_{1},\ldots,X_{p},Y$ are multivariate Gaussian, $Y$ is conditionally independent of $X_{k}$ given $X_{S}$ . As the distribution of $X_{1},...,X_{p},Y,H_{1},...,H_{q}$ is faithful to $D_{total}$ , $Y$ and $X_{k}$ are $d$ -separated by $X_{S}$ in $D_{total}$ , see e.g. Pearl (2009) for a reference. Hence the path $Y\rightarrow X_{k}$ is blocked by $X_{S}$ . But the path $Y\rightarrow X_{k}$ can only be blocked if $k\in S$ . Contradiction. This concludes the proof. ∎

8.5 Asymptotic efficiency

Assume that for $e\in\{1,2\}$ the variables $(X^{e},Y^{e})$ are centered (non-degenerate) Gaussian random variables that are generated from a structural equation model under Assumption 1. Intuitively, as the Gram matrices are asymptotically efficient estimators of $\mathbb{E}(X^{e})^{t}X^{e}$ and $\mathbb{E}(X^{e})^{t}Y^{e}$ one would expect the plug-in estimator $\hat{\beta}=\hat{\mathbf{G}}^{-1}\hat{\mathbf{Z}}$ to be efficient, too. That is still true in some sense, but we have to be a bit careful with the notion of efficiency. There are two issues that we have to take care of. First, we have the additional constraint that the data is generated by a specific SEM that satisfies inner-product invariance under the true causal coefficient $\beta^{0}$ . Can this constraint be exploited to lower asymptotic variance? Additionally, we have to deal with the fact that $n_{1}$ and $n_{2}$ may have different asymptotic growth rates. The following Lemma gives an answer to the first question if we allow for errors-in-variables as defined in equation (4).

Lemma 6.

Consider distributions $(\mathring{X}^{1},\mathring{Y}^{1})\sim\mathcal{N}(0,\mathring{\boldsymbol{\Sigma}}^{1})$ and $(\mathring{X}^{2},\mathring{Y}^{2})\sim\mathcal{N}(0,\mathring{\boldsymbol{\Sigma}}^{2})$ with inner-product invariance under $\beta^{0}$ that satisfy Assumption 1 and have errors-in-variables as defined in equation (4). For any distribution $(\tilde{X}^{1},\tilde{Y}^{1})\sim\mathcal{N}(0,\tilde{\boldsymbol{\Sigma}}^{1})$ with $\tilde{\boldsymbol{\Sigma}}^{1}$ sufficiently close to $\mathring{\boldsymbol{\Sigma}}^{1}$ and $(\tilde{X}^{2},\tilde{Y}^{2})\sim\mathcal{N}(0,\tilde{\boldsymbol{\Sigma}}^{2})$ with $\tilde{\boldsymbol{\Sigma}}^{2}$ sufficiently close to $\mathring{\boldsymbol{\Sigma}}^{2}$ , there exists an linear structural equation model with error-in-variables that satisfies Assumption 1 and equation (4).

This Lemma shows that the fact that our model is generated by a Gaussian linear SEM with additive interventions and errors-in-variables that satisfies inner-product invariance does not restrict the distributions in a neighborhood of other models that satisfy these properties. Now let us turn to the question what statements can be made about the limit $n_{1}\rightarrow\infty$ , $n_{2}\rightarrow\infty$ . It is straightforward to model this the following way: for each sample $i=1,\ldots,n$ , first a coin is tossed. With probability $0<\pi<1$ we observe a sample from setting $e_{i}=1$ and with probability $1-\pi$ we observe a sample of setting $e_{i}=2$ . To be more precise, the corresponding log density can be written as

[TABLE]

where $f_{\boldsymbol{\Sigma}}$ denotes the density of a centered Gaussian distribution with covariance $\boldsymbol{\Sigma}\in\mathbb{R}^{(p+1)\times(p+1)}$ . Hence, $(\mathbf{X}^{1},\mathbf{Y}^{1})$ is a sufficient statistics for $\boldsymbol{\Sigma}^{1}$ and $(\mathbf{X}^{2},\mathbf{Y}^{2})$ is a sufficient statistics for $\boldsymbol{\Sigma}^{2}$ . By Anderson (1973), the Gram matrix of $(\mathbf{X}^{1},\mathbf{Y}^{1})$ is asymptotically efficient for estimating $\boldsymbol{\Sigma}^{1}$ and the Gram matrix of $(\mathbf{X}^{2},\mathbf{Y}^{2})$ is asymptotically efficient for estimating $\boldsymbol{\Sigma}^{2}$ . The Fisher information matrix is block diagonal with blocks for $\boldsymbol{\Sigma}^{1}$ , $\boldsymbol{\Sigma}^{2}$ and $\pi$ . Thus, the Gram matrices of $(\mathbf{X}^{1},\mathbf{Y}^{1})$ and $(\mathbf{X}^{2},\mathbf{Y}^{2})$ are asymptotically efficient for jointly estimating $\boldsymbol{\Sigma}^{1}$ and $\boldsymbol{\Sigma}^{2}$ . By the delta method, the plug-in estimator $\beta=\hat{\mathbf{G}}^{-1}\hat{\mathbf{Z}}$ is asymptotically efficient for estimating $\beta^{0}$ . Note that in the discussion above we have $n_{1}\sim\pi\cdot n$ and $n_{2}\sim(1-\pi)\cdot n$ . Hence this is a “balanced” scenario and this type of analysis does not work for, say, $n_{1}=\scriptstyle\mathcal{O}\textstyle(n_{2})$ . In the latter case, the asymptotic variance of estimating the Gram matrix in setting $e=1$ is dominating the asymptotic variance of estimating the Gram matrix in setting $e=2$ . Hence it can be shown that $\hat{\beta}$ has the same asymptotic variance as an efficient estimator for $\beta^{0}$ assuming the Gram matrix in setting $e=2$ is known.

8.5.1 Proof of Lemma 6

Proof.

Choose $\tilde{\beta}^{0}$ such that $(\tilde{\boldsymbol{\Sigma}}^{1}-\tilde{\boldsymbol{\Sigma}}^{2})_{1:p,1:p}\tilde{\beta}^{0}=(\tilde{\boldsymbol{\Sigma}}^{1}-\tilde{\boldsymbol{\Sigma}}^{2})_{1:p,p+1}$ . By construction of $\tilde{\beta}^{0}$ the random variables satisfy

[TABLE]

In other words, we have inner product invariance under $\tilde{\beta}^{0}$ . Now we want to show that the distribution of $(\tilde{X}^{e},\tilde{Y}^{e}),e\in\mathcal{E}$ can be generated by a structural equation model of the following form. We want to show that there exist independent random variables $\eta^{0},\delta^{e}\in\mathbb{R}^{p+1}\zeta_{1}^{e},\ldots,\zeta_{p}^{e},\zeta_{y}^{e},e\in\{1,2\}$ with $\eta^{e}=\eta^{0}+\delta^{e}$ such that $\eta^{0},\delta^{e},\eta^{e},e\in\{1,2\}$ satisfy Assumption 1 and such that $\zeta_{1}^{e},\ldots,\zeta_{p}^{e},\zeta_{y}^{e},e\in\{1,2\}$ satisfy the assumption mentioned after equation (4). Furthermore, with slight abuse of notation we want that the following structural equation model with error-in-variables

[TABLE]

generates the distribution of $(\tilde{X}^{e},\tilde{Y}^{e})$ , i.e. satisfies $(X^{e},Y^{e})\sim\mathcal{N}(0,\tilde{\boldsymbol{\Sigma}}^{e}),e\in\{1,2\}$ . As $\tilde{X}_{1}^{e},\ldots,\tilde{X}_{p}^{e},\tilde{Y}^{e},e\in\{1,2\}$ are centered multivariate Gaussian it suffices to show that the covariance matrix of $(\tilde{X}_{1}^{e},\ldots,\tilde{X}_{p}^{e},\tilde{Y}^{e}-\tilde{X}^{e}\tilde{\beta}^{0}),e\in\{1,2\}$ can be decomposed into

[TABLE]

with positive semi-definite matrices $\Sigma_{\eta},\Sigma_{\delta}^{e},\Sigma_{\zeta}^{e}$ satisfying

$(\Sigma_{\delta}^{e})_{p+1,\bullet}\equiv 0$ , 2. 2.

$\Sigma_{\zeta}^{e},e\in\{1,2\}$ are diagonal matrices with $(\Sigma_{\zeta}^{1})_{k,k}=(\Sigma_{\zeta}^{2})_{k,k}$ for $k=1,\ldots,p$ .

To this end, define $r^{1}=\tilde{Y}^{1}-\sum_{k=1}^{p}\tilde{X}^{1}_{k}\beta_{k}^{0}$ and $r^{2}=\tilde{Y}^{2}-\sum_{k=1}^{p}\tilde{X}_{k}^{2}\tilde{\beta}_{k}^{0}$ . Define the matrices

[TABLE]

Here, $\mathring{\eta}_{p+1}^{e}$ denotes the noise contribution of $\mathring{X}_{p+1}^{e}=\mathring{Y}^{e}$ in the corresponding structural equation model. For $\tilde{\boldsymbol{\Sigma}}^{e}\rightarrow\mathring{\boldsymbol{\Sigma}}^{e},e\in\{1,2\}$ , $\text{Cov}(\tilde{X}_{1:p}^{e})$ converges to $\text{Cov}(\mathring{X}_{1:p}^{e})$ and $\text{Cov}(\tilde{X}_{1:p}^{e},r^{e})$ converges to $\text{Cov}(\mathring{X}_{1:p}^{e},\mathring{Y}^{e}-\mathring{X}^{e}\beta^{0})=\text{Cov}(\mathring{X}_{1:p}^{e},\mathring{\eta}_{p+1}^{e})$ . Recall that the covariance matrices of $(\mathring{X}_{1}^{e},\ldots,\mathring{X}_{p}^{e},\mathring{\eta}_{p+1}^{e}),e\in\{1,2\}$ are positive definite. Hence $S^{1}$ and $S^{2}$ are positive definite for $\tilde{\boldsymbol{\Sigma}}^{1}$ close to $\mathring{\boldsymbol{\Sigma}}^{1}$ , $\tilde{\boldsymbol{\Sigma}}^{2}$ close to $\mathring{\boldsymbol{\Sigma}}^{2}$ and $\epsilon>0$ small enough. Now we can define

[TABLE]

With this definition the covariance matrix of $(\tilde{X}_{1}^{e},\ldots,\tilde{X}_{p}^{e},\tilde{Y}^{e}-\tilde{X}^{e}\tilde{\beta}^{0}),e\in\{1,2\}$ can be decomposed as $S^{e}+\Sigma_{\zeta}^{e},e\in\{1,2\}$ .

For $\tilde{\boldsymbol{\Sigma}}^{1}$ close to $\mathring{\boldsymbol{\Sigma}}^{1}$ and $\tilde{\boldsymbol{\Sigma}}^{2}$ close to $\mathring{\boldsymbol{\Sigma}}^{2}$ , the matrices $\Sigma_{\zeta}^{e},e\in\{1,2\}$ , are positive semi-definite as $r^{e}$ has asymptotic variance $\text{Var}(\mathring{\eta}_{p+1}^{e}+\mathring{\zeta}_{y}^{e})$ , where $\mathring{\zeta}_{y}^{e}$ denotes the measurement error of $\mathring{Y}$ in environment $e$ in the corresponding structural equation model. Thus by equation (60) it suffices to show that $S^{e},e\in\{1,2\}$ can be decomposed into positive semi-definite matrices $\Sigma_{\eta}+\Sigma_{\delta}^{e}$ such that $(\Sigma_{\delta}^{e})_{p+1,\bullet}\equiv 0$ .

To this end let us define

[TABLE]

Now we want to show that

[TABLE]

are positive semi-definite for $e\in\{1,2\}$ . To this end take $v\in\mathbb{R}^{p}$ . Then for $e\in\{1,2\}$ ,

[TABLE]

Note that we used that $S^{e},e\in\{1,2\}$ are positive definite, that by Assumption 1 $\text{Var}(\mathring{\eta}_{p+1}^{1})=\text{Var}(\mathring{\eta}_{p+1}^{2})$ , and that by inner-product invariance, $\text{Cov}(\tilde{X}_{1:p}^{1},r^{1})=\text{Cov}(\tilde{X}_{1:p}^{2},r^{2})$ . Now by defining $\Sigma_{\eta}:=xx^{t}$ we obtain the decomposition $S^{e}=\Sigma_{\eta}+\Sigma_{\delta}^{e}$ . Note that here we used again that by inner-product invariance under $\beta^{0}$ , $\text{Cov}(\tilde{X}_{1:p}^{1},r^{1})=\text{Cov}(\tilde{X}_{1:p}^{2},r^{2})$ . This completes the proof.

∎

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson [1973] T.W. Anderson. Asymptotically efficient estimation of covariance matrices with linear structure. Annals of Statistics , pages 135–141, 1973.
2Andersson et al. [1997] S.A. Andersson, D. Madigan, and M.D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. Annals of Statistics , 25:505–541, 1997.
3Angrist et al. [1996] J.D. Angrist, G.W. Imbens, and D.B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association , 91:444–455, 1996.
4Bollen [1989] K.A. Bollen. Structural Equations with latent variables . John Wiley & Sons, 1989.
5Bowden and Turkington [1990] R.J. Bowden and D.A. Turkington. Instrumental variables , volume 8. Cambridge University Press, 1990.
6Bühlmann and van de Geer [2011] P. Bühlmann and S. van de Geer. Statistics for high-dimensional data: Methods, theory and applications . Springer, 2011.
7Candes and Tao [2007] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p 𝑝 p is much larger than n 𝑛 n . Annals of Statistics , 35(6):2313–2351, 2007.
8Chickering [2002] D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research , 3:507–554, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions

Abstract

keywords:

keywords:

1 Introduction

1.1 Setting and notation

1.2 Relation to other work

1.3 Overview

2 Conditional and inner-product invariance

2.1 Conditional invariance

2.2 Inner-product invariance

Definition 1**.**

2.3 Additive interventions

Assumption 1**.**

Proposition 1**.**

2.4 Errors-in-variables

Proposition 2**.**

3 Causal Dantzig without regularization

3.1 The estimator

Definition 2** (Unregularized causal Dantzig).**

3.2 More than two environments

3.3 Confidence intervals

Theorem 1** (Asymptotic normality).**

Remark 1** (Estimation of V1V^{1}V1 and V2V^{2}V2).**

3.4 Implementation and example

Estimate StdErr p.value X1 -0.042 0.059 0.481 X2 0.999 0.106 <2e-16 *** X3 0.035 0.042 0.403

3.5 Identifiability of β0\beta^{0}β0 and practical implications

Theorem 2**.**

Theorem 3**.**

3.6 Comparison with instrumental variables

3.7 Inner-product invariance in the potential outcome framework

Proposition 3**.**

4 Causal Dantzig with regularization

4.1 The estimator

4.2 Finite-sample bound

4.2.1 Causal Cone Invertibility Factor

4.2.2 Finite sample bound

Lemma 1**.**

Lemma 2**.**

Theorem 4**.**

Proposition 4**.**

Remark 2**.**

4.3 Behavior of the causal cone invertibility factor

4.3.1 General properties

Lemma 3**.**

4.3.2 Population G\mathbf{G}G invertible

4.3.3 Population G\mathbf{G}G not invertible

5 Practical considerations

5.1 Centering and scaling

5.2 Choosing λ\lambdaλ

5.3 Preselection with hidden variables

Lemma 4**.**

6 Numerical examples

6.1 Causal Dantzig in low dimensions: confidence intervals

6.2 Causal Dantzig and the instrumental variable approach

6.3 Causal Dantzig in high dimensions

6.4 Gene knockout experiments

7 Discussion

8 Appendix

Remark 3** (Reminder of Assumption 1 and some of the notation).**

8.1 Proofs for Section 2

8.1.1 Proof of Proposition 1

Proof.

8.1.2 Proof of Proposition 2

Proof.

8.2 Proofs for Section 3

8.2.1 Proof of Theorem 1

Proof.

8.2.2 Proof of Theorem 2

Proof.

8.2.3 Proof of Theorem 3

Proof.

8.2.4 Proof of Proposition 3

Proof.

Definition 1.

Assumption 1.

Proposition 1.

Proposition 2.

Definition 2 (Unregularized causal Dantzig).

Theorem 1 (Asymptotic normality).

Remark 1 (Estimation of $V^{1}$ and $V^{2}$ ).

3.5 Identifiability of $\beta^{0}$ and practical implications

Theorem 2.

Theorem 3.

Proposition 3.

Lemma 1.

Lemma 2.

Theorem 4.

Proposition 4.

Remark 2.

Lemma 3.

4.3.2 Population $\mathbf{G}$ invertible

4.3.3 Population $\mathbf{G}$ not invertible

5.2 Choosing $\lambda$

Lemma 4.

Remark 3 (Reminder of Assumption 1 and some of the notation).

Lemma 5.

Lemma 6.