Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions
Dominik Rothenh\"ausler, Peter B\"uhlmann, Nicolai Meinshausen

TL;DR
Causal Dantzig offers a computationally efficient method for causal inference in linear structural equation models with hidden variables, leveraging invariance under specific interventions to handle large-scale data.
Contribution
It introduces a new approach using inner-product invariance for fast causal inference, addressing computational challenges and hidden confounders in large-scale linear models.
Findings
Addresses computational efficiency for large datasets
Provides asymptotic confidence intervals in low-dimensional settings
Offers predictive guarantees in non-identifiable cases
Abstract
Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. [2016] that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion…
| Consistency | ||
|---|---|---|
| (mean-shift) | (change in error distribution) | |
| Instrumental variable regression | yes | no |
| Unregularized causal Dantzig | yes | yes |
| Coverage | 0.930.01 | 0.950.01 | 0.960.01 | 0.960.01 |
|---|---|---|---|---|
| Average length | 65.792918.53 | 4.11602.53 | 0.270.62 | 0.180.01 |
| Coverage ICP | 0.920.01 | 0.840.01 | 0.420.02 | 0.30.03 |
| Coverage | 0.950.01 | 0.950.01 | 0.960.01 | 0.960.01 |
|---|---|---|---|---|
| Average length | 11354.752776.95 | 57.2728842.69 | 0.697.28 | 0.393.73 |
| causal Dantzig | 0.460.41 | 0.030.01 | 0.010 |
|---|---|---|---|
| ivreg | 0.070.01 | 0.020 | 0.010 |
| causal Dantzig | 24.0590.75 | 0.030 | 0.010 |
|---|---|---|---|
| ivreg | 36634.21161096.94 | 4244.2915557.82 | 1862.78171.14 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Causal Dantzig: fast inference in linear structural equation models with hidden variables under additive interventions
Dominik Rothenhäuslerlabel=e2][email protected] [
Peter Bühlmann
Nicolai Meinshausenlabel=e1][email protected] [[[ ETH Zürich
Seminar für Statistik
ETH Zürich
8092 Zürich
Switzerland
e3
e1
Abstract
Causal inference is known to be very challenging when only observational data are available. Randomized experiments are often costly and impractical and in instrumental variable regression the number of instruments has to exceed the number of causal predictors. It was recently shown in Peters et al. (2016) that causal inference for the full model is possible when data from distinct observational environments are available, exploiting that the conditional distribution of a response variable is invariant under the correct causal model. Two shortcomings of such an approach are the high computational effort for large-scale data and the assumed absence of hidden confounders. Here we show that these two shortcomings can be addressed if one is willing to make a more restrictive assumption on the type of interventions that generate different environments. Thereby, we look at a different notion of invariance, namely inner-product invariance. By avoiding a computationally cumbersome reverse-engineering approach such as in Peters et al. (2016), it allows for large-scale causal inference in linear structural equation models. We discuss identifiability conditions for the causal parameter and derive asymptotic confidence intervals in the low-dimensional setting. In the case of non-identifiability we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We derive finite-sample bounds in the high-dimensional setting and investigate its performance on simulated datasets.
62J99,
62H99,
68T99,
Causal inference,
structural equation models,
high-dimensional consistency.,
keywords:
[class=MSC]
keywords:
\setattribute
tablecaptionshape \setattributetablecaptionsize \setattributetablename size \setattributetablename skip :
,
and label=e3][email protected] label=u1,url]http://stat.ethz.ch
1 Introduction
Using only observational data to infer causal relations is a challenging task and only possible under certain circumstances and assumptions. In the context of structural equation models (Bollen, 1989; Robins et al., 2000; Pearl, 2009), one possibility is to characterize the Markov equivalence class of graphs under the assumption of acyclicity and usually faithfulness (Verma and Pearl, 1991; Andersson et al., 1997; Tian and Pearl, 2001; Hauser and Bühlmann, 2012; Chickering, 2002). Based on the Markov equivalence class, some causal effects and often only bounds for them can be inferred, see for example Maathuis et al. (2009) and VanderWeele and Robins (2010). Other approaches exploit non-Gaussianity or nonlinearities, while making suitable assumptions about the causal model (Shimizu et al., 2006; Hoyer et al., 2009).
If both observational and data under interventions are available and the target and effect of the interventions is perfectly known, the task of inferring causal relationships becomes easier. Hauser and Bühlmann (2015), for example, modify the greedy equivalence search of Chickering (2002) to such a scenario. If an instrumental variable is available, then different forms of instrumental variable regression (Wright, 1928; Bowden and Turkington, 1990; Angrist et al., 1996; Didelez et al., 2010) can be used to infer the causal effect of a single variable on a target of interest.
Consider a setting where data are recorded in different environments. The environments can have an arbitrary and unknown intervention effect on all predictor variables and the method exploits that the conditional distribution of the target of interest, given its causal parents, is invariant across environments under arbitrary interventions on all variables (excluding, just as in instrumental variable regression, direct interventions on the response or target ). While it was demonstrated in Peters et al. (2016) that the method can infer a full causal model, there are two major shortcomings:
- (i)
It is assumed for invariant causal prediction (ICP) (Peters et al., 2016) that there are no hidden variables that influence and its parents simultaneously. 2. (ii)
ICP scans all potential subsets of variables and tests whether the conditional distribution of given a subset of variables is invariant across all environments. This makes the method computationally prohibitively expensive as soon as the number of predictor variables starts to exceed one or two dozens.
We will show that both shortcomings can be addressed if we are willing to make a more specific assumption about the type of interventions that generate the different environments.
1.1 Setting and notation
Assume we have a variables from a linear Structural Equation Model (SEM) (Bollen, 1989; Robins et al., 2000; Pearl, 2009),
[TABLE]
where is the set of parents of variable . For notational simplicity we set for all . Deviating from convention, we allow dependence between the components of the noise contribution which is equivalent to allowing for hidden variables as parents of the observed variables , see Figure 1 for an example. The variables form a directed graph , where the nodes are given by the variables themselves and there is an edge from variable to if and only if . Furthermore, we allow the underlying graph to be cyclic. The values for form a -dimensional matrix that we denote by . We write for the -dimensional identity matrix. To make the distribution of well defined in the presence of cycles, we assume that is invertible. Note that this is always the case if is acyclic.
We consider inferring the structural equation for just one of the variables and we take variable without loss of generality and denote it by . Note that can be in the parental set of some (or all) of the variables , i.e. the matrix is not necessarily lower triangular. With slight abuse of notation we define , and such that
[TABLE]
Note that the vector has a causal interpretation as it is the coefficient vector in the structural equation model (2). The goal is to infer .
1.2 Relation to other work
We have mentioned already major differences to invariant causal prediction (Peters et al., 2016) and the loose relation to the vast literature on instrumental variable regression (Didelez et al., 2010) which will be detailed in Section 3.6. Another method that relies on shift interventions has been published recently (Rothenhäusler et al., 2015). However, the authors exploit a different type of invariance as inner-product invariance does not hold in this setting. Lewbel (2012) uses heteroscedasticity to infer structural equations. While Lewbel (2012) uses cross-products between exogeneous variables and error terms to identify structural equations, we directly exploit the covariance structure of endogeneous variables and the error terms, resulting in a different method. The comparison in Figure 11 about an application has been published in Meinshausen et al. (2016). The concept of inner-product invariance, the causal Dantzig method and all its corresponding theory are entirely novel.
1.3 Overview
In Section 2 we introduce the notion of inner-product invariance and discuss under which assumptions this property is satisfied. In Section 3 we leverage this property to define the unregularized causal Dantzig and discuss identifiability, low-dimensional estimation and inference. Furthermore, in the case of non-identifiability we show that the solution set of causal Dantzig has predictive guarantees under certain interventions. We conclude with a comparison to instrumental variable regression and a discussion of inner-product invariance from the perspective of potential outcomes. In Section 4 we introduce the regularized causal Dantzig, examine its performance in high-dimensional estimation and show how it can achieve consistency under relaxed identifiability assumptions. Practical considerations for both the regularized and unregularized causal Dantzig can be found in Section 5. Numerical examples can be found in Section 6.
2 Conditional and inner-product invariance
In analogy to the setting of Peters et al. (2016) we assume that the data are recorded under different discrete environments or experimental conditions . The random variable in environment is denoted by and the distribution of by . We observe i.i.d. samples of from each environment and for each sample we observe from which environment it was drawn. This variable can be deterministic or random.
The distribution of a variable can be different across environments due to specific or non-specific interventions. A change in the distribution of can be caused by different intervention mechanisms such as do-interventions or noise-interventions, which can be randomized or not and known or partially known or unknown.
The type of intervention that generates the environments is arbitrary in Peters et al. (2016) with the exception that interventions on the target itself are not allowed. The same requirement is also necessary for the instrumental variable approach and we will keep this requirement in the following. For possible relaxations see Rothenhäusler et al. (2015). Throughout the paper we assume that the distributions are non-degenerate and that the Gram matrix of is well-defined and positive definite for all .
2.1 Conditional invariance
The conditional distribution of the target variable , given its parents is denoted by
[TABLE]
It was assumed in Peters et al. (2016) that the conditional distribution is invariant for all where it is defined in the absence of hidden confounding (where absence of hidden confounding is fulfilled in (1) if all components of are independent). It then holds for all environments and all for which the conditional distributions are well defined that
[TABLE]
This conditional invariance under the true parental set is then exploited for inference by testing for all subsets of whether the invariance of (3) can be rejected. The intersection of all subsets for which invariance cannot be rejected is then automatically a subset of the true parental set with controllable probability.
There are two shortcomings of this invariance approach (Peters et al., 2016) in certain contexts:
- (i)
The invariance (3) becomes invalid under hidden confounding between and the parents of as the conditional invariance of (3) can be violated even for the true parental set (Peters et al., 2016). 2. (ii)
Testing each subset of restricts the number of variables to somewhere between in practice.
Both of these shortcomings can be addressed when using a different type of invariance.
2.2 Inner-product invariance
We show in the following that the invariance of the conditional distribution (3) can be replaced with an inner-product invariance under a more specific assumption on the mechanism that generates the different environments.
Definition 1**.**
Inner-product invariance* under is fulfilled iff*
[TABLE]
for all and .
We will show that inner-product invariance is true for the causal vector under the assumption of additive interventions made precise in the following. A derivation of this result from potential outcome assumptions is discussed in Section 3.7. The concept of inner-product invariance will then later be exploited for computationally fast causal inference for both low- and high-dimensional data.
2.3 Additive interventions
We assume here that the structural equations (1) are constant across all environments and that the change in the distribution of between environments is caused by a shift in the distribution of between different environments.
Assumption 1**.**
Assume that the distributions of , , are generated by the linear SEM
[TABLE]
Assume that there exist random variables with for all such that can be written as
[TABLE]
We assume that for all and .
Note that the components of and of each vector , are allowed to be dependent to allow for hidden confounding. We call the random variables , , additive interventions as they are additive and specific to the environment . can for example be an additive contribution if for some variable or a noise contribution if or both. If for some and we say that there is no intervention on variable in environment . The last part of the assumption ensures that the noise part that is specific to environment does not include an intervention on the target variable itself and is a type of exclusion restriction (Pearl, 2009). Mathematically, the crucial property of Assumption 1 is that the covariance between the error of covariates and target variable is constant, i.e. that is constant across environments . This allows us to obtain the following result.
Proposition 1**.**
Under Assumption 1, we have inner-product invariance under the true causal coefficients :
[TABLE]
for all and .
The proof of this result can be found in the Appendix. A derivation of this result from potential outcome assumptions is discussed in Section 3.7. We will exploit inner-product invariance to infer the causal effects in linear SEMs in the following.
2.4 Errors-in-variables
In many real-world applications, we cannot directly observe , but make a measurement error when observing it. In other words, we measure
[TABLE]
where are centered, jointly independent and independent of with finite variance. Furthermore, we make the assumption that the distributions of are invariant for different settings . Note that we do not assume that the distribution of is invariant for different settings . Errors-in-variables exhibit an effect called “regression dilution” or “attenuation”. As an example consider a Structural Equation Model of the following form:
[TABLE]
For now, let us assume that there is no confounding between and and . When regressing on we obtain a smaller regression coefficient than when regressing on due to higher variance of . The smaller regression coefficient is by definition the best linear prediction of given . In this sense attenuation can be ignored if one wants to make predictions based on . However, in causal inference we are interested in knowing what happens when intervening on , and this effect would be underestimated by the regressing on . The following proposition shows that if inner-product invariance holds for then it also holds for proxy variables .
Proposition 2**.**
Assume inner-product invariance holds for , , under . Assume we have an errors-in-variables model as defined in equation (4). Then inner-product invariance holds for , under :
[TABLE]
for all and .
The proof of this result can be found in the Appendix. As a result, methods based on inner-product invariance will be robust with respect to errors-in-variables. Note that the analogous statement is true for instrumental variable regression. Now let us turn to the definition of the unregularized causal Dantzig.
3 Causal Dantzig without regularization
In this section we introduce the unregularized causal Dantzig, discuss its basic properties and an example. We introduce the unregularized causal Dantzig in Section 3.1. Asymptotic confidence intervals for low-dimensional estimation are discussed in Section 3.3. Section 3.4 provides an example and explains basic usage of the method causalDantzig in the R-package InvariantCausalPrediction (R Core Team, 2017). Identifiability and consistency issues are discussed in Section 3.5. We conclude with a comparison to instrumental variable regression in Section 3.6.
3.1 The estimator
Assume that we observe i.i.d. samples of in two environments with samples in each environment. Let and be the and -dimensional matrices that contain the realized values of the random variables in environment and respectively and let and be the respective measurements of the response variables. Define the differences between the two environments in inner-product and Gram matrices, the so-called Gram-shift matrices
[TABLE]
Assuming inner-product invariance holds under ,
[TABLE]
A simple estimator of is the empirical minimizer of the -norm of the differences between and .
Definition 2** (Unregularized causal Dantzig).**
The causal Dantzig estimator is defined as a solution to the optimization problem
[TABLE]
The choice of how to center and scale variables deserves some attention. We will discuss this in Section 5.1. Causal Dantzig is uniquely defined if and only if is invertible and can in this case be written as
[TABLE]
Note that by equation (5) this estimator is closely related to least squares in linear regression. Recall that for observations and design matrix , the least squares estimator is defined as
[TABLE]
Causal Dantzig is strikingly similar, with the Gram matrices replaced by differences of Gram matrices in different settings. As such, it is straightforward to derive asymptotic confidence intervals for this estimator. Many properties from linear regression do not carry over. For example, the causal Dantzig is only asymptotically unbiased.
3.2 More than two environments
There are two straightforward extensions to more than two environments . Pooling data from different environments preserves inner-product invariance. If some of the environments are “observational” and the others are “interventional”, one option for splitting the data into two environments is pooling all observational data () and pooling all interventional data (). Instead of splitting the data into two environments one can change the definition of the estimator to accommodate for more than two settings, for example by defining as a solution to the optimization problem
[TABLE]
where
[TABLE]
Note that for two environments, solutions of equation (8) coincide with equation (6). It depends on the type of interventions and the signal strength which of the two options mentioned above is better. If the data can be split into two environments that are homogeneous, doing so is preferable as the estimators of and have low variance. If the environments have different (strong) interventions, solving equation (8) can be preferable as the effect of several strong interventions might get “washed out” when averaging over many environments. We will return later to the case of more than two environments. For the following discussion we assume that there are two environments .
3.3 Confidence intervals
In the settings described above is in general only asymptotically unbiased. This bias is unknown as it depends on the unknown amount of confounding between and . Hence we will only pursue asymptotic confidence intervals. We will show that the estimator (7) is under certain conditions asymptotically normally distributed, that is for ,
[TABLE]
The matrices and are positive definite under suitable assumptions and can be consistently estimated from the data as and as we will discuss later. We can then define asymptotically valid confidence intervals for as
[TABLE]
where is the -th diagonal element of and . Here, denotes the distribution function of a standard Gaussian random variable. The interval has asymptotic coverage
[TABLE]
The conditions for asymptotic normality (10) are fourth-moment conditions on the observed random variables as well as conditions that guarantee that and are invertible and that causal Dantzig is unique.
Theorem 1** (Asymptotic normality).**
Let and have finite fourth moments and assume that inner product invariance holds under . Assume that and are independent. Define and and let and the covariance matrix of , be invertible. For ,
[TABLE]
where , are invertible. Note that we allow and to have different asymptotic growth rates.
Remark 1** (Estimation of and ).**
The empirical covariance matrix of
[TABLE]
is a consistent estimator of . can be estimated analogously.
The proof of this result can be found in the Appendix. The assumption that is invertible will be discussed further in Section 3.5. In Section 4 we will discuss how the regularized causal Dantzig can be consistent in some situations where population is not invertible. Asymptotic efficiency is discussed in Section 8.5 in the Appendix.
3.4 Implementation and example
We use data generated according to a SEM with the structure given by Figure 1 as an example. Suppose the data are generated in two environments according to
[TABLE]
where is assumed to be drawn from and the noise variances are for environment and for environment . We draw i.i.d. samples from each environment and the corresponding pairwise scatterplots are shown in Figure 2. For one realization we obtain the estimate via the difference in Gram matrices and inner products with the target as
[TABLE]
where the correct vector of causal coefficients in this problem is
[TABLE]
Asymptotic confidence intervals can be computed via (11).
The procedure is implemented as method causalDantzig in the R-package InvariantCausalPrediction (R Core Team, 2017). The output for the example above is shown below, where is the matrix with predictor variables, the outcome of interest and is an -dimensional vector with entries for samples from environment and entries for samples from environment .
fit <- causalDantzig(X,Y,E,regularization=FALSE) print(fit) Unregularized causal Dantzig Call: causalDantzig(X = X, Y = Y, E = E, regularization = FALSE)
Estimate StdErr p.value X1 -0.042 0.059 0.481 X2 0.999 0.106 <2e-16 *** X3 0.035 0.042 0.403
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Only the direct causal effect of the second variable turns out to be statistically significant. Note that in this setting, instrumental variables regression would fail. One problem is that the number of covariates exceeds the number of “instruments”. Additionally, the expectation of and are equal, implying that there is no mean shift due to the two environments. We will discuss these issues in more detail in Section 3.6.
3.5 Identifiability of and practical implications
In the simplest setting, the number of samples greatly exceeds the number of parameters, and the interventions , are sufficiently different to make the parameter identifiable. Theorem 2 gives conditions under which this is the case.
Theorem 2**.**
Consider a SEM that satisfies Assumption 1. Assume that there exists an “observational” environment, i.e. an environment with . Furthermore assume that all interventions are full-rank on its support, i.e. that the Gram matrix of is positive definite for .
The causal coefficient is identifiable in the population case if and only if for each there exists such that . 2. 2.
If the condition in 1. holds then the solution of causal Dantzig as defined in equation (8) is unique in the population case and equal to .
The proof of this result can be found in the Appendix. Usually, there are many different SEMs satisfying Assumption 1 that can generate a given observed distribution of . Theorem 2 gives a condition under which these SEMs all share the same direct causal effect from to . If said condition is satisfied, the causal Dantzig has a unique solution in the population case that is equal to . Furthermore, it tells us that if this condition is not satisfied, there exist at least two SEMs satisfying Assumption 1 with different direct causal effects from to that generate the given distribution. Without further assumptions it is then not possible to consistently estimate the direct causal effects, but only a set of potential causal effects. We will characterize this set later.
Note that Theorem 2 describes a rather strong condition for identifiability. Especially if is large it might be unrealistic to have nonzero interventions on each of the variables . However, making additional assumptions can help resolve these identifiability issues. If the interventions only act on a subset of the variables or when the number of covariates exceeds the sample size , the regularized causal Dantzig can be consistent under the additional assumption of sparsity. We discuss consistency of the regularized causal Dantzig in such scenarios in Section 4.2 and Section 4.3. Alternatively, it can be advisable to first run LASSO on the pooled dataset to select a subset of the variables. Under the assumption of faithfulness, it is sufficient to have nonzero interventions on the selected subset. Some justification for this approach can be found in Section 5.3.
If the assumptions for identifiability of are not fulfilled it should still be possible to guarantee predictive performance under certain new environments. The following theorem makes this intuition more precise. The proof can be found in the Appendix.
Theorem 3**.**
Consider a SEM that satisfies Assumption 1. Assume that there exists an “observational” environment, i.e. an environment with . Furthermore assume that all interventions are full-rank on its support, i.e. that the Gram matrix of is positive definite for . Let be a solution of causal Dantzig as defined in equation (8) in the population case.
Then the distribution of the residuals is invariant, i.e.
[TABLE] 2. 2.
For a new environment that satisfies Assumption 1 for , with , we have
[TABLE]
In words, solutions of causal Dantzig guarantee that the residuals have the same distribution across all environments . Perhaps more importantly, solutions of causal Dantzig are guaranteed to have the same predictive performance on new environments with arbitrary large additive perturbations as long as these perturbations act on a subset of the variables .
3.6 Comparison with instrumental variables
Consider a setting where the underlying DAG takes the following form:
Y$$H$$X$$e
We assume that is not observed and that takes values in . To be able to use the causal Dantzig, we have to define settings . It is rather straightforwards to write for the variables conditioned on and for the variables conditioned on . As is binary, the method of instrumental variables (IV) coincides with the Wald estimator (Wald, 1940). In the population case it can be written as
[TABLE]
Causal Dantzig leads to
[TABLE]
Both the IV approach and the causal Dantzig have different strengths and weaknesses in this setting. For example, equation (13) is based on means, whereas equation (14) is based on covariances. If, say, , , with centered noise independent of the centered confounder , then . Hence the IV estimator is not well-defined in the population case and one should use the causal Dantzig. If the instrument is weak, causal Dantzig can exhibit efficiency gains. An example of this can be found in Section 6.2. A more general comparison can be found in Table 1. It is also possible to construct examples where equation (14) is not well-defined. For this to happen, the second moments of and have to be equal.
A drawback of the IV approach is that the number of instruments has to equal or exceed the number of endogenous variables. However, this is not necessary for the causal Dantzig. Two settings in our framework correspond to a single binary exogenous variable. In that case the number of endogenous variables can be arbitrarily large as long as , the difference of Gram matrices, is invertible. On the other hand, for the number of endogenous variables exceeds the number of exogenous variables and the IV approach is bound to fail. We compare the performance of the IV approach and causal Dantzig on simulated datasets in Section 6.2.
3.7 Inner-product invariance in the potential outcome framework
In this section we will investigate the notion of inner-product invariance under potential outcome assumptions (Neyman, 1923; Rubin, 1974). Note that here, as in the rest of the paper, we consider a continuous exposure . In the following, we use a slightly different notation compared to the rest of the paper. We write for the potential outcome of a continuous exposure if the environment takes value . Equivalently we write for the potential outcome of the response of a unit if the exposure takes level and environment takes value . We assume that these quantities are well-defined. We make the following additional assumptions:
- A1.
Exclusion restriction:
[TABLE] 2. A2.
Independence:
[TABLE] 3. A3.
Constant confounding across environments :
[TABLE] 4. A4.
Treatment effect homogeneity and linearity:
[TABLE] 5. A5.
The variables are normalized:
[TABLE]
Note that we did not make any cross-world assumptions (Richardson and Robins, 2013), i.e. we made no assumptions on the joint distribution of , or on the joint distribution of , . Condition (A2) can be relaxed to an assumption on the cross-product between and . Details can be found in the Appendix in the proof of Proposition 3. Condition (A3) is crucial: we allow for confounding (nonzero covariance of and ), but we assume that the covariance is constant across environments. Loosely speaking, this can be seen as a non-interaction-assumption of environment and confounding. Condition (A4) ensures that the average treatment effect is the same within strata defined by and and allows the usage of a linear model. For a discussion of similar assumptions in the context of the IV framework, see Wang and Tchetgen Tchetgen (2017).
If these assumptions are fulfilled, then we have inner-product invariance under the average treatment effect .
Proposition 3**.**
Under assumptions (A1) - (A5) we have inner-product invariance under the vector which satisfies , i.e.
[TABLE]
The proof of this result can be found in the Appendix. Using inner-product invariance for estimating the average treatment effect , it is possible to consistently estimate the average treatment effect in cases in which two-stage least squares (or the Wald estimand) is degenerate. For example, in settings where the dimension of exposure variables exceeds the number of environments or when for . In the presence of weak instruments, causal Dantzig can exhibit efficiency gains compared to estimators based on conditional means of and . This is investigated further in Section 3.6 and Section 6.
4 Causal Dantzig with regularization
In this section we introduce the regularized causal Dantzig, and discuss its theoretical properties. The estimator is motivated and introduced in Section 4.1. Section 4.2 contains finite sample bounds. The bounds presented in this section involve a quantity that we call the “causal cone invertibility factor”. The behavior of this quantity is discussed in Section 4.3.
4.1 The estimator
Weak interventions on some of the variables (i.e. small) may lead to coefficient estimates with high variance in equation (7). Furthermore, if the number of predictors exceeds the total sample size , the matrix is not invertible and the solution to equation (6) is not unique. In such settings, regularization and shrinkage is desirable and can outperform unpenalized estimation procedures, see e.g. Bühlmann and van de Geer (2011). In particular, -penalized estimation procedures have attracted much interest in high-dimensional models. For linear models, Candes and Tao (2007) proposed an -minimization method called the Dantzig selector. Consider with , , . For a tuning parameter , the Dantzig selector is defined as a solution to the regularization problem
[TABLE]
The geometry of the Dantzig selector is depicted in Figure 3. The -minimization favors sparse solutions, i.e. vectors in which many coefficients are exactly zero. This facilitates interpretation. Furthermore, if gets larger, the Dantzig selector shrinks towards the zero vector. Choosing is a trade off: small values will generally result in larger variance of the estimator, but smaller bias. We propose the regularized causal Dantzig , which in analogy to equation (6) is defined as a solution to
[TABLE]
On a superficial level, the difference to the Dantzig selector is merely that is replaced by and is replaced by . Hence the geometry of the optimization problem is akin to the Dantzig selector and the causal Dantzig inherits its variable selection, shrinkage and regularization properties. Furthermore, the causal Dantzig can be cast as a linear program for fixed . Details can be found in the Appendix, Section 8.3.6.
4.2 Finite-sample bound
The regularized causal Dantzig is related to the Dantzig selector and enjoys similar properties. Notably, it attains the same rates of convergence under comparable regularity conditions. To this end, we introduce the quantity “causal cone invertibility factor”, similar to the “cone invertibility factor” for the Dantzig selector as defined in Ye and Zhang (2010). For ease of exposition we will first treat the case . The treatment of the general case is sketched in Remark 2.
4.2.1 Causal Cone Invertibility Factor
Let denote the empirical covariance matrix of and consider a set . Later we will mainly be interested in the case where is the active set of . Ye and Zhang (2010) proved bounds for the Dantzig selector that involve the so-called cone invertibility factor (CIF). For the upper bound, the relevant quantity in Ye and Zhang (2010) is . Roughly speaking, the cone invertibility factor is a lower bound on the -norm of , given that lies in the cone and has unit norm . To make the quantity comparable across different norms, it is scaled by a factor . To be more precise,
[TABLE]
Now we are ready to define the causal cone invertibility factor :
[TABLE]
Analogously define for . Here and in the following, notationally we do not treat the case separately. Instead, with small abuse of notation we set for . In the new definition, the positive semi-definite matrix is replaced by the symmetric matrix . As , the matrix is not positive definite in high-dimensional settings and even indefinite in general. However, it can be shown that the CCIF behaves similarly to the CIF in several ways. This is further discussed in Section 4.3. For now, let us turn to the finite-sample bound of the causal Dantzig.
4.2.2 Finite sample bound
The finite-sample results of the causal Dantzig are analogous to the Dantzig selector while the issue of identifiability is now addressed by the causal cone invertibility factor . Similarly as in Ye and Zhang (2010), define and let denote the active set of . The first result is purely algebraic and follows from the definitions of and the causal Dantzig.
Lemma 1**.**
On the event we have
[TABLE]
The proof can be found in the Appendix. There are two terms on the right-hand side in equation (18) that deserve further attention. First, is bounded away from zero under certain assumptions, as discussed in Section 4.3. Secondly, it is crucial to understand the behavior of . Using a union bound over the entries, it can be shown that with high probability, is of the order :
Lemma 2**.**
Assume that inner-product invariance holds for under . Assume are centered and multivariate Gaussian. Let . Then, with probability exceeding ,
[TABLE]
The proof can be found in the Appendix. This result can be extended to situations where and have subgaussian tails, see e.g. exercise 14.3 in Bühlmann and van de Geer (2011). By combining Lemma 1 and Lemma 2 we obtain the following theorem. The proof can be found in the Appendix.
Theorem 4**.**
Let for a constant that satisfies for . Under the assumptions mentioned in Lemma 2,
[TABLE]
with for .
Another consequence of these two Lemmata is the screening property of the causal Dantzig under a so-called betamin-condition. The short proof can be found in the Appendix.
Proposition 4**.**
Let denote the active set of . Using the notation of Theorem 4, assume that
[TABLE]
Then under the assumptions mentioned in Theorem 4 for , we have
[TABLE]
Note that the convergence rate in Theorem 4 coincides with the usual rate of convergence in high-dimensional linear regression (Ye and Zhang (2010)) under comparable assumptions. For consistency in the norm in the regression setting it is required that , that for constant large enough and that the population quantity is bounded away from zero. In our framework, if , the assumptions on the asymptotic behavior of and stay essentially the same, but plays the role of . The next section aims to shed some light on the behavior of this quantity.
Remark 2**.**
The results of this section can be extended to more than two settings . To be more precise, in the general case one can define the regularized causal Dantzig as a solution to
[TABLE]
where are defined as in equation (9). The causal cone invertibility factor is then defined as
[TABLE]
With this notation, it is straightforward to obtain analogous results to Lemma 1-3, Theorem 4 and Proposition 4.
4.3 Behavior of the causal cone invertibility factor
In the preceding section we showed that the causal cone invertibility factor is a crucial quantity to understand the behavior of the regularized causal Dantzig. How do we guarantee that this quantity is bounded away from zero? There are two issues that we will treat separately. First, for , is not invertible. Secondly, the environments might not be sufficiently different to make population version invertible. In Section 4.3.1 we will discuss how to relate the empirical causal cone invertibility factor to the population causal cone invertibility factor. In Section 4.3.2 we consider the case where the environments are sufficiently different to make the population version invertible. In Section 4.3.3 we examine a setting where the environments are not sufficiently different, i.e. where is not invertible.
4.3.1 General properties
In this section we discuss how to relate the empirical causal cone invertibility factor to the population quantity . The following Lemma gives a deterministic bound for these quantities. The proof can be found in the Appendix.
Lemma 3**.**
Let . Then,
[TABLE]
where denotes the matrix max norm.
Hence the problem is reduced to understanding the behavior of . Let the rows of consist of i.i.d. centered multivariate Gaussian random variables for . It can be shown that with probability at least ,
[TABLE]
This result can be extended to situations where and have subgaussian tails, see e.g. exercise 14.3 in Bühlmann and van de Geer (2011). Hence by Lemma 3, even if is not invertible, the quantity in equation (17) is well behaved for , in the sense that it is strictly bounded away from zero if the same is true for the population quantity. The latter assumption is nontrivial and depends on the distribution of the interventions .
4.3.2 Population invertible
Under the assumptions discussed in Section 4.3.1, is bounded away from zero if is bounded away from zero. Hence, the problem is reduced to understanding the population quantity . If is invertible, then
[TABLE]
As
[TABLE]
this is a measure of the difference in the intervention strength between the two settings and . In this sense, this bound is similar to the discussion in Section 4.3.3. However, the bound fails to capture appropriately what happens if the interventions only act on a subset of the variables . In that case the bound in equation (20) is not useful as is not invertible. The next section shows that in some of these settings it is still true that .
4.3.3 Population not invertible
The setting of Section 4.3.2 and the bound in equation (20) are rather restrictive. Consider a situation with a block structure in the Gram matrix, i.e. where for all and . In this case, there might be no interventions on the variables , i.e. for all . As a result, might not be invertible. However, if is invertible and , then
[TABLE]
Hence, under the assumptions discussed in Section 4.2.2, the causal Dantzig is a consistent estimator for . Generally speaking, the causal Dantzig tends to screen out variables that have not been affected by the intervention. In this light it is crucial that the interventions act on the variables in the active set of directly or indirectly.
5 Practical considerations
In this section we discuss practical considerations for the causal Dantzig. Recommendations are given for centering and scaling of the variables, choice of the regularization parameter and a procedure for preselection.
5.1 Centering and scaling
Centering and scaling in the causal Dantzig setting is a bit more intricate than in a regression setting. Let denote the empirical mean of . For centering, we recommend substracting from each sample. By mean-centering globally (and not with an environment-specific intercept), the estimator is able to leverage changes in mean between environments. For scaling, define
[TABLE]
We recommend to scale the -th row of and by approximately for all and . What is the motivation behind this scaling? In the following we will discuss the special case . In absence of noise in equation (16), . By allowing for , we account for the variance of . Since we work with a supremum bound and the same for all components, we want all scaled components to have roughly the same variance. To be more precise, we want
[TABLE]
It can be challenging to scale according to equation (22) as the correlation between and is unknown and changes for different . In the absence of confounding however and if and are not descendants of in the graph , is independent of and and the scaling of equation (21) implies
[TABLE]
where denotes the standard deviation of . The scaling of equation (21) still has some theoretical justification in more general cases. In the presence of confounding and for general it depends on the joint distribution of and whether and are of the same order. Notably, if equation (21) holds with equality and if the variables and are centered multivariate Gaussian, using moment inequalities,
[TABLE]
for . Using independence of samples from different environments ,
[TABLE]
for all . Using equation (21),
[TABLE]
Hence and are of the same order for all .
5.2 Choosing
Large segments of the regularization path of the causal Dantzig are usually poor estimates of . Hence it is crucial to use an appropriate value of the regularization parameter . From a theoretical perspective one would choose as in Theorem 4. However, and are usually unknown in real-world datasets. Hence, in practice we propose to choose by -fold cross-validation. Concretely, in each environment the samples are split into groups of approximately equal size. Denote the causal Dantzig estimator that is calculated on all samples except the samples from group . Let and be defined as in equation (5), using the samples from group . Then we can choose as a solution to
[TABLE]
We define the cross-validated causal Dantzig as . Two exemplary regularization paths and the solution chosen by cross-validation are depicted in Figure 4.
5.3 Preselection with hidden variables
An alternative of running the causal Dantzig directly on a high-dimensional dataset is doing preselection. In the first stage we recommend to run Lasso on observational data, if available. If observational data is not available, one could run Lasso on the pooled dataset. In the second stage, one would run the causal Dantzig with or without regularization on the active set of the first stage. Ideally, the first stage would screen out as many variables as possible, except for the parental set of the target variable . Quite often this will result in a set that contains a superset of the parental set implying a very useful dimensionality reduction. The following Lemma provides some justification for this approach.
Lemma 4**.**
Assume that the distribution is generated by a linear acyclic Gaussian structural equation model with directed acyclic graph that consists of both the observed variables and (potentially) hidden confounders . Assume that the joint distribution of the variables is faithful (Pearl, 2009) to . Let denote the active set of regressing on in the population case. Then,
[TABLE]
The proof can be found in the Appendix. We test this two-step procedure on real world data in Section 6.4. However, note that for valid -values (with the unregularized causal Dantzig) we would have a post-selection problem due to the screening step.
6 Numerical examples
Section 6.1 explores actual coverage and length of the asymptotic confidence intervals as defined in Section 3.3. In Section 6.2 we compare the causal Dantzig to instrumental variable regression for under different types of interventions. In Section 6.3 we evaluate the performance of parameter selection by cross-validation as defined in Section 5.2. Finally, in Section 6.4 we discuss an application to real-world data that has been published in Meinshausen et al. (2016).
6.1 Causal Dantzig in low dimensions: confidence intervals
In this section we explore the actual coverage and average length of the asymptotic confidence intervals constructed according to Theorem 1.
We simulate data from two linear SEMs shown in Figure 5. Specifically, the data are generated according to the equations
[TABLE]
where the noise distributions of and respectively depend on the environment. Specifically, for SEM (A), we assume a factor model for the noise
[TABLE]
where , and the entries in both the factor loading matrix and the factor values are chosen i.i.d. standard normal. The 5-dimensional variable act as hidden confounders between the observed variables. The noise contribution is chosen as 1 in environment and as in environment . We call the intervention strength as it measures the variance of the additional noise input in environment over environment . In our simulations it is chosen as . For SEM (B) we generate the data analogously with the dimension of the hidden variable being five.
We draw samples in total (across both environments) and compute the confidence intervals for the causal coefficients of with the unregularized causal Dantzig. For SEM (A), the true causal coefficients for are given by and the actual coverage and average length of the constructed intervals at confidence level 0.05 with the unregularized causal Dantzig is shown in the two upper rows of Table 2 for variable . The bottom row show the coverage of the confidence intervals for invariant causal prediction (ICP). For large , ICP often (rightfully) rejects all models and outputs neither coefficient estimates nor confidence intervals. These cases were ignored in the table. ICP is not consistent and hence has incorrect coverage for growing sample size, as clearly visible in the table.
The causal Dantzig has approximately correct coverage for all sample sizes in this example. For small sample sizes, the variance of the causal Dantzig is large and consequently the average length of the confidence intervals of the causal Dantzig is large, too. In such regimes, regularization is recommended, as discussed in Section 4. For larger sample sizes, the confidence intervals are shrinking considerably with the -rate. For SEM (A), this effect is depicted in Table 2. Table 3 shows these effects for SEM (B). Note that also in this case the actual coverage of the causal Dantzig is approximately correct.
6.2 Causal Dantzig and the instrumental variable approach
To compare the causal Dantzig to instrumental variables, consider a binary instrument . To be more precise, we consider the model
[TABLE]
The corresponding DAG is depicted in Figure 6. In words, is a direct cause of , there is a hidden confounder that causes both and , and is an instrument for , meaning that is a root node and a direct cause of , but not of or . Note that the conditional mean differs between settings, i.e. . Hence the IV approach is consistent for the true causal effect from to , as discussed in Section 3.6.
For each environment we generate samples and estimate the direct causal effect via causal Dantzig and instrumental variables regression using the function ivreg in the R-package AER. Table 4 shows the mean square error for both methods. For few observations, the causal Dantzig is relatively unstable.
For larger values of , this is not the case and the mean square error shrinks at the -rate for both estimators. The instrumental variables (IV) approach outperforms the causal Dantzig in this example. This is due to the fact that IV is a fraction of conditional means, whereas the causal Dantzig is a fraction of conditional covariances. Estimating conditional means is statistically easier, but it comes at a certain price as we will see below.
For the second model, we change the edge function between and . Notably,
[TABLE]
Both the conditional variance and the conditional mean change between the environments. However, the conditional mean changes only slightly, imposing difficulties for the IV approach. Again, for each environment we generate samples and estimate the direct causal effect via causal Dantzig and ivreg. As seen in Table 5, for very few observations, both ivreg and causal Dantzig are comparatively far from the target quantity. For larger values of , the causal Dantzig converges with the -rate. The instrumental variables approach is consistent but unstable for these small sample sizes as the instrument is weak. It exhibits large MSE as it does not use the changing variance for inference.
6.3 Causal Dantzig in high dimensions
We consider a structural equation model, where the variables form a chain and the distribution of the unobserved confounder changes between the environments. The corresponding directed acylic graph is depicted in Figure 7.
To be more precise, the distribution of the observed variables and is generated according to the following structural equation model:
[TABLE]
We assume that and are jointly independent. The regularization parameter is chosen by -fold cross-validation. Figure 8 shows the regularization path for two different values of . Figure 9 shows the regularization path for varying intervention strength . Finally, in Figure 10 the number of samples collected from each environment is varied. In a nutshell, cross-validation seems to select a reasonable regularization parameter in most cases, estimation performance deteriorates with increasing , but improves with increasing and drastically so with increasing intervention strength .
6.4 Gene knockout experiments
We outline here an application which has appeared in Meinshausen et al. (2016). The authors consider gene expression in yeast (Saccharomyces cerevisiae) under deletion of single genes (Kemmeren et al., 2014): samples are wild-type (observational); and samples are measured under the deletion of a single gene (intervention). For each of those observations, genome-wide mRNA expression levels were measured. We denote these measurements by , where . The goal is to predict whether mRNA expression level changes significantly under a new and unobserved gene-deletion , . Knocking out a gene is not always successful, and the measured activity of a gene is not constant (or zero) after knocking it out, i.e. the intervention is “noisy”. Overall, knockouts decrease the activity, which can be interpreted as a negative shift in the measured log-activity of a gene.
The data is split into training and validation data. To this end, the interventional samples are divided into five sets . For some , the training data consists of the four sets and the observational samples. The samples in are held out for validation. The interventional effects on the validation set were predicted using only training data. This procedure is carried out for all sets , i.e. each gene perturbation is excluded from the training set once.
Preselection with the LASSO was used on the pooled data to screen for a superset of the parental set of variable . For some justification of this approach, see Section 5.3. Then, the causal Dantzig without regularization was used, with setting for observational data and for interventional training data. Using causal Dantzig without screening step is computationally prohibitive due to the large number of variables and as the procedure is repeated for each possible target variable . The most often selected intervention predictions were compared to so-called “strong intervention effects” (SIEs) as defined in Meinshausen et al. (2016). SIEs are computed on the held-out data and are a measure for the total causal effect. The results are depicted in Figure 11. As an example, for causal Dantzig the four most often selected intervention predictions correspond to SIEs.
Screening for causal effects is a very challenging problem in this setting, mainly due to the high-dimensionality of the dataset and the presence of hidden confounders. The ground truth is not perfectly known but good proxies (strong intervention effects) can be computed on hold-out interventional data. The strongest discoveries of InvariantCausalPrediction (ICP) and causalDantzig correspond very well to the benchmark. Assuming hidden confounding and shift interventions (causalDantzig) leads to a different ranking of genes compared to assuming the absence of confounding and allowing for arbitrary interventions (ICP). Interestingly, while both methods miss some important variables, making “wrong” assumptions such as linearity or absence of latent confounding do not seem to lead to false positives for the first few variables in the ranking. This form of validation and the comparison to other methods are further discussed in Meinshausen et al. (2016).
7 Discussion
Causal discovery is challenging, particularly in the presence of hidden confounders and feedback loops. However, hidden confounders can rarely be excluded and feedback loops are to be expected in many real-world applications (e.g., in biological systems). We introduced the notion of inner-product invariance and showed that inference in linear structural equation models under inner-product invariance is possible, both for low- and high-dimensional data.
The proposed methods have interesting parallels to widely-used statistical methods. For example, the functional form of the causal Dantzig estimator is similar to linear regression. The regularized causal Dantzig is similar to the Dantzig selector. For two environments () the causal Dantzig estimator can be compared with instrumental variable regression and is consistent in certain settings in which instrumental variable regression fails. Hence, we believe that the causal Dantzig will push the boundaries in the analysis of certain types of datasets, in particular in the analysis of datasets where potentially unknown interventions (or “perturbations”) change both the mean and the variance of the observed error distribution. Empirical results show state-of-the-art performance of our proposed estimator on a real-world dataset.
We investigated the identifiability of direct causal effects under the proposed model class. Furthermore, we showed that the regularized causal Dantzig can be consistent in the high-dimensional case even if not all covariates have been intervened on. The estimator can be obtained by solving a linear program and as such is feasible for large-scale causal inference. We derived asymptotic confidence intervals for the unregularized causal Dantzig, as well as guarantees for statistical accuracy for the regularized causal Dantzig.
The notion of inner-product invariance pushes the boundaries for the types of datasets we can leverage for causal discovery. We expect it to be useful for practitioners, in particular as a simple and fast tool for screening for potential direct causal effects. From a theoretical perspective, the regularized and unregularized causal Dantzig provide new perspectives on invariant causal prediction, on the instrumental variable approach and on classical theory for high-dimensional estimation.
8 Appendix
Remark 3** (Reminder of Assumption 1 and some of the notation).**
We assume that the distributions of , , are generated by the linear SEM
[TABLE]
Assume that there exist random variables with for all such that can be written as
[TABLE]
*We assume that for all and .
We aim to infer the structural equation for variable , hence we denote it by . Furthermore, for simplicity we write . The values form a -dimensional matrix that we denote by .
8.1 Proofs for Section 2
8.1.1 Proof of Proposition 1
Proof.
Recall that . We can write equation (1) under Assumption 1 more compactly as , where is the matrix that contains the structural parameters . In other words,
[TABLE]
In the following, we denote the -th unit vector in by , i.e.
[TABLE]
Recall that and . By Assumption 1, . Hence,
[TABLE]
Now we can again use Assumption 1. Recall that , and that and are uncorrelated. Hence for ,
[TABLE]
Note that this quantity is the same for all environments , which concludes the proof. ∎
8.1.2 Proof of Proposition 2
Proof.
For all and ,
[TABLE]
In the first line and third line we used that are centered and jointly independent for all . In the second line we used that we have inner product invariance for , under and that for all and . This proves that we also have inner-product invariance for under .
∎
8.2 Proofs for Section 3
8.2.1 Proof of Theorem 1
Proof.
First note that and are invertible as and the covariance matrix of , are assumed to be invertible. Now note that by inner-product invariance of under we have and hence
[TABLE]
In particular,
[TABLE]
We denote the set of real-valued invertible matrices. Define the function by
[TABLE]
By elementary matrix algebra, this function is continuously differentiable with derivative in direction
[TABLE]
As and , the delta method yields
[TABLE]
In the last line we used independence of the samples of environment and , the CLT and the definition of and together with equation (42).
∎
8.2.2 Proof of Theorem 2
Proof.
Part A: In this part we prove claim 2 and of claim 1. By Proposition 1 we have inner-product invariance for and hence the solution set of the population causal Dantzig contains . Assume that for each there exists such that . We want to show that under this assumption the causal Dantzig is unique in the population case. By Proposition 1 we have inner-product invariance for and hence each solution to the population causal Dantzig satisfies
[TABLE]
Denote the “observational” environment, i.e. the environment with for . By inner-product invariance under ,
[TABLE]
By rearranging,
[TABLE]
As we want to show it suffices to show that is invertible. In the following, for notational brevity we write instead of and instead of . By definition, we have
[TABLE]
In the last line we used for all . As setting is “observational”, i.e. ,
[TABLE]
Now we want to show that is invertible. If this is not the case then there exists such that . As is invertible,
[TABLE]
In particular, . As , by equation (44) we have . As ,
[TABLE]
As we assumed that the Gram matrix of is positive definite for all , this is a contradiction. Hence is invertible. Thus the matrix in equation (8.2.2) is invertible if and only if
[TABLE]
is invertible. Let such that . We will lead this to a contradiction. As all matrices , are positive semi-definite we have
[TABLE]
As there exists such that . Fix such a . By assumption there exists such that . Fix such an environment . Define , the support of . By definition,
[TABLE]
But by assumption, the matrix is positive definite. Hence
[TABLE]
Contradiction! Thus is invertible and . This concludes the proof of part A.
Part B: In this part we prove of claim 1. Proof by contradiction. Assume there exists a such that for all . We want to show that there exists a second SEM with that satisfies Assumption 1 and generates the distributions of . Fix a such that for all . As above, it is possible to show that for all ,
[TABLE]
As for all there exists such that for all . Now we want to show that there exists a SEM with that generates the distributions of and satisfies Assumption 1. For we keep the structural equations . For the variable we define the new structural equation
[TABLE]
where we choose small enough to make invertible.
Furthermore define and . Note that this SEM still satisfies that one environment is “observational”, i.e. and that all interventions are full-rank on its support as the same holds true for . Now we want to show that this SEM satisfies inner-product invariance under . By inner-product invariance under , and as for all ,
[TABLE]
Hence we also have inner-product invariance of under . Now we want to show that the new SEM generates the distributions of , i.e. we want to show that
[TABLE]
By definition,
[TABLE]
and again by definition we know . Hence to prove equation (46) it suffices to show
[TABLE]
As we defined it suffices to show
[TABLE]
Rearranging yields
[TABLE]
We know that there exists such that . Using equation (8.2.2) we obtain
[TABLE]
By construction , which implies that . Combining this fact with equation (48) yields
[TABLE]
Equivalently,
[TABLE]
Now we can prove equation (47):
[TABLE]
This proves equation (47) and hence the new SEM generates . Hence is not identifiable. This concludes the proof of part B.
∎
8.2.3 Proof of Theorem 3
Proof.
It suffices to show that
[TABLE]
for all as the distribution on the right hand side of the data is the same across all environments . By Assumption 1,
[TABLE]
and hence
[TABLE]
Hence it suffices to show that
[TABLE]
for all . To this end, let denote the observational environment, i.e. the environment with . By Proposition 1,
[TABLE]
Hence also
[TABLE]
Using equation (49) and equation (50), with , equation (52) is equivalent to
[TABLE]
As shown in the proof of Theorem 2, is invertible. Hence the preceding equation is equivalent to
[TABLE]
Analogously as in the proof of Theorem 2 we can use positive definiteness of to conclude that for all . As
[TABLE]
we proved equation (51), which concludes the proof. ∎
8.2.4 Proof of Proposition 3
Proof.
First, recall that assumption (A4) says that
[TABLE]
We have
[TABLE]
In the second line we used (A1). In the fourth line we used equation (53). In the last line we used (A2). By assumption (A5) we have and hence
[TABLE]
Using assumption (A3) concludes the proof. ∎
8.3 Proofs for Section 4
8.3.1 Proof of Lemma 1
Proof.
The proof follows the technique used in Ye and Zhang (2010). As , . By definition of , we have . As the active set of is we have . Hence we can invoke the definition of to obtain
[TABLE]
To bound the right hand side of equation (54),
[TABLE]
Combining equation (54) and equation (55) concludes the proof. ∎
8.3.2 Proof of Lemma 2
Proof.
Using inner-product invariance of under ,
[TABLE]
Now we can use that , are i.i.d. with distribution , . By van de Geer and Bühlmann (2009), for all , with probability exceeding ,
[TABLE]
Taking a union bound over , for all , with probability exceeding ,
[TABLE]
Using the bound for and and equation (56) yields the desired result.
∎
8.3.3 Proof of Theorem 4
Proof.
As and as for , for we have eventually
[TABLE]
As , eventually
[TABLE]
Using Lemma 3 for , the probability of the event eventually exceeds , which converges to for . By Lemma 2, on the event ,
[TABLE]
This concludes the proof. ∎
8.3.4 Proof of Proposition 4
Proof.
Using Theorem 4 for ,
[TABLE]
Using the betamin-condition,
[TABLE]
with for . Hence with for . This concludes the proof. ∎
8.3.5 Proof of Lemma 3
Proof.
Consider an with . Hence, . Using this,
[TABLE]
In the last line we used that . This concludes the proof. ∎
8.3.6 Causal Dantzig as a LP
For fixed , the regularized causal Dantzig can be cast as a linear program. For notational simplicity, will show this for the case . Define
[TABLE]
Let be the solution set of the linear program
[TABLE]
Let be the solution set of (16). The following Lemma shows that can easily be obtained from .
Lemma 5**.**
**
Proof.
Let . By constraint, all entries of are non-negative. Furthermore, and cannot be nonzero at the same time: In that case, defined as
[TABLE]
would suffices , , , which is a contradiction to the definition of . As either or are equal to zero, . Analogously, one can show that any solution to
[TABLE]
satisfies that either or . Hence is also the solution set of
[TABLE]
By rewriting the constraint, this problem is equivalent to solving
[TABLE]
Now for each solution of this problem we can define and satisfies the constraint . Furthermore the objective functionals match, i.e. . On the other hand, for each solution of
[TABLE]
we can define via and . Note that by definition satisfies the constraints , and again the objective functionals match, i.e. . Hence . This concludes the proof. ∎
8.4 Proof for Section 5
8.4.1 Proof of Lemma 4
Proof.
Proof by contradiction. Let be a parent or child of in with . Without loss of generality let us assume that . As the regression coefficient of is zero and as are multivariate Gaussian, is conditionally independent of given . As the distribution of is faithful to , and are -separated by in , see e.g. Pearl (2009) for a reference. Hence the path is blocked by . But the path can only be blocked if . Contradiction. This concludes the proof. ∎
8.5 Asymptotic efficiency
Assume that for the variables are centered (non-degenerate) Gaussian random variables that are generated from a structural equation model under Assumption 1. Intuitively, as the Gram matrices are asymptotically efficient estimators of and one would expect the plug-in estimator to be efficient, too. That is still true in some sense, but we have to be a bit careful with the notion of efficiency. There are two issues that we have to take care of. First, we have the additional constraint that the data is generated by a specific SEM that satisfies inner-product invariance under the true causal coefficient . Can this constraint be exploited to lower asymptotic variance? Additionally, we have to deal with the fact that and may have different asymptotic growth rates. The following Lemma gives an answer to the first question if we allow for errors-in-variables as defined in equation (4).
Lemma 6**.**
Consider distributions and with inner-product invariance under that satisfy Assumption 1 and have errors-in-variables as defined in equation (4). For any distribution with sufficiently close to and with sufficiently close to , there exists an linear structural equation model with error-in-variables that satisfies Assumption 1 and equation (4).
This Lemma shows that the fact that our model is generated by a Gaussian linear SEM with additive interventions and errors-in-variables that satisfies inner-product invariance does not restrict the distributions in a neighborhood of other models that satisfy these properties. Now let us turn to the question what statements can be made about the limit , . It is straightforward to model this the following way: for each sample , first a coin is tossed. With probability we observe a sample from setting and with probability we observe a sample of setting . To be more precise, the corresponding log density can be written as
[TABLE]
where denotes the density of a centered Gaussian distribution with covariance . Hence, is a sufficient statistics for and is a sufficient statistics for . By Anderson (1973), the Gram matrix of is asymptotically efficient for estimating and the Gram matrix of is asymptotically efficient for estimating . The Fisher information matrix is block diagonal with blocks for , and . Thus, the Gram matrices of and are asymptotically efficient for jointly estimating and . By the delta method, the plug-in estimator is asymptotically efficient for estimating . Note that in the discussion above we have and . Hence this is a “balanced” scenario and this type of analysis does not work for, say, . In the latter case, the asymptotic variance of estimating the Gram matrix in setting is dominating the asymptotic variance of estimating the Gram matrix in setting . Hence it can be shown that has the same asymptotic variance as an efficient estimator for assuming the Gram matrix in setting is known.
8.5.1 Proof of Lemma 6
Proof.
Choose such that . By construction of the random variables satisfy
[TABLE]
In other words, we have inner product invariance under . Now we want to show that the distribution of can be generated by a structural equation model of the following form. We want to show that there exist independent random variables with such that satisfy Assumption 1 and such that satisfy the assumption mentioned after equation (4). Furthermore, with slight abuse of notation we want that the following structural equation model with error-in-variables
[TABLE]
generates the distribution of , i.e. satisfies . As are centered multivariate Gaussian it suffices to show that the covariance matrix of can be decomposed into
[TABLE]
with positive semi-definite matrices satisfying
, 2. 2.
are diagonal matrices with for .
To this end, define and . Define the matrices
[TABLE]
Here, denotes the noise contribution of in the corresponding structural equation model. For , converges to and converges to . Recall that the covariance matrices of are positive definite. Hence and are positive definite for close to , close to and small enough. Now we can define
[TABLE]
With this definition the covariance matrix of can be decomposed as .
For close to and close to , the matrices , are positive semi-definite as has asymptotic variance , where denotes the measurement error of in environment in the corresponding structural equation model. Thus by equation (60) it suffices to show that can be decomposed into positive semi-definite matrices such that .
To this end let us define
[TABLE]
Now we want to show that
[TABLE]
are positive semi-definite for . To this end take . Then for ,
[TABLE]
Note that we used that are positive definite, that by Assumption 1 , and that by inner-product invariance, . Now by defining we obtain the decomposition . Note that here we used again that by inner-product invariance under , . This completes the proof.
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anderson [1973] T.W. Anderson. Asymptotically efficient estimation of covariance matrices with linear structure. Annals of Statistics , pages 135–141, 1973.
- 2Andersson et al. [1997] S.A. Andersson, D. Madigan, and M.D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. Annals of Statistics , 25:505–541, 1997.
- 3Angrist et al. [1996] J.D. Angrist, G.W. Imbens, and D.B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association , 91:444–455, 1996.
- 4Bollen [1989] K.A. Bollen. Structural Equations with latent variables . John Wiley & Sons, 1989.
- 5Bowden and Turkington [1990] R.J. Bowden and D.A. Turkington. Instrumental variables , volume 8. Cambridge University Press, 1990.
- 6Bühlmann and van de Geer [2011] P. Bühlmann and S. van de Geer. Statistics for high-dimensional data: Methods, theory and applications . Springer, 2011.
- 7Candes and Tao [2007] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p 𝑝 p is much larger than n 𝑛 n . Annals of Statistics , 35(6):2313–2351, 2007.
- 8Chickering [2002] D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research , 3:507–554, 2002.
