Invariant Causal Prediction for Sequential Data

Niklas Pfister; Peter B\"uhlmann; Jonas Peters

arXiv:1706.08058·math.ST·May 29, 2018

Invariant Causal Prediction for Sequential Data

Niklas Pfister, Peter B\"uhlmann, Jonas Peters

PDF

Open Access

TL;DR

This paper introduces a method for identifying causal predictors in sequential data that remains invariant across different environments, enabling causal inference without prior environment knowledge, with applications to macroeconomics.

Contribution

It develops an invariant causal prediction method tailored for sequential data, allowing causal inference without known heterogeneity patterns or environments.

Findings

01

Method successfully detects causal predictors in time series data.

02

Provides statistical confidence bounds for causal inference.

03

Applied to macroeconomic data for policy analysis.

Abstract

We investigate the problem of inferring the causal predictors of a response $Y$ from a set of $d$ explanatory variables $(X^{1}, \dots, X^{d})$ . Classical ordinary least squares regression includes all predictors that reduce the variance of $Y$ . Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions, loosely speaking they lead to invariance across different "environments" or "heterogeneity patterns". More precisely, the conditional distribution of $Y$ given its causal predictors remains invariant for all observations. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In…

Tables3

Table 1. Table 1: Computational cost of a single evaluation of each test statistic. The symbol ∗ * either stands for max max \operatorname{max} or sum sum \operatorname{sum} .

test statistic	$T_{i}^{*, ℱ}$ with $i \in {1, 2, 3}$	$T_{i}^{*, ℱ}$ with $i \in {1, 2, 3}$	$T^{HSIC}$
complexity	$𝒪 (\| ℱ \| \cdot {\| S \|}^{2} \cdot (n + \| S \|))$	$𝒪 (\| ℱ \| \cdot n)$	$𝒪 (n^{2})$

Table 2. Table 2: Description of each variable in the data set.

	description
$Y$	log returns of end of month exchange rate Euro to Swiss Francs
$X^{1}$	change in average call money rate (no log transform as part of the values are negative)
$X^{2}$	log returns of end of month proportion of foreign currency investments from total assets on the balance sheet of the SNB
$X^{3}$	log returns of end of month proportion of reserve positions at International Monetary Fund (IMF) from total assets on the balance sheet of the SNB
$X^{4}$	log returns of end of month proportion of monetary assistance loans from total assets on the balance sheet of the SNB
$X^{5}$	log returns of end of month proportion of Swiss Franc securities from total assets on the balance sheet of the SNB
$X^{6}$	log returns of end of month proportion of remaining assets from total assets on the balance sheet of the SNB
$X^{7}$	log returns of Swiss GDP (in Euro) resulting from interpolation of quarterly (seasonally adjusted) data and adjusted using the monthly average exchange rate
$X^{8}$	log returns of Euro zone GDP resulting from an interpolation of quarterly (seasonally adjusted) GDP data
$X^{9}$	inflation rate for Switzerland computed from the monthly consumer price index (CPI)

Table 3. Table 3: The three alternatives used in the simulations.

	alternative 1	alternative 2	alternative 3
$\| β_{e_{n}} - β_{f_{n}} \|$	$\frac{\log (n)}{20 \cdot n^{\frac{1}{2}}}$	$\frac{\log (n)}{20 \cdot n^{\frac{1}{4}}}$	$0$
$\| σ_{e_{n}}^{2} - σ_{f_{n}}^{2} \|$	$0$	$0$	$\frac{\log (n)}{n^{\frac{1}{2}}}$

Equations416

Y_{e}\,|\,\big{(}X^{S^{*}}_{e}=x\big{)}\;\;\overset{d}{=}\;\;Y_{f}\,|\,\big{(}X^{S^{*}}_{f}=x\big{)}.

Y_{e}\,|\,\big{(}X^{S^{*}}_{e}=x\big{)}\;\;\overset{d}{=}\;\;Y_{f}\,|\,\big{(}X^{S^{*}}_{f}=x\big{)}.

H_{0, S} : S is an invariant set with respect to (Y, X) .

H_{0, S} : S is an invariant set with respect to (Y, X) .

\tilde{S} : = H_{0, S} is true S \subseteq {1, \dots, d} : ⋂ S \subseteq S^{*},

\tilde{S} : = H_{0, S} is true S \subseteq {1, \dots, d} : ⋂ S \subseteq S^{*},

\hat{S} (φ) : = φ_{S} accepts H_{0, S} S \subseteq {1, \dots, d} : ⋂ S .

\hat{S} (φ) : = φ_{S} accepts H_{0, S} S \subseteq {1, \dots, d} : ⋂ S .

X^{j} \leftarrow f^{j} (X^{PA (j)}, ε^{j}),

X^{j} \leftarrow f^{j} (X^{PA (j)}, ε^{j}),

ε_{t}^{0} ⊥ ⊥ {ε_{s}^{j} ∣ \forall j \in AN (0), s \in {1, \dots, n}} .

ε_{t}^{0} ⊥ ⊥ {ε_{s}^{j} ∣ \forall j \in AN (0), s \in {1, \dots, n}} .

\tilde{S} \subseteq AN (Y),

\tilde{S} \subseteq AN (Y),

H_{0, S} : {\exists β \in (R ∖ {0})^{∣ S ∣}, σ \in (0, \infty) : Y = X^{S} β + ε, with ε ⊥ ⊥ X^{S} and ε \sim N (0, σ^{2} Id),

H_{0, S} : {\exists β \in (R ∖ {0})^{∣ S ∣}, σ \in (0, \infty) : Y = X^{S} β + ε, with ε ⊥ ⊥ X^{S} and ε \sim N (0, σ^{2} Id),

Y = X^{S} β + ε, with ε \sim N (0, σ^{2} Id) .

Y = X^{S} β + ε, with ε \sim N (0, σ^{2} Id) .

R^{S} : = \frac{( Id - P _{X}^{S} ) Y}{∥( Id - P _{X}^{S} ) Y ∥ _{2}} = \frac{( Id - P _{X}^{S} ) ε}{∥( Id - P _{X}^{S} ) ε ∥ _{2}} = \frac{( Id - P _{X}^{S} ) ε ~}{∥( Id - P _{X}^{S} ) ε ~ ∥ _{2}},

R^{S} : = \frac{( Id - P _{X}^{S} ) Y}{∥( Id - P _{X}^{S} ) Y ∥ _{2}} = \frac{( Id - P _{X}^{S} ) ε}{∥( Id - P _{X}^{S} ) ε ∥ _{2}} = \frac{( Id - P _{X}^{S} ) ε ~}{∥( Id - P _{X}^{S} ) ε ~ ∥ _{2}},

φ_{T, B}^{S} (Y, X) : = \mathds 1_{{∣ T (R^{S})∣ > c_{T, B} (X)}},

φ_{T, B}^{S} (Y, X) : = \mathds 1_{{∣ T (R^{S})∣ > c_{T, B} (X)}},

c_{T, B} (x) : = ⌈ B (1 - α)⌉ -largest value of ∣ T (R_{1}^{S, x})∣, \dots, ∣ T (R_{B}^{S, x})∣ .

c_{T, B} (x) : = ⌈ B (1 - α)⌉ -largest value of ∣ T (R_{1}^{S, x})∣, \dots, ∣ T (R_{B}^{S, x})∣ .

B \to \infty lim P (φ_{T, B}^{S} (Y, X) = 1) = α .

B \to \infty lim P (φ_{T, B}^{S} (Y, X) = 1) = α .

\forall t, s \in {1, \dots, n} : P^{Y_{t} ∣ X_{t}^{S}} = P^{Y_{s} ∣ X_{s}^{S}} .

\forall t, s \in {1, \dots, n} : P^{Y_{t} ∣ X_{t}^{S}} = P^{Y_{s} ∣ X_{s}^{S}} .

Y_{t} = β_{t} X_{t} + ε_{t}, t \in {1, \dots, 200}

Y_{t} = β_{t} X_{t} + ε_{t}, t \in {1, \dots, 200}

e_{i} (CP) : = ⎩ ⎨ ⎧ {1, \dots, t_{1}} {t_{i - 1} + 1, \dots, t_{i}} {t_{L} + 1, \dots, n} if i = 1, if 1 < i \leq L, if i = L + 1.

e_{i} (CP) : = ⎩ ⎨ ⎧ {1, \dots, t_{1}} {t_{i - 1} + 1, \dots, t_{i}} {t_{L} + 1, \dots, n} if i = 1, if 1 < i \leq L, if i = L + 1.

(Y_{t}, X_{t})_{t \in e} \sim iid F_{e},

(Y_{t}, X_{t})_{t \in e} \sim iid F_{e},

T^{\max,\mathcal{F}}_{i}(\widetilde{\mathbf{R}}^{S})\coloneqq\max_{(e,f)\in\mathcal{F}}\big{\lvert}T_{e,f}^{i}(\widetilde{\mathbf{R}}^{S})\big{\rvert},\qquad\text{ or by }\qquad T^{\operatorname{sum},\mathcal{F}}_{i}(\widetilde{\mathbf{R}}^{S})\coloneqq\sum_{(e,f)\in\mathcal{F}}\big{\lvert}T_{e,f}^{i}(\widetilde{\mathbf{R}}^{S})\big{\rvert},

T^{\max,\mathcal{F}}_{i}(\widetilde{\mathbf{R}}^{S})\coloneqq\max_{(e,f)\in\mathcal{F}}\big{\lvert}T_{e,f}^{i}(\widetilde{\mathbf{R}}^{S})\big{\rvert},\qquad\text{ or by }\qquad T^{\operatorname{sum},\mathcal{F}}_{i}(\widetilde{\mathbf{R}}^{S})\coloneqq\sum_{(e,f)\in\mathcal{F}}\big{\lvert}T_{e,f}^{i}(\widetilde{\mathbf{R}}^{S})\big{\rvert},

F^{1} : = {(e, f) \in E \times E ∣ e \cap f = \emptyset} .

F^{1} : = {(e, f) \in E \times E ∣ e \cap f = \emptyset} .

\overset{γ}{^}_{h, S} : = ((X_{h}^{S})^{⊤} X_{h}^{S})^{- 1} (X_{h}^{S})^{⊤} R_{h}^{S} and \overset{s}{^}_{h, S}^{2} : = \frac{( R _{h}^{S} - X _{h}^{S} γ ^ _{h, S} ) ^{⊤} ( R _{h}^{S} - X _{h}^{S} γ ^ _{h, S} )}{∣ h ∣},

\overset{γ}{^}_{h, S} : = ((X_{h}^{S})^{⊤} X_{h}^{S})^{- 1} (X_{h}^{S})^{⊤} R_{h}^{S} and \overset{s}{^}_{h, S}^{2} : = \frac{( R _{h}^{S} - X _{h}^{S} γ ^ _{h, S} ) ^{⊤} ( R _{h}^{S} - X _{h}^{S} γ ^ _{h, S} )}{∣ h ∣},

T_{e, f}^{1} (R^{S})

T_{e, f}^{1} (R^{S})

T_{e, f}^{2} (R^{S})

T_{e, f}^{3} (R^{S}) : = \frac{( R _{e}^{S} - X _{e}^{S} γ ^ _{f, S} ) ^{⊤} ( R _{e}^{S} - X _{e}^{S} γ ^ _{f, S} )}{s ^ _{f, S}^{2} ∣ e ∣} - 1.

T_{e, f}^{3} (R^{S}) : = \frac{( R _{e}^{S} - X _{e}^{S} γ ^ _{f, S} ) ^{⊤} ( R _{e}^{S} - X _{e}^{S} γ ^ _{f, S} )}{s ^ _{f, S}^{2} ∣ e ∣} - 1.

T_{e, f}^{4} (R^{S})

T_{e, f}^{4} (R^{S})

T_{e, f}^{5} (R^{S})

T^{HSIC} (R^{S}) : = HSIC (R^{S}, time),

T^{HSIC} (R^{S}) : = HSIC (R^{S}, time),

(Y_{n, t}, X_{n, t})_{t \in e_{i} (CP_{n}^{*})} \sim iid F_{n, i},

(Y_{n, t}, X_{n, t})_{t \in e_{i} (CP_{n}^{*})} \sim iid F_{n, i},

Y_{t} = μ_{e, S} + X_{t}^{S} β_{e, S} + ε_{t}, with ε_{t} \sim N (0, σ_{e, S}^{2}) and X_{t}^{S} ⊥ ⊥ ε_{t} .

Y_{t} = μ_{e, S} + X_{t}^{S} β_{e, S} + ε_{t}, with ε_{t} \sim N (0, σ_{e, S}^{2}) and X_{t}^{S} ⊥ ⊥ ε_{t} .

H_{A, S}^{n} (a, b) : = {P ∣

H_{A, S}^{n} (a, b) : = {P ∣

\displaystyle\left.a=\big{\lvert}\sigma^{2}_{e_{i}(\operatorname{CP}^{*}_{n}),S}-\sigma^{2}_{e_{j}(\operatorname{CP}^{*}_{n}),S}\big{\rvert}>0\text{ and }b=\big{\lVert}\beta_{e_{i}(\operatorname{CP}^{*}_{n}),S}-\beta_{e_{j}(\operatorname{CP}^{*}_{n}),S}\big{\rVert}_{2}>0\right\}.

r_{n} = e_{n} \in E_{n} min ∣ e_{n} ∣ and n \to \infty lim r_{n} = \infty,

r_{n} = e_{n} \in E_{n} min ∣ e_{n} ∣ and n \to \infty lim r_{n} = \infty,

0 < c \leq E (λ_{m i n} (\frac{1}{e _{n}} X_{e_{n}}^{⊤} X_{e_{n}})^{2 k}) \leq E (λ_{m a x} (\frac{1}{e _{n}} X_{e_{n}}^{⊤} X_{e_{n}})^{2 k}) \leq C < \infty

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsItaly: Economic History and Contemporary Issues · Statistical Methods and Inference

Full text

Invariant Causal Prediction for Sequential Data

Niklas Pfister, Peter Bühlmann and Jonas Peters

Abstract

We investigate the problem of inferring the causal predictors of a response $Y$ from a set of $d$ explanatory variables $(X^{1},\dots,X^{d})$ . Classical ordinary least squares regression includes all predictors that reduce the variance of $Y$ . Using only the causal predictors instead leads to models that have the advantage of remaining invariant under interventions; loosely speaking they lead to invariance across different “environments” or “heterogeneity patterns”. More precisely, the conditional distribution of $Y$ given its causal predictors remains invariant for all observations. Recent work exploits such a stability to infer causal relations from data with different but known environments. We show that even without having knowledge of the environments or heterogeneity pattern, inferring causal relations is possible for time-ordered (or any other type of sequentially ordered) data. In particular, this allows detecting instantaneous causal relations in multivariate linear time series which is usually not the case for Granger causality. Besides novel methodology, we provide statistical confidence bounds and asymptotic detection results for inferring causal predictors, and present an application to monetary policy in macroeconomics.

Keywords: causal structure learning, change point model, Chow statistic, instantaneous causal effects, monetary policy

1 Introduction

Detecting causal relations is a core problem in many scientific fields. Performing controlled randomized intervention experiments can be considered the gold standard for inferring causal relations (e.g., Peirce, 1883; Pearl, 2009; Imbens and Rubin, 2015; Peters et al., 2017). In many situations however, randomization and interventions are unethical, physically impossible or too costly. In addition, many datasets nowadays come from non-designed experiments: the question is then whether one can still infer causal relations. Assuming additional structure, this is indeed possible.

The field of causal structure learning attempts to infer causal relations from data. Many procedures in this field only use observational data (e.g., Spirtes et al., 2000; Chickering, 2002; Shimizu et al., 2006; Janzing et al., 2012; Peters et al., 2014; van de Geer and Bühlmann, 2013; Bühlmann et al., 2014). These methods are either based on strong assumptions (that are in particular violated for heterogeneous data) or they do not output a single causal graph estimate but a set of so-called Markov equivalent graphs. The latter occurs because of severe identifiability problems in general. Having access to interventional data improves idenitifiability for inferring causal relations and several methods have been proposed that exploit both observational and interventional data (e.g., Eaton and Murphy, 2007; Hauser and Bühlmann, 2012; Peters et al., 2016). Causal discovery, however, is an ambitious task, and most of the above methods only provide point estimates and lack statistical confidence guarantees.

The problem of identifying causal directions is greatly reduced in the time series setting, where the concept of Granger causality (Granger, 1969) plays a prominent role; see also structural vector auto-regressive models (SVAR) that are popular in econometrics (e.g., Lütkepohl, 2005). When excluding instantaneous effects, the time order allows applying regression techniques to infer causal relations between variables. In this paper we consider the more general problem of inferring also the time-instantaneous effects.

We consider a target variable $Y$ and covariates $X^{1},\ldots,X^{d}$ . Instead of reconstructing the full causal structure, we thus try to infer the set $S^{*}\subseteq\{1,\ldots,d\}$ of causal predictors111In the context of causal graphical models, it is more common to use the term “causal parents” instead of “causal predictors”. Here, we use the latter in order to emphasize the regression setting. of $Y$ (where the indices in $\{1,\ldots,d\}$ refer to the indices of the variables $X^{1},\ldots,X^{d}$ ). Our approach comes with the following two advantages; (1) Most importantly, it provides a statistical confidence guarantee: the method outputs an estimate $\hat{S}\subseteq\{1,\ldots,d\}$ of the set of causal predictors such that $\hat{S}\subseteq S^{*}$ with controllable (large) coverage probability $1-\alpha$ ; (2) The method does not need to model interdependence of the predictor variables $X^{1},\ldots,X^{d}$ , but rather only the dependence of the causal variables from $S^{*}$ on the target $Y$ .

Our approach uses sequential data that are assumed to arise from a mix of observational and interventional settings. Given that type of data, we propose to look for invariant structures, i.e., conditionals that do not change over time. To do so, we do not need to know the nature or location of the intervention regimes. A framework that connects stability (or invariance) to causality has been recently formulated by Peters et al. (2016) under the name invariant causal prediction which we will summarize next. (Although the underlying principles coincide, Peters et al. (2016) do not consider sequential data, but assume knowledge of the location of the different regimes.)

Invariant causal prediction considers the situation where one observes the response and covariates $(Y_{e},X^{1}_{e},\ldots,X^{d}_{e})$ in different environments $e\in\mathcal{E}$ , that is, in each of these environments, we have an i.i.d. data set. The crucial assumption is that the conditional distribution of the response given the variables from $S^{*}$ is the same in all environments: more formally, we have for all $e,f\in\mathcal{E}$ and all $x$ that

[TABLE]

This assumption is satisfied, for example, if the environments correspond to different intervention settings, that do not contain a direct intervention on the target variable $Y$ (Peters et al., 2016, Proposition 1). Here, we develop invariant causal prediction in an environment-free way, i.e., without knowing the different environments.

Summarizing the above comments, our method is applicable in the following situation. There is a target variable $Y_{t}$ and a set ${S^{*}}$ of causal predictors that satisfies the following property: for all $t$ , $Y_{t}=X_{t}^{S^{*}}\beta+\varepsilon_{t}$ for some Gaussian i.i.d. sequence $\varepsilon_{t}$ with $\varepsilon_{t}\perp\!\!\!\perp X_{t}^{S^{*}}$ (see Assumption 1 in Section 2.1 for details). Our method aims to estimate ${S^{*}}$ . Furthermore, we do not need to assume that the predictors $X_{t}$ are i.i.d. over $t$ as our method utilizes any changes in distribution.

1.1 Contribution and relation to other work

Peters et al. (2016) assume that the environments are known in order to exploit the invariance in (1.1). Without knowledge of the environments the task becomes more difficult. Naively estimating the environments from data and subsequently applying the existing methodology may lead to a loss of the method’s coverage guarantee or yield less powerful results. In particular, this is the case for recovering the environments by clustering (see Remark B.2) or by using change point detection methods (see discussion in Section 3.2.1 and Remark B.3). In contrast, our procedure does not estimate the environments but instead utilizes the existing non-invariances directly. It can thus be seen as a highly non-trivial generalization of invariant causal prediction when the environments are unknown.

From a technical perspective, we provide a new asymptotic analysis for the Chow test (Chow, 1960) for simultaneously testing equality of regression coefficients and homoscedasticity of the residuals. In particular, we show that it has a non-optimal rate for detecting differences of regression coefficients. As an alternative with better rates, we propose using a decoupled version that individually tests regression coefficients and residual variances, and combines them with a Bonferroni correction. Finally, we employ a bootstrap procedure due to Shah and Bühlmann (2017) which allows for efficient multiple testing over many smooth or block-wise time segments of the data.

The proposed causal inference methodology can be directly used for multivariate auto-regressive time series, allowing for detection of instantaneous causal predictors. The notion of “different environments” then translates to non-stationarity; in fact, it is the non-stationary nature of the system which allows for detection of instantaneous causal predictors, whereas a stationary process would not provide the required “perturbations” to identify causality. Our work is therefore different from the celebrated concept of Granger causality for non-instantaneous causal relations (e.g., Granger, 1969; Lütkepohl, 2005). Other methods that are able to identify instantaneous effects (e.g., Chu and Glymour, 2008; Hyvärinen et al., 2008; Peters et al., 2013) often require nonlinearities or non-Gaussianity. Additionally, they need to model the full network and do not come with any notion of causal significance. Another line of work starts from high-frequency data sampled from a Granger model without instantaneous effects and tries to infer the instantaneous effects appearing in low-frequency sub-sampled data (e.g., Gong et al., 2015; Tank et al., 2017). A further related area of research extends non-instantaneous effects models by allowing for time-dependent parameters (e.g., Talih and Hengartner, 2005; Siracusa and Fisher III, 2009). These methods usually work with stationary data, model the full causal system rather than one target variable, and do not come with significance statements on their causal findings.

The paper is structured as follows. Section 2 introduces invariant causal prediction in an environment-free way. Our method and the confidence guarantees for detecting causal predictors are described in Section 3. We establish consistency and detection rates of this method in Section 4. Algorithmic details are given in Section 5, with programming code available online as an R-package. In Section 6, we extend the framework to multivariate time series data, and Section 7 reports on numerical experiments and an application in macroeconomics for monetary policy.

2 Invariant causal prediction

Throughout this work, we assume that we are given data from a sequence $(Y_{t},X_{t})_{t\in\{1,\dots,n\}}$ , where $X_{t}\in\mathbb{R}^{1\times d}$ contains predictor variables and $Y_{t}\in\mathbb{R}$ is a target variable of interest. Moreover, let $\mathbf{Y}\coloneqq(Y_{1},\dots,Y_{n})^{\top}\in\mathbb{R}^{n\times 1}$ and $\mathbf{X}\coloneqq(X_{1}^{\top},\dots,X_{n}^{\top})^{\top}\in\mathbb{R}^{n\times d}$ denote the corresponding matrix quantities. We are interested in settings where the experimental conditions are allowed to change over time, as long as the structural dependence (predictor set and regression parameters) of $Y_{t}$ on $X_{t}$ remains fixed, which is the corresponding environment-free version of the invariance assumption given in (1.1). Ideally, we would like to make direct use of this assumption for structural inference. However, in order to have a reasonable amount of power to test such an assumption based on a finite sample it is useful to describe the dependence of $Y$ on its parents by a parametric function class. In this paper we focus on linear Gaussian models but our ideas potentially also extend to more complicated models.

2.1 Structural invariance

In this subsection we formalize the fixed structural dependence (predictor set and regression parameters) of $Y_{t}$ on $X_{t}$ to be linear Gaussian. We denote by $(\mathbf{Y},\mathbf{X})=(Y_{t},X_{t})_{t\in\{1,\dots,n\}}\in\mathbb{R}^{n\times(d+1)}$ the random vectors corresponding to the entire data and for any set $S\subseteq\{1,\dots,d\}$ , the vector $X^{S}\in\mathbb{R}^{1\times\lvert S\rvert}$ contains only the variables $\{X^{k};k\in S\}$ . We make the following definition.

Definition 2.1 (invariant set $S$ ).

A set $S\subseteq\{1,\dots,d\}$ is called invariant with respect to $(\mathbf{Y},\mathbf{X})$ if there exist parameters $\mu\in\mathbb{R}$ , $\beta\in(\mathbb{R}\setminus\{0\})^{\lvert S\rvert\times 1}$ and $\sigma\in\mathbb{R}_{>0}$ such that

(a)

$\forall t\in\{1,\dots,n\}:\quad Y_{t}=\mu+X_{t}^{S}\beta+\varepsilon_{t}\text{ and }\varepsilon_{t}\perp\!\!\!\perp X_{t}^{S}$ ,

(b)

$\varepsilon_{1},\dots,\varepsilon_{n}\overset{\text{\tiny iid}}{\sim}\mathcal{N}(0,\sigma^{2})$ .

Throughout the paper, the symbol $\perp\!\!\!\perp$ denotes independence and we neglect the intercept term $\mu$ as it can be added without loss of generality by including a constant term in $X$ . For an invariant set $S$ we thus have that conditionally, $Y_{t}\mid X_{t}^{S}$ ( $t=1,\dots,n$ ) are i.i.d. Gaussian random variables. It is crucial to observe that Definition 2.1 makes no restrictions on the distribution of the process $(X_{t})_{t\in\{1,\dots,n\}}$ , and also the distribution of $(\mathbf{Y},\mathbf{X})$ can be quite general. In particular, this allows for time dependencies and arbitrary changes in the distribution of $X_{t}$ . In Remark B.1 we discuss a potential extension that allows to weaken the Gaussian linear assumption.

Based on Definition 2.1, we can formalize an invariance assumption similar to Peters et al. (2016, Assumption 1), by requiring the existence of an invariant set $S^{*}$ .

Assumption 1 (structural invariance).

There exists a set $S^{*}\subseteq\{1,\dots,d\}$ which is invariant with respect to $(\mathbf{Y},\mathbf{X})$ .

The set $S^{*}$ can be seen as a set of predictor variables which shields off $Y$ from any interventions on the system other than interventions on $Y$ itself. In the setting of heterogeneous data, the set $S^{*}$ then corresponds to the set of predictors that can be safely included into a prediction model which works at all time points $t\in\{1,\dots,n\}$ .

2.2 Invariant prediction and coverage

In this section we recall some definitions and results related to invariant causal prediction from Peters et al. (2016). Assumption 1 enforces the existence of a set $S^{*}$ such that the structural dependence (predictor set and regression parameters) between $Y_{t}$ and $X_{t}^{S^{*}}$ remains fixed. Our goal is to estimate the set $S^{*}$ based on the observed data $(\mathbf{Y},\mathbf{X})$ . We build this estimate by taking the intersection of all sets $S\subseteq\{1,\dots,d\}$ which are invariant with respect to $(\mathbf{Y},\mathbf{X})$ , i.e., we consider all such sets $S$ and test the following null hypothesis

[TABLE]

Based on this null hypothesis we define the set of plausible causal predictors

[TABLE]

where we define the intersection over an empty index set as the empty set. The name plausible causal predictors is motivated by the underlying causal interpretation explained in Section 2.3. The property that $\tilde{S}$ is contained in $S^{*}$ follows immediately from Assumption 1. In general, this containment is strict since there could be several sets other than $S^{*}$ which satisfy the invariance condition across the considered interventions. Hence, if we change the interventions in such a way that the number of invariant sets decreases this leads to an increase in the size of the set $\tilde{S}$ . Intuitively, an increase in interventions results in an increase in the set $\tilde{S}$ .

Empirically, we can use this to construct an estimator based on an arbitrary family of hypothesis tests $\varphi=(\varphi_{S})_{S\subseteq\{1,\dots,d\}}$ , where $\varphi_{S}$ is the decision rule that either rejects $H_{0,S}$ ( $\varphi_{S}=1$ ) or accepts $H_{0,S}$ ( $\varphi_{S}=0$ ). We then estimate $\tilde{S}$ using the family of tests $\varphi$ by

[TABLE]

It is obvious that this estimator has the following coverage property, given that the hypothesis tests achieve correct level.

Proposition 2.2** (coverage property (Peters

et al., 2016, Theorem 1)).**

Assume Assumption 1 and let $\varphi=(\varphi_{S})_{S\subseteq\{1,\dots,d\}}$ be a family of hypothesis tests for the null hypotheses $(H_{0,S})_{S\subseteq\{1,\dots,d\}}$ which achieve level $\alpha\in(0,1)$ . Then, $\mathbb{P}\left(\hat{S}(\varphi)\subseteq S^{*}\right)\geq 1-\alpha$ .

In the following paragraph we compare our framework with that of Peters et al. (2016). While most parts are similar, the main difference is that we have phrased our framework without the use of environments. Peters et al. (2016) is then contained as a special case, more precisely their Equation (10) is contained within our null hypothesis (2.1). Our more general formulation comes with three benefits: (1) It allows a mathematically rigorous treatment of data that are generated by systems which change between every observation (while satisfying Assumption 1). (2) It allows constructing tests for a wider class of alternative hypotheses (e.g., smoothly changing systems), while at the same time justifying any test based on environments (e.g., Peters et al., 2016, Method I and Method II). (3) In particular, it allows for a more straightforward justification of procedures for pooling environments, see Peters et al. (2016, discussion by Richardson and Robins on page 1003).

2.3 Relation to causality and discussion of assumptions

An insightful interpretation of the set $S^{*}$ is given in the context of causality. Under certain assumptions, the set $S^{*}$ corresponds to the set of direct causes (or parents) of the target variable $Y$ . This is best understood in the framework of structural causal models (SCMs) (e.g., Bollen, 1989; Pearl, 2009).

Definition 2.3 (structural causal models).

A vector of variables $(X^{0},X^{1},\dots,X^{d})\in\mathbb{R}^{d+1}$ ( $X^{0}$ plays later the role of the target variable $Y$ ) is said to satisfy a structural causal model (SCM) if for all $j\in\{0,1,\dots,d\}$ there exist functions $f^{j}:\mathbb{R}^{\lvert\operatorname{PA}(j)\rvert+1}\rightarrow\mathbb{R}$ and jointly independent noise variables $\varepsilon^{j}$ satisfying

[TABLE]

where the sets $\operatorname{PA}(j)$ denote the parents of the variable $X^{j}$ in the directed (possibly cyclic) graph corresponding to the structure of the SCM.

Recall that our procedure can be applied to any data generating process $(\mathbf{Y},\mathbf{X})$ which satisfies the structural invariance (Assumption 1). In the following example, we define a class of causal models, which satisfies this invariance and use it as an illustration of the types of assumptions necessary to fit to our framework.

Example 2.4 (SCM with linear Gaussian target).

Assume that the data $(\mathbf{Y},\mathbf{X})$ is generated in the following way. For all $t\in\{1,\dots,n\}$ the variables $(Y_{t},X_{t})=(Y_{t}=X^{0}_{t},X^{1}_{t},\dots,X^{d}_{t})$ are generated by potentially different SCMs such that the structural assignment of $Y$ is linear Gaussian, does not depend on $Y$ , i.e., there is no direct feedback loop, and is fixed across all time points. That is, there exists $\sigma\in(0,\infty)$ and $\beta\in\mathbb{R}^{|\operatorname{PA}(0)|}$ such that for all $t\in\{1,\dots,n\}$ it holds that $Y_{t}\leftarrow X^{\operatorname{PA}(0)}_{t}\beta+\varepsilon^{0}_{t}$ with $\varepsilon^{0}_{t}\sim\mathcal{N}(0,\sigma^{2})$ , and $0\notin\operatorname{AN}(0)$ , where $\operatorname{AN}(0)$ denotes the ancestors and $\operatorname{PA}(0)$ the parents of $Y$ . Furthermore, assume that for all $t\in\{1,\dots,n\}$ it holds that

[TABLE]

Then, similar to Peters et al. (2016, Proposition 1), Assumption 1 is satisfied for the parents of $Y$ , namely $S^{*}=\operatorname{PA}(0)$ . In particular, this motivates the term plausible causal predictors used in (2.2).

This causal model allows for any type of intervention on the predictor variables $(X^{1},\dots,X^{d})$ at any time. In particular, this means we do not need to worry about what or when changes occur on the predictors. In contrast, the restriction that the structural assignment as well as the noise of $Y$ are not allowed to change across time implies that no direct interventions on $Y$ are permitted. This is a reasonable assumption, for example, if $Y$ is a phenotype and the predictors are gene activities, measured in a time-course experiment, or if $Y$ is a macro economic indicator and the set of predictors contain all possible variables which could be used to intervene on $Y$ (see our monetary policy example in Section 7.2).

A feature of the invariant causal prediction procedure is that we expect it to be conservative with respect to violations of its assumptions. For example, if there was a direct intervention on $Y$ , it is expected that all sets are rejected, which results in the empty set. Conversely, the procedure also remains conservative in the absence of interventions on the predictors, as the empty set remains invariant in that case. The resulting output would therefore be correct (although uninformative), as expected by the coverage property in Proposition 2.2.

In causal network discovery, the goal is often to infer the entire causal graph. Our framework is different in this respect, as it aims to only infer the parents of a single target variable. This comes with the advantage that we only need to locally infer the dependence from $Y$ (one node in the graph) on its causal predictors. It further allows us to use heterogeneous data without worrying about the types of interventions it may contain, except that they are assumed to not directly affect the target variable $Y$ . The latter is the reason why we cannot simply apply the methodology to the full network: we need interventions to obtain informative answers, but these interventions are not allowed to act directly on the target variable $Y$ . Thus, we cannot use each variable in the graph as a target. However, when having the additional information on the time of the interventions and which variables they directly affect, our method can be iteratively applied to each variable (i.e., node) in the graph separately by removing all observations belonging to interventions on that specific variable; hence allowing us to recover the entire causal structure.

Hidden variables.

The invariant causal prediction framework used here is also robust to the presence of hidden variables. For example, any hidden variables that are not direct parents of the target $Y$ are permitted. More generally, it can be shown that for settings with arbitrary constellations of hidden variables (allowing for hidden confounding between $Y$ and the predictors) and given suitable assumptions (including a faithfulness assumption on the underlying causal graph) the plausible causal set estimator still satisfies the following (slightly weaker) coverage property

[TABLE]

where $\operatorname{AN}(Y)$ are the ancestors of $Y$ . The precise result, is taken from Proposition 5 in Peters et al. (2016). Selected ancestor variables can be interpreted as a true rather than a false positive. Thus, this result establishes a useful robustness property: the price to be paid for allowing hidden variables is a loss of detection power.

3 Tests for $H_{0,S}$ based on scaled residuals

In this section, we construct a general class of tests for $H_{0,S}$ , based on an exact resampling procedure. These tests rely on the linear Gaussian model that is assumed to exist for the invariant set $S^{*}$ given in Assumption 1.

3.1 Scaled residual tests

Consider a fixed invariant set $S\subseteq\{1,\dots,d\}$ . As a first step, observe that by reformulating Definition 2.1 in matrix notation it holds that

[TABLE]

where $\boldsymbol{\varepsilon}\coloneqq(\varepsilon_{1},\dots,\varepsilon_{n})$ . Whenever the set $S$ is not invariant, the dependence of $\mathbf{Y}$ on $\mathbf{X}^{S}$ is not given by the same linear function across all time points. We can therefore construct a test for $H_{0,S}$ by performing a goodness of fit test of the linear Gaussian model

[TABLE]

This motivates a two-step procedure; (1) use linear regression to fit a linear Gaussian model, (2) test whether the residuals are i.i.d. Gaussian distributed. Shah and Bühlmann (2017) give a general methodology for dealing with such tests. We adapt their method to our setting and notation and consider several specific choices of tests which apply to our problem. Define the projection matrix $\mathbf{P}^{S}_{\mathbf{X}}\coloneqq\mathbf{X}^{S}\left((\mathbf{X}^{S})^{\top}\mathbf{X}^{S}\right)^{-1}(\mathbf{X}^{S})^{\top}$ . Then the residuals resulting from an OLS-fit of the model (3.2) are given by $\mathbf{R}^{S}\coloneqq(\mathbf{Id}-\mathbf{P}^{S}_{\mathbf{X}})\mathbf{Y}$ . Furthermore, assuming model (3.2) is true, i.e., $H_{0,S}$ holds, the scaled residuals $\widetilde{\mathbf{R}}^{S}\coloneqq\mathbf{R}^{S}/\lVert\mathbf{R}^{S}\rVert_{2}$ can be expressed as

[TABLE]

where $\tilde{\boldsymbol{\varepsilon}}\coloneqq\boldsymbol{\varepsilon}/\lVert\boldsymbol{\varepsilon}\rVert_{2}$ is the scaled noise. Given that model (3.2) is true, one can thus sample from the distribution of $\widetilde{\mathbf{R}}^{S}\mid\mathbf{X}=\mathbf{x}$ by a resampling procedure using that $\tilde{\boldsymbol{\varepsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{Id})$ . More formally, assume we are given a measurable function $T:\mathbb{R}^{n}\rightarrow\mathbb{R}$ and a let $B\in\mathbb{N}$ be the number of simulations. Then, we can use a resampling approach to construct a sequence of cut-off functions $c_{T,B}:\mathbb{R}^{n\times d}\rightarrow\mathbb{R}$ such that the sequence of hypothesis tests $(\varphi^{S}_{T,B})_{B\in\mathbb{N}}$ defined for all $B\in\mathbb{N}$ by

[TABLE]

achieves correct asymptotic level as $B$ goes to infinity. To see this, fix a significance level $\alpha\in(0,1)$ , let $\tilde{\boldsymbol{\varepsilon}}_{1},\tilde{\boldsymbol{\varepsilon}}_{2},\dots\overset{\text{\tiny iid}}{\sim}\mathcal{N}(\mathbf{0},\mathbf{Id})$ and for all $\mathbf{x}\in\mathbb{R}^{n\times d}$ and for all $b\in\mathbb{N}$ define the random variables $\widetilde{\mathbf{R}}_{b}^{S,\mathbf{x}}\coloneqq{(\mathbf{Id}-\mathbf{P}^{S}_{\mathbf{x}})\tilde{\boldsymbol{\varepsilon}}_{b}}/{\lVert(\mathbf{Id}-\mathbf{P}^{S}_{\mathbf{x}})\tilde{\boldsymbol{\varepsilon}}_{b}\rVert_{2}}$ , which are i.i.d. copies of $\widetilde{\mathbf{R}}^{S}\mid\mathbf{X}=\mathbf{x}$ . Moreover, for all $\mathbf{x}\in\mathbb{R}^{n\times d}$ define

[TABLE]

Based on the convergence of the empirical quantiles to the population quantiles, we get that the test in (3.4) has the following level guarantee.

Proposition 3.1 (level of the scaled residual test).

For all measurable $T:\mathbb{R}^{n}\rightarrow\mathbb{R}$ , based on the scaled residuals $\widetilde{\mathbf{R}}^{S}$ , and for all $(\mathbf{Y},\mathbf{X})$ with $\mathbb{P}^{(\mathbf{Y},\mathbf{X})}\in H_{0,S}$ , the hypothesis test $\varphi^{S}_{T,B}$ defined in (3.4) achieves level $\alpha$ as $B\rightarrow\infty$ , i.e.,

[TABLE]

More details on the proof can be found in Appendix C.5, where we prove a more general result including time lags. For any measurable test statistic $T$ we have, therefore, constructed a hypothesis test which achieves correct asymptotic level for testing the null hypothesis $H_{0,S}$ . It is clear that the power properties of such a test depend on the alternative and the form of the function $T$ . In the next section, we give specific choices of $T$ that allow us to detect alternatives for which $S$ is not an invariant set.

3.2 Choosing test statistics

Recall that the invariance of a set $S$ (see Definition 2.1) corresponds to a time invariance of the conditional distribution, i.e.

[TABLE]

Violations to this invariance include conditional distributions that change at some, many or even at every time point, see Figure 1 for three examples.

These can, e.g., be generated by noise interventions in SCMs, i.e., interventions that change the mean or variance of the noise for the predictor variables. However, not all types of interventions are necessarily captured as a time dependence in the residual distribution alone.

Example 3.2** (non-detectable structure changes in residual

distribution vs time).**

Assume we are given data from the following generative model

[TABLE]

with $X_{t},\varepsilon_{t}\overset{\text{\tiny iid}}{\sim}\mathcal{N}(0,1)$ , $\beta_{t}=1$ for $t\in\{1,\dots,100\}$ and $\beta_{t}=-1$ for $t\in\{101,\dots,200\}$ . Then, the regression parameter resulting from a pooled ordinary least squares regression is given by $\hat{\beta}_{\operatorname{OLS}}\approx 0$ . In particular, this implies that it is impossible to detect the structure change in a residuals versus time plot. Instead, one can group the data into the environments $e_{1}=\{1,\dots,100\}$ and $e_{2}=\{101,\dots,200\}$ and then consider the residuals versus the predictors on each environment individually. The result is that on $e_{1}$ the residuals are increasing with slope one and on $e_{2}$ the residuals are decreasing with slope one, which clearly contradicts the invariance assumption. This shows, that some violations are only detectable in the pooled residuals if we additionally use information contained in the ordering of the predictors rather than only using the time ordering.

The example illustrates that certain types of interventions (in particular if they change the structure) can lead to alternatives that are hard (or even impossible) to detect by looking only for changes in the distribution of the residuals from a pooled regression across time. This implies certain types of violations are only detected by also considering information from predictors, for example, by checking that the regression coefficients remain constant. In the following two subsections we consider specific types of alternative hypotheses and discuss which types of test statistics can be used to detect them. The choice of the test statistic only affects the power of our method, meaning that any of the following test statistics will result in tests which control the type I error as described in Section 3.1.

3.2.1 Change point alternatives

Throughout this subsection, we want to construct test statistics that focus power on detecting deviations from invariance where the interventions occur in a block-wise manner, i.e., non-invariances occur at specific change points. As described in the methodology in the previous sections, we are not interested in estimating the change points from data. To be more precise, we now introduce some notation related to change point models. Consider tuples of the form $\operatorname{CP}=(t_{1},\dots,t_{L})\in\{1,\dots,n-1\}^{L}$ satisfying $t_{i}<t_{j}$ for all $i<j$ . For every such tuple $\operatorname{CP}$ define for all $i\in\{1,\dots,L+1\}$ the following block-wise environments

[TABLE]

Moreover, denote by $\mathcal{E}(\operatorname{CP})\coloneqq\{e_{1}(\operatorname{CP}),\dots,e_{L+1}(\operatorname{CP})\}$ the collection of the $L+1$ environments. We will drop the tuple $\operatorname{CP}$ in the notation whenever it is clear from the context. We consider models described by a fixed set of change points at which changes in the experimental conditions can occur. The underlying change point model can then be specified by the existence of a fixed (unknown) tuple of change points $\operatorname{CP}^{*}=(t_{1}^{*},\dots,t_{L^{*}}^{*})$ such that for all environments $e\in\mathcal{E}(\operatorname{CP}^{*})$ it holds that

[TABLE]

where $F_{e}$ are fixed distributions depending only on the environments.

Given the true collection of change points $\operatorname{CP}^{*}$ , this reduces to the original ideas of invariant causal prediction introduced by Peters et al. (2016). Here, we are interested in the case when the change points are unknown and we no longer have the correct environments. A naive approach would be to use an existing change point detection method and plug-in the estimated segments into the invariant causal prediction (ICP) method from Peters et al. (2016). As discussed in Remark B.2, however, a change point detection method is only allowed to be used on data from the response variable $Y$ , since otherwise the coverage property of the procedure could be destroyed. This implies a major restriction: an example in Remark B.3 shows that changes in the covariates $X$ might be non-detectable in the response variable $Y$ and thus, any change point method applied on the response variable $Y$ might brake down. Here, we instead propose a procedure which exploits changing structures among all the variables and directly optimizes the power to detect (non-)invariances. This is done by simultaneously testing for (non-)invariances over all potential environments based on a grid of potential change points, and encoding this multiple testing problem in the test statistic. Our resampling approach (see Section 3) ensures that the generally strong dependencies between these tests are taken into account and one only pays a small price for the somewhat higher degree of multiple testing adjustment.

Our goal is to construct test statistics $T=T(\widetilde{\mathbf{R}}^{S})$ , based on scaled residuals from a regression on the covariates $S$ which are capable of capturing potential violations in model (3.2) that can occur due to the underlying change points $\operatorname{CP}^{*}$ . Essentially, this means that $\lvert T(\widetilde{\mathbf{R}}^{S})\rvert$ should be small whenever the model (3.2) is true and large whenever it is false. Violations in the invariance occur due to differences in the structural form of model (3.2) between two different environments $e,f\in\mathcal{E}(\operatorname{CP}^{*})$ . Therefore, the idea is to take a collection of environments $\mathcal{E}\subseteq\mathcal{P}(\{1,\dots,n\})$ that makes use of the block-wise structure of the data and then combine all pairwise comparisons between these environments. To be more precise, for all $e,f\in\mathcal{E}$ we construct several test statistics $T_{e,f}^{i}$ which detect differences between the environments $e$ and $f$ . We combine them to single test statistics either by

[TABLE]

where $\mathcal{F}\subseteq\mathcal{E}\times\mathcal{E}$ . Details on how to construct the collections of environments $\mathcal{E}\subseteq\mathcal{P}(\{1,\dots,n\})$ and the corresponding collection $\mathcal{F}$ are given in Section 5.1. For the theory part we consider only

[TABLE]

The test statistics $T_{e,f}$ should be capable of detecting differences between two environments, in the following we consider two options: (1) Test statistics that perform a regression step in order to incorporate information from the predictors, which are then capable of (at least in the large sample limit) detecting any violation of the invariance. (2) Test statistics that only check for changes in the pooled residual distribution, which have the advantage of being computationally faster but are not capable of detecting all violations (see Example 3.2).

Detecting block-wise shifts in the regression of the scaled

residuals on predictors

In the following we construct test statistics which are capable of detecting the following two types of violations of model (3.2) that can arise from an underlying change point model:

(i)

difference in the regression coefficients: $\beta_{e,S}\neq\beta_{f,S}$

(ii)

difference in the noise variance: $\sigma_{e,S}^{2}\neq\sigma_{f,S}^{2}$

where $\beta_{e,S}$ , $\beta_{f,S}$ , $\sigma_{e,S}$ and $\sigma_{f,S}$ are population least squares regression coefficients and residual variances on the environments $e,f\in\mathcal{E}(\operatorname{CP}^{*})$ when regressing $\mathbf{Y}$ on $\mathbf{X}^{S}$ restricted to environment $e$ and $f$ respectively. Both of these violations can be detected by regressing the scaled residuals $\widetilde{\mathbf{R}}^{S}$ on $\mathbf{X}^{S}$ for each of the two environments $e$ and $f$ . To this end, define for all possible environments $h\subseteq\{1,\dots,n\}$ the regression coefficient and biased sample variance of the scaled residuals regressed on $\mathbf{X}^{S}_{h}$ by

[TABLE]

respectively, where $\widetilde{\mathbf{R}}^{S}_{h}$ is the restriction of $\widetilde{\mathbf{R}}^{S}$ to environment $h$ . The idea is that both of the above violations (i) and (ii) lead to differences between the regressions of at least two environments $e,f\in\mathcal{E}(\operatorname{CP}^{*})$ . It is possible to test for either of the two violations individually using the test statistics

[TABLE]

for differences in the regression coefficients and for differences in the variance of the noise, respectively. The two resulting hypothesis tests can then be combined with a Bonferroni correction, which we refer to as decoupled test throughout this paper. A further option is to test for both potential violations simultaneously by using a test statistic similar to the Chow test (Chow, 1960) given by

[TABLE]

For the remainder of this paper we denote the test based on $T_{e,f}^{3}$ as the combined test. Unlike the Chow test we do not normalize the denominator, which means that $T_{e,f}^{3}$ in particular does not follow an F-distribution. Since we use an exact resampling approach this will however also not be necessary here.

Detecting block-wise shifts in the scaled residuals

As illustrated by Example 3.2 we are not capable of detecting all types of violations of the invariance by checking for shifts in the distribution of the pooled residuals across time. Nevertheless, many violations are in fact detectable in this fashion. An example in which the underlying model has two change points, leading to a block-wise time dependence of the (scaled) residuals, is illustrated in the left plot of Figure 1. Such block-wise shifts in mean and variance between two (true) environments $e,f\in\mathcal{E}(\operatorname{CP}^{*})$ can for example be detected using the following two test statistics

[TABLE]

where (3.12) detects shifts in mean and (3.13) detects shifts in variances. The main advantage of these two test statistics is that they do not require an extra regression step of the residuals on the predictors and are thus computationally faster.

3.2.2 Further alternatives

The test statistics constructed in Section 3.2.1 are tuned to detect alternatives arising from an underlying change point model. Depending on the setting, more natural alternatives might be gradual mean shifts (see center plot in Figure 1) or even more complicated shifts in the higher moments (see right plot of Figure 1). In the following, we give two potential choices of test statistics which focus power on these two latter alternatives. As discussed in Section 3.1 the level properties hold , even for finite samples, for any test statistic, allowing us to choose arbitrary test statistics and plug them into our methods described above.

Detecting gradual shifts in the scaled residuals

Assume that we want to detect gradual mean shifts across time as illustrated in the center plot in Figure 1. A natural idea is to use a non-linear (smooth) regression procedure to regress the scaled residuals $\widetilde{\mathbf{R}}$ given in (3.3) on $time$ . This results in an estimator of the mean function $\mu_{t}=\mathbb{E}(\widetilde{\mathbf{R}}_{t})$ , which satisfies $\mu_{t}\equiv 0$ under the null hypothesis $H_{0,S}$ and captures the gradual shifts in the alternative. Essentially, the idea is to use a smoothing procedure that best approximates the expected gradual shifts in the alternative. For example, for very smooth shifts one could use generalized additive models (GAM) (Wood and Augustin, 2002), implemented in the R-package mgcv, to get the non-linear smoothing fit and then consider a measure of how far the smoother deviates from the horizontal line at [math]. Possible measures include the area under the smoother or the p-value corresponding to the hypothesis test which tests whether all coefficients are simultaneously zero. Along the same lines one can also detect shifts in second moment by smoothing the squared scaled residuals $\widetilde{\mathbf{R}}^{2}$ across time.

Detecting more complicated shifts in the scaled residuals

In case the alternatives we are interested in include nonlinear changes of higher moments or other more complicated variations across time, e.g., right plot in Figure 1, one option is to use the test statistic of a non-parametric independence test. For example, we could use the Hilbert-Schmidt independence criterion (HSIC) introduced by Gretton et al. (2007) and consider the test statistic

[TABLE]

where $\widehat{\operatorname{HSIC}}$ is the empirical version of HSIC. The use of HSIC is motivated by the property that it allows to construct independence tests which are capable of capturing any type of dependence between random variables. An implementation of the Hilbert-Schmidt independence criterion is given in the R-package dHSIC (Pfister et al., 2017).

4 Detection rates

While the assumption that a set $S$ is invariant in the sense of Definition 2.1 is sufficient for the scaled residual test to achieve correct level for arbitrary test statistics (see Proposition 3.1), we require additional constraints on the underlying model in order to phrase and prove results about the power. Additionally, any type of power analysis will rely on the form of the test statistic. In this section, we restrict ourselves to showing that the tests based on the statistics (3.9), (3.10) and (3.11) are able to detect a large class of alternatives resulting from an underlying change point model. In particular, we show that they are consistent in the sense that they have asymptotic power equal to one in the large sample limit, with additional results on the rate of convergence.

4.1 Asymptotic change point model

Since we are interested in analyzing the large sample behavior of our methods we need to formalize what a growing sample size means in our change point model. We restrict ourselves to the case of a fixed number of change points where additional data points are added in such a way that the relative positions of the change points are conserved. To this end, assume we are given data from a triangular array $((Y_{n,t},X_{n,t})_{t\in\{1,\dots,n\}})_{n\in\mathbb{N}}$ , which satisfies the following assumption.

Assumption 2 (asymptotic change point model).

There exists a fixed (unknown) collection of relative change points $\alpha^{*}_{1},\dots,\alpha^{*}_{L}\in(0,1)$ satisfying for $i\in\{1,\dots,L\}$ that $\lim_{n\rightarrow\infty}t^{*}_{n,i}/n=\alpha^{*}_{i}$ , where $\operatorname{CP}^{*}_{n}\coloneqq(t^{*}_{n,1},\dots,t^{*}_{n,L})$ is the true set of change points for $n$ data points. Moreover, for all $i\in\{1,\dots,L+1\}$ it holds that

[TABLE]

for some fixed distributions $F_{n,i}$ .

This, in particular, implies that each environment grows linearly as the sample size increases, i.e., for all $i\in\{1,\dots,L+1\}$ it holds that $\lvert e_{i}(\operatorname{CP}^{*}_{n})\rvert=\mathcal{O}(n)$ as $n\rightarrow\infty$ . We assume a finite number of asymptotic change points, and for any finite sample size, the position of these change points is unconstrained. Moreover, our results can be extended to settings where the number of change points increases with $n$ , as long as the size of the individual environments grows polynomially. Finally, we require one further assumption.

Assumption 3 (Multivariate normality).

For all $n\in\mathbb{N}$ and for all $e\in\mathcal{E}(\operatorname{CP}^{*}_{n})$ the random variable $(Y_{t},X_{t})_{t\in e}$ has a multivariate normal distribution.

This assumption together with the i.i.d. assumption for $(Y_{t},X_{t})_{t\in e}$ for any environment $e\in\mathcal{E}(\operatorname{CP}^{*}_{n})$ ensures that for any fixed set $S\subseteq\{1,\dots,d\}$ and for every $e\in\mathcal{E}(\operatorname{CP}^{*}_{n})$ there exist unique parameters $\beta_{e,S}$ , $\mu_{e,S}$ and $\sigma_{e,S}$ such that for all $t\in e$ it holds that

[TABLE]

The important part is the independence between $X^{S}$ and the noise $\varepsilon$ , which is no longer true if Assumption 3 is dropped.

4.2 Asymptotic results

Throughout this section, we assume that $(\mathbf{Y}_{n},\mathbf{X}_{n})_{n\in\mathbb{N}}$ satisfies Assumption 2 and Assumption 3. We show that for an appropriate choice of environments it is possible to prove consistency of our test, against the following alternatives,

[TABLE]

For all $n\in\mathbb{N}$ , let $\mathcal{E}_{n}\subseteq\mathcal{P}(\{1,\dots,n\})$ be a collection of pairwise disjoint non-empty environments. In order to obtain a consistency result we are interested in sequence of such collections $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ satisfying the following 3 conditions,

(C1),k

there exists a sequence $(r_{n})_{n\in\mathbb{N}}$ such that

[TABLE] 2. (C2),k

for all $i\in\{1,\dots,L+1\}$ there exists a sequence $(f_{n})_{n\in\mathbb{N}}$ with $f_{n}\in\mathcal{E}_{n}$ and a constant $N\in\mathbb{N}$ such that for all $n\geq N$ it holds that $f_{n}\subseteq e_{i}(\operatorname{CP}^{*}_{n})$ and such that the sequences $(\sigma_{f_{n}}^{2})_{n\in\mathbb{N}}$ and $(\beta_{f_{n}})_{n\in\mathbb{N}}$ are convergent and $\lim_{n\rightarrow\infty}\sigma_{f_{n}}^{2}>0$ . 3. (C3,k)

for all $e_{n}\in\mathcal{E}_{n}$ the matrix $1/\lvert e_{n}\rvert\cdot\mathbf{X}_{e_{n}}^{\top}\mathbf{X}_{e_{n}}$ is $\mathbb{P}$ -a.s. invertible and there exist $c,C\in\mathbb{R}$ and $k\in\mathbb{N}$ such that for all $n\in\mathbb{N}$ it holds that

[TABLE]

Conditions (C1) and (C2) are in particular satisfied for $(\mathcal{E}^{G_{n}})_{n\in\mathbb{N}}$ where the environments are constructed using a grid as defined in (5.2) and given that the sequence of grids $(G_{n})_{n\in\mathbb{N}}$ becomes finer sufficiently fast as $n$ grows. Moreover, the moment condition (C3,k) is satisfied for any sequence of collections $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ and any $k\in\mathbb{N}$ due to Assumption 3.

Based on these conditions we can prove consistency rates for the tests based on fixed sets $S$ , which results in a consistency of the estimation of the set $\tilde{S}$ .

4.2.1 Rate consistency of tests for fixed sets $S$

Consider a fixed non-invariant set $S$ , then the following theorems show that we are capable of detecting the non-invariance with a rate depending on the type of test we use. We begin with the result for the decoupled test. Recall, that the decoupled test $\varphi_{\operatorname{decoupled},B}$ combines the test statistics in (3.9) and (3.10) and adjusts the level with a Bonferroni correction, i.e., $\varphi_{\operatorname{decoupled},B}$ rejects the null hypothesis at level $\alpha$ if and only if at least one of the tests $\varphi_{T_{1}^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})},B}$ or $\varphi_{T_{2}^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})},B}$ reject the null hypothesis at level $\alpha/2$ . The following theorem shows that it is consistent.

Theorem 4.1 (rate consistency of decoupled test).

Assume Assumption 2 and 3, let $S\subseteq\{1,\dots,d\}$ and let $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ be a sequence of collections of pairwise disjoint non-empty environments with the properties (C1), (C2) and (C3,k) and assume that for all $n\in\mathbb{N}$ it holds that $(\mathbf{Y}_{n},\mathbf{X}_{n})\sim P_{n}\in H_{A,S}^{n}(a_{n},b_{n})$ , where $a_{n}$ and $b_{n}$ satisfy the following condition

[TABLE]

Then it holds that

[TABLE]

A proof of this result is given in Appendix C.2. A different option is to use the combined test based on the test statistic in (3.11) which tests for both shifts in regression coefficients and shifts in variance simultaneously. Surprisingly, this leads to a worse rate of detecting shifts in the regression coefficients than for the decoupled test.

Theorem 4.2 (rate consistency of combined test).

Assume Assumption 2 and 3, let $S\subseteq\{1,\dots,d\}$ and let $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ be a sequence of collections of pairwise disjoint non-empty environments with the properties (C1), (C2) and (C3,k) and assume that for all $n\in\mathbb{N}$ it holds that $(\mathbf{Y}_{n},\mathbf{X}_{n})\sim P_{n}\in H_{A,S}^{n}(a_{n},b_{n})$ , where $a_{n}$ and $b_{n}$ satisfy the following condition

[TABLE]

Then it holds that

[TABLE]

A proof of this result is given in Appendix C.1.

Remark 4.3 (uniform consistency).

The results in Theorems 4.1 and 4.2 can be extended to be uniform across the following alternatives

[TABLE]

i.e., across all alternatives with a fixed minimum signal. Then, given the same rates for $\bar{a}_{n}$ and $\bar{b}_{n}$ in the Theorems 4.1 and 4.2 we get the result

[TABLE]

*where $\varphi$ is either the combined or the decoupled test. The precise statement is given in Theorem C.7 in Appendix C.4. In order to extend the proofs we additionally assume that the condition (C2) is assumed to be uniform across $\bar{H}^{n}_{A,S}(\bar{a}_{n},\bar{b}_{n})$ . Further details on this extension are given in Appendix C.4. *

4.2.2 Rate consistency of estimator $\hat{S}$

We can also show that the estimator for the plausible causal predictors $\hat{S}$ given in (2.3) converges to the set $\tilde{S}$ in (2.2) with the same rates as in the previous section.

Corollary 4.4** (rate consistency of estimator $\hat{S}$ (decoupled

test)).**

Assume Assumption 2 and 3, let $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ a sequence of collections of pairwise disjoint non-empty environments with the properties (C1), (C2) and (C3,k). Additionally assume that there exists positive sequences $(a_{n})_{n\in\mathbb{N}}$ and $(b_{n})_{n\in\mathbb{N}}$ satisfying for all $n\in\mathbb{N}$ and for all $S\subseteq\{1,\dots,d\}$ with $H_{0,S}$ false that $(\mathbf{Y}_{n},\mathbf{X}_{n})\sim P_{n}\in H_{A,S}^{n}(a_{n},b_{n})$ and

[TABLE]

Moreover, for all fixed sets $S\subseteq\{1,\dots,d\}$ denote by $\varphi^{S}_{n,B}$ the hypothesis test given by $\varphi_{\operatorname{decoupled},B}$ and define the family of tests $\varphi_{n,B}=(\varphi^{S}_{n,B})_{S\subseteq\{1,\dots,d\}}$ . Then it holds that

[TABLE]

A proof is given in Appendix C.3. Similar to Theorem 4.2 we get an equivalent result (with worse detection rate) for the combined test. As explained in Section 2.3, the set $\tilde{S}$ is a subset of the parents of $Y$ (or in the case of hidden variables a subset of the ancestors $Y$ ). Hence, this theorem shows that (under sufficient interventions on the predictors) we are able to recover the correct parents (or ancestors) with a known detection rate.

5 Implementation

Our methods are implemented in the R-package seqICP available on CRAN. The package in particular includes all the test statistics introduced in Section 3.2. In this section we discuss some additional details on the practical implementation of our methods. A rough outline of our block-wise procedures is given in Appendix B in Algorithm 1. In contrast to the block-wise procedure, our methods based on smoothing or general independence tests (see Section 3.2.2) do not require a separation into blocks of environments.

5.1 Choosing environments and comparison set

There are many reasonable ways in which the set of comparisons $\mathcal{F}$ can be chosen. The choice affects both empirical power properties and computational complexity. This in particular leads to a trade-off between the number of comparisons and the size of the environments. As shown in Section 4 this trade-off can be chosen in such a way that our methods become consistent.

We consider two options of choosing the comparison set $\mathcal{F}$ which work well in practice. The first option, which we already introduced in (3.8), is to use the choice from the theoretical results where we compare all pairs of non-intersecting environments, i.e.,

[TABLE]

A second computationally more efficient option is to not compare all environments pairwise but to rather compare each environment against its complement, i.e.,

[TABLE]

For each type of comparison we need to additionally choose the collection of environments $\mathcal{E}$ . A reasonable option is to pick a grid $G=(g_{1},\dots,g_{m})$ on $\{1,\dots,n\}$ (where $0<g_{1}<\cdots<g_{m}<n$ ) and then use

[TABLE]

This collection of environments is in particular larger than the one introduced in Section 3.2.1, i.e. $\mathcal{E}\left(\operatorname{CP}\right)\subseteq\mathcal{E}^{\operatorname{CP}}$ , where $\operatorname{CP}$ is the set of change points. Given that the set of change points is unknown one can simply take an equally spaced grid on $\{1,\dots,n\}$ . However, it is also possible to include some prior knowledge about the approximate locations of change points into the grid $G$ .

In order to achieve the consistency rates from the theory (Section 4) we could choose the size of the grid such that $m$ is of the order $\log(n)$ and the size of each of the $m+1$ environments tends to infinity as $n$ gets large. In particular, the comparison sets would then satisfy condition (C1). For example, we could choose an equidistant grid with $\log(n)$ grid points, then using the notation of Theorems 4.1 and 4.2 it holds that $\lvert\mathcal{E}_{n}\rvert=\mathcal{O}(\log(n)^{2})$ and $r_{n}=c\cdot n/\log(n)$ , for some $c>0$ . Hence, given that condition (C3,k) is satisfied for all $k\in\mathbb{N}$ (this is the case for Gaussian noise), we detect shifts in either the variance or the regression coefficients with a rate of $o((\log(n)/n)^{1/2})$ for the decoupled test. Whereas for the combined test the detection rate for shifts in the regression coefficients would only be $o((\log(n)/n)^{1/4})$ and for shifts in variance it would be $o((\log(n)/n)^{1/2})$ .

5.2 Computational complexity

In order to analyze the computational complexity of the procedure it is helpful to distinguish between the complexity of performing an invariance test for a single set $S$ and the complete procedure, which iterates over all such potential invariant sets.

The complexity of a single test based scaled residual resampling introduced in Section 3.2 requires one step of ordinary least squares to compute the residuals and $B$ evaluations of the test statistic to approximate the null distribution. The computational cost of one evaluation of the various test statistics is given in Table 1.

For the comparison sets from the previous sections we have $\lvert\mathcal{F}^{1}(\mathcal{E})\rvert=\mathcal{O}(\lvert\mathcal{E}^{2}\rvert)$ and $\lvert\mathcal{F}^{2}(\mathcal{E})\rvert=\mathcal{O}(\lvert\mathcal{E}\rvert)$ , which implies that if we choose $\mathcal{E}$ to contain of order $\log(n)$ sets the complete complexities of our change point based tests $T^{*,\mathcal{F}}_{i}$ are $\mathcal{O}(B\cdot n\log(n))$ in the low dimensional setting.

Depending on the number of potential predictor variables an exhaustive search over all subsets can quickly become unfeasible. For such settings we would suggest reducing the number of potential predictor variables by using an appropriate pre-selection, for example the Lasso as also described in Peters et al. (2016). Additionally, there is often no need to compute all subsets due to the fact that the intersection in (2.3) is computed. For details see Peters et al. (2016).

6 Instantaneous causal effects in multivariate time series

We have mentioned in Section 2 that the structural invariance defined in Definition 2.1 does not restrict the dependence structure between the predictor variables. This implies that dependence structures as Model A in Figure 2 is included in our framework. Our framework can be adapted to also include time dependencies as the ones in Model B and Model C in Figure 2. In this section, we show that this is possible whenever the dependence of $Y$ on $X$ and the past of $(Y,X)$ is linear Gaussian and higher order Markovian.

Consider a sequence $(Y_{t},X_{t})_{t\in\{1,\dots,n\}}$ as described at the beginning of Section 2 for which there exists $p\in\{0,\dots,n-2\}$ , $S^{*}\subseteq\{1,\dots,d\}$ , $\beta=(\beta_{1},\dots,\beta_{\lvert S^{*}\rvert})^{\top}\in(\mathbb{R}\setminus\{0\})^{\lvert S^{*}\rvert\times 1}$ and $B_{k}\in\mathbb{R}^{(d+1)\times 1}$ for $k\in\{1,\dots,p\}$ , satisfying for all $t\in\{p+1,\dots,n\}$ that

[TABLE]

where $\varepsilon_{p+1},\dots,\varepsilon_{n}\overset{\text{\tiny iid}}{\sim}\mathcal{N}(0,\sigma^{2})$ are independent noise variables. Such a condition is for example satisfied if $(Y_{t},X_{t})$ is a structural vectorized auto-regressive process (SVAR) (see e.g., Lütkepohl, 2005, Chapter 9). However, this framework also allows for more complicated (e.g., non-linear) structures between the predictor variables. We can adapt our methodology from the previous sections to estimate the set of instantaneous effects $S^{*}$ for which the model (6.1) remains invariant. The central idea is to include the past $p$ lags of all variables into each regression step. To make this more precise, for each potential set of instantaneous effects $S\subseteq\{1,\dots,d\}$ we do not regress $Y_{t}$ on $X_{t}^{S}$ , as in the previous sections, but on

[TABLE]

instead. We denote the corresponding scaled residuals by

[TABLE]

where $\mathbf{P}^{S,p}_{\mathbf{Z}}$ is the projection operator onto the linear span of $\mathbf{Z}^{S,p}=(Z_{p+1}^{S,p},\ldots,Z_{n}^{S,p})$ and with a slight abuse of notation $\mathbf{Y}=(Y_{p+1},\dots,Y_{n})$ . Equivalently to (3.1), we consider the following null hypothesis expressed in terms of $Z_{t}^{S,p}$ .

[TABLE]

Then, the same reasoning as in Section 2 can be applied, given that the model in (6.1) remains invariant across time. The corresponding result is as follows.

Proposition 6.1 (level of the scaled residual test including time lags).

For all measurable functions $T:\mathbb{R}^{n-p}\rightarrow\mathbb{R}$ based on the scaled residuals $\widetilde{\mathbf{R}}^{S,p}$ in (6.2), and for all $(\mathbf{Y},\mathbf{X})\sim P\in\widetilde{H}_{0,S,p}$ , it holds that the hypothesis test $\varphi^{S,p}_{T,B}$ defined as in (3.4) (where instead of regressing $Y$ on $X^{S}$ we regress $Y$ on $Z^{S,p}$ ) achieves correct level as $B$ goes to infinity, i.e.,

[TABLE]

A proof is given in Appendix C.5. In practical applications, we usually do not know the number of time lags $p$ . Essentially, there are three ways of dealing with this issue. Firstly, one can include a sufficiently large number of lags $p$ which accounts for enough of the existing time dependence. Since we then need to estimate more parameters this will, however, typically decrease the power of our invariance procedure. A second option would be to apply variable selection such as AIC or BIC. As we aim at finding invariant models rather than models that predict well, one may also base the variable selection on a criterion that optimizes this goal. For example, we could go over all reasonable lags and then select the $p$ which results in the largest causal set, i.e., $p=\operatorname*{arg\,max}_{k}\lvert\hat{S}(k)\rvert$ where $\hat{S}(k)$ is our estimator resulting from using $k$ lags.222The nature of this idea is similar to the one proposed in Mooij et al. (2009), in which the authors evaluate the goodness of a regression function by the independence between residuals and predictors rather than the residual variance. The independence of residuals is afterwards used for causal structure learning. As with any variable selection procedure we need to be careful when interpreting the confidence statements due to post-selection issues. A third option which circumvents any post-selection issues is to use the set $\hat{S}=\cup_{k\in L}\hat{S}(k)$ , for some set of potential lags $L\subset\mathbb{N}$ and then adjust the level using a Bonferroni adjustment of size $\lvert L\rvert$ to account for multiple testing.

Proposition 6.1 establishes a framework for dealing with instantaneous causal effects. This itself allows going beyond the concept of Granger causality which excludes instantaneous effects. Furthermore, the power of invariant causal prediction using a test as in Proposition 6.1 hinges on the amount of non-stationarity present in the multivariate time series to detect deviations from the null-hypothesis $\widetilde{H}_{0,S,p}$ ; that is, non-stationarity, which loosely relates to perturbations, is potentially beneficial for inferring causal time-instantaneous structures. Section A.3 illustrates this empirically.

7 Numerical experiments

We apply our methodology on both artificial data sets (based on SCMs) and real data. In Section 7.1, we summarize the findings from the numerical simulations and in Section 7.2 we apply our method to a real world monetary example.

7.1 Numerical simulations

We empirically verify the theoretical results we have developed in Section 4. In particular, we show that detecting the difference in regression coefficients and residual variance using the combined test based on (3.11) and the decoupled test based on (3.9) and (3.10) yield different convergence rates (Appendix A.1). In Appendix A.2, we compare the power of different choices of the test statistic, e.g., when combining the different environments using a sum or a maximum, see (3.7). One further loses only little power when the true underlying environments are unknown compared to the traditional approach that exploits the precise location of the change points. In fact, in some situations and for large sample sizes, it is beneficial to split the true environments into smaller sets, see Appendix A.2.1. Finally, in Appendix A.3, we consider the time series setting discussed in Section 6. Due to the time dependence, it is possible to infer the causal structure even if there is a shock in the dynamical system at a single time instance (leading to an environment of size one). Whereas for some practical applications, it might be difficult to distinguish a shock from an outlier, we show that in case of an outlier, our method remains conservative.

7.2 Monetary policy example

To illustrate the usefulness of our method for practical applications we apply it to a real world data set related to the monetary policy of the Swiss National Bank (SNB) (see Appendix E for details). Our data set consists of monthly data from January 1999 to January 2017 based on the variables given in Table 2. Our goal is to find the instantaneous monthly causal predictors that affect the log returns of the Euro - Swiss Franc exchange rate (variable $Y$ ). The predictors we selected can be grouped into two categories. Variables $X^{1}$ to $X^{6}$ are all related to the policies of the SNB whereas variables $X^{7}$ to $X^{9}$ describe the economic conditions in Switzerland and the 19 Euro zone countries. As the SNB cannot directly set the exchange rate it is reasonable to assume that any active influence on the exchange rate either occurs through one of the SNB variables or due to changes in the economic conditions. Since we expect a time dependence in the target variable $Y$ , we apply our method by including lagged variables as described in Section 6. After regressing the target variable $Y$ on the past of $(Y,X)$ , the mean as well as the regression coefficients remain fairly stable. Hence, any of our tests testing for shifts in either of these quantities are not able to reject the empty set. In contrast, the residual variance is unstable and tests testing for these changes are indeed able to reject that the empty set is invariant. In this example, we therefore only apply tests capable of detecting instabilities of the second moment.

In Figure 3, we plot the $p$ -values for different lags resulting from the block-wise variance test, the block-wise decoupled test, the block-wise combined test, the smoother based variance test and the HSIC based test, all of which are introduced in Sections 3.2.1 and 3.2.2. For comparison purposes, we also apply the following non-causal method: Fit a linear model including all instantaneous effects as well as all lagged effects and compute the $p$ -values from the standard t-test.

The results show that the predictors $X^{2}$ and $X^{7}$ appear to be causally significant for most methods, or at least they consistently lead to the lowest $p$ -values. From an economic viewpoint this also makes sense: Variable $X^{2}$ represents the foreign currency investments which are a known tool of the SNB to reduce the value of the Swiss Franc. Variable $X^{7}$ is the Swiss GDP, an important economic indicator; it seems plausible that this is a causal predictor as well. Additionally, the plots show that the $p$ -values tend to increase when adding more lags. This happens because including too many lags leads to models that heavily over-fit, in which case our tests lose power. Moreover, the results show that both the combined test and the HSIC based test have less power here. In particular, they are not able to reject the model that includes only one lag.

This example is only an illustration of potential applications to real world data sets. In practice, one might benefit from a more in-depth analysis and a careful a priori selection of the predictors that are included in the model. In our example, one could argue that, instead of taking GDP, it might be useful to use more specific indicators such as the purchasing managers index (PMI) or other economic measures that might be more directly linked to the exchange rate. However, due to the fact that we obtain economically plausible results which post-hoc validate our methodology at least to a certain extent, we do believe that the example illustrates the potential of our approach for practical applications.

8 Summary

We introduce a framework for inferring causal predictors of a target variable $Y$ from sequentially ordered data. In contrast to classical invariant prediction (Peters et al., 2016) we do not need the knowledge of different environments. Nonetheless, we are able to ensure exact type I error control (Propositions 3.1 and 6.1). Given that the data are generated by a change point model, we additionally prove rates of consistency of our block-wise procedures (Theorems 4.1 and 4.2); more precisely, they can detect any violation of invariance with a rate which is essentially as fast as $1/\sqrt{n}$ . We furthermore show that our framework can be extended to include linear time dependencies (Section 6). This opens the door to go beyond the concept of Granger causality and also allows for instantaneous causal effects. From this perspective, our methods make use of non-stationarity (induced by interventions occurring throughout time) in multivariate time series and use it to infer instantaneous causal effects. The empirical performance of our methods are illustrated by simulations. Notably, we verify the convergence rates empirically and show that our methods have comparable power properties to classical invariant causal prediction without requiring knowledge about environments. In the case of time series data our methods are able to detect causal directions even from a single shock intervention at a specific point in time. Finally, we illustrate an application to a real data set about the monetary policy of the Swiss National Bank.

Acknowledgements

The authors thank Nicolai Meinshausen and anonymous reviewers for helpful discussions and constructive comments. NP was supported by a research grant (200021_153504) from the Swiss National Science Foundation (SNSF) and JP was supported by a research grant (18968) from VILLUM FONDEN.

Appendix A Detailed numerical simulations

This section consists of a detailed presentation of the numerical simulations. We begin in Section A.1 with an experiment to provide empirical evidence for the consistency results we have developed in Section 4. Section A.2 compares the power of different choices of the test statistic, e.g. when combining the different environments using a sum or a maximum, see (3.7). Finally, Section A.3 shows an experiment for the time series setting discussed in Section 6.

A.1 Comparison of combined and decoupled test statistic

In this section we empirically verify the convergence rates proved in Theorem 4.1 and Theorem 4.2. For our simulations we use various (even) sample sizes $n$ and simulate data from a linear Gaussian model of the form

[TABLE]

To verify the convergence rates we consider alternatives with one change point at $\frac{n}{2}$ , leading to two environments $e_{n}\coloneqq\{1,\dots,\frac{n}{2}\}$ and $f_{n}\coloneqq\{\frac{n}{2}+1,\dots,n\}$ on which the data is i.i.d. with fixed parameters $\beta_{e_{n}},\beta_{f_{n}},\sigma_{e_{n}}$ and $\sigma_{f_{n}}$ . In this simulation we consider the three alternatives specified in Table 3.

The resulting plots are given in Figure 4. For the alternatives 1 and 2 they show that the combined test based on $T_{e_{n},f_{n}}^{3}$ given in (3.11) is only able to detect changes in the regression coefficients with a rate of $n^{-1/4}$ , while the decoupled test based on $T_{e_{n},f_{n}}^{1}$ and $T_{e_{n},f_{n}}^{2}$ given in (3.9) and (3.10) has a rate of $n^{-1/2}$ . On the other hand alternative 3 shows that both tests are able to detect changes in the noise variance with a rate of $n^{-1/2}$ . This corresponds with what has been proved in Theorems 4.1 and 4.2. In particular, the simulations illustrate that the decoupled test appears (at least in these examples) to be more powerful even for finite sample sizes. This indicates both from a theoretical and an empirical point of view that it is preferable to use the decoupled test rather than the combined test.

A.2 Power comparison on simulated data

We now apply our methods to simulated data. As data generating process we use the linear Gaussian model given in Figure 5.

We perform our simulations for the sample sizes $n\in\{100,200,300,400,500\}$ and for each sample size we generate $1000$ data sets. For each of these repetitions, we randomly draw parameters of the structural causal model according to the following distributions $\beta^{j}\overset{\text{\tiny iid}}{\sim}\operatorname{Uniform}([0.5,1.5])$ , $\sigma_{j}^{2}\overset{\text{\tiny iid}}{\sim}\operatorname{Uniform}([0.1,0.3])$ and $\mu_{j}\overset{\text{\tiny iid}}{\sim}\operatorname{Uniform}([0,0.3])$ that are used to sample from the so-called observational distribution. In order to generate different environments, we randomly select two change points $t_{1}$ and $t_{2}$ in $\{1,\dots,n\}$ . This yields the following three environments.

•

$e_{1}=\{1,\dots,t_{1}\}$ : Here, we sample from the observational distribution.

•

$e_{2}=\{t_{1}+1,\dots,t_{2}\}$ : Here, we use the model as in the first environment but intervene on variable $X^{2}$ , i.e., the structural assignment of $X^{2}$ is replaced by $X^{2}=\beta^{1}X^{1}+\tilde{N}^{2}$ , where $\tilde{N}^{2}$ is a Gaussian random variable with mean sampled uniformly between $1$ and $1.5$ and variance sampled uniformly between $1$ and $1.5$ .

•

$e_{3}=\{t_{2}+1,\dots,n\}$ : Again, we use the same model as in environment $e_{1}$ but this time, we intervene on $X^{3}$ , i.e., the structural assignment of $X^{3}$ is replaced by $X^{3}=\tilde{N}^{3}$ , where $\tilde{N}^{3}$ is a Gaussian random variable with mean sampled uniformly between $-1$ and $-0.5$ and the same variance as the noise $N^{3}$ from the observational setting.

The results are shown in Figure 6. Here we compare our method based on unknown change points (seqICP, green) with our method based on known change points (cp known, red) and the original version of ICP (ICP, blue) from Peters et al. (2016). Since $X^{1}$ and $X^{2}$ are the true parents of $Y$ in the underlying model, we expect the methods to reject these two variables (as being non-causal), at least with increasing sample size. The figure shows that the sum (triangle) works slightly better than the maximum (circle), see (3.7). Furthermore, providing the method with the true underlying change points and placing the grid points on those (red) improves over a larger grid (consisting of $10$ grid points) (green). The difference, however, is not very large and does not seem significant for the sum estimators. This stability property of the sum estimator might be due to a phenomenon we investigate in Section A.2.1 below. The maximum based test statistic works slightly worse than the sum based statistic if the sample size in the smallest segments of the grid is too small. This may be because the maximum is influenced heavily by the terms that correspond to such smallest environments, which we expect to have a large variance. This effect is not as prominent for the sum, in which one effectively averages many of such terms. All the proposed methods outperform the original version of ICP due to the slightly improved test statistic.

A.2.1 Increasing power by splitting environments

In the example above, for each data set, there are two change points that have been used in the data generating process. That is, the data are i.i.d. within each of the three environments $e_{1}$ , $e_{2}$ and $e_{3}$ . Suppose now that we are given the change points and assume further that the third environment is large compared to the former two. We can then run our method using a grid that is placed on the known change points. The question arises whether one can benefit (in terms of power) by splitting the third environment, i.e., by placing another grid point after the second one. Intuitively, this should not be the case for the maximum based test statistic, which is focusing on the largest difference of distributions between any two environments that are constructed from the grid. We observe empirically, however, that it can be indeed the case for the sum based test statistic.

As an experiment, we use the same simulation procedure as in Section A.2 above but fix the sample size to be $n=200$ and fix the two change points at 15 und 30. Placing the grid on those two points yields identification of $X^{1}$ as a causal variable in roughly $20\%$ of the repetitions, see Figure 7, red line. Instead, we can also keep the location of the first two grid points and split the largest environment into smaller segments; this is done by introducing additional grid points between 30 and 200. Somewhat surprisingly, this can yield a significant increase of power, see the green line in Figure 7.

If one splits an existing segment, one obtains additional terms in the sum of test statistic, see right-hand side of Equation (3.7). In the setting above, for example, after splitting environment $e_{3}$ , we now have tripled the number of terms that measure the difference between environments $e_{1}$ and $e_{3}$ . For rather large environments, in which the test statistic has relatively small variance even for the smaller environments, this can be seen as putting more weight on the corresponding term in the original sum. This may then ultimately yield an increase in power of the procedure. At some point, this effect levels off, of course. After introducing too many additional grid points, the variances of the individual test statistics are so large that one cannot detect any difference in distributions between the segments anymore. This is why the green line has to fall below the red line with increasing number of splits.

A.3 Shocks in time series

In this section, we look at a time series example with three variables $X$ , $Y$ , and $Z$ with a linear autoregressive structure (with one lag) given by the DAG in Figure 8. More precisely, we use the following structural time series model

[TABLE]

The choice of parameters is such that all arrows in the DAG in Figure 8 correspond to non-zero coefficients.

For our simulations we use $n=200$ . We then draw the time point at which we intervene uniformly from $\{1,\ldots,n\}$ . Our intervention consists of setting the structural assignment of $X$ at this time point to the desired shock strength, i.e., the shock intervention happens only at this one particular instance in time and the structural assignment of $X$ is changed back to its original form in the next time step. Due to the time and structural dependence the shock propagates and spreads to the other variables. An example time series with a shock intervention of size $15$ at the time point $t=30$ is illustrated in Figure 8. In our simulations, we resample this model $1000$ times for each shock strength in $\{0,2,\ldots,30\}$ and apply our method using the decoupled test based on $T_{1}^{\operatorname{sum},\mathcal{F}^{2}(\mathcal{E}^{G})}$ and $T_{2}^{\operatorname{sum},\mathcal{F}^{2}(\mathcal{E}^{G})}$ , where $G=(20,40,\dots,180)$ and $\mathcal{F}^{2}$ is defined in (5.1). The results are illustrated in Figure 9.

One might argue that in practical applications it might not be possible to distinguish between shock interventions and outliers. We therefore also analyze how our method behaves if instead of a shock intervention we simply set one value of $Y$ as an outlier, i.e., we sample the complete time series without any intervention and then set the value of $Y$ at a random time point to a fixed value. The results are illustrated in Figure 10. They show that with increasing outlier size one obtains a model misspecification. Our method therefore stays conservative and outputs the empty set. It does not give an informative answer but does not output a mistake either.

Appendix B Supporting material

Remark B.1** (violations of the linear Gaussian

assumption).**

The procedure described above relies on the assumption of a linear Gaussian model. An interesting extension for practical applications would be to allow

•

for non-linear settings, i.e., by replacing the linear dependence in Definition 2.1 (a) by $Y_{t}=f(X_{t}^{S},\varepsilon_{t})$ where $f$ is in some general class of functions $\mathcal{F}$

•

and for non-Gaussian noise settings, i.e., by allowing for an arbitrary noise distribution $G_{\varepsilon}$ in Definition 2.1 (b).

*One option is to use a permutation approach as follows; First use a general regression procedure to estimate the function $f$ and compute residuals $R_{1},\dots,R_{n}$ , then in a second step approximate the null distribution of a test statistic $T(R_{1},\dots,R_{n})$ by permuting the time index of the residuals. Given, that our estimate of $f$ is very close to the true function the residuals should be approximately i.i.d. hence (approximately) justifying a permutation approach. While we believe this approach is interesting from a practical viewpoint, it is only a heuristic. Moreover, it turns out to be rather difficult to get precise results about the asymptotic level of such a testing procedure. *

Remark B.2 (Obtaining environments by clustering).

It is tempting to use $(\mathbf{Y},\mathbf{X})$ to construct environments. For example, one could use a clustering procedure on one of the variables $X^{j}$ . In general, however, this can break the level guarantees of the test. To see this, assume we are given observations from the following Gaussian SCM

[TABLE]

*Clearly, $Y$ has no parents implying that the empty set is invariant. However, constructing two environments by clustering on the sign of $X$ results in a changing distribution of $Y$ across the two environments, hence breaking the invariance. A similar counter example can be constructed by letting the noise of $Y$ be bi-modal, then the same problem occurs even if $X$ depends on $Y$ linearly. The problem is that the clustering is based on the noise of $Y$ . One way of avoiding this is to only cluster using the ancestors of $Y$ . Such a method is proposed in Heinze-Deml et al. (2017). *

Remark B.3 (Comparison to change point methods).

While our proposed method also covers the case of smoothly varying shifts we have analyzed its power in a change point model. This might lead to the question of how our method relates to two-stage procedures which first identify change points and then proceed to infer the causal structure based on these environments. Most importantly, the difference is that our method directly optimizes the (non-)invariance required to infer causal relations, while a two-step procedure first solves a change point detection problem, which is only indirectly linked to (non-)invariances. A scenario which illustrates this is given by a model consisting of very many changes for which only very few actually lead to non-invariant models. A two-stage procedure will necessarily run into power issues due to the many small environments, while our method will not be affected in the same way. A second major problem with the two-step procedure is that the change point procedure is only allowed to be applied to the target variable $Y$ and not to the predictors $X$ , as one otherwise runs into the same problem discussed in Remark B.2. This, however, might lead to a loss in power, as the following (rather) artificial example given in Figure 11 illustrates.

*In this example not all changes are visible in the distribution of $Y$ alone. This, in particular, means that any two-stage procedure must fail, since there are at most two distinct environments ( $e_{1}=\{1,\dots,200\}$ and $e_{2}=\{201,\dots,400\}$ ) that can be detected in the distribution of $Y$ . Applying our procedure to this three point grid and using test statistic $T^{1}$ given in (3.9) (this is essentially the same as standard ICP with two environments and a different hypothesis test), we are not able to reject the set $\{X^{1}\}$ ( $p$ -value of $0.554$ ). In contrast, our procedure (based on a fine grid) is able to exploit the differences in distribution in the first half of the data. Hence, we are able to reject the set $\{X^{1}\}$ with a $p$ -value of $0.039$ . *

Appendix C Proofs

C.1 Theorem 4.2

In this section we give a proof of Theorem 4.2. The key step in the proof is based on Proposition C.2 and Proposition C.3. For notational convenience we drop the set $S$ in the notation, throughout this entire section.

Proof (Theorem 4.2).

The convergence of the Monte-Carlo approximation of the empirical distribution is well-established (see e.g. Lehmann and Romano, 2005, Example 11.2.13). It therefore holds $\mathbb{P}$ -a.s. that,

[TABLE]

where $c_{T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{3},B}$ is defined in (3.5). Define, $\varphi^{*}_{T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{3}}\coloneqq\mathds{1}_{\{\lvert T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{3}\rvert>F_{T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{3}}^{-1}(1-\alpha)\}}$ , then by the dominated convergence theorem it holds that

[TABLE]

It therefore remains to prove that the right-hand side of the above equation converges to $1$ . Recall, that for a real random variable $T$ and constants $a\in\mathbb{R}$ it holds for all $q\in\mathbb{R}$ that

[TABLE]

where $F_{T}^{-1}(q)\coloneqq\inf\{t\in\mathbb{R}\,\rvert\,\mathbb{P}(T\leq t)\geq q\}$ is the generalized inverse distribution function. In order to simplify the notation we define

[TABLE]

Assume that $(\mathbf{Y}_{n},\mathbf{X}_{n})$ satisfies $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{0}$ , then Proposition C.2 implies that for all $\varepsilon>0$ there exists $M_{\varepsilon}>0$ such that for all $n\in\mathbb{N}$ it holds that $\mathbb{P}\left(\lvert T_{n}\rvert>M_{\varepsilon}\right)<\varepsilon$ . This in particular implies that

[TABLE]

Next, assume $(\mathbf{Y}_{n},\mathbf{X}_{n})$ satisfies $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{A}^{n}(a_{n},b_{n})$ . Then, there exist $i,j\in\{1,\dots,L+1\}$ such that $a_{n}=\lvert\sigma_{e_{i}(\operatorname{CP}^{*}_{n})}^{2}-\sigma_{e_{j}(\operatorname{CP}^{*}_{n})}^{2}\rvert$ and $b_{n}=\lVert\beta_{e_{i}(\operatorname{CP}^{*}_{n})}-\beta_{e_{j}(\operatorname{CP}^{*}_{n})}\rVert_{2}$ . Moreover, by (C2) there exist sequences of environments $(f_{n})_{n\in\mathbb{N}}$ and $(g_{n})_{n\in\mathbb{N}}$ with $f_{n},g_{n}\in\mathcal{E}_{n}$ such that for sufficiently large $n$ it holds that $f_{n}\subseteq e_{i}(\operatorname{CP}^{*}_{n})$ and $g_{n}\subseteq e_{j}(\operatorname{CP}^{*}_{n})$ . Additionally, we have that

[TABLE]

satisfies $\tfrac{1}{\sqrt{r_{n}}}=\mathcal{O}(\omega_{n})$ and by assumption also that at least one of the following two conditions are satisfied, $\omega_{n}=o(a_{n})$ or $\omega_{n}=o(b_{n}^{2})$ . Thus we can apply Proposition C.3 to get that

[TABLE]

which completes the proof of Theorem 4.2. $\square$

C.1.1 Intermediate results

Lemma C.1 (representation of $T_{e_{1},e_{2}}^{3}$ for true change points).

Let $e_{1},e_{2}\in\mathcal{E}_{n}(\operatorname{CP}_{n}^{*})$ then it holds that

[TABLE]

A proof of this result is given in Appendix C.1.2.

The following theorem gives the asymptotic distribution of our test statistic under the null hypothesis $H_{0}$ .

Proposition C.2 (asymptotic distribution under $H_{0}$ ).

Let $(\mathbf{Y}_{n},\mathbf{X}_{n})_{n\in\mathbb{N}}$ satisfy Assumption 2 and for all $n\in\mathbb{N}$ satisfy $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{0}^{n}$ , let $\widetilde{\mathbf{R}}_{n}$ be the scaled residuals defined in (3.3) corresponding to $(\mathbf{Y}_{n},\mathbf{X}_{n})$ , let $\mathcal{E}_{n}\subseteq\mathcal{P}(\{1,\dots,n\})$ be a sequence of collections of pairwise disjoint environments satisfying conditions (C1) and (C3,k). Then, it holds for all $e_{n},f_{n}\in\mathcal{E}_{n}$ that

[TABLE]

A proof of this result is given in Appendix C.1.2.

Next, we give the corresponding theorem for the asymptotic distribution of our test statistics under the alternative hypothesis $H_{A}$ .

Proposition C.3 (asymptotic distribution under $H_{A}$ ).

Let $(\mathbf{Y}_{n},\mathbf{X}_{n})_{n\in\mathbb{N}}$ satisfy Assumption 2 and for all $n\in\mathbb{N}$ satisfy $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{A}^{n}(a_{n},b_{n})$ , let $\widetilde{\mathbf{R}}_{n}$ be the scaled residuals defined in (3.3) corresponding to $(\mathbf{Y}_{n},\mathbf{X}_{n})$ . Additionally, let $i,j\in\{1,\dots,L+1\}$ such that $a_{n}=\lvert\sigma^{2}_{e_{i}(\operatorname{CP}^{*})}-\sigma^{2}_{e_{j}(\operatorname{CP}^{*})}\rvert$ and $b_{n}=\lVert\beta_{e_{i}(\operatorname{CP}^{*})}-\beta_{e_{j}(\operatorname{CP}^{*})}\rVert_{2}$ . Then, assume that $f_{n}\subseteq e_{i}(\operatorname{CP}^{*}_{n})$ and $g_{n}\subseteq e_{j}(\operatorname{CP}^{*}_{n})$ are sequences satisfying that for $e_{n}\in\{f_{n},g_{n}\}$ the sequences $(\sigma_{e_{n}}^{2})_{n\in\mathbb{N}}$ and $(\beta_{e_{n}})_{n\in\mathbb{N}}$ are convergent and the limit of $\sigma_{e_{n}}^{2}$ is strictly positive and the sequence $(\{f_{n},g_{n}\})_{n\in\mathbb{N}}$ satisfies assumptions (C1) and (C3,k). Then it holds that

[TABLE]

Moreover, let $(\omega_{n})_{n\in\mathbb{N}}$ be a sequence which satisfies $\tfrac{1}{\sqrt{r_{n}}}=\mathcal{O}(\omega_{n})$ and additionally at least one of the two conditions $\omega_{n}=o(a_{n})$ or $\omega_{n}=o(b_{n}^{2})$ . Then it also holds for all $t\geq 0$ that

[TABLE]

A proof of this result is given in Appendix C.1.2.

C.1.2 Proofs of intermediate results

Proof (Lemma C.1).

The result is given by the following straight forward calculation,

[TABLE]

which completes the proof of Lemma C.1. $\square$

Proof (Proposition C.2).

Under $H_{0,S}$ for all $n\in\mathbb{N}$ there exist fixed $\beta_{n}\in\mathbb{R}^{d\times 1}$ and $\sigma_{n}^{2}\in\mathbb{R}_{>0}$ such that for all $e_{n}\in\mathcal{E}_{n}$ it holds that $\beta_{e_{n}}=\beta_{n}$ and $\sigma_{e_{n}}^{2}=\sigma_{n}^{2}$ . The main idea in the first part of the proof is to use the representation of $T_{e_{1},e_{2}}^{3}$ given in Lemma C.1, analyze the convergence of all terms individually and finally conclude by combining the convergences.

We begin by proving for all $e_{n},f_{n}\in\mathcal{E}_{n}$ the following estimates

(a)

$\mathbb{E}\left(\left(\frac{1}{\lvert e_{n}\rvert}\boldsymbol{\varepsilon}_{e_{n}}^{\top}\boldsymbol{\varepsilon}_{e_{n}}-\sigma_{n}^{2}\right)^{2k}\right)\leq\frac{C_{1}}{\lvert e_{n}\rvert^{k}}$ 2. (b)

$\mathbb{E}\left(\left(\frac{1}{\lvert e_{n}\rvert}\boldsymbol{\varepsilon}_{e_{n}}^{\top}\mathbf{X}_{e_{n}}(\beta_{n}-\hat{\beta}_{f_{n}})\right)^{2k}\right)\leq\frac{C_{2}}{\lvert f_{n}\rvert^{k}}$ 3. (c)

$\mathbb{E}\left(\left(\frac{1}{\lvert e_{n}\rvert}(\beta_{n}-\hat{\beta}_{f_{n}})^{\top}\mathbf{X}_{e_{n}}^{\top}\mathbf{X}_{e_{n}}(\beta_{n}-\hat{\beta}_{f_{n}})\right)^{2k}\right)\leq\frac{C_{3}}{\lvert f_{n}\rvert^{2k}}$

To prove (a) consider the following calculation,

[TABLE]

where $C>0$ is a constant satisfying that for all $n\in\mathbb{N}$ and $\ell\in\{1,\dots,k\}$ that $\mathbb{E}\left((\varepsilon_{i}^{2}-\sigma_{n}^{2})^{2\ell}\right)\leq C.$

Next, we prove (b). An application of the Cauchy-Schwarz inequality leads to the following inequality

[TABLE]

We now distinguish between the two cases $e_{n}=f_{n}$ and $e_{n}\cap f_{n}=\varnothing$ . Begin with the case $e_{n}\cap f_{n}=\varnothing$ , then $\boldsymbol{\varepsilon}_{e_{n}}$ and $\mathbf{X}_{e_{n}}$ are both independent of $\hat{\beta}_{f_{n}}$ and hence (C.4) together with a spectral inequality imply that

[TABLE]

So together with Theorem D.1, the moment bounds on $\lambda_{\max}\left(\tfrac{1}{\lvert e_{n}\rvert}\mathbf{X}_{e_{n}}^{\top}\mathbf{X}_{e_{n}}\right)$ and the fact that $\boldsymbol{\varepsilon}_{n}$ is Gaussian distributed this implies that

[TABLE]

Next, assume that $e_{n}=f_{n}$ . Then, defining the projection matrix $\mathbf{P}_{e_{n}}\coloneqq\mathbf{X}_{e_{n}}(\mathbf{X}_{e_{n}}^{\top}\mathbf{X}_{e_{n}})\mathbf{X}_{e_{n}}^{\top}$ we get that

[TABLE]

where in the last step we used the same argument as in (D.4). This completes the proof of (b).

Finally, in order to prove (c) we again distinguish between the two cases. Begin by assuming that $e_{n}=f_{n}$ , then together with Theorem D.1 we get that

[TABLE]

Furthermore, assume $e_{n}\cap f_{n}=\varnothing$ then, using a spectral inequality together with Theorem D.1 we get that

[TABLE]

which completes the proof of (c).

We can now combine these results to analyze the convergence properties of the test statistic $T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{3}$ . As an intermediate step consider the following two statistics,

[TABLE]

and

[TABLE]

Using (a), (b) and (c) together with the Minkowski-inequality we get that

[TABLE]

Hence, with a union bound and Chebyshev’s inequality we get for all constants $M>0$ that

[TABLE]

This implies,

[TABLE]

Similarly we for the $V_{f_{n}}$ in (C.6) we get

[TABLE]

And again, using the Chebyshev’s inequality we get for all constants $M>0$ that

[TABLE]

which implies that

[TABLE]

Hence, (C.7) and (C.8) together with Lemma D.2 imply that

[TABLE]

Since for any $M\geq 1$ it holds that

[TABLE]

we also have that

[TABLE]

This completes the proof of Proposition C.2. $\square$

Proof (Proposition C.3).

We divide the proof into two parts. In the first part we prove (C.2) and then in the second part we prove (C.3) using some results from the first part.

**Part 1:

**Similar to the proof of Proposition C.2, the main idea in the proof is to use the representation of $T_{e_{1},e_{2}}^{3}$ given in Lemma C.1, analyze the convergence of all terms individually and finally conclude by combining the convergences.

We begin by proving the following inequalities

(a)

for $e_{n}\in\{f_{n},g_{n}\}$ : $\big{\lVert}\frac{1}{\lvert e_{n}\rvert}\boldsymbol{\varepsilon}_{e_{n}}^{\top}\boldsymbol{\varepsilon}_{e_{n}}-\sigma_{e_{n}}^{2}\big{\rVert}_{L^{2}}\leq\frac{C}{\sqrt{\lvert e_{n}\rvert}}$ 2. (b)

$\big{\lVert}\frac{1}{\lvert f_{n}\rvert}\boldsymbol{\varepsilon}_{f_{n}}^{\top}\mathbf{X}_{f_{n}}(\beta_{f_{n}}-\hat{\beta}_{g_{n}})\big{\rVert}_{L^{2}}\leq\frac{C_{1}}{\sqrt{\lvert f_{n}\rvert\lvert g_{n}\rvert}}+\frac{C_{2}}{\sqrt{\lvert f_{n}\rvert}}b_{n}$ 3. (c)

$\big{\lVert}\frac{1}{\lvert g_{n}\rvert}\boldsymbol{\varepsilon}_{g_{n}}^{\top}\mathbf{X}_{g_{n}}(\beta_{g_{n}}-\hat{\beta}_{g_{n}})\big{\rVert}_{L^{2}}\leq\frac{C}{\sqrt{\lvert g_{n}\rvert}}$ 4. (d)

$\big{\lVert}\frac{1}{\lvert f_{n}\rvert}(\beta_{f_{n}}-\hat{\beta}_{g_{n}})^{\top}\mathbf{X}_{f_{n}}^{\top}\mathbf{X}_{f_{n}}(\beta_{f_{n}}-\hat{\beta}_{g_{n}})\big{\rVert}_{L^{2}}\leq\frac{C_{1}}{\lvert g_{n}\rvert}+C_{2}b_{n}^{2}$ 5. (e)

$\big{\lVert}\frac{1}{\lvert g_{n}\rvert}(\beta_{g_{n}}-\hat{\beta}_{g_{n}})^{\top}\mathbf{X}_{g_{n}}^{\top}\mathbf{X}_{g_{n}}(\beta_{n}-\hat{\beta}_{g_{n}})\big{\rVert}_{L^{2}}\leq\frac{C}{\lvert g_{n}\rvert}$

Let $e_{n}\in\{f_{n},g_{n}\}$ , we first show (a),

[TABLE]

Since the sequence $(\sigma_{e_{n}})_{n\in\mathbb{N}}$ is assumed to be convergent this proves (a). In order to prove (b) we use the independence of $\mathbf{X}$ and $\boldsymbol{\varepsilon}$ and the Cauchy-Schwarz inequality to get

[TABLE]

where in the last step we used that the sequence $(\sigma_{e_{n}})_{n\in\mathbb{N}}$ is convergent and $\mathbb{E}(\lVert\mathbf{X}_{i}\rVert^{2})$ (for $i\in f_{n}$ ) is bounded. A similar inequality, additionally using Theorem D.1, leads to

[TABLE]

Finally, combining (C.9) and (C.10) with the Minkowski inequality proves (b).

In order to prove (c) define the projection matrix $\mathbf{P}_{g_{n}}\coloneqq\mathbf{X}_{g_{n}}(\mathbf{X}_{g_{n}}^{\top}\mathbf{X}_{g_{n}})\mathbf{X}_{g_{n}}^{\top}$ we get that

[TABLE]

where in the last step we used the same argument as in (D.4).

Next, we prove (d). This is again a straight forward estimate using Minkowski’s inequality, $(a+b)^{2}\leq 2a^{2}+2b^{2}$ and Theorem D.1,

[TABLE]

Finally, (e) is an immediate consequence of Theorem D.1,

[TABLE]

As in the proof of Proposition C.2 consider the following two statistics,

[TABLE]

and

[TABLE]

Using (a), (b) and (d) together with the Minkowski-inequality we get that

[TABLE]

Similarly, using (a), (c) and (e) we get for $V_{g_{n}}$ in (C.12) that

[TABLE]

Since $L^{2}$ convergence implies convergence in probability (C.13) and (C.14) imply that

[TABLE]

Hence, using Lemma D.2 it holds that

[TABLE]

as $n\rightarrow\infty$ . This completes the first part of the proof.

**Part 2:

**Next, we prove (C.3). Since $\sigma_{g_{n}}^{2}$ converges to a positive constant and $\omega_{n}$ converges to zero, there exists $c,C,\delta>0$ and $n_{0}\in\mathbb{N}$ such that for all $n\in\{n_{0},n_{0}+1,\dots\}$ it holds that

[TABLE]

Let $n\in\{n_{0},n_{0}+1,\dots\}$ , then on the event $\{\lvert V_{g_{n}}-\sigma_{g_{n}}^{2}\rvert\leq\delta\omega_{n}\}$ it holds that

[TABLE]

which in particular implies for all $t>0$ that

[TABLE]

Therefore, in order to prove (C.3) it is sufficient to show that $\lvert V_{g_{n}}-\sigma_{g_{n}}^{2}\rvert=o_{\mathbb{P}}(\omega_{n})$ and that for all $t\in\mathbb{R}$ it holds that

[TABLE]

However, by (C.14) and the fact that $L^{2}$ convergence implies convergence in probability we already have shown that $\lvert V_{g_{n}}-\sigma^{2}_{g_{n}}\rvert=o_{\mathbb{P}}(\omega_{n})$ . It thus remains to prove (C.15). To simplify the notation, we make the following definitions

[TABLE]

For the proof we require two more intermediate results. Denote by $Z_{n}^{A}$ and $Z_{n}^{C}$ are sequences satisfying that $\tfrac{1}{a_{n}}Z_{n}^{A}\overset{\mathbb{P}}{\rightarrow}0$ and $\tfrac{1}{b_{n}^{2}}Z_{n}^{C}\overset{\mathbb{P}}{\rightarrow}0$ as $n\rightarrow\infty$ . We want to show that if $\omega_{n}=o(a_{n})$ it holds that

[TABLE]

and if $\omega_{n}=o(b_{n}^{2})$ it holds that

[TABLE]

Define $\tilde{b}_{n}\coloneqq\mathbb{E}\left(\lVert X_{f_{n}}(\beta_{f_{n}}-\beta_{g_{n}})\rVert_{\mathbb{R}^{\lvert f_{n}\rvert}}^{2}\right)$ , the proof of these two results relies on the Paley-Zygmund inequality (e.g. Weber, 2009, Section 8.3) and the following four inequalities,

(1)

$\lvert\mathbb{E}(A_{n})\rvert=a_{n}$ , 2. (2)

$\mathbb{E}(\tilde{A}_{n}^{2})\leq\frac{c}{r_{n}}$ , 3. (3)

$\mathbb{E}\left(C_{n}\right)\geq\tilde{b}_{n}^{2}-\frac{cb_{n}}{\sqrt{r_{n}}}$ , 4. (4)

$\mathbb{E}(C_{n}^{2})\leq\tilde{b}_{n}^{2}+\frac{c}{\sqrt{r_{n}}}$ .

We begin by proving (C.16). Hence, assume that $\omega_{n}=o(a_{n})$ , for any $\theta\in[0,1]$ we can use (1) together with Jensen’s inequality and the Paley-Zygmund inequality to get that

[TABLE]

Applying Jensen’s inequality once more together with (1) and (2) leads to

[TABLE]

This implies for all $\theta\in(0,1]$ that

[TABLE]

Furthermore, since we assumed that $\tfrac{1}{a_{n}}Z_{n}^{A}\overset{\mathbb{P}}{\rightarrow}0$ and that $\omega_{n}=o(a_{n})$ this in particular implies for all $\theta\in(0,1]$ that

[TABLE]

and since $\theta$ can be chosen independently of $n$ we have proved that

[TABLE]

Next, we use a similar reasoning to prove (C.17). Hence, assume that $\omega_{n}=o(b_{n}^{2})$ , for any $\theta\in[0,1]$ we can use (3) together with the Paley-Zygmund inequality to get that

[TABLE]

Making use of (3) and (4) leads to

[TABLE]

where in the last step we used that there exist $c,C>0$ such that $c\cdot b_{n}\leq\tilde{b}_{n}\leq C\cdot b_{n}$ . This implies for all $\theta\in(0,1]$ that

[TABLE]

Furthermore, since we assumed that $\tfrac{1}{b_{n}^{2}}Z_{n}^{C}\overset{\mathbb{P}}{\rightarrow}0$ and that $\omega_{n}=o(b_{n}^{2})$ this in particular implies for all $\theta\in(0,1]$ that

[TABLE]

and since $\theta$ can be chosen independently of $n$ we have proved that

[TABLE]

Finally, we are ready to prove (C.15). We begin by observing that,

[TABLE]

Since $\omega_{n}$ by definition has a slower (or equal) convergence rate as $\frac{1}{\sqrt{r_{n}}}$ , we can interpret it as the fastest rate at which the alternatives can converge without loosing detectability. Keeping this intuition in mind, we distinguish the following 3 cases,

(case 1)

variance and regression shifts are detectable, i.e. $\omega_{n}=o(a_{n})$ and $\omega_{n}=o(b_{n}^{2})$ , 2. (case 2)

only variance shifts are detectable, i.e. $\omega_{n}=o(a_{n})$ and $b_{n}^{2}=\mathcal{O}(\omega_{n})$ , 3. (case 3)

only regression shifts are detectable, i.e. $\omega_{n}=o(b_{n}^{2})$ and $a_{n}=\mathcal{O}(\omega_{n})$ .

• (case 1): Define $Z_{n}\coloneqq t\omega_{n}+\lvert B_{n}\rvert$ , then using $\omega_{n}=o(a_{n})$ and $\omega_{n}=o(b_{n}^{2})$ together with (b) it holds that $\tfrac{1}{a_{n}}Z_{n}\overset{\mathbb{P}}{\rightarrow}0$ and $\tfrac{1}{b_{n}^{2}}Z_{n}\overset{\mathbb{P}}{\rightarrow}0$ . Hence, (C.20) together with a union bound and (C.18) and (C.19) leads to

[TABLE]

• (case 2): Define $Z_{n}\coloneqq t\omega_{n}+\lvert B_{n}\rvert+C_{n}$ , then using $\omega_{n}=o(a_{n})$ and $b_{n}^{2}=\mathcal{O}(\omega_{n})$ together with (b) and (d) it holds that $\tfrac{1}{a_{n}}Z_{n}\overset{\mathbb{P}}{\rightarrow}0$ . Hence, (C.20) together with (C.18) leads to

[TABLE]

• (case 3): Define $Z_{n}\coloneqq t\omega_{n}+\lvert B_{n}\rvert+\lvert A_{n}\rvert$ , then using $\omega_{n}=o(b_{n}^{2})$ and $a_{n}=\mathcal{O}(\omega_{n})$ together with (a) and (d) it holds that $\tfrac{1}{b_{n}^{2}}Z_{n}\overset{\mathbb{P}}{\rightarrow}0$ . Hence, (C.20) together with (C.19) leads to

[TABLE]

Thus we have proved (C.15), which completes the proof of Proposition C.3. $\square$

C.2 Theorem 4.1

The proof of Theorem 4.1 is very similar to the proof of Theorem 4.2. Essentially, we use the same methods to show that the test statistic $T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{1}$ (see (3.9)) is capable of detecting changes in the regression coefficients with a rate of $b_{n}$ and that $T^{\max,\mathcal{F}^{1}(\mathcal{E}_{n})}_{2}$ (see (3.10)) is capable of detecting changes in the residual variance with a rate of $a_{n}$ . Combining both results using a Bonferroni adjustment preserves these rates. In order to not repeat all the details we will therefore often refer to the proof in Section C.1.

Proof (Theorem 4.1).

Using the same arguments and notation as in the proof of Theorem 4.2 we can use Proposition C.5 and Proposition C.6 to show that

•

for $\omega_{n}=o(b_{n})$ it holds that

[TABLE]

•

and for $\omega_{n}=o(a_{n})$ it holds that

[TABLE]

Combining these test using a Bonferroni adjustment preserves the consistency properties of each of the individual tests, which completes the proof of Theorem 4.1. $\square$

C.2.1 Intermediate results

Lemma C.4 (representation of $T_{e_{1},e_{2}}^{1}$ and $T_{e_{1},e_{2}}^{2}$ for true change points).

Let $e_{1},e_{2}\in\mathcal{E}_{n}(\operatorname{CP}_{n}^{*})$ then it holds that

[TABLE]

The proof of this result is immediate using the same transformation as in the proof of Lemma C.1.

The following theorem gives the asymptotic distribution of our test statistics under the null hypothesis $H_{0}$ .

Proposition C.5 (asymptotic distribution under $H_{0}$ ).

Let $(\mathbf{Y}_{n},\mathbf{X}_{n})_{n\in\mathbb{N}}$ satisfy Assumption 2 and for all $n\in\mathbb{N}$ satisfy $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{0}^{n}$ , let $\widetilde{\mathbf{R}}_{n}$ be the scaled residuals defined in (3.3) corresponding to $(\mathbf{Y}_{n},\mathbf{X}_{n})$ , let $\mathcal{E}_{n}\subseteq\mathcal{P}(\{1,\dots,n\})$ be a sequence of collections of pairwise disjoint environments satisfying conditions (C1) and (C3,k). Then, it holds for all $e_{n},f_{n}\in\mathcal{E}_{n}$ that

[TABLE]

as well as

[TABLE]

A proof of this result is given in Appendix C.2.2.

Next, we give the corresponding theorem for the asymptotic distribution of our test statistics under the alternative hypothesis $H_{A}$ .

Proposition C.6 (asymptotic distribution under $H_{A}$ ).

Let $(\mathbf{Y}_{n},\mathbf{X}_{n})_{n\in\mathbb{N}}$ satisfy Assumption 2 and for all $n\in\mathbb{N}$ satisfy $\mathbb{P}^{(\mathbf{Y}_{n},\mathbf{X}_{n})}\in H_{A}^{n}(a_{n},b_{n})$ , let $\widetilde{\mathbf{R}}_{n}$ be the scaled residuals defined in (3.3) corresponding to $(\mathbf{Y}_{n},\mathbf{X}_{n})$ . Additionally, let $i,j\in\{1,\dots,L+1\}$ such that $a_{n}=\lvert\sigma^{2}_{e_{i}(\operatorname{CP}^{*})}-\sigma^{2}_{e_{j}(\operatorname{CP}^{*})}\rvert$ and $b_{n}=\lVert\beta_{e_{i}(\operatorname{CP}^{*})}-\beta_{e_{j}(\operatorname{CP}^{*})}\rVert_{2}$ . Then, assume that $f_{n}\subseteq e_{i}(\operatorname{CP}^{*}_{n})$ and $g_{n}\subseteq e_{j}(\operatorname{CP}^{*}_{n})$ are sequences satisfying that for $e_{n}\in\{f_{n},g_{n}\}$ the sequences $(\sigma_{e_{n}}^{2})_{n\in\mathbb{N}}$ and $(\beta_{e_{n}})_{n\in\mathbb{N}}$ are convergent and the limit of $\sigma_{e_{n}}^{2}$ is strictly positive and the sequence $(\{f_{n},g_{n}\})_{n\in\mathbb{N}}$ satisfies assumptions (C1) and (C3,k). Moreover, let $(\omega_{n})_{n\in\mathbb{N}}$ be a sequence which satisfies $\tfrac{1}{\sqrt{r_{n}}}=\mathcal{O}(\omega_{n})$ then if $\omega_{n}=o(b_{n})$ it holds for all $t\geq 0$ that

[TABLE]

and if $\omega_{n}=o(a_{n})$ it holds for all $t\geq 0$ that

[TABLE]

A proof of this result is given in Appendix C.2.2.

C.2.2 Proofs of intermediate results

Proof (Proposition C.5).

We begin with the results for the test statistic $T^{2}$ . Using the notation from the proof of Proposition C.2 and Lemma C.4 it holds that

[TABLE]

Moreover, similar computations show that

[TABLE]

Applying Lemma D.2 proves the desired results.

Next, we consider the test statistic $T^{1}$ . Using the expansion from Lemma C.4 together with the convergence of the OLS estimator (see Theorem D.1) it holds that

[TABLE]

As in the proof of Proposition C.2 we can now apply Chebyshev’s inequality together with a union bound to also get that

[TABLE]

This completes the proof of Proposition C.5. $\square$

Proof (Proposition C.6).

We begin by proving (C.23). By the representation in Lemma C.4 and since there exists a constant $c>0$ such that $\lVert\widetilde{\mathbf{R}}_{n}\rVert_{2}\geq c$ it holds for all $t\geq 0$ that

[TABLE]

Using the inequalities $\lVert\hat{\beta}_{f_{n}}-\hat{\beta}_{g_{n}}\rVert_{2}\geq b_{n}-\lVert\hat{\beta}_{f_{n}}-\beta_{f_{n}}\rVert_{2}-\lVert\hat{\beta}_{g_{n}}-\beta_{g_{n}}\rVert_{2}$ and $\lVert\hat{\beta}_{f_{n}}-\hat{\beta}_{g_{n}}\rVert_{2}\leq\lVert\hat{\beta}_{f_{n}}-\beta_{f_{n}}\rVert_{2}+\lVert\hat{\beta}_{g_{n}}-\beta_{g_{n}}\rVert_{2}+b_{n}$ we can derive the following two inequalities

[TABLE]

As in the proof of Proposition C.3 we can apply the Paley-Zygmund inequality and use that $\omega_{n}=o(a_{n})$ to show that

[TABLE]

This completes the proof of (C.23).

Next, we prove (C.24). Since $\sigma_{g_{n}}^{2}$ converges to a positive constant and $\omega_{n}$ converges to zero, there exists $c,C,\delta>0$ and $n_{0}\in\mathbb{N}$ such that for all $n\in\{n_{0},n_{0}+1,\dots\}$ it holds that

[TABLE]

Let $n\in\{n_{0},n_{0}+1,\dots\}$ , then on the event $\{\lvert V_{f_{n}}-\sigma_{f_{n}}^{2}\rvert\leq\delta\omega_{n}\}\cup\{\lvert V_{g_{n}}-\sigma_{g_{n}}^{2}\rvert\leq\delta\omega_{n}\}$ it holds that

[TABLE]

which in particular implies for all $t\geq 0$ that

[TABLE]

In the proof of Proposition C.3 (see (C.14)) we showed that $\lvert V_{f_{n}}-\sigma_{f_{n}}^{2}\rvert=o_{\mathbb{P}}(\omega_{n})$ and $\lvert V_{g_{n}}-\sigma_{g_{n}}^{2}\rvert=o_{\mathbb{P}}(\omega_{n})$ . Therefore, using the assumption $\omega_{n}=o(a_{n})$ this implies that

[TABLE]

which completes the proof of Proposition C.6. $\square$

C.3 Corollary 4.4

Proof (Corollary 4.4).

Based on the empirical coverage property given in Proposition 2.2 it holds that

[TABLE]

Moreover, using the union bound we get that

[TABLE]

Finally, using Theorem 4.1 and combining (C.25) with (C.26) it holds that

[TABLE]

which completes the proof of Corollary 4.4. $\square$

C.4 Extension to uniform consistency

As discussed in Remark 4.3 our asymptotic consistency results can be extended to hold uniformly. The precise statement is given in the following theorem.

Theorem C.7 (uniform rate consistency).

Assume Assumption 2 and 3, let $S\subseteq\{1,\dots,d\}$ and let $(\mathcal{E}_{n})_{n\in\mathbb{N}}$ be a sequence of collections of pairwise disjoint non-empty environments with the properties (C1), (C2) and (C3,k) where condition (C2) is extended to ensure that the variances are uniformly bounded, i.e.,

[TABLE]

(the bounds in condition (C3,k) is already uniform across $\bar{H}^{n}_{A,S}(\bar{a}_{n},\bar{b}_{n})$ ). Moreover, assume that for all $n\in\mathbb{N}$ it holds that $(\mathbf{Y}_{n},\mathbf{X}_{n})\sim P_{n}\in\bar{H}_{A,S}^{n}(\bar{a}_{n},\bar{b}_{n})$ , where $\bar{a}_{n}$ and $\bar{b}_{n}$ satisfy the following condition

[TABLE]

Then it holds that

[TABLE]

where $\varphi$ is either the combined or the decoupled test.

The proof is a straight forward extension of the proofs given in Sections C.1 and C.2. We illustrate the changes only for the combined test. Similar arguments can be applied to the decoupled test. In order to prove the result for the combined test we can use the exact same proof as for Theorem 4.2 with the exception that we need to strength the result from Proposition C.3. In particular we need to extend equation (C.3) in the following way,

[TABLE]

This can be accomplished by ensuring that all constants appearing in the proof of Proposition C.3 hold uniformly across all potential alternatives in $\bar{H}^{n}_{A,S}(\bar{a}_{n},\bar{b}_{n})$ . In particular, we need to make sure this holds for all constants appearing in the inequalities (a), (b), (c), (d) and (e). This is, however, immediate given the additional assumption

[TABLE]

C.5 Proposition 6.1

Proof.

Throughout the proof we use the notation $\mathbf{Z}\coloneqq(Z_{p+1},Z_{p+2},\dots,Z_{n})^{\top}\in\mathbb{R}^{(n-p)\times(d+p(d+1))}$ where $Z_{t}=(X_{t},Y_{t-1},X_{t-1}\dots,Y_{t-p},X_{t-p})$ . For a fixed, significance level $\alpha\in(0,1)$ we construct our test for the time series setting as follows. Let $\tilde{\boldsymbol{\varepsilon}}_{1},\tilde{\boldsymbol{\varepsilon}}_{2},\dots\overset{\text{\tiny iid}}{\sim}\mathcal{N}(\mathbf{0},\mathbf{Id}_{n-p})$ , then given that $(\mathbf{Y},\mathbf{X})$ satisfies $\mathbb{P}^{(\mathbf{Y},\mathbf{X})}\in\widetilde{H}_{0,S,p}$ it holds for all $\mathbf{z}\in\mathbb{R}^{(n-p)\times(d+p(d+1))}$ and for all $i\in\mathbb{N}$ that the random variables defined by

[TABLE]

are i.i.d. copies of $\widetilde{\mathbf{R}}^{S,p}\mid\mathbf{Z}=\mathbf{z}$ , where $\widetilde{\mathbf{R}}^{S,p}$ are the scaled residuals defined in (6.2). To see this, use the properties of the projection matrix $\mathbf{P}^{S,p}_{\mathbf{Z}}$ and the fact that $\mathbf{Y}=\mathbf{Z}^{S,p}\eta+\boldsymbol{\varepsilon}$ for some $\eta\in\mathbb{R}^{(\lvert S\rvert+p(d+1))\times 1}$ .

Next, for all $B\in\mathbb{N}$ define the cut-off functions $c_{T,B}:\mathbb{R}^{(n-p)\times(d+p(d+1))}\rightarrow\mathbb{R}$ given for all $\mathbf{z}\in\mathbb{R}^{(n-p)\times(d+p(d+1))}$ by

[TABLE]

Then, our hypothesis tests $(\varphi^{S,p}_{T,B})_{B\in\mathbb{N}}$ are defined for all $B\in\mathbb{N}$ by

[TABLE]

Using the well established fact (see e.g. Lehmann and Romano, 2005, Example 11.2.13) that the quantiles of the empirical distribution converge to the quantiles of the true distribution, we get $\mathbb{P}$ -a.s. for all $\mathbf{z}\in\mathbb{R}^{(n-p)\times(d+p(d+1))}$ that

[TABLE]

where $F_{T(\widetilde{\mathbf{R}}^{S,p,\mathbf{z}}_{1})}^{-1}$ is the quantile function of the random variable $T(\widetilde{\mathbf{R}}^{S,p,\mathbf{z}}_{1})$ . Hence, conditioning on $\mathbf{Z}$ leads to

[TABLE]

which completes the proof of Proposition 6.1. $\square$

Appendix D Auxiliary results

Theorem D.1 (convergence of OLS-estimates (random design)).

Let $(Y_{n,i},X_{n,i})_{i\in\{1,\dots,n\},n\in\mathbb{N}}\subseteq\mathbb{R}\times\mathbb{R}^{d}$ , $(\varepsilon_{n,i})_{i\in\{1,\dots,n\},n\in\mathbb{N}}\subseteq\mathbb{R}$ , $(\sigma_{n})_{n\in\mathbb{N}}\subseteq\mathbb{R}_{>0}$ and $(\beta_{n})_{n\in\mathbb{N}}\subseteq\mathbb{R}$ satisfy for all $n\in\mathbb{N}$ and for all $i\in\{1,\dots,n\}$ that

[TABLE]

where $\boldsymbol{\varepsilon}_{n}\sim\mathcal{N}(\mathbf{0},\sigma_{n}^{2}\mathbf{Id})$ and $\frac{1}{n}\mathbf{X}_{n}^{\top}\mathbf{X}_{n}$ is $\mathbb{P}$ -a.s. invertible for $n$ sufficiently large. Additionally, let $k\in\mathbb{N}$ and assume there exists a constant $c>0$ such that for all $n\in\mathbb{N}$ it holds that

[TABLE]

Moreover, let $\hat{\beta}_{n}$ denote the OLS-estimator of $\beta_{n}$ based on $(Y_{n,1},X_{n,1}),\dots,(Y_{n,n},X_{n,n})$ . Then, for all $n\in\mathbb{N}$ there exists a constant $C_{1},C_{2}>0$ (depending on $k$ ) such that

[TABLE]

In particular, it holds for all $p\in\{1,\dots,2k\}$ that

[TABLE]

Proof.

First, note that given the above assumptions the OLS-estimator can be expressed as

[TABLE]

Using this combined with a spectral estimate we get that,

[TABLE]

where $\mathbf{P}_{n}\coloneqq\mathbf{X}_{n}\left(\mathbf{X}_{n}^{\top}\mathbf{X}_{n}\right)^{-1}\mathbf{X}_{n}^{\top}$ is the projection matrix onto the column space of $\mathbf{X}_{n}$ . Let $\mathbf{V}_{n}$ be a matrix where all columns together form an orthonormal basis of $\mathbb{R}^{n}$ and the first $d$ columns form an orthonormal basis of the column space of $\mathbf{X}_{n}$ . Then it in particular holds that

[TABLE]

Moreover, the orthogonality of $\mathbf{V}_{n}$ implies that the vector $\boldsymbol{\varepsilon}^{*}_{n}\coloneqq\mathbf{V}_{n}^{\top}\boldsymbol{\varepsilon}_{n}$ is again $\mathcal{N}(\mathbf{0},\sigma_{n}^{2}\mathbf{Id})$ distributed. We therefore get that

[TABLE]

where in the last step we used that all moments of a normal distribution up to a fixed order $2k$ can be bounded by a constant $C_{k}$ . Hence, combining (D.3) and (D.4) and using the properties of the conditional expectation we get that

[TABLE]

This proves the first inequality in (D.1).

Next, observe that again by (D.2) it holds that

[TABLE]

Thus, combining (D.5) with (D.4) proves the second inequality in (D.1).

The last part of the theorem is then an immediate consequence of Jensen’s inequality, which thus completes the proof of Theorem D.1. $\square$

Lemma D.2 (convergence rate of fractions).

Let $(X_{n})_{n\in\mathbb{N}},(Y_{n})_{n\in\mathbb{N}}\subseteq\mathbb{R}$ be two sequences of random variables which satisfy that

[TABLE]

where $(x_{n})_{n\in\mathbb{N}}$ , $(y_{n})_{n\in\mathbb{N}}$ and $(a_{n})_{n\in\mathbb{N}}$ are strictly positive, deterministic and convergent sequences satisfying that $\lim_{n\rightarrow\infty}x_{n}>0$ , $\lim_{n\rightarrow\infty}y_{n}>0$ and $\lim_{n\rightarrow\infty}a_{n}=0$ . Then, it holds that

[TABLE]

Proof.

Fix $\varepsilon>0$ , by assumption there exist $M_{1},M_{2}>0$ and $n_{0}\in\mathbb{N}$ such that for all $n\in\{n_{0},n_{0}+1,\dots\}$ it holds that

[TABLE]

Using the convergence of the sequences $(x_{n})_{n\in\mathbb{N}}$ and $(y_{n})_{n\in\mathbb{N}}$ , there exists $x^{*},y^{*}>0$ and $n_{1}\in\{n_{0},n_{0}+1\dots\}$ such that for all $n\in\{n_{1},n_{1}+1,\dots\}$ it holds that $y^{*}<y_{n}$ and $x_{n}<x^{*}$ . Moreover, fix $0<\delta<y^{*}$ , then since $\lim_{n\rightarrow\infty}a_{n}=0$ there exists $n_{2}\in\{n_{1},n_{1}+1,\dots\}$ such that for all $n\in\{n_{2},n_{2}+1,\dots\}$ it holds that $a_{n}M_{2}<\delta$ and thus in particular

[TABLE]

Next, observe that for any $M>0$ it holds that

[TABLE]

Therefore, combining (D.6), (D.7) and (D.8) with $M>\max\{\frac{2M_{1}}{y^{*}-\delta},\frac{2M_{2}x^{*}}{y^{*}(y^{*}-\delta)}\}$ we get for all $n\in\{n_{2},n_{2}+1,\dots\}$ that

[TABLE]

which completes the proof of Lemma D.2. $\square$

Appendix E Monetary policy data set

The data set used in Section 7.2 and described in Table 2 has been gathered from three different sources, as follows

•

quarterly GDP data for Switzerland from Eurostat (2017b)

•

quarterly GDP data for Euro states from Eurostat (2017a)

•

monthly business confidence index (BCI) for Switzerland from OECD (2017a)

•

monthly consumer price index (CPI) for Switzerland from OECD (2017b)

•

monthly balance sheet data SNB from Swiss National Bank (2017)

•

monthly call money rate SNB from Swiss National Bank (2017)

•

monthly average exchange rates CHF from Swiss National Bank (2017).

From each of these we took data from January 1999 to January 2017 and performed the transformation described in Table 2.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bollen (1989) Bollen, K. A. (1989). Structural Equations with Latent Variables . John Wiley & Sons, Inc.
2Bühlmann et al. (2014) Bühlmann, P., J. Peters, and J. Ernest (2014). CAM: Causal additive models, high-dimensional order search and penalized regression. Annals of Statistics 42 (6), 2526–2556.
3Chickering (2002) Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (11), 507–554.
4Chow (1960) Chow, G. C. (1960). Tests of equality between sets of coefficients in two linear regressions. Econometrica 28 (3), 591–605.
5Chu and Glymour (2008) Chu, T. and C. Glymour (2008). Search for additive nonlinear time series causal models. Journal of Machine Learning Research 9 , 967–991.
6Eaton and Murphy (2007) Eaton, D. and K. P. Murphy (2007). Exact Bayesian structure learning from uncertain interventions. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS) , pp. 107–114.
7Eurostat (2017 a) Eurostat (2017 a). GDP (euro/ecu series) for euro area (19 countries) [EUNNGDP]. Retrieved from https://fred.stlouisfed.org/series/EUNNGDP (Accessed on March 15, 2017).
8Eurostat (2017 b) Eurostat (2017 b). GDP for switzerland [CPMNACSAB 1GQCH]. Retrieved from https://fred.stlouisfed.org/series/CPMNACSAB 1GQCH (Accessed on March 16, 2017).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Invariant Causal Prediction for Sequential Data

Abstract

1 Introduction

1.1 Contribution and relation to other work

2 Invariant causal prediction

2.1 Structural invariance

Definition 2.1** (invariant set SSS).**

Assumption 1** (structural invariance).**

2.2 Invariant prediction and coverage

Proposition 2.2** (coverage property (Peters

2.3 Relation to causality and discussion of assumptions

Definition 2.3** (structural causal models).**

Example 2.4** (SCM with linear Gaussian target).**

Hidden variables.

3 Tests for H0,SH_{0,S}H0,S​ based on scaled residuals

3.1 Scaled residual tests

Proposition 3.1** (level of the scaled residual test).**

3.2 Choosing test statistics

Example 3.2** (non-detectable structure changes in residual

3.2.1 Change point alternatives

Detecting block-wise shifts in the regression of the scaled

Detecting block-wise shifts in the scaled residuals

3.2.2 Further alternatives

Detecting gradual shifts in the scaled residuals

Detecting more complicated shifts in the scaled residuals

4 Detection rates

4.1 Asymptotic change point model

Assumption 2** (asymptotic change point model).**

Assumption 3** (Multivariate normality).**

4.2 Asymptotic results

4.2.1 Rate consistency of tests for fixed sets SSS

Theorem 4.1** (rate consistency of decoupled test).**

Theorem 4.2** (rate consistency of combined test).**

Remark 4.3** (uniform consistency).**

4.2.2 Rate consistency of estimator S^\hat{S}S^

Corollary 4.4** (rate consistency of estimator S^\hat{S}S^ (decoupled

5 Implementation

5.1 Choosing environments and comparison set

5.2 Computational complexity

6 Instantaneous causal effects in multivariate time series

Proposition 6.1** (level of the scaled residual test including time lags).**

7 Numerical experiments

7.1 Numerical simulations

7.2 Monetary policy example

8 Summary

Acknowledgements

Appendix A Detailed numerical simulations

A.1 Comparison of combined and decoupled test statistic

A.2 Power comparison on simulated data

A.2.1 Increasing power by splitting environments

A.3 Shocks in time series

Appendix B Supporting material

Remark B.1** (violations of the linear Gaussian

Remark B.2** (Obtaining environments by clustering).**

Remark B.3** (Comparison to change point methods).**

Appendix C Proofs

C.1 Theorem 4.2

Proof** (Theorem 4.2).**

C.1.1 Intermediate results

Lemma C.1** (representation of Te1,e23T_{e_{1},e_{2}}^{3}Te1​,e2​3​ for true change points).**

Proposition C.2** (asymptotic distribution under H0H_{0}H0​).**

Proposition C.3** (asymptotic distribution under HAH_{A}HA​).**

C.1.2 Proofs of intermediate results

Proof** (Lemma C.1).**

Proof** (Proposition C.2).**

Proof** (Proposition C.3).**

C.2 Theorem 4.1

Proof** (Theorem 4.1).**

C.2.1 Intermediate results

Lemma C.4** (representation of Te1,e21T_{e_{1},e_{2}}^{1}Te1​,e2​1​ and Te1,e22T_{e_{1},e_{2}}^{2}Te1​,e2​2​ for true change points).**

Proposition C.5** (asymptotic distribution under H0H_{0}H0​).**

Proposition C.6** (asymptotic distribution under HAH_{A}HA​).**

C.2.2 Proofs of intermediate results

Definition 2.1 (invariant set $S$ ).

Assumption 1 (structural invariance).

Definition 2.3 (structural causal models).

Example 2.4 (SCM with linear Gaussian target).

3 Tests for $H_{0,S}$ based on scaled residuals

Proposition 3.1 (level of the scaled residual test).

Assumption 2 (asymptotic change point model).

Assumption 3 (Multivariate normality).

4.2.1 Rate consistency of tests for fixed sets $S$

Theorem 4.1 (rate consistency of decoupled test).

Theorem 4.2 (rate consistency of combined test).

Remark 4.3 (uniform consistency).

4.2.2 Rate consistency of estimator $\hat{S}$

Corollary 4.4** (rate consistency of estimator $\hat{S}$ (decoupled

Proposition 6.1 (level of the scaled residual test including time lags).

Remark B.2 (Obtaining environments by clustering).

Remark B.3 (Comparison to change point methods).

Proof (Theorem 4.2).

Lemma C.1 (representation of $T_{e_{1},e_{2}}^{3}$ for true change points).

Proposition C.2 (asymptotic distribution under $H_{0}$ ).

Proposition C.3 (asymptotic distribution under $H_{A}$ ).

Proof (Lemma C.1).

Proof (Proposition C.2).

Proof (Proposition C.3).

Proof (Theorem 4.1).

Lemma C.4 (representation of $T_{e_{1},e_{2}}^{1}$ and $T_{e_{1},e_{2}}^{2}$ for true change points).

Proposition C.5 (asymptotic distribution under $H_{0}$ ).

Proposition C.6 (asymptotic distribution under $H_{A}$ ).

Proof (Proposition C.5).

Proof (Proposition C.6).

Proof (Corollary 4.4).

Theorem C.7 (uniform rate consistency).

Proof.

Theorem D.1 (convergence of OLS-estimates (random design)).

Proof.

Lemma D.2 (convergence rate of fractions).

Proof.