Causal mechanism of extreme river discharges in the upper Danube basin   network

Linda Mhalla; Val\'erie Chavez-Demoulin; Debbie J. Dupuis

arXiv:1907.03555·stat.AP·April 2, 2020

Causal mechanism of extreme river discharges in the upper Danube basin network

Linda Mhalla, Val\'erie Chavez-Demoulin, Debbie J. Dupuis

PDF

TL;DR

This paper introduces CausEV, a novel method combining extreme value modeling and causal discovery to identify causal links between extreme river discharges, revealing significant causal relations in the upper Danube basin.

Contribution

It develops a new causal inference approach for extreme events using Kolmogorov complexity and the minimum description length principle, applied to hydrological data.

Findings

01

Identifies causal links between Danube and Lech river discharges.

02

Uncovers causal mechanisms underlying extreme hydrological events.

03

Demonstrates effectiveness of the CausEV method in hydrology.

Abstract

Extreme hydrological events in the Danube river basin may severely impact human populations, aquatic organisms, and economic activity. One often characterizes the joint structure of the extreme events using the theory of multivariate and spatial extremes and its asymptotically justified models. There is interest however in cascading extreme events and whether one event causes another. In this paper, we argue that an improved understanding of the mechanism underlying severe events is achieved by combining extreme value modelling and causal discovery. We construct a causal inference method relying on the notion of the Kolmogorov complexity of extreme conditional quantiles. Tail quantities are derived using multivariate extreme value models and causal-induced asymmetries in the data are explored through the minimum description length principle. Our CausEV, for Causality for Extreme Values,…

Equations75

Pr {(M_{n} - b_{n}) / a_{n} \leq y} \to G (y), n \to \infty,

Pr {(M_{n} - b_{n}) / a_{n} \leq y} \to G (y), n \to \infty,

\displaystyle\text{GEV}_{(\mu,\sigma,\xi)}(y)=\left\{\begin{array}[]{ll}\exp\left[-\left\{1+\xi(y-\mu)/\sigma\right\}^{-1/\xi}_{+}\right],&\xi\neq 0,\\ \exp\left[-\exp\{-(y-\mu)/\sigma\}\right],&\xi=0,\end{array}\right.

\displaystyle\text{GEV}_{(\mu,\sigma,\xi)}(y)=\left\{\begin{array}[]{ll}\exp\left[-\left\{1+\xi(y-\mu)/\sigma\right\}^{-1/\xi}_{+}\right],&\xi\neq 0,\\ \exp\left[-\exp\{-(y-\mu)/\sigma\}\right],&\xi=0,\end{array}\right.

\Pr\left\{\left(Y_{n}-b_{n}\right)/a_{n}>u+y|\left(Y_{n}-b_{n}\right)/a_{n}>u\right\}\xrightarrow[n\to\infty]{}\left\{\begin{array}[]{ll}\left(1+\xi y/\tilde{\sigma}_{u}\right)_{+}^{-1/\xi},&\xi\neq 0,\\ \exp(-y/\tilde{\sigma}_{u}),&\xi=0,\end{array}\right.

\Pr\left\{\left(Y_{n}-b_{n}\right)/a_{n}>u+y|\left(Y_{n}-b_{n}\right)/a_{n}>u\right\}\xrightarrow[n\to\infty]{}\left\{\begin{array}[]{ll}\left(1+\xi y/\tilde{\sigma}_{u}\right)_{+}^{-1/\xi},&\xi\neq 0,\\ \exp(-y/\tilde{\sigma}_{u}),&\xi=0,\end{array}\right.

V (z) = \int_{S_{d}} max (\frac{w _{1}}{z _{1}}, \dots, \frac{w _{d}}{z _{d}}) d H (w), z \in (0, \infty)^{d},

V (z) = \int_{S_{d}} max (\frac{w _{1}}{z _{1}}, \dots, \frac{w _{d}}{z _{d}}) d H (w), z \in (0, \infty)^{d},

\int_{S_{d}} w_{j} d H (w) = 1, j = 1, \dots, d .

\int_{S_{d}} w_{j} d H (w) = 1, j = 1, \dots, d .

G (α_{k} z + β_{k})^{k} = G (z), z = (z_{1}, \dots, z_{d}) \in R^{d},

G (α_{k} z + β_{k})^{k} = G (z), z = (z_{1}, \dots, z_{d}) \in R^{d},

C^{E V} (v)

C^{E V} (v)

C^{E V} (v) = {C^{E V} (v_{1}^{1/ k}, \dots, v_{d}^{1/ k})}^{k}, v = (v_{1}, \dots, v_{d}) \in [0, 1]^{d}, k > 0.

C^{E V} (v) = {C^{E V} (v_{1}^{1/ k}, \dots, v_{d}^{1/ k})}^{k}, v = (v_{1}, \dots, v_{d}) \in [0, 1]^{d}, k > 0.

χ_{u} = Pr {Y_{2} > F_{2}^{- 1} (u) ∣ Y_{1} > F_{1}^{- 1} (u)} \to χ \geq 0, u \to 1.

χ_{u} = Pr {Y_{2} > F_{2}^{- 1} (u) ∣ Y_{1} > F_{1}^{- 1} (u)} \to χ \geq 0, u \to 1.

\partial_{v_{1}} C^{E V} (v_{1}, v_{2})

\partial_{v_{1}} C^{E V} (v_{1}, v_{2})

\partial_{v_{2}} C^{E V} (v_{1}, v_{2})

Y=h(X,\epsilon),\quad X\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\epsilon,

Y=h(X,\epsilon),\quad X\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\epsilon,

n \to \infty lim \frac{K ( X ^{n} )}{n} = - \int f (x) lo g_{2} f (x) d x, a . s .

n \to \infty lim \frac{K ( X ^{n} )}{n} = - \int f (x) lo g_{2} f (x) d x, a . s .

K (F_{X, Y}) = K (F_{X}) + K (F_{Y ∣ X})

K (F_{X, Y}) = K (F_{X}) + K (F_{Y ∣ X})

K (F_{X}) + K (F_{Y ∣ X}) \leq + K (F_{Y}) + K (F_{X ∣ Y}),

K (F_{X}) + K (F_{Y ∣ X}) \leq + K (F_{Y}) + K (F_{X ∣ Y}),

C L_{M_{X}} (X) + C L_{M_{Y ∣ X}} (Y ∣ X) \leq C L_{M_{Y}} (Y) + C L_{M_{X ∣ Y}} (X ∣ Y),

C L_{M_{X}} (X) + C L_{M_{Y ∣ X}} (Y ∣ X) \leq C L_{M_{Y}} (Y) + C L_{M_{X ∣ Y}} (X ∣ Y),

Pr (X^{ext} \leq x, Y^{ext} \leq y)

Pr (X^{ext} \leq x, Y^{ext} \leq y)

Pr (X^{ext} \leq x ∣ Y^{ext} = y > u_{Y})

Pr (Y^{ext} \leq y ∣ X^{ext} = x > u_{X})

Q_{X^{ext}} (τ)

Q_{X^{ext}} (τ)

Q_{Y^{ext}} (τ)

Q_{X^{ext} ∣ Y^{ext} = y > u_{Y}} (τ)

Q_{X^{ext} ∣ Y^{ext} = y > u_{Y}} (τ)

Q_{Y^{ext} ∣ X^{ext} = x > u_{X}} (τ)

(\partial_{v_{2}} C^{E V})^{- 1} {τ, G (y)}

(\partial_{v_{2}} C^{E V})^{- 1} {τ, G (y)}

(\partial_{v_{1}} C^{E V})^{- 1} {F (x), τ}

C L_{F_{X}}^{τ} (X^{ext}) = C L_{F_{X}}^{τ} (\hat{F}_{X}) + C L_{F_{X}}^{τ} (\overset{ϵ}{^}_{X} ∣ \hat{F}_{X}),

C L_{F_{X}}^{τ} (X^{ext}) = C L_{F_{X}}^{τ} (\hat{F}_{X}) + C L_{F_{X}}^{τ} (\overset{ϵ}{^}_{X} ∣ \hat{F}_{X}),

C L_{F_{X}}^{τ} (\hat{F}_{X}) = C L_{F_{Y}}^{τ} (\hat{F}_{Y}) = \frac{p}{2} lo g_{2} (n_{u}),

C L_{F_{X}}^{τ} (\hat{F}_{X}) = C L_{F_{Y}}^{τ} (\hat{F}_{Y}) = \frac{p}{2} lo g_{2} (n_{u}),

L (\overset{σ}{^}_{X}, \hat{ξ}_{X})

L (\overset{σ}{^}_{X}, \hat{ξ}_{X})

L (\overset{σ}{^}_{Y}, \hat{ξ}_{Y})

L (\overset{σ}{^}_{X}, \overset{σ}{^}_{Y}, \hat{ξ}_{X}, \hat{ξ}_{Y}, \hat{C}^{E V})

L (\overset{σ}{^}_{X}, \overset{σ}{^}_{Y}, \hat{ξ}_{X}, \hat{ξ}_{Y}, \hat{C}^{E V})

\hat{S}_{X^{ext}} (τ)

\hat{S}_{X^{ext}} (τ)

\hat{S}_{Y^{ext}} (τ)

\hat{S}_{X^{ext} ∣ Y^{ext}} (τ)

\hat{S}_{Y^{ext} ∣ X^{ext}} (τ)

C L_{F_{X}}^{τ} (X^{ext})

C L_{F_{X}}^{τ} (X^{ext})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Causal mechanism of extreme river discharges in the upper Danube basin network

Linda Mhalla

Department of Decision Sciences, HEC Montreal, Canada

[email protected]

Valérie Chavez-Demoulin

Faculty of Business and Economics, University of Lausanne, Switzerland

Debbie J. Dupuis

Department of Decision Sciences, HEC Montreal, Canada

Abstract

Extreme hydrological events in the Danube river basin may severely impact human populations, aquatic organisms, and economic activity. One often characterizes the joint structure of the extreme events using the theory of multivariate and spatial extremes and its asymptotically justified models. There is interest however in cascading extreme events and whether one event causes another. In this paper, we argue that an improved understanding of the mechanism underlying severe events is achieved by combining extreme value modelling and causal discovery. We construct a causal inference method relying on the notion of the Kolmogorov complexity of extreme conditional quantiles. Tail quantities are derived using multivariate extreme value models and causal-induced asymmetries in the data are explored through the minimum description length principle. Our CausEV, for Causality for Extreme Values, approach uncovers causal relations between summer extreme river discharges in the upper Danube basin and finds significant causal links between the Danube and its Alpine tributary Lech.

keywords:

Causal discovery; Conditional quantile; Extreme value copula; Minimum description length; River discharge

\coaddress

Linda Mhalla, HEC Montréal, 3000, chemin de la Côte-Sainte-Catherine, Montréal (Québec), Canada H3T 2A7.

.

1 Introduction

The upper Danube basin is regularly affected by flooding and has received much attention in the hydrological literature. A wealth of studies focus on understanding the flooding processes from hydrological and anthropogenic perspectives; e.g. Merz and Blöschl (2008); Skublics et al. (2016), while others focus on analysing the influence of flood impact variables on monetary flood damage (Thieken et al., 2005) and assessing the flood risk management system in Germany (see Thieken et al. (2016) and references therein).

Extreme discharges in the upper Danube basin have been studied. Asadi et al. (2015) develop models for spatial extremal dependence based on the hydrological and geographical properties of a network and apply their methods to the river discharges at the stations in Figure 1. Their method allows direct estimation and comparison of the influence of the Euclidean and river distances on the dependence between extreme river discharges. Using a parametric model for multivariate threshold exceedances, Engelke and Hitz (2020) develop graphical models for extremes, resulting in an estimate of an undirected graph structure on the river network in Figure 1. Their findings support evidence for the presence of extremal dependence between some flow-unconnected gauging stations due to the spatial extent of extreme precipitation events.

According to the Future Danube Model, an increase of magnitude and frequency of floods in the Danube basin is expected in 2020–2049, see Hattermann et al. (2018). The projected increase in the frequency of the $100$ -year flood is moreover more pronounced in the German part of the catchment than in the Austrian and Hungarian parts; see, e.g., Figure 6 of Hattermann et al. (2018). These increases will amplify current problems in the area and long-term planning would benefit from a greater understanding of any causal mechanisms in the extremes of river discharges.

Several floods occurred in June 2013, and Blöschl et al. (2013) analyse the causal factors in terms of atmospheric situation, runoff generation and propagation of the flood wave along the Danube and its tributaries. Although the severity of the floods depends on characteristics such as rainfall duration and soil moisture, an analysis of the spatial causality structure of the severe floods is important for an improved understanding of the environment, a flood risk assessment and management, and an identification of flood mitigation measures. In this paper, we develop a new method to assess the intrinsic physical causal relations between the extreme discharges in the upper Danube basin network. The effect of time’s arrow (Sugimoto et al., 2016; Müller et al., 2017; Koutsoyiannis, 2019) on streamflow dynamics will not be considered and the resulting causal mechanisms should be interpreted in conjunction with the physical dynamics of the river network.

Several methods of causal inference from observational data at the average level of their values have been proposed (Maathuis and Nandy, 2016). While statistical association, possibly occurring without causation, describes settings where events take place jointly more often than they are expected to happen separately, causal relationships allow the prediction of the effect of interventions on the observed system. For instance, climate researchers study causal links between climate forcings and observed responses with the aim of attributing likely causes for a detected climate change (\al@NAP21852,hannart2016, Naveau2018; \al@NAP21852,hannart2016, Naveau2018; \al@NAP21852,hannart2016, Naveau2018). If one wants to investigate the effect of manipulating a variable, such manipulations should be carried out in a controlled experiment and the results observed. However, it is often unethical (clinical studies) or physically impossible (environmental studies) to perform such experiments, so approaches to discovering causal knowledge based on purely observational data have been proposed. In contrast with Rubin’s causal model and the associated counterfactual framework (Rubin, 1974), these methods are derived from the notion of independence between the cause and the effect conditional on the cause, the causal Markov condition (Spirtes et al., 2000). This notion of conditional independence, often considered a postulate for causal conclusions, emanates from the common cause principle of Reichenbach (1956) known as “no correlation without causation” and stating that statistical dependence between two random variables must be due to a causal link where either one causes the other or there is a third variable causing both. Moreover, the causal Markov condition is at the heart of the theory of causation of Pearl (1995, 2009) based on structural causal models associated with directed acyclic graphs. Finally, the Markov condition can be stated in a statistical or an algorithmic version (Janzing and Schölkopf, 2010; Lemeire and Janzing, 2013) fundamental difference residing in the definition of the mutual information measuring the violation of conditional independence. In a statistical approach, the mutual information is measured using Shannon entropy while Kolmogorov complexity is used in an algorithmic approach.

In this paper, we are interested in unveiling causal links between large observed values of two random variables. That is, we want to know whether an intervention on the tail distribution of one random variable affects the tail distribution of another random variable. Unlike bivariate causal discovery methods concerned with causal effects at the mean level (Peters and Schölkopf, 2014; Marx and Vreeken, 2018), our focus is on the conditional independence in the joint upper tails. Classical statistical techniques that usually provide a good description of data tend to fail in the joint upper tail due to the scarcity of observations. We turn to extreme value theory for suitable models to describe the observed and unobserved large events at a penultimate level. The proposed method is valid under the assumption of asymptotic dependence where tail properties are appropriately captured by non-degenerate multivariate extreme value models (de Haan and Ferreira, 2006). Inference for the causal structure in the tails relies on the algorithmic version of the causal Markov condition. The Kolmogorov complexity being non-computable (Cover and Thomas, 2006), we use the minimum description length principle to provide a well-founded approximation of this complexity, like Tagasovska et al. (2018).

The rest of the paper is organized as follows. In Section 2 we review basic extreme value theory for the univariate and multivariate settings, with a focus on threshold methods for joint occurrences of extreme events. In Section 3 we detail our quantile-based method for distinguishing between cause and effect at extreme levels of a bivariate random vector. Our CausEV approach is assessed and compared to alternative non-extreme-based methods by simulation in Section 4. In Section 5, CausEV is used to uncover the causal mechanism between the extreme discharges of the 31 stations in Figure 1. All the directed edges of the resulting network coincide with the real flow direction between sites, and causal relationships between summer extreme discharges at Alpine stations and the Danube are uncovered. We conclude in Section 6.

2 Extreme value theory

In this section, we present only the main results in extreme value theory (EVT) needed for our development of causal discoveries at extreme levels. We refer the interested reader to Embrechts et al. (1997); Beirlant et al. (2004) and de Haan and Ferreira (2006) for a more comprehensive review of the subject.

2.1 Univariate extreme value theory

2.1.1 Maxima of iid random variables

Let $(Y_{i})_{i\geq 1}$ be a sequence of independent and identically distributed (iid) random variables with common distribution $F$ . Let $M_{n}$ be the maximum of a sequence of $n$ such random variables, i.e. $M_{n}=\text{max}\{Y_{1},\ldots,Y_{n}\}$ . The Fisher–Tippett theorem (Fisher and Tippett, 1928; Gnedenko, 1943) states that if there exist sequences of constants $\{a_{n}>0\}$ and $\{b_{n}\}$ such that the normalized random variable $M_{n}$ converges in distribution to a random variable with a non–degenerate distribution function $G$ , i.e.,

[TABLE]

then $G$ belongs to the Generalized Extreme Value (GEV) family of distributions

[TABLE]

defined on $\{y:1+\xi(y-\mu)/\sigma>0\}$ , with $-\infty<\mu,\xi<\infty$ , $\sigma>0$ , and $x_{+}=\max(x,0)$ . The parameters $\mu$ , $\sigma$ , and $\xi$ correspond, respectively, to the location, scale and shape. The value of the shape determines the limiting distribution: Fréchet ( $\xi>0$ ) with support bounded below $(\mu-\sigma/\xi,+\infty)$ , reversed Weibull ( $\xi<0$ ) with support bounded above $(-\infty,\mu-\sigma/\xi)$ , and Gumbel ( $\xi=0$ ) with support in $\mathbb{R}$ and exponential decay in the upper tail. The Fisher–Tippett theorem provides a justification for the approximation of the distribution of the maximum in a block of $n$ iid random variables by the GEV distribution, for sufficiently large $n$ . Standard inference techniques include maximum likelihood estimation (MLE) methods where classical asymptotic results of the estimators hold under certain regularity conditions outlined in Smith (1985) and detailed in Bücher and Segers (2017) using the notion of differentiability in quadratic mean.

2.1.2 Threshold exceedances

Extreme events being scarce by definition, the block maximum approach may be wasteful, as it discards events that are not as extreme as the block maximum but that should be informative about the behaviour in the tails. An alternative approach to the block maximum is the peaks over threshold where focus is on the asymptotic distribution of the exceedances of a high fixed threshold. The following result (Balkema and de Haan, 1974) allows the approximation of the conditional distribution of the exceedances above a high threshold. If there exist normalizing sequences $\{a_{n}>0\}$ and $\{b_{n}\}$ such that (1) holds, i.e., $F$ is in the max-domain of attraction of a $\text{GEV}_{(\mu,\sigma,\xi)}$ , then, for a sufficiently high threshold $u$ we can model the limiting distribution of the exceedances $Y-u|Y>u$ with a Generalized Pareto Distribution (GPD) $G_{(\tilde{\sigma}_{u},\xi)}$ ,

[TABLE]

where $\tilde{\sigma}_{u}=\sigma+\xi(u-\mu)$ and the shape parameter $\xi$ equals that of the corresponding GEV distribution. The limiting distribution function $G_{(\tilde{\sigma}_{u},\xi)}$ in (2) is defined on $\{y:y>0\ \text{and}\ (1+\xi y/\tilde{\sigma}_{u})>0\}$ , and the case of $\xi=0$ is interpreted as the limit. By a slight abuse of notation, we say that $Y\mid Y>u\sim\text{GPD}(u,\tilde{\sigma}_{u},\xi)$ whenever $Y-u\mid Y>u\sim G_{(\tilde{\sigma}_{u},\xi)}$ . That is, $\text{GPD}(u,\tilde{\sigma}_{u},\xi)$ is a Generalized Pareto Distribution $G_{(\tilde{\sigma}_{u},\xi)}$ shifted by the threshold $u$ .

We use the asymptotic result (2) to justify the use of the GPD as the model for the exceedances $Y_{i}-u$ for $i$ such that $Y_{i}>u$ for some high threshold $u$ . The threshold $u$ is chosen following a bias-variance trade-off: a low threshold will yield a higher number of exceedances and decrease the variance however it increases the bias as the asymptotic approximation for the tail of the distribution will be poor. The GPD may be fitted using MLE (Coles, 2001) or other estimation methods, see Beirlant et al. (2004, Section 5.3).

2.2 Multivariate extreme value theory

2.2.1 Normalized componentwise maxima of iid random vectors

Let $(\mathbf{Y}_{i})_{i\geq 1}$ be iid copies of a $d$ -dimensional random vector $\mathbf{Y}=(Y_{1},\ldots,Y_{d})$ with marginal distributions $F_{j}$ , $j=1,\ldots,d$ , and joint distribution $F$ . We denote by $\mathbf{M}_{n}=(M_{n,1},\ldots,M_{n,d})$ the vector of componentwise maxima, where $M_{n,j}=\max_{i=1}^{n}Y_{i,j}$ is the sample maximum of the $j$ -th component. Note that $\mathbf{M}_{n}$ is not necessarily observed as the componentwise maxima may occur at different times. As in Section 2.1.1, the vector $\mathbf{M}_{n}$ needs to be suitably normalized to avoid degeneracy of its limit law as $n\rightarrow\infty$ . We suppose that there exist sequences $\{\mathbf{a}_{n}\}\subset\mathbb{R}^{d}_{+}$ and $\{\mathbf{b}_{n}\}\subset\mathbb{R}^{d}$ such that the normalized vector $(\mathbf{M}_{n}-\mathbf{b}_{n})/\mathbf{a}_{n}$ converges in distribution to a random vector $\mathbf{Z}=(Z_{1},\ldots,Z_{d})$ with joint distribution $G$ and non-degenerate margins $G_{j}$ , $j=1,\ldots,d$ . When the marginal distributions $F_{j}$ are unit Fréchet, i.e, $F_{j}(y)=\exp(-1/y)$ , $y>0$ , Pickands representation theorem (Coles, 2001, Theorem 8.1) states that the law of the standardized componentwise maxima $n^{-1}\mathbf{M}_{n}$ converges in distribution to a multivariate extreme value distribution (MEVD), $G(\mathbf{z})=\exp\left\{-V(\mathbf{z})\right\},$ with

[TABLE]

for some positive finite measure $H$ on the unit simplex $S_{d}=\big{\{}(w_{1},\ldots,w_{d})\in[0,1]^{d}:\allowbreak w_{1}+\cdots+w_{d}=1\big{\}}$ obeying

[TABLE]

The class of MEVDs coincides with the class of max-stable distribution functions with non-degenerate margins (Beirlant et al., 2004, Chapter 8.2.1), with a $d$ -variate distribution $G$ said to be max-stable if there exist vectors $\bm{\alpha}_{k}>\mathbf{0}$ and $\bm{\beta}_{k}$ such that

[TABLE]

for any integer $k>0$ . Two results follow from this property. The margins $G_{j}$ must be max-stable, or equivalently $\text{GEV}_{(\mu_{j},\sigma_{j},\xi_{j})}$ , which follows from the univariate EVT and the construction of $\mathbf{M}_{n}$ , and the distribution function $G$ is max-infinitely divisible, i.e., $G^{1/k}$ is a distribution function for any integer $k>0$ (Balkema and de Haan, 1974). The max-stability of the limiting distribution $G$ implies that its associated copula,

[TABLE]

belongs to the large class of extreme value copulas $\mathcal{C}^{EV}$ satisfying

[TABLE]

The function $A$ in (4) is termed the Pickands dependence function and is a continuous convex function defined on the unit simplex $S_{d}$ and satisfying $\max(w_{1},\ldots,w_{d})\leq A(w_{1},\ldots,w_{d})\leq 1$ , for all $(w_{1},\ldots,w_{d})\in S_{d}$ . The Pickands dependence function describes the extremal dependence, i.e., the dependence structure in the limiting distribution of the normalized maxima. In the bivariate setting, a useful summary of the strength of tail dependence in $(Y_{1},Y_{2})$ is given by the coefficient of tail dependence $\chi$ (Coles et al., 1999) where

[TABLE]

When $\chi>0$ , the random vector $(Y_{1},Y_{2})$ is said to be asymptotically dependent, whereas $\chi=0$ characterizes the asymptotic independence regime where the limiting extreme value copula coincides with the independence copula.

2.2.2 Joint threshold exceedances

From Section 2.1.2, the GPD is suitable for modelling exceedances of a univariate random variable above a high threshold. When the interest is in joint exceedances above high thresholds of a random vector, it is thus reasonable to model their joint distribution with a distribution with GPD margins. When it comes to the dependence structure of this joint limiting distribution, (Beirlant et al., 2004, Section 8.3.2) and (McNeil et al., 2005, Section 7.6.1) argue that, assuming that the joint distribution of the random vector $\mathbf{Y}=(Y_{1},\ldots,Y_{d})$ is in the maximum domain of attraction of an MEVD, we can approximate, for $\mathbf{Y}\geq\mathbf{u}$ (with inequality holding componentwise), the dependence structure between the exceedances of the high multivariate threshold $\mathbf{u}$ by an extreme value copula satisfying (5).

2.2.3 Inference for extremal dependence

As opposed to the univariate case, in which a parametric family of distributions characterizes all the possible limiting distributions of suitably normalized maxima, the class of multivariate extreme value distributions yields an infinite-dimensional family of representations. The validity of a multivariate extreme value distribution relies solely on its associated measure $H$ satisfying the mean condition (3). For extremal dependence modelling and inference, one can either rely on flexible classes of parametric models (Beirlant et al., 2004, Section 9.2.2) or use non-parametric estimation of the extreme value copula or its associated Pickands dependence function.

In this paper, we make no assumption on the form of the extremal dependence in the random vector $\mathbf{Y}=(Y_{1},\ldots,Y_{d})$ , and take a non-parametric inference approach. To do so, we use the min-projection approach of Mhalla et al. (2019) for inference on the extreme value copula describing the dependence structure between multivariate threshold exceedances, through non-parametric estimation of its associated Pickands dependence function. We compute the min-projection for a sequence of fixed directions in the unit simplex and regularize the Pickands function so that the resulting estimates of the extreme value copula as well as its derivatives are valid with respect to the convexity and boundary conditions. When the regularisation relies on the median smoothing approach of Ng and Maechler (2007), the resulting valid estimate of the Pickands function is a linear combination of B-spline basis functions. Equation (4) implies that the derivatives of the extreme value copula and the Pickands function are related, in the bivariate setting, through

[TABLE]

where $A(\omega)\equiv A(\omega,1-\omega)$ and $A^{\prime}(\omega)=dA(\omega)/d\omega$ . Thus, inference for the extreme value copula $C^{EV}$ and its partial derivatives is conducted straightforwardly based on the B-spline representation of the Pickands estimator.

3 Pairwise causal discovery of extremes

In this section, we develop a quantile-based method for distinguishing between cause and effect at extreme levels of a bivariate random vector $(X,Y)$ . Throughout this section, we assume that we observe a dataset $\{(X_{i},Y_{i})\}_{i=1}^{n}$ and that the resulting extreme events are defined as the observations exceeding sufficiently high thresholds in both margins, i.e., the $n_{u}$ observations $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}$ where for all $i=1,\ldots,n_{u}$ , $X_{i}^{\text{ext}}>u_{X}$ and $Y_{i}^{\text{ext}}>u_{Y}$ for high thresholds $u_{X}$ and $u_{Y}$ . We study the causal relationships between $X^{\text{ext}}$ and $Y^{\text{ext}}$ , doing so based on the independence of cause and mechanism postulate (Daniušis et al., 2010)

Postulate 1

The mechanisms generating the random variable describing the cause, denoted by $X$ , and generating the random variable describing the effect given the cause, denoted by $Y\mid X$ , are independent, i.e., they contain no information about each other.

The independence of mechanisms is related to Pearl’s notion of stability (Pearl, 2009, Section 2.4) which states that the causal mechanism describing a variable given its cause must remain unchanged if one changes the mechanism generating the cause. For instance, consider the toy example where the altitude of a station modelled by a random variable $X$ and the temperature at this station modelled by the random variable $Y$ are causally related through the structural equation model (Pearl, 2009; Pearl et al., 2016)

[TABLE]

where $h$ is the mechanism (function) describing $Y\mid X$ and which can be thought of as modelling the physical mechanism relating the temperature to the altitude, and $\epsilon$ is an error term. Then, any localised intervention on the altitude, e.g., by considering a nearby station, would result in a change in the temperature but would not affect the physical mechanism $h$ . Therefore, the conditional random variable $Y\mid X$ described by $h$ provides no information about $X$ .

Janzing and Schölkopf (2010) formalize the notion of two mechanisms containing no information about each other using algorithmic information theory and more specifically the notion of Kolmogorov complexity, that is the length of the shortest computer program that prints a sequence of the underlying random variable and halts (Kolmogorov, 1968; Li and Vitányi, 2008). The Kolmogorov complexity of a random sequence is closely related to the Shannon entropy (Shannon, 1948) of the underlying distribution. For a random sequence $X^{n}=\{X_{i}\}_{i=1}^{n}$ , with $X_{i}$ drawn independently from a probability distribution $F$ with density $f$ , Brudno (1983) shows that the Kolmogorov complexity $K(X^{n})$ of the sequence $X^{n}$ is linked to the Shannon entropy of $F$ through

[TABLE]

This result is also valid for discrete distributions where the Shannon entropy in the right hand-side of (7) is modified accordingly. For example, if $X^{n}\in\{0,1\}^{n}$ is drawn from $n$ independent Bernoulli random variables with known success probability $p$ , then, with probability close to 1, $K(X^{n})$ is close to $n$ times the binary entropy, i.e., $-np\log_{2}(p)-n(1-p)\log_{2}(1-p)$ .

In terms of the Kolmogorov complexities of the random variables $X$ and $Y$ , Postulate 1 is translated as follows:

Postulate 2

If the random variable $X$ is the cause of a random variable $Y$ , then the distribution of $X$ , $F_{X}$ , and the distribution of $Y\mid X$ , $F_{Y\mid X}$ , are algorithmically independent, that is

[TABLE]

where $K(F_{Z})$ stands for the Kolmogorov complexity of a random variable $Z\sim F_{Z}$ .

In a setting where $X$ and $Y$ are causally related, i.e., either $X$ causes $Y$ or $Y$ causes $X$ , Postulate 2 implies that one should infer that $X$ causally influences $Y$ whenever

[TABLE]

where $\overset{+}{\leq}$ denotes inequality up to an additive constant. This inequality stems from the definition of the algorithmic mutual information of two random variables $U$ and $V$ , which is a positive quantity equal to $I(F_{U}:F_{V})=K(F_{V})-K(F_{U,V})+K(F_{U})$ and from the equivalence between $K(F_{U,V})$ and $K(F_{U,V\mid U})$ ; see Janzing and Schölkopf (2010, Section II-A) for details. One possible setting where $X$ and $Y$ are not causally related is where $X$ and $Y$ are not causally sufficient, i.e., there is a latent common cause or unobserved confounder to $X$ and $Y$ . In the latter case, the joint distribution function $F_{X,Y}$ might have a smaller complexity $K(F_{X,Y})$ when considering its decomposition by the chain rule involving the marginal distribution of the confounder. We assume in this work the absence of common confounders and proceed with the weaker version of Postulate 2 given by (8).

3.1 Kolmogorov complexity and code length

Here we describe the link between the Kolmogorov complexity of a random variable and its code length through the minimum description length (MDL) of Rissanen (1978). The MDL translates Occam’s razor, i.e., the parsimony principle that complex models should not be used beyond necessity, using information theory and states that one should choose the model that provides the shortest description of the data. Random objects being incompressible, they cannot have a concise description (Turing, 1937) and their descriptive Kolmogorov complexity is therefore not computable (Cover and Thomas, 2006). Rissanen modified the concept of the Kolmogorov complexity by proposing the MDL, which focuses on the description length of probability distributions. This paves the way to new insights into many statistical procedures (Hansen and Yu, 2001; Davis et al., 2006; Aue et al., 2014). From Rissanen’s perspective and based on Shannon’s source coding theorem (Shannon, 1948), the MDL of the data is the descriptive power, also called the code length, of its underlying generating distribution. In the example of the random sequence $X^{n}\in\{0,1\}^{n}$ described above, one can encode every symbol of the sequence at a cost of $-\log_{2}(p)$ for a $1$ and $-\log_{2}(1-p)$ for a [math], resulting in a total code length of $X^{n}$ equal to the negative log-likelihood of the Bernoulli model. We denote this quantity by $CL_{\mathcal{M}}(X^{n})$ , where in this example $\mathcal{M}$ describes the Bernoulli model with known probability of success $p$ .

As the true distribution of the data is rarely known, we minimize the code length of many probability distributions from a specific model class with the same sample space as the data. Based on the MDL principle, the Kolmogorov complexity of a random variable can be practically approximated by its code length, which is defined relative to a model class (whether it contains the true underlying distribution or not) and computed using a specific coding scheme. This approximation allows us to formulate Postulate 2 in terms of code lengths leading to the following inequality whenever a random variable $X$ is the cause of a random variable $Y$ :

[TABLE]

where each quantity is to be understood as dependent on the observed dataset.

3.2 Causal discovery using quantile scoring

The MDL principle defines the “best” fitting model from a model class as the one within this class that produces the shortest code length that completely describes the observations. Different forms of the model-based code length have been devised using various coding algorithms (Hansen and Yu, 2001, Section 3) but we focus on the “two-stage” or “two-part” version, stating that the code length of an observed sequence of a random variable can be decomposed into a sum of two parts. The first represents the code length of the fitted model, that is the amount of space required to store the fitted model or, in a fully parametric setting, the corresponding estimated parameter. The second part of the MDL represents the code length of the data based on the fitted model and is equal to the negative log-likelihood of the model evaluated at the transmitted fitted parameter, i.e., the negative log-likelihood of the fitted model. In the example of the Bernoulli sequence $X^{n}$ , as the probability of success $p$ is known, we showed in Section 3.1 that the code length of the sequence is equal to the negative log-likelihood of the Bernoulli model, i.e., it is equal to the second part of the MDL. When the probability of success $p$ is unknown, the two-stage MDL takes the form of a penalised log-likelihood where the penalty, computed in its first part, is the complexity of estimating $p$ .

We derive the code lengths of the marginal and conditional random variables in the dataset $\{(X^{\text{ext}}_{i},Y_{i}^{\text{ext}})\}_{i=1}^{n_{u}}$ using a quantile-based MDL obtained with respect to a specific model class. That is, we encode the dataset of the tail observations using their quantile function under the chosen model class rather than their distribution function. We use a quantile-based approach as it can provide an enhanced characterization of the distribution of an outcome variable by capturing features such as heteroskedasticity that mean-based approaches such as that in Peters et al. (2016) will fail to capture. In contrast with Tagasovska et al. (2018) who focus on causal discovery in the main body of the observations using the MDL principle in a nonparametric setting, we concentrate on causal discovery in the tail region relying on the MDL principle within the class of extreme value distributions.

As described in Section 2.2.2, the distribution function of the joint tails of the vector $(X,Y)$ can be approximated by a bivariate distribution with shifted GPD margins and an extreme value copula $C^{EV}\in\mathcal{C}^{EV}$ satisfying (4) for some valid Pickands dependence function $A$ . More specifically, denoting the marginal distributions $F=F_{(\sigma_{X},\xi_{X})}\sim\text{GPD}(u_{X},\sigma_{X},\xi_{X})$ and $G=G_{(\sigma_{Y},\xi_{Y})}\sim\text{GPD}(u_{Y},\sigma_{Y},\xi_{Y})$ , we have

[TABLE]

We consider a marginal model for the $\tau$ -th quantiles of $X^{\text{ext}}$ and $Y^{\text{ext}}$ ,

[TABLE]

as well as the following model for the conditional $\tau$ -th quantiles of $X^{\text{ext}}\mid Y^{\text{ext}}$ and $Y^{\text{ext}}\mid X^{\text{ext}}$

[TABLE]

where

[TABLE]

Here, the level $\tau$ ranges from a probability of [math] to a probability of $1$ covering the entire joint tail distribution. We denote the corresponding classes of models by $\mathcal{M}_{X^{\text{ext}}}$ , $\mathcal{M}_{Y^{\text{ext}}}$ , $\mathcal{M}_{X^{\text{ext}}\mid Y^{\text{ext}}}$ , and $\mathcal{M}_{Y^{\text{ext}}\mid X^{\text{ext}}}$ and any model from these classes by $\mathcal{F}_{X^{\text{ext}}}\in\mathcal{M}_{X^{\text{ext}}}$ , $\mathcal{F}_{Y^{\text{ext}}}\in\mathcal{M}_{Y^{\text{ext}}}$ , $\mathcal{F}_{X^{\text{ext}}\mid Y^{\text{ext}}}\in\mathcal{M}_{X^{\text{ext}}\mid Y^{\text{ext}}}$ , and $\mathcal{F}_{Y^{\text{ext}}\mid X^{\text{ext}}}\in\mathcal{M}_{Y^{\text{ext}}\mid X^{\text{ext}}}$ . The superscripts ext are omitted where no confusion can arise. If $CL_{\mathcal{F}_{X}}^{\tau}(X^{\text{ext}})$ denotes the code length of the random variable $X^{\text{ext}}$ under the $\tau$ -th quantile model $\mathcal{F}_{X}\in\mathcal{M}_{X}$ , then

[TABLE]

where $CL_{\mathcal{F}_{X}}^{\tau}(\hat{\mathcal{F}}_{X})$ is the code length of the fitted model $\hat{\mathcal{F}}_{X}$ and $CL_{\mathcal{F}_{X}}^{\tau}(\hat{\mathcal{\epsilon}}_{X}|\hat{\mathcal{F}}_{X})$ is the leftover information not captured by the transmitted model $\hat{\mathcal{F}}_{X}$ , i.e., the code length of the residuals of $\hat{\mathcal{F}}_{X}$ ; see Davis et al. (2006); Aue et al. (2014) for related approaches in the settings of autoregressive and quantile regression modelling, respectively. We first focus on encoding the first part of the “two-stage” version of the MDL of $X^{\text{ext}}$ and $Y^{\text{ext}}$ . Since any $\hat{\mathcal{F}}_{X}$ or $\hat{\mathcal{F}}_{Y}$ is completely specified by the scale and shape parameters of the GPD, the code lengths for encoding the fitted models $\hat{\mathcal{F}}_{X}$ and $\hat{\mathcal{F}}_{Y}$ are equal to those of the estimates of the parameters $\sigma_{X}$ and $\xi_{X}$ and $\sigma_{Y}$ and $\xi_{Y}$ . The fitting step is performed using maximum likelihood (ML) estimation and we can use the result by Rissanen (1989) stating that the code length of a ML estimate of a real-valued parameter based on $n$ observations is equal to $\log_{2}(n)/2$ , to obtain

[TABLE]

where $p=2$ is the number of parameters of the GPD. The fitted conditional models $\hat{\mathcal{F}}_{X\mid Y}$ and $\hat{\mathcal{F}}_{X\mid Y}$ are more complicated to encode, as the estimation procedure involves both the estimation of the margins, using ML, and the estimation of the dependence structure described by the extreme value copula $C^{EV}$ , which is estimated non-parametrically as described in Section 2.2.3. As the fitted conditional models rely on the same marginal distributions and non-parametric extreme value copula, their code lengths $CL^{\tau}_{\mathcal{F}_{X\mid Y}}(\hat{\mathcal{F}}_{X|Y})$ and $CL^{\tau}_{\mathcal{F}_{Y\mid X}}(\hat{\mathcal{F}}_{Y\mid X})$ are equal, and we show below that an analytical expression of their complexity is not needed for causal discovery between $X^{\text{ext}}$ and $Y^{\text{ext}}$ .

We now encode the residuals $\hat{\mathcal{\epsilon}}_{X}$ , $\hat{\mathcal{\epsilon}}_{Y}$ , $\hat{\mathcal{\epsilon}}_{X\mid Y}$ , and $\hat{\mathcal{\epsilon}}_{Y\mid X}$ of the fitted models. That is, we compute the second part of the “two-stage” MDL for the $\tau$ -th quantile models. As discussed above, this second part is equal to the negative log-likelihood of the model evaluated at the fitted parameters. We derive such quantities relying on the link between the asymmetric Laplace density and quantile regression (Komunjer, 2005). More precisely, following Geraci and Bottai (2007) and Aue et al. (2014), the code lengths of the innovations in our $\tau$ -th quantile models (10)–(13) are obtained through an application of the asymmetric Laplace likelihood functions

[TABLE]

respectively. The quantities $\hat{S}_{X^{\text{ext}}}(\tau)$ , $\hat{S}_{Y^{\text{ext}}}(\tau)$ , $\hat{S}_{X^{\text{ext}}\mid Y^{\text{ext}}}(\tau)$ , and $\hat{S}_{Y^{\text{ext}}\mid X^{\text{ext}}}(\tau)$ are the estimates of the expected quantile scores (multiplied by $n_{u}$ ) of the $\tau$ -th quantile forecasts (10)–(13) (Koenker and Machado, 1999; Gneiting and Raftery, 2007), and are given by

[TABLE]

where $\rho_{\tau}$ is a loss function given by the check function $\rho_{\tau}(t)=(\mathbf{1}_{\{t\geq 0\}}-\tau)t$ (Koenker and Machado, 1999), and $(\hat{\sigma}_{X},\hat{\sigma}_{Y},\hat{\xi}_{X},\hat{\xi}_{Y})\in\mathbb{R}^{2}_{+}\times\mathbb{R}^{2}$ and $\hat{C}^{EV}\in\mathcal{C}^{EV}$ are the transmitted fitted parameters.

Thus, summing both parts of the MDL, the code lengths of our random variables are

[TABLE]

Finally, as the complexities of the fitted conditional models are equal, we can formulate Postulate 2 in terms of the $\tau$ -th quantile scores through the equivalence between the inequalities

[TABLE]

The causality model at extreme levels is expected to be stable with respect to the quantiles, i.e., the inequality (14) is expected to hold for various levels $\tau$ . By defining $\hat{S}_{X^{\text{ext}}}=\int_{0}^{1}\hat{S}_{X^{\text{ext}}}(\tau)d\tau$ (a similar notation holds for $Y^{\text{ext}}$ , $X^{\text{ext}}\mid Y^{\text{ext}}$ , and $Y^{\text{ext}}\mid X^{\text{ext}}$ ), we modify our decision rule about causality from (14) to

[TABLE]

Further, we define the causal score of our CausEV method as

[TABLE]

and conclude that $X$ causes $Y$ at extreme levels whenever $S_{X\rightarrow Y}^{\text{ext}}>0.5$ , where equality stands for the non-identifiable setting (Peters and Schölkopf, 2014), i.e., a setting where one cannot identify the causal direction at extreme levels between $X$ and $Y$ .

4 Simulation study

We show the effectiveness of CausEV under different simulation scenarios and compare it to state-of-the-art methods for uncovering causality at the mean level, such as LINGAM (Shimizu et al., 2006), IGCI (Janzing et al., 2012), CAM (Bühlmann et al., 2014), and RESIT (Peters and Schölkopf, 2014).

4.1 Additive noise models

We consider a structural equation model with an additive noise (AN) structure between $X$ and $Y$ , i.e.,

[TABLE]

where the deterministic structural function $h$ and the distribution functions of $X$ and $\epsilon$ , denoted $F_{X}$ and $F_{\epsilon}$ respectively, are such that the AN structure (16) holds in the joint upper tail of $(X,Y)$ . This avoids scenarios where large values of $X$ induce low values for $Y$ . Moreover, we require asymptotic dependence between $X$ and $Y$ , i.e., the joint distribution of $(X,Y)$ must be in the maximum domain of attraction of some dependent MEVD; see Section 2.2. Under the AN structure (16), the coefficient of tail dependence $\chi$ defined in (6), is equal to

[TABLE]

In the absence of the noise random variable $\epsilon$ , a monotonic increasing function $h$ ensures comonotonicity between $X$ and $Y$ and hence $\chi=1$ . We monitor the effect of the noise variable that should result in $\chi<1$ while maintaining asymptotic dependence. We consider the following settings for the AN structure (16):

Scenario 1.

$X\sim\text{GPD}(2,0.3,0.1)$

(a)

$h(x)=\log(x+10)+x^{6}$ and $\epsilon\sim t(\nu_{\epsilon})$ , with $\nu_{\epsilon}\in[2.1,4]$ , 2. (b)

$h(x)=x^{3}+x$ and $\epsilon\sim\mathcal{N}(0,\sigma_{\epsilon}^{2})$ , with $\sigma_{\epsilon}\in[0.1,20]$ ;

Scenario 2.

$X\sim\mathcal{N}(1,0.4^{2})$

(c)

$h(x)=\log(x+10)+x^{6}$ and $\epsilon\sim t(\nu_{\epsilon})$ , with $\nu_{\epsilon}\in[2.1,4]$ , 2. (d)

$h(x)=x^{3}+x$ and $\epsilon\sim\mathcal{N}(0,\sigma_{\epsilon}^{2})$ , with $\sigma_{\epsilon}\in[0.05,4]$ ;

Scenario 3.

$X\sim\text{GEV}(-2.8,1,-0.1)$

(e)

$h(x)=x^{3}+x$ and $\epsilon\sim\mathcal{N}(0,\sigma_{\epsilon}^{2})$ , with $\sigma_{\epsilon}\in[0.05,4]$ .

Figure 2 displays empirical estimates of $\chi_{0.95}$ computed under Scenarios 1–3. The resulting tail dependence coefficients hint at asymptotic dependence of $(X,Y)$ in all cases. We proceed with CausEV to distinguish the cause ( $X$ ) from the effect ( $Y$ ) in the upper quadrant $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}=\{(X_{i},Y_{i}):X_{i}>F^{-1}_{X}(0.95)\ \text{and }Y_{i}>F^{-1}_{Y}(0.95)\}$ . Experiments are based on $300$ repetitions where we fix the size of concurrent exceedances at $n_{u}=55$ .

Owing to the relatively small size of the set of extreme observations, we perform our score-based CausEV relying solely on three quantiles of the limiting GPD given by Legendre quadrature of the interval $[0,1]$ , i.e., the quantities in the relation (14) are computed for $\tau=0.5$ and $\tau=0.5\times(1\pm\sqrt{3/5})$ and weighted respectively by $w=0.5\times(5/9)$ and $w=0.5\times(8/9)$ to approximate the integrated scores in (15). We conduct the estimation of the extreme value copula using the min-projection method of Mhalla et al. (2019) at $500$ unequally spaced values in the unit simplex. The margins are transformed to the unit Fréchet scale using rank transformations.

Figure 3 shows the estimated score $S_{X\rightarrow Y}^{\text{ext}}$ over the 300 simulations. In almost all cases, the scores are greater than $0.5$ and CausEV can distinguish the cause from the effect in the extreme region of the data. The variability of the noise affects the estimated score $S_{X\rightarrow Y}^{\text{ext}}$ , with larger variances inducing score values closer to $0.5$ . For instance, the scores for scenario 2(d) reach values smaller than $0.5$ when $\sigma_{\epsilon}=4$ and a decision on the causality in this noisy scenario cannot be made. Heuristically, the value of the causal score reflects the strength of the causal signal, with decreasing causal scores approaching $0.5$ in the presence of increasing noise; see for instance the heuristic estimate of confidence defined by Mooij et al. (2016) as the difference between causal scores of both directions.

Although numerous procedures have been proposed to infer the causal direction from bivariate joint observational distributions, these methods are not tailored to deal with exceedances in the joint upper tails and do not exploit the asymptotically motivated results of the multivariate extreme value theory. Moreover, when the causal mechanism is believed to be different in the tails than in the bulk of the distribution (see, e.g., Barbero et al. (2018)), the outcome of these methods might be affected by the causal relations in a moderate regime, thus misrepresenting these relations in an extreme regime. Therefore, to obtain a fair comparison, we apply four methods for the mean level to the extreme set $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}$ directly.

The first method, LINGAM (Linear Non-Gaussian Acyclic Model) (Shimizu et al., 2006), assumes that the data generating process is linear with non-Gaussian noise and that there are no unobserved confounders. Although the assumption of linearity is violated in our simulation settings, a few scenarios, namely 2(c), 2(d), and 3(e), exhibit a causal relationship close to linearity in the upper tails of the cause $X$ . The second method, IGCI (Information-Geometric Causal Inference) (Janzing et al., 2012), assumes the absence of the noise in a structural causal model between the cause and the effect, and uses Postulate 1 to construct a score based on the Kullback–Leibler divergence between the densities of the cause and the effect and a reference measure. We choose the Gaussian reference measure as it has been shown by Mooij et al. (2016) to yield higher performances of the method than the uniform reference measure. Despite the assumption of a noiseless deterministic relation, Janzing et al. (2012) discuss the robustness of their method in a noisy regime such as (16). The third method, CAM (Causal Additive Model), assumes additive noise structure between the effect and the cause with a Gaussian noise variable. This method is robust against misspecification of the noise distribution (Bühlmann et al., 2014). The fourth method, RESIT (Regression with Subsequent Independent Test), assumes a structural equation model between $X$ and $Y$ with additive noise and performs a HSIC test (Gretton et al., 2008) for independence between the effect and the residuals of a generalized additive model of the effect as a function of the cause. The assumptions of the latter two methods are not violated in our simulation scenarios.

We compute the success rate, i.e., the percentage of repetitions (out of 300) inferring the true causal direction. Additionally, we assess the sensitivity of the success rate to the threshold choice by considering joint exceedances in the upper quadrant $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}=\{(X_{i},Y_{i}):X_{i}>F^{-1}_{X}(u)\ \text{and }Y_{i}>F^{-1}_{Y}(u)\}$ , with $u=0.9,0.93,0.95$ and fixed size $n_{u}=55$ . Figure 4 shows the results for all methods and at all considered marginal thresholds. CausEV is the clear winner as expected and its performance is unaffected by the increasing bias resulting from applying asymptotic extreme value models at lower thresholds. The performances of the methods CAM and IGCI are comparable in some settings, but deteriorate with increasing noise variance. Lower marginal thresholds lead to decreasing success rates for the four mean-based methods in the Gaussian noise scenario 3(e). This is unsurprising since, by taking higher marginal thresholds, the causal structure is less spoiled by the light-tailed noise in these scenarios.

4.2 Robustness to the assumption of causality

Our use of the weaker version of Postulate 2 given by (8) assumes $X$ and $Y$ are causally related. We now carry out experiments under three settings where this assumption does not hold.

Here, the MDL principle is applied to the correct class of models, i.e., the true underlying model belongs to the class of considered models. We simulate $(X,Y)$ from bivariate generalized extreme value distributions with the symmetric and asymmetric logistic copulas Tawn (1988). Under the symmetric logistic setting, the strength of asymptotic dependence between $X$ and $Y$ depends on the parameter $\alpha\in(0,1]$ , which we vary from $0.1$ (strong dependence) to $0.9$ (weak dependence). Two additional parameters $(\theta_{1},\theta_{2})\in[0,1]^{2}$ control the asymmetry of the dependence structure between $X$ and $Y$ in the asymmetric logistic case. Under this setting, we set the overall dependence parameter $\alpha=0.2$ and explore different levels of asymmetry by fixing $\theta_{1}=1$ and varying $\theta_{2}=0.1,\ldots,0.9$ . The symmetric case is retrieved when $\theta_{1}=\theta_{2}=1$ . By construction, both resulting random vectors satisfy the assumptions of our model class but neither $X$ causes $Y$ nor $Y$ causes $X$ . We investigate the causality in the upper quadrant $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}=\{(X_{i},Y_{i}):X_{i}>F^{-1}_{X}(0.95)\ \text{and }Y_{i}>F^{-1}_{Y}(0.95)\}$ and base the experiments on $300$ repetitions where we fix the number of concurrent exceedances at $n_{u}=55$ . Figure 5 shows the estimated scores $S_{X\rightarrow Y}^{ext}$ as a function of the logistic dependence parameter $\alpha$ (left panel) and the asymmetry parameter $\theta_{2}$ of the asymmetric logistic copula (middle panel). Based on the mean estimates of the score $S_{X\rightarrow Y}^{ext}$ , and for the different considered strengths of asymptotic dependence and asymmetry, CausEV is robust against the absence of a causal mechanism in the data and yields few false positives.

We now add model misspecification: we repeat the above simulation scheme but with a random vector $(X,Y)$ simulated from a bivariate Gaussian distribution with correlation $\rho$ varying from $0.6$ to $0.95$ . The Gaussian copula is asymptotically independent, i.e., although we observe residual tail dependence at finite thresholds, the associated coefficient of tail dependence is $\chi=0$ . Assuming an extreme value copula in the joint tail for $\{(X^{\text{ext}}_{i},Y^{\text{ext}}_{i})\}_{i=1}^{n_{u}}=\{(X_{i},Y_{i}):X_{i}>F^{-1}_{X}(0.95)\ \text{and }Y_{i}>F^{-1}_{Y}(0.95)\}$ will overestimate the dependence in the joint tails and we assess here the resulting consequences on the causal score $S_{X\rightarrow Y}^{\text{ext}}$ . Figure 5 (right) shows the estimated scores $S_{X\rightarrow Y}^{\text{ext}}$ as a function of $\rho$ . The resulting mean estimates of the score $S_{X\rightarrow Y}^{\text{ext}}$ show very few false positives, accurately detecting the absence of a causal mechanism in the data.

5 Causal mechanism of extremes in the upper Danube basin

The upper Danube basin covers most of the German state of Bavaria and parts of Baden-Würtemberg, Austria and Switzerland. Frequent flooding in the area has led to a well-developed system of gauging stations covering the basin. Figure 1 shows the locations of the 31 stations at which we have daily measurements of river discharge ( $m^{3}/s$ ). The daily series have lengths from 51 to 110 years and are made available by the Bavarian Environmental Agency (http://www.gkd.bayern.de). There are 51 years of data for all stations from 1960–2010 and these data are analysed in Asadi et al. (2015) and Engelke and Hitz (2020) where the multivariate approaches require observations at all gauging stations to be available. We also consider data from 1960–2010 and remove seasonality and trend issues following these authors: only data for the months of June, July, and August are retained. The most extreme precipitations and floods occur during summer (Jeneiová et al., 2016), see Asadi et al. (2015) for further justification of temporal stationarity over the period retained.

Our interest lies in the causal structure between the peak flows at the 31 stations. Temporal dependence causes extreme discharges at a given station to occur in clusters. Extreme discharges at upstream stations may also cause extreme discharges at downstream stations some days later. All these extremes should be considered part of the same event and treated as dependent. The data must thus be declustered and we do so following Asadi et al. (2015)[Section 5.2].

Declustering yields approximately independent threshold exceedances at each station, and allows us to unveil the mechanism of causality in the extremes due to the inherent physical river dynamics, as opposed to the temporal causality in the discharges at flow-connected stations that one would observe due to time lags allowing water propagation through the network. According to Serinaldi et al. (2018), time lags between peaks of flood events reflect flood duration which, for European summer flood events rarely exceeds the $9$ -day period used by Asadi et al. (2015). Starting from the $51\times 92$ daily observations, the declustering step results in $n=428$ independent events at all of the $31$ gauging stations. Each independent flood event represents a $31$ -dimensional vector whose $i$ -th entry corresponds to the maximum water discharge at the $i$ -th station, observed within a $9$ -day window where at least one station witnessed a large discharge value.

We assess the validity of the assumption of asymptotic dependence in the data by computing empirical estimates of the coefficient of tail dependence $\chi$ (6) for all 465 possible pairs of stations. Figure 6 depicts the estimates across all pairs, along with $95\%$ block bootstrap uncertainty bounds.

All estimates are strictly positive and we proceed by assuming asymptotic dependence in the data. This assumption is also justified by the exploratory analysis of Asadi et al. (2015, Section 5.4) and the hydrological properties of the flow connections in the upper Danube basin; see for instance the tree induced by flow-connections in Engelke and Hitz (2020), reproduced in the left panel of Figure 7.

We consider the set of all possible pairs of stations in the river network to which we apply CausEV. For each pair, we set the marginal thresholds at the $90\%$ empirical quantiles leading to an average of $n_{u}=25$ joint threshold exceedances. For the estimation of the extreme value copula, we transform the data empirically to the unit Fréchet scale and perform the min-projection of Mhalla et al. (2019) at $500$ unequally spaced values in the unit simplex $[0,1]$ . The Pickands dependence function is then estimated based on the deficits of the $10\%$ quantile of the min-projected random variable and regularised using the COBS procedure; see Ng and Maechler (2007) and Mhalla et al. (2017) for details. Figure 6 displays the resulting estimates of the coefficient of upper tail dependence for all pairs of stations $(X_{i},X_{j})$ . Although this non-parametric estimation procedure might result in biased estimates as it assumes the validity of the extreme value copula (Serinaldi et al., 2015), it yields sensible estimates of the tail correlation in the pairs, with slightly less bias for the pairs of flow-connected sites.

The right panel of Figure 7 displays, with solid edges, the oriented graph induced by the pairs of stations with a significant causal relation, i.e. with $95\%$ percentile bootstrap bounds for the causal score not containing the critical value $0.5$ . Bootstrap bounds are based on a yearly block bootstrap procedure applied to the declustered data 300 times. Forty-three oriented edges, from the cause to the effect, are observed over the upper Danube basin. All follow the natural water flow topography.

Among the $43$ significant edges, we observe $28$ edges oriented towards stations located over the Danube river (dark green stations in Figure 7). Aside from the edges connecting stations along the Danube river and for which the direction agrees with the natural water flow, we uncover causal relations between the extreme discharges of stations located in the Alpine tributaries (red stations in Figure 7) of Lech (stations $20$ – $22$ ) and Salzach (stations $28$ – $31$ ) and the extreme discharges of stations located in the Bavarian Danube. These findings agree with the assertion of Skublics et al. (2016) and Blöschl et al. (2013) that summer floods in the Bavarian Danube basin are strongly caused by topographically enhanced precipitation at the northern rim of the Alps.

We do not find edges in the northern tributaries of Regen and Naab (yellow stations in Figure 7) and gauging station $4$ , even if they are flow-connected. This is in fact as expected since we are only examining extreme summer events where causal signals between these tributaries and the Danube might be weak. According to Skublics et al. (2016), it is warm winter snow melt and rainfall in these northern tributaries that lead to extreme discharges and winter flooding in the Bavarian Danube basin.

We do not find an edge between the flow-connected gauging stations $15$ and $14$ . Looking at the daily water discharges at these stations, we notice a strong linear correlation in the observations that carries over in the tails with an empirical estimate of $\chi_{0.9}$ evaluated approximately at $0.89$ . The observed linearity in the data renders the causal discovery task infeasible as both causal directions could be observed, although in reality only the one given by the flow direction is possible.

We find an edge between the Alpine tributaries Iller (station $12$ ) and Lech (station $21$ ). These are not flow-connected, but they originate in the same region of the Bregenz Forest Mountains in the Alps, where common topographic effects induce moisture convergence during summer that triggers intense convective downpours (Beniston, 2007). The edge should be interpreted cautiously, as extreme discharges along these tributaries are likely to be triggered by a set of confounders related to their geographical locations. For example, the largest river discharges at these two stations were observed in August 2005 when western Tyrol and the south of Bavaria witnessed extensive precipitation and high antecedent soil moisture (BLU (2007), Bayerisches Landesamt für Umwelt). The oriented edge between stations $12$ and $21$ remains even after removing the August 2005 flood event. The Salzach tributary (stations $28$ to $31$ ), although originating in the Alps, is not causally related to these tributaries. This is expected from the topographic map of the upper Danube basin where the Salzach catchment is separated from the sub-catchment of the other Alpine tributaries by the Inn river (not represented in our dataset). For instance, the most extreme discharge observed at station $28$ occurred in August 1977 and it coincides with the 26th largest discharge at station $21$ .

An advantage of CausEV is that the analyses are done pairwise and do not require complete data at all stations for data to be retained, as in a multivariate approach. We restricted our analyses to data over 1960–2010 only because the upper Danube basin presents non-stationarity over longer periods due to the construction of dams and hydropower plants along the Danube and its tributaries prior to 1960. Only a few segments in the upper Danube basin are uninterrupted by a dam, e.g., that stretching from Straubing (upstream of the gauging station $3$ ) to Vilhofen (downstream of the gauging station $2$ ), and the segment linking gauging station $14$ on the Isar to the Danube; see Map 23.1 in Mauser and Prasch (2015). When all the available observations at these stations are considered, i.e. from the summer of $1901$ for station $2$ and from the summer of $1926$ for stations $3$ and $14$ , we uncover a causal effect for the flow-connected pairs $3\longrightarrow 2$ and $14\longrightarrow 2$ and, as one would expect, there are no cycles in the resulting graph.

6 Conclusion

The approach described in this paper is a first step towards the development of causal inference for extreme values, a new and promising line of research. The resulting method is based on the asymptotically justified arguments of extreme value theory lifting the restriction on the knowledge of the true distribution underlying the observations. In the context of its application to the Danube, CausEV requires no additional information such as the topology of the network or the distances between the pairs of stations. Other applications where underlying causal relationship between extremes matters are thus easily treated. An important point not treated in this paper is the possible presence of confounders. Future work will aim to adapt the current method by including the effects of common causes at extreme levels. Specifically, one can compute proxies for the Kolmogorov complexity conditional on a set of potential confounders by modelling the influence of these confounders on the conditional quantiles, that is on the marginal distributions and the extreme value copula.

Acknowledgements

The first author acknowledges the support of the Centre de Recherches Mathématiques and the Canadian Statistical Sciences Institute. Support of the Swiss National Science Foundation is gratefully acknowledged by the first and second authors. The second author acknowledges the financial support of the Forschungsinstitut fuer Mathematik (FIM) of the ETH Zurich. The third author acknowledges the support of the Natural Sciences and Engineering Research Council of Canada RGPIN-2016-04114 and the Fondation HEC. The authors would like to thank Anthony C. Davison for his comments and suggestions. The authors also thank the Associate Editor and two anonymous referees for comments that improved the presentation of the results.

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Asadi et al. (2015) Asadi, P., Davison, A. C. and Engelke, S. (2015) Extremes on river networks. The Annals of Applied Statistics , 9 , 2023–2050.
2Aue et al. (2014) Aue, A., Cheung, R. C. Y., Lee, T. C. M. and Zhong, M. (2014) Segmented model selection in quantile regression using the minimum description length principle. Journal of the American Statistical Association , 109 , 1241–1256. URL: http://www.tandfonline.com/doi/abs/10.1080/01621459.2014.889022 .
3Balkema and de Haan (1974) Balkema, A. A. and de Haan, L. (1974) Residual life time at great age. The Annals of Probability , 792–804.
4Barbero et al. (2018) Barbero, R., Westra, S., Lenderink, G. and Fowler, H. J. (2018) Temperature-extreme precipitation scaling: a two-way causality? International Journal of Climatology , 38 , e 1274–e 1279.
5Beirlant et al. (2004) Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., De Waal, D. and Ferro, C. (2004) Statistics of Extremes: Theory and Applications . New York: Wiley.
6Beniston (2007) Beniston, M. (2007) Linking extreme climate events and economic impacts: Examples from the Swiss Alps. Energy Policy , 35 , 5384–5392.
7Blöschl et al. (2013) Blöschl, G., Nester, T., Komma, J., Parajka, J. and Perdigão, R. A. P. (2013) The June 2013 flood in the Upper Danube Basin, and comparisons with the 2002, 1954 and 1899 floods. Hydrology and Earth System Sciences , 17 , 5197–5212. URL: https://www.hydrol-earth-syst-sci.net/17/5197/2013/ .
8BLU (2007) (Bayerisches Landesamt für Umwelt) BLU (Bayerisches Landesamt für Umwelt) (2007) August – Hochwasser 2005 in Südbayern (August 2005 flood in Sounthern Bavaria). Tech. rep. , Bayerisches Landesamt für Umwelt.