Graphical Models for Extremes

Sebastian Engelke; Adrien S. Hitz

arXiv:1812.01734·math.ST·November 14, 2019

Graphical Models for Extremes

Sebastian Engelke, Adrien S. Hitz

PDF

TL;DR

This paper develops a new graphical model framework for multivariate extreme value distributions, enabling sparse and interpretable models for high-dimensional rare event data, with applications to flood risk assessment.

Contribution

It introduces a general theory of conditional independence for multivariate Pareto distributions, linking it to graphical models and sparsity in extreme value analysis.

Findings

01

Hammersley-Clifford theorem for extremal graphical models

02

Sparsity patterns can be inferred from inverse covariance matrices

03

Application to flood risk assessment on the Danube river

Abstract

Conditional independence, graphical models and sparsity are key notions for parsimonious statistical models and for understanding the structural relationships in the data. The theory of multivariate and spatial extremes describes the risk of rare events through asymptotically justified limit models such as max-stable and multivariate Pareto distributions. Statistical modelling in this field has been limited to moderate dimensions so far, partly owing to complicated likelihoods and a lack of understanding of the underlying probabilistic structures. We introduce a general theory of conditional independence for multivariate Pareto distributions that allows the definition of graphical models and sparsity for extremes. A Hammersley-Clifford theorem links this new notion to the factorization of densities of extreme value models on graphs. For the popular class of H\"usler-Reiss distributions…

Figures12

Click any figure to enlarge with its caption.

Equations203

Y_{i} ⊥_{e} Y_{j} ∣ Y_{∖ {i, j}}, (i, j) \in / E .

Y_{i} ⊥_{e} Y_{j} ∣ Y_{∖ {i, j}}, (i, j) \in / E .

n \to \infty lim P (\frac{M _{j n} - b _{j n}}{a _{j n}} \leq x) = G_{j} (x) = exp {- (1 + ξ_{j} x)_{+}^{- 1/ ξ_{j}}}, x \in R,

n \to \infty lim P (\frac{M _{j n} - b _{j n}}{a _{j n}} \leq x) = G_{j} (x) = exp {- (1 + ξ_{j} x)_{+}^{- 1/ ξ_{j}}}, x \in R,

n \to \infty lim P {i = 1, \dots, n max X_{i 1} \leq n z_{1}, \dots, i = 1, \dots, n max X_{i d} \leq n z_{d}} = P (Z \leq z) .

n \to \infty lim P {i = 1, \dots, n max X_{i 1} \leq n z_{1}, \dots, i = 1, \dots, n max X_{i d} \leq n z_{d}} = P (Z \leq z) .

P (Z \leq z) = exp {- Λ (z)}, z \in E,

P (Z \leq z) = exp {- Λ (z)}, z \in E,

\int_{y \in E : y_{i} > 1} λ (y) d y = 1.

\int_{y \in E : y_{i} > 1} λ (y) d y = 1.

λ_{I} (y_{I}) = \int_{[0, \infty)^{d - ∣ I ∣}} λ (y) d y_{∖ I},

λ_{I} (y_{I}) = \int_{[0, \infty)^{d - ∣ I ∣}} λ (y) d y_{∖ I},

u \to \infty lim u {1 - P (X \leq u z)} = Λ (z), z \in E .

u \to \infty lim u {1 - P (X \leq u z)} = Λ (z), z \in E .

P (Y \leq z)

P (Y \leq z)

f_{Y} (y) = \frac{\partial ^{d}}{\partial y _{1} \dots \partial y _{d}} P (Y \leq y) = \frac{λ ( y )}{Λ ( 1 )}, y \in L,

f_{Y} (y) = \frac{\partial ^{d}}{\partial y _{1} \dots \partial y _{d}} P (Y \leq y) = \frac{λ ( y )}{Λ ( 1 )}, y \in L,

f_{I} (y_{I}) = \frac{Λ ( 1 )}{Λ _{I} ( 1 )} \int_{[0, \infty)^{d - ∣ I ∣}} f_{Y} (y) d y_{∖ I} = \frac{λ _{I} ( y _{I} )}{Λ _{I} ( 1 )}, y_{I} \in L_{I},

f_{I} (y_{I}) = \frac{Λ ( 1 )}{Λ _{I} ( 1 )} \int_{[0, \infty)^{d - ∣ I ∣}} f_{Y} (y) d y_{∖ I} = \frac{λ _{I} ( y _{I} )}{Λ _{I} ( 1 )}, y_{I} \in L_{I},

f_{Y} (y) = \frac{1}{d ^{θ}} (y_{1}^{- 1/ θ} + \dots + y_{d}^{- 1/ θ})^{θ - d} i = 1 \prod d - 1 (\frac{i}{θ} - 1) i = 1 \prod d y_{i}^{- 1/ θ - 1}, y \in L .

f_{Y} (y) = \frac{1}{d ^{θ}} (y_{1}^{- 1/ θ} + \dots + y_{d}^{- 1/ θ})^{θ - d} i = 1 \prod d - 1 (\frac{i}{θ} - 1) i = 1 \prod d y_{i}^{- 1/ θ - 1}, y \in L .

λ (y)

λ (y)

Σ^{(k)} = \frac{1}{2} {Γ_{ik} + Γ_{j k} - Γ_{ij}}_{i, j \neq = k} \in R^{(d - 1) \times (d - 1)} .

Σ^{(k)} = \frac{1}{2} {Γ_{ik} + Γ_{j k} - Γ_{ij}}_{i, j \neq = k} \in R^{(d - 1) \times (d - 1)} .

λ (y_{1}, y_{2}) = \frac{y _{1}^{- 2} y _{2}^{- 1}}{2 π Γ _{12}} exp [- \frac{{ lo g ( y _{2} / y _{1} ) + Γ _{12} /2 } ^{2}}{2 Γ _{12}}], (y_{1}, y_{2}) \in E,

λ (y_{1}, y_{2}) = \frac{y _{1}^{- 2} y _{2}^{- 1}}{2 π Γ _{12}} exp [- \frac{{ lo g ( y _{2} / y _{1} ) + Γ _{12} /2 } ^{2}}{2 Γ _{12}}], (y_{1}, y_{2}) \in E,

λ (y_{1}, y_{2}) = y_{1}^{- 3} f_{U_{2}^{1}} (y_{2} / y_{1}), (y_{1}, y_{2}) \in E,

λ (y_{1}, y_{2}) = y_{1}^{- 3} f_{U_{2}^{1}} (y_{2} / y_{1}), (y_{1}, y_{2}) \in E,

f_{X} (x) = \frac{f _{A \cup B} ( x _{A \cup B} ) f _{B \cup C} ( x _{B \cup C} )}{f _{B} ( x _{B} )},

f_{X} (x) = \frac{f _{A \cup B} ( x _{A \cup B} ) f _{B \cup C} ( x _{B \cup C} )}{f _{B} ( x _{B} )},

f_{X} (x) = C \in C \prod ψ_{C} (x_{C}), x \in X,

f_{X} (x) = C \in C \prod ψ_{C} (x_{C}), x \in X,

f_{X} (x) = \frac{\prod _{C \in C} f _{C} ( x _{C} )}{\prod _{D \in D} f _{D} ( x _{D} )}, x \in X,

f_{X} (x) = \frac{\prod _{C \in C} f _{C} ( x _{C} )}{\prod _{D \in D} f _{D} ( x _{D} )}, x \in X,

W_{i} ⊥ ⊥ W_{j} ∣ W_{∖ {i, j}} ⟺ Σ_{ij}^{- 1} = 0.

W_{i} ⊥ ⊥ W_{j} ∣ W_{∖ {i, j}} ⟺ Σ_{ij}^{- 1} = 0.

f^{k} (y) = \frac{f _{Y} ( y )}{\int _{L^{k}} f _{Y} ( y ) d y} = λ (y), y \in L^{k},

f^{k} (y) = \frac{f _{Y} ( y )}{\int _{L^{k}} f _{Y} ( y ) d y} = λ (y), y \in L^{k},

f_{I}^{k} (y_{I}) = \int_{[0, \infty)^{d - ∣ I ∣}} λ (y) d y_{∖ I} = λ_{I} (y_{I}), y_{I} \in L_{I}^{k},

f_{I}^{k} (y_{I}) = \int_{[0, \infty)^{d - ∣ I ∣}} λ (y) d y_{∖ I} = λ_{I} (y_{I}), y_{I} \in L_{I}^{k},

\forall k \in {1, \dots, d} : Y_{A}^{k} ⊥ ⊥ Y_{C}^{k} ∣ Y_{B}^{k} .

\forall k \in {1, \dots, d} : Y_{A}^{k} ⊥ ⊥ Y_{C}^{k} ∣ Y_{B}^{k} .

\exists k \in B : Y_{A}^{k} ⊥ ⊥ Y_{C}^{k} ∣ Y_{B}^{k} .

\exists k \in B : Y_{A}^{k} ⊥ ⊥ Y_{C}^{k} ∣ Y_{B}^{k} .

λ (y) = \frac{λ _{A \cup B} ( y _{A \cup B} ) λ _{B \cup C} ( y _{B \cup C} )}{λ _{B} ( y _{B} )}, y \in L .

λ (y) = \frac{λ _{A \cup B} ( y _{A \cup B} ) λ _{B \cup C} ( y _{B \cup C} )}{λ _{B} ( y _{B} )}, y \in L .

Y_{i} ⊥_{e} Y_{j} ∣ Y_{∖ {i, j}}, (i, j) \in / E,

Y_{i} ⊥_{e} Y_{j} ∣ Y_{∖ {i, j}}, (i, j) \in / E,

f_{Y} (y) = \frac{1}{Λ ( 1 )} \frac{\prod _{C \in C} λ _{C} ( y _{C} )}{\prod _{D \in D} λ _{D} ( y _{D} )}, y \in L,

f_{Y} (y) = \frac{1}{Λ ( 1 )} \frac{\prod _{C \in C} λ _{C} ( y _{C} )}{\prod _{D \in D} λ _{D} ( y _{D} )}, y \in L,

E = {{1, 2}, {2, 3}, \dots, {d - 1, d}} .

E = {{1, 2}, {2, 3}, \dots, {d - 1, d}} .

λ_{I} (y_{I})

λ_{I} (y_{I})

λ_{D} (y_{D}) = \int_{[0, \infty)^{∣ C ∖ D ∣}} λ_{C} (y_{C}) d y_{C ∖ D} .

λ_{D} (y_{D}) = \int_{[0, \infty)^{∣ C ∖ D ∣}} λ_{C} (y_{C}) d y_{C ∖ D} .

λ (y) = \frac{\prod _{C \in C} λ _{C} ( y _{C} )}{\prod _{D \in D} λ _{D} ( y _{D} )}, y \in L,

λ (y) = \frac{\prod _{C \in C} λ _{C} ( y _{C} )}{\prod _{D \in D} λ _{D} ( y _{D} )}, y \in L,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Graphical Models for Extremes

Sebastian Engelke

Research Center for Statistics, University of Geneva, Boulevard du Pont d’Arve 40, 1205 Geneva, Switzerland.

Adrien S. Hitz

Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, UK and Materialize.X, Enterprise Lab, Imperial College London, London SW7 2AZ, UK.

Abstract

Conditional independence, graphical models and sparsity are key notions for parsimonious statistical models and for understanding the structural relationships in the data. The theory of multivariate and spatial extremes describes the risk of rare events through asymptotically justified limit models such as max-stable and multivariate Pareto distributions. Statistical modelling in this field has been limited to moderate dimensions so far, partly owing to complicated likelihoods and a lack of understanding of the underlying probabilistic structures. We introduce a general theory of conditional independence for multivariate Pareto distributions that allows the definition of graphical models and sparsity for extremes. A Hammersley–Clifford theorem links this new notion to the factorization of densities of extreme value models on graphs. For the popular class of Hüsler–Reiss distributions we show that, similarly to the Gaussian case, the sparsity pattern of a general extremal graphical model can be read off from suitable inverse covariance matrices. New parametric models can be built in a modular way and statistical inference can be simplified to lower-dimensional marginals. We discuss learning of minimum spanning trees and model selection for extremal graph structures, and illustrate their use with an application to flood risk assessment on the Danube river.

Keywords: Extreme value theory; Conditional independence; Multivariate Pareto distribution; Graphical models; Sparsity

1 Introduction

Evaluation of the risk related to heat waves, extreme flooding, financial crises, or other rare events requires the quantification of their small occurrence probabilities. Empirical estimates are unreliable since the regions of interest are in the tail of the distribution and typically contain few or no data points. Extreme value theory provides the theoretical foundation for extrapolations to the distributional tail of a $d$ -dimensional random vector $\boldsymbol{X}$ . The univariate case $d=1$ is well-studied and the generalized extreme value and Pareto distributions are widely applied in areas such as hydrology (Katz et al., 2002), climate science (Min et al., 2011) and finance (McNeil et al., 2015); see also Embrechts et al. (1997) and Beirlant et al. (2004).

In the multivariate setting, $d\geq 2$ , the result of the extrapolation strongly depends on the strength of extremal dependence between the components of $\boldsymbol{X}$ . Most current statistical models assume multivariate regular variation for $\boldsymbol{X}$ (Resnick, 2008) since this entails mathematically elegant descriptions of the asymptotic tail distribution. Similar to the univariate setting, two different but closely related approaches exist. Max-stable distributions arise as limits of normalised maxima of independent copies of $\boldsymbol{X}$ and have been extensively studied and applied in multivariate and spatial risk problems (cf., de Haan, 1984; Gudendorf and Segers, 2010; Davison et al., 2012). On the other hand, multivariate Pareto distributions describe the random vector $\boldsymbol{X}$ conditioned on the event that at least one component exceeds a high threshold; see Rootzén and Tajvidi (2006), Rootzén et al. (2018) and Kiriliouk et al. (2018) for their construction, stability properties and statistical inference.

A drawback of the current multivariate models is their limitation to rather moderate dimensions $d$ , and the construction of tractable parametric models in higher dimensions is challenging, both for max-stable and multivariate Pareto distributions. Sparse multivariate models require the notion of conditional independence (Dawid, 1979), which is not easy to define for tail distributions. In fact, Papastathopoulos and Strokorb (2016) show that if $(Z_{1},Z_{2},Z_{3})$ is a max-stable random vector with positive continuous density, then the conditional independence of $Z_{1}\perp\!\!\!\perp Z_{3}\mid Z_{2}$ already implies the independence $Z_{1}\perp\!\!\!\perp Z_{3}$ ; see also Dombry and Éyi-Minko (2014). Meaningful conditional independence structures can thus only be obtained for max-stable distributions with discrete spectral measure (Gissibl and Klüppelberg, 2018). Since these models do not admit densities, this excludes most of the currently used parametric families.

In this work we take the perspective of threshold exceedances and introduce a new notion of conditional independence for a multivariate Pareto distribution $\boldsymbol{Y}=(Y_{1},\dots,Y_{d})$ , which we denote by $\perp_{e}$ to stress that it is designed for extremes. It is different from classical conditional independence since the support of $\boldsymbol{Y}$ is not a product space, but the homogeneity property of $\boldsymbol{Y}$ can be used to show that it is well-defined. Conditional independence is tightly linked to graphical models. For an undirected graph $\mathcal{G}=(V,E)$ with nodes $V=\{1,\dots,d\}$ and edge set $E$ , we say that $\boldsymbol{Y}$ is an extremal graphical model if it satisfies the pairwise Markov property

[TABLE]

The main advantage of conditional independence and graphical models is that they imply a simple probabilistic structure and possibly sparse patterns in multivariate random vectors (Lauritzen, 1996; Wainwright and Jordan, 2008). For extremal graphical models on decomposable graphs, we prove a Hammersley–Clifford type theorem stating that (1) is equivalent to the factorization of the density $f_{\boldsymbol{Y}}$ of $\boldsymbol{Y}$ into lower-dimensional marginals. This underlines that our notion of conditional independence is in fact natural for multivariate Pareto distributions.

Applications of this result are manifold. From a probabilistic perspective, we analyse models in the literature regarding their graphical properties in the sense of our definition (1). Extremal graphical models whose underlying graph is a tree have a particularly simple multiplicative stochastic representation in terms of extremal functions, a notion that is known from the simulation of max-stable processes (Dombry et al., 2016). In multivariate extremes, one may argue that the family of Hüsler–Reiss distributions (Hüsler and Reiss, 1989) takes a similar role as Gaussian distributions in the non-extreme world. Instead of covariance matrices, they are parameterized by a variogram matrix $\Gamma$ . We show that the extremal graphical structure of a Hüsler–Reiss distribution can be identified by zero patterns on matrices derived from $\Gamma$ .

Extremal graphical models enable the construction of parsimonious models for multivariate Pareto distributions $\boldsymbol{Y}$ , which further enjoy the advantage of interpretability in terms of the underlying graph. Thanks to the factorization of the densities, statistical inference can be efficiently carried out on lower-dimensional marginals. For decomposable graphs with singleton separator sets, so-called block graphs, this allows the use of multivariate Pareto models in fairly high dimensions. In many cases the underlying graphical structure is unknown and has to be learned from data. We discuss how a maximum likelihood tree can be obtained using standard algorithms by Kruskal (1956) or Prim (1957), and how the best model can be selected among different extremal graphical models.

There is previous work on the construction of parsimonious extreme value models. A large body of literature studies spatial max-stable random fields (Schlather, 2002; Kabluchko et al., 2009; Opitz, 2013). Such models have small parameter dimension but they rely on strong assumptions on stationarity and cannot be applied to multivariate, non-spatial data without information on an underlying space. Other approaches include constructions through factor copulas (Lee and Joe, 2018), ensembles of trees combining bivariate copulas (Yu et al., 2017), graphical models for large censored observations (Hitz and Evans, 2016) and eigendecompositions (Cooley and Thibaud, 2018). Closely related to our concept of conditional independence is the work of Coles and Tawn (1991) and Smith et al. (1997) who propose a Markov chain model where all bivariate marginals are extreme value distributions. This can be seen as a special case of our approach when the graph has the simple structure of a chain. Similar limiting objects also arise as the tail chains in the theory of extremes of stationary Markov chains with regularly varying marginals (Smith, 1992; Basrak and Segers, 2009; Janssen and Segers, 2014). This theory has recently been extended to regularly varying Markov trees (Segers, 2019). Gissibl and Klüppelberg (2018) and Gissibl et al. (2018) study the causal structure of directed acyclic graphs for max-linear models, and they develop methods for model identification based on tail dependence coefficients. Their work is in some sense complementary to ours, since their models do not possess densities whereas we will explicitly assume the existence of densities.

To the best of our knowledge, our work is the first principled attempt to define conditional independence for general multivariate extreme value models that naturally extends to the factorization of densities, sparsity and graphical models. Section 2 introduces background on extreme value theory and graphical models needed throughout the paper. The new notion of conditional independence is defined in Section 3 and equivalent properties are derived. Section 4 contains the main probabilistic results on extremal graphical models, the representation of trees and the characterization for Hüsler–Reiss distributions. Statistical models on block graphs and their estimation, simulation and model selection are discussed in Section 5. In these graphical models the dependence is modeled directly between lower-dimensional subsets of variables, whereas the global dependence is implicitly implied by the conditional independence structure of the graph. There are many potential applications of extremal graphical models. In Section 6, we illustrate the advantages of this structured approach compared to classical extreme value models on a data set related to flooding on a river network in the upper Danube basin (cf., Asadi et al., 2015). The interpretation of the graphical structures obtained in this application is particularly interesting since there is a seemingly natural underlying tree associated to the flow-connections. Our conditional independence is formulated for multivariate Pareto distributions, but the results in this paper have implications for max-stable distributions. This point and further research directions will be addressed in the discussion in Section 7. The Appendix contains proofs and some additional results.

An implementation for R (R Core Team, 2019) is available in the package graphicalExtremes (Engelke et al., 2019). The code for the simulation study and application can be found in the supplementary material.

2 Background

2.1 Notation

We introduce some standard notation that is used throughout the paper. Symbols in boldface such as $\boldsymbol{x}\in\mathbb{R}^{d}$ are column vectors with components denoted by $x_{i}$ , $i\in\{1,\dots,d\}$ , and operations and relations involving such vectors are meant componentwise. The vectors $\boldsymbol{0}=(0,\dots,0)$ and $\boldsymbol{1}=(1,\dots,1)$ are used as generic vectors with suitable dimension. Denoting the index set by $V=\{1,\dots,d\}$ , for a non-empty subset $I\subset V$ , we write for the subvectors $\boldsymbol{x}_{I}=(x_{i})_{i\in I}$ and $\boldsymbol{x}_{\setminus I}=(x_{i})_{i\in V\setminus I}$ . Similar notation is used for random vectors $\boldsymbol{X}=(X_{i})_{i\in V}$ with values in $\mathbb{R}^{d}$ . For a matrix $A=(A_{ij})_{i,j\in V}\in\mathbb{R}^{d\times d}$ with entries indexed by $V$ , and subsets $I,J\subset V$ we let $A_{IJ}=(A_{ij})_{i\in I,j\in J}$ denote the $|I|\times|J|$ submatrix of $A$ , and we abbreviate $A_{I}=A_{II}$ . For $\boldsymbol{a},\boldsymbol{b}\in\mathbb{R}^{d}$ with $\boldsymbol{a}\leq\boldsymbol{b}$ , a multivariate interval is denoted by $[\boldsymbol{a},\boldsymbol{b}]=[a_{1},b_{1}]\times\dots\times[a_{d},b_{d}]$ . The $\ell_{p}$ -norm of a vector $\boldsymbol{x}\in\mathbb{R}^{d}$ for $p\geq 1$ is $\|\boldsymbol{x}\|_{p}=\left(\sum_{i\in V}|x_{i}|^{p}\right)^{1/p}$ , and its $\ell_{\infty}$ -norm is $\|\boldsymbol{x}\|_{\infty}=\max_{i\in V}|x_{i}|$ . The density of a random vector $\boldsymbol{X}$ , if it exists, is denoted by $f_{\boldsymbol{X}}$ . The density of the marginal $\boldsymbol{X}_{I}$ for a non-empty $I\subset V$ is denoted by $f_{I}$ , if there is no ambiguity regarding the random vector.

2.2 Multivariate extreme value theory

The tail behavior of the random vector $\boldsymbol{X}=(X_{1},\dots,X_{d})$ can be described through two different approaches, one based on componentwise maxima and the other one on threshold exceedances. We briefly discuss both approaches and the close link between them.

Let $\boldsymbol{X}_{i}=(X_{i1},\dots,X_{id})$ , $i=1,\dots,n$ , be independent copies of $\boldsymbol{X}$ and denote the componentwise maximum by $\boldsymbol{M}_{n}=(M_{1n},\dots,M_{dn})=(\max_{i=1}^{n}X_{i1},\dots,\max_{i=1}^{n}X_{id})$ . Under mild conditions on the marginal distribution of $X_{j}$ there exist sequences of normalizing constants $b_{jn}\in\mathbb{R}$ , $a_{jn}>0$ , $j=1,\dots,d$ , such that

[TABLE]

where $z_{+}=\max(z,0)$ , and $G_{j}$ is the generalized extreme value distribution whose shape parameter $\xi_{j}\in\mathbb{R}$ determines the heaviness of the tail of $X_{j}$ ; see de Haan and Ferreira (2006); Embrechts et al. (1997) and Beirlant et al. (2004) for details. For analysis of the dependence structure, the marginal distributions $F_{j}$ of $X_{j}$ are typically estimated first to normalise the data by $1/\{1-F_{j}(X_{j})\}$ to standard Pareto distributions. For simplicity, we assume in the sequel that the $F_{j}$ are known and the vector $\boldsymbol{X}$ has been normalised to standard Pareto marginals. Joint estimation of marginals and dependence is discussed in Section 5.2.

The standardized vector $\boldsymbol{X}$ is said to be in the max-domain of attraction of the random vector $\boldsymbol{Z}=(Z_{1},\dots,Z_{d})$ if for any $\boldsymbol{z}=(z_{1},\dots,z_{d})$

[TABLE]

In this case, $\boldsymbol{Z}$ is max-stable with standard Fréchet marginals $\mathbb{P}(Z_{j}\leq z)=\exp(-1/z)$ , $z\geq 0$ , and we may write

[TABLE]

where the exponent measure $\Lambda$ is a Radon measure on the cone $\mathcal{E}=[0,\infty)^{d}\setminus\{\boldsymbol{0}\}$ , and $\Lambda\left(\boldsymbol{z}\right)$ is shorthand for $\Lambda\left(\mathcal{E}\setminus[\boldsymbol{0},\boldsymbol{z}]\right)$ . If $\Lambda$ is absolutely continuous with respect to Lebesgue measure on $\mathcal{E}$ , its Radon–Nikodym derivative, denoted by $\lambda$ , has the following properties:

(L1)

homogeneity of order $-(d+1)$ , i.e., $\lambda(t\boldsymbol{y})=t^{-(d+1)}\lambda(\boldsymbol{y})$ for any $t>0$ and $\boldsymbol{y}\in\mathcal{E}$ ;

(L2)

normalised marginals, i.e., for any $i=1,\dots,d$ ,

[TABLE]

The two properties follow from the max-stability and the standard Fréchet marginals of $\boldsymbol{Z}$ , respectively. For a non-empty subset $I\subset\{1,\dots,d\}$ , we define the marginal of $\lambda$ by

[TABLE]

and note that it is homogeneous of order $-(|I|+1)$ . In particular, if $I=\{i\}$ for some $i=1,\dots,d$ , then $\lambda_{\{i\}}(y_{i})=1/y_{i}^{2}$ as a consequence of (L1) and (L2). Conversely, any positive and continuous function $\lambda$ satisfying (L1) and (L2) defines a valid density of an exponent measure $\Lambda(\boldsymbol{z})$ by integration over $\mathcal{E}\setminus[\boldsymbol{0},\boldsymbol{z}]$ , $\boldsymbol{z}\in\mathcal{E}$ , that satisfies similar homogeneity and normalization properties as $\lambda$ . By (4) this also defines a max-stable distribution.

Another perspective on multivariate extremes is through threshold exceedances. By Proposition 5.17 in Resnick (2008), the convergence in (3) is equivalent to

[TABLE]

Consequently, the multivariate distribution of the threshold exceedances of $\boldsymbol{X}$ satisfies

[TABLE]

The distribution of the limiting random vector $\boldsymbol{Y}$ is called a multivariate Pareto distribution (cf., Rootzén and Tajvidi, 2006). It is defined through the exponent measure $\Lambda$ of the max-stable distribution $\boldsymbol{Z}$ , with support on the $L$ -shaped space $\mathcal{L}=\{\boldsymbol{x}\in\mathcal{E}:\|\boldsymbol{x}\|_{\infty}>1\}$ . We say that $\boldsymbol{Z}$ and $\boldsymbol{Y}$ are associated, since their distributions mutually determine each other.

Multivariate Pareto distributions are the only possible limits in (6) and, owing to the homogeneity of the exponent measure, they enjoy certain stability properties (cf., Rootzén et al., 2018). The exponent measure $\Lambda$ , and hence the distribution of $\boldsymbol{Y}$ , may place mass on some lower-dimensional faces of the space $\mathcal{E}$ . For the remainder of this paper we exclude this case to avoid technical difficulties. We further assume that the distribution of $\boldsymbol{Y}$ admits a positive and continuous density $f_{\boldsymbol{Y}}$ on $\mathcal{L}$ , which is

[TABLE]

since $\Lambda(\boldsymbol{y}\wedge\boldsymbol{1})$ is always constant along at least one coordinate for $\boldsymbol{y}\in\mathcal{L}$ . The density $f_{\boldsymbol{Y}}$ is thus proportional to the density $\lambda$ of the exponent measure $\Lambda$ . By the homogeneity of $\lambda$ , $f_{\boldsymbol{Y}}$ is also homogeneous of order $-(d+1)$ . The normalization constant $\Lambda(\boldsymbol{1})\in[1,d]$ is known as the $d$ -variate extremal coefficient (cf., Schlather and Tawn, 2003). The assumption of a positive and continuous density $f_{\boldsymbol{Y}}$ implies that the multivariate Pareto distributions we study are models for asymptotic extremal dependence, and all $p$ -variate extremal coefficients, $1\leq p\leq d$ , are strictly smaller than their upper limit $p$ .

For some non-empty subset $I\subset\{1,\dots,d\}$ , the subvector $\boldsymbol{X}_{I}=(X_{j})_{j\in I}$ , properly normalised, given that its $\ell_{\infty}$ -norm is large converge in the sense of (6) to the marginal $\boldsymbol{Y}_{I}=(Y_{j})_{j\in I}$ of $\boldsymbol{Y}$ defined on $\mathcal{L}_{I}=\{\boldsymbol{x}_{I}\in[0,\infty)^{|I|}\setminus\{\boldsymbol{0}\}:\|\boldsymbol{x}_{I}\|_{\infty}>1\}$ with homogeneous density of order $-(|I|+1)$ given by

[TABLE]

where $\Lambda_{I}$ is the exponent measure of $\boldsymbol{Z}_{I}$ , and $\lambda_{I}$ is the density of $\Lambda_{I}$ .

Example 1 (Logistic distribution).

The extremal logistic distribution with parameter $\theta\in(0,1)$ induces a multivariate Pareto distribution with density

[TABLE]

Example 2 (Hüsler–Reiss distribution).

The Hüsler–Reiss distribution (Hüsler and Reiss, 1989) is parameterized by a symmetric, strictly conditionally negative definite matrix $\Gamma=\{\Gamma_{ij}\}_{1\leq i,j\leq d}$ with $\operatorname{diag}(\Gamma)=\boldsymbol{0}$ and non-negative entries, that is, $\boldsymbol{a}^{\top}\Gamma\boldsymbol{a}<0$ for all non-zero vectors $\boldsymbol{a}\in\mathbb{R}^{d}$ with $\sum_{i=1}^{d}a_{i}=0$ . The corresponding density of the exponent measure can be written for any $k\in\{1,\dots,d\}$ as (cf., Engelke et al., 2015)

[TABLE]

where $\phi_{p}(\cdot;\Sigma)$ is the density of a centred $p$ -dimensional normal distribution with covariance matrix $\Sigma$ , $\boldsymbol{\tilde{y}}=\{\log(y_{i}/y_{k})+\Gamma_{ik}/2\}_{i=1,\dots,d}$ and

[TABLE]

The matrix $\Sigma^{(k)}$ is strictly positive definite; see Appendix B for details. The representation of the density in (9) seems to depend on the choice of $k$ , but, in fact, the value of the right-hand side of this equation is independent of $k$ . The Hüsler–Reiss multivariate Pareto distribution has density $f_{\boldsymbol{Y}}(\boldsymbol{y})=\lambda(\boldsymbol{y})/\Lambda(\mathbf{1})$ and the strength of dependence between the $i$ th and $j$ th component is parameterized by $\Gamma_{ij}$ , ranging from complete dependence for $\Gamma_{ij}=0$ and independence for $\Gamma_{ij}=+\infty$ . In the bivariate case $d=2$ we have

[TABLE]

and $\Lambda(1,1)=2\Phi\left(\sqrt{\Gamma_{12}}/2\right)$ , where $\Phi$ is the standard normal distribution function. The extension of Hüsler–Reiss distributions to random fields are Brown–Resnick processes (Brown and Resnick, 1977; Kabluchko et al., 2009), which are widely used models for spatial extremes.

Example 3 (Bivariate Pareto distribution).

In the general bivariate case $d=2$ , due to homogeneity, the density $\lambda$ of the exponent measure can be characterised by a univariate distribution. Indeed, for any positive random variable $U^{1}_{2}$ with density $f_{U^{1}_{2}}$ and $\mathbb{E}U^{1}_{2}=1$ ,

[TABLE]

satisfies conditions (L1) and (L2) above and thus defines a valid exponent measure density. We call $U^{1}_{2}$ the extremal function at coordinate $2$ , relative to coordinate $1$ (cf., Dombry et al., 2013, 2016). Equivalently, we can write the density in terms of the extremal function $U^{2}_{1}$ at coordinate $1$ , relative to coordinate $2$ , as $\lambda(y_{1},y_{2})=y_{2}^{-3}f_{U^{2}_{1}}(y_{1}/y_{2})$ , $(y_{1},y_{2})\in\mathcal{E}$ , and $U^{2}_{1}$ is related to $U^{1}_{2}$ via the measure change $\mathbb{P}(U^{2}_{1}\leq z)=\mathbb{E}(\boldsymbol{1}\{1/U^{1}_{2}\leq z\}U^{1}_{2})$ , $z>0$ .

The above is a general construction principle, since every valid exponent measure density can be obtained in this way. The bivariate Hüsler–Reiss distribution in (11) corresponds to the case of log-normal $U^{1}_{2}$ and $U^{2}_{1}$ , but many other parametric and non-parametric examples are available (e.g., Boldi and Davison, 2007; Cooley et al., 2010; Ballani and Schlather, 2011; de Carvalho and Davison, 2014).

2.3 Graphical models

A graph $\mathcal{G}=(V,E)$ is defined as a set of nodes $V=\{1,\dots,d\}$ , also called vertices, together with a set of edges $E\subset V\times V$ of pairs of distinct nodes. The graph is called undirected if for two nodes $i,j\in V$ , $(i,j)\in E$ if and only if $(j,i)\in E$ . For notational convenience, for undirected graphs we sometimes represent edges as unordered pairs $\{i,j\}\in E$ . When counting the number of edges, we count $\{i,j\}\in E$ such that each edge is considered only once. A subset $C\subset V$ of nodes is called complete if it is fully connected in the sense that $(i,j)\in E$ for all $i,j\in C$ . We denote by $\mathcal{C}$ the set of all cliques, that is, the complete subsets that are not properly contained within any other complete subset.

To each node $i\in V$ we associate a random variable $X_{i}$ with continuous state space $\mathcal{X}_{i}\subset\mathbb{R}$ . The resulting random vector $\boldsymbol{X}=(X_{i})_{i\in V}$ takes values in the Cartesian product $\mathcal{X}=\times_{i\in V}\mathcal{X}_{i}$ . Suppose that $\boldsymbol{X}$ has a positive and continuous Lebesgue density $f_{\boldsymbol{X}}$ on $\mathcal{X}$ . For three disjoint subsets $A,B,C\subset V$ whose union is $V$ , we say that $\boldsymbol{X}_{A}$ is conditionally independent of $\boldsymbol{X}_{C}$ given $\boldsymbol{X}_{B}$ if the density factorizes as

[TABLE]

and we write $\boldsymbol{X}_{A}\perp\!\!\!\perp\boldsymbol{X}_{C}\mid\boldsymbol{X}_{B}$ . If $B=\emptyset$ , then (13) amounts to independence of $\boldsymbol{X}_{A}$ and $\boldsymbol{X}_{C}$ .

The random vector $\boldsymbol{X}$ is said to be a probabilistic graphical model on the graph $\mathcal{G}=(V,E)$ if its distribution satisfies the pairwise Markov property relative to $\mathcal{G}$ , that is, $X_{i}\perp\!\!\!\perp X_{j}\mid\boldsymbol{X}_{\setminus\{i,j\}}$ for all $(i,j)\notin E$ . If in addition, for any disjoint subsets $A,B,C\subset V$ such that $B$ separates $A$ from $C$ in $\mathcal{G}$ , $\boldsymbol{X}_{A}\perp\!\!\!\perp\boldsymbol{X}_{C}\mid\boldsymbol{X}_{B}$ , then $\boldsymbol{X}$ is said to obey the global Markov property relative to $\mathcal{G}$ . Since $f_{\boldsymbol{X}}$ is positive and continuous, it follows from the Hammersley–Clifford theorem (cf., Lauritzen, 1996, Theorem 3.9) that the two Markov properties are equivalent, and they are further equivalent to the factorization of the density

[TABLE]

for suitable functions $\psi_{C}$ on $\times_{i\in C}\mathcal{X}_{i}$ . If the graph $\mathcal{G}$ is decomposable, then this factorization can be rewritten in terms of marginal densities

[TABLE]

where $\mathcal{D}$ is a multiset containing intersections between the cliques called separator sets; see Lauritzen (1996) and Appendix A for the definition of decomposability and separator sets.

Example 4.

We recall that for a normal distribution $\boldsymbol{W}=(W_{i})_{i\in V}$ with invertible covariance matrix $\Sigma$ , the precision matrix $\Sigma^{-1}$ contains the conditional independencies, or equivalently the graph structure, since for $i,j\in V$ ,

[TABLE]

3 Conditional independence for threshold exceedances

The notion of conditional independence has not been exploited in extreme value theory. In fact, for max-stable distributions it only leads to trivial probabilistic structures (Papastathopoulos and Strokorb, 2016). An exception are directed acyclic graphs for max-linear models studied in Gissibl and Klüppelberg (2018) and Gissibl et al. (2018), which do however not admit densities.

We therefore approach the problem from the perspective of threshold exceedances. Since the notion of independence is only defined on product spaces, the meaning of conditional independence is not straightforward for a multivariate Pareto distribution $\boldsymbol{Y}=(Y_{i})_{i\in V}$ , $V=\{1,\dots,d\}$ , with support on the $L$ -shaped space $\mathcal{L}=\{\boldsymbol{x}\in\mathcal{E}:\|\boldsymbol{x}\|_{\infty}>1\}$ . In this section we show that there is nevertheless a natural definition of conditional independence for $\boldsymbol{Y}$ . To this end, we restrict $\boldsymbol{Y}$ to product spaces. For any $k\in V$ , we consider the random vector $\boldsymbol{Y}^{k}$ defined as $\boldsymbol{Y}$ conditioned on the event that $\{Y_{k}>1\}$ . Clearly, $\boldsymbol{Y}^{k}$ has support on the product space $\mathcal{L}^{k}=\{\boldsymbol{x}\in\mathcal{L}:x_{k}>1\}$ (cf., Figure 1) and it admits the density

[TABLE]

since $\int_{\mathcal{L}^{k}}f_{\boldsymbol{Y}}(\boldsymbol{y})\mathrm{d}\boldsymbol{y}=1/\Lambda(\boldsymbol{1})$ because of (L2) in Section 2.2. From (16) we see that the densities $f^{1},\dots,f^{d}$ coincide with $\lambda$ on the intersection of their supports. Therefore there are no problems with lack of self-consistency as for instance in the model of Heffernan and Tawn (2004).

For any set $I\subset V$ with $k\in I$ , the marginal $\boldsymbol{Y}_{I}^{k}$ has density

[TABLE]

which is homogeneous of order $-(|I|+1)$ on $\mathcal{L}_{I}^{k}=\{\boldsymbol{x}_{I}\in\mathcal{L}_{I}:x_{k}>1\}$ ; see (5). This is however not the case if $k\notin I$ , since integration over $\boldsymbol{y}_{\setminus I}$ then includes $y_{k}$ whose domain is $(1,\infty)$ in $\mathcal{L}^{k}$ , and thus in general $f^{k}_{I}(\boldsymbol{y}_{I})\neq\lambda_{I}(\boldsymbol{y}_{I})$ , $\boldsymbol{y}_{I}\in[0,\infty)^{|I|}.$

Definition 1.

Suppose that $\boldsymbol{Y}$ is multivariate Pareto and admits a positive and continuous density $f_{\boldsymbol{Y}}$ on $\mathcal{L}$ , and let $A,B,C\subset V$ be non-empty disjoint subsets whose union is $V=\{1,\ldots,d\}$ . We say that $\boldsymbol{Y}_{A}$ is conditionally independent of $\boldsymbol{Y}_{C}$ given $\boldsymbol{Y}_{B}$ if

[TABLE]

In this case we write $\boldsymbol{Y}_{A}\perp_{e}\boldsymbol{Y}_{C}\mid\boldsymbol{Y}_{B}$ .

In fact, this definition can be shown to be equivalent to a slightly weaker condition, and to a factorization of the exponent measure density $\lambda$ .

Proposition 1.

Let $f_{\boldsymbol{Y}}$ and the sets $A,B,C$ be as in the above definition, then $\boldsymbol{Y}_{A}\perp_{e}\boldsymbol{Y}_{C}\mid\boldsymbol{Y}_{B}$ is equivalent to any of the following two conditions.

(i)

[TABLE] 2. (ii)

The density of the exponent measure factorizes as

[TABLE]

A natural question is whether one can extend the definition of $\boldsymbol{Y}_{A}\perp_{e}\boldsymbol{Y}_{C}\mid\boldsymbol{Y}_{B}$ to the case where $B=\emptyset$ , meaning that $\boldsymbol{Y}_{A}$ and $\boldsymbol{Y}_{C}$ are independent on $\mathcal{L}$ . In terms of the original definition, that would mean that for any $k\in V$ , $f^{k}(\boldsymbol{y})=f^{k}_{A}(\boldsymbol{y}_{A})f^{k}_{C}(\boldsymbol{y}_{C})$ for all $\boldsymbol{y}\in\mathcal{L}^{k}$ . Without losing generality, suppose $k\in A$ , then $f^{k}_{C}(\boldsymbol{y}_{C})=\lambda(\boldsymbol{y}_{A},\boldsymbol{y}_{C})/\lambda_{A}(\boldsymbol{y}_{A})$ for any $\boldsymbol{y}_{A}\in\mathcal{L}^{k}_{A}$ and $\boldsymbol{y}_{C}\in[0,\infty)^{|C|}$ . Therefore $f^{k}_{C}$ would be homogeneous of order $-|C|$ and thus not integrable on $[0,\infty)^{|C|}$ , a contradiction. In the next section we show that this property implies that all graphical models defined in terms of the conditional independence $\perp_{e}$ must be connected.

4 Graphical models for threshold exceedances

The notion of conditional independence allows us to define graphical models for threshold exceedances. As before, let $f_{\boldsymbol{Y}}$ be the positive and continuous density on $\mathcal{L}$ of a multivariate Pareto distribution $\boldsymbol{Y}$ , proportional to the density $\lambda$ of the exponent measure $\Lambda$ , and homogeneous of order $-(d+1)$ . Let $\mathcal{G}=(V,E)$ be an undirected graph with nodes $V=\{1,\dots,d\}$ and edge set $E$ . Similarly to the classical probabilistic graphical models described in Section 2.3, we say that $\boldsymbol{Y}$ satisfies the pairwise Markov property on $\mathcal{L}$ relative to $\mathcal{G}$ if

[TABLE]

that is, $Y_{i}$ and $Y_{j}$ are conditionally independent in the sense of Definition 1 given all other nodes whenever there is no edge between $i$ and $j$ in $\mathcal{G}$ . By definition, this is equivalent to saying that $\boldsymbol{Y}^{k}$ satisfies the usual pairwise Markov property on $\mathcal{L}^{k}$ relative to $\mathcal{G}$ for all $k\in V$ . The global Markov property for $\boldsymbol{Y}$ is defined similarly.

Definition 2.

Let $\mathcal{G}=(V,E)$ be an undirected graph. If the multivariate Pareto distribution $\boldsymbol{Y}$ with positive and continuous density $f_{\boldsymbol{Y}}$ satisfies the pairwise Markov property (20) relative to $\mathcal{G}$ , we call the distribution of $\boldsymbol{Y}$ an extremal graphical model with respect to $\mathcal{G}$ .

For a decomposable graph $\mathcal{G}$ we obtain a factorization of the density $f_{\boldsymbol{Y}}$ similar to the classical Hammersley–Clifford theorem, showing that the Definition 1 of conditional independence is natural for multivariate Pareto distributions. Let $\mathcal{C}$ and $\mathcal{D}$ be the sequences of cliques and separators of $\mathcal{G}$ , respectively, satisfying the running intersection property (44) in Appendix A.

Theorem 1.

Let $\mathcal{G}=(V,E)$ be a decomposable graph and suppose that $\boldsymbol{Y}$ has a multivariate Pareto distribution with positive and continuous density $f_{\boldsymbol{Y}}$ on $\mathcal{L}$ . Denote the corresponding exponent measure and its density by $\Lambda$ and $\lambda$ , respectively. Then the following are equivalent.

(i)

The distribution of ${\boldsymbol{Y}}$ satisfies the pairwise Markov property relative to $\mathcal{G}.$ 2. (ii)

The distribution of ${\boldsymbol{Y}}$ satisfies the global Markov property relative to $\mathcal{G}.$ 3. (iii)

The density $f_{\boldsymbol{Y}}$ factorizes according to $\mathcal{G}$ , that is,

[TABLE]

where the marginals $\lambda_{I}$ are positive, continuous and homogeneous of order $-(|I|+1)$ for any $I\subset V$ .

In all cases, the graph $\mathcal{G}$ is necessarily connected.

Remark 1.

The above theorem shows that only connected extremal graphical models can arise. This is related to the assumption of multivariate regular variation in (3) that implies asymptotic dependence between all components. Loosely speaking, unconnected components would correspond to asymptotically independent components.

Remark 2.

If the graph $\mathcal{G}$ in the above theorem is non-decomposable, it is expected that the density $f_{\boldsymbol{Y}}$ still factorizes into factors on the cliques of the graph. These factors can however no longer be identified with marginal densities, and it is an open problem whether they can be chosen to be homogeneous.

Since $\mathcal{L}$ is not a product space, unlike in the classical Hammersley–Clifford theorem for decomposable graphs in (15), the factors in the factorization of the density $f_{\boldsymbol{Y}}$ in (21) are not the marginals $f_{I}$ but the marginals of the exponent measure density $\lambda_{I}$ . It holds however that $f_{I}(\boldsymbol{y}_{I})=\lambda_{I}(\boldsymbol{y}_{I})/\Lambda_{I}(\boldsymbol{1})$ for all $\boldsymbol{y}_{I}\in\mathcal{L}_{I}\subset\{\boldsymbol{x}_{I}:\boldsymbol{x}\in\mathcal{L}\}$ .

As a first application, the above theorem allows us to formally analyse the conditional independencies and graphical structures of models in the multivariate extreme value literature.

Example 5.

One of the simplest examples of a graph is a chain, that is,

[TABLE]

Coles and Tawn (1991)** proposed a model that factorizes with respect to this chain where all bivariate marginals are logistic (cf., Example 1). This was extended to general bivariate marginals in Smith et al. (1997). More generally, in the study of extremes of stationary Markov chains the limiting objects are so-called tail chains. The latter induce multivariate Pareto distributions that can readily be seen to factorize with respect to a chain; see Smith (1992) Basrak and Segers (2009) and Janssen and Segers (2014).

Example 6.

It turns out that many of the multivariate models in the literature do not have any conditional independencies, that is, their underlying graph is fully connected. For instance, this holds for the logistic multivariate Pareto distribution in Example 1, the Dirichlet mixture model in Boldi and Davison (2007), and the pairwise beta distribution in Cooley et al. (2010).

Example 7.

Similar to Gaussian distributions, an appealing property of the Hüsler–Reiss model is its stability under taking marginals. Indeed, for any $I\subset V$ and $k\in I$ the marginal density of the exponent measure is

[TABLE]

with the notation of Example 2, where $\Sigma^{(k)}_{I}$ is the matrix in (10) induced by the submatrix $\Gamma_{I}$ . Thus, $f_{I}(\boldsymbol{y}_{I})=\lambda_{I}(\boldsymbol{y}_{I})/\Lambda_{I}(\mathbf{1})$ is the density of the $|I|$ -dimensional Hüsler–Reiss Pareto distribution with parameter matrix $\Gamma_{I}$ .

*By Theorem 1, the density of a Hüsler–Reiss distribution that satisfies the pairwise Markov property relative to some decomposable graph $\mathcal{G}$ factorizes into lower-dimensional Hüsler–Reiss distributions. The explicit formula is given in Corollary 2 in Appendix C. *

Theorem 1 can also be seen as a construction principle to build new classes of extreme value distributions in a modular way by combining lower-dimensional marginals. The following corollary shows how a multivariate Pareto distributions can be defined that factorizes according to a desired underlying graphical structure. This is particularly useful in high-dimensional problems to ensure model sparsity.

Corollary 1.

Let $\mathcal{G}$ be a decomposable and connected graph and suppose that $\{\lambda_{I}:I\in\mathcal{C}\cup\mathcal{D}\}$ is a set of valid, positive and continuous exponent measure densities in the sense of (L1) and (L2) in Section 2.2. For $D\subset C$ , $D\in\mathcal{D},$ $C\in\mathcal{C}$ , assume that they satisfy the consistency constraint

[TABLE]

The density of a valid $d$ -dimensional exponent measure $\Lambda$ is then given by

[TABLE]

and the function $f_{\boldsymbol{Y}}(\boldsymbol{y})=\lambda(\boldsymbol{y})/\Lambda(\mathbf{1})$ , $\boldsymbol{y}\in\mathcal{L}$ , is the density of a multivariate Pareto distribution satisfying the pairwise Markov property relative to $\mathcal{G}$ .

4.1 Tree graphical models

A tree is a special case of a decomposable graphical model that is connected and has no cycles. All cliques are then of size two and the separators contain only one node. Let $\mathcal{T}=(V,E)$ be an undirected tree with nodes $V=\{1,\dots,d\}$ and edge set $E\subset V\times V$ . Suppose that $\boldsymbol{Y}=(Y_{i})_{i\in V}$ follows a multivariate Pareto distribution on $\mathcal{L}$ with density $f_{\boldsymbol{Y}}$ that is an extremal graphical model with respect to the tree $\mathcal{T}$ . Theorem 1 yields the factorization

[TABLE]

where $\lambda_{ij}=\lambda_{\{i,j\}}$ are the bivariate marginals of the exponent measure density $\lambda$ corresponding to $\boldsymbol{Y}$ . The formula (23) allows the extension of the modelling approach by Smith et al. (1997) described in Example 5 from time series to general tree structures. Such tree models are able to represent more complex dependencies and, moreover, are suitable beyond temporal data for multivariate or spatial applications.

Thanks to the relatively simple structure of a tree, more explicit results can be derived than for general graphical models. To this end, we define a new, directed tree $\mathcal{T}^{k}=(V,E^{k})$ rooted at an arbitrary but fixed node $k\in V$ . The edge set $E^{k}$ consist of all edges $e\in E$ of the tree $\mathcal{T}$ pointing away from node $k$ ; see Figure 2 for an example with $k=2$ . For the resulting directed tree we define a set $(U_{e})_{e\in E^{k}}$ of independent random variables, where for $e=(i,j)$ , the distribution of $U_{e}=U^{i}_{j}$ is the extremal function of $\lambda_{ij}$ at coordinate $j$ , relative to coordinate $i$ ; see (12) in Example 3 for the definition of extremal functions. The following stochastic representation of the random vectors $\boldsymbol{Y}^{k}$ , $k\in V$ , provides a better understanding of the stochastic structure of multivariate Pareto distributions factorizing on a tree, and it is the main ingredient for efficient simulation in Section 5.4.

Proposition 2.

Let $\boldsymbol{Y}$ be a multivariate Pareto distribution with positive and continuous density on $\mathcal{L}$ that factorizes with respect to the tree $\mathcal{T}$ . With the notation above, and for a standard Pareto distribution $P$ , we have the joint stochastic representation for $\boldsymbol{Y}^{k}$ on $\mathcal{L}^{k}$

[TABLE]

where $\operatorname{ph}(ki)$ denotes the set of edges on the unique path from node $k$ to node $i$ on the tree $\mathcal{T}^{k}$ .

Remark 3.

The same object as in (24) has been obtained in Segers (2019) as the limit of regularly varying random vectors that satisfy a Markov condition on a tree. In analogy to the tail chains in Example 5, they term it a tail tree.

Example 8.

Suppose all bivariate marginals $\lambda_{ij}$ for $\{i,j\}\in E$ of a tree Pareto model are of logistic type with parameter $\theta_{ij}\in(0,1)$ as defined in Example 1. This tree logistic model is a generalization of the chain logistic model considered in Coles and Tawn (1991). In this symmetric case, the extremal functions $U^{i}_{j}$ and $U^{j}_{i}$ have the same distribution with stochastic representation $F/G$ , where $F$ follows a Fréchet $(1/\theta,c_{\theta})$ distribution with scale parameter $c_{\theta}=\Gamma(1-\theta)^{-1}$ and $(G/c_{\theta})^{-1/\theta}$ follows a Gamma $(1-\theta,1)$ distribution, where we abbreviated $\theta=\theta_{ij}$ and $\Gamma$ is the Gamma function.

Similarly we can define a Hüsler–Reiss tree model, or use asymmetric models for $\lambda_{ij}$ such as the Dirichlet model in Boldi and Davison (2007) for some or all of the edges $\{i,j\}\in E$ . In asymmetric models, the extremal functions $U^{i}_{j}$ and $U^{j}_{i}$ have in general different distributions. We refer to Section 4 in Dombry et al. (2016) for explicit formulas for extremal function distributions of commonly used model classes.

4.2 Hüsler–Reiss graphical models

In many ways, the class of Hüsler–Reiss distributions introduced in Example 2 can be seen as the natural analog of Gaussian distributions in the world of asymptotically dependent extremes. They are parameterized by the variogram of Gaussian distributions, and their statistical inference (Wadsworth and Tawn, 2014; Engelke et al., 2015) and exact simulation (Dombry et al., 2016) involves tools that are closely related to the corresponding methods for normal models.

Despite the similarities to Gaussian distributions, there are subtle but important differences that render analysis and statistical inference of Hüsler–Reiss distributions more difficult. In order to characterise conditional independence and graphical structures in these models, we first recall some results related to the original construction. The max-stable Hüsler–Reiss distribution has a stochastic representation as componentwise maxima

[TABLE]

where $\{U_{l}:l\in\mathbb{N}\}$ is a Poisson point process on $[0,\infty)$ with intensity $u^{-2}\mathrm{d}u$ , and $\boldsymbol{W}_{l}$ are independent copies of a $d$ -dimensional normal distribution $\boldsymbol{W}$ with zero mean and covariance matrix $\Sigma$ . Subtracting $\mathbb{E}(\exp\boldsymbol{W})=\operatorname{diag}(\Sigma)/2$ in the exponent normalises the marginals of $\boldsymbol{Z}$ to be standard Fréchet. Kabluchko et al. (2009) show that the distribution of $\boldsymbol{Z}$ only depends on the strictly conditionally negative definite variogram matrix of $\boldsymbol{W}$ ,

[TABLE]

Importantly, this implies that the representation in (25) is not unique since any centred, possibly degenerate normal distribution $\boldsymbol{W}$ with variogram matrix $\Gamma$ leads to the same max-stable Hüsler–Reiss distribution. Let

[TABLE]

be the set of all covariance matrices that correspond to the same variogram matrix $\Gamma$ ; see Appendix B. The Hüsler–Reiss Pareto distribution $\boldsymbol{Y}$ associated with $\boldsymbol{Z}$ is defined by its density in Example 2, which is also parameterized by $\Gamma$ . We recall that for a normal distribution $\boldsymbol{W}$ with invertible covariance matrix $\Sigma$ , the precision matrix $\Sigma^{-1}$ contains the conditional independencies; see Example 4. A first, naive guess would be that the graph structure of $\boldsymbol{W}$ used in the construction of $\boldsymbol{Z}$ directly translates into the extremal graph structure of the Hüsler–Reiss Pareto distribution $\boldsymbol{Y}$ . This is however not the case.

Example 9.

We consider three examples for $\boldsymbol{W}$ in the representation (25) with $d=4$ .

Let $W_{i}$ , $i=1,\dots,4$ , be independent standard normal distributions, then $\Sigma^{-1}=\operatorname{diag}(1,\dots,1)$ and $\Gamma_{ij}=2$ if $i\neq j$ and zero otherwise. The graph underlying the distribution of $\boldsymbol{W}$ is the graph with four unconnected nodes. The graph of the corresponding Hüsler–Reiss Pareto distribution $\boldsymbol{Y}$ turns out to be the fully connected graph on the left-hand side of Figure 3. 2. 2.

Consider the centred normal distribution $\boldsymbol{W}$ with precision matrix and variogram matrix

[TABLE]

respectively. The Gaussian graphical model is the graph in the centre of Figure 3 with an additional edge between the nodes $2$ and $3$ . On the contrary, the corresponding Hüsler–Reiss model factorizes according to the graph in the centre of Figure 3. 3. 3.

Consider the centred normal distribution $\boldsymbol{W}$ with precision matrix and variogram matrix

[TABLE]

respectively. It can be checked that both the Gaussian and the Hüsler–Reiss graphical model are as in the right-hand side of Figure 3. Also note that this graph is not decomposable.

The above examples show that it is not possible to simply transfer the Gaussian graphical model of the covariance matrix $\Sigma$ in the construction (25) to the extremal graphical structure of the corresponding Hüsler–Reiss Pareto distribution. This is not surprising since the covariance matrices in the set $\mathcal{S}_{\Gamma}$ can have very different graph structures, but all lead to the same Hüsler–Reiss graphical model. There is however a set of particular matrices that allow us to identify conditional independencies and thus the graphical structure of a Hüsler–Reiss Pareto distribution. Recall the definition of $\Sigma^{(k)}\in\mathbb{R}^{(d-1)\times(d-1)}$ in (10). The same matrix including the $k$ th row and column

[TABLE]

is degenerate since the $k$ th component has zero variance, but it is a valid choice in the construction (25), that is, $\tilde{\Sigma}^{(k)}\in\mathcal{S}_{\Gamma}$ , for any $k\in V$ . Let $\boldsymbol{W}^{k}$ be a centred normal distribution with covariance matrix $\tilde{\Sigma}^{(k)}$ and note that $W^{k}_{k}=0$ almost surely. For a random variable $P$ with standard Pareto distribution, independent of $\boldsymbol{W}^{k}$ , it can be seen that

[TABLE]

by comparing the density of the right-hand side with (9) and noting that $\operatorname{diag}(\tilde{\Sigma}^{(k)})=\Gamma_{\cdot\,k}$ . This together with the original definition of conditional independence in (17) suggests that the matrices $\Sigma^{(k)}$ contain the graphical structure of a Hüsler–Reiss Pareto distribution.

We denote the precision matrix of $\Sigma^{(k)}$ by $\Theta^{(k)}=(\Sigma^{(k)})^{-1}$ . For notational convenience, the indices of the matrices $\Sigma^{(k)}$ and $\Theta^{(k)}$ range in $\{1,\dots,d\}\setminus\{k\}$ instead of $\{1,\dots,d-1\}$ .

Lemma 1.

For $k,k^{\prime}\in V$ , $k\neq k^{\prime}$ , the precision matrices $\Theta^{(k)}$ and $\Theta^{(k^{\prime})}$ satisfy for $i,j\in V\setminus\{k^{\prime}\}$

[TABLE]

The above lemma is of independent interest since it explains the link between the precision matrices $\Theta^{(k)}$ for different different $k\in V$ . The proof uses blockwise inversion of the precision matrices. This result is the crucial ingredient to characterise conditional independence in Hüsler–Reiss models.

Proposition 3.

For a Hüsler–Reiss Pareto distribution $\boldsymbol{Y}$ with parameter matrix $\Gamma$ , it holds for $i,j\in V$ with $i\neq j$ , and for any $k\in V$ , that

[TABLE]

For any $k\in V$ , the single matrix $\Theta^{(k)}$ contains all information on conditional independence of $\boldsymbol{Y}$ . Conditional independence concerning the $k$ th component is encoded in the row and column sums of $\Theta^{(k)}$ , and it might sometimes be easier to switch to another representation $\Theta^{(k^{\prime})}$ , $k^{\prime}\neq k$ , where it simply figures as a zero entry. In Example 9 we can now easily determine the graphical model $\mathcal{G}=(V,E)$ for each of the three Hüsler–Reiss Pareto distributions. For a given $\Sigma$ we first compute the matrix $\Gamma$ as in (26), then transform it by (10) to obtain $\Sigma^{(k)}$ for any $k\in V$ and use Proposition 3 to decide whether $(i,j)\in E$ for all $i,j\in V$ . These transformations are implemented in our R-package graphicalExtremes (Engelke et al., 2019).

Example 10.

In spatial extreme value statistics, the finite dimensional distributions of the Brown–Resnick process (Kabluchko et al., 2009) at locations $t_{1},\dots,t_{d}\in\mathbb{R}^{D}$ are Hüsler–Reiss distributed with variogram matrix $\Gamma_{ij}=\gamma(t_{i}-t_{j})$ , $i,j\in\{1,\dots,d\}$ , where $\gamma$ is a variogram function on $\mathbb{R}^{D}$ . The most commonly used model is the fractal variogram family $\gamma_{\alpha}(h)=\|h\|_{2}^{\alpha}$ , for some $\alpha\in(0,2]$ . The corresponding $d$ -variate Hüsler–Reiss distribution does not have conditional independencies and its graph is thus fully connected. The only exception is the case of the original Brown–Resnick process in Brown and Resnick (1977) with $\alpha=1$ and $D=1$ , where the corresponding graph is a chain as in Example 5.

In this section, we have so far not required that the underlying graph $\mathcal{G}$ is decomposable. If this is the case then, as shown in Example 7, Theorem 1 implies that the density of the Hüsler–Reiss graphical model factorizes into lower-dimensional Hüsler–Reiss densities; see Corollary 2 in Appendix C.

5 Statistical inference for block graphs

5.1 Model construction

The notion of conditional independence and graphical models for multivariate Pareto distributions allows the construction of new statistical models with two major advantages. First, sparsity can be imposed on the model, which is a crucial ingredient for tractable and parsimonious models in higher dimensions. Second, under certain graphical structures, the model parameters can be estimated separately on lower-dimensional subsets of the data.

We consider here, and throughout the rest of the paper, decomposable and connected graphs $\mathcal{G}=(V,E)$ with clique set $\mathcal{C}$ and separator set $\mathcal{D}$ , where all separators in $\mathcal{D}$ are single nodes. Such graph structures with singleton separator sets are known as block graphs (cf., Harary, 1963) and have already been seen to have appealing properties for discrete data (Loh and Wainwright, 2013). In our case, they are a convenient way of restricting the model complexity in order to obtain a tractable class of extremal graphical models. In fact, Corollary 1 provides a simple construction principle for multivariate Pareto distributions that factorize with respect to the block graph $\mathcal{G}$ .

i)

For each clique $C\in\mathcal{C}$ , choose possibly different parametric families of valid exponent measure densities $\{\lambda_{C}(\cdot;\theta_{C}):\theta_{C}\in\Omega_{C}\}$ for suitable parameter spaces $\Omega_{C}$ . If $\mathcal{G}$ is a tree $\mathcal{T}$ , then this reduces to choosing $d-1$ bivariate exponent measure densities $\lambda_{ij}$ , for each $\{i,j\}\in E$ ; see Example 3 for a general representation of such densities.

ii)

Since all separator sets consist of a single node, the consistency constraint (22) is trivially fulfilled as a consequence of (L1) and (L2) in Section 2.2 and the fact that $\lambda_{D}(y_{D})=y_{D}^{-2}$ for all $D\in\mathcal{D}$ .

iii)

For any fixed combination of parameters $\theta=(\theta_{C})_{C\in\mathcal{C}}\in\Omega=\times_{C\in\mathcal{C}}\Omega_{C}$ , the product of the normalised lower-dimensional exponent measure densities,

[TABLE]

defines a valid $d$ -variate Pareto distribution factorizing according to the graph $\mathcal{G}$ , which is a member of the parametric family parameterized by $\theta\in\Omega$ . For a tree $\mathcal{T}$ , this reduces to the density in (23).

Concrete examples for this construction are tree logistic or tree Hüsler–Reiss models as described in Example 8, where all cliques have the same type of distributions. The above construction is much more flexible, as it allows us to use different distribution families for the different cliques. Moreover, some, or even all of the cliques may be modeled by non-parametric methods; see Lafferty et al. (2012) for non-parametric tree models in the non-extreme case. In this direction, there is a line of research on kernel-based estimation of exponent measure densities (cf., de Carvalho and Davison, 2014; Marcon et al., 2017; Kiriliouk et al., 2018) that could be used as clique models. We will not follow this approach here.

In the graphical models above, the dependence inside each clique is modeled directly, whereas dependence between components from different cliques is implicitly implied by the conditional independence structure of the graph. Even if all cliques are modeled with the same type of parametric family, the joint distribution (30) is typically not of this distribution type. For a tree logistic distribution, for instance, this can easily be seen by comparing its density (23) with that of $d$ -variate logistic distribution in Example 1. The latter only has one parameter governing the whole $d$ -dimensional dependence structure, whereas the tree has $d-1$ logistic parameters $\{\theta_{ij};\{i,j\}\in E\}$ and thus much higher flexibility.

An important exception is the family of Hüsler–Reiss distributions, which is stable under taking marginal distributions; see Example 7. The following proposition shows that for a given graphical structure as above, if all cliques have Hüsler–Reiss distributions, then so has the full $d$ -dimensional model. This is the converse of Corollary 2 in Appendix C.

Proposition 4.

Let $\mathcal{G}=(V,E)$ be a block graph as above, and suppose that on each clique $C\in\mathcal{C}$ , $\boldsymbol{Y}$ has a $|C|$ -variate Hüsler–Reiss distribution with exponent measure density $\lambda_{C}(\cdot;\Gamma^{(C)})$ parameterized by a $|C|\times|C|$ -dimensional variogram matrix $\Gamma^{(C)}$ . Then there exists a unique solution to the problem:

[TABLE]

with the notation from Proposition 3. The corresponding $d$ -variate Hüsler–Reiss distribution factorizes according to the graph $\mathcal{G}$ into the lower-dimensional Hüsler–Reiss densities on the cliques.

This is a matrix completion problem for variograms similar to what Dempster (1972) introduced for covariance matrices. In our case, the graph is decomposable and the above result relates to the marginal problem studied in Kellerer (1964) and Dawid and Lauritzen (1993). For Hüsler–Reiss marginals on block graphs we even see that the implied $d$ -dimensional distribution is again Hüsler–Reiss. We give a direct, constructive proof in Appendix F. This provides a method to construct high-dimensional Hüsler–Reiss distributions out of many low-dimensional ones. The full $d$ -variate Hüsler–Reiss model without any conditional independencies has $d(d-1)/2$ parameters. A Hüsler–Reiss distribution as in Proposition 4 that factorizes on a block graph with click set $\mathcal{C}$ has only

[TABLE]

parameters, which can be much smaller than $d(d-1)/2$ .

5.2 Estimation

Extremal graphical models can be used to build parsimonious statistical models for the tail of a multivariate random vector. In this section we discuss how the model parameters can be estimated efficiently by considering each clique distribution separately.

Let $\boldsymbol{X}=(X_{j})_{j\in V}$ , $V=\{1,\dots,d\}$ , be a random vector in the max-domain of attraction of the max-stable random vector $\boldsymbol{Z}$ as in (3), with marginal distribution $X_{j}$ in the max-domain of attraction of a generalized extreme value distribution with shape parameter $\xi_{j}$ , $j\in V$ . Equivalently, there exist a sequence of high thresholds $\boldsymbol{t}_{u}=(t_{u1},\dots,t_{ud})$ with $t_{uj}$ tending to the upper endpoint of $X_{j}$ as $u\to\infty$ , and positive normalizing functions $\sigma_{u}=(\sigma_{u1},\dots,\sigma_{ud})$ , such that the distribution of exceedances converges weakly

[TABLE]

where $\boldsymbol{Y}$ is the multivariate Pareto distribution associated with $\boldsymbol{Z}$ . We assume $\boldsymbol{Y}$ to be in the model class of the previous section with density (30), and for now we suppose that the underlying graph $\mathcal{G}=(V,E)$ is known and fixed. The conditional density of $\boldsymbol{X}-\boldsymbol{t}_{u}$ given that $\|\boldsymbol{X}/\boldsymbol{t}_{u}\|_{\infty}>1$ is then approximated by

[TABLE]

This density can be used to estimate jointly the marginal parameters $(\sigma_{uj},\xi_{j})$ , $j\in V$ , and the dependence parameter vector $\theta=(\theta_{C})_{C\in\mathcal{C}}$ of $f_{\boldsymbol{Y}}$ .

In the sequel we concentrate on estimation of the dependence, and we therefore assume that the marginal parameters are known or have been estimated separately. As described in Section 2.2, we can then normalise $\boldsymbol{X}$ to standard Pareto marginals, in which case $\xi_{j}=1$ , $t_{uj}=u$ and $\sigma_{uj}=u$ for all $j\in V$ . We recover the standardized setting of (6) considered throughout the paper, where $\boldsymbol{X}/u$ given that $\|\boldsymbol{X}\|_{\infty}>u$ converges to $\boldsymbol{Y}$ , whose likelihood is proportional as a function of $\theta$ to

[TABLE]

Direct maximization of the likelihood with contributions (34) for each data point is tedious since the normalizing constant $Z_{\theta}$ contains all parameters and does not factorize. Fortunately the class of block graphs has the property that we can estimate the parameters $\theta_{C}$ of each $\lambda_{C}$ separately, without having to enforce the consistency constraints at the separator sets. In fact, we use the following observation. If $\boldsymbol{X}$ is in the domain of attraction of the family of multivariate Pareto distributions $\{f_{\boldsymbol{Y}}(\cdot;\theta):\theta\in\Omega\}$ , then for a fixed clique $C\in\mathcal{C}$ , the subvector $\boldsymbol{X}_{C}$ is in the domain of attraction of $\{f_{C}(\cdot;\theta_{C}):\theta_{C}\in\Omega_{C}\}$ , and the distribution of the normalised exceedance $\boldsymbol{X}_{C}/u\mid\|\boldsymbol{X}_{C}\|_{\infty}>u$ is approximated for large $u$ by $\boldsymbol{Y}_{C}$ with density

[TABLE]

see (7) in Section 2.2. We can therefore obtain an estimate of $\theta_{C}$ based only on data of the components in $C$ , whose dimension is typically much smaller than the dimension $d$ of the full graph. Estimating the cliques separately might in principle result in a loss of estimation efficiency compared to using the joint likelihood (34). The normalizing constant $Z_{\theta}$ does however not contain much information on the parameter $\theta$ and the maximum likelihood estimate using $f_{\boldsymbol{Y}}(\boldsymbol{y};\theta)$ is generally very close to the estimate obtained by maximizing separate likelihoods based on (35). We discuss this point in the simulation study in Section 5.5.

In practice, some components of $\boldsymbol{X}$ might not have converged to the limiting distribution $\boldsymbol{Y}$ . In order to avoid biased estimates of the dependence parameters $\theta_{C}$ , it has become a standard approach to apply censoring to the data; see Ledford and Tawn (1997), Smith et al. (1997). For a data point $\boldsymbol{X}_{C}$ with $\|\boldsymbol{X}_{C}\|_{\infty}>u$ for a high threshold $u>0$ , define $J$ to be the set of indices $j\in C$ such that $Y_{j}<1$ , i.e., $X_{j}<u$ . For this data point we use the censored likelihood contribution

[TABLE]

which uses for all $j\in J$ only the information that this component of $\boldsymbol{Y}_{C}$ is smaller than $1$ , but not its exact value. For explicit forms of the censored likelihoods for many parametric models see Dombry et al. (2017) and Kiriliouk et al. (2018).

For $n$ independent data $\boldsymbol{y}^{(h)}\in\mathcal{L}$ , $h=1,\dots,n$ , of $\boldsymbol{X}/u\mid\|\boldsymbol{X}\|_{\infty}>u$ , for each clique $C$ we define $\widehat{\theta}_{C}$ as the maximizer of the censored log-likelihood

[TABLE]

where $\mathcal{L}_{C}=\{\boldsymbol{y}\in\mathcal{L}:\exists j\in C\text{ s.t. }y_{j}>1\}$ , and each $\boldsymbol{y}^{(h)}_{C}$ has its own censoring set $J^{(h)}\subset C$ .

Maximum likelihood estimation is only one possibility to infer the parameters $\theta_{C}$ based on exceedances of $\boldsymbol{X}_{C}$ and the limiting distribution (35). Alternative methods use $M$ -estimators (Einmahl et al., 2012; Einmahl et al., 2016) or proper scoring rules (de Fondeville and Davison, 2018).

5.3 Model selection

Up to now we have assumed that a graphical structure $\mathcal{G}$ was a priori given and we analysed models that factorize with respect to this structure. In many applications the underlying graph structure is unknown and should be learned in a data-driven way. Theorem 1 implies that all extremal graphical structures are connected, and a simple and flexible class of connected graphs are trees; see Section 4.1. It is thus natural to first build a suitable tree as a baseline model, and then extend the tree by adding additional edges in order to obtain more complex graphs.

Since trees are a special case of general graphical models, there are specific methods to learn these simpler structures. The notion of a minimum spanning tree is crucial (Kruskal, 1956). Let $\mathcal{G}_{0}=(V,E_{0})$ be the fully connected graph on $V=\{1,\dots,d\}$ with edge set $E_{0}=\{(i,j):i,j\in V\}$ . Suppose that a positive weight $w_{ij}>0$ is attached to each edge $(i,j)\in E_{0}$ of $\mathcal{G}_{0}$ . This number can be seen as the length of the edge $(i,j)$ or the distance between nodes $i$ and $j$ , and it is assumed that $w_{ij}=w_{ji}$ and $w_{ii}=0$ , $i,j\in V$ . The minimum spanning tree is the tree $\mathcal{T}_{\operatorname{mst}}=(V,E_{\operatorname{mst}})$ with $E_{\operatorname{mst}}\subset E_{0}$ , that minimizes the sum of weights on that tree, i.e.,

[TABLE]

If all edges of $\mathcal{G}_{0}$ have distinct lengths, then $\mathcal{T}_{\operatorname{mst}}$ is unique. This minimization problem can be solved efficiently by the greedy algorithms proposed in Kruskal (1956) or Prim (1957).

The weights $w_{ij}$ determine the tree structure and should be chosen carefully. A common approach in graphical modelling is to search the conditional independence structure that maximizes the likelihood, (cf., Cowell et al., 2006, Chapter 11). Such a tree is also called a Chow–Liu tree (Chow and Liu, 1968). We fix a parametric family of bivariate Pareto distributions that is used for all pairs of nodes $\{f(\cdot;\theta_{ij}):\theta_{ij}\in\Omega\}$ . For $n$ independent data $\boldsymbol{y}^{(h)}$ , $h=1,\dots,n$ , the maximal log-likelihood of a fixed tree within this parametric class is essentially the sum over the maximized clique log-likelihoods in (37) over all edges of this tree. In order to find the tree that maximizes the log-likelihood over all trees and all distributions in this parametric family, we therefore find the minimum spanning tree in (38) with weights

[TABLE]

where we include the censored marginal densities $y_{i}^{-2}$ and $y_{j}^{-2}$ in (30) for the clique $\{i,j\}$ , since now the edges are no longer fixed but parameters of the optimization. The resulting tree ${\mathcal{T}}_{\operatorname{mst}}$ is the baseline model for the data. If the model fit is not satisfactory, it is possible to extend this tree to graphs with more complex structures by adding additional edges. The family of Hüsler–Reiss distributions is particularly appealing since the bivariate marginals remain in the same class. We illustrate this model extension through a greedy forward selection in Section 5.5.

The different multivariate Pareto models can then be compared by the Akaike information criterion (Kiriliouk et al., 2018),

[TABLE]

where $p$ is the number of parameters in the respective model, and the second term is twice the negative log-likelihood based on the censored version of (34), evaluated at the optimized parameters of each clique.

5.4 Exact simulation

Exact simulation of a max-stable random vector $\boldsymbol{Z}$ relies on the notion of extremal functions (Dombry and Éyi-Minko, 2013). The extremal function of $\boldsymbol{Z}$ , or of its associated multivariate Pareto distribution $\boldsymbol{Y}$ , relative to coordinate $k\in V$ is the $d$ -dimensional random vector $\boldsymbol{U}^{k}$ with $U^{k}_{k}=1$ such that the exponent measure density of $\boldsymbol{Z}$ can be written as

[TABLE]

The distributions of the extremal functions $\boldsymbol{U}^{k}$ , $k\in V$ , for most commonly used models have explicit forms and are derived in Section 4 of Dombry et al. (2016). Theorem 2 in the same paper relates the distribution of the so-called spectral measure to these extremal functions. Together with the following representation of $\boldsymbol{Y}$ , this enables simulation of multivariate Pareto distributions by rejection sampling. Recall that for any $k\in V$ , the random vector $\boldsymbol{Y}^{k}$ is defined as $\boldsymbol{Y}$ conditioned on the event that $\{Y_{k}>1\}$ .

Lemma 2.

The distribution of the extremal function $\boldsymbol{U}^{k}$ of $\boldsymbol{Y}$ relative to coordinate $k\in V$ is given by the distribution of $\boldsymbol{Y}^{k}/Y^{k}_{k}$ . Independently, let $P$ be a standard Pareto random variable and $T$ uniformly distributed on $\{1,\dots,d\}$ . We then have for any Borel set $A\subset\mathcal{L}$

[TABLE]

The above representation yields a simple algorithm for exact simulation of $\boldsymbol{Y}$ ; see also de Fondeville and Davison (2018).

Algorithm 1 (Exact simulation of a multivariate Pareto distribution $\boldsymbol{Y}$ ).

*1. Simulate a standard Pareto random variable $P$ .

Simulate $T$ uniformly on $\{1,\dots,d\}$ and sample a realization of the extremal function $\boldsymbol{U}^{T}$ relative to coordinate $T$ .
If $\max\{P\|\boldsymbol{U}^{T}\|_{\infty}/\|\boldsymbol{U}^{T}\|_{1}\}>1$ ,
return $\boldsymbol{Y}=P\boldsymbol{U}^{T}/\|\boldsymbol{U}^{T}\|_{1}$ as realization of the multivariate Pareto distribution.
Else,
reject the simulation and go back to step 1.*

The complexity of this simulation algorithm as a function of the dimension $d$ of the vector $\boldsymbol{Y}$ is driven by the number of times one has to sample from one of the extremal functions $\boldsymbol{U}^{1},\dots,\boldsymbol{U}^{d}$ , since simulation of the variables $P$ and $T$ requires much less computational effort. Let $C_{\boldsymbol{Y}}(d)$ denote the number of extremal functions that have to be simulated in the above algorithm. The random variable $C_{\boldsymbol{Y}}(d)$ follows a geometric distribution and from (50) in the proof of Lemma 2 its expectation is

[TABLE]

The expected complexity therefore depends on both the dimension and the strength of extremal dependence in $\boldsymbol{Y}$ . Weak dependence implies a large coefficient $\Lambda(\boldsymbol{1})$ closer to $d$ and therefore reduces the computational effort required for exact simulation. The simulation of multivariate Pareto distributions is in general computationally easier than for the associated max-stable distribution $\boldsymbol{Z}$ . Indeed, exact simulation of the latter is also based on samples from a mixture of the $\boldsymbol{U}^{1},\dots,\boldsymbol{U}^{d}$ , and the fastest algorithm in Dombry et al. (2016) has expected complexity $\mathbb{E}\{C_{\boldsymbol{Z}}(d)\}=d$ ; see also Dieker and Mikosch (2015) and Oesting et al. (2018) for other exact simulation methods.

The complexity measures $C_{\boldsymbol{Y}}(d)$ and $C_{\boldsymbol{Z}}(d)$ only consider the number of extremal functions required for one exact simulation of $\boldsymbol{Y}$ and $\boldsymbol{Z}$ , respectively. The computational effort of sampling $\boldsymbol{U}^{k}$ can however be significantly lower if $\boldsymbol{Y}$ has a sparse structure. If $\boldsymbol{Y}$ factorizes according to a graph, then, by the Definition 1 of conditional independence, the $\boldsymbol{Y}^{1},\dots,\boldsymbol{Y}^{d}$ inherit the sparsity of this graph structure. This is particularly important in the case of trees and for Hüsler–Reiss distributions, as shown in the examples below. It is important to note that more efficient simulation of the extremal functions speeds up exact simulation of the multivariate Pareto distribution $\boldsymbol{Y}$ , but also of the max-stable distribution $\boldsymbol{Z}$ .

Example 11.

Suppose that $\boldsymbol{Y}$ factorizes according to a tree $\mathcal{T}=(V,E)$ . It follows from Proposition 2 and Lemma 2 that the extremal function $\boldsymbol{U}^{k}$ relative to coordinate $k\in V$ is

[TABLE]

For exact simulation of $\boldsymbol{Y}$ it therefore suffices to simulate the univariate random variables $U_{e}$ . This is feasible even in very large dimensions.

Example 12.

If $\boldsymbol{Y}$ has a Hüsler–Reiss distribution that factorizes on the graph $\mathcal{G}=(V,E)$ , then it follows from (28) that the extremal function $\boldsymbol{U}^{k}$ relative to coordinate $k\in V$ is

[TABLE]

where $\boldsymbol{W}^{k}$ is a centred normal distribution with covariance matrix $\tilde{\Sigma}^{(k)}$ in (27); see also Proposition 4 in Dombry et al. (2016). The normal distribution $\boldsymbol{W}^{k}_{\setminus k}$ factorizes in the classical sense on the subgraph $\mathcal{G}_{\setminus k}$ , and efficient simulation algorithms exist if the graph is sparse (e.g., Rue and Held, 2005).

The exact simulation algorithms for both multivariate Pareto and max-stable distributions are implemented in our R-package graphicalExtremes (Engelke et al., 2019).

5.5 Simulation study

We assess the efficiency of parameter estimation and model selection in the framework of graphical models for extremes described in the previous sections. We fix a dimension $d$ of variables or nodes $V=\{1,\dots,d\}$ and a block graph $\mathcal{G}=(V,E)$ as in Section 5.1. In this study we simulate samples directly from the limiting distribution $\boldsymbol{Y}$ using the exact Algorithm 1, but we use the censored estimation since this is common practice in applications.

We first choose $d=5$ and let $\mathcal{G}$ be the undirected version of the tree in Figure 2. We simulate $n\in\{100,200\}$ samples $\boldsymbol{y}^{(1)},\dots,\boldsymbol{y}^{(n)}$ of a Hüsler–Reiss distribution with parameter matrix $\Gamma$ that factorizes according to $\mathcal{G}$ . The entries of $\Gamma$ need to be specified only on the submatrices $\Gamma^{(C)}$ for all cliques $C\in\mathcal{C}$ of $\mathcal{G}$ , since the solution to the matrix completion problem in Proposition 4 then yields the unique variogram matrix $\Gamma$ . In this simulation we set

[TABLE]

where we only specified the four parameters $\Gamma_{ij}$ for $(i,j)\in E$ , $i<j$ , to the values in bold, and the rest of the matrix is implied by the graph structure.

In this dimension we can still maximize the censored version of the joint likelihood (34) to obtain an estimate $\widehat{\Gamma}_{ij}^{\text{joint}}$ , $\{i,j\}\in E$ , of the parameters corresponding to the four edges of the tree. We also obtain estimates $\widehat{\Gamma}_{ij}$ , $\{i,j\}\in E$ , of the parameters of each clique separately by maximizing the censored clique likelihood (37). In both cases, the four estimated parameters yield estimates $\widehat{\Gamma}^{\text{joint}}$ and $\widehat{\Gamma}$ of the whole variogram matrix $\Gamma$ through the graph structure. We repeat the simulation and estimation $200$ times and compare the efficiency of both approaches in Figure 4, displaying only the four free parameters that have actually been estimated.

The difference in efficiency between the joint and clique likelihoods seems to be small or even negligible. This is due to two reasons. For non-censored points the two likelihoods only differ by the normalizing constant $Z_{\theta}$ . Since this constant only measures the global strength of dependence and does not depend on the data, it seems not very sensitive to changes in the parameter $\theta$ . The second difference between the two approaches is that they use slightly different data. Consider a clique $C\in\mathcal{C}$ and the corresponding model parameter $\theta_{C}$ . The joint likelihood uses all data $\boldsymbol{Y}$ in the space $\mathcal{L}=\{\boldsymbol{y}\in\mathcal{E}:\exists j\in V\text{ s.t. }y_{j}>1\}$ , but censors all components with $y_{j}\leq 1$ . On the other hand, the clique likelihood uses the marginals $\boldsymbol{Y}_{C}$ of all data $\boldsymbol{Y}$ in $\mathcal{L}_{C}=\{\boldsymbol{y}\in\mathcal{L}:\exists j\in C\text{ s.t. }y_{j}>1\}$ . Consequently, the additional data used in the joint likelihood is in $\mathcal{L}\setminus\mathcal{L}_{C}=\{\boldsymbol{y}\in\mathcal{L}:y_{j}\leq 1\text{ for all }j\in C\}$ . But the contribution to the joint likelihood of data in this set with regard to the parameter $\theta_{C}$ is completely censored and does therefore not add significant additional information. These two considerations underline that estimating the parameters for each clique separately does not result in significant efficiency losses. This is one of the main advantages of graphical models, namely that the distribution is defined locally by the cliques and extends globally by the conditional independence structure. In terms of computational aspects, the joint likelihood becomes infeasible even in moderate dimensions, whereas the clique likelihood is applicable in high dimensions as long as the cliques have small enough sizes. Moreover, the computations for different cliques can be easily parallelized.

For the second experiment we take $d=16$ and let $\mathcal{G}$ be the graph on the left-hand side of Figure 5, which is not a tree. We simulate $n=100$ samples of a Hüsler–Reiss distribution with parameter matrix $\Gamma$ that factorizes according to $\mathcal{G}$ . The parameters of the $p=18$ edges are independently sampled from a uniform distribution on $(0.5,1)$ , under the constraint that $\Gamma$ is conditionally negative definite on cliques with three nodes. We illustrate how we can choose the best graphical model, where we restrict to block graphs as in Section 5.1 with cliques of sizes two and three. We first construct the minimum spanning tree as described in Section 5.3 within the class of Hüsler–Reiss distributions. The estimated edge set of this tree is denoted by $E_{1}$ . The $15$ parameter estimates $\widehat{\Gamma}_{ij}$ , $\{i,j\}\in E_{1}$ obtained by fitting the clique likelihoods of each clique of the tree yield a unique estimate $\widehat{\Gamma}$ of the $d\times d$ -dimensional variogram matrix; see Proposition 4. This tree model does not contain all edges of the true underlying graph. We therefore perform a greedy forward selection in order to add additional edges and improve the model. In each step, we define an enlarged edge set $E_{m+1}=E_{m}\cup\{i,j\}$ , $m=1,2,\dots$ , restricting to those edges $\{i,j\}$ , $i,j\in V$ , that still yield a block graph with cliques of maximal size three. We continue this process until no more edge can be added in this way. For the same parameter matrix $\Gamma$ , we repeat the simulation and model selection 100 times. The right-hand side of Figure 5 shows the graph with the selected edges, where the line width of each edge indicates the number of times it has been selected among the first $18$ edges. It can be seen that the graph structure is generally very well identified. For each model and each repetition we also compute the resulting $\operatorname{AIC}$ according to (40). The proportion of times that the model with $\{15,\dots,20\}$ edges has the smallest $\operatorname{AIC}$ are $\{0.01,0.11,0.23,0.39,0.23,0.03\}$ . Even though the $\operatorname{AIC}$ is a criterion built for model estimation and not for identification (cf., Arlot and Celisse, 2010), it seems to be well suited to select the correct degree of sparsity for this extremal graphical model.

6 Application

We illustrate the applicability of extremal graphical models at the example of river discharges in the upper Danube basin, a region that is prone to serious flooding. The data are provided by the Bavarian Environmental Agency (http://www.gkd.bayern.de) and we use $d=31$ gauging stations with $50$ years of common daily data from 1960–2009. The tree induced by the physical flow-connections at these stations is shown on the left-hand side of Figure 6, where the path $10\to 9\to\dots\to 1$ is on the Danube and the other branches are tributaries. The spatial extremal dependence structure of this data set has been studied in Asadi et al. (2015) and we follow their preprocessing steps to make the results comparable. Out of all daily data only the three months June, July and August are considered since the most severe floods occur in this period and are caused by heavy summer rain (Böhm and Wetzel, 2006). The $50\times 92=4600$ observations in these months are declustered in time in order to remove temporal dependence and to match slightly shifted peak flows at different locations. We refer to Asadi et al. (2015) for more details on the data, the declustering method and exploratory analysis concerning stationarity and asymptotic dependence; see also Keef et al. (2009, 2013) for other approaches to flood risk assessment.

The declustering yields $N=428$ supposedly independent events $\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(N)}\in\mathbb{R}^{d}$ . The univariate marginal distributions of these data are estimated in Asadi et al. (2015) by a regionalized extreme value model. We focus on estimation of the extremal dependence and normalise the data empirically to standard Pareto marginals. This still guarantees consistent inference of the dependence parameters (e.g., Genest et al., 1995; Joe, 2015). We obtain $n=117$ approximate samples of $\boldsymbol{Y}$ by $\boldsymbol{y}^{(h)}=\boldsymbol{x}^{(h)}/u$ for all observations with $\|\boldsymbol{x}^{(h)}\|_{\infty}>u$ , where we choose the threshold $u$ as the $90\%$ -quantile of the marginal Pareto distribution.

The max-stable Brown–Resnick model in Asadi et al. (2015) corresponds to a parametric family of Hüsler–Reiss Pareto distributions $\{f_{\boldsymbol{Y}}(\cdot;\theta):\theta\in\Omega\}$ at the $31$ gauging stations. The dependence model is tailor-made for this particular application to river extremes and uses several covariates such as distance on the river network, catchment sizes and altitudes. In terms of our new notion of extremal graphical models it is readily checked using the results of Proposition 3 that for any parameter value $\theta\in\Omega$ their model does not exhibit conditional independencies.

We propose a different Hüsler–Reiss model that factorizes according to a sparse graph and does not require any domain knowledge or additional covariates. In fact, we propose a sequence of models

[TABLE]

where $\theta^{(l)}=(\theta^{(l)}_{C})_{C\in\mathcal{C}^{(l)}}$ , and $\mathcal{C}^{(l)}$ is the set of all cliques of the $l$ th extremal graphical model $\mathcal{G}^{(l)}$ according to which the model family $M^{(l)}$ factorizes. As simplest model we take $\mathcal{G}^{(1)}$ to be the minimum spanning tree within the family of Hüsler–Reiss models as described in Section 5.3. Similarly as in the simulation study in Section 5.5, we obtain $\mathcal{G}^{(2)},\dots,\mathcal{G}^{(L)}$ by successively adding edges to the tree $\mathcal{G}^{(1)}$ in a greedy way while restricting the model class to block graphs with cliques of size at most three. The estimated tree $\mathcal{G}^{(1)}$ is shown on the left-hand side of Figure 9 in Appendix D. It is very similar to the tree in Figure 6 that corresponds to the tree induced by the flow-connections of the river network. There are however differences, and it is important to note that the flow-connection tree is not necessarily the optimal tree structure in terms of extreme river discharges. Appendix D also contains a sensitivity analysis of the tree structure for different thresholds $u$ , and a comparison to a Gaussian tree model fitted to non-extremal data.

Figure 7 shows the $\operatorname{AIC}$ values for the different models $M^{(1)},\dots,M^{(L)}$ . The forward selection is a greedy approach and it does not guarantee to find the optimal graph. We therefore also initialize the forward selection with the simplest model $\mathcal{G}^{(1)}$ being the flow-connection tree on the left-hand side of Figure 6. This tree must have a larger $\operatorname{AIC}$ than the minimum spanning tree, but interestingly, the left panel of Figure 7 shows that by adding additional edges the optimal $\operatorname{AIC}$ is better than the previous optimal $\operatorname{AIC}$ . In this particular case, we thus choose the graph initiated with the flow-connection tree with $9$ additional edges. In general, a tree structure appears to be too simple for this application. The reason is that only part of the extremal dependence of discharges at different locations can be explained by flow-connections. Additional dependence may arise even between flow-unconnected locations due to proximity of their catchments that are affected by the same spatial precipitation events. Asadi et al. (2015) model this explicitly through a variogram with two parts, one for the dependence on the river network and one for the spatial, meteorological dependence. The $9$ additional edges of the graphical model on the right-hand side of Figure 6, which minimizes the $\operatorname{AIC}$ , partly improve the model in terms of this spatial dependence between flow-unconnected stations, but also strengthen it between some flow-connected locations. This best graphical model has $39$ edges and an $\operatorname{AIC}$ of $5269.43$ . It significantly outperforms the simpler tree models with $30$ edges and the spatial model of Asadi et al. (2015), which has only six parameters but an $\operatorname{AIC}$ of $5291.34$ , which is indicated by the dashed orange line in the left panel of Figure 7.

A popular summary statistic for extremal dependence between $Y_{i}$ and $Y_{j}$ , $i,j\in V$ , is the tail correlation (cf., Coles et al., 1999), which can be expressed as $\chi_{ij}=2-\Lambda_{ij}(1,1)$ . The centre and right panels of Figure 7 compare empirical estimates of these statistics for all pairs of stations with those implied by the fitted models. In terms of this bivariate summary, both models seem to fit the data well, even though the graphical model seems to be slightly less biased than the spatial model. There are also versions of $\chi$ that assess how a model captures the higher-order extremal dependence structure. In Figure 11 in Appendix E we compare the trivariate empirical $\chi$ coefficients with those implied from the fitted spatial and graphical model. Both models fit well the trivariate dependence, again with a slightly lower bias of the graphical model.

In this application we have only considered block graphs, which are particularly convenient in terms of statistical inference as seen in the previous sections. In general it should be assessed whether this sparse model class is justified for the data. In our case, the bivariate and trivariate $\chi$ coefficients indicate that block graphs are flexible enough to capture the extremal dependence structure of the river data. This is further supported by the fact that the AIC curve in Figure 7 attains its minimum even before the maximal number of edges is added in this model class. It is an important question for future research how extremal graphical models with more complicated structures can be estimated.

7 Discussion

The conditional independence relation $\perp_{e}$ introduced in this paper is natural for a multivariate Pareto distribution $\boldsymbol{Y}$ as it explains the factorization of its density $\boldsymbol{f}_{\boldsymbol{Y}}$ into lower-dimensional marginals (cf., Theorem 1). This establishes a link of extreme value statistics to the broad field of graphical models, and it opens the door to define sparsity and to perform structure learning for tail distributions. In this work we have studied the probabilistic structure and statistical inference for some important models, with the main purpose of modelling the extremal dependence structure. Many subsequent research directions are possible. Directed acyclic graphs as in Gissibl and Klüppelberg (2018) for max-linear models may be formulated in our setting and would yield different factorizations than for undirected graphs, and this would form the basis to extend work on causal inference for extremes (Naveau et al., 2018; Mhalla et al., 2019; Gnecco et al., 2019) to continuous extreme value distributions. The models in this paper are well-suited for asymptotic dependence. Another line of research focuses on multivariate tail models under asymptotic independence (Ledford and Tawn, 1997; Heffernan and Tawn, 2004; Wadsworth et al., 2017). Conditional independence and graphical models have not been studied in this framework, except for the special case of Markov chains (Kulik and Soulier, 2015; Papastathopoulos et al., 2017).

Conditional independence for $\boldsymbol{Y}$ does not carry over to factorization of the density of the associated max-stable distribution $\boldsymbol{Z}$ . By Proposition 1, the conditional independence relation $\perp_{e}$ does however imply the factorization of the exponent measure density $\lambda$ of $\boldsymbol{Z}$ , which is the key object in simulation (Dombry et al., 2016) and full likelihood estimation (Thibaud et al., 2016; Dombry et al., 2017; Huser et al., 2019) of max-stable processes. Thus, sparsity in our notion for multivariate Pareto distributions also facilitates inferential tasks for max-stable distributions, a fact that has been briefly discussed for simulation in Section 5.4 but deserves further investigation.

The application to flood risk assessment is just one illustrative example. Unlike spatial models, extremal graphical models can be applied to multivariate problems without domain knowledge, as for instance in financial or insurance applications. The ability to learn underlying structures in a data-driven way has also great practical potential for exploratory analysis and data visualization. In ongoing research we investigate efficient learning of extremal tree structures and, in the case of Hüsler–Reiss distributions, of more general graphs based on $\ell_{1}$ -regularization.

Acknowledgments

We thank Robin J. Evans and Nicola Gnecco for helpful discussions. We are grateful to the editorial team and the referees for knowledgeable comments that improved the paper. Financial support by the Swiss National Science Foundation (S. Engelke) and by the Berrow Foundation (A. S. Hitz) is gratefully acknowledged. The paper was completed while S. Engelke was a visitor at the Department of Statistical Sciences, University of Toronto.

Appendix

A Definitions for graphical models

Let $\mathcal{G}=(V,E)$ be an undirected graph with node set $V=\{1,\dots,d\}$ and edge set $E\subset V\times V$ ; see Section 2.3. We define the notion decompositions and decomposability for the graph $\mathcal{G}$ (cf., Lauritzen, 1996, Definition 2.1).

Definition 3.

A triplet $(A,B,C)$ of disjoints subsets of $V$ is said to form a decomposition of $\mathcal{G}$ into the components $\mathcal{G}_{A\cup B}$ and $\mathcal{G}_{B\cup C}$ if $V=A\cup B\cup C$ and

•

$B$ * separates $A$ from $C$ (i.e., every path from $A$ to $C$ intersects $B$ );*

•

$B$ * is a complete subset.*

The decomposition is called proper if $A$ and $C$ are both non-empty. A graph $\mathcal{G}$ is decomposable if it is complete or if there exists a proper decomposition $(A,B,C)$ into decomposable subgraphs $\mathcal{G}_{A\cup B}$ and $\mathcal{G}_{B\cup C}.$ Decomposable graphs are also known as triangulated or chordal graphs.

For instance, $(\{1,2,3,4,5\},\{4,5\},\{4,5,6\})$ is a proper decomposition of the decomposable graph in Figure 8.

For a connected, decomposable graph $\mathcal{G}$ , we can order the set of the cliques $\mathcal{C}=\{C_{1},\dots,C_{m}\}$ such that for all $i=2,\dots,m$ ,

[TABLE]

a condition called the running intersection property; cf., Lauritzen (1996, Chapter 2) and Green and Thomas (2013). The sets $D_{i}$ , $i=2,\dots,m$ , are called separators of the graph, and both $\mathcal{C}$ and the collection of separators $\mathcal{D}=\{D_{2},\dots,D_{m}\}$ are uniquely determined up to different orderings. The separators may not all be distinct, and we say that $\mathcal{D}$ is a multiset. A possible enumeration of cliques and separators for the graph in Figure 8 that satisfies the running intersection property is

[TABLE]

From (44) we note that the clique $C_{m}$ intersects the other cliques only in $D_{m}$ . Consider the connected, decomposable subgraph $\mathcal{G}_{m-1}$ of $\mathcal{G}$ with node set $V_{m-1}=V\setminus(C_{m}\setminus D_{m})$ and corresponding induced edge set. The property (44) then holds for $\mathcal{G}_{m-1}$ , which has one clique less. Continuing this process, we note that each $C_{j}$ intersects the subgraph $\mathcal{G}_{j}$ only in $D_{j}$ , $j=2,\dots,m$ , and $\mathcal{G}_{1}$ with nodes $V_{1}=C_{1}$ is complete.

B Link between variogram and covariance matrices

For $k\in V=\{1,\dots,d\}$ , we denote by $\mathcal{P}_{d-1}^{k}$ the set of all strictly positive definite covariance matrices $\Sigma^{(k)}\subset\mathbb{R}^{(d-1)\times(d-1)}$ indexed by $V\setminus\{k\}$ . On the other hand, the space of strictly conditionally negative definite $d\times d$ matrices is denoted by

[TABLE]

Lemma 3.

For any $k\in V$ , there is a bijection $\varphi_{k}:\mathcal{D}_{d}\to\mathcal{P}_{d-1}^{k}$ given by

[TABLE]

where $\tilde{\Sigma}^{(k)}$ is the $d\times d$ matrix that coincides with $\Sigma^{(k)}$ for $i,j\neq k$ and that has zeros in the $k$ th column and row.

Proof.

It is easy to check that the mappings are their mutual inverses. To see that the strict positive definiteness of $\Sigma^{(k)}$ is equivalent to the strict conditionally negative definiteness of $\Gamma$ , we observe for any $\boldsymbol{a}_{\setminus k}\in\mathbb{R}^{d-1}\setminus\{\boldsymbol{0}\}$ and $a_{k}=-\sum_{i\neq k}a_{i}$

[TABLE]

using the fact that $\Gamma$ is symmetric and $\Gamma_{ii}=0$ for all $i\in V$ . The assertion then follows; see also the proof of Lemma 3.2.1 in Berg et al. (1984). ∎

C Hüsler–Reiss densities on decomposable graphs

Corollary 2.

Let $\mathcal{G}=(V,E)$ be a decomposable and connected graph, and suppose that $\boldsymbol{Y}$ is a Hüsler–Reiss Pareto distribution that satisfies the pairwise Markov property

[TABLE]

Then the density of $\boldsymbol{Y}$ factorizes according to $\mathcal{G}$ into lower-dimensional Hüsler–Reiss densities, that is,

[TABLE]

where the sequences of cliques $\{C_{1},\dots,C_{m}\}$ and separator sets $\{D_{2},\dots,D_{m}\}$ have the running intersection property (44), and $k_{i}\in D_{i}$ , $i=2,\dots,m$ , $k_{1}\in C_{1}$ .

Proof.

Theorem 1 and Proposition 3 yield the factorization. It remains to show that the factors in front of the normal densities simplify to $y_{k_{m-1}}^{-2}\prod_{i\neq k_{m-1}}y_{i}^{-1}$ . Indeed, since we choose $k_{i}\in D_{i}\subset C_{i}$ , $i=2,\dots,m$ , the ratio $\lambda_{C_{i}}(\boldsymbol{y}_{C_{i}})/\lambda_{D_{i}}(\boldsymbol{y}_{D_{i}})$ contributes the factor $y_{j}^{-1}$ for all $j\in C_{i}\setminus D_{i}$ , and each such $j$ appears exactly once. For $i=1$ , the contribution of $\lambda_{C_{1}}(\boldsymbol{y}_{C_{1}})$ is $y_{k_{1}}^{-2}\prod_{i\in C_{1}\setminus\{k_{1}\}}y_{i}^{-1}$ . ∎

D Minimum spanning tree for the Danube river

The left-hand side of Figure 9 shows the estimated Hüsler–Reiss minimum spanning tree for the Danube data in Section 6 for a threshold $u$ chosen as the $90\%$ -quantile of the marginal Pareto distribution. In order to assess the sensitivity of the tree structure with respect to the threshold choice, we estimate the minimum spanning tree for thresholds $u$ corresponding to a range of different quantiles. The similarity of these trees in terms of the number of identical edges compared to the $90\%$ -quantile tree are shown in Figure 10. One can see that there is some variation of the tree structure for different thresholds, but that most of the $30$ edges are fairly stable throughout a wide range of thresholds. As a comparison, the right-hand side of Figure 9 shows the Gaussian minimum spanning tree fitted to all log-transformed data, using $\log(1-\rho_{ij}^{2})$ as distances in (38), where $\rho_{ij}$ is the correlation coefficient between nodes $i,j\in V$ . The Gaussian tree, a model for non-extremal data, is similar to the Hüsler–Reiss tree, a model for extreme flooding, but there are also some differences. For instance, for the extremal data the ordering of the stations 16 to 19 seems to be less important since large discharges affect all at the same time. This is confirmed by the fact that when the Hüsler–Reiss tree is extended to a block graph, then additional edges are introduced between these stations.

E Trivariate $\chi$ coefficients

Figure 11 shows the empircal estimates of the trivariate coefficients

[TABLE]

against those implied by the fitted spatial model in Asadi et al. (2015) and our graphical model minimizing the $\operatorname{AIC}$ .

F Proofs

of Proposition 1.

The implication $\eqref{eq:citail3}\Rightarrow(i)$ is trivial. For $(i)\Rightarrow(ii)$ let $k\in B$ and suppose that (18) holds, that is,

[TABLE]

For any $\boldsymbol{y}\in\mathcal{L}$ choose $0<t<\min(y_{k},1)$ , i.e., $\boldsymbol{y}/t\in\mathcal{L}^{k}$ , and observe

[TABLE]

using the homogeneity of the $\lambda_{I}$ , and the fact that $f_{I}^{k}(\boldsymbol{y}_{I}/t)=\lambda_{I}(\boldsymbol{y}_{I}/t)$ for any $I\subset V$ with $k\in I$ . Note that for this argument it is crucial that $k$ is in an element of all three sets $B$ , $A\cup B$ and $B\cup C$ .

For $(ii)\Rightarrow\leavevmode\nobreak\ \eqref{eq:citail3}$ suppose that the factorization (19) of $\lambda$ holds, and let $k\in V$ . For all $\boldsymbol{y}\in\mathcal{L}^{k}$

[TABLE]

for suitable functions $g$ and $h$ , implying the required conditional independence of $f^{k}$ (cf., Lauritzen, 1996, Chapter 3). This shows that condition (17) indeed holds and thus $\boldsymbol{Y}_{A}\perp_{e}\boldsymbol{Y}_{C}\mid\boldsymbol{Y}_{B}$ . ∎

of Theorem 1.

We start by proving that if $\boldsymbol{Y}$ satisfies the pairwise Markov property relative to $\mathcal{G}$ , then the graph $\mathcal{G}$ is necessarily connected. Indeed, suppose $V$ can be split into non-empty, disjoint subsets $V_{1},V_{2}\subset V$ such that for $(i,j)\in E$ it holds either $i,j\in V_{1}$ or $i,j\in V_{2}$ . For an arbitrary $k\in V$ , by assumption, the pairwise Markov property relative to $\mathcal{G}$ is satisfied for $f^{k}$ on $\mathcal{L}^{k}$ and the classical Hammersley–Clifford theorem implies the global Markov property for $f^{k}$ , and in particular

[TABLE]

The discussion after Proposition 1 shows that such as factorization contradicts integrability of the multivariate Pareto density, and therefore the graph has to be connected.

We now show that $(i)\Rightarrow(iii)$ . The pairwise Markov property of $f^{k}$ relative to $\mathcal{G}$ implies by the classical Hammersley–Clifford theorem that

[TABLE]

This representation is not of direct use since it cannot be extended to $f_{\boldsymbol{Y}}$ on the whole space $\mathcal{L}$ , since all $f_{I}^{k}$ with $k\notin I$ are not homogeneous. The result however tells us that $\boldsymbol{Y}^{k}$ also satisfies the global Markov property on $\mathcal{L}^{k}$ relative to $\mathcal{G}$ , as defined in Section 2.3. The running intersection property implies that $D_{m}$ separates $C_{m}\setminus D_{m}$ from $(C_{1}\cup\dots\cup C_{m-1})\setminus D_{m}$ . Choose $k\in D_{m}$ , then the global Markov property for $\boldsymbol{Y}^{k}$ yields

[TABLE]

where the second equality holds since $k\in D_{m}$ , and $D_{m}$ is a subset of both $C_{m}$ and $C_{1}\cup\dots\cup C_{m-1}$ . By a homogeneity argument similar to the proof of Proposition 1, this factorization extends to $\lambda$ on the whole space $\mathcal{L}$ , that is,

[TABLE]

It remains to decompose $\lambda_{C_{1}\cup\dots\cup C_{m-1}}$ in the same manner. To this end, choose a new $k\in D_{m-1}$ and note that

[TABLE]

and therefore satisfies the global Markov property relative to the subgraph induced on $C_{1}\cup\dots\cup C_{m-1}$ . Since $f_{C_{1}\cup\dots\cup C_{m-1}}^{k}=\lambda_{C_{1}\cup\dots\cup C_{m-1}}$ on $\mathcal{L}^{k}$ , applying successively the same reasoning as before yields the factorization of $\lambda$ that directly implies the representation in (21) for $f_{\boldsymbol{Y}}$ .

In order to show that $(iii)\Rightarrow(ii)$ , we only need to verify that $\boldsymbol{Y}^{k}$ satisfies the global Markov property on $\mathcal{L}^{k}$ for any $k\in V$ . For disjoint sets $A,B,C\subset V$ such that $B$ separates $A$ from $C$ , the factorization (21) entails that

[TABLE]

for suitable functions $g$ and $h$ , and thus $\boldsymbol{Y}^{k}_{A}\perp\!\!\!\perp\boldsymbol{Y}^{k}_{C}\mid\boldsymbol{Y}^{k}_{B}$ .

The implication $(ii)\Rightarrow(i)$ holds trivially. ∎

of Corollary 1.

It is easy to check that $\lambda$ and $f_{\boldsymbol{Y}}$ are homogeneous of order $-(d+1)$ on $\mathcal{L}$ . Let $\{C_{1},\dots,C_{m}\}$ and $\{D_{2},\dots,D_{m}\}$ be the sequences of cliques and separators with the running intersection property (44). Sequential integration of the function $f_{\boldsymbol{Y}}$ on $C_{m}\setminus D_{m},\dots,C_{2}\setminus D_{2},$ together with the consistency constraint yields that it defines in fact a probability density. Theorem 1 implies that the corresponding distribution on $\mathcal{L}$ satisfies the Markov property relative to $\mathcal{G}$ . ∎

of Proposition 2.

The density of the random vector on the right-hand side of (24) is

[TABLE]

where we used (12) for the first equation, and the fact that each node $i\in V\setminus\{k\}$ has exactly one incoming arrow, and the $k$ th node has no incoming arrows. On the other hand, we recall that the density of $\boldsymbol{Y}^{k}$ is $\lambda(\boldsymbol{y})=\Lambda(\boldsymbol{1})f_{\boldsymbol{Y}}(\boldsymbol{y})$ , which factorizes with respect to the tree $\mathcal{T}$ . Comparing the above density with (23) yields the result. ∎

of Lemma 1.

Without losing generality, we may and do assume that $k^{\prime}=1$ and $k=2$ . Let the vector $\boldsymbol{W}^{1}=(0,W_{2}^{1},\dots,W_{d}^{1})$ have a centred normal distribution with covariance matrix $\Sigma=\{\sigma_{ij}\}=\tilde{\Sigma}^{(1)}$ , such that

[TABLE]

The precision matrix is obtained by blockwise inversion as

[TABLE]

where $S=\Sigma_{\setminus\{1,2\}}-\sigma_{22}^{-1}\Sigma_{\setminus\{1,2\},2}\Sigma_{2,\setminus\{1,2\}}$ is the Schur complement of upper left block $\sigma_{22}$ in the matrix $\Sigma^{(1)}$ . The random vector $\boldsymbol{W}^{1}$ can be transformed into

[TABLE]

which is readily verified to have centred normal distribution with covariance matrix $\tilde{\Sigma}^{(2)}$ . On the other hand, we may write the covariance matrix $\Sigma^{(2)}$ of $(-W_{2}^{1},W_{3}^{1}-W_{2}^{1},\dots,W_{d}^{1}-W_{2}^{1})$ in terms of $\Sigma$ as

[TABLE]

It can be checked that the Schur complement of the upper left block $\sigma_{22}$ in the matrix $\Sigma^{(2)}$ is again $S$ . Thus, blockwise inversion yields

[TABLE]

Comparing these representations of $\Theta^{(1)}$ and $\Theta^{(2)}$ yields the assertion for $i,j\in V\setminus\{1,2\}$ . For $i\neq 2,j=2$ , we observe

[TABLE]

The case $i,j=2$ follows similarly. ∎

of Proposition 3.

Let $i,j\in V$ with $i\neq j$ be fixed and choose a $k\neq i,j$ . Let $P$ and $\boldsymbol{W}$ be as in representation (28). Since $Y_{k}^{k}=P$ and due to the independence of $P$ and $\boldsymbol{W}$ we obtain

[TABLE]

where the variable $W_{k}^{k}$ can be deleted from the conditioning since it is deterministic given $P$ , and therefore the reduced precision matrix $\Theta^{(k)}$ of the vector $\boldsymbol{W}_{\setminus k}^{k}$ appears. The last equivalence follows from the well-known fact that conditional independence in multivariate normal models corresponds to zeros in the precision matrix (cf., Example 4).

Let now $k=i\neq j$ and choose a $k^{\prime}\notin\{i,j\}$ . Lemma 1 implies that

[TABLE]

Since $k^{\prime}\in V\setminus\{i,j\}$ , by Proposition 1, $Y_{i}\perp_{e}Y_{j}\mid\boldsymbol{Y}_{\setminus\{i,j\}}$ is equivalent to $Y_{k}^{k^{\prime}}\perp\!\!\!\perp Y_{j}^{k^{\prime}}\mid\boldsymbol{Y}^{k^{\prime}}_{\setminus\{k,j\}}$ . The latter, by the first part of the proof, is then equivalent to $\Theta^{(k^{\prime})}_{jk}=0,$ which, together with (46), yields the assertion. The case $k=j\neq i$ is analogous by symmetry. ∎

of Proposition 4.

Let $C_{1},\dots,C_{m}$ be an enumeration of the cliques of the decomposable connected graph $\mathcal{G}=(V,E)$ . Recall that by assumption, all intersections between pairs of cliques are either empty or contain a single node. We show how to obtain the unique, $d\times d$ -dimensional variogram matrix $\Gamma$ that solves the completion problem (31) by adding one clique after the other. We first set

[TABLE]

Let $I_{p-1}=C_{1}\cup\dots\cup C_{p-1}$ be the union of the first $p-1$ cliques, $2\leq p\leq m$ cliques that have been chosen in an order such that $\mathcal{G}$ restricted to $I_{p-1}$ forms a connected graph. Suppose that we have already constructed a unique $|I_{p-1}|\times|I_{p-1}|$ -dimensional variogram matrix $\Gamma^{(I_{p-1})}$ that satisfies

[TABLE]

where here and in the sequel we use the notation $\Theta^{(J,k)}$ as the inverse of $\Sigma^{(J,k)}=\varphi_{k}(\Gamma^{(J)})$ for a variogram matrix $\Gamma^{(J)}$ on some index set $J\subset V$ and $k\in J$ . We next choose a clique, say $C_{p}$ , that intersects $I_{p-1}$ , and this intersection has to be a single node, say $k_{0}\in V$ . Let $I_{p}=I_{p-1}\cup C_{p}$ and define the matrix

[TABLE]

This matrix is an invertible covariance matrix since its blocks are invertible covariance matrices, and its inverse $\Sigma^{(I_{p},k_{0})}$ has the same property with blocks $\Sigma^{(I_{p-1},k_{0})}$ and $\Sigma^{(C_{p},k_{0})}$ . This yields an $|I_{p}|\times|I_{p}|$ -dimensional variogram matrix $\Gamma^{(I_{p})}$ through the mapping $\varphi_{k_{0}}^{-1}$ , which has the form

[TABLE]

This variogram matrix clearly solves the problem (48) with $I_{p-1}$ replaced by $I_{p}$ . It is unique by construction and the fact that $\varphi_{k_{0}}$ and $\varphi_{k_{0}}^{-1}$ are bijections.

Starting with (47) and then adding all cliques for $p=2,\dots,m$ according to the above procedure, we obtain a unique $d\times d$ -dimensional variogram $\Gamma=\Gamma^{(I_{m})}$ matrix that satisfies all constraints in (31). Comparing with Corollary 2 it follows that the corresponding density in (30) is $d$ -variate Hüsler–Reiss with parameter matrix $\Gamma$ . ∎

of Lemma 2.

The general formula for extremal functions in Proposition 1 in Dombry et al. (2016) can be written in terms of the exponent measure density $\lambda$ as

[TABLE]

Since the density of $\boldsymbol{U}^{k}_{\setminus k}=\boldsymbol{Y}^{k}_{\setminus k}/Y^{k}_{k}$ is readily seen to be $\lambda(\boldsymbol{y})$ for $\boldsymbol{y}_{\setminus k}\in[0,\infty)^{d-1}$ and $y_{k}=1$ , it follows with

[TABLE]

that (41) is an equivalent definition of extremal functions.

It follows from Theorem 2 in Dombry et al. (2016) that for a uniform distribution $T$ on $\{1,\dots,d\}$ , the random vector $\boldsymbol{Y}^{T}/\|\boldsymbol{Y}^{T}\|_{1}$ follows the distribution of the spectral measure $H$ on $S_{d-1}=\{\boldsymbol{x}\in\mathcal{E}:\|x\|_{1}=1\}$ associated with the max-stable distribution $\boldsymbol{Z}$ , that is,

[TABLE]

If $A\subset\mathcal{L}$ , then $u\boldsymbol{w}\in A$ implies $u\geq 1$ , and therefore

[TABLE]

since $f_{P}(u)=1/u^{2},u\geq 1$ . For $A=\mathcal{L}=\mathcal{E}\setminus[\boldsymbol{0},\boldsymbol{1}]$ this yields for the conditioning event in (42)

[TABLE]

Since $\boldsymbol{Y}$ has density $\lambda(\boldsymbol{y})/\Lambda(\boldsymbol{1})$ , this concludes the proof. ∎

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arlot and Celisse (2010) Arlot, S. and A. Celisse (2010). A survey of cross-validation procedures for model selection. Statist. Surv. 4 , 40–79.
2Asadi et al. (2015) Asadi, P., A. C. Davison, and S. Engelke (2015). Extremes on river networks. Ann. Appl. Stat. 9 , 2023–2050.
3Ballani and Schlather (2011) Ballani, F. and M. Schlather (2011). A construction principle for multivariate extreme value distributions. Biometrika 98 , 633–645.
4Basrak and Segers (2009) Basrak, B. and J. Segers (2009). Regularly varying multivariate time series. Stochastic Process. Appl. 119 , 1055 – 1080.
5Beirlant et al. (2004) Beirlant, J., Y. Goegebeur, J. Teugels, and J. Segers (2004). Statistics of Extremes . Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester.
6Berg et al. (1984) Berg, C., J. P. R. Christensen, and P. Ressel (1984). Harmonic Analysis on Semigroups , Volume 100 of Graduate Texts in Mathematics . New York: Springer-Verlag. Theory of positive definite and related functions.
7Böhm and Wetzel (2006) Böhm, O. and K.-F. Wetzel (2006). Flood history of the Danube tributaries Lech and Isar in the alpine foreland of Germany. Hydrological Sciences Journal 51 , 784–798.
8Boldi and Davison (2007) Boldi, M.-O. and A. C. Davison (2007). A mixture model for multivariate extremes. J. R. Statist. Soc. B 69 , 217–229.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Graphical Models for Extremes

Abstract

1 Introduction

2 Background

2.1 Notation

2.2 Multivariate extreme value theory

Example 1** (Logistic distribution).**

Example 2** (Hüsler–Reiss distribution).**

Example 3** (Bivariate Pareto distribution).**

2.3 Graphical models

Example 4**.**

3 Conditional independence for threshold exceedances

Definition 1**.**

Proposition 1**.**

4 Graphical models for threshold exceedances

Definition 2**.**

Theorem 1**.**

Remark 1**.**

Remark 2**.**

Example 5**.**

Example 6**.**

Example 7**.**

Corollary 1**.**

4.1 Tree graphical models

Proposition 2**.**

Remark 3**.**

Example 8**.**

4.2 Hüsler–Reiss graphical models

Example 9**.**

Lemma 1**.**

Proposition 3**.**

Example 10**.**

5 Statistical inference for block graphs

5.1 Model construction

Proposition 4**.**

5.2 Estimation

5.3 Model selection

5.4 Exact simulation

Lemma 2**.**

Algorithm 1** (Exact simulation of a multivariate Pareto distribution Y\boldsymbol{Y}Y).**

Example 11**.**

Example 12**.**

5.5 Simulation study

6 Application

7 Discussion

Acknowledgments

Appendix

A Definitions for graphical models

Definition 3**.**

B Link between variogram and covariance matrices

Lemma 3**.**

Proof.

C Hüsler–Reiss densities on decomposable graphs

Corollary 2**.**

Proof.

D Minimum spanning tree for the Danube river

E Trivariate χ\chiχ coefficients

F Proofs

of Proposition 1.

of Theorem 1.

of Corollary 1.

of Proposition 2.

of Lemma 1.

of Proposition 3.

of Proposition 4.

of Lemma 2.

Example 1 (Logistic distribution).

Example 2 (Hüsler–Reiss distribution).

Example 3 (Bivariate Pareto distribution).

Example 4.

Definition 1.

Proposition 1.

Definition 2.

Theorem 1.

Remark 1.

Remark 2.

Example 5.

Example 6.

Example 7.

Corollary 1.

Proposition 2.

Remark 3.

Example 8.

Example 9.

Lemma 1.

Proposition 3.

Example 10.

Proposition 4.

Lemma 2.

Algorithm 1 (Exact simulation of a multivariate Pareto distribution $\boldsymbol{Y}$ ).

Example 11.

Example 12.

Definition 3.

Lemma 3.

Corollary 2.

E Trivariate $\chi$ coefficients