Evaluating probabilistic forecasts of extremes using continuous ranked   probability score distributions

Maxime Taillardat; Anne-Laure Foug\`eres; Philippe Naveau; Rapha\"el; de Fondeville

arXiv:1905.04022·stat.ME·February 9, 2023

Evaluating probabilistic forecasts of extremes using continuous ranked probability score distributions

Maxime Taillardat, Anne-Laure Foug\`eres, Philippe Naveau, Rapha\"el, de Fondeville

PDF

TL;DR

This paper investigates the effectiveness of the continuous ranked probability score (CRPS) in evaluating probabilistic forecasts of extreme events, proposing a new approach based on extreme value theory for better assessment.

Contribution

It introduces a formal framework for evaluating extreme event forecasts and proposes a novel method using extreme value theory to improve assessment accuracy.

Findings

01

CRPS is not suitable for extreme event verification when assessed by expectation.

02

A new index based on extreme value theory effectively compares calibrated forecasts for extremes.

03

The proposed method's strengths and limitations are analyzed through theory and simulations.

Abstract

Verifying probabilistic forecasts for extreme events is a highly active research area because popular media and public opinions are naturally focused on extreme events, and biased conclusions are readily made. In this context, classical verification methods tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated, and the well-known continuous ranked probability score (CRPS) is no exception. In this paper, we define a formal framework for assessing the behavior of forecast evaluation procedures with respect to extreme events, which we use to demonstrate that assessment based on the expectation of a proper score is not suitable for extremes. Alternatively, we propose studying the properties of the CRPS as a random variable by using extreme value theory to address extreme event verification. An index is introduced…

Tables4

Table 1. Table 1: Benchmark to assess the behavior of forecast evaluation procedure with respect to different tail regimes. All forecasts but F extr subscript 𝐹 extr F_{\rm extr} are calibrated.

Forecasts $\$ Truth	$Y \overset{d}{=} ℰ xp (Δ)$ where $Δ \overset{d}{=} Γ (1 / γ, 1 / γ)$ , $1 > γ > 0$
Ideal $F_{ideal}$	$ℰ xp (Δ)$
Climatological $F_{clim}$	$GP (1, γ)$
$λ$ -Informed $F_{λ}$	$λ ℰ xp (Δ) + (1 - λ) GP (1, γ)$
Extremist $F_{extr}$	$ℰ xp (Δ / ν)$ , $ν > 1$

Table 2. Table 2: Relative ratio of the mean CRPS, in percent, with respect to the ideal forecast for the model GE with γ = 1 / 4 𝛾 1 4 \gamma=1/4 , based on T = 10 6 𝑇 superscript 10 6 T=10^{6} observation/forecast pairs.

Truth	$Y \overset{d}{=} ℰ xp (Δ)$ where $Δ \overset{d}{=} Γ (4, 4)$
Forecasts	$%$ w.r.t. Ideal
Ideal $F_{ideal}$	$100 %$
Extremist $ν = 1.1$	$100.48 %$
0.75-Informed $F_{0.75}$	$100.90 %$
0.5-Informed $F_{0.5}$	$103.58 %$
Extremist $ν = 1.4$	$106.68 %$
0.25-Informed $F_{0.25}$	$108.06 %$
Climatological $F_{clim}$	$114.33 %$
Extremist $ν = 1.8$	$122.89 %$

Table 3. Table 3: Availability status of the quantities of interest. It can be an a posteriori availability.

Object	Definition	Availability
		in practice
$F_{t}$	Distribution of the forecast for time $t$	yes
$y_{t}$	Observed realisation at time $t$	yes
$δ_{t}$	Conditioning variable	no
$Δ$	Conditioning random variable	no
$Y_{t}$	Conditional random variable generating $y_{t}$	no
$Y$	Unconditional random variable of the observations	yes
$CRPS (F_{t}, y_{t})$	CRPS of the couple for time $t$	yes
$CRPS (F_{t}, Y_{t})$	Random variable associated to $C R P S (F_{t}, y_{t})$	no
${CRPS}_{𝒮} (F, Y)$	Random variable generated by the ${(C R P S (F_{t}, y_{t}))}_{t}$	yes
${CRPS}_{𝒮^{*}} (F, Y)$	Random variable generated by the ${(C R P S (F_{t}, y_{π (t)}))}_{t}$	yes

Table 4. Table 4: Computation of Cramér-von Mises’ statistic from N 𝑁 N couples forecast/observation. It can be done with the R package extremeIndex (Taillardat, 2021a ) .

0. CRPS estimates for each forecaster:	- For the $N$ couples forecast/observation, compute their corresponding instantaneous CRPS.
1. Estimation of $γ$ on the observations:	- Find a threshold $u$ where the Pareto approximation is acceptable and estimate the Pareto shape parameter $γ$ and $σ$ .
2. For a threshold $w \geq u$ :	- Compute the scale parameter $σ_{w} = σ + γ w$ .
3. Computation of $X_{u}$	- Order the $m$ CRPS values where the observation $y \geq w$ in increasing order $s_{1}, \dots, s_{m}$ .
For $i \in [1, m]$	-Compute for each CRPS value $s_{i}$ , $H_{γ, σ_{w}} (s_{i})$ .
	-Compute ${[\frac{2 i - 1}{2 m} - H_{γ, σ_{w}} (s_{i})]}^{2}$ .
End 3.
End 2.

Equations114

CRPS (F, y)

CRPS (F, y)

wCRPS (F, y)

wCRPS (F, y)

\frac{F { u + x b ( u )}}{F ( u )} ⟶ \overline{H} (x) > 0, u \to x_{F},

\frac{F { u + x b ( u )}}{F ( u )} ⟶ \overline{H} (x) > 0, u \to x_{F},

\overline{H}_{γ} (x) = (1 + γ x)^{- \frac{1}{γ}},

\overline{H}_{γ} (x) = (1 + γ x)^{- \frac{1}{γ}},

P (X - u \geq x ∣ X > u) \approx \overline{H}_{γ} (x / σ) = (1 + \frac{γ x}{σ})^{- \frac{1}{γ}},

P (X - u \geq x ∣ X > u) \approx \overline{H}_{γ} (x / σ) = (1 + \frac{γ x}{σ})^{- \frac{1}{γ}},

x \to x_{*} lim \frac{F ( x )}{G ( x )} = c \in (0, + \infty) .

x \to x_{*} lim \frac{F ( x )}{G ( x )} = c \in (0, + \infty) .

∣ E_{G} (wCRPS (G, Y)) - E_{G} (wCRPS (F, Y)) ∣ \leq η,

∣ E_{G} (wCRPS (G, Y)) - E_{G} (wCRPS (F, Y)) ∣ \leq η,

\left\{\begin{array}[]{rl}\Delta&\stackrel{{\scriptstyle d}}{{=}}\Gamma(\gamma^{-1},\gamma^{-1})\\ Y&\stackrel{{\scriptstyle d}}{{=}}\textrm{Exp}(\Delta)\stackrel{{\scriptstyle d}}{{=}}\textrm{GP}(1,\gamma),\end{array}\right.

\left\{\begin{array}[]{rl}\Delta&\stackrel{{\scriptstyle d}}{{=}}\Gamma(\gamma^{-1},\gamma^{-1})\\ Y&\stackrel{{\scriptstyle d}}{{=}}\textrm{Exp}(\Delta)\stackrel{{\scriptstyle d}}{{=}}\textrm{GP}(1,\gamma),\end{array}\right.

C R P S (F_{extr}, y) = y + \frac{2 ν}{δ} exp (- \frac{δ y}{ν}) - \frac{3 ν}{2 δ};

C R P S (F_{extr}, y) = y + \frac{2 ν}{δ} exp (- \frac{δ y}{ν}) - \frac{3 ν}{2 δ};

C R P S (F_{λ}, y)

C R P S (F_{λ}, y)

S (F_{T}) = {CRPS (F_{t}, Y_{t})}_{t = 1, \dots, T} \leavevmode \leavevmode \leavevmode and \leavevmode \leavevmode \leavevmode \leavevmode S^{*} (F_{T}) = {CRPS (F_{t}, Y_{π (t)})}_{t = 1, \dots, T},

S (F_{T}) = {CRPS (F_{t}, Y_{t})}_{t = 1, \dots, T} \leavevmode \leavevmode \leavevmode and \leavevmode \leavevmode \leavevmode \leavevmode S^{*} (F_{T}) = {CRPS (F_{t}, Y_{π (t)})}_{t = 1, \dots, T},

CRPS (G, Y) = d S^{*} (G) = d S (G) .

CRPS (G, Y) = d S^{*} (G) = d S (G) .

P (\frac{CRPS ( F _{δ} , Y _{δ} ) + c _{F_{δ}} - u _{δ}}{b _{δ} ( u _{δ} )} > x Y_{δ} > u_{δ}) ⟶ (1 + γ_{δ} x)^{- 1/ γ_{δ}},

P (\frac{CRPS ( F _{δ} , Y _{δ} ) + c _{F_{δ}} - u _{δ}}{b _{δ} ( u _{δ} )} > x Y_{δ} > u_{δ}) ⟶ (1 + γ_{δ} x)^{- 1/ γ_{δ}},

P {\frac{CRPS ( G , Y ) + c _{G} - u}{b ( u )} > x Y > u} ⟶ (1 + γ x)^{- 1/ γ}, u \to x_{G},

P {\frac{CRPS ( G , Y ) + c _{G} - u}{b ( u )} > x Y > u} ⟶ (1 + γ x)^{- 1/ γ}, u \to x_{G},

ω_{u}^{2} {S (F_{T})} = \int_{- \infty}^{+ \infty} [\hat{K}^{(m)}_{S, u} (v) - H_{γ, σ_{u}} (v)]^{2} d H_{γ, σ_{u}} (v),

ω_{u}^{2} {S (F_{T})} = \int_{- \infty}^{+ \infty} [\hat{K}^{(m)}_{S, u} (v) - H_{γ, σ_{u}} (v)]^{2} d H_{γ, σ_{u}} (v),

Ω_{u}^{F} = m \times ω_{u}^{2} {S (F_{T})} = \frac{1}{12 m} + i = 1 \sum m [\frac{2 i - 1}{2 m} - H_{γ, σ_{u}} (s_{i})]^{2},

Ω_{u}^{F} = m \times ω_{u}^{2} {S (F_{T})} = \frac{1}{12 m} + i = 1 \sum m [\frac{2 i - 1}{2 m} - H_{γ, σ_{u}} (s_{i})]^{2},

T_{u} (F, G) = 1 - \frac{Ω _{u}^{G}}{Ω _{u}^{F}} .

T_{u} (F, G) = 1 - \frac{Ω _{u}^{G}}{Ω _{u}^{F}} .

w C R P S (F, y) = W (y) + 2 E_{F} [{W (X) - W (y)} 1_{X > y}] - 2 E_{F} [W (X) F (X)] .

w C R P S (F, y) = W (y) + 2 E_{F} [{W (X) - W (y)} 1_{X > y}] - 2 E_{F} [W (X) F (X)] .

w C R P S (F, y) = E_{F} ∣ W (X) - W (y) ∣ - \frac{1}{2} E_{F} ∣ W (X) - W (X^{'}) ∣.

w C R P S (F, y) = E_{F} ∣ W (X) - W (y) ∣ - \frac{1}{2} E_{F} ∣ W (X) - W (X^{'}) ∣.

E_{F} ∣ W (X) - W (y) ∣

E_{F} ∣ W (X) - W (y) ∣

E_{F} ∣ W (X) - W (X^{'}) ∣

E_{F} ∣ W (X) - W (X^{'}) ∣

wCRPS (F, y)

wCRPS (F, y)

X_{u} = Y 1 {u \geq Y} + (Z + u) 1 {Y > u},

X_{u} = Y 1 {u \geq Y} + (Z + u) 1 {Y > u},

\overline{F_{u}} (x)

\overline{F_{u}} (x)

\overline{F_{u}} (x) \leq \overline{G} (x) .

\overline{F_{u}} (x) \leq \overline{G} (x) .

E [W (Y) 1 {Y < x}] = E [W (X_{u}) 1 {X_{u} < x}] .

E [W (Y) 1 {Y < x}] = E [W (X_{u}) 1 {X_{u} < x}] .

\frac{1}{2} [wCRPS (F_{u}, x) - wCRPS (G, x)]

\frac{1}{2} [wCRPS (F_{u}, x) - wCRPS (G, x)]

Δ (x) = E_{G} [(W (Y) - W (x)) 1 {Y \leq x}] - E_{F_{u}} [(W (X_{u}) - W (x)) 1 {X_{u} \leq x}] .

Δ (x) = E_{G} [(W (Y) - W (x)) 1 {Y \leq x}] - E_{F_{u}} [(W (X_{u}) - W (x)) 1 {X_{u} \leq x}] .

\frac{1}{2} ∣ E_{G} [wCRPS (F_{u}, Y)] - E_{G} [wCRPS (G, Y)] ∣ \leq \int_{u}^{x_{G}} Δ (x) d G (x) .

\frac{1}{2} ∣ E_{G} [wCRPS (F_{u}, Y)] - E_{G} [wCRPS (G, Y)] ∣ \leq \int_{u}^{x_{G}} Δ (x) d G (x) .

Δ (x)

Δ (x)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Extreme events evaluation using CRPS distributions

Maxime Taillardat

[email protected]

Anne-Laure Fougères

Philippe Naveau

Raphaël de Fondeville

CNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France.

Météo-France, Toulouse, France

Univ. Lyon, Université Claude Bernard Lyon 1, CNRS UMR 5208, Institut Camille Jordan, F-69622 Villeurbanne, France

Laboratoire des Sciences du Climat et de l’Environnement, UMR 8212, CEA-CNRS-UVSQ, IPSL & U Paris-Saclay, Gif-sur-Yvette, France

Swiss Data Science Center, ETH Zürich and EPFL, Switzerland

Abstract

Verification of probabilistic forecasts for extreme events has been a very active field of research, stirred by media and public opinions who naturally focus their attention on extreme events, and easily draw biased conclusions. In this context, classical verification methodologies tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated; the well-known Continuous Ranked Probability Score (CRPS) makes no exception.

In this paper, we define a formal framework to assess the behavior of forecast evaluation procedures with respect to extreme events, that we use to point out that assessment based on the expectation of a proper score is not suitable for extremes. As an alternative, we propose to study the properties of the CRPS as a random variable using extreme value theory to address extreme events verification. To compare calibrated forecasts, an index is introduced that summarizes the ability of probabilistic forecasts to predict extremes. Its strengths and limitations are discussed using both theoretical arguments and simulations.

keywords:

CRPS , Extreme events , Probabilistic forecasting , Scoring rules , Calibration , Verification.

††journal: International Journal of Forecasting

1 Introduction

By definition, the rarity of extreme events makes difficult to issue relevant forecasts, whose performance assessment is an even greater challenge. In particular, the scarcity of extremes imposes that verification schemes have to be built and understood in a probabilistic sense. The general framework for probabilistic forecast evaluation compares an observation $y$ with a probabilistic forecast $F$ , represented by its cumulative distribution function (cdf). The framework also assumes that $y$ is drawn from a random variable $Y$ with cdf $G$ . For a better utilization of the forecasts, it is generally convenient, and even recommended (Ferro and Stephenson, 2011), to further assume that the forecast $F$ is calibrated (Dawid, 1984; Diebold et al., 1997), i.e., that the predictive distribution resembles the distribution of the observations given the information contained in the forecast. For a formal definition of auto-calibration (calibration in the following), we refer to the works of Tsyplakov (2011) and Strähl and Ziegel (2017) summarized in A.

Calibrated forecasts can be commonly evaluated based on their sharpness, also called refinement by Winkler et al. (1996), which usually refers to their spread. This leads to the paradigm of ‘maximizing sharpness subject to calibration’, introduced by Gneiting et al. (2007) and later formally justified by Tsyplakov (2011).

Probabilistic forecasting has become more and more popular over the last years in various fields such as economics and finance (Galbraith and Norden, 2012), demography and social science (Raftery and Ševčíková, 2021), health (Henzi et al., 2021), energy (Hong et al., 2016), hydrology and hydraulics (Tiberi-Wadier et al., 2021). In this work, we focus on weather probabilistic forecasts (Leutbecher and Palmer, 2008). Indeed, probabilistic forecasts are nowadays issued by most National Weather Services (NWS) and $F$ is known through a sample of finite size called “ensemble” (see, e.g., Zamo and Naveau, 2017). In this context, forecast verification is performed by computing scoring rules such as the Continuous Ranked Probability Score (CRPS) (Epstein, 1969; Hersbach, 2000; Bröcker, 2012)

[TABLE]

where $y\in\mathbb{R}$ , and $X$ and $X^{\prime}$ are independent random variables with common cdf $F$ . The CRPS is attractive as it does not require predictive densities, is inferred non-parametrically, and has simple interpretation. The right hand side of Equation (1) decomposes the CRPS into, in this order, a calibration and a sharpness term (Gneiting and Raftery, 2007). Alternative decompositions are also available; see Taillardat et al. (2016); Bessac and Naveau (2021) and B.

For the forecast evaluation of extreme events, proper weighted scoring rules were introduced by Gneiting and Ranjan (2011) and Diks et al. (2011). For a non-negative function $w(x)$ , the weighted CRPS

[TABLE]

with $W(x)=\int_{-\infty}^{x}w(t)dt$ , aims to emphasize a region of interest, for instance distributional tails. When $w$ is continuous, an alternative expression of the weighted CRPS is available and can be found in B. The choice of the weight function $w(x)$ is complex and depends on the different stakeholders, such as forecast users and forecasters; see, e.g., Ehm et al. (2016); Gneiting and Ranjan (2011); Patton (2014); Smith et al. (2015); Taillardat (2021b). Even in the hypothetical case where $w(x)$ could be objectively defined, it is essential that the verification process has to be made on the whole set of observations (Lerch et al., 2017) and one can wonder if the corresponding weighted CRPS correctly discriminates between two competitive forecasts with respect to extreme events.

In this work, we show that the expected weighted CRPS cannot discriminate forecasts with different extremal tail behaviors, a potentially redhibitory defect for extremal evaluation. To address this issue, we view the CRPS as a random variable. Its tail behavior is derived and compared to the tail regime of observations using Extreme Value Theory (EVT) (see, e.g. De Haan and Ferreira, 2007).

This work is organized as follows: Section 2 provides an analysis of the weighted CRPS with respect to the notion of tail equivalence, the main backbone of EVT. In particular, we propose a benchmark to compare the tail properties of forecast verification tools allowing us to pinpoint the shortcomings of the CRPS and its weighted counterpart for scoring extreme events. In Section 3, we study the CRPS as a random variable and we make theoretical links between its tail behavior and the observational tail distribution. These mathematical connections help us to propose and study a new index to assess the skill of calibrated probabilistic forecasts with respect to extreme events. The paths and pitfalls of this index and potential future works are discussed in the Section 4.

2 Limitations of the (w)CRPS as a proper scoring rule for extremes

2.1 Tail modelling using EVT

Thanks to the pioneering work of Gumbel (1935) and De Haan (1970), EVT provides a theoretically justified framework to model the tail of random variables, more precisely excesses above a large threshold; see, e.g., Embrechts et al. (1997); Beirlant et al. (2004). For any random variable $X$ with cdf $F$ , EVT models assume the existence of a domain of attraction, i.e., that there exists a positive auxiliary function $b$ , such that

[TABLE]

where $\overline{F}=1-F$ corresponds to the survival, also called tail function, and $x_{F}=\sup\{x:F(x)<1\}$ is the upper endpoint of $F$ . Under condition (3), noted $F\in\mathcal{D}(H)$ , the Pickands-Balkema-de Haan’s theorem (De Haan, 1970; Pickands, 1975) establishes that $H$ has to belong to the family of generalized Pareto (GP) survival functions, i.e.,

[TABLE]

where $x\in\{x:1+\gamma x>0\}$ . As a consequence, the GP tail appears to be the ideal candidate to approximate the survival function of exceedances over a large threshold $u>0$ , i.e.,

[TABLE]

where $x\in\{x:1+\gamma x/\sigma>0\}$ and $\sigma>0$ . The GP family covers the three possible regimes of tail decay which is determined by the value of its tail index $\gamma$ : when $\gamma\neq 0$ the decay is polynomial and has an upper bound when $\gamma<0$ . For $\gamma=0$ , the GP survival function becomes exponential, i.e., $\overline{H}_{0}(z)=e^{-z/\sigma}$ .

2.2 Tail equivalence and proper scoring rules

The comparison of the tail behavior of two random variables, or equivalently their respective cdfs $F$ and $G$ , can be framed using the notion of tail equivalence.

Definition 1.

(Embrechts et al., 1997, Section 3.3)* Two random variables $X$ and $Y$ with respective cdf $F$ and $G$ are tail equivalent if they have equal upper endpoint $x_{F}=x_{G}=x_{*}$ and if their survival functions $\overline{F}$ and $\overline{G}$ satisfy*

[TABLE]

Tail equivalence can also be simply expressed as the equality of tail indexes. In terms of extremal forecast, we expect that, between two forecasters, one should favor the one that is tail equivalent to the observations. In practice, this may be difficult. For instance, consider two GP distributed random variables $X_{1}$ and $X_{2}$ with survival functions $\overline{H}_{1}(x)$ and $\overline{H}_{1+\epsilon}(x/\sigma)$ with $\sigma=(1+\epsilon)/(2^{1+\epsilon}-1)$ . By construction, the medians of $X_{1}$ and $X_{2}$ are both equal to one. Still, their tail behavior widely differ even for small $\epsilon$ : The 100 year return level for $X_{1}$ is 99, while it is equal to 138 for $X_{2}$ with $\epsilon=0.1$ . In other words, if the precedent random variables were to represent water levels, a small difference of $0.1$ in tail index, implied a difference of $39$ meters which would most likely cause massive and destructive flooding.

This short example illustrates how issuing forecasts with the right tail regime, i.e., as close as possible to the observational one, is a priority for extreme events and that a verification methodology should reward forecast with close, if not equal, tail regime. Ideally, the measure of forecast performance should give not only the distance but also the ‘direction’, i.e., if the forecast is more likely to over- or under-estimate the high quantiles. Indeed, let $\gamma_{G}\in\mathbb{R}$ be the tail index of observations. If the forecast satisfies $\gamma_{F}>\gamma_{G}$ , the forecast over-estimates the risk producing a pessimistic or risk averse scenario. On the contrary, $\gamma_{F}<\gamma_{G}$ falls on the optimistic side by under-estimating the likelihood of extreme events.

Classical methods for forecast evaluation, even when designed to focus on extreme events, do not conserve tail equivalence. For instance, for any positive $\eta$ and observation distribution $G$ , it is always possible to construct a non-tail equivalent cdf $F$ , such that

[TABLE]

proof can be found in C. More precisely if $G\in\mathcal{D}(H_{\gamma_{G}})$ , then it is possible for any arbitrary $\gamma_{F}\in\mathbb{R}$ to find $F\in\mathcal{D}(H_{\gamma_{F}})$ satisfying Equation (4). Thus the CRPS is unable to discriminate properly forecasts with different tail regime, as non-tail equivalent forecasts can perform almost equally well as the ideal forecast $G$ . A detailed illustration of this result for GP forecasts is given in D. We also refer to Brehmer and Strokorb (2019), who obtained a more general result, proving that proper scoring rule expectations are not suitable to distinguish tail properties, see their Theorem 5.4.

2.3 A benchmark for assessing forecasts of extremes

Following Gneiting et al. (2007) and Strähl and Ziegel (2017), we propose a benchmark to assess the behavior of forecast evaluation procedures with respect to tail regimes. The design relies on a hierarchical model based on Gamma–exponential mixtures with $\gamma>0$

[TABLE]

where $\textrm{Exp}(\delta)$ refers to an exponential random variable with scale $\delta>0$ . The fact that $Y$ follows a heavy tailed GP distribution, see relation (5), can be proved using Laplace transforms. For analogy with weather forecasting, we present the benchmark in a temporal setting. At each time $t=1,\dots,T>1$ , an observation $y$ is drawn independently from an exponential distribution whose scale $\delta$ is a realization of $\Delta$ . In this setting, $Y$ has an exponential tail which is conditioned by the information brought by its scale $\delta$ , representing the a priori knowledge of the system, for instance the weather at previous time. Thus the ideal forecast for each time step is $\textrm{Exp}(\delta)$ , and requires the knowledge of $\delta$ . Using relation (5), we see that the climatological forecaster $F_{\rm clim}$ is a GP distribution with tail index $\gamma$ and unit scale. Climatology is a commonly used forecast reference in meteorology. In other fields, it can be viewed as the unconditional distribution of the truth, and an estimation of a climatological forecast can be done based on a sample of past and analogs observations. This setting is attractive as the ideal and the climatological forecasters belong to two different regimes of tail decay.

We introduce alternative competitors modelling partial knowledge of the conditional state: the $\lambda$ -informed forecaster $F_{\lambda}$ , $\lambda\in[0,1]$ is a mixture between the climatological and ideal forecasts, where a weight, say $\lambda\in[0;1]$ , indicates the contribution of each one, see Table 1 for the definition.

Finally, the extremist forecaster $F_{\rm extr}$ simply adds a multiplicative bias to the ideal forecaster: while it is not calibrated, such forecast has the same tail behavior as the ideal forecaster ; see A for detailed discussion on calibration. The benchmark is summarized in Table 1 and later referred to as the “Model GE”.

Closed forms of the CRPS are available for each forecast of the proposed benchmark. For instance, the extremist forecast $F_{extr}$ , satisfies

[TABLE]

Besides, combining (12) and (6) yields the following formula for the $\lambda$ -informed forecast, $\lambda\in[0,1],$

[TABLE]

where ${\rm I}\!\Gamma(s,x)=\int_{x}^{+\infty}e^{-t}t^{s-1}\,dt$ . Table 2 gives the relative ratio of the empirical means of the CRPS for the benchmark with $\gamma=1/4$ .

The CRPS being a proper score, the ideal forecast cannot be beaten in average in the Table 2. Moreover, there is a clear ranking among calibrated forecasts, based on the nested information sets (Holzmann and Eulert, 2014). Following the principle of tail equivalence presented in Section 2.2, the extremist forecast should be the forecast the closest to the ideal as they both belong to the same regime of tail decay; however, we observe that the CRPS average gives a performance in between the least informed forecaster and the climatology. An alternative measure for forecast evaluation, satisfying the tail equivalence principle is thus required. A good candidate commonly used in forecast science is the ROC curve (Gneiting and Vogel, 2018). However, in the case of Model GE, all the ROC curves, except the climatological one, coincide whatever the event, which illustrates its invariance under calibration (Kharin and Zwiers, 2003). Further alternatives should thus be investigated.

3 The CRPS as a random variable

3.1 The random CRPS and its properties

Section 2 pointed out the difficulty of summarizing forecast performance for meaningful comparisons for extreme observations. We illustrated in particular that a single number such as the mean of the CRPS, or its weighted counterpart, fails to deliver relevant comparisons. As an alternative, we propose to study the distribution of the CRPS when treated as a random variable, see also Ferro (2017); Bessac and Naveau (2021).

For simplicity, we use the setting and corresponding notations of the benchmark presented in Section 2.3. From equations (12) and (6), the climatological and ideal scores can be treated as random variables whenever $y_{t}$ is replaced by $Y_{t}$ . At this stage, it is important to remind that a forecast is issue with only a partial knowledge of the system: the exact value of $\delta_{t}$ and the distribution of $Y_{t}$ are unknown, and only the observation $y_{t}$ is available. Table 3 summarizes quantities that are available to forecasters. Thus, to evaluate forecasts performance, it is only possible to compute $\mathrm{CRPS}(F_{t},y_{t})$ for each $t$ . The climatological distribution, that we now note $G$ and whose existence needs to be hypothesised in practice, is characterized by the observed sample $(y_{1},\dots,y_{t})$ , considered as a sample of independent realizations of the random variable $Y$ .

For any set of forecasts $\{F_{t}\}_{t=1,\dots,T}$ and sample $y_{1},\dots,y_{T}$ , two types of sets of random variables can be defined:

[TABLE]

where $\pi$ is a random permutation of $\{1,\dots,n\}$ . Applying $\pi$ breaks the conditional dependence between $y_{t}$ and $F_{t}$ , quantified by $\delta_{t}$ in the benchmark, creating alternative less informative forecasts. Thus for a given forecaster, represented by the set $F_{T}=\{F_{t}\}_{i=1,\dots,T}$ and permutation $\pi$ , we introduce two random variables ${\cal S}(F_{T})$ and ${\cal S}^{*}(F_{T})$ characterized by their respective empirical cdf.

The climatological forecaster is the only forecaster satisfying

[TABLE]

as by definition it discards any information about the system conditioning. The first equality in (8) is a direct consequence of auto-calibration, see A; the second equality follows from the permutation invariance of the data from the point of view of the climatological forecaster.

The distributional properties of ${\cal S}(F_{T})$ , ${\cal S^{*}}(F_{T})$ , and ${\cal S}(G)$ give relevant insights on the behavior of the forecaster. For illustration, Figure 1 gives qq-plots of the distributions of ${\cal S^{*}}(F_{T})$ against ${\cal S}(F_{T})$ for each forecast of the benchmark with $\gamma=1/4$ .

We observe that the ideal, $\lambda$ -informed and extremist forecasts deviate from the diagonal, illustrating the influence of the loss of information caused by the permutation: such a visual diagnostic summarizes how ${\cal S}(F_{T})$ and ${\cal S^{*}}(F_{T})$ capture relevant information from the conditioning modelled here by the random variable $\Delta$ . The right panel of Figure 1 displays these distributions on the probability scale and highlights how the discrepancy of the $\lambda$ -informed forecaster evolves with the parameter $\lambda$ . Extremist forecasts, with multiple values of the scale parameter $\nu$ , are displayed here for the sole purpose to illustrate how such visual diagnostics behave when calibration is not satisfied. In Figure 1, we can also see that forecast dominance among forecasters could be inferred, as in Ehm et al. (2016, Fig. 1,2,4,6) for point forecasts. Under calibration, discrepancy between distributions can be appropriately interpreted as a direct measure of the forecaster skill (the $\lambda$ -informed curves never cross each other), making such diagnosis particularly relevant and compliant with the recommendations on the extremal dependence indices established by Ferro and Stephenson (2011).

3.2 Tail properties of the random CRPS

We now study the upper tail behavior of the random CRPS, using EVT to develop a meaningful forecast evaluation for extreme events. To lighten the technicality of this section, all proofs are relegated to E. In terms of notations with respect to any conditional model that depends on $\Delta=\delta$ , we want to emphasize the difference between a conditional forecast, say $F_{\delta}$ , and an unconditional forecast $F$ . Note that $\delta$ depends on the time index $t$ , but for notation simplicity, we drop this index; $\Delta$ might also change over time but here assumed invariant.

Let $X$ and $Y$ be two random variables with absolutely continuous cdfs $F$ and $G$ with common upper bound $x_{F}=x_{G}$ . Suppose that there exists $\gamma<1$ such that $G\in\mathcal{D}(H_{\gamma})$ and that $c_{F}=2\mathbb{E}_{F}(XF(X))$ is finite. Then conditionally on $\Delta=\delta$ , one has

[TABLE]

as $u_{\delta}$ tends to $x_{G_{\delta}}$ , with $1+\gamma_{\delta}x>0$ . So at any fixed state $\delta$ (state of the atmosphere for a weather forecast, say), the CRPS upper tail behavior (conditionally on $\Delta=\delta$ ) is equivalent to the observation tail behavior and formalizes what could be intuited from (12).

Now, unconditionally, one can also get a result for the climatological forecast, thanks to its property of invariance under permutation (see Section 3.1). If there exists $\gamma<1$ such that $G\in\mathcal{D}(H_{\gamma})$ , then

[TABLE]

for any $x$ such that $1+\gamma x>0$ . In the case where $\gamma>0$ , convergence in Equation (10) also holds for $c_{G}=0$ as the latter vanishes due to the linear behavior of the auxiliary function $b$ in Equation (3), e.g., see Embrechts et al. (1997).

The benchmark presented in Table 1 illustrates these results. The choice of working with a time indexed couple $(F_{t},Y_{t})$ or with an invariant $(G,Y)$ impacts significantly the tail behavior of the CRPS random variables: according to Table 1, the former case implies that the limit in (9) exhibits an exponential tail, whereas the climatological tail given by (10) is heavy, i.e., $\gamma>0$ .

3.3 Assessing the forecaster tail behavior

In this section, we propose a tail-equivalent forecast performance index inspired from equations (9), (10), and Figure 1. We aim only to provide the intuition behind the index and leave formal theoretical analysis for future work. We assume that the forecasts lie in the domain of attraction of some distribution $H_{\gamma,\sigma}$ . For sufficiently large $u$ , the null hypothesis $H_{0}:\mathcal{S}(F_{T})|Y>u\stackrel{{\scriptstyle d}}{{=}}H_{\gamma,\sigma_{u}}$ should be rejected for any calibrated forecast with tail behaviour closer to the ideal forecast than the climatological reference.

To go further, assume that the variables in $\mathcal{S}(F_{T})$ are iid. This assumption may not be always satisfied, as for instance temperature measures of two consecutive days are likely to be dependent, but can be reasonably satisfied for measurements from sufficiently far apart. For each forecast, we can compute a Cramér-von Mises criterion

[TABLE]

where ${\hat{K}^{(m)}}_{\mathcal{S},u}$ is the empirical distribution of the observations in $\mathcal{S}(F_{T})$ exceeding the threshold $u$ . The empirical nature of ${\hat{K}^{(m)}}_{\mathcal{S},u}$ allows to simplify ${\omega_{u}}^{2}\{\mathcal{S}(F_{T})\}$ to

[TABLE]

where $m$ denotes the number of observations exceeding $u$ and $s_{1},\dots,s_{m}$ are the ordered values of $\mathcal{S}(F_{T})$ . A detailed algorithm for the computation of $\Omega^{F}_{u}$ is provided in Table 4 of F.

As suggested by Figure 1, we assume that $\Omega^{F}_{u}>\Omega^{G}_{u}$ , for any calibrated forecasts and climatology $G$ . Also, for two calibrated forecasts $F^{1}$ and $F^{2}$ , we conjecture that $\Omega^{F^{2}}_{u}\geq\Omega^{F^{1}}_{u}$ if $F^{2}$ has a tail behaviour closer to the ideal forecast than $F^{1}$ . Under these assumptions, we can summarize simply the comparison between $\Omega^{F}_{u}$ and $\Omega^{G}_{u}$ through

[TABLE]

The behaviour of the index $T_{u}$ is illustrated with the help of model GE; Figure 2 displays the evolution of $T_{u}$ as a function of the threshold $u$ for $T=10^{6}$ and $\gamma=1/4$ . The behaviour of the index is shown to be consistent with our conjecture: first, the ideal forecast performs best, while the climatology has the lowest index. Performance ranking among calibrated forecasters is stable as the threshold increases, with the ideal forecast always obtaining the largest index. The extremist forecasters, displayed here to illustrate the behaviour of the index for non-calibrated forecast, obtain a high index, even larger than the ideal forecast, stressing the importance of calibration which must be carefully assessed before any interpretation of $T_{u}$ .

In practice, a threshold choice has to be made, for which numerous methodologies have been developed, see, e.g., Beirlant et al. (2004); Papastathopoulos and Tawn (2013); Naveau et al. (2016).

4 Discussion

In this work, we have argued with the help of a carefully designed benchmark that the mean of the CRPS, or its weighted counterparts, are unable to successfully discriminate a forecast upper tail regime, as demonstrated by Brehmer and Strokorb (2019). Ehm et al. (2016) have introduced the so-called “Murphy diagrams” for assessing dominance in point forecasts. This original approach allows to appreciate dominance among different forecasts and anticipate their skill area; a similar visual diagnostic is presented in Figure 1 for calibrated forecasts.

Inspired by Friederichs and Thorarinsdottir (2012), we apply EVT directly on common verification measures. By considering the CRPS as a random variable, see also Bessac and Naveau (2021) for non-extreme cases, one can view this contribution as a first step in considering other functionals of the scores distributions rather than their means. The new index introduced in Section 3.3 can be considered as a probabilistic alternative to the scores introduced by Ferro (2007) and Ferro and Stephenson (2011). We make a link between the paradigm of maximizing the sharpness subject to calibration from Gneiting et al. (2007) and the paradigm of maximizing the information for extreme events subject to calibration. In a same vein, Murphy (1993) has presented the differences between forecast quality (accordance between forecasts and observations) and forecast value (ability to bring information to realize a benefit by choosing a forecast), the forecast value seems to be the most important for extreme events, where decision making is crucial. For deterministic weather forecasts, such tools are well-known, see e.g. Richardson (2000); Zhu et al. (2002). Other widely-used scores based on the dependence between forecasts and observed events have been considered in Stephenson et al. (2008); Ferro and Stephenson (2011).

It would be worthwhile to further study the theoretical properties of this CRPS-based tool. Another potentially interesting investigation could be to extend this procedure to other scores like the mean absolute difference, the Dawid-Sebastiani score (Dawid and Sebastiani, 1999) or the ignorance score (Smith et al., 2015; Diks et al., 2011). Classical tools in verification relies on a verification period, as a consequence evaluation is always done a posteriori. Thus, an interesting manner to pursue this work would be to consider sequential evaluation of rare events, in the spirit of the e-values (Vovk and Wang, 2021) introduced to assess and monitor calibration continuously (Arnold et al., 2021). Eventually, we invite scientists to work on new theory of scoring rule departing from the score’s averages.

Acknowledgments

Part of this work was supported by the French National Research Agency (ANR) project T-REX (ANR-20-CE40-0025) and by Energy oriented Centre of Excellence-II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and the ANR project Melody (ANR-19-CE46-0011). This work was partially supported by the ANR LABEX MILYON (ANR-10-LABX-0070) of Université de Lyon, within the program "Investissements d’Avenir" (ANR-11-IDEX-0007).

Implementation details

The implementation of the index relies on the extremeIndex package (Taillardat, 2021a). The R code generating simulation data and Figures is available upon request.

Appendix A Prediction framework and calibration

The theoretical framework considered in this paper is the now classical prediction space already introduced by Murphy and Winkler (1987); Gneiting and Ranjan (2013); Ehm et al. (2016), and generalized in a serial context by Strähl and Ziegel (2017). It starts formally with a probability space $(\Omega,{\mathcal{A}},{\mathbb{Q}})$ and a collection of sub- $\sigma$ -algebras $\mathcal{A}_{1},\dots,\mathcal{A}_{k}\subset{\mathcal{A}}$ , where $\mathcal{A}_{i}$ represents the information available to forecaster $i$ . In a meteorological context, it can be seen as the representation of the atmosphere done by each forecaster. In the benchmark considered in Section 2.3, we will consider for simplicity that the information set is generated by a random variable $\Delta$ .

A real-valued outcome $Y$ is observed and seen as a (real-valued) random variable. A probabilistic forecast $i$ for $Y$ is identified with its so-called “predictive distribution” with cdf $F_{i}$ . Rigorously speaking, $F_{i}:\Omega\times{\mathcal{B}}({\mathbb{R}})\to[0,1]$ is a kernel111This means that for each fixed $\omega\in\Omega$ , $F_{i}(\omega,\cdot)$ is a probability measure, and for each fixed $x\in{\mathbb{R}}$ , $F_{i}(\cdot,(-\infty,x])$ is ${\mathcal{A}}_{i}$ -measurable. See e.g. Kallenberg (2017). from $(\Omega,{\mathcal{A}}_{i})$ to $({\mathbb{R}},{\mathcal{B}}({\mathbb{R}}))$ , but as done by previous authors, we will identify the kernels with random cumulative cdf, see e.g. Strähl and Ziegel (2017) for more details. For each $x\in{\mathbb{R}}$ , we might in particular use the notation $F_{i}(x)$ meaning the random element $\omega\mapsto F_{i}(\omega,(-\infty,x])$ .

In such a framework, a forecast $F_{i}$ is termed ideal with respect to ${\mathcal{A}}_{i}$ if $F_{i}={\mathcal{L}}(Y|{\mathcal{A}}_{i})$ almost surely. Tsyplakov (2011) also refers to this property saying that $F_{i}$ is calibrated with respect to ${\mathcal{A}}_{i}$ . He additionally defines the auto-calibration as the property for $F_{i}$ to satisfy $F_{i}={\mathcal{L}}(Y|\sigma(F_{i}))$ almost surely. Here, $\sigma(F_{i})$ denotes the $\sigma$ -algebra generated by $F_{i}$ , that is to say the smallest $\sigma$ -algebra such that $\omega\mapsto F_{i}(\omega,x)$ is measurable for all $x\in{\mathbb{R}}$ . Note that if a forecast is calibrated with respect to ${\mathcal{A}}_{i}$ , then it is auto-calibrated, but the converse does not hold in general. As a particular case considered in Section 2.3, the climatological forecaster is ideal with respect to the trivial $\sigma$ -algebra.

In practice, one is not only concerned with predictions for an outcome $Y$ at a single time point. The framework introduced above also allows to deal with independent replicates at times $t=1,2,\dots$ , as is done in Section 2.3. If such an assumption of independence sounds unrealistic in several situations, as argued by Strähl and Ziegel (2017), it can nevertheless provide a first step and takes advantage of a lighter context. We chose therefore to keep it in this paper for simplicity.

Appendix B An alternative expression of the weighted CRPS

The weighted CRPS defined by (1) can be reformulated in the following way, as soon as the weight function $w(.)$ is continuous,

[TABLE]

Assume that the weight function $w(.)$ is continuous. By integrating by parts $\int_{-\infty}^{y}F^{2}(x)w(x)\,dx$ and $\int_{y}^{\infty}\overline{F}^{2}(x)w(x)\,dx$ and using $W(x)=\int_{-\infty}^{x}w(z)dz$ , the weighted CRPS defined by (1) can be rewritten as

[TABLE]

The equality $|a-b|=2\max(a,b)-(a+b)$ gives

[TABLE]

and

[TABLE]

where the last line follows from the fact that $F_{W(X)}(W(X))$ and $F(X)$ have the same distribution, which is uniform on $(0,1$ ). As $W(x)$ is non-decreasing, one has $\{W(X)>W(y)\}=\{X>y\}$ , and it follows that

[TABLE]

as announced in (12).

Appendix C Proof of the inequality (4)

Let $u$ be a positive real. Denote $Z$ a non-negative random variable with finite mean and cdf $H$ . Assume that $Z$ and $Y$ are independent and have same right end point. We introduce the new random variable

[TABLE]

with survival function $\overline{F_{u}}$ defined by

[TABLE]

Note that the decreasingness of $\overline{F_{u}}$ yields in particular that for all $x$ ,

[TABLE]

Besides, equation (16) and the monotonicity of $W$ allows to write that for any $x\leq u$

[TABLE]

Equality (12) implies that

[TABLE]

where

[TABLE]

The stochastic ordering that holds between $X_{u}$ and $Y$ implies that the quantity $\mathbb{E}_{F_{u}}[W(X_{u})\overline{F_{u}}(X_{u})]-\mathbb{E}_{G}[W(Y)\overline{G}(Y)]$ is negative. Combined with (18), this leads to

[TABLE]

For $x>u$ we can write that

[TABLE]

since $W(Y)-W(x)\leq 0$ in the first expectation, whereas $0\leq W(x)-W(X_{u})\leq W(x)-W(u)$ in the second one. As a consequence, one gets

[TABLE]

This last expression combined with (19) leads finally to

[TABLE]

Note that this inequality is true for any $u$ and $H$ , and its right hand side does not depend on $\overline{H}(x)$ . Thus, the tail behavior of the random variables $Y$ and $Z$ can be completely different, although the CRPS of $G$ and $G$ can be as closed as one wishes. The right hand side goes to [math] due to the finite mean of $W(Y)$ .

Appendix D A detailed example related to Section 2.2

In this appendix, we illustrate the fact that the CRPS fails at discriminating forecasts with different tails. We consider GP distributed forecasts and observations. In this case, closed form of the CRPS are available, as detailed in the following.

Lemma 1.

Consider $X\stackrel{{\scriptstyle d}}{{=}}\mbox{\rm GP}(\beta,\xi)$ and $Y\stackrel{{\scriptstyle d}}{{=}}\mbox{\rm GP}(\sigma,\gamma)$ with $0\leq\xi<1$ and $0\leq\gamma<1$ , with respective survival functions $\overline{F}(x)=(1+\xi x/\beta)^{-1/\xi}$ (for $x>-\beta/\xi$ ) and $\overline{G}(x)=(1+\gamma x/\sigma)^{-1/\gamma}$ (for $x>-\sigma/\gamma$ ). If $\gamma/\sigma=\xi/\beta$ , with $\gamma\neq 0$ , then

[TABLE]

This gives the minimum CRPS value for $\xi=\gamma$ and $\sigma=\beta$ ,

[TABLE]

Proof: Applying (12) with $W(y)=y$ , and making use of classical properties of the Pareto distribution (see e.g. (Embrechts et al., 1997, Theorem 3.4.13)), one gets

[TABLE]

It follows that

[TABLE]

with

[TABLE]

Since

[TABLE]

one can write

[TABLE]

Besides, as $G^{-1}(v)=\frac{\sigma}{\gamma}\left(\left(1-v\right)^{-\gamma}-1\right)$ , one can thus rewrite, denoting by $U$ a random variable uniformly distributed on $(0,1)$ ,

[TABLE]

If $\displaystyle c=\frac{\xi\sigma}{\beta\gamma}=1$ , then this simplifies to

[TABLE]

In particular, $m_{0}=\frac{1}{\gamma}B(1,1/\xi+1/\gamma)=\left(1+\frac{\gamma}{\xi}\right)^{-1}$ and

[TABLE]

It follows that, if $\frac{\gamma}{\sigma}=\frac{\xi}{\beta}$ , then we have

[TABLE]

This gives the minimum CRPS value for $\xi=\gamma$ and $\sigma=\beta$ ,

[TABLE]

concluding the proof of Lemma 1. $\square$

Lemma 1 allows to study the effect of changing the forecast’s tail behavior captured by $\xi$ and the spread forecast encapsulated in $\beta$ , when $F$ and $G$ have proportional parameters, i.e., $\beta=a\sigma$ and $\xi=a\gamma$ for some $a>0$ . In this case, the CRPS simplifies to

[TABLE]

leading when $a>1$ to a forecaster with heavier-tail, overestimating the true upper tail behavior, and to the opposite when $a<1$ .

Counter examples as the previous one can thus be found, illustrating how weighted scoring rules fail to compare tail behaviors. They should therefore be handled with a particular care, especially for forecast makers, as already advocated by Gilleland et al. (2018); Lerch et al. (2017).

Appendix E Proof of the convergences (9) and (10)

The proof of (10) can be seen as a particular case of (9), so that we will focus on proving (9). The following lemma will help to get the result, and is presented first with its proof. In what follows, the mean excess function of any random variable $Z$ with finite mean and with cdf $F$ will be denoted by $M(F,z)$ , so that $\overline{F}(z)M(F,z)=\mathbb{E}_{F}[(Z-z){\rm{1}\!l}_{Z>z}].$ Lemma : Consider a random variable $Z$ with finite mean that belongs to domain of attraction $\mathcal{D}(H_{\gamma})$ with $\gamma<1$ . There exist non negative real numbers $\alpha$ and $\beta$ such that for each $z\in{\mathbb{R}}$ ,

[TABLE]

Proof of the lemma: The indicator function ${\rm{1}\!l}_{Z>z}$ implies that we always have $0\leq 2\mathbb{E}_{F}((Z-z){\rm{1}\!l}_{Z>z})$ . To prove that $2\mathbb{E}_{F}((Z-z){\rm{1}\!l}_{Z>z})$ is smaller than $\overline{F}(z)(\alpha z+\beta)$ , we first show that this inequality holds for large values of $z$ . Note first that if $z>x_{F}$ , then (22) is trivially true. Let then show the result when $z\stackrel{{\scriptstyle<\leavevmode\nobreak\ }}{{\rightarrow}}x_{F}$ , and for this, let decompose the proof depending on the sign of $\gamma$ :

$F$ belongs to $\mathcal{D}(H_{\gamma})$ with $0<\gamma<1$ : In this case, Embrechts et al. (1997) (Section 3.4) show that $M(F,z)\sim\gamma z/(1-\gamma)$ as $z$ tends to $x_{F}$ , and we can conclude directly. 2. 2.

$F$ belongs to $\mathcal{D}(H_{\gamma})$ with $\gamma<0$ : In this case, the result also follows easily from Embrechts et al. (1997) since when $z$ tends to $x_{F}$ , $M(F,z)\sim\gamma(x_{F}-z)/(\gamma-1).$ This allows to fix $\alpha=0$ and $\beta=\sup_{z\in V(x_{F})}\gamma(x_{F}-z)/(\gamma-1)$ for an appropriate neighborhood $V(x_{F})$ of $x_{F}$ . 3. 3.

$F$ belongs to $\mathcal{D}(H_{0})$ : When $F$ is in the Gumbel domain of attraction, $M(F,z)/z\rightarrow 0$ as $z$ tends to $x_{F}$ (see e.g. Theorem 3.9 in Ghosh and Resnick (2010)). If $x_{F}$ is finite, then there exists a positive $\beta$ such that $2M(F,z)\leq\beta$ and $\alpha$ can be fixed to 0, whereas if $x_{F}$ is infinite, the fact that $2M(F,z)<z$ for $z$ large enough enables to conclude.

So far, we have shown that, for some large $z_{0}$ , there exist non negative $\alpha$ and $\beta$ such that

[TABLE]

We still need to prove that this statement also holds for $z\leq z_{0}$ . Define

[TABLE]

As $\gamma<1$ , $\beta_{0}$ is finite and, as $\overline{F}(z)\geq\overline{F}(z_{0})$ for all $z\leq z_{0}$ , we have

[TABLE]

We have now two cases: either $\beta<\frac{\beta_{0}}{\overline{F}(z_{0})}$ or $\beta\geq\frac{\beta_{0}}{\overline{F}(z_{0})}$ . In the latter case, we have $2\mathbb{E}_{F}((Z-z){\rm{1}\!l}_{Z>z})\leq\beta_{0}\leq\overline{F}(z)(\alpha z+\beta)$ , and so, the required result is obtained. In the case of $\beta<\frac{\beta_{0}}{\overline{F}(z_{0})}$ , it is always possible to increase $\beta$ chosen when $z>z_{0}$ , and bring it above $\frac{\beta_{0}}{\overline{F}(z_{0})}$ . ∎

We are now ready to prove (9) as announced. *Proof of (9):

*Given the conditional forecast $F_{\delta}$ , the CRPS can be computed with respect to the conditional observation $y_{\delta}$ in the following way

[TABLE]

where $c_{\delta}=2\mathbb{E}_{F_{\delta}}\left[X_{\delta}F_{\delta}(X_{\delta})\right]$ . To simplify notations, we drop the subscript $\delta$ in the rest of the proof, but it will be back at the end. The previous lemma allows to write

[TABLE]

Let now work conditionally on $Y>u$ , for a large $u$ close to $x_{F}=x_{Y}$ . We then get

[TABLE]

This holds when the right end point of $Y$ is non-negative. If this was not the case, note that one can simply write $Y\leq\mathrm{CRPS}(F,Y)+c\leq Y+\beta\overline{F}(u)\;\;\;a.s.$ .

The main idea of the proof is to notice that $\overline{F}(u)$ goes to zero as $u$ gets large, and consequently, the above inequalities indicate that the thresholded random variable $Y[u]=[(Y-u)/b(u)\;|\;Y>u]$ and the thresholded CRPS $C[u]=[(\mathrm{CRPS}(F,Y)+c-u)/b(u)\;|\;Y>u]$ should behave similarly for large $u$ . The choice of positive constant $b(u)$ depends on the domain of attraction of $Y$ . More precisely, we assume that $Y[u]$ converges in distribution towards a GPD with finite mean. So that

[TABLE]

We recognize the probability (conditionally on $Y>u$ ) for $Y$ to be in an interval denoted by

[TABLE]

The remaining part of the proof consists in showing that this conditional probability tends to 0 as $u\to x_{F}$ . We can write

[TABLE]

where $J_{u}=\displaystyle\left[\frac{tb(u)-\overline{F}(u)(\alpha+\beta)}{1+\alpha\overline{F}(u)},tb(u)\right].$ For $u$ large enough, the latter probability can be approximated by a GPD, so that

[TABLE]

where $g_{GP}$ denotes the probability density function associated to the GPD. This implies the convergence to 0 of the latter probability. Since this is true conditionally on $\Delta=\delta$ , it can be rewritten, after reintroduction of the subscript $\delta$ , as

[TABLE]

as $u$ tends to $x_{G_{\delta}}$ , with $1+\gamma_{\delta}x>0$ . ∎

Appendix F Algorithm for the computation of the Cramer-von-Mises criterion

Note that for large $u$ , under the null hypothesis, the statistic ${\Omega^{F}_{u}}$ follows a Cramér-von Mises distribution. The associated $p$ -values $p^{F}_{u}\in[0,1]$ could have been computed, but they are actually subject to numerical instabilities (Prokhorov, 1968; Csörgő and Faraway, 1996). Furthermore, ${\Omega^{F}_{u}}$ is sufficient to compare the effect size of the deviation.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arnold et al. (2021) Arnold, S., Henzi, A., Ziegel, J. F., 2021. Sequentially valid tests for forecast calibration. ar Xiv preprint ar Xiv:2109.11761.
2Beirlant et al. (2004) Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., Waal, D., Ferro, C., 2004. Statistics of extremes: Theory and applications.
3Bessac and Naveau (2021) Bessac, J., Naveau, P., 2021. Forecast score distributions with imperfect observations. Advances in Statistical Climatology, Meteorology and Oceanography 7 (2), 53–71.
4Brehmer and Strokorb (2019) Brehmer, J. R., Strokorb, K., 2019. Why scoring functions cannot assess tail properties. Electronic Journal of Statistics 13 (2), 4015 – 4034. URL https://doi.org/10.1214/19-EJS 1622 · doi ↗
5Bröcker (2012) Bröcker, J., 2012. Evaluating raw ensembles with the continuous ranked probability score. Quarterly Journal of the Royal Meteorological Society 138 (667), 1611–1617.
6Csörgő and Faraway (1996) Csörgő, S., Faraway, J. J., 1996. The exact and asymptotic distributions of cramér-von mises statistics. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), 221–234.
7Dawid (1984) Dawid, A. P., 1984. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 278–292.
8Dawid and Sebastiani (1999) Dawid, A. P., Sebastiani, P., 1999. Coherent dispersion criteria for optimal experimental design. Annals of Statistics, 65–81.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Extreme events evaluation using CRPS distributions

Abstract

keywords:

1 Introduction

2 Limitations of the (w)CRPS as a proper scoring rule for extremes

2.1 Tail modelling using EVT

2.2 Tail equivalence and proper scoring rules

Definition 1**.**

2.3 A benchmark for assessing forecasts of extremes

3 The CRPS as a random variable

3.1 The random CRPS and its properties

3.2 Tail properties of the random CRPS

3.3 Assessing the forecaster tail behavior

4 Discussion

Acknowledgments

Implementation details

Appendix A Prediction framework and calibration

Appendix B An alternative expression of the weighted CRPS

Appendix C Proof of the inequality (4)

Appendix D A detailed example related to Section 2.2

Lemma 1**.**

Appendix E Proof of the convergences (9) and (10)

Appendix F Algorithm for the computation of the Cramer-von-Mises criterion

Definition 1.

Lemma 1.