Evaluating probabilistic forecasts of extremes using continuous ranked probability score distributions
Maxime Taillardat, Anne-Laure Foug\`eres, Philippe Naveau, Rapha\"el, de Fondeville

TL;DR
This paper investigates the effectiveness of the continuous ranked probability score (CRPS) in evaluating probabilistic forecasts of extreme events, proposing a new approach based on extreme value theory for better assessment.
Contribution
It introduces a formal framework for evaluating extreme event forecasts and proposes a novel method using extreme value theory to improve assessment accuracy.
Findings
CRPS is not suitable for extreme event verification when assessed by expectation.
A new index based on extreme value theory effectively compares calibrated forecasts for extremes.
The proposed method's strengths and limitations are analyzed through theory and simulations.
Abstract
Verifying probabilistic forecasts for extreme events is a highly active research area because popular media and public opinions are naturally focused on extreme events, and biased conclusions are readily made. In this context, classical verification methods tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated, and the well-known continuous ranked probability score (CRPS) is no exception. In this paper, we define a formal framework for assessing the behavior of forecast evaluation procedures with respect to extreme events, which we use to demonstrate that assessment based on the expectation of a proper score is not suitable for extremes. Alternatively, we propose studying the properties of the CRPS as a random variable by using extreme value theory to address extreme event verification. An index is introduced…
| Forecasts Truth | where , |
|---|---|
| Ideal | |
| Climatological | |
| -Informed | |
| Extremist | , |
| Truth | where |
|---|---|
| Forecasts | w.r.t. Ideal |
| Ideal | |
| Extremist | |
| 0.75-Informed | |
| 0.5-Informed | |
| Extremist | |
| 0.25-Informed | |
| Climatological | |
| Extremist |
| Object | Definition | Availability |
|---|---|---|
| in practice | ||
| Distribution of the forecast for time | yes | |
| Observed realisation at time | yes | |
| Conditioning variable | no | |
| Conditioning random variable | no | |
| Conditional random variable generating | no | |
| Unconditional random variable of the observations | yes | |
| CRPS of the couple for time | yes | |
| Random variable associated to | no | |
| Random variable generated by the | yes | |
| Random variable generated by the | yes |
| 0. CRPS estimates for each forecaster: | - For the couples forecast/observation, compute their corresponding instantaneous CRPS. |
| 1. Estimation of on the observations: | - Find a threshold where the Pareto approximation is acceptable and estimate the Pareto shape parameter and . |
| 2. For a threshold : | - Compute the scale parameter . |
| 3. Computation of | - Order the CRPS values where the observation in increasing order . |
| For | -Compute for each CRPS value , . |
| -Compute . | |
| End 3. | |
| End 2. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Extreme events evaluation using CRPS distributions
Maxime Taillardat
Anne-Laure Fougères
Philippe Naveau
Raphaël de Fondeville
CNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France.
Météo-France, Toulouse, France
Univ. Lyon, Université Claude Bernard Lyon 1, CNRS UMR 5208, Institut Camille Jordan, F-69622 Villeurbanne, France
Laboratoire des Sciences du Climat et de l’Environnement, UMR 8212, CEA-CNRS-UVSQ, IPSL & U Paris-Saclay, Gif-sur-Yvette, France
Swiss Data Science Center, ETH Zürich and EPFL, Switzerland
Abstract
Verification of probabilistic forecasts for extreme events has been a very active field of research, stirred by media and public opinions who naturally focus their attention on extreme events, and easily draw biased conclusions. In this context, classical verification methodologies tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated; the well-known Continuous Ranked Probability Score (CRPS) makes no exception.
In this paper, we define a formal framework to assess the behavior of forecast evaluation procedures with respect to extreme events, that we use to point out that assessment based on the expectation of a proper score is not suitable for extremes. As an alternative, we propose to study the properties of the CRPS as a random variable using extreme value theory to address extreme events verification. To compare calibrated forecasts, an index is introduced that summarizes the ability of probabilistic forecasts to predict extremes. Its strengths and limitations are discussed using both theoretical arguments and simulations.
keywords:
CRPS , Extreme events , Probabilistic forecasting , Scoring rules , Calibration , Verification.
††journal: International Journal of Forecasting
1 Introduction
By definition, the rarity of extreme events makes difficult to issue relevant forecasts, whose performance assessment is an even greater challenge. In particular, the scarcity of extremes imposes that verification schemes have to be built and understood in a probabilistic sense. The general framework for probabilistic forecast evaluation compares an observation with a probabilistic forecast , represented by its cumulative distribution function (cdf). The framework also assumes that is drawn from a random variable with cdf . For a better utilization of the forecasts, it is generally convenient, and even recommended (Ferro and Stephenson, 2011), to further assume that the forecast is calibrated (Dawid, 1984; Diebold et al., 1997), i.e., that the predictive distribution resembles the distribution of the observations given the information contained in the forecast. For a formal definition of auto-calibration (calibration in the following), we refer to the works of Tsyplakov (2011) and Strähl and Ziegel (2017) summarized in A.
Calibrated forecasts can be commonly evaluated based on their sharpness, also called refinement by Winkler et al. (1996), which usually refers to their spread. This leads to the paradigm of ‘maximizing sharpness subject to calibration’, introduced by Gneiting et al. (2007) and later formally justified by Tsyplakov (2011).
Probabilistic forecasting has become more and more popular over the last years in various fields such as economics and finance (Galbraith and Norden, 2012), demography and social science (Raftery and Ševčíková, 2021), health (Henzi et al., 2021), energy (Hong et al., 2016), hydrology and hydraulics (Tiberi-Wadier et al., 2021). In this work, we focus on weather probabilistic forecasts (Leutbecher and Palmer, 2008). Indeed, probabilistic forecasts are nowadays issued by most National Weather Services (NWS) and is known through a sample of finite size called “ensemble” (see, e.g., Zamo and Naveau, 2017). In this context, forecast verification is performed by computing scoring rules such as the Continuous Ranked Probability Score (CRPS) (Epstein, 1969; Hersbach, 2000; Bröcker, 2012)
[TABLE]
where , and and are independent random variables with common cdf . The CRPS is attractive as it does not require predictive densities, is inferred non-parametrically, and has simple interpretation. The right hand side of Equation (1) decomposes the CRPS into, in this order, a calibration and a sharpness term (Gneiting and Raftery, 2007). Alternative decompositions are also available; see Taillardat et al. (2016); Bessac and Naveau (2021) and B.
For the forecast evaluation of extreme events, proper weighted scoring rules were introduced by Gneiting and Ranjan (2011) and Diks et al. (2011). For a non-negative function , the weighted CRPS
[TABLE]
with , aims to emphasize a region of interest, for instance distributional tails. When is continuous, an alternative expression of the weighted CRPS is available and can be found in B. The choice of the weight function is complex and depends on the different stakeholders, such as forecast users and forecasters; see, e.g., Ehm et al. (2016); Gneiting and Ranjan (2011); Patton (2014); Smith et al. (2015); Taillardat (2021b). Even in the hypothetical case where could be objectively defined, it is essential that the verification process has to be made on the whole set of observations (Lerch et al., 2017) and one can wonder if the corresponding weighted CRPS correctly discriminates between two competitive forecasts with respect to extreme events.
In this work, we show that the expected weighted CRPS cannot discriminate forecasts with different extremal tail behaviors, a potentially redhibitory defect for extremal evaluation. To address this issue, we view the CRPS as a random variable. Its tail behavior is derived and compared to the tail regime of observations using Extreme Value Theory (EVT) (see, e.g. De Haan and Ferreira, 2007).
This work is organized as follows: Section 2 provides an analysis of the weighted CRPS with respect to the notion of tail equivalence, the main backbone of EVT. In particular, we propose a benchmark to compare the tail properties of forecast verification tools allowing us to pinpoint the shortcomings of the CRPS and its weighted counterpart for scoring extreme events. In Section 3, we study the CRPS as a random variable and we make theoretical links between its tail behavior and the observational tail distribution. These mathematical connections help us to propose and study a new index to assess the skill of calibrated probabilistic forecasts with respect to extreme events. The paths and pitfalls of this index and potential future works are discussed in the Section 4.
2 Limitations of the (w)CRPS as a proper scoring rule for extremes
2.1 Tail modelling using EVT
Thanks to the pioneering work of Gumbel (1935) and De Haan (1970), EVT provides a theoretically justified framework to model the tail of random variables, more precisely excesses above a large threshold; see, e.g., Embrechts et al. (1997); Beirlant et al. (2004). For any random variable with cdf , EVT models assume the existence of a domain of attraction, i.e., that there exists a positive auxiliary function , such that
[TABLE]
where corresponds to the survival, also called tail function, and is the upper endpoint of . Under condition (3), noted , the Pickands-Balkema-de Haan’s theorem (De Haan, 1970; Pickands, 1975) establishes that has to belong to the family of generalized Pareto (GP) survival functions, i.e.,
[TABLE]
where . As a consequence, the GP tail appears to be the ideal candidate to approximate the survival function of exceedances over a large threshold , i.e.,
[TABLE]
where and . The GP family covers the three possible regimes of tail decay which is determined by the value of its tail index : when the decay is polynomial and has an upper bound when . For , the GP survival function becomes exponential, i.e., .
2.2 Tail equivalence and proper scoring rules
The comparison of the tail behavior of two random variables, or equivalently their respective cdfs and , can be framed using the notion of tail equivalence.
Definition 1**.**
(Embrechts et al., 1997, Section 3.3)* Two random variables and with respective cdf and are tail equivalent if they have equal upper endpoint and if their survival functions and satisfy*
[TABLE]
Tail equivalence can also be simply expressed as the equality of tail indexes. In terms of extremal forecast, we expect that, between two forecasters, one should favor the one that is tail equivalent to the observations. In practice, this may be difficult. For instance, consider two GP distributed random variables and with survival functions and with . By construction, the medians of and are both equal to one. Still, their tail behavior widely differ even for small : The 100 year return level for is 99, while it is equal to 138 for with . In other words, if the precedent random variables were to represent water levels, a small difference of in tail index, implied a difference of meters which would most likely cause massive and destructive flooding.
This short example illustrates how issuing forecasts with the right tail regime, i.e., as close as possible to the observational one, is a priority for extreme events and that a verification methodology should reward forecast with close, if not equal, tail regime. Ideally, the measure of forecast performance should give not only the distance but also the ‘direction’, i.e., if the forecast is more likely to over- or under-estimate the high quantiles. Indeed, let be the tail index of observations. If the forecast satisfies , the forecast over-estimates the risk producing a pessimistic or risk averse scenario. On the contrary, falls on the optimistic side by under-estimating the likelihood of extreme events.
Classical methods for forecast evaluation, even when designed to focus on extreme events, do not conserve tail equivalence. For instance, for any positive and observation distribution , it is always possible to construct a non-tail equivalent cdf , such that
[TABLE]
proof can be found in C. More precisely if , then it is possible for any arbitrary to find satisfying Equation (4). Thus the CRPS is unable to discriminate properly forecasts with different tail regime, as non-tail equivalent forecasts can perform almost equally well as the ideal forecast . A detailed illustration of this result for GP forecasts is given in D. We also refer to Brehmer and Strokorb (2019), who obtained a more general result, proving that proper scoring rule expectations are not suitable to distinguish tail properties, see their Theorem 5.4.
2.3 A benchmark for assessing forecasts of extremes
Following Gneiting et al. (2007) and Strähl and Ziegel (2017), we propose a benchmark to assess the behavior of forecast evaluation procedures with respect to tail regimes. The design relies on a hierarchical model based on Gamma–exponential mixtures with
[TABLE]
where refers to an exponential random variable with scale . The fact that follows a heavy tailed GP distribution, see relation (5), can be proved using Laplace transforms. For analogy with weather forecasting, we present the benchmark in a temporal setting. At each time , an observation is drawn independently from an exponential distribution whose scale is a realization of . In this setting, has an exponential tail which is conditioned by the information brought by its scale , representing the a priori knowledge of the system, for instance the weather at previous time. Thus the ideal forecast for each time step is , and requires the knowledge of . Using relation (5), we see that the climatological forecaster is a GP distribution with tail index and unit scale. Climatology is a commonly used forecast reference in meteorology. In other fields, it can be viewed as the unconditional distribution of the truth, and an estimation of a climatological forecast can be done based on a sample of past and analogs observations. This setting is attractive as the ideal and the climatological forecasters belong to two different regimes of tail decay.
We introduce alternative competitors modelling partial knowledge of the conditional state: the -informed forecaster , is a mixture between the climatological and ideal forecasts, where a weight, say , indicates the contribution of each one, see Table 1 for the definition.
Finally, the extremist forecaster simply adds a multiplicative bias to the ideal forecaster: while it is not calibrated, such forecast has the same tail behavior as the ideal forecaster ; see A for detailed discussion on calibration. The benchmark is summarized in Table 1 and later referred to as the “Model GE”.
Closed forms of the CRPS are available for each forecast of the proposed benchmark. For instance, the extremist forecast , satisfies
[TABLE]
Besides, combining (12) and (6) yields the following formula for the -informed forecast,
[TABLE]
where . Table 2 gives the relative ratio of the empirical means of the CRPS for the benchmark with .
The CRPS being a proper score, the ideal forecast cannot be beaten in average in the Table 2. Moreover, there is a clear ranking among calibrated forecasts, based on the nested information sets (Holzmann and Eulert, 2014). Following the principle of tail equivalence presented in Section 2.2, the extremist forecast should be the forecast the closest to the ideal as they both belong to the same regime of tail decay; however, we observe that the CRPS average gives a performance in between the least informed forecaster and the climatology. An alternative measure for forecast evaluation, satisfying the tail equivalence principle is thus required. A good candidate commonly used in forecast science is the ROC curve (Gneiting and Vogel, 2018). However, in the case of Model GE, all the ROC curves, except the climatological one, coincide whatever the event, which illustrates its invariance under calibration (Kharin and Zwiers, 2003). Further alternatives should thus be investigated.
3 The CRPS as a random variable
3.1 The random CRPS and its properties
Section 2 pointed out the difficulty of summarizing forecast performance for meaningful comparisons for extreme observations. We illustrated in particular that a single number such as the mean of the CRPS, or its weighted counterpart, fails to deliver relevant comparisons. As an alternative, we propose to study the distribution of the CRPS when treated as a random variable, see also Ferro (2017); Bessac and Naveau (2021).
For simplicity, we use the setting and corresponding notations of the benchmark presented in Section 2.3. From equations (12) and (6), the climatological and ideal scores can be treated as random variables whenever is replaced by . At this stage, it is important to remind that a forecast is issue with only a partial knowledge of the system: the exact value of and the distribution of are unknown, and only the observation is available. Table 3 summarizes quantities that are available to forecasters. Thus, to evaluate forecasts performance, it is only possible to compute for each . The climatological distribution, that we now note and whose existence needs to be hypothesised in practice, is characterized by the observed sample , considered as a sample of independent realizations of the random variable .
For any set of forecasts and sample , two types of sets of random variables can be defined:
[TABLE]
where is a random permutation of . Applying breaks the conditional dependence between and , quantified by in the benchmark, creating alternative less informative forecasts. Thus for a given forecaster, represented by the set and permutation , we introduce two random variables and characterized by their respective empirical cdf.
The climatological forecaster is the only forecaster satisfying
[TABLE]
as by definition it discards any information about the system conditioning. The first equality in (8) is a direct consequence of auto-calibration, see A; the second equality follows from the permutation invariance of the data from the point of view of the climatological forecaster.
The distributional properties of , , and give relevant insights on the behavior of the forecaster. For illustration, Figure 1 gives qq-plots of the distributions of against for each forecast of the benchmark with .
We observe that the ideal, -informed and extremist forecasts deviate from the diagonal, illustrating the influence of the loss of information caused by the permutation: such a visual diagnostic summarizes how and capture relevant information from the conditioning modelled here by the random variable . The right panel of Figure 1 displays these distributions on the probability scale and highlights how the discrepancy of the -informed forecaster evolves with the parameter . Extremist forecasts, with multiple values of the scale parameter , are displayed here for the sole purpose to illustrate how such visual diagnostics behave when calibration is not satisfied. In Figure 1, we can also see that forecast dominance among forecasters could be inferred, as in Ehm et al. (2016, Fig. 1,2,4,6) for point forecasts. Under calibration, discrepancy between distributions can be appropriately interpreted as a direct measure of the forecaster skill (the -informed curves never cross each other), making such diagnosis particularly relevant and compliant with the recommendations on the extremal dependence indices established by Ferro and Stephenson (2011).
3.2 Tail properties of the random CRPS
We now study the upper tail behavior of the random CRPS, using EVT to develop a meaningful forecast evaluation for extreme events. To lighten the technicality of this section, all proofs are relegated to E. In terms of notations with respect to any conditional model that depends on , we want to emphasize the difference between a conditional forecast, say , and an unconditional forecast . Note that depends on the time index , but for notation simplicity, we drop this index; might also change over time but here assumed invariant.
Let and be two random variables with absolutely continuous cdfs and with common upper bound . Suppose that there exists such that and that is finite. Then conditionally on , one has
[TABLE]
as tends to , with . So at any fixed state (state of the atmosphere for a weather forecast, say), the CRPS upper tail behavior (conditionally on ) is equivalent to the observation tail behavior and formalizes what could be intuited from (12).
Now, unconditionally, one can also get a result for the climatological forecast, thanks to its property of invariance under permutation (see Section 3.1). If there exists such that , then
[TABLE]
for any such that . In the case where , convergence in Equation (10) also holds for as the latter vanishes due to the linear behavior of the auxiliary function in Equation (3), e.g., see Embrechts et al. (1997).
The benchmark presented in Table 1 illustrates these results. The choice of working with a time indexed couple or with an invariant impacts significantly the tail behavior of the CRPS random variables: according to Table 1, the former case implies that the limit in (9) exhibits an exponential tail, whereas the climatological tail given by (10) is heavy, i.e., .
3.3 Assessing the forecaster tail behavior
In this section, we propose a tail-equivalent forecast performance index inspired from equations (9), (10), and Figure 1. We aim only to provide the intuition behind the index and leave formal theoretical analysis for future work. We assume that the forecasts lie in the domain of attraction of some distribution . For sufficiently large , the null hypothesis should be rejected for any calibrated forecast with tail behaviour closer to the ideal forecast than the climatological reference.
To go further, assume that the variables in are iid. This assumption may not be always satisfied, as for instance temperature measures of two consecutive days are likely to be dependent, but can be reasonably satisfied for measurements from sufficiently far apart. For each forecast, we can compute a Cramér-von Mises criterion
[TABLE]
where is the empirical distribution of the observations in exceeding the threshold . The empirical nature of allows to simplify to
[TABLE]
where denotes the number of observations exceeding and are the ordered values of . A detailed algorithm for the computation of is provided in Table 4 of F.
As suggested by Figure 1, we assume that , for any calibrated forecasts and climatology . Also, for two calibrated forecasts and , we conjecture that if has a tail behaviour closer to the ideal forecast than . Under these assumptions, we can summarize simply the comparison between and through
[TABLE]
The behaviour of the index is illustrated with the help of model GE; Figure 2 displays the evolution of as a function of the threshold for and . The behaviour of the index is shown to be consistent with our conjecture: first, the ideal forecast performs best, while the climatology has the lowest index. Performance ranking among calibrated forecasters is stable as the threshold increases, with the ideal forecast always obtaining the largest index. The extremist forecasters, displayed here to illustrate the behaviour of the index for non-calibrated forecast, obtain a high index, even larger than the ideal forecast, stressing the importance of calibration which must be carefully assessed before any interpretation of .
In practice, a threshold choice has to be made, for which numerous methodologies have been developed, see, e.g., Beirlant et al. (2004); Papastathopoulos and Tawn (2013); Naveau et al. (2016).
4 Discussion
In this work, we have argued with the help of a carefully designed benchmark that the mean of the CRPS, or its weighted counterparts, are unable to successfully discriminate a forecast upper tail regime, as demonstrated by Brehmer and Strokorb (2019). Ehm et al. (2016) have introduced the so-called “Murphy diagrams” for assessing dominance in point forecasts. This original approach allows to appreciate dominance among different forecasts and anticipate their skill area; a similar visual diagnostic is presented in Figure 1 for calibrated forecasts.
Inspired by Friederichs and Thorarinsdottir (2012), we apply EVT directly on common verification measures. By considering the CRPS as a random variable, see also Bessac and Naveau (2021) for non-extreme cases, one can view this contribution as a first step in considering other functionals of the scores distributions rather than their means. The new index introduced in Section 3.3 can be considered as a probabilistic alternative to the scores introduced by Ferro (2007) and Ferro and Stephenson (2011). We make a link between the paradigm of maximizing the sharpness subject to calibration from Gneiting et al. (2007) and the paradigm of maximizing the information for extreme events subject to calibration. In a same vein, Murphy (1993) has presented the differences between forecast quality (accordance between forecasts and observations) and forecast value (ability to bring information to realize a benefit by choosing a forecast), the forecast value seems to be the most important for extreme events, where decision making is crucial. For deterministic weather forecasts, such tools are well-known, see e.g. Richardson (2000); Zhu et al. (2002). Other widely-used scores based on the dependence between forecasts and observed events have been considered in Stephenson et al. (2008); Ferro and Stephenson (2011).
It would be worthwhile to further study the theoretical properties of this CRPS-based tool. Another potentially interesting investigation could be to extend this procedure to other scores like the mean absolute difference, the Dawid-Sebastiani score (Dawid and Sebastiani, 1999) or the ignorance score (Smith et al., 2015; Diks et al., 2011). Classical tools in verification relies on a verification period, as a consequence evaluation is always done a posteriori. Thus, an interesting manner to pursue this work would be to consider sequential evaluation of rare events, in the spirit of the e-values (Vovk and Wang, 2021) introduced to assess and monitor calibration continuously (Arnold et al., 2021). Eventually, we invite scientists to work on new theory of scoring rule departing from the score’s averages.
Acknowledgments
Part of this work was supported by the French National Research Agency (ANR) project T-REX (ANR-20-CE40-0025) and by Energy oriented Centre of Excellence-II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and the ANR project Melody (ANR-19-CE46-0011). This work was partially supported by the ANR LABEX MILYON (ANR-10-LABX-0070) of Université de Lyon, within the program "Investissements d’Avenir" (ANR-11-IDEX-0007).
Implementation details
The implementation of the index relies on the extremeIndex package (Taillardat, 2021a). The R code generating simulation data and Figures is available upon request.
Appendix A Prediction framework and calibration
The theoretical framework considered in this paper is the now classical prediction space already introduced by Murphy and Winkler (1987); Gneiting and Ranjan (2013); Ehm et al. (2016), and generalized in a serial context by Strähl and Ziegel (2017). It starts formally with a probability space and a collection of sub--algebras , where represents the information available to forecaster . In a meteorological context, it can be seen as the representation of the atmosphere done by each forecaster. In the benchmark considered in Section 2.3, we will consider for simplicity that the information set is generated by a random variable .
A real-valued outcome is observed and seen as a (real-valued) random variable. A probabilistic forecast for is identified with its so-called “predictive distribution” with cdf . Rigorously speaking, is a kernel111This means that for each fixed , is a probability measure, and for each fixed , is -measurable. See e.g. Kallenberg (2017). from to , but as done by previous authors, we will identify the kernels with random cumulative cdf, see e.g. Strähl and Ziegel (2017) for more details. For each , we might in particular use the notation meaning the random element .
In such a framework, a forecast is termed ideal with respect to if almost surely. Tsyplakov (2011) also refers to this property saying that is calibrated with respect to . He additionally defines the auto-calibration as the property for to satisfy almost surely. Here, denotes the -algebra generated by , that is to say the smallest -algebra such that is measurable for all . Note that if a forecast is calibrated with respect to , then it is auto-calibrated, but the converse does not hold in general. As a particular case considered in Section 2.3, the climatological forecaster is ideal with respect to the trivial -algebra.
In practice, one is not only concerned with predictions for an outcome at a single time point. The framework introduced above also allows to deal with independent replicates at times , as is done in Section 2.3. If such an assumption of independence sounds unrealistic in several situations, as argued by Strähl and Ziegel (2017), it can nevertheless provide a first step and takes advantage of a lighter context. We chose therefore to keep it in this paper for simplicity.
Appendix B An alternative expression of the weighted CRPS
The weighted CRPS defined by (1) can be reformulated in the following way, as soon as the weight function is continuous,
[TABLE]
Assume that the weight function is continuous. By integrating by parts and and using , the weighted CRPS defined by (1) can be rewritten as
[TABLE]
The equality gives
[TABLE]
and
[TABLE]
where the last line follows from the fact that and have the same distribution, which is uniform on ). As is non-decreasing, one has , and it follows that
[TABLE]
as announced in (12).
Appendix C Proof of the inequality (4)
Let be a positive real. Denote a non-negative random variable with finite mean and cdf . Assume that and are independent and have same right end point. We introduce the new random variable
[TABLE]
with survival function defined by
[TABLE]
Note that the decreasingness of yields in particular that for all ,
[TABLE]
Besides, equation (16) and the monotonicity of allows to write that for any
[TABLE]
Equality (12) implies that
[TABLE]
where
[TABLE]
The stochastic ordering that holds between and implies that the quantity is negative. Combined with (18), this leads to
[TABLE]
For we can write that
[TABLE]
since in the first expectation, whereas in the second one. As a consequence, one gets
[TABLE]
This last expression combined with (19) leads finally to
[TABLE]
Note that this inequality is true for any and , and its right hand side does not depend on . Thus, the tail behavior of the random variables and can be completely different, although the CRPS of and can be as closed as one wishes. The right hand side goes to [math] due to the finite mean of .
Appendix D A detailed example related to Section 2.2
In this appendix, we illustrate the fact that the CRPS fails at discriminating forecasts with different tails. We consider GP distributed forecasts and observations. In this case, closed form of the CRPS are available, as detailed in the following.
Lemma 1**.**
Consider and with and , with respective survival functions (for ) and (for ). If , with , then
[TABLE]
This gives the minimum CRPS value for and ,
[TABLE]
Proof: Applying (12) with , and making use of classical properties of the Pareto distribution (see e.g. (Embrechts et al., 1997, Theorem 3.4.13)), one gets
[TABLE]
It follows that
[TABLE]
with
[TABLE]
Since
[TABLE]
one can write
[TABLE]
Besides, as , one can thus rewrite, denoting by a random variable uniformly distributed on ,
[TABLE]
If , then this simplifies to
[TABLE]
In particular, and
[TABLE]
It follows that, if , then we have
[TABLE]
This gives the minimum CRPS value for and ,
[TABLE]
concluding the proof of Lemma 1.
Lemma 1 allows to study the effect of changing the forecast’s tail behavior captured by and the spread forecast encapsulated in , when and have proportional parameters, i.e., and for some . In this case, the CRPS simplifies to
[TABLE]
leading when to a forecaster with heavier-tail, overestimating the true upper tail behavior, and to the opposite when .
Counter examples as the previous one can thus be found, illustrating how weighted scoring rules fail to compare tail behaviors. They should therefore be handled with a particular care, especially for forecast makers, as already advocated by Gilleland et al. (2018); Lerch et al. (2017).
Appendix E Proof of the convergences (9) and (10)
The proof of (10) can be seen as a particular case of (9), so that we will focus on proving (9). The following lemma will help to get the result, and is presented first with its proof. In what follows, the mean excess function of any random variable with finite mean and with cdf will be denoted by , so that Lemma : Consider a random variable with finite mean that belongs to domain of attraction with . There exist non negative real numbers and such that for each ,
[TABLE]
Proof of the lemma: The indicator function implies that we always have . To prove that is smaller than , we first show that this inequality holds for large values of . Note first that if , then (22) is trivially true. Let then show the result when , and for this, let decompose the proof depending on the sign of :
belongs to with : In this case, Embrechts et al. (1997) (Section 3.4) show that as tends to , and we can conclude directly. 2. 2.
belongs to with : In this case, the result also follows easily from Embrechts et al. (1997) since when tends to , This allows to fix and for an appropriate neighborhood of . 3. 3.
belongs to : When is in the Gumbel domain of attraction, as tends to (see e.g. Theorem 3.9 in Ghosh and Resnick (2010)). If is finite, then there exists a positive such that and can be fixed to 0, whereas if is infinite, the fact that for large enough enables to conclude.
So far, we have shown that, for some large , there exist non negative and such that
[TABLE]
We still need to prove that this statement also holds for . Define
[TABLE]
As , is finite and, as for all , we have
[TABLE]
We have now two cases: either or . In the latter case, we have , and so, the required result is obtained. In the case of , it is always possible to increase chosen when , and bring it above . ∎
We are now ready to prove (9) as announced. *Proof of (9):
*Given the conditional forecast , the CRPS can be computed with respect to the conditional observation in the following way
[TABLE]
where . To simplify notations, we drop the subscript in the rest of the proof, but it will be back at the end. The previous lemma allows to write
[TABLE]
Let now work conditionally on , for a large close to . We then get
[TABLE]
This holds when the right end point of is non-negative. If this was not the case, note that one can simply write .
The main idea of the proof is to notice that goes to zero as gets large, and consequently, the above inequalities indicate that the thresholded random variable and the thresholded CRPS should behave similarly for large . The choice of positive constant depends on the domain of attraction of . More precisely, we assume that converges in distribution towards a GPD with finite mean. So that
[TABLE]
We recognize the probability (conditionally on ) for to be in an interval denoted by
[TABLE]
The remaining part of the proof consists in showing that this conditional probability tends to 0 as . We can write
[TABLE]
where For large enough, the latter probability can be approximated by a GPD, so that
[TABLE]
where denotes the probability density function associated to the GPD. This implies the convergence to 0 of the latter probability. Since this is true conditionally on , it can be rewritten, after reintroduction of the subscript , as
[TABLE]
as tends to , with . ∎
Appendix F Algorithm for the computation of the Cramer-von-Mises criterion
Note that for large , under the null hypothesis, the statistic follows a Cramér-von Mises distribution. The associated -values could have been computed, but they are actually subject to numerical instabilities (Prokhorov, 1968; Csörgő and Faraway, 1996). Furthermore, is sufficient to compare the effect size of the deviation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arnold et al. (2021) Arnold, S., Henzi, A., Ziegel, J. F., 2021. Sequentially valid tests for forecast calibration. ar Xiv preprint ar Xiv:2109.11761.
- 2Beirlant et al. (2004) Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J., Waal, D., Ferro, C., 2004. Statistics of extremes: Theory and applications.
- 3Bessac and Naveau (2021) Bessac, J., Naveau, P., 2021. Forecast score distributions with imperfect observations. Advances in Statistical Climatology, Meteorology and Oceanography 7 (2), 53–71.
- 4Brehmer and Strokorb (2019) Brehmer, J. R., Strokorb, K., 2019. Why scoring functions cannot assess tail properties. Electronic Journal of Statistics 13 (2), 4015 – 4034. URL https://doi.org/10.1214/19-EJS 1622 · doi ↗
- 5Bröcker (2012) Bröcker, J., 2012. Evaluating raw ensembles with the continuous ranked probability score. Quarterly Journal of the Royal Meteorological Society 138 (667), 1611–1617.
- 6Csörgő and Faraway (1996) Csörgő, S., Faraway, J. J., 1996. The exact and asymptotic distributions of cramér-von mises statistics. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), 221–234.
- 7Dawid (1984) Dawid, A. P., 1984. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 278–292.
- 8Dawid and Sebastiani (1999) Dawid, A. P., Sebastiani, P., 1999. Coherent dispersion criteria for optimal experimental design. Annals of Statistics, 65–81.
