Multivariate postprocessing methods for high-dimensional seasonal   weather forecasts

Claudio Heinrich; Kristoffer H. Hellton; Alex Lenkoski; Thordis L.; Thorarinsdottir

arXiv:1907.09716·stat.ME·November 11, 2019

Multivariate postprocessing methods for high-dimensional seasonal weather forecasts

Claudio Heinrich, Kristoffer H. Hellton, Alex Lenkoski, Thordis L., Thorarinsdottir

PDF

Open Access

TL;DR

This paper introduces a multivariate postprocessing method for high-dimensional seasonal weather forecasts that improves accuracy by handling complex error patterns and trends, using covariance tapering and PCA for efficiency.

Contribution

It presents a novel multivariate postprocessing approach combining covariance tapering and PCA, capable of managing non-stationary, non-isotropic, and negatively correlated errors on a global scale.

Findings

01

Outperforms existing methods in global sea surface temperature forecasts

02

Handles non-stationary and non-isotropic error patterns effectively

03

Adapts to bias trends caused by global warming

Abstract

Seasonal weather forecasts are crucial for long-term planning in many practical situations and skillful forecasts may have substantial economic and humanitarian implications. Current seasonal forecasting models require statistical postprocessing of the output to correct systematic biases and unrealistic uncertainty assessments. We propose a multivariate postprocessing approach utilizing covariance tapering, combined with a dimension reduction step based on principal component analysis for efficient computation. Our proposed technique can correctly and efficiently handle non-stationary, non-isotropic and negatively correlated spatial error patterns, and is applicable on a global scale. Further, a moving average approach to marginal postprocessing is shown to flexibly handle trends in biases caused by global warming, and short training periods. In an application to global sea surface…

Tables4

Table 1. Table 1: Mean squared error (MSE) over all grid points, months and years in the validation period 2001-2016. N G R m 𝑁 𝐺 subscript 𝑅 𝑚 NGR_{m} is linear regression grouped by month, N G R s 𝑁 𝐺 subscript 𝑅 𝑠 NGR_{s} by location, and N G R m , s 𝑁 𝐺 subscript 𝑅 𝑚 𝑠 NGR_{m,s} by both, N G R m l a 𝑁 𝐺 subscript superscript 𝑅 𝑙 𝑎 𝑚 NGR^{la}_{m} is the locally adaptive method described in Section 2.2 , S M A 𝑆 𝑀 𝐴 SMA is bias correction by simple moving averages, and E M A 𝐸 𝑀 𝐴 EMA by exponential moving averages. The second row shows p 𝑝 p -values obtained in a permutation test for the significance of the score difference to the best performing method (EMA).

Method	$N G R_{m}$	$N G R_{s}$	$N G R_{m, s}$	$N G R_{m}^{l a}$	$S M A$	$E M A$
MSE	2.028	0.417	0.227	0.331	0.223	0.220
$p$ -value	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$	-

Table 2. Table 2: Continuous ranked probability score (CRPS) for different variance estimation methods with bias-correction by EMA and p 𝑝 p -values for permutation tests comparing with the best possible model (EMA). Results are aggregated over all grid points, months and years in the validation period 2001-2016 and the best models are indicated in bold.

Method	$N G R_{m}$	$N G R_{s}$	$N G R_{m, s}$	$S M A$	$E M A$
CRPS	0.2426	0.2349	0.2311	0.2305	0.2304
$p$ -value	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$	-

Table 3. Table 3: Variogram scores for the area shown in Figure 5 averaged over all months in the validation period 2001–2016, and p 𝑝 p -values for a permutation test comparing to the best performing model. Numbers in bold highlight models that are not significantly worse than the best performing model, at a level of 5%.

Method:	${\hat{Σ}}^{m c}$	${\hat{Σ}}^{a c}$	GS	ECC	Schaake
VS $(F, 𝒚)$	0.03074	0.03075	0.03410	0.03111	0.03095
$p$ -value	-	$40.1 %$	$< 0.1 %$	$< 0.1 %$	$< 0.1 %$

Table 4. Table 4: Scores for minimum SST forecasts along a shipping route from Bordeaux, France to Norfolk, USA for the four different multivariate models, aggregated over all months and years in the validation period 2001–2016.

Method	${\hat{Σ}}^{m c}$	${\hat{Σ}}^{a c}$	GS	ECC	Schaake
MSE	0.692	0.690	0.726	0.708	0.687
$p$ -value MSE	26.3 $%$	37.1 $%$	2.1 $%$	0.2 $%$	-
CRPS	0.445	0.443	0.454	0.463	0.446
$p$ -value CRPS	19.0 $%$	-	1.2 $%$	$< 0.1 %$	22.5 $%$

Equations52

X_{y, s} \sim N (μ_{y, s}, σ_{y, s}^{2}) .

X_{y, s} \sim N (μ_{y, s}, σ_{y, s}^{2}) .

b_{y, s} := j < y \sum w_{y - j}^{b} (\overline{f}_{j, s} - t_{j, s}), σ_{y, s}^{2} := j < y \sum w_{y - j}^{σ} (t_{j, s} - (\overline{f}_{j, s} - b_{j, s}))^{2},

b_{y, s} := j < y \sum w_{y - j}^{b} (\overline{f}_{j, s} - t_{j, s}), σ_{y, s}^{2} := j < y \sum w_{y - j}^{σ} (t_{j, s} - (\overline{f}_{j, s} - b_{j, s}))^{2},

CRPS (F, t) = E_{F} ∣ X - t ∣ - \frac{1}{2} E_{F} E_{F} ∣ X - X^{'} ∣,

CRPS (F, t) = E_{F} ∣ X - t ∣ - \frac{1}{2} E_{F} E_{F} ∣ X - X^{'} ∣,

μ_{y, s} = a + b \overline{f}_{y, s}, σ_{y, s}^{2} = c^{2} + d^{2} S_{y, s}^{2},

μ_{y, s} = a + b \overline{f}_{y, s}, σ_{y, s}^{2} = c^{2} + d^{2} S_{y, s}^{2},

\overline{f}_{y, s}^{ano} := \overline{f}_{y, s} - \frac{1}{# { j < y }} j < y \sum \overline{f}_{j, s}, t_{y, s}^{ano} := t_{y, s} - \frac{1}{# { j < y }} j < y \sum t_{j, s},

\overline{f}_{y, s}^{ano} := \overline{f}_{y, s} - \frac{1}{# { j < y }} j < y \sum \overline{f}_{j, s}, t_{y, s}^{ano} := t_{y, s} - \frac{1}{# { j < y }} j < y \sum t_{j, s},

μ_{y, s} = a + b \overline{f}_{y, s}^{ano} + t_{y, s}^{clim}, where t_{y, s}^{clim} := t_{y, s} - t_{y, s}^{ano} .

μ_{y, s} = a + b \overline{f}_{y, s}^{ano} + t_{y, s}^{clim}, where t_{y, s}^{clim} := t_{y, s} - t_{y, s}^{ano} .

res_{y} := t_{y} - μ_{y} \sim N_{S} (0, Σ_{y}), with diag(Σ_{y}) = (σ_{y, s_{i}}^{2})_{s_{i} \in S},

res_{y} := t_{y} - μ_{y} \sim N_{S} (0, Σ_{y}), with diag(Σ_{y}) = (σ_{y, s_{i}}^{2})_{s_{i} \in S},

X_{y} \sim N_{S} (μ_{y}, Σ_{y}),

X_{y} \sim N_{S} (μ_{y}, Σ_{y}),

S (s_{i}, s_{j}) := \frac{1}{Y - 1} y \sum (t_{y, s_{i}} - μ_{y, s_{i}}) (t_{y, s_{j}} - μ_{y, s_{j}}),

S (s_{i}, s_{j}) := \frac{1}{Y - 1} y \sum (t_{y, s_{i}} - μ_{y, s_{i}}) (t_{y, s_{j}} - μ_{y, s_{j}}),

S_{ϕ} (s_{i}, s_{j}) := ϕ (∥ s_{i} - s_{j} ∥) S (s_{i}, s_{j}) .

S_{ϕ} (s_{i}, s_{j}) := ϕ (∥ s_{i} - s_{j} ∥) S (s_{i}, s_{j}) .

\phi_{L}(t):=\phi(t/L),\quad\text{where}\quad\phi(t):=\bigg{(}(1-t)\frac{\sin(2\pi t)}{2\pi t}+\frac{1-\cos(2\pi t)}{2\pi^{2}t}\bigg{)}\mathds{1}\{0\leq t\leq 1\}.

\phi_{L}(t):=\phi(t/L),\quad\text{where}\quad\phi(t):=\bigg{(}(1-t)\frac{\sin(2\pi t)}{2\pi t}+\frac{1-\cos(2\pi t)}{2\pi^{2}t}\bigg{)}\mathds{1}\{0\leq t\leq 1\}.

S_{ϕ} = U Λ U^{T} = i = 1 \sum S λ_{i} u_{i} u_{i}^{T},

S_{ϕ} = U Λ U^{T} = i = 1 \sum S λ_{i} u_{i} u_{i}^{T},

Σ := i = 1 \sum d λ_{i} u_{i} u_{i}^{T} .

Σ := i = 1 \sum d λ_{i} u_{i} u_{i}^{T} .

Σ_{ss} = i = 1 \sum d λ_{i} u_{i s}^{2} < i = 1 \sum S λ_{i} u_{i s}^{2} = (S_{ϕ})_{ss} = S_{ss} .

Σ_{ss} = i = 1 \sum d λ_{i} u_{i s}^{2} < i = 1 \sum S λ_{i} u_{i s}^{2} = (S_{ϕ})_{ss} = S_{ss} .

\widehat{\Sigma}^{mc}\ :=\ \Xi\,\widetilde{\Sigma}\,\Xi,\qquad\text{ with }\qquad\text{$\Xi=\text{diag}\bigg{(}\frac{\widehat{\sigma}_{y,1}}{(\widetilde{\Sigma}_{11})^{1/2}},...,\frac{\widehat{\sigma}_{y,S}}{(\widetilde{\Sigma}_{SS})^{1/2}}\bigg{)}$},

\widehat{\Sigma}^{mc}\ :=\ \Xi\,\widetilde{\Sigma}\,\Xi,\qquad\text{ with }\qquad\text{$\Xi=\text{diag}\bigg{(}\frac{\widehat{\sigma}_{y,1}}{(\widetilde{\Sigma}_{11})^{1/2}},...,\frac{\widehat{\sigma}_{y,S}}{(\widetilde{\Sigma}_{SS})^{1/2}}\bigg{)}$},

Σ^{a c} := Σ + diag (η_{1}, ..., η_{S}), where η_{s} := max {σ_{y, s}^{2} - Σ_{ss}, 0} .

Σ^{a c} := Σ + diag (η_{1}, ..., η_{S}), where η_{s} := max {σ_{y, s}^{2} - Σ_{ss}, 0} .

X = Ξ U^{(d)} (Λ^{(d)})^{1/2} Y,

X = Ξ U^{(d)} (Λ^{(d)})^{1/2} Y,

X = U^{(d)} (Λ^{(d)})^{1/2} Y + diag (η_{1}, ..., η_{S}) Z,

X = U^{(d)} (Λ^{(d)})^{1/2} Y + diag (η_{1}, ..., η_{S}) Z,

C_{\theta,r}(s_{i},s_{j})=(1-\theta)\exp\bigg{(}-\frac{\|s_{i}-s_{j}\|}{r}\bigg{)}+\theta\delta_{ij}.

C_{\theta,r}(s_{i},s_{j})=(1-\theta)\exp\bigg{(}-\frac{\|s_{i}-s_{j}\|}{r}\bigg{)}+\theta\delta_{ij}.

\widetilde{x}^{(i)}_{s}:=F_{s}^{-1}\bigg{(}\frac{i}{N+1}\bigg{)},\qquad\text{for $i=1,...,N,$ $s=1,...,S$,}

\widetilde{x}^{(i)}_{s}:=F_{s}^{-1}\bigg{(}\frac{i}{N+1}\bigg{)},\qquad\text{for $i=1,...,N,$ $s=1,...,S$,}

f_{s}^{(ρ_{s} (1))} \leq \dots \leq f_{s}^{(ρ_{s} (N))}

f_{s}^{(ρ_{s} (1))} \leq \dots \leq f_{s}^{(ρ_{s} (N))}

{x_{s}^{(ρ_{s}^{- 1} (i))}}_{s \in {1, ..., S}} .

{x_{s}^{(ρ_{s}^{- 1} (i))}}_{s \in {1, ..., S}} .

MSE (F, t) = (μ - t)^{2},

MSE (F, t) = (μ - t)^{2},

CRPS (\b x, t) = \frac{1}{N} k = 1 \sum N ∣ x_{k} - t ∣ - \frac{1}{2 N ^{2}} k = 1 \sum N l = 1 \sum N ∣ x_{k} - x_{l} ∣.

CRPS (\b x, t) = \frac{1}{N} k = 1 \sum N ∣ x_{k} - t ∣ - \frac{1}{2 N ^{2}} k = 1 \sum N l = 1 \sum N ∣ x_{k} - x_{l} ∣.

\textup{VS}(\widehat{F},{\bf t})=\sum_{i=1}^{S}\sum_{j=1}^{S}\omega_{ij}\big{(}|t_{i}-t_{j}|^{p}-\mathbb{E}_{\widehat{F}}|X_{i}-X_{j}|^{p}\big{)}^{2},

\textup{VS}(\widehat{F},{\bf t})=\sum_{i=1}^{S}\sum_{j=1}^{S}\omega_{ij}\big{(}|t_{i}-t_{j}|^{p}-\mathbb{E}_{\widehat{F}}|X_{i}-X_{j}|^{p}\big{)}^{2},

s := \frac{1}{n} i = 1 \sum n (S (F_{1}, y_{i}) - S (F_{2}, y_{j})) =: \frac{1}{n} i = 1 \sum n S_{i} .

s := \frac{1}{n} i = 1 \sum n (S (F_{1}, y_{i}) - S (F_{2}, y_{j})) =: \frac{1}{n} i = 1 \sum n S_{i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClimate variability and models · Soil Geostatistics and Mapping · Grey System Theory Applications

Full text

Multivariate postprocessing methods for high-dimensional seasonal weather forecasts

Claudio Heinrich, Kristoffer H. Hellton, Alex Lenkoski and

Thordis L. Thorarinsdottir

Norwegian Computing Center, Oslo, Norway

The authors gratefully acknowledge the support of the Volkswagen Foundation through grant nr. 88511 and the Research Council of Norway through grant nr. 270733. We thank Tilmann Gneiting and our project partners in the Seasonal Forecasting Engine project for fruitful discussions.

Abstract

Seasonal weather forecasts are crucial for long-term planning in many practical situations and skillful forecasts may have substantial economic and humanitarian implications. Current seasonal forecasting models require statistical postprocessing of the output to correct systematic biases and unrealistic uncertainty assessments. We propose a multivariate postprocessing approach utilizing covariance tapering, combined with a dimension reduction step based on principal component analysis for efficient computation. Our proposed technique can correctly and efficiently handle non-stationary, non-isotropic and negatively correlated spatial error patterns, and is applicable on a global scale. Further, a moving average approach to marginal postprocessing is shown to flexibly handle trends in biases caused by global warming, and short training periods. In an application to global sea surface temperature forecasts issued by the Norwegian Climate Prediction Model (NorCPM), our proposed methodology is shown to outperform known reference methods.

Keywords: Covariance regularization, moving average, multivariate postprocessing, probabilistic forecast, sea surface temperature

1 Introduction

Seasonal, or medium-range, weather forecasts on a timescale of one month to a year ahead are highly important in a range of applications. Decisions makers can e.g. greatly benefit from skillful forecasts of increased danger for natural disasters or extreme weather events, such as droughts, hurricanes or extreme snowfall and winds, for efficient mitigation efforts and emergency management. Unlike short-range weather forecasting, medium-range forecasts rely on the prediction of atmospheric modes with a low-frequency variability which can be predicted months ahead. This includes the El Niño Southern Oscillation, monsoon rains and the Northern Atlantic Oscillation (Hoskins,, 2013). As ocean states change considerably slower than states in the atmosphere, these modes are typically associated with the sea surface temperature in certain regions. Therefore, reliable months-ahead forecasting of sea surface temperature is a crucial first step towards skillful seasonal forecasts of other weather phenomena. For example, the winter mean surface temperature in large parts of Europe is considered to be negatively correlated with the sea surface temperature in the Nordic seas in the preceding autumn (Kolstad and Årthun,, 2018; Dobrynin et al.,, 2018).

In order to be useful for decision-making, weather forecasts ought to be probabilistic in nature and well calibrated (Gneiting et al.,, 2007). Calibration implies that the probability of any event under the forecast distribution matches the actual frequency observed for the event. Current numerical weather prediction (NWP) models are typically deterministic and account for forecast uncertainty by generating an ensemble of forecasts where every ensemble member represents a possible simulation of the future, often generated by creating a small perturbation of the initial state of the prediction. However, as these models rely on simplifications of the underlying physical system, a (possibly too crude) discretization of space and imperfect initialization, they will be biased. Further, the ensemble spread may not accurately capture the forecast uncertainty. Hence statistical postprocessing is required, where the forecast model is recalibrated based on past performance and observations, see Vannitsem et al., (2018) for an overview.

In the postprocessing of medium-range forecasts, obtaining enough training data is a particular challenge. The high-frequency variability patterns need to be filtered out, such that observations have to be averaged over several weeks or months. As the period of reliable sea surface temperature observations starts around 1980, the beginning of the satellite era, the number of available observations for each season is typically below 50. Therefore, medium-range postprocessing techniques must be robust to minimize the risk of overfitting. Additionally, ongoing climate change leads to significant trends in biases and model uncertainty over time (Van Schaeybroeck and Vannitsem,, 2018).

Many questions stated by forecast users share the feature that they depend on the forecast distribution at multiple locations, so that the forecast must take into account complex dependencies for a skillful prediction of the answers. Examples include predicting the probability of the maximal sea surface temperature in a specific area exceeding 26.5∘ C, a necessary condition for the development of tropical cyclones (McTaggart-Cowan et al.,, 2015), or predicting the probability of observing sea ice along a shipping route. This requires multivariate postprocessing techniques.

A common approach to multivariate postprocessing is to fit variograms of a parametric covariance family, such as exponential covariance functions (e.g. Feldmann et al.,, 2015). This approach generally assumes stationarity and often nonnegative correlations decaying with distance. Neither of these assumptions are natural when considering global sea surface temperature, as it is likely to depend on ocean currents and the presence or absence of land near, or in between, locations. This is highlighted by Figure 1 showing normalized observed sea surface temperature for May 2016. Nonstationary effects are visible, such as the strong horizontal correlations in the Pacific ocean westwards of Peru and Columbia due to the El Nino Southern Oscillation. These effects commonly carry forward to the forecast errors in that the model captures the pattern but not the exact magnitude.

Given the physical complexity underlying the NWP model, forecast residuals of different locations may be negatively correlated. For example, the sea surface temperature forecast in our data set for July tends to underestimate the temperature in the Baltic sea, when overestimating the temperature in the Barents sea, and vice versa. Most parametric families are not able to model negative correlations, an exception being parametric hole effect models (e.g. Chilès and Delfiner,, 1999). However, these assume that locations at certain distances are always negatively correlated, which is not reasonable in our setup.

We propose a probabilistic multivariate postprocessing approach to tackle these issues and apply it to forecasts of monthly mean sea surface temperature issued by the Norwegian climate prediction model (NorCPM). A moving average approach ensures that the postprocessing will be robust against lack of training data and trends in biases and uncertainties caused by climate change. To achieve spatially consistent forecasts, we explicitly model the spatial dependence structure of the forecast residuals. We utilize regularization of the covariance structure by tapering the covariance matrix, and use further dimension reduction based on principal component analysis to reduce the computational time and reduce the risk of overfitting. Validation on out-of-sample observational data demonstrates that this multivariate postprocessing approach yields spatially consistent and well calibrated forecasts.

The paper is organized as follows: in Section 2, we first develop a univariate postprocessing technique based on moving averages, and extend this to the multivariate setting, incorporating covariance tapering, principal component analysis and a marginal variance correction. Section 3 outlines several validation and comparison tools for multivariate forecast distributions. In Section 4, we show how the proposed univariate and multivariate methods perform for the NorCPM forecasts and compare their performance to several reference methods. In Section 5, we consider a case study of a shipping route in the Northern Atlantic, forecasting the probability of ice along the route. The Section 6 gives the concluding remarks and discussion of results. The code for all our methods is available as R package at www.github.com/ClaudioHeinrich/pp.sst.

2 Modeling sea surface temperature

2.1 Data

We consider monthly mean forecasts of sea surface temperature (SST) issued by the Norwegian climate prediction model (NorCPM), see Counillon et al., (2014, 2016). The forecasts cover the entire globe on a longitude-latitude grid with resolution 1°, for a total of 64 800 grid points, where approximately 43 000 are located in the oceans. NorCPM issues new forecasts every 3 months, at the beginning of January, April, July and October, such that the lead times of the forecasts vary between one and three months. Each forecast consists of nine exchangeable members. For postprocessing, we consider forecasts from 1985 to 2016. The validation period is set to 2001–2016, while the years 1985–2000 are used to train the model. Throughout the validation period, the model estimation is updated for each time point to include the most recent observations. Observations of monthly mean SST over the period are obtained from the Optimum Interpolation Sea Surface Temperature (OISST) dataset of the National Oceanic and Atmospheric Administration (Reynolds et al.,, 2007).

2.2 Univariate postprocessing

A wide range of methods are available for univariate forecast postprocessing, e.g. ensemble Bayesian model averaging, nonhomogeneous regressions and quantile regression; see Wilks, (2018) for a recent overview. In our data set, both bias and prediction uncertainty depend strongly on spatial location and calendar month. Here, we postprocess data from each calendar month separately, ignoring possible interactions between months. In the following, a fixed month is considered with the monthly index suppressed for simplicity.

For a given year $y\in\{1985,\dots,2016\}$ and location $s\in\mathcal{S}$ , the SST forecast is assumed normally distributed,

[TABLE]

The mean and variance are estimated following a (weighted) moving average approach. Specifically, the mean is taken to be the bias-corrected NorCPM ensemble mean, $\widehat{\mu}_{y,s}=\overline{f}_{y,s}-\widehat{b}_{y,s}$ , where $\overline{f}$ denotes the mean of the raw ensemble. To account for trends in the bias over time, for instance caused by climatic changes not accounted for by NorCPM and improved reliability of observations, the bias and predictive variance are estimated by weighted moving averages:

[TABLE]

where $t_{j,s}$ denotes the observed temperature at year $j$ and location $s$ , and the sequences of weights $w_{1},w_{2}\dots$ are normalized. In Section 4, we compare the performance of a simple and an exponentially decaying weighting scheme. The weights for the bias are chosen by minimizing the mean squared error (MSE) of the bias-corrected forecast $\overline{f}-\widehat{b}$ . For the variance, they are chosen by minimizing the continuous rank probability score (CRPS). The CRPS for predictive distribution $\widehat{F}$ and observation $t$ is defined as

[TABLE]

where $X,X^{\prime}\sim\widehat{F}$ denote two independent random variables. It constitutes a proper scoring rule, cf. Gneiting and Raftery, (2007), and is often used to assess predictive performance.

A main advantage of this approach is that the weighted averages automatically account for month- and location-specific bias and uncertainty. Therefore, parameters of a weighting scheme may be estimated using all locations and all past months which prevents overfitting and makes optimal use of the available training data.

This comes at the cost that our univariate postprocessing technique is not inherently coherent, see Krzysztofowicz, (1999); Zhao et al., (2017), in the sense that it automatically produces forecasts that are no worse than climatology. Coherent models estimate the correlation between forecast and observation and produce skillful forecasts even when this correlation is negative. In our context, if this is done for each location separately it introduces a large risk of overfitting due to the short duration of the training period, whereas estimating the same correlation at all locations makes the estimate too crude. The model (2) makes the assumption that the NWP forecast indeed contains some signal, which is sometimes not satisfied in seasonal forecasting, see Schepen et al., (2014). On the other hand, the model (2) only requires to fit the parameters of the weighting scheme. These can be estimated using all locations and months, since for any weighting scheme the bias and variance estimates are month- and location-specific. As a consequence, the model (2) is very robust, which in our context is more valuable than coherence. Indeed, in Section 4.2 we demonstrate that it outperforms several coherent reference methods.

Both estimators rely on the ensemble mean only and are independent both of the number of ensemble members and the ensemble spread which is commonly used as predictor of uncertainty in short range weather forecasting (e.g. Messner et al.,, 2017). In seasonal to decadal forecasts the ensemble spread is known to be a less reliable predictor of forecast uncertainty (Ho et al.,, 2013). This is supported by the findings in Section 3. Increasing the number of ensemble members usually also benefits the skill of the ensemble mean (Buizza and Palmer,, 1998) and would decrease the variance of the estimators $\widehat{b}$ and $\widehat{\sigma}$ . Therefore, increasing the ensemble size remains desirable in our context, even though it does not directly enter our equations.

In the OISST dataset, the SST is truncated at $-1.79^{\circ}$ C, the assumed freezing temperature of sea water. As this is relevant for relatively few grid points, we apply the same truncation to the predictive distributions after the parameter estimation rather than assuming a truncated normal model in (1). In numerical experiments, the truncation error was found to be substantially smaller than the forecast uncertainty. NorCPM, as most climate prediction models, will inherently account for global warming. However, it relies on simplification of the underlying physical processes and is unlikely to fully describe the effects of climate change. Moreover, numerical prediction models, once initialized, tend to drift towards a model attractor which on the seasonal to decadal scale introduces changes in model biases over time. While this may be accounted for with a linear trend term in the bias model (Boer,, 2009), this was found to reduce the predictive ability here due to overfitting.

The proposed approach is compared against the related and well known non-homogeneous Gaussian regression (NGR) approach (Gneiting et al.,, 2005). Here, the mean and the variance are modeled as linear functions of predictor variables, most commonly the ensemble mean and the ensemble spread, i.e.

[TABLE]

where $S^{2}$ denotes the sample variance of the forecast ensemble, and $a,b,c,d$ are regression coefficients. The coefficients $a$ and $b$ are fitted by linear regression by minimizing the mean squared error of the forecast, whereas $c$ and $d$ are fitted by minimizing the CRPS over the training period. In order to assess sensitivity to month and location, we consider three different versions of NGR for comparison: Grouped by month and location, $NGR_{m,s}$ , where the coefficients may depend on both; grouped by location, $NGR_{s}$ ; and grouped by month, $NGR_{m}$ . In the original reference (Gneiting et al.,, 2005) $a,b,c$ and $d$ are estimated simultaneously to minimize the CRPS. When using the moving average estimates (2), estimating the parameters of the weighting schemes $w^{b},w^{\sigma}$ simultaneously is computationally costly which is why we estimate the weights $w^{b}$ for the bias correction by MSE-minimization first, and thereafter estimate the variance for the assumed bias correction. In Section 3 we assess the skill of the mean model by (out-of-sample) MSE. For this reason we deviate from Gneiting et al., (2005) and estimate the regression parameters $a,b$ by ordinary least squares.

The main disadvantage of $NGR_{m,s}$ in our context is the vast number of parameters which makes it prone to overfitting. An alternative method is usually referred to as locally adaptive NGR, see for example Scheuerer and König, (2014) . This method avoids fitting $a$ and $b$ for each location while still allowing for location-specific bias of the model which is achieved by considering model and temperature anomalies instead of absolute temperatures. More precisely, defining

[TABLE]

the locally adaptive NGR fits parameters $a,b$ by regressing $t^{\text{ano}}$ on $\overline{f}^{\text{ano}}$ , and issues the mean of the predictive distribution as

[TABLE]

Unlike the classical NGR this method accounts for the the mean of the predicted and true temperature varying largely between locations, and is used to estimate one pair of regression coefficients $a,b$ for all locations. We denote this reference method by $NGR^{la}_{m}$ , since the parameters $a$ and $b$ are estimated for each month separately and the method is locally adaptive.

2.3 Multivariate postprocessing

In order to obtain physically consistent postprocessed forecast fields, the model (1) must be extended to include spatial correlation. The main challenge is that the set $\mathcal{S}$ of considered locations contains around 42 000 points for the entire globe, and is very large compared to the sample size of up to 31 training years. To allow for non-stationary effects and negative correlations, we propose a postprocessing procedure based on regularization of the sample covariance matrix. It relies on a combination of tapering the sample covariance matrix and principal component analysis (PCA). These are classical tools for high-dimensional covariance estimation, but have found little attention in the context of statistical postprocessing of spatial data. As reference methods we compare the proposed technique to a geostationary approach (Feldmann et al.,, 2015) and ensemble copula coupling (Schefzik et al.,, 2013).

2.3.1 Postprocessing by regularization of the sample covariance matrix

The univariate model (1) is extended by estimating the covariance matrix of the forecast residuals. The residuals are assumed to follow a multivariate normal distribution

[TABLE]

where $\widehat{\mu}_{y}$ denotes the vector of bias-corrected forecasts $\overline{f}_{y}-\widehat{b}_{y}$ , and $t_{y}$ the vector of observed temperatures. The covariance matrix $\Sigma_{y}$ is multi-layered as it captures both the spatial climatological correlation between different locations on the globe and the forecast uncertainty including spatial interactions. Given an estimator of the covariance matrix, $\widehat{\Sigma}_{y}$ , a spatial forecast is issued as

[TABLE]

generalizing the marginal model in Equation (1). In the following, the year and month are assumed fixed and the indices $y$ and $m$ are suppressed.

The standard estimator of the covariance matrix is the sample covariance matrix (SCM):

[TABLE]

where the sum runs over all $Y$ previously observed years. However, in the high-dimensional setting with limited training data, the sample covariance estimator requires regularization. We propose a two-step procedure for regularizing $\mathbf{S}$ , first applying a distance-dependent tapering, or weighting, of the covariance matrix and secondly utilizing principal component analysis (PCA) to regularize the eigenstructure and reduce dimensionality.

For the first step, the tapering, we consider a positive, monotonically decreasing function $\phi$ , defining a spatial correlation function $C_{\phi}(s_{i},s_{j})=\phi(\|s_{i}-s_{j}\|)$ that only depends on the distance of the locations $s_{i},s_{j}\in\mathcal{S}$ . The SCM $\mathbf{S}$ , is then tapered by $\phi$ by

[TABLE]

The resulting tapered matrix is always positive semi-definite. Tapering covariance matrices by distances is frequently used in atmospheric sciences (Gaspari and Cohn,, 1999). Gneiting, (2002) argued that the weight function $\phi$ should be twice differentiable with $\phi^{\prime}(0)=0$ and a minimal second derivative $|\phi^{\prime\prime}(0)|$ , and suggests the function

[TABLE]

Here, $\phi$ is supported on $[0,1]$ , such that the tapering function $\phi_{L}$ has a tuning parameter $L$ determining its support. In numerical experiments, the performance of our postprocessing method performed best for $L$ between $1000$ km and $4000$ km. For the remaining of the paper, $L$ is set to 2500km.

The tapering is beneficial in two ways: Firstly, the SCM does not consider distance between locations, and it will thus have a high risk of spurious correlations given the large number of locations pairs ( $\sim 10^{9}$ ). The spatial correlation is likely to decrease with distance, and the tapering down-weights high correlations between distant locations as these are less credible than those between close locations. Secondly, it removes the rank deficiency of the SCM $\mathbf{S}$ and changes it into a full rank matrix. Indeed, the rank of $\mathbf{S}$ is limited by the number of observed years, $Y=31$ , significantly lower than $S\approx$ 42 000. The tapered covariance matrix having full rank makes it benefit more from regularization by principal component analysis (PCA).

To further reduce the risk of over-fitting and increase the speed of simulation, PCA is applied to the tapered covariance matrix estimate. PCA can be used to restrict the covariance estimator to a low-dimensional linear subspace with minimal information loss. In detail, we consider the eigenvalue decomposition of $\mathbf{S}_{\phi}$

[TABLE]

with orthogonal eigenvectors $U=[\mathbf{u}_{1},...,\mathbf{u}_{S}]$ and eigenvalues $\Lambda=\text{diag}(\lambda_{1},...,\lambda_{S})$ in decreasing order.

The ordered eigenvectors, usually referred to as principal components, are orthogonal linear combinations of the locations expressing the highest variance. The underlying assumption is that only the first $d\ll S$ principal components truly represent a signal, whereas the variability of the remaining components represents unstructured noise. Therefore, only the first $d$ eigenvectors are considered for the covariance estimate:

[TABLE]

The truncation of the eigenvalue decomposition will decrease the marginal sample variance at location $s$ , the diagonal element ${\mathbf{S}}_{ss}$ , for a given month:

[TABLE]

Assuming the marginal postprocessing yields calibrated marginal distributions, we want the marginal variances of the multivariate method to equal those estimated by the univariate method. We will therefore compare two alternative approaches for correcting the variance deflation, a multiplicative and an additive correction. In the multiplicative correction, the PCA step is performed on the (tapered) correlation matrix and transformed back to the covariance matrix

[TABLE]

where the marginal variances $\widehat{\sigma}^{2}_{y,s}$ are estimated as in (2).

Alternatively, we perform the PCA on the (tapered) covariance matrix and apply an additive correction to the marginal variances,

[TABLE]

To ensure the positive definiteness of $\widehat{\Sigma}^{ac}$ , the difference between the regularized and unregularized marginal variances has to be truncated at zero. The additive correction does not change the off-diagonal elements of $\widetilde{\Sigma}$ . It, however, only satisfies $\widehat{\Sigma}^{ac}_{ss}=\widehat{\sigma}_{y,s}^{2}$ for locations $s$ where $\widehat{\sigma}^{2}_{y,s}\geq\widetilde{\Sigma}_{ss}$ . As the marginal estimator $\widehat{\sigma}^{2}_{y,s}$ is not equal to the standard sample variance $(\mathbf{S})_{ss}$ , this is not guaranteed for all locations.

The main purpose of applying PCA in this way is to allow for more efficient sampling from the predictive distribution. For instance, when simulating an $S$ -dimensional normally distributed $X$ with zero mean and covariance matrix $\widehat{\Sigma}^{mc}$ , it is sufficient to simulate a $d$ -dimensional vector $Y$ and set

[TABLE]

where $U^{(d)}$ and $\Lambda^{(d)}$ contain the first $d$ eigenvectors and eigenvalues. We found that a dimension reduction with order of magnitude $S/d\approx 100$ leads to good results. As a consequence, simulating $X$ with (previously computed) covariance matrix $\widehat{\Sigma}^{mc}$ is approximately 100 times faster than simulating from a full rank normal distribution with known Cholesky-decomposition of the covariance matrix, disregarding fixed computation costs. When the additive correction is applied, $X$ is simulated by

[TABLE]

where $Y$ is $d$ -dimensional standard normal and $Z$ is $S$ -dimensional standard normal. This is slower than simulating from the multiplicative corrected covariance estimate, but nevertheless significantly faster than simulating from a general $S$ -dimensional normal distribution.

2.3.2 Reference methods

We consider two reference methods commonly used in statistical postprocessing of spatial forecasts, see Schefzik and Möller, (2018). A geostationary approach fits a parametric correlation model, assuming spatial stationarity and isotropy. The parametric correlation function is usually assumed to be in the Whittle-Matérn or the exponential family. Feldmann et al., (2015) suggest an exponential model with nugget, where the correlation of the forecast error at locations $s_{i}$ and $s_{j}$ can be written as

[TABLE]

The parameters $\theta$ and $r$ are estimated by fitting the variogram of the parametric model to the empirical variogram, for details see Feldmann et al., (2015).

Secondly, we compare to ensemble copula coupling (ECC), see Schefzik et al., (2013). The method constructs a postprocessed ensemble of the same size $N$ as the original NWP ensemble. The new ensemble is univariately calibrated and follows the same rank-order structure as the raw NWP ensemble. This is achieved in two steps: First, a univariately calibrated ensemble $\widetilde{x}^{(1)},...,\widetilde{x}^{(N)}$ is generated by considering $m$ equally spaced quantiles of the calibrated distribution (1), i.e.

[TABLE]

where $F_{s}$ denotes the cumulative distribution function of the univariate model (1) at location $s$ . Thereafter, the ensemble indices are permuted at each location to obtain an ensemble with the same rank order structure, or empirical copula, as the raw ensemble forecast. To achieve this, denote by $f^{(i)}_{s}$ the value of the $i$ th ensemble member at location $s$ of the raw forecast and find for each location a permutation $\rho_{s}$ such that

[TABLE]

is satisfied. Then, the $i$ th member of the multivariate ECC forecast ensemble is

[TABLE]

ECC is computationally efficient and it does not require the specification of a full multivariate distribution. A limitation is that the newly generated ensemble has the same number of members as the NWP ensemble. While extensions to larger ensembles using the rank order structure of historical observations have been proposed (e.g. Schefzik,, 2016), those do not apply in our setting of limited available historical observations.

As a third reference method we compare to the Schaake shuffle (Clark et al.,, 2004). The Schaake shuffle constructs a postprocessed ensemble that has the same empirical copula as past observations. It works very similar to ECC, except that a post-processed ensemble with the same rank order structure as past observations is constructed rather than with the rank order structure of the raw ensemble forecast. The size of the ensemble created by the Schaake shuffle is therefore not limited by the size of the NWP ensemble but by the number of years in the training data set.

3 Validation methods

We validate predictive performance by assessing the sharpness of the predictive distributions subject to calibration (Gneiting et al.,, 2007). Calibration, or reliability, refers to the statistical consistency between the forecast and the observations in the validation period, while sharpness refers to the spread of the predictive distribution. Subject to being calibrated, a sharper forecast is less uncertain and thus more informative.

Following Dawid, (1984), probabilistic calibration of marginal forecasts is assessed by the probability integral transform (PIT), i.e. the predictive cumulative distribution function $\widehat{F}$ evaluated at the observation $t$ . If $\widehat{F}$ is probabilistically calibrated, the PIT will be uniformly distributed, $\widehat{F}(t)\sim U([0,1])$ . To summarize the marginal calibration across grid point locations, we investigate the first two moments of the marginal PIT distribution over all time points in the validation period. A uniform distribution $U([0,1])$ has an expectation of $0.5$ and a standard deviation of $1/(2\sqrt{3})\simeq 0.29$ . It follows that $\mathbb{E}(\widehat{F}(t))<0.5$ indicates a positive bias and $\mathbb{E}(\widehat{F}(t))>0.5$ indicates a negative bias. If $\textup{SD}(\widehat{F}(t))<0.29$ , the forecast is overdispersive, and reversely, for $\textup{SD}(\widehat{F}(t))>0.29$ , it is underdispersive.

We assess multivariate calibration as proposed by Thorarinsdottir et al., (2016). Here, pre-rank functions are employed to map an ensemble of realizations from the multivariate forecast and the observation to real numbers which are subsequently ranked in a standard manner. If the forecast and the observations are statistically indistinguishable, the resulting histogram over the observation ranks is flat, whereas deviations from uniformity indicate miscalibration. Thorarinsdottir et al., (2016) propose two pre-rank functions which assess the multivariate calibration in slightly different manners. The average pre-rank function finds the average of the marginal univariate ranks while band depth ranking assesses the centrality of the observation within the forecast ensemble as proposed by López-Pintado and Romo, (2009).

Forecast accuracy is typically assessed using proper scoring rules (Winkler and Murphy,, 1968; Gneiting and Raftery,, 2007). Scoring rules assign a numerical score to each forecast-observation pair, where a lower value indicates better predictive performance. To assess the marginal accuracy, we use the mean squared error (MSE),

[TABLE]

where $\widehat{\mu}$ denotes the mean of $\widehat{F}$ . The continuous ranked probability score was defined in (3). For an ensemble $\b{x}=\{x_{1},\ldots,x_{N}\}$ , the CRPS equals

[TABLE]

The MSE is fast to compute and compares different mean models for the predictive distribution. The CRPS provides a more complete picture in that it assesses both calibration and sharpness (Gneiting and Raftery,, 2007).

For a multivariate assessment we utilize the multivariate variogram score (VS) proposed by Scheuerer and Hamill, (2015). For a multivariate distribution function $\widehat{F}$ and an observation vector $\bf{t}$ at $S$ locations, the VS of order $p$ is given by

[TABLE]

where $t_{i}$ is the observation at the $i$ th location and $X_{i}$ the $i$ th component of a random vector distributed according to $\widehat{F}$ . The (nonnegative) weights $\omega_{ij}$ are set to be constant, such that the correlation structure of all distances is assessed, and we select the order $p=0.5$ , as recommended by Scheuerer and Hamill, (2015).

To test significance of score differences, we apply a permutation test relying on resampling (Möller et al.,, 2013; Good,, 2013). Two predictive distributions $\widehat{F}_{1}$ and $\widehat{F}_{2}$ are compared under a scoring rule $S(F,\cdot)$ using the statistic

[TABLE]

The permutation test is then based on resampling copies of $s$ with the labels of $\widehat{F}_{1}$ and $\widehat{F}_{2}$ swapped for a random number of summands. Under the null hypothesis, $\widehat{F}_{1}$ and $\widehat{F}_{2}$ perform equally well and the permutations would have the same limiting distribution as the statistic, $s$ , as $n\to\infty$ . By considering the rank of the observed statistic within the permutations, an asymptotic test is obtained. Permutation tests are computationally efficient and, unlike the commonly applied Diebold-Mariano test (Diebold and Mariano,, 1995), they do not require the estimation of the asymptotic variance of the score difference $S_{i}$ which can be involved in the spatial dependence context.

Let us finally remark that validating forecasts in a high dimensional setting is a challenge in its own rights. For both variogram scores and multivariate rank histograms the role of the forecast dimension has been discussed in the original papers Scheuerer and Hamill, (2015); Thorarinsdottir et al., (2016), and they were found to perform well in dimension $S$ up to 20. When the dimension is much higher than that, and in particular larger than the number of available forecast-observation-couples, new issues arise. For example, the variogram score becomes computationally involved as the number of summands is $S^{2}$ . For multivariate rank histograms, on the other hand, slight misspecifications of the predictive marginal distributions tend to dominate the appearance of the histogram in very high dimension, making it less informative with regard to the multivariate predictive performance.

3njem

4 Results

4.1 Training period

The first step of the analysis is to determine optimal weighting parameters in Equation (2). For simple moving averages and window length $l$ , the corresponding weights are $w_{k}=\mathds{1}\{k\leq l\}/l,$ while for exponential moving averages with scale parameter $a$ the weights are $w_{k}\sim\exp(-ak)$ , for the $k$ th preceeding year. For the bias-correction, the weighting parameters are chosen for each year in the validation period by minimizing MSE as follows: For a range of weighting parameters, we bias-correct a set of previous forecasts in an out-of-sample fashion and compute the MSE. We then select the weighting parameter for which the MSE is minimized. Figure 2 shows this selection process for 2017. Here, the best performance is obtained for relatively short training periods of $l=12$ for SMA and $a=0.11$ for EMA, with a substantial improvement in the predictive performance under EMA. For the variance estimation, the weighting parameters are estimated by minimizing the CRPS. Here, the improvement by using weighted averages is more marginal and unweighted averages lead to close to optimal results. Furthermore, the optimal training period is usually somewhat longer than for the bias-correction. For 2017, the values would be $l=28$ for SMA and $a=0.05$ for EMA. These values are typical for years for which sufficient past training data is available.

4.2 Marginal predictive performance

For a more formal skill assessment, we compare the aggregated MSE for the SMA and EMA methods against the three NGR reference method. The results are summarized in Table 1. The $NGR_{m}$ method performs substantially worse than all others, demonstrating that model biases strongly depend on location. Further, $NGR_{m,s}$ performs significantly better than $NGR_{s}$ , indicating that the bias also varies between seasons. The locally adaptive method $NGR^{la}_{m}$ estimates the same number of parameters as $NGR_{m}$ , which it outperforms clearly, but performs worse than $NGR_{m,s}$ . Both SMA and EMA outperform all other methods, with EMA yielding the overall lowest value as demonstrated in Figure 2. The second row of the table shows $p$ -values of a permutation test assessing significance of the score difference between the model and the best performing model EMA. The seemingly small score differences are all highly significant, as these values constitute average scores over roughly 8 million score evaluations (192 validation months and 42.000 grid points). The $NGR_{m,s}$ model relies on a total of approximately $10^{6}$ parameters while the SMA and EMA approaches rely on one parameter each and are thus much more robust towards outliers.

We continue our analysis using EMA for the bias-correction. Different models for estimating marginal variances are compared in Table 2 using the CRPS. In particular for the CRPS, miniscule differences can be significant when averaging over many score evaluations. Permutation tests reveal that even the difference between the mean CRPS for SMA and EMA is highly significant.

In the supplementary material to this article we additionally assess mean and variance estimation for each month separately. This more detailed analysis supports the conclusion that the exponential moving average method performs best overall, in almost all instances significantly better than all other models.

The calibration of EMA and the best NGR method, $NGR_{m,s}$ are assessed in Figure 3. The postprocessed forecasts are overall well calibrated, indicated by a white color. However, the PIT values reveal small biases in the region governed by the Gulf stream and in the southern oceans below $-50^{\circ}$ south. The PIT mean values for the EMA method are between 0.26 and 0.72 for all locations, while they range from 0.23 to 0.78 for the $NGR_{m,s}$ . The figure indicates that the $NGR_{m,s}$ method exhibits similar biases as the exponential moving average approach, but tends to have larger biases overall. Figure 4 further shows the PIT standard deviations across locations for EMA, indicating overall good calibration except in the polar regions where the forecast is somewhat overdispersed.

4.3 Multivariate predictive performance

Here, we compare various multivariate postprocessing approaches where the marginal distributions are generated with EMA. For computational reasons, we restrict our analysis to an area covering the northern half of the Atlantic ocean, cf. Figure 5. The restricted area covers approximately 5600 grid points. Figure 5 shows the forecast residual, the difference between mean forecast and observations, for June 2016. The aim is for the multivariate correlation structure of the predictive distribution to produce similar spatial patterns. To assess this, we compare the methods in terms of variogram scores. To compute the variogram score, the moments of the predictive distribution, $\mathbb{E}_{F}[|X_{i}-X_{j}|^{1/2}]$ are estimated using 500 simulations from the distribution. For the ECC, the 9 forecast ensemble members are used instead. Table 3 shows the variogram scores averaged over all months in the validation period, with the significance of the score differences assessed by the permutation test. The lowest variogram score is achieved by the regularized covariance matrix with multiplicative correction, $\widehat{\Sigma}^{mc}$ , and the score is significantly lower than those for both the geostationary and ECC approach, at a 5% level. The regularized covariance approach with additive correction, $\widehat{\Sigma}^{ac}$ , achieves a similar variogram score as $\widehat{\Sigma}^{mc}$ , with a non-significant score difference at the 5% level.

An empirical assessment of simulated residulas (not shown) suggests that the regularization approach produces the most realistic spatial structure. The residuals generated by the geostationary approach look somewhat too coarse. This is likely caused by an overall poor fit of the parametric variogram to the empirical variogram, resulting in an overestimation of the nugget. The residuals generated by ECC, on the other hand, seem to vary too little on a large scale, which is mainly caused by the low number of only 9 ensemble members.

5 Case study

In a further assessment of the multivariate predictive distributions, we take a more applied angle and use the model to predict the minimum SST along a shipping route crossing the Atlantic Ocean from Bordeaux, France to Norfolk, USA, see Figure 5. The route has a length of 6205 km and we consider all grid cells that are intersected by the geodesic from Bordeaux to Norfolk a part of the route, a total of 93 grid cells. The minimum SST along this route depends jointly on the SST at all locations along the route, requiring spatially coherent forecasts. We consider the same methods of multivariate postprocessing as in the previous section, i.e. the regularization approach, both with multiplicative and additive correction of the marginal variance, as well as the geostationary model and ECC as reference.

For each method, we generate multiple simulations from the predictive distribution for each month of the validation period, and compute the minimum temperature along the route. The empirical distribution of simulated minima is then considered the probabilistic forecast of the minimum temperature along the route. For the regularization and geostationary approaches, we simulate 500 forecasts each, whereas for ECC the postprocessed ensemble containing 9 members is used. The accuracy of the forecasts is evaluated with the univariate scores MSE and CRPS, see Table 4. Permutation tests show that the score differences between the Schaake shuffle and the regularization approaches with multiplicative and additive correction of the marginal distribution are not significant at a level of 5%, whereas both the geostationary model and ECC show lower skill.

We further assess the multivariate calibration of the 93-dimensional forecasts using average and band depth ranking, see Figure 6. Note that the number of available observations in the test set is only 192, making it necessary to restrict the number of bins. The rank histograms of the regularization techniques and the geostationary approach look very similar. In fact, the correlation of the observation ranks is approximately 0.99 for any two of these methods, indicating that deviations from uniformity are mainly attributed to imperfect marginal calibration. All three methods exhibit too many high average ranks, corresponding to an overall underestimation of the temperature along the route, possibly caused by warming trends not accounted for in the numerical model. The same effect is responsible for too many low band depth ranks, signalizing non-centrality of the observation within the ensemble. The band depth rank histograms for ECC display a slight $\cap$ -shape indicating too strong spatial correlations in the empirical copula of the raw NWP ensemble. This is supported by plots of members from the raw ensemble forecast which tend to be visibly smoother than the observation (not shown). The band depth rank histogram of the Schaake shuffle looks fairly uniform. If anything, it is slightly $\cap$ -shaped which could indicate stronger correlations in the training period than in the validation period, and thus an increasing volatility in the minimum temperature possibly caused by warming. The average rank histogram for ECC and Schaake shuffle look more uniform than those for the other methods. This is presumably caused by the use of equally spaced quantiles which leads to a more evenly spread out predictive ensemble than any ensemble based on simulation. However, this comes at the cost of a very limited ensemble size.

6 Discussion

This paper proposes a fully probabilistic postprocessing approach for multivariate forecasts, the computational costs of which scale well to higher dimensions. The proposed method incorporates a moving average approach combined with regularization of the covariance matrix through tapering. The approach yields a predictive distribution allowing for non-stationary, non-isotropic and negative correlations in the forecasting error. Based on validation data, it performs well with little available training data and is therefore attractive for seasonal and long-range weather predictions.

We have applied the method to seasonal forecasts of sea surface temperature issued by the Norwegian Climate Prediction Model. Performance comparisons indicate that our methodology has higher predictive skill than two reference models, specifically empirical copula coupling and a geostationary method. The Schaake shuffle, our third reference model performed equally good in our case study, but significantly worse when assessed by the variogram score. The geostationary method assumes the forecast error to have a positive and stationary correlation structure, which in weather forecasting is a highly restrictive assumption, as the underlying physics in numerical prediction models will depend on geographic features not taken into account. Both Schaake shuffle and ECC construct ensembles of fixed size, namely the number of years in the trainings period (16, Schaake shuffle) or the size of the NWP ensemble (9, ECC). The small size of these ensembles restricts the usefulness of these methods for some tasks such as quantile estimation. This limitation is not shared by our fully probabilistic postprocessing method which can be used to construct ensembles of any size.

There are a variety of regularization methods for sample covariance matrices available in the literature, see Pourahmadi, (2013) for an overview. We believe that our combination of tapering and PCA has several advantages making it appealing in the context of postprocessing seasonal weather forecasts. Firstly, the tapering regularizes by distance, in the sense that the sample covariance of distant gridpoints is down-weighted, but still allows for spatial correlations over long ranges. Given the lack of training data, regularization by distance is more reliable than purely data-driven regularization methods such as e.g. sparse PCA, thresholding or graphical lasso. There are some regularization methods that assume sparsity of the precision matrix while accounting for spatial structure of the data, such as for example nearest-neighbour Gaussian fields (Datta et al.,, 2016). We expect such methods to lead to similar results as tapering. However, both tapering and PCA have a long tradition in the field of meteorology and are therefore more familiar to practitioners. Note that in our model PCA is applied to the already tapered covariance matrix, and its main purpose is not regularization, but reducing the dimension of the predictive distribution. This leads to a massive decrease in the computational costs when sampling from the predictive distribution.

For the covariance tapering we chose the range of our tapering function to be 2500 km. At the equator this corresponds to 22 grid points in each direction. In our experiments, choosing any range between 1000 and 4000 km led to good results and did not affect the conclusions derived from Sections 4 and 5. Given the very limited amount of data we refrained from attempting to fit the range parameter. For the PCA, we chose the number $d$ of principal components such that 90% of the variance where retained. Let us stress that PCA is applied to the already regularized tapered covariance matrix and its main purpose is not regularization, but dimension reduction which allows for faster sampling from the predictive distribution. When the area with 5.600 gridpoints shown in Figure 5 is considered, 90% of the variance corresponds to approximately 170 principal components (the exact number varying from month to month). For comparison, if 99% of the variance should be retained, roughly 600 principal components are required. It is, in fact, not straightforward to derive an optimal number of principal components based on validation, as validating high-dimensional forecasts in itself is involved. An optimization of the variogram score using cross-validation, for example, comes at tremendous computational costs and the score differences are insignificant over a large range of principal components. In other words, the impact of truncating the eigenspace of the tapered covariance matrix by PCA is not picked up by the validation tools we applied. This allows us to choose a relatively small number of principal components in order to lower the computational costs for sampling from the predictive distribution. The rule of keeping 90% is therefore a compromise that recovers most of the covariance structure while leading to an increase of sampling speed of factor 20 compared to using the full tapered covariance matrix.

We have investigated an additive and a multiplicative correction of the marginal variances of the multivariate model. The results indicate no significant difference between the skill of these corrections. As the additive correction does not recreate the marginal model (1) exactly, and is heavier computationally, the approach based on multiplicative correction should be preferred.

As emphasized by Van Schaeybroeck and Vannitsem, (2018), a main challenge for postprocessing of seasonal weather forecasts is the shortage of available training data. This is supported by the findings in Sections 4.2 and 4.3, which demonstrate the risk of overfitting when forecast distributions are estimated separately at each location. We utilized moving averages to estimate the biases and variances. Our approach estimates location-specific biases and variances, but we only need to determine a single, global, weighting parameter. Thus the approach will be more robust against outliers than non-homogeneous Gaussian regression grouped by month and location, the best performing reference model. Moreover, our moving average approach will account (to a certain extend) for the trends caused by global warming and the increase in reliability of temperature measurements over the last 30 years.

Future research directions include to consider ensemble information beyond the ensemble mean, for instance the single ensemble members. The good performance of ensemble copula coupling, considering the high number of locations and low number of ensemble members, indicates that the ensemble members do contain valuable information. Ensemble copula coupling relies fully on the empirical copula of the ensemble to capture the multivariate forecast structure, while our approach only considers variability around the ensemble mean. Combining both sources of information may be a fruitful way to extend postprocessing techniques in the high-dimensional setting. Moreover, we have currently not considered interactions between different months or seasons throughout the year. Early exploratory analyses showed that including information from previous months as predictors did not improved the forecast distribution. It seems, however, reasonable that forecast errors of different months may in fact be correlated and developing detailed models for such interactions may improve the predictive skill. Finally, our model does not account for sea ice in an appropriate way and could be improved by being combined with an external sea ice model.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Boer, (2009) Boer, G. (2009). Climate trends in a seasonal forecasting system. Atmos. ocean , 47(2):123–138.
2Buizza and Palmer, (1998) Buizza, R. and Palmer, T. (1998). Impact of ensemble size on ensemble prediction. Mon. weather rev. , 126(9):2503–2518.
3Chilès and Delfiner, (1999) Chilès, J.-P. and Delfiner, P. (1999). Geostatistics: modeling spatial uncertainty . John Wiley & Sons Inc., New York.
4Clark et al., (2004) Clark, M. P., Gangopadhyay, S., Hay, L. E., Rajagopalan, B., and Wilby, R. L. (2004). The Schaake Shuffle: A method for reconstructing space-time variability in forecasted precipitation and temperature fields. J. Hydrometeorol. , 5:243–262.
5Counillon et al., (2014) Counillon, F., Bethke, I., Keenlyside, N., Bentsen, M., Bertino, L., and Zheng, F. (2014). Seasonal-to-decadal predictions with the ensemble Kalman filter and the Norwegian Earth System Model: a twin experiment. Tellus A , 66(1):21074.
6Counillon et al., (2016) Counillon, F., Keenlyside, N., Bethke, I., Wang, Y., Billeau, S., Shen, M. L., and Bentsen, M. (2016). Flow-dependent assimilation of sea surface temperature in isopycnal coordinates with the Norwegian Climate Prediction Model. Tellus A , 68(1):32437.
7Datta et al., (2016) Datta, A., Banerjee, S., Finley, A., and Gelfand, A. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J. Am. Stat. Assoc. , 111(514):800–812.
8Dawid, (1984) Dawid, A. P. (1984). Statistical theory: The prequential approach. J. Roy. Statist. Soc. Ser. A , 147:278–292.