Skill and reliability of seasonal forecasts for the Chinese energy   sector

Philip E. Bett; Hazel E. Thornton; Julia F. Lockwood; Adam A. Scaife,; Nicola Golding; Chris Hewitt; Rong Zhu; Peiqun Zhang; Chaofan Li

arXiv:1703.06662·physics.ao-ph·June 26, 2018

Skill and reliability of seasonal forecasts for the Chinese energy sector

Philip E. Bett, Hazel E. Thornton, Julia F. Lockwood, Adam A. Scaife,, Nicola Golding, Chris Hewitt, Rong Zhu, Peiqun Zhang, Chaofan Li

PDF

TL;DR

This study evaluates the accuracy of seasonal forecasts for temperature, wind, and irradiance over China using GloSea5, highlighting regions with high forecast skill relevant to the energy sector's planning and renewable energy management.

Contribution

It provides a detailed assessment of the skill and reliability of seasonal climate forecasts over China, identifying specific regions and variables with high forecast skill for energy applications.

Findings

01

High skill in winter wind forecasts near South China Sea coast.

02

Good winter irradiance forecast skill in eastern central China.

03

Summer temperature forecasts show skill, especially around Beijing.

Abstract

We assess the skill and reliability of forecasts of winter and summer temperature, wind speed and irradiance over China, using the GloSea5 seasonal forecast system. Skill in such forecasts is important for the future development of seasonal climate services for the energy sector, allowing better estimates of forthcoming demand and renewable electricity supply. We find that although overall the skill from the direct model output is patchy, some high-skill regions of interest to the energy sector can be identified. In particular, winter mean wind speed is skilfully forecast around the coast of the South China Sea, related to skilful forecasts of the El Ni\~no--Southern Oscillation. Such information could improve seasonal estimates of offshore wind power generation. Similarly, forecasts of winter irradiance have good skill in eastern central China, with possible use for solar power…

Equations4

BSS = \frac{RES - REL}{UNC} .

BSS = \frac{RES - REL}{UNC} .

ROCSS = 2 \times A_{ROC} - 1.

ROCSS = 2 \times A_{ROC} - 1.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\extraauthor

Adam A. Scaife \extraaffilMet Office Hadley Centre, FitzRoy Road, Exeter EX1 3PB, UK.

College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK \extraauthorNicola Golding, Chris Hewitt \extraaffilMet Office Hadley Centre, FitzRoy Road, Exeter EX1 3PB, UK. \extraauthorRong Zhu, Peiqun Zhang \extraaffilLaboratory for Climate Studies, National Climate Center, China Meteorological Administration, Beijing, People’s Republic of China \extraauthorChaofan Li \extraaffilCenter for Monsoon System Research, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, People’s Republic of China

Skill and reliability of seasonal forecasts for the Chinese energy sector

Abstract

We assess the skill and reliability of forecasts of winter and summer temperature, wind speed and irradiance over China, using the GloSea5 seasonal forecast system. Skill in such forecasts is important for the future development of seasonal climate services for the energy sector, allowing better estimates of forthcoming demand and renewable electricity supply. We find that although overall the skill from the direct model output is patchy, some high-skill regions of interest to the energy sector can be identified. In particular, winter mean wind speed is skilfully forecast around the coast of the South China Sea, related to skilful forecasts of the El Niño–Southern Oscillation. Such information could improve seasonal estimates of offshore wind power generation. Similarly, forecasts of winter irradiance have good skill in eastern central China, with possible use for solar power estimation. Much of China shows skill in summer temperatures, which derives from an upward trend. However, the region around Beijing retains this skill even when detrended. This temperature skill could be helpful in managing summer energy demand. While both the strengths and limitations of our results will need to be considered when developing seasonal climate services in the future, the outlook for such service development in China is promising.

††journal: jamc

1 Introduction

The energy sector has long been a key user of weather and climate information across timescales: short-range weather forecasts (e.g. Taylor and Buizza 2003; Costa et al. 2008), longer-range forecasts out to several weeks ahead (e.g. Dubus 2014), as well as projections of possible future climates decades ahead (e.g. McColl et al. 2012; Wang et al. 2014). These are all used to inform the planning, development, management and running of energy systems on those timescales. The energy sector has also been a leader in demonstrating demand for seasonal to decadal climate prediction (Buontempo et al. 2010; Bruno Soares and Dessai 2015). In particular, seasonal forecasts of the climate in the coming 3-month period have the potential for providing real added value, in both a practical and financial sense, across a range of areas within the energy sector (Troccoli 2010; Doblas-Reyes et al. 2013; Dessai and Bruno Soares 2015; Bruno Soares and Dessai 2015). However, lack of skill in many areas of the globe has limited the uptake of this kind of information within the sector.

Energy demand is strongly related to air temperature (e.g. Valor et al. 2001; Hor et al. 2005; Apadula et al. 2012; Zhang et al. 2014; Thornton et al. 2016), and the potential use of seasonal climate forecasting in demand management has been recognised for many decades (e.g. Weiss 1982, and references therein). The need to reduce both greenhouse gas emissions and air pollution has driven an increase in the amount of electricity supplied by renewable sources. This has, in turn, resulted in an increase in the weather-dependence of energy supply systems, and therefore in the possible utility of weather and climate forecasting to the sector. The energy sector in China faces similar issues as other countries, although particular features are the increases in demand due to rapid urbanisation (Wang 2014; Lin and Ouyang 2014), and a recent large growth in both installed and planned renewable energy capacity (e.g. Hong et al. 2013; CNREC 2014; Lo 2014; Qiang et al. 2016).

Seasonal forecasts can give an early warning of a season of high demand – such as a particularly cold winter or hot summer – or of reduced supply, due to low wind speeds, more cloudy/hazy periods reducing solar power generation, or low water levels (which can affect both hydroelectric plants and the cooling systems for traditional thermal power plants). Predicting conditions that could damage energy infrastructure, such as storms, could also be valuable. In all these cases, seasonal forecasts can enable mitigation plans to be put in place: storing more water in dams, rescheduling maintenance work, making early decisions around staff availability and financial planning for the coming 3-month period. The information could be used by a range of people, including industry regulators, network operators, energy production companies, maintenance contractors, and financial market traders.

However, seasonal forecasts are most useful if they have sufficient skill to allow decision-making. Furthermore, what ‘sufficient’ means (beyond being statistically significant) will depend on the particular use case. If forecasts are not sufficiently skillful, then while an organisation might be happy to receive forecast information, they might not be able to make a decision based on it.

While skillful forecasts for some variables in some parts of the world have been possible for some time (e.g. Arribas et al. 2011), recent advances in seasonal forecasting systems have led to major improvements in the skill of extratropical features such as the North Atlantic Oscillation (NAO, e.g. Athanasiadis et al. 2014, 2016; Butler et al. 2016; Smith et al. 2016 and references therein). Scaife et al. (2014) demonstrated skill in NAO forecasts from version 5 of the Met Office’s Global Seasonal forecasting system, GloSea5 (MacLachlan et al. 2015). This has led to the development of seasonal climate services for various sectors in the UK, including hydrology (Svensson et al. 2015), transport (Palin et al. 2016) and energy (Clark et al. 2017). While seasonal forecasting has a long history in China, the traditional low skill from dynamical models has led to a wide literature in statistical downscaling and statistical forecasting (e.g. Wang et al. 2015; Xing et al. 2016). It is timely therefore to examine the skill of the GloSea5 system in China, for direct forecasts of energy-relevant climate variables. The results could allow the development of future climate services (Golding et al. 2017), that is, the provision and use of climate information to enable better informed decisions.

The development and use of such climate services has become a major undertaking world-wide, with international coordination being facilitated by the Global Framework for Climate Services (GFCS, Hewitt et al. 2012). The GFCS focuses on 5 priority sectors, one of which is the energy sector. China is developing its own framework aligned to the GFCS, called the China Framework for Climate Services. This brings together actors involved in scientific research, climate service development, service providers and users, to ensure that available capability and services meet users’ needs.

In this paper we assess the skill of seasonal forecasts of wind speed, irradiance, and temperature across China, from the GloSea5 system, and consider the implications for the wind power, solar power and energy demand sectors. We firstly describe the data sets and methods used in section 2. We then present our results in section 3, considering a China-wide overview of each variable before focusing on some specific areas of interest. We discuss our conclusions in section 4.

2 Data and analysis methods

2.1 Data sets

In this paper we use the hindcast data set produced to assess the version of GloSea5 that was deployed operationally at the Met Office in February 2015. This is based on the second Global Coupled configuration (GC2) of the HadGEM3 global climate model, described in Williams et al. (2015). HadGEM3-GC2 uses the GA6.0 configuration of the Met Office Unified Model (UM, version 8.4) as its atmospheric component, on an N216 grid111i.e. $432$ cells east–west by $324$ cells north–south. (a horizontal resolution of $0.83°$ in longitude and $0.55°$ in latitude) and 85 vertical levels reaching a height of $85\,\mathrm{km}$ near the mesopause (Walters et al. 2016). This is coupled to the GL6.0 configuration of the JULES land surface model (Best et al. 2011), the GO5.0 configuration of the NEMO ocean model with a $0.25°$ nominal resolution and 75 vertical levels (version 3.4, Megann et al. 2014; Madec 2008), and the GSI6.0 configuration of the CICE sea ice model (version 4.1, Rae et al. 2015; Hunke and Lipscomb 2010). The GloSea5 system is described in full by MacLachlan et al. (2015) and references therein.

The assessment hindcast was produced to examine the skill of the system in forecasting for winter (December–January–February, DJF) and summer (June–July–August, JJA) only. Lagged ensemble ‘forecasts’ were initialised on three start dates centred on 1st November (for DJF) and 1st May (for JJA) for each of the 20 years of the hindcast data set, producing a total of 24 members for each hindcast season. The DJF hindcasts cover (boreal) winter 1992/1993 to winter 2011/2012, and the JJA hindcasts cover (boreal) summers 1992–2011. The details of the initialisation are described in MacLachlan et al. (2015).

Our ability to robustly assess forecast skill is limited by the size of the hindcast, both in terms of the number of years and the number of members. The operational GloSea5 forecast system uses 42 members, rather than the 24 available in the hindcast used here, so the probability distributions inferred using the hindcast will be less well resolved than they would be operationally. Furthermore, it has been shown that the skill itself depends directly on the size of the ensemble, as it allows better identification of predictable signals (Scaife et al. 2014; Eade et al. 2014; Dunstone et al. 2016). For our purposes, the ensemble size means that we can regard levels of skill shown here, where significant, to be lower limits of the actual skill that could be realized in the operational system. The robustness of the skill estimates is also limited by the number of years in the hindcast (Kumar 2009), as this places a restriction on the number of different types of event that are sampled in the period of study. The impact of the limited hindcast period is quantified by assessing the statistical significance of the correlations between the hindcast and observations, described in the next subsection.

We use the ERA-Interim reanalysis data (Dee et al. 2011) as a proxy for observations in this paper. While ERA-Interim and other reanalyses are frequently used to validate temperature and wind data from climate models, their use as a proxy for irradiance observations is more contested (e.g. Boilley and Wald 2015). We have compared some of our irradiance results against the SARAH-E satellite-derived observational data (Huld et al. 2016; Amillo et al. 2014), and find that, for the seasonal means averaged over large areas that we consider here, and in the standardised units we use, ERA-Interim compares very well. However, the use of climatological aerosols in both ERA-Interim and SARAH-E, and indeed GloSea5, means that the impact of aerosols on interannual variability remains an important uncertainty.

2.2 Skill assessment methodology

In this paper we focus on three meteorological variables of interest to the energy sector in China: near-surface air temperature (related to energy demand); 10-metre wind speed (linked to wind power generation); and downwelling shortwave irradiance at the surface (related to solar power generation). Precipitation is also of great importance for the energy sector, as China has a very large, and growing, hydroelectric industry. Li et al. (2016) have already shown that GloSea5 has significant skill in forecasting summer precipitation in the Yangtze river basin, where the Three Gorges Dam is located, and this has led to the development of a trial forecast service (Golding et al. 2017). Further work by Lu et al. (2017) found high levels of skill in GloSea5 forecasts of winter precipitation, over south-east China.

We take the following approach to skill assessments. For each variable, we first map the Pearson correlation between the hindcast ensemble mean and the observations. While the limited timespan and ensemble size of the hindcast means that forecasts from single grid cells (or even small regions) are likely to be very noisy and not robust, these maps give a good indicative overview, and provide context when selecting larger geographical areas of interest for subsequent analysis. These regions are selected based on their interest to the energy sector: we consider the current and likely future development of substantial energy supply or demand, and if they appear to have some promising skill (regions are not selected on the basis of skill maps alone)

We then assess the skill in each region in more detail. We consider 3 types of plot, each with an associated skill score:

Standardised222By standardised we mean that we subtract the average then divide by the standard deviation $\sigma$ . The result is an anomaly time series in units of $\sigma$ . time series (using the hindcast ensemble mean); with the Pearson correlation $r$ . 2. 2.

Reliability and sharpness diagrams, with the Brier skill score $\mathrm{BSS}$ . These show the joint distribution of hindcast probabilities and observed frequencies for a particular class of event, and the $\mathrm{BSS}$ measures how much better the forecast system is compared to using climatology in that case (see Appendix A for further details). 3. 3.

ROC diagrams, with the ROC skill score $\mathrm{ROCSS}$ . These describe the ability of the forecast system to distinguish between events occurring or not occurring (see Appendix B for further details).

In all cases, we follow the standard WMO (2010) procedure for assessing such forecasts. In particular, this means weighting by the cosines of the grid cell latitudes when aggregating grid cells in the region in question.

A Student’s t-test is often used to assess whether a Pearson sample correlation is significantly non-zero. This assumes that the variables in question are independent in time and have a Gaussian distribution. While this will only be approximately true in our case, we nevertheless use the t-test to give an indicative measure of significant skill in our correlation maps, to aid (but not determine) selection of regions of interest: we draw a contour around areas that would be significantly non-zero at the $5\%$ level if the assumptions of the test applied exactly.333In our case of having $20$ years of data, the threshold in correlation corresponding to the $5\%$ significance level is $|r|>0.44$ . We are not correcting for multiple testing here, so it should be expected that some of the regions marked as notionally significant will be false positives; again, the significance contour should not be taken as definitive.

There are some cases where there is a clear trend running through the data. The reproduction of such a trend by the hindcast is a genuinely useful aspect of skill, as it shows that the forecast system is capable of maintaining the impact of whatever forcing caused the trend, after being initialised. However, it can hide information about the ability of the model to evolve correctly away from its initialised state more generally, which is also of interest when assessing the model. We have therefore also looked at the correlation skill after detrending, which we perform by simply removing the linear least-squares regression fits to the hindcast ensemble mean and observational time series, separately. Note that we are not making any assumptions as to the cause or significance of any trends. The time series is sufficiently short that natural interannual and decadal-scale variability will be very important, even before considering anthropogenic drivers of climate change such as CO2 emissions, land-use change (affecting effective surface roughness and hence wind speed) and aerosol emissions444Although ERA-Interim uses climatological aerosols, it assimilates other observations that could be forced by anomalous aerosol emissions, making it particularly complicated to assess in this way. (affecting surface irradiance and temperature). We simply remove the empirical linear trend.

Our reliability and ROC diagrams are made in terms of probabilistic forecasts of particular types of event: we consider the probability of the variable in question being above the median, in the upper/middle/lower tercile, or in the top/bottom quintile, of its historical distribution. These quantiles are calculated for the hindcast and observational data sets independently, from their own climatologies. This means that our reliability diagrams and Brier skill scores are insensitive to a simple bias in the mean state between the two data sets.

For each type of event (e.g. upper tercile), the distribution of ensemble members each year555We use cross-validation when calculating the quantile values, with a window length of 1 year (i.e. 1 DJF or JJA period), following WMO (2010): The quantile in question is calculated separately for each year, from the 19 years of data remaining after that year is masked out. provides the forecast probability of that event occurring, in each grid cell. Using bins of probability with width 0.1, we then consider all the years when the event was forecast to occur with probability in a given bin, and count the frequency of times when it was observed to actually occur. The counting is done in each grid cell, and we then pool the counts from all the grid cells in the region using cos-latitude weighted sums, to calculate the skill scores and reliability/ROC plots.

3 Results

Here we show maps of the correlation between GloSea5 and ERA-Interim for each variable. In each case, we then go on to examine particular regions in more detail, through their regional time series, reliability and ROC diagrams.

3.1 Wind speed

Maps of the correlation between ERA-Interim and the GloSea5 hindcast for 10 m wind speeds are shown in Figure 1, for winter and summer. Maps of the correlation of detrended time series are practically indistinguishable (not shown).

While there are some areas of significant positive skill for wind speed in DJF, they are rather patchy. A major highlight however is the very high skill in the South China Sea, off the south and south-east coasts of China, with some skill being retained inland. This is likely to be related to the skill in forecasting the El Niño–Southern Oscillation ( $r\approx 0.9$ for the DJF Niño3.4 index,666The Niño3.4 index is the timeseries of sea surface temperature anomalies in the region 120°W–170°W, 5°S–5°N. see MacLachlan et al. 2015), and in its teleconnections over China: these are shown in Figure 2 in terms of correlations between the Niño3.4 index and wind speed. While the overall response in the region differs in detail between GloSea5 and ERA-Interim, the significant anticorrelation in the South China Sea is present in both, with weaker wind speeds correlated to El Niño events. In this region, as part of the East Asian winter monsoon, there is strong north-easterly flow around the south eastern edge of the Siberian–Mongolian High (Chang et al. 2006). During El Niño events, there is increased subsidence over the Maritime Continent, increasing the surface pressure over that region and thus reducing the land–sea pressure gradient and resulting monsoonal winds (Zhang et al. 1996).

Figure 3 shows skill and reliability for winter wind speeds in this South China Sea region (the south-eastern green box in Figure 1). While the deterministic ensemble-mean forecast has a correlation of $r\approx 0.8$ , the reliability diagrams show that this region also exhibits skillful, reliable and sharp probabilistic forecasts of above-median wind speed events. Results are also very good for upper and lower tercile events, and upper and lower quintiles, although these are more noisy. This is reflected in the sharpness diagrams: high-probability forecasts of outer quintile events in particular are very poorly sampled by the hindcast, whereas forecasts of above-median events are well sampled across the full range of probabilities. It is also important to note that this skill in wind speeds is not retained in summer (as seen in the lower panel of Figure 1): in that case, the skill scores $r=0.13$ , $\mathrm{BSS}=-0.07$ and $\mathrm{ROCSS}=0.004$ are not significantly different to zero (not shown).

Some other regions also appear to have reasonably high levels of skill. Figure 4 shows the timeseries for winter wind speed in north-central China and southern Mongolia (northern green box in Figure 1). This is a region of particularly high wind resource (e.g. CNREC 2014; Davidson et al. 2016), so being able to forecast it could be of great practical use. However, the skill in this particular region is marginal ( $r=0.42$ ). Research is ongoing to see if statistical models, based on larger-scale atmospheric drivers such as the Arctic Oscillation and Middle Eastern Jet Stream (Yang et al. 2004; He et al. 2017), could result in higher levels of skill than the direct model output.

Yunnan province, in southern China, shows modest but significant skill for wind speeds, in both winter and summer (Figure 5; the region is also marked in Figure 1). Yunnan is very mountainous, and energy production has traditionally been dominated by hydroelectricity. However, in recent years there has been substantial drive to utilise the available wind resource (Liang et al. 2015), and a seasonal forecast could prove useful.

3.2 Irradiance

Figure 6 shows correlation maps for irradiance in winter and summer. In winter, there is a broad area of promising skill in eastern China and the East China Sea. This bears a strong resemblance to the patterns of skill in winter precipitation shown by Lu et al. (2017), perhaps unsurprisingly as both rainfall and irradiance are strongly related to cloudiness. Lu et al. (2017) determined that the key drivers of precipitation predictability here are ENSO, and rainfall in the eastern Indian Ocean/Bay of Bengal. It is reasonable to assume that the same processes that affect winter precipitation in this region also affect cloudiness and therefore downwelling shortwave irradiance at the surface.

Figure 7 shows the winter irradiance skill in the eastern China region in more detail, using the eastern green box marked in Figure 6: probabilistic forecasts remain reasonably skillful and reliable for tercile and outer quintile events. While this isn’t a region of particularly high solar radiation resource within China (CNREC 2014), it is an area of high population density with many urban centres, including Shanghai. The potential for high levels of demand modulated by large numbers of roof-mounted solar panels means that being able to forecast winters with more or less solar generation than usual could be of value.

In summer, the correlation overall across China is much poorer, with the eastern China region considered above now having $r=0.03$ , consistent with zero (not shown). Regions further west appear to have higher levels of skill, but it is still vary patchy.

As already discussed, since both ERA-Interim and GloSea5 use climatological aerosols, we are unable to assess the impact of interannual aerosol variability on seasonal irradiance forecasts. This might have a strong impact in urban areas affected by haze for example, and it remains an important uncertainty when considering the application of these results to solar power generation.

3.3 Temperature

Figure 8 shows correlation maps for temperature, including a comparison with detrended data. It is clear that there is significant skill in predicting summer temperatures over large areas of China, and that in many regions these are due to positive trends over the hindcast period. The winter temperatures are less affected by trends, and indeed show very little skill overall. As with winter wind speed, research is underway to improve the forecasts through the use of larger-scale atmospheric drivers.

One exception is Yunnan province in south–central China, which we also highlighted for wind speed skill: here it shows positive skill for both winter and summer temperatures. Yunnan is less urban than many more eastern parts of China, so the utility of a temperature forecast in energy demand planning is more limited. However, agriculture and tourism are both very important for Yunnan, which could benefit from a skillful seasonal temperature forecast.

Clearly, the potential for useful seasonal forecasts of energy demand, and hence temperature, is greatest in urban centres. In particular, energy demand in Beijing is strongly related to temperature in summer (Zhang et al. 2014). It is important therefore that the region around Beijing (the northern green box in Figure 8) also shows some skill for summer temperatures, before and after detrending, with correlations of $\sim 0.5$ – $0.6$ (Figure 9). There is less skill however for probabilistic forecasts more detailed than the simple ‘above-average’ case, although this might be improved by looking at a larger region, reducing statistical noise.

4 Discussion and conclusions

Our results have shown that, while overall skill for energy-relevant variables in China remains patchy, some specific areas have significant skill: winter wind speeds in the South China Sea, winter solar irradiance in eastern/southern China, and summer temperatures across much of China (due to the trend), including Beijing (even when detrended). Taken together with similarly-promising results for skillful summer precipitation in the Yangtze river basin (Li et al. 2016), and winter precipitation in southeastern China (Lu et al. 2017), there are clear opportunities to develop useful seasonal climate services for specific cases within China.

Indeed, taking our results and those on precipitation skill together, there are clear potential climate service applications beyond the energy sector: for example, forecasting risks to agriculture and transport, and risks of flooding.

We have only considered the skill of the direct model output from the GloSea5 hindcast here. This represents a minimum level of forecast skill, in two senses. Firstly, the operational forecast ensemble is larger than that available in this hindcast, and it is well-established that the forecast skill in GloSea5 increases with the size of the ensemble (e.g. Scaife et al. 2014; Li et al. 2016; Dunstone et al. 2016).

Secondly, statistical models linking larger-scale drivers directly to the impact variable of interest may offer further improvement in predictability (e.g Scaife 2016). This technique has been used for seasonal forecasts in the UK (e.g. Svensson et al. 2015; Palin et al. 2016; Clark et al. 2017), and is often used already in China (e.g. Xiao et al. 2012; Wang et al. 2013; Peng et al. 2014; Wang et al. 2015; Xing et al. 2016). Research is ongoing to understand the predictability of larger-scale drivers in GloSea5, and how they can be used to improve sector-specific forecasts.

Furthermore, the most user-relevant services are likely to be forecasts of the particular impact of interest to the user, such as energy supply or demand. A next step in developing seasonal climate services based on these results should therefore be to assess the skill of GloSea5 against such direct impacts data, where available from a potential user. The way that forecasts are communicated and handled also affects the usefulness of the forecast (e.g. Taylor et al. 2015; Davis et al. 2016): user engagement is therefore key to optimising a climate prediction service.

Nevertheless, if co-developed with users and communicated carefully, our results show some areas of very promising skill, allowing the development of improved, skillful seasonal climate forecasting services for specific parts of the energy sector, and other sectors, in China.

Acknowledgements.

This work and its contributors (PB, JL, AS, CH and NG) were supported by the UK–China Research & Innovation Partnership Fund through the Met Office Climate Science for Service Partnership (CSSP) China as part of the Newton Fund. HT was supported by the Joint UK BEIS/Defra Met Office Hadley Centre Climate Programme (GA01101). CL was supported by the National Natural Science Foundation of China (Grant No. 41320104007). PB would like to thank Hongli Ren, Jo Camp, Robin Clark and Margaret Gordon for helpful discussions. [A] \appendixtitleReliability and sharpness diagrams, and Brier skill scores $\mathrm{BSS}$ Model reliability is a description of how closely the forecast probabilities of an event correspond to the frequency of that event being observed in historical data, assessing a conditional bias in the forecast system: for example, we might find that every time the event is forecast to occur with $70\%$ probability, it actually occurs only $60\%$ of the time. Having characterised such discrepancies, they can then be removed through calibration, resulting in improved forecasts. A full description can be found in Wilks (2011), but we describe the key points for interpreting our plots here. The reliability diagram for a given class of event is a plot of the observed frequency of the event, at times when it was forecast to occur with a given probability. As described in section 2 2.2, we use bins of probability of width 0.1, and pool the event counts from all grid cells in the chosen region of interest. For the set of years where the event was forecast to occur with a given probability, we plot the on the vertical axis the fraction of those years when the event actually occurred, and join the points from each bin with a line. We mark additional lines in our reliability diagrams (sometimes called an attributes diagram, Hsu and Murphy 1986). The 1:1 line (black, solid) marks “perfect reliability”, differentiating between underconfident and overconfident forecasts – these will have steeper or shallower reliability lines respectively than the ‘perfect’ case. The climatological frequency (e.g. $1/3$ for terciles) is marked with a dotted horizontal “no resolution” line: If the reliability line lies along this, then it cannot resolve different events into different probabilities, as all forecasts occur at the climatological rate. A line midway between “perfect reliability” and the “no resolution” line is called the “no skill” line (dashed), as only points above this line make a positive contribution to the Brier skill score. We shade this region of skill in green. The Brier skill score measures how much better the forecast system is relative to climatology,777In general, the Brier skill score compares the forecast system to any reference forecast, but here we use climatology as the reference. and can be written as

[TABLE]

Here, the resolution $\mathrm{RES}$ is the weighted mean square distance between the points and the “no resolution” line, the reliability $\mathrm{REL}$ is the weighted mean square distance between the points and the “perfect reliability” line, and the “uncertainty” $\mathrm{UNC}$ is the product of the observed climatological frequency and its complement, e.g. $\frac{1}{3}\times\frac{2}{3}$ . The skill is positive if $\mathrm{RES}>\mathrm{REL}$ , i.e. if the points in the reliability line are closer to “perfect reliability” than to the “no resolution” line.

Below each reliability diagram, we include a sharpness diagram, a histogram of the distribution of forecasts made in each probability bin. If the histogram is flat, then the hindcast has sampled the full range of possible forecast probabilities and is described as sharp. If it is strongly peaked at the climatological frequency for the event, then the system has no sharpness and mostly just predicts climatology. Taken together, the sharpness and reliability diagrams provide a complete description of the joint distribution of observed frequencies and forecast probabilities.

[B] \appendixtitleROC diagrams and ROC skill scores $\mathrm{ROCSS}$

Relative Operating Characteristic (ROC) diagrams describe how well the forecast system can distinguish between classes of event occurring and not occurring (again, see Wilks 2011, for a more full description). In practice, we construct ROC diagrams and scores using the same event classes and probability bins as used for the reliability diagrams, counting events and performing weighted sums over contributing grid cells. Four aggregates of the event counts are made, for each probability bin $p$ :

•

$N_{\mathrm{H}}(p)$ , the number of hits: the number of times the event was forecast with probability $>p$ , and observed to occur.

•

$N_{\mathrm{M}}(p)$ , the number of misses: the number of times the event was observed, but was not forecast with probability $>p$ .

•

$N_{\mathrm{FA}}(p)$ , the number of false alarms: the number of times the event was forecast with probability $>p$ , but was not observed to occur.

•

$N_{\mathrm{CR}}(p)$ , the number of correct rejections: the number of times the event was not observed to occur, and was not forecast with probability $>p$ .

We then calculate:

•

Hit Rate, $\mathrm{HR}(p)=N_{\mathrm{H}}/(N_{\mathrm{H}}+N_{\mathrm{M}})$

•

False Alarm Rate, $\mathrm{FAR}(p)=N_{\mathrm{FA}}/(N_{\mathrm{FA}}+N_{\mathrm{CR}})$

The ROC diagram then is a plot of Hit Rate against the False Alarm Rate, for a series of probability thresholds. A skillful system has $\mathrm{HR}>\mathrm{FAR}$ and therefore a ROC curve in the top-left of the diagram; the 1:1 line (dashed in our plots) delineates no skill, as $\mathrm{HR}=\mathrm{FAR}$ . We therefore use the area under the ROC curve $A_{\mathrm{ROC}}$ as a measure of skill, and scale it to produce a skill score that lies between 0 (no skill) and 1 (perfect):

[TABLE]

ROC diagrams are insensitive to calibration of the forecast probabilities, so complement the reliability diagrams – they assess the potential usefulness of the forecast system after calibration.

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amillo et al. (2014) Amillo, A., T. Huld, and R. Müller, 2014: A new database of global and direct solar radiation using the eastern Meteosat satellite, models and validation. Remote Sens. , 6 (9) , 8165–8189, 10.3390/rs 6098165 . · doi ↗
2Apadula et al. (2012) Apadula, F., A. Bassini, A. Elli, and S. Scapin, 2012: Relationships between meteorological variables and monthly electricity demand. Appl. Energy , 98 , 346–356, 10.1016/j.apenergy.2012.03.053 . · doi ↗
3Arribas et al. (2011) Arribas, A., and Coauthors, 2011: The Glo Sea 4 ensemble prediction system for seasonal forecasting. Mon. Wea. Rev. , 139 (6) , 1891–1910, 10.1175/2010 mwr 3615.1 . · doi ↗
4Athanasiadis et al. (2014) Athanasiadis, P. J., and Coauthors, 2014: The representation of atmospheric blocking and the associated low-frequency variability in two seasonal prediction systems. J. Climate , 27 (24) , 9082–9100, 10.1175/jcli-d-14-00291.1 . · doi ↗
5Athanasiadis et al. (2016) Athanasiadis, P. J., and Coauthors, 2016: A multi-system view of wintertime NAO seasonal predictions. J. Climate , 10.1175/jcli-d-16-0153.1 . · doi ↗
6Best et al. (2011) Best, M. J., and Coauthors, 2011: The joint UK land environment simulator (JULES), model description – part 1: Energy and water fluxes. Geosci. Model Dev. , 4 (3) , 677–699, 10.5194/gmd-4-677-2011 . · doi ↗
7Boilley and Wald (2015) Boilley, A., and L. Wald, 2015: Comparison between meteorological re-analyses from ERA-Interim and MERRA and measurements of daily solar irradiation at surface. Renew. Energy , 75 , 135–143, 10.1016/j.renene.2014.09.042 . · doi ↗
8Bruno Soares and Dessai (2015) Bruno Soares, M., and S. Dessai, 2015: Exploring the use of seasonal climate forecasts in Europe through expert elicitation. Climate Risk. Manage. , 10 , 8–16, 10.1016/j.crm.2015.07.001 . · doi ↗