Multiscale inference and long-run variance estimation in nonparametric regression with time series errors
Marina Khismatullina, Michael Vogt

TL;DR
This paper introduces multiscale testing methods for qualitative features of nonparametric regression functions with time series errors, along with a new estimator for long-run variance in AR(p) processes, supported by theory, simulations, and climate data analysis.
Contribution
It presents novel multiscale tests for shape properties of regression functions with time series errors and a new difference-based estimator for long-run variance in AR(p) models.
Findings
Asymptotic validity of the proposed tests and estimator.
Simulation results demonstrate good finite-sample performance.
Application to climate data reveals meaningful trend features.
Abstract
In this paper, we develop new multiscale methods to test qualitative hypotheses about the regression function m in a nonparametric regression model with fixed design points and time series errors. In time series applications, m represents a nonparametric time trend. Practitioners are often interested in whether the trend m has certain shape properties. For example, they would like to know whether m is constant or whether it is increasing/decreasing in certain time regions. Our multiscale methods allow to test for such shape properties of the trend m. In order to perform the methods, we require an estimator of the long-run variance of the error process. We propose a new difference-based estimator of the long-run error variance for the case that the error terms form an AR(p) process. In the technical part of the paper, we derive asymptotic theory for the proposed multiscale test and the…
| nominal size | nominal size | nominal size | nominal size | nominal size | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.05 | 0.1 | 0.01 | 0.05 | 0.1 | 0.01 | 0.05 | 0.1 | 0.01 | 0.05 | 0.1 | 0.01 | 0.05 | 0.1 | ||||||
| 0.015 | 0.050 | 0.127 | 0.014 | 0.057 | 0.120 | 0.011 | 0.046 | 0.116 | 0.013 | 0.042 | 0.108 | 0.011 | 0.052 | 0.117 | ||||||
| 0.009 | 0.067 | 0.120 | 0.010 | 0.055 | 0.095 | 0.009 | 0.055 | 0.096 | 0.010 | 0.049 | 0.090 | 0.010 | 0.059 | 0.114 | ||||||
| 0.015 | 0.053 | 0.128 | 0.015 | 0.047 | 0.100 | 0.018 | 0.048 | 0.101 | 0.015 | 0.042 | 0.106 | 0.015 | 0.056 | 0.107 | ||||||
| MT | SiZer | MT | SiZer | MT | SiZer | MT | SiZer | MT | SiZer | MT | SiZer | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.018 | 0.112 | 0.040 | 0.374 | 0.104 | 0.575 | 0.017 | 0.106 | 0.034 | 0.347 | 0.092 | 0.522 | |||
| 0.012 | 0.140 | 0.058 | 0.426 | 0.080 | 0.621 | 0.012 | 0.130 | 0.046 | 0.399 | 0.074 | 0.578 | |||
| 0.005 | 0.140 | 0.041 | 0.489 | 0.097 | 0.680 | 0.006 | 0.136 | 0.039 | 0.452 | 0.097 | 0.639 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distance to upper boundary
**Multiscale Inference
** **and Long-Run Variance Estimation
** **in Nonparametric Regression
** with Time Series Errors
Marina Khismatullina111Address: Bonn Graduate School of Economics, University of Bonn, 53113 Bonn, Germany. Email: [email protected].
University of Bonn
Michael Vogt222Address: Department of Economics and Hausdorff Center for Mathematics, University of Bonn, 53113 Bonn, Germany. Email: [email protected].
University of Bonn
In this paper, we develop new multiscale methods to test qualitative hypotheses about the function in the nonparametric regression model with time series errors . In time series applications, represents a nonparametric time trend. Practitioners are often interested in whether the trend has certain shape properties. For example, they would like to know whether is constant or whether it is increasing/decreasing in certain time regions. Our multiscale methods allow to test for such shape properties of the trend . In order to perform the methods, we require an estimator of the long-run error variance . We propose a new difference-based estimator of for the case that is an AR() process. In the technical part of the paper, we derive asymptotic theory for the proposed multiscale test and the estimator of the long-run error variance. The theory is complemented by a simulation study and an empirical application to climate data.
In this supplement, we provide the technical details and proofs that are omitted in the paper. In addition, we report the results of some robustness checks which complement the simulation exercises in Section 5 of the paper.
Key words: Multiscale statistics; long-run variance; nonparametric regression; time series errors; shape constraints; strong approximations; anti-concentration bounds.
AMS 2010 subject classifications: 62E20; 62G10; 62G20; 62M10.
1 Introduction
The analysis of time trends is an important aspect of many time series applications. In a wide range of situations, practitioners are particularly interested in certain shape properties of the trend. They raise questions such as the following: Does the observed time series have a trend at all? If so, is the trend increasing/decreasing in certain time regions? Can one identify the regions of increase/decrease? As an example, consider the time series plotted in Figure 1 which shows the yearly mean temperature in Central England from 1659 to 2017. Climatologists are very much interested in learning about the trending behaviour of temperature time series like this; see e.g. Benner (1999) and Rahmstorf et al. (2017). Among other things, they would like to know whether there is an upward trend in the Central England mean temperature towards the end of the sample as visual inspection might suggest.
In this paper, we develop new methods to test for certain shape properties of a nonparametric time trend. We in particular construct a multiscale test which allows to identify local increases/decreases of the trend function. We develop our test in the context of the following model setting: We observe a time series of the form
[TABLE]
for , where is an unknown nonparametric regression function and the error terms form a stationary time series process with . In a time series context, the design points represent the time points of observation and is a nonparametric time trend. As usual in nonparametric regression, we let the function depend on rescaled time rather than on real time . A detailed description of model (1.1) is provided in Section 2.
Our multiscale test is developed step by step in Section 3. Roughly speaking, the procedure can be outlined as follows: Let be the hypothesis that is constant in the time window , where is the midpoint and the size of the window. In a first step, we set up a test statistic for the hypothesis . In a second step, we aggregate the statistics for a large number of different time windows . We thereby construct a multiscale statistic which allows to test the hypothesis simultaneously for many time windows . In the technical part of the paper, we derive the theoretical properties of the resulting multiscale test. To do so, we come up with a proof strategy which combines strong approximation results for dependent processes with anti-concentration bounds for Gaussian random vectors. This strategy is of interest in itself and may be applied to other multiscale test problems for dependent data. As shown by our theoretical analysis, our multiscale test is a rigorous level--test of the overall null hypothesis that is simultaneously fulfilled for all time windows under consideration. Moreover, for a given significance level , the test allows to make simultaneous confidence statements of the following form: We can claim, with statistical confidence , that there is an increase/decrease in the trend on all time windows for which the hypothesis is rejected. Hence, the test allows to identify, with a pre-specified statistical confidence, time regions where the trend is increasing/decreasing.
For independent data, multiscale tests have been developed in a variety of different contexts in recent years. In the regression context, Chaudhuri and Marron (1999, 2000) introduced the so-called SiZer method which has been extended in various directions; see e.g. Hannig and Marron (2006) where a refined distribution theory for SiZer is derived. Hall and Heckman (2000) constructed a multiscale test on monotonicity of a regression function. Dümbgen and Spokoiny (2001) developed a multiscale approach which works with additively corrected supremum statistics and derived theoretical results in the context of a continuous Gaussian white noise model. Rank-based multiscale tests for nonparametric regression were proposed in Dümbgen (2002) and Rohde (2008). More recently, Proksch et al. (2018) have constructed multiscale tests for inverse regression models. In the context of density estimation, multiscale tests have been investigated in Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017) among others.
Whereas a large number of multiscale tests for independent data have been developed in recent years, multiscale tests for dependent data are much rarer. Most notably, there are some extensions of the SiZer approach to a time series context. Park et al. (2004) and Rondonotti et al. (2007) have introduced SiZer methods for dependent data which can be used to find local increases/decreases of a trend and which may thus be regarded as an alternative to our multiscale test. However, these SiZer methods are mainly designed for data exploration rather than for rigorous statistical inference. Our multiscale method, in contrast, is a rigorous level--test of the hypothesis which allows to make simultaneous confidence statements about the time regions where the trend is increasing/decreasing. Some theoretical results for dependent SiZer methods have been derived in Park et al. (2009), but only under a quite severe restriction: Only time windows with window sizes or scales are taken into account that remain bounded away from zero as the sample size grows. Scales that converge to zero as increases are excluded. This effectively means that only large time windows are taken into consideration. Our theory, in contrast, allows to simultaneously consider scales of fixed size and scales that converge to zero at various different rates. We are thus able to take into account time windows of many different sizes.
Our multiscale approach is also related to Wavelet-based methods: Similar to the latter, it takes into account different locations and resolution levels or scales simultaneously. However, while our multiscale approach is designed to test for local increases/decreases of a nonparametric trend, Wavelet methods are commonly used for other purposes. Among other things, they are employed for estimating/reconstructing nonparametric regression curves [see e.g. Donoho et al. (1995) or Von Sachs and MacGibbon (2000)] and for change point detection [see e.g. Cho and Fryzlewicz (2012)].
The test statistic of our multiscale method depends on the long-run error variance , which is usually unknown in practice. To carry out our multiscale test, we thus require an estimator of . Indeed, such an estimator is required for virtually all inferential procedures in the context of model (1.1). Hence, the problem of estimating in model (1.1) is of broader interest and has received a lot of attention in the literature; see Müller and Stadtmüller (1988), Herrmann et al. (1992) and Hall and Van Keilegom (2003) among many others. In Section 4, we discuss several estimators of which are valid under different conditions on the error process . Most notably, we introduce a new difference-based estimator of for the case that is an AR() process. This estimator improves on existing methods in several respects.
The methodological and theoretical analysis of the paper is complemented by a simulation study in Section 5 and an empirical application in Section 6. In the simulation study, we examine the finite sample properties of our multiscale test and compare it to the dependent SiZer methods introduced in Park et al. (2004) and Rondonotti et al. (2007). Moreover, we investigate the small sample performance of our estimator of in the AR() case and compare it to the estimator of Hall and Van Keilegom (2003). In Section 6, we use our methods to analyse the temperature data from Figure 1.
2 The model
We now describe the model setting in detail which was briefly outlined in the Introduction. We observe a time series of length which satisfies the nonparametric regression equation
[TABLE]
for . Here, is an unknown nonparametric function defined on and is a zero-mean stationary error process. For simplicity, we restrict attention to equidistant design points . However, our methods and theory can also be carried over to non-equidistant designs. The stationary error process is assumed to have the following properties:
- (C1)
The variables allow for the representation , where are i.i.d. random variables and is a measurable function. 2. (C2)
It holds that for some , where .
Following Wu (2005), we impose conditions on the dependence structure of the error process in terms of the physical dependence measure , where with being an i.i.d. copy of . In particular, we assume the following:
- (C3)
Define for . It holds that , where and .
The conditions (C1)–(C3) are fulfilled by a wide range of stationary processes . As a first example, consider linear processes of the form with , where are absolutely summable coefficients and are i.i.d. innovations with and . Trivially, (C1) and (C2) are fulfilled in this case. Moreover, if for some , then (C3) is easily seen to be satisfied as well. As a special case, consider an ARMA process of the form with , where and are real-valued parameters. As before, we let be i.i.d. innovations with and . Moreover, as usual, we suppose that the complex polynomials and do not have any roots in common. If does not have any roots inside the unit disc, then the ARMA process is stationary and causal. Specifically, it has the representation with for some , implying that (C1)–(C3) are fulfilled. The results in Wu and Shao (2004) show that condition (C3) (as well as the other two conditions) is not only fulfilled for linear time series processes but also for a variety of non-linear processes.
3 The multiscale test
In this section, we introduce our multiscale method to test for local increases/decreases of the trend function and analyse its theoretical properties. We assume throughout that is continuously differentiable on . The test problem under consideration can be formulated as follows: Let be the hypothesis that is constant on the interval . Since is continuously differentiable, can be reformulated as
[TABLE]
where is the first derivative of . We want to test the hypothesis not only for a single interval but simultaneously for many different intervals. The overall null hypothesis is thus given by
[TABLE]
where is some large set of points . The details on the set are discussed at the end of Section 3.1 below. Note that in general depends on the sample size , implying that the null hypothesis depends on as well. We thus consider a sequence of null hypotheses as increases. For simplicity of notation, we however suppress the dependence of on . In Sections 3.1 and 3.2, we step by step construct the multiscale test of the hypothesis . The theoretical properties of the test are analysed in Section 3.3.
3.1 Construction of the multiscale statistic
We first construct a test statistic for the hypothesis , where is a given interval. To do so, we consider the kernel average
[TABLE]
where is a kernel weight and is the bandwidth. In order to avoid boundary issues, we work with a local linear weighting scheme. We in particular set
[TABLE]
where
[TABLE]
for and is a kernel function with the following properties:
- (C4)
The kernel is non-negative, symmetric about zero and integrates to one. Moreover, it has compact support and is Lipschitz continuous, that is, for any and some constant .
The kernel average is nothing else than a rescaled local linear estimator of the derivative with bandwidth .333Alternatively to the local linear weights defined in (3.1), we could also work with the weights , where the kernel function is assumed to be differentiable and is its derivative. We however prefer to use local linear weights as these have superior theoretical properties at the boundary.
A test statistic for the hypothesis is given by the normalized kernel average , where is an estimator of the long-run variance of the error process . The problem of estimating is discussed in detail in Section 4. For the time being, we suppose that is an estimator with reasonable theoretical properties. Specifically, we assume that with . This is a fairly weak condition which is in particular satisfied by the estimators of analysed in Section 4. The kernel weights are chosen such that in the case of independent errors , for any location and bandwidth , where the long-run error variance simplifies to . In the more general case that the error terms satisfy the weak dependence conditions from Section 2, for any and under consideration. Hence, for sufficiently large sample sizes , the test statistic has approximately unit variance.
We now combine the test statistics for a wide range of different locations and bandwidths or scales . There are different ways to do so, leading to different types of multiscale statistics. Our multiscale statistic is defined as
[TABLE]
where and is the set of points that are taken into consideration. The details on the set are given below. As can be seen, the statistic does not simply aggregate the individual statistics by taking the supremum over all points as in more traditional multiscale approaches. We rather calibrate the statistics that correspond to the bandwidth by subtracting the additive correction term . This approach was pioneered by Dümbgen and Spokoiny (2001) and has been used in numerous other studies since then; see e.g. Dümbgen (2002), Rohde (2008), Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017).
To see the heuristic idea behind the additive correction , consider for a moment the uncorrected statistic
[TABLE]
and suppose that the hypothesis is true for all . For simplicity, assume that the errors are i.i.d. normally distributed and neglect the estimation error in , that is, set . Moreover, suppose that the set only consists of the points with and . In this case, we can write
[TABLE]
Under our simplifying assumptions, the statistics with are independent and standard normal for any given bandwidth . Since the maximum over independent standard normal random variables is as , we obtain that is approximately of size for small bandwidths . As for , this implies that tends to be much larger in size for small than for large bandwidths . As a result, the stochastic behaviour of the uncorrected statistic tends to be dominated by the statistics corresponding to small bandwidths . The additively corrected statistic , in contrast, puts the statistics corresponding to different bandwidths on a more equal footing, thus counteracting the dominance of small bandwidth values.
The multiscale statistic simultaneously takes into account all locations and bandwidths with . Throughout the paper, we suppose that is some subset of , where and denote some minimal and maximal bandwidth value, respectively. For our theory to work, we require the following conditions to hold:
- (C5)
for some arbitrarily large but fixed constant , where denotes the cardinality of . 2. (C6)
, that is, with defined in (C2) and .
According to (C5), the number of points in should not grow faster than for some arbitrarily large but fixed . This is a fairly weak restriction as it allows the set to be extremely large compared to the sample size . For example, we may work with the set
[TABLE]
which contains more than enough points for most practical applications. Condition (C6) imposes some restrictions on the minimal and maximal bandwidths and . These conditions are fairly weak, allowing us to choose the bandwidth window extremely large. The lower bound on depends on the parameter defined in (C2) which specifies the number of existing moments for the error terms . As one can see, we can choose to be of the order for any . Hence, we can let converge to [math] very quickly even if only the first few moments of the error terms exist. If all moments exist (i.e. ), may converge to [math] almost as quickly as . Furthermore, the maximal bandwidth is not even required to converge to [math], which implies that we can pick it very large.
Remark 3.1**.**
The above construction of the multiscale statistic can be easily adapted to hypotheses other than . To do so, one simply needs to replace the kernel weights defined in (3.1) by appropriate versions which are suited to test the hypothesis of interest. For example, if one wants to test for local convexity/concavity of , one may define the kernel weights such that the kernel average is a (rescaled) estimator of the second derivative of at the location with bandwidth .
3.2 The test procedure
In order to formulate a test for the null hypothesis , we still need to specify a critical value. To do so, we define the statistic
[TABLE]
where and are independent standard normal random variables. The statistic can be regarded as a Gaussian version of the test statistic under the null hypothesis . Let be the -quantile of . Importantly, the quantile can be computed by Monte Carlo simulations and can thus be regarded as known. Our multiscale test of the hypothesis is now defined as follows: For a given significance level , we reject if .
3.3 Theoretical properties of the test
In order to examine the theoretical properties of our multiscale test, we introduce the auxiliary multiscale statistic
[TABLE]
with . The following result is central to the theoretical analysis of our multiscale test. According to it, the (known) quantile of the Gaussian statistic defined in Section 3.2 can be used as a proxy for the -quantile of the multiscale statistic .
Theorem 3.1**.**
Let (C1)–(C6) be fulfilled and assume that with . Then
[TABLE]
A full proof of Theorem 3.1 is given in the Supplementary Material. We here shortly outline the proof strategy, which splits up into two main steps. In the first, we replace the statistic for each by a statistic with the same distribution as and the property that
[TABLE]
where and the Gaussian statistic is defined in Section 3.2. We thus replace the statistic by an identically distributed version which is close to a Gaussian statistic whose distribution is known. To do so, we make use of strong approximation theory for dependent processes as derived in Berkes et al. (2014). In the second step, we show that
[TABLE]
which immediately implies the statement of Theorem 3.1. Importantly, the convergence result (3.5) is not sufficient for establishing (3.6). Put differently, the fact that can be approximated by in the sense that does not imply that the distribution of is close to that of in the sense of (3.6). For (3.6) to hold, we additionally require the distribution of to have some sort of continuity property. Specifically, we prove that
[TABLE]
which says that does not concentrate too strongly in small regions of the form . The main tool for verifying (3.7) are anti-concentration results for Gaussian random vectors as derived in Chernozhukov et al. (2015). The claim (3.6) can be proven by using (3.5) together with (3.7), which in turn yields Theorem 3.1.
The main idea of our proof strategy is to combine strong approximation theory with anti-concentration bounds for Gaussian random vectors to show that the quantiles of the multiscale statistic can be proxied by those of a Gaussian analogue. This strategy is quite general in nature and may be applied to other multiscale problems for dependent data. Strong approximation theory has also been used to investigate multiscale tests for independent data; see e.g. Schmidt-Hieber et al. (2013). However, it has not been combined with anti-concentration results to approximate the quantiles of the multiscale statistic. As an alternative to strong approximation theory, Eckle et al. (2017) and Proksch et al. (2018) have recently used Gaussian approximation results derived in Chernozhukov et al. (2014, 2017) to analyse multiscale tests for independent data. Even though it might be possible to adapt these techniques to the case of dependent data, this is not trivial at all as part of the technical arguments and the Gaussian approximation tools strongly rely on the assumption of independence.
We now investigate the theoretical properties of our multiscale test with the help of Theorem 3.1. The first result is an immediate consequence of Theorem 3.1. It says that the test has the correct (asymptotic) size.
Proposition 3.1**.**
Let the conditions of Theorem 3.1 be satisfied. Under the null hypothesis , it holds that
[TABLE]
The second result characterizes the power of the multiscale test against local alternatives. To formulate it, we consider any sequence of functions with the following property: There exists with such that
[TABLE]
where is any sequence of positive numbers with . Alternatively to (3.8), we may also assume that for all . According to the following result, our test has asymptotic power against local alternatives of the form (3.8).
Proposition 3.2**.**
Let the conditions of Theorem 3.1 be satisfied and consider any sequence of functions with the property (3.8). Then
[TABLE]
The proof of Proposition 3.2 can be found in the Supplementary Material. To formulate the next result, we define
[TABLE]
together with
[TABLE]
is the collection of intervals for which the (corrected) test statistic lies above the critical value , that is, for which our multiscale test rejects the hypothesis . and can be interpreted analogously but take into account the sign of the statistic . With this notation at hand, we consider the events
[TABLE]
(, ) is the event that the function is non-constant (increasing, decreasing) on all intervals (, ). More precisely, (, ) is the event that for each interval (, ), there is a subset with being a non-constant (increasing, decreasing) function on . We can make the following formal statement about the events , and , whose proof is given in the Supplementary Material.
Proposition 3.3**.**
Let the conditions of Theorem 3.1 be fulfilled. Then for , it holds that
[TABLE]
According to Proposition 3.3, we can make simultaneous confidence statements of the following form: With (asymptotic) probability , the trend function is non-constant (increasing, decreasing) on some part of the interval for all (, ). Hence, our multiscale procedure allows to identify, with a pre-specified confidence, time regions where there is an increase/decrease in the time trend .
Remark 3.2**.**
Unlike , the sets and only contain intervals which are subsets of . We thus exclude points and which lie at the boundary, that is, for which . The reason is as follows: Let with . Our technical arguments allow us to say, with asymptotic confidence , that for some . However, we cannot say whether or , that is, we cannot make confidence statements about the sign. Crudely speaking, the problem is that the local linear weights behave quite differently at boundary points with . As a consequence, we can include boundary points in but not in and .
The statement of Proposition 3.3 suggests to graphically present the results of our multiscale test by plotting the intervals for , that is, by plotting the intervals where (with asymptotic confidence ) our test detects a violation of the null hypothesis. The drawback of this graphical presentation is that the number of intervals in is often quite large. To obtain a better graphical summary of the results, we replace by a subset which is constructed as follows: As in Dümbgen (2002), we call an interval minimal if there is no other interval with . Let be the set of all minimal intervals in for and define the events
[TABLE]
It is easily seen that for . Hence, by Proposition 3.3, it holds that
[TABLE]
for . This suggests to plot the minimal intervals in rather than the whole collection of intervals as a graphical summary of the test results. We in particular use this way of presenting the test results in our application in Section 6.
4 Estimation of the long-run error variance
In this section, we discuss how to estimate the long-run variance of the error terms in model (2.1). There are two broad classes of estimators: residual- and difference-based estimators. In residual-based approaches, is estimated from the residuals , where is a nonparametric estimator of with the bandwidth or smoothing parameter . Difference-based methods proceed by estimating from the -th differences of the observed time series for certain orders . In what follows, we focus attention on difference-based methods as these do not involve a nonparametric estimator of the function and thus do not require to specify a bandwidth for the estimation of . To simplify notation, we let denote the -th differences of a general time series throughout the section.
4.1 Weakly dependent error processes
We first consider the case that is a general stationary error process. We do not impose any time series model such as a moving average (MA) or an autoregressive (AR) model on but only require that satisfies certain weak dependence conditions such as those from Section 2. These conditions imply that the autocovariances decay to zero at a certain rate as . For simplicity of exposition, we assume that the decay is exponential, that is, for some and . In addition to these weak dependence conditions, we suppose that the trend is smooth. Specifically, we assume to be Lipschitz continuous on , that is, for all and some constant .
Under these conditions, a difference-based estimator of can be obtained as follows: To start with, we construct an estimator of the short-run error variance . As is Lipschitz continuous, it holds that . Hence, the differences of the observed time series are close to the differences of the unobserved error process as long as is not too large in comparison to . Moreover, since , we have that . Taken together, these considerations yield that , which motivates to estimate by
[TABLE]
where we assume that with and . Estimators of the autocovariances for can be derived by similar considerations. Since , we may in particular define
[TABLE]
for any . Difference-based estimators of the type (4.1) and (4.2) have been used in different contexts in the literature before. Estimators similar to (4.1) and (4.2) were analysed, for example, in Müller and Stadtmüller (1988) and Hall and Van Keilegom (2003) in the context of -dependent and autoregressive error terms, respectively. In order to estimate the long-run error variance , we may employ HAC-type estimation procedures as discussed in Andrews (1991) or De Jong and Davidson (2000). In particular, an estimator of may be defined as
[TABLE]
where is a kernel (e.g. of Bartlett or Parzen type) and is a bandwidth parameter with and . The additional bandwidth comes into play because estimating under general weak dependence conditions is a nonparametric problem. In particular, it is equivalent to estimating the (nonparametric) spectral density of the process at frequency [math] (assuming that exists).
Estimating the long-run error variance under general weak dependence conditions is a notoriously difficult problem. Estimators of such as from (4.3) tend to be quite imprecise and are usually very sensitive to the choice of the smoothing parameter, that is, to in the case of from (4.3). To circumvent this issue in practice, it may be beneficial to impose a time series model on the error process . Estimating under the restrictions of such a model may of course create some misspecification bias. However, as long as the model gives a reasonable approximation to the true error process, the produced estimates of can be expected to be fairly reliable even though they are a bit biased. Which time series model is appropriate of course depends on the application at hand. In the sequel, we follow authors such as Hart (1994) and Hall and Van Keilegom (2003) and impose an autoregressive structure on the error terms , which is a very popular error model in many application contexts. We thus do not dwell on the nonparametric estimator from (4.3) any further but rather give an in-depth analysis of the case of autoregressive error terms.
4.2 Autoregressive error processes
Estimators of the long-run error variance in model (2.1) have been developed for different kinds of error processes . A number of authors have analysed the case of MA() or, more generally, -dependent error terms. Difference-based estimators of for this case were proposed in Müller and Stadtmüller (1988), Herrmann et al. (1992) and Tecuapetla-Gómez and Munk (2017) among others. Under the assumption of -dependence, for all . Even though -dependent time series are a reasonable error model in some applications, the condition that is exactly equal to [math] for sufficiently large lags is quite restrictive in many situations. Presumably the most widely used error model in practice is an AR() process. Residual-based methods to estimate in model (2.1) with AR() errors can be found for example in Truong (1991), Shao and Yang (2011) and Qiu et al. (2013). A difference-based method was proposed in Hall and Van Keilegom (2003).
In what follows, we introduce a difference-based estimator of for the AR() case which improves on existing methods in several respects. As in Hall and Van Keilegom (2003), we consider the following situation: is a stationary and causal AR() process of the form
[TABLE]
where are unknown parameters and are i.i.d. innovations with and . The AR order is known and is Lipschitz continuous on , that is, for all and some constant . Since is causal, the variables have an MA() representation of the form . The coefficients can be computed iteratively from the equations
[TABLE]
for , where , for and for . Moreover, the coefficients can be shown to decay exponentially fast to zero as , in particular, with some and .
Our estimation method relies on the following simple observation: If is an AR() process of the form (4.4), then the time series of the differences is an ARMA() process of the form
[TABLE]
As is Lipschitz, the differences of the unobserved error process are close to the differences of the observed time series in the sense that
[TABLE]
Taken together, (4.6) and (4.7) imply that the differenced time series is approximately an ARMA() process of the form (4.6). It is precisely this point which is exploited by our estimation methods.
We first construct an estimator of the parameter vector . For any , the ARMA() process satisfies the Yule-Walker equations
[TABLE]
where and are the coefficients from the MA() expansion of . From (4.8) and (4.9), we get that
[TABLE]
where , and denotes the covariance matrix . Since the coefficients decay exponentially fast to zero, and thus for large values of . This suggests to estimate by
[TABLE]
where and are defined analogously as and with replaced by the sample autocovariances and goes to infinity sufficiently fast as , specifically, with and .
The estimator depends on the tuning parameter , which is very similar in nature to the two tuning parameters of the methods in Hall and Van Keilegom (2003). An appropriate choice of needs to take care of the following two points: (i) should be chosen large enough to ensure that the vector is close to zero. As we have already seen, the constants decay exponentially fast to zero and can be computed from the recursive equations (4.5) for given AR parameters . In the AR() case, for example, one can readily calculate that for any and any . Hence, if we have an AR() model for the errors and the error process is not too persistent, choosing such that should make sure that is close to zero. Generally speaking, the recursive equations (4.5) can be used to get some idea for which values of the vector can be expected to be approximately zero. (ii) should not be chosen too large in order to ensure that the trend is appropriately eliminated by taking -th differences. As long as the trend is not very strong, the two requirements (i) and (ii) can be fulfilled without much difficulty. For example, by choosing in the AR() case just discussed, we do not only take care of (i) but also make sure that moderate trends are differenced out appropriately.
When the trend is very pronounced, in contrast, even moderate values of may be too large to eliminate the trend appropriately. As a result, the estimator will have a strong bias. In order to reduce this bias, we refine our estimation procedure as follows: By solving the recursive equations (4.5) with replaced by , we can compute estimators of the coefficients and thus estimators of the vectors for any . Moreover, the innovation variance can be estimated by , where and is the -th entry of the vector . Plugging the expressions , , and into (4.10), we can estimate by
[TABLE]
where is any fixed number with . In particular, unlike , the parameter does not diverge to infinity but remains fixed as the sample size increases. As one can see, the estimator is based on differences of some small order ; only the pilot estimator relies on differences of a larger order . As a consequence, should eliminate the trend more appropriately and should thus be less biased than the pilot estimator . In order to make the method more robust against estimation errors in , we finally average the estimators for a few small values of . In particular, we define
[TABLE]
where is a small natural number. For ease of notation, we suppress the dependence of on the parameter . Once is computed, the long-run variance can be estimated by
[TABLE]
where with is an estimator of the innovation variance and we make use of the fact that for the AR() process .
We briefly compare the estimator to competing methods. Presumably closest to our approach is the procedure of Hall and Van Keilegom (2003). Nevertheless, the two approaches differ in several respects. The two main advantages of our method are as follows:
- (a)
Our estimator produces accurate estimation results even when the AR process is quite persistent, that is, even when the AR polynomial has a root close to the unit circle. The estimator of Hall and Van Keilegom (2003), in contrast, may have very high variance and may thus produce unreliable results when the AR polynomial is close to having a unit root. This difference in behaviour can be explained as follows: Our pilot estimator has the property that the estimated AR polynomial has no root inside the unit disc, that is, for all complex numbers with .444More precisely, for all with , whenever the covariance matrix is non-singular. Moreover, is non-singular whenever , which is the generic case. Hence, the fitted AR model with the coefficients is ensured to be stationary and causal. Even though this may seem to be a minor technical detail, it has a huge effect on the performance of the estimator: It keeps the estimator stable even when the AR process is very persistent and the AR polynomial has almost a unit root. This in turn results in a reliable behaviour of the estimator in the case of high persistence. The estimator of Hall and Van Keilegom (2003), in contrast, may produce non-causal results when the AR polynomial is close to having a unit root. As a consequence, it may have unnecessarily high variance in the case of high persistence. We illustrate this difference between the estimators by the simulation exercises in Section 5.3. A striking example is Figure 5, which presents the simulation results for the case of an AR() process with and clearly shows the much better performance of our method. 2. (b)
Both our pilot estimator and the estimator of Hall and Van Keilegom (2003) tend to have a substantial bias when the trend is pronounced. Our estimator reduces this bias considerably as demonstrated in the simulations of Section 5.3. Unlike the estimator of Hall and Van Keilegom (2003), it thus produces accurate results even in the presence of a very strong trend.
We now derive some basic asymptotic properties of the estimators , and . The following proposition shows that they are -consistent.
Proposition 4.1**.**
Let be a causal AR() process of the form (4.4). Suppose that the innovations have a finite fourth moment and let be Lipschitz continuous. If with and , then as well as and .
It can also be shown that , and are asymptotically normal. In general, their asymptotic variance is somewhat larger than that of the estimators in Hall and Van Keilegom (2003). They are thus a bit less efficient in terms of asymptotic variance. However, this theoretical loss of efficiency is more than compensated by the advantages discussed in (a) and (b) above, which lead to a substantially better small sample performance as demonstrated in the simulations of Section 5.3.
5 Simulations
To assess the finite sample performance of our methods, we conduct a number of simulations. In Sections 5.1 and 5.2, we investigate the performance of our multiscale test and compare it to the SiZer methods for time series developed in Park et al. (2004), Rondonotti et al. (2007) and Park et al. (2009). In Section 5.3, we analyse the finite sample properties of our long-run variance estimator from Section 4.2 and compare it to the estimator of Hall and Van Keilegom (2003).
5.1 Size and power properties of the multiscale test
Our simulation design mimics the situation in the application example of Section 6. We generate data from the model for different trend functions , error processes and time series lengths . The error terms are supposed to have the AR() structure , where and are i.i.d. standard normal. In addition, we consider the AR() specification , where are normally distributed with and . We set , and , thus matching the estimated values obtained in the application of Section 6. To simulate data under the null hypothesis, we let be a constant function. In particular, we set without loss of generality. To generate data under the alternative, we consider the trend functions with . These functions are broken lines with a kink at and different slopes . Their shape roughly resembles the trend estimates in the application of Section 6. The slope parameter corresponds to a trend with the value at the right endpoint . We thus consider broken lines with the values . Inspecting the middle panel of Figure 7, the broken lines with the endpoints and (that is, with and ) can be seen to resemble the local linear trend estimates in the real-data example the most (where we neglect the nonlinearities of the local linear fits at the beginning of the observation period). The broken line with is closer to the null, making it harder for our test to detect this alternative.555The broken lines are obviously non-differentiable at the kink point. We could replace them by slightly smoothed versions to satisfy the differentiability assumption that is imposed in the theoretical part of the paper. However, as this leaves the simulation results essentially unchanged but only creates additional notation, we stick to the broken lines.
[FIGURE:]
To implement our test, we choose to be an Epanechnikov kernel and define the set of location-scale points as
[TABLE]
We thus take into account all rescaled time points on an equidistant grid with step length . For the bandwidth and any , the kernel weights are non-zero for exactly observations. Hence, the bandwidths in correspond to effective sample sizes of up to approximately data points. As a robustness check, we have re-run the simulations for a number of other grids. As the results are very similar, we do however not report them here. The long-run error variance is estimated by the procedures from Section 4.2: We first compute the estimator of the AR parameter(s), where we use and the pilot estimator with . Based on , we then compute the estimator of the long-run error variance . As a further robustness check, we have re-run the simulations for other choices of the parameters and , which yields very similar results. The dependence of the estimators and on and is further explored in Section 5.3. To compute the critical values of the multiscale test, we simulate values of the statistic defined in Section 3.2 and compute their empirical quantile .
Tables 2 and 2 report the simulation results for the sample sizes and the significance levels . The sample size is approximately equal to the time series length in the real-data example of Section 6. To produce our simulation results, we generate samples for each model specification and carry out the multiscale test for each sample. The entries of Tables 2 and 2 are computed as the number of simulations in which the test rejects divided by the total number of simulations. As can be seen from Table 2, the actual size of the test is fairly close to the nominal target for all the considered AR specifications and sample sizes. Hence, the test has approximately the correct size. Inspecting Table 2, one can further see that the test has reasonable power properties. For all the considered AR specifications, the power increases quickly (i) as the sample size gets larger and (ii) as we move away from the null by increasing the slope parameter . The power is of course quite different across the various AR specifications. In particular, it is much lower for positive than for negative values of in the AR() case, the lowest power numbers being obtained for the largest positive value under consideration. This reflects the fact that it is more difficult to detect a trend when there is strong positive autocorrelation in the data. For the AR() specification of the errors, the sample size and the slopes and , which yield the two model specifications that resemble the real-life data in Section 6 the most, the power of the test is above for the significance levels and and above for . Hence, our method has substantial power in the two simulation scenarios which are closest to the situation in the application.
5.2 Comparison with SiZer
We now compare our multiscale test to SiZer for times series which was developed in Park et al. (2004), Rondonotti et al. (2007) and Park et al. (2009). Roughly speaking, the SiZer method proceeds as follows: For each location and bandwidth in a pre-specified set, SiZer computes an estimator of the derivative and a corresponding confidence interval. For each , it then checks whether the confidence interval includes the value [math]. The set of points for which the confidence interval does not include [math] corresponds to the set of intervals for which our multiscale test finds an increase/decrease in the trend . In order to explore how our test performs in comparison to SiZer, we compare the two sets and in different ways to each other in what follows.
In order to implement SiZer for time series, we follow the exposition in Park et al. (2009).666We have also examined the somewhat different implementation from Rondonotti et al. (2007). As this yields worse simulation results than the procedure from Park et al. (2009), we however do not report them here. The details are given in Section S.3 in the Supplementary Material. To simplify the implementation of SiZer, we assume that the autocovariance function of the error process and thus the long-run error variance is known. Our multiscale test is implemented in the same way as in Section 5.1. To keep the comparison fair, we treat as known also when implementing our method. Moreover, we use the same grid of points for both methods. To achieve this, we start off with the grid from (5.1). We then follow Rondonotti et al. (2007) and Park et al. (2009) and restrict attention to those points for which the effective sample size for correlated data is not smaller than . This yields the grid . A detailed discussion of the effective sample size for correlated data can be found in Rondonotti et al. (2007).
In the first part of the comparison study, we analyse the size and power of the two methods. To do so, we treat SiZer as a rigorous statistical test of the null hypothesis that is constant on all intervals with . In particular, we let SiZer reject the null if the set is non-empty, that is, if the value [math] is not included in the confidence interval for at least one point . We simulate data from the model with different AR() error processes and different trends . In particular, we let be an AR() process of the form with and i.i.d. standard normal innovations . To simulate data under the null, we set as in the previous section. To generate data under the alternative, we consider the linear trends with different slopes . As it is more difficult to detect a trend in the data when the error terms are positively autocorrelated, we choose the slopes larger in the AR() case with than in the case with . In particular, we let when and when . Further model specifications with nonlinear trends are considered in the second part of the comparison study. To produce our simulation results, we generate samples for each model specification and carry out the two methods for each sample.
The simulation results are reported in Tables 4 and 4. Both for our multiscale test and SiZer, the entries in the tables are computed as the number of simulations in which the respective method rejects the null hypothesis divided by the total number of simulations. As can be seen from Table 4, our test has approximately correct size in all of the considered settings, whereas SiZer is very liberal and rejects the null way too often. Examining Table 4, one can further see that our procedure has reasonable power against the considered alternatives. The power numbers are of course higher for SiZer, which is a trivial consequence of the fact that SiZer is extremely liberal. These numbers should thus be treated with caution. All in all, the simulations suggest that SiZer can hardly be regarded as a rigorous statistical test of the null hypothesis that is constant on all intervals with . This is not very surprising as SiZer is not designed to be such a test but to produce informative SiZer maps. In particular, the confidence intervals of SiZer are not constructed to control the level under . In what follows, we thus attempt to compare the two methods in a different way which goes beyond mere size and power comparisons.
Both our method and SiZer can be regarded as statistical tools to identify time regions where the curve is increasing/decreasing.777More precisely speaking, SiZer is usually interpreted as investigating the curve , viewed at different levels of resolution, rather than the curve itself. Put differently, the underlying object of interest is a family of smoothed versions of rather than itself. Suppose that is increasing/decreasing in the time region but constant otherwise, that is, for all and for all . A natural question is the following: How well can the two methods identify the time region ? In our framework, information on the region is contained in the minimal intervals of the set . In particular, the union of the minimal intervals in can be regarded as an estimate of . This follows from the results in Propositions 3.2 and 3.3. Let be the union of the minimal intervals in . In what follows, we compare and to the region . This gives us information on how well the two methods approximate the true region where is increasing/decreasing.888The same exercise could of course also be carried out separately for the time region where the trend increases and the region where it decreases.
We consider the same simulation setup as in the first part of the comparison study, only the trend function is different. We let be defined as , which implies that . The function is plotted in the two upper panels of Figure 2. We set the significance level to and the sample size to . For each AR parameter , we simulate samples and compute and for each sample. The simulation results are depicted in Figure 2, the two subfigures (a) and (b) corresponding to different AR parameters. The upper panel of each subfigure displays the time series path of a representative simulation together with the trend function . The middle panel shows the regions produced by our multiscale approach for the simulation runs: On the -axis, the simulation runs are enumerated for , and the black line at -level represents for the -th simulation. Finally, the lower panel of each subfigure depicts the regions in an analogous way.
Inspecting Figure 2, our multiscale method can be seen to approximate the region fairly well in both simulation scenarios under consideration. In particular, gives a good approximation to the region for most simulations. Only in some simulation runs, is too large compared to , which means that our method is not able to locate the region sufficiently precisely. Overall, the SiZer method also produces quite satisfactory results. However, the SiZer estimates of are not as precise as ours. In particular, SiZer spuriously finds regions of decrease/increase outside the interval much more often than our method. It thus frequently mistakes fluctuations in the time series which are due to the dependence in the error terms for increases/decreases in the trend .
To sum up, our multiscale test exhibits good size and power properties in the simulations, and the minimal intervals produced by it identify the time regions where increases/decreases in a quite reliable way. SiZer performs clearly worse in these respects. Nevertheless, it may still produce informative SiZer plots. All in all, we would like to regard the two methods as complementary rather than direct competitors. SiZer is an explorative tool which aims to give an overview of the increases/decreases in by means of a SiZer plot. Our method, in contrast, is tailored to be a rigorous statistical test of the hypothesis . In particular, it allows to make rigorous confidence statements about the time regions where the trend increases/decreases.
5.3 Small sample properties of the long-run variance estimator
In the final part of the simulation study, we examine the estimators of the AR parameters and the long-run error variance from Section 4.2. We simulate data from the model , where is an AR() process of the form . We consider the AR parameters and let be i.i.d. standard normal innovation terms. We report our findings for a specific sample size , in particular for , as the results for other sample sizes are very similar. For simplicity, is chosen to be a linear function of the form with the slope parameter . For each value of , we consider two different slopes , one corresponding to a moderate and one to a pronounced trend . In particular, we let with . When , the slope is equal to the standard deviation of the error process, which yields a moderate trend . When , in contrast, the slope is times as large as , which results in a quite pronounced trend .
For each model specification, we generate data samples and compute the following quantities for each simulated sample:
- (i)
the pilot estimator from (4.11) with the tuning parameter . 2. (ii)
the estimator from (4.13) with the tuning parameter as well as the long-run variance estimator from (4.14). 3. (iii)
the estimators of and from Hall and Van Keilegom (2003), which are denoted by and for ease of reference. The estimator is computed as described in Section 2.2 of Hall and Van Keilegom (2003) and as defined at the bottom of p.447 in Section 2.3. The estimator (as well as ) depends on two tuning parameters which we denote by and as in Hall and Van Keilegom (2003). 4. (iv)
oracle estimators and of and , which are constructed under the assumption that the error process is observed. For each simulation run, we compute as the maximum likelihood estimator of from the time series of simulated error terms . We then calculate the residuals and estimate the innovation variance by . Finally, we set .
Throughout the section, we set , and . We in particular choose to be in the middle of and to make the tuning parameters of the estimators and more or less comparable. In order to assess how sensitive our estimators are to the choice of and , we carry out a number of robustness checks, considering a range of different values for and . In addition, we vary the tuning parameters and of the estimators from Hall and Van Keilegom (2003) in order to make sure that the results of our comparison study are not driven by the particular choice of any of the involved tuning parameters. The results of our robustness checks are reported in Section S.3 of the Supplementary Material. They show that the results of our comparison study are robust to different choices of the parameters , and . Moreover, they indicate that our estimators are rather insensitive to the choice of tuning parameters.
For each estimator , , and , , and for each model specification, the simulation output consists in a vector of length which contains the simulated values of the respective estimator. Figures 3 and 4 report the mean squared error (MSE) of these simulated values for each estimator. On the -axis of each plot, the various values of the AR parameter are listed which are considered. The solid line in each plot gives the MSE values of our estimators. The dashed and dotted lines specify the MSE values of the HvK and the oracle estimators, respectively. Note that for the long-run variance estimators, the plots report the logarithm of the MSE rather than the MSE itself since the MSE values are too different across simulation scenarios to obtain a reasonable graphical presentation. In addition to the MSE values presented in Figures 3 and 4, we depict histograms of the simulated values produced by the estimators , , and , , for two specific simulation scenarios in Figures 5 and 6. The main findings can be summarized as follows:
- (a)
In the simulation scenarios with a moderate trend (), the estimators and of Hall and Van Keilegom (2003) exhibit a similar performance as our estimators and as long as the AR parameter is not too close to . For strongly negative values of (in particular for and ), the estimators perform much worse than ours. This can be clearly seen from the much larger MSE values of the estimators and for and in Figure 3. Figure 5 gives some further insights into what is happening here. It shows the histograms of the simulated values produced by the estimators , , and the corresponding long-run variance estimators in the scenario with and . As can be seen, the estimator does not obey the causality restriction but frequently takes values substantially smaller than . This results in a very large spread of the histogram and thus in a disastrous performance of the estimator.999One could of course set to for some small whenever it takes a value smaller than . This modified estimator, however, is still far from performing in a satisfying way when is close to . A similar point applies to the histogram of the long-run variance estimator . Our estimators and , in contrast, exhibit a stable behaviour in this case.
Interestingly, the estimator (as well as the corresponding long-run variance estimator ) performs much worse than ours for large negative values but not for large positive values of . This can be explained as follows: In the special case of an AR() process, the estimator may produce estimates smaller than but it cannot become larger than . This can be easily seen upon inspecting the definition of the estimator. Hence, for large positive values of , the estimator performs well as it satisfies the causality restriction that the estimated AR parameter should be smaller than . 2. (b)
In the simulation scenarios with a pronounced trend (), the estimators of Hall and Van Keilegom (2003) are clearly outperformed by ours for most of the AR parameters under consideration. In particular, their MSE values reported in Figure 4 are much larger than the values produced by our estimators for most parameter values . The reason is the following: The HvK estimators have a strong bias since the pronounced trend with is not eliminated appropriately by the underlying differencing methods. This point is illustrated by Figure 6 which shows histograms of the simulated values for the estimators , , and the corresponding long-run variance estimators in the scenario with and . As can be seen, the histogram produced by our estimator is approximately centred around the true value , whereas that of the estimator is strongly biased upwards. A similar picture arises for the long-run variance estimators and .
Whereas the methods of Hall and Van Keilegom (2003) perform much worse than ours for negative and moderately positive values of , the performance (in terms of MSE) is fairly similar for large values of . This can be explained as follows: When the trend is not eliminated appropriately by taking differences, this creates spurious persistence in the data. Hence, the estimator tends to overestimate the AR parameter , that is, tends to be larger in absolute value than . Very loosely speaking, when the parameter is close to , say , there is not much room for overestimation since cannot become larger than . Consequently, the effect of not eliminating the trend appropriately has a much smaller impact on for large positive values of .
6 Application
The analysis of time trends in long temperature records is an important task in climatology. Information on the shape of the trend is needed in order to better understand long-term climate variability. The Central England temperature record is the longest instrumental temperature time series in the world. It is a valuable asset for analysing climate variability over the last few hundred years. The data is publicly available on the webpage of the UK Met Office. A detailed description of the data can be found in Parker et al. (1992). For our analysis, we use the dataset of yearly mean temperatures which consists of observations covering the years from to .
We assume that the data follow the nonparametric trend model , where is the unknown time trend of interest. The error process is supposed to have the AR() structure , where are i.i.d. innovations with mean [math] and variance . As pointed out in Mudelsee (2010) among others, this is the most widely used error model for discrete climate time series. To select the AR order , we proceed as follows: We estimate the AR parameters and the corresponding variance of the innovation terms for different AR orders by our methods from Section 4.2 and choose to be the minimizer of the Bayesian information criterion (BIC). This yields the AR order . We then estimate the parameters and the long-run error variance by the estimators and , which gives the values , and . To select the AR order and to produce the estimators and , we set and as in the simulation study of Section 5.1.101010As a robustness check, we have repeated the process of order selection and parameter estimation for other values of and as well as for other criteria such as FPE, AIC and AICC, which gave similar results.
With the help of our multiscale method from Section 3, we now test the null hypothesis that is constant on all intervals with , where we use the grid defined in (5.1). To do so, we set the significance level to and implement the test in exactly the same way as in the simulations of Section 5.1. The results are presented in Figure 7. The upper panel shows the raw temperature time series, whereas the middle panel depicts local linear kernel estimates of the trend for different bandwidths . As one can see, the shape of the estimated time trend strongly differs with the chosen bandwidth. When the bandwidth is small, there are many local increases and decreases in the estimated trend. When the bandwidth is large, most of these local variations get smoothed out. Hence, by themselves, the nonparametric fits do not give much information on whether the trend is increasing or decreasing in certain time regions.
Our multiscale test provides this kind of information, which is summarized in the lower panel of Figure 7. The plot depicts the minimal intervals contained in the set , which is defined in Section 3.3. The set of intervals is empty in the present case. The height at which a minimal interval is plotted indicates the value of the corresponding (additively corrected) test statistic . The dashed line specifies the critical value , where as already mentioned above. According to Proposition 3.3, we can make the following simultaneous confidence statement about the collection of minimal intervals in . We can claim, with confidence of about , that the trend function has some increase on each minimal interval. More specifically, we can claim with this confidence that there has been some upward movement in the trend both in the period from around to and in the period from about onwards. Hence, our test in particular provides evidence that there has been some warming trend in the period over approximately the last years. On the other hand, as the set is empty, there is no evidence of any downward movement of the trend.
S.1 Proofs of the results from Section 3
In this section, we prove the theoretical results from Section 3. We use the following notation: The symbol denotes a universal real constant which may take a different value on each occurrence. For , we write and . For any set , the symbol denotes the cardinality of . The notation means that the two random variables and have the same distribution. Finally, and denote the density and distribution function of the standard normal distribution, respectively.
Auxiliary results using strong approximation theory
The main purpose of this section is to prove that there is a version of the multiscale statistic defined in (3.4) which is close to a Gaussian statistic whose distribution is known. More specifically, we prove the following result.
Proposition S.1**.**
Under the conditions of Theorem 3.1, there exist statistics for with the following two properties: (i) has the same distribution as for any , and (ii)
[TABLE]
where is a Gaussian statistic as defined in (3.3).
Proof of Proposition S.1.
For the proof, we draw on strong approximation theory for stationary processes that fulfill the conditions (C1)–(C3). By Theorem 2.1 and Corollary 2.1 in Berkes et al. (2014), the following strong approximation result holds true: On a richer probability space, there exist a standard Brownian motion and a sequence such that for each and
[TABLE]
where denotes the long-run error variance. To apply this result, we define
[TABLE]
where and is the same estimator as with replaced by for . In addition, we let
[TABLE]
with and . With this notation, we can write
[TABLE]
where the last equality follows by taking into account that for all , for some large but fixed constant and . Straightforward calculations yield that
[TABLE]
Using summation by parts, we further obtain that
[TABLE]
where
[TABLE]
Standard arguments show that . Applying the strong approximation result (S.1), we can thus infer that
[TABLE]
Plugging (S.3) into (S.2) completes the proof. ∎
Auxiliary results using anti-concentration bounds
In this section, we establish some properties of the Gaussian statistic defined in (3.3). We in particular show that does not concentrate too strongly in small regions of the form with converging to zero.
Proposition S.2**.**
Under the conditions of Theorem 3.1, it holds that
[TABLE]
where .
Proof of Proposition S.2.
The main technical tool for proving Proposition S.2 are anti-concentration bounds for Gaussian random vectors. The following proposition slightly generalizes anti-concentration results derived in Chernozhukov et al. (2015), in particular Theorem 3 therein.
Proposition S.3**.**
Let be a Gaussian random vector in with and for . Define together with and . Moreover, set and . For every , it holds that
[TABLE]
where depends only on and .
The proof of Proposition S.3 is provided at the end of this section for completeness. To apply Proposition S.3 to our setting at hand, we introduce the following notation: We write along with , where for some large but fixed by our assumptions. Moreover, for , we set
[TABLE]
with . This notation allows us to write
[TABLE]
where is a Gaussian random vector with the following properties: (i) and thus , and (ii) for all . Since for all , it holds that . Moreover, as the variables are standard normal, we have that . With this notation at hand, we can apply Proposition S.3 to obtain that
[TABLE]
with , which is the statement of Proposition S.2. ∎
Proof of Theorem 3.1
To prove Theorem 3.1, we make use of the two auxiliary results derived above. By Proposition S.1, there exist statistics for which are distributed as for any and which have the property that
[TABLE]
where is a Gaussian statistic as defined in (3.3). The approximation result (S.4) allows us to replace the multiscale statistic by an identically distributed version which is close to the Gaussian statistic . In the next step, we show that
[TABLE]
which immediately implies the statement of Theorem 3.1. For the proof of (S.5), we use the following simple lemma:
Lemma S.1**.**
Let and be real-valued random variables for such that with some . If
[TABLE]
then
[TABLE]
The statement of Lemma S.1 can be summarized as follows: If can be approximated by in the sense that and if does not concentrate too strongly in small regions of the form as assumed in (S.6), then the distribution of can be approximated by that of in the sense of (S.7).
Proof of Lemma S.1.
It holds that
[TABLE]
We now apply this lemma with , and : From (S.4), we already know that . Moreover, by Proposition S.2, it holds that
[TABLE]
Hence, the conditions of Lemma S.1 are satisfied. Applying the lemma, we obtain (S.5), which completes the proof of Theorem 3.1.
Proof of Proposition 3.2
To start with, we introduce the notation with and . By assumption, there exists with such that for all . (The case that for all can be treated analogously.) Below, we prove that under this assumption,
[TABLE]
for sufficiently large , where . Moreover, by arguments very similar to those for the proof of Proposition S.1, it follows that
[TABLE]
With the help of (S.9), (S.10) and the fact that , we can infer that
[TABLE]
for sufficiently large . Since for any fixed , (S.11) immediately yields that , which is the statement of Proposition 3.2.
Proof of (S.9).
Write , where is an intermediate point between and . The local linear weights are constructed such that . We thus obtain that
[TABLE]
Moreover, since the kernel is symmetric and for some , it holds that , which in turn implies that
[TABLE]
From (S.12), (S.13) and the assumption that for all , we get that
[TABLE]
Standard calculations exploiting the Lipschitz continuity of the kernel show that for any and any given natural number ,
[TABLE]
where the constant does not depend on , and . With the help of (S.13) and (S.15), we obtain that for any with ,
[TABLE]
where the constant does once again not depend on , and . (S.16) implies that for sufficiently large and any with . Using this together with (S.14), we immediately obtain (S.9). ∎
Proof of Proposition 3.3
In what follows, we show that
[TABLE]
The other statements of Proposition 3.3 can be verified by analogous arguments. (S.17) is a consequence of the following two observations:
- (i)
For all with
[TABLE]
it holds that . 2. (ii)
For all with , implies that for some .
Observation (i) is trivial, (ii) can be seen as follows: Let be any point with and . It holds that , where has been defined in the proof of Proposition 3.2. As already shown in (S.12),
[TABLE]
where is some intermediate point between and . Moreover, by (S.13), it holds that for any . Hence, can only take a positive value if for some .
From observations (i) and (ii), we can draw the following conclusions: On the event
[TABLE]
it holds that for all with , for some . We thus obtain that . This in turn implies that
[TABLE]
where the last equality holds by Theorem 3.1.
Proof of Proposition S.3
The proof makes use of the following three lemmas, which correspond to Lemmas 5–7 in Chernozhukov et al. (2015).
Lemma S.2**.**
Let be a (not necessarily centred) Gaussian random vector in with for all . Suppose that whenever . Then the distribution of is absolutely continuous with respect to Lebesgue measure and a version of the density is given by
[TABLE]
Lemma S.3**.**
Let be a (not necessarily centred) Gaussian random vector with for all . Suppose that . Then the map
[TABLE]
is non-decreasing on .
Lemma S.4**.**
Let be a centred Gaussian random vector in with for some . Then for any ,
[TABLE]
The proof of Lemmas S.2 and S.3 can be found in Chernozhukov et al. (2015). Lemma S.4 is a standard result on Gaussian concentration whose proof is given e.g. in Ledoux (2001); see Theorem 7.1 therein. We now closely follow the arguments for the proof of Theorem 3 in Chernozhukov et al. (2015). The proof splits up into three steps.
Step 1. Pick any and set
[TABLE]
By construction, and . Defining , it holds that
[TABLE]
Step 2. We now bound the density of . Without loss of generality, we assume that for . The marginal distribution of is with . Hence, by Lemmas S.2 and S.3, the random variable has a density of the form
[TABLE]
where the map is non-decreasing. Define and set such that for any . With these definitions at hand, we obtain that
[TABLE]
where the last inequality follows from Lemma S.4. Since , it holds that
[TABLE]
Hence, for every ,
[TABLE]
Mill’s inequality states that for ,
[TABLE]
Since for and for , we can infer that
[TABLE]
This together with (S.18) and (S.19) yields that
[TABLE]
Step 3. By Step 2, we get that for any and ,
[TABLE]
where the last inequality follows from the fact that the map (with ) is non-increasing on . Combining this bound with Step 1, we further obtain that for any and ,
[TABLE]
This inequality also holds for by an analogous argument, and hence for all .
Now let and define . For any , (S.20) yields that
[TABLE]
with a sufficiently large constant that depends only on and . For , we obtain that
[TABLE]
which can be seen as follows: If , then implies that and thus . Hence, it holds that
[TABLE]
If , then implies that . Hence, in this case,
[TABLE]
where the last inequality follows from the fact that for centred Gaussian random variables and , . With (S.23) and (S.24), we obtain that for any ,
[TABLE]
the last inequality following from Lemma S.4. To sum up, we have established that for any and any ,
[TABLE]
with some constant that does only depend on and . For , (S.25) trivially follows upon setting . This completes the proof.
S.2 Proofs of the results from Section 4
In what follows, we prove Proposition 4.1 from Section 4. The notation is the same as in the previous section. In particular, we use the symbol to denote a generic constant which may take a different value on each occurrence.
Auxiliary results
To start with, we derive some auxiliary results needed for the proof of Proposition 4.1. The first lemma analyses the term
[TABLE]
where and are natural numbers with that may depend on the sample size , that is, as well as and .
Lemma S.5**.**
For any with , it holds that
[TABLE]
where .
Proof of Lemma S.5.
Since the variables have the expansion and , it holds that
[TABLE]
where
[TABLE]
with and for . Noting that
[TABLE]
we can infer that
[TABLE]
the last equality following from the fact that the autocovariances are absolutely summable and the coefficients decay exponentially fast to zero. ∎
We next show that the empirical autocovariances
[TABLE]
of the process have the following property.
Lemma S.6**.**
For any with and any , it holds that
[TABLE]
where .
Proof of Lemma S.6.
To analyse the term , we decompose it as follows:
[TABLE]
where
[TABLE]
as well as , and with . With the help of Lemma S.5, it is straightforward to show that
[TABLE]
Moreover, the Cauchy-Schwarz inequality yields that
[TABLE]
Since is Lipschitz by assumption, we get that . In addition, it obviously holds that . Hence, we can infer that
[TABLE]
which implies that . Similar arguments yield that for as well. Putting everything together, we arrive at the statement of Lemma S.6. ∎
Proof of Proposition 4.1
We first show that the pilot estimator converges to . In particular, we verify that . By Lemma S.6, it holds that and . Since is invertible, this implies that
[TABLE]
With the help of equation (4.10), we can further infer that
[TABLE]
As already noted in Section 4.2, the entries of the vector decay exponentially fast to zero, that is, for some . Moreover, it holds that for any fixed as . Consequently, , where denotes the usual supremum norm for vectors. As a result, we obtain that .
We next show that , where is any fixed integer that does not grow with the sample size . By definition, it holds that . From Lemma S.6, it follows that and . Moreover, with the help of the fact that , it is straightforward to verify that and . Hence, we arrive at
[TABLE]
where the last equality is due to equation (4.10).
From (S.26), it immediately follows that , which in turn allows us to infer that and by straightforward arguments.
S.3 Robustness checks and implementation details for the simulations in Section 5
Robustness checks for Section 5.3
In what follows, we carry out some robustness checks to assess how sensitive the estimators and are to the choice of the tuning parameters and . To do so, we repeat the simulation exercises of Section 5.3 for different values of and . In addition, we consider different choices of the tuning parameters on which the estimators of Hall and Van Keilegom (2003) depend. As in Section 5.3, we choose and such that lies between these values. We thus keep the parameters and roughly comparable.
To start with, we consider the simulation scenarios with a moderate trend (). The MSE values of the estimators , , and , , for these scenarios are presented in Figure 3 of Section 5.3. These MSEs are re-calculated in Figures S.1 and S.2 for a range of different choices of , and . As one can see, the MSEs in the different plots of Figures S.1 and S.2 are very similar. Hence, the MSE results reported in Section 5.3 for the scenarios with a moderate trend appear to be fairly robust to different choices of the tuning parameters. In particular, our estimators and seem to be quite insensitive to the choice of tuning parameters, at least as far as their MSEs are concerned.
We next turn to the simulation designs with a pronounced trend (). The MSE values of the estimators in these scenarios are reported in Figure 4 of Section 5.3. Analogously as before, we re-calculate these MSEs for different tuning parameters in Figures S.3–S.5. Figure S.4 is a zoomed-in version of Figure S.3 which is added for better visibility. As can be seen, our estimators appear to be barely influenced by the choice of . However, the MSE values become somewhat larger when is chosen bigger. This is of course not very surprising: The main reason why the estimator works well in the presence of a strong trend is that it is only based on differences of small orders. If we increase , we use larger differences to compute , which results in not eliminating the trend appropriately any more. This becomes visible in somewhat larger MSE values. Nevertheless, overall, our estimators appear not to be strongly influenced by the choice of tuning parameters (in terms of MSE) as long as these are chosen within reasonable bounds.
Implementation of SiZer in Section 5.2
The SiZer methods for the comparison study in Section 5.2 are implemented as follows:
- (a)
Computation of the grid :
To start with, we compute the variance of , which is given by
[TABLE]
Since the autocovariance function is known by assumption, we can calculate the value of by using the formula together with the true parameters and . We next compute
[TABLE]
which can be interpreted as a measure of information in the data. For each point from (5.1), we finally calculate the effective sample size for dependent data
[TABLE]
with and set . 2. (b)
Computation of the local linear estimators and their standard deviations:
For each , we compute a standard local linear estimator of the derivative together with its standard deviation . The latter is given by , where with and
[TABLE]
The matrices , and are defined as follows: is a matrix with the elements
[TABLE]
is a diagonal matrix with the diagonal entries and
[TABLE] 3. (c)
Computation of the confidence intervals:
For a given confidence level and for each bandwidth value with , we compute the quantile
[TABLE]
where is the distribution function of a standard normal random variable, is the number of locations in the grid , and the cluster index is defined on p.1519 in Park et al. (2009). The confidence interval of is then computed as .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andrews (1991) Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica , 59 817–858.
- 2Benner (1999) Benner, T. C. (1999). Central england temperatures: long-term variability and teleconnections. International Journal of Climatology , 19 391–403.
- 3Berkes et al. (2014) Berkes, I. , Liu, W. and Wu, W. B. (2014). Komlós-Major-Tusnády approximation under dependence. Annals of Probability , 42 794–817.
- 4Chaudhuri and Marron (1999) Chaudhuri, P. and Marron, J. S. (1999). Si Zer for the exploration of structures in curves. Journal of the American Statistical Association , 94 807–823.
- 5Chaudhuri and Marron (2000) Chaudhuri, P. and Marron, J. S. (2000). Scale space view of curve estimation. Annals of Statistics , 28 408–428.
- 6Chernozhukov et al. (2014) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Annals of Statistics , 42 1564–1597.
- 7Chernozhukov et al. (2015) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probability Theory and Related Fields , 162 47–70.
- 8Chernozhukov et al. (2017) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Annals of Probability , 45 2309–2352.
