Multiscale inference and long-run variance estimation in nonparametric   regression with time series errors

Marina Khismatullina; Michael Vogt

arXiv:1903.01253·math.ST·March 5, 2019

Multiscale inference and long-run variance estimation in nonparametric regression with time series errors

Marina Khismatullina, Michael Vogt

PDF

TL;DR

This paper introduces multiscale testing methods for qualitative features of nonparametric regression functions with time series errors, along with a new estimator for long-run variance in AR(p) processes, supported by theory, simulations, and climate data analysis.

Contribution

It presents novel multiscale tests for shape properties of regression functions with time series errors and a new difference-based estimator for long-run variance in AR(p) models.

Findings

01

Asymptotic validity of the proposed tests and estimator.

02

Simulation results demonstrate good finite-sample performance.

03

Application to climate data reveals meaningful trend features.

Abstract

In this paper, we develop new multiscale methods to test qualitative hypotheses about the regression function m in a nonparametric regression model with fixed design points and time series errors. In time series applications, m represents a nonparametric time trend. Practitioners are often interested in whether the trend m has certain shape properties. For example, they would like to know whether m is constant or whether it is increasing/decreasing in certain time regions. Our multiscale methods allow to test for such shape properties of the trend m. In order to perform the methods, we require an estimator of the long-run variance of the error process. We propose a new difference-based estimator of the long-run error variance for the case that the error terms form an AR(p) process. In the technical part of the paper, we derive asymptotic theory for the proposed multiscale test and the…

Tables2

Table 1. Table 1 : Size of our multiscale test for different AR parameters a 1 subscript 𝑎 1 a_{1} and a 2 subscript 𝑎 2 a_{2} , sample sizes T 𝑇 T and nominal sizes α 𝛼 \alpha .

	$a_{1} = - 0.5$			$a_{1} = - 0.25$			$a_{1} = 0.25$			$a_{1} = 0.5$			$(a_{1}, a_{2}) = (0.167, 0.178)$
	nominal size $α$			nominal size $α$			nominal size $α$			nominal size $α$			nominal size $α$
	0.01	0.05	0.1	0.01	0.05	0.1	0.01	0.05	0.1	0.01	0.05	0.1	0.01	0.05	0.1
$T = 250$	0.015	0.050	0.127	0.014	0.057	0.120	0.011	0.046	0.116	0.013	0.042	0.108	0.011	0.052	0.117
$T = 350$	0.009	0.067	0.120	0.010	0.055	0.095	0.009	0.055	0.096	0.010	0.049	0.090	0.010	0.059	0.114
$T = 500$	0.015	0.053	0.128	0.015	0.047	0.100	0.018	0.048	0.101	0.015	0.042	0.106	0.015	0.056	0.107

Table 2. Table 3 : Size of our multiscale test (MT) and SiZer for different model specifications.

	$a_{1} = - 0.25$						$a_{1} = 0.25$
	$α = 0.01$		$α = 0.05$		$α = 0.1$		$α = 0.01$		$α = 0.05$		$α = 0.1$
	MT	SiZer	MT	SiZer	MT	SiZer	MT	SiZer	MT	SiZer	MT	SiZer
$T = 250$	0.018	0.112	0.040	0.374	0.104	0.575	0.017	0.106	0.034	0.347	0.092	0.522
$T = 350$	0.012	0.140	0.058	0.426	0.080	0.621	0.012	0.130	0.046	0.399	0.074	0.578
$T = 500$	0.005	0.140	0.041	0.489	0.097	0.680	0.006	0.136	0.039	0.452	0.097	0.639

Equations268

Y_{t,T}=m\Big{(}\frac{t}{T}\Big{)}+\varepsilon_{t}

Y_{t,T}=m\Big{(}\frac{t}{T}\Big{)}+\varepsilon_{t}

Y_{t,T}=m\Big{(}\frac{t}{T}\Big{)}+\varepsilon_{t}

Y_{t,T}=m\Big{(}\frac{t}{T}\Big{)}+\varepsilon_{t}

H_{0} (u, h) : m^{'} (w) = 0 for all w \in [u - h, u + h],

H_{0} (u, h) : m^{'} (w) = 0 for all w \in [u - h, u + h],

H_{0} : The hypothesis H_{0} (u, h) holds true for all (u, h) \in G_{T},

H_{0} : The hypothesis H_{0} (u, h) holds true for all (u, h) \in G_{T},

ψ_{T} (u, h) = t = 1 \sum T w_{t, T} (u, h) Y_{t, T},

ψ_{T} (u, h) = t = 1 \sum T w_{t, T} (u, h) Y_{t, T},

w_{t, T} (u, h) = \frac{Λ _{t, T} ( u , h )}{{ \sum _{t = 1}^{T} Λ _{t, T} ( u , h ) ^{2} } ^{1/2}},

w_{t, T} (u, h) = \frac{Λ _{t, T} ( u , h )}{{ \sum _{t = 1}^{T} Λ _{t, T} ( u , h ) ^{2} } ^{1/2}},

\Lambda_{t,T}(u,h)=K\Big{(}\frac{\frac{t}{T}-u}{h}\Big{)}\Big{[}S_{T,0}(u,h)\Big{(}\frac{\frac{t}{T}-u}{h}\Big{)}-S_{T,1}(u,h)\Big{]},

\Lambda_{t,T}(u,h)=K\Big{(}\frac{\frac{t}{T}-u}{h}\Big{)}\Big{[}S_{T,0}(u,h)\Big{(}\frac{\frac{t}{T}-u}{h}\Big{)}-S_{T,1}(u,h)\Big{]},

\widehat{\Psi}_{T}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{\{}\Big{|}\frac{\widehat{\psi}_{T}(u,h)}{\widehat{\sigma}}\Big{|}-\lambda(h)\Big{\}},

\widehat{\Psi}_{T}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{\{}\Big{|}\frac{\widehat{\psi}_{T}(u,h)}{\widehat{\sigma}}\Big{|}-\lambda(h)\Big{\}},

\widehat{\Psi}_{T,\text{uncorrected}}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{|}\frac{\widehat{\psi}_{T}(u,h)}{\widehat{\sigma}}\Big{|}

\widehat{\Psi}_{T,\text{uncorrected}}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{|}\frac{\widehat{\psi}_{T}(u,h)}{\widehat{\sigma}}\Big{|}

\widehat{\Psi}_{T,\text{uncorrected}}=\max_{1\leq\ell\leq L}\max_{1\leq k\leq\lfloor 1/2h_{\ell}\rfloor}\Big{|}\frac{\widehat{\psi}_{T}(u_{k},h_{\ell})}{\sigma}\Big{|}.

\widehat{\Psi}_{T,\text{uncorrected}}=\max_{1\leq\ell\leq L}\max_{1\leq k\leq\lfloor 1/2h_{\ell}\rfloor}\Big{|}\frac{\widehat{\psi}_{T}(u_{k},h_{\ell})}{\sigma}\Big{|}.

\displaystyle\mathcal{G}_{T}=\big{\{}

\displaystyle\mathcal{G}_{T}=\big{\{}

\displaystyle\text{ with }h=t/T\text{ for some }1\leq t\leq T\big{\}},

\Phi_{T}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{\{}\Big{|}\frac{\phi_{T}(u,h)}{\sigma}\Big{|}-\lambda(h)\Big{\}},

\Phi_{T}=\max_{(u,h)\in\mathcal{G}_{T}}\Big{\{}\Big{|}\frac{\phi_{T}(u,h)}{\sigma}\Big{|}-\lambda(h)\Big{\}},

Φ_{T}

Φ_{T}

\mathbb{P}\big{(}\widehat{\Phi}_{T}\leq q_{T}(\alpha)\big{)}=(1-\alpha)+o(1).

\mathbb{P}\big{(}\widehat{\Phi}_{T}\leq q_{T}(\alpha)\big{)}=(1-\alpha)+o(1).

\big{|}\widetilde{\Phi}_{T}-\Phi_{T}\big{|}=o_{p}(\delta_{T}),

\big{|}\widetilde{\Phi}_{T}-\Phi_{T}\big{|}=o_{p}(\delta_{T}),

\sup_{x\in\mathbb{R}}\big{|}\mathbb{P}(\widetilde{\Phi}_{T}\leq x)-\mathbb{P}(\Phi_{T}\leq x)\big{|}=o(1),

\sup_{x\in\mathbb{R}}\big{|}\mathbb{P}(\widetilde{\Phi}_{T}\leq x)-\mathbb{P}(\Phi_{T}\leq x)\big{|}=o(1),

\sup_{x\in\mathbb{R}}\mathbb{P}\big{(}|\Phi_{T}-x|\leq\delta_{T}\big{)}=o(1),

\sup_{x\in\mathbb{R}}\mathbb{P}\big{(}|\Phi_{T}-x|\leq\delta_{T}\big{)}=o(1),

\mathbb{P}\big{(}\widehat{\Psi}_{T}\leq q_{T}(\alpha)\big{)}=(1-\alpha)+o(1).

\mathbb{P}\big{(}\widehat{\Psi}_{T}\leq q_{T}(\alpha)\big{)}=(1-\alpha)+o(1).

m_{T}^{'} (w) \geq c_{T} \frac{lo g T}{T h ^{3}} for all w \in [u - h, u + h],

m_{T}^{'} (w) \geq c_{T} \frac{lo g T}{T h ^{3}} for all w \in [u - h, u + h],

\mathbb{P}\big{(}\widehat{\Psi}_{T}\leq q_{T}(\alpha)\big{)}=o(1).

\mathbb{P}\big{(}\widehat{\Psi}_{T}\leq q_{T}(\alpha)\big{)}=o(1).

Π_{T}^{\pm}

Π_{T}^{\pm}

Π_{T}^{+}

Π_{T}^{-}

A_{T}^{\pm}

A_{T}^{\pm}

A_{T}^{+}

A_{T}^{-}

E_{T}^{\pm}

E_{T}^{\pm}

E_{T}^{+}

E_{T}^{-}

\mathbb{P}\big{(}E_{T}^{\ell}\big{)}\geq(1-\alpha)+o(1).

\mathbb{P}\big{(}E_{T}^{\ell}\big{)}\geq(1-\alpha)+o(1).

E_{T}^{\pm, m i n}

E_{T}^{\pm, m i n}

E_{T}^{+, m i n}

E_{T}^{-, m i n}

\mathbb{P}\big{(}E_{T}^{\ell,\min}\big{)}\geq(1-\alpha)+o(1)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distance to upper boundary

**Multiscale Inference

** **and Long-Run Variance Estimation

** **in Nonparametric Regression

** with Time Series Errors

Marina Khismatullina111Address: Bonn Graduate School of Economics, University of Bonn, 53113 Bonn, Germany. Email: [email protected].

University of Bonn

Michael Vogt222Address: Department of Economics and Hausdorff Center for Mathematics, University of Bonn, 53113 Bonn, Germany. Email: [email protected].

University of Bonn

In this paper, we develop new multiscale methods to test qualitative hypotheses about the function $m$ in the nonparametric regression model $Y_{t,T}=m(t/T)+\varepsilon_{t}$ with time series errors $\varepsilon_{t}$ . In time series applications, $m$ represents a nonparametric time trend. Practitioners are often interested in whether the trend $m$ has certain shape properties. For example, they would like to know whether $m$ is constant or whether it is increasing/decreasing in certain time regions. Our multiscale methods allow to test for such shape properties of the trend $m$ . In order to perform the methods, we require an estimator of the long-run error variance $\sigma^{2}=\sum\nolimits_{\ell=-\infty}^{\infty}\textnormal{Cov}(\varepsilon_{0},\varepsilon_{\ell})$ . We propose a new difference-based estimator of $\sigma^{2}$ for the case that $\{\varepsilon_{t}\}$ is an AR( $p$ ) process. In the technical part of the paper, we derive asymptotic theory for the proposed multiscale test and the estimator of the long-run error variance. The theory is complemented by a simulation study and an empirical application to climate data.

In this supplement, we provide the technical details and proofs that are omitted in the paper. In addition, we report the results of some robustness checks which complement the simulation exercises in Section 5 of the paper.

Key words: Multiscale statistics; long-run variance; nonparametric regression; time series errors; shape constraints; strong approximations; anti-concentration bounds.

AMS 2010 subject classifications: 62E20; 62G10; 62G20; 62M10.

1 Introduction

The analysis of time trends is an important aspect of many time series applications. In a wide range of situations, practitioners are particularly interested in certain shape properties of the trend. They raise questions such as the following: Does the observed time series have a trend at all? If so, is the trend increasing/decreasing in certain time regions? Can one identify the regions of increase/decrease? As an example, consider the time series plotted in Figure 1 which shows the yearly mean temperature in Central England from 1659 to 2017. Climatologists are very much interested in learning about the trending behaviour of temperature time series like this; see e.g. Benner (1999) and Rahmstorf et al. (2017). Among other things, they would like to know whether there is an upward trend in the Central England mean temperature towards the end of the sample as visual inspection might suggest.

In this paper, we develop new methods to test for certain shape properties of a nonparametric time trend. We in particular construct a multiscale test which allows to identify local increases/decreases of the trend function. We develop our test in the context of the following model setting: We observe a time series $\{Y_{t,T}:1\leq t\leq T\}$ of the form

[TABLE]

for $1\leq t\leq T$ , where $m:[0,1]\rightarrow\mathbb{R}$ is an unknown nonparametric regression function and the error terms $\varepsilon_{t}$ form a stationary time series process with $\mathbb{E}[\varepsilon_{t}]=0$ . In a time series context, the design points $t/T$ represent the time points of observation and $m$ is a nonparametric time trend. As usual in nonparametric regression, we let the function $m$ depend on rescaled time $t/T$ rather than on real time $t$ . A detailed description of model (1.1) is provided in Section 2.

Our multiscale test is developed step by step in Section 3. Roughly speaking, the procedure can be outlined as follows: Let $H_{0}(u,h)$ be the hypothesis that $m$ is constant in the time window $[u-h,u+h]\subseteq[0,1]$ , where $u$ is the midpoint and $2h$ the size of the window. In a first step, we set up a test statistic $\widehat{s}_{T}(u,h)$ for the hypothesis $H_{0}(u,h)$ . In a second step, we aggregate the statistics $\widehat{s}_{T}(u,h)$ for a large number of different time windows $[u-h,u+h]$ . We thereby construct a multiscale statistic which allows to test the hypothesis $H_{0}(u,h)$ simultaneously for many time windows $[u-h,u+h]$ . In the technical part of the paper, we derive the theoretical properties of the resulting multiscale test. To do so, we come up with a proof strategy which combines strong approximation results for dependent processes with anti-concentration bounds for Gaussian random vectors. This strategy is of interest in itself and may be applied to other multiscale test problems for dependent data. As shown by our theoretical analysis, our multiscale test is a rigorous level- $\alpha$ -test of the overall null hypothesis $H_{0}$ that $H_{0}(u,h)$ is simultaneously fulfilled for all time windows $[u-h,u+h]$ under consideration. Moreover, for a given significance level $\alpha\in(0,1)$ , the test allows to make simultaneous confidence statements of the following form: We can claim, with statistical confidence $1-\alpha$ , that there is an increase/decrease in the trend $m$ on all time windows $[u-h,u+h]$ for which the hypothesis $H_{0}(u,h)$ is rejected. Hence, the test allows to identify, with a pre-specified statistical confidence, time regions where the trend $m$ is increasing/decreasing.

For independent data, multiscale tests have been developed in a variety of different contexts in recent years. In the regression context, Chaudhuri and Marron (1999, 2000) introduced the so-called SiZer method which has been extended in various directions; see e.g. Hannig and Marron (2006) where a refined distribution theory for SiZer is derived. Hall and Heckman (2000) constructed a multiscale test on monotonicity of a regression function. Dümbgen and Spokoiny (2001) developed a multiscale approach which works with additively corrected supremum statistics and derived theoretical results in the context of a continuous Gaussian white noise model. Rank-based multiscale tests for nonparametric regression were proposed in Dümbgen (2002) and Rohde (2008). More recently, Proksch et al. (2018) have constructed multiscale tests for inverse regression models. In the context of density estimation, multiscale tests have been investigated in Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017) among others.

Whereas a large number of multiscale tests for independent data have been developed in recent years, multiscale tests for dependent data are much rarer. Most notably, there are some extensions of the SiZer approach to a time series context. Park et al. (2004) and Rondonotti et al. (2007) have introduced SiZer methods for dependent data which can be used to find local increases/decreases of a trend and which may thus be regarded as an alternative to our multiscale test. However, these SiZer methods are mainly designed for data exploration rather than for rigorous statistical inference. Our multiscale method, in contrast, is a rigorous level- $\alpha$ -test of the hypothesis $H_{0}$ which allows to make simultaneous confidence statements about the time regions where the trend $m$ is increasing/decreasing. Some theoretical results for dependent SiZer methods have been derived in Park et al. (2009), but only under a quite severe restriction: Only time windows $[u-h,u+h]$ with window sizes or scales $h$ are taken into account that remain bounded away from zero as the sample size $T$ grows. Scales $h$ that converge to zero as $T$ increases are excluded. This effectively means that only large time windows $[u-h,u+h]$ are taken into consideration. Our theory, in contrast, allows to simultaneously consider scales $h$ of fixed size and scales $h$ that converge to zero at various different rates. We are thus able to take into account time windows of many different sizes.

Our multiscale approach is also related to Wavelet-based methods: Similar to the latter, it takes into account different locations $u$ and resolution levels or scales $h$ simultaneously. However, while our multiscale approach is designed to test for local increases/decreases of a nonparametric trend, Wavelet methods are commonly used for other purposes. Among other things, they are employed for estimating/reconstructing nonparametric regression curves [see e.g. Donoho et al. (1995) or Von Sachs and MacGibbon (2000)] and for change point detection [see e.g. Cho and Fryzlewicz (2012)].

The test statistic of our multiscale method depends on the long-run error variance $\sigma^{2}=\sum\nolimits_{\ell=-\infty}^{\infty}\textnormal{Cov}(\varepsilon_{0},\varepsilon_{\ell})$ , which is usually unknown in practice. To carry out our multiscale test, we thus require an estimator of $\sigma^{2}$ . Indeed, such an estimator is required for virtually all inferential procedures in the context of model (1.1). Hence, the problem of estimating $\sigma^{2}$ in model (1.1) is of broader interest and has received a lot of attention in the literature; see Müller and Stadtmüller (1988), Herrmann et al. (1992) and Hall and Van Keilegom (2003) among many others. In Section 4, we discuss several estimators of $\sigma^{2}$ which are valid under different conditions on the error process $\{\varepsilon_{t}\}$ . Most notably, we introduce a new difference-based estimator of $\sigma^{2}$ for the case that $\{\varepsilon_{t}\}$ is an AR( $p$ ) process. This estimator improves on existing methods in several respects.

The methodological and theoretical analysis of the paper is complemented by a simulation study in Section 5 and an empirical application in Section 6. In the simulation study, we examine the finite sample properties of our multiscale test and compare it to the dependent SiZer methods introduced in Park et al. (2004) and Rondonotti et al. (2007). Moreover, we investigate the small sample performance of our estimator of $\sigma^{2}$ in the AR( $p$ ) case and compare it to the estimator of Hall and Van Keilegom (2003). In Section 6, we use our methods to analyse the temperature data from Figure 1.

2 The model

We now describe the model setting in detail which was briefly outlined in the Introduction. We observe a time series $\{Y_{t,T}:1\leq t\leq T\}$ of length $T$ which satisfies the nonparametric regression equation

[TABLE]

for $1\leq t\leq T$ . Here, $m$ is an unknown nonparametric function defined on $[0,1]$ and $\{\varepsilon_{t}:1\leq t\leq T\}$ is a zero-mean stationary error process. For simplicity, we restrict attention to equidistant design points $x_{t}=t/T$ . However, our methods and theory can also be carried over to non-equidistant designs. The stationary error process $\{\varepsilon_{t}\}$ is assumed to have the following properties:

(C1)

The variables $\varepsilon_{t}$ allow for the representation $\varepsilon_{t}=G(\ldots,\eta_{t-1},\eta_{t},\eta_{t+1},\ldots)$ , where $\eta_{t}$ are i.i.d. random variables and $G:\mathbb{R}^{\mathbb{Z}}\rightarrow\mathbb{R}$ is a measurable function. 2. (C2)

It holds that $\|\varepsilon_{t}\|_{q}<\infty$ for some $q>4$ , where $\|\varepsilon_{t}\|_{q}=(\mathbb{E}|\varepsilon_{t}|^{q})^{1/q}$ .

Following Wu (2005), we impose conditions on the dependence structure of the error process $\{\varepsilon_{t}\}$ in terms of the physical dependence measure $d_{t,q}=\|\varepsilon_{t}-\varepsilon_{t}^{\prime}\|_{q}$ , where $\varepsilon_{t}^{\prime}=G(\ldots,\eta_{-1},\eta_{0}^{\prime},\eta_{1},\ldots,\eta_{t-1},\eta_{t},\eta_{t+1},\ldots)$ with $\{\eta_{t}^{\prime}\}$ being an i.i.d. copy of $\{\eta_{t}\}$ . In particular, we assume the following:

(C3)

Define $\Theta_{t,q}=\sum\nolimits_{|s|\geq t}d_{s,q}$ for $t\geq 0$ . It holds that $\Theta_{t,q}=O(t^{-\tau_{q}}(\log t)^{-A})$ , where $A>\frac{2}{3}(1/q+1+\tau_{q})$ and $\tau_{q}=\{q^{2}-4+(q-2)\sqrt{q^{2}+20q+4}\}/8q$ .

The conditions (C1)–(C3) are fulfilled by a wide range of stationary processes $\{\varepsilon_{t}\}$ . As a first example, consider linear processes of the form $\varepsilon_{t}=\sum\nolimits_{i=0}^{\infty}c_{i}\eta_{t-i}$ with $\|\varepsilon_{t}\|_{q}<\infty$ , where $c_{i}$ are absolutely summable coefficients and $\eta_{t}$ are i.i.d. innovations with $\mathbb{E}[\eta_{t}]=0$ and $\|\eta_{t}\|_{q}<\infty$ . Trivially, (C1) and (C2) are fulfilled in this case. Moreover, if $|c_{i}|=O(\rho^{i})$ for some $\rho\in(0,1)$ , then (C3) is easily seen to be satisfied as well. As a special case, consider an ARMA process $\{\varepsilon_{t}\}$ of the form $\varepsilon_{t}-\sum\nolimits_{i=1}^{p}a_{i}\varepsilon_{t-i}=\eta_{t}+\sum\nolimits_{j=1}^{r}b_{j}\eta_{t-j}$ with $\|\varepsilon_{t}\|_{q}<\infty$ , where $a_{1},\ldots,a_{p}$ and $b_{1},\ldots,b_{r}$ are real-valued parameters. As before, we let $\eta_{t}$ be i.i.d. innovations with $\mathbb{E}[\eta_{t}]=0$ and $\|\eta_{t}\|_{q}<\infty$ . Moreover, as usual, we suppose that the complex polynomials $A(z)=1-\sum\nolimits_{j=1}^{p}a_{j}z^{j}$ and $B(z)=1+\sum\nolimits_{j=1}^{r}b_{j}z^{j}$ do not have any roots in common. If $A(z)$ does not have any roots inside the unit disc, then the ARMA process $\{\varepsilon_{t}\}$ is stationary and causal. Specifically, it has the representation $\varepsilon_{t}=\sum\nolimits_{i=0}^{\infty}c_{i}\eta_{t-i}$ with $|c_{i}|=O(\rho^{i})$ for some $\rho\in(0,1)$ , implying that (C1)–(C3) are fulfilled. The results in Wu and Shao (2004) show that condition (C3) (as well as the other two conditions) is not only fulfilled for linear time series processes but also for a variety of non-linear processes.

3 The multiscale test

In this section, we introduce our multiscale method to test for local increases/decreases of the trend function $m$ and analyse its theoretical properties. We assume throughout that $m$ is continuously differentiable on $[0,1]$ . The test problem under consideration can be formulated as follows: Let $H_{0}(u,h)$ be the hypothesis that $m$ is constant on the interval $[u-h,u+h]$ . Since $m$ is continuously differentiable, $H_{0}(u,h)$ can be reformulated as

[TABLE]

where $m^{\prime}$ is the first derivative of $m$ . We want to test the hypothesis $H_{0}(u,h)$ not only for a single interval $[u-h,u+h]$ but simultaneously for many different intervals. The overall null hypothesis is thus given by

[TABLE]

where $\mathcal{G}_{T}$ is some large set of points $(u,h)$ . The details on the set $\mathcal{G}_{T}$ are discussed at the end of Section 3.1 below. Note that $\mathcal{G}_{T}$ in general depends on the sample size $T$ , implying that the null hypothesis $H_{0}=H_{0,T}$ depends on $T$ as well. We thus consider a sequence of null hypotheses $\{H_{0,T}:T=1,2,\ldots\}$ as $T$ increases. For simplicity of notation, we however suppress the dependence of $H_{0}$ on $T$ . In Sections 3.1 and 3.2, we step by step construct the multiscale test of the hypothesis $H_{0}$ . The theoretical properties of the test are analysed in Section 3.3.

3.1 Construction of the multiscale statistic

We first construct a test statistic for the hypothesis $H_{0}(u,h)$ , where $[u-h,u+h]$ is a given interval. To do so, we consider the kernel average

[TABLE]

where $w_{t,T}(u,h)$ is a kernel weight and $h$ is the bandwidth. In order to avoid boundary issues, we work with a local linear weighting scheme. We in particular set

[TABLE]

where

[TABLE]

$S_{T,\ell}(u,h)=(Th)^{-1}\sum\nolimits_{t=1}^{T}K(\frac{\frac{t}{T}-u}{h})(\frac{\frac{t}{T}-u}{h})^{\ell}$ for $\ell=0,1,2$ and $K$ is a kernel function with the following properties:

(C4)

The kernel $K$ is non-negative, symmetric about zero and integrates to one. Moreover, it has compact support $[-1,1]$ and is Lipschitz continuous, that is, $|K(v)-K(w)|\leq C|v-w|$ for any $v,w\in\mathbb{R}$ and some constant $C>0$ .

The kernel average $\widehat{\psi}_{T}(u,h)$ is nothing else than a rescaled local linear estimator of the derivative $m^{\prime}(u)$ with bandwidth $h$ .333Alternatively to the local linear weights defined in (3.1), we could also work with the weights $w_{t,T}(u,h)=K^{\prime}(h^{-1}[u-t/T])/\{\sum\nolimits_{t=1}^{T}K^{\prime}(h^{-1}[u-t/T])^{2}\}^{1/2}$ , where the kernel function $K$ is assumed to be differentiable and $K^{\prime}$ is its derivative. We however prefer to use local linear weights as these have superior theoretical properties at the boundary.

A test statistic for the hypothesis $H_{0}(u,h)$ is given by the normalized kernel average $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ , where $\widehat{\sigma}^{2}$ is an estimator of the long-run variance $\sigma^{2}=\sum\nolimits_{\ell=-\infty}^{\infty}\textnormal{Cov}(\varepsilon_{0},\varepsilon_{\ell})$ of the error process $\{\varepsilon_{t}\}$ . The problem of estimating $\sigma^{2}$ is discussed in detail in Section 4. For the time being, we suppose that $\widehat{\sigma}^{2}$ is an estimator with reasonable theoretical properties. Specifically, we assume that $\widehat{\sigma}^{2}=\sigma^{2}+o_{p}(\rho_{T})$ with $\rho_{T}=o(1/\log T)$ . This is a fairly weak condition which is in particular satisfied by the estimators of $\sigma^{2}$ analysed in Section 4. The kernel weights $w_{t,T}(u,h)$ are chosen such that in the case of independent errors $\varepsilon_{t}$ , $\textnormal{Var}(\widehat{\psi}_{T}(u,h))=\sigma^{2}$ for any location $u$ and bandwidth $h$ , where the long-run error variance $\sigma^{2}$ simplifies to $\sigma^{2}=\textnormal{Var}(\varepsilon_{t})$ . In the more general case that the error terms satisfy the weak dependence conditions from Section 2, $\textnormal{Var}(\widehat{\psi}_{T}(u,h))=\sigma^{2}+o(1)$ for any $u$ and $h$ under consideration. Hence, for sufficiently large sample sizes $T$ , the test statistic $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ has approximately unit variance.

We now combine the test statistics $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ for a wide range of different locations $u$ and bandwidths or scales $h$ . There are different ways to do so, leading to different types of multiscale statistics. Our multiscale statistic is defined as

[TABLE]

where $\lambda(h)=\sqrt{2\log\{1/(2h)\}}$ and $\mathcal{G}_{T}$ is the set of points $(u,h)$ that are taken into consideration. The details on the set $\mathcal{G}_{T}$ are given below. As can be seen, the statistic $\widehat{\Psi}_{T}$ does not simply aggregate the individual statistics $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ by taking the supremum over all points $(u,h)\in\mathcal{G}_{T}$ as in more traditional multiscale approaches. We rather calibrate the statistics $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ that correspond to the bandwidth $h$ by subtracting the additive correction term $\lambda(h)$ . This approach was pioneered by Dümbgen and Spokoiny (2001) and has been used in numerous other studies since then; see e.g. Dümbgen (2002), Rohde (2008), Dümbgen and Walther (2008), Rufibach and Walther (2010), Schmidt-Hieber et al. (2013) and Eckle et al. (2017).

To see the heuristic idea behind the additive correction $\lambda(h)$ , consider for a moment the uncorrected statistic

[TABLE]

and suppose that the hypothesis $H_{0}(u,h)$ is true for all $(u,h)\in\mathcal{G}_{T}$ . For simplicity, assume that the errors $\varepsilon_{t}$ are i.i.d. normally distributed and neglect the estimation error in $\widehat{\sigma}$ , that is, set $\widehat{\sigma}=\sigma$ . Moreover, suppose that the set $\mathcal{G}_{T}$ only consists of the points $(u_{k},h_{\ell})=((2k-1)h_{\ell},h_{\ell})$ with $k=1,\ldots,\lfloor 1/2h_{\ell}\rfloor$ and $\ell=1,\ldots,L$ . In this case, we can write

[TABLE]

Under our simplifying assumptions, the statistics $\widehat{\psi}_{T}(u_{k},h_{\ell})/\sigma$ with $k=1,\ldots,\lfloor 1/2h_{\ell}\rfloor$ are independent and standard normal for any given bandwidth $h_{\ell}$ . Since the maximum over $\lfloor 1/2h\rfloor$ independent standard normal random variables is $\lambda(h)+o_{p}(1)$ as $h\rightarrow 0$ , we obtain that $\max_{k}\widehat{\psi}_{T}(u_{k},h_{\ell})/\sigma$ is approximately of size $\lambda(h_{\ell})$ for small bandwidths $h_{\ell}$ . As $\lambda(h)\rightarrow\infty$ for $h\rightarrow 0$ , this implies that $\max_{k}\widehat{\psi}_{T}(u_{k},h_{\ell})/\sigma$ tends to be much larger in size for small than for large bandwidths $h_{\ell}$ . As a result, the stochastic behaviour of the uncorrected statistic $\widehat{\Psi}_{T,\text{uncorrected}}$ tends to be dominated by the statistics $\widehat{\psi}_{T}(u_{k},h_{\ell})$ corresponding to small bandwidths $h_{\ell}$ . The additively corrected statistic $\widehat{\Psi}_{T}$ , in contrast, puts the statistics $\widehat{\psi}_{T}(u_{k},h_{\ell})$ corresponding to different bandwidths $h_{\ell}$ on a more equal footing, thus counteracting the dominance of small bandwidth values.

The multiscale statistic $\widehat{\Psi}_{T}$ simultaneously takes into account all locations $u$ and bandwidths $h$ with $(u,h)\in\mathcal{G}_{T}$ . Throughout the paper, we suppose that $\mathcal{G}_{T}$ is some subset of $\mathcal{G}_{T}^{\text{full}}=\{(u,h):u=t/T\text{ for some }1\leq t\leq T\text{ and }h\in[h_{\min},h_{\max}]\}$ , where $h_{\min}$ and $h_{\max}$ denote some minimal and maximal bandwidth value, respectively. For our theory to work, we require the following conditions to hold:

(C5)

$|\mathcal{G}_{T}|=O(T^{\theta})$ for some arbitrarily large but fixed constant $\theta>0$ , where $|\mathcal{G}_{T}|$ denotes the cardinality of $\mathcal{G}_{T}$ . 2. (C6)

$h_{\min}\gg T^{-(1-\frac{2}{q})}\log T$ , that is, $h_{\min}/\{T^{-(1-\frac{2}{q})}\log T\}\rightarrow\infty$ with $q>4$ defined in (C2) and $h_{\max}<1/2$ .

According to (C5), the number of points $(u,h)$ in $\mathcal{G}_{T}$ should not grow faster than $T^{\theta}$ for some arbitrarily large but fixed $\theta>0$ . This is a fairly weak restriction as it allows the set $\mathcal{G}_{T}$ to be extremely large compared to the sample size $T$ . For example, we may work with the set

[TABLE]

which contains more than enough points $(u,h)$ for most practical applications. Condition (C6) imposes some restrictions on the minimal and maximal bandwidths $h_{\min}$ and $h_{\max}$ . These conditions are fairly weak, allowing us to choose the bandwidth window $[h_{\min},h_{\max}]$ extremely large. The lower bound on $h_{\min}$ depends on the parameter $q$ defined in (C2) which specifies the number of existing moments for the error terms $\varepsilon_{t}$ . As one can see, we can choose $h_{\min}$ to be of the order $T^{-1/2}$ for any $q>4$ . Hence, we can let $h_{\min}$ converge to [math] very quickly even if only the first few moments of the error terms $\varepsilon_{t}$ exist. If all moments exist (i.e. $q=\infty$ ), $h_{\min}$ may converge to [math] almost as quickly as $T^{-1}\log T$ . Furthermore, the maximal bandwidth $h_{\max}$ is not even required to converge to [math], which implies that we can pick it very large.

Remark 3.1.

The above construction of the multiscale statistic can be easily adapted to hypotheses other than $H_{0}$ . To do so, one simply needs to replace the kernel weights $w_{t,T}(u,h)$ defined in (3.1) by appropriate versions which are suited to test the hypothesis of interest. For example, if one wants to test for local convexity/concavity of $m$ , one may define the kernel weights $w_{t,T}(u,h)$ such that the kernel average $\widehat{\psi}_{T}(u,h)$ is a (rescaled) estimator of the second derivative of $m$ at the location $u$ with bandwidth $h$ .

3.2 The test procedure

In order to formulate a test for the null hypothesis $H_{0}$ , we still need to specify a critical value. To do so, we define the statistic

[TABLE]

where $\phi_{T}(u,h)=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)\,\sigma Z_{t}$ and $Z_{t}$ are independent standard normal random variables. The statistic $\Phi_{T}$ can be regarded as a Gaussian version of the test statistic $\widehat{\Psi}_{T}$ under the null hypothesis $H_{0}$ . Let $q_{T}(\alpha)$ be the $(1-\alpha)$ -quantile of $\Phi_{T}$ . Importantly, the quantile $q_{T}(\alpha)$ can be computed by Monte Carlo simulations and can thus be regarded as known. Our multiscale test of the hypothesis $H_{0}$ is now defined as follows: For a given significance level $\alpha\in(0,1)$ , we reject $H_{0}$ if $\widehat{\Psi}_{T}>q_{T}(\alpha)$ .

3.3 Theoretical properties of the test

In order to examine the theoretical properties of our multiscale test, we introduce the auxiliary multiscale statistic

[TABLE]

with $\widehat{\phi}_{T}(u,h)=\widehat{\psi}_{T}(u,h)-\mathbb{E}[\widehat{\psi}_{T}(u,h)]=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)\varepsilon_{t}$ . The following result is central to the theoretical analysis of our multiscale test. According to it, the (known) quantile $q_{T}(\alpha)$ of the Gaussian statistic $\Phi_{T}$ defined in Section 3.2 can be used as a proxy for the $(1-\alpha)$ -quantile of the multiscale statistic $\widehat{\Phi}_{T}$ .

Theorem 3.1.

Let (C1)–(C6) be fulfilled and assume that $\widehat{\sigma}^{2}=\sigma^{2}+o_{p}(\rho_{T})$ with $\rho_{T}=o(1/\log T)$ . Then

[TABLE]

A full proof of Theorem 3.1 is given in the Supplementary Material. We here shortly outline the proof strategy, which splits up into two main steps. In the first, we replace the statistic $\widehat{\Phi}_{T}$ for each $T\geq 1$ by a statistic $\widetilde{\Phi}_{T}$ with the same distribution as $\widehat{\Phi}_{T}$ and the property that

[TABLE]

where $\delta_{T}=o(1)$ and the Gaussian statistic $\Phi_{T}$ is defined in Section 3.2. We thus replace the statistic $\widehat{\Phi}_{T}$ by an identically distributed version which is close to a Gaussian statistic whose distribution is known. To do so, we make use of strong approximation theory for dependent processes as derived in Berkes et al. (2014). In the second step, we show that

[TABLE]

which immediately implies the statement of Theorem 3.1. Importantly, the convergence result (3.5) is not sufficient for establishing (3.6). Put differently, the fact that $\widetilde{\Phi}_{T}$ can be approximated by $\Phi_{T}$ in the sense that $\widetilde{\Phi}_{T}-\Phi_{T}=o_{p}(\delta_{T})$ does not imply that the distribution of $\widetilde{\Phi}_{T}$ is close to that of $\Phi_{T}$ in the sense of (3.6). For (3.6) to hold, we additionally require the distribution of $\Phi_{T}$ to have some sort of continuity property. Specifically, we prove that

[TABLE]

which says that $\Phi_{T}$ does not concentrate too strongly in small regions of the form $[x-\delta_{T},x+\delta_{T}]$ . The main tool for verifying (3.7) are anti-concentration results for Gaussian random vectors as derived in Chernozhukov et al. (2015). The claim (3.6) can be proven by using (3.5) together with (3.7), which in turn yields Theorem 3.1.

The main idea of our proof strategy is to combine strong approximation theory with anti-concentration bounds for Gaussian random vectors to show that the quantiles of the multiscale statistic $\widehat{\Phi}_{T}$ can be proxied by those of a Gaussian analogue. This strategy is quite general in nature and may be applied to other multiscale problems for dependent data. Strong approximation theory has also been used to investigate multiscale tests for independent data; see e.g. Schmidt-Hieber et al. (2013). However, it has not been combined with anti-concentration results to approximate the quantiles of the multiscale statistic. As an alternative to strong approximation theory, Eckle et al. (2017) and Proksch et al. (2018) have recently used Gaussian approximation results derived in Chernozhukov et al. (2014, 2017) to analyse multiscale tests for independent data. Even though it might be possible to adapt these techniques to the case of dependent data, this is not trivial at all as part of the technical arguments and the Gaussian approximation tools strongly rely on the assumption of independence.

We now investigate the theoretical properties of our multiscale test with the help of Theorem 3.1. The first result is an immediate consequence of Theorem 3.1. It says that the test has the correct (asymptotic) size.

Proposition 3.1.

Let the conditions of Theorem 3.1 be satisfied. Under the null hypothesis $H_{0}$ , it holds that

[TABLE]

The second result characterizes the power of the multiscale test against local alternatives. To formulate it, we consider any sequence of functions $m=m_{T}$ with the following property: There exists $(u,h)\in\mathcal{G}_{T}$ with $[u-h,u+h]\subseteq[0,1]$ such that

[TABLE]

where $\{c_{T}\}$ is any sequence of positive numbers with $c_{T}\rightarrow\infty$ . Alternatively to (3.8), we may also assume that $-m_{T}^{\prime}(w)\geq c_{T}\sqrt{\log T/(Th^{3})}$ for all $w\in[u-h,u+h]$ . According to the following result, our test has asymptotic power $1$ against local alternatives of the form (3.8).

Proposition 3.2.

Let the conditions of Theorem 3.1 be satisfied and consider any sequence of functions $m_{T}$ with the property (3.8). Then

[TABLE]

The proof of Proposition 3.2 can be found in the Supplementary Material. To formulate the next result, we define

[TABLE]

together with

[TABLE]

$\Pi_{T}^{\pm}$ is the collection of intervals $I_{u,h}=[u-h,u+h]$ for which the (corrected) test statistic $|\widehat{\psi}_{T}(u,h)/\widehat{\sigma}|-\lambda(h)$ lies above the critical value $q_{T}(\alpha)$ , that is, for which our multiscale test rejects the hypothesis $H_{0}(u,h)$ . $\Pi_{T}^{+}$ and $\Pi_{T}^{-}$ can be interpreted analogously but take into account the sign of the statistic $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}$ . With this notation at hand, we consider the events

[TABLE]

$E_{T}^{\pm}$ ( $E_{T}^{+}$ , $E_{T}^{-}$ ) is the event that the function $m$ is non-constant (increasing, decreasing) on all intervals $I_{u,h}\in\Pi_{T}^{\pm}$ ( $\Pi_{T}^{+}$ , $\Pi_{T}^{-}$ ). More precisely, $E_{T}^{\pm}$ ( $E_{T}^{+}$ , $E_{T}^{-}$ ) is the event that for each interval $I_{u,h}\in\Pi_{T}^{\pm}$ ( $\Pi_{T}^{+}$ , $\Pi_{T}^{-}$ ), there is a subset $J_{u,h}\subseteq I_{u,h}$ with $m$ being a non-constant (increasing, decreasing) function on $J_{u,h}$ . We can make the following formal statement about the events $E_{T}^{\pm}$ , $E_{T}^{+}$ and $E_{T}^{-}$ , whose proof is given in the Supplementary Material.

Proposition 3.3.

Let the conditions of Theorem 3.1 be fulfilled. Then for $\ell\in\{\pm,+,-\}$ , it holds that

[TABLE]

According to Proposition 3.3, we can make simultaneous confidence statements of the following form: With (asymptotic) probability $\geq(1-\alpha)$ , the trend function $m$ is non-constant (increasing, decreasing) on some part of the interval $I_{u,h}$ for all $I_{u,h}\in\Pi_{T}^{\pm}$ ( $\Pi_{T}^{+}$ , $\Pi_{T}^{-}$ ). Hence, our multiscale procedure allows to identify, with a pre-specified confidence, time regions where there is an increase/decrease in the time trend $m$ .

Remark 3.2.

Unlike $\Pi_{T}^{\pm}$ , the sets $\Pi_{T}^{+}$ and $\Pi_{T}^{-}$ only contain intervals $I_{u,h}=[u-h,u+h]$ which are subsets of $[0,1]$ . We thus exclude points $(u,h)\in\mathcal{A}_{T}^{+}$ and $(u,h)\in\mathcal{A}_{T}^{-}$ which lie at the boundary, that is, for which $I_{u,h}\nsubseteq[0,1]$ . The reason is as follows: Let $(u,h)\in\mathcal{A}_{T}^{+}$ with $I_{u,h}\nsubseteq[0,1]$ . Our technical arguments allow us to say, with asymptotic confidence $\geq 1-\alpha$ , that $m^{\prime}(v)\neq 0$ for some $v\in I_{u,h}$ . However, we cannot say whether $m^{\prime}(v)>0$ or $m^{\prime}(v)<0$ , that is, we cannot make confidence statements about the sign. Crudely speaking, the problem is that the local linear weights $w_{t,T}(u,h)$ behave quite differently at boundary points $(u,h)$ with $I_{u,h}\nsubseteq[0,1]$ . As a consequence, we can include boundary points $(u,h)$ in $\Pi_{T}^{\pm}$ but not in $\Pi_{T}^{+}$ and $\Pi_{T}^{-}$ .

The statement of Proposition 3.3 suggests to graphically present the results of our multiscale test by plotting the intervals $I_{u,h}\in\Pi_{T}^{\ell}$ for $\ell\in\{\pm,+,-\}$ , that is, by plotting the intervals where (with asymptotic confidence $\geq 1-\alpha$ ) our test detects a violation of the null hypothesis. The drawback of this graphical presentation is that the number of intervals in $\Pi_{T}^{\ell}$ is often quite large. To obtain a better graphical summary of the results, we replace $\Pi_{T}^{\ell}$ by a subset $\Pi_{T}^{\ell,\min}$ which is constructed as follows: As in Dümbgen (2002), we call an interval $I_{u,h}\in\Pi_{T}^{\ell}$ minimal if there is no other interval $I_{u^{\prime},h^{\prime}}\in\Pi_{T}^{\ell}$ with $I_{u^{\prime},h^{\prime}}\subset I_{u,h}$ . Let $\Pi_{T}^{\ell,\min}$ be the set of all minimal intervals in $\Pi_{T}^{\ell}$ for $\ell\in\{\pm,+,-\}$ and define the events

[TABLE]

It is easily seen that $E_{T}^{\ell}=E_{T}^{\ell,\min}$ for $\ell\in\{\pm,+,-\}$ . Hence, by Proposition 3.3, it holds that

[TABLE]

for $\ell\in\{\pm,+,-\}$ . This suggests to plot the minimal intervals in $\Pi_{T}^{\ell,\min}$ rather than the whole collection of intervals $\Pi_{T}^{\ell}$ as a graphical summary of the test results. We in particular use this way of presenting the test results in our application in Section 6.

4 Estimation of the long-run error variance

In this section, we discuss how to estimate the long-run variance $\sigma^{2}=\sum\nolimits_{\ell=-\infty}^{\infty}\textnormal{Cov}(\varepsilon_{0},\varepsilon_{\ell})$ of the error terms in model (2.1). There are two broad classes of estimators: residual- and difference-based estimators. In residual-based approaches, $\sigma^{2}$ is estimated from the residuals $\widehat{\varepsilon}_{t}=Y_{t,T}-\widehat{m}_{h}(t/T)$ , where $\widehat{m}_{h}$ is a nonparametric estimator of $m$ with the bandwidth or smoothing parameter $h$ . Difference-based methods proceed by estimating $\sigma^{2}$ from the $\ell$ -th differences $Y_{t,T}-Y_{t-\ell,T}$ of the observed time series $\{Y_{t,T}\}$ for certain orders $\ell$ . In what follows, we focus attention on difference-based methods as these do not involve a nonparametric estimator of the function $m$ and thus do not require to specify a bandwidth $h$ for the estimation of $m$ . To simplify notation, we let $\Delta_{\ell}Z_{t}=Z_{t}-Z_{t-\ell}$ denote the $\ell$ -th differences of a general time series $\{Z_{t}\}$ throughout the section.

4.1 Weakly dependent error processes

We first consider the case that $\{\varepsilon_{t}\}$ is a general stationary error process. We do not impose any time series model such as a moving average (MA) or an autoregressive (AR) model on $\{\varepsilon_{t}\}$ but only require that $\{\varepsilon_{t}\}$ satisfies certain weak dependence conditions such as those from Section 2. These conditions imply that the autocovariances $\gamma_{\varepsilon}(\ell)=\textnormal{Cov}(\varepsilon_{0},\varepsilon_{\ell})$ decay to zero at a certain rate as $|\ell|\rightarrow\infty$ . For simplicity of exposition, we assume that the decay is exponential, that is, $|\gamma_{\varepsilon}(\ell)|\leq C\rho^{|\ell|}$ for some $C>0$ and $0<\rho<1$ . In addition to these weak dependence conditions, we suppose that the trend $m$ is smooth. Specifically, we assume $m$ to be Lipschitz continuous on $[0,1]$ , that is, $|m(u)-m(v)|\leq C|u-v|$ for all $u,v\in[0,1]$ and some constant $C<\infty$ .

Under these conditions, a difference-based estimator of $\sigma^{2}$ can be obtained as follows: To start with, we construct an estimator of the short-run error variance $\gamma_{\varepsilon}(0)=\textnormal{Var}(\varepsilon_{0})$ . As $m$ is Lipschitz continuous, it holds that $\Delta_{q}Y_{t,T}=\Delta_{q}\varepsilon_{t}+O(q/T)$ . Hence, the differences $\Delta_{q}Y_{t,T}$ of the observed time series are close to the differences $\Delta_{q}\varepsilon_{t}$ of the unobserved error process as long as $q$ is not too large in comparison to $T$ . Moreover, since $|\gamma_{\varepsilon}(q)|\leq C\rho^{q}$ , we have that $\mathbb{E}[(\Delta_{q}\varepsilon_{t})^{2}]/2=\gamma_{\varepsilon}(0)-\gamma_{\varepsilon}(q)=\gamma_{\varepsilon}(0)+O(\rho^{q})$ . Taken together, these considerations yield that $\gamma_{\varepsilon}(0)=\mathbb{E}[(\Delta_{q}Y_{t,T})^{2}]/2+O(\{q/T\}^{2}+\rho^{q})$ , which motivates to estimate $\gamma_{\varepsilon}(0)$ by

[TABLE]

where we assume that $q=q_{T}\rightarrow\infty$ with $q_{T}/\log T\rightarrow\infty$ and $q_{T}/\sqrt{T}\rightarrow 0$ . Estimators of the autocovariances $\gamma_{\varepsilon}(\ell)$ for $\ell\neq 0$ can be derived by similar considerations. Since $\gamma_{\varepsilon}(\ell)=\gamma_{\varepsilon}(0)-\mathbb{E}[(\Delta_{\ell}\varepsilon_{t})^{2}]/2=\gamma_{\varepsilon}(0)-\mathbb{E}[(\Delta_{\ell}Y_{t,T})^{2}]/2+O(\{\ell/T\}^{2})$ , we may in particular define

[TABLE]

for any $\ell\neq 0$ . Difference-based estimators of the type (4.1) and (4.2) have been used in different contexts in the literature before. Estimators similar to (4.1) and (4.2) were analysed, for example, in Müller and Stadtmüller (1988) and Hall and Van Keilegom (2003) in the context of $m$ -dependent and autoregressive error terms, respectively. In order to estimate the long-run error variance $\sigma^{2}$ , we may employ HAC-type estimation procedures as discussed in Andrews (1991) or De Jong and Davidson (2000). In particular, an estimator of $\sigma^{2}$ may be defined as

[TABLE]

where $W:[-1,1]\rightarrow\mathbb{R}$ is a kernel (e.g. of Bartlett or Parzen type) and $b_{T}$ is a bandwidth parameter with $b_{T}\rightarrow\infty$ and $b_{T}/q_{T}\rightarrow 0$ . The additional bandwidth $b_{T}$ comes into play because estimating $\sigma^{2}$ under general weak dependence conditions is a nonparametric problem. In particular, it is equivalent to estimating the (nonparametric) spectral density $f_{\varepsilon}$ of the process $\{\varepsilon_{t}\}$ at frequency [math] (assuming that $f_{\varepsilon}$ exists).

Estimating the long-run error variance $\sigma^{2}$ under general weak dependence conditions is a notoriously difficult problem. Estimators of $\sigma^{2}$ such as $\widehat{\sigma}^{2}$ from (4.3) tend to be quite imprecise and are usually very sensitive to the choice of the smoothing parameter, that is, to $b_{T}$ in the case of $\widehat{\sigma}^{2}$ from (4.3). To circumvent this issue in practice, it may be beneficial to impose a time series model on the error process $\{\varepsilon_{t}\}$ . Estimating $\sigma^{2}$ under the restrictions of such a model may of course create some misspecification bias. However, as long as the model gives a reasonable approximation to the true error process, the produced estimates of $\sigma^{2}$ can be expected to be fairly reliable even though they are a bit biased. Which time series model is appropriate of course depends on the application at hand. In the sequel, we follow authors such as Hart (1994) and Hall and Van Keilegom (2003) and impose an autoregressive structure on the error terms $\{\varepsilon_{t}\}$ , which is a very popular error model in many application contexts. We thus do not dwell on the nonparametric estimator $\widehat{\sigma}^{2}$ from (4.3) any further but rather give an in-depth analysis of the case of autoregressive error terms.

4.2 Autoregressive error processes

Estimators of the long-run error variance $\sigma^{2}$ in model (2.1) have been developed for different kinds of error processes $\{\varepsilon_{t}\}$ . A number of authors have analysed the case of MA( $m$ ) or, more generally, $m$ -dependent error terms. Difference-based estimators of $\sigma^{2}$ for this case were proposed in Müller and Stadtmüller (1988), Herrmann et al. (1992) and Tecuapetla-Gómez and Munk (2017) among others. Under the assumption of $m$ -dependence, $\gamma_{\varepsilon}(\ell)=0$ for all $|\ell|>m$ . Even though $m$ -dependent time series are a reasonable error model in some applications, the condition that $\gamma_{\varepsilon}(\ell)$ is exactly equal to [math] for sufficiently large lags $\ell$ is quite restrictive in many situations. Presumably the most widely used error model in practice is an AR( $p$ ) process. Residual-based methods to estimate $\sigma^{2}$ in model (2.1) with AR( $p$ ) errors can be found for example in Truong (1991), Shao and Yang (2011) and Qiu et al. (2013). A difference-based method was proposed in Hall and Van Keilegom (2003).

In what follows, we introduce a difference-based estimator of $\sigma^{2}$ for the AR( $p$ ) case which improves on existing methods in several respects. As in Hall and Van Keilegom (2003), we consider the following situation: $\{\varepsilon_{t}\}$ is a stationary and causal AR( $p$ ) process of the form

[TABLE]

where $a_{1},\ldots,a_{p}$ are unknown parameters and $\eta_{t}$ are i.i.d. innovations with $\mathbb{E}[\eta_{t}]=0$ and $\mathbb{E}[\eta_{t}^{2}]=\nu^{2}$ . The AR order $p$ is known and $m$ is Lipschitz continuous on $[0,1]$ , that is, $|m(u)-m(v)|\leq C|u-v|$ for all $u,v\in[0,1]$ and some constant $C<\infty$ . Since $\{\varepsilon_{t}\}$ is causal, the variables $\varepsilon_{t}$ have an MA( $\infty$ ) representation of the form $\varepsilon_{t}=\sum_{k=0}^{\infty}c_{k}\eta_{t-k}$ . The coefficients $c_{k}$ can be computed iteratively from the equations

[TABLE]

for $k=0,1,2,\ldots$ , where $b_{0}=1$ , $b_{k}=0$ for $k>0$ and $c_{k}=0$ for $k<0$ . Moreover, the coefficients $c_{k}$ can be shown to decay exponentially fast to zero as $k\rightarrow\infty$ , in particular, $|c_{k}|\leq C\rho^{k}$ with some $C>0$ and $0<\rho<1$ .

Our estimation method relies on the following simple observation: If $\{\varepsilon_{t}\}$ is an AR( $p$ ) process of the form (4.4), then the time series $\{\Delta_{q}\varepsilon_{t}\}$ of the differences $\Delta_{q}\varepsilon_{t}=\varepsilon_{t}-\varepsilon_{t-q}$ is an ARMA( $p,q$ ) process of the form

[TABLE]

As $m$ is Lipschitz, the differences $\Delta_{q}\varepsilon_{t}$ of the unobserved error process are close to the differences $\Delta_{q}Y_{t,T}$ of the observed time series in the sense that

[TABLE]

Taken together, (4.6) and (4.7) imply that the differenced time series $\{\Delta_{q}Y_{t,T}\}$ is approximately an ARMA( $p,q$ ) process of the form (4.6). It is precisely this point which is exploited by our estimation methods.

We first construct an estimator of the parameter vector $\boldsymbol{a}=(a_{1},\ldots,a_{p})^{\top}$ . For any $q\geq 1$ , the ARMA( $p,q$ ) process $\{\Delta_{q}\varepsilon_{t}\}$ satisfies the Yule-Walker equations

[TABLE]

where $\gamma_{q}(\ell)=\textnormal{Cov}(\Delta_{q}\varepsilon_{t},$ $\Delta_{q}\varepsilon_{t-\ell})$ and $c_{k}$ are the coefficients from the MA( $\infty$ ) expansion of $\{\varepsilon_{t}\}$ . From (4.8) and (4.9), we get that

[TABLE]

where $\boldsymbol{c}_{q}=(c_{q-1},\dots,c_{q-p})^{\top}$ , $\boldsymbol{\gamma}_{q}=(\gamma_{q}(1),\dots,\gamma_{q}(p))^{\top}$ and $\boldsymbol{\Gamma}_{q}$ denotes the $p\times p$ covariance matrix $\boldsymbol{\Gamma}_{q}=(\gamma_{q}(i-j):1\leq i,j\leq p)$ . Since the coefficients $c_{k}$ decay exponentially fast to zero, $\boldsymbol{c}_{q}\approx\boldsymbol{0}$ and thus $\boldsymbol{\Gamma}_{q}\boldsymbol{a}\approx\boldsymbol{\gamma}_{q}$ for large values of $q$ . This suggests to estimate $\boldsymbol{a}$ by

[TABLE]

where $\widehat{\boldsymbol{\Gamma}}_{q}$ and $\widehat{\boldsymbol{\gamma}}_{q}$ are defined analogously as $\boldsymbol{\Gamma}_{q}$ and $\boldsymbol{\gamma}_{q}$ with $\gamma_{q}(\ell)$ replaced by the sample autocovariances $\widehat{\gamma}_{q}(\ell)=(T-q)^{-1}\sum_{t=q+\ell+1}^{T}\Delta_{q}Y_{t,T}\Delta_{q}Y_{t-\ell,T}$ and $q=q_{T}$ goes to infinity sufficiently fast as $T\rightarrow\infty$ , specifically, $q=q_{T}\rightarrow\infty$ with $q_{T}/\log T\rightarrow\infty$ and $q_{T}/\sqrt{T}\rightarrow 0$ .

The estimator $\widetilde{\boldsymbol{a}}_{q}$ depends on the tuning parameter $q$ , which is very similar in nature to the two tuning parameters of the methods in Hall and Van Keilegom (2003). An appropriate choice of $q$ needs to take care of the following two points: (i) $q$ should be chosen large enough to ensure that the vector $\boldsymbol{c}_{q}=(c_{q-1},\dots,c_{q-p})^{\top}$ is close to zero. As we have already seen, the constants $c_{k}$ decay exponentially fast to zero and can be computed from the recursive equations (4.5) for given AR parameters $a_{1},\ldots,a_{p}$ . In the AR( $1$ ) case, for example, one can readily calculate that $c_{k}\leq 0.0035$ for any $k\geq 20$ and any $|a_{1}|\leq 0.75$ . Hence, if we have an AR( $1$ ) model for the errors $\varepsilon_{t}$ and the error process is not too persistent, choosing $q$ such that $q\geq 20$ should make sure that $\boldsymbol{c}_{q}$ is close to zero. Generally speaking, the recursive equations (4.5) can be used to get some idea for which values of $q$ the vector $\boldsymbol{c}_{q}$ can be expected to be approximately zero. (ii) $q$ should not be chosen too large in order to ensure that the trend $m$ is appropriately eliminated by taking $q$ -th differences. As long as the trend $m$ is not very strong, the two requirements (i) and (ii) can be fulfilled without much difficulty. For example, by choosing $q=20$ in the AR( $1$ ) case just discussed, we do not only take care of (i) but also make sure that moderate trends $m$ are differenced out appropriately.

When the trend $m$ is very pronounced, in contrast, even moderate values of $q$ may be too large to eliminate the trend appropriately. As a result, the estimator $\widetilde{\boldsymbol{a}}_{q}$ will have a strong bias. In order to reduce this bias, we refine our estimation procedure as follows: By solving the recursive equations (4.5) with $\boldsymbol{a}$ replaced by $\widetilde{\boldsymbol{a}}_{q}$ , we can compute estimators $\widetilde{c}_{k}$ of the coefficients $c_{k}$ and thus estimators $\widetilde{\boldsymbol{c}}_{r}$ of the vectors $\boldsymbol{c}_{r}$ for any $r\geq 1$ . Moreover, the innovation variance $\nu^{2}$ can be estimated by $\widetilde{\nu}^{2}=(2T)^{-1}\sum_{t=p+1}^{T}\widetilde{r}_{t,T}^{2}$ , where $\widetilde{r}_{t,T}=\Delta_{1}Y_{t,T}-\sum_{j=1}^{p}\widetilde{a}_{j}\Delta_{1}Y_{t-j,T}$ and $\widetilde{a}_{j}$ is the $j$ -th entry of the vector $\widetilde{\boldsymbol{a}}_{q}$ . Plugging the expressions $\widehat{\boldsymbol{\Gamma}}_{r}$ , $\widehat{\boldsymbol{\gamma}}_{r}$ , $\widetilde{\boldsymbol{c}}_{r}$ and $\widetilde{\nu}^{2}$ into (4.10), we can estimate $\boldsymbol{a}$ by

[TABLE]

where $r$ is any fixed number with $r\geq 1$ . In particular, unlike $q$ , the parameter $r$ does not diverge to infinity but remains fixed as the sample size $T$ increases. As one can see, the estimator $\widehat{\boldsymbol{a}}_{r}$ is based on differences of some small order $r$ ; only the pilot estimator $\widetilde{\boldsymbol{a}}_{q}$ relies on differences of a larger order $q$ . As a consequence, $\widehat{\boldsymbol{a}}_{r}$ should eliminate the trend $m$ more appropriately and should thus be less biased than the pilot estimator $\widetilde{\boldsymbol{a}}_{q}$ . In order to make the method more robust against estimation errors in $\widetilde{\boldsymbol{c}}_{r}$ , we finally average the estimators $\widehat{\boldsymbol{a}}_{r}$ for a few small values of $r$ . In particular, we define

[TABLE]

where $\overline{r}$ is a small natural number. For ease of notation, we suppress the dependence of $\widehat{\boldsymbol{a}}$ on the parameter $\overline{r}$ . Once $\widehat{\boldsymbol{a}}=(\widehat{a}_{1},\ldots,\widehat{a}_{p})^{\top}$ is computed, the long-run variance $\sigma^{2}$ can be estimated by

[TABLE]

where $\widehat{\nu}^{2}=(2T)^{-1}\sum_{t=p+1}^{T}\widehat{r}_{t,T}^{2}$ with $\widehat{r}_{t,T}=\Delta_{1}Y_{t,T}-\sum_{j=1}^{p}\widehat{a}_{j}\Delta_{1}Y_{t-j,T}$ is an estimator of the innovation variance $\nu^{2}$ and we make use of the fact that $\sigma^{2}=\nu^{2}/(1-\sum_{j=1}^{p}a_{j})^{2}$ for the AR( $p$ ) process $\{\varepsilon_{t}\}$ .

We briefly compare the estimator $\widehat{\boldsymbol{a}}$ to competing methods. Presumably closest to our approach is the procedure of Hall and Van Keilegom (2003). Nevertheless, the two approaches differ in several respects. The two main advantages of our method are as follows:

(a)

Our estimator produces accurate estimation results even when the AR process $\{\varepsilon_{t}\}$ is quite persistent, that is, even when the AR polynomial $A(z)=1-\sum_{j=1}^{p}a_{j}z^{j}$ has a root close to the unit circle. The estimator of Hall and Van Keilegom (2003), in contrast, may have very high variance and may thus produce unreliable results when the AR polynomial $A(z)$ is close to having a unit root. This difference in behaviour can be explained as follows: Our pilot estimator $\widetilde{\boldsymbol{a}}_{q}=(\widetilde{a}_{1},\ldots,\widetilde{a}_{p})^{\top}$ has the property that the estimated AR polynomial $\widetilde{A}(z)=1-\sum_{j=1}^{p}\widetilde{a}_{j}z^{j}$ has no root inside the unit disc, that is, $\widetilde{A}(z)\neq 0$ for all complex numbers $z$ with $|z|\leq 1$ .444More precisely, $\widetilde{A}(z)\neq 0$ for all $z$ with $|z|\leq 1$ , whenever the covariance matrix $(\widehat{\gamma}_{q}(i-j):1\leq i,j\leq p+1)$ is non-singular. Moreover, $(\widehat{\gamma}_{q}(i-j):1\leq i,j\leq p+1)$ is non-singular whenever $\widehat{\gamma}_{q}(0)>0$ , which is the generic case. Hence, the fitted AR model with the coefficients $\widetilde{\boldsymbol{a}}_{q}$ is ensured to be stationary and causal. Even though this may seem to be a minor technical detail, it has a huge effect on the performance of the estimator: It keeps the estimator stable even when the AR process is very persistent and the AR polynomial $A(z)$ has almost a unit root. This in turn results in a reliable behaviour of the estimator $\widehat{\boldsymbol{a}}$ in the case of high persistence. The estimator of Hall and Van Keilegom (2003), in contrast, may produce non-causal results when the AR polynomial $A(z)$ is close to having a unit root. As a consequence, it may have unnecessarily high variance in the case of high persistence. We illustrate this difference between the estimators by the simulation exercises in Section 5.3. A striking example is Figure 5, which presents the simulation results for the case of an AR( $1$ ) process $\varepsilon_{t}=a_{1}\varepsilon_{t-1}+\eta_{t}$ with $a_{1}=-0.95$ and clearly shows the much better performance of our method. 2. (b)

Both our pilot estimator $\widetilde{\boldsymbol{a}}_{q}$ and the estimator of Hall and Van Keilegom (2003) tend to have a substantial bias when the trend $m$ is pronounced. Our estimator $\widehat{\boldsymbol{a}}$ reduces this bias considerably as demonstrated in the simulations of Section 5.3. Unlike the estimator of Hall and Van Keilegom (2003), it thus produces accurate results even in the presence of a very strong trend.

We now derive some basic asymptotic properties of the estimators $\widetilde{\boldsymbol{a}}_{q}$ , $\widehat{\boldsymbol{a}}$ and $\widehat{\sigma}^{2}$ . The following proposition shows that they are $\sqrt{T}$ -consistent.

Proposition 4.1.

Let $\{\varepsilon_{t}\}$ be a causal AR( $p$ ) process of the form (4.4). Suppose that the innovations $\eta_{t}$ have a finite fourth moment and let $m$ be Lipschitz continuous. If $q\rightarrow\infty$ with $q/\log T\rightarrow\infty$ and $q/\sqrt{T}\rightarrow 0$ , then $\widetilde{\boldsymbol{a}}_{q}-\boldsymbol{a}=O_{p}(T^{-1/2})$ as well as $\widehat{\boldsymbol{a}}-\boldsymbol{a}=O_{p}(T^{-1/2})$ and $\widehat{\sigma}^{2}-\sigma^{2}=O_{p}(T^{-1/2})$ .

It can also be shown that $\widetilde{\boldsymbol{a}}_{q}$ , $\widehat{\boldsymbol{a}}$ and $\widehat{\sigma}^{2}$ are asymptotically normal. In general, their asymptotic variance is somewhat larger than that of the estimators in Hall and Van Keilegom (2003). They are thus a bit less efficient in terms of asymptotic variance. However, this theoretical loss of efficiency is more than compensated by the advantages discussed in (a) and (b) above, which lead to a substantially better small sample performance as demonstrated in the simulations of Section 5.3.

5 Simulations

To assess the finite sample performance of our methods, we conduct a number of simulations. In Sections 5.1 and 5.2, we investigate the performance of our multiscale test and compare it to the SiZer methods for time series developed in Park et al. (2004), Rondonotti et al. (2007) and Park et al. (2009). In Section 5.3, we analyse the finite sample properties of our long-run variance estimator from Section 4.2 and compare it to the estimator of Hall and Van Keilegom (2003).

5.1 Size and power properties of the multiscale test

Our simulation design mimics the situation in the application example of Section 6. We generate data from the model $Y_{t,T}=m(t/T)+\varepsilon_{t}$ for different trend functions $m$ , error processes $\{\varepsilon_{t}\}$ and time series lengths $T$ . The error terms are supposed to have the AR( $1$ ) structure $\varepsilon_{t}=a_{1}\varepsilon_{t-1}+\eta_{t}$ , where $a_{1}\in\{-0.5,-0.25,0.25,0.5\}$ and $\eta_{t}$ are i.i.d. standard normal. In addition, we consider the AR( $2$ ) specification $\varepsilon_{t}=a_{1}\varepsilon_{t-1}+a_{2}\varepsilon_{t-2}+\eta_{t}$ , where $\eta_{t}$ are normally distributed with $\mathbb{E}[\eta_{t}]=0$ and $\mathbb{E}[\eta_{t}^{2}]=\nu^{2}$ . We set $a_{1}=0.167$ , $a_{2}=0.178$ and $\nu^{2}=0.322$ , thus matching the estimated values obtained in the application of Section 6. To simulate data under the null hypothesis, we let $m$ be a constant function. In particular, we set $m=0$ without loss of generality. To generate data under the alternative, we consider the trend functions $m(u)=\beta(u-0.5)\cdot 1(0.5\leq u\leq 1)$ with $\beta=1.5,2.0,2.5$ . These functions are broken lines with a kink at $u=0.5$ and different slopes $\beta$ . Their shape roughly resembles the trend estimates in the application of Section 6. The slope parameter $\beta$ corresponds to a trend with the value $m(1)=0.5\beta$ at the right endpoint $u=1$ . We thus consider broken lines with the values $m(1)=0.75,1.0,1.25$ . Inspecting the middle panel of Figure 7, the broken lines with the endpoints $m(1)=1.0$ and $m(1)=1.25$ (that is, with $\beta=2.0$ and $\beta=2.5$ ) can be seen to resemble the local linear trend estimates in the real-data example the most (where we neglect the nonlinearities of the local linear fits at the beginning of the observation period). The broken line with $\beta=1.5$ is closer to the null, making it harder for our test to detect this alternative.555The broken lines $m$ are obviously non-differentiable at the kink point. We could replace them by slightly smoothed versions to satisfy the differentiability assumption that is imposed in the theoretical part of the paper. However, as this leaves the simulation results essentially unchanged but only creates additional notation, we stick to the broken lines.

[FIGURE:]

To implement our test, we choose $K$ to be an Epanechnikov kernel and define the set $\mathcal{G}_{T}$ of location-scale points $(u,h)$ as

[TABLE]

We thus take into account all rescaled time points $u\in[0,1]$ on an equidistant grid with step length $5/T$ . For the bandwidth $h=(3+5\ell)/T$ and any $u\in[h,1-h]$ , the kernel weights $K(h^{-1}\{t/T-u\})$ are non-zero for exactly $5+10\ell$ observations. Hence, the bandwidths $h$ in $\mathcal{G}_{T}$ correspond to effective sample sizes of $5,15,25,\ldots$ up to approximately $T/4$ data points. As a robustness check, we have re-run the simulations for a number of other grids. As the results are very similar, we do however not report them here. The long-run error variance $\sigma^{2}$ is estimated by the procedures from Section 4.2: We first compute the estimator $\widehat{\boldsymbol{a}}$ of the AR parameter(s), where we use $\overline{r}=10$ and the pilot estimator $\widetilde{\boldsymbol{a}}_{q}$ with $q=25$ . Based on $\widehat{\boldsymbol{a}}$ , we then compute the estimator $\widehat{\sigma}^{2}$ of the long-run error variance $\sigma^{2}$ . As a further robustness check, we have re-run the simulations for other choices of the parameters $q$ and $\overline{r}$ , which yields very similar results. The dependence of the estimators $\widehat{\boldsymbol{a}}$ and $\widehat{\sigma}^{2}$ on $q$ and $\overline{r}$ is further explored in Section 5.3. To compute the critical values of the multiscale test, we simulate $1000$ values of the statistic $\Phi_{T}$ defined in Section 3.2 and compute their empirical $(1-\alpha)$ quantile $q_{T}(\alpha)$ .

Tables 2 and 2 report the simulation results for the sample sizes $T=250,350,500$ and the significance levels $\alpha=0.01,0.05,0.10$ . The sample size $T=350$ is approximately equal to the time series length $359$ in the real-data example of Section 6. To produce our simulation results, we generate $S=1000$ samples for each model specification and carry out the multiscale test for each sample. The entries of Tables 2 and 2 are computed as the number of simulations in which the test rejects divided by the total number of simulations. As can be seen from Table 2, the actual size of the test is fairly close to the nominal target $\alpha$ for all the considered AR specifications and sample sizes. Hence, the test has approximately the correct size. Inspecting Table 2, one can further see that the test has reasonable power properties. For all the considered AR specifications, the power increases quickly (i) as the sample size gets larger and (ii) as we move away from the null by increasing the slope parameter $\beta$ . The power is of course quite different across the various AR specifications. In particular, it is much lower for positive than for negative values of $a_{1}$ in the AR( $1$ ) case, the lowest power numbers being obtained for the largest positive value $a_{1}=0.5$ under consideration. This reflects the fact that it is more difficult to detect a trend when there is strong positive autocorrelation in the data. For the AR( $2$ ) specification of the errors, the sample size $T=350$ and the slopes $\beta=2.0$ and $\beta=2.5$ , which yield the two model specifications that resemble the real-life data in Section 6 the most, the power of the test is above $92\%$ for the significance levels $\alpha=0.05$ and $\alpha=0.1$ and above $75\%$ for $\alpha=0.01$ . Hence, our method has substantial power in the two simulation scenarios which are closest to the situation in the application.

5.2 Comparison with SiZer

We now compare our multiscale test to SiZer for times series which was developed in Park et al. (2004), Rondonotti et al. (2007) and Park et al. (2009). Roughly speaking, the SiZer method proceeds as follows: For each location $u$ and bandwidth $h$ in a pre-specified set, SiZer computes an estimator $\widehat{m}_{h}^{\prime}(u)$ of the derivative $m^{\prime}(u)$ and a corresponding confidence interval. For each $(u,h)$ , it then checks whether the confidence interval includes the value [math]. The set $\Pi_{T}^{\text{SiZer}}$ of points $(u,h)$ for which the confidence interval does not include [math] corresponds to the set of intervals $\Pi_{T}^{\pm}$ for which our multiscale test finds an increase/decrease in the trend $m$ . In order to explore how our test performs in comparison to SiZer, we compare the two sets $\Pi_{T}^{\pm}$ and $\Pi_{T}^{\text{SiZer}}$ in different ways to each other in what follows.

In order to implement SiZer for time series, we follow the exposition in Park et al. (2009).666We have also examined the somewhat different implementation from Rondonotti et al. (2007). As this yields worse simulation results than the procedure from Park et al. (2009), we however do not report them here. The details are given in Section S.3 in the Supplementary Material. To simplify the implementation of SiZer, we assume that the autocovariance function $\gamma_{\varepsilon}(\cdot)$ of the error process and thus the long-run error variance $\sigma^{2}$ is known. Our multiscale test is implemented in the same way as in Section 5.1. To keep the comparison fair, we treat $\sigma^{2}$ as known also when implementing our method. Moreover, we use the same grid $\mathcal{G}_{T}$ of points $(u,h)$ for both methods. To achieve this, we start off with the grid $\mathcal{G}_{T}$ from (5.1). We then follow Rondonotti et al. (2007) and Park et al. (2009) and restrict attention to those points $(u,h)\in\mathcal{G}_{T}$ for which the effective sample size $\text{ESS}^{*}(u,h)$ for correlated data is not smaller than $5$ . This yields the grid $\mathcal{G}_{T}^{*}=\{(u,h)\in\mathcal{G}_{T}:\text{ESS}^{*}(u,h)\geq 5\}$ . A detailed discussion of the effective sample size $\text{ESS}^{*}(u,h)$ for correlated data can be found in Rondonotti et al. (2007).

In the first part of the comparison study, we analyse the size and power of the two methods. To do so, we treat SiZer as a rigorous statistical test of the null hypothesis $H_{0}$ that $m$ is constant on all intervals $[u-h,u+h]$ with $(u,h)\in\mathcal{G}_{T}^{*}$ . In particular, we let SiZer reject the null if the set $\Pi_{T}^{\text{SiZer}}$ is non-empty, that is, if the value [math] is not included in the confidence interval for at least one point $(u,h)\in\mathcal{G}_{T}^{*}$ . We simulate data from the model $Y_{t,T}=m(t/T)+\varepsilon_{t}$ with different AR( $1$ ) error processes and different trends $m$ . In particular, we let $\{\varepsilon_{t}\}$ be an AR( $1$ ) process of the form $\varepsilon_{t}=a_{1}\varepsilon_{t-1}+\eta_{t}$ with $a_{1}\in\{-0.25,0.25\}$ and i.i.d. standard normal innovations $\eta_{t}$ . To simulate data under the null, we set $m=0$ as in the previous section. To generate data under the alternative, we consider the linear trends $m(u)=\beta(u-0.5)$ with different slopes $\beta$ . As it is more difficult to detect a trend $m$ in the data when the error terms are positively autocorrelated, we choose the slopes $\beta$ larger in the AR( $1$ ) case with $a_{1}=0.25$ than in the case with $a_{1}=-0.25$ . In particular, we let $\beta\in\{1.0,1.25,1.5\}$ when $a_{1}=-0.25$ and $\beta\in\{2.0,2.25,2.5\}$ when $a_{1}=0.25$ . Further model specifications with nonlinear trends are considered in the second part of the comparison study. To produce our simulation results, we generate $S=1000$ samples for each model specification and carry out the two methods for each sample.

The simulation results are reported in Tables 4 and 4. Both for our multiscale test and SiZer, the entries in the tables are computed as the number of simulations in which the respective method rejects the null hypothesis $H_{0}$ divided by the total number of simulations. As can be seen from Table 4, our test has approximately correct size in all of the considered settings, whereas SiZer is very liberal and rejects the null way too often. Examining Table 4, one can further see that our procedure has reasonable power against the considered alternatives. The power numbers are of course higher for SiZer, which is a trivial consequence of the fact that SiZer is extremely liberal. These numbers should thus be treated with caution. All in all, the simulations suggest that SiZer can hardly be regarded as a rigorous statistical test of the null hypothesis $H_{0}$ that $m$ is constant on all intervals $[u-h,u+h]$ with $(u,h)\in\mathcal{G}_{T}^{*}$ . This is not very surprising as SiZer is not designed to be such a test but to produce informative SiZer maps. In particular, the confidence intervals of SiZer are not constructed to control the level $\alpha$ under $H_{0}$ . In what follows, we thus attempt to compare the two methods in a different way which goes beyond mere size and power comparisons.

Both our method and SiZer can be regarded as statistical tools to identify time regions where the curve $m$ is increasing/decreasing.777More precisely speaking, SiZer is usually interpreted as investigating the curve $m$ , viewed at different levels of resolution, rather than the curve $m$ itself. Put differently, the underlying object of interest is a family of smoothed versions of $m$ rather than $m$ itself. Suppose that $m$ is increasing/decreasing in the time region $\mathcal{R}\subset[0,1]$ but constant otherwise, that is, $m^{\prime}(u)\neq 0$ for all $u\in\mathcal{R}$ and $m^{\prime}(u)=0$ for all $u\notin\mathcal{R}$ . A natural question is the following: How well can the two methods identify the time region $\mathcal{R}$ ? In our framework, information on the region $\mathcal{R}$ is contained in the minimal intervals of the set $\Pi_{T}^{\pm}$ . In particular, the union $\mathcal{R}_{T}^{\pm}$ of the minimal intervals in $\Pi_{T}^{\pm}$ can be regarded as an estimate of $\mathcal{R}$ . This follows from the results in Propositions 3.2 and 3.3. Let $\mathcal{R}_{T}^{\text{SiZer}}$ be the union of the minimal intervals in $\Pi_{T}^{\text{SiZer}}$ . In what follows, we compare $\mathcal{R}_{T}^{\pm}$ and $\mathcal{R}_{T}^{\text{SiZer}}$ to the region $\mathcal{R}$ . This gives us information on how well the two methods approximate the true region where $m$ is increasing/decreasing.888The same exercise could of course also be carried out separately for the time region where the trend $m$ increases and the region where it decreases.

We consider the same simulation setup as in the first part of the comparison study, only the trend function $m$ is different. We let $m$ be defined as $m(u)=2\cdot 1(u\in[0.4,0.6])\cdot(1-100\{u-0.5\}^{2})^{2}$ , which implies that $\mathcal{R}=(0.4,0.5)\cup(0.5,0.6)$ . The function $m$ is plotted in the two upper panels of Figure 2. We set the significance level to $\alpha=0.05$ and the sample size to $T=500$ . For each AR parameter $a_{1}\in\{-0.25,0.25\}$ , we simulate $S=100$ samples and compute $\mathcal{R}_{T}^{\pm}$ and $\mathcal{R}_{T}^{\text{SiZer}}$ for each sample. The simulation results are depicted in Figure 2, the two subfigures (a) and (b) corresponding to different AR parameters. The upper panel of each subfigure displays the time series path of a representative simulation together with the trend function $m$ . The middle panel shows the regions $\mathcal{R}_{T}^{\pm}$ produced by our multiscale approach for the $100$ simulation runs: On the $y$ -axis, the simulation runs $i$ are enumerated for $1\leq i\leq 100$ , and the black line at $y$ -level $i$ represents $\mathcal{R}_{T}^{\pm}$ for the $i$ -th simulation. Finally, the lower panel of each subfigure depicts the regions $\mathcal{R}_{T}^{\text{SiZer}}$ in an analogous way.

Inspecting Figure 2, our multiscale method can be seen to approximate the region $\mathcal{R}$ fairly well in both simulation scenarios under consideration. In particular, $\mathcal{R}_{T}^{\pm}$ gives a good approximation to the region $\mathcal{R}$ for most simulations. Only in some simulation runs, $\mathcal{R}_{T}^{\pm}$ is too large compared to $\mathcal{R}$ , which means that our method is not able to locate the region $\mathcal{R}$ sufficiently precisely. Overall, the SiZer method also produces quite satisfactory results. However, the SiZer estimates of $\mathcal{R}$ are not as precise as ours. In particular, SiZer spuriously finds regions of decrease/increase outside the interval $\mathcal{R}$ much more often than our method. It thus frequently mistakes fluctuations in the time series which are due to the dependence in the error terms for increases/decreases in the trend $m$ .

To sum up, our multiscale test exhibits good size and power properties in the simulations, and the minimal intervals produced by it identify the time regions where $m$ increases/decreases in a quite reliable way. SiZer performs clearly worse in these respects. Nevertheless, it may still produce informative SiZer plots. All in all, we would like to regard the two methods as complementary rather than direct competitors. SiZer is an explorative tool which aims to give an overview of the increases/decreases in $m$ by means of a SiZer plot. Our method, in contrast, is tailored to be a rigorous statistical test of the hypothesis $H_{0}$ . In particular, it allows to make rigorous confidence statements about the time regions where the trend $m$ increases/decreases.

5.3 Small sample properties of the long-run variance estimator

In the final part of the simulation study, we examine the estimators of the AR parameters and the long-run error variance from Section 4.2. We simulate data from the model $Y_{t,T}=m(t/T)+\varepsilon_{t}$ , where $\{\varepsilon_{t}\}$ is an AR( $1$ ) process of the form $\varepsilon_{t}=a_{1}\varepsilon_{t-1}+\eta_{t}$ . We consider the AR parameters $a_{1}\in\{-0.95,-0.75,-0.5,-0.25,0.25,0.5,0.75,0.95\}$ and let $\eta_{t}$ be i.i.d. standard normal innovation terms. We report our findings for a specific sample size $T$ , in particular for $T=500$ , as the results for other sample sizes are very similar. For simplicity, $m$ is chosen to be a linear function of the form $m(u)=\beta u$ with the slope parameter $\beta$ . For each value of $a_{1}$ , we consider two different slopes $\beta$ , one corresponding to a moderate and one to a pronounced trend $m$ . In particular, we let $\beta=s_{\beta}\sqrt{\textnormal{Var}(\varepsilon_{t})}$ with $s_{\beta}\in\{1,10\}$ . When $s_{\beta}=1$ , the slope $\beta$ is equal to the standard deviation $\sqrt{\textnormal{Var}(\varepsilon_{t})}$ of the error process, which yields a moderate trend $m$ . When $s_{\beta}=10$ , in contrast, the slope $\beta$ is $10$ times as large as $\sqrt{\textnormal{Var}(\varepsilon_{t})}$ , which results in a quite pronounced trend $m$ .

For each model specification, we generate $S=1000$ data samples and compute the following quantities for each simulated sample:

(i)

the pilot estimator $\widetilde{a}_{q}$ from (4.11) with the tuning parameter $q$ . 2. (ii)

the estimator $\widehat{a}$ from (4.13) with the tuning parameter $\overline{r}$ as well as the long-run variance estimator $\widehat{\sigma}^{2}$ from (4.14). 3. (iii)

the estimators of $a_{1}$ and $\sigma^{2}$ from Hall and Van Keilegom (2003), which are denoted by $\widehat{a}_{\text{HvK}}$ and $\widehat{\sigma}^{2}_{\text{HvK}}$ for ease of reference. The estimator $\widehat{a}_{\text{HvK}}$ is computed as described in Section 2.2 of Hall and Van Keilegom (2003) and $\widehat{\sigma}^{2}_{\text{HvK}}$ as defined at the bottom of p.447 in Section 2.3. The estimator $\widehat{a}_{\text{HvK}}$ (as well as $\widehat{\sigma}^{2}_{\text{HvK}}$ ) depends on two tuning parameters which we denote by $m_{1}$ and $m_{2}$ as in Hall and Van Keilegom (2003). 4. (iv)

oracle estimators $\widehat{a}_{\text{oracle}}$ and $\widehat{\sigma}^{2}_{\text{oracle}}$ of $a_{1}$ and $\sigma^{2}$ , which are constructed under the assumption that the error process $\{\varepsilon_{t}\}$ is observed. For each simulation run, we compute $\widehat{a}_{\text{oracle}}$ as the maximum likelihood estimator of $a_{1}$ from the time series of simulated error terms $\varepsilon_{1},\ldots,\varepsilon_{T}$ . We then calculate the residuals $r_{t}=\varepsilon_{t}-\widehat{a}_{\text{oracle}}\,\varepsilon_{t-1}$ and estimate the innovation variance $\nu^{2}=\mathbb{E}[\eta_{t}^{2}]$ by $\widehat{\nu}_{\text{oracle}}^{2}=(T-1)^{-1}\sum_{t=2}^{T}r_{t}^{2}$ . Finally, we set $\widehat{\sigma}^{2}_{\text{oracle}}=\widehat{\nu}_{\text{oracle}}^{2}/(1-\widehat{a}_{\text{oracle}})^{2}$ .

Throughout the section, we set $q=25$ , $\overline{r}=10$ and $(m_{1},m_{2})=(20,30)$ . We in particular choose $q$ to be in the middle of $m_{1}$ and $m_{2}$ to make the tuning parameters of the estimators $\widetilde{a}_{q}$ and $\widehat{a}_{\text{HvK}}$ more or less comparable. In order to assess how sensitive our estimators are to the choice of $q$ and $\overline{r}$ , we carry out a number of robustness checks, considering a range of different values for $q$ and $\overline{r}$ . In addition, we vary the tuning parameters $m_{1}$ and $m_{2}$ of the estimators from Hall and Van Keilegom (2003) in order to make sure that the results of our comparison study are not driven by the particular choice of any of the involved tuning parameters. The results of our robustness checks are reported in Section S.3 of the Supplementary Material. They show that the results of our comparison study are robust to different choices of the parameters $q$ , $\overline{r}$ and $(m_{1},m_{2})$ . Moreover, they indicate that our estimators are rather insensitive to the choice of tuning parameters.

For each estimator $\widehat{a}$ , $\widehat{a}_{\text{HvK}}$ , $\widehat{a}_{\text{oracle}}$ and $\widehat{\sigma}^{2}$ , $\widehat{\sigma}^{2}_{\text{HvK}}$ , $\widehat{\sigma}^{2}_{\text{oracle}}$ and for each model specification, the simulation output consists in a vector of length $S=1000$ which contains the $1000$ simulated values of the respective estimator. Figures 3 and 4 report the mean squared error (MSE) of these $1000$ simulated values for each estimator. On the $x$ -axis of each plot, the various values of the AR parameter $a_{1}$ are listed which are considered. The solid line in each plot gives the MSE values of our estimators. The dashed and dotted lines specify the MSE values of the HvK and the oracle estimators, respectively. Note that for the long-run variance estimators, the plots report the logarithm of the MSE rather than the MSE itself since the MSE values are too different across simulation scenarios to obtain a reasonable graphical presentation. In addition to the MSE values presented in Figures 3 and 4, we depict histograms of the $1000$ simulated values produced by the estimators $\widehat{a}$ , $\widehat{a}_{\text{HvK}}$ , $\widehat{a}_{\text{oracle}}$ and $\widehat{\sigma}^{2}$ , $\widehat{\sigma}^{2}_{\text{HvK}}$ , $\widehat{\sigma}^{2}_{\text{oracle}}$ for two specific simulation scenarios in Figures 5 and 6. The main findings can be summarized as follows:

(a)

In the simulation scenarios with a moderate trend ( $s_{\beta}=1$ ), the estimators $\widehat{a}_{\text{HvK}}$ and $\widehat{\sigma}^{2}_{\text{HvK}}$ of Hall and Van Keilegom (2003) exhibit a similar performance as our estimators $\widehat{a}$ and $\widehat{\sigma}^{2}$ as long as the AR parameter $a_{1}$ is not too close to $-1$ . For strongly negative values of $a_{1}$ (in particular for $a_{1}=-0.75$ and $a_{1}=-0.95$ ), the estimators perform much worse than ours. This can be clearly seen from the much larger MSE values of the estimators $\widehat{a}_{\text{HvK}}$ and $\widehat{\sigma}^{2}_{\text{HvK}}$ for $a_{1}=-0.75$ and $a_{1}=-0.95$ in Figure 3. Figure 5 gives some further insights into what is happening here. It shows the histograms of the simulated values produced by the estimators $\widehat{a}$ , $\widehat{a}_{\text{HvK}}$ , $\widehat{a}_{\text{oracle}}$ and the corresponding long-run variance estimators in the scenario with $a_{1}=-0.95$ and $s_{\beta}=1$ . As can be seen, the estimator $\widehat{a}_{\text{HvK}}$ does not obey the causality restriction $|a_{1}|\leq 1$ but frequently takes values substantially smaller than $-1$ . This results in a very large spread of the histogram and thus in a disastrous performance of the estimator.999One could of course set $\widehat{a}_{\text{HvK}}$ to $-(1-\delta)$ for some small $\delta>0$ whenever it takes a value smaller than $-1$ . This modified estimator, however, is still far from performing in a satisfying way when $a_{1}$ is close to $-1$ . A similar point applies to the histogram of the long-run variance estimator $\widehat{\sigma}^{2}_{\text{HvK}}$ . Our estimators $\widehat{a}$ and $\widehat{\sigma}^{2}$ , in contrast, exhibit a stable behaviour in this case.

Interestingly, the estimator $\widehat{a}_{\text{HvK}}$ (as well as the corresponding long-run variance estimator $\widehat{\sigma}^{2}_{\text{HvK}}$ ) performs much worse than ours for large negative values but not for large positive values of $a_{1}$ . This can be explained as follows: In the special case of an AR( $1$ ) process, the estimator $\widehat{a}_{\text{HvK}}$ may produce estimates smaller than $-1$ but it cannot become larger than $1$ . This can be easily seen upon inspecting the definition of the estimator. Hence, for large positive values of $a_{1}$ , the estimator $\widehat{a}_{\text{HvK}}$ performs well as it satisfies the causality restriction that the estimated AR parameter should be smaller than $1$ . 2. (b)

In the simulation scenarios with a pronounced trend ( $s_{\beta}=10$ ), the estimators of Hall and Van Keilegom (2003) are clearly outperformed by ours for most of the AR parameters $a_{1}$ under consideration. In particular, their MSE values reported in Figure 4 are much larger than the values produced by our estimators for most parameter values $a_{1}$ . The reason is the following: The HvK estimators have a strong bias since the pronounced trend with $s_{\beta}=10$ is not eliminated appropriately by the underlying differencing methods. This point is illustrated by Figure 6 which shows histograms of the simulated values for the estimators $\widehat{a}$ , $\widehat{a}_{\text{HvK}}$ , $\widehat{a}_{\text{oracle}}$ and the corresponding long-run variance estimators in the scenario with $a_{1}=0.25$ and $s_{\beta}=10$ . As can be seen, the histogram produced by our estimator $\widehat{a}$ is approximately centred around the true value $a_{1}=0.25$ , whereas that of the estimator $\widehat{a}_{\text{HvK}}$ is strongly biased upwards. A similar picture arises for the long-run variance estimators $\widehat{\sigma}^{2}$ and $\widehat{\sigma}^{2}_{\text{HvK}}$ .

Whereas the methods of Hall and Van Keilegom (2003) perform much worse than ours for negative and moderately positive values of $a_{1}$ , the performance (in terms of MSE) is fairly similar for large values of $a_{1}$ . This can be explained as follows: When the trend $m$ is not eliminated appropriately by taking differences, this creates spurious persistence in the data. Hence, the estimator $\widehat{a}_{\text{HvK}}$ tends to overestimate the AR parameter $a_{1}$ , that is, $\widehat{a}_{\text{HvK}}$ tends to be larger in absolute value than $a_{1}$ . Very loosely speaking, when the parameter $a_{1}$ is close to $1$ , say $a_{1}=0.95$ , there is not much room for overestimation since $\widehat{a}_{\text{HvK}}$ cannot become larger than $1$ . Consequently, the effect of not eliminating the trend appropriately has a much smaller impact on $\widehat{a}_{\text{HvK}}$ for large positive values of $a_{1}$ .

6 Application

The analysis of time trends in long temperature records is an important task in climatology. Information on the shape of the trend is needed in order to better understand long-term climate variability. The Central England temperature record is the longest instrumental temperature time series in the world. It is a valuable asset for analysing climate variability over the last few hundred years. The data is publicly available on the webpage of the UK Met Office. A detailed description of the data can be found in Parker et al. (1992). For our analysis, we use the dataset of yearly mean temperatures which consists of $T=359$ observations covering the years from $1659$ to $2017$ .

We assume that the data follow the nonparametric trend model $Y_{t,T}=m(t/T)+\varepsilon_{t}$ , where $m$ is the unknown time trend of interest. The error process $\{\varepsilon_{t}\}$ is supposed to have the AR( $p$ ) structure $\varepsilon_{t}=\sum_{j=1}^{p}a_{j}\varepsilon_{t-j}+\eta_{t}$ , where $\eta_{t}$ are i.i.d. innovations with mean [math] and variance $\nu^{2}$ . As pointed out in Mudelsee (2010) among others, this is the most widely used error model for discrete climate time series. To select the AR order $p$ , we proceed as follows: We estimate the AR parameters and the corresponding variance of the innovation terms for different AR orders by our methods from Section 4.2 and choose $p$ to be the minimizer of the Bayesian information criterion (BIC). This yields the AR order $p=2$ . We then estimate the parameters $\boldsymbol{a}=(a_{1},a_{2})$ and the long-run error variance $\sigma^{2}$ by the estimators $\widehat{\boldsymbol{a}}=(\widehat{a}_{1},\widehat{a}_{2})$ and $\widehat{\sigma}^{2}$ , which gives the values $\widehat{a}_{1}=0.167$ , $\widehat{a}_{2}=0.178$ and $\widehat{\sigma}^{2}=0.749$ . To select the AR order $p$ and to produce the estimators $\widehat{\boldsymbol{a}}$ and $\widehat{\sigma}^{2}$ , we set $q=25$ and $\overline{r}=10$ as in the simulation study of Section 5.1.101010As a robustness check, we have repeated the process of order selection and parameter estimation for other values of $q$ and $\overline{r}$ as well as for other criteria such as FPE, AIC and AICC, which gave similar results.

With the help of our multiscale method from Section 3, we now test the null hypothesis $H_{0}$ that $m$ is constant on all intervals $[u-h,u+h]$ with $(u,h)\in\mathcal{G}_{T}$ , where we use the grid $\mathcal{G}_{T}$ defined in (5.1). To do so, we set the significance level to $\alpha=0.05$ and implement the test in exactly the same way as in the simulations of Section 5.1. The results are presented in Figure 7. The upper panel shows the raw temperature time series, whereas the middle panel depicts local linear kernel estimates of the trend $m$ for different bandwidths $h$ . As one can see, the shape of the estimated time trend strongly differs with the chosen bandwidth. When the bandwidth is small, there are many local increases and decreases in the estimated trend. When the bandwidth is large, most of these local variations get smoothed out. Hence, by themselves, the nonparametric fits do not give much information on whether the trend $m$ is increasing or decreasing in certain time regions.

Our multiscale test provides this kind of information, which is summarized in the lower panel of Figure 7. The plot depicts the minimal intervals contained in the set $\Pi_{T}^{+}$ , which is defined in Section 3.3. The set of intervals $\Pi_{T}^{-}$ is empty in the present case. The height at which a minimal interval $I_{u,h}=[u-h,u+h]\in\Pi_{T}^{+}$ is plotted indicates the value of the corresponding (additively corrected) test statistic $\widehat{\psi}_{T}(u,h)/\widehat{\sigma}-\lambda(h)$ . The dashed line specifies the critical value $q_{T}(\alpha)$ , where $\alpha=0.05$ as already mentioned above. According to Proposition 3.3, we can make the following simultaneous confidence statement about the collection of minimal intervals in $\Pi_{T}^{+}$ . We can claim, with confidence of about $95\%$ , that the trend function $m$ has some increase on each minimal interval. More specifically, we can claim with this confidence that there has been some upward movement in the trend both in the period from around $1680$ to $1740$ and in the period from about $1870$ onwards. Hence, our test in particular provides evidence that there has been some warming trend in the period over approximately the last $150$ years. On the other hand, as the set $\Pi_{T}^{-}$ is empty, there is no evidence of any downward movement of the trend.

S.1 Proofs of the results from Section 3

In this section, we prove the theoretical results from Section 3. We use the following notation: The symbol $C$ denotes a universal real constant which may take a different value on each occurrence. For $a,b\in\mathbb{R}$ , we write $a_{+}=\max\{0,a\}$ and $a\vee b=\max\{a,b\}$ . For any set $A$ , the symbol $|A|$ denotes the cardinality of $A$ . The notation $X\stackrel{{\scriptstyle\mathcal{D}}}{{=}}Y$ means that the two random variables $X$ and $Y$ have the same distribution. Finally, $f_{0}(\cdot)$ and $F_{0}(\cdot)$ denote the density and distribution function of the standard normal distribution, respectively.

Auxiliary results using strong approximation theory

The main purpose of this section is to prove that there is a version of the multiscale statistic $\widehat{\Phi}_{T}$ defined in (3.4) which is close to a Gaussian statistic whose distribution is known. More specifically, we prove the following result.

Proposition S.1.

Under the conditions of Theorem 3.1, there exist statistics $\widetilde{\Phi}_{T}$ for $T=1,2,\ldots$ with the following two properties: (i) $\widetilde{\Phi}_{T}$ has the same distribution as $\widehat{\Phi}_{T}$ for any $T$ , and (ii)

[TABLE]

where $\Phi_{T}$ is a Gaussian statistic as defined in (3.3).

Proof of Proposition S.1.

For the proof, we draw on strong approximation theory for stationary processes $\{\varepsilon_{t}\}$ that fulfill the conditions (C1)–(C3). By Theorem 2.1 and Corollary 2.1 in Berkes et al. (2014), the following strong approximation result holds true: On a richer probability space, there exist a standard Brownian motion $\mathbb{B}$ and a sequence $\{\widetilde{\varepsilon}_{t}:t\in\mathbb{N}\}$ such that $[\widetilde{\varepsilon}_{1},\ldots,\widetilde{\varepsilon}_{T}]\stackrel{{\scriptstyle\mathcal{D}}}{{=}}[\varepsilon_{1},\ldots,\varepsilon_{T}]$ for each $T$ and

[TABLE]

where $\sigma^{2}=\sum_{k\in\mathbb{Z}}\textnormal{Cov}(\varepsilon_{0},\varepsilon_{k})$ denotes the long-run error variance. To apply this result, we define

[TABLE]

where $\widetilde{\phi}_{T}(u,h)=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)\widetilde{\varepsilon}_{t}$ and $\widetilde{\sigma}^{2}$ is the same estimator as $\widehat{\sigma}^{2}$ with $Y_{t,T}=m(t/T)+\varepsilon_{t}$ replaced by $\widetilde{Y}_{t,T}=m(t/T)+\widetilde{\varepsilon}_{t}$ for $1\leq t\leq T$ . In addition, we let

[TABLE]

with $\phi_{T}(u,h)=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)\sigma Z_{t}$ and $Z_{t}=\mathbb{B}(t)-\mathbb{B}(t-1)$ . With this notation, we can write

[TABLE]

where the last equality follows by taking into account that $\phi_{T}(u,h)\sim N(0,\sigma^{2})$ for all $(u,h)\in\mathcal{G}_{T}$ , $|\mathcal{G}_{T}|=O(T^{\theta})$ for some large but fixed constant $\theta$ and $\widetilde{\sigma}^{2}=\sigma^{2}+o_{p}(\rho_{T})$ . Straightforward calculations yield that

[TABLE]

Using summation by parts, we further obtain that

[TABLE]

where

[TABLE]

Standard arguments show that $\max_{(u,h)\in\mathcal{G}_{T}}W_{T}(u,h)=O(1/\sqrt{Th_{\min}})$ . Applying the strong approximation result (S.1), we can thus infer that

[TABLE]

Plugging (S.3) into (S.2) completes the proof. ∎

Auxiliary results using anti-concentration bounds

In this section, we establish some properties of the Gaussian statistic $\Phi_{T}$ defined in (3.3). We in particular show that $\Phi_{T}$ does not concentrate too strongly in small regions of the form $[x-\delta_{T},x+\delta_{T}]$ with $\delta_{T}$ converging to zero.

Proposition S.2.

Under the conditions of Theorem 3.1, it holds that

[TABLE]

where $\delta_{T}=T^{1/q}/\sqrt{Th_{\min}}+\rho_{T}\sqrt{\log T}$ .

Proof of Proposition S.2.

The main technical tool for proving Proposition S.2 are anti-concentration bounds for Gaussian random vectors. The following proposition slightly generalizes anti-concentration results derived in Chernozhukov et al. (2015), in particular Theorem 3 therein.

Proposition S.3.

Let $(X_{1},\ldots,X_{p})^{\top}$ be a Gaussian random vector in $\mathbb{R}^{p}$ with $\mathbb{E}[X_{j}]=\mu_{j}$ and $\textnormal{Var}(X_{j})=\sigma_{j}^{2}>0$ for $1\leq j\leq p$ . Define $\overline{\mu}=\max_{1\leq j\leq p}|\mu_{j}|$ together with $\underline{\sigma}=\min_{1\leq j\leq p}\sigma_{j}$ and $\overline{\sigma}=\max_{1\leq j\leq p}\sigma_{j}$ . Moreover, set $a_{p}=\mathbb{E}[\max_{1\leq j\leq p}(X_{j}-\mu_{j})/\sigma_{j}]$ and $b_{p}=\mathbb{E}[\max_{1\leq j\leq p}(X_{j}-\mu_{j})]$ . For every $\delta>0$ , it holds that

[TABLE]

where $C>0$ depends only on $\underline{\sigma}$ and $\overline{\sigma}$ .

The proof of Proposition S.3 is provided at the end of this section for completeness. To apply Proposition S.3 to our setting at hand, we introduce the following notation: We write $x=(u,h)$ along with $\mathcal{G}_{T}=\{x:x\in\mathcal{G}_{T}\}=\{x_{1},\ldots,x_{p}\}$ , where $p:=|\mathcal{G}_{T}|\leq O(T^{\theta})$ for some large but fixed $\theta>0$ by our assumptions. Moreover, for $j=1,\ldots,p$ , we set

[TABLE]

with $x_{j}=(x_{j1},x_{j2})$ . This notation allows us to write

[TABLE]

where $(X_{1},\ldots,X_{2p})^{\top}$ is a Gaussian random vector with the following properties: (i) $\mu_{j}:=\mathbb{E}[X_{j}]=-\lambda(x_{j2})$ and thus $\overline{\mu}=\max_{1\leq j\leq 2p}|\mu_{j}|\leq C\sqrt{\log T}$ , and (ii) $\sigma_{j}^{2}:=\textnormal{Var}(X_{j})=1$ for all $j$ . Since $\sigma_{j}=1$ for all $j$ , it holds that $a_{2p}=b_{2p}$ . Moreover, as the variables $(X_{j}-\mu_{j})/\sigma_{j}$ are standard normal, we have that $a_{2p}=b_{2p}\leq\sqrt{2\log(2p)}\leq C\sqrt{\log T}$ . With this notation at hand, we can apply Proposition S.3 to obtain that

[TABLE]

with $\delta_{T}=T^{1/q}/\sqrt{Th_{\min}}+\rho_{T}\sqrt{\log T}$ , which is the statement of Proposition S.2. ∎

Proof of Theorem 3.1

To prove Theorem 3.1, we make use of the two auxiliary results derived above. By Proposition S.1, there exist statistics $\widetilde{\Phi}_{T}$ for $T=1,2,\ldots$ which are distributed as $\widehat{\Phi}_{T}$ for any $T\geq 1$ and which have the property that

[TABLE]

where $\Phi_{T}$ is a Gaussian statistic as defined in (3.3). The approximation result (S.4) allows us to replace the multiscale statistic $\widehat{\Phi}_{T}$ by an identically distributed version $\widetilde{\Phi}_{T}$ which is close to the Gaussian statistic $\Phi_{T}$ . In the next step, we show that

[TABLE]

which immediately implies the statement of Theorem 3.1. For the proof of (S.5), we use the following simple lemma:

Lemma S.1.

Let $V_{T}$ and $W_{T}$ be real-valued random variables for $T=1,2,\ldots$ such that $V_{T}-W_{T}=o_{p}(\delta_{T})$ with some $\delta_{T}=o(1)$ . If

[TABLE]

then

[TABLE]

The statement of Lemma S.1 can be summarized as follows: If $W_{T}$ can be approximated by $V_{T}$ in the sense that $V_{T}-W_{T}=o_{p}(\delta_{T})$ and if $V_{T}$ does not concentrate too strongly in small regions of the form $[x-\delta_{T},x+\delta_{T}]$ as assumed in (S.6), then the distribution of $W_{T}$ can be approximated by that of $V_{T}$ in the sense of (S.7).

Proof of Lemma S.1.

It holds that

[TABLE]

We now apply this lemma with $V_{T}=\Phi_{T}$ , $W_{T}=\widetilde{\Phi}_{T}$ and $\delta_{T}=T^{1/q}/\sqrt{Th_{\min}}+\rho_{T}\sqrt{\log T}$ : From (S.4), we already know that $\widetilde{\Phi}_{T}-\Phi_{T}=o_{p}(\delta_{T})$ . Moreover, by Proposition S.2, it holds that

[TABLE]

Hence, the conditions of Lemma S.1 are satisfied. Applying the lemma, we obtain (S.5), which completes the proof of Theorem 3.1.

Proof of Proposition 3.2

To start with, we introduce the notation $\widehat{\psi}_{T}(u,h)=\widehat{\psi}_{T}^{A}(u,h)+\widehat{\psi}_{T}^{B}(u,h)$ with $\widehat{\psi}_{T}^{A}(u,h)=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)\varepsilon_{t}$ and $\widehat{\psi}_{T}^{B}(u,h)=\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)m_{T}(\frac{t}{T})$ . By assumption, there exists $(u_{0},h_{0})\in\mathcal{G}_{T}$ with $[u_{0}-h_{0},u_{0}+h_{0}]\subseteq[0,1]$ such that $m_{T}^{\prime}(w)\geq c_{T}\sqrt{\log T/(Th_{0}^{3})}$ for all $w\in[u_{0}-h_{0},u_{0}+h_{0}]$ . (The case that $-m_{T}^{\prime}(w)\geq c_{T}\sqrt{\log T/(Th_{0}^{3})}$ for all $w$ can be treated analogously.) Below, we prove that under this assumption,

[TABLE]

for sufficiently large $T$ , where $\kappa=(\int K(\varphi)\varphi^{2}d\varphi)/(\int K^{2}(\varphi)\varphi^{2}d\varphi)^{1/2}$ . Moreover, by arguments very similar to those for the proof of Proposition S.1, it follows that

[TABLE]

With the help of (S.9), (S.10) and the fact that $\lambda(h)\leq\lambda(h_{\min})\leq C\sqrt{\log T}$ , we can infer that

[TABLE]

for sufficiently large $T$ . Since $q_{T}(\alpha)=O(\sqrt{\log T})$ for any fixed $\alpha\in(0,1)$ , (S.11) immediately yields that $\mathbb{P}(\widehat{\Psi}_{T}\leq q_{T}(\alpha))=o(1)$ , which is the statement of Proposition 3.2.

Proof of (S.9).

Write $m_{T}(\frac{t}{T})=m_{T}(u_{0})+m_{T}^{\prime}(\xi_{u_{0},t,T})(\frac{t}{T}-u_{0})$ , where $\xi_{u_{0},t,T}$ is an intermediate point between $u_{0}$ and $t/T$ . The local linear weights $w_{t,T}(u_{0},h_{0})$ are constructed such that $\sum_{t=1}^{T}w_{t,T}(u_{0},h_{0})=0$ . We thus obtain that

[TABLE]

Moreover, since the kernel $K$ is symmetric and $u_{0}=t/T$ for some $t$ , it holds that $S_{T,1}(u_{0},h_{0})=0$ , which in turn implies that

[TABLE]

From (S.12), (S.13) and the assumption that $m_{T}^{\prime}(w)\geq c_{T}\sqrt{\log T/(Th_{0}^{3})}$ for all $w\in[u_{0}-h_{0},u_{0}+h_{0}]$ , we get that

[TABLE]

Standard calculations exploiting the Lipschitz continuity of the kernel $K$ show that for any $(u,h)\in\mathcal{G}_{T}$ and any given natural number $\ell$ ,

[TABLE]

where the constant $C$ does not depend on $u$ , $h$ and $T$ . With the help of (S.13) and (S.15), we obtain that for any $(u,h)\in\mathcal{G}_{T}$ with $[u-h,u+h]\subseteq[0,1]$ ,

[TABLE]

where the constant $C$ does once again not depend on $u$ , $h$ and $T$ . (S.16) implies that $\sum\nolimits_{t=1}^{T}w_{t,T}(u,h)(\frac{t}{T}-u)/h\geq\kappa\sqrt{Th}/2$ for sufficiently large $T$ and any $(u,h)\in\mathcal{G}_{T}$ with $[u-h,u+h]\subseteq[0,1]$ . Using this together with (S.14), we immediately obtain (S.9). ∎

Proof of Proposition 3.3

In what follows, we show that

[TABLE]

The other statements of Proposition 3.3 can be verified by analogous arguments. (S.17) is a consequence of the following two observations:

(i)

For all $(u,h)\in\mathcal{G}_{T}$ with

[TABLE]

it holds that $\mathbb{E}[\widehat{\psi}_{T}(u,h)]>0$ . 2. (ii)

For all $(u,h)\in\mathcal{G}_{T}$ with $[u-h,u+h]\subseteq[0,1]$ , $\mathbb{E}[\widehat{\psi}_{T}(u,h)]>0$ implies that $m^{\prime}(v)>0$ for some $v\in[u-h,u+h]$ .

Observation (i) is trivial, (ii) can be seen as follows: Let $(u,h)$ be any point with $(u,h)\in\mathcal{G}_{T}$ and $[u-h,u+h]\subseteq[0,1]$ . It holds that $\mathbb{E}[\widehat{\psi}_{T}(u,h)]=\widehat{\psi}_{T}^{B}(u,h)$ , where $\widehat{\psi}_{T}^{B}(u,h)$ has been defined in the proof of Proposition 3.2. As already shown in (S.12),

[TABLE]

where $\xi_{u,t,T}$ is some intermediate point between $u$ and $t/T$ . Moreover, by (S.13), it holds that $w_{t,T}(u,h)(\frac{t}{T}-u)/h\geq 0$ for any $t$ . Hence, $\mathbb{E}[\widehat{\psi}_{T}(u,h)]=\widehat{\psi}_{T}^{B}(u,h)$ can only take a positive value if $m^{\prime}(v)>0$ for some $v\in[u-h,u+h]$ .

From observations (i) and (ii), we can draw the following conclusions: On the event

[TABLE]

it holds that for all $(u,h)\in\mathcal{A}_{T}^{+}$ with $[u-h,u+h]\subseteq[0,1]$ , $m^{\prime}(v)>0$ for some $v\in I_{u,h}=[u-h,u+h]$ . We thus obtain that $\{\widehat{\Phi}_{T}\leq q_{T}(\alpha)\}\subseteq E_{T}^{+}$ . This in turn implies that

[TABLE]

where the last equality holds by Theorem 3.1.

Proof of Proposition S.3

The proof makes use of the following three lemmas, which correspond to Lemmas 5–7 in Chernozhukov et al. (2015).

Lemma S.2.

Let $(W_{1},\ldots,W_{p})^{\top}$ be a (not necessarily centred) Gaussian random vector in $\mathbb{R}^{p}$ with $\textnormal{Var}(W_{j})=1$ for all $1\leq j\leq p$ . Suppose that $\textnormal{Corr}(W_{j},W_{k})<1$ whenever $j\neq k$ . Then the distribution of $\max_{1\leq j\leq p}W_{j}$ is absolutely continuous with respect to Lebesgue measure and a version of the density is given by

[TABLE]

Lemma S.3.

Let $(W_{0},W_{1},\ldots,W_{p})^{\top}$ be a (not necessarily centred) Gaussian random vector with $\textnormal{Var}(W_{j})=1$ for all $0\leq j\leq p$ . Suppose that $\mathbb{E}[W_{0}]\geq 0$ . Then the map

[TABLE]

is non-decreasing on $\mathbb{R}$ .

Lemma S.4.

Let $(X_{1},\ldots,X_{p})^{\top}$ be a centred Gaussian random vector in $\mathbb{R}^{p}$ with $\max_{1\leq j\leq p}\mathbb{E}[X_{j}^{2}]\leq\sigma_{X}^{2}$ for some $\sigma_{X}^{2}>0$ . Then for any $r>0$ ,

[TABLE]

The proof of Lemmas S.2 and S.3 can be found in Chernozhukov et al. (2015). Lemma S.4 is a standard result on Gaussian concentration whose proof is given e.g. in Ledoux (2001); see Theorem 7.1 therein. We now closely follow the arguments for the proof of Theorem 3 in Chernozhukov et al. (2015). The proof splits up into three steps.

Step 1. Pick any $x\geq 0$ and set

[TABLE]

By construction, $\mathbb{E}[W_{j}]\geq 0$ and $\textnormal{Var}(W_{j})=1$ . Defining $Z=\max_{1\leq j\leq p}W_{j}$ , it holds that

[TABLE]

Step 2. We now bound the density of $Z$ . Without loss of generality, we assume that $\text{Corr}(W_{j},W_{k})<1$ for $k\neq j$ . The marginal distribution of $W_{j}$ is $N(\nu_{j},1)$ with $\nu_{j}=\mathbb{E}[W_{j}]=(\mu_{j}/\sigma_{j}+\overline{\mu}/{\underline{\sigma}})+(x/\underline{\sigma}-x/\sigma_{j})\geq 0$ . Hence, by Lemmas S.2 and S.3, the random variable $Z$ has a density of the form

[TABLE]

where the map $z\mapsto G_{p}(z)$ is non-decreasing. Define $\overline{Z}=\max_{1\leq j\leq p}(W_{j}-\mathbb{E}[W_{j}])$ and set $\overline{z}=2\overline{\mu}/\underline{\sigma}+x(1/\underline{\sigma}-1/\overline{\sigma})$ such that $\mathbb{E}[W_{j}]\leq\overline{z}$ for any $1\leq j\leq p$ . With these definitions at hand, we obtain that

[TABLE]

where the last inequality follows from Lemma S.4. Since $W_{j}-\mathbb{E}[W_{j}]=(X_{j}-\mu_{j})/\sigma_{j}$ , it holds that

[TABLE]

Hence, for every $z\in\mathbb{R}$ ,

[TABLE]

Mill’s inequality states that for $z>0$ ,

[TABLE]

Since $(1+z^{2})/z^{2}\leq 2$ for $z\geq 1$ and $f_{0}(z)/\{1-F_{0}(z)\}\leq 1.53\leq 2$ for $z\in(-\infty,1)$ , we can infer that

[TABLE]

This together with (S.18) and (S.19) yields that

[TABLE]

Step 3. By Step 2, we get that for any $y\in\mathbb{R}$ and $u>0$ ,

[TABLE]

where the last inequality follows from the fact that the map $z\mapsto ze^{-(z-a)^{2}/2}$ (with $a>0$ ) is non-increasing on $[a+1,\infty)$ . Combining this bound with Step 1, we further obtain that for any $x\geq 0$ and $\delta>0$ ,

[TABLE]

This inequality also holds for $x<0$ by an analogous argument, and hence for all $x\in\mathbb{R}$ .

Now let $0<\delta\leq\underline{\sigma}$ and define $b_{p}=\mathbb{E}\max_{1\leq j\leq p}\{X_{j}-\mu_{j}\}$ . For any $|x|\leq\delta+\overline{\mu}+b_{p}+\overline{\sigma}\sqrt{2\log(\underline{\sigma}/\delta)}$ , (S.20) yields that

[TABLE]

with a sufficiently large constant $C>0$ that depends only on $\underline{\sigma}$ and $\overline{\sigma}$ . For $|x|\geq\delta+\overline{\mu}+b_{p}+\overline{\sigma}\sqrt{2\log(\underline{\sigma}/\delta)}$ , we obtain that

[TABLE]

which can be seen as follows: If $x>\delta+\overline{\mu}$ , then $|\max_{j}X_{j}-x|\leq\delta$ implies that $|x|-\delta\leq\max_{j}X_{j}\leq\max_{j}\{X_{j}-\mu_{j}\}+\overline{\mu}$ and thus $\max_{j}\{X_{j}-\mu_{j}\}\geq|x|-\delta-\overline{\mu}$ . Hence, it holds that

[TABLE]

If $x<-(\delta+\overline{\mu})$ , then $|\max_{j}X_{j}-x|\leq\delta$ implies that $\max_{j}\{X_{j}-\mu_{j}\}\leq-|x|+\delta+\overline{\mu}$ . Hence, in this case,

[TABLE]

where the last inequality follows from the fact that for centred Gaussian random variables $V_{j}$ and $v>0$ , $\mathbb{P}(\max_{j}V_{j}\leq-v)\leq\mathbb{P}(V_{1}\leq-v)=P(V_{1}\geq v)\leq\mathbb{P}(\max_{j}V_{j}\geq v)$ . With (S.23) and (S.24), we obtain that for any $|x|\geq\delta+\overline{\mu}+b_{p}+\overline{\sigma}\sqrt{2\log(\underline{\sigma}/\delta)}$ ,

[TABLE]

the last inequality following from Lemma S.4. To sum up, we have established that for any $0<\delta\leq\underline{\sigma}$ and any $x\in\mathbb{R}$ ,

[TABLE]

with some constant $C>0$ that does only depend on $\underline{\sigma}$ and $\overline{\sigma}$ . For $\delta>\underline{\sigma}$ , (S.25) trivially follows upon setting $C\geq 1/\underline{\sigma}$ . This completes the proof.

S.2 Proofs of the results from Section 4

In what follows, we prove Proposition 4.1 from Section 4. The notation is the same as in the previous section. In particular, we use the symbol $C$ to denote a generic constant which may take a different value on each occurrence.

Auxiliary results

To start with, we derive some auxiliary results needed for the proof of Proposition 4.1. The first lemma analyses the term

[TABLE]

where $\ell_{1},\ell_{2}$ and $L$ are natural numbers with $0\leq\ell_{1},\ell_{2}\leq L$ that may depend on the sample size $T$ , that is, $L=L_{T}$ as well as $\ell_{1}=\ell_{1,T}$ and $\ell_{2}=\ell_{2,T}$ .

Lemma S.5.

For any $L=L_{T}$ with $L_{T}/T\rightarrow 0$ , it holds that

[TABLE]

where $\gamma_{\varepsilon}(\ell)=\textnormal{Cov}(\varepsilon_{t},\varepsilon_{t-\ell})$ .

Proof of Lemma S.5.

Since the variables $\varepsilon_{t}$ have the expansion $\varepsilon_{t}=\sum_{k=0}^{\infty}c_{k}\eta_{t-k}$ and $\gamma_{\varepsilon}(\ell)=(\sum_{k=0}^{\infty}c_{k}c_{k+\ell})\nu^{2}$ , it holds that

[TABLE]

where

[TABLE]

with $\kappa=\mathbb{E}[\eta_{0}^{4}]-3\nu^{4}$ and $c_{k}=0$ for $k<0$ . Noting that

[TABLE]

we can infer that

[TABLE]

the last equality following from the fact that the autocovariances $\gamma_{\varepsilon}(\ell)$ are absolutely summable and the coefficients $c_{k}$ decay exponentially fast to zero. ∎

We next show that the empirical autocovariances

[TABLE]

of the process $\{\Delta_{q}Y_{t,T}\}$ have the following property.

Lemma S.6.

For any $q=q_{T}$ with $q_{T}/\sqrt{T}\rightarrow 0$ and any $1\leq\ell\leq p+1$ , it holds that

[TABLE]

where $\gamma_{q}(\ell)=\textnormal{Cov}(\Delta_{q}\varepsilon_{t},\Delta_{q}\varepsilon_{t-\ell})$ .

Proof of Lemma S.6.

To analyse the term $\widehat{\gamma}_{q}(\ell)$ , we decompose it as follows:

[TABLE]

where

[TABLE]

as well as $R_{A}=(T-q)^{-1}\sum_{t=q+\ell+1}^{T}\Delta_{q}m_{t}\Delta_{q}\varepsilon_{t-\ell}$ , $R_{B}=(T-q)^{-1}\sum_{t=q+\ell+1}^{T}\Delta_{q}\varepsilon_{t}\Delta_{q}m_{t-\ell}$ and $R_{C}=(T-q)^{-1}\sum_{t=q+\ell+1}^{T}\Delta_{q}m_{t}\Delta_{q}m_{t-\ell}$ with $\Delta_{q}m_{t}=m(\frac{t}{T})-m(\frac{t-q}{T})$ . With the help of Lemma S.5, it is straightforward to show that

[TABLE]

Moreover, the Cauchy-Schwarz inequality yields that

[TABLE]

Since $m$ is Lipschitz by assumption, we get that $(T-q)^{-1}\sum_{t=q+\ell+1}^{T}(\Delta_{q}m_{t})^{2}\leq C(q/T)^{2}$ . In addition, it obviously holds that $\mathbb{E}[(T-q)^{-1}\sum_{t=q+\ell+1}^{T}(\Delta_{q}\varepsilon_{t-\ell})^{2}]=O(1)$ . Hence, we can infer that

[TABLE]

which implies that $R_{A}=o_{p}(T^{-1/2})$ . Similar arguments yield that $R_{j}=o_{p}(T^{-1/2})$ for $j=B,C$ as well. Putting everything together, we arrive at the statement of Lemma S.6. ∎

Proof of Proposition 4.1

We first show that the pilot estimator $\widetilde{\boldsymbol{a}}_{q}$ converges to $\boldsymbol{a}$ . In particular, we verify that $\widetilde{\boldsymbol{a}}_{q}-\boldsymbol{a}=O_{p}(T^{-1/2})$ . By Lemma S.6, it holds that $\widehat{\boldsymbol{\Gamma}}_{q}=\boldsymbol{\Gamma}_{q}+O_{p}(T^{-1/2})$ and $\widehat{\boldsymbol{\gamma}}_{q}=\boldsymbol{\gamma}_{q}+O_{p}(T^{-1/2})$ . Since $\boldsymbol{\Gamma}_{q}$ is invertible, this implies that

[TABLE]

With the help of equation (4.10), we can further infer that

[TABLE]

As already noted in Section 4.2, the entries of the vector $\boldsymbol{c}_{q}=(c_{q-1},\ldots,c_{c-p})^{\top}$ decay exponentially fast to zero, that is, $|c_{k}|\leq C\rho^{k}$ for some $0<\rho<1$ . Moreover, it holds that $\gamma_{q}(\ell)\rightarrow 2\gamma_{\varepsilon}(\ell)$ for any fixed $\ell$ as $q\rightarrow\infty$ . Consequently, $\|\nu^{2}\boldsymbol{\Gamma}_{q}^{-1}\boldsymbol{c}_{q}\|_{\infty}=o(T^{-1/2})$ , where $\|\cdot\|_{\infty}$ denotes the usual supremum norm for vectors. As a result, we obtain that $\widetilde{\boldsymbol{a}}_{q}-\boldsymbol{a}=O_{p}(T^{-1/2})$ .

We next show that $\widehat{\boldsymbol{a}}_{r}-\boldsymbol{a}=O_{p}(T^{-1/2})$ , where $r\geq 1$ is any fixed integer that does not grow with the sample size $T$ . By definition, it holds that $\widehat{\boldsymbol{a}}_{r}=\widehat{\boldsymbol{\Gamma}}_{r}^{-1}(\widehat{\boldsymbol{\gamma}}_{r}+\widetilde{\nu}^{2}\widetilde{\boldsymbol{c}}_{r})$ . From Lemma S.6, it follows that $\widehat{\boldsymbol{\Gamma}}_{r}^{-1}=\boldsymbol{\Gamma}_{r}^{-1}+O_{p}(T^{-1/2})$ and $\widehat{\boldsymbol{\gamma}}_{r}=\boldsymbol{\gamma}_{r}+O_{p}(T^{-1/2})$ . Moreover, with the help of the fact that $\widetilde{\boldsymbol{a}}_{q}-\boldsymbol{a}=O_{p}(T^{-1/2})$ , it is straightforward to verify that $\widetilde{\nu}^{2}-\nu^{2}=O_{p}(T^{-1/2})$ and $\widetilde{\boldsymbol{c}}_{r}-\boldsymbol{c}_{r}=O_{p}(T^{-1/2})$ . Hence, we arrive at

[TABLE]

where the last equality is due to equation (4.10).

From (S.26), it immediately follows that $\widehat{\boldsymbol{a}}-\boldsymbol{a}=O_{p}(T^{-1/2})$ , which in turn allows us to infer that $\widehat{\nu}^{2}-\nu^{2}=O_{p}(T^{-1/2})$ and $\widehat{\sigma}^{2}=\sigma^{2}+O_{p}(T^{-1/2})$ by straightforward arguments.

S.3 Robustness checks and implementation details for the simulations in Section 5

Robustness checks for Section 5.3

In what follows, we carry out some robustness checks to assess how sensitive the estimators $\widehat{a}$ and $\widehat{\sigma}^{2}$ are to the choice of the tuning parameters $q$ and $\overline{r}$ . To do so, we repeat the simulation exercises of Section 5.3 for different values of $q$ and $\overline{r}$ . In addition, we consider different choices of the tuning parameters $(m_{1},m_{2})$ on which the estimators of Hall and Van Keilegom (2003) depend. As in Section 5.3, we choose $m_{1}$ and $m_{2}$ such that $q$ lies between these values. We thus keep the parameters $q$ and $(m_{1},m_{2})$ roughly comparable.

To start with, we consider the simulation scenarios with a moderate trend ( $s_{\beta}=1$ ). The MSE values of the estimators $\widehat{a}$ , $\widehat{a}_{\text{HvK}}$ , $\widehat{a}_{\text{oracle}}$ and $\widehat{\sigma}^{2}$ , $\widehat{\sigma}^{2}_{\text{HvK}}$ , $\widehat{\sigma}^{2}_{\text{oracle}}$ for these scenarios are presented in Figure 3 of Section 5.3. These MSEs are re-calculated in Figures S.1 and S.2 for a range of different choices of $q$ , $\overline{r}$ and $(m_{1},m_{2})$ . As one can see, the MSEs in the different plots of Figures S.1 and S.2 are very similar. Hence, the MSE results reported in Section 5.3 for the scenarios with a moderate trend appear to be fairly robust to different choices of the tuning parameters. In particular, our estimators $\widehat{a}$ and $\widehat{\sigma}^{2}$ seem to be quite insensitive to the choice of tuning parameters, at least as far as their MSEs are concerned.

We next turn to the simulation designs with a pronounced trend ( $s_{\beta}=10$ ). The MSE values of the estimators in these scenarios are reported in Figure 4 of Section 5.3. Analogously as before, we re-calculate these MSEs for different tuning parameters in Figures S.3–S.5. Figure S.4 is a zoomed-in version of Figure S.3 which is added for better visibility. As can be seen, our estimators appear to be barely influenced by the choice of $q$ . However, the MSE values become somewhat larger when $\overline{r}$ is chosen bigger. This is of course not very surprising: The main reason why the estimator $\widehat{a}$ works well in the presence of a strong trend is that it is only based on differences of small orders. If we increase $\overline{r}$ , we use larger differences to compute $\widehat{a}$ , which results in not eliminating the trend $m$ appropriately any more. This becomes visible in somewhat larger MSE values. Nevertheless, overall, our estimators appear not to be strongly influenced by the choice of tuning parameters (in terms of MSE) as long as these are chosen within reasonable bounds.

Implementation of SiZer in Section 5.2

The SiZer methods for the comparison study in Section 5.2 are implemented as follows:

(a)

Computation of the grid $\mathcal{G}_{T}^{*}$ :

To start with, we compute the variance of $\bar{Y}=T^{-1}\sum_{t=1}^{T}Y_{t,T}$ , which is given by

[TABLE]

Since the autocovariance function $\gamma_{\varepsilon}(\cdot)$ is known by assumption, we can calculate the value of $\textnormal{Var}(\bar{Y})$ by using the formula $\gamma_{\varepsilon}(k)=\nu^{2}a_{1}^{|k|}/(1-a_{1}^{2})$ together with the true parameters $a_{1}$ and $\nu^{2}=\mathbb{E}[\eta_{t}^{2}]$ . We next compute

[TABLE]

which can be interpreted as a measure of information in the data. For each point $(u,h)\in\mathcal{G}_{T}$ from (5.1), we finally calculate the effective sample size for dependent data

[TABLE]

with $K_{h}(v)=h^{-1}K(v/h)$ and set $\mathcal{G}_{T}^{*}=\{(u,h)\in\mathcal{G}_{T}:\text{ESS}^{*}(u,h)\geq 5\}$ . 2. (b)

Computation of the local linear estimators and their standard deviations:

For each $(u,h)\in\mathcal{G}_{T}^{*}$ , we compute a standard local linear estimator $\widehat{m}^{\prime}_{h}(u)$ of the derivative $m^{\prime}(u)$ together with its standard deviation $\text{sd}(\widehat{m}^{\prime}_{h}(u))$ . The latter is given by $\text{sd}(\widehat{m}^{\prime}_{h}(u))=\{\textnormal{Var}(\widehat{m}^{\prime}_{h}(u))\}^{1/2}$ , where $\textnormal{Var}(\widehat{m}^{\prime}_{h}(u))=e^{\top}Ve$ with $e=(\begin{matrix}0&1\end{matrix})^{\top}$ and

[TABLE]

The matrices $X$ , $W$ and $\Sigma$ are defined as follows: $\Sigma$ is a $T\times T$ matrix with the elements

[TABLE]

$W$ is a $T\times T$ diagonal matrix with the diagonal entries $K_{h}(t/T-u)$ and

[TABLE] 3. (c)

Computation of the confidence intervals:

For a given confidence level $\alpha$ and for each bandwidth value $h$ with $(u,h)\in\mathcal{G}_{T}^{*}$ , we compute the quantile

[TABLE]

where $\Phi$ is the distribution function of a standard normal random variable, $g$ is the number of locations $u$ in the grid $\mathcal{G}_{T}$ , and the cluster index $\theta$ is defined on p.1519 in Park et al. (2009). The confidence interval of $\widehat{m}^{\prime}_{h}(u)$ is then computed as $[\widehat{m}^{\prime}_{h}(u)-q(h)\,\text{sd}(\widehat{m}^{\prime}_{h}(u)),\widehat{m}^{\prime}_{h}(u)+q(h)\,\text{sd}(\widehat{m}^{\prime}_{h}(u))]$ .

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andrews (1991) Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica , 59 817–858.
2Benner (1999) Benner, T. C. (1999). Central england temperatures: long-term variability and teleconnections. International Journal of Climatology , 19 391–403.
3Berkes et al. (2014) Berkes, I. , Liu, W. and Wu, W. B. (2014). Komlós-Major-Tusnády approximation under dependence. Annals of Probability , 42 794–817.
4Chaudhuri and Marron (1999) Chaudhuri, P. and Marron, J. S. (1999). Si Zer for the exploration of structures in curves. Journal of the American Statistical Association , 94 807–823.
5Chaudhuri and Marron (2000) Chaudhuri, P. and Marron, J. S. (2000). Scale space view of curve estimation. Annals of Statistics , 28 408–428.
6Chernozhukov et al. (2014) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Annals of Statistics , 42 1564–1597.
7Chernozhukov et al. (2015) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probability Theory and Related Fields , 162 47–70.
8Chernozhukov et al. (2017) Chernozhukov, V. , Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Annals of Probability , 45 2309–2352.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

1 Introduction

2 The model

3 The multiscale test

3.1 Construction of the multiscale statistic

Remark 3.1**.**

3.2 The test procedure

3.3 Theoretical properties of the test

Theorem 3.1**.**

Proposition 3.1**.**

Proposition 3.2**.**

Proposition 3.3**.**

Remark 3.2**.**

4 Estimation of the long-run error variance

4.1 Weakly dependent error processes

4.2 Autoregressive error processes

Proposition 4.1**.**

5 Simulations

5.1 Size and power properties of the multiscale test

5.2 Comparison with SiZer

5.3 Small sample properties of the long-run variance estimator

6 Application

S.1 Proofs of the results from Section 3

Auxiliary results using strong approximation theory

Proposition S.1**.**

Proof of Proposition S.1.

Auxiliary results using anti-concentration bounds

Proposition S.2**.**

Proof of Proposition S.2.

Proposition S.3**.**

Proof of Theorem 3.1

Lemma S.1**.**

Proof of Lemma S.1.

Proof of Proposition 3.2

Proof of (S.9).

Proof of Proposition 3.3

Proof of Proposition S.3

Lemma S.2**.**

Lemma S.3**.**

Lemma S.4**.**

S.2 Proofs of the results from Section 4

Auxiliary results

Lemma S.5**.**

Proof of Lemma S.5.

Lemma S.6**.**

Proof of Lemma S.6.

Proof of Proposition 4.1

S.3 Robustness checks and implementation details for the simulations in Section 5

Robustness checks for Section 5.3

Implementation of SiZer in Section 5.2

Remark 3.1.

Theorem 3.1.

Proposition 3.1.

Proposition 3.2.

Proposition 3.3.

Remark 3.2.

Proposition 4.1.

Proposition S.1.

Proposition S.2.

Proposition S.3.

Lemma S.1.

Lemma S.2.

Lemma S.3.

Lemma S.4.

Lemma S.5.

Lemma S.6.