Cross validation for locally stationary processes
Stefan Richter, Rainer Dahlhaus

TL;DR
This paper introduces an adaptive cross-validation method for selecting bandwidths in local M-estimators applied to locally stationary processes, demonstrating asymptotic optimality and practical effectiveness through simulations.
Contribution
It presents a novel cross-validation approach for bandwidth selection in locally stationary processes, with proven asymptotic optimality and broad applicability.
Findings
Method achieves asymptotic optimality under mild conditions
Works well even in misspecified models
Applicable to both linear and nonlinear processes
Abstract
We propose an adaptive bandwidth selector via cross validation for local M-estimators in locally stationary processes. We prove asymptotic optimality of the procedure under mild conditions on the underlying parameter curves. The results are applicable to a wide range of locally stationary processes such linear and nonlinear processes. A simulation study shows that the method works fairly well also in misspecified situations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Cross validation for locally stationary processes
Stefan Richter
Rainer Dahlhaus
Abstract
We propose an adaptive bandwidth selector via cross validation for local M-estimators in locally stationary processes. We prove asymptotic optimality of the procedure under mild conditions on the underlying parameter curves. The results are applicable to a wide range of locally stationary processes such linear and nonlinear processes. A simulation study shows that the method works fairly well also in misspecified situations.
1 Introduction
Inference for locally stationary time series models is strongly connected to the estimation of parameter curves which determine the degree of nonstationarity. The estimation of these curves was discussed for several specific models such as tvARMA processes (Dahlhaus and Polonik (2009)), the tvARCH and tvGARCH processes (Fryzclewicz, Sapatinas and Subba Rao (2008), Dahlhaus and Subba Rao (2006), Dahlhaus (2012)), and time-varying random coefficient models (Subba Rao (2006)). Of interest is also a time-varying TAR process which was considered in Zhou and Wu (2009)
Local estimators such as kernel estimators require the selection of a bandwidth. In opposite to nonparametric regression, there exist only very few theoretical results about adaptivity for locally stationary processes. We mention Mallat, Papanicolaou and Zhang (1998) who discussed adaptive covariance estimation for a general class of locally stationary processes. Other results are constructed for specific models and are partly dependent on further tuning parameters: Giraud, Roueff and Sanchez-Perez (2015) discussed online-adaptive forecasting of tvAR processes and Arkoun and Pergamenchtchikov (2008), Arkoun (2011) proposed methods for sequential and minimax-optimal bandwidth selection for tvAR processes of order 1.
In this paper we treat the problem for arbitrary locally stationary time series models determined by a time varying parameter curve. We focus on local M-estimators and use the functional dependence measure introduced in Wu (2005) to formulate mixing conditions. We propose an adaptive bandwidth selection procedure inspired by cross validation in the iid regression model which does not need any tuning parameters. We discuss the theoretic behavior by proving asymptotic optimality of the selector (similar to Härdle and Marron (1985) where nonparametric regression has been treated). We also prove convergence towards the deterministic asymptotic optimal bandwidth.
The technical core of the paper is martingaly theory applied in particular to the score function of the objective function and several bounds for moments of quadratic and cubic forms of locally stationary processes which are needed to provide convergence of expansions of the estimation error with suitable rates.
In Section 2 we introduce the locally stationary time series model and formalize the separation of the process into a parametric stationary process and unknown parameter curves. We define local M-estimators and the cross validation procedure. We introduce a Kullback-Leibler type distance measure which can be seen as an analogue to the averaged squared error in nonparametric regression.
In Section 3 we prove asymptotic optimality of the cross validation procedure with respect to the Kullback-Leibler type distance measure and convergence of the cross validation bandwidth towards the deterministic asymptotic optimal bandwidth. The assumptions are stated in terms of a parametric stationary time series model which is connected to the locally stationary process. This allows for easy verification since most of the conditions are standard in M-estimation theory and were already shown for specific stationary models.
In Section 4 we discuss some processes where the main results are applicable. The performance of the method for different models such as tvAR, tvARCH and tvMA is studied in simulations.
In Section 5 a short conclusion is drawn. All proofs are deferred without further reference to Section 6 and the appendix.
2 A cross validation method for locally stationary processes
2.1 The Model
In this paper we discuss adaptive estimation of a multidimensional parameter curve , i.e. we restrict to locally stationary processes , parameterized by curves. As usual we are working in the infill asymptotic framework with rescaled time , where denotes the number of observations.
Following the original idea of locally stationary processes, for fixed , should locally (i.e., for ) behave like a stationary process . In this paper, we assume that the time dependence of the approximation is solely described by , i.e. , where , is some family of parametric stationary processes. In this paper we will formulate the assumptions in terms of instead of leading to a clear separation between the properties of the model class and the smoothness assumptions on . We formalize this by
Assumption 2.1** (Locally stationary time series model).**
Let and . Let , be a triangular array of observations. Suppose that for each , there exists a stationary process , such that for all , uniformly in ,
[TABLE]
with some , and
[TABLE]
Remark 2.2**.**
- (i)
We conjecture that the assumption on the existence of all moments of and can be dropped - but the calculations would be very tedious without much additional insight. The number of moments needed for the proofs increases if the Hoelder exponent of the unknown parameter curve decreases. 2. (ii)
In many models, the second condition in (1) basically means that the unknown parameter curve has bounded variation, see also Assumption 3.3.
We first give some examples which are covered by our results. These include in particular several classical parametric time series models where the constant parameters have been replaced by time-dependent parameter curves.
Example 2.3**.**
- (i)
the tvARMA() process: Given parameter curves , ) with ,
[TABLE] 2. (ii)
the tvARCH() process (cf. Dahlhaus and Subba Rao (2006)**): Given parameter curves (),
[TABLE] 3. (iii)
the tvTAR() process (cf. Zhou and Wu (2009)**): Given parameter curves , define
[TABLE]
where and .
As an estimator of we consider local likelihood (or local M-) estimators weighted by kernels, that is
[TABLE]
where
[TABLE]
and with consisting of the observed past, where is a given objective function (localized in by the kernel ). is nonnegative with , and is the bandwidth. For shortening the notation, we used K_{h}(\cdot):=\frac{1}{h}K\big{(}\frac{\cdot}{h}\big{)}. In practice, is often chosen to be the negative logarithm of the infinite past likelihood of given ,
[TABLE]
assuming that . In this paper, we allow for general objective functions which have to obey some smoothness conditions (see Assumption 3.3).
2.2 Distance measures
Define . In the following, we will use to denote the derivative with respect to , and denotes the transpose of a vector or matrix . As global distance measures we use the averaged and the integrated squared error (ASE/ISE) weighted by the Fisher information
[TABLE]
and the misspecified Fisher information V(\theta):=\mathbb{E}\nabla^{2}\ell(\tilde{Y}_{0}(a),\theta)\big{|}_{a=\theta} of the corresponding stationary approximation. In addition the weight function is needed to exclude boundary effects. Since the proof is the same for other weights we allow in Assumption 3.4 for more general weights.
More precisely we set (with for and )
[TABLE]
and
[TABLE]
It can be shown that and are for an approximation of the Kullback-Leibler divergence between models with parameter curves and .
In Theorem 3.8 we will prove that under suitable conditions, can be approximated uniformly in by a deterministic distance measure , which has a unique minimizer . can be seen as the (deterministic) optimal bandwidth.
2.3 The crossvalidation method
We now choose the bandwidth by a generalized cross validation method. We define a ’quasi-leave-one-out’ local likelihood
[TABLE]
and a ’quasi-leave-one-out’ estimator of by
[TABLE]
Here, ’leave-one-out’ does not mean that we ignore the -th observation of the process , but that we ignore the term which is contributed by the likelihood at time step . Because of that, we refer to the estimator as a quasi-leave-one-out method.
We then choose via minimizing the cross validation functional
[TABLE]
It is important to note that such a minimizer of does not need to exist, because can not shown to be continuous. When varies it is possible that the location of the minimum of changes and therefore makes a jump. For the mathematical considerations we therefore choose some such that
[TABLE]
where is a suitable subinterval of , see Assumption 3.4, which covers all relevant values of .
3 Main results
In this chapter we present our main results concerning the bandwidth chosen by cross validation. We prove in Theorem 3.6 that is asymptotically optimal with respect to , i.e.
[TABLE]
and in Theorem 3.9 that is consistent in the sense that a.s., where is the deterministic optimal bandwidth defined in (21). Recall that can be interpreted as a Kullback-Leibler-type distance between the two time series models associated to and . Thus, the cross validation procedure yields an estimator of such that the distributions of the associated time series coincide best.
To prove asymptotic results, we have to state some mixing type conditions on the underlying process . For this, we use the functional dependence measure introduced in Wu (2005). Let , be a sequence of i.i.d. random variables. For let be the shift process and , where is a random variable which has the same distribution as and is independent of all , . For a stationary process with deterministic define and the functional dependence measure
[TABLE]
Assumption 3.1** (Dependence assumption).**
*Suppose that for each , there exists a representation with some measurable and for some . *
Note that we only need dependence conditions on the stationary approximations and no further assumption on .
To state smoothness conditions on the objective function in a concise way, we introduce the class of Lipschitz-continuous functions from to where we allow the Lipschitz constant to depend on the location at most polynomially.
Definition 3.2** (The class ).**
We say that a function is in the class if , , and for all , :
[TABLE]
where and .
In Assumption 3.3, we pose some standard conditions on the likelihood function which ensure the validity of basic results (such as Taylor expansions) from maximum likelihood theory. Again all conditions are formulated in terms of the stationary process and therefore easily verifiable due to known results on stationary time series.
Assumption 3.3**.**
Suppose that is three times differentiable with respect to , and
- (1)
* is compact. For all , lies in the interior of and is Hoelder continuous with exponent and has component-wise bounded variation .* 2. (2)
* is the unique minimizer of .* 3. (3)
the minimal eigenvalue of is bounded from below by some constant uniformly in . 4. (4)
\nabla\ell(\tilde{Y}_{t}(\theta^{\prime}),\theta)\big{|}_{\theta^{\prime}=\theta}* is a martingale difference sequence with respect to in each component.* 5. (5)
*each component of lies in for some , where for some . *
Finally, let us formalize the conditions on the set of bandwidths , the localizing kernel appearing in the estimation procedure and the weight function which arises in the cross validation functional and the distance measures.
Assumption 3.4**.**
For let , where , for some constants . Suppose that
- (1)
the kernel has compact support , fulfills and is Lipschitz continuous with Lipschitz constant . 2. (2)
the weight function is bounded by , has bounded variation uniformly in and support .
For some with support of Lebesgue measure greater than zero, assume that .
Furthermore, suppose that there exists some such that
[TABLE]
Remark 3.5**.**
Note that all conditions in (2) are fulfilled by the indicator or with some fixed .
We now show that the cross validation bandwidth is asymptotically optimal.
Theorem 3.6** (Asymptotic optimality of cross validation).**
Under assumptions 2.1, 3.1, 3.3 and 3.4 the bandwidth chosen by cross validation is asymptotically optimal in the sense that
[TABLE]
where is or .
Under stronger smoothness assumptions which allow a typical bias expansion up to the second derivative, we will show (in Theorem 3.9 below) that is asymptotically equivalent to the asymptotically optimal theoretical bandwidth (ao-bandwidth for short). The additional smoothness assumptions are natural specifications of Assumption 2.1 and 3.3.
Assumption 3.7** (Bias expansion conditions).**
Suppose that
- (1)
* is symmetric and is twice continuously differentiable,* 2. (2)
for all , , is twice partially differentiable and for all with absolutely summable sequences . 3. (3)
* is twice continuously differentiable almost surely. For all , and are finite.*
We know from standard asymptotics that
[TABLE]
which motivates the following approximations to and :
[TABLE]
We now set
[TABLE]
If is twice continuously differentiable and some additional smoothness assumptions on the approximating stationary process (see Assumption 3.7), Proposition 6.4 together with Assumption 3.4(2) implies the usual bias-variance decomposition for :
[TABLE]
uniformly in , where , and
[TABLE]
leading to the definition of the deterministic bias-variance decomposition and the resulting asymptotically optimal bandwidth in the following two theorems.
Theorem 3.8** (Approximation of distance measures).**
Let Assumptions 2.1, 3.1, 3.3, 3.4 and 3.7 hold. Define
[TABLE]
If the bias is not degenerated, i.e. , then it holds that
[TABLE]
where is or .
Theorem 3.9** (Consistency of the cross validation bandwidth).**
Let Assumptions 2.1, 3.1, 3.3, 3.4 and 3.7 hold. Then the bandwidth chosen by cross validation fulfils
[TABLE]
where
[TABLE]
is the unique minimizer of .
3.1 Proofs
Here we present the structure of the proofs of Theorems 3.6, 3.9 and 3.8. The technical details including the proofs of the lemmata are postponed to the appendix. The main tool for the proofs is a general bound for moments on quadratic and cubic forms of functions of locally stationary processes (cf. Proposition C.1). From now on, we assume that Assumptions 2.1, 3.1, 3.3 and 3.4 hold. The following Lemma shows that the approximated distances , are close to .
Lemma 3.10**.**
We have almost surely
[TABLE]
As a consequence of Lemma 3.10 also the distances are close to :
Corollary 3.11**.**
We have almost surely
[TABLE]
To get a connection between the distance measure and the cross validation functional , we define
[TABLE]
The next two lemmata show that is close both to and . Lemma 3.13 can be viewed as the core of the proof since there the main assumptions come into play, such as the martingale property of \nabla\ell(\tilde{Y}_{t}(\theta^{\prime}),\theta)\big{|}_{\theta^{\prime}=\theta} which is used for normalization and the differentiability properties of which are used for Taylor expansions of third order.
Lemma 3.12**.**
We have almost surely
[TABLE]
Lemma 3.13**.**
We have almost surely
[TABLE]
With the help of these results, we can now prove Theorems 3.6, 3.8, 3.9:
Proof of Theorem 3.6.
An immediate consequence of Lemma 3.13 is (use for positive numbers )
[TABLE]
almost surely. Now, using Corollary 3.11 and Lemma 3.12 it is easy to see that
[TABLE]
Choosing and such that
[TABLE]
yields
[TABLE]
almost surely. Because of Corollary 3.11 and (17) we have a.s. Thus,
[TABLE]
from which
[TABLE]
follows. The same can be done for . ∎
Proof of Theorem 3.8.
Because of and (17), we have
[TABLE]
Application of Corollary 3.11 finishes the proof. ∎
Proof of Theorem 3.9.
As in the proof of Theorem 3.8, we show (23). This result in combination with Lemma 3.12 and Lemma 3.13 gives almost surely
[TABLE]
Using the same methods as in the proof of Theorem 3.6, we have almost surely
[TABLE]
The structure of implies a.s. ∎
4 Examples and Simulations
4.1 Examples
Assumptions 2.1, 3.1, 3.3 and 3.7 are fulfilled for a large class of locally stationary time series models. Here, we discuss how the conditions transform in the case of some special linear and recursively defined time series. More general statements can be found in the technical supplement, see Proposition D.1 and D.2 therein.
Recall that , is a sequence of i.i.d. real random variables. We will use a Gaussian likelihood for defined in (4), but allow for a non Gaussian distribution of .
An important special case of locally stationary linear processes is given by tvARMA processes, see also Proposition 2.4. in Dahlhaus and Polonik (2009). Since in this case, the linear filter and the spectral density f_{\theta}(\lambda)=\frac{\sigma^{2}}{2\pi}\cdot\big{|}\frac{\beta(e^{i\lambda})}{\alpha(e^{i\lambda})}\big{|}^{2} have a simple form, the conditions in Proposition D.1 are obviously fulfilled. The likelihood (4) takes the form
[TABLE]
Example 4.1** (tvARMA() process).**
Assume that are i.i.d. with existing moments of all order. Suppose that and . Let Assumption 3.3(1) hold. Assume that obeys
[TABLE]
where . Define , , and let be an arbitrary compact subset of
[TABLE]
Then Assumptions 2.1, 3.1, 3.3 are fulfilled for chosen as in (24). If additionally Assumption 3.7(1) is fulfilled, then Assumption 3.7 is fulfilled. It holds that .
Remark 4.2** (tvAR() processes).**
In the special case of processes, closed forms for the estimators based on \ell(z,\theta)=\frac{1}{2}\log(2\pi\sigma^{2})+\frac{1}{2\sigma^{2}}\big{(}z_{1}+\sum_{j=1}^{r}\alpha_{j}z_{j+1}\big{)}^{2} are available: and \hat{\sigma}_{h}(u)^{2}=\frac{1}{n}\sum_{t=r+1}^{n}\big{(}X_{t,n}+\sum_{j=1}^{r}\hat{\alpha}_{j}(u)X_{t-j,n}\big{)}^{2}, where and
[TABLE]
We now discuss recursively defined nonlinear time series models with additive innovations . Let us fix some and define the vectors of the last lags , as the vector of the past values of the locally stationary and the stationary time series, respectively. Many popular locally stationary models assume that the conditional mean and / or variance is a linear combination of unknown parameter curves and functions of , i.e.
[TABLE]
with some measurable , . In this case, the likelihood (4) takes the form
[TABLE]
In the first example we discuss conditional mean processes. This class covers the tvAR- as well as the tvTAR case.
Example 4.3** (Conditional mean processes).**
Assume that , are i.i.d. and have all moments with and . Suppose that Assumption 3.3(1) is fulfilled. Assume that obeys
[TABLE]
Here, and is a function which fulfills
- (a)
* with some (),* 2. (b)
* are linearly independent in for all .*
Define with some and .
Then Assumptions 2.1, 3.1, 3.3 are fulfilled for chosen as in (25). Furthermore, with it holds that
[TABLE]
The next example discusses conditional variance processes, who cover, for instance, the tvARCH process.
Example 4.4** (Conditional variance processes).**
Assume that , are i.i.d. with and and are almost surely bounded by some . Suppose that Assumption 3.3(1) is fulfilled. Assume that obeys
[TABLE]
Here, and is a function which fulfills
- (a)
* with some (). There exists such that for all .* 2. (b)
* are linearly independent in for all .*
Define with some and .
Then Assumptions 2.1, 3.1, 3.3 are fulfilled for chosen as in (25). It holds that
[TABLE]
**A simulation study. ** Here, we study the behavior of the presented cross validation algorithm for different time series models. We assume that is standard Gaussian distributed, and consider
- (a)
tvAR(1) processes , with and . 2. (b)
tvMA(1) processes , with and . 3. (c)
tvARCH(1) processes , with and . 4. (d)
tvTAR(1) processes , with and and , for real numbers .
We performed a Monte Carlo study by generating in each case realizations of time series with length . For estimation, we used the weight function which excludes most of the boundary effects and the Epanechnikov kernel . We do not use since this weight function has poor finite sample properties for large .
We chose and calculated the cross-validation bandwidth , the ao-bandwidth from Theorem 3.9 (for models (a)-(c), model (d) does not satisfy the smoothness conditions) and the optimal theoretical bandwidth
[TABLE]
Note that depend on the current realization while is deterministic and fixed. and depend on the unknown true curve and are unavailable in practice.
Figure 1 shows the results for the four models respectively. The histograms show the chosen cross validation bandwidths , the bandwidth is marked via a black vertical line. The boxplots show the achieved values of for the different selectors (labeled as ’CV’, ’Plugin’ and ’Optimal’). Each box contains while the whiskers contain of the values of . It can be seen that the cross validation procedure works well even for the case of a time series length of only if the model is recursively defined (i.e., (a), (b), (d)) while the method needs a larger sample size such that the bandwidths accumulate around in the tvMA case (c). For the models (a),(d) we observe that the distances attained by the cross validation approach are nearly as good as the distances obtained by the optimal selector which is remarkable. For the models (b) and (c) the values of associated to have a higher variance. This can be explained by the higher variance of the maximum likelihood estimators in these models. In all cases, the distances produced by the estimator based on the cross validation procedure are of course greater in average, but they still look quite satisfying in our opinion.
**Model misspecifications: ** We observed in simulations that the performance of the cross validation procedure is robust against the distribution of , leading to similar results even if is uniformly, exponentially or Pareto distributed (meaning that the moment conditions from Assumption 2.1 are violated).
Due to the fact that our cross validation method is a natural generalization of the version for iid regression it works even well if the underlying model itself is misspecified. In the following we estimate parameters with a Gaussian likelihood which assumes that the time series model follows a tvAR(1) model , but in fact the underlying model is either tvMA (b) or tvARCH (c). The cross validation method then tries to estimate the minimizer of , i.e. and \sigma^{ms}(u)=\big{(}\frac{c(0)^{2}-c(1)^{2}}{c(0)}\big{)}^{1/2} with the covariances :
[TABLE]
To compare the distances, we use with from the tvAR(1) model. The simulations are performed in the same way as for the correctly specified case above. In Figure 2 it is seen that even in the misspecified case the bandwidth selector produces reasonable estimators which are comparable with the optimal bandwidth choice in the case of tvMA estimators and still satisfying in the tvARCH case (note that a lot of information is lost due to the fact that in this case).
5 Concluding remarks
In this paper we have introduced a data adaptive bandwidth selector via cross validation which is applicable for a large class of locally stationary processes. An important property of the method is the fact that it does not involve any tuning parameters.
In simulations we have seen that the proposed cross validation method yields nearly optimal bandwidth choices with respect to an Kullback-Leibler type distance measure in the case of correctly specified models and still leads to satisfying results in the case of model misspecification. It remains an open question if a similar cross validation procedure can be defined which is asymptotically optimal with respect to a simple quadratic distance measure (i.e. without a weighting matrix) which would then lead to estimates of which do not optimize the prediction properties of the associated model but the estimation quality of the parameter curve itself.
We mention that it is not hard to generalize the proposed method and the proofs to multidimensional time series which may be of interest in many practical applications.
An interesting open problem is the adaptive estimation in time series models with several parameter curves coming from different smoothness class, in particular since these curves are not observed separately but via a single time series.
Let us point out the fact that cross validation procedures in general are not stable if applied locally. Thus it remains an open question to find an local adaptive bandwidth selector.
In nonparametric regression there exist also several results on the rate of convergence of . Based on the simulations, we conjecture that (with from (21)) is asymptotically normal if is twice continuously differentiable, like Härdle, Hall and Marron (1988) showed in the iid regression case. This raises the question if there are improved crossvalidation methods like Chiu (1991) (via Fourier transform) or Hall, Marron and Park (1992) (via presmoothing) proved in the iid kernel density estimation case that attain the optimal rate of if further smoothness assumptions on are supposed.
6 Proofs
In the following, we will use the abbreviation for real-valued random variables (or random vectors) . Note that by Assumption 3.3, we have that the minimal eigenvalue of is bounded from below by some , leading to invertibility of and by equivalence of the norms on to a uniform upper bound
[TABLE]
Additionally define . In the following, we will use the abbreviation and . Recall that . Define
[TABLE]
and by omitting the -th summand, .
6.1 Standard approximations
Lemma 6.1** (The stationary crop approximation).**
Let . Put
[TABLE]
Suppose that Assumption 2.1 holds. Assume that for some . Then for all there exists a constant not depending on such that
[TABLE]
The proof follows from Hoelder’s inequality. Details can be found in the appendix.
Lemma 6.2** (Weak bias approximation).**
Let . Define
[TABLE]
Suppose that Assumption 2.1 holds. Assume that is Hoelder continuous with exponent in each component. Then there exist constants such that for all :
[TABLE]
Proof of Lemma 6.2:.
Since is Hoelder continuous in each component, there exists such that for all and . Thus . By Hoelder’s inequality,
[TABLE]
Since \frac{1}{n}\sum_{t=1}^{n}\big{|}K_{h}\big{(}\frac{t}{n}-u\big{)}\big{|}\cdot\|\tilde{X}_{0}(t/n)-\tilde{X}_{0}(u)\|_{2M}\leq|K|_{\infty}C_{A}pL\cdot h^{\beta\wedge 1}, we obtain the result. ∎
6.2 The bias-variance decomposition of
The proof of the next Lemma 6.3 is purely analytical and is deferred to the appendix.
Lemma 6.3** (Expansion of expectations).**
*Let Assumption 2.1, 3.3 and 3.7 hold. Assume that is twice continuously partially differentiable and for each component of , where are absolutely summable sequences.
Furthermore assume that and for all . Then it holds for and that*
[TABLE]
where R(u,\xi):=\int_{0}^{1}\big{\{}\mathbb{E}\partial_{u}^{2}g(\tilde{Y}_{0}(u+\xi s))-\mathbb{E}\partial_{u}^{2}g(\tilde{Y}_{0}(u))\big{\}}\ \mbox{d}s=o(1) () uniformly in , and all expressions exist.
We now summarize the results about the bias-variance decomposition of . The following Proposition is obtained as a corollary from Lemma 6.1, Lemma 6.2 and Lemma 6.3. Details of the proof are deferred to appendix.
Proposition 6.4** (Bias-Variance decomposition of ).**
Let Assumptions 2.1, 3.1, 3.3 and 3.4 hold.
- (i)
Decomposition: Let denote the trace of a matrix, ,
[TABLE]
Set , and . Then it holds that
[TABLE] 2. (ii)
Put and define the discrete bias terms and . Then it holds for that
[TABLE] 3. (iii)
Bias expansion: Suppose additionally that Assumption 3.7 holds. Then it holds uniformly in that
[TABLE]
6.3 Uniform convergence results and moment inequalities for the local likelihood
In this section we show the uniform convergence of empirical processes of towards their expectations. We give convergence rates and prove the uniform consistency (w.r.t. and ) of the maximum likelihood estimator towards . For some and , define
[TABLE]
The proof of the next Lemma 6.5 as well as the proofs of Lemma 3.10, 6.11 use Lemma C.1 which is deferred to the appendix due to its complexity. It allows to bound moments of linear, quadratic and cubic forms of functions of locally stationary processes. For instance we obtain bounds \|\sum_{t=1}^{n}a_{t}V_{t,n}^{(1)}\|_{q}\leq\tilde{C}_{q}\big{(}\sum_{t=1}^{n}a_{t}^{2}\big{)}^{1/2} and \|\sum_{s,t=1}^{n}a_{s,t}V_{t,n}^{(1)}(s)V_{s,n}^{(2)}(t)\|_{q}\leq\tilde{C}_{q}\big{(}\sum_{t=1}^{n}a_{s,t}^{2}\big{)}^{1/2} for deterministic numbers or and processes , and which fulfill dependence conditions and have bounded variation with respect to the indices and .
Lemma 6.5** (Moment inequality).**
Let Assumption 2.1, 3.1 hold. Let . Then, for all ,
[TABLE]
where , and is defined in Lemma C.1.
Proof of Lemma 6.5.
Note that for , we have by Hoelder’s inequality for all , :
[TABLE]
Since \sum_{t=1}^{\infty}\Big{(}\sum_{j=0}^{t-1}\chi_{j}\delta_{qM}(t-j)\Big{)}\leq\sum_{j=0}^{\infty}\chi_{j}\cdot\sum_{t=1}^{\infty}\delta_{qM}(t)<\infty, Lemma C.1 is applicable and we obtain the assertion. ∎
Lemma 6.6** (Continuity properties of localized sums).**
Let Assumption 2.1, 3.1 and 3.4 hold. Let . Then it holds for arbitrary , :
[TABLE]
where , depend solely on and .
Proof.
Define . Since , it holds that for all , |g(y,\theta)|\leq|g|_{\infty}+C_{1}|y|_{\chi,1}\cdot\big{(}1+|y|_{\chi,1}^{M-1}\big{)}. By Young’s inequality, it holds that a\leq\frac{1}{M}\big{(}(M-1)+a^{M}\big{)} for and nonnegative real numbers , which shows that there exists a constant such that for ,
[TABLE]
We can use the bound (32) to see that uniformly in it holds that . It holds that
[TABLE]
∎
Lemma 6.7** (Uniform convergence and weak bias expansion).**
Let Assumption 2.1, 3.1 and 3.4 hold. Let . Then for all , it holds almost surely that
[TABLE]
and there exists a constant independent of such that for all :
[TABLE]
Proof of Lemma 6.7.
Define
[TABLE]
where . For each , we can find a space with such that the compact space is approximated in the following way: for each there is a such that . Now fix some . For , we obtain
[TABLE]
Our goal is to bound , by absolutely summable sequences in . Then the assertion follows from Borel-Cantelli’s lemma. From Lemma 6.5 we obtain:
[TABLE]
Furthermore we have for by Lemma 6.1 that
[TABLE]
By Markov’s inequality, it follows that
[TABLE]
and thus for large enough, W_{1}\leq\#\Xi_{n}^{\prime}\cdot\sup_{\xi\in\Xi_{n}}\mathbb{P}\big{(}(nh)^{\frac{1}{2}-\alpha}|f(\xi)|>\delta/2\big{)}\leq\big{(}\frac{\rho_{1,\psi C_{\delta,\cdot},q}|K|_{\infty}+C_{S,q}}{\delta/2}\big{)}^{q}\cdot n^{q-\alpha\delta q} is bounded by an absolutely summable sequence in .
We now discuss . Define . Using the inequality (32) and the Lipschitz property of , we obtain
[TABLE]
Since and (cf. Assumption 3.4), we have shown that , where the deterministic grows only polynomially fast in . Choose large enough and some constants such that for all , then we have
[TABLE]
which is absolutely summable.
The proof of (34) is immediate from the bounds (59), (61) and (26) applied to each summand of and the fact that has bounded variation which gives
[TABLE]
as long as . ∎
Lemma 6.8**.**
Let Assumption 2.1, 3.1 and 3.4 hold. Let . Define E_{n,-s}(\phi,g,\theta):=\frac{1}{n}\sum_{t=1,t\not=s}^{n}\phi\big{(}\frac{t}{n}\big{)}\cdot g(Y_{t,n}^{c},\theta). Then for all , we have
[TABLE]
Proof of Lemma 6.8:.
Fix . Since F_{n}\leq n^{-\delta\alpha}\cdot|K|_{\infty}C_{\infty}\cdot\sup_{s=1,...,n}\big{(}1+|Y_{s,n}|_{\chi,1}^{M}\big{)}, we obtain by Markov’s inequality
[TABLE]
If is chosen large enough, we obtain the assertion by Borel-Cantelli’s lemma. ∎
The following corollary is immediate from Lemma 6.7 and 6.8.
Corollary 6.9** (Uniform convergence of likelihoods).**
Let Assumption 2.1, 3.1, 3.3 and 3.4 hold. Then for all and all it holds component-wise that
[TABLE]
and there exists a constant independent of such that (component-wise):
[TABLE]
and for all , it holds that
[TABLE]
The following Theorem is a consequence of Corollary 6.9. The proof uses standard arguments from maximum likelihood theory and is postponed to the appendix.
Theorem 6.10** (Uniform strong consistency of the maximum likelihood estimator).**
Let Assumptions 2.1, 3.1, 3.3 and 3.4 hold. Then for each , it holds almost surely in each component that
[TABLE]
Furthermore for large enough, we have uniformly in , for each component that
[TABLE]
The results still hold if , are replaced by , accordingly.
6.4 Proofs of the results of Chapter 3
Proof of Lemma 3.10.
We have for arbitrary :
[TABLE]
Since has bounded variation , has bounded variation and , the second summand of (39) is of order . Furthermore, by Lemma 6.1, Lemma 6.5 and Lemma 6.7 we obtain for each component
[TABLE]
By Lemma 6.6, we have
[TABLE]
Since has bounded variation , we obtain that the first summand in (39) is O((nh)^{-1}\big{\{}h^{\beta\wedge 1}+(nh)^{-1/2}\big{\}}). So in view of Proposition 6.4, we have shown that there exists such that
[TABLE]
Define . Using the inequality (32) and the notation therein, we obtain for each component that and . These results together with (14) imply
[TABLE]
A similar argumentation is valid for . Since by assumption, we have shown that there exists which grows at most polynomially in such that
[TABLE]
As in the proof of Lemma 3.10 it can be shown that (40) and (41) together imply that a.s.
With the definitions and , we decompose
[TABLE]
With Lemma C.1 we obtain , and (see Proposition 6.4(ii) for ), i.e. with some . Details are deferred to the appendix. It is straightforward to see that fulfills a similar condition as in (40) for . The technique from the proof of Lemma 3.10 implies \sup_{h\in H_{n}}\big{|}\frac{d_{I}^{*}(\hat{\theta}_{h},\theta_{0})-d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. ∎
Proof of Corollary 3.11.
The convergence \sup_{h\in H_{n}}\big{|}\frac{d_{I}(\hat{\theta}_{h},\theta_{0})-d_{I}^{*}(\hat{\theta}_{h},\theta_{0})}{d_{I}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. follows from the decomposition
[TABLE]
and the results from Corollary 6.9 and Theorem 6.10. Together with Lemma 3.10, the assertion of the Corollary follows. Details can be found in the appendix. ∎
Proof of Lemma 3.12.
Put
[TABLE]
We have to show that
[TABLE]
then it follows immediately from Lemma 3.10:
[TABLE]
Using the same techniques as in the proof of Corollary 3.11 (and additionally the uniform convergence results of Corollary 6.9), it can be shown that
[TABLE]
and we can conclude from (43), (44) (by a similar expansion as in (58)) that \sup_{h\in H_{n}}\big{|}\frac{\overline{d}_{A}(\hat{\theta}_{h},\theta_{0})-d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. It remains to prove (42). By applying the Cauchy Schwarz inequality, we obtain
[TABLE]
where W_{n}:=\frac{1}{n}\sum_{t=1}^{n}\big{|}\nabla\ell(Y_{t,n}^{c},\theta_{0}(t/n))\big{|}^{2}_{V(\theta_{0}(t/n))^{-1}} is bounded a.s. (details are deferred to the appendix). The assertion now follows from (45), Lemma 3.10 and the expansion d_{A}^{*}=d_{M}^{*}\cdot\big{(}1+\frac{d_{A}^{*}-d_{M}^{*}}{d_{M}^{*}}\big{)}. ∎
For some and vectors , define as .
Proof of Lemma 3.13.
Define . By Taylor’s expansion, it holds that
[TABLE]
with some intermediate value which satisfies . By Theorem 6.10, converges to uniformly in and thus lies in the interior of for large enough. Using a third-order Taylor expansion, we obtain
[TABLE]
with some intermediate value which satisfies . Put . For large enough and , we have by Corollary 6.9 that . Then it follows that the minimal eigenvalue of and are bounded from below by . So for large enough, , we have
[TABLE]
With (46) and (47), we obtain the decomposition
[TABLE]
The remainders () now have to be discussed separately. To get rid of and the intermediate values and we replace them by and , respectively. To replace , we use the decompositions
[TABLE]
It can be shown these replacements are of order uniformly in with arbitrary small by using the uniform results of Corollary 6.9 and Theorem 6.10, (38) (see Proposition 6.4 for ). By the decomposition of in Proposition 6.4(i),(ii) the replacements are of smaller order than . The remaining terms are listed in Lemma 6.11, where also their convergence is proven. Details are in the appendix. ∎
In the proof of the following Lemma 6.11 we use a similar technique as in Lemma 6.7 (or Lemma 3.10). The main part therefore is to calculate the norm of the quantities which is done via the results of Lemma C.1. Details can be found in the appendix.
Lemma 6.11**.**
Let Assumption 2.1, 3.1, 3.3 and 3.4 hold. Put . Then it holds almost surely that
[TABLE]
Acknowledgements.
We gratefully acknowledge support by Deutsche Forschungsgemeinschaft through the Research Training Group RTG 1653.
Appendix A More detailed proofs of section 3
Proof of Corollary 3.11, more detailed.
We have
[TABLE]
where is the identity matrix and
[TABLE]
and is some intermediate value with . Using the bound , we have
[TABLE]
By Corollary 6.9 and Theorem 6.10, we have
[TABLE]
thus
[TABLE]
According to Assumption 3.3, let be the value which bounds all eigenvalues from from below. Using the representations , , we conclude with (56), (57):
[TABLE]
Using the shortcuts and so on, we have
[TABLE]
hence, the assertion follows from Lemma 3.10. The proof for is the same by using sums instead of integrals. ∎
Appendix B Additional proofs of section 6
Proof of Lemma 6.1.
Since , we have by Hoelder’s inequality:
[TABLE]
Furthermore, we have
[TABLE]
Define , then
[TABLE]
By Hoelder’s inequality, we have for arbitrary :
[TABLE]
Defining , we obtain
[TABLE]
(60) and (62) give the result. ∎
Proof of Lemma 6.3:.
Put . Define and . We show that is Frechet differentiable with derivative . Choose with . Let be a sequence of zeros where only at the -th position is a 1. By the mean value theorem in , there exists such that
[TABLE]
This shows Frechet differentiability of . By applying the chain rule, is differentiable with derivative . By the fundamental theorem of analysis,
[TABLE]
with some constant dependent on . This shows that . In the same manner, we obtain Frechet differentiability of itself, and with some constant depending on . Since and are twice continuously differentiable, we conclude that the composition is twice continuously differentiable and thus
[TABLE]
by the chain rule. We obtain with Hoelder’s inequality
[TABLE]
where . With similar arguments, we can show that and with some constants . Hoelder’s inequality yields
[TABLE]
so we have proven the existence of all terms in the expansion (27). It remains to analyze the residual term . Since is (uniformly) continuous a.s., we have
[TABLE]
Furthermore, it holds that and thus is finite by Assumption 3.7. Assumption 3.7 also directly implies that
[TABLE]
Using similar techniques as in (63) and (64), follows. The dominated convergence theorem and (65) yield
[TABLE]
∎
Proof of Proposition 6.4.
(i) Let . By Lemma 6.1, there exists some constant such that
[TABLE]
By (component-wise) and (32), it holds that
[TABLE]
Furthermore, has bounded variation (uniformly in ) since has bounded variation. The same holds for the kernel . We conclude that
[TABLE]
This implies, uniformly in ,
[TABLE]
By Lemma 6.1 and Lemma 6.2, there exist constants such that
[TABLE]
By the (component-wise) martingale difference property of and since has bounded variation, we have
[TABLE]
Let us use the abbreviations and . With (69) we conclude
[TABLE]
Note that \mathbb{E}\big{|}\nabla\ell(\tilde{Y}_{t}(u),\theta_{0}(u))\big{|}_{V(\theta_{0}(u))^{-1}}^{2}=\mbox{tr}\{V^{-1}(\theta_{0}(u))I(\theta_{0}(u))\}. By combining (68), (70) and (71), we obtain
[TABLE]
from which the stated convergence follows by integration.
(ii) By Lipschitz continuity of , we have for each component that . Furthermore, by using the inequality (where denotes the spectral norm of a matrix ), the bounded variation of and and Lemma 6.2, we have
[TABLE]
The uniform results (66) and (67) together with Lemma 6.2 provide (30).
(iii) If , it holds that . According to Lemma 6.3, we have uniformly in :
[TABLE]
Since is a martingale difference sequence and is symmetric, the first two terms in (72) vanish. The last term in (72) is uniformly in since and from Lemma 6.3. ∎
Proof of Theorem 6.10.
By application of Corollary 6.9 with , we obtain the uniform convergence
[TABLE]
The identifiability condition in Assumption 3.3 implies that attains its unique minimum at . Standard arguments provide the uniform convergence (37) (see also Dahlhaus, Richter and Wu (2017)). Since lies in the interior of , we have that lies in the interior of almost surely for large enough. With some intermediate value satisfying , it holds that
[TABLE]
By using the continuity of and the uniform convergences of provided by Corollary 6.9, we have that
[TABLE]
converges to [math] uniformly in , . Since the smallest eigenvalue of is bounded from below by uniformly in , we have that for large enough, the smallest eigenvalue of is bounded from below by uniformly in , giving the result (38). The arguments for are similar. ∎
Proof of Lemma 3.10 (additional material).
We now show that (40) and (41) together imply that a.s., where . For each , we can find a space such that for each there exists some such that . Choose large enough such that with some constants . Let be arbitrarily chosen. Then by Markov’s inequality, (40) and (41),
[TABLE]
which is absolutely summable for large enough, giving the result by applying Borel-Cantelli’s lemma. In the following we will use this technique for similar expressions without explicitly showing results of the type (41). The proofs are similar to the proof of (41). Following the argumentation in (B), we obtain
[TABLE]
For , it holds that
[TABLE]
where and . We now derive upper bounds for
[TABLE]
[TABLE]
where and are defined in Lemma 6.5. Furthermore, we have
[TABLE]
Put and from Proposition 6.4. By Lemma 6.6, we have |\hat{b}_{h,i}(u)-\hat{b}_{h,i}(v)|\leq C_{-,1}\big{\{}\frac{|u-v|}{h}+|\theta_{0}(u)-\theta_{0}(v)|_{1}\big{\}} and . This implies
[TABLE]
Since , we obtain
[TABLE]
The bounds (74), (76) and (78) imply \sup_{h\in H_{n}}\big{|}\frac{d_{I}^{*}(\hat{\theta}_{h},\theta_{0})-d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. ∎
Proof of Lemma 3.12 (additional material).
Using similar techniques as in (B), we obtain , which shows a.s. Furthermore, by ((ii)), we have for arbitrary that , showing that a.s. By the bound (32),
[TABLE]
which finally shows that is bounded and thus is bounded a.s. ∎
Proof of Lemma 3.13 (additional material).
For shortness, let us define , where the supremum over should be understood as the maximum over all components of the argument (if it is a vector or matrix). It is well-known from Corollary (6.9) that for and all ,
[TABLE]
Furthermore, let denote a generic constant depending only on and which may change its value from line to line.
We first discuss . By using (50), can be expanded into three terms:
[TABLE]
The first summand and the second summand in (80) are discussed in Lemma 6.11, (51) and (53). Put . Due to the Cauchy Schwarz inequality, the third summand in (80) is bounded by
[TABLE]
Here we used the fact that for and , we have for all by the assumption on . is defined in Proposition 6.4(ii). By (79), (30) and Proposition 6.4(i),(ii) we conclude that a.s. and a.s.
The next part of the proof addresses . Our goal is to eliminate and the intermediate values. Put
[TABLE]
We now argue how can be obtained by successively replacing terms in with a rate smaller than . The remainder by replacing by in is bounded by (see (50) and (38)):
[TABLE]
which is of order by (79).
The same arguments hold for the remainder which is obtained by replacing with \mathbb{E}[\nabla^{3}L_{n,h,-t}(t/n,\theta)]\big{|}_{\theta=\tilde{\theta}_{h,-t}(t/n)}. For the next replacement note that for arbitrary , we have
[TABLE]
where . By Lemma 6.7, is bounded almost surely. The remainder by replacing \mathbb{E}[\nabla^{3}L_{n,h,-t}(t/n,\theta)]\big{|}_{\theta=\tilde{\theta}_{h,-t}(t/n)} with is bounded by (see (38) and (82))
[TABLE]
While the second summand is discussed in Lemma 6.11, (55), the first and the third summand are of order by Lemma 6.2 and (79).
Lastly we replace twice by by using the expansion
[TABLE]
where . The remainder is bounded by terms similar to (81) and (83).
With Proposition 6.4(i),(ii) we conclude that \sup_{h\in H_{n}}\big{|}\frac{R_{n,h,2}-R_{n,h,2,1}}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. In Lemma 6.11, (54) it is shown that \sup_{h\in H_{n}}\big{|}\frac{R_{n,h,2,1}}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s.
Finally, put
[TABLE]
We shortly explain how can be obtained from . The intermediate value is replaced by with remainder given by
[TABLE]
which is similar handled as in (83). The replacement of by is done as for . We conclude with Proposition 6.4(i),(ii) that \sup_{h\in H_{n}}\big{|}\frac{R_{n,h,3}-R_{n,h,3,1}}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. The convergence \sup_{h\in H_{n}}\big{|}\frac{R_{n,h,3,1}}{d_{M}^{*}(\hat{\theta}_{h},\theta_{0})}\big{|}\to 0 a.s. is proved in Lemma 6.11, (52).
∎
Proof of Lemma 6.11:.
We have uniformly in that (see the bound in (31)), where and are defined in Lemma 6.5. Furthermore, we can use the same argumentation as in (26) to see that uniformly in it holds that . Since is Hoelder continuous, there exists such that . As in the proof of Lemma 3.13, let denote a generic constant depending only on and which may change its value from line to line. We use the same technique as in the proof of Lemma 3.10, (73), but omit the proofs on Lipschitz continuity in (cf. (41)) since they do not pose any extra difficulty.
We start by proving (51). Put
[TABLE]
By Hoelder’s inequality (cp. also Lemma 6.1) it follows that
[TABLE]
Since are martingale differences, has expectation 0 and we obtain
[TABLE]
which shows that since is absolutely summable. Define
[TABLE]
Recall that is absolutely summable. Furthermore, for some , note that for arbitrary , we have for all by the assumption on . By Lemma C.1, we have
[TABLE]
which shows that a.s. Finally, put
[TABLE]
Since are martingale differences, it holds that . Since and , we conclude that and , which also shows . Thus Lemma C.1 is applicable and we obtain
[TABLE]
which shows that a.s.
We now prove (52). Similar arguments as in (B) can be used to show that
[TABLE]
fulfills a.s. Put
[TABLE]
Decompose
[TABLE]
where
[TABLE]
Since , we obtain by Lemma C.1, similarly to (85),
[TABLE]
[TABLE]
By the Cauchy-Schwarz inequality, we have
[TABLE]
which shows that .
By Lemma C.1, we obtain similarly to (86),
[TABLE]
which shows that a.s. By Lemma C.1, (93),
[TABLE]
thus . Define
[TABLE]
then by Lemma C.1 and inequality (35) it holds that
[TABLE]
Define
[TABLE]
then by Lemma C.1 and inequality (35) it holds that
[TABLE]
Finally, by Lemma C.1, ((iii)), we have
[TABLE]
We now discuss (53). Put . Similar arguments as in (B) can be used to show that
[TABLE]
fulfills . Define
[TABLE]
Put
[TABLE]
By Lemma C.1, it holds that
[TABLE]
Due to the similar structure, the argumentation for can be mimicked from above and leads to .
The stochastic structure of (54) is exactly the same as in (52), thus the proof follows the same lines. Finally, it is easy to see from Lemma C.1 that
[TABLE]
which shows (55). ∎
Appendix C Bounds for moments of sums, quadratic and cubic forms of Bernoulli shifts
The inequalities derived in this section are needed to prove the moment inequalities for the local likelihoods and its derivatives in Section 6.3. Results for linear and quadratic forms were obtained in Xiao and Wu (2012), however we need them in a more general setting.
Lemma C.1**.**
Let . Let , be some sequence of nonnegative real numbers. Put . Let , , be some deterministic sequences of real numbers.
- (i)
Let , be a triangular array with and . Put . Then it holds that
[TABLE] 2. (ii)
For , let , be triangular arrays with and as well as
[TABLE]
Put \rho_{2,\delta,q}=8q^{1/2}\Delta_{0,2q}\big{[}q^{1/2}\Delta_{0,2q}+\sum_{k=0}^{\infty}\Delta_{k,2q}\big{]} and . Then it holds that
[TABLE] 3. (iii)
For , let , be triangular arrays with and . Assume that and fulfill (88) uniformly in as well as
[TABLE]
Put
[TABLE]
and \tilde{\rho}_{3,\delta}:=12\Delta_{0,3}^{2}\big{[}\Delta_{0,3}+\sum_{k=0}^{\infty}\Delta_{k,3}\big{]}. Then it holds that
[TABLE]
Proof.
To keep the notation simple, set to [math] if one of the indices is not in . We start with proving the stochastic inequalities (87), ((ii)) and ((iii)). Since , it holds almost surely that . , is a martingale difference sequence, thus with Rio (2009), Theorem 2.1 we obtain
[TABLE]
We now prove ((ii)). Define D_{s,t,k_{1},k_{2}}:=a_{s,t}\big{\{}P_{s-k_{1}}V_{s,n}^{(1)}(t)P_{t-k_{2}}V_{t,n}^{(2)}(s)-\mathbb{E}P_{s-k_{1}}V_{s,n}^{(1)}(t)P_{t-k_{2}}V_{t,n}^{(2)}(s)\big{\}}. Note that
[TABLE]
where
[TABLE]
Here as well as are martingale difference sequences. By applying Theorem 2.1 in Rio (2009),
[TABLE]
Define . By partial summation and the Cauchy-Schwarz inequality, we have
[TABLE]
We finally obtain A_{2,1}\leq 2(q-1)^{1/2}(2q-1)^{1/2}\Delta_{0,2q}^{2}\cdot\big{(}\sum_{s,t=1}^{n}a_{s,t}^{2}\big{)}^{1/2}, a similar upper bound holds for . To discuss , note that
[TABLE]
is a martingale difference sequence. The arguments of and of are omitted in the next steps. In the following, fix some . Let be an i.i.d. copy of . Let . For fixed and some random variable , we define and . Similarly, and . We use the abbreviation . It holds that
[TABLE]
Note that and . Furthermore,
[TABLE]
and thus, by Jensen’s inequality,
[TABLE]
Note that in (94), there may be better estimations in special cases. We obtain
[TABLE]
We now prove ((iii)). To do so, define
[TABLE]
Here, we can bound \sum_{k_{1},k_{2},k_{3}\geq 0}\big{\|}\sum_{r,s,t=1}^{n}D_{r,s,t,k_{1},k_{2},k_{3}}\big{\|}_{q} by four different types of terms. The first type (all indices are different) is of the form
[TABLE]
Note that the three sequences \big{(}\sum_{s:s-k_{2}<r-k_{1}}\sum_{t:t-k_{3}<s-k_{2}}D_{r,s,t,k_{1},k_{2},k_{3}}\big{)}_{r} and \big{(}\sum_{t:t-k_{3}<s-k_{2}}a_{r,s,t}P_{s-k_{2}}V_{s,n}^{(2)}P_{t-k_{3}}V_{t,n}^{(3)}\big{)}_{s} and are martingale differences. Put . By partial summation, we have
[TABLE]
leading with Theorem 2.1. in Rio (2009) and Hoelder’s inequality to the upper bound
[TABLE]
for . Using the same partial summation argument as in the discussion of above, we obtain
[TABLE]
The second type (the two smaller indices are equal) is of the form
[TABLE]
Put By applying similar partial summation techniques as for , we obtain the upper bound
[TABLE]
for . The same technique as applied in leads to the bound
[TABLE]
Note that is a martingale difference sequence. For brevity, we will omit the additional arguments of in the following part. The third type (all three indices are equal) is of the form
[TABLE]
where
[TABLE]
and
[TABLE]
Similarly to (94), we obtain
[TABLE]
which leads to
[TABLE]
The fourth type (the two bigger indices are equal) has the form
[TABLE]
is bounded by the sum of three terms , which will be defined in the following. Put . For brevity in the following argumentation, put . By using similar techniques as for , we obtain
[TABLE]
Applying the same partial summation techniques as for the term , we conclude
[TABLE]
With slight changes in the argumentation, the second term has the upper bound
[TABLE]
In the following, we will again omit the arguments of . The third term reads
[TABLE]
Put , , and . Using similar techniques as in , we obtain
[TABLE]
which shows
[TABLE]
and finishes the proof of ((iii)).
To prove (90) and (93), we will omit the arguments of since all bounds are uniform in these arguments. For (90), we use the inequalities
[TABLE]
To prove (93), note that
[TABLE]
The above term is bounded by two types of terms. The first type (all three indices are equal) is of the form
[TABLE]
the second type (the two bigger indices of are equal) is of the form
[TABLE]
∎
Appendix D Proofs of section 4
For linear time series, we use the model which was set up in Dahlhaus and Polonik (2009).
Proposition D.1** (Linear time series models).**
Suppose that Assumption 3.3(1) holds. Assume that
[TABLE]
with some coefficients and satisfying
[TABLE]
*with some absolutely summable sequence . Assume that , are i.i.d. and have all moments, especially and .
Define , the spectral density and real numbers . Assume that*
- (a)
For , it holds that implies . 2. (b)
* uniformly in . is four times continuously differentiable in . Assume that there exist , such that ().* 3. (c)
The minimal eigenvalue of is bounded away from 0 uniformly in .
Then, assuming that has a standard Gaussian distribution, from (4) has the form
[TABLE]
and Assumptions 2.1, 3.1 and 3.3 are fulfilled for (24), and it holds with the fourth cumulant of that
[TABLE]
If additionally Assumption 3.7(1) holds and
- (d)
* is -times continuously differentiable in and fulfills component-wise ).*
then Assumption 3.7 is fulfilled and the bias term has the form
[TABLE]
where and is defined via .
Proof of Proposition D.1.
Put . By (95), we have for all that
[TABLE]
It holds that . By condition (b) and Katznelson (2004), chapter I, section 4, we have that with some for .
[TABLE]
Furthermore we obtain . Since , it follows from the inverse Fourier transform that the process is invertible in the sense that
[TABLE]
since
[TABLE]
This shows and , thus the negative logarithm of the Gaussian conditional likelihood (4) has the form (96). From the conditions on in (b) it is straightforward to see that satisfies Assumption 3.3(5).
Furthermore, we have
[TABLE]
since with the same Fourier argument as before. From (98) it is immediate that \nabla\ell(\tilde{Y}_{t}(\theta^{\prime}),\theta)\big{|}_{\theta^{\prime}=\theta} is a martingale difference sequence w.r.t. . Using Bochner’s theorem, we have and thus
[TABLE]
Furthermore, Kolmogorov’s formula (cf. Brockwell and Davis (2009), Theorem 5.8.1) implies that , which shows that
[TABLE]
Since with equality if and only if and since condition (a) holds, is the unique minimizer of . Differentiating (100) twice with respect to (and replacing by afterwards), we obtain
[TABLE]
Condition (c) implies that the minimal eigenvalue of is bounded away from 0 uniformly in . Define which is -measurable and . Recall . From (99) it follows that
[TABLE]
where
[TABLE]
thus
[TABLE]
where denotes the fourth cumulant of .
Now let be twice continuously differentiable. Note that
[TABLE]
which implies Assumption 3.4.
Define . Then , . We have
[TABLE]
Thus we obtain the following decomposition of the bias term
[TABLE]
∎
Proposition D.2** (Recursively defined time series).**
Assume that , are i.i.d. and have all moments with and . Suppose that Assumption 3.3(1) is fulfilled. Assume that fulfills
[TABLE]
Assume that satisfy
[TABLE]
for all with some with . Assume that with some constant . Assume that , and
- (a)
For , it holds and for each component. 2. (b)
for fixed , the two conditions a.s. and a.s. imply . 3. (c)
uniformly in , the smallest eigenvalues of the matrices W_{\mu}(\theta):=\mathbb{E}[\frac{\nabla\mu(\tilde{Y}_{0}(\theta^{\prime}),\theta)\nabla\mu(\tilde{Y}_{0}(\theta^{\prime}),\theta)^{\prime}}{\sigma^{2}(\tilde{Y}_{0}(\theta),\theta)}]\big{|}_{\theta^{\prime}=\theta} and W_{\sigma}(\theta):=\mathbb{E}[\frac{\nabla\sigma(\tilde{Y}_{0}(\theta^{\prime}),\theta)\nabla\sigma(\tilde{Y}_{0}(\theta^{\prime}),\theta)^{\prime}}{\sigma^{2}(\tilde{Y}_{0}(\theta),\theta)}]\big{|}_{\theta^{\prime}=\theta} are bounded from below by some .
Then Assumptions 2.1, 3.1 and 3.3 are fulfilled for chosen as in (4) assuming that is standard Gaussian distributed, i.e. for
[TABLE]
In this case, it holds with W_{\mu\sigma}:=\mathbb{E}[\frac{\nabla\mu(\tilde{Y}_{0}(\theta^{\prime}),\theta)\nabla\sigma(\tilde{Y}_{0}(\theta^{\prime}),\theta)^{\prime}}{\sigma^{2}(\tilde{Y}_{0}(\theta),\theta)}]\big{|}_{\theta^{\prime}=\theta} that
[TABLE]
Proof of Proposition D.2.
Condition (102) implies that exists and is a.s. unique (cf. Proposition 4.3 in Dahlhaus, Richter and Wu (2017) or Theorem 5.1 in Shao and Wu (2007)) and that there exist , such that for all . With slight changes in the argumentation of their proof, (1) and follows similar as Proposition 4.3 and Lemma 4.4 in Dahlhaus, Richter and Wu (2017). Thus, the condition of Assumption 2.1 and 3.1 are fulfilled.
Put . Let us omit the argument of and in the following. It holds that
[TABLE]
By a Taylor expansion of , we obtain that the second summand is lower bounded by \frac{1}{4}\mathbb{E}\Big{[}\frac{(\sigma^{2}(\theta_{0}(u))-\sigma^{2}(\theta))^{2}}{(\sigma^{2}(\theta_{0}(u))-\sigma^{2}(\theta))^{2}+\Sigma^{4}(\theta)}\Big{]}. By these inequalities, condition (b) shows that the unique minimizer of is given by .
Omitting the arguments, (where ), we have
[TABLE]
Since and , , we obtain that is a martingale difference sequence with respect to in each component. Furthermore,
[TABLE]
whose smallest eigenvalue is bounded from below by condition (c). By condition (a) and the fact that , straightforward calculations show that Assumption 3.3(5) is fulfilled with the truncated (and thus summable) sequence and some . ∎
We do not want to go into details when Assumption 3.7 is fulfilled in the situation of Proposition D.2. Regarding the results in Section 4 in Dahlhaus, Richter and Wu (2017) we need additionally that are four times continuously differentiable and fulfill some moment conditions.
Appendix E The bias in the tvAR(1) model
We use the results from Proposition D.1. Assume that for , . Here it holds that , where . Here, it holds that and thus
[TABLE]
Moreover, the variance and the first covariance take the form and . This leads to (let denote the derivative with respect to )
[TABLE]
The derivatives of the covariances read
[TABLE]
and
[TABLE]
We obtain
[TABLE]
thus
[TABLE]
where
[TABLE]
By using (97), we finally obtain
[TABLE]
If , i.e. is assumed to be constant and known, we obtain
[TABLE]
Appendix F The bias in the tvMA(1) model
We use the results from Proposition D.1. Assume that for , . Then we have . Note that with corresponding to the spectral density of an AR(1) process with parameter instead of . Recall that Kolmogorov’s formula implies and thus
[TABLE]
By (101), this leads to the same as in the AR case:
[TABLE]
To calculate , note that
[TABLE]
We have
[TABLE]
where and . With
[TABLE]
we obtain
[TABLE]
thus
[TABLE]
where
[TABLE]
Using (97), we finally obtain
[TABLE]
F.1 Bias terms in Examples 4.3 and 4.4
Lemma F.1**.**
In the situation of Example 4.3, it holds that
[TABLE]
where and
[TABLE]
Proof.
Note that . It holds that
[TABLE]
The recursion implies
[TABLE]
Using the equations given by the recursion, we end up with
[TABLE]
Furthermore, we obtain
[TABLE]
and thus
[TABLE]
Note that
[TABLE]
We finally obtain the bias-term
[TABLE]
In the special case of the AR(1) process it is known that and thus P(u)=\partial_{u}W(\theta_{0}(u))=\frac{2\sigma(u)^{2}}{1-\alpha(u)^{2}}\big{[}\frac{\partial_{u}\sigma(u)}{\sigma(u)}+\frac{\alpha(u)\partial_{u}\alpha(u)}{1-\alpha(u)^{2}}\big{]}=2W(\theta_{0}(u))\cdot\big{[}\frac{\partial_{u}\sigma(u)}{\sigma(u)}+\frac{\alpha(u)\partial_{u}\alpha(u)}{1-\alpha(u)^{2}}\big{]}, i.e.
[TABLE]
This leads to
[TABLE]
If is assumed to be known (and not a parameter, meaning that ), then this simplifies to
[TABLE]
∎
Lemma F.2**.**
In the situation of Example 4.4, it holds that
[TABLE]
where and .
Proof of Lemma F.2.
Let and . It holds that . Thus
[TABLE]
Since is a martingale difference sequence, the first summand in the above formula will vanish when the expectation ist applied. To keep the formulas short, we will abbreviate in the following. It holds that
[TABLE]
Furthermore, we have
[TABLE]
The recursion implies
[TABLE]
Recall that V(\theta)=\frac{1}{2}\mathbb{E}\big{[}\frac{\mu(\tilde{Z}_{0}(\theta))\cdot\mu(\tilde{Z}_{0}(\theta))^{\prime}}{\langle\theta,\mu(\tilde{Z}_{0}(\theta))\rangle^{2}}\big{]}. Using the equations above and , we obtain
[TABLE]
This leads to
[TABLE]
In the special case of the ARCH(1) process with , we have
[TABLE]
where , can be obtained by using the recursion formulas \tilde{X}_{t}(u)^{2}=\big{(}\alpha_{1}(u)+\alpha_{2}(u)\tilde{X}_{t-1}(u)^{2}\big{)}\varepsilon_{t}^{2} and
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arkoun and Pergamenchtchikov (2008) Arkoun, O., and Pergamenchtchikov, S. (2008). Nonparametric estimation for an autoregressive model. Bulletin of Tomsk State University. Mathematics and Mechanics , 2(3).
- 2Arkoun (2011) Arkoun, O. (2011). Sequential adaptive estimators in nonparametric autoregressive models. Sequential analysis , 30(2), 229-247.
- 3Brockwell and Davis (2009) Brockwell, P. J., and Davis, R. A. (2009). Time Series: Theory and Methods , Springer Series in Statistics.
- 4Chiu (1991) Chiu, S. T. (1991). Bandwidth selection for kernel density estimation. The Annals of Statistics , 19(4), 1883-1905.
- 5Dahlhaus (1997) Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. The Annals of Statistics 25(1), 1-37.
- 6Dahlhaus and Giraitis (1998) Dahlhaus, R., and Giraitis, L. (1998). On the optimal segment length for parameter estimates for locally stationary time series. Journal of Time Series Analysis 19(6), 629-655.
- 7Dahlhaus, Neumann and von Sachs (1999) Dahlhaus, R., Neumann, M. H., and von Sachs, R. (1999). Nonlinear wavelet estimation of time-varying autoregressive processes. Bernoulli , 5(5), 873-906.
- 8Dahlhaus (2000) Dahlhaus, R. (2000). A likelihood approximation for locally stationary processes. The Annals of Statistics 28(6), 1762–1794.
