Consistency Results for Stationary Autoregressive Processes with Constrained Coefficients
Alessio Sancetta

TL;DR
This paper investigates the estimation consistency of stationary autoregressive processes with constrained coefficients, demonstrating theoretical results and practical benefits of including constraints directly in estimation.
Contribution
It provides new consistency results for constrained and penalized estimators in autoregressive models with coefficients in an ellipsoid, including universal consistency and robustness insights.
Findings
Constrained estimators improve robustness in autoregressive process estimation.
Consistency results hold under various norms for these estimators.
Simulations confirm practical advantages of direct constraint inclusion.
Abstract
We consider stationary autoregressive processes with coefficients restricted to an ellipsoid, which includes autoregressive processes with absolutely summable coefficients. We provide consistency results under different norms for the estimation of such processes using constrained and penalized estimators. As an application we show some weak form of universal consistency. Simulations show that directly including the constraint in the estimation can lead to more robust results.
| 100 | 1000 | ||||
| Short Memory | |||||
| 0.99 | 0.99 | 0.99 | 0.99 | ||
| 0.99 | 0.99 | 0.99 | 0.99 | ||
| Long Memory | |||||
| 0.93 | 0.88 | 0.94 | 0.88 | ||
| 0.93 | 0.88 | 0.94 | 0.88 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Consistency Results for Stationary Autoregressive Processes with
Constrained Coefficients
Alessio Sancetta Acknowledgements: I am grateful to Luca Mucciante for insightful conversations. E-mail: [email protected], URL: http://sites.google.com/site/wwwsancetta/. Address for correspondence: Department of Economics, Royal Holloway University of London, Egham TW20 0EX, UK
Abstract
We consider stationary autoregressive processes with coefficients restricted to an ellipsoid, which includes autoregressive processes with absolutely summable coefficients. We provide consistency results under different norms for the estimation of such processes using constrained and penalized estimators. As an application we show some weak form of universal consistency. Simulations show that directly including the constraint in the estimation can lead to more robust results.
Key Words: consistency, empirical process, ridge regression, reproducing kernel Hilbert space, universal consistency.
1 Introduction
It is common to impose constraints on the decay rate of the autoregressive coefficients in order to derive results amenable to estimation for the purpose of prediction. At minimum, these constraints tend to require that the AR coefficients are absolutely summable. Then, a natural approach when dealing with high order autoregressive models is to consider sieve estimation. Sieve estimation of infinite AR models has been considered by various authors. For universal consistency, Schäfer (2002) derived perhaps the strongest result possible. Györfi and Sancetta (2015) review some of these results. For convergence in probability, various authors have considered infinite AR models and its applications, e.g. Bühlmann (1997), and Kreiss et al. (2011). Additional references can be found in the cited papers.
Here, we constraint the autoregressive coefficients to lie in an infinite dimensional ellipsoid such that coefficients associated to higher order lags decay fast. Then, we can exploit the fact that the ellipsoid is compact under the norm in order to derive asymptotic results. The conditions essentially require the autoregressive coefficients to be absolutely summable. We shall see that the vector of autoregressive coefficients can be seen as an element in a Reproducing Kernel Hilbert Space (RKHS) when is equipped with a suitable inner product. This allows us to exploit all the existing machinery for estimation in RKHS and build on it (Steinwart and Chirstmann, 2008, for a comprehensive review) . The main ingredient is penalized least square estimation. We also consider the constrained least square problem. Penalized and constrained estimation are dual problems for specific values of the penalty coefficient. Our result establishes the relation between the two problems and the consistency rates. In general, they can lead to different consistency results under different norms. One norm is the usual Euclidean norm of the vector of coefficients while the other is the norm of the RKHS. We show that consistency under the latter has important implications for prediction problems.
In general, unlike existing results we are able to establish consistency as both the autoregressive order and the sample size go to infinity with no constraint on the rates. Existing results use the machinery of method of sieve, hence they require the autoregressive order to go to infinity in a controlled way. As already mentioned, we are able to avoid this restriction because the ellipsoid is compact under the Euclidean norm.
The plan for the paper is as follows. Section 2 reviews the estimation method and presents the consistency results. A numerical example is provided in Section 3. Section 4 mentions extensions to other processes such as vector autoregressive processes (VAR). The proof of the consistency results is long and is given in Section 5.
2 Estimation Method
We restrict attention to the infinite order autoregressive process
[TABLE]
for some mean zero independent identically distributed (i.i.d.) sequence and unknown coefficients ’s. This paper considers estimators of the above under the condition that .
In a finite sample, the above model can only be approximated by the finite dimensional model
[TABLE]
with . While this is essentially a sieve we do not necessarily require to be of smaller order than the sample size. Here, we restrict the coefficients in an ellipsoid to be defined as follows. Let ’s be positive constants such that for , where means that the left hand side (l.h.s.) and the right hand side (r.h.s.) are proportional. Define the ellipsoid as
[TABLE]
Given that the ’s are increasing, the ’s need to be smaller in absolute values as increases. Write for the ellipsoid where all coefficients can be non-zero, and , so for example is the ellipsoid that is restricted to have finite but decreasing principal axes. The following condition will be imposed on the ellipsoid.
Condition 1
The sequence follows the process (1) with and , where . Moreover, only for outside the unit circle. The innovations are independent identically distributed with finite fourth moment.
Throughout, when writing and similar quantities, it is understood that the ’s are as in Condition 1. The following is stated for convenience.
Lemma 1
If then, for some , where is inequality up to a fixed absolute multiplicative constant.
In consequence, Condition 1 implies absolutely summable autoregressive coefficients. Note that absolute summability would just require in Condition 1 rather than , hence the condition we use is a bit more restrictive. The following states additional properties of the model.
Lemma 2
Under Condition 1, is stationary and ergodic with absolutely summable autocovariance function and .
It is well known that for the AR process, only for outside the unit circle if the autocovariance function is absolutely summable and the spectral density is strictly positive and continuous (Kreiss et al., 2011, Corollary 2.1).
Note that there are processes (even Gaussian) that satisfy Condition 1, but fail to be beta mixing (Doukhan, 1995, Theorem 3, p.59). The beta mixing assumption is often conveniently used when proving convergence using methods from empirical process theory. Alas, it cannot be used here.
2.1 Estimation and Consistency
The goal is to find an estimator for . We consider two approaches: constrained least square and penalized least square. By duality, the two can be made to be equivalent by suitable choice of the penalty parameter. However, in the constrained case, the penalty turns out to be sample dependent, while in penalized estimation this it not necessarily the case.
To avoid notational trivialities, suppose that the sample size is . This will be assumed without further notice throughout the paper. In particular, our sample is . This also stresses the fact that and can go to infinity at different rates.
In the constrained problem, we estimate . The constrained estimator is defined as
[TABLE]
Of course, in the above, if .
In the penalized problem, we estimate , but introduce the penalty parameter . The penalized estimator is defined as
[TABLE]
where the ’s are from the definition of . By use of the Lagrangian, we can always rewrite (3) as (4) for suitable choice of , i.e. there is a ( if the constraint it not binding) such that .
Both problems can be reformulated in matrix form using the Lagrangian. Let be the dimensional matrix with entry equal to and be the -dimensional vector with entry . Also, let be the diagonal matrix with diagonal entry equal to . The estimator for either (3) or (4) is found by minimizing the penalized least square criterion with respect to (w.r.t.) ,
[TABLE]
where for (3) is chosen so that the constraint is satisfied. In this latter case, is necessarily random because the constraint needs to be satisfied in sample. Here the tilde in is used to remind us that in the matrix formulation, is truncated to be a dimensional vector, as all entries larger than are zero by definition of . The solution is the usual ridge regression estimator .
For problem (4), can go to zero in a controlled way. For problem (3), must be chosen so that the constraint is satisfied. Such is zero if the constraint is binding, and zero otherwise. This is equivalent to replacing with in (5), and minimizing the so modified objective function (5) w.r.t. and . The minimizer w.r.t. is .
All vectors are in , though only the first elements might be non-zero. The exception is when we use a tilde, as in (5). For in (3), the Euclidean norm of becomes
It is worth noting that the ellipsoid is a RKHS generated by the kernel where is the Kronecker’s delta, i.e. if and zero otherwise. The inner product is defined to satisfy the reproducing kernel property . Hence for , and . The norm induced by the inner product is such that for any vector , . This norm strictly dominates the Euclidean norm. The fact that is compact under the Euclidean norm is a consequence of the fact that is a RKHS (Li and Linde, 1999) and sharp asymptotics can be derived by related means (Graf and Luschgy, 2004).
Once we realize such compactness, it becomes clear that it might be possible to estimate infinite AR processes under no restriction on the number of estimated coefficients. We show that this conjecture is true. We also establish convergence rates. Moreover, we want to clearly address the relation between constrained and penalized estimation.
The best approximation to minimizes the population mean square error
[TABLE]
Despite the abuse of notation, do not confuse with the entry in .
Theorem 1
Suppose that Condition 1, and hold.
*(Consistency of Constrained Estimator) If *There is a random such that , and if , for any . 2. 2.
(Consistency of Penalized Estimator) Consider possibly random such that and in probability. There is a finite such that , eventually in probability and in probability. 3. 3.
(Approximation Error in ) There is an such that . Suppose the entry in satisfies with for all large enough. Then . 4. 4.
(Estimation Error in ) If , then 5. 5.
(Difference Between Norms) There is and such that in probability, but does not converge to zero in probability.
Point 1 in the theorem establishes the link between constrained and penalized estimation by finding the rate of decay of the ridge penalty so that (3) and (4) are the same. It also establishes the convergence rate of (3) towards the true in terms of (recall in Condition 1). This rate does not constrain the number of lags used once we constrain . For the finite dimensional case we trivially recover the root-n convergence by letting .
Point 2 says that if we use the penalized estimation and the penalty does not go to zero too fast (i.e. strictly slower than in Point 1) we can expect (4) to be contained in a ball in that contains the true parameter with probability going to one. Moreover, (4) is consistent under the norm .
Point 3 is concerned with the approximation error of (6) in the RKHS norm. This error might go to zero at a logarithmic rate. However, if the true coefficients decay fast, then we can have polynomial convergence rate.
Point 4 restricts the way we let in order to derive convergence rates of the estimation error under the norm .
Point 5 establishes an additional insight between the convergence under the Euclidean norm and the RKHS norm in terms of the penalty. A “slowly convergent” penalty is necessary for convergence under . Hence, this also shows that the constrained estimator (whose penalty is when ) cannot be consistent in the norm in general. This happens when choosing a rather large that leads to a binding constraint for (3).
As corollary to Points 3 and 4 in Theorem 1, we have the following.
Corollary 1
Suppose Condition 1 holds, and .
Choose for some . Then, there is an such that . 2. 2.
Suppose the entry in satisfies with for all large enough. Choose . Then, .
Corollary 1 imposes additional restrictions in order to improve on the statement of Point 2 in Theorem 1 by giving rates of convergence. These rates are not tight as they require unlike Point 2 in Theorem 1. However, they are useful in applications (e.g. Section 2.1.1).
Sieve estimators are often consistent under the sole condition that the number of components (here ) is of smaller order of magnitude than the sample size . In Point 1 of Theorem 1, we have shown that this is not required. Recall that is the sample size. We can have as long as . Of course, we require knowledge concerning the magnitude of the coefficients. Such knowledge is usually assumed in the literature in order to bound the approximation error.
In practice the fact that we allow might sound irrelevant. However, the asymptotic results can be seen as suggesting that, once we set the constraint, the procedure used here can be more robust to lag choice. We show this in the simulation in Section 3.
2.1.1 Application to Optimal Forecasting and Universal Consistency
Define for any . The expectation of conditioning on the infinite past is . As an application of Theorem 1 consider the following problem. Show that
[TABLE]
in probability where or ( in (4)). Hence, we want to be close to the conditional expectation of uniformly in , which is even more general than considering a moving target. The norm is useful because the previous display can be written as
[TABLE]
To obtain the inequality, we have multiplied and divided each term in the sum (on the l.h.s.) by and then used the Cauchy-Schwarz inequality and Condition 1 to set .
We have that in probability, where at rate which depends on Theorem 1. Then, if
[TABLE]
we have shown that (7) goes to zero in probability. This is a weak form of universal consistency because the convergence is in probability rather than almost surely. On the positive side, the convergence holds for a variety of processes and circumstances.
If then (8) is almost surely finite if the random variables are bounded, and (7) goes to zero in probability using Point 2 in Theorem 1.
If , we can use the bound
[TABLE]
when the variables are integrable. If is such that , then the r.h.s. of (7) goes to zero in probability. If has moment generating function the r.h.s. of the above display is . Either way, to find we can use Corollary 1. Note that the argument is unchanged if for any .
Theorem 1 can also be applied to the less ambitious problem: show that
[TABLE]
in probability. In this case we want to forecast as well as the increasingly best approximation of the conditional expectation of , uniformly in . Point 4 in Theorem 1 is suited for this problem.
2.2 Choice of in Practice
The parameter can be chosen to minimize some cross-validated prediction error estimate (beware of cross-validation in a time series context, e.g. Györfi et al., 1990, Burman and Nolan, 1992, Burman et al., 1994, for discussions and applicability). Alternatively, one can choose to minimize some penalized loss function such as
[TABLE]
where and is the solution of , using the notation in (5). Here, is the sample variance of the residuals from the estimation. If the constraint is binding, solves
[TABLE]
This is then used to compute , which is the effective number of degrees of freedom implied by (Hastie et al., 2009)
3 Numerical Example
Asymptotic results are of interest on their own, but it is also of interest to understand the scope of applicability in practice. As a benchmark, we use predictions based on an AR model where the lag length is chosen by Akaike’s Information Criterion (AIC).
3.1 Simulated True Models
One thousand data samples are simulated from (1). The sample size is . A warm up sample of 1000 observations is used to reduce any dependence on the starting value. We also simulate a testing sample of observations to approximate the mean square error (MSE). We consider different specifications for in (1) including long memory in order to see how the procedure works when the true model is not in . In this case, an approximation error is incurred.
Short Memory
In (1), the errors are i.i.d. standard normal and the ’s are chosen to be , where . A higher value for leads to a more persistent behaviour. By construction, for both values of , the model appears to generate cycles because the roots of are outside the unit circle, but complex. We shall have different values for . Given the finite number of lags the coefficients are automatically in .
Long Memory Model
The model is an ARFIMA
[TABLE]
where the ’s are as in the previous paragraph. The MA polynomial is with . The coefficient of fractional integration . Hence, the model is stationary, but exhibits long memory.
3.2 Estimation and Results
The parameter’s estimates are obtained from (5) with . The benchmark is an AR model with lag length chosen to minimize AIC. Denote the number of lags chosen using AIC by . We compare this to a model estimated using more lags, but with coefficients constrained in . In particular, and with chosen as outlined in Section 2.2 . The goal is to verify whether the procedure is robust to lag choice. AIC is known to choose large models. We use even larger models, and verify whether we are able to obtain sensible results.
The results in Table LABEL:Table_simulations show the improvement in MSE of the constrained procedure over AIC. Table LABEL:Table_simulations shows that the procedure is robust against lag choice. This becomes evident in the long memory case. The larger model ( leads to relatively better performance when the true model exhibits persistency as (11).
4 Further Remarks
It is simple to impose linear restrictions on the coefficients of either the constrained or penalized estimator. A natural example is positivity. This is the case if we wish to estimate ARCH models of large orders. Under ARCH restrictions, the squared returns follow an AR process. The estimator does not have a closed form expression, but it is just the solution of a quadratic programming problem. Another extension pertains to vector autoregressive processes
[TABLE]
where now the variables and innovations are dimensional vectors and we use the capital to stress the multivariate framework, where is an matrix. Again, we can restrict in a suitable way. For example, we can impose that is lower triangular. This restriction has a variety of implications going from Granger causality to exogeneity and it is of much interest in econometrics (e.g., Sims, 1980). For fixed , all the results in this paper apply to this problem as well, with obvious changes if we modify the constraint to where is any matrix norm, e.g., Frobenius: , where is the transpose of .
An extension, which does not follow directly from the results derived here, is to consider the case where . This is the problem where we have a large cross-section ( is the dimensional of the vector in (12)). In this case, the constraint cannot use an arbitrary matrix norm (norms are not equivalent in infinite dimensional spaces). Results in Lutz and Bühlmann (2006) together with the ones derived here can provide initial guidance on how to tackle this problem in the future.
5 Proofs
At first we include the short proof of Lemma 2
Proof. [Lemma 2]A stationary infinite AR process with absolutely summable AR coefficients has an infinite MA representation with absolutely summable coefficient and it is invertible (Lemma 2.1 in Bühlmann, 1995). Hence, there are coefficients ’s such that and
[TABLE]
which means that the autocovariance function is absolutely summable. The moment bound follows from the infinite MA representation and the bound on the fourth moment of the innovations.
5.1 Proof of Theorem 1
We divide the proof into two parts. One only concerns results under the Euclidean norm. The other is concerned with convergence results under the RKHS norm.
5.1.1 Consistency Under the Euclidean Norm
Few lemmas are needed for the proof. Throughout, we shall use the notation for any .
Lemma 3
For as in Condition 1) and real constants ’s, and similarly, for real constants ’s, .
Proof. Note that . Given that , then uniformly in , by Lemma 1. This implies that the previous quantity is bounded by a constant multiple of . The same argument proves the second statement in the lemma
The ’s in the lemma above will be partial sums of cross products of ’s, which we bound using the following.
For arbitrary , the first order conditions that define (4) imply that
[TABLE]
where is the element in . By Condition 1, multiplying both sides by and summing over ,
[TABLE]
recalling the definition of and using the Cauchy-Schwarz inequality. If , and the above display clearly holds uniformly in . We need to show that there is a such . This will imply the display in the statement of the lemma.
Lemma 4
Under Condition 1,
[TABLE]
Proof. From the proof of Lemma 2, there are absolutely summable coefficients ’s, such that . For ease of notation suppose that the i.i.d. innovations have variance one and the MA coefficients are non-negative. By stationarity,
[TABLE]
where the r.h.s. holds for any . If we showed that
[TABLE]
the result would follow by summability of the coefficients. To show the above, with no loss of generality, by symmetry, consider only the case . This implies that
[TABLE]
The above is equal to
[TABLE]
By the i.i.d. condition on the innovations, the covariance is zero if the indexes are not constrained in the following sets , , . Hence, we can consider summation with indexes in these sets only. Splitting the sum according to the above index sets, we have respectively,
[TABLE]
[TABLE]
[TABLE]
By elementary change of indexes,
[TABLE]
Similarly, deduce that
[TABLE]
Finally,
[TABLE]
The bounds do not depend on beyond the fact that . Repeating the argument for , the result follows.
Lemma 4 will be used to bound quantities such as the following
[TABLE]
where the second inequality follows because . Then, by Lemma 4 the expectation is finite because and it is independent of by stationarity. In consequence the display is because convergence in implies convergence in probability.
To establish convergence rates we need two stochastic equicontinuity results.
Lemma 5
Under Condition 1, for any
[TABLE]
Proof. By the triangle inequality, (15) is bounded by
[TABLE]
By Lemma 3, there is a such that the above is bounded by a constant multiple of
[TABLE]
by summability of . For any positive , the above display can be written as
[TABLE]
We shall bound the two sums separately. By the Cauchy-Schwarz inequality, the first sum is bounded by
[TABLE]
where the inequality uses Lemma 4 and . Having set to such finite value, by the Cauchy-Schwarz inequality, the second sum is bounded by
[TABLE]
for any , using again Lemma 4, and the fact that is summable and is decreasing. The r.h.s. is then bounded by a constant multiple of . Equating with we choose , implying that and the lemma is proved.
Lemma 6
Under Condition 1, for any ,
[TABLE]
Proof. By linearity and the triangle inequality,
[TABLE]
Note that
[TABLE]
Hence, we can proceed exactly as in the proof of Lemma 5 to deduce the result.
The first part of Point 1 in the theorem will be proved in Lemma 8 (Section 5.1.2). Hence, here we shall only derive the convergence rate.
Define the empirical loss function
[TABLE]
where . When the sum inside the parenthesis only runs from to . The population loss is
[TABLE]
Define such that its first entries are as in and the remaining are all zero. The consistency proof is standard (van der Vaart and Wellner, 2000, Theorem 3.2.5) once we show the following:
[TABLE]
[TABLE]
for some . Then, for any sequence satisfying , and , we have that .
At first we verify (17). Note that
[TABLE]
where is the autocovariance function (ACF) of the ’s. The estimator is uniquely identified if the matrix, say , with entry equal to , is strictly positive definite with smallest eigenvalue (see remarks after Lemma 2.2. in Kreiss et al., 2011). This is the case if the spectral density of , say , is bounded away from zero. The spectral density of the AR model (1) is given by , where with . Noting that by Condition 1, , deduce that the eigenvalues of are bounded away from zero. Hence,
[TABLE]
and (17) holds.
Using the notation , the empirical loss is equal to
[TABLE]
This implies that
[TABLE]
To verify (18), we need to bound the above uniformly in such that To this end, apply Lemma 6 to the first term on the r.h.s. to find that the uniform bound is a constant multiple of for any . By basic algebraic manipulations, the second term on the r.h.s. of the display is
[TABLE]
Note that both and are in . We apply Lemma 5 to deduce that each term on the r.h.s. of the above display is uniformly bounded in by a constant multiple of for any when . Hence (18) is verified with . When we are only interested in a finite dimensional model, we can take to deduce that , which is the parametric case.
To find note that
[TABLE]
Also, for some using Lemma 1 and bounding the sum with an integral ad using the fact that is slowly varying at infinity. Hence we deduce that as stated in Point 1 of the theorem.
5.1.2 Consistency Under the RKHS Norm
The proof depends on a few preliminary lemmas. Let be the penalized population estimator
[TABLE]
The following can be deduced from Theorem 5.9 in Steinwart and Christmann (2008, eq. 5.14). The proof is given, as the context might seem different at first sight.
Lemma 7
Suppose Condition 1. For arbitrary but fixed , consider and in (4) and (20) with possibly diverging to infinity. Then,
[TABLE]
where is the entry in the dimensional vector , and similarly for .
Proof. By convexity of the square error loss,
[TABLE]
Note the following algebraic equality,
[TABLE]
The above two displays imply
[TABLE]
where the most r.h.s. follows because minimizes the empirical penalized risk. The first order conditions for read
[TABLE]
for . Substituting this in the previous display,
[TABLE]
Rearranging and using the definition of , deduce that
[TABLE]
using the Cauchy-Schwarz inequality in the last step. This implies the result of the lemma after simple rearrangement.
The next lemma establishes the relation between the constrained and penalized estimator and states a bound for the distance between the sample and population penalized estimator under the RKHS norm.
Lemma 8
Suppose that . Under Condition 1, if , and is as in (4), there is such that and
[TABLE]
where the above bound holds uniformly in . In consequence, there is a such that .
Moreover, for any ,
[TABLE]
Proof. Suppose that as otherwise, by the first order conditions, the r.h.s. in the first display in the statement of lemma is exactly zero and there is nothing to prove.
By the triangle inequality,
[TABLE]
For , , as the penalized population estimator must have norm no larger than . By this remark and the fact that , there is an such that the first term on the r.h.s. is . Lemma 7 gives
[TABLE]
Adding and subtracting , and then using the basic inequality for any real , the r.h.s. is
[TABLE]
Recalling that our goal is to bound the second term on the r.h.s. of (22), the above two displays imply that
[TABLE]
To bound on the r.h.s. note that for ,
[TABLE]
(recall is the ACF) so that
[TABLE]
because the coefficients are summable. Hence, it is possible to find a such that . To bound , recall that for any , and write
[TABLE]
for ease of notation. Then, for ,
[TABLE]
using Lemma 3 in the second inequality and summability of the coefficient in the last step. By Lemma 4, for some finite absolute constant . Hence, deduce that , which implies that . Hence, there is a such that . The control of implies that (24) is not greater than for suitable . Hence, we have shown that there is a such that (22) is not greater than . This bound for (22) together with (14) proves the first display in the lemma. To see that this also implies that there is a such that note that is non-deceasing as . Hence, for the smallest such that
The last statement in the lemma follows from (23) and the just derived bound for (24).
We now estimate the approximation error.
Lemma 9
For any , we have that as where is as in (6). Moreover, if , then .
Proof. The first part of the lemma is just Theorem 5.17 in Steinwart and Christmann (2008). Hence, we only need to prove the second statement. Let be the matrix with entry and let be the first column in . Let to be the first entries in . Recall that in both and all entries are zero. Then, , and writing for as in (5),
[TABLE]
By the Woodbury identity (Petersen and Pedersen, 2012, eq.159)
[TABLE]
we have that
[TABLE]
Hence,
[TABLE]
using the definitions of and . For any square matrix and compatible vector , , where is the maximum eigenvalue of . Define . Given that , then, . Hence, we only need to find the maximum eigenvalue of to bound the above display. The following inequalities hold for the eigenvalues of the product of two positive definite matrices and :
[TABLE]
where and are the maximum and minimum eigenvalue of the matrix argument (Bathia, 1997, problem III.6.14, p.78). In order to derive (19), we argued that has minimum eigenvalue bounded away from zero. Hence, has eigenvalues in . The matrix has eigenvalues equal to 1 plus the eigenvalues of . Hence deduce that . This is just as required.
We need a final approximation result.
Lemma 10
Recall (6). If , then as . If also with , then, .
Proof. Recall the definition of just before (17). Let have the same first entries as as . Write where . Given that is the population ordinary least square estimator, using the same notation as in the proof of Lemma 9,
[TABLE]
We need to show that the second term goes to zero under the norm . Given that the innovations are i.i.d., the expectation is equal to
[TABLE]
Hence,
[TABLE]
We need to show that this converges to zero. By similar arguments as in the proof of Lemma 9, deduce that
[TABLE]
so that it is sufficient to bound the square root of the above display. We have that
[TABLE]
Note that , and by Lemma 1 the autocovariance function is summable. Moreover . Hence, when holds true, the above display can be bounded by a constant multiple of
[TABLE]
Finally, by definition of ,
[TABLE]
This implies that . If we only assume that , then for some by Lemma 1. Substituting in the above display, we have a logarithmic convergence rate rather than polynomial.
We can now prove Points 2-5 in Theorem 1. If , then, there is a finite such that . Hence, by Lemma 7 and 8, deduce that and also that eventually in probability. Hence, if in probability, by Lemma 9, in probability irrespective of the fact that . By Lemma 10, as , so that the triangle inequality gives in probability under the sole condition that in probability. This proves Point 2.
The approximation rates in Point 3 are from Lemma 10.
To show Point 4, use Lemma 9 for the approximation error of the penalized estimator. We need for the lemma to apply. Use Lemmas 7 and 8 to derive the estimation error relative to the penalized estimator. Hence, deduce that . Equating the two terms inside the , this quantity is when . This choice of satisfies as long as , as required.
We now prove Point 5. Lemma 8 also shows that for the constrained problem, the Lagrange multiplier is , and the constraint is possibly binding. In fact, there is a large enough relatively to , such that the constraint needs to be binding. Then, , and from Lemma 8 we deduce that . Hence, if there is an such that . Then, we must have
[TABLE]
But . Hence, the above display is greater or equal than
[TABLE]
This means that cannot converge under the norm .
5.2 Proof of Corollary 1
Now prove Point 1 in the corollary. By Point 4 in Theorem 1, the estimation error is as long as for ; we also require which under the condition on also satisfies . Point 3 in Theorem 1 gives an approximation error of order because . Hence, we deduce the first part of the corollary.
To derive Point 2, consider Point 3 in Theorem 1 under the additional condition on the decay rate of the true coefficients. Point 4 in the same theorem gives again the estimation error. From the sum of the two errors deduce that . Equating the coefficients this is when . Once again, the bound on the estimation error requires that . Under the condition on this ensures that , which is required.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bathia, R. (1997) Matrix Analysis. New York: Springer.
- 2[2] Bühlmann, P. (1995). Moving-average representation for autoregressive approximations. Stochastic Processes and their Applications 60, 331-342.
- 3[3] Bühlmann, P. (1997) Sieve Bootstrap for Time Series. Bernoulli 3, 123-148.
- 4[4] Burman, P. and D. Nolan (1992) Data-Dependent Estimation of Prediction Functions. Journal of Time Series Analysis 13, 189-207.
- 5[5] Burman, P., E. Chow and D. Nolan (1994) A Cross-Validatory Method for Dependent Data. Biometrika 81, 351-358.
- 6[6] Graf, S. and H. Luschgy (2004) Sharp Asymptotics of the Metric Entropy for Ellipsoids. Journal of Complexity 20, 876-882.
- 7[7] Györfi, L., W. Härdle, P. Sarda and P. Vieu (1990) Nonparametric Curve Estimation from Time Series. Heidelberg: Springer.
- 8[8] Györfi, L. and A. Sancetta (2015) An open problem on strongly consistent learning of the best prediction for Gaussian processes. in M. Akritas, S.N. Lahiri and D. Politis (eds.), Proceedings of the first conference of the international Society of Nonparametric Statistics. Heidelberg: Springer.
