Consistency Results for Stationary Autoregressive Processes with   Constrained Coefficients

Alessio Sancetta

arXiv:1706.02492·stat.ML·June 9, 2017·IEEE Trans. Inf. Theory

Consistency Results for Stationary Autoregressive Processes with Constrained Coefficients

Alessio Sancetta

PDF

TL;DR

This paper investigates the estimation consistency of stationary autoregressive processes with constrained coefficients, demonstrating theoretical results and practical benefits of including constraints directly in estimation.

Contribution

It provides new consistency results for constrained and penalized estimators in autoregressive models with coefficients in an ellipsoid, including universal consistency and robustness insights.

Findings

01

Constrained estimators improve robustness in autoregressive process estimation.

02

Consistency results hold under various norms for these estimators.

03

Simulations confirm practical advantages of direct constraint inclusion.

Abstract

We consider stationary autoregressive processes with coefficients restricted to an ellipsoid, which includes autoregressive processes with absolutely summable coefficients. We provide consistency results under different norms for the estimation of such processes using constrained and penalized estimators. As an application we show some weak form of universal consistency. Simulations show that directly including the constraint in the estimation can lead to more robust results.

Tables1

Table 1. Table 1: Simulation Results. For Short Memory the process is as in ( 1 ) with number of true AR coefficients equal to K 0 subscript 𝐾 0 K_{0} and AR coefficients satisfying φ k = φ ¯ k − 1 / 2 / ( ∑ k = 1 K 0 k − 1 / 2 ) subscript 𝜑 𝑘 ¯ 𝜑 superscript 𝑘 1 2 superscript subscript 𝑘 1 subscript 𝐾 0 superscript 𝑘 1 2 \varphi_{k}=\bar{\varphi}k^{-1/2}/\left(\sum_{k=1}^{K_{0}}k^{-1/2}\right) , where φ ¯ = 0.75 , 0.99 ¯ 𝜑 0.75 0.99 \bar{\varphi}=0.75,\,0.99 . For Long Memory, the process is as in ( 11 ). Entries denote the MSE improvement relative to the MSE of a model with lag length K A I C subscript 𝐾 𝐴 𝐼 𝐶 K_{AIC} chosen using AIC. MSE in the numerator in the calculation of the relative improvement is computed using lag length 2 K A I C 2 subscript 𝐾 𝐴 𝐼 𝐶 2K_{AIC} and 4 K A I C 4 subscript 𝐾 𝐴 𝐼 𝐶 4K_{AIC} and constraining the coefficients in ℰ ( B ) ℰ 𝐵 \mathcal{E}\left(B\right) where B 𝐵 B is chosen as described in Section 2.2 .

$K_{0} =$	100		1000
	$2 K_{A I C}$	$4 K_{A I C}$	$2 K_{A I C}$	$4 K_{A I C}$
	Short Memory
$\bar{φ} = 0.75$	0.99	0.99	0.99	0.99
$\bar{φ} = 0.99$	0.99	0.99	0.99	0.99
	Long Memory
$\bar{φ} = 0.75$	0.93	0.88	0.94	0.88
$\bar{φ} = 0.99$	0.93	0.88	0.94	0.88

Equations171

Y_{t} = k = 1 \sum \infty φ_{k} Y_{t - k} + ε_{t}

Y_{t} = k = 1 \sum \infty φ_{k} Y_{t - k} + ε_{t}

Y_{t} = k = 1 \sum K b_{k} Y_{t - k} + ε_{t}

Y_{t} = k = 1 \sum K b_{k} Y_{t - k} + ε_{t}

E_{K} (B) := {b \in R^{\infty} : k = 1 \sum \infty b_{k}^{2} λ_{k}^{2} \leq B^{2}, b_{k} = 0 for k > K} .

E_{K} (B) := {b \in R^{\infty} : k = 1 \sum \infty b_{k}^{2} λ_{k}^{2} \leq B^{2}, b_{k} = 0 for k > K} .

b_{n} = ar g b \in E_{K} (B) in f \frac{1}{n} t = 1 \sum n (Y_{t} - k = 1 \sum \infty b_{k} Y_{t - k})^{2}

b_{n} = ar g b \in E_{K} (B) in f \frac{1}{n} t = 1 \sum n (Y_{t} - k = 1 \sum \infty b_{k} Y_{t - k})^{2}

b_{n, τ} := ar g b \in E_{K} in f \frac{1}{n} t = 1 \sum n (Y_{t} - k = 1 \sum \infty b_{k} Y_{t - k})^{2} + τ k = 1 \sum \infty λ_{k}^{2} b_{k}^{2},

b_{n, τ} := ar g b \in E_{K} in f \frac{1}{n} t = 1 \sum n (Y_{t} - k = 1 \sum \infty b_{k} Y_{t - k})^{2} + τ k = 1 \sum \infty λ_{k}^{2} b_{k}^{2},

\frac{1}{n} (Y - X \tilde{b})^{T} (Y - X \tilde{b}) + τ \tilde{b}^{T} Λ^{2} \tilde{b}

\frac{1}{n} (Y - X \tilde{b})^{T} (Y - X \tilde{b}) + τ \tilde{b}^{T} Λ^{2} \tilde{b}

φ_{K} = ar g b \in E_{K} in f E (Y_{1} - k = 1 \sum \infty b_{k} Y_{1 - k})

φ_{K} = ar g b \in E_{K} in f E (Y_{1} - k = 1 \sum \infty b_{k} Y_{1 - k})

t \in T sup ∣ X_{t} (φ) - X_{t} (b_{n, τ}) ∣ \to 0

t \in T sup ∣ X_{t} (φ) - X_{t} (b_{n, τ}) ∣ \to 0

t \in T sup ∣ X_{t} (φ - b_{n, τ}) ∣

t \in T sup ∣ X_{t} (φ - b_{n, τ}) ∣

t \in T sup (k = 1 \sum \infty (\frac{Y _{t - k}}{k ^{λ}})^{2})^{1/2} = o_{p} (ϵ_{n}^{- 1}),

t \in T sup (k = 1 \sum \infty (\frac{Y _{t - k}}{k ^{λ}})^{2})^{1/2} = o_{p} (ϵ_{n}^{- 1}),

(E t \in (0, n) sup k = 1 \sum \infty Y_{t - k}^{2} k^{2 λ})^{1/2} \leq n^{1/ (2 p)} t \in (0, n) sup (E k = 1 \sum \infty Y_{t - k}^{2 p} k^{2 λ p})^{1/ (2 p)}

(E t \in (0, n) sup k = 1 \sum \infty Y_{t - k}^{2} k^{2 λ})^{1/2} \leq n^{1/ (2 p)} t \in (0, n) sup (E k = 1 \sum \infty Y_{t - k}^{2 p} k^{2 λ p})^{1/ (2 p)}

K \to \infty lim t \in T sup ∣ X_{t} (φ_{K}) - X_{t} (b_{n, τ}) ∣ \to 0

K \to \infty lim t \in T sup ∣ X_{t} (φ_{K}) - X_{t} (b_{n, τ}) ∣ \to 0

ln \overset{σ}{^}_{B}^{2} + \frac{2 df ( B )}{n}

ln \overset{σ}{^}_{B}^{2} + \frac{2 df ( B )}{n}

Y^{T} X (X^{T} X + τ_{B, n} n Λ^{2})^{- 2} X^{T} Y = B^{2} .

Y^{T} X (X^{T} X + τ_{B, n} n Λ^{2})^{- 2} X^{T} Y = B^{2} .

Y_{t} = k = 1 \sum K_{0} φ_{k} Y_{t - k} + (1 - L)^{- d} (l = 0 \sum L θ_{l} ε_{t - l})

Y_{t} = k = 1 \sum K_{0} φ_{k} Y_{t - k} + (1 - L)^{- d} (l = 0 \sum L θ_{l} ε_{t - l})

Y_{t} = k = 1 \sum \infty Φ_{k} Y_{t - k} + ε_{t}

Y_{t} = k = 1 \sum \infty Φ_{k} Y_{t - k} + ε_{t}

k = 1 \sum \infty ∣ E Y_{t} Y_{t - k} ∣ \leq σ^{2} k = 1 \sum \infty s = 0 \sum \infty ∣ ψ_{s + k} ∣ ∣ ψ_{s} ∣ < \infty,

k = 1 \sum \infty ∣ E Y_{t} Y_{t - k} ∣ \leq σ^{2} k = 1 \sum \infty s = 0 \sum \infty ∣ ψ_{s + k} ∣ ∣ ψ_{s} ∣ < \infty,

b_{n, τ, k} = - \frac{1}{2 τ λ _{k}^{2}} \frac{1}{n} t = 1 \sum n (Y_{t} - X_{t} (b_{n, τ})) Y_{t - k}

b_{n, τ, k} = - \frac{1}{2 τ λ _{k}^{2}} \frac{1}{n} t = 1 \sum n (Y_{t} - X_{t} (b_{n, τ})) Y_{t - k}

\frac{1}{n} t = 1 \sum n (Y_{t} - X_{t} (b_{n, τ})) X_{t} (a)

\frac{1}{n} t = 1 \sum n (Y_{t} - X_{t} (b_{n, τ})) X_{t} (a)

\leq 2 τ k = 1 \sum K λ_{k}^{2} b_{n, τ, k}^{2} k = 1 \sum K λ_{k}^{2} a_{k}^{2},

n, k, l > 0 sup E \frac{1}{n} t = 1 \sum n (1 - E) Y_{t - k} Y_{t - l}^{2} < \infty.

n, k, l > 0 sup E \frac{1}{n} t = 1 \sum n (1 - E) Y_{t - k} Y_{t - l}^{2} < \infty.

E \frac{1}{n} t = 1 \sum n (1 - E) Y_{t - k} Y_{t - l}^{2} \leq 2 s = 0 \sum n E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}],

E \frac{1}{n} t = 1 \sum n (1 - E) Y_{t - k} Y_{t - l}^{2} \leq 2 s = 0 \sum n E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}],

E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}] ≲ ψ_{s}

E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}] ≲ ψ_{s}

E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}]

E [(1 - E) Y_{t - k} Y_{t - l}] [(1 - E) Y_{t - s - k} Y_{t - s - l}]

u_{1} = 0 \sum \infty u_{2} = 0 \sum \infty u_{3} = 0 \sum \infty u_{4} = 0 \sum \infty ψ_{u_{1}} ψ_{u_{2}} ψ_{u_{3}} ψ_{u_{4}} C o v (ε_{t - k - u_{1}} ε_{t - l - u_{2}}, ε_{t - s - k - u_{3}} ε_{t - s - l - u_{4}}) .

u_{1} = 0 \sum \infty u_{2} = 0 \sum \infty u_{3} = 0 \sum \infty u_{4} = 0 \sum \infty ψ_{u_{1}} ψ_{u_{2}} ψ_{u_{3}} ψ_{u_{4}} C o v (ε_{t - k - u_{1}} ε_{t - l - u_{2}}, ε_{t - s - k - u_{3}} ε_{t - s - l - u_{4}}) .

I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + l - k} ψ_{u} ψ_{v + l - k} ψ_{v} C o v (ε_{0}^{2}, ε_{u - (s + v)}^{2}),

I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + l - k} ψ_{u} ψ_{v + l - k} ψ_{v} C o v (ε_{0}^{2}, ε_{u - (s + v)}^{2}),

I I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + s} ψ_{v + s} ψ_{u} ψ_{v} E ε_{0}^{2} ε_{(u - v) + (k - l)}^{2},

I I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + s} ψ_{v + s} ψ_{u} ψ_{v} E ε_{0}^{2} ε_{(u - v) + (k - l)}^{2},

I I I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + s + (l - k)} ψ_{v + s + (k - l)} ψ_{u} ψ_{v} E ε_{0}^{2} ε_{(u - v - s) + (k - l)}^{2} .

I I I = u = 0 \sum \infty v = 0 \sum \infty ψ_{u + s + (l - k)} ψ_{v + s + (k - l)} ψ_{u} ψ_{v} E ε_{0}^{2} ε_{(u - v - s) + (k - l)}^{2} .

I

I

I I ≲ (u = 0 \sum \infty ψ_{u} ψ_{u + s})^{2} \leq ψ_{s}^{2} (u = 0 \sum \infty ψ_{u})^{2} ≲ ψ_{s}^{2} .

I I ≲ (u = 0 \sum \infty ψ_{u} ψ_{u + s})^{2} \leq ψ_{s}^{2} (u = 0 \sum \infty ψ_{u})^{2} ≲ ψ_{s}^{2} .

I I I ≲ u = 0 \sum \infty v = 0 \sum \infty ψ_{u} ψ_{v} ψ_{u + s + (l - k)} ψ_{v + s + (k - l)} \leq ψ_{s} (u = 0 \sum \infty v = 0 \sum \infty ψ_{v} ψ_{u}) ≲ ψ_{s} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Consistency Results for Stationary Autoregressive Processes with

Constrained Coefficients

Alessio Sancetta Acknowledgements: I am grateful to Luca Mucciante for insightful conversations. E-mail: [email protected], URL: http://sites.google.com/site/wwwsancetta/. Address for correspondence: Department of Economics, Royal Holloway University of London, Egham TW20 0EX, UK

Abstract

We consider stationary autoregressive processes with coefficients restricted to an ellipsoid, which includes autoregressive processes with absolutely summable coefficients. We provide consistency results under different norms for the estimation of such processes using constrained and penalized estimators. As an application we show some weak form of universal consistency. Simulations show that directly including the constraint in the estimation can lead to more robust results.

Key Words: consistency, empirical process, ridge regression, reproducing kernel Hilbert space, universal consistency.

1 Introduction

It is common to impose constraints on the decay rate of the autoregressive coefficients in order to derive results amenable to estimation for the purpose of prediction. At minimum, these constraints tend to require that the AR coefficients are absolutely summable. Then, a natural approach when dealing with high order autoregressive models is to consider sieve estimation. Sieve estimation of infinite AR models has been considered by various authors. For universal consistency, Schäfer (2002) derived perhaps the strongest result possible. Györfi and Sancetta (2015) review some of these results. For convergence in probability, various authors have considered infinite AR models and its applications, e.g. Bühlmann (1997), and Kreiss et al. (2011). Additional references can be found in the cited papers.

Here, we constraint the autoregressive coefficients to lie in an infinite dimensional ellipsoid such that coefficients associated to higher order lags decay fast. Then, we can exploit the fact that the ellipsoid is compact under the $\ell_{2}$ norm in order to derive asymptotic results. The conditions essentially require the autoregressive coefficients to be absolutely summable. We shall see that the vector of autoregressive coefficients can be seen as an element in a Reproducing Kernel Hilbert Space (RKHS) when $\ell_{2}$ is equipped with a suitable inner product. This allows us to exploit all the existing machinery for estimation in RKHS and build on it (Steinwart and Chirstmann, 2008, for a comprehensive review) . The main ingredient is penalized least square estimation. We also consider the constrained least square problem. Penalized and constrained estimation are dual problems for specific values of the penalty coefficient. Our result establishes the relation between the two problems and the consistency rates. In general, they can lead to different consistency results under different norms. One norm is the usual Euclidean norm of the vector of coefficients while the other is the norm of the RKHS. We show that consistency under the latter has important implications for prediction problems.

In general, unlike existing results we are able to establish consistency as both the autoregressive order and the sample size go to infinity with no constraint on the rates. Existing results use the machinery of method of sieve, hence they require the autoregressive order to go to infinity in a controlled way. As already mentioned, we are able to avoid this restriction because the ellipsoid is compact under the Euclidean norm.

The plan for the paper is as follows. Section 2 reviews the estimation method and presents the consistency results. A numerical example is provided in Section 3. Section 4 mentions extensions to other processes such as vector autoregressive processes (VAR). The proof of the consistency results is long and is given in Section 5.

2 Estimation Method

We restrict attention to the infinite order autoregressive process

[TABLE]

for some mean zero independent identically distributed (i.i.d.) sequence $\left(\varepsilon_{t}\right)_{t\in\mathbb{Z}}$ and unknown coefficients $\varphi_{k}$ ’s. This paper considers estimators of the above under the condition that $\sum_{k=1}^{\infty}\left|\varphi_{k}\right|\leq\bar{\varphi}<\infty$ .

In a finite sample, the above model can only be approximated by the finite dimensional model

[TABLE]

with $K\rightarrow\infty$ . While this is essentially a sieve we do not necessarily require $K$ to be of smaller order than the sample size. Here, we restrict the coefficients in an ellipsoid to be defined as follows. Let $\lambda_{k}$ ’s be positive constants such that $\lambda_{k}\asymp k^{\lambda}$ for $\lambda>0$ , where $\asymp$ means that the left hand side (l.h.s.) and the right hand side (r.h.s.) are proportional. Define the ellipsoid as

[TABLE]

Given that the $\lambda_{k}$ ’s are increasing, the $b_{k}$ ’s need to be smaller in absolute values as $k$ increases. Write $\mathcal{E}\left(B\right)=\bigcup_{K>0}\mathcal{E}_{K}\left(B\right)$ for the ellipsoid where all coefficients can be non-zero, $\mathcal{E}_{K}=\bigcup_{B<\infty}\mathcal{E}_{K}\left(B\right)$ and $\mathcal{E}=\bigcup_{B<\infty}\mathcal{E}\left(B\right)$ , so for example $\mathcal{E}=\left\{b\in\mathbb{R}^{\infty}:\sum_{k=1}^{\infty}b_{k}^{2}\lambda_{k}^{2}<\infty\right\}$ is the ellipsoid that is restricted to have finite but decreasing principal axes. The following condition will be imposed on the ellipsoid.

Condition 1

The sequence $\left(Y_{t}\right)_{t\in\mathbb{Z}}$ follows the process (1) with $\varphi\in\mathcal{E}$ and $\lambda_{k}\asymp k^{\lambda}$ , where $\lambda>1/2$ . Moreover, $1-\sum_{k=1}^{\infty}\varphi_{k}z^{k}=0$ only for $z$ outside the unit circle. The innovations $\left(\varepsilon_{t}\right)_{t\in\mathbb{Z}}$ are independent identically distributed with finite fourth moment.

Throughout, when writing $\mathcal{E}_{K}\left(B\right)$ and similar quantities, it is understood that the $\lambda_{k}$ ’s are as in Condition 1. The following is stated for convenience.

Lemma 1

If $b\in\mathcal{E}\left(B\right)$ then, $b_{k}\lesssim k^{-\left(2\lambda+1\right)/2}/\ln^{1+\epsilon}\left(1+k\right)$ for some $\epsilon>0$ , where $\lesssim$ is inequality up to a fixed absolute multiplicative constant.

In consequence, Condition 1 implies absolutely summable autoregressive coefficients. Note that absolute summability would just require $\lambda\geq 1/2$ in Condition 1 rather than $\lambda>1/2$ , hence the condition we use is a bit more restrictive. The following states additional properties of the model.

Lemma 2

Under Condition 1, $\left(Y_{t}\right)_{t\in\mathbb{Z}}$ is stationary and ergodic with absolutely summable autocovariance function and $\mathbb{E}Y_{t}^{4}<\infty$ .

It is well known that for the AR process, $1-\sum_{k=1}^{\infty}\varphi_{k}z^{k}=0$ only for $z$ outside the unit circle if the autocovariance function is absolutely summable and the spectral density is strictly positive and continuous (Kreiss et al., 2011, Corollary 2.1).

Note that there are processes (even Gaussian) that satisfy Condition 1, but fail to be beta mixing (Doukhan, 1995, Theorem 3, p.59). The beta mixing assumption is often conveniently used when proving convergence using methods from empirical process theory. Alas, it cannot be used here.

2.1 Estimation and Consistency

The goal is to find an estimator for $\varphi$ . We consider two approaches: constrained least square and penalized least square. By duality, the two can be made to be equivalent by suitable choice of the penalty parameter. However, in the constrained case, the penalty turns out to be sample dependent, while in penalized estimation this it not necessarily the case.

To avoid notational trivialities, suppose that the sample size is $N=n+K$ . This will be assumed without further notice throughout the paper. In particular, our sample is $Y_{-\left(K-1\right)},Y_{-\left(K-2\right)},...,Y_{0},Y_{1},...,Y_{n}$ . This also stresses the fact that $n$ and $K$ can go to infinity at different rates.

In the constrained problem, we estimate $b\in\mathcal{E}_{K}\left(B\right)$ . The constrained estimator is defined as

[TABLE]

Of course, in the above, $\sum_{k=1}^{\infty}b_{k}Y_{t-k}=\sum_{k=1}^{K}b_{k}Y_{t-k}$ if $b\in\mathcal{E}_{K}\left(B\right)$ .

In the penalized problem, we estimate $b\in\mathcal{E}_{K}$ , but introduce the penalty parameter $\tau>0$ . The penalized estimator is defined as

[TABLE]

where the $\lambda_{k}$ ’s are from the definition of $\mathcal{E}$ . By use of the Lagrangian, we can always rewrite (3) as (4) for suitable choice of $\tau$ , i.e. there is a $\tau=\tau_{B,n}$ ( $\tau=0$ if the constraint it not binding) such that $b_{n,\tau}=b_{n}$ .

Both problems can be reformulated in matrix form using the Lagrangian. Let $X$ be the $n\times K$ dimensional matrix with $\left(t,k\right)^{th}$ entry equal to $Y_{t-k}$ and $Y$ be the $n$ -dimensional vector with $t^{th}$ entry $Y_{t}$ . Also, let $\Lambda$ be the $K\times K$ diagonal matrix with $k^{th}$ diagonal entry equal to $\lambda_{k}$ . The estimator for either (3) or (4) is found by minimizing the penalized least square criterion with respect to (w.r.t.) $\tilde{b}\in\mathbb{R}^{K}$ ,

[TABLE]

where for (3) $\tau$ is chosen so that the constraint $\tilde{b}^{T}\Lambda\tilde{b}\leq B^{2}$ is satisfied. In this latter case, $\tau$ is necessarily random because the constraint needs to be satisfied in sample. Here the tilde in $\tilde{b}$ is used to remind us that in the matrix formulation, $b$ is truncated to be a $K$ dimensional vector, as all entries larger than $K$ are zero by definition of $\mathcal{E}_{K}$ . The solution is the usual ridge regression estimator $\tilde{b}_{n,\tau}:=\left(X^{T}X+\tau\Lambda^{2}\right)^{-1}X^{T}Y$ .

For problem (4), $\tau=\tau_{n}$ can go to zero in a controlled way. For problem (3), $\tau=\tau_{B,n}\geq 0$ must be chosen so that the constraint is satisfied. Such $\tau_{B,n}$ is zero if the constraint is binding, and zero otherwise. This is equivalent to replacing $\tau\tilde{b}^{T}\Lambda^{2}\tilde{b}$ with $\left(\tilde{b}^{T}\Lambda^{2}\tilde{b}-B^{2}\right)$ in (5), and minimizing the so modified objective function (5) w.r.t. $\tilde{b}$ and $\tau\geq 0$ . The minimizer w.r.t. $\tau$ is $\tau_{B,n}$ .

All vectors are in $\mathbb{R}^{\infty}$ , though only the first $K$ elements might be non-zero. The exception is when we use a tilde, as in (5). For $b_{n}$ in (3), the Euclidean norm of $b_{n}-\varphi$ becomes $\left|b_{n}-\varphi\right|_{2}=\left(\sum_{k=1}^{K}\left|b_{nk}-\varphi_{k}\right|^{2}+\sum_{k>K}\left|\varphi_{k}\right|^{2}\right)^{1/2}$

It is worth noting that the ellipsoid $\mathcal{E}\subset\ell_{2}$ is a RKHS generated by the kernel $C\left(k,l\right)=\sum_{v=1}^{\infty}\lambda_{v}^{-2}\delta_{v,k}\delta_{v,l}$ where $\delta_{v,l}$ is the Kronecker’s delta, i.e. $\delta_{v,l}=1$ if $v=l$ and zero otherwise. The inner product $\left\langle\cdot,\cdot\right\rangle_{\mathcal{E}}$ is defined to satisfy the reproducing kernel property $\left\langle C\left(\cdot,l\right),C\left(\cdot,k\right)\right\rangle_{\mathcal{E}}=C\left(k,l\right)$ . Hence for $a,b\in\mathcal{E}$ , $b_{k}=\left\langle b,C\left(\cdot,k\right)\right\rangle_{\mathcal{E}}$ and $\left\langle a,b\right\rangle_{\mathcal{E}}=\sum_{v=1}^{\infty}\lambda_{v}^{2}a_{v}b_{v}$ . The norm induced by the inner product is $\left|\cdot\right|_{\mathcal{E}}$ such that for any vector $b\in\mathbb{R}^{\infty}$ , $\left|b\right|_{\mathcal{E}}^{2}=\sum_{k=1}^{\infty}\lambda_{k}^{2}b_{k}^{2}$ . This norm strictly dominates the Euclidean norm. The fact that $\mathcal{E}\left(1\right)$ is compact under the Euclidean norm is a consequence of the fact that $\mathcal{E}$ is a RKHS (Li and Linde, 1999) and sharp asymptotics can be derived by related means (Graf and Luschgy, 2004).

Once we realize such compactness, it becomes clear that it might be possible to estimate infinite AR processes under no restriction on the number of estimated coefficients. We show that this conjecture is true. We also establish convergence rates. Moreover, we want to clearly address the relation between constrained and penalized estimation.

The best approximation $\varphi_{K}\in\mathcal{E}_{K}$ to $\varphi$ minimizes the population mean square error

[TABLE]

Despite the abuse of notation, do not confuse $\varphi_{K}$ with the $K^{th}$ entry in $\varphi$ .

Theorem 1

Suppose that Condition 1, and $n,\,K\rightarrow\infty$ hold.

*(Consistency of Constrained Estimator) If * $\varphi\in\mathcal{E}\left(B\right)$ There is a random $\tau=\tau_{B,n}$ such that $\tau=O_{p}\left(n^{-1/2}\right)$ , $b_{n,\tau}=b_{n}$ and if $\varphi\in\mathcal{E}\left(B\right)$ , $\left|b_{n}-\varphi\right|_{2}=O_{p}\left(n^{-\frac{1}{2}\left(\frac{2\lambda-\epsilon}{2\lambda-\epsilon+1}\right)}+K^{-\lambda}\right)$ for any $\epsilon\in\left(0,2\lambda-1\right)$ . 2. 2.

(Consistency of Penalized Estimator) Consider possibly random $\tau=\tau_{n}$ such that $\tau\rightarrow 0$ and $\tau n^{1/2}\rightarrow\infty$ in probability. There is a finite $B$ such that $\varphi\in\mathrm{int}\left(\mathcal{E}\left(B\right)\right)$ , $\left|b_{n,\tau}\right|_{\mathcal{E}}<B$ eventually in probability and $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}\rightarrow 0$ in probability. 3. 3.

(Approximation Error in $\mathcal{E}$ ) There is an $\epsilon>0$ such that $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}=O\left(\left(\ln K\right)^{-\left(1+\epsilon\right)}\right)$ . Suppose the $k^{th}$ entry $\varphi_{k}$ in $\varphi$ satisfies $\left|\varphi_{k}\right|\lesssim k^{-\nu}$ with $\nu>\left(2\lambda+1\right)/2$ for all $k$ large enough. Then $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}=O\left(K^{\left(2\lambda+1-2\nu\right)/2}\right)$ . 4. 4.

(Estimation Error in $\mathcal{E}$ ) If $\left(\tau+n^{-1/2}\right)=O_{p}\left(K^{-2\lambda}\right)$ , then $\left|b_{n,\tau}-\varphi_{K}\right|_{\mathcal{E}}=O_{p}\left(n^{-1/4}K^{\lambda}\right)$ 5. 5.

(Difference Between Norms) There is $K\rightarrow\infty$ and $\tau=O_{p}\left(n^{-1/2}\right)$ such that $\left|b_{n,\tau}-\varphi\right|_{2}\rightarrow 0$ in probability, but $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}$ does not converge to zero in probability.

Point 1 in the theorem establishes the link between constrained and penalized estimation by finding the rate of decay of the ridge penalty so that (3) and (4) are the same. It also establishes the convergence rate of (3) towards the true $\varphi$ in terms of $\lambda$ (recall $\lambda_{k}\asymp k^{\lambda}$ in Condition 1). This rate does not constrain the number of lags used once we constrain $\varphi\in\mathcal{E}\left(B\right)$ . For the finite dimensional case we trivially recover the root-n convergence by letting $\lambda\rightarrow\infty$ .

Point 2 says that if we use the penalized estimation and the penalty does not go to zero too fast (i.e. strictly slower than in Point 1) we can expect (4) to be contained in a ball in $\mathcal{E}$ that contains the true parameter with probability going to one. Moreover, (4) is consistent under the norm $\left|\cdot\right|_{\mathcal{E}}$ .

Point 3 is concerned with the approximation error of (6) in the RKHS norm. This error might go to zero at a logarithmic rate. However, if the true coefficients decay fast, then we can have polynomial convergence rate.

Point 4 restricts the way we let $K\rightarrow\infty$ in order to derive convergence rates of the estimation error under the norm $\left|\cdot\right|_{\mathcal{E}}$ .

Point 5 establishes an additional insight between the convergence under the Euclidean norm and the RKHS norm in terms of the penalty. A “slowly convergent” penalty is necessary for convergence under $\left|\cdot\right|_{\mathcal{E}}$ . Hence, this also shows that the constrained estimator (whose penalty is $\tau=\tau_{B,n}=O_{p}\left(n^{-1/2}\right)$ when $\varphi\in\mathcal{E}\left(B\right)$ ) cannot be consistent in the norm $\left|\cdot\right|_{\mathcal{E}}$ in general. This happens when choosing a rather large $K$ that leads to a binding constraint for (3).

As corollary to Points 3 and 4 in Theorem 1, we have the following.

Corollary 1

Suppose Condition 1 holds, $K\rightarrow\infty$ and $\tau=O_{p}\left(K^{-2\lambda}\right)$ .

Choose $K\asymp n^{\kappa}$ for some $\kappa\in\left(0,1/4\right)$ . Then, there is an $\epsilon>0$ such that $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}=O_{p}\left(\left(\ln K\right)^{-\left(1+\epsilon\right)}\right)$ . 2. 2.

Suppose the $k^{th}$ entry $\varphi_{k}$ in $\varphi$ satisfies $\left|\varphi_{k}\right|\lesssim k^{-\nu}$ with $\nu>\left(2\lambda+1\right)/2$ for all $k$ large enough. Choose $K\asymp n^{\frac{1}{2\left(2\nu-1\right)}}$ . Then, $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}=O_{p}\left(n^{-\frac{2\nu-\left(2\lambda+1\right)}{4\left(2\nu-1\right)}}\right)$ .

Corollary 1 imposes additional restrictions in order to improve on the statement of Point 2 in Theorem 1 by giving rates of convergence. These rates are not tight as they require $K=o\left(n\right)$ unlike Point 2 in Theorem 1. However, they are useful in applications (e.g. Section 2.1.1).

Sieve estimators are often consistent under the sole condition that the number of components (here $K$ ) is of smaller order of magnitude than the sample size $n$ . In Point 1 of Theorem 1, we have shown that this is not required. Recall that $N=n+K$ is the sample size. We can have $K=O\left(N\right)$ as long as $n\rightarrow\infty$ . Of course, we require knowledge concerning the magnitude of the coefficients. Such knowledge is usually assumed in the literature in order to bound the approximation error.

In practice the fact that we allow $K=O\left(N\right)$ might sound irrelevant. However, the asymptotic results can be seen as suggesting that, once we set the constraint, the procedure used here can be more robust to lag choice. We show this in the simulation in Section 3.

2.1.1 Application to Optimal Forecasting and Universal Consistency

Define $X_{t}\left(a\right)=\sum_{k=1}^{\infty}a_{k}Y_{t-k}$ for any $a\in\mathbb{R}^{\infty}$ . The expectation of $Y_{t}$ conditioning on the infinite past $\left(Y_{t-s}\right)_{s>0}$ is $X_{t}\left(\varphi\right)$ . As an application of Theorem 1 consider the following problem. Show that

[TABLE]

in probability where $\mathcal{T}=\left(0,\infty\right)$ or $\left(0,n\right)$ ( $b_{n,\tau}$ in (4)). Hence, we want $X_{t}\left(b_{n,\tau}\right)$ to be close to the conditional expectation of $Y_{t}$ uniformly in $t\in\mathcal{T}$ , which is even more general than considering a moving target. The norm $\left|\cdot\right|_{\mathcal{E}}$ is useful because the previous display can be written as

[TABLE]

To obtain the inequality, we have multiplied and divided each term in the sum (on the l.h.s.) by $\lambda_{k}$ and then used the Cauchy-Schwarz inequality and Condition 1 to set $\lambda_{k}\asymp k^{\lambda}$ .

We have that $\left|\varphi-b_{n,\tau}\right|_{\mathcal{E}}=O_{p}\left(\epsilon_{n}\right)$ in probability, where $\epsilon_{n}\rightarrow 0$ at rate which depends on Theorem 1. Then, if

[TABLE]

we have shown that (7) goes to zero in probability. This is a weak form of universal consistency because the convergence is in probability rather than almost surely. On the positive side, the convergence holds for a variety of processes and circumstances.

If $\mathcal{T}=\left(0,\infty\right)$ then (8) is almost surely finite if the random variables are bounded, and (7) goes to zero in probability using Point 2 in Theorem 1.

If $\mathcal{T}=\left(0,n\right)$ , we can use the bound

[TABLE]

when the variables are $2p$ integrable. If $p$ is such that $n^{1/\left(2p\right)}=o\left(\epsilon_{n}^{-1}\right)$ , then the r.h.s. of (7) goes to zero in probability. If $Y_{t}$ has moment generating function the r.h.s. of the above display is $O\left(\ln n\right)$ . Either way, to find $\epsilon_{n}$ we can use Corollary 1. Note that the argument is unchanged if $\mathcal{T}=\left(0,c_{n}\right)$ for any $c_{n}\asymp n$ .

Theorem 1 can also be applied to the less ambitious problem: show that

[TABLE]

in probability. In this case we want to forecast as well as the increasingly best approximation of the conditional expectation of $Y_{t}$ , uniformly in $t\in\mathcal{T}$ . Point 4 in Theorem 1 is suited for this problem.

2.2 Choice of $B$ in Practice

The parameter $B$ can be chosen to minimize some cross-validated prediction error estimate (beware of cross-validation in a time series context, e.g. Györfi et al., 1990, Burman and Nolan, 1992, Burman et al., 1994, for discussions and applicability). Alternatively, one can choose $B$ to minimize some penalized loss function such as

[TABLE]

where $\mathrm{df}\left(B\right)=\mathrm{Trace}\left(\left(X^{T}X+\tau_{B,n}n\Lambda^{2}\right)^{-1}X^{T}X\right)$ and $\tau_{B,n}$ is the solution of $\tilde{b}_{n}^{T}\Lambda^{2}\tilde{b}_{n}\leq B$ , using the notation in (5). Here, $\hat{\sigma}_{B}^{2}$ is the sample variance of the residuals from the estimation. If the constraint is binding, $\tau_{B,n}$ solves

[TABLE]

This $\tau_{B,n}$ is then used to compute $\mathrm{df}\left(B\right)$ , which is the effective number of degrees of freedom implied by $B$ (Hastie et al., 2009)

3 Numerical Example

Asymptotic results are of interest on their own, but it is also of interest to understand the scope of applicability in practice. As a benchmark, we use predictions based on an AR model where the lag length is chosen by Akaike’s Information Criterion (AIC).

3.1 Simulated True Models

One thousand data samples are simulated from (1). The sample size is $N=1000$ . A warm up sample of 1000 observations is used to reduce any dependence on the starting value. We also simulate a testing sample of $1000$ observations to approximate the mean square error (MSE). We consider different specifications for $\varphi$ in (1) including long memory in order to see how the procedure works when the true model is not in $\mathcal{E}$ . In this case, an approximation error is incurred.

Short Memory

In (1), the errors are i.i.d. standard normal and the $\varphi_{k}$ ’s are chosen to be $\varphi_{k}=\bar{\varphi}k^{-1/2}/\left(\sum_{k=1}^{K_{0}}k^{-1/2}\right)$ , where $\bar{\varphi}=0.75,\,0.99$ . A higher value for $\bar{\varphi}$ leads to a more persistent behaviour. By construction, for both values of $\bar{\varphi}$ , the model appears to generate cycles because the roots of $1-\sum_{k=1}^{K_{0}}\varphi_{k}z^{k}=0$ are outside the unit circle, but complex. We shall have different values for $K_{0}\in\left\{100,1000\right\}$ . Given the finite number of lags the coefficients are automatically in $\mathcal{E}$ .

Long Memory Model

The model is an ARFIMA

[TABLE]

where the $\varphi_{k}$ ’s are as in the previous paragraph. The MA polynomial is $\theta_{l}=\left(1-0.1l\right)$ with $L=5$ . The coefficient of fractional integration $d=0.49$ . Hence, the model is stationary, but exhibits long memory.

3.2 Estimation and Results

The parameter’s estimates are obtained from (5) with $\lambda_{k}=k^{-0.501}$ . The benchmark is an AR model with lag length chosen to minimize AIC. Denote the number of lags chosen using AIC by $K_{AIC}$ . We compare this to a model estimated using more lags, but with coefficients constrained in $\mathcal{E}_{K}\left(B\right)$ . In particular, $K=2K_{AIC}$ and $4K_{AIC}$ with $B$ chosen as outlined in Section 2.2 . The goal is to verify whether the procedure is robust to lag choice. AIC is known to choose large models. We use even larger models, and verify whether we are able to obtain sensible results.

The results in Table LABEL:Table_simulations show the improvement in MSE of the constrained procedure over AIC. Table LABEL:Table_simulations shows that the procedure is robust against lag choice. This becomes evident in the long memory case. The larger model ( $4K_{AIC})$ leads to relatively better performance when the true model exhibits persistency as (11).

4 Further Remarks

It is simple to impose linear restrictions on the coefficients of either the constrained or penalized estimator. A natural example is positivity. This is the case if we wish to estimate ARCH models of large orders. Under ARCH restrictions, the squared returns follow an AR process. The estimator does not have a closed form expression, but it is just the solution of a quadratic programming problem. Another extension pertains to vector autoregressive processes

[TABLE]

where now the variables and innovations are $L$ dimensional vectors and we use the capital $\Phi_{k}$ to stress the multivariate framework, where $\Phi_{k}$ is an $L\times L$ matrix. Again, we can restrict $\mathcal{E}$ in a suitable way. For example, we can impose that $\Phi_{k}$ is lower triangular. This restriction has a variety of implications going from Granger causality to exogeneity and it is of much interest in econometrics (e.g., Sims, 1980). For fixed $L$ , all the results in this paper apply to this problem as well, with obvious changes if we modify the constraint to $\sum_{k=1}^{\infty}\left|\Phi_{k}\right|^{2}\lambda_{k}^{2}\leq B$ where $\left|\Phi_{k}\right|$ is any matrix norm, e.g., Frobenius: $\left|\Phi_{k}\right|=\sqrt{Trace\left(\Phi_{k}^{T}\Phi_{k}\right)}$ , where $\Phi_{k}^{T}$ is the transpose of $\Phi_{k}$ .

An extension, which does not follow directly from the results derived here, is to consider the case where $L\rightarrow\infty$ . This is the problem where we have a large cross-section ( $L$ is the dimensional of the vector $Y_{t}$ in (12)). In this case, the constraint cannot use an arbitrary matrix norm (norms are not equivalent in infinite dimensional spaces). Results in Lutz and Bühlmann (2006) together with the ones derived here can provide initial guidance on how to tackle this problem in the future.

5 Proofs

At first we include the short proof of Lemma 2

Proof. [Lemma 2]A stationary infinite AR process with absolutely summable AR coefficients has an infinite MA representation with absolutely summable coefficient and it is invertible (Lemma 2.1 in Bühlmann, 1995). Hence, there are coefficients $\psi_{s}$ ’s such that $Y_{t}=\sum_{s=0}^{\infty}\psi_{s}\varepsilon_{t-s}$ and

[TABLE]

which means that the autocovariance function is absolutely summable. The moment bound follows from the infinite MA representation and the bound on the fourth moment of the innovations.

5.1 Proof of Theorem 1

We divide the proof into two parts. One only concerns results under the Euclidean norm. The other is concerned with convergence results under the RKHS norm.

5.1.1 Consistency Under the Euclidean Norm

Few lemmas are needed for the proof. Throughout, we shall use the notation $X_{t}\left(a\right)=\sum_{k=1}^{\infty}a_{k}Y_{t-k}$ for any $a\in\mathbb{R}^{\infty}$ .

Lemma 3

For $\rho:=\left(2\lambda+1\right)/2>1$ $(\lambda>1/2$ as in Condition 1) and real constants $w_{k}$ ’s, $\sup_{b\in\mathcal{E}_{K}\left(B\right)}\left|\sum_{k=1}^{K}b_{k}w_{k}\right|\lesssim\sum_{k=1}^{K}k^{-\rho}\left|w_{k}\right|,$ and similarly, for real constants $w_{k,l}$ ’s, $\sup_{b\in\mathcal{E}_{K}\left(B\right)}\left|\sum_{k,l=1}^{K}b_{k}b_{l}w_{lk}\right|\lesssim\sum_{k,l=1}^{K}k^{-\rho}l^{-\rho}\left|w_{kl}\right|$ .

Proof. Note that $\left|\sum_{k=1}^{K}b_{k}w_{k}\right|\leq\sum_{k,l=1}^{K}\frac{\left|b_{k}\right|}{k^{-\rho}}k^{-\rho}\left|w_{k}\right|$ . Given that $b\in\mathcal{E}_{K}\left(B\right)$ , then $b_{k}\lesssim k^{-\rho}$ uniformly in $b\in\mathcal{E}_{K}\left(B\right)$ , by Lemma 1. This implies that the previous quantity is bounded by a constant multiple of $\sum_{k=1}^{K}k^{-\rho}\left|w_{k}\right|$ . The same argument proves the second statement in the lemma

The $w_{kl}$ ’s in the lemma above will be partial sums of cross products of $Y_{t}$ ’s, which we bound using the following.

For arbitrary $\tau>0$ , the first order conditions that define (4) imply that

[TABLE]

where $b_{n,\tau,k}$ is the $k^{th}$ element in $b_{n,\tau}$ . By Condition 1, multiplying both sides by $2\tau\lambda_{k}^{2}a_{k}$ and summing over $k$ ,

[TABLE]

recalling the definition of $X_{t}\left(a\right)$ and using the Cauchy-Schwarz inequality. If $a\in\mathcal{E}_{K}\left(1\right)$ , $\sqrt{\sum_{k=1}^{K}\lambda_{k}^{2}a_{k}^{2}}\leq 1$ and the above display clearly holds uniformly in $a$ . We need to show that there is a $\tau=\tau_{n}=O_{p}\left(n^{-1/2}\right)$ such $\sqrt{\sum_{k=1}^{K}\lambda_{k}^{2}b_{n,\tau,k}^{2}}<B$ . This will imply the display in the statement of the lemma.

Lemma 4

Under Condition 1,

[TABLE]

Proof. From the proof of Lemma 2, there are absolutely summable coefficients $\psi_{u}$ ’s, such that $Y_{t}=\sum_{u=0}^{\infty}\psi_{u}\varepsilon_{t-u}$ . For ease of notation suppose that the i.i.d. innovations have variance one and the MA coefficients are non-negative. By stationarity,

[TABLE]

where the r.h.s. holds for any $t$ . If we showed that

[TABLE]

the result would follow by summability of the coefficients. To show the above, with no loss of generality, by symmetry, consider only the case $l\geq k$ . This implies that

[TABLE]

The above is equal to

[TABLE]

By the i.i.d. condition on the innovations, the covariance is zero if the indexes are not constrained in the following sets $\left\{k+u_{1}=l+u_{2},\,k+u_{3}=l+u_{4}\right\}$ , $\left\{u_{1}=u_{3}+s,\,u_{2}=u_{4}+s\right\}$ , $\left\{k+u_{1}=l+u_{4}+s,\,l+u_{2}=k+u_{3}+s\right\}$ . Hence, we can consider summation with indexes in these sets only. Splitting the sum according to the above index sets, we have respectively,

[TABLE]

By elementary change of indexes,

[TABLE]

Similarly, deduce that

[TABLE]

Finally,

[TABLE]

The bounds do not depend on $k,l$ beyond the fact that $l\geq k$ . Repeating the argument for $k>l$ , the result follows.

Lemma 4 will be used to bound quantities such as the following

[TABLE]

where the second inequality follows because $\left(2\lambda+1\right)/2>1$ . Then, by Lemma 4 the expectation is finite because $\mathbb{E}\left|\cdot\right|\leq\left(\mathbb{E}\left|\cdot\right|^{2}\right)^{1/2}$ and it is independent of $k,l$ by stationarity. In consequence the display is $O_{p}\left(n^{-1/2}\right)$ because convergence in $L_{1}$ implies convergence in probability.

To establish convergence rates we need two stochastic equicontinuity results.

Lemma 5

Under Condition 1, for any $\epsilon>0$

[TABLE]

Proof. By the triangle inequality, (15) is bounded by

[TABLE]

By Lemma 3, there is a $\rho>1$ such that the above is bounded by a constant multiple of

[TABLE]

by summability of $l^{-\rho}$ . For any positive $V$ , the above display can be written as

[TABLE]

We shall bound the two sums separately. By the Cauchy-Schwarz inequality, the first sum is bounded by

[TABLE]

where the inequality uses Lemma 4 and $\left|b\right|_{2}\leq\delta$ . Having set $V$ to such finite value, by the Cauchy-Schwarz inequality, the second sum is bounded by

[TABLE]

for any $\epsilon\in\left(0,2\lambda-1\right)$ , using again Lemma 4, and the fact that $k^{-\left(1+\epsilon\right)}$ is summable and $k^{\left(1+\epsilon\right)}\lambda_{k}^{-2}$ is decreasing. The r.h.s. is then bounded by a constant multiple of $V^{\left(1+\epsilon-2\lambda\right)/2}$ . Equating $\delta\sqrt{V}$ with $V^{\left(1+\epsilon-2\lambda\right)/2}$ we choose $V=\delta^{2/\left(2\lambda-\epsilon\right)}$ , implying that $\delta\sqrt{V}+V^{\left(1+\epsilon-2\lambda\right)/2}\lesssim\delta^{\frac{2\lambda-\epsilon-1}{2\lambda-\epsilon}}$ and the lemma is proved.

Lemma 6

Under Condition 1, for any $\epsilon>0$ ,

[TABLE]

Proof. By linearity and the triangle inequality,

[TABLE]

Note that

[TABLE]

Hence, we can proceed exactly as in the proof of Lemma 5 to deduce the result.

The first part of Point 1 in the theorem will be proved in Lemma 8 (Section 5.1.2). Hence, here we shall only derive the convergence rate.

Define the empirical loss function

[TABLE]

where $b\in\mathcal{E}$ . When $b\in\mathcal{E}_{K}$ the sum inside the parenthesis only runs from $1$ to $K$ . The population loss is

[TABLE]

Define $\beta=\beta_{K}\in\mathbb{R}^{\infty}$ such that its first $K$ entries are as in $\varphi$ and the remaining are all zero. The consistency proof is standard (van der Vaart and Wellner, 2000, Theorem 3.2.5) once we show the following:

[TABLE]

for some $\alpha\in\left(0,2\right)$ . Then, for any sequence $r_{n}\rightarrow\infty$ satisfying $r_{n}^{1-2\alpha}\lesssim\sqrt{n}$ , $L_{n}\left(b_{n}\right)\leq L_{n}\left(\beta\right)+O_{p}\left(r_{n}^{-2}\right)$ and $\left|\varphi-\beta\right|_{2}\lesssim r_{n}^{-1}$ , we have that $\left|b_{n}-\varphi\right|_{2}=O_{p}\left(r_{n}^{-1}\right)$ .

At first we verify (17). Note that

[TABLE]

where $\gamma\left(k\right)$ is the autocovariance function (ACF) of the $Y_{t}$ ’s. The estimator is uniquely identified if the matrix, say $\Gamma$ , with $\left(k,l\right)$ entry equal to $\gamma\left(k-l\right)$ , is strictly positive definite with smallest eigenvalue $\theta_{min}>0$ (see remarks after Lemma 2.2. in Kreiss et al., 2011). This is the case if the spectral density of $\left(Y_{t}\right)_{t\in\mathbb{Z}}$ , say $g\left(\omega\right)$ , is bounded away from zero. The spectral density of the AR model (1) is given by $g\left(\omega\right)=\left(2\pi\right)^{-1}\sigma^{2}/\varphi\left(\omega\right)$ , where $\varphi\left(\omega\right)=\left|\sum_{k=0}^{\infty}\varphi_{k}e^{-ik\omega}\right|^{2}$ with $\varphi_{0}:=1$ . Noting that by Condition 1, $\varphi\left(\omega\right)=\left|\sum_{k=0}^{\infty}\varphi_{k}e^{-ik\omega}\right|^{2}\leq\left(\sum_{k=0}^{\infty}\left|\varphi_{k}\right|\right)^{2}<\infty$ , deduce that the eigenvalues of $\Gamma$ are bounded away from zero. Hence,

[TABLE]

and (17) holds.

Using the notation $Y_{t}=X_{t}\left(\varphi\right)+\varepsilon_{t}$ , the empirical loss is equal to

[TABLE]

This implies that

[TABLE]

To verify (18), we need to bound the above uniformly in $b\in\mathcal{E}\left(B\right)$ such that $\left|b-\beta\right|_{2}\leq\delta.$ To this end, apply Lemma 6 to the first term on the r.h.s. to find that the uniform bound is a constant multiple of $n^{-1/2}\delta^{\frac{2\lambda-\epsilon-1}{2\lambda-\epsilon}}$ for any $\epsilon>0$ . By basic algebraic manipulations, the second term on the r.h.s. of the display is

[TABLE]

Note that both $\varphi-b$ and $\beta-\varphi$ are in $\mathcal{E}\left(2B\right)$ . We apply Lemma 5 to deduce that each term on the r.h.s. of the above display is uniformly bounded in $L_{1}$ by a constant multiple of $n^{-1/2}\delta^{\frac{2\lambda-\epsilon-1}{2\lambda-\epsilon}}$ for any $\epsilon>0$ when $\left|b-\beta\right|_{2}\leq\delta$ . Hence (18) is verified with $\alpha=\frac{2\lambda-\epsilon-1}{2\lambda-\epsilon}$ . When we are only interested in a finite dimensional model, we can take $\lambda\rightarrow\infty$ to deduce that $\alpha=1$ , which is the parametric case.

To find $r_{n}$ note that

[TABLE]

Also, $\left|\varphi-\beta\right|_{2}=\left(\sum_{k>K}\left|\varphi_{k}\right|^{2}\right)^{1/2}\lesssim K^{-\lambda}/\ln^{1+\epsilon}\left(K\right)$ for some $\epsilon>0$ using Lemma 1 and bounding the sum with an integral ad using the fact that $\ln^{1+\epsilon}\left(\cdot\right)$ is slowly varying at infinity. Hence we deduce that $r_{n}^{-1}\asymp\left(K^{-\lambda}/\ln^{1+\epsilon}\left(K\right)\right)+n^{-\frac{1}{2}\left(\frac{2\lambda-\epsilon}{2\lambda-\epsilon+1}\right)}$ as stated in Point 1 of the theorem.

5.1.2 Consistency Under the RKHS Norm

The proof depends on a few preliminary lemmas. Let $\varphi_{\tau}=\varphi_{K,\tau}\in\mathcal{E}_{K}$ be the penalized population estimator

[TABLE]

The following can be deduced from Theorem 5.9 in Steinwart and Christmann (2008, eq. 5.14). The proof is given, as the context might seem different at first sight.

Lemma 7

Suppose Condition 1. For arbitrary but fixed $\tau>0$ , consider $b_{n,\tau}$ and $\varphi_{\tau}$ in (4) and (20) with $K$ possibly diverging to infinity. Then,

[TABLE]

where $b_{n,\tau,k}$ is the $k^{th}$ entry in the $K$ dimensional vector $b_{n,\tau}$ , and similarly for $\varphi_{\tau,k}$ .

Proof. By convexity of the square error loss,

[TABLE]

Note the following algebraic equality,

[TABLE]

The above two displays imply

[TABLE]

where the most r.h.s. follows because $b_{n,\tau}$ minimizes the empirical penalized risk. The first order conditions for $\varphi_{\tau}$ read

[TABLE]

for $k\geq 1$ . Substituting this in the previous display,

[TABLE]

Rearranging and using the definition of $X_{t}\left(b_{n,\tau}-\varphi_{\tau}\right)$ , deduce that

[TABLE]

using the Cauchy-Schwarz inequality in the last step. This implies the result of the lemma after simple rearrangement.

The next lemma establishes the relation between the constrained and penalized estimator and states a bound for the distance between the sample and population penalized estimator under the RKHS norm.

Lemma 8

Suppose that $\varphi\in\mathrm{int}\left(\mathcal{E}\left(B\right)\right)$ . Under Condition 1, if $a\in\mathcal{E}_{K}\left(1\right)$ , and $b_{n,\tau}$ is as in (4), there is $\tau=\tau_{n}=O_{p}\left(n^{-1/2}\right)$ such that $\left|b_{n,\tau}\right|_{\mathcal{E}}<B$ and

[TABLE]

where the above bound holds uniformly in $a\in\mathcal{E}_{K}\left(1\right)$ . In consequence, there is a $\tau=O_{p}\left(n^{-1/2}\right)$ such that $b_{n,\tau}=b_{n}$ .

Moreover, for any $\tau>0$ ,

[TABLE]

Proof. Suppose that $\tau>0$ as otherwise, by the first order conditions, the r.h.s. in the first display in the statement of lemma is exactly zero and there is nothing to prove.

By the triangle inequality,

[TABLE]

For $\tau\geq 0$ , $\sqrt{\sum_{k=1}^{K}\lambda_{k}^{2}\varphi_{\tau,k}^{2}}\leq\sqrt{\sum_{k=1}^{K}\lambda_{k}^{2}\varphi_{k}^{2}}$ , as the penalized population estimator must have norm no larger than $\varphi$ . By this remark and the fact that $\varphi\in\mathrm{int}\left(\mathcal{E}\left(B\right)\right)$ , there is an $\epsilon>0$ such that the first term on the r.h.s. is $B-3\epsilon$ . Lemma 7 gives

[TABLE]

Adding and subtracting $\left(1-\mathbb{E}\right)X_{t}\left(\varphi\right)Y_{t-k}$ , and then using the basic inequality $\left(x+y\right)^{2}\leq 2x^{2}+2y^{2}$ for any real $x,y$ , the r.h.s. is

[TABLE]

Recalling that our goal is to bound the second term on the r.h.s. of (22), the above two displays imply that

[TABLE]

To bound $I$ on the r.h.s. note that for $k>0$ ,

[TABLE]

(recall $\gamma\left(k\right)$ is the ACF) so that

[TABLE]

because the coefficients $\lambda_{k}^{-2}$ are summable. Hence, it is possible to find a $\tau=O_{p}\left(n^{-1/2}\right)$ such that $I\leq\epsilon$ . To bound $II$ , recall that $\varphi_{\tau},\varphi\in\mathcal{E}\left(B\right)$ for any $\tau\geq 0$ , and write

[TABLE]

for ease of notation. Then, for $\rho=\left(2\lambda+1\right)/2>1$ ,

[TABLE]

using Lemma 3 in the second inequality and summability of the coefficient in the last step. By Lemma 4, $\mathbb{E}W_{k,l}^{2}\leq c$ for some finite absolute constant $c$ . Hence, deduce that $III=O_{p}\left(n^{-1}\right)$ , which implies that $II=O_{p}\left(\tau^{-1}n^{-1/2}\right)$ . Hence, there is a $\tau=O_{p}\left(n^{-1/2}\right)$ such that $II\leq\epsilon$ . The control of $I+II$ implies that (24) is not greater than $2\epsilon$ for suitable $\tau$ . Hence, we have shown that there is a $\tau=O_{p}\left(n^{-1/2}\right)$ such that (22) is not greater than $B-\epsilon$ . This bound for (22) together with (14) proves the first display in the lemma. To see that this also implies that there is a $\tau=O_{p}\left(n^{-1/2}\right)$ such that $b_{n,\tau}=b_{n}$ note that $\left|b_{n,\tau}\right|_{\mathcal{E}}$ is non-deceasing as $\tau\rightarrow 0$ . Hence, $b_{n,\tau}=b_{n}$ for the smallest $\tau$ such that $\left|b_{n,\tau}\right|_{\mathcal{E}}\leq B$

The last statement in the lemma follows from (23) and the just derived bound for (24).

We now estimate the approximation error.

Lemma 9

For any $K\rightarrow\infty$ , we have that $\left|\varphi_{K}-\varphi_{\tau}\right|_{\mathcal{E}}\rightarrow 0$ as $\tau\rightarrow 0$ where $\varphi_{K}$ is as in (6). Moreover, if $\tau=O_{p}\left(K^{-2\lambda}\right)$ , then $\left|\varphi_{K}-\varphi_{\tau}\right|_{\mathcal{E}}=O_{p}\left(\tau K^{2\lambda}\right)$ .

Proof. The first part of the lemma is just Theorem 5.17 in Steinwart and Christmann (2008). Hence, we only need to prove the second statement. Let $\Gamma$ be the $K\times K$ matrix with $\left(k,l\right)$ entry $\gamma\left(k-l\right)$ and let $\Gamma_{1}$ be the first column in $\Gamma$ . Let $\tilde{\varphi}_{K},\tilde{\varphi}_{\tau}\in\mathbb{R}^{K}$ to be the first $K$ entries in $\varphi_{K},\varphi_{\tau}\in\mathcal{E}_{K}$ . Recall that in both $\varphi_{K}$ and $\varphi_{\tau}$ all entries $k>K$ are zero. Then, $\tilde{\varphi}_{K}=\Gamma^{-1}\Gamma_{1}$ , and writing $D:=\tau^{1/2}\Lambda$ for $\Lambda$ as in (5),

[TABLE]

By the Woodbury identity (Petersen and Pedersen, 2012, eq.159)

[TABLE]

we have that

[TABLE]

Hence,

[TABLE]

using the definitions of $\tilde{\varphi}_{K}$ and $D$ . For any square matrix $W$ and compatible vector $a$ , $\left|Wa\right|_{2}\leq\sigma_{\max}^{2}\left(W\right)\left|a\right|_{2}$ , where $\sigma_{\max}^{2}\left(W\right)$ is the maximum eigenvalue of $W$ . Define $W=D\Gamma^{-1}D\left(I+D\Gamma^{-1}D\right)^{-1}$ . Given that $\varphi\in\mathcal{E}_{K}\left(B\right)$ , then, $\left|\Lambda\tilde{\varphi}\right|_{2}\leq B$ . Hence, we only need to find the maximum eigenvalue of $W$ to bound the above display. The following inequalities hold for the eigenvalues of the product of two positive definite matrices $A$ and $C$ :

[TABLE]

where $\sigma_{\max}^{2}\left(\cdot\right)$ and $\sigma_{\min}^{2}\left(\cdot\right)$ are the maximum and minimum eigenvalue of the matrix argument (Bathia, 1997, problem III.6.14, p.78). In order to derive (19), we argued that $\Gamma$ has minimum eigenvalue $\theta_{\min}$ bounded away from zero. Hence, $D\Gamma^{-1}D$ has eigenvalues in $\left[\theta_{\min}^{-1}\tau\lambda_{1}^{2},\theta_{\min}^{-1}\tau\lambda_{K}^{2}\right]$ . The matrix $\left(I+D\Gamma^{-1}D\right)$ has eigenvalues equal to 1 plus the eigenvalues of $D\Gamma^{-1}D$ . Hence deduce that $\left|\varphi_{K}-\varphi_{\tau}\right|_{\mathcal{E}}\lesssim\theta_{\min}^{-1}\tau\lambda_{K}^{2}\left(1+\theta_{\min}^{-1}\tau\lambda_{1}^{2}\right)$ . This is just $O\left(\tau\lambda_{K}^{2}\right)=O\left(\tau K^{2\lambda}\right)$ as required.

We need a final approximation result.

Lemma 10

Recall (6). If $\varphi\in\mathcal{E}$ , then $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}=1/\ln^{1+\epsilon}\left(K\right)$ as $K\rightarrow\infty$ . If also $\left|\varphi_{k}\right|\lesssim k^{-\nu}$ with $\nu>\left(2\lambda+1\right)/2$ , then, $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}=O\left(K^{\left(2\lambda+1-2\nu\right)/2}\right)$ .

Proof. Recall the definition of $\beta=\beta_{K}\in\mathbb{R}^{\infty}$ just before (17). Let $\tilde{\beta}\in\mathbb{R}^{K}$ have the same first $K$ entries as as $\beta$ . Write $Y_{t}=X_{t}\left(\beta\right)+\varepsilon_{K,t}$ where $\varepsilon_{K,t}=\varepsilon_{t}-X_{t}\left(\beta-\varphi\right)$ . Given that $\tilde{\varphi}_{K}$ is the population ordinary least square estimator, using the same notation as in the proof of Lemma 9,

[TABLE]

We need to show that the second term goes to zero under the norm $\left|\cdot\right|_{\mathcal{E}}$ . Given that the innovations are i.i.d., the expectation is equal to

[TABLE]

Hence,

[TABLE]

We need to show that this converges to zero. By similar arguments as in the proof of Lemma 9, deduce that

[TABLE]

so that it is sufficient to bound the square root of the above display. We have that

[TABLE]

Note that $\max_{k\leq K}\left|\gamma\left(K-k+l\right)\right|\leq\left|\gamma\left(l\right)\right|$ , and by Lemma 1 the autocovariance function is summable. Moreover $\lambda_{k}^{2}\asymp k^{2\lambda}$ . Hence, when $\left|\varphi_{K+l}\right|\lesssim K^{-\nu}$ holds true, the above display can be bounded by a constant multiple of

[TABLE]

Finally, by definition of $\beta$ ,

[TABLE]

This implies that $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}=O\left(K^{\left(2\lambda+1-2\nu\right)/2}\right)$ . If we only assume that $\varphi\in\mathcal{E}$ , then $\left|\varphi_{k}\right|\lesssim k^{-\left(2\lambda+1\right)/2}/\ln^{1+\epsilon}\left(1+k\right)$ for some $\epsilon>0$ by Lemma 1. Substituting in the above display, we have a logarithmic convergence rate rather than polynomial.

We can now prove Points 2-5 in Theorem 1. If $\varphi\in\mathcal{E}$ , then, there is a finite $B$ such that $\varphi\in\mathrm{int}\left(\mathcal{E}\left(B\right)\right)$ . Hence, by Lemma 7 and 8, deduce that $\left|b_{n,\tau}-\varphi_{\tau}\right|_{\mathcal{E}}=O_{p}\left(\tau^{-1}n^{-1/2}\right)$ and also that $\left|b_{n,\tau}\right|_{\mathcal{E}}<B$ eventually in probability. Hence, if $\tau n^{1/2}\rightarrow\infty$ in probability, by Lemma 9, $\left|b_{n,\tau}-\varphi_{K}\right|_{\mathcal{E}}\rightarrow 0$ in probability irrespective of the fact that $K\rightarrow\infty$ . By Lemma 10, $\left|\varphi-\varphi_{K}\right|_{\mathcal{E}}\rightarrow 0$ as $K\rightarrow\infty$ , so that the triangle inequality gives $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}\rightarrow 0$ in probability under the sole condition that $\tau n^{1/2}+K\rightarrow\infty$ in probability. This proves Point 2.

The approximation rates in Point 3 are from Lemma 10.

To show Point 4, use Lemma 9 for the approximation error of the penalized estimator. We need $\tau=O_{p}\left(K^{-2\lambda}\right)$ for the lemma to apply. Use Lemmas 7 and 8 to derive the estimation error relative to the penalized estimator. Hence, deduce that $\left|b_{n,\tau}-\varphi_{K}\right|_{\mathcal{E}}=O_{p}\left(\tau^{-1}n^{-1/2}+\tau K^{2\lambda}\right)$ . Equating the two terms inside the $O_{p}\left(\cdot\right)$ , this quantity is $O_{p}\left(n^{-1/4}K^{\lambda}\right)$ when $\tau\asymp n^{-1/4}K^{-\lambda}$ . This choice of $\tau$ satisfies $\tau=O_{p}\left(K^{-2\lambda}\right)$ as long as $n^{-1/4}K^{\lambda}=O\left(1\right)$ , as required.

We now prove Point 5. Lemma 8 also shows that for the constrained problem, the Lagrange multiplier is $\tau=\tau_{n,B}=O_{p}\left(n^{-1/2}\right)$ , and the constraint is possibly binding. In fact, there is a $K$ large enough relatively to $n$ , such that the constraint needs to be binding. Then, $\left|b_{n}\right|_{\mathcal{E}}=B$ , and from Lemma 8 we deduce that $\tau n^{1/2}=O_{p}\left(1\right)$ . Hence, if $\varphi\in\mathrm{int}\left(\mathcal{E}\left(B\right)\right)$ there is an $\epsilon>0$ such that $\left|\varphi\right|_{\mathcal{E}}=B-\epsilon$ . Then, we must have

[TABLE]

But $\left\langle b_{n},\varphi\right\rangle_{\mathcal{E}}\leq\left|b_{n}\right|_{\mathcal{E}}\left|\varphi\right|_{\mathcal{E}}\leq B\left(B-\epsilon\right)$ . Hence, the above display is greater or equal than

[TABLE]

This means that $b_{n}$ cannot converge under the norm $\left|\cdot\right|_{\mathcal{E}}$ .

5.2 Proof of Corollary 1

Now prove Point 1 in the corollary. By Point 4 in Theorem 1, the estimation error is $o_{p}\left(1\right)$ as long as $K\asymp n^{\kappa}$ for $\kappa\in\left(0,1/4\right)$ ; we also require $\tau=O_{p}\left(K^{-2\lambda}\right)$ which under the condition on $K$ also satisfies $\tau n^{1/2}\rightarrow\infty$ . Point 3 in Theorem 1 gives an approximation error of order $\left(\ln K\right)^{-\left(1+\epsilon\right)}=o\left(1\right)$ because $K\rightarrow\infty$ . Hence, we deduce the first part of the corollary.

To derive Point 2, consider Point 3 in Theorem 1 under the additional condition on the decay rate of the true coefficients. Point 4 in the same theorem gives again the estimation error. From the sum of the two errors deduce that $\left|b_{n,\tau}-\varphi\right|_{\mathcal{E}}=O_{p}\left(n^{-1/4}K^{\lambda}+K^{\left(2\lambda+1-2\nu\right)/2}\right)$ . Equating the coefficients this is $O_{p}\left(n^{-\frac{2\nu-\left(2\lambda+1\right)}{4\left(2\nu-1\right)}}\right)$ when $K=n^{\frac{1}{2\left(2\nu-1\right)}}$ . Once again, the bound on the estimation error requires that $\tau=O_{p}\left(K^{-2\lambda}\right)$ . Under the condition on $K$ this ensures that $\tau n^{1/2}\rightarrow\infty$ , which is required.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bathia, R. (1997) Matrix Analysis. New York: Springer.
2[2] Bühlmann, P. (1995). Moving-average representation for autoregressive approximations. Stochastic Processes and their Applications 60, 331-342.
3[3] Bühlmann, P. (1997) Sieve Bootstrap for Time Series. Bernoulli 3, 123-148.
4[4] Burman, P. and D. Nolan (1992) Data-Dependent Estimation of Prediction Functions. Journal of Time Series Analysis 13, 189-207.
5[5] Burman, P., E. Chow and D. Nolan (1994) A Cross-Validatory Method for Dependent Data. Biometrika 81, 351-358.
6[6] Graf, S. and H. Luschgy (2004) Sharp Asymptotics of the Metric Entropy for Ellipsoids. Journal of Complexity 20, 876-882.
7[7] Györfi, L., W. Härdle, P. Sarda and P. Vieu (1990) Nonparametric Curve Estimation from Time Series. Heidelberg: Springer.
8[8] Györfi, L. and A. Sancetta (2015) An open problem on strongly consistent learning of the best prediction for Gaussian processes. in M. Akritas, S.N. Lahiri and D. Politis (eds.), Proceedings of the first conference of the international Society of Nonparametric Statistics. Heidelberg: Springer.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Consistency Results for Stationary Autoregressive Processes with

Abstract

1 Introduction

2 Estimation Method

Condition 1

Lemma 1

Lemma 2

2.1 Estimation and Consistency

Theorem 1

Corollary 1

2.1.1 Application to Optimal Forecasting and Universal Consistency

2.2 Choice of BBB in Practice

3 Numerical Example

3.1 Simulated True Models

Short Memory

Long Memory Model

3.2 Estimation and Results

4 Further Remarks

5 Proofs

5.1 Proof of Theorem 1

5.1.1 Consistency Under the Euclidean Norm

Lemma 3

Lemma 4

Lemma 5

Lemma 6

5.1.2 Consistency Under the RKHS Norm

Lemma 7

Lemma 8

Lemma 9

Lemma 10

5.2 Proof of Corollary 1

2.2 Choice of $B$ in Practice