Linear regression with stationary errors : the R package slm

Emmanuel Caron; J\'er\^ome Dedecker; Bertrand Michel

arXiv:1906.06583·stat.AP·August 31, 2021·R J.

Linear regression with stationary errors : the R package slm

Emmanuel Caron, J\'er\^ome Dedecker, Bertrand Michel

PDF

Open Access 1 Repo

TL;DR

The paper presents the R package slm for linear regression with stationary errors, providing procedures for estimation, inference correction, and validation through simulations and real data examples.

Contribution

It introduces new methods for estimating covariance and correcting inference in linear regression with stationary errors within the R package slm.

Findings

01

Effective covariance estimation methods demonstrated.

02

Improved type I error control in tests.

03

Validated procedures with simulations and real data.

Abstract

This paper introduces the R package slm which stands for Stationary Linear Models. The package contains a set of statistical procedures for linear regression in the general context where the error process is strictly stationary with short memory. We work in the setting of Hannan (1973), who proved the asymptotic normality of the (normalized) least squares estimators (LSE) under very mild conditions on the error process. We propose different ways to estimate the asymptotic covariance matrix of the LSE, and then to correct the type I error rates of the usual tests on the parameters (as well as confidence intervals). The procedures are evaluated through different sets of simulations, and two examples of real datasets are studied.

Figures17

Click any figure to enlarge with its caption.

Tables5

Table 1. Table 1: Plot output for each method given in the \code method_cov_st of \code slm.

\codemethod_cov_st=	plot
\codefitAR	ACF and PACF of the residual process
\codekernel	ACF of the residual process
\codekernel with \codemodel_selec = -1	Graph of the estimated risk and of the estimated $γ (k)$ ’s
\codespectralproj	Estimated spectral density
\codeselect	ACF of the residuals up to the selected order
\codeefromovich	ACF of the residuals up to the selected order
\codehac	No plot available

Table 2. Table 2: Available kernel functions in \pkg slm.

\codekernel_fonc =	kernel definition
\coderectangular	$K (x) = 𝟙_{{\| x \| \leq 1}}$
\codetriangle	$K (x) = (1 - \| x \|) 𝟙_{{\| x \| \leq 1}}$
\codetrapeze	$K (x) = 𝟙_{{\| x \| \leq δ}} + \frac{1}{1 - δ} (1 - \| x \|) 𝟙_{{δ \leq \| x \| \leq 1}}$

Table 3. Table 3: Estimated levels for the non-seasonal processes.

n		Fisher test	fitAR	spectralproj	efromovich	kernel	hac
200	AR1 process	0.465	0.097	0.14	0.135	0.149	0.108
	NonMixing	0.298	0.082	0.103	0.096	0.125	0.064
	Sysdyn process	0.385	0.105	0.118	0.124	0.162	0.12
1000	AR1 process	0.418	0.043	0.049	0.049	0.086	0.049
	NonMixing	0.298	0.046	0.05	0.053	0.076	0.038
	Sysdyn process	0.393	0.073	0.077	0.079	0.074	0.078
2000	AR1 process	0.454	0.071	0.078	0.075	0.067	0.071
	NonMixing	0.313	0.051	0.053	0.057	0.067	0.047
	Sysdyn process	0.355	0.063	0.064	0.066	0.069	0.073
5000	AR1 process	0.439	0.044	0.047	0.047	0.047	0.044
	NonMixing	0.315	0.053	0.056	0.059	0.068	0.05
	Sysdyn process	0.381	0.058	0.061	0.057	0.064	0.071

Table 4. Table 4: Estimated levels for the seasonal processes.

n		Fisher test	fitAR	spectralproj	efromovich	kernel	hac
200	AR12 process	0.436	0.178	0.203	0.223	0.234	0.169
200	MA12 process	0.228	0.113	0.113	0.116	0.15	0.222
1000	AR12 process	0.468	0.068	0.183	0.181	0.124	0.179
1000	MA12 process	0.209	0.064	0.066	0.069	0.063	0.18
2000	AR12 process	0.507	0.071	0.196	0.153	0.104	0.192
2000	MA12 process	0.237	0.064	0.064	0.058	0.068	0.173
5000	AR12 process	0.47	0.062	0.183	0.1	0.091	0.171
5000	MA12 process	0.242	0.044	0.048	0.043	0.057	0.147

Table 5. Table 5: Estimated levels for the i.i.d. process

n		Fisher test	fitAR	spectralproj	efromovich	kernel	hac
150	i.i.d. process	0.053	0.068	0.078	0.061	0.124	0.063
300	i.i.d. process	0.052	0.051	0.06	0.05	0.095	0.052
500	i.i.d. process	0.047	0.049	0.053	0.049	0.082	0.056

Equations64

Y = X β + ε,

Y = X β + ε,

Γ_{n} := [γ (j - l)]_{1 \leq j, l \leq n} .

Γ_{n} := [γ (j - l)]_{1 \leq j, l \leq n} .

\mathbb{E}\left(D(n)(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{t}D(n)^{t}\Big{|}X\right)\xrightarrow[n\rightarrow\infty]{}C,\quad a.s.

\mathbb{E}\left(D(n)(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{t}D(n)^{t}\Big{|}X\right)\xrightarrow[n\rightarrow\infty]{}C,\quad a.s.

\mathbb{E}\left(D(n)(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{t}D(n)^{t}\Big{|}X\right)=D(n)(X^{t}X)^{-1}X^{t}\Gamma_{n}X(X^{t}X)^{-1}D(n).

\mathbb{E}\left(D(n)(\hat{\beta}-\beta)(\hat{\beta}-\beta)^{t}D(n)^{t}\Big{|}X\right)=D(n)(X^{t}X)^{-1}X^{t}\Gamma_{n}X(X^{t}X)^{-1}D(n).

C = C (Γ_{n}) := D (n) (X^{t} X)^{- 1} X^{t} Γ_{n} X (X^{t} X)^{- 1} D (n),

C = C (Γ_{n}) := D (n) (X^{t} X)^{- 1} X^{t} Γ_{n} X (X^{t} X)^{- 1} D (n),

Γ_{n, h} = [K (\frac{j - l}{h}) \tilde{γ}_{j - l}]_{1 \leq j, l \leq n},

Γ_{n, h} = [K (\frac{j - l}{h}) \tilde{γ}_{j - l}]_{1 \leq j, l \leq n},

\tilde{γ}_{k} = \frac{1}{n} j = 1 \sum n - ∣ k ∣ \overset{ε}{^}_{j} \overset{ε}{^}_{j + ∣ k ∣} .

\tilde{γ}_{k} = \frac{1}{n} j = 1 \sum n - ∣ k ∣ \overset{ε}{^}_{j} \overset{ε}{^}_{j + ∣ k ∣} .

\tilde{C}_{n}^{- 1/2} D (n) (\hat{β} - β) L n \to \infty N_{p} (0_{p}, I_{p}),

\tilde{C}_{n}^{- 1/2} D (n) (\hat{β} - β) L n \to \infty N_{p} (0_{p}, I_{p}),

Z_{j} = \frac{d _{j} ( n ) β ^ _{j}}{C _{(j, j)}}

Z_{j} = \frac{d _{j} ( n ) β ^ _{j}}{C _{(j, j)}}

W := (A C (Γ_{n}))^{- 1/2} D (n) A (\hat{β} - β) L n \to \infty N_{k} (0_{k}, I_{k}) .

W := (A C (Γ_{n}))^{- 1/2} D (n) A (\hat{β} - β) L n \to \infty N_{k} (0_{k}, I_{k}) .

Y_{i} = β_{1} + β_{2} (lo g (i) + sin (i) + Z_{i}) + β_{3} i + ε_{i},

Y_{i} = β_{1} + β_{2} (lo g (i) + sin (i) + Z_{i}) + β_{3} i + ε_{i},

\hat{f}_{J_{n r}} (λ) = \frac{1}{2 π} k = - J_{n r} \sum k = J_{n r} \overset{γ}{^}_{k} e^{ik λ},

\hat{f}_{J_{n r}} (λ) = \frac{1}{2 π} k = - J_{n r} \sum k = J_{n r} \overset{γ}{^}_{k} e^{ik λ},

J_{n r} = \frac{lo g ( n )}{2 r} [1 + (lo g (n))^{- 1/2}],

J_{n r} = \frac{lo g ( n )}{2 r} [1 + (lo g (n))^{- 1/2}],

K (x) = \frac{25}{12 π ^{2} x ^{2}} (\frac{sin ( 6 π x /5 )}{6 π x /5} - cos (6 π x /5))

K (x) = \frac{25}{12 π ^{2} x ^{2}} (\frac{sin ( 6 π x /5 )}{6 π x /5} - cos (6 π x /5))

ϕ_{j}^{(d)} = \frac{d}{π} \mathds 1_{[π j / d, π (j + 1) / d [}, j = 0, 1, \dots, d - 1.

ϕ_{j}^{(d)} = \frac{d}{π} \mathds 1_{[π j / d, π (j + 1) / d [}, j = 0, 1, \dots, d - 1.

\hat{f}_{d} (λ) = j = 0 \sum d - 1 \overset{a}{^}_{j}^{(d)} ϕ_{j}^{(d)},

\hat{f}_{d} (λ) = j = 0 \sum d - 1 \overset{a}{^}_{j}^{(d)} ϕ_{j}^{(d)},

\overset{a}{^}_{j}^{(d)} = \frac{d}{π} (\frac{γ ~ _{0}}{2 d} + \frac{1}{π} r = 1 \sum n - 1 \frac{γ ~ _{r}}{r} [sin (\frac{π ( j + 1 ) r}{d}) - sin (\frac{π j r}{d})]) .

\overset{a}{^}_{j}^{(d)} = \frac{d}{π} (\frac{γ ~ _{0}}{2 d} + \frac{1}{π} r = 1 \sum n - 1 \frac{γ ~ _{r}}{r} [sin (\frac{π ( j + 1 ) r}{d}) - sin (\frac{π j r}{d})]) .

γ_{k}

γ_{k}

γ_{0}

γ_{0}

\mbox cr i t (d) := - j = 0 \sum d - 1 [\overset{a}{^}_{j}^{(d)}]^{2} + c \frac{d}{n}

\mbox cr i t (d) := - j = 0 \sum d - 1 [\overset{a}{^}_{j}^{(d)}]^{2} + c \frac{d}{n}

\overset{γ}{^}_{I} (k) := \tilde{γ}_{k} \mathbbm 1_{k \in I \cup {0}} 0 \leq ∣ k ∣ \leq n - 1

\overset{γ}{^}_{I} (k) := \tilde{γ}_{k} \mathbbm 1_{k \in I \cup {0}} 0 \leq ∣ k ∣ \leq n - 1

ε_{i} - 0.7 ε_{i - 1} = W_{i},

ε_{i} - 0.7 ε_{i - 1} = W_{i},

ε_{i} - 0.5 ε_{i - 1} - 0.2 ε_{i - 12} = W_{i},

ε_{i} - 0.5 ε_{i - 1} - 0.2 ε_{i - 12} = W_{i},

ε_{i} = W_{i} + 0.5 W_{i - 2} + 0.3 W_{i - 3} + 0.2 W_{i - 12},

ε_{i} = W_{i} + 0.5 W_{i - 2} + 0.3 W_{i - 3} + 0.2 W_{i - 12},

Z_{i + 1} = \frac{1}{2} (Z_{i} + η_{i + 1}),

Z_{i + 1} = \frac{1}{2} (Z_{i} + η_{i + 1}),

ε_{i} = Q_{0, σ^{2}} (Z_{i}) .

ε_{i} = Q_{0, σ^{2}} (Z_{i}) .

\theta_{\gamma}(x)=\left\{\begin{array}[]{r c l}x(1+2^{\gamma}x^{\gamma})&\text{if}&\ x\in[0,1/2[\\ 2x-1&\text{if}&\ x\in[1/2,1].\\ \end{array}\right.

\theta_{\gamma}(x)=\left\{\begin{array}[]{r c l}x(1+2^{\gamma}x^{\gamma})&\text{if}&\ x\in[0,1/2[\\ 2x-1&\text{if}&\ x\in[1/2,1].\\ \end{array}\right.

ε_{i} = θ_{γ}^{i} .

ε_{i} = θ_{γ}^{i} .

Y_{i} = β_{1} + β_{2} (lo g (i) + sin (i) + Z_{i}) + β_{3} i + ε_{i}, for all i in {1, \dots, n},

Y_{i} = β_{1} + β_{2} (lo g (i) + sin (i) + Z_{i}) + β_{3} i + ε_{i}, for all i in {1, \dots, n},

ϵ_{i} = W_{i}^{2} - \frac{5}{4},

ϵ_{i} = W_{i}^{2} - \frac{5}{4},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

E-Caron/slm
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

Full text

Linear regression with stationary errors : the \proglangR package \pkgslm

Emmanuel Caron

LMJL Ecole Centrale Nantes &Jérôme Dedecker

Université Paris Descartes &Bertrand Michel

LMJL Ecole Centrale Nantes

[email protected]

\Plainauthor

Emmanuel Caron, Jérôme Dedecker, Bertrand Michel

\PlaintitleLinear regression with stationary errors : the R package slm \ShorttitleLinear regression with stationary errors \AbstractThis paper introduces the \proglangR package \pkgslm which stands for Stationary Linear Models. The package contains a set of statistical procedures for linear regression in the general context where the error process is strictly stationary with short memory. We work in the setting of Hannan (1973), who proved the asymptotic normality of the (normalized) least squares estimators (LSE) under very mild conditions on the error process. We propose different ways to estimate the asymptotic covariance matrix of the LSE, and then to correct the type $I$ error rates of the usual tests on the parameters (as well as confidence intervals). The procedures are evaluated through different sets of simulations, and two examples of real datasets are studied. \KeywordsLinear model, Least Squares Estimators, Stationary processes, AutoRegressive processes, Spectral density, Hypothesis testing\PlainkeywordsLinear model, Least Squares Estimators, Stationary processes, AutoRegressive processes, Spectral density, Hypothesis testing \Address Emmanuel Caron, Bertrand Michel

Laboratoire de Mathématiques Jean Leray UMR 6629

Ecole Centrale Nantes

44300 Nantes, France

E-mail: ,

URL: http://ecaron.perso.math.cnrs.fr, http://bertrand.michel.perso.math.cnrs.fr

**

Jérôme Dedecker

Laboratoire MAP5 UMR 8145

Université Paris Descartes

75006 Paris, France

E-mail:

URL: http://w3.mi.parisdescartes.fr/~jdedecke/

**

1 Introduction

We consider the usual linear regression model

[TABLE]

where $Y$ is the $n$ -dimensional vector of observations, $X$ is a (possibly random) $n\times p$ design matrix, $\beta$ is a $p$ -dimensional vector of parameters, and $\varepsilon=(\varepsilon_{i})_{1\leq i\leq n}$ is the error process (with zero mean and independent of $X$ ). The standard assumptions are that the $\varepsilon_{i}$ ’s are independent and identically distributed (i.i.d.) with zero mean and finite variance.

In this paper, we propose to modify the standard statistical procedures (tests, confidence intervals, …) of the linear model in the more general context where the $\varepsilon_{i}$ ’s are obtained from a strictly stationary process $(\varepsilon_{i})_{i\in{\mathbb{N}}}$ with short memory. To be more precise, let $\hat{\beta}$ denote the usual least squares estimator of $\beta$ . Our approach is based on two papers: the paper by Hannan (1973) who proved the asymptotic normality of the least squares estimator $D(n)(\hat{\beta}-\beta)$ ( $D(n)$ being the usual normalization) under very mild conditions on the design and on the error process; and a recent paper by Caron (2019) who showed that, under Hannan’s conditions, the asymptotic covariance matrix of $D(n)(\hat{\beta}-\beta)$ can be consistently estimated.

Let us emphasize that Hannan’s conditions on the error process are very mild and are satisfied for most of short-memory processes (see the discussion in Section $4.4$ of Caron and Dede (2018)). Putting together the two above results, we can develop a general methodology for tests and confidence regions on the parameter $\beta$ , which should be valid for most of short-memory processes. This is of course directly useful for time-series regression (we shall present in Section 5 an application to the "Mona Loa" \proglangR data-set on CO2 concentration), but also in the more general context where the residuals of the linear model seem to be strongly correlated. More precisely, when checking the residuals of the linear model, if the autocorrelation function of the residuals shows significant correlations, and if the residuals can be suitably modeled by an ARMA process, then our methodology is likely to apply. We shall give an example of such a situation in Section 5 (Shangai pollution data-set).

Hence, the tools presented in the present paper can be seen from two different points of view:

as appropriate tools for time series regression with short memory error process.

-

as a way to robustify the usual statistical procedures when the residuals are correlated.

Let us now describe the organisation of the paper. In Section 2, we recall the mathematical background, the consistent estimator of the asymptotic covariance matrix introduced in Caron (2019) and the modified $Z$ -statistics and $\chi$ -square statistics for testing hypothesis on the parameter $\beta$ . In Section 3 we present the \pkgslm package, and the different ways to estimate the asymptotic covariance matrix: by fitting an autoregressive process on the residuals (default procedure), by means of the kernel estimator described in Caron (2019) (theoretically valid) with a bootstrap method to choose the bandwidth (Wu and Pourahmadi (2009)), by using alternative choices of the bandwidth for the rectangular kernel (Efromovich (1998)) and the quadratic spectral kernel (Andrews (1991)), by means of an adaptative estimator of the spectral density via Histograms (Comte (2001)). In Section 4, we estimate the level of a $\chi$ -square test for a linear model with random design, with different kinds of error processes and for different estimation procedures. In Section 5, we present two different data sets "CO2 concentration", "Shangai pollution", and we compare the summary output of \codeslm with the usual summary output of \codelm.

2 Linear regression with stationary errors

2.1 Asymptotic results for the kernel estimator

We start this section by giving a short presentation of linear regression with stationary errors, more details can be found for instance in Caron (2019). Let $\hat{\beta}$ be the usual least squares estimator for the unknown vector $\beta$ . The aim is to provide hypothesis tests and confidence regions for $\beta$ in the non i.i.d. context.

Let $\gamma$ be the autocovariance function of the error process $\varepsilon$ : for any integers $k$ and $m$ , let $\gamma(k)=\mathrm{Cov}(\varepsilon_{m},\varepsilon_{m+k})$ . We also introduce the covariance matrix

[TABLE]

Hannan (1973) has shown a Central Limit Theorem for $\hat{\beta}$ when the error process is strictly stationary, under very mild conditions on the design and the error process. Let us notice that the design can be random or deterministic. We introduce the normalization matrix $D(n)$ which is a diagonal matrix with diagonal term $d_{j}(n)=\left\|X_{.,j}\right\|_{2}$ for $j$ in $\{1,\ldots,p\}$ , where $X_{.,j}$ is the $j$ th column of $X$ . Roughly speaking, Hannan’s result says in particular that, given the design $X$ , the vector $D(n)(\hat{\beta}-\beta)$ converges in distribution to a centered Gaussian distribution with covariance matrix $C$ . As usual, in practice the covariance matrix $C$ is unknown and it has to be estimated. Hannan also showed the convergence of second order moment:111The transpose of a matrix $X$ is denoted by $X^{t}$ .

[TABLE]

where

[TABLE]

In this paper we propose a general plug-in approach: for some given estimator $\widehat{\Gamma}_{n}$ of $\Gamma_{n}$ , we introduce the plug-in estimator

[TABLE]

and we use $\widehat{C}$ to standardize the usual statistics considered for the study of linear regression.

Let us illustrate this plug-in approach with a kernel estimator which has been proposed in Caron (2019). For some $K$ and a bandwidth $h$ , the kernel estimator $\widetilde{\Gamma}_{n,h}$ is defined by

[TABLE]

where the residual based empirical covariance coefficients are defined for $0\leq|k|\leq n-1$ by

[TABLE]

For a well-chosen kernel $K$ and under mild assumptions on the design and the error process, it has been proved in Caron (2019) that

[TABLE]

for the plug-in estimator $\tilde{C}_{n}:=\widehat{C}(\widetilde{\Gamma}_{n,h_{n}})$ , for some suitable sequence of bandwidths $(h_{n})$ .

More generally, in this paper we say that an estimator $\widehat{\Gamma}_{n}$ of $\Gamma_{n}$ is consistent for estimating the covariance matrix $C$ if $\widehat{C}(\widehat{\Gamma}_{n})$ is positive definite and if it converges in probability to $C$ . Note that such a property requires assumptions on the design, see Caron (2019). If $\widehat{C}(\widehat{\Gamma}_{n})$ is consistent for estimating the covariance matrix $C$ , then ${\widehat{C}(\widehat{\Gamma}_{n})}^{-1/2}D(n)(\hat{\beta}-\beta)$ converges in distribution to a standard Gaussian vector.

To conclude this section, let us make some additional remarks. The interest of Caron’s recent paper is that the consistency of the estimator $\widehat{C}(\widehat{\Gamma}_{n})$ is proved under Hannan’s condition on the error process, which is known to be optimal with respect to the convergence in distribution (see for instance Dedecker (2015)), and which allows to deal with most short memory processes. But the natural estimator of the covariance matrix of $\hat{\beta}$ based on $\widehat{\Gamma}_{n}$ has been studied by many other authors in various contexts. For instance, let us mention the important line of research initiated by Newey and West (1987, 1994), and the related papers by Andrews (1991); Andrews and Monahan (1992) among others. In the paper by Andrews (1991), the consistency of the estimator based on $\widehat{\Gamma}_{n}$ is proved under general conditions on the fourth order cumulants of the error process, and a data driven choice of the bandwidth is proposed. Note that these authors also considered the case of heteroskedastic processes. Most of these procedures, known as HAC (Heteroskedasticity and Autocorrelation Consistent) procedures, are implemented in the package \pkgsandwich by Zeileis, Lumley, Berger and Graham222See the CRAN website: https://cran.r-project.org/web/packages/sandwich/index.html., and presented in great detail in the paper by Zeileis (2004). We shall use an argument of the \pkgsandwich package, based on the data driven procedure described by Andrews (1991) at the end of Section 3.2 (see also Section 4 for a comparison with other methods).

2.2 Tests and confidence regions

We now present tests and confidence regions for arbitrary estimators $\widehat{\Gamma}_{n}$ . The complete justifications are available for kernel estimators, see Caron (2019).

Z-Statistics.

We introduce the following univariate statistics:

[TABLE]

where $\widehat{C}=\widehat{C}(\widehat{\Gamma}_{n})$ . If $\widehat{\Gamma}_{n}$ is consistent for estimating the covariance matrix $C$ and if $\beta_{j}=0$ , the distribution of $Z_{j}$ converges to a standard normal distribution when $n$ tends to infinity. We directly derive an asymptotic hypothesis test for testing $\beta_{j}=0$ against $\beta_{j}\neq 0$ as well as an asymptotic confidence interval for $\beta_{j}$ .

Chi-square statistics.

Let $A$ be a $n\times k$ matrix with $\operatorname{rank}(A)=k$ . Under Hannan (1973)’s conditions, $D(n)(A\hat{\beta}-A\beta)$ converges in distribution to a centered Gaussian distribution with covariance matrix $ACA^{t}$ . If $\widehat{\Gamma}_{n}$ is consistent for estimating the covariance matrix $C$ , then $A\widehat{C}(\widehat{\Gamma}_{n})$ converges in probability to $AC$ . The matrix $A\widehat{C}(\widehat{\Gamma}_{n})A^{t}$ being symmetric positive definite, this yields

[TABLE]

This last result provides asymptotical confidence regions for the vector $A\beta$ . It also provides an asymptotic test for testing the hypothesis $H_{0}$ : $A\beta=0$ against $H_{1}$ : $A\beta\neq 0$ . Indeed, under the $H_{0}$ -hypothesis, the distribution of $\|W\|_{2}^{2}$ converges to a $\chi^{2}(k)$ -distribution.

The test can be used to simplify a linear model by testing that several linear combinations between the parameters $\beta_{j}$ are zero, as we usually do for Anova and regression models. In particular, with $A=I_{p}$ , the test corresponds to the test of overall significance.

3 Introduction to linear regression with the slm package

Using the \pkgslm package is very intuitive because the arguments and the outputs of \codeslm are similar to those of the standard functions \codelm, \codeglm, etc. The output of the main function \codeslm is an object of class \codeslm , a specific class that has been defined for linear regression with stationary processes. The \codeslm class has methods \codeplot, \codesummary, \codeconfint and \codepredict. Moreover, the class \codeslm inherits from the \codelm class and thus provides the output of the classical \codelm function.

R> library(slm)

The statistical tools available in \codeslm strongly depend on the choice of the covariance plug-in estimator $\widehat{C}(\widehat{\Gamma}_{n})$ we use for estimating $C$ . All the estimators $\widehat{\Gamma}_{n}$ proposed in \codeslm are residual-based estimators, but they rely on different approaches. In this section, we present the main functionality of \codeslm together with the different covariance plug-in estimators.

For illustrating the package, we simulate synthetic data according to the linear model:

[TABLE]

where $Z$ is a gaussian autoregressive process of order $1$ , and $\varepsilon$ is the Nonmixing process described in Section 4.1. We use the functions \codegenerative_model and \codegenerative_process respectively to simulate observations according to this regression design and with this specific stationary process. More details on the designs and the processes available with

\codegenerative_model and \codegenerative_process are given in Section 4.1.

R> n = 500R> eps = generative_process(n,"Nonmixing")R> design = generative_model(n,"mod2")R> design_sim = cbind(rep(1,n), as.matrix(design))R> beta_vec = c(2,0.001,0.5)R> Y = design_sim %*% beta_vec + eps

3.1 Linear regression via AR fitting on the residuals

A large class of stationary processes with continuous spectral density can be well approximated by AR processes, see for instance Corollary 4.4.2 in Brockwell and Davis (1991). The covariance structure of an AR process having a closed form, it is thus easy to derive an approximation $\widetilde{\Gamma}_{AR(p)}$ of $\Gamma_{n}$ by fitting an AR process on the residual process.

The AR-based method for estimating $C$ is the default version of \codeslm. This method proceeds in four main steps:

Fit an autoregressive process on the residual process $\hat{\varepsilon}$ ; 2. 2.

Compute the theoretical covariances of the fitted AR process ; 3. 3.

Plug the covariances in the Toeplitz matrix $\widetilde{\Gamma}_{AR(p)}$ ; 4. 4.

Compute $\widehat{C}=\widehat{C}(\widetilde{\Gamma}_{AR(p)})$ .

The \codeslm function fits a linear regression of the vector $Y$ on the design $X$ and then fits an AR process on the residual process using the \codear function from the \pkgstats package. The output of the \codeslm function is an object of class \codeslm. The order $p$ of the AR process is set in the argument \codemodel_selec:

R> regslm = slm(Y ~ X1+X2, data = design, method_cov_st = "fitAR",+ model_selec = 3) The estimated covariance is recorded as a vector in the attribute \codecov_st of \coderegslm, which is an object of class \codeslm. The estimated covariance matrix can be computed by taking the Toeplitz matrix of \codecov_st, using the \codetoeplitz function.

Summary method

As for \codelm objects, a summary of a \codeslm object is given by

R> summary(regslm)

Call:"slm(formula = myformula, data = data, x = x, y = y)"Residuals: Min 1Q Median 3Q Max-13.9086 -3.4586 0.1646 3.5025 13.7488Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 2.936183 0.855214 3.433 0.000596 X1 0.084387 0.082371 1.024 0.305613X2 0.492590 0.002738 179.938 < 2e-16 ---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 4.907Multiple R-squared: 0.9953chi2-statistic: 3.278e+04 on 2 DF, p-value: < 2.2e-16 The coefficient table output by the summary provides the estimators of the $\beta_{j}$ ’s, which are exactly the classical least squares estimators. The \codez value column provides the values of the $Z_{j}$ statistics defined by (4). The \codeStd.Error column gives an estimation of the standard errors of the $\hat{\beta}_{j}$ ’s, which are taken equal to $\frac{\sqrt{\widehat{C}_{(j,j)}}}{d_{j}(n)}$ . As with the \codelm function, the p-value column is the p-value for testing $\beta_{j}=0$ against $\beta_{j}\neq 0$ . In this example, the small p-value for the second feature $X2$ is consistent with the value chosen for \codebeta_vec at the beginning of the section. The \codechi2-statistic at the end of the summary is the $\chi^{2}$ statistic for testing the significance of the model (see the end of Section 2.2) For this example, the p-value is very small, indeed the variable $X2$ has a significant effect on $Y$ .

Plot argument and plot method

The \codeslm function has a \codeplot argument: with \codeplot = TRUE, the function plots a figure which depends on the method chosen for estimating the covariance matrix $C$ . Table 1 summarizes the plots for each method given in the argument \codemethod_cov_st. With the AR fitting method, the argument \codeplot = TRUE outputs the ACF and the PACF of the residual process. The ACF and PACF are computed with the functions \codeacf and \codepacf of the \pkgstats package. As usual, the ACF and PACF graphs should help the user to choose an appropriate order for the AR process.

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "fitAR",+ model_selec = 2, plot = TRUE) The plot output by the \codeslm function for this example is given in Figure 1.

Since the \codeslm class inherits from the \codelm class, the former class comes with a \codeplot method which is the same as for the \codelm class, namely the diagnostic analysis of the linear regression. The graphics are displayed using the command

R> plot(regslm)

Confidence intervals for the coefficients

The \codeconfint function computes the confidence intervals for the coefficients $\beta_{j}$ estimated by \codeslm. These intervals are computed according to the distribution of the $Z_{j}$ statistics defined in (4).

R> confint(regslm, level = 0.90)

5 % 95 %(Intercept) 1.56396552 4.3083996X1 -0.05048587 0.2192597X2 0.48821351 0.4969666

AR order selection

The order $p$ of the AR process can be chosen at hand by setting \codemodel_selec = p, or automatically with the AIC criterion by setting \codemodel_selec = -1.

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "fitAR",+ model_selec = -1) The order of the fitted AR process is recorded in the \codemodel_selec attribute of \coderegslm:

R> regslm@model_selec

[1] 2 Here, the AIC criterion suggests to fit an AR(2) process on the residuals.

3.2 Linear regression via kernel estimation of the error covariance

The second method for estimating the covariance matrix $C$ is the kernel estimation method (1) studied in Caron (2019). In short, this method estimates $C$ via a smooth approximation of the covariance matrix $\Gamma_{n}$ of the residuals. This estimation of $\Gamma_{n}$ corresponds to the so-called tapered covariance matrix estimator in the literature, see for instance Xiao and Wu (2012), or also to the "lag-window estimator" defined in Brockwell and Davis (1991), page $330$ . It applies in particular for non negative symmetric kernels with compact support, with an integrable Fourier transform and such that $K(0)=1$ . Table 2 gives the list of the available kernels in the package \pkgslm.

It is also possible for the user to define his own kernel and to use it in the argument \codekernel_fonc of the \codeslm function. Below we use the triangle kernel which assures that the covariance matrix is positive definite. The support of the kernel $K$ in Equation (1) being compact, only the terms $\tilde{\gamma}_{j-l}$ for small enough lag $j-l$ are kept and weighted by the kernel in the expression of $\widetilde{\Gamma}_{n,h}$ . Rather than setting the bandwidth $h$ , we select the number of $\gamma(k)$ ’s that should be kept (the lag) with the argument \codemodel_selec in the \codeslm function. Then the bandwidth $h$ is calibrated accordingly, that is equal to $\texttt{model\_selec}+1$ .

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "kernel",+ model_selec = 5, kernel_fonc = triangle, plot = TRUE) The plot output by the \codeslm function is given in Figure 2.

Order selection via bootstrap

The order parameter can be chosen at hand as before or automatically by setting \codemodel_selec = -1. The automatic order selection is based on the bootstrap procedure proposed by Wu and Pourahmadi (2009) for banded covariance matrix estimation. The \codeblock_size argument sets the size of bootstrap blocks and the \codeblock_n argument sets the number of blocks. The final order is chosen by taking the order which has the minimal risk. Figure 3 gives the plots of the estimated risk for the estimation of $\Gamma_{n}$ (left) and the final estimated ACF (right).

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st ="kernel",+ model_selec = -1, kernel_fonc = triangle, model_max = 30,+ block_size = 100, block_n = 100, plot = TRUE)

The selected order is recorded in the \codemodel_selec attribute of the \codeslm object output by the \codeslm function:

R> regslm@model_selec

[1] 10

3.2.1 Order selection by Efromovich’s method (rectangular kernel)

An alternative method for choosing the bandwidth in the case of the rectangular kernel has been proposed in Efromovich (1998). For a large class of stationary processes with exponentially decaying autocovariance function (mainly the ARMA processes), Efromovich proved that the rectangular kernel is asymptotically minimax, and he proposed the following estimator:

[TABLE]

with the lag

[TABLE]

where $r$ is a regularity index of the autocovariance index. In practice this parameter is unknown and is estimated thanks to the algorithm proposed in the section $4$ of Efromovich (1998). As for the other methods, we use the residual based empirical covariances $\tilde{\gamma}_{k}$ to compute $\hat{f}_{J_{nr}}(\lambda)$ .

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "efromovich",+ model_selec = -1)

3.2.2 Order Selection by Andrews’s method

Another method for choosing the bandwidth has been proposed by Andrews (1991) (see the last paragraph of Section 2.1) and implemented in the package \pkgsandwich by Zeileis, Lumley, Berger and Graham (see the paper by Zeileis (2004)). For the \pkgslm package, the automatic choice of the bandwidth proposed by Andrews can be obtained as follows

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "hac") The procedure is based on the function \codekernHAC in the \pkgsandwich package. This function computes directly the covariance matrix estimator of $\hat{\beta}$ , which will be recorded in the slot \codeCov_ST of the \codeslm function.

Here, we take the quadratic spectral kernel

[TABLE]

as suggested by Andrews (see Section $2$ in Andrews (1991), or Section $3.2$ in Zeileis (2004)), but other kernels could be used, such as Bartlett, Parzen, Tukey-Hamming, among others (see Zeileis (2004)).

Positive definite projection

Depending of the method used, the matrix $\widehat{C}(\widehat{\Gamma}_{n})$ may not always be positive definite. It is the case of the kernel method with rectangular or trapeze kernel. To overcome this problem, we make the projection of $\widehat{C}(\widehat{\Gamma}_{n})$ into the cone of positive definite matrices by applying a hard thresholding on the spectrum of this matrix: we replace all eigenvalues lower or equal to zero with the smallest positive eigenvalue of $\widehat{C}(\widehat{\Gamma}_{n})$ .

Note that this projection is useless for the triangle or quadratic spectral kernels because their Fourier transform is non-negative (leading to a positive definite matrix $\widehat{C}(\widehat{\Gamma}_{n})$ ). Of course, it is also useless for the \codefitAR and \codespectralproj methods.

3.3 Linear regression via projection spectral estimation

The projection method relies on the ideas of Comte (2001), where an adaptive nonparametric method has been proposed for estimating the spectral density of a stationary Gaussian process.

We use the residual process as a proxy for the error process and we compute the projection coefficients with the residual-based empirical covariance coefficients $\tilde{\gamma}_{k}$ , see Equation (2).

For some $d\in\mathbb{N}^{*}$ , the estimator of the spectral density of the error process that we use is defined by computing the projection estimators for the residual process, on the basis of histogram functions

[TABLE]

The estimator is defined by

[TABLE]

where the projection coefficients are

[TABLE]

The Fourier coefficients of the spectral density are equal to the covariance coefficients. Thus, for $k=1,\ldots,n-1$ it yields

[TABLE]

and for $k=0$ :

[TABLE]

This method can be proceeded in the \codeslm function by setting \codemethod_cov_st =

\code"spectralproj":

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "spectralproj",+ model_selec = 10, plot = TRUE) The graph of the estimated spectral density can be plotted by setting \codeplot = TRUE in the \codeslm function, see Figure 4.

Model selection

The Gaussian model selection method proposed in Comte (2001) follows the ideas of Birgé and Massart, see for instance Massart (2007). It consists in minimizing the $l_{2}$ penalized criterion, see Section 5 in Comte (2001):

[TABLE]

where $c$ is a multiplicative constant that in practice can be calibrated using the slope heuristic method, see Birgé and Massart (2007); Baudry et al. (2012) and the \proglangR package \pkgcapushe.

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "spectralproj",+ model_selec = -1, model_max = 50, plot = TRUE) The selected dimension is recorded in the \codemodel_selec attribute of the \codeslm object output by the \codeslm function:

R> regslm@model_selec

[1] 8 The slope heuristic algorithm here selects an Histogram on a regular partition of size $8$ over the interval $[0,\pi]$ to estimate the spectral density.

3.4 Linear regression via masked covariance estimation

This method is a full-manual method for estimating the covariance matrix $C$ by only selecting covariance terms from the residual covariances $\tilde{\gamma}_{k}$ defined by (2). Let $I$ be a set of positive integers, then we consider

[TABLE]

and then we define the estimated covariance marix $\widehat{\Gamma}_{I}$ by taking the Toeplitz matrix of the vector $\hat{\gamma}_{I}$ . This estimator is a particular example of masked sample covariance estimator, as introduced by Chen et al. (2012), see also Levina and Vershynin (2012). Finally we derive from $\widehat{\Gamma}_{I}$ an estimator $\widehat{C}(\widehat{\Gamma}_{I})$ for $C$ .

The next instruction selects the coefficients 0, 1, 2 and 4 from the residual covariance terms:

R> regslm = slm(Y ~ X1 + X2, data = design, method_cov_st = "select",+ model_selec = c(1,2,4)) The positive lags of the selected covariances are recordered in the \codemodel_selec argument. Let us notice that the variance $\gamma_{0}$ is automatically selected.

As for the kernel method, the resulting covariance matrix may not be positive definite. If it is the case, the positive definite projection method, described at the end of the section 3.2, is used.

3.5 Linear regression via manual plugged covariance matrix

This last method is a direct plug-in method. The user proposes his own vector estimator $\hat{\gamma}$ of $\gamma$ and then the Toeplitz matrix $\widehat{\Gamma}_{n}$ of the vector $\hat{\gamma}$ is used for estimating $C$ with $\widehat{C}(\widehat{\Gamma}_{n})$ .

R> v = rep(0,n)R> v[1:10] = acf(epsilon, type = "covariance", lag.max = 9)$acfR> regslm = slm(Y ~ X1 + X2, data = design, cov_st = v)

The user can also propose his own covariance matrix $\widehat{\Gamma}_{n}$ for estimating $C$ .

R> v = rep(0,n)R> v[1:10] = acf(epsilon, type = "covariance", lag.max = 9)$acfR> V = toeplitz(v)R> regslm = slm(Y ~ X1 + X2, data = design, Cov_ST = V)

Let us notice that the user must verify that the resulting covariance matrix is positive definite. The positive definite projection algorithm is not used with this method.

4 Numerical experiments and method comparisons

This section summarizes an extensive study which has been carried out to compare the performances of the different approaches presented before in the context of linear model with short range dependent stationary errors.

4.1 Description of the generative models

We first present the five generative models for the errors that we consider in the paper. We choose different kinds of processes to reflect the diversity of short-memory processes.

•

AR1 process. The AR1 process is a gaussian AR(1) process defined by:

[TABLE]

where $W_{i}$ is a standard gaussian distribution $\mathcal{N}(0,1)$ .

•

AR12 process. The AR12 process is a seasonal AR(12) process defined by:

[TABLE]

where $W_{i}$ is a standard gaussian distribution $\mathcal{N}(0,1)$ . When studying monthly data-sets, one usually observes a seasonality of order $12$ . For example, when looking at climate data (such as the "CO2 concentration" dataset of Section 5), the data are often collected per month, and the same phenomenon tends to repeat every year. Even if the design integrates the deterministic part of the seasonality, a correlation of order $12$ remains usually present in the residual process.

•

MA12 process. The MA12 is also a seasonal process defined by:

[TABLE]

where the $(W_{i})$ ’s are i.i.d. random variables following Student’s distribution with $10$ degrees of freedom.

•

Nonmixing process. The three processes described above are basic ARMA processes, whose innovations have absolutely continuous distributions; in particular, they are strongly mixing in the sense of Rosenblatt (1956), with a geometric decay of the mixing coefficients (in fact the MA12 process is even $12$ -dependent, which means that the mixing coefficient $\alpha(k)=0$ if $k>12$ ). Let us now describe a more complicated process: let $(Z_{1},\ldots,Z_{n})$ satisfying the AR(1) equation

[TABLE]

where $Z_{1}$ is uniformly distributed over $[0,1]$ and the $\eta_{i}$ ’s are i.i.d. random variables with distribution $\mathcal{B}(1/2)$ , independent of $Z_{1}$ . The process $(Z_{i})_{i\geq 1}$ is a strictly stationary Markov chain, but it is not $\alpha$ -mixing in the sense of Rosenblatt (see Bradley (1986)). Let now $Q_{0,\sigma^{2}}$ be the inverse of the cumulative distribution function of a centered Gaussian distribution with variance $\sigma^{2}$ (for the simulations below, we choose $\sigma^{2}=25$ ). The Nonmixing process is then defined by

[TABLE]

The sequence $(\varepsilon_{i})_{i\geq 1}$ is also a stationary Markov chain (as an invertible function of a stationary Markov chain). By construction, $\varepsilon_{i}$ is $\mathcal{N}(0,\sigma^{2})$ -distributed, but the sequence $(\varepsilon_{i})_{i\geq 1}$ is not a Gaussian process (otherwise it would be mixing in the sense of Rosenblatt). Although it is not obvious, one can prove that the process $(\varepsilon_{i})_{i\geq 1}$ satisfies Hannan’s condition (see Caron (2019), Section $4.2$ ).

•

Sysdyn process. The four processes described above have the property of "geometric decay of correlations", which means that the $\gamma(k)$ ’s tend to [math] at an exponential rate. However, as already pointed out in the introduction, Hannan’s condition is valid for most of short memory processes, even for processes with slow decay of correlations (provided that the $\gamma(k)$ ’s are summable). Hence, our last example will be a non-mixing process (in the sense of Rosenblatt), with an arithmetic decay of the correlations.

For $\gamma\in]0,1[$ , the intermittent map $\theta_{\gamma}:[0,1]\mapsto[0,1]$ introduced in Liverani et al. (1999) is defined by

[TABLE]

It follows from Liverani et al. (1999) that there exists a unique $\theta_{\gamma}$ -invariant probability measure $\nu_{\gamma}$ . The Sysdyn process is then defined by

[TABLE]

From Liverani et al. (1999), we know that, on the probability space $([0,1],\nu_{\gamma})$ , the autocorrelations $\gamma(k)$ of the stationary process $(\varepsilon_{i})_{i\geq 1}$ are exactly of order $k^{-(1-\gamma)/\gamma}$ . Hence $(\varepsilon_{i})_{i\geq 1}$ is a short memory process provided $\gamma\in]0,1/2[$ . Moreover, it has been proved in Section $4.4$ of Caron and Dede (2018) that $(\varepsilon_{i})_{i\geq 1}$ satisfies Hannan’s condition in the whole short-memory range, that is for $\gamma\in]0,1/2[$ . For the simulations below, we took $\gamma=1/4$ , which give autocorrelations $\gamma(k)$ of order $k^{-3}$ .

The linear regression models simulated in the experiments all have the following form:

[TABLE]

where $Z$ is a gaussian autoregressive process of order $1$ and $\varepsilon$ is one of the stationary processes defined above. For the simulations, $\beta_{1}$ is always equal to $3$ . All the error processes presented above can be simulated with the \pkgslm package with the \codegenerative_process function. The design can be simulated with the \codegenerative_model function.

4.2 Automatic calibration of the tests

It is of course of first importance to provide hypothesis tests with correct significance levels or at least with correct asymptotical significance levels, which is possible if the estimator $\widehat{\Gamma}_{n}$ of the covariance matrix $\Gamma_{n}$ is consistent for estimating $C$ . For instance, the results of Caron (2019) show that it is possible to construct statistical tests with correct asymptotical significance levels. However in practice such asymptotical results are not sufficient since they do not indicate how to tune the bandwidth on a given dataset. This situation makes the practice of linear regression with dependent errors really more difficult than linear regression with i.i.d. errors. This problem happens for several methods given before : order choice for the \codefitAR method, bandwidth choice for the \codekernel method, dimension selection for the \codespectralproj method.

It is a tricky issue to design a data driven procedure for choosing test parameters in order to have to correct Type I Error. Note that unlike with supervised problems and density estimation, it is not possible to calibrate hypothesis tests in practice using cross validation approaches. We thus propose to calibrate the tests using well founded statistical procedures for risk minimization : AIC criterion for the \codefitAR method, bootstrap procedures for the \codekernel method and slope heuristics for the \codespectralproj method. These procedures are implemented in the \codeslm function with the \codemodel_selec = -1 argument, as detailed in the previous section.

Let us first illustrate the calibration problem with the AR12 process. For $T=1000$ simulations, we generate an error process of size $n$ under the null hypothesis: $H_{0}:\beta_{2}=\beta_{3}=0$ . Then we use the \codefitAR method of the \codeslm function with orders between $1$ and $50$ and we perform the model significance test. The procedure is repeated $1000$ times and we estimate the true level of the test by taking the average of the estimated levels on the $1000$ simulations for each order. The results are given on Figure 5 for $n=1000$ . A boxplot is also displayed to visualize the distribution of the order selected by the automatic criterion (AIC).

4.3 Non-Seasonal errors

We first study the case of non-Seasonal error processes. We simulate a $n$ -error process according to the AR1, the Nonmixing or the Sysdyn processes. We simulate realizations of the linear regression model (5) under the null hypothesis: $H_{0}:\beta_{2}=\beta_{3}=0$ . We use the automatic selection procedures for each method (\codemodel_selec = -1). The simulations are repeated $1000$ times in order to estimate the true level of the model significance for each test procedure. We simulate either small samples ( $n=200$ ) or larger samples ( $n=1000,2000,5000$ ). The results of this experiments are summarized in Table 3.

For $n$ large enough ( $n\geq 1000$ ), all methods work well and the estimated level is around $0.05$ . However, for small samples ( $n=200$ ), we observe that the \codefitAR and the \codehac methods show better performances than the others. The \codekernel method is slightly less effective. With this method, we must choose the size of the bootstrap blocks as well as the number of blocks and the test results are really sensitive to these parameters. In these simulations, we have chosen $100$ blocks with a size of $n/2$ . The results are expected to improve with a larger number of blocks.

Let us notice that for all methods and for all sample sizes, the estimated level is much better than if no correction is made (usual Fisher tests).

4.4 Seasonal errors

We now study the case of linear regression with seasonal errors. The experiment is exactly the same as before, except that we simulate AR12 or MA12 processes. The results of these experiments are summarized in Table 4.

We directly see that the case of seasonal processes is more complicated than for the non-seasonal processes especially for the AR12 process. For small samples size, the estimated level is between $0.17$ and $0.24$ , which is clearly too large. It is however much better than the estimated level of the usual Fisher test, which is around $0.45$ . The \codefitAR method is the best method here for the AR12 process, because for $n\geq 1000$ , the estimated level is between $0.06$ and $0.07$ . For \codeefromovich and \codekernel methods, a level less than $0.10$ is reached but for large samples only. The \codespectralproj and \codehac methods do not seem to work well for the AR12 process, although they remain much better than the usual Fisher tests (around $19\%$ of rejection instead of $45\%$ ).

The case of the MA12 process seems easier to deal with. For $n$ large enough ( $n\geq 1000$ ), the estimated level is between $0.04$ and $0.07$ whatever the method, except for \codehac (around $0.15$ for $n=5000$ ). It is less effective for small sample size ( $n=200$ ) with an estimated level around $0.115$ for \codefitAR, \codespectralproj and \codeefromovich methods.

4.5 I.I.D. errors

To be complete, we consider in this subsection the case where the $\epsilon_{i}$ ’s are i.i.d., to see how the five automatic methods perform in that case.

We simulate $n$ i.i.d. centered random variables according to the formula:

[TABLE]

where $W$ follows a student distribution with $10$ degrees of freedom. Note that the distribution of the $\epsilon_{i}$ ’s is not symmetric and has no exponential moments.

Except for the kernel method, the estimated levels are close to $5\%$ for $n$ large enough ( $n\geq 300$ ). It is slightly worse for small samples but it remains quite good for the methods \codefitAR, \codeefromovich and \codehac.

As a general conclusion of Section $4$ , we see that the \codefitAR method performs quite well in a wide variety of situations, and should therefore be used as soon as the user suspects that the error process can be modeled by a stationary short-memory process.

5 Application to real data

5.1 Data CO2

Let us introduce the first dataset that we want to study. It concerns the well-known dataset "co2", available in the package \pkgdatasets of \proglangR:

R> data("co2") This dataset is provided by the observatory of Mona Loa (Hawaii). It contains average monthly measurements of CO2 (parts per million: ppmv) in the atmosphere of the Hawaiian coast. Surveys were produced monthly between $1959$ and $1998$ , giving a total of $468$ measurements. The graph of the data is displayed in Figure 6. More information on this dataset is available in the \proglangR documentation.

We model the CO2 measurements with a time series. Typically, a time series can be decomposed into three parts: a trend $m$ and a seasonality $s$ , which are deterministic components, and the errors $\varepsilon$ , which constitute the random part of the model. The trend represents the overall behavior of the series and seasonality its periodic behavior. Formally, we have:

[TABLE]

where $Y_{t}$ represents the CO2 rate at time $t$ , with the usual constraints $s_{t}=s_{t+12}$ and $\sum_{t=1}^{12}s_{t}=0$ . The two deterministic components can be grouped into a matrix $X$ and the model can be rewritten into a linear regression model:

[TABLE]

For this example, we fit a $3$ -degree polynomial for the trend and a trigonometric polynomial with well-chosen frequencies for the seasonality. Here the time $t$ represents a month and $t$ goes from $1$ to $40$ by step of length $1/12$ . Let us perform a linear regression to fit the trend and the seasonality on the CO2 time series, using the \codelm function:

R> y = as.vector(co2)R> t = as.vector(time(co2)) - 1958R> regtrigo = lm(y ~ t + I(t^2) + I(t^3) + sin(2pit) + cos(2pit)+ + sin(4pit) + cos(4pit) + sin(6pit) + cos(6pit)+ + sin(8pit) + cos(8pit)) We obtain the following output:

R> summary.lm(regtrigo)

Call:lm(formula = y ~ t + I(t^2) + I(t^3) + sin(2 * pi * t) + cos(2 * pi * t) + sin(4 * pi * t) + cos(4 * pi * t) + sin(6 * pi * t) + cos(6 * pi * t) + sin(8 * pi * t) + cos(8 * pi * t))Residuals: Min 1Q Median 3Q Max-1.54750 -0.32688 0.00233 0.28100 1.50295Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.157e+02 1.118e-01 2823.332 < 2e-16 ***t 3.194e-01 2.306e-02 13.847 < 2e-16 ***I(t^2) 4.077e-02 1.293e-03 31.523 < 2e-16 ***I(t^3) -4.562e-04 2.080e-05 -21.930 < 2e-16 ***sin(2 * pi * t) 2.751e+00 3.298e-02 83.426 < 2e-16 ***cos(2 * pi * t) -3.960e-01 3.296e-02 -12.015 < 2e-16 ***sin(4 * pi * t) -6.743e-01 3.296e-02 -20.459 < 2e-16 **cos(4 * pi * t) 3.785e-01 3.296e-02 11.484 < 2e-16 sin(6 * pi * t) -1.042e-01 3.296e-02 -3.161 0.00168 **cos(6 * pi * t) -4.389e-02 3.296e-02 -1.332 0.18362sin(8 * pi * t) 8.733e-02 3.296e-02 2.650 0.00833 cos(8 * pi * t) 2.559e-03 3.296e-02 0.078 0.93814---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 0.5041 on 456 degrees of freedomMultiple R-squared: 0.9989,Adjusted R-squared: 0.9989F-statistic: 3.738e+04 on 11 and 456 DF, p-value: < 2.2e-16 We see in the summary that two variables have no significant effect on the CO2 rate. Next, we perform a backward selection method with a p-value threshold equal to $0.05$ . This selects the following model:

R> regtrigo = lm(y ~ t + I(t^2) + I(t^3) + sin(2pit) + cos(2pit)+ + sin(4pit) + cos(4pit) + sin(6pit) + sin(8pit)) with the corresponding summary

R> summary.lm(regtrigo)

Call:lm(formula = y ~ t + I(t^2) + I(t^3) + sin(2 * pi * t) + cos(2 * pi * t) + sin(4 * pi * t) + cos(4 * pi * t) + sin(6 * pi * t) + sin(8 * pi * t))Residuals: Min 1Q Median 3Q Max-1.59287 -0.32364 0.00226 0.29884 1.50154Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 3.157e+02 1.118e-01 2824.174 < 2e-16 ***t 3.196e-01 2.306e-02 13.861 < 2e-16 ***I(t^2) 4.075e-02 1.293e-03 31.522 < 2e-16 ***I(t^3) -4.560e-04 2.080e-05 -21.927 < 2e-16 ***sin(2 * pi * t) 2.751e+00 3.297e-02 83.446 < 2e-16 ***cos(2 * pi * t) -3.960e-01 3.295e-02 -12.018 < 2e-16 ***sin(4 * pi * t) -6.743e-01 3.295e-02 -20.464 < 2e-16 **cos(4 * pi * t) 3.785e-01 3.295e-02 11.487 < 2e-16 sin(6 * pi * t) -1.042e-01 3.295e-02 -3.162 0.00167 **sin(8 * pi * t) 8.734e-02 3.295e-02 2.651 0.00831 ---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 0.504 on 458 degrees of freedomMultiple R-squared: 0.9989,Adjusted R-squared: 0.9989F-statistic: 4.57e+04 on 9 and 458 DF, p-value: < 2.2e-16

The sum of the estimated trend and estimated tendency is displayed on the left plot of Figure 7, and the residuals are displayed on the right plot.

The \codelm procedure assumes that the errors are independent, but if we look at the autocorrelation function of the residual process we clearly observe that the residuals are strongly correlated, see Figure 8. Consequently, the \codelm procedure may be unreliable in this context.

The autocorrelation function of the residuals decreases rather fast. Looking at the partial autocorrelation function, it seems reasonable to fit an AR process on the residuals. The automatic \codefitAR method selects an AR of order $14$ and the residuals look like a white noise, see Figure 9.

We now use the \codeslm function with the \codefitAR method with the following complete model

R> regtrigo = slm(y ~ t + I(t^2) + I(t^3) + sin(2pit) + cos(2pit)+ + sin(4pit) + cos(4pit) + sin(6pit) + cos(6pit)+ + sin(8pit) + cos(8pit), method_cov_st = "fitAR",+ model_selec = -1) Let us display the summary of the procedure:

R> summary(regtrigo)

Call:"slm(formula = myformula, data = data, x = x, y = y)"Residuals: Min 1Q Median 3Q Max-1.54750 -0.32688 0.00233 0.28100 1.50295Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 3.157e+02 3.968e-01 795.646 < 2e-16 ***t 3.194e-01 8.222e-02 3.884 0.000103 ***I(t^2) 4.077e-02 4.619e-03 8.825 < 2e-16 ***I(t^3) -4.562e-04 7.430e-05 -6.140 8.23e-10 ***sin(2 * pi * t) 2.751e+00 4.739e-02 58.054 < 2e-16 ***cos(2 * pi * t) -3.960e-01 4.716e-02 -8.396 < 2e-16 ***sin(4 * pi * t) -6.743e-01 2.051e-02 -32.875 < 2e-16 ***cos(4 * pi * t) 3.785e-01 2.041e-02 18.548 < 2e-16 ***sin(6 * pi * t) -1.042e-01 1.359e-02 -7.663 1.82e-14 **cos(6 * pi * t) -4.389e-02 1.359e-02 -3.228 0.001245 sin(8 * pi * t) 8.733e-02 1.246e-02 7.009 2.41e-12 cos(8 * pi * t) 2.559e-03 1.252e-02 0.204 0.838038---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 0.5041Multiple R-squared: 0.9989chi2-statistic: 3.598e+04 on 11 DF, p-value: < 2.2e-16 The last variable has no significant effect on the CO2. After performing a backward selection method with a p-value threshold equal to $0.05$ , we obtain the following model

R> regtrigo = slm(y ~ t + I(t^2) + I(t^3) + sin(2pit) + cos(2pit)+ + sin(4pit) + cos(4pit) + sin(6pit) + cos(6pit)+ + sin(8pit), method_cov_st = "fitAR", model_selec = -1) and the associated summary

R> summary(regtrigo)

Call:"slm(formula = myformula, data = data, x = x, y = y)"Residuals: Min 1Q Median 3Q Max-1.54877 -0.32432 0.00187 0.28069 1.50168Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 3.157e+02 3.969e-01 795.627 < 2e-16 ***t 3.194e-01 8.223e-02 3.884 0.000103 ***I(t^2) 4.077e-02 4.619e-03 8.825 < 2e-16 ***I(t^3) -4.562e-04 7.430e-05 -6.140 8.23e-10 ***sin(2 * pi * t) 2.751e+00 4.738e-02 58.061 < 2e-16 ***cos(2 * pi * t) -3.960e-01 4.716e-02 -8.397 < 2e-16 ***sin(4 * pi * t) -6.743e-01 2.051e-02 -32.874 < 2e-16 ***cos(4 * pi * t) 3.785e-01 2.041e-02 18.547 < 2e-16 ***sin(6 * pi * t) -1.042e-01 1.359e-02 -7.664 1.80e-14 **cos(6 * pi * t) -4.389e-02 1.359e-02 -3.229 0.001244 sin(8 * pi * t) 8.733e-02 1.248e-02 6.998 2.60e-12 ---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 0.5036Multiple R-squared: 0.9989chi2-statistic: 3.596e+04 on 10 DF, p-value: < 2.2e-16 There is a clear difference between the two backward procedures: \codeslm keeps the variable $\cos(6\pi x)$ , while \codelm does not. Given the obvious dependency of the error process, we recommend using \codeslm instead of \codelm in this context.

5.2 PM2.5 Data of Shanghai

This dataset comes from a study about fine particle pollution in five Chinese cities. The data are available on the following website https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities#. We are interested here by the city of Shanghai. We study the regression of PM2.5 pollution in Xuhui District by other measurements of pollution in neighboring districts and also by meteorological variables. The dataset contains hourly observations between January 2010 and December 2015. More precisely it contains $52584$ records of $17$ variables: date, time of measurement, pollution and meteorological variables. More information on these data is available in the paper of Liang et al. (2016).

We remove the lines that contain NA observations and we then extract the first $5000$ observations. For simplicity, we will only consider pollution variables and weather variables. We start the study with the following $10$ variables:

PM_Xuhui: PM2.5 concentration in the Xuhui district ( $ug/m^{3}$ )

-

PM_Jingan: PM2.5 concentration in the Jing’an district ( $ug/m^{3}$ )

-

PM_US.Post: PM2.5 concentration in the U.S diplomatic post ( $ug/m^{3}$ )

-

DEWP: Dew Point (Celsius Degree)

-

TEMP: Temperature (Celsius Degree)

-

HUMI: Humidity ( $\%$ )

-

PRES: Pressure (hPa)

-

Iws: Cumulated wind speed ( $m/s$ )

-

precipitation: hourly precipitation (mm)

-

Iprec: Cumulated precipitation (mm)

R> shan = read.csv("ShanghaiPM20100101_20151231.csv", header = TRUE,+ sep = ",")R> shan = na.omit(shan)R> shan_complete = shan[1:5000,c(7,8,9,10,11,12,13,15,16,17)]R> shan_complete[1:5,]

PM_Jingan PM_US.Post PM_Xuhui DEWP HUMI PRES TEMP Iws26305 66 70 71 -5 69.00 1023 0 6026306 67 76 72 -5 69.00 1023 0 6226308 73 78 74 -4 74.41 1023 0 6526309 75 77 77 -4 80.04 1023 -1 6826310 73 78 80 -4 80.04 1023 -1 70 precipitation Iprec26305 0 026306 0 026308 0 026309 0 026310 0 0

The aim is to study the concentration of particles in Xuhui District according to the other variables. We first fit a linear regression with the \codelm function:

R> reglm = lm(shan_complete$PM_Xuhui ~ . ,data = shan_complete)R> summary.lm(reglm)

Call:lm(formula = shan_complete $PM_Xuhui ~ ., data = shan_complete)Residuals: Min 1Q Median 3Q Max-132.139 -4.256 -0.195 4.279 176.450Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -54.859483 40.975948 -1.339 0.180690PM_Jingan 0.596490 0.014024 42.533 < 2e-16 ***PM_US.Post 0.375636 0.015492 24.246 < 2e-16 ***DEWP -1.038941 0.170144 -6.106 1.10e-09 ***HUMI 0.291713 0.045799 6.369 2.07e-10 ***PRES 0.025287 0.038915 0.650 0.515852TEMP 1.305543 0.168754 7.736 1.23e-14 ***Iws -0.007650 0.002027 -3.774 0.000163 ***precipitation 0.462885 0.132139 3.503 0.000464 ***Iprec -0.125456 0.039025 -3.215 0.001314 **---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 10.68 on 4990 degrees of freedomMultiple R-squared: 0.9409,Adjusted R-squared: 0.9408F-statistic: 8828 on 9 and 4990 DF, p-value: < 2.2e-16 The variable PRES has no significant effect on the PM_Xuhui variable. We then perform a backward selection procedure, which leads to select$ 9$ significant variables:

R> shan_lm = shan[1:5000,c(7,8,9,10,11,13,15,16,17)]R> reglm = lm(shan_lm$PM_Xuhui ~ . ,data = shan_lm)R> summary.lm(reglm)

Call:lm(formula = shan_lm$PM_Xuhui ~ ., data = shan_lm)Residuals: Min 1Q Median 3Q Max-132.122 -4.265 -0.168 4.283 176.560Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -28.365506 4.077590 -6.956 3.94e-12 ***PM_Jingan 0.595564 0.013951 42.690 < 2e-16 ***PM_US.Post 0.376486 0.015436 24.390 < 2e-16 ***DEWP -1.029188 0.169471 -6.073 1.35e-09 ***HUMI 0.285759 0.044870 6.369 2.08e-10 ***TEMP 1.275880 0.162453 7.854 4.90e-15 ***Iws -0.007734 0.002023 -3.824 0.000133 **precipitation 0.462137 0.132127 3.498 0.000473 Iprec -0.127162 0.038934 -3.266 0.001098 ---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 10.68 on 4991 degrees of freedomMultiple R-squared: 0.9409,Adjusted R-squared: 0.9408F-statistic: 9933 on 8 and 4991 DF, p-value: < 2.2e-16

The autocorrelation of the residual process shows that the errors are clearly not i.i.d., see Figure 10. We thus suspect the \codelm procedure to be unreliable in this context.

The autocorrelation function decreases pretty fast, and the partial autocorrelation function suggests that fitting an AR process on the residuals should be an appropriate method in this case. The automatic \codefitAR method of \codeslm selects an AR process of order $28$ . The residuals of this AR fitting look like a white noise, as shown in Figure 11.

Consequently, we propose to perform a linear regression with \codeslm function, using the fitAR method on the complete model

R> regslm = slm(shan_complete$PM_Xuhui ~ . ,data = shan_complete,+ method_cov_st = "fitAR", model_selec = -1)R> summary(regslm)

Call:"slm(formula = myformula, data = data, x = x, y = y)"Residuals: Min 1Q Median 3Q Max-132.139 -4.256 -0.195 4.279 176.450Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -54.859483 143.268399 -0.383 0.701783PM_Jingan 0.596490 0.028467 20.953 < 2e-16 ***PM_US.Post 0.375636 0.030869 12.169 < 2e-16 ***DEWP -1.038941 0.335909 -3.093 0.001982 **HUMI 0.291713 0.093122 3.133 0.001733 **PRES 0.025287 0.137533 0.184 0.854123TEMP 1.305543 0.340999 3.829 0.000129 Iws -0.007650 0.005698 -1.343 0.179399precipitation 0.462885 0.125641 3.684 0.000229 Iprec -0.125456 0.064652 -1.940 0.052323 .---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 10.68Multiple R-squared: 0.9409chi2-statistic: 8383 on 9 DF, p-value: < 2.2e-16 Note that the variables show globally larger p-values than with the \codelm procedure, and more variables have no significant effect than with \codelm. After performing a backward selection we obtain the following results

R> shan_slm = shan[1:5000,c(7,8,9,10,11,13)]R> regslm = slm(shan_slm$PM_Xuhui ~ . , data = shan_slm,+ method_cov_st = "fitAR", model_selec = -1)R> summary(regslm)

Call:"slm(formula = myformula, data = data, x = x, y = y)"Residuals: Min 1Q Median 3Q Max-132.263 -4.341 -0.192 4.315 176.501Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -29.44924 8.38036 -3.514 0.000441 ***PM_Jingan 0.60063 0.02911 20.636 < 2e-16 ***PM_US.Post 0.37552 0.03172 11.840 < 2e-16 **DEWP -1.05252 0.34131 -3.084 0.002044 **HUMI 0.28890 0.09191 3.143 0.001671 TEMP 1.30069 0.32435 4.010 6.07e-05 ---Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1Residual standard error: 10.71Multiple R-squared: 0.9406chi2-statistic: 8247 on 5 DF, p-value: < 2.2e-16 The backward selection with \codeslm only keeps $5$ variables.

Acknowledgements

The authors are grateful to Anne Philippe and Aymeric Stamm for valuable discussions.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Andrews (1991) Andrews D (1991). “Heteroskedasticity and autocorrelation consistent covariant matrix estimation.” Econometrica , 59 (3), 817–858.
2Andrews and Monahan (1992) Andrews DW, Monahan JC (1992). “An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator.” Econometrica: Journal of the Econometric Society , pp. 953–966.
3Baudry et al. (2012) Baudry JP, Maugis C, Michel B (2012). “Slope heuristics: overview and implementation.” Statistics and Computing , 22 (2), 455–470.
4Birgé and Massart (2007) Birgé L, Massart P (2007). “Minimal penalties for Gaussian model selection.” Probability theory and related fields , 138 (1-2), 33–73.
5Bradley (1986) Bradley RC (1986). “Basic properties of strong mixing conditions.” In Dependence in probability and statistics (Oberwolfach, 1985) , volume 11 of Progr. Probab. Statist. , pp. 165–192. Birkhäuser Boston, Boston, MA.
6Brockwell and Davis (1991) Brockwell PJ, Davis RA (1991). Time Series: Theory and Methods . Springer Science & Business Media.
7Caron (2019) Caron E (2019). “Asymptotic distribution of least square estimators for linear models with dependent errors.” Statistics , 53 (4), 885–902. https://doi.org/10.1080/02331888.2019.1593987 . · doi ↗
8Caron and Dede (2018) Caron E, Dede S (2018). “Asymptotic Distribution of Least Squares Estimators for Linear Models with Dependent Errors: Regular Designs.” Mathematical Methods of Statistics , 27 (4), 268–293.