Testing for high-dimensional network parameters in auto-regressive   models

Lili Zheng; Garvesh Raskutti

arXiv:1812.03659·math.ST·December 13, 2018

Testing for high-dimensional network parameters in auto-regressive models

Lili Zheng, Garvesh Raskutti

PDF

Open Access

TL;DR

This paper develops statistical inference methods, including confidence intervals, for high-dimensional auto-regressive network models with sub-Gaussian noise, extending beyond Gaussian assumptions and addressing dependence challenges.

Contribution

It introduces convergence in distribution results and confidence intervals for high-dimensional AR(p) models with sub-Gaussian noise, broadening applicability beyond Gaussian assumptions.

Findings

01

Convergence results hold when T scales as (s ∨ ρ)^2 log^2 M.

02

Provides novel concentration bounds for dependent sub-Gaussian quadratic forms.

03

Validates theoretical results through simulations on structured networks.

Abstract

High-dimensional auto-regressive models provide a natural way to model influence between $M$ actors given multi-variate time series data for $T$ time intervals. While there has been considerable work on network estimation, there is limited work in the context of inference and hypothesis testing. In particular, prior work on hypothesis testing in time series has been restricted to linear Gaussian auto-regressive models. From a practical perspective, it is important to determine suitable statistical tests for connections between actors that go beyond the Gaussian assumption. In the context of \emph{high-dimensional} time series models, confidence intervals present additional estimators since most estimators such as the Lasso and Dantzig selectors are biased which has led to \emph{de-biased} estimators. In this paper we address these challenges and provide convergence in distribution…

Equations746

X_{t + 1} = j = 1 \sum p A^{*} (j) X_{t + 1 - j} + ϵ_{t},

X_{t + 1} = j = 1 \sum p A^{*} (j) X_{t + 1 - j} + ϵ_{t},

ℓ (θ, γ) = - \frac{1}{n} i = 1 \sum n lo g f (U_{i}; θ, γ) .

ℓ (θ, γ) = - \frac{1}{n} i = 1 \sum n lo g f (U_{i}; θ, γ) .

n \nabla_{θ} ℓ (0, \hat{γ}) - n \nabla_{θ} ℓ (0, γ^{*}) \approx n \nabla_{θ γ}^{2} ℓ (0, γ^{*}) (\hat{γ} - γ^{*}),

n \nabla_{θ} ℓ (0, \hat{γ}) - n \nabla_{θ} ℓ (0, γ^{*}) \approx n \nabla_{θ γ}^{2} ℓ (0, γ^{*}) (\hat{γ} - γ^{*}),

S (θ, γ) = \nabla_{θ} ℓ (θ, γ) - I_{θ γ} I_{γ γ}^{- 1} \nabla_{γ} ℓ (θ, γ),

S (θ, γ) = \nabla_{θ} ℓ (θ, γ) - I_{θ γ} I_{γ γ}^{- 1} \nabla_{γ} ℓ (θ, γ),

E (\nabla_{γ} S (θ, γ)) = 0.

E (\nabla_{γ} S (θ, γ)) = 0.

M_{X} (t) ≜ E [exp (tX)] \leq exp (τ^{2} t^{2} /2) .

M_{X} (t) ≜ E [exp (tX)] \leq exp (τ^{2} t^{2} /2) .

X_{t + 1} = j = 1 \sum p A (j) X_{t - j + 1} + ϵ_{t},

X_{t + 1} = j = 1 \sum p A (j) X_{t - j + 1} + ϵ_{t},

X_{t + 1} = A^{*} X_{t} + ϵ_{t} .

X_{t + 1} = A^{*} X_{t} + ϵ_{t} .

H_{0} : A_{D} = 0

H_{0} : A_{D} = 0

det (A (z)) \neq = 0, ∣ z ∣ \leq 1.

det (A (z)) \neq = 0, ∣ z ∣ \leq 1.

(A (z))^{- 1} = j = 0 \sum \infty Ψ_{j} z^{j},

(A (z))^{- 1} = j = 0 \sum \infty Ψ_{j} z^{j},

X_{t} = j = 0 \sum \infty Ψ_{j} ϵ_{t - j - 1},

X_{t} = j = 0 \sum \infty Ψ_{j} ϵ_{t - j - 1},

Σ = Cov (X_{t}) = j = 0 \sum \infty Ψ_{j} Ψ_{j}^{⊤} .

Σ = Cov (X_{t}) = j = 0 \sum \infty Ψ_{j} Ψ_{j}^{⊤} .

[S (A^{*})]_{j k} = - \frac{1}{T} t = 0 \sum T - 1 (X_{t + 1, j} - a_{j}^{* ⊤} X_{t}) X_{t k} = - \frac{1}{T} t = 0 \sum T - 1 ϵ_{t, j} X_{t k} .

[S (A^{*})]_{j k} = - \frac{1}{T} t = 0 \sum T - 1 (X_{t + 1, j} - a_{j}^{* ⊤} X_{t}) X_{t k} = - \frac{1}{T} t = 0 \sum T - 1 ϵ_{t, j} X_{t k} .

S = (S_{1}^{⊤}, S_{2}^{⊤}, \dots, S_{k}^{⊤})^{⊤} \in R^{d},

S = (S_{1}^{⊤}, S_{2}^{⊤}, \dots, S_{k}^{⊤})^{⊤} \in R^{d},

S_{m} = - \frac{1}{T} t = 0 \sum T - 1 ϵ_{t, m} (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}),

S_{m} = - \frac{1}{T} t = 0 \sum T - 1 ϵ_{t, m} (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}),

Cov (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}, X_{t, D_{m}^{c}}) = 0.

Cov (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}, X_{t, D_{m}^{c}}) = 0.

w_{m}^{*} = (Υ_{D_{m}^{c}, D_{m}^{c}})^{- 1} Υ_{D_{m}^{c}, D_{m}} .

w_{m}^{*} = (Υ_{D_{m}^{c}, D_{m}^{c}})^{- 1} Υ_{D_{m}^{c}, D_{m}} .

V_{T, m} ≜ T (Υ^{(m)})^{- \frac{1}{2}} S_{m},

V_{T, m} ≜ T (Υ^{(m)})^{- \frac{1}{2}} S_{m},

Υ^{(m)} ≜ Cov (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}) = Cov (X_{t, D_{m}} ∣ X_{t, D_{m}^{c}}) = Υ_{D_{m}, D_{m}} - Υ_{D_{m}, D_{m}^{c}} (Υ_{D_{m}^{c}, D_{m}^{c}})^{- 1} Υ_{D_{m}^{c}, D_{m}} .

Υ^{(m)} ≜ Cov (X_{t, D_{m}} - w_{m}^{* ⊤} X_{t, D_{m}^{c}}) = Cov (X_{t, D_{m}} ∣ X_{t, D_{m}^{c}}) = Υ_{D_{m}, D_{m}} - Υ_{D_{m}, D_{m}^{c}} (Υ_{D_{m}^{c}, D_{m}^{c}})^{- 1} Υ_{D_{m}^{c}, D_{m}} .

V_{T} = (V_{T, 1}^{⊤}, \dots, V_{T, k}^{⊤})^{⊤} .

V_{T} = (V_{T, 1}^{⊤}, \dots, V_{T, k}^{⊤})^{⊤} .

U_{T} = T m = 1 \sum k S_{m}^{⊤} (Υ^{(m)})^{- 1} S_{m},

U_{T} = T m = 1 \sum k S_{m}^{⊤} (Υ^{(m)})^{- 1} S_{m},

S_{m} = - \frac{1}{T} t = 0 \sum T - 1 (X_{t + 1, m} - (A_{m})_{D_{m}^{c}}^{⊤} X_{t, D_{m}^{c}}) (X_{t, D_{m}} - \overset{w}{^}_{m}^{⊤} X_{t, D_{m}^{c}}),

S_{m} = - \frac{1}{T} t = 0 \sum T - 1 (X_{t + 1, m} - (A_{m})_{D_{m}^{c}}^{⊤} X_{t, D_{m}^{c}}) (X_{t, D_{m}} - \overset{w}{^}_{m}^{⊤} X_{t, D_{m}^{c}}),

ρ_{m} ≜ ∥ A_{m}^{*} ∥_{0}, s_{i} ≜ ∥ w_{i}^{*} ∥_{0},

ρ_{m} ≜ ∥ A_{m}^{*} ∥_{0}, s_{i} ≜ ∥ w_{i}^{*} ∥_{0},

s_{m} \leq d_{m}^{2} 1 \leq i \leq M max ρ_{i}, for 1 \leq m \leq k .

s_{m} \leq d_{m}^{2} 1 \leq i \leq M max ρ_{i}, for 1 \leq m \leq k .

H_{0} : A_{D} = 0,

H_{0} : A_{D} = 0,

H_{A} : A_{D} = T^{- ϕ} Δ,

H_{A} : A_{D} = T^{- ϕ} Δ,

Δ = (Δ_{1}^{⊤}, \dots . Δ_{k}^{⊤})^{⊤},

Δ = (Δ_{1}^{⊤}, \dots . Δ_{k}^{⊤})^{⊤},

i = 0 \sum \infty (j = 0 \sum \infty ∥ Ψ_{i + j} ∥_{2}^{2})^{\frac{1}{2}} \leq β,

i = 0 \sum \infty (j = 0 \sum \infty ∥ Ψ_{i + j} ∥_{2}^{2})^{\frac{1}{2}} \leq β,

i = 0 \sum \infty (j = 0 \sum \infty (A^{*})^{i + j}_{2}^{2})^{\frac{1}{2}} \leq β,

i = 0 \sum \infty (j = 0 \sum \infty (A^{*})^{i + j}_{2}^{2})^{\frac{1}{2}} \leq β,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Markov Chains and Monte Carlo Methods · Statistical Methods and Bayesian Inference

Full text

Testing for high-dimensional network parameters in auto-regressive models

Lili Zheng1 and Garvesh Raskutti1

Abstract

High-dimensional auto-regressive models provide a natural way to model influence between $M$ actors given multi-variate time series data for $T$ time intervals. While there has been considerable work on network estimation, there is limited work in the context of inference and hypothesis testing. In particular, prior work on hypothesis testing in time series has been restricted to linear Gaussian auto-regressive models. From a practical perspective, it is important to determine suitable statistical tests for connections between actors that go beyond the Gaussian assumption. In the context of high-dimensional time series models, confidence intervals present additional estimators since most estimators such as the Lasso and Dantzig selectors are biased which has led to de-biased estimators. In this paper we address these challenges and provide convergence in distribution results and confidence intervals for the multi-variate AR(p) model with sub-Gaussian noise, a generalization of Gaussian noise that broadens applicability and presents numerous technical challenges. The main technical challenge lies in the fact that unlike Gaussian random vectors, for sub-Gaussian vectors zero correlation does not imply independence. The proof relies on using an intricate truncation argument to develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables. Our convergence in distribution results hold provided $T=\Omega((s\vee\rho)^{2}\log^{2}M)$ , where $s$ and $\rho$ refer to sparsity parameters which matches existed results for hypothesis testing with i.i.d. samples. We validate our theoretical results with simulation results for both block-structured and chain-structured networks.

11footnotetext: Department of Statistics, University of Wisconsin-Madison

1 Introduction

Vector autoregressive models arise in a number of applications including macroeconomics (see e.g.Ang and Piazzesi (2003),Hansen (2003),Shan (2005)), computational neuroscience (see e.g.Goebel et al. (2003),Seth et al. (2015),Harrison et al. (2003), Bressler et al. (2007)), and many others (see e.g.Michailidis and d’Alché Buc (2013),Fujita et al. (2007)). Recent years has seen substantial development in the theory and methodology of high-dimensional auto-regressive models with respect to parameter estimation (see e.g. Song and Bickel (2011),Basu et al. (2015),Davis et al. (2016),Medeiros and Mendes (2016), Mark B. and R. (2018)). In particular if there are $M$ dependent time series (e.g. voxels in the brain, actors in a social network, measurements at different spatial locations), time series network models allow us to model temporal dependence between actors/nodes in a network.

More precisely, consider the following time series auto-regressive network model with lag $p$ ,

[TABLE]

where $\{X_{t}\}_{t=0}^{T}\in\mathbb{R}^{M}$ is the time series data we have access to, $\{A^{*}(j)\in\mathbb{R}^{M\times M},j=1,\dots,p\}$ are the network parameters of interest and $\epsilon_{t}\in\mathbb{R}^{M}$ is zero-mean noise. We are considering the high-dimensional setting where the number of nodes $M$ in the network is much larger than the sample size $T$ . Prior work in Basu et al. (2015) has addressed the question of how to estimate the network parameter $A^{*}$ with Gaussian noise $\epsilon_{t}$ under sparsity assumptions and various structural constraints. In this paper, we focus on inference and hypothesis testing for the parameter $A^{*}$ given the data $(X_{t})_{t=0}^{T}$ .

In high-dimensional statistics, there has recently been a growing body of work on confidence intervals and hypothesis testing under structural assumptions such as sparsity. Since the widely used Lasso estimator for sparse linear regression is asymptotically biased, one-step estimators based on bias-correction have been studied in works such as Zhang and Zhang (2014), Van de Geer et al. (2014) and Javanmard and Montanari (2014) which are referred to as LDPE, de-sparsifying and de-biasing estimator respectively. Low-dimensional components of these estimators have asymptotic normality and thus can be used for constructing hypothesis testing and confidence intervals.

In this paper, we adopt the framework of Ning and Liu (Ning et al. (2017)) who propose a high dimensional test statistic based on score function, called the decorrelated score function which we briefly describe here. Formally, consider a statistical model $\mathcal{P}=\{\mathbb{P}_{\boldsymbol{\beta}}:\boldsymbol{\beta}\in\Omega\}$ with high-dimensional parameter vector ${\boldsymbol{\beta}}=(\theta,\boldsymbol{\gamma}^{\top})^{\top}\in\mathbb{R}^{d}$ . Suppose we are interested in the scalar parameter $\theta$ and $\boldsymbol{\gamma}\in\mathbb{R}^{d-1}$ is the nuisance parameter. Suppose data $\{\boldsymbol{U}_{i},i=1,\dots,n\}$ are i.i.d. data following distribution $\mathbb{P}_{\boldsymbol{\beta}}$ , then the negative log-likelihood function is defined as

[TABLE]

It is known that the score function $\sqrt{n}\nabla_{\theta}\ell(0,\boldsymbol{\gamma}^{*})$ is asymptotically normal if the true parameter $\boldsymbol{\beta}^{*}=(0,\boldsymbol{\gamma}^{*})$ . If $\boldsymbol{\gamma}^{*}$ is substituted by some estimator $\hat{\boldsymbol{\gamma}}$ , the estimation induced error can be approximated as the following:

[TABLE]

when $\hat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}$ is small enough. Although $\hat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}$ converge to 0 with properly chosen $\hat{\boldsymbol{\gamma}}$ , e.g. Lasso estimator, $\sqrt{n}\nabla_{\theta\boldsymbol{\gamma}}^{2}\ell(0,\boldsymbol{\gamma}^{*})(\hat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})$ would not vanish if $\mathbb{E}_{\boldsymbol{\beta}}\left(\nabla_{\theta\boldsymbol{\gamma}}^{2}\ell(0,\boldsymbol{\gamma}^{*})\right)\neq 0$ . This fact motivates the decorrelated score function:

[TABLE]

with Fisher information matrix ${\mathbf{I}}=\mathbb{E}_{\boldsymbol{\beta}}\left(\nabla^{2}\ell(\boldsymbol{\beta})\right)$ . One can check that

[TABLE]

Both $\boldsymbol{\gamma}$ and ${\mathbf{I}}_{\theta\boldsymbol{\gamma}}{\mathbf{I}}_{\boldsymbol{\gamma}\boldsymbol{\gamma}}^{-1}$ are substituted by some estimator, and it is shown in Ning et al. (2017) that the decorrelated score function is asymptotically normal.

In the linear regression case, the test statistic generated by the decorrelated score function in Ning et al. (2017) is equivalent to that constructed by de-biased estimator in Van de Geer et al. (2014). However, Ning et al. (2017) allow a more general form, and thus is easier to adapt to the time series case. In fact Neykov et al.Neykov et al. (2018) consider amongst other examples, high-dimensional time series with Gaussian error innovations. While Gaussian error innovations are widely used, many time series models include data that has bounded range or discrete data, for which the Gaussian distribution is not a natural fit. In this paper, we address the more general and technically challenging setting in which the noise $\epsilon_{t}$ is sub-Gaussian.

One of the important technical challenges in going from the Gaussian to the sub-Gaussian case is that dependent Gaussian vectors can be rotated to be independent, while such a result does not hold for sub-Gaussian vectors. Prior work in Wong et al. (2016) addresses this challenges by imposing stationarity and $\beta$ -mixing conditions. In order to avoid these conditions, we develop novel concentration bounds for sub-Gaussian random vectors.

In this paper, we investigate the hypothesis testing and confidence region with respect to a low-dimensional component of parameter matrices $\{A^{*}(j),j=1,\dots,p\}$ for sub-Gaussian data, using the testing framework in Ning et al. (2017). Our major contributions are as follows:

•

Extending theoretical results in Ning et al. (2017) for high-dimensional hypothesis testing from Gaussian to sub-Gaussian temporal dependent data (VAR model), both under null and alternative hypothesis. We also show that our techniques lead to similar results to Neykov et al.Neykov et al. (2018) in the Gaussian case but under less restrictive conditions;

•

A novel concentration bound for quadratic forms of sub-Gaussian time series data. Note that unlike Gaussian vectors which can be rotated to be independent, sub-Gaussian vectors can not which present additional technical challenges. Our analysis also leads to estimators for covariance and regression parameters for time series data under sub-Gaussian assumptions which are of independent interest.

•

We also construct semi-parametric efficient confidence region for multivariate parameters with fixed dimension;

•

Finally we support our theoretical guarantees with a simulation study on bounded noise, which is sub-Gaussian but not Gaussian.

1.1 Related Work

In the literature on inference for high-dimensional VAR models, most work focuses on the estimation problem. Song and Bickel (Song and Bickel (2011)) investigate penalized least squares algorithms for different penalties, with some externally imposed assumptions on the temporal dependence. Theoretical guarantees on Dantzig type and Lasso type estimators are studied in Han et al. (2015) and Basu et al. (2015), but with Gaussian noise. Barigozzi and Brownlees (Barigozzi and Brownlees (2018)) consider the inference for stationary dependence structure built among variables, other than the parameters in the VAR model. In our work, we control the error bounds of Lasso and Dantzig type estimators for parameter matrices, with sub-Gaussian noise. Then we establish asymptotic distribution of test statistic based on this.

In the high-dimensional hypothesis testing literature, there is some work regarding to testing for high-dimensional mean vector (Srivastava (2009)), covariance matrices (Chen et al. (2010),Zhang et al. (2013)) and independence among variables (Schott (2005)). While for testing on regression parameters, most work assumes i.i.d samples. Lockhart et al. (2014), Taylor et al. (2014) and Lee et al. (2016) proposes methods to test whether a covariate should be selected conditioning on the selection of some other covariates. A penalized score test depending on the tuning parameter $\lambda$ is considered in Voorman et al. (2014). Our work follows the a line of work by Zhang and Zhang (2014), Van de Geer et al. (2014), Javanmard and Montanari (2014) and Ning et al. (2017), the de-sparsifying or decorrelated literature. We construct a VAR version of decorrelated score test proposed by Ning et al. (2017). Chen and Wu (Chen and Wu (2018)) tackles the hypothesis testing problem for time series data as well, but they are testing the trend in a time series, instead of the autoregressive parameter which encodes the influence structure among variables.

As mentioned earlier, our work is most closely related to the prior work of Neykov et al.Neykov et al. (2018), which provides a hypothesis testing framework with high-dimensional Gaussian time series as a special case. In our work, we consider the more general and technically challenging case of sub-Gaussian vector auto-regressive models. Throughout this paper, we provide a comparison to results derived in this work for the Gaussian case.

1.2 Organization of the Paper

Section 2 explains the problem set up and proposes our test statistic. Theoretical guarantee is shown in section 3. Specifically, section 3.1 and 3.2 present the weak convergence rate of test statistic under the null and alternative hypothesis $\mathcal{H}_{0}$ and $\mathcal{H}_{A}$ . Section 3.3 propose some feasible estimators, which satisfy the assumptions required and can be plugged into the test statistic. Section 3.4 considers the case when the variance of noise are unknown, and we construct a confidence region for multivariate parameter vectors in Section 3.5. We consider the special case of the AR(1) model with Gaussian noise, a detailed comparison with Neykov et al. (2018) is provided in section 3.6. Section 4 provides simulation results and section 5 includes the proofs for the two main theorems. Much of the proof is deferred to Appendices.

1.3 Notation

We define the following norms for vectors and matrices: For a vector $u=(u_{1},\dots,u_{d})^{\top}\in\mathbb{R}^{d}$ , we define the $p$ -norm where $p\geq 1$ , $\|u\|_{p}=\left(\sum_{i=1}^{d}u_{i}^{p}\right)^{\frac{1}{p}}.$ For a matrix $U\in\mathbb{R}^{m\times n}$ , the $\ell_{p}$ norm and Frobenius norm of $U$ is defined as $\|U\|_{p}=\sup_{v}\frac{\left\|Uv\right\|_{p}}{\|v\|_{p}},\quad\|U\|_{F}=\left(\sum_{i=1}^{m}\sum_{j=1}^{n}U_{ij}^{2}\right)^{\frac{1}{2}}.$ We also use notation $\|U\|_{1,1}$ to denote the $\ell_{1}$ penalty on $U$ , which is $\sum_{i=1}^{m}\sum_{j=1}^{n}|U_{i,j}|$ . Furthermore, if $U$ is symmetric the trace norm of $U$ is $\|U\|_{{\rm tr}}=\text{tr}(\sqrt{U^{2}}).$

Throughout the paper, we assume that the entries of noise vectors $\{\epsilon_{ti},1\leq i\leq M\}_{t=-\infty}^{\infty}$ are independent sub-Gaussian variables with constant scale factor. A univariate centered random variable $X$ has a sub-Gaussian distribution with scale factor $\tau$ if

[TABLE]

2 Problem Setup

We consider a general vector auto-regressive time series with lag $p$ , where $p$ is known and finite and independent of $T$ or other dimensions:

[TABLE]

where $X_{t}\in\mathbb{R}^{M}$ , $\epsilon_{t}\in\mathbb{R}^{M}$ is zero-mean entry-wise independent sub-Gaussian noise with identity covariance matrix, and $A(j)\in\mathbb{R}^{M\times M},j=1,\cdots,p$ are parameters of interest. Define the matrix $A^{*}=(A(1),\cdots,A(p))\in\mathbb{R}^{M\times pM}$ and $\mathcal{X}_{t}=(X_{t}^{\top},\cdots,X_{t-p+1}^{\top})^{\top}\in\mathbb{R}^{pM}$ , then we can also write (2) as

[TABLE]

For notational convenience, we assume that time series data $X_{t}$ has time range $1-p\leq t\leq T$ .

Based on data $(X_{t})_{t=1-p}^{T}$ , we test the hypothesis of whether a subset of entries in $A^{*}$ are [math]. Let $A_{i}^{*}$ be the $i$ th row vector of $A^{*}$ . Without loss of generality, suppose the entries we test are in rows $1,\cdots,k$ . Define $D_{m}\subset\{1,\cdots,pM\}$ as the columns we test in $m$ th row with $d_{m}=|D_{m}|$ , and $D=\{(i,j):1\leq i\leq k,j\in D_{i}\}$ , with $d=|D|=\sum_{m=1}^{k}d_{m}$ . We test the null hypothesis:

[TABLE]

where $\widetilde{A}_{D}=((A_{1}^{*})_{D_{1}}^{\top},\cdots,(A_{k}^{*})_{D_{k}}^{\top})^{\top}\in\mathbb{R}^{d}$ . We also assume that $d$ is finite and not increasing with $T$ . In the work of of Neykov et al.Neykov et al. (2018), $d$ is assumed to be $1$ .

2.1 Stationary distribution

Since we are developing a hypothesis testing framework based on the decorrelated score test, it is important to specify a stationary distribution for $X_{t}$ Using standard notation from auto-regressive time series models, define the polynomial $\mathcal{A}(z)=I_{M}-\sum_{j=1}^{p}A(j)z^{j}$ , where $I_{M}$ is an $M\times M$ identity matrix, and $z$ is a complex number. To guarantee the existence of a stationary solution to (3), we assume

[TABLE]

Then we can write

[TABLE]

where $\Psi_{j},j\geq 0$ are all real valued matrices which are polynomial functions of $A(i),1\leq i\leq p$ . Note that in the special case where $p=1$ , $\Psi_{j}=(A^{*})^{j}$ .

It can be shown that the unique stationary solution to (2) is

[TABLE]

and the covariance matrix $\Sigma$ of $X_{t}$ satisfies

[TABLE]

2.2 Decorrelated Score Function

Using the frameworks developed in Ning et al. (2017) for independent design, we consider the decorrelated score test. First we define the score function $S(A^{*})\in\mathbb{R}^{M\times M}$ , with each entry defined as follows:

[TABLE]

As pointed out in Ning et al. (2017), the standard score function is infeasible and we need to consider the decorrelated score function

[TABLE]

with each $S_{m}\in\mathbb{R}^{d_{m}}$ corresponding to the tested row $(m,D_{m})$ :

[TABLE]

where $\mathcal{X}_{t,D_{m}}\in\mathbb{R}^{d_{m}}$ is composed of the entries of $\mathcal{X}_{t}$ whose indices are within set $D_{m}$ . $\mathcal{X}_{t,D_{m}^{c}}\in\mathbb{R}^{pM-d_{m}}$ is also defined similarly and $w_{m}^{*}\in\mathbb{R}^{(pM-d_{m})\times d_{m}}$ is chosen to satisfy

[TABLE]

Specifically, $w_{m}^{*}$ is defined as a function of $\Upsilon={\rm Cov}(\mathcal{X}_{t})\in\mathbb{R}^{pM\times pM}$ :

[TABLE]

2.3 Test Statistic

Based on the decorrelated score function $S_{m}$ , we first define the statistic $V_{T,m}\in\mathbb{R}^{d_{m}}$ :

[TABLE]

with $\Upsilon^{(m)}\in\mathbb{R}^{d_{m}\times d_{m}}$ being defined as:

[TABLE]

Let $V_{T}$ be the $d$ -dimensional vector concatenated by $V_{T,m}$ ’s:

[TABLE]

One of the main results of the paper is to show that $V_{T}$ is asymptotically Gaussian. Define $U_{T}=\|V_{T}\|_{2}^{2}$ , then $U_{T}$ is asymptotically $\chi_{d}^{2}$ . Since we do not know $\epsilon_{t}$ , $w_{m}^{*}$ , and $\Upsilon^{(m)}$ , we later define estimators for these quantities. Formally, we define our test statistic $\widehat{U}_{T}$ as

[TABLE]

where $\widehat{\Upsilon^{(m)}}\in\mathbb{R}^{d_{m}\times d_{m}}$ is an estimator for $\Upsilon^{(m)}$ and $\widehat{S}_{m}\in\mathbb{R}^{d_{m}}$ is defined as

[TABLE]

with $\widehat{A}_{m}\in\mathbb{R}^{pM}$ and $\hat{w}_{m}\in\mathbb{R}^{(pM-d_{m})\times d_{m}}$ estimating $A_{m}^{*}$ and $w_{m}^{*}$ . Here we are not worried about the invertible issue of $\widehat{\Upsilon^{(m)}}$ , since $\Upsilon^{(m)}$ is a low dimensional covariance matrix. To guarantee a good estimation of the high-dimensional parameter $A_{m}^{*}$ and $w_{m}^{*}$ , we impose sparsity conditions upon them. Specifically, for each $1\leq m\leq M$ , $1\leq i\leq k$ define

[TABLE]

and note that they both depend on $A^{*}$ .

The sparsity of $w_{m}^{*}$ can be implied by the sparsity of $\Upsilon^{-1}$ , which is a common condition in high-dimensional hypothesis testing literature (e.g. see Van de Geer et al. (2014)). Specifically, the following Lemma shows that when lag $p=1$ and $A^{*}$ is symmetric, the sparsity of $w_{m}^{*}$ is implied by the sparsity of $A^{*}$ :

Lemma 2.1.

If $p=1$ , $A^{*}\in\mathbb{R}^{M\times M}$ is symmetric, then $s_{m}$ defined in (10) satisfies

[TABLE]

The proof for Lemma 2.1 is included in Appendix E.

3 Theoretical guarantee

In this section, we present uniform convergence results for test statistic $\widehat{U}_{T}$ under $\mathcal{H}_{0}$ and $\mathcal{H}_{A}$ , with $A^{*}$ and estimators satisfying conditions. We also provide feasible estimators, and prove that they satisfy corresponding conditions in Section 3.3. Unknown variance and confidence region construction is discussed in Section 3.4 and 3.5. In Section 3.6 we provide consequences of our theory under AR(1) model with Gaussian noise and compare our results with Neykov et al.Neykov et al. (2018).

Recall that the null hypothesis is

[TABLE]

with $\widetilde{A}_{D}\in\mathbb{R}^{d}$ being concatenated by $(A_{1}^{*})_{D_{1}},\dots,(A_{k}^{*})_{D_{k}}$ . While for the alternative hypothesis, like in Ning et al. (2017), we consider

[TABLE]

with some constant $\phi>0$ and constant vector $\Delta\in\mathbb{R}^{d}$ . Write

[TABLE]

where each $\Delta_{m}\in\mathbb{R}^{d_{m}}$ . The reason why $T^{-\phi}\Delta$ instead of $\Delta$ is considered in (12) is that we expect the test to be more sensitive as sample size increases. We will see how the value of $\phi$ influences the convergence of $\widehat{U}_{T}$ in Theorem 3.2.

We still assume $\epsilon_{ti}$ ’s are i.i.d. sub-Gaussian random variables, and also consider a special case, where $\epsilon_{t}\sim\mathcal{N}(0,I)$ . We compare our result in the Gaussian case to results in Neykov et al.Neykov et al. (2018).

First we define the sets $\Omega_{0}$ and $\Omega_{1}$ of feasible parameter matrices $A^{*}$ under $\mathcal{H}_{0}$ and $\mathcal{H}_{A}$ respectively. To control the stability of $\{X_{t}\}$ in model (3), we impose the condition:

[TABLE]

for some constant $\beta>0$ . In the case $p=1$ , condition (13) reduces to

[TABLE]

which is implied by $\left\|A^{*}\right\|_{2}\leq 1-\epsilon$ for some $0<\epsilon<1$ , a typical condition assumed (see e.g. Neykov et al. (2018)). Then define sets $\Omega_{0}$ and $\Omega_{1}$ for any $\beta,\rho,s,M,T,\phi>0$ , set $D$ of size $d$ and vector $\Delta=(\Delta_{1}^{\top},\cdots,\Delta_{k}^{\top})^{\top}\in\mathbb{R}^{d}$ :

[TABLE]

Note here $\rho_{m}(A^{*})$ and $s_{m}(A^{*})$ are still functions of $A^{*}$ , since $\Upsilon$ is determined by $A^{*}$ . Clearly we need reliable estimators for $\widehat{A}_{m}$ , $\hat{w}_{m}$ and $\widehat{\Sigma^{(m)}}$ with $1\leq m\leq k$ , to guarantee the weak convergence of $\widehat{U}_{T}$ . We present the following assumptions for these estimators, which we will verify in section 3.3. Note that constants $C$ may depend on $p,d,\beta$ and $\tau$ , but do not depend on either $M$ or $T$ .

Assumption 3.1 (Estimation Error for $A_{m}^{*}$ ).

For each $A^{*}\in\Omega_{0}\cup\Omega_{1}$ ,

[TABLE]

hold for $1\leq m\leq k$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

These are standard error bounds for Lasso estimator and Dantzig Selector with independent design. In this paper we verify Assumption 3.1 in section 3.3 and the remaining two assumptions when we have dependent sub-Gaussian random variables, as we do for our vector auto-regressive model setting.

Assumption 3.2 (Estimation Error for $w_{m}^{*}$ ).

For each $A^{*}\in\Omega_{0}\cup\Omega_{1}$ :

[TABLE]

hold for $1\leq m\leq k$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

Similar to Assumption 3.1, we will show that both Lasso estimator and Dantzig selector under model (3) satisfy Assumption 3.2.

Assumption 3.3 (Estimation Error for $\Upsilon^{(m)}$ ).

For each $A^{*}\in\Omega_{0}\cup\Omega_{1}$ ,

[TABLE]

hold for $1\leq m\leq k$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

Note that $\Upsilon^{(m)}\in\mathbb{R}^{d_{m}\times d_{m}}$ is a low-dimensional matrix, and thus it is computationally feasible to use the sample covariance matrix of $X_{t,D_{m}}-\hat{w}_{m}^{\top}X_{t,D_{m}^{c}}$ as an estimator for $\widehat{\Upsilon^{(m)}}$ . We show in section 3.3 that, as long as $\hat{w}_{m}$ is a reliable estimator for $w_{m}^{*}$ , $\widehat{\Upsilon^{(m)}}$ would satisfy a tighter bound than (19). This looser bound in Assumption 3.3 actually allows more choices for estimators for $(\Upsilon^{(m)})^{-1}$ , as shown in section 3.5.

3.1 Uniform convergence under null hypothesis

Based on these assumptions, we have the following main theorem.

Theorem 3.1.

Consider the model (3) with i.i.d. sub-Gaussian noise $\epsilon_{ti}$ with sub-Gaussian parameter $\tau$ . If Assumptions 3.1-3.3 are satisfied, and $(\rho\vee s)\log M=o(\sqrt{T})$ , then $\widehat{U}_{T}$ defined in (9) satisfies

[TABLE]

when $T>C$ for some constant $C$ . Here the constants $C_{i}$ ’s depend on $p,d,\beta,\tau$ .

Theorem 3.1 proves weak convergence of $\widehat{U}_{T}$ to $\chi_{d}^{2}$ . The uniform convergence rate can be understood as follows: the first term is due to the rate obtained by martingale CLT, where we require $T^{-\frac{1}{8}}$ rather than $T^{-\frac{1}{2}}$ due to the dependence; the remaining two terms arise from estimation error, with the second one being the error bounds, and third being the probability that the error bounds do not hold. If we assume Gaussianity, we can improve the first term in the rate of convergence from $T^{-\frac{1}{8}}$ to $T^{-\frac{1}{4}+\alpha}$ for any $\alpha>0$ . To the best of our knowledge, ours is the first work that formally attempts to characterize the rates of convergence.

Remark 3.1.

Compared to the theoretical result for independent design in Ning et al. (2017), the only additional condition we add is $\sum_{i=0}^{\infty}\left(\sum_{j=0}^{\infty}\left\|\Psi_{i+j}\right\|_{2}^{2}\right)^{\frac{1}{2}}\leq\beta$ , which is used to control the strength of dependence uniformly. Also, we consider multivariate testing which is more general, and derive the explicit convergence rate.

Remark 3.2.

The test statistic proposed in Van de Geer et al. (2014) and Javanmard and Montanari (2014) for the independent design share similar ideas with our test statistic. Instead of imposing a sparsity assumption upon $w_{m}^{*}$ , Van de Geer et al. (2014) assumes $\Upsilon^{-1}$ to be row wise sparse. This is actually equivalent to the sparsity assumption on $w_{m}^{*}$ in the univariate case. Javanmard and Montanari (2014) does not require the sparsity condition on $\Upsilon^{-1}$ , but it is hard to extend their theory to the time series setting, due to a difficulty in applying the martingale CLT.

Remark 3.3.

The theoretical guarantee we obtained here, is more general and stronger than the result achieved in Neykov et al. (2018). A more detailed comparison is presented in section 3.6.

3.2 Uniform convergence under alternative hypothesis

Recall the definition of $\Omega_{A}$ in (16). The following theorem establishes the asymptotic behavior of $\widehat{U}_{T}$ for $A^{*}\in\Omega_{A}$ , with different values of $\phi$ . First define

[TABLE]

where $\Upsilon^{(m)}$ is defined in (8).

Theorem 3.2.

Consider the model (3) with i.i.d. sub-Gaussian noise $\epsilon_{ti}$ and sub-Gaussian parameter $\tau$ . If Assumptions 3.1-3.3 are satisfied, and $(\rho\vee s)\log M=o(\sqrt{T})$ , then when $T>C$ for some constant $C$ ,

(1)

$\phi=\frac{1}{2}$ **

[TABLE]

(2)

$0<\phi<\frac{1}{2}$ **

[TABLE]

(3)

$\phi>\frac{1}{2}$ **

[TABLE]

Here $C_{i}$ ’s are constants depending on $p,d,\beta,\Delta,\tau$ .

Theorem 3.2 shows the threshold value of $\phi$ for $\mathcal{H}_{A}$ to be detectable. When $\phi>\frac{1}{2}$ , we cannot distinguish $\mathcal{H}_{0}$ and $\mathcal{H}_{A}$ since under both cases $\widehat{U}_{T}$ converges to $\chi_{d}^{2}$ ; When $\phi<\frac{1}{2}$ , $\widehat{U}_{T}$ diverges to $+\infty$ in probability, thus it would be very easy to detect $\mathcal{H}_{A}$ ; When $\phi=\frac{1}{2}$ , $\widehat{U}_{T}$ converges to a non-central $\chi_{d}^{2}$ with noncentrality parameter determined by constant vector $\Delta$ and $\Upsilon=\text{Cov}(\mathcal{X}_{t})$ , which implies the power of the test. Note here, (23) holds also for the trivial case $\phi<0$ , since we do not use the fact $\phi>0$ in the proof.

Remark 3.4.

Theorem 3.2 is also consistent with the threshold value of $\phi$ given by Ning et al. (2017) for linear regression with i.i.d samples. However, Ning et al. (2017) assumes additional conditions on the scaling of sample size, number of covariates and sparsity of $w_{m}^{*}$ for proving asymptotic power. Our conditions are exactly the same as the ones for $\mathcal{H}_{0}$ , due to a more specific model and careful analysis.

3.3 Feasible Estimators

Both the estimation of $w_{m}^{*}$ and $A^{*}$ can be viewed as high-dimensional sparse regression problems, thus we can use the Lasso or Dantzig selector. Formally, define

[TABLE]

as the Lasso estimator for $A^{*}$ , and

[TABLE]

as the Dantzig selector estimator for $A^{*}$ . Similarly, for $1\leq m\leq k,$ define

[TABLE]

and

[TABLE]

While for estimating $\Upsilon^{(m)}$ , since this is a low dimensional covariance matrix for $\mathcal{X}_{t,D_{m}}-w_{m}^{*\top}\mathcal{X}_{t,D_{m}^{c}}$ , we can directly use sample covariance of $\mathcal{X}_{t,D_{m}}-\hat{w}_{m}^{\top}\mathcal{X}_{t,D_{m}^{c}}$ as $\widehat{\Upsilon^{(m)}}$ :

[TABLE]

for $1\leq m\leq k$ . Here $\hat{w}_{m}$ in the definition of (29) is either $\hat{w}_{m}^{(L)}$ or $\hat{w}_{m}^{(D)}$ .

As shown in the following, estimators (25) to (29) all satisfy Assumptions 3.1 to 3.3, under the model setting stated in (3):

Lemma 3.1.

If $\widehat{A}=\widehat{A}^{(L)}$ , or $\widehat{A}=\widehat{A}^{(D)}$ , which are defined as in (25) and (26) with $\lambda_{A}\asymp\sqrt{\frac{\log M}{T}}$ , then $\widehat{A}$ satisfies Assumption 3.1 when $T>C\rho\log M$ .

Lemma 3.2.

If $\hat{w}_{m}=\hat{w}_{m}^{(L)}$ or $\hat{w}_{m}=\hat{w}_{m}^{(D)}$ , which are defined as in (27) and (28) with $\lambda_{w}\asymp\sqrt{\frac{\log M}{T}}$ , then $\hat{w}_{m}$ ’s satisfy Assumption 3.2 when $T>Cs\log M$ .

Lemma 3.3.

If $\widehat{\Upsilon^{(m)}}$ ’s are defined as in (29), where $\hat{w}_{m}$ satisfies (18) with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ , then

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ , when $T>Cs^{2}\log M$ .

Note here Lemma 3.3 is stronger than Assumption 3.3. The proof of these Lemmas are deferred to Appendix A. By these lemmas and Theorem 3.1, 3.2, we arrive at following Corollary.

Corollary 3.1.

Under model (3) with i.i.d sub-Gaussian noise $\epsilon_{ti}$ with parameter $\tau$ , if $\widehat{A}=\widehat{A}^{(L)}$ or $\widehat{A}^{(D)}$ , $\hat{w}_{m}=\hat{w}_{m}^{(L)}$ or $\hat{w}_{m}^{(D)}$ , and $\widehat{\Upsilon^{(m)}}$ ’s are defined as in (29) for $1\leq m\leq k$ with $\lambda_{A}\asymp\lambda_{w}\asymp\sqrt{\frac{\log M}{T}}$ , then if $(\rho\vee s)\log M=o(\sqrt{T})$ and $T>C$ for some constant $C>0$ , bounds (20) to (24) from Theorems 3.1 and 3.2 hold.

3.4 Variance Estimation

In this section, we consider the case where $\sigma^{*2}=\text{Var}(\epsilon_{ti})$ is unknown under model (3). Actually, if $\sigma^{*}\neq 1$ is known, it is straightforward to extend Theorem 3.1 to Theorem 3.2 for $\widehat{U}_{T}$ defined as follows:

[TABLE]

This follows since if we consider $Y_{t}=X_{t}/\sigma^{*}$ , time series data $Y_{t}$ would satisfy the same model but with unit variance noise.

When $\sigma^{*2}$ is unknown, we apply the estimator

[TABLE]

and define the test statistic

[TABLE]

We show that $\widetilde{U}_{T}$ has the same convergence results we derive for the unit variance noise case.

Theorem 3.3.

Consider the model (3) with i.i.d. sub-Gaussian noise $\epsilon_{ti}$ of variance $\sigma^{*2}=\text{Var}(\epsilon_{ti})\geq\sigma_{0}^{2}>0$ and scale factor $\tau\sigma^{*}$ . Then Theorem 3.1 and 3.2 hold for $\widetilde{U}_{T}$ under each corresponding condition, and constants $C_{i}$ ’s also depend on $\sigma_{0}$ .

Theorem 3.3 shows that when we have to estimate the unknown $\sigma^{*2}$ , test statistic $\widetilde{U}_{T}$ maintains the same asymptotic behavior as $\widehat{U}_{T}$ under the known variance case, given that all the assumptions for estimation errors are satisfied and $\sigma^{*}$ is lower bounded by some constant.

Remark 3.5.

With sub-Gaussian noise $\epsilon_{ti}$ , if we still assume the scale factor $\tau\sigma^{*}$ of $\epsilon_{ti}$ to be bounded by constant, then Lemma 3.1 to 3.3 would still hold. Thus the assumptions imposed on estimation errors of $\widehat{A}$ , $\hat{w}_{m}$ and $\widehat{\Upsilon^{(m)}}$ are all satisfied. However, if we don’t assume $\sigma^{*}$ to be bounded, then the tuning parameters $\lambda_{A}$ and $\lambda_{w}$ have to scale with $\sigma^{*}$ .

Remark 3.6.

Neykov et al. (2018)** proposes another estimator for the variance of $\epsilon_{ti}$ , based on the fact that $\Sigma=A\Sigma A^{\top}+Cov(\epsilon_{t})$ . Both these estimators are consistent and lead to convergence in distribution results.

3.5 Semi-parametric Optimal Confidence Region

In this section, we construct a confidence region for $\widetilde{A}_{D}$ , under model (3) with unknown noise variance $\sigma^{*2}$ . Similar to Ning et al. (2017), we consider the one-step estimator $\hat{a}(m)$ for each $(A_{m}^{*})_{D_{m}}$ , based on the decorrelated score function:

[TABLE]

where $\widehat{A}_{m}$ is any estimator satisfying the Assumptions 3.1 on error bounds for $\widehat{A}_{m}-A_{m}^{*}$ , and both the Lasso or Dantzig Estimator for $A_{m}^{*}$ are suitable. $\widetilde{\Upsilon^{(m)}}$ takes the form:

[TABLE]

which is another estimator for $\Upsilon^{(m)}$ , and

[TABLE]

We will show that $\hat{a}(m)-(A^{*}_{m})_{D_{m}}$ is asymptotically Gaussian with covariance matrix $(\Upsilon^{(m)})^{-1}$ . Thus we construct the following confidence region for $\widetilde{A}_{D}$ , with asymptotic confidence coefficient $1-\alpha$ :

[TABLE]

This is a $d$ dimensional elliptical ball with center vector $(\hat{a}(1)^{\top},\dots\hat{a}(k)^{\top})^{\top}$ . The following theorem shows the weak convergence result of

[TABLE]

Theorem 3.4.

Under model (3) with i.i.d. sub-Gaussian noise $\epsilon_{ti}$ with variance $\sigma^{*2}=\text{Var}(\epsilon_{ti})\geq\sigma_{0}^{2}>0$ and sub-Gaussian parameter $\tau\sigma^{*}$ , then Theorem 3.1 and 3.2 hold for $\widehat{R}_{T}$ under each corresponding condition, and the constants $C_{i}$ ’s also depend on $\sigma_{0}$ .

Remark 3.7.

In the definition of one-step estimator $\hat{a}(m)$ , we use $\widetilde{\Upsilon^{(m)}}$ instead of $\widehat{\Upsilon^{(m)}}$ for theoretical convenience. Theorem 3.4 would still hold true if $\hat{a}(m)$ is defined as $(\widehat{A}_{m})_{D_{m}}-\left(\widehat{\Upsilon^{(m)}}\right)^{-1}\widetilde{S}_{m}$ .

Remark 3.8.

We have exactly the same theoretical result for $\widetilde{U}_{T}$ and $\widehat{R}_{T}$ , and this is due to the close relationship between these two quantities. In particular,

[TABLE]

compared to $\widetilde{U}_{T}=T\sum_{m=1}^{k}\widehat{S}_{m}^{\top}(\widehat{\Upsilon^{(m)}})^{-1}\widehat{S}_{m}/\hat{\sigma}^{2}.$ We show in the proof of Theorem 3.4 that $\left(\widetilde{\Upsilon^{(m)}}^{\top}\right)^{-1}\widehat{\Upsilon^{(m)}}\left(\widetilde{\Upsilon^{(m)}}\right)^{-1}$ also satisfies Assumption 3.3 as an estimator for $\left(\Upsilon^{(m)}\right)^{-1}$ .

Remark 3.9.

The one-step estimator $\hat{a}(m)$ is asymptotically unbiased, and shares a similar form to the de-biased estimator proposed by Zhang and Zhang (2014), Van de Geer et al. (2014). The de-biased estimator in Van de Geer et al. (2014) would take the following form under our setting:

[TABLE]

where $\widehat{\Theta}$ is computed by node-wise regression, as an estimator for $\Upsilon^{-1}$ . When $d_{m}=|D_{m}|=1$ , this is essentially the same as our estimator $\hat{a}(m)$ , but would be slightly different in the multivariate case. Note that the asymptotic covariance matrix for $\hat{a}(m)$ equals to the partial information matrix $I^{*}(A_{m,D_{m}}|A_{m,D_{m}^{c}}$ ), and thus is semi-parametric efficient, while $\hat{b}_{m}$ is only efficient when it is a scalar.

Remark 3.10.

$\widehat{R}_{T}$ * is also very similar to the test statistic proposed by Neykov et al. (2018) for VAR model with lag 1. The only difference lies in the estimation of Var $(\epsilon_{ti})$ , and they only consider Dantzig selector for estimating $A^{*}$ and $w_{m}^{*}$ . We will provide a detailed comparison between their theoretical result with ours in section 3.6.*

3.6 Special case: AR(1) with Gaussian noise

Our theoretical guarantee covers VAR models with lag $p$ and sub-Gaussian noise, of which AR(1) model and Gaussian noise are special cases. Here we explain the consequences of our result under this special case and provide comparison with Neykov et al. (2018).

When we consider lag $p=1$ , the constraint for $A^{*}$ becomes

[TABLE]

with $(\rho\vee s)\log M=o(\sqrt{T})$ . The two sparsity conditions and sample size requirement are included in the conditions Neykov et al. (2018) proposes. In addition, they assume the following:

[TABLE]

for some $0<\varepsilon<1$ . Note that we don’t require these conditions, among which the first and third are quite strong, and the second one $\|A^{*}\|_{2}\leq 1-\varepsilon$ is sufficient for our condition $\sum_{i=0}^{\infty}\left(\sum_{j=0}^{\infty}\left\|(A^{*})^{i+j}\right\|_{2}^{2}\right)^{\frac{1}{2}}\leq\beta$ . This follows since if $\|A^{*}\|_{2}\leq 1-\varepsilon$ ,

[TABLE]

Until now the discussion focuses on the case where $\epsilon_{ti}$ are i.i.d. sub-Gaussian noise of scale factor $C\sigma^{*}$ , with $(\sigma^{*})^{2}$ being the variance of $\epsilon_{ti}$ and lower bounded by some constant. Thus our setting covers the case where $\epsilon_{t}\sim\mathcal{N}(0,(\sigma^{*})^{2}I)$ with $\sigma^{*}\geq c$ . If $\epsilon_{t}\sim\mathcal{N}(0,\Psi)$ with $\Psi_{ii}\geq c$ as assumed in Neykov et al. (2018), we can still prove the same theoretical guarantee, under even weaker condition based on spectral density, due to established concentration bounds in Basu et al. (2015).

4 Numerical Experiments

In this section, we provide a simulation study to validate our theoretical results. For simplicity, our simulation is based on the AR(1) model:

[TABLE]

where $A^{*}\in\mathbb{R}^{M\times M}$ is set to be row-wise sparse. Symmetricity is not required in our theory, but in order to ensure the sparsity of $w_{m}^{*}$ , we focus on symmetric matrices under $\mathcal{H}_{0}$ , and slightly asymmetric ones under $\mathcal{H}_{A}$ . The eigenvalues of $A^{*}$ all fall in the unit circle of the complex plane, which ensures the existence of stationary solution to this model. White noise $\epsilon_{ti}$ is simulated as independent $\mbox{Uniform}(-1,1)$ in order to satisfy the sub-Gaussianity condition. Other distributions were also used but not reported since the results were very similar.

To consider multi-variate test sets, throughout the simulation we test the index set $D$ with $d=|D|=6$ , which involves three different rows and two columns in each row:

[TABLE]

The null hypothesis takes the form $\mathcal{H}_{0}:\widetilde{A}_{D}=\mu$ with some $d$ -dimensional vector $\mu$ . Correspondingly, we consider alternative hypothesis $\mathcal{H}_{A}:\widetilde{A}_{D}=\mu+T^{-\phi}\Delta$ , with $\Delta$ randomly selected from $d$ -dimensional Gaussian distribution, and $\phi$ ranges from $0.25$ to $1.2$ .

Under $\mathcal{H}_{0}$ , we generate $A^{*}$ with different row-wise sparsity levels and structures, and for each $A^{*}$ , vector $\mu$ may differ depending on the corresponding $\widetilde{A}_{D}$ . Under $\mathcal{H}_{A}$ , $A^{*}$ are still the same matrices as under $\mathcal{H}_{0}$ , but only adding the tested indices $\widetilde{A}_{D}$ by $T^{-\phi}\Delta$ . The experiments are repeated under different settings of $A^{*}$ , $\Delta$ , $M,T$ and $\phi$ .

We use Lasso estimators defined in (25), (27) for the estimation of $A^{*}$ and $w_{m}^{*}$ , $1\leq m\leq k$ , and tuning parameters $\lambda_{A}$ , $\lambda_{w}$ are selected using cross validation. In cross validation, the training sets are composed of consecutive time series data, with the remaining 10% of the original data set being testing sets. Under $\mathcal{H}_{0}$ , 1000 simulations are carried out under each parameter setting, while under $\mathcal{H}_{A}$ , we have 100 simulations. In the following sections, we look into false positive rates (FPR) and true positive rates (TPR) of test statistics $\widetilde{U}_{T}$ and $\widehat{R}_{T}$ as defined in (32) and (36), when we set the level of test as $\alpha=0.05$ .

4.1 Under the Null Hypothesis

(1)

Varying sparsity

Here we summarize the experiments with randomly generated $A^{*}$ , that are symmetric and row-wise sparse, with different sparsity levels $\rho$ defined in (10). Figure 1 shows how FPR of $\widetilde{U}_{T}$ and $\widehat{R}_{T}$ averaged over 1000 experiments vary with $\sqrt{T}$ . We can see that when $T$ increases to about 500, the FPR becomes stable and close to $\alpha=0.05$ regardless of $\rho,M$ , choice between $\widetilde{U}_{T}$ and $\widehat{R}_{T}$ .

When the sample size $T$ is small, the test tends to be conservative, which is the consequence of estimating variance $\sigma^{*2}$ and covariances $\Upsilon^{(m)}$ ’s. In the simulation we use naive estimators for these two quantities, as defined in (31) and (29) which tend to be smaller than the true parameters. This is because we usually fit noise in the regression, as noticed by Fan et al. (2012). As shown in these two figures, $\widehat{R}_{T}$ is less conservative than $\widetilde{U}_{T}$ when $T$ is small, since the magnitude of $\widetilde{\Upsilon}^{(m)}$ is larger than $\widehat{\Upsilon}^{(m)}$ , which makes $\left(\widetilde{\Upsilon}^{(m)^{\top}}\right)^{-1}\widehat{\Upsilon}^{(m)}\left(\widetilde{\Upsilon}^{(m)}\right)^{-1}$ probably a better estimator for $\Upsilon^{(m)}$ .

We also summarize the FPR when the variance $\sigma^{*2}$ of $\epsilon_{ti}$ is known in Figure 2. We can see from these figures that $\widehat{U}_{T}$ is still a little conservative when $T$ is small, while $\widehat{R}_{T}$ with $\hat{\sigma}^{2}$ substituted by $\sigma^{*2}$ is not conservative.

(2)

Different Graph Structures

If we consider the $M$ actors in the time series as nodes in a network, and a nonzero $A_{ij}^{*}$ represents an directed edge from $j$ to $i$ , then each matrix $A^{*}$ corresponds to a $M$ -dimensional directed graph. We experiment with different structures of $A^{*}$ , which also correspond to different graph structure, including block graph or chain graph. Specifically, we consider matrices with $\ell_{2}$ norm equal to 0.75:

[TABLE]

which is a block graph;

[TABLE]

with constant $c$ chosen to ensure $\left\|A^{(2)}\right\|_{2}=0.75$ , which is a chain graph; and $A^{(3)}$ being randomly generated symmetric matrix of sparsity level $\rho=2$ , and largest eigenvalue equal to 0.75. Figure 3 shows the difference among these three different structures. We can see that block graph is less accurate than the other two, which is due to a larger variance for each $X_{t,D_{m}}-w_{m}^{*\top}X_{t,D_{m}^{c}}$ . Investigating the question of how graph structure theoretically influences testing performance remains an open and interesting direction.

4.2 Alternative Hypothesis

First we look into how the true positive rate (TPR) varies with $\|T^{-\phi}\Delta\|_{2}$ , since we set $\mathcal{H}_{A}$ as $\widetilde{A}_{D}=\mu+T^{-\phi}\Delta$ and $\|T^{-\phi}\Delta\|_{2}$ may be viewed as a measure of distance from the null hypothesis. Fig. 4 only presents the simulation results when $A^{*}=A^{(1)}$ and $M=300$ , while the other choices of $A^{*}$ and $M$ generate very similar results. We can see from these two figures that as $\|T^{-\phi}\Delta\|_{2}$ increases, TPR approaches 1. The slope increases when sample size $T$ gets larger, or when the test statistic changes from $\widehat{R}_{T}$ to $\widetilde{U}_{T}$ . This aligns with intuition, since when $T$ increases, we are supposed to distinguish between $\mathcal{H}_{0}$ and $\mathcal{H}_{A}$ better, and $\widetilde{U}_{T}$ is more conservative than $\widehat{R}_{T}$ as we show in subsection 4.1.

We also check the influence of $\phi$ . Figure 5 reveals how TPR changes when $T$ increases, if we set $\left\|\widetilde{\Delta}\right\|_{2}$ and $\phi$ fixed. If $\phi<0.5$ , TPR converges to 1 very quickly, while if $\phi>0.5$ , TPR converges to 0.05, but the convergence is slower when $\phi$ or $\left\|\widetilde{\Delta}\right\|_{2}$ increases. When $\phi=0.5$ , Theorem 3.3 and 3.4 states that $\widetilde{U}_{T}$ and $\widehat{R}_{T}$ would converge to $\chi_{d,\left\|\widetilde{\Delta}\right\|_{2}^{2}}$ , thus the TPR should converge to some value between 0.05 and 1, depending on $d$ and $\left\|\widetilde{\Delta}\right\|_{2}^{2}$ . The black lines in figure 5 indicate this convergence value, but since the test tends to be conservative when $T$ is not large enough, TPR when $\phi=0.5$ is usually above the black line. The conservative issue is more severe under $\mathcal{H}_{A}$ since the deviation $\widetilde{\Delta}$ is also multiplied by the estimated variances, which exaggerates the conservative tendency. However, this may not be a big concern under $\mathcal{H}_{A}$ , since we always want the TPR to be large.

5 Proof Overview

One of the main contributions of this work is the proof technique, which addresses a number of technical challenges and develops novel concentration bounds for dependent sub-Gaussian random vectors. In this section, we present and discuss key lemmas for the proof and provide the main steps for proving Theorems 3.1 and 3.2, deferring the more technically intensive steps to the supplement.

5.1 Key Lemmas

The major technical challenge lies in proving the following two concentration bounds for dependent sub-Gaussian random vectors.

Lemma 5.1 (Deviation Bound for $A^{*}$ ).

Under model (3), when $\epsilon_{ti}$ are sub-Gaussian noise with scale factor $\tau$ , and $A^{*}\in\Omega_{0}\cup\Omega_{1}$ ,

[TABLE]

When $T\geq C\log M$ .

Lemma 5.1 is a standard deviation bound for proving estimation error bound of Lasso type or Dantzig selector type estimators. We apply this lemma both in the proof of Theorem 3.1, 3.2 and Lemma 3.1.

Lemma 5.2.

Under model (3), when $\epsilon_{ti}$ are sub-Gaussian noise with constant scale factor $\tau$ , and $A^{*}\in\Omega_{0}\cup\Omega_{1}$ , if $B\in\mathbb{R}^{pM\times pM}$ is a symmetric matrix, we have

[TABLE]

Lemma 5.2 provides concentration bound for the sample average of general quadratic form $\mathcal{X}_{t}^{\top}B\mathcal{X}_{t}$ , and is very helpful in proving martingale CLT under our setting, REC, Lemma 3.3, etc.

In the Gaussian case, both these lemmas follow from prior work in Basu et al. (2015) which relies on the fact that dependent Gaussian vectors can be rotated to be independent. Since dependent sub-Gaussian random variables cannot be rotated to be independent (only uncorrelated), we exploit the independence of $\epsilon_{t}$ by representing each $\mathcal{X}_{t}$ by linear function of the infinite series $\{\epsilon_{i}\}_{i=-\infty}^{i=t}$ and then use a careful truncation argument. We analyze sufficiently many terms in the summation, and control the infinite residues.

5.2 Proof of Theorem 3.1

Proof.

Suppose $A^{*}\in\Omega_{0}$ . We will use $C_{i},c_{i}$ to refer to constants that only depend on $p,d,\beta,\tau$ (not $M$ or $T$ ), and different constants might share the same notation.

The proof can be divided into two major parts: showing the convergence of $U_{T}$ to $\chi_{d}^{2}$ , and bounding the estimation error $\left|\widehat{U}_{T}-U_{T}\right|$ . Formally, for any $\varepsilon>0$ ,

[TABLE]

and

[TABLE]

which implies

[TABLE]

In the following, we provide bounds on each of the three terms. The following lemma shows the uniform weak convergence rate of $\left\|V_{T}+\mu\right\|_{2}^{2}$ to $\chi^{2}_{d,\|\mu\|_{2}^{2}}$ , of which the convergence of $U_{T}=\|V_{T}\|_{2}^{2}$ to $\chi_{d}^{2}$ is a special case.

Lemma 5.3 (Convergence Rate of $\left\|V_{T}+\mu\right\|_{2}^{2}$ ).

Under model (3) with $\epsilon_{ti}$ being sub-Gaussian noise of scale factor $\tau$ , then for any $A^{*}\in\Omega_{0}$ , $\forall\mu\in\mathbb{R}^{d}$ ,

[TABLE]

when $T>C$ for some absolute constant $C$ , where $C(\|\mu\|_{2})$ is a constant depending on and is non-decreasing with respect to $\|\mu\|_{2}$ .

This Lemma is proved in section C, by applying a uniform martingale central limit theorem result. Thus, by Lemma 5.3, if $T>C$ for some constant $C$ ,

[TABLE]

Meanwhile,

[TABLE]

since $\chi_{d}^{2}$ has bounded density.

Now we only need to choose a proper $\varepsilon$ and bound $\mathbb{P}\left(\left|\widehat{U}_{T}-U_{T}\right|>\varepsilon\right)$ .

[TABLE]

Define $E_{m}=\sqrt{T}(\Upsilon^{(m)})^{-\frac{1}{2}}\left(\widehat{S}_{m}-S_{m}\right)$ , then (40) turns into

[TABLE]

We can bound $\|V_{T,m}\|_{2}$ using Lemma 5.3 and $\left\|\Upsilon^{(m)\frac{1}{2}}\left(\widehat{\Upsilon^{(m)}}\right)^{-1}\Upsilon^{(m)\frac{1}{2}}-I\right\|_{\infty}$ using Lemma 19, while for bounding the estimation induced error $\|E_{m}\|_{2}$ , we first apply the following lemma to bound the eigenvalues of $\Upsilon^{(m)}$ .

Lemma 5.4.

Consider the model (2) with independent noise $\epsilon_{ti}$ of unit variance, $A^{*}$ satisfies (13), then the eigenvalues of $\Upsilon$ can be bounded as follows:

[TABLE]

Lemma 5.4 is proved based on established results in Basu et al. (2015). Note that we assumed unit variance in Theorem 3.1 and 3.2, so we can apply Lemma 5.4 here. Since $\left(\Upsilon^{(m)}\right)^{-1}=\left(\Upsilon^{-1}\right)_{D_{m},D_{m}}$ , applying Lemma 5.4 would lead us to the following:

[TABLE]

Thus we have

[TABLE]

with

[TABLE]

The following two lemmas provide bounds for $\left\|\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t,D_{m}^{c}}\epsilon_{t,m}\right\|_{\infty}$ , and

[TABLE]

Lemma 5.5.

When $T\geq C\log M$ ,

[TABLE]

Lemma 5.1 is a common condition in high-dimensional regression problems, and is usually referred to as deviation bound. We will prove it in Section C.

Lemma 5.6 (Deviation Bound for $w_{m}^{*}$ ).

With probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ , for all $1\leq m\leq k$ ,

[TABLE]

Lemma 5.6 can also be viewed as a deviation bound, if we consider a regression problem with $\mathcal{X}_{t,D_{m}}$ as response and $\mathcal{X}_{t,D_{m}^{c}}$ as covariates. This is also proved in Section C. Applying Assumptions 3.1 and 3.2, with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

where

[TABLE]

and Assumption 3.1 and 3.2 implies $Q_{1}\leq C\frac{\rho_{m}\log M}{T}$ and $Q_{2}\leq C\frac{s_{m}\log M}{T}$ . The former is not straightforward: to see why it holds true, let $\hat{h}_{m}=\widehat{A}_{m}-A_{m}^{*}$ and $H=\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}\mathcal{X}_{t}^{\top}$ , then we have

[TABLE]

Here we apply Assumption 3.1, and the fact that

[TABLE]

The last inequality is due to Lemma 5.4 and the following lemma:

Lemma 5.7.

With probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

Therefore, by taking a union bound, we show that

[TABLE]

for any $1\leq m\leq k$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

Meanwhile, by applying Lemma 5.3, one can show that for $y>\sqrt{5d}$ ,

[TABLE]

where the second inequality is due to a $\chi_{d}^{2}$ tail bound established in Laurent and Massart (2000) (see Lemma 1 in Laurent and Massart (2000)), and the third inequality comes from the fact that, $\forall$ constant $C_{1}>0$ , $\exists$ constant $C_{2}$ such that

[TABLE]

Let $y=\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{-\frac{1}{4}}$ and plug it into (41), then with Assumption 3.3, we can show that with probability at least

[TABLE]

the following holds:

[TABLE]

if $(s\vee\rho)\log M=o(\sqrt{T})$ and $T>C$ for some constant $C$ . Therefore, applying (38) with $\varepsilon=C\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{\frac{1}{2}}$ ,

[TABLE]

Since constants $C_{i}$ only depend on $d,\beta$ and $\tau$ , this bound also holds for supremum over $A^{*}\in\Omega_{0}$ and $x\in\mathbb{R}$ . Note that for a clear presentation, we are not showing the sharpest bound, which can be obtained by choosing a different $y$ . ∎

5.3 Proof of Theorem 3.2

proof of Theorem 3.2.

We prove this case by case. We will use $C_{i},c_{i}$ to refer to constants that only depend on $d,\beta,\Delta,\phi$ , and different constants might share the same notation.

Similar from the proof of Theorem 3.1, the major part of the proof is devoted to bounding $\left|\widehat{U}_{T}-\left\|V_{T}+\mu\right\|_{2}^{2}\right|$ with high probability for some vector $\mu\in\mathbb{R}^{d}$ .

(1)

$\phi=\frac{1}{2}$

Suppose $A^{*}\in\Omega_{1}$ . Using similar deduction as in the proof of Theorem 3.1, for any $\varepsilon>0$ ,

[TABLE]

(a)

Bounding the first two terms

The first term is the convergence rate of $\|V_{T}-\widetilde{\Delta}\|_{2}^{2}$ to $\chi^{2}_{d,\|\widetilde{\Delta}\|_{2}^{2}}$ . By Lemma 5.3,

[TABLE]

The last inequality is due to

[TABLE]

and an upper bound for $\Lambda_{\max}\left(\Upsilon^{(m)}\right)$ in (42).

Bounding the second term in (46) is not straightforward as bounding $F_{d}(x+\varepsilon)-F_{d}(x-\varepsilon)$ in the proof of Theorem 3.1, since $\widetilde{\Delta}$ is not a constant vector when $A^{*}$ takes different values in $\Omega_{1}^{*}$ . We only have a uniform bound of $\left\|\widetilde{\Delta}\right\|_{2}$ as shown above. One can show that

[TABLE]

where $Z$ is a $d$ -dimensional standard Gaussian random vector with density $\phi(z)=C(d)\exp\{-\|z\|_{2}^{2}/2\}$ . The last inequality holds because that, for any set $\mathcal{C}\subset\mathbb{R}^{d}$ ,

[TABLE]

Suppose $0<\varepsilon\leq 1$ , then if $\sqrt{x-\varepsilon}\geq 2\|\widetilde{\Delta}\|_{2}$ ,

[TABLE]

otherwise,

[TABLE]

Thus,

[TABLE] 2. (b)

Bounding $\left|\widehat{U}_{T}-\left\|V_{T}-\widetilde{\Delta}\right\|_{2}^{2}\right|$

Similar from (41) in the proof of Theorem 3.1, it is straightforward to show that

[TABLE]

where $E_{m}=\sqrt{T}(\Upsilon^{(m)})^{-\frac{1}{2}}\widehat{S}_{m}-V_{T,m}+\widetilde{\Delta}_{m}$ . To bound $\|E_{m}\|_{2}$ , note that

[TABLE]

and

[TABLE]

with $\widetilde{S}_{m}\in\mathbb{R}^{d_{m}}$ and $W_{m}^{*}\in\mathbb{R}^{d_{m}\times M}$ defined as follows:

[TABLE]

Therefore,

[TABLE]

The last inequality applies (42). Meanwhile,

[TABLE]

The first equality and second inequality come from the definition of $W_{m}^{*}$ and $w_{m}^{*}$ ; the third inequality is because that $\left\|\Upsilon_{\cdot i}\right\|_{2}^{2}=\left(\Upsilon^{2}\right)_{ii}$ ; the fourth inequality is due to that $\left(\Upsilon^{2}\right)_{ii}=e_{i}^{\top}\Upsilon^{2}e_{i}\leq\Lambda_{\max}(\Upsilon)^{2}$ ; and the last inequality is obtained from Lemma 5.4. Applying Lemma 5.7 leads us to

[TABLE]

We can write $\widehat{S}_{m}-\widetilde{S}_{m}$ as

[TABLE]

Note that

[TABLE]

due to Lemma 5.4 and 5.7, which further implies

[TABLE]

Applying Assumption 3.1 to 3.3, Lemma 5.1, 5.6, one can show that with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

with the same arguments as bounding $\|\widehat{S}_{m}-S_{m}\|_{2}$ under $\mathcal{H}_{0}$ .

While for $\left\|V_{T,m}-(\Upsilon^{(m)})^{\frac{1}{2}}\Delta_{m}\right\|_{2}$ , applying Lemma 5.3 leads us to

[TABLE]

for any $y\geq 0$ , where $Z\sim\mathcal{N}(0,I_{d})$ . We apply the tail bound for $\chi^{2}_{d}$ (Lemma 1 in Laurent and Massart (2000)) as in (45), and obtain

[TABLE]

when $y>C$ for some constant $C$ . Let $y=\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{-\frac{1}{4}}$ , and plug $\left\|V_{T,m}-(\Upsilon^{(m)})^{\frac{1}{2}}\Delta_{m}\right\|_{2}\leq y$ , (51) and (19) into (47), one can show that

[TABLE]

with probability at least

[TABLE]

if $(s\vee\rho)\log M=o(T)$ and $T>C$ .

Therefore, applying (46) with $\varepsilon=C\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{\frac{1}{2}}$ leads to

[TABLE]

Since constants $C_{i}$ only depend on $d,\beta,\Delta,\tau$ , this bound also holds for supremum over $A^{*}\in\Omega_{1}$ and $x\in\mathbb{R}$ . 2. (2)

$0<\phi<\frac{1}{2}$

First we provide a lower bound for $\widehat{U}_{T}$ with high probability. Since bounds in Assumption 3.1 to 3.3, Lemma 5.1 to 5.7 hold with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ , we apply these bounds directly in following deduction. Meanwhile, we always assume $(\rho\vee s)\log M=o(\sqrt{T})$ and $T>C$ for desired constant $C$ . With these conditions, one can show that

[TABLE]

The third line is due to Assumption 3.3, which implies $\left\|\Upsilon^{(m)\frac{1}{2}}(\widehat{\Upsilon^{(m)}})^{-1}\Upsilon^{(m)\frac{1}{2}}-I\right\|_{\infty}$ converges to 0 under our scaling $(\rho\vee s)\log M=o(\sqrt{T})$ .

We provide a lower bound for $\left\|(\Upsilon^{(m)})^{-\frac{1}{2}}(\widehat{S}_{m}-S_{m})\right\|_{2}^{2}$ in the following. First write $\widehat{S}_{m}-S_{m}$ as

[TABLE]

we find the upper bounds for $\left\|E_{m}^{(1)}\right\|_{2},\left\|E_{m}^{(3)}\right\|_{2}$ and lower bound for $\left\|E_{m}^{(2)}\right\|_{2}$ in the following. Applying Assumption 3.2 and Lemma 5.1 provides an upper bound for $\left\|E_{m}^{(1)}\right\|_{2}$ :

[TABLE]

Since

[TABLE]

then using the same argument as bounding $\|\widehat{S}_{m}-S_{m}\|_{2}$ when proving Theorem 3.1, we have

[TABLE]

To lower bound $\|E_{m}^{(2)}\|_{2}$ , first note that

[TABLE]

where we apply (49), Lemma 5.7, Assumption 3.2, and bound $\left\|\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}\mathcal{X}_{t}^{\top}\right\|_{\infty}$ using the same argument as in (50). Thus,

[TABLE]

since $\Delta_{m}$ is a constant vector, and $\Lambda_{\min}(\Upsilon^{(m)}$ is lower bounded by constant as in (42).

Applying these bounds for $\|E_{m}^{(i)}\|_{2},1\leq i\leq 3$ , one can show that,

[TABLE]

Plug this into (52) and apply Lemma 5.3, we have

[TABLE]

where in the last line we apply the $\chi_{d}^{2}$ tail bound as in (45). Since the constants here only depend on $d,\beta,\Delta,\tau$ , this bound holds when taking supremum over $A^{*}\in\Omega_{1}$ and $x\in\mathbb{R}$ . 3. (3)

$\phi>\frac{1}{2}$

The proof of this case is similar to that of Theorem 3.1. The only thing different lies in the choice of $\varepsilon$ and bounding $\mathbb{P}\left(\left|\widehat{U}_{T}-U_{T}\right|>\varepsilon\right)$ . The bound (41) for $\left|\widehat{U}_{T}-U_{T}\right|$ still holds here, with $E_{m}=\sqrt{T}(\Upsilon^{(m)})^{-\frac{1}{2}}(\widehat{S}_{m}-S_{m})$ . We directly apply the bounds in Assumptions 3.1 to 3.3, and Lemma 5.1 to Lemma 5.7 in the following. First we write

[TABLE]

Note here that the first three terms are exactly the same as in (43), and thus can be bounded as in the proof of Theorem 3.1. We only have to tackle the last term. By (53), one can show that,

[TABLE]

Thus, going through the same arguments as bounding $\left\|\widehat{S}_{m}-S_{m}\right\|_{2}$ under $\mathcal{H}_{0}$ , we have

[TABLE]

with probability at least $1-C\exp\{-c\log M\}$ . Recall that in (45), when $y>C$ for some constant $C$ ,

[TABLE]

Let $y=\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{-\frac{1}{4}}\wedge T^{\frac{2\phi-1}{6}}$ , then by (41) one can show that

[TABLE]

with probability at least

[TABLE]

if $(s\vee\rho)\log M=o(\sqrt{T})$ and $T>C$ for some constant $C$ . Therefore, applying (38) with $\varepsilon=C_{1}\left(\frac{(s\vee\rho)\log M}{\sqrt{T}}\right)^{\frac{1}{2}}+C_{2}T^{\frac{1-2\phi}{3}}$ ,

[TABLE]

Since constants $C_{i}$ only depend on $d,\beta,\tau,\Delta$ , this bound also holds for supremum over $A^{*}\in\Omega_{1}$ and $x\in\mathbb{R}$ .

∎

6 Conclusion

In this paper, we have provided theoretical guarantees for hypothesis tests for sparse high-dimensional auto-regressive models with sub-Gaussian innovations. Specific upper bounds for the convergence rates of test statistics are given. Importantly, our results go beyond the Gaussian assumption and do not rely on mixing assumptions. As a consequence of our theory, we also develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables using a careful truncation argument.

It would be of interest to consider other variance estimation method, e.g., scaled Lasso Sun and Zhang (2012), or cross-validation based method Fan et al. (2012), and establish corresponding theoretical guarantee. There also remain a number of open questions/challenges including extensions to generalized linear models, heavy-tailed innovations and incorporating hidden variables under time series setting.

Acknowledgements

We would like to thank both Sumanta Basu and Yiming Sun for useful discussions and comments. LZ and GR were supported by ARO W911NF-17-1-0357 and NGA HM0476-17-1-2003. GR was also supported by NSF DMS-1811767.

Appendix A Proof of Lemmas in Section 3.3

Proof of Lemma 3.1.

We prove the error bounds for each $\widehat{A}_{m}$ and then take a union bound. Without loss of generality, we consider the estimation of $A_{1}^{*}\in\mathbb{R}^{M}$ . With a little abuse of notation, let $S=\text{supp}(A_{1}^{*})$ , $\hat{h}=\widehat{A}_{1}-A_{1}^{*}$ , $S=\text{supp}(A_{1}^{*})$ , and $H=\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}\mathcal{X}_{t}^{\top}$ ( $S$ is not the decorrelated score function we defined in section 9). We would like to bound $\|\hat{h}\|_{1}$ , $\|\hat{h}\|_{2}$ and $\hat{h}^{\top}H\hat{h}$ under two cases separately:

(1)

$\widehat{A}=\widehat{A}^{(L)}$ .

Here we adopt the standard proof framework for Lasso. By (25) we know that $\widehat{A}_{1}\in\mathbb{R}^{M}$ satisfies

[TABLE]

which implies

[TABLE]

Rearranging the terms, we have

[TABLE]

The last line is due to that

[TABLE]

By Lemma 5.1, with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

Meanwhile, since $H$ is positive semi-definite,

[TABLE]

We have the following restricted eigenvalue condition for $H$ .

Lemma A.1.

Under the model specified in (3) with independent sub-Gaussian noise $\epsilon_{ti}$ of constant scale factor, and $A^{*}\in\Omega_{0}\cup\Omega_{1}$ , for any set $J\subset\{1,2,\cdots,pM\}$ , positive integer $\kappa>0$ , $H$ satisfies the following REC:

[TABLE]

with probability at least $1-2\exp\left\{-cT\right\}$ , when $|J|\log pM\leq C_{2}T$ . Here $\mathcal{C}(J,\kappa)=\{v:\|v_{J^{c}}\|_{1}\leq\kappa\|v_{J}\|_{1}\}$ , constant $C_{1}$ depends on $\beta$ , $c$ and $C_{2}$ depend on $\kappa$ and $\beta$ .

Here $\hat{h}\in\mathcal{C}(S,3)$ , $|S|=\rho_{1}$ , by Lemma A.1, when $T\geq C\rho\log M$ ,

[TABLE]

with probability at least $1-2\exp\{-cT\}$ , when $T>C\rho\log M$ . Thus

[TABLE]

which implies

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

(2)

$\widehat{A}=\widehat{A}^{(D)}$ .

Here we adopt the standard proof framework for Dantzig selector. By (26),

[TABLE]

By Lemma 5.1, when $T\geq C\log M$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

which implies

[TABLE]

Meanwhile, by (55),

[TABLE]

Here $\hat{h}\in\mathcal{C}(S,1)$ , $|S|=\rho_{1}$ , by Lemma A.1, when $T\geq C\rho\log M$ ,

[TABLE]

with probability at least $1-2\exp\{-cT\}$ , when $T>C\rho\log M$ . Thus

[TABLE]

which implies

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

Therefore, after taking a union bound over $m=1,\cdots,k$ , proof complete. ∎

Proof of Lemma 3.2.

Without loss of generality, we consider the estimation of $(w_{1}^{*})_{\cdot,1}$ and then take a union bound. Let $v^{*}=(w_{1}^{*})_{\cdot,1}$ , $\hat{v}=(\hat{w}_{1})_{\cdot,1}$ , $\hat{h}=\hat{v}-v^{*}\in\mathbb{R}^{M-d_{1}}$ and $S=\text{supp}(v^{*})$ . Then we prove upper bounds for $\|\hat{h}\|_{1}$ and $\hat{h}^{\top}H_{D_{1}^{c},D_{1}^{c}}\hat{h}$ with high probability under two cases.

(1)

$\hat{w}_{m}=\hat{w}_{m}^{(L)}$ .

Looking into the definition (27) of $\hat{w}_{1}$ , it is clear that the optimization can be viewed as $d_{1}$ separate optimization problems, in terms of each column of $\hat{w}_{1}$ . Thus

[TABLE]

The following proof is almost identical to the proof in Lemma 3.1 under $\widehat{A}=\widehat{A}^{(L)}$ , except some difference in notation and application of Lemmas. One can show that,

[TABLE]

Rearranging the inequality gives us

[TABLE]

By Lemma 5.6, with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

which implies,

[TABLE]

Let $\tilde{h}\in\mathbb{R}^{M}$ be defined as the following:

[TABLE]

By Lemma A.1, when $T\geq Cs\log M$ , with probability at least $1-2\exp\{-cT\}$ ,

[TABLE]

which implies

[TABLE]

and

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

(2)

$\hat{w}_{m}=\hat{w}_{m}^{(D)}$ .

By (28),

[TABLE]

This proof is also pretty similar to the proof of Lemma 3.1 under the case where $\widehat{A}=\widehat{A}^{(D)}$ . By Lemma 5.6,

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ . Thus,

[TABLE]

Meanwhile, by (58),

[TABLE]

which further implies

[TABLE]

Recall the definition of $\tilde{h}$ in (57),then by Lemma A.1, (59) and (57), when $T\geq Cs\log M$ ,

[TABLE]

which implies

[TABLE]

and

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ .

Since

[TABLE]

and

[TABLE]

taking a union bound over $\{\hat{w}_{m}:m=1,\cdots,k\}$ and all columns of $\hat{w}_{m}$ , proof is complete. ∎

Proof of Lemma 3.3.

The following established result can be applied here:

Lemma A.2.

For any invertible matrix $B$ , if $B+\Delta$ is also invertible, then

[TABLE]

Since $\left\|I\right\|_{2}=1$ , one can show that for $1\leq m\leq k$ ,

[TABLE]

where $\Delta=\Upsilon^{(m)-\frac{1}{2}}\widehat{\Upsilon^{(m)}}\Upsilon^{(m)-\frac{1}{2}}-I$ . Due to (42),

[TABLE]

In the following we bound $\left\|\widehat{\Upsilon^{(m)}}-\Upsilon^{(m)}\right\|_{\infty}$ . Write $\widehat{\Upsilon^{(m)}}-\Upsilon^{(m)}$ as

[TABLE]

where $W_{m}^{*}$ is defined as in (48). Actually,

[TABLE]

which is the maximum over deviations of some quadratic forms from their expectation. The following lemma provides a bound for quadratic form $\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}^{\top}B\mathcal{X}_{t}$ , with $B\in\mathbb{R}^{M\times M}$ being any symmetric matrix.

By Lemma 5.2, we only need to bound the trace norm and operator norm of

[TABLE]

The following lemma establishes the relationship between $\|\cdot\|_{{\rm tr}}$ and $\|\cdot\|_{2}$ for symmetric matrices.

Lemma A.3.

For any symmetric matrix $U$ of rank $r$ , $\|U\|_{{\rm tr}}\leq r\|U\|_{2}$ .

Since $\frac{1}{2}\left((W_{m}^{*})_{i\cdot}^{\top}(W_{m}^{*})_{j\cdot}+(W_{m}^{*})_{j\cdot}^{\top}(W_{m}^{*})_{i\cdot}\right)$ is of rank 2,

[TABLE]

Meanwhile, similar from (49), we bound $\max_{i}\|(W_{m}^{*})_{i\cdot}\|_{2}^{2}$ by

[TABLE]

where the second inequality is due to that $\left\|\Upsilon_{\cdot,i}\right\|_{2}^{2}=(\Upsilon^{2})_{ii}\leq\Lambda_{\max}(\Upsilon^{2})\leq\Lambda_{\max}(\Upsilon)^{2}$ . Thus, both the trace norm and $\ell_{2}$ norm of $\frac{1}{2}\left(W_{m,i\cdot}^{*\top}W_{m,j\cdot}^{*}+W_{m,j\cdot}^{*\top}W_{m,i\cdot}^{*}\right)$ can be bounded by constant, and applying Lemma 5.2 gives us

[TABLE]

Meanwhile, by Lemma 5.6 and Assumption 3.2, with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

and

[TABLE]

Here the second line is because that $H_{D_{m}^{c},D_{m}^{c}}=\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t,D_{m}^{c}}\mathcal{X}_{t,D_{m}^{c}}^{\top}$ is symmetric and positive semi-definite, thus we can apply Cauchey-Schwartz inequality. When $T\geq Cs^{2}\log M$ .

[TABLE]

which implies

[TABLE]

Therefore, take a union bound over $1\leq m\leq k$ , with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

when $T\geq Cs^{2}\log M$ . ∎

Appendix B Proof of Theorem 3.3 and Theorem 3.4

Proof of Theorem 3.3.

Now we consider model (3), with unknown $\sigma^{*2}=\text{Var}(\epsilon_{ti})\geq\sigma_{0}^{2}$ . Under this model, we use the notation $\widehat{U}_{T}$ for the quantity defined in the following:

[TABLE]

As explained in Section 3.4, $\widehat{U}_{T}$ satisfies Theorem 3.1 and 3.2 under each corresponding condition. We show in the following that we only need to control the estimation error of $\hat{\sigma}^{2}$ . Note that for any $0<\delta<1$ ,

[TABLE]

and

[TABLE]

For any distribution function $F(x)$ ,

[TABLE]

Recall that Theorem 3.1 and 3.2 establish bounds for $\mathbb{P}\left(\widehat{U}_{T}\leq x\right)-F_{d}(x)$ under $\mathcal{H}_{0}$ , or under $\mathcal{H}_{A}$ with $\phi>\frac{1}{2}$ , for $\mathbb{P}\left(\widehat{U}_{T}\leq x\right)-F_{d,\|\widetilde{\Delta}\|_{2}^{2}}(x)$ when $\phi=\frac{1}{2}$ , and for $\mathbb{P}\left(\widehat{U}_{T}\leq x\right)$ when $0<\phi<\frac{1}{2}$ . Thus we only need to bound $\mathbb{P}\left(\hat{\sigma}^{2}<\frac{\sigma^{*2}}{1+\delta}\right)$ , $\mathbb{P}\left(\hat{\sigma}^{2}>\frac{\sigma^{*2}}{1-\delta}\right)$ and $\sup_{y}\left|F(y)-F(y(1-\delta))\right|$ with $F(x)=F_{d}(x)$ or $F(x)=F_{d,\|\widetilde{\Delta}\|_{2}^{2}}(x)$ . Since $0<\delta<1$ ,

[TABLE]

Meanwhile,

[TABLE]

By Assumption 3.1 and Lemma 5.1, with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

and

[TABLE]

Also, since $\epsilon_{ti}$ are independent sub-Gaussian random variables with scale factor $C\sigma^{*}$ , the first term can be bounded by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]):

[TABLE]

Let $\delta=C\sqrt{\frac{\rho\log M}{T}}$ , then

[TABLE]

While for $\sup_{x}F_{d,\|\mu\|_{2}^{2}}(x)-F_{d,\|\mu\|_{2}^{2}}\left(x(1-\delta)\right)$ with any $\mu\in\mathbb{R}^{d}$ satisfying $\|\mu\|_{2}\leq C$ , if $\delta<\frac{1}{2}$ ,

[TABLE]

Here $Z\in\mathbb{R}^{d}$ is a standard Gaussian random vector, the third line is due to that the density of $Z$ is $(2\pi)^{-\frac{d}{2}}e^{-\|z\|_{2}^{2}/2}$ , and the fourth line applies the fact that when $0<\delta<\frac{1}{2}$ ,

[TABLE]

Meanwhile, when $\sqrt{x(1-\delta)}<\|\mu\|_{2}$ ,

[TABLE]

and when $\sqrt{x(1-\delta)}\geq\|\mu\|_{2}$ ,

[TABLE]

which implies

[TABLE]

To see why all the bounds for $\widehat{U}_{T}$ still hold for $\widetilde{U}_{T}$ , note that we only need to add $C\sqrt{\frac{\rho\log M}{T}}+2\exp\left\{-c_{1}\rho M\log M\right\}+c_{2}\exp\{-c_{3}\log M\}$ to the bounds under $\mathcal{H}_{0}$ , and under $\mathcal{H}_{A}$ when $\phi\geq\frac{1}{2}$ , which only changes the constant factors of the previous bounds. For the bound under $\mathcal{H}_{A}$ when $0<\phi<\frac{1}{2}$ , we substitute $x$ by $\frac{x}{1-\delta}$ with $\delta=C\sqrt{\frac{\log M}{T}}$ , and add $2\exp\left\{-c_{1}\rho M\log M\right\}+c_{2}\exp\{-c_{3}\log M\}$ , which only changes the constant factors as well. Therefore, all the conclusions for $\widehat{U}_{T}$ in Theorem 3.1 and 3.2 still hold for $\widetilde{U}_{T}$ under each corresponding condition. ∎

Proof of Theorem 3.4.

First we show the connection between $R_{T}$ and $\widetilde{U}_{T}$ . Note that

[TABLE]

which implies

[TABLE]

Thus

[TABLE]

and the only difference between $R_{T}$ and $\widetilde{U}_{T}$ is that we substitute $\left(\widehat{\Upsilon^{(m)}}\right)^{-1}$ by $\left(\widetilde{\Upsilon^{(m)}}^{\top}\right)^{-1}\widehat{\Upsilon^{(m)}}\left(\widetilde{\Upsilon^{(m)}}\right)^{-1}$ . We only need to prove that $\left(\widetilde{\Upsilon^{(m)}}^{\top}\right)^{-1}\widehat{\Upsilon^{(m)}}\left(\widetilde{\Upsilon^{(m)}}\right)^{-1}$ satisfies Assumption 3.3. The argument is very similar to the proof of Lemma 3.3, but we need to bound $\left\|\widetilde{\Upsilon^{(m)}}\left(\widehat{\Upsilon^{(m)}}\right)^{-1}\widetilde{\Upsilon^{(m)}}^{\top}-\Upsilon^{(m)}\right\|_{\infty}$ instead of $\left\|\widehat{\Upsilon^{(m)}}-\Upsilon^{(m)}\right\|_{\infty}$ here.

Let $E=\widetilde{\Upsilon^{(m)}}-\widehat{\Upsilon^{(m)}}$ , then

[TABLE]

Recall that when proving Lemma 3.3, we already upper bound $\left\|\widehat{\Upsilon^{(m)}}-\Upsilon^{(m)}\right\|_{\infty}$ by $C\sqrt{\frac{\log M}{T}}$ with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ . Thus for any vector $u\in\mathbb{R}^{d_{m}}$ s.t $\|u\|_{2}=1$ ,

[TABLE]

which implies $\Lambda_{\max}\left(\left(\widehat{\Upsilon^{(m)}}\right)^{-1}\right)\leq C$ , and $\left\|E\left(\widehat{\Upsilon^{(m)}}\right)^{-1}E^{\top}\right\|_{\infty}\leq Cd_{m}\|E\|_{\infty}$ . We bound $\|E\|_{\infty}$ in the following. One can show that

[TABLE]

Applying (42), (62), Lemma 5.7, we have

[TABLE]

Thus, with Lemma 5.6, Assumption 3.2, and (63), we show that with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ ,

[TABLE]

Therefore, using the same arguments as in the proof of Lemma 3.3,

[TABLE]

By Lemma A.2,

[TABLE]

∎

Appendix C Proof of Lemmas in Section 5

Proof of Lemma 5.3.

Let

[TABLE]

Define filtration $\mathcal{F}_{T,t}=\sigma(X_{-p+1},X_{-p+2},\cdots,X_{t+1})$ , then $(\xi_{Tt},\mathcal{F}_{Tt})_{0\leq t\leq T-1}$ is a martingale difference sequence, and $V_{T}=\sum_{t=0}^{T-1}\xi_{T,t}$ . To bound the convergence rate, we are going to use a modified version of Lemma 4 in Grama and Haeusler (2006).

Lemma C.1.

Let $(\xi_{ni},\mathcal{F}_{ni})_{0\leq i\leq n}$ be a martingale difference sequence taking values in $\mathbb{R}^{d}$ . Let $X_{k}^{n}=\sum_{i=1}^{k}\xi_{ni}$ , and $\left\langle X^{n}\right\rangle_{k}=\sum_{i=1}^{k}a_{ni}\triangleq\sum_{i=1}^{k}\mathbb{E}(\xi_{ni}\xi_{ni}^{\top}|\mathcal{F}_{n,i-1})$ . Define $R_{\delta}^{n,d}=L_{\delta}^{n,d}+N_{\delta}^{n,d}$ ,

[TABLE]

Then $\forall\mu\in\mathbb{R}^{d},r\geq 0,0<\delta\leq\frac{1}{2}$ , when $R_{\delta}^{n,d}\leq 1$ ,

[TABLE]

where $Z_{d\times 1}\sim\mathcal{N}(0,I)$ , $C(\|\mu\|_{2},d,\delta)$ is non-decreasing as $\|\mu\|_{2}$ increases.

By Lemma C.1, to bound $\sup_{x>0,}\left|\mathbb{P}(\|V_{T}+\mu\|_{2}^{2}\leq x)-F_{d,\|\mu\|_{2}^{2}}(x)\right|$ , we only need to bound $R_{\delta}^{T,d}=L_{\delta}^{T,d}+N_{\delta}^{T,d}$ .

[TABLE]

Here the second line is due to $\Lambda_{\min}(\Upsilon^{(m)})\geq 1$ , and the third line is due to $f(x)=x^{1+\delta}$ is a convex function. More specifically,

[TABLE]

While for the last line, since $\epsilon_{t,m}$ is sub-Gaussian with parameter $\tau$ , $\mathbb{E}|\epsilon_{t,m}|^{2+2\delta}\leq C(\delta)$ . Note that $d,\beta,\tau$ are all viewed as constants here. Due to the sub-Gaussianity of $\epsilon_{t,i}$ ’s, we have the following lemma.

Lemma C.2.

[TABLE]

Therefore,

[TABLE]

which implies

[TABLE]

While for $N_{\delta}^{T,d}$ , since

[TABLE]

where $B_{m}=W_{m}^{*}\left(\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}\mathcal{X}_{t}^{\top}-\Upsilon\right)W_{m}^{*\top}$ ,

[TABLE]

where the second line is because that $(\Upsilon^{(m)})^{-\frac{1}{2}}B_{m}(\Upsilon^{(m)})^{-\frac{1}{2}}$ is of rank at most $d_{m}$ , and we can apply Lemma A.3; the last line is due to

[TABLE]

Since

[TABLE]

by Lemma 5.2, we only need to bound the operator norm and trace norm of

[TABLE]

By (61) and (62), we have the following:

[TABLE]

Therefore, applying Lemma 5.2 leads us to

[TABLE]

which implies

[TABLE]

Thus,

[TABLE]

By Lemma C.1, for any $x\geq 0$ , $\mu\in\mathbb{R}^{d}$ , and $0\leq\delta\leq\frac{1}{2}$ , when $T>C(\delta)$ ,

[TABLE]

The best rate is achieved when $\delta=\frac{1}{2}$ , and thus when $T>C$ ,

[TABLE]

∎

Proof of Lemma 42.

We prove the lower and upper bounds for eigenvalues of $\Upsilon$ , by establishing a connection between our stability condition (13) and another spectral density based condition proposed in Basu et al. [2015]. First we introduce the following lemma, which is a direct result of proposition 2.3 and (2.6) in Basu et al. [2015] under our setting.

Lemma C.3.

Under the model specified in (3) with independent noise $\epsilon_{ti}$ of unit variance, the eigenvalues of $\Upsilon$ can be bounded as follows:

[TABLE]

where $\mu_{\min}(\mathcal{A})=\min_{|z|=1}\Lambda_{\min}\left(\mathcal{A}^{*}(z)\mathcal{A}(z)\right)$ , and $\mu_{\max}(\mathcal{A})=\max_{|z|=1}\Lambda_{\max}\left(\mathcal{A}^{*}(z)\mathcal{A}(z)\right)$ .

By Lemma C.3, we only need to prove that condition (13) implies a lower bound for $\mu_{\min}(\mathcal{A})$ and upper bound for $\mu_{\max}(\mathcal{A})$ . First note that

[TABLE]

where the last equality is due to that $\left\|\left(\mathcal{A}^{*}(z)\right)^{-1}\right\|_{2}=\sup_{v}\frac{\left\|\left(\mathcal{A}^{*}(z)\right)^{-1}v\right\|_{2}}{\|v\|_{2}}$ . Meanwhile, for any $|z|=1$ ,

[TABLE]

where we apply condition (13) in the last inequality. Thus $\mu_{\min}(\mathcal{A})\geq\beta^{-2}$ .

While for bounding $\mu_{\max}(\mathcal{A})$ , we start by bounding $\|A_{n}\|_{2}$ for $0\leq n\leq p$ . Here we define $A_{0}=I_{M\times M}$ , and $A_{n}=0$ for all $n>p$ . Since

[TABLE]

one can show that $\Psi_{0}=I$ , and $\sum_{i=0}^{n}\Psi_{i}A_{n-i}=0$ for $n\geq 1$ . Thus

[TABLE]

and $\|A_{n}\|_{2}\leq\sum_{i=1}^{n}\|\Psi_{i}\|_{2}\|A_{n-i}\|_{2}$ . We have the following claim:

[TABLE]

This can be proved by induction. It is clear that $\|A_{0}\|_{2}=\|I\|_{2}=\beta^{0}$ , and if (64) holds for $0\leq n=k\leq p$ ,

[TABLE]

Therefore, $\mu_{\max}(\mathcal{A})$ can be bounded in the following:

[TABLE]

With Lemma C.3, we conclude that

[TABLE]

where $C_{1}(\beta)=\left(\frac{1-\beta}{1-\beta^{p+1}}\right)^{2}1(\beta>1)+(p+1)^{-2}1(0\leq\beta\leq 1)$ , and $C_{2}(\beta)=\beta^{2}$ . ∎

Proof of Lemma 5.1.

Recall that $X_{t}=\sum_{j=0}^{\infty}\Psi_{j}\epsilon_{t-j-1}$ . Define $\Psi_{j}^{(p)}\in\mathbb{R}^{pM\times M}$ as the following:

[TABLE]

then we can also write $\mathcal{X}_{t}$ as an infinite sum $\mathcal{X}_{t}=\sum_{j=0}^{\infty}\Psi_{j}^{(p)}\epsilon_{t-j-1}$ . Without loss of generality, we consider the first entry of $\frac{1}{T}\sum_{t=0}^{T-1}\epsilon_{t}\mathcal{X}_{t}^{\top}$ :

[TABLE]

In the following, we tackle the infinite sum in (66), by focusing our analysis on the finite sum and let the residue converges to 0. Rigorously, for any positive integer $m$ , let

[TABLE]

and $e^{(t)}\in\mathbb{R}^{(T+m+1)M}$ satisfying $e^{(t)}_{i}=\mathbf{1}(i=(t+m)M+1)$ , then we have

[TABLE]

We will let $m$ be sufficiently large in later argument. The following arguments are devided into two parts: bounding $E_{1}$ and $E_{2}$ .

(1)

Bounding $E_{1}$

Since all entries of $\tilde{\epsilon}$ are independent sub-Gaussian with constant parameter, we can apply the following Hanson-Wright inequality:

Lemma C.4.

Let $X=(X_{1},\dots,X_{n})\in\mathbb{R}^{n}$ be a random vector with independent components $X_{i}$ which satisfy $\mathbb{E}(X_{i})=0$ and $\|X_{i}\|_{\psi_{2}}\leq K$ . Let $A$ be an $n\times n$ matrix. Then, for every $t\geq 0$ ,

[TABLE]

This lemma is a result in Rudelson et al. [2013].By Lemma C.4, we only need to bound the norms of $\frac{1}{T}\sum_{t=0}^{T-1}e^{(t)}\eta^{(t)^{\top}}$ .

First note that

[TABLE]

For any $u,v\in\mathbb{R}^{(T+m+1)M}$ with unit $\ell_{2}$ norm, one can show that

[TABLE]

where $v^{(i)}=(v_{(i-1)M+1},\dots,v_{iM})^{\top}$ , $\alpha_{i}=\left\|\Psi_{i}\right\|_{2}\geq\left\|(\Psi_{i})_{1\cdot}\right\|_{2}$ , and $\Gamma\in\mathbb{R}^{T\times(T+m)}$ is a matrix with each entry $\Gamma_{ij}=\alpha_{m+i-j}1(m+i-j\geq 0)$ . Since $\Gamma$ is a Toeplitz matrix, we will use the following lemma to bound its $\ell_{2}$ norm.

Lemma C.5.

Let $f(\lambda)$ be a Fourier series defined as $f(\lambda)=\sum_{t=-\infty}^{\infty}t_{k}\exp\{ik\lambda\}$ , with $\sum_{k=-\infty}^{\infty}|t_{k}|<\infty$ . We define a sequence of Toeplitz matrices $T_{n}$ with $(T_{n})_{i,j}=t_{i-j}$ , then the operator norm of $T_{n}$ is bounded by

[TABLE]

where ess $\sup f$ the essential supremum.

This is actually Lemma 4.1 in Gray et al. [2006], and we directly apply it here. By Lemma C.5,

[TABLE]

Thus $\left\|\frac{1}{T}\sum_{t=0}^{T-1}e^{(t)}\eta^{(t)^{\top}}\right\|_{2}\leq\frac{\beta}{T}$ . While for the Frobenius norm, we have

[TABLE]

Therefore, by Lemma C.4, for any $\delta>0$ ,

[TABLE]

(2)

Bounding $E_{2}$

First note that

[TABLE]

Recall the definition of $\|\cdot\|_{\psi_{1}}$ and $\|\cdot\|_{\psi_{2}}$ in the proof of Lemma C.2. Since $\|\epsilon_{t,1}^{2}\|_{\psi_{1}}\leq 2\|\epsilon_{t,1}\|_{\psi_{2}}^{2}\leq 2\tau^{2}$ ,

[TABLE]

by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]).

Now we bound the second term $\frac{1}{2T}\sum_{t=0}^{T-1}\left(\sum_{j=t+m+1}^{\infty}(\Psi_{j})_{1\cdot}\epsilon_{t-j-1}\right)^{2}$ . Since

[TABLE]

one can show that

[TABLE]

where we apply the fact that $\left\|\left\|\epsilon_{t}\right\|_{2}\right\|_{\psi_{2}}\leq C\sqrt{M}\tau$ , which is shown in the proof of Lemma C.2. Thus we have

[TABLE]

due to the tail bound of sub-exponential r.v. (also see Vershynin [2010]). Since

[TABLE]

Let $m$ be sufficiently large such that $\left(\sum_{j=t+m+1}^{\infty}\alpha_{j}\right)^{2}\leq\frac{1}{MT}$ , then we arrive at the following

[TABLE]

Let $\delta=C\sqrt{\log M}{T}$ and take a union bound over the $pM^{2}$ entries of $\frac{1}{T}\sum_{t=0}^{T-1}\epsilon_{t}\mathcal{X}_{t}^{\top}$ , the conclusion follows.

∎

Proof of Lemma 5.6.

Without loss of generality, consider

[TABLE]

for any $1\leq i\leq d_{m}$ , and $j\in D_{m}^{c}$ . Similar from the proof of Lemma 5.6, We can write it as a quadratic form

[TABLE]

where $W_{m}^{*}$ is defined as in (48). Since $\frac{1}{2}\left((W_{m}^{*})_{i\cdot}^{\top}e_{j}^{\top}+e_{j}(W_{m}^{*})_{i\cdot}\right)$ is of rank 2, and we have bounded $\left\|(W_{m}^{*})_{i\cdot}\right\|_{2}$ in (62), applying Lemma A.3 leads to

[TABLE]

Applying Lemma 5.2, and taking a union bound over all entries of

[TABLE]

the conclusion follows. ∎

Proof of Lemma 5.7.

Similar from the proof of Lemma 5.1, we consider $\left|\frac{1}{T}\sum_{t=0}^{T-1}X_{ti}X_{tj}-\Upsilon_{ij}\right|$ . Since

[TABLE]

by Lemma 5.2, we need to bound norms of $\frac{1}{2}(e_{i}e_{j}^{\top}+e_{j}e_{i}^{\top})$ , which is of rank at most 2. One can show that

[TABLE]

with Lemma A.3. Therefore, by taking a union bound, it is clear that

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}\log M\}$ . ∎

Appendix D Proof of Lemmas in Section A and Appendix C

Proof of Lemma C.1.

Here we adopt the proof framework for Lemma 4 in Grama and Haeusler [2006], but with some small adjustments. First we construct a new martingale difference sequence $(m_{nk},\mathcal{G}_{nk})_{1\leq k\leq n+1}$ , sum of whose covariances equal to $I_{d\times d}$ . Random projections are used for construction. The following lemma on random projections is stated as Lemma 3 in Grama and Haeusler [2006].

Lemma D.1.

*Let $V$ and $a_{1},\cdots,a_{n}$ be positive semi-definite $d\times d$ matrices. Set $A_{k}=a_{1}+\cdots+a_{k}$ , for $k=1,\cdots,n$ . Then there exist a sequence of integers $1\leq\tau_{1}\leq\cdots\leq\tau_{d}\leq n$ and a corresponding sequence $\mathcal{S}_{1}\supseteq\cdots\supseteq\mathcal{S}_{d}$ of subspaces of $\mathbb{R}^{d}$ such that, with $P_{k}$ defined as the projection matrix of subspace $\mathcal{S}_{i}$ , for $\tau_{i}\leq k<\tau_{i+1}$ (where $\tau_{0}=1,\tau_{d+1}=n+1,\mathcal{S}_{0}=\mathbb{R}^{d}$ ), the following statements hold true for $k=1,\cdots,n$ :

$(a)V-\widehat{A}_{k}$ is non-negative definite, where $\widehat{A}_{k}=P_{1}a_{1}P_{1}+\cdots+P_{k}a_{k}P_{k}$ ;

$(b)x^{\top}(\widehat{A}_{k}-A_{k})x=0$ , for all $x\in\Pi_{k}\triangleq\{P_{k}x:x\in\mathbb{R}^{d}\}$ ;

$(c)x^{\top}(\widehat{A}_{k}-V+\alpha_{k}I)x\geq 0$ for all $x\in\Pi_{k}^{\top}$ , where $\alpha_{k}=\max\{\|a_{\tau_{j}}\|_{2}:\tau_{j}\leq k\}$ .

Meanwhile, $P_{k}$ is determined by $a_{1},\cdots,a_{k}$ and $V$ .*

Given this claim, $m_{nk}$ can be constructed as follows:

Recall the martingale sequence we consider is $(\xi_{nk},\mathcal{F}_{nk})_{1\leq k\leq n+1}$ , and $a_{nk}=\mathbb{E}\left(\xi_{nk}\xi_{nk}^{\top}\right)$ . Apply the fact with $V=I$ , $a_{k}=a_{nk}$ , and let $\{P_{nk}\}_{k=1}^{n}$ be the corresponding projection matrices. Let $D_{n}=I-\sum_{k=1}^{n}P_{nk}a_{nk}P_{nk}$ , which is non-negative definite. Define

[TABLE]

where

[TABLE]

Since $P_{nk}\in\mathcal{F}_{n,k-1}$ , $m_{nk}\in\mathcal{F}_{nk}$ for $1\leq k\leq n$ .Thus $(m_{nk},\mathcal{G}_{nk})$ is also a martingale difference sequence with $\mathcal{G}_{nk}=\mathcal{F}_{nk}$ , when $1\leq k\leq n$ , and $\mathcal{G}_{n,n+1}=\sigma(F_{nn},\eta_{n,n+1})$ . Meanwhile,

[TABLE]

This construction is from Grama and Haeusler [2006]. They also prove that, for any $\varepsilon,\delta>0$ ,

[TABLE]

Since

[TABLE]

for any $\mu\in\mathbb{R}^{d},r\geq 0,\varepsilon>0$ , we need to bound

[TABLE]

and

[TABLE]

The following functions are defined as a smooth relaxation for indicator function. Let

[TABLE]

where $C$ is a normalizing constant s.t. $\int\phi(t)dt=1$ . Then we have $f_{*}(z)=0$ if $z\leq 0$ , $0\leq f_{*}(z)\leq 1$ if $0\leq z\leq 1$ , and $f_{*}(z)=1$ if $z\geq 1$ . $f_{*}(z)$ is infinitely many times differentiable on $\mathbb{R}$ , and since $f_{*}(z)$ is constant when $z\leq 0$ or $z\geq 1$ , for any fixed order, the derivative of $f_{*}(z)$ is bounded. For any $z\in\mathbb{R}^{d}$ , let

[TABLE]

where

[TABLE]

In the following proof, we will denote $f_{l,\mu,r,\varepsilon}(z)$ and $g_{l,\mu,r,\varepsilon}(z)$ as $f_{l}(z)$ and $g_{l}(z)$ , $l=1,2$ for brevity. Therefore,

[TABLE]

Thus,

[TABLE]

Actually, when $r\leq 3\varepsilon$ , the right hand side of (68) can be substituted by

[TABLE]

and

[TABLE]

To bound $\mathbb{E}(f_{l}(M_{n+1}^{n})-f_{l}(Z))$ , we will use the following lemma.

Lemma D.2.

For $f_{l}(\cdot)$ defined as in (70),

[TABLE]

for any $k\in\mathbb{Z}^{*}$ , $y,z\in\mathbb{R}^{d}$ , when $l=1$ , or when $l=2$ and $r>3\varepsilon$ .

The proof of this lemma is deferred to Appendix E. In the following proof, we will always assume the condition $l=1$ or $l=2$ and $r>3\varepsilon$ hold. Therefore, for any $m\in\mathbb{Z}^{*}$ ,

[TABLE]

where $u=z+t_{1}y$ for some $0\leq t_{1}\leq 1$ . Meanwhile,

[TABLE]

where $v=z+t_{2}y$ for some $0\leq t_{2}\leq 1$ . Thus, for any $\delta>0$ ,

[TABLE]

Let $\tilde{w}_{nk}$ , $1\leq k\leq n$ be i.i.d. standard Gaussian random vectors that are independent of $\mathcal{G}_{n,n+1}$ , $w_{nk}=(b_{nk})^{\frac{1}{2}}\tilde{w}_{nk}$ , for $k=1,\cdots,n+1$ , where $b_{nk}=\mathbb{E}(m_{nk}m_{nk}^{\top}|\mathcal{G}_{n,k-1})$ . Define

[TABLE]

Then $W_{1}^{n}$ follows standard Gaussian distribution. Let $U_{k}^{n}=M_{k-1}^{n}+W_{k+1}^{n}$ , then

[TABLE]

Generally this inequality holds for $\delta\in(0,\frac{1}{2}]$ , since $w_{nk}$ and $m_{nk}$ have the same second order moments, which justifies the fourth line. By the proof of Lemma 4 in Grama and Haeusler [2006],

[TABLE]

thus

[TABLE]

Now we only need to bound $\mathbb{P}\left(\|Z+\mu\|_{2}\in[r-2\varepsilon,r+2\varepsilon]\right)$ and $\mathbb{P}\left(\|Z+\mu\|_{2}\in[0,3\varepsilon)\right)$ . Assume $\varepsilon\leq 1$ , then

[TABLE]

Meanwhile,

[TABLE]

The last line is due to that

[TABLE]

when $r\leq 2\varepsilon+\|\mu\|_{2}$ , and

[TABLE]

Here clearly $C(d,\|\mu\|_{2})$ is non-decreasing with respect to $\|\mu\|_{2}$ . Therefore, by (72), (67) and (74), when $R_{\delta}^{n,d}\leq 1$ , for any $\mu\in\mathbb{R}^{d}$ , $r\geq 0$ , $0<\delta\leq\frac{1}{2}$ , with $\varepsilon=(R_{\delta}^{n,d})^{\frac{1}{3+2\delta}}$ ,

[TABLE]

where $C(d,\delta,\|\mu\|_{2})$ is non-decreasing with respect to $\|\mu\|_{2}$ .

∎

Proof of Lemma C.2.

First we introduce the following two norms:

For any random variable $X$ ,

[TABLE]

These two norms are related to sub-exponential and sub-Gaussian random variables, and the following lemma shows the connections between the two norms and the scale factor for sub-Gaussian r.v.

Lemma D.3.

For any sub-Gaussian r.v. $X$ with scale factor $\tau$ , the following hold:

[TABLE]

with some absolute constants $c,C$ , and

[TABLE]

This is an established result in Vershynin [2010]. By Lemma D.3, bounding $\left\|\left\|W_{m}^{*}\mathcal{X}_{t}\right\|_{2}^{2}\right\|_{\psi_{1}}$ would be sufficient, and we start from bounding $\mathbb{E}\left(\exp\left\{\lambda\left(W_{m}^{*}\right)_{i\cdot}\mathcal{X}_{t}\right\}\right)$ for any $\lambda\in\mathbb{R}$ . Recall that $\mathcal{X}_{t}=\Psi_{j}^{(p)}\varepsilon_{t-j-1}$ , with $\Psi_{j}^{(p)}$ defined as in (65), we can write

[TABLE]

and

[TABLE]

where $\tilde{\alpha}_{k}$ is defined as $\left\|\Psi_{k}^{(p)}\right\|_{2}$ . The relationship between $\tilde{\alpha}_{k}$ and $\alpha_{k}=\left\|\Psi_{k}\right\|_{2}$ can be established as follows:

[TABLE]

if we define $\alpha_{i}=0$ when $i<0$ . We now prove that $\exp\left\{|\lambda|\sum_{k=0}^{\infty}\left\|(W_{m}^{*})_{i\cdot}\right\|_{2}\tilde{\alpha}_{k}\left\|\epsilon_{t-k}\right\|_{2}\right\}$ is integrable so that we can use Dominated Convergence Theorem. Since $\epsilon_{ti}$ ’s are all independent sub-Gaussian random variables with parameter $\tau$ ,

[TABLE]

where the second inequality is due to Minkowski’s inequality. Thus,

[TABLE]

where the first equality is due to Monotone Convergence Theorem, and the last line is due to (62) and the fact that

[TABLE]

Therefore, by Dominated Convergence Theorem,

[TABLE]

By Lemma D.3, $\left\|\left(W_{m}^{*}\right)_{i\cdot}\mathcal{X}_{t}\right\|_{\psi_{2}}\leq C$ , and

[TABLE]

Thus

[TABLE]

∎

Proof of Lemma 5.2.

Recall that $\mathcal{X}_{t}=\sum_{j=0}^{\infty}\Psi_{j}^{(p)}\epsilon_{t-j-1}$ , where $\Psi_{j}^{(p)}$ is defined in (65). Similar from the proof of Lemma 5.1, for any positive integer $m$ , we can write down $\frac{1}{T}\sum_{t=0}^{T-1}\mathcal{X}_{t}^{\top}B\mathcal{X}_{t}$ as the following:

[TABLE]

Then we can bound each $E_{i}$ from its expectation separately, and $m$ will be chosen to be sufficiently large later.

(1)

Bounding $E_{1}-\mathbb{E}(E_{1})$

Let $\Theta^{(t)}\in\mathbb{R}^{pM\times(T+m)M}$ and $\tilde{\epsilon}\in\mathbb{R}^{(T+m)M}$ be defined as

[TABLE]

Then $E_{1}=\tilde{\epsilon}^{\top}\left(\frac{1}{T}\sum_{t=0}^{T-1}\Theta^{(t)\top}B\Theta^{(t)}\right)\tilde{\epsilon}$ , and by Lemma C.4 we only need to bound the operator norm and Frobenius norm of $\frac{1}{T}\sum_{t=0}^{T-1}\Theta^{(t)\top}B\Theta^{(t)}$ .

i.

Bounding $\left\|\frac{1}{T}\sum_{t=0}^{T-1}\Theta^{(t)\top}B\Theta^{(t)}\right\|_{2}$

For any unit vector $u,v\in\mathbb{R}^{(t+m)M}$ ,

[TABLE]

where $u^{(i)}=(u_{(i-1)M+1},\dots,u_{iM})$ . Let $\tilde{\alpha}_{i}=\left\|\Psi_{i}^{(p)}\right\|_{2}$ , and $\Gamma\in\mathbb{R}^{(t+m)\times(t+m)}$ be defined as $\Gamma_{ij}=\sum_{k=0}^{\infty}\tilde{\alpha}_{|i-j|+k}\tilde{\alpha}_{k}$ , then

[TABLE]

Thus we only need to bound $\Lambda_{\max}(\Gamma)$ . Applying Lemma C.5, the largest eigenvalue of Toeplitz matrix $\Gamma$ can be bounded by

[TABLE]

where the third inequality is due to Cauchey-Schwartz inequality. Due to (75), we can further obtain

[TABLE]

and we define $\alpha_{i}=0$ when $i<0$ for convenience. Therefore,

[TABLE]

ii.

Bounding $\left\|\frac{1}{T}\sum_{t=0}^{T-1}\Theta^{(t)\top}B\Theta^{(t)}\right\|_{F}^{2}$

First note that

[TABLE]

and if we write $B=P^{\top}\Lambda P$ with orthogonal $P$ and diagonal $\Lambda$ (since $B$ is symmetric),

[TABLE]

Meanwhile, due to that $\tilde{\alpha}_{i}=\left\|\Psi_{i}^{(p)}\right\|_{2}$ and (75),

[TABLE]

Note that $\sum_{i=0}^{\infty}\left(\sum_{j=0}^{\infty}\alpha_{i+j}^{2}\right)^{\frac{1}{2}}\leq\beta$ ,

[TABLE]

where the fourth line is due to Cauchey-Schwartz inequality. Therefore,

[TABLE]

Now we apply Lemma C.4, and arrive at

[TABLE]

(2)

Bounding $E_{2}-\mathbb{E}(E_{2})$

We will show that $\left|E_{2}-\mathbb{E}(E_{2})\right|$ vanishes when $m$ is large enough. First we bound $\|E_{2}\|_{\psi_{1}}$ . Since

[TABLE]

by (75) and (76),

[TABLE]

Meanwhile,

[TABLE]

For any $\delta>0$ , let $m$ be sufficiently large such that $\sum_{j=m-p}^{\infty}\alpha_{j}^{2}<\frac{\delta}{2p\|B\|_{{\rm tr}}}$ , $\|E_{2}\|_{\psi_{1}}\leq\frac{C\|B\|_{2}}{T}$ , then by tail bound of sub-exponential random variable (see Vershynin [2010]),

[TABLE]

(3)

Bounding $E_{3}-\mathbb{E}(E_{3})$

One can show that

[TABLE]

and

[TABLE]

Thus

[TABLE]

The first line is due to the following fact: For any two sub-Gaussian random variables $X$ and $Y$ , $\left\|XY\right\|_{\psi_{1}}\leq 2\|X\|_{\psi_{2}}\|Y\|_{\psi_{2}}$ . We can prove this in the following:

[TABLE]

where the first line applies Cauchey-Schwartz inequality. Thus, with large enough $m$ , $\|E_{3}\|_{\psi_{1}}\leq\frac{\|B\|_{2}}{T}$ . Also, $\mathbb{E}(E_{3})=0$ , therefore implies the same bound for $E_{3}-\mathbb{E}(E_{3})$ as the one for $E_{2}-\mathbb{E}(E_{2})$ :

[TABLE]

In conclusion, for any $\delta>0$ , if we choose some $m$ accordingly,

[TABLE]

∎

Proof of Lemma A.1.

Here we apply some results in Basu et al. [2015] with a little change in notation. These results simplifies the original problem to finding a upper bound for $\left|v^{\top}(H-\Upsilon)v\right|$ with any fixed unit vector $v$ . Specifically, the following lemmas are useful:

Lemma D.4.

For any $J\subset\{1,\cdots,pM\}$ , and $\kappa>0$ ,

[TABLE]

where $\mathcal{K}(l)=\{v\in\mathbb{R}^{pM}:\|v\|_{0}\leq l,\|v\|_{2}\leq 1\}$ for any positive integer $l$ .

Lemma D.5.

[TABLE]

Lemma D.6.

Consider a symmetric matrix $D\in\mathbb{R}^{pM\times pM}$ . If for any vector $v\in\mathbb{R}^{pM}$ with $\|v\|_{2}\leq 1$ , and any $\eta\geq 0$ ,

[TABLE]

then for any integer $l\geq 1$ ,

[TABLE]

By Lemma D.4 and Lemma D.5,

[TABLE]

For any unit vector $v\in\mathbb{R}^{pM}$ ,

[TABLE]

Thus $\left|v^{\top}(H-\Upsilon)v\right|$ can be bounded by Lemma 5.2.

[TABLE]

which implies

[TABLE]

By Lemma D.6, when $|J|\log pM\leq C(\eta)T$ ,

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}T\min\{\eta,\eta^{2}\}\}$ . Let $\eta=[6(\kappa+2)^{2}]^{-1}\Lambda_{\min}(\Upsilon)\geq C(\kappa,\beta)$ , then

[TABLE]

with probability at least $1-c_{1}\exp\{-c_{2}T\}$ , when $|J|\log pM\leq C(\kappa,\beta)T$ , and $c_{2}$ depends on $\kappa$ and $\beta$ . Here we apply Lemma 5.4 to lower bound the eigenvalues of $\Upsilon$ . ∎

Appendix E Proof of Lemma D.2, 2.1, A.2, and A.3

Proof of Lemma D.2.

Recall that $f_{l}(z)=f_{*}(g_{l}(z))$ , with $f_{*}(z)=\int_{-\infty}^{z-\frac{1}{2}}\phi(z)dz$ , $g_{1}(z)=\left(\|Z+\mu\|_{2}-r-\varepsilon\right)/\varepsilon$ , and $g_{2}(z)=\left(\|Z+\mu\|_{2}-r+2\varepsilon\right)/\varepsilon$ . In order to bound the partial derivatives of composite function, we apply the following lemma which is a direct result of Proposition 1 and 2 in Hardy [2006].

Lemma E.1.

Suppose univariate function $f$ and $g$ : $\mathbb{R}^{n}\rightarrow\mathbb{R}$ have derivatives and partial derivatives of orders up to $k$ , then $\forall\{i_{1},\dots,i_{k}\}\subset\{1,\dots,n\}$ ,

[TABLE]

where $\Pi(k)$ is the set of partitions for $\{1,\cdots,k\}$ , and $B\in\pi$ is a block in $\pi$ . Formally,

[TABLE]

By Lemma E.1, we can write out the $k$ th order partial derivatives of $f_{l}$ :

[TABLE]

Moreover, we can also write $g_{l}(z)$ as a composite function $\varphi_{l}(\psi(z))$ , with $\varphi_{1}(x)=\frac{\sqrt{x}-r-\varepsilon}{\varepsilon}$ , $\varphi_{2}(x)=\frac{\sqrt{x}-r+2\varepsilon}{\varepsilon}$ , and $\psi(z)=\|z+\mu\|_{2}^{2}$ . Then applying Lemma E.1 on $g_{l}(z)$ gives us

[TABLE]

Note that

[TABLE]

which means that we only need to consider the partitions with all blocks of size $1$ or $2$ , when calculating the partial derivative of $g_{l}(z)$ using (77). Also note that we need partitions for blocks within an original partition $\pi$ , we define the following partition set $\mathcal{C}(\pi)$ for any partition $\pi=\{B_{1},\dots,B_{n}\}$ of size $n$ :

[TABLE]

This set $\mathcal{C}(\pi)$ include the unions of partitions for each block $B_{i}$ within $\pi$ , and each block within the partition of $B_{i}$ has size bounded by $2$ . Let $S(\tilde{\pi})=\{i:\{i\}\in\tilde{\pi}\}$ , and $P(\tilde{\pi})=\{\{i,j\}:\{i,j\}\in\tilde{\pi}\}$ , then the partial derivative of $f_{l}(z)$ can be expanded as

[TABLE]

where we apply the fact that $\varphi_{l}^{(k)}(x)=\frac{C(k)}{\varepsilon x^{k-\frac{1}{2}}}$ . For each fixed $\pi\in\Pi(k)$ and $\tilde{\pi}\in\mathcal{C}(\pi)$ ,

[TABLE]

then combine this with (78), we have

[TABLE]

In addition, note that $f_{*}^{(k)}(x)=\phi^{(k-1)}(x-\frac{1}{2})=0$ when $x\leq 0$ or $x\geq 1$ , and is bounded on $(0,1)$ .Thus we only have to consider $\|z+\mu\|_{2}>r+\varepsilon$ when $l=1$ and $\|z+\mu\|_{2}>r-2\varepsilon$ when $l=2$ . If $r>3\varepsilon$ and $l=2$ , $\|z+\mu\|_{2}>r-2\varepsilon>\varepsilon$ . Therefore,

[TABLE]

∎

Proof of Lemma 2.1.

Note that

[TABLE]

When $A^{*}$ is symmetric, $\Upsilon^{-1}=I-(A^{*})^{2}$ , thus

[TABLE]

It is clear that

[TABLE]

Let $R_{m}=\left|\{i:[(A^{*})^{2}]_{i,D_{m}}\neq 0\}\right|$ and $C_{m}=\{j:A^{*}_{j,D_{m}}\neq 0\}$ , then

[TABLE]

and

[TABLE]

Therefore,

[TABLE]

∎

Proof of Lemma A.2.

Let $Y=(B+\Delta)^{-1}$ , then immediately we have $YB-I=-Y\Delta$ , which is equivalent to $Y-B^{-1}=-Y\Delta B^{-1}$ . Thus the $\ell_{2}$ norm of $Y-B^{-1}$ can be bounded by $\|Y\|_{2}\|\Delta\|_{2}\|B^{-1}\|_{2}$ . Moreover, note that $\|Y\|_{2}\leq\|Y-B^{-1}\|+\|B^{-1}\|$ , we have

[TABLE]

and rearranging terms gives us

[TABLE]

Therefore,

[TABLE]

∎

Proof of Lemma A.3.

First note that for any symmetric matrix $U$ , we can write it as $U=P^{\top}\Lambda P$ , with orthogonal matrix $P$ and diagonal matrix $\Lambda$ . By the definition of trace norm,

[TABLE]

If we denote the non-zero eigenvalues of $U$ as $\lambda_{1},\dots,\lambda_{r}$ , then

[TABLE]

∎

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ang and Piazzesi [2003] A. Ang and M. Piazzesi. A no-arbitrage vector autoregression of term structure dynamics with macroeconomic and latent variables. Journal of Monetary economics , 50(4):745–787, 2003.
2Barigozzi and Brownlees [2018] M. Barigozzi and C. T. Brownlees. Nets: Network estimation for time series. 2018.
3Basu et al. [2015] S. Basu, G. Michailidis, et al. Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics , 43(4):1535–1567, 2015.
4Bressler et al. [2007] S. L. Bressler, C. G. Richter, Y. Chen, and M. Ding. Cortical functional network organization from autoregressive modeling of local field potential oscillations. Statistics in medicine , 26(21):3875–3885, 2007.
5Chen and Wu [2018] L. Chen and W. B. Wu. Testing for trends in high-dimensional time series. Journal of the American Statistical Association , (just-accepted):1–37, 2018.
6Chen et al. [2010] S. X. Chen, L.-X. Zhang, and P.-S. Zhong. Tests for high-dimensional covariance matrices. Journal of the American Statistical Association , 105(490):810–819, 2010.
7Davis et al. [2016] R. A. Davis, P. Zang, and T. Zheng. Sparse vector autoregressive modeling. Journal of Computational and Graphical Statistics , 25(4):1077–1096, 2016.
8Fan et al. [2012] J. Fan, S. Guo, and N. Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 74(1):37–65, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Testing for high-dimensional network parameters in auto-regressive models

Abstract

1 Introduction

1.1 Related Work

1.2 Organization of the Paper

1.3 Notation

2 Problem Setup

2.1 Stationary distribution

2.2 Decorrelated Score Function

2.3 Test Statistic

Lemma 2.1**.**

3 Theoretical guarantee

Assumption 3.1** (Estimation Error for Am∗A_{m}^{*}Am∗​).**

Assumption 3.2** (Estimation Error for wm∗w_{m}^{*}wm∗​).**

Assumption 3.3** (Estimation Error for Υ(m)\Upsilon^{(m)}Υ(m)).**

3.1 Uniform convergence under null hypothesis

Theorem 3.1**.**

Remark 3.1**.**

Remark 3.2**.**

Remark 3.3**.**

3.2 Uniform convergence under alternative hypothesis

Theorem 3.2**.**

Remark 3.4**.**

3.3 Feasible Estimators

Lemma 3.1**.**

Lemma 3.2**.**

Lemma 3.3**.**

Corollary 3.1**.**

3.4 Variance Estimation

Theorem 3.3**.**

Remark 3.5**.**

Remark 3.6**.**

3.5 Semi-parametric Optimal Confidence Region

Theorem 3.4**.**

Remark 3.7**.**

Remark 3.8**.**

Remark 3.9**.**

Remark 3.10**.**

3.6 Special case: AR(1) with Gaussian noise

4 Numerical Experiments

4.1 Under the Null Hypothesis

4.2 Alternative Hypothesis

5 Proof Overview

5.1 Key Lemmas

Lemma 5.1** (Deviation Bound for A∗A^{*}A∗).**

Lemma 5.2**.**

5.2 Proof of Theorem 3.1

Proof.

Lemma 5.3** (Convergence Rate of ∥VT+μ∥22\left\|V_{T}+\mu\right\|_{2}^{2}∥VT​+μ∥22​).**

Lemma 5.4**.**

Lemma 5.5**.**

Lemma 5.6** (Deviation Bound for wm∗w_{m}^{*}wm∗​).**

Lemma 5.7**.**

5.3 Proof of Theorem 3.2

proof of Theorem 3.2.

6 Conclusion

Acknowledgements

Appendix A Proof of Lemmas in Section 3.3

Proof of Lemma 3.1.

Lemma A.1**.**

Proof of Lemma 3.2.

Proof of Lemma 3.3.

Lemma A.2**.**

Lemma A.3**.**

Appendix B Proof of Theorem 3.3 and Theorem 3.4

Proof of Theorem 3.3.

Proof of Theorem 3.4.

Appendix C Proof of Lemmas in Section 5

Proof of Lemma 5.3.

Lemma C.1**.**

Lemma C.2**.**

Proof of Lemma 42.

Lemma C.3**.**

Lemma 2.1.

Assumption 3.1 (Estimation Error for $A_{m}^{*}$ ).

Assumption 3.2 (Estimation Error for $w_{m}^{*}$ ).

Assumption 3.3 (Estimation Error for $\Upsilon^{(m)}$ ).

Theorem 3.1.

Remark 3.1.

Remark 3.2.

Remark 3.3.

Theorem 3.2.

Remark 3.4.

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Corollary 3.1.

Theorem 3.3.

Remark 3.5.

Remark 3.6.

Theorem 3.4.

Remark 3.7.

Remark 3.8.

Remark 3.9.

Remark 3.10.

Lemma 5.1 (Deviation Bound for $A^{*}$ ).

Lemma 5.2.

Lemma 5.3 (Convergence Rate of $\left\|V_{T}+\mu\right\|_{2}^{2}$ ).

Lemma 5.4.

Lemma 5.5.

Lemma 5.6 (Deviation Bound for $w_{m}^{*}$ ).

Lemma 5.7.

Lemma A.1.

Lemma A.2.

Lemma A.3.

Lemma C.1.

Lemma C.2.

Lemma C.3.

Lemma C.4.

Lemma C.5.

Lemma D.1.

Lemma D.2.

Lemma D.3.

Lemma D.4.

Lemma D.5.

Lemma D.6.

Lemma E.1.