Testing for high-dimensional network parameters in auto-regressive models
Lili Zheng, Garvesh Raskutti

TL;DR
This paper develops statistical inference methods, including confidence intervals, for high-dimensional auto-regressive network models with sub-Gaussian noise, extending beyond Gaussian assumptions and addressing dependence challenges.
Contribution
It introduces convergence in distribution results and confidence intervals for high-dimensional AR(p) models with sub-Gaussian noise, broadening applicability beyond Gaussian assumptions.
Findings
Convergence results hold when T scales as (s ∨ ρ)^2 log^2 M.
Provides novel concentration bounds for dependent sub-Gaussian quadratic forms.
Validates theoretical results through simulations on structured networks.
Abstract
High-dimensional auto-regressive models provide a natural way to model influence between actors given multi-variate time series data for time intervals. While there has been considerable work on network estimation, there is limited work in the context of inference and hypothesis testing. In particular, prior work on hypothesis testing in time series has been restricted to linear Gaussian auto-regressive models. From a practical perspective, it is important to determine suitable statistical tests for connections between actors that go beyond the Gaussian assumption. In the context of \emph{high-dimensional} time series models, confidence intervals present additional estimators since most estimators such as the Lasso and Dantzig selectors are biased which has led to \emph{de-biased} estimators. In this paper we address these challenges and provide convergence in distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Markov Chains and Monte Carlo Methods · Statistical Methods and Bayesian Inference
Testing for high-dimensional network parameters in auto-regressive models
Lili Zheng1 and Garvesh Raskutti1
Abstract
High-dimensional auto-regressive models provide a natural way to model influence between actors given multi-variate time series data for time intervals. While there has been considerable work on network estimation, there is limited work in the context of inference and hypothesis testing. In particular, prior work on hypothesis testing in time series has been restricted to linear Gaussian auto-regressive models. From a practical perspective, it is important to determine suitable statistical tests for connections between actors that go beyond the Gaussian assumption. In the context of high-dimensional time series models, confidence intervals present additional estimators since most estimators such as the Lasso and Dantzig selectors are biased which has led to de-biased estimators. In this paper we address these challenges and provide convergence in distribution results and confidence intervals for the multi-variate AR(p) model with sub-Gaussian noise, a generalization of Gaussian noise that broadens applicability and presents numerous technical challenges. The main technical challenge lies in the fact that unlike Gaussian random vectors, for sub-Gaussian vectors zero correlation does not imply independence. The proof relies on using an intricate truncation argument to develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables. Our convergence in distribution results hold provided , where and refer to sparsity parameters which matches existed results for hypothesis testing with i.i.d. samples. We validate our theoretical results with simulation results for both block-structured and chain-structured networks.
11footnotetext: Department of Statistics, University of Wisconsin-Madison
1 Introduction
Vector autoregressive models arise in a number of applications including macroeconomics (see e.g.Ang and Piazzesi (2003),Hansen (2003),Shan (2005)), computational neuroscience (see e.g.Goebel et al. (2003),Seth et al. (2015),Harrison et al. (2003), Bressler et al. (2007)), and many others (see e.g.Michailidis and d’Alché Buc (2013),Fujita et al. (2007)). Recent years has seen substantial development in the theory and methodology of high-dimensional auto-regressive models with respect to parameter estimation (see e.g. Song and Bickel (2011),Basu et al. (2015),Davis et al. (2016),Medeiros and Mendes (2016), Mark B. and R. (2018)). In particular if there are dependent time series (e.g. voxels in the brain, actors in a social network, measurements at different spatial locations), time series network models allow us to model temporal dependence between actors/nodes in a network.
More precisely, consider the following time series auto-regressive network model with lag ,
[TABLE]
where is the time series data we have access to, are the network parameters of interest and is zero-mean noise. We are considering the high-dimensional setting where the number of nodes in the network is much larger than the sample size . Prior work in Basu et al. (2015) has addressed the question of how to estimate the network parameter with Gaussian noise under sparsity assumptions and various structural constraints. In this paper, we focus on inference and hypothesis testing for the parameter given the data .
In high-dimensional statistics, there has recently been a growing body of work on confidence intervals and hypothesis testing under structural assumptions such as sparsity. Since the widely used Lasso estimator for sparse linear regression is asymptotically biased, one-step estimators based on bias-correction have been studied in works such as Zhang and Zhang (2014), Van de Geer et al. (2014) and Javanmard and Montanari (2014) which are referred to as LDPE, de-sparsifying and de-biasing estimator respectively. Low-dimensional components of these estimators have asymptotic normality and thus can be used for constructing hypothesis testing and confidence intervals.
In this paper, we adopt the framework of Ning and Liu (Ning et al. (2017)) who propose a high dimensional test statistic based on score function, called the decorrelated score function which we briefly describe here. Formally, consider a statistical model with high-dimensional parameter vector . Suppose we are interested in the scalar parameter and is the nuisance parameter. Suppose data are i.i.d. data following distribution , then the negative log-likelihood function is defined as
[TABLE]
It is known that the score function is asymptotically normal if the true parameter . If is substituted by some estimator , the estimation induced error can be approximated as the following:
[TABLE]
when is small enough. Although converge to 0 with properly chosen , e.g. Lasso estimator, would not vanish if . This fact motivates the decorrelated score function:
[TABLE]
with Fisher information matrix . One can check that
[TABLE]
Both and are substituted by some estimator, and it is shown in Ning et al. (2017) that the decorrelated score function is asymptotically normal.
In the linear regression case, the test statistic generated by the decorrelated score function in Ning et al. (2017) is equivalent to that constructed by de-biased estimator in Van de Geer et al. (2014). However, Ning et al. (2017) allow a more general form, and thus is easier to adapt to the time series case. In fact Neykov et al.Neykov et al. (2018) consider amongst other examples, high-dimensional time series with Gaussian error innovations. While Gaussian error innovations are widely used, many time series models include data that has bounded range or discrete data, for which the Gaussian distribution is not a natural fit. In this paper, we address the more general and technically challenging setting in which the noise is sub-Gaussian.
One of the important technical challenges in going from the Gaussian to the sub-Gaussian case is that dependent Gaussian vectors can be rotated to be independent, while such a result does not hold for sub-Gaussian vectors. Prior work in Wong et al. (2016) addresses this challenges by imposing stationarity and -mixing conditions. In order to avoid these conditions, we develop novel concentration bounds for sub-Gaussian random vectors.
In this paper, we investigate the hypothesis testing and confidence region with respect to a low-dimensional component of parameter matrices for sub-Gaussian data, using the testing framework in Ning et al. (2017). Our major contributions are as follows:
- •
Extending theoretical results in Ning et al. (2017) for high-dimensional hypothesis testing from Gaussian to sub-Gaussian temporal dependent data (VAR model), both under null and alternative hypothesis. We also show that our techniques lead to similar results to Neykov et al.Neykov et al. (2018) in the Gaussian case but under less restrictive conditions;
- •
A novel concentration bound for quadratic forms of sub-Gaussian time series data. Note that unlike Gaussian vectors which can be rotated to be independent, sub-Gaussian vectors can not which present additional technical challenges. Our analysis also leads to estimators for covariance and regression parameters for time series data under sub-Gaussian assumptions which are of independent interest.
- •
We also construct semi-parametric efficient confidence region for multivariate parameters with fixed dimension;
- •
Finally we support our theoretical guarantees with a simulation study on bounded noise, which is sub-Gaussian but not Gaussian.
1.1 Related Work
In the literature on inference for high-dimensional VAR models, most work focuses on the estimation problem. Song and Bickel (Song and Bickel (2011)) investigate penalized least squares algorithms for different penalties, with some externally imposed assumptions on the temporal dependence. Theoretical guarantees on Dantzig type and Lasso type estimators are studied in Han et al. (2015) and Basu et al. (2015), but with Gaussian noise. Barigozzi and Brownlees (Barigozzi and Brownlees (2018)) consider the inference for stationary dependence structure built among variables, other than the parameters in the VAR model. In our work, we control the error bounds of Lasso and Dantzig type estimators for parameter matrices, with sub-Gaussian noise. Then we establish asymptotic distribution of test statistic based on this.
In the high-dimensional hypothesis testing literature, there is some work regarding to testing for high-dimensional mean vector (Srivastava (2009)), covariance matrices (Chen et al. (2010),Zhang et al. (2013)) and independence among variables (Schott (2005)). While for testing on regression parameters, most work assumes i.i.d samples. Lockhart et al. (2014), Taylor et al. (2014) and Lee et al. (2016) proposes methods to test whether a covariate should be selected conditioning on the selection of some other covariates. A penalized score test depending on the tuning parameter is considered in Voorman et al. (2014). Our work follows the a line of work by Zhang and Zhang (2014), Van de Geer et al. (2014), Javanmard and Montanari (2014) and Ning et al. (2017), the de-sparsifying or decorrelated literature. We construct a VAR version of decorrelated score test proposed by Ning et al. (2017). Chen and Wu (Chen and Wu (2018)) tackles the hypothesis testing problem for time series data as well, but they are testing the trend in a time series, instead of the autoregressive parameter which encodes the influence structure among variables.
As mentioned earlier, our work is most closely related to the prior work of Neykov et al.Neykov et al. (2018), which provides a hypothesis testing framework with high-dimensional Gaussian time series as a special case. In our work, we consider the more general and technically challenging case of sub-Gaussian vector auto-regressive models. Throughout this paper, we provide a comparison to results derived in this work for the Gaussian case.
1.2 Organization of the Paper
Section 2 explains the problem set up and proposes our test statistic. Theoretical guarantee is shown in section 3. Specifically, section 3.1 and 3.2 present the weak convergence rate of test statistic under the null and alternative hypothesis and . Section 3.3 propose some feasible estimators, which satisfy the assumptions required and can be plugged into the test statistic. Section 3.4 considers the case when the variance of noise are unknown, and we construct a confidence region for multivariate parameter vectors in Section 3.5. We consider the special case of the AR(1) model with Gaussian noise, a detailed comparison with Neykov et al. (2018) is provided in section 3.6. Section 4 provides simulation results and section 5 includes the proofs for the two main theorems. Much of the proof is deferred to Appendices.
1.3 Notation
We define the following norms for vectors and matrices: For a vector , we define the -norm where , For a matrix , the norm and Frobenius norm of is defined as We also use notation to denote the penalty on , which is . Furthermore, if is symmetric the trace norm of is
Throughout the paper, we assume that the entries of noise vectors are independent sub-Gaussian variables with constant scale factor. A univariate centered random variable has a sub-Gaussian distribution with scale factor if
[TABLE]
2 Problem Setup
We consider a general vector auto-regressive time series with lag , where is known and finite and independent of or other dimensions:
[TABLE]
where , is zero-mean entry-wise independent sub-Gaussian noise with identity covariance matrix, and are parameters of interest. Define the matrix and , then we can also write (2) as
[TABLE]
For notational convenience, we assume that time series data has time range .
Based on data , we test the hypothesis of whether a subset of entries in are [math]. Let be the th row vector of . Without loss of generality, suppose the entries we test are in rows . Define as the columns we test in th row with , and , with . We test the null hypothesis:
[TABLE]
where . We also assume that is finite and not increasing with . In the work of of Neykov et al.Neykov et al. (2018), is assumed to be .
2.1 Stationary distribution
Since we are developing a hypothesis testing framework based on the decorrelated score test, it is important to specify a stationary distribution for Using standard notation from auto-regressive time series models, define the polynomial , where is an identity matrix, and is a complex number. To guarantee the existence of a stationary solution to (3), we assume
[TABLE]
Then we can write
[TABLE]
where are all real valued matrices which are polynomial functions of . Note that in the special case where , .
It can be shown that the unique stationary solution to (2) is
[TABLE]
and the covariance matrix of satisfies
[TABLE]
2.2 Decorrelated Score Function
Using the frameworks developed in Ning et al. (2017) for independent design, we consider the decorrelated score test. First we define the score function , with each entry defined as follows:
[TABLE]
As pointed out in Ning et al. (2017), the standard score function is infeasible and we need to consider the decorrelated score function
[TABLE]
with each corresponding to the tested row :
[TABLE]
where is composed of the entries of whose indices are within set . is also defined similarly and is chosen to satisfy
[TABLE]
Specifically, is defined as a function of :
[TABLE]
2.3 Test Statistic
Based on the decorrelated score function , we first define the statistic :
[TABLE]
with being defined as:
[TABLE]
Let be the -dimensional vector concatenated by ’s:
[TABLE]
One of the main results of the paper is to show that is asymptotically Gaussian. Define , then is asymptotically . Since we do not know , , and , we later define estimators for these quantities. Formally, we define our test statistic as
[TABLE]
where is an estimator for and is defined as
[TABLE]
with and estimating and . Here we are not worried about the invertible issue of , since is a low dimensional covariance matrix. To guarantee a good estimation of the high-dimensional parameter and , we impose sparsity conditions upon them. Specifically, for each , define
[TABLE]
and note that they both depend on .
The sparsity of can be implied by the sparsity of , which is a common condition in high-dimensional hypothesis testing literature (e.g. see Van de Geer et al. (2014)). Specifically, the following Lemma shows that when lag and is symmetric, the sparsity of is implied by the sparsity of :
Lemma 2.1**.**
If , is symmetric, then defined in (10) satisfies
[TABLE]
The proof for Lemma 2.1 is included in Appendix E.
3 Theoretical guarantee
In this section, we present uniform convergence results for test statistic under and , with and estimators satisfying conditions. We also provide feasible estimators, and prove that they satisfy corresponding conditions in Section 3.3. Unknown variance and confidence region construction is discussed in Section 3.4 and 3.5. In Section 3.6 we provide consequences of our theory under AR(1) model with Gaussian noise and compare our results with Neykov et al.Neykov et al. (2018).
Recall that the null hypothesis is
[TABLE]
with being concatenated by . While for the alternative hypothesis, like in Ning et al. (2017), we consider
[TABLE]
with some constant and constant vector . Write
[TABLE]
where each . The reason why instead of is considered in (12) is that we expect the test to be more sensitive as sample size increases. We will see how the value of influences the convergence of in Theorem 3.2.
We still assume ’s are i.i.d. sub-Gaussian random variables, and also consider a special case, where . We compare our result in the Gaussian case to results in Neykov et al.Neykov et al. (2018).
First we define the sets and of feasible parameter matrices under and respectively. To control the stability of in model (3), we impose the condition:
[TABLE]
for some constant . In the case , condition (13) reduces to
[TABLE]
which is implied by for some , a typical condition assumed (see e.g. Neykov et al. (2018)). Then define sets and for any , set of size and vector :
[TABLE]
[TABLE]
Note here and are still functions of , since is determined by . Clearly we need reliable estimators for , and with , to guarantee the weak convergence of . We present the following assumptions for these estimators, which we will verify in section 3.3. Note that constants may depend on and , but do not depend on either or .
Assumption 3.1** (Estimation Error for ).**
For each ,
[TABLE]
hold for , with probability at least .
These are standard error bounds for Lasso estimator and Dantzig Selector with independent design. In this paper we verify Assumption 3.1 in section 3.3 and the remaining two assumptions when we have dependent sub-Gaussian random variables, as we do for our vector auto-regressive model setting.
Assumption 3.2** (Estimation Error for ).**
For each :
[TABLE]
hold for , with probability at least .
Similar to Assumption 3.1, we will show that both Lasso estimator and Dantzig selector under model (3) satisfy Assumption 3.2.
Assumption 3.3** (Estimation Error for ).**
For each ,
[TABLE]
hold for , with probability at least .
Note that is a low-dimensional matrix, and thus it is computationally feasible to use the sample covariance matrix of as an estimator for . We show in section 3.3 that, as long as is a reliable estimator for , would satisfy a tighter bound than (19). This looser bound in Assumption 3.3 actually allows more choices for estimators for , as shown in section 3.5.
3.1 Uniform convergence under null hypothesis
Based on these assumptions, we have the following main theorem.
Theorem 3.1**.**
Consider the model (3) with i.i.d. sub-Gaussian noise with sub-Gaussian parameter . If Assumptions 3.1-3.3 are satisfied, and , then defined in (9) satisfies
[TABLE]
when for some constant . Here the constants ’s depend on .
Theorem 3.1 proves weak convergence of to . The uniform convergence rate can be understood as follows: the first term is due to the rate obtained by martingale CLT, where we require rather than due to the dependence; the remaining two terms arise from estimation error, with the second one being the error bounds, and third being the probability that the error bounds do not hold. If we assume Gaussianity, we can improve the first term in the rate of convergence from to for any . To the best of our knowledge, ours is the first work that formally attempts to characterize the rates of convergence.
Remark 3.1**.**
Compared to the theoretical result for independent design in Ning et al. (2017), the only additional condition we add is , which is used to control the strength of dependence uniformly. Also, we consider multivariate testing which is more general, and derive the explicit convergence rate.
Remark 3.2**.**
The test statistic proposed in Van de Geer et al. (2014) and Javanmard and Montanari (2014) for the independent design share similar ideas with our test statistic. Instead of imposing a sparsity assumption upon , Van de Geer et al. (2014) assumes to be row wise sparse. This is actually equivalent to the sparsity assumption on in the univariate case. Javanmard and Montanari (2014) does not require the sparsity condition on , but it is hard to extend their theory to the time series setting, due to a difficulty in applying the martingale CLT.
Remark 3.3**.**
The theoretical guarantee we obtained here, is more general and stronger than the result achieved in Neykov et al. (2018). A more detailed comparison is presented in section 3.6.
3.2 Uniform convergence under alternative hypothesis
Recall the definition of in (16). The following theorem establishes the asymptotic behavior of for , with different values of . First define
[TABLE]
where is defined in (8).
Theorem 3.2**.**
Consider the model (3) with i.i.d. sub-Gaussian noise and sub-Gaussian parameter . If Assumptions 3.1-3.3 are satisfied, and , then when for some constant ,
- (1)
**
[TABLE]
- (2)
**
[TABLE]
- (3)
**
[TABLE]
Here ’s are constants depending on .
Theorem 3.2 shows the threshold value of for to be detectable. When , we cannot distinguish and since under both cases converges to ; When , diverges to in probability, thus it would be very easy to detect ; When , converges to a non-central with noncentrality parameter determined by constant vector and , which implies the power of the test. Note here, (23) holds also for the trivial case , since we do not use the fact in the proof.
Remark 3.4**.**
Theorem 3.2 is also consistent with the threshold value of given by Ning et al. (2017) for linear regression with i.i.d samples. However, Ning et al. (2017) assumes additional conditions on the scaling of sample size, number of covariates and sparsity of for proving asymptotic power. Our conditions are exactly the same as the ones for , due to a more specific model and careful analysis.
3.3 Feasible Estimators
Both the estimation of and can be viewed as high-dimensional sparse regression problems, thus we can use the Lasso or Dantzig selector. Formally, define
[TABLE]
as the Lasso estimator for , and
[TABLE]
as the Dantzig selector estimator for . Similarly, for define
[TABLE]
and
[TABLE]
While for estimating , since this is a low dimensional covariance matrix for , we can directly use sample covariance of as :
[TABLE]
for . Here in the definition of (29) is either or .
As shown in the following, estimators (25) to (29) all satisfy Assumptions 3.1 to 3.3, under the model setting stated in (3):
Lemma 3.1**.**
If , or , which are defined as in (25) and (26) with , then satisfies Assumption 3.1 when .
Lemma 3.2**.**
If or , which are defined as in (27) and (28) with , then ’s satisfy Assumption 3.2 when .
Lemma 3.3**.**
If ’s are defined as in (29), where satisfies (18) with probability at least , then
[TABLE]
with probability at least , when .
Note here Lemma 3.3 is stronger than Assumption 3.3. The proof of these Lemmas are deferred to Appendix A. By these lemmas and Theorem 3.1, 3.2, we arrive at following Corollary.
Corollary 3.1**.**
Under model (3) with i.i.d sub-Gaussian noise with parameter , if or , or , and ’s are defined as in (29) for with , then if and for some constant , bounds (20) to (24) from Theorems 3.1 and 3.2 hold.
3.4 Variance Estimation
In this section, we consider the case where is unknown under model (3). Actually, if is known, it is straightforward to extend Theorem 3.1 to Theorem 3.2 for defined as follows:
[TABLE]
This follows since if we consider , time series data would satisfy the same model but with unit variance noise.
When is unknown, we apply the estimator
[TABLE]
and define the test statistic
[TABLE]
We show that has the same convergence results we derive for the unit variance noise case.
Theorem 3.3**.**
Consider the model (3) with i.i.d. sub-Gaussian noise of variance and scale factor . Then Theorem 3.1 and 3.2 hold for under each corresponding condition, and constants ’s also depend on .
Theorem 3.3 shows that when we have to estimate the unknown , test statistic maintains the same asymptotic behavior as under the known variance case, given that all the assumptions for estimation errors are satisfied and is lower bounded by some constant.
Remark 3.5**.**
With sub-Gaussian noise , if we still assume the scale factor of to be bounded by constant, then Lemma 3.1 to 3.3 would still hold. Thus the assumptions imposed on estimation errors of , and are all satisfied. However, if we don’t assume to be bounded, then the tuning parameters and have to scale with .
Remark 3.6**.**
Neykov et al. (2018)** proposes another estimator for the variance of , based on the fact that . Both these estimators are consistent and lead to convergence in distribution results.
3.5 Semi-parametric Optimal Confidence Region
In this section, we construct a confidence region for , under model (3) with unknown noise variance . Similar to Ning et al. (2017), we consider the one-step estimator for each , based on the decorrelated score function:
[TABLE]
where is any estimator satisfying the Assumptions 3.1 on error bounds for , and both the Lasso or Dantzig Estimator for are suitable. takes the form:
[TABLE]
which is another estimator for , and
[TABLE]
We will show that is asymptotically Gaussian with covariance matrix . Thus we construct the following confidence region for , with asymptotic confidence coefficient :
[TABLE]
This is a dimensional elliptical ball with center vector . The following theorem shows the weak convergence result of
[TABLE]
Theorem 3.4**.**
Under model (3) with i.i.d. sub-Gaussian noise with variance and sub-Gaussian parameter , then Theorem 3.1 and 3.2 hold for under each corresponding condition, and the constants ’s also depend on .
Remark 3.7**.**
In the definition of one-step estimator , we use instead of for theoretical convenience. Theorem 3.4 would still hold true if is defined as .
Remark 3.8**.**
We have exactly the same theoretical result for and , and this is due to the close relationship between these two quantities. In particular,
[TABLE]
compared to We show in the proof of Theorem 3.4 that also satisfies Assumption 3.3 as an estimator for .
Remark 3.9**.**
The one-step estimator is asymptotically unbiased, and shares a similar form to the de-biased estimator proposed by Zhang and Zhang (2014), Van de Geer et al. (2014). The de-biased estimator in Van de Geer et al. (2014) would take the following form under our setting:
[TABLE]
where is computed by node-wise regression, as an estimator for . When , this is essentially the same as our estimator , but would be slightly different in the multivariate case. Note that the asymptotic covariance matrix for equals to the partial information matrix ), and thus is semi-parametric efficient, while is only efficient when it is a scalar.
Remark 3.10**.**
* is also very similar to the test statistic proposed by Neykov et al. (2018) for VAR model with lag 1. The only difference lies in the estimation of Var, and they only consider Dantzig selector for estimating and . We will provide a detailed comparison between their theoretical result with ours in section 3.6.*
3.6 Special case: AR(1) with Gaussian noise
Our theoretical guarantee covers VAR models with lag and sub-Gaussian noise, of which AR(1) model and Gaussian noise are special cases. Here we explain the consequences of our result under this special case and provide comparison with Neykov et al. (2018).
When we consider lag , the constraint for becomes
[TABLE]
with . The two sparsity conditions and sample size requirement are included in the conditions Neykov et al. (2018) proposes. In addition, they assume the following:
[TABLE]
for some . Note that we don’t require these conditions, among which the first and third are quite strong, and the second one is sufficient for our condition . This follows since if ,
[TABLE]
Until now the discussion focuses on the case where are i.i.d. sub-Gaussian noise of scale factor , with being the variance of and lower bounded by some constant. Thus our setting covers the case where with . If with as assumed in Neykov et al. (2018), we can still prove the same theoretical guarantee, under even weaker condition based on spectral density, due to established concentration bounds in Basu et al. (2015).
4 Numerical Experiments
In this section, we provide a simulation study to validate our theoretical results. For simplicity, our simulation is based on the AR(1) model:
[TABLE]
where is set to be row-wise sparse. Symmetricity is not required in our theory, but in order to ensure the sparsity of , we focus on symmetric matrices under , and slightly asymmetric ones under . The eigenvalues of all fall in the unit circle of the complex plane, which ensures the existence of stationary solution to this model. White noise is simulated as independent in order to satisfy the sub-Gaussianity condition. Other distributions were also used but not reported since the results were very similar.
To consider multi-variate test sets, throughout the simulation we test the index set with , which involves three different rows and two columns in each row:
[TABLE]
The null hypothesis takes the form with some -dimensional vector . Correspondingly, we consider alternative hypothesis , with randomly selected from -dimensional Gaussian distribution, and ranges from to .
Under , we generate with different row-wise sparsity levels and structures, and for each , vector may differ depending on the corresponding . Under , are still the same matrices as under , but only adding the tested indices by . The experiments are repeated under different settings of , , and .
We use Lasso estimators defined in (25), (27) for the estimation of and , , and tuning parameters , are selected using cross validation. In cross validation, the training sets are composed of consecutive time series data, with the remaining 10% of the original data set being testing sets. Under , 1000 simulations are carried out under each parameter setting, while under , we have 100 simulations. In the following sections, we look into false positive rates (FPR) and true positive rates (TPR) of test statistics and as defined in (32) and (36), when we set the level of test as .
4.1 Under the Null Hypothesis
- (1)
Varying sparsity
Here we summarize the experiments with randomly generated , that are symmetric and row-wise sparse, with different sparsity levels defined in (10). Figure 1 shows how FPR of and averaged over 1000 experiments vary with . We can see that when increases to about 500, the FPR becomes stable and close to regardless of , choice between and .
When the sample size is small, the test tends to be conservative, which is the consequence of estimating variance and covariances ’s. In the simulation we use naive estimators for these two quantities, as defined in (31) and (29) which tend to be smaller than the true parameters. This is because we usually fit noise in the regression, as noticed by Fan et al. (2012). As shown in these two figures, is less conservative than when is small, since the magnitude of is larger than , which makes probably a better estimator for .
We also summarize the FPR when the variance of is known in Figure 2. We can see from these figures that is still a little conservative when is small, while with substituted by is not conservative.
- (2)
Different Graph Structures
If we consider the actors in the time series as nodes in a network, and a nonzero represents an directed edge from to , then each matrix corresponds to a -dimensional directed graph. We experiment with different structures of , which also correspond to different graph structure, including block graph or chain graph. Specifically, we consider matrices with norm equal to 0.75:
[TABLE]
which is a block graph;
[TABLE]
with constant chosen to ensure , which is a chain graph; and being randomly generated symmetric matrix of sparsity level , and largest eigenvalue equal to 0.75. Figure 3 shows the difference among these three different structures. We can see that block graph is less accurate than the other two, which is due to a larger variance for each . Investigating the question of how graph structure theoretically influences testing performance remains an open and interesting direction.
4.2 Alternative Hypothesis
First we look into how the true positive rate (TPR) varies with , since we set as and may be viewed as a measure of distance from the null hypothesis. Fig. 4 only presents the simulation results when and , while the other choices of and generate very similar results. We can see from these two figures that as increases, TPR approaches 1. The slope increases when sample size gets larger, or when the test statistic changes from to . This aligns with intuition, since when increases, we are supposed to distinguish between and better, and is more conservative than as we show in subsection 4.1.
We also check the influence of . Figure 5 reveals how TPR changes when increases, if we set and fixed. If , TPR converges to 1 very quickly, while if , TPR converges to 0.05, but the convergence is slower when or increases. When , Theorem 3.3 and 3.4 states that and would converge to , thus the TPR should converge to some value between 0.05 and 1, depending on and . The black lines in figure 5 indicate this convergence value, but since the test tends to be conservative when is not large enough, TPR when is usually above the black line. The conservative issue is more severe under since the deviation is also multiplied by the estimated variances, which exaggerates the conservative tendency. However, this may not be a big concern under , since we always want the TPR to be large.
5 Proof Overview
One of the main contributions of this work is the proof technique, which addresses a number of technical challenges and develops novel concentration bounds for dependent sub-Gaussian random vectors. In this section, we present and discuss key lemmas for the proof and provide the main steps for proving Theorems 3.1 and 3.2, deferring the more technically intensive steps to the supplement.
5.1 Key Lemmas
The major technical challenge lies in proving the following two concentration bounds for dependent sub-Gaussian random vectors.
Lemma 5.1** (Deviation Bound for ).**
Under model (3), when are sub-Gaussian noise with scale factor , and ,
[TABLE]
When .
Lemma 5.1 is a standard deviation bound for proving estimation error bound of Lasso type or Dantzig selector type estimators. We apply this lemma both in the proof of Theorem 3.1, 3.2 and Lemma 3.1.
Lemma 5.2**.**
Under model (3), when are sub-Gaussian noise with constant scale factor , and , if is a symmetric matrix, we have
[TABLE]
Lemma 5.2 provides concentration bound for the sample average of general quadratic form , and is very helpful in proving martingale CLT under our setting, REC, Lemma 3.3, etc.
In the Gaussian case, both these lemmas follow from prior work in Basu et al. (2015) which relies on the fact that dependent Gaussian vectors can be rotated to be independent. Since dependent sub-Gaussian random variables cannot be rotated to be independent (only uncorrelated), we exploit the independence of by representing each by linear function of the infinite series and then use a careful truncation argument. We analyze sufficiently many terms in the summation, and control the infinite residues.
5.2 Proof of Theorem 3.1
Proof.
Suppose . We will use to refer to constants that only depend on (not or ), and different constants might share the same notation.
The proof can be divided into two major parts: showing the convergence of to , and bounding the estimation error . Formally, for any ,
[TABLE]
and
[TABLE]
which implies
[TABLE]
In the following, we provide bounds on each of the three terms. The following lemma shows the uniform weak convergence rate of to , of which the convergence of to is a special case.
Lemma 5.3** (Convergence Rate of ).**
Under model (3) with being sub-Gaussian noise of scale factor , then for any , ,
[TABLE]
when for some absolute constant , where is a constant depending on and is non-decreasing with respect to .
This Lemma is proved in section C, by applying a uniform martingale central limit theorem result. Thus, by Lemma 5.3, if for some constant ,
[TABLE]
Meanwhile,
[TABLE]
since has bounded density.
Now we only need to choose a proper and bound .
[TABLE]
Define , then (40) turns into
[TABLE]
We can bound using Lemma 5.3 and using Lemma 19, while for bounding the estimation induced error , we first apply the following lemma to bound the eigenvalues of .
Lemma 5.4**.**
Consider the model (2) with independent noise of unit variance, satisfies (13), then the eigenvalues of can be bounded as follows:
[TABLE]
Lemma 5.4 is proved based on established results in Basu et al. (2015). Note that we assumed unit variance in Theorem 3.1 and 3.2, so we can apply Lemma 5.4 here. Since , applying Lemma 5.4 would lead us to the following:
[TABLE]
Thus we have
[TABLE]
with
[TABLE]
The following two lemmas provide bounds for , and
[TABLE]
Lemma 5.5**.**
When ,
[TABLE]
Lemma 5.1 is a common condition in high-dimensional regression problems, and is usually referred to as deviation bound. We will prove it in Section C.
Lemma 5.6** (Deviation Bound for ).**
With probability at least , for all ,
[TABLE]
Lemma 5.6 can also be viewed as a deviation bound, if we consider a regression problem with as response and as covariates. This is also proved in Section C. Applying Assumptions 3.1 and 3.2, with probability at least ,
[TABLE]
where
[TABLE]
and Assumption 3.1 and 3.2 implies and . The former is not straightforward: to see why it holds true, let and , then we have
[TABLE]
Here we apply Assumption 3.1, and the fact that
[TABLE]
The last inequality is due to Lemma 5.4 and the following lemma:
Lemma 5.7**.**
With probability at least ,
[TABLE]
Therefore, by taking a union bound, we show that
[TABLE]
for any , with probability at least .
Meanwhile, by applying Lemma 5.3, one can show that for ,
[TABLE]
where the second inequality is due to a tail bound established in Laurent and Massart (2000) (see Lemma 1 in Laurent and Massart (2000)), and the third inequality comes from the fact that, constant , constant such that
[TABLE]
Let and plug it into (41), then with Assumption 3.3, we can show that with probability at least
[TABLE]
the following holds:
[TABLE]
if and for some constant . Therefore, applying (38) with ,
[TABLE]
Since constants only depend on and , this bound also holds for supremum over and . Note that for a clear presentation, we are not showing the sharpest bound, which can be obtained by choosing a different . ∎
5.3 Proof of Theorem 3.2
proof of Theorem 3.2.
We prove this case by case. We will use to refer to constants that only depend on , and different constants might share the same notation.
Similar from the proof of Theorem 3.1, the major part of the proof is devoted to bounding with high probability for some vector .
- (1)
Suppose . Using similar deduction as in the proof of Theorem 3.1, for any ,
[TABLE]
- (a)
Bounding the first two terms
The first term is the convergence rate of to . By Lemma 5.3,
[TABLE]
The last inequality is due to
[TABLE]
and an upper bound for in (42).
Bounding the second term in (46) is not straightforward as bounding in the proof of Theorem 3.1, since is not a constant vector when takes different values in . We only have a uniform bound of as shown above. One can show that
[TABLE]
where is a -dimensional standard Gaussian random vector with density . The last inequality holds because that, for any set ,
[TABLE]
Suppose , then if ,
[TABLE]
otherwise,
[TABLE]
Thus,
[TABLE] 2. (b)
Bounding
Similar from (41) in the proof of Theorem 3.1, it is straightforward to show that
[TABLE]
where . To bound , note that
[TABLE]
and
[TABLE]
with and defined as follows:
[TABLE]
[TABLE]
Therefore,
[TABLE]
The last inequality applies (42). Meanwhile,
[TABLE]
The first equality and second inequality come from the definition of and ; the third inequality is because that ; the fourth inequality is due to that ; and the last inequality is obtained from Lemma 5.4. Applying Lemma 5.7 leads us to
[TABLE]
We can write as
[TABLE]
Note that
[TABLE]
due to Lemma 5.4 and 5.7, which further implies
[TABLE]
Applying Assumption 3.1 to 3.3, Lemma 5.1, 5.6, one can show that with probability at least ,
[TABLE]
with the same arguments as bounding under .
While for , applying Lemma 5.3 leads us to
[TABLE]
for any , where . We apply the tail bound for (Lemma 1 in Laurent and Massart (2000)) as in (45), and obtain
[TABLE]
when for some constant . Let , and plug , (51) and (19) into (47), one can show that
[TABLE]
with probability at least
[TABLE]
if and .
Therefore, applying (46) with leads to
[TABLE]
Since constants only depend on , this bound also holds for supremum over and . 2. (2)
First we provide a lower bound for with high probability. Since bounds in Assumption 3.1 to 3.3, Lemma 5.1 to 5.7 hold with probability at least , we apply these bounds directly in following deduction. Meanwhile, we always assume and for desired constant . With these conditions, one can show that
[TABLE]
The third line is due to Assumption 3.3, which implies converges to 0 under our scaling .
We provide a lower bound for in the following. First write as
[TABLE]
we find the upper bounds for and lower bound for in the following. Applying Assumption 3.2 and Lemma 5.1 provides an upper bound for :
[TABLE]
Since
[TABLE]
then using the same argument as bounding when proving Theorem 3.1, we have
[TABLE]
To lower bound , first note that
[TABLE]
where we apply (49), Lemma 5.7, Assumption 3.2, and bound using the same argument as in (50). Thus,
[TABLE]
since is a constant vector, and is lower bounded by constant as in (42).
Applying these bounds for , one can show that,
[TABLE]
Plug this into (52) and apply Lemma 5.3, we have
[TABLE]
where in the last line we apply the tail bound as in (45). Since the constants here only depend on , this bound holds when taking supremum over and . 3. (3)
The proof of this case is similar to that of Theorem 3.1. The only thing different lies in the choice of and bounding . The bound (41) for still holds here, with . We directly apply the bounds in Assumptions 3.1 to 3.3, and Lemma 5.1 to Lemma 5.7 in the following. First we write
[TABLE]
Note here that the first three terms are exactly the same as in (43), and thus can be bounded as in the proof of Theorem 3.1. We only have to tackle the last term. By (53), one can show that,
[TABLE]
Thus, going through the same arguments as bounding under , we have
[TABLE]
with probability at least . Recall that in (45), when for some constant ,
[TABLE]
Let , then by (41) one can show that
[TABLE]
with probability at least
[TABLE]
if and for some constant . Therefore, applying (38) with ,
[TABLE]
Since constants only depend on , this bound also holds for supremum over and .
∎
6 Conclusion
In this paper, we have provided theoretical guarantees for hypothesis tests for sparse high-dimensional auto-regressive models with sub-Gaussian innovations. Specific upper bounds for the convergence rates of test statistics are given. Importantly, our results go beyond the Gaussian assumption and do not rely on mixing assumptions. As a consequence of our theory, we also develop novel concentration bounds for quadratic forms of dependent sub-Gaussian random variables using a careful truncation argument.
It would be of interest to consider other variance estimation method, e.g., scaled Lasso Sun and Zhang (2012), or cross-validation based method Fan et al. (2012), and establish corresponding theoretical guarantee. There also remain a number of open questions/challenges including extensions to generalized linear models, heavy-tailed innovations and incorporating hidden variables under time series setting.
Acknowledgements
We would like to thank both Sumanta Basu and Yiming Sun for useful discussions and comments. LZ and GR were supported by ARO W911NF-17-1-0357 and NGA HM0476-17-1-2003. GR was also supported by NSF DMS-1811767.
Appendix A Proof of Lemmas in Section 3.3
Proof of Lemma 3.1.
We prove the error bounds for each and then take a union bound. Without loss of generality, we consider the estimation of . With a little abuse of notation, let , , , and ( is not the decorrelated score function we defined in section 9). We would like to bound , and under two cases separately:
- (1)
.
Here we adopt the standard proof framework for Lasso. By (25) we know that satisfies
[TABLE]
which implies
[TABLE]
Rearranging the terms, we have
[TABLE]
The last line is due to that
[TABLE]
By Lemma 5.1, with probability at least ,
[TABLE]
Meanwhile, since is positive semi-definite,
[TABLE]
We have the following restricted eigenvalue condition for .
Lemma A.1**.**
Under the model specified in (3) with independent sub-Gaussian noise of constant scale factor, and , for any set , positive integer , satisfies the following REC:
[TABLE]
with probability at least , when . Here , constant depends on , and depend on and .
Here , , by Lemma A.1, when ,
[TABLE]
with probability at least , when . Thus
[TABLE]
which implies
[TABLE]
with probability at least .
- (2)
.
Here we adopt the standard proof framework for Dantzig selector. By (26),
[TABLE]
By Lemma 5.1, when , with probability at least ,
[TABLE]
which implies
[TABLE]
Meanwhile, by (55),
[TABLE]
Here , , by Lemma A.1, when ,
[TABLE]
with probability at least , when . Thus
[TABLE]
which implies
[TABLE]
with probability at least .
Therefore, after taking a union bound over , proof complete. ∎
Proof of Lemma 3.2.
Without loss of generality, we consider the estimation of and then take a union bound. Let , , and . Then we prove upper bounds for and with high probability under two cases.
- (1)
.
Looking into the definition (27) of , it is clear that the optimization can be viewed as separate optimization problems, in terms of each column of . Thus
[TABLE]
The following proof is almost identical to the proof in Lemma 3.1 under , except some difference in notation and application of Lemmas. One can show that,
[TABLE]
Rearranging the inequality gives us
[TABLE]
By Lemma 5.6, with probability at least ,
[TABLE]
which implies,
[TABLE]
Let be defined as the following:
[TABLE]
By Lemma A.1, when , with probability at least ,
[TABLE]
which implies
[TABLE]
and
[TABLE]
with probability at least .
- (2)
.
By (28),
[TABLE]
This proof is also pretty similar to the proof of Lemma 3.1 under the case where . By Lemma 5.6,
[TABLE]
with probability at least . Thus,
[TABLE]
Meanwhile, by (58),
[TABLE]
which further implies
[TABLE]
Recall the definition of in (57),then by Lemma A.1, (59) and (57), when ,
[TABLE]
which implies
[TABLE]
and
[TABLE]
with probability at least .
Since
[TABLE]
and
[TABLE]
taking a union bound over and all columns of , proof is complete. ∎
Proof of Lemma 3.3.
The following established result can be applied here:
Lemma A.2**.**
For any invertible matrix , if is also invertible, then
[TABLE]
Since , one can show that for ,
[TABLE]
where . Due to (42),
[TABLE]
In the following we bound . Write as
[TABLE]
where is defined as in (48). Actually,
[TABLE]
which is the maximum over deviations of some quadratic forms from their expectation. The following lemma provides a bound for quadratic form , with being any symmetric matrix.
By Lemma 5.2, we only need to bound the trace norm and operator norm of
[TABLE]
The following lemma establishes the relationship between and for symmetric matrices.
Lemma A.3**.**
For any symmetric matrix of rank , .
Since is of rank 2,
[TABLE]
Meanwhile, similar from (49), we bound by
[TABLE]
where the second inequality is due to that . Thus, both the trace norm and norm of can be bounded by constant, and applying Lemma 5.2 gives us
[TABLE]
Meanwhile, by Lemma 5.6 and Assumption 3.2, with probability at least ,
[TABLE]
and
[TABLE]
Here the second line is because that is symmetric and positive semi-definite, thus we can apply Cauchey-Schwartz inequality. When .
[TABLE]
which implies
[TABLE]
Therefore, take a union bound over , with probability at least ,
[TABLE]
when . ∎
Appendix B Proof of Theorem 3.3 and Theorem 3.4
Proof of Theorem 3.3.
Now we consider model (3), with unknown . Under this model, we use the notation for the quantity defined in the following:
[TABLE]
As explained in Section 3.4, satisfies Theorem 3.1 and 3.2 under each corresponding condition. We show in the following that we only need to control the estimation error of . Note that for any ,
[TABLE]
and
[TABLE]
For any distribution function ,
[TABLE]
Recall that Theorem 3.1 and 3.2 establish bounds for under , or under with , for when , and for when . Thus we only need to bound , and with or . Since ,
[TABLE]
Meanwhile,
[TABLE]
By Assumption 3.1 and Lemma 5.1, with probability at least ,
[TABLE]
and
[TABLE]
Also, since are independent sub-Gaussian random variables with scale factor , the first term can be bounded by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]):
[TABLE]
Let , then
[TABLE]
While for with any satisfying , if ,
[TABLE]
Here is a standard Gaussian random vector, the third line is due to that the density of is , and the fourth line applies the fact that when ,
[TABLE]
Meanwhile, when ,
[TABLE]
and when ,
[TABLE]
which implies
[TABLE]
To see why all the bounds for still hold for , note that we only need to add to the bounds under , and under when , which only changes the constant factors of the previous bounds. For the bound under when , we substitute by with , and add , which only changes the constant factors as well. Therefore, all the conclusions for in Theorem 3.1 and 3.2 still hold for under each corresponding condition. ∎
Proof of Theorem 3.4.
First we show the connection between and . Note that
[TABLE]
which implies
[TABLE]
Thus
[TABLE]
and the only difference between and is that we substitute by . We only need to prove that satisfies Assumption 3.3. The argument is very similar to the proof of Lemma 3.3, but we need to bound instead of here.
Let , then
[TABLE]
Recall that when proving Lemma 3.3, we already upper bound by with probability at least . Thus for any vector s.t ,
[TABLE]
which implies , and . We bound in the following. One can show that
[TABLE]
Applying (42), (62), Lemma 5.7, we have
[TABLE]
Thus, with Lemma 5.6, Assumption 3.2, and (63), we show that with probability at least ,
[TABLE]
Therefore, using the same arguments as in the proof of Lemma 3.3,
[TABLE]
By Lemma A.2,
[TABLE]
∎
Appendix C Proof of Lemmas in Section 5
Proof of Lemma 5.3.
Let
[TABLE]
Define filtration , then is a martingale difference sequence, and . To bound the convergence rate, we are going to use a modified version of Lemma 4 in Grama and Haeusler (2006).
Lemma C.1**.**
Let be a martingale difference sequence taking values in . Let , and . Define ,
[TABLE]
Then , when ,
[TABLE]
where , is non-decreasing as increases.
By Lemma C.1, to bound , we only need to bound .
[TABLE]
Here the second line is due to , and the third line is due to is a convex function. More specifically,
[TABLE]
While for the last line, since is sub-Gaussian with parameter , . Note that are all viewed as constants here. Due to the sub-Gaussianity of ’s, we have the following lemma.
Lemma C.2**.**
[TABLE]
Therefore,
[TABLE]
which implies
[TABLE]
While for , since
[TABLE]
where ,
[TABLE]
where the second line is because that is of rank at most , and we can apply Lemma A.3; the last line is due to
[TABLE]
Since
[TABLE]
by Lemma 5.2, we only need to bound the operator norm and trace norm of
[TABLE]
By (61) and (62), we have the following:
[TABLE]
Therefore, applying Lemma 5.2 leads us to
[TABLE]
which implies
[TABLE]
Thus,
[TABLE]
By Lemma C.1, for any , , and , when ,
[TABLE]
The best rate is achieved when , and thus when ,
[TABLE]
∎
Proof of Lemma 42.
We prove the lower and upper bounds for eigenvalues of , by establishing a connection between our stability condition (13) and another spectral density based condition proposed in Basu et al. [2015]. First we introduce the following lemma, which is a direct result of proposition 2.3 and (2.6) in Basu et al. [2015] under our setting.
Lemma C.3**.**
Under the model specified in (3) with independent noise of unit variance, the eigenvalues of can be bounded as follows:
[TABLE]
where , and .
By Lemma C.3, we only need to prove that condition (13) implies a lower bound for and upper bound for . First note that
[TABLE]
where the last equality is due to that . Meanwhile, for any ,
[TABLE]
where we apply condition (13) in the last inequality. Thus .
While for bounding , we start by bounding for . Here we define , and for all . Since
[TABLE]
one can show that , and for . Thus
[TABLE]
and . We have the following claim:
[TABLE]
This can be proved by induction. It is clear that , and if (64) holds for ,
[TABLE]
Therefore, can be bounded in the following:
[TABLE]
With Lemma C.3, we conclude that
[TABLE]
where , and . ∎
Proof of Lemma 5.1.
Recall that . Define as the following:
[TABLE]
then we can also write as an infinite sum . Without loss of generality, we consider the first entry of :
[TABLE]
In the following, we tackle the infinite sum in (66), by focusing our analysis on the finite sum and let the residue converges to 0. Rigorously, for any positive integer , let
[TABLE]
and satisfying , then we have
[TABLE]
We will let be sufficiently large in later argument. The following arguments are devided into two parts: bounding and .
- (1)
Bounding
Since all entries of are independent sub-Gaussian with constant parameter, we can apply the following Hanson-Wright inequality:
Lemma C.4**.**
Let be a random vector with independent components which satisfy and . Let be an matrix. Then, for every ,
[TABLE]
This lemma is a result in Rudelson et al. [2013].By Lemma C.4, we only need to bound the norms of .
First note that
[TABLE]
For any with unit norm, one can show that
[TABLE]
where , , and is a matrix with each entry . Since is a Toeplitz matrix, we will use the following lemma to bound its norm.
Lemma C.5**.**
Let be a Fourier series defined as , with . We define a sequence of Toeplitz matrices with , then the operator norm of is bounded by
[TABLE]
where ess the essential supremum.
This is actually Lemma 4.1 in Gray et al. [2006], and we directly apply it here. By Lemma C.5,
[TABLE]
Thus . While for the Frobenius norm, we have
[TABLE]
Therefore, by Lemma C.4, for any ,
[TABLE]
- (2)
Bounding
First note that
[TABLE]
Recall the definition of and in the proof of Lemma C.2. Since ,
[TABLE]
by Bernstein type inequality of sub-exponential random variables(see proposition 5.16 in Vershynin [2010]).
Now we bound the second term . Since
[TABLE]
one can show that
[TABLE]
where we apply the fact that , which is shown in the proof of Lemma C.2. Thus we have
[TABLE]
due to the tail bound of sub-exponential r.v. (also see Vershynin [2010]). Since
[TABLE]
[TABLE]
Let be sufficiently large such that , then we arrive at the following
[TABLE]
Let and take a union bound over the entries of , the conclusion follows.
∎
Proof of Lemma 5.6.
Without loss of generality, consider
[TABLE]
for any , and . Similar from the proof of Lemma 5.6, We can write it as a quadratic form
[TABLE]
where is defined as in (48). Since is of rank 2, and we have bounded in (62), applying Lemma A.3 leads to
[TABLE]
Applying Lemma 5.2, and taking a union bound over all entries of
[TABLE]
the conclusion follows. ∎
Proof of Lemma 5.7.
Similar from the proof of Lemma 5.1, we consider . Since
[TABLE]
by Lemma 5.2, we need to bound norms of , which is of rank at most 2. One can show that
[TABLE]
with Lemma A.3. Therefore, by taking a union bound, it is clear that
[TABLE]
with probability at least . ∎
Appendix D Proof of Lemmas in Section A and Appendix C
Proof of Lemma C.1.
Here we adopt the proof framework for Lemma 4 in Grama and Haeusler [2006], but with some small adjustments. First we construct a new martingale difference sequence , sum of whose covariances equal to . Random projections are used for construction. The following lemma on random projections is stated as Lemma 3 in Grama and Haeusler [2006].
Lemma D.1**.**
*Let and be positive semi-definite matrices. Set , for . Then there exist a sequence of integers and a corresponding sequence of subspaces of such that, with defined as the projection matrix of subspace , for (where ), the following statements hold true for :
is non-negative definite, where ;
, for all ;
for all , where .
Meanwhile, is determined by and .*
Given this claim, can be constructed as follows:
Recall the martingale sequence we consider is , and . Apply the fact with , , and let be the corresponding projection matrices. Let , which is non-negative definite. Define
[TABLE]
where
[TABLE]
Since , for .Thus is also a martingale difference sequence with , when , and . Meanwhile,
[TABLE]
This construction is from Grama and Haeusler [2006]. They also prove that, for any ,
[TABLE]
Since
[TABLE]
for any , we need to bound
[TABLE]
and
[TABLE]
The following functions are defined as a smooth relaxation for indicator function. Let
[TABLE]
where is a normalizing constant s.t. . Then we have if , if , and if . is infinitely many times differentiable on , and since is constant when or , for any fixed order, the derivative of is bounded. For any , let
[TABLE]
where
[TABLE]
In the following proof, we will denote and as and , for brevity. Therefore,
[TABLE]
[TABLE]
Thus,
[TABLE]
Actually, when , the right hand side of (68) can be substituted by
[TABLE]
and
[TABLE]
To bound , we will use the following lemma.
Lemma D.2**.**
For defined as in (70),
[TABLE]
for any , , when , or when and .
The proof of this lemma is deferred to Appendix E. In the following proof, we will always assume the condition or and hold. Therefore, for any ,
[TABLE]
where for some . Meanwhile,
[TABLE]
where for some . Thus, for any ,
[TABLE]
Let , be i.i.d. standard Gaussian random vectors that are independent of , , for , where . Define
[TABLE]
Then follows standard Gaussian distribution. Let , then
[TABLE]
Generally this inequality holds for , since and have the same second order moments, which justifies the fourth line. By the proof of Lemma 4 in Grama and Haeusler [2006],
[TABLE]
thus
[TABLE]
Now we only need to bound and . Assume , then
[TABLE]
Meanwhile,
[TABLE]
The last line is due to that
[TABLE]
when , and
[TABLE]
Here clearly is non-decreasing with respect to . Therefore, by (72), (67) and (74), when , for any , , , with ,
[TABLE]
where is non-decreasing with respect to .
∎
Proof of Lemma C.2.
First we introduce the following two norms:
For any random variable ,
[TABLE]
These two norms are related to sub-exponential and sub-Gaussian random variables, and the following lemma shows the connections between the two norms and the scale factor for sub-Gaussian r.v.
Lemma D.3**.**
For any sub-Gaussian r.v. with scale factor , the following hold:
[TABLE]
with some absolute constants , and
[TABLE]
This is an established result in Vershynin [2010]. By Lemma D.3, bounding would be sufficient, and we start from bounding for any . Recall that , with defined as in (65), we can write
[TABLE]
and
[TABLE]
where is defined as . The relationship between and can be established as follows:
[TABLE]
if we define when . We now prove that is integrable so that we can use Dominated Convergence Theorem. Since ’s are all independent sub-Gaussian random variables with parameter ,
[TABLE]
where the second inequality is due to Minkowski’s inequality. Thus,
[TABLE]
where the first equality is due to Monotone Convergence Theorem, and the last line is due to (62) and the fact that
[TABLE]
Therefore, by Dominated Convergence Theorem,
[TABLE]
By Lemma D.3, , and
[TABLE]
Thus
[TABLE]
∎
Proof of Lemma 5.2.
Recall that , where is defined in (65). Similar from the proof of Lemma 5.1, for any positive integer , we can write down as the following:
[TABLE]
Then we can bound each from its expectation separately, and will be chosen to be sufficiently large later.
- (1)
Bounding
Let and be defined as
[TABLE]
[TABLE]
Then , and by Lemma C.4 we only need to bound the operator norm and Frobenius norm of .
- i.
Bounding
For any unit vector ,
[TABLE]
where . Let , and be defined as , then
[TABLE]
Thus we only need to bound . Applying Lemma C.5, the largest eigenvalue of Toeplitz matrix can be bounded by
[TABLE]
where the third inequality is due to Cauchey-Schwartz inequality. Due to (75), we can further obtain
[TABLE]
and we define when for convenience. Therefore,
[TABLE]
- ii.
Bounding
First note that
[TABLE]
and if we write with orthogonal and diagonal (since is symmetric),
[TABLE]
Meanwhile, due to that and (75),
[TABLE]
Note that ,
[TABLE]
where the fourth line is due to Cauchey-Schwartz inequality. Therefore,
[TABLE]
Now we apply Lemma C.4, and arrive at
[TABLE]
- (2)
Bounding
We will show that vanishes when is large enough. First we bound . Since
[TABLE]
[TABLE]
Meanwhile,
[TABLE]
For any , let be sufficiently large such that , , then by tail bound of sub-exponential random variable (see Vershynin [2010]),
[TABLE]
- (3)
Bounding
One can show that
[TABLE]
and
[TABLE]
Thus
[TABLE]
The first line is due to the following fact: For any two sub-Gaussian random variables and , . We can prove this in the following:
[TABLE]
where the first line applies Cauchey-Schwartz inequality. Thus, with large enough , . Also, , therefore implies the same bound for as the one for :
[TABLE]
In conclusion, for any , if we choose some accordingly,
[TABLE]
∎
Proof of Lemma A.1.
Here we apply some results in Basu et al. [2015] with a little change in notation. These results simplifies the original problem to finding a upper bound for with any fixed unit vector . Specifically, the following lemmas are useful:
Lemma D.4**.**
For any , and ,
[TABLE]
where for any positive integer .
Lemma D.5**.**
[TABLE]
Lemma D.6**.**
Consider a symmetric matrix . If for any vector with , and any ,
[TABLE]
then for any integer ,
[TABLE]
[TABLE]
For any unit vector ,
[TABLE]
Thus can be bounded by Lemma 5.2.
[TABLE]
which implies
[TABLE]
By Lemma D.6, when ,
[TABLE]
with probability at least . Let , then
[TABLE]
with probability at least , when , and depends on and . Here we apply Lemma 5.4 to lower bound the eigenvalues of . ∎
Appendix E Proof of Lemma D.2, 2.1, A.2, and A.3
Proof of Lemma D.2.
Recall that , with , , and . In order to bound the partial derivatives of composite function, we apply the following lemma which is a direct result of Proposition 1 and 2 in Hardy [2006].
Lemma E.1**.**
Suppose univariate function and : have derivatives and partial derivatives of orders up to , then ,
[TABLE]
where is the set of partitions for , and is a block in . Formally,
[TABLE]
By Lemma E.1, we can write out the th order partial derivatives of :
[TABLE]
Moreover, we can also write as a composite function , with , , and . Then applying Lemma E.1 on gives us
[TABLE]
Note that
[TABLE]
which means that we only need to consider the partitions with all blocks of size or , when calculating the partial derivative of using (77). Also note that we need partitions for blocks within an original partition , we define the following partition set for any partition of size :
[TABLE]
This set include the unions of partitions for each block within , and each block within the partition of has size bounded by . Let , and , then the partial derivative of can be expanded as
[TABLE]
where we apply the fact that . For each fixed and ,
[TABLE]
then combine this with (78), we have
[TABLE]
In addition, note that when or , and is bounded on .Thus we only have to consider when and when . If and , . Therefore,
[TABLE]
∎
Proof of Lemma 2.1.
Note that
[TABLE]
When is symmetric, , thus
[TABLE]
It is clear that
[TABLE]
Let and , then
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
∎
Proof of Lemma A.2.
Let , then immediately we have , which is equivalent to . Thus the norm of can be bounded by . Moreover, note that , we have
[TABLE]
and rearranging terms gives us
[TABLE]
Therefore,
[TABLE]
∎
Proof of Lemma A.3.
First note that for any symmetric matrix , we can write it as , with orthogonal matrix and diagonal matrix . By the definition of trace norm,
[TABLE]
If we denote the non-zero eigenvalues of as , then
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ang and Piazzesi [2003] A. Ang and M. Piazzesi. A no-arbitrage vector autoregression of term structure dynamics with macroeconomic and latent variables. Journal of Monetary economics , 50(4):745–787, 2003.
- 2Barigozzi and Brownlees [2018] M. Barigozzi and C. T. Brownlees. Nets: Network estimation for time series. 2018.
- 3Basu et al. [2015] S. Basu, G. Michailidis, et al. Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics , 43(4):1535–1567, 2015.
- 4Bressler et al. [2007] S. L. Bressler, C. G. Richter, Y. Chen, and M. Ding. Cortical functional network organization from autoregressive modeling of local field potential oscillations. Statistics in medicine , 26(21):3875–3885, 2007.
- 5Chen and Wu [2018] L. Chen and W. B. Wu. Testing for trends in high-dimensional time series. Journal of the American Statistical Association , (just-accepted):1–37, 2018.
- 6Chen et al. [2010] S. X. Chen, L.-X. Zhang, and P.-S. Zhong. Tests for high-dimensional covariance matrices. Journal of the American Statistical Association , 105(490):810–819, 2010.
- 7Davis et al. [2016] R. A. Davis, P. Zang, and T. Zheng. Sparse vector autoregressive modeling. Journal of Computational and Graphical Statistics , 25(4):1077–1096, 2016.
- 8Fan et al. [2012] J. Fan, S. Guo, and N. Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 74(1):37–65, 2012.
