TL;DR
This paper introduces a new nonparametric change-point detection method for regression that is fully data-driven, tuning-free, and effective in both theoretical and practical scenarios, including financial data analysis.
Contribution
It proposes a novel, tuning-free, data-driven change-point detection procedure for regression with proven theoretical guarantees and practical effectiveness.
Findings
Proper control of type I error rate under null hypothesis
High power approaching 1 under alternative hypothesis
Successful detection of change-points in financial data
Abstract
This paper considers the prominent problem of change-point detection in regression. The study suggests a novel testing procedure featuring a fully data-driven calibration scheme. The method is essentially a black box, requiring no tuning from the practitioner. The approach is investigated from both theoretical and practical points of view. The theoretical study demonstrates proper control of first-type error rate under and power approaching under . The experiments conducted on synthetic data fully support the theoretical claims. In conclusion, the method is applied to financial data, where it detects sensible change-points. Techniques for change-point localization are also suggested and investigated.
| Power | ||
|---|---|---|
| 1.0 | 40.0 | |
| 1.0 | 20.5 | |
| 1.0 | 15.7 | |
| 1.0 | 15.9 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Nonparametric Change Point Detection in Regression
Valeriy Avanesov
WIAS Berlin
Abstract
This paper considers the prominent problem of change-point detection in regression. The study suggests a novel testing procedure featuring a fully data-driven calibration scheme. The method is essentially a black box, requiring no tuning from the practitioner. The approach is investigated from both theoretical and practical points of view. The theoretical study demonstrates proper control of first-type error rate under and power approaching under . The experiments conducted on synthetic data fully support the theoretical claims. In conclusion, the method is applied to financial data, where it detects sensible change-points. Techniques for change-point localization are also suggested and investigated.
1 Introduction
The current study works on a problem of change point detection, which applications range from neuroimaging [9] to finance [17, 10, 19, 29]. In many fields practitioners have to deal with the processes subject to an abrupt unpredictable change, hence arises the need to detect and localize such changes. In the writing we refer to the former problem as break detection and the latter as change-point localization, effectively adopting the terminology suggested in [4]. The importance of the topic promotes an immense variety of considered settings and obtained results on the topic [18, 2, 28, 25, 1, 43, 45, 15, 22, 16].
In the current paper we focus on break detection and change point localization in regression. Typically, in a regression setting a dataset of pairs of (possibly) multivariate covariates and univariate responses is considered, while the goal is to approximate the functional dependence between the two. Here we assume, the data points are separated in time. The problem at hand is whether the functional dependence stayed the same over time and if not, when did the break take place. This setting has been attracting a plethora of attention for decades now. Most researches consider linear [33, 23, 7, 6, 8, 32, 27, 24] or piece-wise constant regression [5, 26, 31]. A recent paper [39] allows for a generalized linear model, leaving the proper choice of a parametric model to the practitioner. In contrast, we develop a fully non-parametric method, eliminating the need to choose a parametric family. Some papers (e.g. [33, 6, 39]), however, rely on a fairly general framework of Likelihood-Ratio test, which we employ in our study as well. Further, some researchers (see [20] for example) propose a test statistic, yet leave the choice of the critical value to the practitioner, while we also suggest a fully data driven way to obtain them.
Contribution of our work consists in a novel break detection approach in regression which is:
- •
fully nonparametric
- •
fully data-driven
- •
working in black-box mode: has virtually no tuning parameters
- •
capable of multiple break detection
- •
naturally suitable for change-point localization
- •
featuring formal results bounding first type error rate (from above) and power (from below)
- •
performing well on simulated and on real-world data.
Formally, we consider the pairs of deterministic multidimensional covariates and corresponding univariate responses for , where is a compact in . We wish to test a null hypothesis
[TABLE]
versus an alternative (only a single break is allowed for simplicity, Section 2.1 suggests a generalization)
[TABLE]
where denote centered independent identically distributed noise. The functions , and , mapping from the compact to , are assumed to be unknown along with the distribution of .
The approach relies on Likelihood Ratio test statistic. Assume for now, the break could happen only at the time . Then it makes sense to consider data points to the left and data points to the right of and consider the ratio of likelihoods of points under a single model and under a pair of models explaining the portions of data to the left and to the right of separately. Yet the break can happen at any moment, so we consider the test statistics for all possible time moments simultaneously. Finally, in order to resolve the issue of the proper choice of the window size we suggest to consider multiple window sizes at once (e. g. powers of ).
The paper is organized as follows. Section 2 describes the approach. Further, the approach receives a formal treatment in Section 3. Finally, the behavior of the approach is empirically examined in Section 4.
2 The approach
Let us introduce some notation first. Denote the maximal and the minimal window sizes as and . Define a set of central points for each window size as . Further, for each and define vectors composed of the responses belonging to the window to the left of . Correspondingly, vectors are composed of . The concatenation of these two vectors is denoted as . Also, we use , and to denote the tuples of covariates corresponding to , and respectively. For each window size and central point we define the test statistic
[TABLE]
where is a likelihood function which is defined below. Intuitively, the statistic should take extremely large values when the two portions of data before and after are much better explained by a pair of distinct models than by a common one. As we aim to construct a nonparametric approach, we define relying on a well known technique named Gaussian Process Regression [34]. Formally, we model the noise with a normal distribution and impose the zero-mean Gaussian Process prior with covariance function on the regression function
[TABLE]
where is the number of response-covariate pairs under consideration and is a regularization parameter (see (3.4) and (3.5) for its choice). Integrating out we can easily see, the joint distribution of responses given the covariates is modelled as a multivariate normal distribution with zero mean and covariance matrix , such that , where is Kronecker delta. This observation followed by taking the logarithm and abolishing the non-random additive constants leads to the following definition of the likelihood :
[TABLE]
Remark 2.1**.**
The suggested approach shares its local nature with the ones presented in [4, 3, 39] as they use only a portion of the dataset (of size 2n) to construct a test statistic for time . Alternatively, one could use the whole dataset as in [42], yet, this is not the best option in presence of multiple breaks. Consider a setting where a function changes to and back to shortly afterwards. The long tails might ”water down” the test statistic. To that end a method called Wild Binary Segmentation suggests to choose multiple random continuous sub-datasets of random lengths [20]. Unfortunately, this might lead to excessively long sub-datasets and significantly increase computational complexity. Our approach is free of either of these issues. Also see Remark 3.1 for another motivation for an approach of a local nature.
Remark 2.2**.**
Choice of covariance function and is rather important in practice. Typically, a parametric family of covariance functions is considered and the optimal combination of hyper-parameters and is chosen via evidence maximization (see Section 4.5.1 in [34] for details).
The approach being suggested rejects the if for some window size and some central point the test statistic exceeds its corresponding critical level given the significance level . Formally, the rejection set is
[TABLE]
As the joint distribution of is unknown, we mimic it with a residual bootstrap scheme in order to allow for the proper choice of the critical levels. First, let us choose some subset of indices we use for bootstrap. We assume the response-covariate pairs follow the same distribution, hence we require to be located either to the left, or to the right from (we presume the former without loss of generality). Given a collection of pairs , we construct estimates of and the corresponding residuals . Now define the bootstrap counterpart of the response as
[TABLE]
where for all we draw independently and uniformly with replacement from and are independently and uniformly drawn from . At this point we can trivially define the bootstrap statistics in the same way their real-world counterparts are defined by plugging in instead of . Next, using to denote the bootstrap probability measure, we define the quantile functions for each
[TABLE]
Finally, we correct the significance level for multiplicity
[TABLE]
and define the critical levels .
If the method rejects , one can localize the change-point as follows. First, define the earliest central point, where is rejected
[TABLE]
Now, if , the change point is located in the interval (up to the significance level ). Therefore, we suggest to define the earliest detecting window
[TABLE]
and use the following change-point location estimator
[TABLE]
Remark 2.3**.**
The estimates may be obtained with any regression method as long as they are consistent under . As we strive for a nonparametric methodology, Gaussian Process Regression trained on is suggested. The theoretical results can be trivially adapted to any kind of a consistent regressor used instead.
Remark 2.4**.**
In practice it may be computationally difficult to obtain enough samples of the bootstrap statistics for the large number of quantiles to be simultaneously estimated. Alternatively, we suggest to choose the critical levels independently of the central point , effectively replacing the rejection region (2.4) with
[TABLE]
as the smaller number of quantiles can be reliably estimated based on much fewer number of the samples drawn. Clearly, this may lead to some drop of sensitivity.
Remark 2.5**.**
The method can be easily extended for break detection in multivariate regression. In that case one can consider for -th component of outcome, alter the calibration scheme accordingly and make multiplicity correction (2.7) also account for the dimensionality of responses (not only for the windows and break locations).
2.1 Multiple break detection
In spite of the fact that we allow for at most one break, the local nature of the test statistic allows for a straightforward application of the test in presence of multiple breaks as well. Again, consider a dataset but assume allows for multiple change-points ( is unknown). Formally, extending the notation and ,
[TABLE]
Then we estimate the location of the first change-point as
[TABLE]
Next, the procedure is recursively called on the rest of the dataset .
3 Theoretical results
This section is devoted to the theoretical results. Namely, Section 3.2 presents the bootstrap validity result, claiming that the critical levels yielded by the calibration procedure are indeed chosen in accordance with the critical level . The sensitivity result is reported in Section 3.3. It defines the minimal window width sufficient for the detection of a break and is also followed by a corollary providing change-point localization guaranties.
3.1 Assumptions and definitions
In order to state the theoretical results we need to formulate some assumptions and definitions. Particularly, we rely on definition of sub-Gaussian variables and vectors.
Definition 3.1****Sub-Gaussianity.
We say a centered random variable is sub-Gaussian with if
[TABLE]
We say a centered random vector is sub-Gaussian with if for all unit vectors the product is sub-Gaussian with .
Further, we consider two broad classes of smooth functions: Sobolev and Hölder.
Definition 3.2****Sobolev and Hölder classes.
Consider an orthonormal basis in and a function . We call it -smooth Sobolev if
[TABLE]
and we call it -smooth Hölder if
[TABLE]
These properties drive the choice of the regularization parameter . Namely, for sample size large enough we choose
[TABLE]
if the function is Sobolev and
[TABLE]
if the function is -Hölder.
Throughout the paper we use a variety of norms. We use to denote Euclidean norm of a vector or a spectral norm of a matrix. Further, refers to sup-norm for both vectors and matrices (the maximal absolute value of an element), as well as functions (the maximal absolute value of an element of its image), while stands for Frobenius norm of a matrix.
The result Lemma F.1 (by [44]) we rely upon imposes the following two assumptions.
Assumption 3.1**.**
Let there exist and s.t. for eigenfunctions of covariance function
[TABLE]
and for all
[TABLE]
Assumption 3.2**.**
Let for the eigenvalues of covariance function exist positive and s.t. for .
Matérn kernel with smoothness index satisfy these assumptions. In [44] the authors claim, their results also hold for kernels with non-polynomially decaying eigenvalues, like RBF and polynomial kernels. And as long as we do not use these assumptions in our proofs directly, so do ours.
Finally, we introduce the assumptions required by our machinery.
Assumption 3.3**.**
Let have the same elements as with exception for the diagonal and . Assume, exists a positive s.t. for all for
[TABLE]
It would be natural to expect in (3.8) instead of , e.g.
[TABLE]
On the one hand, if the design is regular, (e.g. a uniform grid), (3.9) implies (3.8), yet in general, particularly (3.8) is the desired assumption. We prove the bootstrap validity result (Theorem 3.1) using our Gaussian approximation Lemma B.3. There we have to treat the diagonal and off-diagonal elements of the quadratic forms separately. This is reminiscent of the results in [21] where they study an asymptotic distribution of a single quadratic form (we, in contrast, work with a joint distribution of numerous quadratic forms).
Assumption 3.4**.**
Let there exist a positive constant independent of s.t.
[TABLE]
Informally, Assumption 3.3 does not let the GP prior be too unrealistic, while Assumption 3.4 prohibits concentrations of measurements in a local area. Neither would we like Assumption 3.4 violated looking from a practical perspective, as it ensures being well-conditioned.
3.2 Bootstrap validity
In this section we demonstrate closeness of measures and in some sense which is a theoretical justification of our choice of the calibration scheme.
Theorem 3.1**.**
Let , Assumption 3.1, Assumption 3.2 hold, be sub-Gaussian with . Let be -smooth Sobolev and or -smooth Hölder and . Let , , and grow. Also assume for some positive and
[TABLE]
[TABLE]
and finally, let Assumption 3.3 hold for . Then on a set of arbitrarily high probability
[TABLE]
Proof of the theorem is given in Section A. The strategy of the proof is typical for bootstrap validity results. First, we approximate the joint distribution of the test statistics with a distribution of some function of a high-dimensional Gaussian vector. This step is handled with our Gaussian approximation result Lemma B.4. Next, the same is done for their bootstrap counterparts using a different Gaussian vector. Finally, we build the bridge between the two approximating distributions using the fact that the mean and variance of these Gaussian vectors are close to each other (see Lemma C.1). The assumptions (3.11) and (3.12) enforce negligibility of the remainder terms involved in Lemma B.4 and Lemma C.1 respectively. In turn, the Gaussian approximation result (Lemma B.4) is obtained using a novel, significantly tailored version of Lindeberg principle [30, 35, 11, 12]. The proof of Gaussian comparison result (Lemma C.1) is inspired by the technique used in [41]. We use Slepian smart interpolant too, yet applying it in a non-trivial way. We believe, Lemma B.4 can also be proven via Slepian smart interpolant instead of Lindeberg principle, which might yield slightly better convergence rate. We leave this for the future research.
3.3 Sensitivity result
Consider a setting under . For simplicity, assume there is a single change point at . In order for the break to be detectable we have to impose some discrepancy condition on and . Moreover, in order to guarantee detection we have to require the choice of covariates to make this discrepancy observed. Keeping that in mind we define the observed break extent
[TABLE]
Theorem 3.2**.**
Let the setting described above hold, be sub-Gaussian with . Let , , be -smooth Sobolev and or -smooth Hölder and . Also let , and . Also impose Assumption 3.1, Assumption 3.2, Assumption 3.3 (for ), Assumption 3.4, (3.11), (3.12) and
[TABLE]
Then
[TABLE]
We defer proof to Appendix D. It is fairly straightforward. First, we bound the test statistics with high probability, next we use Theorem 3.1 to also bound the critical levels and finally, we bound the test statistic from below and make sure it exceeds the critical level. The assumption (3.15) essentially requires the observed break extent to exceed the precision of Gaussian Process Regression predictor.
Remark 3.1**.**
The sensitivity result gives rise to another motivation behind simultaneous consideration of wider and narrower windows (and also it is another argument for local statistics in the first place, also see Remark 2.1). Consider a hostile setting, where the values of functions and coincide for most of the arguments. For instance, let and let for all . Then by definition and hence the assumption (3.15) implies
[TABLE]
Clearly, a narrower window detects a smaller break of such a kind.
Remark 3.2**.**
In the setting allowing for multiple change-points (see Section 2.1) assumption (3.15) dictates the requirement for the minimal distance between two consecutive change-points as which is sufficient for detection of all the change-points with probability approaching .
Finally, we formulate a trivial corollary providing change-point localization guaranties.
Corollary 3.1**.**
Under the assumptions of Theorem 3.2
[TABLE]
4 Empirical Study
In this section we report the results of our experiments111The code is available at github.com/akopich/gpcd. Section 4.1 presents the findings of the simulation study supporting the bootstrap validity and sensitivity results, as well as empirically justifying the simultaneous use of multiple windows and the change-point location estimator 2.10. In Section 4.2 we successfully apply the method to detect change-points in daily quotes of NASDAQ Composite index.
4.1 Experiment on synthetic data
We consider functions and for various choices of . Univariate covariates are shuffled equidistant points between [math] and . Under the responses are sampled independently as . Under we choose the change-point location and sample for and for . For our experiments we consider and report the corresponding observed break extent (defined by (3.14)). In all the experiments , the confidence level was chosen to be . We choose RBF kernel family
[TABLE]
and choose optimal parameters and via evidence maximization using .
The suggested approach has demonstrated proper control of the first type error rate in all the configurations we consider, keeping it below .
The power the test exhibits is shown on Figure 1. As expected, larger window size and larger observed break extent correspond to higher power. At the same time, the Figure 2 summarizes root mean squared errors of the estimator (defined by (2.10)). The estimator proves itself to be reliable when the power of the test is high. Generally, wider windows and larger observable break extent lead to higher accuracy of .
Further, in order to investigate the behavior of the method in a multiscale regime () we use several choices of for a single . Results, reported in the Table 1, exhibit a significant decrease in the average width of the narrowest detecting window and hence an improvement in change-point localization thanks to simultaneous use of wider and narrower windows. This should be highly beneficial in presence of multiple change points, as it allows for smaller distance between them (see Section 2.1 and Remark 3.2).
4.2 Real-world dataset experiment
The prices of stock indexes are known to be subject to abrupt breaks [37, 38]. We consider a series of closing daily prices of NASDAQ Composite index. The dataset spans from February 1990 until February 2019. We suggest to model the process using the following Stochastic Differential equation
[TABLE]
where denotes a Wiener process. Now we wish to test the dataset for the presence of breaks. In order to do so we employ the Euler–Maruyama method, effectively boiling the problem down to a regression problem with univariate covariates and the corresponding responses . Further we apply the scheme suggested in Section 2.1 with , , and the kernel family (4.1). The method detects three breaks and all of them may be related to the known events. Namely, computer virus CIH has activated itself and attacked Windows 9x in August 1998, burst of the dot-com bubble and 2008 financial crisis.
Acknowledgements
The research of “Project Approximative Bayesian inference and model selection for stochastic differential equations (SDEs)” has been partially funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1294 “Data Assimilation”, “Project Approximative Bayesian inference and model selection for stochastic differential equations (SDEs)”.
Further, we would like to thank Vladimir Spokoiny, Alexandra Carpentier and Evgeniya Sokolova for the discussions and/or proofreading which have greatly improved the manuscript.
Appendix A Proof of the bootstrap validity result
Proof of Theorem 3.1.
Apply Lemma B.4 to and , next apply Lemma C.1 and via triangle inequality obtain on a set of probability at least
[TABLE]
where and come from Lemma B.4 and Lemma C.1 respectively. Now observe
[TABLE]
and using (3.11) conclude . Clearly, the ratio entering the definition of is bounded (in the same way as in the proof of Lemma B.4). Next we use Lemma F.1 and obtain on a set of probability at least
[TABLE]
where
[TABLE]
Now observe that the following holds for and involved in Lemma C.1 by construction of and (coming from the gaussian approximation and defined by (B.4))
[TABLE]
[TABLE]
Further, Lemma C.3 yields the bound . Assumption (3.11) implies . Then (3.12) it turn implies
[TABLE]
Finally, choose (involved in the definition of , see Lemma C.1), recall assumption (3.12) and conclude . ∎
Appendix B Gaussian Approximation
Consider a random vector of independent components centered at . Introduce for even and denoting a vector composed of . Also, assume symmetric matrices are given and define a map s. t. .
Two ingredients of paramount importance are soft-max function
[TABLE]
and a smooth indicator function with three bounded derivatives s.t. . Also let and . An example of such function along with bounds for its derivatives is provided in [39].
Consider the following decomposition of matrices into diagonal matrices and matrices with zeroes down their diagonals
[TABLE]
Further, consider a vector s.t. for all . And introduce notation similar to . Now consider a vector denoting vectors and stacked. Clearly, there is a map s.t.:
[TABLE]
for all . Also define an independent vector
[TABLE]
and denote the first half of the vector as and the second as .
Our proof employs a novel version of the Lindeberg principle [30, 35, 11, 12] tuned for the problem at hand. Typically, Lindeberg principle suggests to ”replace” random variables with their Gaussian counterparts one by one. Here we have to ”replace” each -th component of along with the component of being its square starting with the -st one, repeat starting with the -nd one and so on repeating the procedure times. Namely, in the first step we ”replace” components with indexes , , and so on. On the second step we ”replace” components with indexes , , and so on. And further in the same manner. Or more formally, consider a sequence of vectors for s. t. and for all and for s.t. . Denote the indexes of components which were replaced at step as . Also define a vector s.t. for and for the rest of . Define sequence of and in a similar way. Finally, let denote the vectors and stacked together and denote stacked vectors and . Note, .
Lemma B.1**.**
Choose . Consider a function defined as
[TABLE]
where and is defined by (B.3). Further, using decomposition (B.2) assume for some positive :
[TABLE]
and denote
[TABLE]
Then
[TABLE]
[TABLE]
[TABLE]
Proof.
Proof of this result consists in direct differentiation followed by application of Lemma A.2 from [13] providing bounds for the first three derivatives of soft-max function. ∎
Lemma B.2**.**
Let assumptions of Lemma B.1 hold. Then for an independent Gaussian vector (defined by (B.4))
[TABLE]
where is the sum of the maximal third centered absolute moments of and , while is defined in Lemma B.5 and is defined by (B.3).
Proof.
Clearly, for ,
[TABLE]
and hence
[TABLE]
The rest of the proof consists in bounding an arbitrary summand on the right hand side. In order to do so we use Taylor expansion of second degree for and around with Lagrange remainder. Given equality of the first two moments of and , we conclude, the first two terms cancel out. Hence, using Lemma B.5 we immediately obtain
[TABLE]
Combination of (B.13) and (B.14) establishes the claim. ∎
Lemma B.3**.**
Let assumptions of Lemma B.1 hold. Then
[TABLE]
where comes from Lemma E.1 and is defined by (B.3).
Proof.
Choose . Then for an arbitrary constant vector
[TABLE]
Here he have consequently used Lemma B.6, Lemma B.2, Lemma B.6 again and Lemma E.1 (which also defines ). The last step uses that . Now we choose
[TABLE]
and obtain
[TABLE]
Similar reasoning yields a chain of ”larger-or-equal” inequalities which, combined with the one above, finalizes the proof. ∎
Lemma B.4**.**
Let be sub-Gaussian and matrices have bounded spectrum. Also assume for some positive
[TABLE]
Then for any positive on a set of probability at least for and going to infinity
[TABLE]
where is defined by (B.3).
Proof.
Application of Lemma B.7 to matrices yields the bound on defined by (B.7)
[TABLE]
on a set of probability at least .
Investigation of defined in Lemma E.1 yields . Really, is a sum of squared eigenvalues (which are bounded) and . Now we apply Lemma B.3
[TABLE]
Now change the variable
[TABLE]
∎
Lemma B.5**.**
In terms of Lemma B.1 for function
[TABLE]
it holds that
[TABLE]
[TABLE]
and
[TABLE]
Proof.
The proof consists in direct differentiation and bounding using Lemma B.1 and equation (53) from [39]. Intermediate differentiation steps can be found in the proof of Lemma A.14 [39]. ∎
The following lemma justifies the smoothing relying on smooth indicator and soft-max . Its proof can be found in [13].
Lemma B.6**.**
Let , then for arbitrary vector :
[TABLE]
The next lemma establishes prerequisites for inequality (B.28).
Lemma B.7**.**
Consider a symmetric matrix with the largest eigenvalue . Let be a vector of independent sub-Gaussian with elements. Then on a set of probability at least
[TABLE]
Proof.
For a given unit vector , as far as the components of are independent and sub-Gaussian, is sub-Gaussian with as well. Hence,
[TABLE]
and therefore,
[TABLE]
∎
Appendix C Gaussian comparison
Notation of this section follows the notation of Section B. Proof of the following result was inspired by the proof of Theorem 1 in [41].
Lemma C.1**.**
Consider two -dimensional normal vectors and . Denote and . Use notation of Lemma B.1. Then for any constant vector and positive holds
[TABLE]
where comes from Lemma E.1.
Proof.
The proof consists in a multiple use of Slepian smart interpolant. Denote the first and the second halves of vector as and and similarly introduce and being halves of . Further, consider real values and compose a vector of length iterating over these values:
[TABLE]
Denote and consider a function
[TABLE]
where we use to denote element-wise product and radicals are applied to vectors in an element-wise manner. Clearly,
[TABLE]
and hence
[TABLE]
For the derivative we have
[TABLE]
Next we apply Lemma C.2 (which applies only to centered vectors, thus the second term)
[TABLE]
Now we make use of Lemma B.5 and Lemma B.1 and choose
[TABLE]
[TABLE]
Next, recalling (C.5) obtain
[TABLE]
Finally, in order to move from smooth functions to indicators we employ reasoning identical to the one in Lemma B.3.
[TABLE]
Combination with a similar chain of larger-or-equal finalizes the proof. ∎
We use the same version of Stein’s identity as the authors of [41] have.
Lemma C.2****Stein’s identity.
Let be a normal centered vector and function be a function with finite first derivatives. Then for all
[TABLE]
Proof.
See Section A.6 of [40]. ∎
Lemma C.3**.**
Consider defined in Section 2. Let for some positive . Let be sub-Gaussian with . Then on a set of probability at least
[TABLE]
Proof.
By construction
[TABLE]
Now due to sub-Gaussianity for a positive
[TABLE]
and hence
[TABLE]
On set Hoeffding inequality applies to and their squares:
[TABLE]
[TABLE]
Therefore, with probability at least
[TABLE]
and hence
[TABLE]
Clearly, the choice
[TABLE]
[TABLE]
[TABLE]
makes polynomially decreasing. Substitution yields the claim. ∎
Appendix D Proof of sensitivity result
Proof of Theorem 3.2.
Denoting the probability density functions in the world of the Gaussian Process Regression model as by construction we have
[TABLE]
Further, denote and . Define shorthand notation and . Also let and denote predictive mean and variance of the Gaussian Process Regression for given and . Now recall the posterior is Gaussian:
[TABLE]
Define a norm for an arbitrary positive-definite symmetric matrix . Clearly, . Now trivial algebra yields
[TABLE]
where we use to denote “equality up to an additive deterministic constant”. Consider a matrix being a block matrix with blocks of equal size:
[TABLE]
Notice that is its Schur complement, thus (the second inequality is due to Assumption 3.4). Using we have and for some independent of . To sum these observations up:
[TABLE]
Having established control over these eigenvalues, we are ready to bound the terms and from above under both and , while should be bounded from above under and from below under . Now we bound the test statistic under . Denote . Then
[TABLE]
In order to bound the second term on a set of high probability we employ Lemma D.1 and obtain for a positive
[TABLE]
The third term will be controlled using sub-Gaussianity of . For any unit vector and positive
[TABLE]
and clearly,
[TABLE]
where . Hence, on a set of probability at least
[TABLE]
Finally, we choose and . Now under bound by Lemma F.1, recall and obtain
[TABLE]
on a set of probability at least as . Now we use Theorem 3.1 along with the fact that for large enough and obtain on a set of probability approaching
[TABLE]
On the other hand, under the bounds (D.7) and (D.10) still hold and
[TABLE]
Finally, choose , , and recall assumption (3.15) to conclude for large with probability approaching . ∎
The following result bounds a quadratic form of a sub-Gaussian vector with high probability. It is a direct corollary of Theorem 1.1 (Hanson-Wright inequality) stated in [36].
Lemma D.1**.**
Consider a vector sub-Gaussian with and a positive-definite matrix of size . Let there be a constant independent of s.t. . Then for a positive , large enough and some absolute positive
[TABLE]
Appendix E Anti-concentration inequality
This section uses notation introduced in Section B.
Lemma E.1**.**
Consider a -dimensional Gaussian vector , where and are -dimensional. Further, let and for arbitrary and . Finally, let . Then for an arbitrary vector and
[TABLE]
where
[TABLE]
and the map is defined by (B.3).
Proof.
Introduce an isotropic Gaussian vector and notice, for all . Applying Lemma E.2 to yields the claim. ∎
The rest of the proofs of this section mostly follow the Nazarov’s inequality proof presented in [14].
Define a map , where :
[TABLE]
With a slight abuse of notation we will use to denote both the map and an element of its image.
Lemma E.2**.**
Consider and along with . Then for all positive :
[TABLE]
Proof.
Define a set , and a function . is absolutely continuous distribution function and hence
[TABLE]
where denotes the right derivative of . Essentially, the proof boils down to the following lemma.
Lemma E.3**.**
[TABLE]
Proof.
Denote and note it is a convex polyhedron. Denote a projector onto as : . Now for a (proper) face of define
[TABLE]
[TABLE]
Clearly, . Clearly,
[TABLE]
hence for any face of dimensionality less than for . Hence,
[TABLE]
Now it is left to prove that
[TABLE]
By Lemma E.4
[TABLE]
where denotes the density of . Consider facets such that . Choose (or flip the sign if ) and denote . Further, since ,
[TABLE]
and given the number of facets is less than ,
[TABLE]
Now turn to the facets s.t. . By Lemma E.4,
[TABLE]
The final observation is based on the fact that are disjoint and
[TABLE]
and its combination with (E.14) completes the proof. ∎
∎
Lemma E.4**.**
[TABLE]
where is the standard surface measure on .
Proof.
Parametrize every as
[TABLE]
where is an arbitrary element of , while form an orthonormal basis on . Further, choose a unit outward normal vector to at . Then we can parametrize
[TABLE]
Now
[TABLE]
and in the same way
[TABLE]
Thus
[TABLE]
which proves the equality in the claim.
Now for any and exist vectors and such that for . Then
[TABLE]
Combination of (E.20) and (E.23) yields
[TABLE]
Now choose , note and establish the claim. ∎
Appendix F Consistency of Gaussian Process Regression by [44]
In this section we quote a consistency result for predictions of Gaussian Process Regression.
Lemma F.1****Corollary 2.1 in [44].
Assume, are sub-Gaussian. Let be -smooth Sobolev and or -smooth Hölder and . Further let satisfy Assumption 3.1 and Assumption 3.2. Then, for the training sample size going to infinity with probability at least we have
[TABLE]
where denotes the predictive function.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alexander Aue, Siegfried Hörmann, Lajos Horváth, and Matthew Reimherr. Break detection in the covariance structure of multivariate time series models. Ann. Statist. , 37(6B):4046–4087, 12 2009.
- 2[2] Alexander Aue and Lajos Horváth. Structural breaks in time series. Journal of Time Series Analysis , 34(1):1–16, 2013.
- 3[3] Valeriy Avanesov. Structural break analysis in high-dimensional covariance structure. ar Xiv e-prints , page ar Xiv:1803.00508, March 2018.
- 4[4] Valeriy Avanesov and Nazar Buzun. Change-point detection in high-dimensional covariance structure. Electron. J. Statist. , 12(2):3254–3294, 2018.
- 5[5] Jushan Bai. Estimating multiple breaks one at a time. Econometric Theory , 13(3):315–352, 1997.
- 6[6] Jushan Bai. Likelihood ratio tests for multiple structural changes. Journal of Econometrics , 91(2):299 – 323, 1999.
- 7[7] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica , 66(1):47–78, 1998.
- 8[8] Jushan Bai and Pierre Perron. Critical values for multiple structural change tests. The Econometrics Journal , 6(1):72–78, 2003.
