Nonparametric Change Point Detection in Regression

Valeriy Avanesov

arXiv:1903.02603·math.ST·July 2, 2019

Nonparametric Change Point Detection in Regression

Valeriy Avanesov

PDF

1 Repo

TL;DR

This paper introduces a new nonparametric change-point detection method for regression that is fully data-driven, tuning-free, and effective in both theoretical and practical scenarios, including financial data analysis.

Contribution

It proposes a novel, tuning-free, data-driven change-point detection procedure for regression with proven theoretical guarantees and practical effectiveness.

Findings

01

Proper control of type I error rate under null hypothesis

02

High power approaching 1 under alternative hypothesis

03

Successful detection of change-points in financial data

Abstract

This paper considers the prominent problem of change-point detection in regression. The study suggests a novel testing procedure featuring a fully data-driven calibration scheme. The method is essentially a black box, requiring no tuning from the practitioner. The approach is investigated from both theoretical and practical points of view. The theoretical study demonstrates proper control of first-type error rate under $H_{0}$ and power approaching $1$ under $H_{1}$ . The experiments conducted on synthetic data fully support the theoretical claims. In conclusion, the method is applied to financial data, where it detects sensible change-points. Techniques for change-point localization are also suggested and investigated.

Tables1

Table 1. Table 1: This table demonstrates average width of the narrowest detecting window n ∗ subscript 𝑛 n_{*} (from Theorem 3.2 ) may be reduced employing multiple window sizes at once without noticeable loss of power.

$𝔑$	Power	$n_{*}$
${40}$	1.0	40.0
${40, 20}$	1.0	20.5
${40, 20, 10}$	1.0	15.7
${40, 20, 10, 5}$	1.0	15.9

Equations270

H_{0} = {\forall i : y_{i} = f^{*} (X_{i}) + ε_{i}}

H_{0} = {\forall i : y_{i} = f^{*} (X_{i}) + ε_{i}}

H_{1} = {\exists τ, f_{1}^{*} \neq = f_{2}^{*} : y_{i} = f_{1}^{*} (X_{i}) + ε_{i} if i < τ y_{i} = f_{2}^{*} (X_{i}) + ε_{i} otherwise},

H_{1} = {\exists τ, f_{1}^{*} \neq = f_{2}^{*} : y_{i} = f_{1}^{*} (X_{i}) + ε_{i} if i < τ y_{i} = f_{2}^{*} (X_{i}) + ε_{i} otherwise},

A_{n} (t) : = L (y_{n}^{l} (t), X_{n}^{l} (t)) + L (y_{n}^{r} (t), X_{n}^{r} (t)) - L (y_{n} (t), X_{n} (t)),

A_{n} (t) : = L (y_{n}^{l} (t), X_{n}^{l} (t)) + L (y_{n}^{r} (t), X_{n}^{r} (t)) - L (y_{n} (t), X_{n} (t)),

f y_{j} \sim G P (0, ρ k (\cdot, \cdot)), \sim N (f (X_{j}), σ^{2}) for j \in 1.. M,

f y_{j} \sim G P (0, ρ k (\cdot, \cdot)), \sim N (f (X_{j}), σ^{2}) for j \in 1.. M,

L (y, X) : = - \frac{1}{2} y^{T} K (X)^{- 1} y .

L (y, X) : = - \frac{1}{2} y^{T} K (X)^{- 1} y .

{\exists n \in N, t \in T_{n} : A_{n} (t) > x_{n, α} (t)} .

{\exists n \in N, t \in T_{n} : A_{n} (t) > x_{n, α} (t)} .

y_{i}^{♭} = \overset{y}{^}_{i} + ε_{i}^{♭}, with ε_{i}^{♭} : = s_{i} \overset{ε}{^}_{j_{i}},

y_{i}^{♭} = \overset{y}{^}_{i} + ε_{i}^{♭}, with ε_{i}^{♭} : = s_{i} \overset{ε}{^}_{j_{i}},

z_{n, x}^{♭} (t) : = inf {z : P^{♭} {A_{n}^{♭} (t) > z} \leq x} .

z_{n, x}^{♭} (t) : = inf {z : P^{♭} {A_{n}^{♭} (t) > z} \leq x} .

α^{*} : = sup {x : P^{♭} {\exists n \in N, t \in T_{n} : A_{n}^{♭} (t) > z_{n, x}^{♭} (t)} \leq α}

α^{*} : = sup {x : P^{♭} {\exists n \in N, t \in T_{n} : A_{n}^{♭} (t) > z_{n, x}^{♭} (t)} \leq α}

\tilde{τ}^{n} : = min {t \in T_{n} : A_{n} (t) > x_{n, α}^{♭} (t)} .

\tilde{τ}^{n} : = min {t \in T_{n} : A_{n} (t) > x_{n, α}^{♭} (t)} .

n_{*} : = ar g n \in N min (\tilde{τ}^{n} + n)

n_{*} : = ar g n \in N min (\tilde{τ}^{n} + n)

\overset{τ}{^} : = ar g t \in [\tilde{τ}^{n_{*}} - n_{*}, \tilde{τ}^{n_{*}} + n_{*}) max A_{n_{+}} (t) .

\overset{τ}{^} : = ar g t \in [\tilde{τ}^{n_{*}} - n_{*}, \tilde{τ}^{n_{*}} + n_{*}) max A_{n_{+}} (t) .

{\exists n \in N : t \in T_{n} max A_{n} (t) > x_{n, α}^{♭}}

{\exists n \in N : t \in T_{n} max A_{n} (t) > x_{n, α}^{♭}}

H_{1} : = ⎩ ⎨ ⎧ \exists {f_{k}^{*}}_{k = 1}^{K + 1} : f_{k}^{*} \neq = f_{k + 1}^{*} y_{i} = f_{k}^{*} (X_{i}) + ε_{i} if τ_{k - 1} \leq i < τ_{k} for all k ⎭ ⎬ ⎫ .

H_{1} : = ⎩ ⎨ ⎧ \exists {f_{k}^{*}}_{k = 1}^{K + 1} : f_{k}^{*} \neq = f_{k + 1}^{*} y_{i} = f_{k}^{*} (X_{i}) + ε_{i} if τ_{k - 1} \leq i < τ_{k} for all k ⎭ ⎬ ⎫ .

\overset{τ}{^}_{1} : = ar g t \in [\tilde{τ}^{n_{*}} - n_{*}, \tilde{τ}^{n_{*}} + n_{*}) max A_{n_{+}} (t) .

\overset{τ}{^}_{1} : = ar g t \in [\tilde{τ}^{n_{*}} - n_{*}, \tilde{τ}^{n_{*}} + n_{*}) max A_{n_{+}} (t) .

E [exp (s x)] \leq exp (g^{2} s^{2} /2), \forall s \in R .

E [exp (s x)] \leq exp (g^{2} s^{2} /2), \forall s \in R .

\exists B : j = 1 \sum \infty j^{2ℵ} f_{j}^{2} \leq B^{2}

\exists B : j = 1 \sum \infty j^{2ℵ} f_{j}^{2} \leq B^{2}

\exists B : j = 1 \sum \infty j^{ℵ} ∣ f_{j} ∣ \leq B^{2} .

\exists B : j = 1 \sum \infty j^{ℵ} ∣ f_{j} ∣ \leq B^{2} .

ρ = \frac{B ^{2}}{lo g M}

ρ = \frac{B ^{2}}{lo g M}

ρ = (\frac{B ^{2}}{lo g M})^{2ℵ/ (2ℵ + 1)} (\frac{1}{M})^{1/ (2ℵ + 1)}

ρ = (\frac{B ^{2}}{lo g M})^{2ℵ/ (2ℵ + 1)} (\frac{1}{M})^{1/ (2ℵ + 1)}

j max ∥ ψ_{j} ∥_{\infty} \leq C_{ψ}

j max ∥ ψ_{j} ∥_{\infty} \leq C_{ψ}

∣ ψ_{j} (t) - ψ_{j} (s) ∣ \leq j L_{ψ} ∥ t - s ∥ .

∣ ψ_{j} (t) - ψ_{j} (s) ∣ \leq j L_{ψ} ∥ t - s ∥ .

\tilde{K}_{n} (t)^{- 1} E [y_{n} (t)]_{\infty} = O (n^{γ}) .

\tilde{K}_{n} (t)^{- 1} E [y_{n} (t)]_{\infty} = O (n^{γ}) .

K (X_{n} (t))^{- 1} E [y_{n} (t)]_{\infty} = O (n^{γ}) .

K (X_{n} (t))^{- 1} E [y_{n} (t)]_{\infty} = O (n^{γ}) .

∥ K (X_{n} (t)) ∥ < C .

∥ K (X_{n} (t)) ∥ < C .

\frac{lo g ^{15} ( ∣ N ∣ N )}{n _{-}^{1 - 6 γ}} = o (1),

\frac{lo g ^{15} ( ∣ N ∣ N )}{n _{-}^{1 - 6 γ}} = o (1),

(\frac{lo g I ^{♭}}{I ^{♭}})^{κ} n_{+}^{1/2 + δ + γ} = o (1)

(\frac{lo g I ^{♭}}{I ^{♭}})^{κ} n_{+}^{1/2 + δ + γ} = o (1)

c_{n} (t) sup ∣ P {\forall n \in N, t \in T_{n} : A_{n} (t) < c_{n} (t)} - P^{♭} {\forall n \in N, t \in T_{n} : A_{n}^{♭} (t) < c_{n} (t)} = o (1) .

c_{n} (t) sup ∣ P {\forall n \in N, t \in T_{n} : A_{n} (t) < c_{n} (t)} - P^{♭} {\forall n \in N, t \in T_{n} : A_{n}^{♭} (t) < c_{n} (t)} = o (1) .

B_{n}^{2} : = \frac{1}{n} i = τ \sum τ + n - 1 (f_{1}^{*} (X_{i}) - f_{2}^{*} (X_{i}))^{2} .

B_{n}^{2} : = \frac{1}{n} i = τ \sum τ + n - 1 (f_{1}^{*} (X_{i}) - f_{2}^{*} (X_{i}))^{2} .

B_{n_{*}}^{- 1} (\frac{lo g n _{*}}{n _{*}})^{κ} = o (1) .

B_{n_{*}}^{- 1} (\frac{lo g n _{*}}{n _{*}})^{κ} = o (1) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akopich/gpcd
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nonparametric Change Point Detection in Regression

Valeriy Avanesov

WIAS Berlin

Abstract

This paper considers the prominent problem of change-point detection in regression. The study suggests a novel testing procedure featuring a fully data-driven calibration scheme. The method is essentially a black box, requiring no tuning from the practitioner. The approach is investigated from both theoretical and practical points of view. The theoretical study demonstrates proper control of first-type error rate under $\mathbb{H}_{0}$ and power approaching $1$ under $\mathbb{H}_{1}$ . The experiments conducted on synthetic data fully support the theoretical claims. In conclusion, the method is applied to financial data, where it detects sensible change-points. Techniques for change-point localization are also suggested and investigated.

1 Introduction

The current study works on a problem of change point detection, which applications range from neuroimaging [9] to finance [17, 10, 19, 29]. In many fields practitioners have to deal with the processes subject to an abrupt unpredictable change, hence arises the need to detect and localize such changes. In the writing we refer to the former problem as break detection and the latter as change-point localization, effectively adopting the terminology suggested in [4]. The importance of the topic promotes an immense variety of considered settings and obtained results on the topic [18, 2, 28, 25, 1, 43, 45, 15, 22, 16].

In the current paper we focus on break detection and change point localization in regression. Typically, in a regression setting a dataset of pairs of (possibly) multivariate covariates and univariate responses is considered, while the goal is to approximate the functional dependence between the two. Here we assume, the data points are separated in time. The problem at hand is whether the functional dependence stayed the same over time and if not, when did the break take place. This setting has been attracting a plethora of attention for decades now. Most researches consider linear [33, 23, 7, 6, 8, 32, 27, 24] or piece-wise constant regression [5, 26, 31]. A recent paper [39] allows for a generalized linear model, leaving the proper choice of a parametric model to the practitioner. In contrast, we develop a fully non-parametric method, eliminating the need to choose a parametric family. Some papers (e.g. [33, 6, 39]), however, rely on a fairly general framework of Likelihood-Ratio test, which we employ in our study as well. Further, some researchers (see [20] for example) propose a test statistic, yet leave the choice of the critical value to the practitioner, while we also suggest a fully data driven way to obtain them.

Contribution of our work consists in a novel break detection approach in regression which is:

•

fully nonparametric

•

fully data-driven

•

working in black-box mode: has virtually no tuning parameters

•

capable of multiple break detection

•

naturally suitable for change-point localization

•

featuring formal results bounding first type error rate (from above) and power (from below)

•

performing well on simulated and on real-world data.

Formally, we consider the pairs of deterministic multidimensional covariates $X_{i}\in\mathcal{X}$ and corresponding univariate responses $y_{i}\in\mathbb{R}$ for $i\in 1..N$ , where $\mathcal{X}$ is a compact in $\mathbb{R}^{p}$ . We wish to test a null hypothesis

[TABLE]

versus an alternative (only a single break is allowed for simplicity, Section 2.1 suggests a generalization)

[TABLE]

where $\varepsilon_{i}$ denote centered independent identically distributed noise. The functions $f^{*}$ , $f_{1}^{*}$ and $f_{2}^{*}$ , mapping from the compact $\mathcal{X}$ to $\mathbb{R}$ , are assumed to be unknown along with the distribution of $\varepsilon_{i}$ .

The approach relies on Likelihood Ratio test statistic. Assume for now, the break could happen only at the time $t$ . Then it makes sense to consider $n$ data points to the left and $n$ data points to the right of $t$ and consider the ratio of likelihoods $A_{n}(t)$ of $2n$ points under a single model and under a pair of models explaining the portions of data to the left and to the right of $t$ separately. Yet the break can happen at any moment, so we consider the test statistics for all possible time moments simultaneously. Finally, in order to resolve the issue of the proper choice of the window size $n$ we suggest to consider multiple window sizes $n\in\mathfrak{N}\subset\mathbb{N}$ at once (e. g. powers of $2$ ).

The paper is organized as follows. Section 2 describes the approach. Further, the approach receives a formal treatment in Section 3. Finally, the behavior of the approach is empirically examined in Section 4.

2 The approach

Let us introduce some notation first. Denote the maximal and the minimal window sizes as $n_{+}\coloneqq\max\mathfrak{N}$ and $n_{-}\coloneqq\min\mathfrak{N}$ . Define a set of central points for each window size $n$ as $\mathbb{T}_{n}\coloneqq\{n,n+1,..,N-n\}$ . Further, for each $n\in\mathfrak{N}$ and $t\in\mathbb{T}_{n}$ define vectors $y^{l}_{n}(t)$ composed of the responses $\{y_{i}\}_{i=t-n+1}^{t}$ belonging to the window to the left of $t$ . Correspondingly, vectors $y^{r}_{n}(t)$ are composed of $\{y_{i}\}_{i=t+1}^{t+n}$ . The concatenation of these two vectors is denoted as $y_{n}(t)$ . Also, we use $X^{l}_{n}(t)$ , $X^{r}_{n}(t)$ and $X_{n}(t)$ to denote the tuples of covariates corresponding to $y^{l}_{n}(t)$ , $y^{r}_{n}(t)$ and $y_{n}(t)$ respectively. For each window size $n\in\mathfrak{N}$ and central point $t\in\mathbb{T}_{n}$ we define the test statistic

[TABLE]

where $L$ is a likelihood function which is defined below. Intuitively, the statistic should take extremely large values when the two portions of data before and after $t$ are much better explained by a pair of distinct models than by a common one. As we aim to construct a nonparametric approach, we define $L$ relying on a well known technique named Gaussian Process Regression [34]. Formally, we model the noise with a normal distribution and impose the zero-mean Gaussian Process prior with covariance function $k(\cdot,\cdot)$ on the regression function

[TABLE]

where $M$ is the number of response-covariate pairs under consideration and $\rho$ is a regularization parameter (see (3.4) and (3.5) for its choice). Integrating $f$ out we can easily see, the joint distribution of responses $y\in\mathbb{R}^{M}$ given the covariates $X=\{X_{j}\}_{j=1}^{M}$ is modelled as a multivariate normal distribution with zero mean and covariance matrix $K\left(X\right)\in\mathbb{R}^{M\times M}$ , such that $K\left(X\right)_{jj^{\prime}}\coloneqq\rho k(X_{j},X_{j^{\prime}})+\sigma^{2}\delta_{jj^{\prime}}$ , where $\delta_{jj^{\prime}}$ is Kronecker delta. This observation followed by taking the logarithm and abolishing the non-random additive constants leads to the following definition of the likelihood $L$ :

[TABLE]

Remark 2.1.

The suggested approach shares its local nature with the ones presented in [4, 3, 39] as they use only a portion of the dataset (of size 2n) to construct a test statistic for time $t$ . Alternatively, one could use the whole dataset as in [42], yet, this is not the best option in presence of multiple breaks. Consider a setting where a function $f_{1}^{*}$ changes to $f_{2}^{*}$ and back to $f_{1}^{*}$ shortly afterwards. The long tails might ”water down” the test statistic. To that end a method called Wild Binary Segmentation suggests to choose multiple random continuous sub-datasets of random lengths [20]. Unfortunately, this might lead to excessively long sub-datasets and significantly increase computational complexity. Our approach is free of either of these issues. Also see Remark 3.1 for another motivation for an approach of a local nature.

Remark 2.2.

Choice of covariance function $k(\cdot,\cdot)$ and $\sigma^{2}$ is rather important in practice. Typically, a parametric family of covariance functions $\{k_{\theta}(\cdot,\cdot)\}_{\theta\in\Theta}$ is considered and the optimal combination of hyper-parameters $\theta$ and $\sigma^{2}$ is chosen via evidence maximization (see Section 4.5.1 in [34] for details).

The approach being suggested rejects the $\mathbb{H}_{0}$ if for some window size $n\in\mathfrak{N}$ and some central point $t\in\mathbb{T}_{n}$ the test statistic $A_{n}(t)$ exceeds its corresponding critical level $x_{n,\alpha}(t)$ given the significance level $\alpha$ . Formally, the rejection set is

[TABLE]

As the joint distribution of $A_{n}(t)$ is unknown, we mimic it with a residual bootstrap scheme in order to allow for the proper choice of the critical levels. First, let us choose some subset of indices $\mathcal{I}^{\flat}\subseteq 1..N$ we use for bootstrap. We assume the response-covariate pairs $\{(y_{i},X_{i})\}_{i\in\mathcal{I}^{\flat}}$ follow the same distribution, hence we require $\mathcal{I}^{\flat}$ to be located either to the left, or to the right from $\tau$ (we presume the former without loss of generality). Given a collection of pairs $\{\left(y_{i},X_{i}\right)\}_{i\in\mathcal{I}^{\flat}}$ , we construct estimates $\hat{y}_{i}$ of $\mathbb{E}\left[y_{i}\right]$ and the corresponding residuals $\hat{\varepsilon}_{i}\coloneqq y_{i}-\hat{y}_{i}$ . Now define the bootstrap counterpart of the response $y_{i}$ as

[TABLE]

where for all $i\in 1..N$ we draw $j_{i}$ independently and uniformly with replacement from $\mathcal{I}^{\flat}$ and $s_{i}$ are independently and uniformly drawn from $\{-1,1\}$ . At this point we can trivially define the bootstrap statistics $A_{n}^{\flat}(t)$ in the same way their real-world counterparts $A_{n}(t)$ are defined by plugging in $y^{\flat}_{i}$ instead of $y_{i}$ . Next, using $\mathbb{P}^{\flat}$ to denote the bootstrap probability measure, we define the quantile functions for each $\mathrm{x}\in[0,1]$

[TABLE]

Finally, we correct the significance level $\alpha$ for multiplicity

[TABLE]

and define the critical levels $x_{n,\alpha}^{\flat}(t)\coloneqq z_{n,\alpha^{*}}^{\flat}(t)$ .

If the method rejects $\mathbb{H}_{0}$ , one can localize the change-point as follows. First, define the earliest central point, where $\mathbb{H}_{0}$ is rejected

[TABLE]

Now, if $A_{n}(t)>x_{n,\alpha}^{\flat}(t)$ , the change point is located in the interval $[t-n,t+n)$ (up to the significance level $\alpha$ ). Therefore, we suggest to define the earliest detecting window

[TABLE]

and use the following change-point location estimator

[TABLE]

Remark 2.3.

The estimates $\hat{y}_{i}$ may be obtained with any regression method as long as they are consistent under $\mathbb{H}_{0}$ . As we strive for a nonparametric methodology, Gaussian Process Regression trained on $\{\left(y_{i},X_{i}\right)\}_{i\in\mathcal{I}^{\flat}}$ is suggested. The theoretical results can be trivially adapted to any kind of a consistent regressor used instead.

Remark 2.4.

In practice it may be computationally difficult to obtain enough samples of the bootstrap statistics $A_{n}^{\flat}(t)$ for the large number of quantiles to be simultaneously estimated. Alternatively, we suggest to choose the critical levels $x_{n,\alpha}^{\flat}(t)=x^{\flat}_{n,\alpha}$ independently of the central point $t$ , effectively replacing the rejection region (2.4) with

[TABLE]

as the smaller number of quantiles can be reliably estimated based on much fewer number of the samples drawn. Clearly, this may lead to some drop of sensitivity.

Remark 2.5.

The method can be easily extended for break detection in multivariate regression. In that case one can consider $A_{n}^{l}(t)$ for $l$ -th component of outcome, alter the calibration scheme accordingly and make multiplicity correction (2.7) also account for the dimensionality of responses (not only for the windows and break locations).

2.1 Multiple break detection

In spite of the fact that we allow for at most one break, the local nature of the test statistic $A_{n}(t)$ allows for a straightforward application of the test in presence of multiple breaks as well. Again, consider a dataset $\{(X_{i},y_{i})\}_{i=1}^{N}$ but assume $\mathbb{H}_{1}$ allows for multiple change-points $\{\tau_{k}\}_{k=1}^{K}$ ( $K$ is unknown). Formally, extending the notation $\tau_{0}\coloneqq 1$ and $\tau_{K+1}=N$ ,

[TABLE]

Then we estimate the location of the first change-point as

[TABLE]

Next, the procedure is recursively called on the rest of the dataset $\{(X_{i},y_{i})\}_{i=\tilde{\tau}^{n_{*}}+n_{*}}^{N}$ .

3 Theoretical results

This section is devoted to the theoretical results. Namely, Section 3.2 presents the bootstrap validity result, claiming that the critical levels $x_{n,\alpha}^{\flat}(t)$ yielded by the calibration procedure are indeed chosen in accordance with the critical level $\alpha$ . The sensitivity result is reported in Section 3.3. It defines the minimal window width sufficient for the detection of a break and is also followed by a corollary providing change-point localization guaranties.

3.1 Assumptions and definitions

In order to state the theoretical results we need to formulate some assumptions and definitions. Particularly, we rely on definition of sub-Gaussian variables and vectors.

Definition 3.1Sub-Gaussianity.

We say a centered random variable $x$ is sub-Gaussian with $\mathfrak{g}^{2}$ if

[TABLE]

We say a centered random vector $X$ is sub-Gaussian with $\mathfrak{g}^{2}$ if for all unit vectors $u$ the product $\langle u,X\rangle$ is sub-Gaussian with $\mathfrak{g}^{2}$ .

Further, we consider two broad classes of smooth functions: Sobolev and Hölder.

Definition 3.2Sobolev and Hölder classes.

Consider an orthonormal basis $\{\psi_{j}\}$ in $L_{2}(\mathbb{R}^{p})$ and a function $f=\sum_{j}f_{j}\psi_{j}\in L_{2}(\mathbb{R}^{p})$ . We call it $\aleph$ -smooth Sobolev if

[TABLE]

and we call it $\aleph$ -smooth Hölder if

[TABLE]

These properties drive the choice of the regularization parameter $\rho$ . Namely, for sample size $M$ large enough we choose

[TABLE]

if the function is Sobolev and

[TABLE]

if the function is $\aleph$ -Hölder.

Throughout the paper we use a variety of norms. We use $\left\|\cdot\right\|$ to denote Euclidean norm of a vector or a spectral norm of a matrix. Further, $\left\|\cdot\right\|_{\infty}$ refers to sup-norm for both vectors and matrices (the maximal absolute value of an element), as well as functions (the maximal absolute value of an element of its image), while $\left\|\cdot\right\|_{F}$ stands for Frobenius norm of a matrix.

The result Lemma F.1 (by [44]) we rely upon imposes the following two assumptions.

Assumption 3.1.

Let there exist $C_{\psi}$ and $L_{\psi}$ s.t. for eigenfunctions $\{\psi_{j}(\cdot)\}_{j=1}^{\infty}$ of covariance function $k(\cdot,\cdot)$

[TABLE]

and for all $t,s\in\mathbb{R}^{p}$

[TABLE]

Assumption 3.2.

Let for the eigenvalues $\{\mu_{j}\}_{j=1}^{\infty}$ of covariance function $k(\cdot,\cdot)$ exist positive $c$ and $C$ s.t. $cj^{-2\aleph}\leq\mu_{j}\leq Cj^{-2\aleph}$ for $\aleph>1/2$ .

Matérn kernel with smoothness index $\aleph-1/2$ satisfy these assumptions. In [44] the authors claim, their results also hold for kernels with non-polynomially decaying eigenvalues, like RBF and polynomial kernels. And as long as we do not use these assumptions in our proofs directly, so do ours.

Finally, we introduce the assumptions required by our machinery.

Assumption 3.3.

Let $\tilde{K}_{n}(t)^{-1}$ have the same elements as $K(X_{n}(t))^{-1}$ with exception for the diagonal and $\mathrm{diag}\tilde{K}_{n}(t)^{-1}=0$ . Assume, exists a positive $\gamma$ s.t. for all $t\in\mathbb{T}_{n}$ for $n\rightarrow\infty$

[TABLE]

It would be natural to expect $K(X_{n}(t))^{-1}$ in (3.8) instead of $\tilde{K}_{n}(t)^{-1}$ , e.g.

[TABLE]

On the one hand, if the design $\{X_{i}\}_{i=1}^{N}$ is regular, (e.g. a uniform grid), (3.9) implies (3.8), yet in general, particularly (3.8) is the desired assumption. We prove the bootstrap validity result (Theorem 3.1) using our Gaussian approximation Lemma B.3. There we have to treat the diagonal and off-diagonal elements of the quadratic forms separately. This is reminiscent of the results in [21] where they study an asymptotic distribution of a single quadratic form (we, in contrast, work with a joint distribution of numerous quadratic forms).

Assumption 3.4.

Let there exist a positive constant $C$ independent of $n$ s.t. $\forall t$

[TABLE]

Informally, Assumption 3.3 does not let the GP prior be too unrealistic, while Assumption 3.4 prohibits concentrations of measurements in a local area. Neither would we like Assumption 3.4 violated looking from a practical perspective, as it ensures $K(X_{n}(t))$ being well-conditioned.

3.2 Bootstrap validity

In this section we demonstrate closeness of measures $\mathbb{P}$ and $\mathbb{P}^{\flat}$ in some sense which is a theoretical justification of our choice of the calibration scheme.

Theorem 3.1.

Let $\mathbb{H}_{0}$ , Assumption 3.1, Assumption 3.2 hold, $\varepsilon_{i}$ be sub-Gaussian with $\mathfrak{g}^{2}$ . Let $f^{*}$ be $\aleph$ -smooth Sobolev and $\kappa\coloneqq\left(\aleph-1/2\right)/(2\aleph)$ or $\aleph$ -smooth Hölder and $\kappa\coloneqq\aleph/(2\aleph+1)$ . Let $n_{-}$ , $n_{+}$ , $\left|\mathfrak{N}\right|$ and $N$ grow. Also assume for some positive $\gamma$ and $\delta$

[TABLE]

and finally, let Assumption 3.3 hold for $\gamma$ . Then on a set of arbitrarily high probability

[TABLE]

Proof of the theorem is given in Section A. The strategy of the proof is typical for bootstrap validity results. First, we approximate the joint distribution of the test statistics $\{A_{n}(t)\}_{n\in\mathfrak{N},t\in\mathbb{T}_{n}}$ with a distribution of some function of a high-dimensional Gaussian vector. This step is handled with our Gaussian approximation result Lemma B.4. Next, the same is done for their bootstrap counterparts $\{A_{n}^{\flat}(t)\}_{n\in\mathfrak{N},t\in\mathbb{T}_{n}}$ using a different Gaussian vector. Finally, we build the bridge between the two approximating distributions using the fact that the mean and variance of these Gaussian vectors are close to each other (see Lemma C.1). The assumptions (3.11) and (3.12) enforce negligibility of the remainder terms involved in Lemma B.4 and Lemma C.1 respectively. In turn, the Gaussian approximation result (Lemma B.4) is obtained using a novel, significantly tailored version of Lindeberg principle [30, 35, 11, 12]. The proof of Gaussian comparison result (Lemma C.1) is inspired by the technique used in [41]. We use Slepian smart interpolant too, yet applying it in a non-trivial way. We believe, Lemma B.4 can also be proven via Slepian smart interpolant instead of Lindeberg principle, which might yield slightly better convergence rate. We leave this for the future research.

3.3 Sensitivity result

Consider a setting under $\mathbb{H}_{1}$ . For simplicity, assume there is a single change point at $\tau$ . In order for the break to be detectable we have to impose some discrepancy condition on $f_{1}^{*}$ and $f_{2}^{*}$ . Moreover, in order to guarantee detection we have to require the choice of covariates $X_{i}$ to make this discrepancy observed. Keeping that in mind we define the observed break extent

[TABLE]

Theorem 3.2.

Let the setting described above hold, $\varepsilon_{i}$ be sub-Gaussian with $\mathfrak{g}^{2}$ . Let $f^{*}$ , $f_{1}^{*}$ , $f_{2}^{*}$ be $\aleph$ -smooth Sobolev and $\kappa\coloneqq\left(\aleph-1/2\right)/(2\aleph)$ or $\aleph$ -smooth Hölder and $\kappa\coloneqq\aleph/(2\aleph+1)$ . Also let $n_{*}\in\mathfrak{N}$ , $n_{*},N\rightarrow+\infty$ and $\mathfrak{B}_{n_{*}}\rightarrow 0$ . Also impose Assumption 3.1, Assumption 3.2, Assumption 3.3 (for $t<\tau$ ), Assumption 3.4, (3.11), (3.12) and

[TABLE]

Then

[TABLE]

We defer proof to Appendix D. It is fairly straightforward. First, we bound the test statistics $A_{n}(\tau)$ with high probability, next we use Theorem 3.1 to also bound the critical levels $x_{n,\alpha}(\tau)$ and finally, we bound the test statistic $A_{n}(\tau)$ from below and make sure it exceeds the critical level. The assumption (3.15) essentially requires the observed break extent to exceed the precision of Gaussian Process Regression predictor.

Remark 3.1.

The sensitivity result gives rise to another motivation behind simultaneous consideration of wider and narrower windows (and also it is another argument for local statistics in the first place, also see Remark 2.1). Consider a hostile setting, where the values of functions $f_{1}^{*}$ and $f_{2}^{*}$ coincide for most of the arguments. For instance, let $\mathfrak{B}^{\prime}\coloneqq\left|f_{1}^{*}(X_{\tau})-f_{2}^{*}(X_{\tau})\right|$ and let $f_{1}^{*}(X_{i})=f_{2}^{*}(X_{i})$ for all $i>\tau$ . Then by definition $\mathfrak{B}_{n}=\mathfrak{B}^{\prime}/n$ and hence the assumption (3.15) implies

[TABLE]

Clearly, a narrower window detects a smaller break of such a kind.

Remark 3.2.

In the setting allowing for multiple change-points (see Section 2.1) assumption (3.15) dictates the requirement for the minimal distance $\Delta_{\tau}\coloneqq\min_{k,k^{\prime}:k\neq k^{\prime}}\left|\tau_{k}-\tau_{k^{\prime}}\right|$ between two consecutive change-points as $\Delta_{\tau}\geq 2n_{*}+\left|\mathcal{I}^{\flat}\right|$ which is sufficient for detection of all the change-points with probability approaching $1$ .

Finally, we formulate a trivial corollary providing change-point localization guaranties.

Corollary 3.1.

Under the assumptions of Theorem 3.2

[TABLE]

4 Empirical Study

In this section we report the results of our experiments111The code is available at github.com/akopich/gpcd. Section 4.1 presents the findings of the simulation study supporting the bootstrap validity and sensitivity results, as well as empirically justifying the simultaneous use of multiple windows and the change-point location estimator 2.10. In Section 4.2 we successfully apply the method to detect change-points in daily quotes of NASDAQ Composite index.

4.1 Experiment on synthetic data

We consider functions $f^{*}(x)=f_{1}^{*}(x)=\sin(x)$ and $f_{2}^{*}(x)=\sin(x+\phi)$ for various choices of $\phi$ . Univariate covariates $\{x_{i}\}_{i=1}^{800}$ are shuffled $800$ equidistant points between [math] and $\pi$ . Under $\mathbb{H}_{0}$ the responses are sampled independently as $y_{i}\sim\mathcal{N}\left(f^{*}(x_{i}),0.1^{2}\right)$ . Under $\mathbb{H}_{1}$ we choose the change-point location $\tau=700$ and sample $y_{i}\sim\mathcal{N}\left(f_{1}^{*}(x_{i}),0.1^{2}\right)$ for $i<\tau$ and $y_{i}\sim\mathcal{N}\left(f_{2}^{*}(x_{i}),0.1^{2}\right)$ for $i\geq\tau$ . For our experiments we consider $\phi\in\{\pi/2,\pi/5,\pi/10,\pi/20,\pi/40\}$ and report the corresponding observed break extent $\mathfrak{B}_{n}$ (defined by (3.14)). In all the experiments $\mathcal{I}^{\flat}=\{1,2,..,500\}$ , the confidence level $\alpha$ was chosen to be $0.01$ . We choose RBF kernel family

[TABLE]

and choose optimal parameters $\theta$ and $\sigma^{2}$ via evidence maximization using $\{x_{i}\}_{i\in\mathcal{I}^{\flat}}$ .

The suggested approach has demonstrated proper control of the first type error rate in all the configurations we consider, keeping it below $0.015$ .

The power the test exhibits is shown on Figure 1. As expected, larger window size $n$ and larger observed break extent $\mathfrak{B}_{n}$ correspond to higher power. At the same time, the Figure 2 summarizes root mean squared errors of the estimator $\hat{\tau}$ (defined by (2.10)). The estimator proves itself to be reliable when the power of the test is high. Generally, wider windows and larger observable break extent lead to higher accuracy of $\hat{\tau}$ .

Further, in order to investigate the behavior of the method in a multiscale regime ( $\left|\mathfrak{N}\right|>1$ ) we use several choices of $\mathfrak{N}$ for a single $\phi=\pi/10$ . Results, reported in the Table 1, exhibit a significant decrease in the average width of the narrowest detecting window $n_{*}$ and hence an improvement in change-point localization thanks to simultaneous use of wider and narrower windows. This should be highly beneficial in presence of multiple change points, as it allows for smaller distance $\Delta_{\tau}$ between them (see Section 2.1 and Remark 3.2).

4.2 Real-world dataset experiment

The prices of stock indexes are known to be subject to abrupt breaks [37, 38]. We consider a series $X_{t}$ of closing daily prices of NASDAQ Composite index. The dataset spans from February 1990 until February 2019. We suggest to model the process using the following Stochastic Differential equation

[TABLE]

where $W_{t}$ denotes a Wiener process. Now we wish to test the dataset for the presence of breaks. In order to do so we employ the Euler–Maruyama method, effectively boiling the problem down to a regression problem with univariate covariates $x_{t}\coloneqq X_{t}$ and the corresponding responses $y_{t}\coloneqq\frac{X_{t+1}-X_{t}}{X_{t}}$ . Further we apply the scheme suggested in Section 2.1 with $\alpha=0.01$ , $\mathfrak{N}=\{20\}$ , $\mathcal{I}^{\flat}=\{1..300\}$ and the kernel family (4.1). The method detects three breaks and all of them may be related to the known events. Namely, computer virus CIH has activated itself and attacked Windows 9x in August 1998, burst of the dot-com bubble and 2008 financial crisis.

Acknowledgements

The research of “Project Approximative Bayesian inference and model selection for stochastic differential equations (SDEs)” has been partially funded by Deutsche Forschungsgemeinschaft (DFG) through grant CRC 1294 “Data Assimilation”, “Project Approximative Bayesian inference and model selection for stochastic differential equations (SDEs)”.

Further, we would like to thank Vladimir Spokoiny, Alexandra Carpentier and Evgeniya Sokolova for the discussions and/or proofreading which have greatly improved the manuscript.

Appendix A Proof of the bootstrap validity result

Proof of Theorem 3.1.

Apply Lemma B.4 to $A_{n}(t)$ and $A_{n}^{\flat}(t)$ , next apply Lemma C.1 and via triangle inequality obtain on a set of probability at least $1-2\exp(-\mathrm{u}^{2})$

[TABLE]

where $R_{A}$ and $R_{C}$ come from Lemma B.4 and Lemma C.1 respectively. Now observe

[TABLE]

and using (3.11) conclude $R_{A}=o(1)$ . Clearly, the ratio entering the definition of $R_{C}$ is bounded $\sqrt{n}/\mathfrak{s}=O(1)$ (in the same way as in the proof of Lemma B.4). Next we use Lemma F.1 and obtain on a set of probability at least $1-\left|\mathcal{I}^{\flat}\right|^{-10}$

[TABLE]

where

[TABLE]

Now observe that the following holds for $\Delta_{\mu}$ and $\Delta_{\Sigma}$ involved in Lemma C.1 by construction of $Z$ and $\tilde{Z}$ (coming from the gaussian approximation and defined by (B.4))

[TABLE]

Further, Lemma C.3 yields the bound $\left|\mathrm{Var}\left[\varepsilon_{1}\right]-\mathrm{Var}\left[\varepsilon^{\flat}_{1}\right]\right|=O(\Delta_{f}^{2})$ . Assumption (3.11) implies $\gamma<1/6$ . Then (3.12) it turn implies

[TABLE]

Finally, choose $\Delta=n^{-\delta/2}$ (involved in the definition of $R_{C}$ , see Lemma C.1), recall assumption (3.12) and conclude $R_{C}=o(1)$ . ∎

Appendix B Gaussian Approximation

Consider a random vector $x\in\mathbb{R}^{N}$ of independent components centered at $\mu=\mathbb{E}\left[x\right]$ . Introduce $x_{n}(j)$ for even $n=2m\in\mathbb{N}$ and $j\in J\coloneqq\{m,m+1,..,N-m\}$ denoting a vector composed of $\{x_{i}\}_{i=j-m+1}^{j+m}$ . Also, assume $\left|J\right|$ symmetric matrices $B(j)\in\mathbb{R}^{n\times n}$ are given and define a map $S:\mathbb{R}^{N}\rightarrow\mathbb{R}^{\left|J\right|}$ s. t. $S(x)_{j}\coloneqq\frac{1}{\sqrt{n}}\langle x_{n}(j),B(j)x_{n}(j)\rangle$ .

Two ingredients of paramount importance are soft-max function $F_{\beta}:\mathbb{R}^{\left|J\right|}\rightarrow\mathbb{R}$

[TABLE]

and a smooth indicator function $g_{\Delta}$ with three bounded derivatives s.t. $\left|x\right|\geq\Delta\Rightarrow g_{\Delta}(x)=1[x>0]$ . Also let $g\coloneqq g_{1}$ and $g(x/\Delta)=g_{\Delta}(x)$ . An example of such function along with bounds for its derivatives is provided in [39].

Consider the following decomposition of matrices $B(j)$ into diagonal matrices and matrices with zeroes down their diagonals

[TABLE]

Further, consider a vector $X\in\mathbb{R}^{N}$ s.t. $x_{i}^{2}=X_{i}$ for all $i=1..N$ . And introduce notation $X_{n}(j)$ similar to $x_{n}(j)$ . Now consider a vector $Z$ denoting vectors $x$ and $X$ stacked. Clearly, there is a map $Q:\mathbb{R}^{2N}\rightarrow\mathbb{R}^{\left|J\right|}$ s.t.:

[TABLE]

for all $j=1..\left|J\right|$ . Also define an independent vector

[TABLE]

and denote the first half of the vector as $\tilde{x}$ and the second as $\tilde{X}$ .

Our proof employs a novel version of the Lindeberg principle [30, 35, 11, 12] tuned for the problem at hand. Typically, Lindeberg principle suggests to ”replace” random variables with their Gaussian counterparts one by one. Here we have to ”replace” each $n$ -th component of $x$ along with the component of $X$ being its square starting with the $1$ -st one, repeat starting with the $2$ -nd one and so on repeating the procedure $n$ times. Namely, in the first step we ”replace” components with indexes $1$ , $n+1$ , $2n+1$ and so on. On the second step we ”replace” components with indexes $2$ , $n+2$ , $2n+2$ and so on. And further in the same manner. Or more formally, consider a sequence of vectors $x^{i}\in\mathbb{R}^{N}$ for $i=0..n$ s. t. $x^{0}=x$ and $\forall i>0:x^{i}_{kn+i}=\tilde{x}_{kn+i}$ for all $k\in\{0,1,2,..,\lceil N/n\rceil-1\}$ and $x^{i}_{j}=x^{i-1}_{j}$ for $j$ s.t. $\nexists k\in\{0,1,2,..,\lceil N/n\rceil-1\}:kn+i=j$ . Denote the indexes of components which were replaced at step $i$ as $I_{i}$ . Also define a vector $\mathring{x}^{i}$ s.t. $\mathring{x}^{i}_{j}=0$ for $j\in I_{i}$ and $\mathring{x}^{i}_{j}=x^{i}_{j}$ for the rest of $j$ . Define sequence of $X^{i}$ and $\mathring{X}^{i}$ in a similar way. Finally, let $Z^{i}$ denote the vectors $x^{i}$ and $X^{i}$ stacked together and $\mathring{Z}_{i}$ denote stacked vectors $\mathring{x}^{i}$ and $\mathring{X}^{i}$ . Note, $Z^{n}=\tilde{Z}$ .

Lemma B.1.

Choose $i=1..n$ . Consider a function $\phi:\mathbb{R}^{N}\times\mathbb{R}^{N}\rightarrow\mathbb{R}$ defined as

[TABLE]

where $j\notin I_{i}\Rightarrow a_{j}=0,~{}b_{j}=0$ and $Q(\cdot)$ is defined by (B.3). Further, using decomposition (B.2) assume for some positive $L$ :

[TABLE]

and denote

[TABLE]

Then

[TABLE]

Proof.

Proof of this result consists in direct differentiation followed by application of Lemma A.2 from [13] providing bounds for the first three derivatives of soft-max function. ∎

Lemma B.2.

Let assumptions of Lemma B.1 hold. Then for an independent Gaussian vector $\tilde{Z}$ (defined by (B.4))

[TABLE]

where $m_{3}$ is the sum of the maximal third centered absolute moments of $x$ and $\tilde{Z}$ , while $\mathcal{Z}$ is defined in Lemma B.5 and $Q(\cdot)$ is defined by (B.3).

Proof.

Clearly, for $f\coloneqq g_{\Delta}\circ F_{\beta}\circ Q$ ,

[TABLE]

and hence

[TABLE]

The rest of the proof consists in bounding an arbitrary summand on the right hand side. In order to do so we use Taylor expansion of second degree for ${f\left(Z^{i-1}\right)}$ and ${f\left(Z^{i}\right)}$ around $\mathbb{E}\left[Z\right]$ with Lagrange remainder. Given equality of the first two moments of $Z$ and $\tilde{Z}$ , we conclude, the first two terms cancel out. Hence, using Lemma B.5 we immediately obtain

[TABLE]

Combination of (B.13) and (B.14) establishes the claim. ∎

Lemma B.3.

Let assumptions of Lemma B.1 hold. Then

[TABLE]

where $\mathfrak{s}$ comes from Lemma E.1 and $Q(\cdot)$ is defined by (B.3).

Proof.

Choose $\Delta=\log\left|J\right|/\beta$ . Then for an arbitrary constant vector $c\in\mathbb{R}^{\left|J\right|}$

[TABLE]

Here he have consequently used Lemma B.6, Lemma B.2, Lemma B.6 again and Lemma E.1 (which also defines $\mathfrak{s}$ ). The last step uses that $\log 3>1$ . Now we choose

[TABLE]

and obtain

[TABLE]

Similar reasoning yields a chain of ”larger-or-equal” inequalities which, combined with the one above, finalizes the proof. ∎

Lemma B.4.

Let $x-\mathbb{E}\left[x\right]$ be sub-Gaussian and matrices $B(j)$ have bounded spectrum. Also assume for some positive $\gamma$

[TABLE]

Then for any positive $\mathrm{u}$ on a set of probability at least $1-\exp\left(-\mathrm{u}^{2}\right)$ for $N$ and $n$ going to infinity

[TABLE]

where $Q(\cdot)$ is defined by (B.3).

Proof.

Application of Lemma B.7 to matrices $E(j)$ yields the bound on $\mathfrak{L}$ defined by (B.7)

[TABLE]

on a set of probability at least $1-\left|J\right|e^{-\mathrm{t}^{2}}$ .

Investigation of $\mathfrak{s}$ defined in Lemma E.1 yields $\sqrt{n}/\mathfrak{s}=O(1)$ . Really, $\left\|B(j)\right\|_{F}$ is a sum of squared eigenvalues (which are bounded) and $\left\|\mathbb{E}\left[x\right]\right\|^{2}\leq n\left\|\mathbb{E}\left[x\right]\right\|_{\infty}^{2}$ . Now we apply Lemma B.3

[TABLE]

Now change the variable $\mathrm{u}^{2}\coloneqq\mathrm{t}^{2}-\log\left|J\right|$

[TABLE]

∎

Lemma B.5.

In terms of Lemma B.1 for function

[TABLE]

it holds that

[TABLE]

and

[TABLE]

Proof.

The proof consists in direct differentiation and bounding using Lemma B.1 and equation (53) from [39]. Intermediate differentiation steps can be found in the proof of Lemma A.14 [39]. ∎

The following lemma justifies the smoothing relying on smooth indicator $g_{\Delta}$ and soft-max $F_{\beta}$ . Its proof can be found in [13].

Lemma B.6.

Let $\Delta=\log\left|J\right|/\beta$ , then for arbitrary vector $x$ :

[TABLE]

The next lemma establishes prerequisites for inequality (B.28).

Lemma B.7.

Consider a symmetric matrix $A$ with the largest eigenvalue $\Lambda$ . Let $\varepsilon$ be a vector of independent sub-Gaussian with $\mathfrak{g}^{2}$ elements. Then on a set of probability at least $1-\exp(-\mathrm{t}^{2})$

[TABLE]

Proof.

For a given unit vector $a$ , as far as the components of $\varepsilon$ are independent and sub-Gaussian, $a^{T}\varepsilon$ is sub-Gaussian with $\mathfrak{g}^{2}$ as well. Hence,

[TABLE]

and therefore,

[TABLE]

∎

Appendix C Gaussian comparison

Notation of this section follows the notation of Section B. Proof of the following result was inspired by the proof of Theorem 1 in [41].

Lemma C.1.

Consider two $2N$ -dimensional normal vectors $Z\sim\mathcal{N}\left(\mu,\Sigma\right)$ and $\tilde{Z}\sim\mathcal{N}\left(\tilde{\mu},\tilde{\Sigma}\right)$ . Denote $\Delta_{\mu}\coloneqq\left\|\mu-\tilde{\mu}\right\|_{\infty}$ and $\Delta_{\Sigma}\coloneqq\left\|\Sigma-\tilde{\Sigma}\right\|_{\infty}$ . Use notation of Lemma B.1. Then for any constant vector $c$ and positive $\Delta$ holds

[TABLE]

where $\mathfrak{s}$ comes from Lemma E.1.

Proof.

The proof consists in a multiple use of Slepian smart interpolant. Denote the first and the second halves of vector $Z$ as $x$ and $X$ and similarly introduce $\tilde{x}$ and $\tilde{X}$ being halves of $\tilde{Z}$ . Further, consider $n$ real values $\varphi_{1},\varphi_{2},..,\varphi_{n}$ and compose a vector of length $N$ iterating over these values:

[TABLE]

Denote $f\coloneqq g_{\Delta}\circ F_{\beta}\circ Q$ and consider a function

[TABLE]

where we use $\otimes$ to denote element-wise product and radicals are applied to vectors in an element-wise manner. Clearly,

[TABLE]

and hence

[TABLE]

For the derivative we have

[TABLE]

Next we apply Lemma C.2 (which applies only to centered vectors, thus the second term)

[TABLE]

Now we make use of Lemma B.5 and Lemma B.1 and choose $\beta=\log\left|J\right|/\Delta$

[TABLE]

Next, recalling (C.5) obtain

[TABLE]

Finally, in order to move from smooth functions to indicators we employ reasoning identical to the one in Lemma B.3.

[TABLE]

Combination with a similar chain of larger-or-equal finalizes the proof. ∎

We use the same version of Stein’s identity as the authors of [41] have.

Lemma C.2Stein’s identity.

Let $X\in\mathbb{R}^{p}$ be a normal centered vector and function $f:\mathbb{R}^{p}\rightarrow\mathbb{R}$ be a $C^{1}$ function with finite first derivatives. Then for all $j=1..p$

[TABLE]

Proof.

See Section A.6 of [40]. ∎

Lemma C.3.

Consider $y,\hat{y},\varepsilon,\varepsilon^{\flat}$ defined in Section 2. Let $\Delta_{f}\coloneqq\left\|\mathbb{E}\left[y\right]-\hat{y}\right\|_{\infty}=O\left((\log\left|\mathcal{I}^{\flat}\right|/\left|\mathcal{I}^{\flat}\right|)^{\kappa}\right)$ for some positive $\kappa\leq 1/2$ . Let $\varepsilon_{i}$ be sub-Gaussian with $\mathfrak{g}^{2}$ . Then on a set of probability at least $1-\left|\mathcal{I}^{\flat}\right|^{10}$

[TABLE]

Proof.

By construction

[TABLE]

Now due to sub-Gaussianity for a positive $\mathrm{e}$

[TABLE]

and hence

[TABLE]

On set $\mathcal{E}$ Hoeffding inequality applies to $\varepsilon_{i}$ and their squares:

[TABLE]

Therefore, with probability at least $1-p^{\prime}-p^{\prime\prime}-p^{\prime\prime\prime}$

[TABLE]

and hence

[TABLE]

Clearly, the choice

[TABLE]

makes $p^{\prime},p^{\prime\prime},p^{\prime\prime\prime}$ polynomially decreasing. Substitution yields the claim. ∎

Appendix D Proof of sensitivity result

Proof of Theorem 3.2.

Denoting the probability density functions in the world of the Gaussian Process Regression model as $p(\cdot)$ by construction we have

[TABLE]

Further, denote $f^{r}_{n}(t)\coloneqq\mathbb{E}\left[y^{r}_{n}(t)\right]$ and $\varepsilon^{r}_{n}(t)\coloneqq y^{r}_{n}(t)-f^{r}_{n}(t)$ . Define shorthand notation $K^{r}_{n}(t)\coloneqq K(X^{r}_{n}(t))$ and $K^{l}_{n}(t)\coloneqq K(X^{l}_{n}(t))$ . Also let $\hat{f}^{r}_{n}(t)$ and $\hat{V}_{n}(t)$ denote predictive mean and variance of the Gaussian Process Regression for $X^{r}_{n}(t)$ given $X^{l}_{n}(t)$ and $y^{l}_{n}(t)$ . Now recall the posterior is Gaussian:

[TABLE]

Define a norm $\left\|x\right\|_{A}\coloneqq\left\|A^{1/2}x\right\|$ for an arbitrary positive-definite symmetric matrix $A$ . Clearly, $\left\|x\right\|_{A}^{2}=\langle x,Ax\rangle$ . Now trivial algebra yields

[TABLE]

where we use $\cong$ to denote “equality up to an additive deterministic constant”. Consider a matrix $K(X_{n}(t))$ being a block $2\times 2$ matrix with blocks of equal size:

[TABLE]

Notice that $\hat{V}_{n}(t)$ is its Schur complement, thus $\lambda_{\max}(\hat{V}_{n}(t))\leq\lambda_{\max}(K(X_{n}(t)))\leq C$ (the second inequality is due to Assumption 3.4). Using $\sigma^{2}>0$ we have $\lambda_{\min}(\hat{V}_{n}(t))>1/c$ and $\lambda_{\min}(K^{r}_{n}(t))>1/c$ for some $c$ independent of $n$ . To sum these observations up:

[TABLE]

Having established control over these eigenvalues, we are ready to bound the terms $T_{2}$ and $T_{3}$ from above under both $\mathbb{H}_{0}$ and $\mathbb{H}_{1}$ , while $T_{1}$ should be bounded from above under $\mathbb{H}_{0}$ and from below under $\mathbb{H}_{1}$ . Now we bound the test statistic $A_{n}(t)$ under $\mathbb{H}_{0}$ . Denote $\Delta_{f}\coloneqq\left\|f^{r}_{n}(t)-\hat{f}^{r}_{n}(t)\right\|_{\infty}$ . Then

[TABLE]

In order to bound the second term on a set of high probability we employ Lemma D.1 and obtain for a positive $\mathrm{t}$

[TABLE]

The third term will be controlled using sub-Gaussianity of $\varepsilon^{r}_{n}(t)$ . For any unit vector $u$ and positive $\mathrm{e}$

[TABLE]

and clearly,

[TABLE]

where $F\coloneqq\max\{\left\|f\right\|_{\infty},\left\|f_{1}^{*}\right\|_{\infty},\left\|f_{2}^{*}\right\|_{\infty}\}$ . Hence, on a set of probability at least $1-2\exp(-\mathrm{e})$

[TABLE]

Finally, we choose $\mathrm{t}\coloneqq 10\log n$ and $\mathrm{e}\coloneqq 10\log n$ . Now under $\mathbb{H}_{0}$ bound $\Delta_{f}$ by Lemma F.1, recall $\kappa<1/2$ and obtain

[TABLE]

on a set of probability at least $1-3/n^{10}\rightarrow 1$ as $n\rightarrow+\infty$ . Now we use Theorem 3.1 along with the fact that for $n$ large enough $\alpha>R_{A}+3/n^{10}$ and obtain on a set of probability approaching $1$

[TABLE]

On the other hand, under $\mathbb{H}_{1}$ the bounds (D.7) and (D.10) still hold and

[TABLE]

Finally, choose $n=n_{*}$ , $t=\tau$ , and recall assumption (3.15) to conclude $A_{n}(\tau)>x^{\flat}_{n,\alpha}(\tau)$ for large $n$ with probability approaching $1$ . ∎

The following result bounds a quadratic form of a sub-Gaussian vector with high probability. It is a direct corollary of Theorem 1.1 (Hanson-Wright inequality) stated in [36].

Lemma D.1.

Consider a vector $x\in\mathbb{R}^{n}$ sub-Gaussian with $\mathfrak{g}^{2}$ and a positive-definite matrix $A$ of size $n\times n$ . Let there be a constant $\Lambda$ independent of $n$ s.t. $\lambda_{\max}(A)\leq\Lambda$ . Then for a positive $\mathrm{t}$ , large enough $n$ and some absolute positive $c$

[TABLE]

Appendix E Anti-concentration inequality

This section uses notation introduced in Section B.

Lemma E.1.

Consider a $2p$ -dimensional Gaussian vector $z=(x,X)$ , where $x$ and $X$ are $p$ -dimensional. Further, let $\mathrm{Var}\left[x\right]=\sigma^{2}I_{p}$ and $Cov(x_{j},X_{j})=Cov(\eta,\eta^{2})$ for arbitrary $1\leq j\leq p$ and $\eta\sim\mathcal{N}\left(0,\sigma^{2}\right)$ . Finally, let $\mathrm{Var}\left[X\right]=\mathrm{Var}\left[\eta^{2}\right]I_{p}$ . Then for an arbitrary vector $C$ and $\delta\in\mathbb{R}$

[TABLE]

where

[TABLE]

and the map $Q(\cdot)$ is defined by (B.3).

Proof.

Introduce an isotropic Gaussian vector $\tilde{z}=\mathrm{Var}\left[z\right]^{-1/2}z\mathrm{Var}\left[z\right]^{-1/2}$ and notice, $\sigma^{2}/3\leq\mathrm{Var}\left[\tilde{z}_{j}\right]\leq 3\sigma^{2}$ for all $j$ . Applying Lemma E.2 to $\tilde{z}$ yields the claim. ∎

The rest of the proofs of this section mostly follow the Nazarov’s inequality proof presented in [14].

Define a map $u:\mathbb{R}^{2p}\rightarrow\mathcal{U}$ , where $\mathcal{U}\coloneqq\mathbb{R}^{(p+5)p/2}$ :

[TABLE]

With a slight abuse of notation we will use $u$ to denote both the map and an element of its image.

Lemma E.2.

Consider $x\sim\mathcal{N}\left(0,I_{2p}\right)$ and $a_{1},a_{2},..a_{p(p+5)/2}\in\{u\in\mathcal{U}:\left\|u\right\|=1\}$ along with $b_{1},b_{2},..,b_{p(p+5)/2}\in\mathbb{R}$ . Then for all positive $\delta$ :

[TABLE]

Proof.

Define a set $K(t)\coloneqq\left\{u\in\mathcal{U}:\forall j~{}a_{j}^{T}u\leq b_{j}+t\right\}$ , and a function $G(t)\coloneqq\mathbb{P}\left\{u(x)\in K(t)\right\}$ . $G$ is absolutely continuous distribution function and hence

[TABLE]

where $G^{\prime}_{+}$ denotes the right derivative of $G$ . Essentially, the proof boils down to the following lemma.

Lemma E.3.

[TABLE]

Proof.

Denote $K\coloneqq K(0)$ and note it is a convex polyhedron. Denote a projector onto $K$ as $P_{K}$ : $\left\|x-P_{K}x\right\|=\min_{y\in K}\left\|x-y\right\|$ . Now for a (proper) face $F$ of $K$ define

[TABLE]

Clearly, $K(\delta)\backslash K=\bigcup_{F:\text{face of }K}N_{F}(\delta)$ . Clearly,

[TABLE]

hence for any face $F$ of dimensionality less than $\dim\mathcal{U}-1$ for $\delta\downarrow 0:\gamma_{p}(N_{F})\coloneqq\mathbb{P}\left\{u(x)\in N_{F}\right\}=o(\delta)$ . Hence,

[TABLE]

Now it is left to prove that

[TABLE]

By Lemma E.4

[TABLE]

where $f_{2p}(\cdot)$ denotes the density of $\mathcal{N}\left(0,I_{2p}\right)$ . Consider facets $F$ such that $\mathrm{dist}(0,F)>4\log p$ . Choose $\bar{h}=\mathrm{dist}(0,F)v$ (or flip the sign if $\bar{h}\notin F$ ) and denote $\bar{x}=u^{-1}(\bar{h})$ . Further, since $\left\|\bar{x}\right\|\geq\sqrt{4\log p}$ ,

[TABLE]

and given the number of facets is less than $p^{2}$ ,

[TABLE]

Now turn to the facets $F$ s.t. $\mathrm{dist}(0,F)\leq 4\log p$ . By Lemma E.4,

[TABLE]

The final observation is based on the fact that $N_{F}$ are disjoint and $\gamma_{p}(\mathcal{U})=1$

[TABLE]

and its combination with (E.14) completes the proof. ∎

∎

Lemma E.4.

[TABLE]

where $d\sigma$ is the standard surface measure on $F$ .

Proof.

Parametrize every $h\in F$ as

[TABLE]

where $\bar{h}$ is an arbitrary element of $F$ , while $q_{j}$ form an orthonormal basis on $F-\bar{h}$ . Further, choose a unit outward normal vector $v$ to $\partial K$ at $F$ . Then we can parametrize $N_{F}$

[TABLE]

Now

[TABLE]

and in the same way

[TABLE]

Thus

[TABLE]

which proves the equality in the claim.

Now for any $h\in F$ and $v$ exist vectors $x$ and $n$ such that $u^{-1}(h+tv)=x+t^{\prime}n$ for $t^{\prime}=\sqrt{t}$ . Then

[TABLE]

Combination of (E.20) and (E.23) yields

[TABLE]

Now choose $h=\mathrm{dist}(0,F)$ , note $|x^{T}n|=\sqrt{\mathrm{dist}(0,F)}$ and establish the claim. ∎

Appendix F Consistency of Gaussian Process Regression by [44]

In this section we quote a consistency result for predictions of Gaussian Process Regression.

Lemma F.1Corollary 2.1 in [44].

Assume, $\varepsilon_{i}$ are sub-Gaussian. Let $f^{*}$ be $\aleph$ -smooth Sobolev and $\kappa\coloneqq\left(\aleph-1/2\right)/(2\aleph)$ or $\aleph$ -smooth Hölder and $\kappa\coloneqq\aleph/(2\aleph+1)$ . Further let $k(\cdot,\cdot)$ satisfy Assumption 3.1 and Assumption 3.2. Then, for the training sample size $n$ going to infinity with probability at least $1-n^{-10}$ we have

[TABLE]

where $\hat{f}$ denotes the predictive function.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alexander Aue, Siegfried Hörmann, Lajos Horváth, and Matthew Reimherr. Break detection in the covariance structure of multivariate time series models. Ann. Statist. , 37(6B):4046–4087, 12 2009.
2[2] Alexander Aue and Lajos Horváth. Structural breaks in time series. Journal of Time Series Analysis , 34(1):1–16, 2013.
3[3] Valeriy Avanesov. Structural break analysis in high-dimensional covariance structure. ar Xiv e-prints , page ar Xiv:1803.00508, March 2018.
4[4] Valeriy Avanesov and Nazar Buzun. Change-point detection in high-dimensional covariance structure. Electron. J. Statist. , 12(2):3254–3294, 2018.
5[5] Jushan Bai. Estimating multiple breaks one at a time. Econometric Theory , 13(3):315–352, 1997.
6[6] Jushan Bai. Likelihood ratio tests for multiple structural changes. Journal of Econometrics , 91(2):299 – 323, 1999.
7[7] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica , 66(1):47–78, 1998.
8[8] Jushan Bai and Pierre Perron. Critical values for multiple structural change tests. The Econometrics Journal , 6(1):72–78, 2003.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Nonparametric Change Point Detection in Regression

Abstract

1 Introduction

2 The approach

Remark 2.1**.**

Remark 2.2**.**

Remark 2.3**.**

Remark 2.4**.**

Remark 2.5**.**

2.1 Multiple break detection

3 Theoretical results

3.1 Assumptions and definitions

Definition 3.1****Sub-Gaussianity.

Definition 3.2****Sobolev and Hölder classes.

Assumption 3.1**.**

Assumption 3.2**.**

Assumption 3.3**.**

Assumption 3.4**.**

3.2 Bootstrap validity

Theorem 3.1**.**

3.3 Sensitivity result

Theorem 3.2**.**

Remark 3.1**.**

Remark 3.2**.**

Corollary 3.1**.**

4 Empirical Study

4.1 Experiment on synthetic data

4.2 Real-world dataset experiment

Acknowledgements

Appendix A Proof of the bootstrap validity result

Proof of Theorem 3.1.

Appendix B Gaussian Approximation

Lemma B.1**.**

Proof.

Lemma B.2**.**

Proof.

Lemma B.3**.**

Proof.

Lemma B.4**.**

Proof.

Lemma B.5**.**

Proof.

Lemma B.6**.**

Lemma B.7**.**

Proof.

Appendix C Gaussian comparison

Lemma C.1**.**

Proof.

Lemma C.2****Stein’s identity.

Proof.

Lemma C.3**.**

Proof.

Appendix D Proof of sensitivity result

Proof of Theorem 3.2.

Lemma D.1**.**

Appendix E Anti-concentration inequality

Lemma E.1**.**

Proof.

Lemma E.2**.**

Proof.

Lemma E.3**.**

Proof.

Lemma E.4**.**

Proof.

Appendix F Consistency of Gaussian Process Regression by [44]

Lemma F.1****Corollary 2.1 in [44].

Remark 2.1.

Remark 2.2.

Remark 2.3.

Remark 2.4.

Remark 2.5.

Definition 3.1Sub-Gaussianity.

Definition 3.2Sobolev and Hölder classes.

Assumption 3.1.

Assumption 3.2.

Assumption 3.3.

Assumption 3.4.

Theorem 3.1.

Theorem 3.2.

Remark 3.1.

Remark 3.2.

Corollary 3.1.

Lemma B.1.

Lemma B.2.

Lemma B.3.

Lemma B.4.

Lemma B.5.

Lemma B.6.

Lemma B.7.

Lemma C.1.

Lemma C.2Stein’s identity.

Lemma C.3.

Lemma D.1.

Lemma E.1.

Lemma E.2.

Lemma E.3.

Lemma E.4.

Lemma F.1Corollary 2.1 in [44].