Nonparametric Multiple Change Point Detection for Non-Stationary Times Series
Zixiang Guan, Gemai Chen

TL;DR
This paper introduces a nonparametric method for detecting multiple change points in non-stationary time series by comparing spectral density functions, applicable to various linear processes, with proven consistency and empirical validation.
Contribution
It proposes a novel nonparametric approach for change point detection that works for a wide class of linear processes, including non-invertible models, with consistent estimation and model selection.
Findings
Method accurately detects change points in simulations.
Approach is effective for both invertible and non-invertible processes.
Consistent estimation of number and locations of change points.
Abstract
This article considers a nonparametric method for detecting change points in non-stationary time series. The proposed method will divide the time series into several segments so that between two adjacent segments, the normalized spectral density functions are different. The theory is based on the assumption that within each segment, time series is a linear process, which means that our method works not only for classic time series models, e.g., causal and invertible ARMA process, but also preserves good performance for non-invertible moving average process. We show that our estimations for change points are consistent. Also, a Bayesian information criterion is applied to estimate the member of change points consistently. Simulation results as well as empirical results will be presented.
| 22.99 (0.011) | 22.99 (0.011) | ||
| 23.55 (0.012) | 23.55 (0.012) | ||
| 22.61 (0.011) | 22.61 (0.011) | ||
| 26.47 (0.013) | 26.47 (0.013) |
| AutoPARM | 93.3% | 7.61 (0.004) | 8.05 (0.004) |
| NSCD | 98.8% | 25.00 (0.012) | 37.33 (0.018) |
| MuBred | 97.3 % | 12.219 (0.006) | 16.229 (0.008) |
| WBS | 37.9% | 66.65 (0.033) | 183.184 (0.089) |
| BS | 62.7% | 84.875 (0.041) | 122.058 (0.060) |
| NMCD | 0% | 147.4365 (0.072) | 554.9088 (0.271) |
| ARMA | MA | ||||
|---|---|---|---|---|---|
| 35.29 (0.020) | 34.74 (0.020) | 13.09 (0.007) | 13.09 (0.007) | ||
| 32.74 (0.018) | 32.07 (0.018) | 13.11 (0.007) | 13.11 (0.007) | ||
| 31.12 (0.017) | 31.01 (0.017) | 12.84 (0.007) | 12.84 (0.007) | ||
| 26.39 (0.015) | 26.38 (0.015) | 13.71 (0.008) | 13.71 (0.008) | ||
| ARMA | MA | |||||
|---|---|---|---|---|---|---|
| AutoPARM | 95.3% | 48.37 (0.027) | 27.62 (0.015) | 99% | 11.42(0.006) | 11.42 (0.006) |
| NSCD | 99% | 31.42 (0.017) | 33.26 (0.018) | 99.6% | 14.59 (0.008) | 15.69 (0.009) |
| MuBred | 93.3% | 40.436 (0.022) | 44.467 (0.025) | 97.7% | 49.518 (0.028) | 41.444 (0.023) |
| WBS | 74.9% | 184.393 (0.102) | 66.288 (0.037) | 91.8% | 33.586 (0.019) | 45.362 (0.025) |
| BS | 77.7% | 188.955 (0.105) | 69.194 (0.038) | 96.3% | 29.673 (0.016) | 36.611 (0.020) |
| NMCD | 0% | 188.64 (0.105) | 324.31 (0.180) | 0.1% | 160.4434 (0.089) | 326.4282 (0.181) |
| MA | |||
|---|---|---|---|
| 33.60 (0.019) | 33.60 (0.019) | ||
| 36.00 (0.020) | 36.00 (0.020) | ||
| 40.11 (0.022) | 40.11 (0.022) | ||
| 41.00 (0.023) | 41.00 (0.023) | ||
| MA | |||
|---|---|---|---|
| AutoPARM | 36.7% | 408.60 (0.227) | 27.24 (0.015) |
| NSCD | 99.7% | 37.49 (0.021) | 38.02 (0.021) |
| MuBred | 8.4% | 216.203 (0.120) | 26.724 (0.015) |
| WBS | 18.2% | 488.674 (0.271) | 90.493 (0.050) |
| BS | 8.6% | 576.462 (0.320) | 56.975 (0.032) |
| NMCD | 0% | 165.54 (0.092) | 326.701 (0.182) |
| Case 1 | 24.24 (0.012) | 24.37 (0.012) |
| Case 2 | 29.67 (0.017) | 28.91 (0.016) |
| Case 3 | 11.74 (0.007) | 11.74 (0.007) |
| Case 4 | 36.36 (0.020) | 36.36 (0.020) |
| Bandwidth | |||
|---|---|---|---|
| Case 1 | 23.195 (0.023) | 23.263 (0.023) | |
| 27.716 (0.027) | 28.075 (0.027) | ||
| Case 2 | 31.72 (0.035) | 30.718 (0.035) | |
| 28.104 (0.031) | 27.734 (0.030) | ||
| Case 3 | 14.664 (0.016) | 14.158 (0.016) | |
| 12.652 (0.014) | 12.652 (0.014) | ||
| Case 4 | 37.28 (0.041) | 37.272 (0.041) | |
| 44.3 (0.05) | 43.877 (0.049) |
| bandwidth | ||||
|---|---|---|---|---|
| Case 1 | 96% | 24.972 (0.024) | 30.44 (0.030) | |
| 90.4% | 35.268 (0.040) | 39.494 (0.044) | ||
| Case 2 | 96.2% | 38.326 (0.043) | 28.1 (0.031) | |
| 98.7% | 27.494 (0.031) | 26.384 (0.029) | ||
| Case 3 | 96.5% | 31.458 (0.035) | 12.73 (0.014) | |
| 96% | 32.484 (0.036) | 11.436 (0.013) | ||
| Case 4 | 95% | 41.228 (0.046) | 40.246 (0.045) | |
| 92% | 47.67 (0.053) | 53.254 (0.059) |
| Case 1 | 22.9 (0.011) | 22.9 (0.011) |
| Case 2 | 29.26 (0.016) | 29.26 (0.016) |
| Case 3 | 10.14 (0.006) | 10.14 (0.006) |
| Case 4 | 40.52 (0.023) | 40.52 (0.023) |
| Case 1 | 96.2% | 22.696 (0.011) | 39.72 (0.019) |
| Case 2 | 97.2% | 42.02 (0.0233) | 30.68 (0.017) |
| Case 3 | 99.6% | 12.54 (0.007) | 10.14 (0.006) |
| Case 4 | 98.6% | 40.56 (0.023) | 43.74 (0.024) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Spectroscopy and Chemometric Analyses · Control Systems and Identification
Nonparametric Multiple Change Point Detection
for Non-Stationary Time Series
Zixiang Guan, Gemai Chen
Department of Mathematics and Statitsics
University of Calgary, Calgary, Alberta, T2N 1N4
Email: [email protected]
Author’s Footnote:
Zixiang Guan is PhD, Department of Mathematics and Statistics, University of Calgary (Email: [email protected]), Gemai Chen is Professor (Email: [email protected]), Department of Mathematics and Statistics, University of Calgary, Alberta, Canada.
Abstract
This article considers a nonparametric method for detecting change points in non-stationary time series. The proposed method will divide the time series into several segments so that between two adjacent segments, the normalized spectral density functions are different. The theory is based on the assumption that within each segment, time series is a linear process, which means that our method works not only for classic time series models, e.g., causal and invertible ARMA process, but also preserves good performance for non-invertible moving average process. We show that our estimations for change points are consistent. Also, a Bayesian information criterion is applied to estimate the member of change points consistently. Simulation results as well as empirical results will be presented.
Keywords: Changepoint; Dynamic Programming; Spectrums; BIC; Kullback-Leibler Divergence
1 Introduction
Time series analysis is a well-developed branch of statistics with a wide range of applications in engineering, economics, biology and so on. Generally speaking, when investigating the theoretical properties as well as analyzing real data both in time domain and frequency domain, it is often assumed that time series is stationary. However in application, stationarity may be violated. How to analyze non-stationary time series is challenging. In ARIMA model ( Brockwell and Davis 1991), data is differenced finite times so that it reduces to ARMA process. In another way, we may assume that non-stationary process consists of several stationary ones. Our goal is to segment time series properly.
Consider a sequence of data and let be nonnegative integers satisfying . We assume that within the th segment of data, i.e., , , is stationary. are called structural breaks, or change points, which are unknown. is the number of change points. From the perspective of time domain, we can assume that each stationary segment can be modeled by appropriate statistical models while its structure varies across different segments. Davis, Lee and Rodriguez-Yam (2006) divided non-stationary time series into several different autoregressive processes. Kitagawa and Akaike (1978) detected change points by AIC criterion. In frequency domain, Ombao, Raz, Von Sachs and Malow (2001) used a family of orthogonal wavelet called SLEX to partition non-stationary process into stationary ones. Lavielle and Ludeña (2000) estimated change points using Whittle log-likelihood when time series was parametric. In Korkas and Fryzlewicz (2017), Locally Stationary Wavelets was applied to estimate the second-order structure, then Wild Binary Segmentation was imposed to divide the time series into several segments based on CUSUM statistics.
When autocovariance function is absolutely summable, it is well known that stationary process has spectral density function, which is the Fourier transformation applied to autocovariance function (Brockwell and Davis 1991). Spectral density function preserves a good property which is that after being properly normalized, it becomes a well-defined probability density function (Priestley 1981). In probability theory, there are some existing functions to measure the difference between probability density functions. Kullback-Leibler divergence (K-L divergence) is one of them.
In Kullback and Leibler (1951), a divergence function was introduced to measure the discrimination information between two distribution functions. The original definition is as follows. Suppose there are two probability distributions, and . and denote the probability density functions of and , respectively. Then Kullback-Leibler (K-L) divergence from to is
.
As we can see, K-L divergence is defined for probability density function, which is non-negative. Since spectral density function is also non-negative, we will generalize K-L divergence so that it is applicable to non-negative functions, which is still called Kullback-Leibler divergence in this article and will be applied later in the forthcoming sections. The definition is as follows. Suppose we have two non-negative functions, and , defined on a common support. Then the Kullback-Leibler divergence of with respect to is
,
where , , and , . By Gibbs’s inequality, this new K-L divergence is still non-negative and is equal to 0 if and only if almost everywhere, where is an appropriate constant. In this article, we apply K-L divergence to normalized spectral density functions to define our objective function. Then we estimate change points by maximizing the objective function. That is, we find the locations where discrepancy between different spectral density functions reaches its maximum. When estimating spectral density function, we adopt the classical method. That is, we first calculate periodogram then smooth it by choosing appropriate spectral window. Although we assume that each stationary segment is a linear process, which is a class of stationary time series more general than autoregressive process and ARMA model, we calculate spectral density function without estimating any parameters. So our change point detection method is nonparametric, which we call it nonparametric spectral change-point detection (NSCD). In application, we do not know the number of change points, so a BIC criterion,which is similar to Yao (1988) and Zou, Yin, Feng, and Wang (2014), is proposed. The consistency of both change point estimation and BIC criterion can be guaranteed. Dynamic programming algorithm (Hawkins 2001) is used, but due to its computational complexity, we adopt the screening algorithm (Zou et al. 2014). Also, Pruned Exact Linear Time (PELT) by Killick, Fearnhead and Eckley (2012) can also boost the speed of algorithm when estimating the number and locations of change points simultaneously.
The rest of this article is organized as follows. In Section 2, we describe our objective functions. Asymptotic properties are presented in Section 3. Implementation and further details of our algorithm will be given in Section 4. In Section 5, numerical simulations as well as comparison with several methodologies are shown. The analysis of EEG data is presented in Section 6.
2 Model and Methodology
Consider non-stationary time series , , , with , , . Here we assume that all are independent and identically distributed. satisfy , , so that when , spectral density function exists, denoted by (Brockwell and Davis 1991). If , we can subtract mean from observations by . Otherwise, we assume that for all . Our goal is to detect when and are different. If we could find a spectral density function that it is different from all , then a discriminant function can be applied so that it reaches its maximum at . Here we adopt Kullback-Leibler divergence and define our objective function as below:
.
Here , , , and so that and become probability density functions, and is a possible partition of the time series. There are two ways to find . We can give an estimation of the spectral density function , based on the whole time series. Although the whole time series is non-stationary, the estimated spectral density function still converges to a well-defined function, which is the weighted sum of (See Lemma 2 for details). If we know that data is not white noise, then .
There are literatures concerning the application of K-L divergence in time series. Parzen (1982) used it to estimate the parameters of autoregressive process. Parzen (1983) extended it to estimate ARMA process. Shore (1981) calculated the minimum K-L divergence to estimate spectrum given its priori spectrum has an exponential form. See Rao (1993) for more reviews. From information point of view, K-L divergence measures the information gain when a new probability density function is used instead of the old one. So our objective function will detect the locations where we maximize the information gain when a new spectral density function comes in.
To estimate the spectral density function, the classical methodology is adopted here. First, periodogram will be calculated as follows:
Since periodogram is not consistent, we choose a spectral window, , to smooth periodogram, then
, .
There are several choices for (Priestley 1981). Since we want the normalized to be a probability density function, i.e., , Bartlett kernel is chosen, which is , where is the bandwidth. One usually applies Fourier transformation to achieve an estimator on a grid of frequencies, denoted by , where is the cardinality of , which could tend to infinity as goes to infinity. Here we still use integral to denote the summation of estimators. We may set so could tend to infinity, or could be Nyquist frequency. Based on this grid of frequency, K-L divergence applied to normalized spectral density function is still non-negative, and equals zero if and only if two normalized spectral density functions are equal on .
When estimating change points, we have no idea about the number of change points in the data, so should be estimated. Yao and Au (1989), Zou et al. (2014) gave BIC criterion and showed its consistency. Following their work, we also propose a BIC criterion as follows:
.
Here is an appropriate constant which will be illustrated later. So our estimator is chosen by minimizing the criterion above.
3 Asymptotic Theory
In Section 2, we can see that is a constant, and , , change with . So there is a sequence of constants, , such that is a realization of non-stationary time series with , , where denotes the largest integer which is not greater than . Their estimators are denoted by , . We can estimate first, denoted by , by maximizing the objective function , then . In literature, Yao and Au (1989) achieved a consistent estimation of for , which means that the difference between estimators and true change points is no bigger than a constant. Zou et al. (2014) also drew the same conclusion when the number of change points was constant. In Davis et al. (2006), the consistency was attained in the sense that in probability 1, where was some constant. The reason that estimators cannot converge as fast as those in Yao and Au (1989) and Zou et al. (2014) is that estimating AR process needs a sufficient number of samples. To guarantee estimate accuracy, we always find the next change point which is away from the previous one. For example, after giving , we look for the next change point starting from . Our results are similar to Davis et al. (2006) and based on a set of discrete frequencies . The following assumptions are needed to obtain the consistency:
- A1:
where is some integer satisfying , converge absolutely, .
- A2:
is a non-negative, even, bounded, integrable function, ,
.
- A3:
, as , and , where .
- A4:
, is everywhere positive and satisfies uniform Lipschitz condition:
, , where is a constant.
- A5:
, where .
- A6:
, and has continuous derivatives to the order of .
- A7:
are linearly independent for .
- A8:
, where .
In A8, is the minimal length of time series when estimating spectral density functions, since a sufficient number of observations is always necessary, especially when the convergence rate of estimators is slow. Also, so that all change points are distinguishable. Assumption 7 is given to guarantee that any linear combination is not equal to , . Assumption 1-6 are similar to those in Woodroofe and Van Ness (1967) so that the can be bounded in probability. Assumption 4 can be stronger so that in Assumption 5, can be less than (see Woodroofe and Van Ness (1967) for further details). Theorem 1 gives the consistency of change point estimation.
Theorem 1**.**
When is known, , .
To estimate the number of change points, a pre-specified upper bound satisfying will be given. Then BIC values for each are calculated and will be the number where BIC reaches its minimum. The following theorem establishes the consistency of estimation of BIC criterion.
Theorem 2**.**
If , , we have .
4 Algorithm
Similar to Zou et al. (2014), our objective function is separable. For change point detection, a commonly adopted algorithm, Dynamic Programming (Hawkins 2001), can be applied. The main idea of Dynamic Programming is that estimation of is computed first, which is the rightmost change point. Then the time series data from to will be divided into parts, and we will estimate recursively. However, the computation complexity is , and taking Discrete Fourier transformation and spectrum smoothing into consideration, it is time-consuming.
To reduce computational complexity, Zou et al. (2014) proposed a screening algorithm. For , where is some constant integer, calculate the location where the function below reaches its maximum.
[TABLE]
Here and denote the estimated spectral density functions based on observations from to , and to . Let change from to , then we have a set containing all , then apply Dynamic Programming on . The main idea is that if a change point is included in , then the equation above should reach its maximum at this true change point. That is, contains true change points. Here we should choose so that contain only one change point.
Here and denote the estimated spectral density function of samples from to , and to . Let change from to , then we have a set containing all , then apply dynamic programming on . The main idea is that if a change point is included in , then the equation above should reach its maximum at this true change point. That is, contains true change points. Here we should choose so that contain only one change point.
When calculating the spectrums within each segment of time series, the most widely used method is Fast Fourier Transformation (FFT). However, in FFT, a problem is that spectral density function is estimated on Nyguist frequency. If so, time series with different length is estimated on different set of Fourier frequency. So when calculating the integral in , we cannot align the frequencies where spectral density functions and are estimated. Our method is that we first choose a set of frequencies, denoted by , then apply Discrete Fourier Transformation on for every subset of samples, which will solve the alignment problem naturally.
For BIC criterion, usually Dynamic Programming will be applied first, then the values of BIC criterion for all will be calculated. Obviously, this will increase complexity. Killick et al. (2012) proposed a method called Pruned Exact Linear Time (PELT), which would significantly reduce computational complexity. The main idea is that when a new sample is included, check all the remaining locations before new sample. If the objective function decreases to the extent that it is larger than the penalty term , those locations which do not satisfy the condition will be removed and the next sample will be added into our calculation until the end. The cardinality of the set of all remaining locations is the estimated number of change points and the elements within will be the estimated change points. Under some assumptions, the computational complexity is linear with respect to sample size.
In BIC criterion, another problem is how to choose . Although we can set to satisfy conditions in Theorem 2, this choice may be too large which leads to underestimation of . To overcome this difficulty, we first choose a length, which equals , then compute the median, denoted by , for all values of the function below:
, .
is the estimated spectral density function from , is its integral. Finally, . Our simulation shows that an appropriate choice for is 0.73 regardless of .
The selection of spectral windows is an important topic in spectral estimation. In Priestley (1981), bandwidth is selected as follows
,
where , where is the inverse Fourier transformation of , and is the largest integer so that the limit aforementioned exists and is non-zero. is the scale parameter in spectral windows. By the proofs of Theorem 1, we can see that estimators of change points reach consistency because dominate convergence rate. So bandwidth selection does not matter too much, which is verified by our simulations.
In application, sometimes sample size is large. Although we can apply some methods, such as screening, PELT, to boost the calculation, the computational complexity is still intolerable. Here, we set a change point searching unit, denoted by (Hawkins 2001). That is, when searching for change points, we add a unit of observations into our calculation each time, not just one observation. This unit is different from , which is used to calculate spectrums since estimating spectral density function usually needs sufficient amount of observations. By setting this unit, change point can only be estimated at , , , and so on. If we set this unit equal to 1, the algorithm degenerates to the general scenario when no searching unit is given. This will dramatically increase the speed of our algorithm. The computational complexity of Dynamic Programming is (Zou et al. 2014), so by setting a searching unit , the complexity will drop to . Apparently the estimation accuracy will be sacrificed, since the true change points and the maximizers of objective function, may not lie on the grid. We suggest the choice of by choosing the desired estimation accuracy first. Intuitively speaking, if we want the estimates having the accuracy of of the total sample size, then we can set .
5 Simulation
In this section we show the finite sample properties of our method and compare it with AutoPARM, Wild Binary Segmentation (WBS), Binary Segmentation (BS), MuBred, and NMCD under several cases.
Following Zou et al. (2014), we calculate the distance between two sets and , which are the true change point set and the estimated set, respectively, by
, and .
The first measurement shows if there is an estimator close enough to a true change point, while the second measurement reveals the distance between estimates and true change points. When is known, both measures should give good performances in the sense that and are small. For all the tables, and are shown outside the parentheses while and are shown inside, which can measure the estimation accuracy of . In the following subsections, , if it is not mentioned. To guarantee the estimation accuracy of spectral density function, we should set the minimal length of a segment, which is denoted by . In this simulation, we set if it is not mentioned. By setting , we also assume that the distance of two adjacent change points will not be smaller than , so we set for each case. If should be estimated, we report the percentage when is accurately detected. All simulations are obtained with 1000 replications.
5.1 AutoRegresstive Process
Following the examples in Davis et al. (2006), we generate the non-stationary time series from the following.
Autoregressive Processes (Case 1):
- 1
, ,
- 2
, ,
- 3
, .
and their normalized spectral density functions are shown in Figure 1.
In Table 1, we show the results of our method when is known with the spectral density function of total samples and white noise as the baseline functions, and different choices of bandwidths, respectively. In Table 2, the simulation is conducted when is unknown. We adopt Bartlett window since it guarantees the non-negativity of estimated spectral density function. To reduce computational complexity, screening algorithm is applied.
From Table 1, the choice of baseline function does not make much difference. Also, under different bandwidths, results are quite similar. We may suggest to choose larger bandwidth for large sample size since the bias of estimation is small. From Table 2, it is not surprising that AutoPARM outperforms NSCD, because AutoPARM knows the exact structure of data. WBS and BS perform weaker than NSCD, and cannot estimate the number of change points accurately. The performance of MuBred is better than NSCD, although it does not need the assumption of Autoregressive process. NMCD fails to capture the true change points, since it is designed for independent random variables. In fact, it prefers to overestimate the number of change points, which is 3.86.
5.2 ARMA Process and Invertible Moving Average Process
Next, we simulate from ARMA and invertible MA processes. Table 3 and Table 4 contain the results.
ARMA (Case 2):
- 1:
, ,
- 2:
, ,
- 3:
, .
Invertible MA (Case 3):
- 1:
, ,
- 2:
, ,
- 3:
, .
We expect that AutoPARM still works since it is well known that ARMA and invertible MA processes can be approximated by causal AR processes. We see that under ARMA setting, the performance of NSCD is comparable to AutoPARM, while AutoPARM performs better under MA settings. Both NSCD and AutoPARM perform better than WBS and BS. MuBred gives a slightly worse performance since periodogram is not consistent. NMCD fails in both cases as it does in the previous section. The number of change points is overestimated by NMCD again, which is 3.005 and 3.00, respectively.
5.3 Non-invertible MA Process
We will show simulation results when samples are generated from non-invertible MA process. In theory causal AR cannot approximate non-invertible MA process. Results are given in Table 5 and 6.
Non-invertible MA (Case 4):
- •
, ,
- •
, ,
- •
, .
From Table 6, we can see that NSCD works for Case 4, while AutoPARM and MuBred fail since the number of change points are all underestimated. The possible reason is that the marginal variance of Case 4 does not vary much.
5.4 Random Noise without the Existence of Higher Moments
Next, we investigate the results when does not have higher-order expectations. Here so that the variance of is still 1. The results for four cases are shown in Table 7, when is known.
Apparently, from Table 7, it is clear that Assumption 1 can be slightly violated in application, while still achieving good performance.
5.5 Investigation of Smaller Sample Size
In previous sections, we discuss the performances when sample size is about 2000, which is large. Here we shrink the sample size by half, and investigate the estimate accuracy as well as the choice of bandwidth. We set , and . Baseline function is chosen to be . Results are shown in Table 8.
As shown in the table, the estimate accuracy is not affected by the choice of bandwidth when sample size is relatively small. The estimation accuracy of is worse than the results in Section 5.1 to 5.3, since the sample size is decreased by half.
5.6 Further Investigation of BIC Criterion
In this section, we investigate the performance with different choice of . ranges from 0.1 to 0.9 and will be plotted in all the cases mentioned above with two choices of bandwidth. All the results are shown from Figures 2a to 3d with baseline function . From figures, we can see that is a good choice for all the four cases.
Next, we are going to simulate the performances of BIC when changes. Since from the previous section,we can see that the choice of depends on . Figures 4a to 5d show the results when , while Figures 6a to 7d give the performances when .
As we can see from the figures, is not a good choice in general, and it will overestimate the number of change points. Although overestimating is tolerable since we do not want to miss the true change points, we still suggest that a sufficiently large choice for is necessary. Since the maximum possible number of change points is . We suggest that one should choose a reasonable depending on some prior information, then choose a satisfying .
5.7 Influence of Change Point Searching Unit
In this section, we investigate the influence of . For all the four cases, , which means that for Case 2-4, are all a multiple of , while for Case 1, the true change points are not divisible by . We set and baseline function for four cases. The results are shown in Table 10 and 11.
We can see that setting an will not affect our results much even if the true change points are not a multiple of in Case 1, so it is safe to apply this in application, which could boost the computation as well as give accurate estimations.
6 Case Study
6.1 Simulated Data
In the simulation section, we check the performances of our method based on 4 cases, and compare NMSD with other methods. Here we will investigate the change point detection of these 4 cases more directly. Figure 8a-8d are the realizations of Case 1 to Case 4. The long-dashed lines are the locations of estimated change points while the dotted lines represent true change points. It is easy to see from Figure 8a that the existence of change points are obvious, since marginal variance of the second segment is bigger. In Case 2, the second change point is far less obvious than the first one which can be seen in Figure 8b. In Figure 8c, since the observations in the first two segments concentrate more around their mean compared to the third segment, the second change point may be captured by eyes. In Figure 8d, two change points are not apparent any more. However, from the perspective of spectrum, those change points can be easily detected. Figure 9a-9d give the estimated spectral density functions. In Case 4, we can see that the power of spectrum of the first segment concentrates more at higher and lower frequencies, while the spectrum of the second part has more power at low frequency. The spectrum of the third part is similar to the first one, but it has more power in the low frequency range.
6.2 Electroencephalography Recordings for Seizure
In this section, we investigate the performance of NSCD when applied to EEG data for seizure (Goldberger et al. 2000). The data is retrieved from https://www.physionet.org/pn6/chbmit/. For each subject, the EEG signal was recorded into several data files, which was one-hour long. Here all subjects were monitored for several days to trace their states, so we only analyze the data file with seizures. The sampling rate is 256 Hz, so the number of observations in each recording is 9216000, which is huge. In the data file we choose (which is chb01_16), seizure happened only once with duration 51 seconds, so the number of change points is 2. The duration of seizure is very short compared with the total length of this recording, so we analyze a particular part of the data which begins at 10 seconds before the seizure and ends at the 10 seconds after the seizure, so the sample size is 18176, which is still large. So here we will set a change point searching unit equal to 64, which is 0.25 second, to ease the computation burden. What is more, the minimal length is 256 which is 1 second. There are totally 23 channels in EEG recording, we apply our method on channel “FP1-F3” and “FP1-F7”. The results are shown in Figure 10a and 10b. The vertical dotted lines represent the locations of the beginning and ending of seizure, while the vertical dashed lines are the estimated change points. To demonstrate the changes in spectrums, we plot the estimated spectrums in Figure F3s and F7s. As we can see, NSCD can successfully detect the true change points, while almost all the other estimates fall between the true change points. This is because during seizure, the EEG recordings change abruptly, which will bring more non-stationarity into the time series.
7 Conclusion
In this article, we propose a change point detection method based on spectral density functions for non-stationary time series. We assume that non-stationary time series can be segmented into several linear processes. Then Kullback-Leibler divergence is applied to measure the discrepancy between different spectral density functions. A BIC criterion is suggested to estimate the number of change points. Due to the separable structure of objective function, we use Dynamic Programming to find the estimators. We also show the consistency of our estimators in theory and the estimate accuracy by simulations.
Appendix A Proofs
In the appendix, to are appropriate constant and , are constant with respect to .
Lemma 1**.**
Suppose , , where . . Denote as the spectral density function of . is the estimated spectral density function. Then under Assumptions 1-6,
Proof: Following the method of Woodroofe and Van Ness (1967), we separate our proofs into 3 parts.
Part 1: we show that for given , . Here is the smoothed spectral density estimation for ,
[TABLE]
where ,
,
,
.
Since for any real number , . So we only need to prove that , , are . By Markov’s inequality, it suffices to show that , , are .
After expanding , we have
[TABLE]
where . If for any , then expectation of is not zero. If one of is 1, for example, , since for any , it is impossible to find in , so the expectation would be zero. To sum up, for all non-zero terms in , should be greater than 2. Since for any , , , can be bounded by a real number, denoted by , so we only need to count the number of non-zero terms. When for any , then , and the number of non-zero terms are no more than . When at least one of , say , the number of non-zero terms is no more than . So
.
For
,
we can see that for any in all non-zero terms, so following the discussions above, we have
.
By far, we have proven that and are .
[TABLE]
First, when for any , , the number of terms after expanding is . So for fixed , contains no more than terms. What is more, for any fixed , the number for all possible , of which the power satisfy , is no more than . So there are no more than terms, which means that .
When and some of are equal to 1, we assume that . For , should satisfy , that is, . If not, since is independent of ,
, we have
.
So the number of all non-zero terms is no more than . For , there are terms which contain . So there are terms which do not contain . Then the number of non-zero terms is O(. For , if , contain , then it is easy to see that there are totally no more than terms which do not contain . However, now for all , . So the number of all possible non-zero terms is . Hence .
When , there are at least of , so following the discussions above, . Therefore summarizing discussions above, .
Part 2, we prove that .
[TABLE]
And
[TABLE]
Since , we have .
And
[TABLE]
Since by Woodroofe and Van Ness (1967),
,
where is a constant. So
[TABLE]
And . So we complete Part 2.
Part 3: we show that .
[TABLE]
where is the periodogram for , is a constant. After some manipulations (see Woodroofe and Van Ness 1967, Grenander and Rosenblatt 1957).
[TABLE]
Now let us focus on , that is , , and denote as the autocovariance function of .
[TABLE]
Again, we only need to prove that the expectation of and are . And
[TABLE]
For those four terms in the last inequality above, following the proofs in Part 1, we can show that each of them is . Similarly to the discussions above, . So we have . When some of are different from each other, for example, for , then in , the number of non-zero terms should be less than because some in are not in . And now, when the power of is 1 in the expansion of , we can not find any in , so the expectations of these terms are 0. So,
[TABLE]
Since ,
for any given . Since , .
[TABLE]
where the second inequality holds because of Hölder inequality, and is the Fourier transformation of the th derivative of . Since for , we have for , and is bounded. So similar to Part 1, we have . Here we complete Part 3.
Since
[TABLE]
we have
.
Then, by Markov inequality,
[TABLE]
[TABLE]
So
[TABLE]
Here we complete the proof of Lemma 1.
Lemma 2**.**
Suppose , satisfy Assumptions 1-8, then
[TABLE]
[TABLE]
Proof:
[TABLE]
where . So
[TABLE]
By Lemma 1, we only need to show that
.
What is more, assume without loss of generality, then we have
[TABLE]
Since , , for , so if , . So without loss of generality, we set , . Next, we separate the proofs into 2 parts, as in Lemma 1.
Part 1:
[TABLE]
So
[TABLE]
So for two terms in the inequality above, following the proofs in Part 1 of Lemma 1, we have
,
.
Part 2: Set , satisfying , , where denotes the conjugate of . We show that .
[TABLE]
Since , all satisfy uniform Lipschitz condition, then it is easy to see that also satisfies uniform Lipschitz condition. Following the proofs in Part 3 of Lemma 1, we have
.
Following the proofs in Part 3 of Lemma 1 again, we have , where
[TABLE]
So following the proofs in Part 3 of Lemma 1 again. we have
.
Here we complete the proof of Lemma 2.
Lemma 3**.**
Assume Assumption 1-8, , , when . Here
**
Proof: Set , and to are all constant. By Lemma 1, we have
,
So denote , then on set ,
, , .
Since ,
.
So on set , we have
[TABLE]
Then on ,
[TABLE]
Here is some constant containing . So set ,
[TABLE]
and the last inequality can be arbitrarily small by choosing sufficiently large . Here we complete proofs of Lemma 3.
Lemma 4**.**
Assume Assumption 1-8, , , when . Here
**
Proof: Following the proofs of Lemma 3, this lemma can be easily obtained.
**Proofs of Theorem 1:
**Denote , where is the event in probability space . Then , we prove that , .
For , since is bounded, , then there exists such that on the subsequence. It follows from Lemma 2 and 3, that
,
where , . If , then by Lemma 3, we have
.
Since .
So
[TABLE]
as , since every term above is always negative. If , then
[TABLE]
So we have
[TABLE]
as . This is a contradiction because are the maximizers of . Since
.
so .
Proofs of Theorem 2:
If , then there should be a change point that can not estimated consistently. Then following the proof in Theorem 1,
[TABLE]
as .
If , then still every change point should be estimated consistently. So, there should be a change point . Then by Lemma 2 and 3, . So , as . Here we complete the proofs of Theorem 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Brockwell, P. J., and Davis, R. A. (1991), Time Series: Theory and Methods (2nd Edition) , Springer-Verlag.
- 2[2] Davis, R. A., Lee, T. C. M., and Rodriguez-Yam, G. A. (2006), “Structural Break Estimation for Nonstationary Time Series Models,” Journal of the American Statistical Association , 101(473), 223-239.
- 3[3] Fan, J. and Yao, Q., (2003), Nonlinear Time Series , Springer.
- 4[4] Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. Ch., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C-K., Stanley, H. E. (2000), “Physio Bank, Physio Toolkit, and Physio Net: Components of a New Research Resource for Complex Physiologic Signals”, Circulation 101(23), e 215-e 220.
- 5[5] Hawkins, D. M. (2001), “Fitting Multiple Change-point Models to Data,” Computational Statistics & Data Analysis , 37(3), 323-341.
- 6[6] Killick, R., Fearnhead, P., and Eckley, I. A. (2012), “Optimal Detection of Changepoints With a Linear Computational Cost,” Journals of the American Statistical Association , 107(500), 1590- 1598.
- 7[7] Kitagawa, G., and Akaike, H. (1978), “A Procedure for the Modeling of Non-Stationary Time Series,” Annals of the Institute of Statistical Mathematics , 30, 351-363.
- 8[8] Korkas, K. K., and Fryzlewicz, P. (2017), “Multiple Change-Point Detection for Non-Stationary Time Series Using Wild Binary Segmentation,” Statistica Sinica , 27(1), 287-311.
