Nonparametric Multiple Change Point Detection for Non-Stationary Times   Series

Zixiang Guan; Gemai Chen

arXiv:1901.03036·math.ST·November 5, 2020

Nonparametric Multiple Change Point Detection for Non-Stationary Times Series

Zixiang Guan, Gemai Chen

PDF

Open Access

TL;DR

This paper introduces a nonparametric method for detecting multiple change points in non-stationary time series by comparing spectral density functions, applicable to various linear processes, with proven consistency and empirical validation.

Contribution

It proposes a novel nonparametric approach for change point detection that works for a wide class of linear processes, including non-invertible models, with consistent estimation and model selection.

Findings

01

Method accurately detects change points in simulations.

02

Approach is effective for both invertible and non-invertible processes.

03

Consistent estimation of number and locations of change points.

Abstract

This article considers a nonparametric method for detecting change points in non-stationary time series. The proposed method will divide the time series into several segments so that between two adjacent segments, the normalized spectral density functions are different. The theory is based on the assumption that within each segment, time series is a linear process, which means that our method works not only for classic time series models, e.g., causal and invertible ARMA process, but also preserves good performance for non-invertible moving average process. We show that our estimations for change points are consistent. Also, a Bayesian information criterion is applied to estimate the member of change points consistently. Simulation results as well as empirical results will be presented.

Tables11

Table 1. Table 1: Performances of NSCD for Case 1 with Different Baseline Functions and Bandwidths When K 𝐾 K Is Known

		$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
$m = N^{1 / 3}$	${\hat{f}}_{0}$	22.99 (0.011)	22.99 (0.011)
$m = N^{1 / 3}$	$\frac{1}{2 π}$	23.55 (0.012)	23.55 (0.012)
$m = N^{1 / 4}$	${\hat{f}}_{0}$	22.61 (0.011)	22.61 (0.011)
$m = N^{1 / 4}$	$\frac{1}{2 π}$	26.47 (0.013)	26.47 (0.013)

Table 2. Table 2: BIC criterion of NSCD for AR Processes with Comparison to AutoPARM, WBS, BS, MuBred, and NMCD

	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
AutoPARM	93.3%	7.61 (0.004)	8.05 (0.004)
NSCD	98.8%	25.00 (0.012)	37.33 (0.018)
MuBred	97.3 %	12.219 (0.006)	16.229 (0.008)
WBS	37.9%	66.65 (0.033)	183.184 (0.089)
BS	62.7%	84.875 (0.041)	122.058 (0.060)
NMCD	0%	147.4365 (0.072)	554.9088 (0.271)

Table 3. Table 3: Performance of NSCD for ARMA and Invertible MA Processes with Different Baseline Functions and Bandwidths When K 𝐾 K Is Known

		ARMA		MA
		$ϱ (\hat{G} \| \| G)$		$ϱ (G \| \| \hat{G})$
$m = N^{1 / 3}$	${\hat{f}}_{0}$	35.29 (0.020)	34.74 (0.020)	13.09 (0.007)	13.09 (0.007)
$m = N^{1 / 3}$	$\frac{1}{2 π}$	32.74 (0.018)	32.07 (0.018)	13.11 (0.007)	13.11 (0.007)
$m = N^{1 / 4}$	${\hat{f}}_{0}$	31.12 (0.017)	31.01 (0.017)	12.84 (0.007)	12.84 (0.007)
$m = N^{1 / 4}$	$\frac{1}{2 π}$	26.39 (0.015)	26.38 (0.015)	13.71 (0.008)	13.71 (0.008)

Table 4. Table 4: Performance of BIC Criterion of NSCD for ARMA and Invertible MA Processes with Comparison to AutoPARM, WBS, BS, NMCD, and MuBred

	ARMA			MA
	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
AutoPARM	95.3%	48.37 (0.027)	27.62 (0.015)	99%	11.42(0.006)	11.42 (0.006)
NSCD	99%	31.42 (0.017)	33.26 (0.018)	99.6%	14.59 (0.008)	15.69 (0.009)
MuBred	93.3%	40.436 (0.022)	44.467 (0.025)	97.7%	49.518 (0.028)	41.444 (0.023)
WBS	74.9%	184.393 (0.102)	66.288 (0.037)	91.8%	33.586 (0.019)	45.362 (0.025)
BS	77.7%	188.955 (0.105)	69.194 (0.038)	96.3%	29.673 (0.016)	36.611 (0.020)
NMCD	0%	188.64 (0.105)	324.31 (0.180)	0.1%	160.4434 (0.089)	326.4282 (0.181)

Table 5. Table 5: Performance of NSCD for Non-Invertible MA Processes with Different Baseline Functions and Bandwidths When K 𝐾 K Is Known

		MA
		$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
$m = N^{1 / 3}$	${\hat{f}}_{0}$	33.60 (0.019)	33.60 (0.019)
$m = N^{1 / 3}$	$\frac{1}{2 π}$	36.00 (0.020)	36.00 (0.020)
$m = N^{1 / 4}$	${\hat{f}}_{0}$	40.11 (0.022)	40.11 (0.022)
$m = N^{1 / 4}$	$\frac{1}{2 π}$	41.00 (0.023)	41.00 (0.023)

Table 6. Table 6: Performance of BIC Criterion of NSCD for Non-Invertible MA Processes with Comparison to AutoPARM, WBS, BS, NMCD, and MuBred

	MA
	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
AutoPARM	36.7%	408.60 (0.227)	27.24 (0.015)
NSCD	99.7%	37.49 (0.021)	38.02 (0.021)
MuBred	8.4%	216.203 (0.120)	26.724 (0.015)
WBS	18.2%	488.674 (0.271)	90.493 (0.050)
BS	8.6%	576.462 (0.320)	56.975 (0.032)
NMCD	0%	165.54 (0.092)	326.701 (0.182)

Table 7. Table 7: Performance of NSCD for Random Noise with t(4)

	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
Case 1	24.24 (0.012)	24.37 (0.012)
Case 2	29.67 (0.017)	28.91 (0.016)
Case 3	11.74 (0.007)	11.74 (0.007)
Case 4	36.36 (0.020)	36.36 (0.020)

Table 8. Table 8: Performance of NSCD with Different Bandwidths While the Sample Size Is Reduced by Half

	Bandwidth	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
Case 1	$m = N^{1 / 3}$	23.195 (0.023)	23.263 (0.023)
Case 1	$m = N^{1 / 4}$	27.716 (0.027)	28.075 (0.027)
Case 2	$m = N^{1 / 3}$	31.72 (0.035)	30.718 (0.035)
Case 2	$m = N^{1 / 4}$	28.104 (0.031)	27.734 (0.030)
Case 3	$m = N^{1 / 3}$	14.664 (0.016)	14.158 (0.016)
Case 3	$m = N^{1 / 4}$	12.652 (0.014)	12.652 (0.014)
Case 4	$m = N^{1 / 3}$	37.28 (0.041)	37.272 (0.041)
Case 4	$m = N^{1 / 4}$	44.3 (0.05)	43.877 (0.049)

Table 9. Table 9: Performance of BIC Criterion of NSCD with Different Bandwidths While The Sample Size Is Reduced by Half

	bandwidth	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
Case 1	$m = N^{1 / 3}$	96%	24.972 (0.024)	30.44 (0.030)
Case 1	$m = N^{1 / 4}$	90.4%	35.268 (0.040)	39.494 (0.044)
Case 2	$m = N^{1 / 3}$	96.2%	38.326 (0.043)	28.1 (0.031)
Case 2	$m = N^{1 / 4}$	98.7%	27.494 (0.031)	26.384 (0.029)
Case 3	$m = N^{1 / 3}$	96.5%	31.458 (0.035)	12.73 (0.014)
Case 3	$m = N^{1 / 4}$	96%	32.484 (0.036)	11.436 (0.013)
Case 4	$m = N^{1 / 3}$	95%	41.228 (0.046)	40.246 (0.045)
Case 4	$m = N^{1 / 4}$	92%	47.67 (0.053)	53.254 (0.059)

Table 10. Table 10: Performance of NSCD with Searching Unit When K 𝐾 K Is Known

	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
Case 1	22.9 (0.011)	22.9 (0.011)
Case 2	29.26 (0.016)	29.26 (0.016)
Case 3	10.14 (0.006)	10.14 (0.006)
Case 4	40.52 (0.023)	40.52 (0.023)

Table 11. Table 11: Performance of BIC Criterion of NSCD with Searching Unit

	$\hat{K}$	$ϱ (\hat{G} \| \| G)$	$ϱ (G \| \| \hat{G})$
Case 1	96.2%	22.696 (0.011)	39.72 (0.019)
Case 2	97.2%	42.02 (0.0233)	30.68 (0.017)
Case 3	99.6%	12.54 (0.007)	10.14 (0.006)
Case 4	98.6%	40.56 (0.023)	43.74 (0.024)

Equations68

r_{j}

r_{j}

∣ g_{N} (λ) - 1 ∣^{2 q}

∣ g_{N} (λ) - 1 ∣^{2 q}

(r_{N} (λ))^{2 q}

(r_{N} (λ))^{2 q}

(Z_{N} (λ))^{2 q}

(Z_{N} (λ))^{2 q}

\frac{f ( λ ) - E f _{N} ( λ )}{f ( λ )}^{2 q}

\frac{f ( λ ) - E f _{N} ( λ )}{f ( λ )}^{2 q}

f (λ) - m \int W (m (λ - u)) f (u) d u^{2 q}

f (λ) - m \int W (m (λ - u)) f (u) d u^{2 q}

m \int W (m (λ - u)) (f (u) - E I_{N} (u)) d u^{2 q}

m \int W (m (λ - u)) (f (u) - E I_{N} (u)) d u^{2 q}

m \int W (m (λ - u)) (f (u) - E I_{N} (u)) d u^{2 q}

m \int W (m (λ - u)) (f (u) - E I_{N} (u)) d u^{2 q}

\frac{f _{N} ( λ ) - E f _{N} ( λ )}{f ( λ )} - (g_{N} (λ) - σ^{2})^{2 q}

\frac{f _{N} ( λ ) - E f _{N} ( λ )}{f ( λ )} - (g_{N} (λ) - σ^{2})^{2 q}

E R_{1} (λ)

E R_{1} (λ)

∣ d_{r s} ∣^{4 q}

∣ d_{r s} ∣^{4 q}

E Q_{r s}^{4 q}

E Q_{r s}^{4 q}

E R_{1} (λ) \leq \frac{B N ^{q} m ^{2 q}}{N ^{2 q}} p = 1 \sum 2 q t_{1} + \dots + t_{p} \sum r_{1}, s_{1} \dots r_{p}, s_{p} \sum ∣ a_{r_{1}} a_{s_{1}} ∣^{t_{1}} \dots a_{r_{p}} a_{s_{p}}^{t_{p}}

E R_{1} (λ) \leq \frac{B N ^{q} m ^{2 q}}{N ^{2 q}} p = 1 \sum 2 q t_{1} + \dots + t_{p} \sum r_{1}, s_{1} \dots r_{p}, s_{p} \sum ∣ a_{r_{1}} a_{s_{1}} ∣^{t_{1}} \dots a_{r_{p}} a_{s_{p}}^{t_{p}}

R_{2} (λ)

R_{2} (λ)

\frac{f _{N} ( v ) - f ( v )}{f ( v )}^{2 q}

\frac{f _{N} ( v ) - f ( v )}{f ( v )}^{2 q}

P (\frac{f ^ ( λ ) - f ( λ )}{f ( λ )} \geq ϵ \frac{m}{N ^{1/2}})

P (\frac{f ^ ( λ ) - f ( λ )}{f ( λ )} \geq ϵ \frac{m}{N ^{1/2}})

P (λ_{j} max \frac{f _{N} ( λ _{j} ) - f ( λ _{j} )}{f ( λ _{j} )} \geq ϵ \frac{m}{n ^{(q - 1) / (2 q)}})

P (λ_{j} max \frac{f _{N} ( λ _{j} ) - f ( λ _{j} )}{f ( λ _{j} )} \geq ϵ \frac{m}{n ^{(q - 1) / (2 q)}})

P (1 \leq k < l \leq N max λ_{i} max \frac{f _{N} ( λ _{i} ) - f ( λ _{i} )}{f ( λ _{i} )} \geq ϵ \frac{m}{N ^{\frac{q - 3}{2 q}}}) \leq k = 1 \sum N l = 1 \sum N B_{5} \frac{m ^{2 q} N ^{q - 3}}{N ^{q - 1} ϵ ^{2 q}} = \frac{B _{5} m ^{2 q}}{ϵ ^{2 q}} .

P (1 \leq k < l \leq N max λ_{i} max \frac{f _{N} ( λ _{i} ) - f ( λ _{i} )}{f ( λ _{i} )} \geq ϵ \frac{m}{N ^{\frac{q - 3}{2 q}}}) \leq k = 1 \sum N l = 1 \sum N B_{5} \frac{m ^{2 q} N ^{q - 3}}{N ^{q - 1} ϵ ^{2 q}} = \frac{B _{5} m ^{2 q}}{ϵ ^{2 q}} .

f_{0} (λ) = k = 1 \sum K + 1 \frac{N _{k}}{N} f_{k} (λ)

f_{0} (λ) = k = 1 \sum K + 1 \frac{N _{k}}{N} f_{k} (λ)

λ_{i} \in Λ max \frac{f ^ _{0} ( λ _{i} ) - f _{0} ( λ _{i} )}{f _{0} ( λ _{i} )} = O_{p} (\frac{m}{N ^{(q - 1) /2 q}})

λ_{i} \in Λ max \frac{f ^ _{0} ( λ _{i} ) - f _{0} ( λ _{i} )}{f _{0} ( λ _{i} )} = O_{p} (\frac{m}{N ^{(q - 1) /2 q}})

\hat{f}_{0} (λ)

\hat{f}_{0} (λ)

P (\frac{f ^ _{0} ( λ ) - f _{0} ( λ )}{f _{0} ( λ )} \geq ϵ \frac{m}{N ^{1/2}})

P (\frac{f ^ _{0} ( λ ) - f _{0} ( λ )}{f _{0} ( λ )} \geq ϵ \frac{m}{N ^{1/2}})

\frac{m}{N} \int_{- \infty}^{+ \infty} W (m (u - λ)) ζ_{k_{1}} (u) \overset{ˉ}{ζ}_{k_{2}} (u) d u

\frac{m}{N} \int_{- \infty}^{+ \infty} W (m (u - λ)) ζ_{k_{1}} (u) \overset{ˉ}{ζ}_{k_{2}} (u) d u

\frac{1}{N} m \int W (m (u - λ)) ζ_{1} (u) \overset{ˉ}{ζ}_{2} (u)

\frac{1}{N} m \int W (m (u - λ)) ζ_{1} (u) \overset{ˉ}{ζ}_{2} (u)

E \frac{1}{N ^{2 q}} l = 1 \sum N_{1} w (l m^{- 1}) e^{i l λ} j = N_{1} - l + 1 \sum N_{1} - m ξ_{j} ξ_{j + l}^{2 q}

E \frac{1}{N ^{2 q}} l = 1 \sum N_{1} w (l m^{- 1}) e^{i l λ} j = N_{1} - l + 1 \sum N_{1} - m ξ_{j} ξ_{j + l}^{2 q}

\frac{m \int W ( m ( u - λ )) ζ _{1} ( u ) ζ ˉ _{2} ( u ) d u}{f _{11} f ˉ _{22}} - g_{n} (λ)^{2 q}

\frac{m \int W ( m ( u - λ )) ζ _{1} ( u ) ζ ˉ _{2} ( u ) d u}{f _{11} f ˉ _{22}} - g_{n} (λ)^{2 q}

d_{r s}

d_{r s}

0 \leq \frac{f ^ _{k l} ( λ _{i} )}{f _{s} ( λ _{i} )} \leq 1 + ϵ \frac{B _{5} m}{N ^{(q - 3) / (2 q)}}

0 \leq \frac{f ^ _{k l} ( λ _{i} )}{f _{s} ( λ _{i} )} \leq 1 + ϵ \frac{B _{5} m}{N ^{(q - 3) / (2 q)}}

k l max ϑ_{k l}

k l max ϑ_{k l}

P (k l max ϑ_{k l} \geq B_{8 ϵ} N^{- \frac{q - 3 - 2 q α}{2 q}})

P (k l max ϑ_{k l} \geq B_{8 ϵ} N^{- \frac{q - 3 - 2 q α}{2 q}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Spectroscopy and Chemometric Analyses · Control Systems and Identification

Full text

Nonparametric Multiple Change Point Detection

for Non-Stationary Time Series

Zixiang Guan, Gemai Chen

Department of Mathematics and Statitsics

University of Calgary, Calgary, Alberta, T2N 1N4

Email: [email protected]

[email protected]

Author’s Footnote:

Zixiang Guan is PhD, Department of Mathematics and Statistics, University of Calgary (Email: [email protected]), Gemai Chen is Professor (Email: [email protected]), Department of Mathematics and Statistics, University of Calgary, Alberta, Canada.

Abstract

This article considers a nonparametric method for detecting change points in non-stationary time series. The proposed method will divide the time series into several segments so that between two adjacent segments, the normalized spectral density functions are different. The theory is based on the assumption that within each segment, time series is a linear process, which means that our method works not only for classic time series models, e.g., causal and invertible ARMA process, but also preserves good performance for non-invertible moving average process. We show that our estimations for change points are consistent. Also, a Bayesian information criterion is applied to estimate the member of change points consistently. Simulation results as well as empirical results will be presented.

Keywords: Changepoint; Dynamic Programming; Spectrums; BIC; Kullback-Leibler Divergence

1 Introduction

Time series analysis is a well-developed branch of statistics with a wide range of applications in engineering, economics, biology and so on. Generally speaking, when investigating the theoretical properties as well as analyzing real data both in time domain and frequency domain, it is often assumed that time series is stationary. However in application, stationarity may be violated. How to analyze non-stationary time series is challenging. In ARIMA model ( Brockwell and Davis 1991), data is differenced finite times so that it reduces to ARMA process. In another way, we may assume that non-stationary process consists of several stationary ones. Our goal is to segment time series properly.

Consider a sequence of data $\{X_{1},\ldots,X_{N}\}$ and let $\tau^{0}_{0},\tau^{0}_{1},\tau_{2}^{0},\ldots,\tau_{K}^{0},\tau^{0}_{K+1}$ be nonnegative integers satisfying $0=\tau^{0}_{0}<\tau^{0}_{1}<\tau_{2}^{0}<\ldots<\tau_{K}^{0}<\tau^{0}_{K+1}=N$ . We assume that within the $j$ th segment of data, i.e., $\tau_{j-1}^{0}+1\leq t\leq\tau_{j}^{0}$ , $1\leq j\leq K+1$ , $X_{t}$ is stationary. $\tau_{j}^{0}$ are called structural breaks, or change points, which are unknown. $K$ is the number of change points. From the perspective of time domain, we can assume that each stationary segment can be modeled by appropriate statistical models while its structure varies across different segments. Davis, Lee and Rodriguez-Yam (2006) divided non-stationary time series into several different autoregressive processes. Kitagawa and Akaike (1978) detected change points by AIC criterion. In frequency domain, Ombao, Raz, Von Sachs and Malow (2001) used a family of orthogonal wavelet called SLEX to partition non-stationary process into stationary ones. Lavielle and Ludeña (2000) estimated change points using Whittle log-likelihood when time series was parametric. In Korkas and Fryzlewicz (2017), Locally Stationary Wavelets was applied to estimate the second-order structure, then Wild Binary Segmentation was imposed to divide the time series into several segments based on CUSUM statistics.

When autocovariance function is absolutely summable, it is well known that stationary process has spectral density function, which is the Fourier transformation applied to autocovariance function (Brockwell and Davis 1991). Spectral density function preserves a good property which is that after being properly normalized, it becomes a well-defined probability density function (Priestley 1981). In probability theory, there are some existing functions to measure the difference between probability density functions. Kullback-Leibler divergence (K-L divergence) is one of them.

In Kullback and Leibler (1951), a divergence function was introduced to measure the discrimination information between two distribution functions. The original definition is as follows. Suppose there are two probability distributions, $Z_{1}(x)$ and $Z_{2}(x)$ . $z_{1}(x)$ and $z_{2}(x)$ denote the probability density functions of $Z_{1}$ and $Z_{2}$ , respectively. Then Kullback-Leibler (K-L) divergence from $Z_{2}$ to $Z_{1}$ is

$\displaystyle D_{\mathrm{KL}}(Z_{1}\|Z_{2})=\int z_{1}(x)\log\left(\frac{z_{1}(x)}{z_{2}(x)}\right)dx$ .

As we can see, K-L divergence is defined for probability density function, which is non-negative. Since spectral density function is also non-negative, we will generalize K-L divergence so that it is applicable to non-negative functions, which is still called Kullback-Leibler divergence in this article and will be applied later in the forthcoming sections. The definition is as follows. Suppose we have two non-negative functions, $f_{1}$ and $f_{2}$ , defined on a common support. Then the Kullback-Leibler divergence of $f_{1}(x)$ with respect to $f_{2}(x)$ is

$\displaystyle D_{\mathrm{KL}}(f_{1}\|f_{2})=\int f_{1}(x)\log\frac{st(f_{1})}{st(f_{2})}dx$ ,

where $st(f_{1})=\frac{f_{1}(x)}{F_{1}}$ , $st(f_{2})=\frac{f_{2}(x)}{F_{2}}$ , and $F_{1}=\int f_{1}(x)dx$ , $F_{2}=\int f_{2}(x)dx$ . By Gibbs’s inequality, this new K-L divergence is still non-negative and is equal to 0 if and only if $f_{1}=cf_{2}$ almost everywhere, where $c$ is an appropriate constant. In this article, we apply K-L divergence to normalized spectral density functions to define our objective function. Then we estimate change points by maximizing the objective function. That is, we find the locations where discrepancy between different spectral density functions reaches its maximum. When estimating spectral density function, we adopt the classical method. That is, we first calculate periodogram then smooth it by choosing appropriate spectral window. Although we assume that each stationary segment is a linear process, which is a class of stationary time series more general than autoregressive process and ARMA model, we calculate spectral density function without estimating any parameters. So our change point detection method is nonparametric, which we call it nonparametric spectral change-point detection (NSCD). In application, we do not know the number of change points, so a BIC criterion,which is similar to Yao (1988) and Zou, Yin, Feng, and Wang (2014), is proposed. The consistency of both change point estimation and BIC criterion can be guaranteed. Dynamic programming algorithm (Hawkins 2001) is used, but due to its computational complexity, we adopt the screening algorithm (Zou et al. 2014). Also, Pruned Exact Linear Time (PELT) by Killick, Fearnhead and Eckley (2012) can also boost the speed of algorithm when estimating the number and locations of change points simultaneously.

The rest of this article is organized as follows. In Section 2, we describe our objective functions. Asymptotic properties are presented in Section 3. Implementation and further details of our algorithm will be given in Section 4. In Section 5, numerical simulations as well as comparison with several methodologies are shown. The analysis of EEG data is presented in Section 6.

2 Model and Methodology

Consider non-stationary time series $X_{t}=\sum\limits_{j=-\infty}^{+\infty}a_{j}(k)\xi_{t-j}$ , $\xi_{j}\overset{i.i.d}{\sim}(0,\sigma^{2})$ , $\tau^{0}_{k-1}<t\leq\tau^{0}_{k}$ , $k=1,\ldots,K+1$ with $\tau_{0}^{0}=0$ , $\tau^{0}_{K+1}=N$ , $\tau_{k}^{0}-\tau_{k-1}^{0}=N_{k}$ . Here we assume that all $\xi_{t}$ are independent and identically distributed. $\{a_{j}(k)\}$ satisfy $\sum\limits_{j}|a_{j}(k)|<+\infty$ , $\forall k$ , so that when $\tau_{k-1}^{0}<t\leq\tau_{k}^{0}$ , spectral density function exists, denoted by $f_{k}$ (Brockwell and Davis 1991). If $EX_{t}\neq 0$ , we can subtract mean from observations by $X_{t}-\mu$ . Otherwise, we assume that $EX_{t}=0$ for all $t$ . Our goal is to detect $\tau_{k}$ when $f_{k}$ and $f_{k+1}$ are different. If we could find a spectral density function $f$ that it is different from all $f_{k}$ , then a discriminant function can be applied so that it reaches its maximum at $\tau_{k}$ . Here we adopt Kullback-Leibler divergence and define our objective function as below:

$\displaystyle R(\tau_{1},\ldots,\tau_{K+1})=\sum\limits_{k=1}^{K+1}(\tau_{k}-\tau_{k-1})\int_{-\pi}^{\pi}\hat{f}_{k}(u)\log\frac{st(\hat{f}_{k})}{st(f)}du$ .

Here $\hat{F}_{k}=\int_{-\pi}^{\pi}\hat{f}_{k}(u)du$ , $F=\int_{-\pi}^{\pi}f(u)du$ , $st(\hat{f}_{k})=\frac{\hat{f}_{k}}{\hat{F}_{k}}$ , and $st(f)=\frac{f}{F}$ so that $st(\hat{f}_{k})$ and $st(f)$ become probability density functions, and $0<\tau_{1}<\tau_{2}<\cdots<\tau_{K}<N$ is a possible partition of the time series. There are two ways to find $f$ . We can give an estimation of the spectral density function $\hat{f}_{0}$ , based on the whole time series. Although the whole time series is non-stationary, the estimated spectral density function still converges to a well-defined function, which is the weighted sum of $f_{k}$ (See Lemma 2 for details). If we know that data is not white noise, then $st(f)=\frac{1}{2\pi}$ .

There are literatures concerning the application of K-L divergence in time series. Parzen (1982) used it to estimate the parameters of autoregressive process. Parzen (1983) extended it to estimate ARMA process. Shore (1981) calculated the minimum K-L divergence to estimate spectrum given its priori spectrum has an exponential form. See Rao (1993) for more reviews. From information point of view, K-L divergence measures the information gain when a new probability density function is used instead of the old one. So our objective function will detect the locations where we maximize the information gain when a new spectral density function $f$ comes in.

To estimate the spectral density function, the classical methodology is adopted here. First, periodogram will be calculated as follows:

$\displaystyle I_{\tau_{k-1}+1,\tau_{k}}(\lambda)=\frac{1}{\tau_{k}-\tau_{k-1}}\left|\sum\limits_{j=\tau_{k-1}+1}^{\tau_{k}}X_{j}e^{-ij\lambda}\right|^{2}$

Since periodogram is not consistent, we choose a spectral window, $W(u)$ , to smooth periodogram, then

$\displaystyle\hat{f}_{k}(\lambda)=m\int_{-\pi}^{\pi}W(m(\lambda-u))I_{\tau_{k-1}+1,\tau_{k}}(u)du$ , $\lambda\in[-\pi,\pi]$ .

There are several choices for $W(u)$ (Priestley 1981). Since we want the normalized $\hat{f}_{k}$ to be a probability density function, i.e., $\hat{f}_{k}\geq 0$ , Bartlett kernel is chosen, which is $mW(mu)=\sin^{2}(mu/2)/(2\pi m\sin^{2}(u/2))$ , where $m$ is the bandwidth. One usually applies Fourier transformation to achieve an estimator on a grid of frequencies, denoted by $\Lambda=\{\lambda_{1},\ldots,\lambda_{N_{\lambda}}\}$ , where $N_{\lambda}$ is the cardinality of $\Lambda$ , which could tend to infinity as $N$ goes to infinity. Here we still use integral to denote the summation of estimators. We may set $N_{\lambda}=N$ so $N_{\lambda}$ could tend to infinity, or $\Lambda$ could be Nyquist frequency. Based on this grid of frequency, K-L divergence applied to normalized spectral density function is still non-negative, and equals zero if and only if two normalized spectral density functions are equal on $[-\pi,\pi]$ .

When estimating change points, we have no idea about the number of change points in the data, so $K$ should be estimated. Yao and Au (1989), Zou et al. (2014) gave BIC criterion and showed its consistency. Following their work, we also propose a BIC criterion as follows:

$BIC_{L}=-\max_{\tau_{1}^{\prime},\ldots,\tau_{K}^{\prime}}+LC_{N}$ .

Here $C_{N}$ is an appropriate constant which will be illustrated later. So our estimator $\hat{K}$ is chosen by minimizing the criterion above.

3 Asymptotic Theory

In Section 2, we can see that $K$ is a constant, and $\tau_{i}^{0}$ , $i=1,2,\ldots,K$ , change with $N$ . So there is a sequence of constants, $0<\kappa_{1}^{0}<\kappa_{2}^{0}<\cdots<\kappa_{K}^{0}<1$ , such that $\{X_{t},t=1,2,\ldots,1\}$ is a realization of non-stationary time series with $\tau_{k}^{0}=[\kappa_{k}^{0}N]$ , $k=1,2,\ldots,K$ , where $[x]$ denotes the largest integer which is not greater than $x$ . Their estimators are denoted by $\hat{\kappa}_{k}$ , $k=1,\ldots,K$ . We can estimate $\tau_{k}^{0}$ first, denoted by $\hat{\tau}_{k}$ , by maximizing the objective function $R(\tau_{1},\ldots,\tau_{K})$ , then $\hat{\kappa}_{k}=\hat{\tau}_{k}/N$ . In literature, Yao and Au (1989) achieved a consistent estimation of $O_{p}(1)$ for $\hat{\tau}_{k}$ , which means that the difference between estimators and true change points is no bigger than a constant. Zou et al. (2014) also drew the same conclusion when the number of change points was constant. In Davis et al. (2006), the consistency was attained in the sense that $|\hat{\kappa}_{k}-\kappa_{k}|<\epsilon$ in probability 1, where $\epsilon$ was some constant. The reason that estimators cannot converge as fast as those in Yao and Au (1989) and Zou et al. (2014) is that estimating AR process needs a sufficient number of samples. To guarantee estimate accuracy, we always find the next change point which is $ml$ away from the previous one. For example, after giving $\hat{\tau}_{j}$ , we look for the next change point starting from $\hat{\tau}_{j}+ml$ . Our results are similar to Davis et al. (2006) and based on a set of discrete frequencies $\Lambda$ . The following assumptions are needed to obtain the consistency:

A1:

$E\xi_{t}^{8q}<\infty$ where $q$ is some integer satisfying $q\geq 3$ , $\{a_{j}(k)\}$ converge absolutely, $\forall k$ .

A2:

$W(v)$ is a non-negative, even, bounded, integrable function, $\int_{-\pi}^{\pi}W(v)dv=1$ ,

$\int_{-\pi}^{\pi}\left(W(v)\right)^{1-\frac{1}{2q}}dv<\infty$ .

A3:

$N_{\mathrm{min}}/N\rightarrow c_{\mathrm{min}}>0$ , as $N\rightarrow\infty$ , and $N_{\mathrm{min}}>m$ , where $N_{\mathrm{min}}=\min_{1\leq k\leq K}(\tau_{k}^{0}-\tau^{0}_{k-1})$ .

A4:

$\forall k$ , $f_{k}$ is everywhere positive and satisfies uniform Lipschitz condition:

$\left|f_{k}(u_{1})-f_{k}(u_{2})\right|\leq B_{f_{k}}|u_{1}-u_{2}|$

$\forall u_{1}\in[-\pi,\pi]$ , $u_{2}\in[-\pi,\pi]$ , where $B_{f_{k}}$ is a constant.

A5:

$m=O(N^{\alpha})$ , where $\frac{1}{4}\leq\alpha<\alpha+\frac{3}{2q}<\frac{1}{2}$ .

A6:

$w(0)=1$ , and $w(v)$ has continuous derivatives to the order of $2q$ .

A7:

$\{f_{k}(u)\}$ are linearly independent for $u\in[-\pi,\pi]$ .

A8:

$ml=c_{\mathrm{ml}}N$ , where $c_{\mathrm{ml}}>0$ . $ml<N_{\mathrm{min}}$

In A8, $ml$ is the minimal length of time series when estimating spectral density functions, since a sufficient number of observations is always necessary, especially when the convergence rate of estimators is slow. Also, $ml<N_{min}$ so that all change points are distinguishable. Assumption 7 is given to guarantee that any linear combination is not equal to $f_{j}$ , $\forall j=1,\ldots,K$ . Assumption 1-6 are similar to those in Woodroofe and Van Ness (1967) so that the $\max\limits_{\lambda_{j}\in\Lambda}\frac{\hat{f}_{k}(\lambda_{j})}{f_{k}(\lambda_{j})}$ can be bounded in probability. Assumption 4 can be stronger so that in Assumption 5, $\alpha$ can be less than $\frac{1}{4}$ (see Woodroofe and Van Ness (1967) for further details). Theorem 1 gives the consistency of change point estimation.

Theorem 1.

When $K$ is known, $\hat{\kappa}_{j}\overset{p}{\rightarrow}\kappa_{j}$ , $\forall j=1,\ldots,K$ .

To estimate the number of change points, a pre-specified upper bound $K_{\mathrm{max}}$ satisfying $K<K_{\mathrm{max}}$ will be given. Then BIC values for each $1\leq L\leq K_{\mathrm{max}}$ are calculated and $\hat{K}$ will be the number where BIC reaches its minimum. The following theorem establishes the consistency of estimation of BIC criterion.

Theorem 2.

If $C_{N}/N\rightarrow 0$ , $C_{N}/N^{\frac{q+3+2q\alpha}{2q}}\rightarrow\infty$ , we have $P(\hat{K}=K)\rightarrow 1$ .

4 Algorithm

Similar to Zou et al. (2014), our objective function is separable. For change point detection, a commonly adopted algorithm, Dynamic Programming (Hawkins 2001), can be applied. The main idea of Dynamic Programming is that estimation of $\hat{\tau}_{K}$ is computed first, which is the rightmost change point. Then the time series data from $1$ to $\hat{\tau}_{K}$ will be divided into $K-1$ parts, and we will estimate $\tau_{K-1}$ recursively. However, the computation complexity is $O(KN^{2})$ , and taking Discrete Fourier transformation and spectrum smoothing into consideration, it is time-consuming.

To reduce computational complexity, Zou et al. (2014) proposed a screening algorithm. For $X_{j},\ldots,X_{j+l}$ , where $l$ is some constant integer, calculate the location where the function below reaches its maximum.

[TABLE]

Here $\hat{f}_{j,j+r}$ and $\hat{f}_{j+r+1,j+l}$ denote the estimated spectral density functions based on observations from $j$ to $j+r$ , and $j+r+1$ to $j+l$ . Let $j$ change from $1$ to $n-l$ , then we have a set $A_{sc}$ containing all $r_{j}$ , then apply Dynamic Programming on $A_{sc}$ . The main idea is that if a change point is included in $X_{j},\ldots,X_{j+l}$ , then the equation above should reach its maximum at this true change point. That is, $A_{sc}$ contains true change points. Here we should choose $l<N_{\mathrm{min}}$ so that $X_{j},\ldots,X_{j+l}$ contain only one change point.

Here $\hat{f}_{j,j+r}$ and $\hat{f}_{j+r+1,j+l}$ denote the estimated spectral density function of samples from $j$ to $j+r$ , and $j+r+1$ to $j+l$ . Let $j$ change from $1$ to $n-l$ , then we have a set $A_{sc}$ containing all $r_{j}$ , then apply dynamic programming on $A_{sc}$ . The main idea is that if a change point is included in $X_{j},\ldots,X_{j+l}$ , then the equation above should reach its maximum at this true change point. That is, $A_{sc}$ contains true change points. Here we should choose $l<N_{\mathrm{min}}$ so that $X_{j},\ldots,X_{j+l}$ contain only one change point.

When calculating the spectrums within each segment of time series, the most widely used method is Fast Fourier Transformation (FFT). However, in FFT, a problem is that spectral density function is estimated on Nyguist frequency. If so, time series with different length is estimated on different set of Fourier frequency. So when calculating the integral in $R(\tau_{1},\ldots,\tau_{K})$ , we cannot align the frequencies where spectral density functions $f_{k}$ and $f$ are estimated. Our method is that we first choose a set of frequencies, denoted by $\Lambda$ , then apply Discrete Fourier Transformation on $\Lambda$ for every subset of samples, which will solve the alignment problem naturally.

For BIC criterion, usually Dynamic Programming will be applied first, then the values of BIC criterion for all $L\leq K_{\mathrm{max}}$ will be calculated. Obviously, this will increase complexity. Killick et al. (2012) proposed a method called Pruned Exact Linear Time (PELT), which would significantly reduce computational complexity. The main idea is that when a new sample is included, check all the remaining locations before new sample. If the objective function decreases to the extent that it is larger than the penalty term $C_{N}$ , those locations which do not satisfy the condition will be removed and the next sample will be added into our calculation until the end. The cardinality of the set of all remaining locations is the estimated number of change points and the elements within will be the estimated change points. Under some assumptions, the computational complexity is linear with respect to sample size.

In BIC criterion, another problem is how to choose $C_{N}$ . Although we can set $C_{N}$ to satisfy conditions in Theorem 2, this choice may be too large which leads to underestimation of $K$ . To overcome this difficulty, we first choose a length, which equals $ml$ , then compute the median, denoted by $me_{\mathrm{BIC}}$ , for all values of the function below:

$\displaystyle\int\hat{f}_{j,j+ml}(u)\log\frac{st(\hat{f}_{j,j+ml})}{st(f)}du$ , $\forall j=1,\ldots,K$ .

$\hat{f}_{j,j+ml}$ is the estimated spectral density function from $X_{j},\ldots,X_{j+ml}$ , $\hat{F}_{j,j+ml}$ is its integral. Finally, $C_{n}=me_{\mathrm{BIC}}\times n^{c}$ . Our simulation shows that an appropriate choice for $c$ is 0.73 regardless of $m$ .

The selection of spectral windows is an important topic in spectral estimation. In Priestley (1981), bandwidth is selected as follows

$B_{W}=2\sqrt{6}\left(\frac{1}{m^{r}}k^{(r)}\right)^{1/r}$ ,

where $k^{(r)}=\lim\limits_{u\rightarrow 0}\frac{1-w(u)}{|u|^{r}}$ , where $w(u)$ is the inverse Fourier transformation of $W(u)$ , and $r$ is the largest integer so that the limit aforementioned exists and is non-zero. $m$ is the scale parameter in spectral windows. By the proofs of Theorem 1, we can see that estimators of change points reach consistency because $N$ dominate convergence rate. So bandwidth selection does not matter too much, which is verified by our simulations.

In application, sometimes sample size is large. Although we can apply some methods, such as screening, PELT, to boost the calculation, the computational complexity is still intolerable. Here, we set a change point searching unit, denoted by $n_{\mathrm{su}}$ (Hawkins 2001). That is, when searching for change points, we add a unit of observations into our calculation each time, not just one observation. This unit is different from $ml$ , which is used to calculate spectrums since estimating spectral density function usually needs sufficient amount of observations. By setting this unit, change point can only be estimated at $n_{\mathrm{su}}$ , $2n_{\mathrm{su}}$ , $3n_{\mathrm{su}}$ , and so on. If we set this unit equal to 1, the algorithm degenerates to the general scenario when no searching unit is given. This will dramatically increase the speed of our algorithm. The computational complexity of Dynamic Programming is $O(n^{2})$ (Zou et al. 2014), so by setting a searching unit $n_{\mathrm{su}}$ , the complexity will drop to $O(n^{2}/n^{2}_{\mathrm{su}})$ . Apparently the estimation accuracy will be sacrificed, since the true change points and the maximizers of objective function, may not lie on the grid. We suggest the choice of $n_{\mathrm{su}}$ by choosing the desired estimation accuracy first. Intuitively speaking, if we want the estimates having the accuracy of $1\%$ of the total sample size, then we can set $n_{\mathrm{su}}=0.01n$ .

5 Simulation

In this section we show the finite sample properties of our method and compare it with AutoPARM, Wild Binary Segmentation (WBS), Binary Segmentation (BS), MuBred, and NMCD under several cases.

Following Zou et al. (2014), we calculate the distance between two sets $G$ and $\hat{G}$ , which are the true change point set and the estimated set, respectively, by

$\varrho(\hat{G}||G)=\sup\limits_{b\in G}\inf\limits_{a\in\hat{G}}|a-b|$ , and $\varrho(G||\hat{G})=\sup\limits_{a\in\hat{G}}\inf\limits_{b\in G}|a-b|$ .

The first measurement shows if there is an estimator close enough to a true change point, while the second measurement reveals the distance between estimates and true change points. When $K$ is known, both measures should give good performances in the sense that $\varrho(\hat{G}||G)$ and $\varrho(G||\hat{G})$ are small. For all the tables, $\varrho(\hat{G}||G)$ and $\varrho(G||\hat{G})$ are shown outside the parentheses while $\varrho(\hat{G}||G)/N$ and $\varrho(G||\hat{G})/N$ are shown inside, which can measure the estimation accuracy of $\hat{\kappa}$ . In the following subsections, $\xi_{t}\overset{\mathrm{i.i.d}}{\sim}N(0,1)$ , if it is not mentioned. To guarantee the estimation accuracy of spectral density function, we should set the minimal length of a segment, which is denoted by $ml$ . In this simulation, we set $ml=350$ if it is not mentioned. By setting $ml$ , we also assume that the distance of two adjacent change points will not be smaller than $ml$ , so we set $K_{\mathrm{max}}=6$ for each case. If $K$ should be estimated, we report the percentage when $K$ is accurately detected. All simulations are obtained with 1000 replications.

5.1 AutoRegresstive Process

Following the examples in Davis et al. (2006), we generate the non-stationary time series from the following.

Autoregressive Processes (Case 1):

1

$X_{t}-0.9X_{t-1}=\xi_{t}$ , $1\leq t\leq 1024$ ,

2

$X_{t}-1.69X_{t-1}+0.81X_{t-2}=\xi_{t}$ , $1025\leq t\leq 1536$ ,

3

$X_{t}-1.32X_{t-1}+0.81X_{t-2}=\xi_{t}$ , $1537\leq t\leq 2048$ .

and their normalized spectral density functions are shown in Figure 1.

In Table 1, we show the results of our method when $K$ is known with the spectral density function of total samples and white noise as the baseline functions, and different choices of bandwidths, respectively. In Table 2, the simulation is conducted when $K$ is unknown. We adopt Bartlett window since it guarantees the non-negativity of estimated spectral density function. To reduce computational complexity, screening algorithm is applied.

From Table 1, the choice of baseline function does not make much difference. Also, under different bandwidths, results are quite similar. We may suggest to choose larger bandwidth for large sample size since the bias of estimation is small. From Table 2, it is not surprising that AutoPARM outperforms NSCD, because AutoPARM knows the exact structure of data. WBS and BS perform weaker than NSCD, and cannot estimate the number of change points accurately. The performance of MuBred is better than NSCD, although it does not need the assumption of Autoregressive process. NMCD fails to capture the true change points, since it is designed for independent random variables. In fact, it prefers to overestimate the number of change points, which is 3.86.

5.2 ARMA Process and Invertible Moving Average Process

Next, we simulate from ARMA and invertible MA processes. Table 3 and Table 4 contain the results.

ARMA (Case 2):

1:

$X_{t}-X_{t-1}+0.25X_{t-2}=\xi_{t}+0.8\xi_{t-1}$ , $1\leq t\leq 500$ ,

2:

$X_{t}-0.5X_{t-1}=\xi_{t}$ , $501\leq t\leq 1100$ ,

3:

$X_{t}-1.7X_{t-1}+0.9X_{t-2}-0.168X_{t-3}=\xi_{t}-1.6\xi_{t-1}+0.79\xi_{t-2}-0.12\xi_{t-3}$ , $1101\leq t\leq 1800$ .

Invertible MA (Case 3):

1:

$X_{t}=(3+B)(2-B)\xi_{t}$ , $1\leq t\leq 500$ ,

2:

$X_{t}=(3-B)(2-B)\xi_{t}$ , $501\leq t\leq 1100$ ,

3:

$X_{t}=(3+B)(2-B)\xi_{t}$ , $1101\leq t\leq 1800$ .

We expect that AutoPARM still works since it is well known that ARMA and invertible MA processes can be approximated by causal AR processes. We see that under ARMA setting, the performance of NSCD is comparable to AutoPARM, while AutoPARM performs better under MA settings. Both NSCD and AutoPARM perform better than WBS and BS. MuBred gives a slightly worse performance since periodogram is not consistent. NMCD fails in both cases as it does in the previous section. The number of change points is overestimated by NMCD again, which is 3.005 and 3.00, respectively.

5.3 Non-invertible MA Process

We will show simulation results when samples are generated from non-invertible MA process. In theory causal AR cannot approximate non-invertible MA process. Results are given in Table 5 and 6.

Non-invertible MA (Case 4):

•

$X_{t}=(1+2B+B^{2}+5B^{3})\xi_{t}$ , $1\leq t\leq 500$ ,

•

$X_{t}=(1-2B+2B^{2}-5B^{3})\xi_{t}$ , $501\leq t\leq 1100$ ,

•

$X_{t}=(1+2B-B^{2}+5B^{3})\xi_{t}$ , $1101\leq t\leq 1800$ .

From Table 6, we can see that NSCD works for Case 4, while AutoPARM and MuBred fail since the number of change points are all underestimated. The possible reason is that the marginal variance of Case 4 does not vary much.

5.4 Random Noise without the Existence of Higher Moments

Next, we investigate the results when $\xi$ does not have higher-order expectations. Here $\xi\sim\frac{1}{\sqrt{2}}t(4)$ so that the variance of $\xi$ is still 1. The results for four cases are shown in Table 7, when $K$ is known.

Apparently, from Table 7, it is clear that Assumption 1 can be slightly violated in application, while still achieving good performance.

5.5 Investigation of Smaller Sample Size

In previous sections, we discuss the performances when sample size is about 2000, which is large. Here we shrink the sample size by half, and investigate the estimate accuracy as well as the choice of bandwidth. We set $ml=200$ , and $K_{\mathrm{max}}=5$ . Baseline function is chosen to be $\hat{f}_{0}$ . Results are shown in Table 8.

As shown in the table, the estimate accuracy is not affected by the choice of bandwidth when sample size is relatively small. The estimation accuracy of $\kappa$ is worse than the results in Section 5.1 to 5.3, since the sample size is decreased by half.

5.6 Further Investigation of BIC Criterion

In this section, we investigate the performance with different choice of $N^{c}$ . $c$ ranges from 0.1 to 0.9 and $\hat{K}$ will be plotted in all the cases mentioned above with two choices of bandwidth. All the results are shown from Figures 2a to 3d with baseline function $\hat{f}_{0}$ . From figures, we can see that $c=0.73$ is a good choice for all the four cases.

Next, we are going to simulate the performances of BIC when $ml$ changes. Since from the previous section,we can see that the choice of $C_{N}$ depends on $ml$ . Figures 4a to 5d show the results when $ml=300$ , while Figures 6a to 7d give the performances when $ml=250$ .

As we can see from the figures, $c=0.73$ is not a good choice in general, and it will overestimate the number of change points. Although overestimating is tolerable since we do not want to miss the true change points, we still suggest that a sufficiently large choice for $ml$ is necessary. Since the maximum possible number of change points is $[N/ml]$ . We suggest that one should choose a reasonable $K_{\mathrm{max}}$ depending on some prior information, then choose a $ml$ satisfying $ml\leq N/K_{\mathrm{max}}$ .

5.7 Influence of Change Point Searching Unit

In this section, we investigate the influence of $n_{\mathrm{su}}$ . For all the four cases, $n_{\mathrm{su}}=10$ , which means that for Case 2-4, $\tau_{1}^{0},\ldots\tau_{K}^{0}$ are all a multiple of $n_{\mathrm{su}}$ , while for Case 1, the true change points are not divisible by $n_{\mathrm{su}}$ . We set $m=N^{1/4}$ and baseline function $f=\hat{f}_{0}$ for four cases. The results are shown in Table 10 and 11.

We can see that setting an $n_{su}$ will not affect our results much even if the true change points are not a multiple of $n_{su}$ in Case 1, so it is safe to apply this in application, which could boost the computation as well as give accurate estimations.

6 Case Study

6.1 Simulated Data

In the simulation section, we check the performances of our method based on 4 cases, and compare NMSD with other methods. Here we will investigate the change point detection of these 4 cases more directly. Figure 8a-8d are the realizations of Case 1 to Case 4. The long-dashed lines are the locations of estimated change points while the dotted lines represent true change points. It is easy to see from Figure 8a that the existence of change points are obvious, since marginal variance of the second segment is bigger. In Case 2, the second change point is far less obvious than the first one which can be seen in Figure 8b. In Figure 8c, since the observations in the first two segments concentrate more around their mean compared to the third segment, the second change point may be captured by eyes. In Figure 8d, two change points are not apparent any more. However, from the perspective of spectrum, those change points can be easily detected. Figure 9a-9d give the estimated spectral density functions. In Case 4, we can see that the power of spectrum of the first segment concentrates more at higher and lower frequencies, while the spectrum of the second part has more power at low frequency. The spectrum of the third part is similar to the first one, but it has more power in the low frequency range.

6.2 Electroencephalography Recordings for Seizure

In this section, we investigate the performance of NSCD when applied to EEG data for seizure (Goldberger et al. 2000). The data is retrieved from https://www.physionet.org/pn6/chbmit/. For each subject, the EEG signal was recorded into several data files, which was one-hour long. Here all subjects were monitored for several days to trace their states, so we only analyze the data file with seizures. The sampling rate is 256 Hz, so the number of observations in each recording is 9216000, which is huge. In the data file we choose (which is chb01_16), seizure happened only once with duration 51 seconds, so the number of change points is 2. The duration of seizure is very short compared with the total length of this recording, so we analyze a particular part of the data which begins at 10 seconds before the seizure and ends at the 10 seconds after the seizure, so the sample size is 18176, which is still large. So here we will set a change point searching unit equal to 64, which is 0.25 second, to ease the computation burden. What is more, the minimal length $n_{\mathrm{min}}$ is 256 which is 1 second. There are totally 23 channels in EEG recording, we apply our method on channel “FP1-F3” and “FP1-F7”. The results are shown in Figure 10a and 10b. The vertical dotted lines represent the locations of the beginning and ending of seizure, while the vertical dashed lines are the estimated change points. To demonstrate the changes in spectrums, we plot the estimated spectrums in Figure F3s and F7s. As we can see, NSCD can successfully detect the true change points, while almost all the other estimates fall between the true change points. This is because during seizure, the EEG recordings change abruptly, which will bring more non-stationarity into the time series.

7 Conclusion

In this article, we propose a change point detection method based on spectral density functions for non-stationary time series. We assume that non-stationary time series can be segmented into several linear processes. Then Kullback-Leibler divergence is applied to measure the discrepancy between different spectral density functions. A BIC criterion is suggested to estimate the number of change points. Due to the separable structure of objective function, we use Dynamic Programming to find the estimators. We also show the consistency of our estimators in theory and the estimate accuracy by simulations.

Appendix A Proofs

In the appendix, $B_{1}$ to $B_{8}$ are appropriate constant and $B_{7\epsilon}$ , $B_{8\epsilon}$ are constant with respect to $\epsilon$ .

Lemma 1.

Suppose $X_{t}=\sum\limits_{j=-\infty}^{+\infty}a_{j}\xi_{t-j}$ , $1\leq t\leq N$ , where $\sum\limits_{j}|a_{j}|<+\infty$ . $\xi_{j}\overset{\mathrm{i.i.d}}{\sim}(0,\sigma^{2})$ . Denote $f(\lambda)$ as the spectral density function of $X_{t}$ . $\hat{f}$ is the estimated spectral density function. Then under Assumptions 1-6,

$\displaystyle\left|\frac{\hat{f}(\lambda)-f(\lambda)}{f(\lambda)}\right|^{2q}=O_{p}(\frac{m^{2q}}{N^{q}})$

$\displaystyle\max\limits_{\lambda_{i}\in\Lambda}\left|\frac{\hat{f}(\lambda_{i})-f(\lambda_{i})}{f(\lambda_{i})}\right|^{2q}=O_{p}(\frac{m^{2q}}{N^{q-1}})$

Proof: Following the method of Woodroofe and Van Ness (1967), we separate our proofs into 3 parts.

Part 1: we show that for given $\lambda\in[-\pi,\pi]$ , $\displaystyle E|g_{N}(\lambda)-1|^{2q}=O_{p}\left(\frac{m^{2q}}{N^{q}}\right)$ . Here $g_{N}(\lambda)$ is the smoothed spectral density estimation for $\{\xi_{1},\cdots,\xi_{N}\}$ , $\xi$

[TABLE]

where $Z_{N}(\lambda)=\sum\limits_{t=1}^{N}Z_{N,t}(\lambda)$ ,

$Z_{N,t}(\lambda)=2\sum\limits_{v=1}^{m-1}\xi_{t}\xi_{t+s}w(vm^{-1})\cos(v\lambda)$ ,

$r_{N}(\lambda)=2\sum\limits_{t=N-m+2}^{N}\sum\limits_{v=N-t+1}^{m-1}\xi_{t}\xi_{t+v}w(vm^{-1})\cos(v\lambda)$ ,

$r_{N}=\sum\limits_{t=1}^{N}(\xi_{t}^{2}-\sigma^{2})$ .

Since for any real number $b_{1},\ldots,b_{p}$ , $\left(b_{1}+\ldots+b_{p}\right)^{2q}\leq 2^{2pq}(b_{1}^{2q}+\ldots+b_{p}^{2q})$ . So we only need to prove that $\left(Z_{N}(\lambda)\right)^{2q}$ , $\left(r_{N}(\lambda)\right)^{2q}$ , $\left(r_{N}\right)^{2q}$ are $O_{p}(N^{q}m^{2q})$ . By Markov’s inequality, it suffices to show that $E\left(Z_{N}(\lambda)\right)^{2q}$ , $E\left(r_{N}(\lambda)\right)^{2q}$ , $E\left(r_{N}\right)^{2q}$ are $O(N^{q}m^{2q})$ .

After expanding $\left(r_{N}(\lambda)\right)^{2q}$ , we have

[TABLE]

where $\{j_{1},\ldots,j_{p}\}\subset\{N-m+2,\ldots,N\}$ . If $t_{k}\geq 2$ for any $k$ , then expectation of $\left(\xi_{j_{1}}\xi_{j_{1}+v_{1}}\right)^{t_{1}}\cdots\left(\xi_{j_{p}}\xi_{j_{p}+v_{p}}\right)^{t_{p}}$ is not zero. If one of $t_{k}$ is 1, for example, $t_{k_{0}}=1$ , since $j_{k}\leq N<j_{k}+v_{k}$ for any $k$ , it is impossible to find $\xi_{j_{k_{0}}}$ in $\{\xi_{j_{1}+v_{1}},\ldots,\xi_{j_{p}+v_{p}}\}$ , so the expectation would be zero. To sum up, for all non-zero terms in $E\left(r_{N}(\lambda)\right)^{2q}$ , $t_{k}$ should be greater than 2. Since for any $s\leq 2q$ , $\sum\limits_{k=1}^{s}t_{k}=2q$ , $w^{j}(u)\cos^{j}(u)$ , $E\xi_{1}^{t_{1}}\cdots\xi_{s}^{t_{s}}$ can be bounded by a real number, denoted by $B_{1}$ , so we only need to count the number of non-zero terms. When $t_{k}=2$ for any $k$ , then $p=q$ , and the number of non-zero terms are no more than $\left(C_{2q}^{2}C_{2q-2}^{2}\cdots C_{2}^{2}C_{m}^{q}\right)^{2}=O(m^{2q})$ . When at least one of $t_{k}>2$ , say $t_{k_{0}}=3$ , the number of non-zero terms is no more than $\left(C_{2q}^{2}C_{2q-2}^{2}\cdots C_{4}^{3}C_{m}^{q-1}\right)^{2}=O(m^{2q-2})$ . So

$E\left(r_{N}(\lambda)\right)^{2q}\leq B_{1}m^{2q}$ .

For

$E(r_{N})^{2q}=E(\sum\limits_{t=1}^{N}(\xi_{t}^{2}-\sigma^{2}))^{2q}=\sum\limits_{p=1}^{2q}\sum\limits_{t_{1}+\ldots+t_{p}=2q}\sum\limits_{j_{1}\cdots j_{p}}(\xi_{j_{1}}^{2}-\sigma^{2})^{t_{1}}\ldots(\xi_{j_{p}}^{2}-\sigma^{2})^{t_{p}}\leq B_{1}m^{2q}$ ,

we can see that $t_{k}\geq 2$ for any $k$ in all non-zero terms, so following the discussions above, we have

$E(\sum\limits_{t=1}^{N}(\xi_{t}^{2}-\sigma^{2}))^{2q}\leq B_{2}N^{q}$ .

By far, we have proven that $(r_{N}(\lambda))^{2q}$ and $r_{N}^{2q}$ are $O_{p}(N^{q}m^{2q})$ .

[TABLE]

First, when $t_{k}\geq 2$ for any $k$ , $p\leq q$ , the number of terms after expanding $Z_{N,j_{1}}^{t_{1}}$ is $(m-1)^{t_{1}}$ . So for fixed $t_{1},\ldots,t_{p}$ , $Z_{N,j_{1}}^{t_{1}}\cdots Z_{N,j_{p}}^{t_{p}}$ contains no more than $(m-1)^{t_{1}+\cdots+t_{p}}=(m-1)^{2q}$ terms. What is more, for any fixed $p$ , the number for all possible $j_{1},\ldots,j_{p}$ , of which the power satisfy $t_{1}+\cdots+t_{p}=2q$ , is no more than $p!\,C_{N}^{p}$ . So there are no more than $N^{q}m^{2q}$ terms, which means that $E\sum\limits_{p=1}^{q}\sum\limits_{t_{1}+\ldots+t_{p}=2q}\sum\limits_{j_{1}\cdots j_{p}}Z_{N,j_{1}}^{t_{1}}(\lambda)\cdots Z_{N,j_{p}}^{t_{p}}(\lambda)\leq B_{1}N^{q}m^{2q}$ .

When $p\leq q$ and some of $t_{k}$ are equal to 1, we assume that $t_{k_{s_{1}}}=\cdots=t_{k_{s_{c}}}=1$ . For $t_{k_{s_{1}}}$ , $j_{k_{s_{1}}}$ should satisfy $j_{k_{s_{1}}}-j_{k_{s_{1}}-1}<m$ , that is, $\xi_{j_{k_{s_{1}}}}\in\{\xi_{j_{k_{s_{1}}-1}+1},\ldots,\xi_{j_{k_{s_{1}}-1}+m-1}\}$ . If not, since $\xi_{j_{k_{s_{1}}}}$ is independent of $(Z_{N,j_{1}})^{t_{1}},\ldots,(Z_{N,j_{k_{s_{1}}-1}})^{t_{j_{k_{s_{1}}-1}}},\left(\xi_{j_{k_{s_{1}}}+1}+\cdots+\xi_{j_{k_{s_{1}}}+m-1}\right)$ ,

$\left(Z_{N,j_{k_{s_{1}}}+1}\right)^{t_{j_{k_{s_{1}}}+1}},\ldots,\left(Z_{N,j_{p}}\right)^{t_{p}}$ , we have

$E(Z_{N,j_{1}})^{t_{1}}\cdots\left(Z_{N,j_{p}}\right)^{t_{p}}=0$ .

So the number of all non-zero terms is no more than $C_{N}^{p-c}m^{c}$ . For $(Z_{N,j_{k_{s_{1}}-1}})^{t_{j_{k_{s_{1}}-1}}}$ , there are $(m-2)^{t_{j_{k_{s_{1}}-1}}}$ terms which contain $\xi_{j_{k_{s_{1}}}}$ . So there are $(m-1)^{t_{j_{k_{s_{1}}-1}}}-(m-2)^{t_{j_{k_{s_{1}}-1}}}=O((m-2)^{t_{j_{k_{s_{1}}-1}}})$ terms which do not contain $\xi_{j_{k_{s_{1}}}}$ . Then the number of non-zero terms is O( $N^{p-c}m^{c}m^{t_{1}+\cdots+t_{p}-c})\leq O(N^{q}m^{2q})$ . For $\xi_{j_{k_{s_{1}}}}$ , if $Z_{N,j_{k_{1}}}^{t_{k_{1}}},\ldots,Z_{N,j_{k_{d}}}^{t_{k_{d}}}$ , contain $\xi_{j_{k_{s_{1}}}}$ , then it is easy to see that there are totally no more than $m^{t_{k_{1}}+\cdots+t_{k_{d}}-1}$ terms which do not contain $\xi_{j_{k_{s_{1}}}}$ . However, now for all $j_{k_{1}},\ldots,j_{k_{d}}$ , $|j_{k_{v}}-j_{k_{s_{1}}}|\leq m-1$ . So the number of all possible non-zero terms is $O(N^{p-c-d+1}m^{c+d}m^{t_{1}+\cdots+t_{p}-c})\leq O(N^{q}m^{2q})$ . Hence $E(Z_{N,j_{1}})^{t_{1}}\cdots\left(Z_{N,j_{p}}\right)^{t_{p}}\leq B_{1}N^{q}m^{2q}$ .

When $p>q$ , there are at least $p-q$ of $t_{j}=1$ , so following the discussions above, $E(Z_{N,j_{1}})^{t_{1}}\cdots\left(Z_{N,j_{p}}\right)^{t_{p}}\leq B_{1}N^{q}m^{2q}$ . Therefore summarizing discussions above, $E(Z_{N}(\lambda))^{2q}=O_{p}(\frac{m^{2q}}{N^{q}})$ .

Part 2, we prove that $\displaystyle\left|\frac{f(\lambda)-Ef_{N}(\lambda)}{f(\lambda)}\right|^{2q}=O(\frac{m^{2q}}{N^{q}})$ .

[TABLE]

And

[TABLE]

Since $\alpha\geq\frac{1}{4}$ , we have $\displaystyle\lim\limits_{N\rightarrow\infty}\frac{1}{m^{2q}}/\frac{m^{2q}}{N^{q}}<+\infty$ .

And

[TABLE]

Since by Woodroofe and Van Ness (1967),

$\max_{|\lambda|\leq\pi}\left|f(\lambda)-EI_{N}(\lambda)\right|\leq B_{pe}\log N/N$ ,

where $B_{pe}$ is a constant. So

[TABLE]

And $\displaystyle\lim\limits_{N\rightarrow\infty}\frac{(\log N)^{2q}}{N^{2q}}/\frac{m^{2q}}{N^{q}}\rightarrow 0$ . So we complete Part 2.

Part 3: we show that $E\left|\frac{f_{N}(\lambda)-Ef_{N}(\lambda)}{f(\lambda)}-(g_{N}(\lambda)-\sigma^{2})\right|^{2q}=O_{p}(\frac{m^{2q}}{N^{q}})$ .

[TABLE]

where $J_{N}(u)$ is the periodogram for $\xi_{1},\ldots,\xi_{N}$ , $B_{3}$ is a constant. After some manipulations (see Woodroofe and Van Ness 1967, Grenander and Rosenblatt 1957).

[TABLE]

Now let us focus on $\left|d_{rs}\right|^{4q}$ , that is $r_{1}=\cdots=r_{p}$ , $s_{1}=\cdots=s_{p}$ , and denote $R_{\xi}(v)$ as the autocovariance function of $\xi_{j}$ .

[TABLE]

Again, we only need to prove that the expectation of $Q_{rs}^{4q}$ and $T_{rs}^{4q}$ are $O(N^{2q}m^{4q})$ . And

[TABLE]

For those four terms in the last inequality above, following the proofs in Part 1, we can show that each of them is $O(N^{2q}m^{4q})$ . Similarly to the discussions above, $E\left(T_{rs}(\lambda)\right)^{4q}=O(N^{2q}m^{4q})$ . So we have $E\left|d_{rs}(\lambda)\right|^{4q}\leq O(N^{2q}m^{4q})$ . When some of $(r_{k},s_{k})$ are different from each other, for example, $(r_{1},s_{1})\neq(r_{k},s_{k})$ for $k\geq 2$ , then in $E|d_{r_{1}s_{1}}(\lambda)|^{2t_{1}}\cdots|d_{r_{p}s_{p}}(\lambda)|^{2t_{p}}$ , the number of non-zero terms should be less than $E\left|d_{rs}(\lambda)\right|^{4q}$ because some $\xi_{j}$ in $d_{r_{1}s_{1}}(\lambda)$ are not in $d_{r_{k}s_{k}}(\lambda)$ . And now, when the power of $\xi_{j}$ is 1 in the expansion of $\left(d_{r_{1}s_{1}}(\lambda)\right)^{2t_{1}}$ , we can not find any $\xi_{j}$ in $\left(d_{r_{k}s_{k}}(\lambda)\right)^{2t_{k}}$ , so the expectations of these terms are 0. So,

[TABLE]

Since $\sum_{j}|a_{j}|<+\infty$ ,

$\sum\limits_{\begin{subarray}{c}r_{1},s_{1}\\ \cdots\\ r_{p},s_{p}\end{subarray}}\left|a_{r_{1}}a_{s_{1}}\right|^{t_{1}}\cdots\left|a_{r_{p}}a_{s_{p}}\right|^{t_{p}}<+\infty$

for any given $t_{1},\ldots,t_{p}$ . Since $t_{1}+\cdots+t_{p}=2q$ , $\displaystyle ER_{1}(\lambda)\leq B_{1}\frac{N^{q}m^{2q}}{N^{2q}}=B_{1}\frac{m^{2q}}{N^{q}}$ .

[TABLE]

where the second inequality holds because of Hölder inequality, and $\mathcal{F}(w^{(2q)})(u)$ is the Fourier transformation of the $2q$ th derivative of $w(v)$ . Since $w(v)=0$ for $v\geq 1$ , we have $w^{(2q)}=0$ for $v\geq 1$ , and $w^{(2q)}(v)$ is bounded. So similar to Part 1, we have $\displaystyle R_{2}(\lambda)\leq B_{5}m^{-2q}\frac{m^{2q}}{N^{q}}=\frac{B_{5}}{N^{q}}$ . Here we complete Part 3.

Since

[TABLE]

we have

$\displaystyle E\left|\frac{f_{N}(v)-f(v)}{f(v)}\right|^{2q}\leq B\frac{m^{2q}}{N^{q}}$ .

Then, by Markov inequality,

[TABLE]

So

[TABLE]

Here we complete the proof of Lemma 1.

Lemma 2.

Suppose $\forall k$ , $X_{t}=\sum\limits_{j=-\infty}^{+\infty}a_{j}(k)\epsilon_{t-j}$ satisfy Assumptions 1-8, then

[TABLE]

Proof:

[TABLE]

where $\zeta_{k_{1}}(u)=\sum\limits_{j=\tau_{k_{1}-1}^{0}+1}^{\tau_{k_{1}}^{0}}e^{-iju}X_{j}$ . So

[TABLE]

By Lemma 1, we only need to show that

$\displaystyle\left|\frac{m}{N}\int_{-\infty}^{+\infty}W(m(u-\lambda))\zeta_{k_{1}}(u)\bar{\zeta}_{k_{2}}(u)du\right|=O_{p}(\frac{m}{N^{1/2}})$ .

What is more, assume $k_{1}<k_{2}$ without loss of generality, then we have

[TABLE]

Since $N_{k}>m$ , $\forall k$ , $w(u)=0$ for $|u|\geq 1$ , so if $k_{2}-k_{1}>1$ , $\zeta_{k_{1}}(u)\bar{\zeta}_{k_{2}}(u)=0$ . So without loss of generality, we set $k_{1}=1$ , $k_{2}=1$ . Next, we separate the proofs into 2 parts, as in Lemma 1.

Part 1:

[TABLE]

So

[TABLE]

So for two terms in the inequality above, following the proofs in Part 1 of Lemma 1, we have

$\displaystyle E\left(\frac{1}{N^{2q}}2^{2q}\left(\sum\limits_{l=1}^{m}w(lm^{-1})\cos(l\lambda)\sum\limits_{j=n_{1}-l+1}^{n_{1}-m}\xi_{j}\xi_{j+l}\right)^{2q}\right)=O\left(\frac{m^{3q}}{N^{2q}}\right)$ ,

$\displaystyle E\left(\frac{1}{N^{2q}}2^{2q}\left(\sum\limits_{l=1}^{m}w(lm^{-1})\sin(l\lambda)\sum\limits_{j=n_{1}-l+1}^{n_{1}-m}\xi_{j}\xi_{j+l}\right)^{2q}\right)=O\left(\frac{m^{3q}}{N^{2q}}\right)$ .

Part 2: Set $f_{11}(u)$ , $f_{22}(u)$ satisfying $f_{1}(u)=f_{11}(u)\bar{f}_{11}(u)$ , $f_{2}(u)=f_{22}(u)\bar{f}_{22}(u)$ , where $\bar{f}_{22}$ denotes the conjugate of $f_{22}$ . We show that $\displaystyle\left|\frac{m\int W(m(u-\lambda))\zeta_{1}(u)\bar{\zeta}_{2}(u)du}{f_{11}\bar{f}_{22}}-g_{n}(\lambda)\right|^{2q}=O_{p}(\frac{m^{2q}}{N^{q}})$ .

[TABLE]

Since $\forall k$ , $f_{k}(\lambda)$ all satisfy uniform Lipschitz condition, then it is easy to see that $f_{11}(u)\bar{f}_{22}(u)$ also satisfies uniform Lipschitz condition. Following the proofs in Part 3 of Lemma 1, we have

$\displaystyle R_{2}(\lambda)=O_{p}(\frac{1}{N^{q}})$ .

Following the proofs in Part 3 of Lemma 1 again, we have $\displaystyle R_{1}(\lambda)=|\sum\limits_{r=-\infty,s=-\infty}^{+\infty}a_{r}(1)a_{s}(2)d_{rs}|^{2q}$ , where

[TABLE]

So following the proofs in Part 3 of Lemma 1 again. we have

$\displaystyle R_{1}(\lambda)=O_{p}(\frac{m^{2q}}{N^{q}})$ .

Here we complete the proof of Lemma 2.

Lemma 3.

Assume Assumption 1-8, $\forall s=1,\cdots,K+1$ , $\max\limits_{\tau_{s-1}^{0}\leq k<l\leq\tau_{s}^{0}}\vartheta_{kl}\sim O_{p}(N^{-\frac{q-3-2q\alpha}{2q}})$ , when $l-k=N_{kl}\geq ml$ . Here

$\displaystyle\vartheta_{kl}=\frac{N_{kl}}{N}\int_{-\pi}^{\pi}\hat{f}_{k}^{l}(v)\log\left(\frac{st(\hat{f}_{k})}{st(f_{s})}\right)dv$ **

Proof: Set $m_{kl}=N_{kl}^{\alpha}$ , and $B_{1}$ to $B_{6}$ are all constant. By Lemma 1, we have

$\displaystyle P\left(\max\limits_{\lambda_{i}}\left|\frac{\hat{f}_{k}^{l}(\lambda_{i})-f(\lambda_{i})}{f(\lambda_{i})}\right|^{2q}>\epsilon\frac{m^{2q}}{N^{(q-3)}}\right)\leq\frac{B_{5}}{N^{2}\epsilon^{2q}}$ ,

So denote $\displaystyle A_{kl}=\{\max\limits_{\lambda_{i}}\left|\frac{\hat{f}_{k}^{l}(\lambda_{i})-f(\lambda_{i})}{f(\lambda_{i})}\right|\leq\epsilon\frac{m}{N^{(q-3)/(2q)}}\}$ , then on set $A_{kl}$ ,

$\displaystyle\left|\frac{\hat{f}_{k}^{l}(\lambda_{i})-f(\lambda_{i})}{f(\lambda_{i})}\right|\leq\epsilon\frac{m}{N^{(q-3)/(2q)}}\Rightarrow\max(0,1-\epsilon\frac{B_{5}m}{N^{(q-3)/(2q)}})\leq\frac{\hat{f}_{kl}(\lambda_{i})}{f_{s}(\lambda_{i})}\leq 1+\epsilon\frac{B_{5}m}{N^{(q-3)/(2q)}}$ , $\forall k,l$ , $\lambda_{i}$ .

Since $\frac{\hat{f}_{kl}(\lambda_{i})}{f_{s}(\lambda_{i})}\geq 0$ ,

$\displaystyle\max(0,1-\epsilon\frac{B_{5}m}{N^{(q-3)/(2q)}})\leq\frac{\hat{f}_{kl}(\lambda_{i})}{f_{s}(\lambda_{i})}\leq 1+\epsilon\frac{B_{5}m}{N^{(q-3)/(2q)}}$ .

So on set $A_{kl}$ , we have

[TABLE]

Then on $\bigcap\limits_{kl}A_{kl}$ ,

[TABLE]

Here $B_{7\epsilon}$ is some constant containing $\epsilon$ . So set $B_{8\epsilon}>B_{7\epsilon}$ ,

[TABLE]

and the last inequality can be arbitrarily small by choosing sufficiently large $\epsilon$ . Here we complete proofs of Lemma 3.

Lemma 4.

Assume Assumption 1-8, $\forall s=1,\cdots,K+1$ , $\displaystyle\max\limits_{1\leq k<l\leq N}\vartheta_{kl}\sim O_{p}(N^{-\frac{q-3-2q\alpha}{2q}})$ , when $l-k=N_{kl}\geq ml$ . Here

$\displaystyle\vartheta_{kl}=\frac{N_{kl}}{N}\int_{-\pi}^{\pi}\hat{f}_{k}^{l}(v)\log\left(\frac{st(\hat{f}_{k})}{st(f_{s})}\right)dv$ **

Proof: Following the proofs of Lemma 3, this lemma can be easily obtained.

**Proofs of Theorem 1:

**Denote $\displaystyle A_{kl}=\left\{\omega:\max\limits_{\lambda_{i}}\left|\frac{\hat{f}_{k}^{l}(\lambda_{i})-f(\lambda_{i})}{f(\lambda_{i})}\right|\leq\epsilon\frac{m}{N^{(q-3)/(2q)}},\forall j=1,\ldots,K\right\}$ , where $\omega$ is the event in probability space $(\Omega,\mathcal{F},P)$ . Then $\forall\omega\in\bigcap\limits_{kl}A_{kl}$ , we prove that $\hat{\kappa}_{k}\rightarrow\kappa_{k}^{0}$ , $\forall j$ .

For $\hat{\kappa}_{k}$ , since $\hat{\kappa}_{k}(\omega)$ is bounded, $\forall k$ , then there exists $\{n_{s}\}$ such that $\hat{\kappa}_{n_{s}}(\omega)\rightarrow\kappa_{k}^{*}$ on the subsequence. It follows from Lemma 2 and 3, that

$\displaystyle\frac{1}{N}R(\kappa_{1}^{*},\ldots,\kappa_{K}^{*})\leq\sum\limits_{i=1}^{K+1}(\kappa_{i}^{*}-\kappa_{i-1}^{*})\int f_{\kappa_{i-1}^{*}\kappa_{i}^{*}}(v)\log\frac{st(f_{\kappa_{i-1}^{*}\kappa_{i}^{*}})}{st(f)}dv+O(N^{-\frac{q-3-2q\alpha}{2q}})$ ,

where $\lambda_{0}^{*}=0$ , $\lambda_{K+1}^{*}=1$ . If $\kappa_{i-1}\leq\kappa_{j-1}^{*}<\kappa_{k}<\cdots<\kappa_{i+k}<\kappa_{j}^{*}$ , then by Lemma 3, we have

$\displaystyle f_{\kappa_{j-1}^{*}\kappa_{j}^{*}}=\frac{\kappa_{i}-\kappa_{j-1}^{*}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}f_{i}+\frac{\kappa_{i+1}-\kappa_{i}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}f_{i+1}+\ldots+\frac{\kappa_{j}^{*}-\kappa_{j+k}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}f_{j+k+1}$

$\displaystyle F_{\kappa_{j-1}^{*}\kappa_{j}^{*}}=\frac{\kappa_{i}-\kappa_{j-1}^{*}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}F_{i}+\frac{\kappa_{i+1}-\kappa_{i}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}F_{i+1}+\ldots+\frac{\kappa_{j}^{*}-\kappa_{j+k}}{\kappa_{j}^{*}-\kappa_{j-1}^{*}}F_{j+k+1}$ .

Since $\frac{1}{N}R(\kappa_{1},\ldots,\kappa_{K})=\sum\limits_{k=1}^{K+1}\int f_{k}(u)\log\frac{st(f_{k})}{st(f)}du$ .

So

[TABLE]

as $N\rightarrow\infty$ , since every term above is always negative. If $\kappa_{j-1}<\kappa_{k}^{*}<\kappa_{k+1}^{*}<\kappa_{j}$ , then

[TABLE]

So we have

[TABLE]

as $N\rightarrow\infty$ . This is a contradiction because $\hat{\kappa}_{k}$ are the maximizers of $R(\kappa_{1},\ldots,\kappa_{K})$ . Since

$\displaystyle P\left(\bigcap\limits_{k,l}A_{kl}\right)=1-P\left(\bigcup\limits_{k,l}A^{c}_{kl}\right)\geq 1-\sum\limits_{l-k\geq ml}^{N}P\left(A_{kl}^{c}\right)=1-\frac{B_{1}}{\epsilon^{2q}}$ .

so $\hat{\kappa}_{j}\overset{p}{\rightarrow}\kappa_{j}^{0}$ .

Proofs of Theorem 2:

If $L<K$ , then there should be a change point $\lambda_{j}^{0}$ that can not estimated consistently. Then following the proof in Theorem 1,

[TABLE]

as $N\rightarrow\infty$ .

If $\hat{K}>K$ , then still every change point $\lambda_{i}$ should be estimated consistently. So, there should be a change point $\kappa_{j-1}<\kappa_{k}^{*}<\kappa_{j}$ . Then by Lemma 2 and 3, $\displaystyle R(\kappa_{j-1},\kappa_{k}^{*},\kappa_{j})-R(\kappa_{j-1},\kappa_{j})=O_{p}(N^{-\frac{q-3-2q\alpha}{2q}})$ . So $BIC_{L}-BIC_{K}=O_{p}(N^{-\frac{q-3-2q\alpha}{2q}})+(K-L)C_{N}/N<0$ , as $N\rightarrow\infty$ . Here we complete the proofs of Theorem 2.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Brockwell, P. J., and Davis, R. A. (1991), Time Series: Theory and Methods (2nd Edition) , Springer-Verlag.
2[2] Davis, R. A., Lee, T. C. M., and Rodriguez-Yam, G. A. (2006), “Structural Break Estimation for Nonstationary Time Series Models,” Journal of the American Statistical Association , 101(473), 223-239.
3[3] Fan, J. and Yao, Q., (2003), Nonlinear Time Series , Springer.
4[4] Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. Ch., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C-K., Stanley, H. E. (2000), “Physio Bank, Physio Toolkit, and Physio Net: Components of a New Research Resource for Complex Physiologic Signals”, Circulation 101(23), e 215-e 220.
5[5] Hawkins, D. M. (2001), “Fitting Multiple Change-point Models to Data,” Computational Statistics & Data Analysis , 37(3), 323-341.
6[6] Killick, R., Fearnhead, P., and Eckley, I. A. (2012), “Optimal Detection of Changepoints With a Linear Computational Cost,” Journals of the American Statistical Association , 107(500), 1590- 1598.
7[7] Kitagawa, G., and Akaike, H. (1978), “A Procedure for the Modeling of Non-Stationary Time Series,” Annals of the Institute of Statistical Mathematics , 30, 351-363.
8[8] Korkas, K. K., and Fryzlewicz, P. (2017), “Multiple Change-Point Detection for Non-Stationary Time Series Using Wild Binary Segmentation,” Statistica Sinica , 27(1), 287-311.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Nonparametric Multiple Change Point Detection

1 Introduction

2 Model and Methodology

3 Asymptotic Theory

Theorem 1**.**

Theorem 2**.**

4 Algorithm

5 Simulation

5.1 AutoRegresstive Process

5.2 ARMA Process and Invertible Moving Average Process

5.3 Non-invertible MA Process

5.4 Random Noise without the Existence of Higher Moments

5.5 Investigation of Smaller Sample Size

5.6 Further Investigation of BIC Criterion

5.7 Influence of Change Point Searching Unit

6 Case Study

6.1 Simulated Data

6.2 Electroencephalography Recordings for Seizure

7 Conclusion

Appendix A Proofs

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Theorem 1.

Theorem 2.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.