Detecting multiple generalized change-points by isolating single ones

Andreas Anastasiou; Piotr Fryzlewicz

arXiv:1901.10852·stat.ME·June 2, 2021

Detecting multiple generalized change-points by isolating single ones

Andreas Anastasiou, Piotr Fryzlewicz

PDF

Open Access

TL;DR

The paper presents Isolate-Detect (ID), a novel method for accurately identifying multiple change-points in noisy data sequences, capable of handling various signal changes and outperforming existing methods.

Contribution

Introduction of the Isolate-Detect (ID) method, a new approach for consistent estimation of multiple change-points with enhanced accuracy and minimal parameter tuning.

Findings

01

ID outperforms state-of-the-art methods in various scenarios.

02

ID effectively detects change-points with small magnitudes.

03

The method is implemented in R packages IDetect and breakfast.

Abstract

We introduce a new approach, called Isolate-Detect (ID), for the consistent estimation of the number and location of multiple generalized change-points in noisy data sequences. Examples of signal changes that ID can deal with are changes in the mean of a piecewise-constant signal and changes, continuous or not, in the linear trend. The number of change-points can increase with the sample size. Our method is based on an isolation technique, which prevents the consideration of intervals that contain more than one change-point. This isolation enhances ID's accuracy as it allows for detection in the presence of frequent changes of possibly small magnitudes. In ID, model selection is carried out via thresholding, or an information criterion, or SDLL, or a hybrid involving the former two. The hybrid model selection leads to a general method with very good practical performance and minimal…

Tables23

Table 1. Table 1: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from (S1). The average MSE is also given.

		•
		$\hat{N} - N$
Signal	Method	$\leq - 15$	$(- 15, - 5]$	$[- 4, 4]$	$[5, 15)$	$\geq 15$	MSE
	ID	0	0	100	0	0	$13 \times 10^{- 5}$
(S1)	NOT	5	86	9	0	0	$141 \times 10^{- 5}$
	MARS	100	0	0	0	0	$284 \times 10^{- 5}$
	ID	0	1	97	2	0	$94 \times 10^{- 5}$
(S2)	NOT	100	0	0	0	0	$485 \times 10^{- 5}$
	PELT	78	22	0	0	0	$437 \times 10^{- 5}$
	WBS	27	71	2	0	0	$413 \times 10^{- 5}$

Table 2. Table 2: The competing methods used in the simulation study

Type of signal	Method notation	Reference	R package
	PELT	Killick et al. (2012)	changepoint
	NP.PELT	Haynes et al. (2017)	changepoint.np
	S3IB	Rigaill (2015)	Segmentor3IsBack
	CumSeg	Muggeo and Adelfio (2011)	cumSeg
Piecewise-constant	CPM	Ross (2015)	cpm
	WBS	Fryzlewicz (2014)	wbs
	WBS2	Fryzlewicz (2020)	breakfast
	NOT	Baranowski et al. (2019)	not
	FDR	Li et al. (2016)	FDRSeg
	TGUH	Fryzlewicz (2018)	breakfast
	NOT	Baranowski et al. (2019)	not
	TF	Kim et al. (2009)	-
Continuous piecewise-linear	CPOP	Maidstone et al. (2019)	-
	MARS	Friedman (1991)	earth
	FKS	Spiriti et al. (2013)	freeknotsplines

Table 3. Table 3: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences of the piecewise-constant signals (M1)-(M4). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

		•
		$\hat{N} - N$
Method	Model	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (ms)
PELT		6	32	50	12	0	0	0	3.23	0.14	3
NP.PELT		0	2	27	49	15	5	2	2.82	0.10	211.8
S3IB		0	7	38	54	1	0	0	2.49	0.08	343.2
CumSeg		39	21	38	2	0	0	0	6.37	0.20	62.3
CPM. $l .$ 500		0	0	0	3	3	4	90	4.45	0.44	2.3
CPM. $l .$ 3000		0	0	8	41	26	19	6	3.03	0.19	3.3
WBSC1	(M1)	0	0	11	32	27	19	11	2.79	0.25	99.3
WBSIC		0	3	37	53	7	0	0	2.59	0.08	99.3
WBS2		0	3	54	31	8	2	2	2.64	0.09	623.3
NOT		0	3	51	43	3	0	0	2.61	0.10	80.7
FDR		0	0	33	54	12	1	0	2.51	0.09	-
TGUH		0	5	37	49	7	1	1	3.30	0.08	127.4
ID		0	3	30	62	5	0	0	2.66	0.08	23.9
ID.SDLL		1	2	59	28	5	3	2	2.80	0.10	20
ID $_{\sqrt{3 / 2}}$		0	9	62	28	1	0	0	2.75	0.09	22.3
PELT		85	6	0	9	0	0	0	181 $\times 10^{- 3}$	6.62	1.1
NP.PELT		84	12	3	1	0	0	0	165 $\times 10^{- 3}$	4.26	3.1
S3IB		41	15	1	43	0	0	0	117 $\times 10^{- 3}$	3.73	15.2
CumSeg		100	0	0	0	0	0	0	251 $\times 10^{- 3}$	-	3.9
CPM. $l .$ 500		78	4	15	3	0	0	0	145 $\times 10^{- 3}$	2.96	0.4
WBSC1	(M2)	1	2	7	72	12	6	0	53 $\times 10^{- 3}$	0.33	38.2
WBSIC		7	8	1	68	13	3	0	64 $\times 10^{- 3}$	1.00	38.2
WBS2		3	3	4	71	10	4	5	58 $\times 10^{- 3}$	0.363	30.5
NOT		9	7	4	73	6	1	0	65 $\times 10^{- 3}$	0.97	43.4
FDR		14	11	11	55	7	2	0	71 $\times 10^{- 3}$	0.80	-
TGUH		4	18	3	68	7	0	0	64 $\times 10^{- 3}$	0.47	22.8
ID		7	7	1	74	11	0	0	60 $\times 10^{- 3}$	0.87	8.8
ID.SDLL		5	5	6	63	8	4	9	62 $\times 10^{- 3}$	0.43	3.7
ID $_{\sqrt{3 / 2}}$		28	13	9	47	3	0	0	84 $\times 10^{- 3}$	0.90	5.3
PELT		0	2	7	90	1	0	0	23 $\times 10^{- 3}$	0.15	1.1
NP.PELT		100	0	0	0	0	0	0	781 $\times 10^{- 3}$	1.78	4.2
S3IB		98	1	1	0	0	0	0	213 $\times 10^{- 3}$	0.91	20.2
CumSeg		0	3	16	72	9	0	0	65 $\times 10^{- 3}$	0.32	5.2
CPM. $l .$ 500		1	6	87	6	0	0	0	51 $\times 10^{- 3}$	0.85	0.2
WBSC1	(M3)	0	0	0	66	26	7	1	24 $\times 10^{- 3}$	0.19	37.3
WBSIC		0	0	0	64	27	9	0	24 $\times 10^{- 3}$	0.18	37.3
WBS2		0	0	1	87	8	2	2	25 $\times 10^{- 3}$	0.17	34.7
NOT		0	0	0	93	7	0	0	21 $\times 10^{- 3}$	0.13	118.3
FDR		0	0	2	77	15	5	1	23 $\times 10^{- 3}$	0.17	-
TGUH		0	0	1	91	6	2	0	25 $\times 10^{- 3}$	0.15	25.2
ID		0	0	0	91	8	1	0	22 $\times 10^{- 3}$	0.13	9.8
ID.SDLL		0	0	1	97	1	0	1	24 $\times 10^{- 3}$	0.14	6.8
ID $_{\sqrt{𝟑 / 𝟐}}$		0	0	2	94	3	1	0	23 $\times 10^{- 3}$	0.15	4.4
PELT		-	53	0	47	0	0	0	14 $\times 10^{- 3}$	0.54	6.7
NP.PELT		-	0	0	21	3	34	42	14 $\times 10^{- 3}$	0.47	395.2
S3IB		-	12	0	87	1	0	0	7 $\times 10^{- 3}$	0.12	292.1
CumSeg		-	100	0	0	0	0	0	23 $\times 10^{- 3}$	-	84.6
CPM. $l .$ 500		-	0	0	0	0	6	94	31 $\times 10^{- 3}$	0.76	14
CPM. $l .$ 2000		-	0	0	35	11	22	32	13 $\times 10^{- 3}$	0.39	20.4
WBSC1	(M4)	-	0	0	23	20	17	40	13 $\times 10^{- 3}$	0.50	120.8
WBSIC		-	4	0	96	1	0	0	5 $\times 10^{- 3}$	0.04	119.2
WBS2		-	0	1	83	10	4	2	5 $\times 10^{- 3}$	0.09	666.4
NOT		-	8	0	92	0	0	0	6 $\times 10^{- 3}$	0.08	61.8
FDR		-	0	19	70	10	1	0	9 $\times 10^{- 3}$	0.07	-
TGUH		-	0	51	40	7	2	0	23 $\times 10^{- 3}$	0.28	169.2
ID		-	7	0	93	0	0	0	6 $\times 10^{- 3}$	0.07	42.3
ID.SDLL		-	0	0	81	4	10	5	7 $\times 10^{- 3}$	0.10	28.7
ID $_{\sqrt{𝟑 / 𝟐}}$		-	1	0	98	1	0	0	5 $\times 10^{- 3}$	0.05	66.4

Table 4. Table 4: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the piecewise-constant signal (M5). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

	•
	$\hat{N} - N$
Method	$\leq - 500$	$(- 500, - 50]$	$(- 50, - 10)$	$[- 10, 10]$	$> 10$	MSE	$d_{H}$	Time (s)
PELT	100	0	0	0	0	1.97	114.92	0.033
NP.PELT	100	0	0	0	0	2.25	551.89	8.976
S3IB	99	1	0	0	0	2.23	1979.95	332.841
CumSeg	100	0	0	0	0	2.25	1999	0.551
CPM. $l .$ 500	0	45	54	1	0	0.19	9.00	0.002
CPM. $l .$ 20000	100	0	0	0	0	2.23	1999	1.245
WBSC1	100	0	0	0	0	1.51	35.26	12.272
WBSIC	100	0	0	0	0	2.25	1999	12.272
WBS2	0	0	0	100	0	0.14	0.54	5.796
NOT	100	0	0	0	0	2.25	1999	0.484
FDR	0	0	0	5	95	0.14	0.51	-
TGUH	0	0	0	100	0	0.16	0.84	0.794
ID	0	0	0	100	0	0.14	0.99	0.785
ID.SDLL	0	0	0	100	0	0.14	0.71	120.601
ID $_{\sqrt{3 / 2}}$	0	82	18	0	0	0.22	2.48	1.363

Table 5. Table 5: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from (NC). Also the average MSE and computational times for each method are given

	•
	$\hat{N} - N$
Method	0	1	2	$\geq 3$	MSE	Time (s)
PELT	100	0	0	0	39 $\times 10^{- 5}$	0.004
NP.PELT	8	1	23	68	999 $\times 10^{- 5}$	1.077
S3IB	100	0	0	0	39 $\times 10^{- 5}$	0.715
CumSeg	100	0	0	0	39 $\times 10^{- 5}$	0.115
CPM. $l .$ 500	0	0	0	100	2957 $\times 10^{- 5}$	0.011
CPM. $l .$ 3000	28	6	39	27	628 $\times 10^{- 5}$	0.031
WBSC1	15	18	20	47	653 $\times 10^{- 5}$	0.149
WBSIC	99	1	0	0	44 $\times 10^{- 5}$	0.149
WBS2	89	5	4	2	82 $\times 10^{- 5}$	0.958
NOT	99	1	0	0	44 $\times 10^{- 5}$	0.089
FDR	96	4	0	0	47 $\times 10^{- 5}$	-
TGUH	100	0	0	0	39 $\times 10^{- 5}$	0.217
ID	100	0	0	0	39 $\times 10^{- 5}$	0.172
ID.SDLL	90	4	0	6	182 $\times 10^{- 5}$	0.069
ID $_{\sqrt{𝟑 / 𝟐}}$	99	0	1	0	41 $\times 10^{- 5}$	0.259

Table 6. Table 6: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the continuous piecewise-linear signals (W1), (W3), and (W4). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time for each method are also given

		•
		$\hat{N} - N$
Method	Model	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (s)
NOT		0	0	0	99	1	0	0	0.016	0.063	0.343
TF		0	0	0	0	0	0	100	0.029	0.451	1.125
CPOP		0	0	0	99	1	0	0	0.013	0.055	23.190
MARS	(W1)	0	0	2	9	42	39	8	0.034	0.200	0.011
FKS		0	0	0	72	22	6	0	0.015	0.109	270.385
ID		0	0	0	91	9	0	0	0.030	0.104	0.036
ID.SDLL		0	0	0	98	0	1	1	0.033	0.098	0.030
NOT		0	0	27	0	6	18	49	0.035	0.571	0.163
TF		0	0	0	0	0	0	100	606.523	0.432	0.117
CPOP		0	0	0	90	6	2	2	0.010	0.097	0.078
MARS	(W3)	91	0	7	2	0	0	0	3.991	2.258	0.008
FKS		0	0	0	90	9	1	0	0.010	0.097	67.582
ID		0	0	0	99	1	0	0	0.013	0.101	0.017
ID.SDLL		0	0	0	93	4	1	2	0.022	0.130	0.010
NOT		0	1	14	20	16	20	29	0.109	0.998	0.958
TF		0	0	0	0	0	0	100	660.399	0.465	1.349
CPOP	(W4)	0	0	0	92	8	0	0	0.015	0.084	1.627
MARS		100	0	0	0	0	0	0	22.058	1.609	0.019
ID		0	0	0	92	8	0	0	0.038	0.123	0.045
ID.SDLL		0	0	0	92	4	1	3	0.062	0.120	0.025

Table 7. Table 7: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences of the continuous piecewise-linear signal (W2). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time for each method are also given

	•
	$\hat{N} - N$
Method	$\leq - 90$	$(- 90, - 1)$	$- 1$	0	$1$	$(1, 60]$	$> 60$	MSE	$d_{H}$	Time (s)
NOT	100	0	0	0	0	0	0	4.731	99	0.869
TF	0	0	0	0	0	0	100	212.547	0.387	0.863
CPOP	0	0	0	97	3	0	0	0.162	0.189	1.161
MARS	100	0	0	0	0	0	0	4.703	98.523	0.009
ID	0	0	0	98	2	0	0	0.201	0.242	0.589
ID.SDLL	0	0	0	98	2	0	0	0.256	0.287	0.097

Table 8. Table 8: ID results for the distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for the models (M2)-(M4) and (W1), over 100 simulations where the distribution of the noise is Student- t d subscript 𝑡 𝑑 t_{d} , for d = 3 , 5 𝑑 3 5 d=3,5 . The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

		•
		$\hat{N} - N$
$d$	Model	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (ms)
	(M2)	6	2	2	74	9	5	2	$60 \times 10^{- 3}$	0.86	9.7
5	(M3)	0	0	0	75	16	5	4	$21 \times 10^{- 3}$	0.16	9.2
	(W1)	0	0	0	86	12	2	0	$31 \times 10^{- 3}$	0.23	32.8
	(M2)	7	1	2	52	21	8	9	$71 \times 10^{- 3}$	1.18	8.7
3	(M3)	0	1	0	59	20	13	7	$26 \times 10^{- 3}$	0.22	9.8
	(W1)	0	0	0	62	28	4	6	$32 \times 10^{- 3}$	0.25	22.6

Table 9. Table 1: A comparison on the performance of WID and ID over 10 simulated time series of three different models of length 10 5 superscript 10 5 10^{5} each. The distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N , as well as the average MSE, Hausdorff distance and computational time for each method are provided

Method	Model	$\hat{N} - N = 0$	MSE	$d_{H}$	Time (s)
WID		10	$1.65 \times 10^{- 5}$	-	2.39
ID	(D1)	10	$1.65 \times 10^{- 5}$	-	79.40
WID		10	$8.8 \times 10^{- 5}$	$3.8 \times 10^{- 5}$	2.28
ID	(D2)	10	$11 \times 10^{- 5}$	$1.6 \times 10^{- 5}$	13.06
WID		10	$8.9 \times 10^{- 5}$	0	2.17
ID	(D3)	10	$8.9 \times 10^{- 5}$	0	6.34

Table 10. Table 2: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the piecewise-constant signal (LS). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

	•
	$\hat{N} - N$
Method	$\leq - 300$	$(- 300, - 100]$	$(- 100, - 10)$	$[- 10, 10]$	$> 10$	MSE	$d_{H}$	Time (s)
PELT	0	100	0	0	0	0.67	1.02	0.024
NP.PELT	100	0	0	0	0	9655.28	113.76	11.302
S3IB	0	13	87	0	0	0.44	1.01	111.143
CumSeg	100	0	0	0	0	87.11	15.08	0.323
CPM. $l .$ 500	0	0	0	100	0	0.19	0.75	0.002
CPM. $l .$ 10000	0	0	74	25	0	0.22	1	0.002
WBSC1	0	0	80	20	0	0.24	1	1.293
WBSIC	0	0	22	78	0	0.22	0.99	1.293
WBS2	0	0	29	58	13	0.22	0.99	1.293
NOT	100	0	0	0	0	9.31	5.27	4.330
FDR	0	0	2	98	0	0.19	0.83	-
TGUH	0	0	100	0	0	0.29	1	0.484
ID	0	0	14	86	0	0.21	1.00	0.460

Table 11. Table 3: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the piecewise-constant signal (LT2). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

	•
	$\hat{N} - N$
Method	$\leq - 150$	$(- 150, - 50]$	$(- 50, - 10)$	$[- 10, 10]$	$> 10$	MSE	$d_{H}$	Time (s)
PELT	100	0	0	0	0	0.55	130.64	0.016
NP.PELT	0	12	88	0	0	0.21	4.95	0.496
S3IB	0	1	54	45	0	0.13	2.58	33.410
CumSeg	100	0	0	0	0	0.56	249	0.428
CPM. $l .$ 500	0	0	0	9	91	0.12	0.59	0.008
CPM. $l .$ 10000	0	0	0	100	0	0.12	1.21	0.008
WBSC1	0	84	16	0	0	0.27	4.22	0.631
WBSIC	7	0	0	88	5	0.16	15.92	1.051
NOT	100	0	0	0	0	0.56	240.08	0.610
FDR	0	0	0	99	1	0.11	0.72	-
TGUH	0	0	2	98	0	0.14	1.43	0.580
ID	0	0	0	100	0	0.11	0.76	0.139

Table 12. Table 4: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated time series from the signal (ELT). Also the average MSE and computational times for each method are given

	•
	$\hat{N} - N$
Method	$\leq - 17000$	$(- 17000, - 10)$	$[- 10, 10]$	MSE	Time (s)
PELT	100	0	0	0.94	0.086
NP.PELT	100	0	0	1	136.154
CumSeg	100	0	0	1	4.831
CPM. $l .$ 500	100	0	0	1	59.936
WBSC1	100	0	0	0.92	6.346
NOT	100	0	0	1	2.873
TGUH	0	0	100	0.02	4.751
ID	0	10	90	0.02	3.693

Table 13. Table 5: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated time series from (NC2). Also the average MSE and computational times for each method are given

	•
	$\hat{N} - N$
Method	0	1	2	$\geq 3$	MSE	Time (ms)
PELT	100	0	0	0	28 $\times 10^{- 4}$	12.3
NP.PELT	56	13	23	8	269 $\times 10^{- 4}$	30.4
S3IB	95	3	2	0	57 $\times 10^{- 4}$	35.7
CumSeg	100	0	0	0	28 $\times 10^{- 4}$	19.3
CPM. $l .$ 500	54	11	15	20	294 $\times 10^{- 4}$	1.5
WBSC1	17	16	19	48	578 $\times 10^{- 4}$	62.7
WBSIC	95	3	1	1	59 $\times 10^{- 4}$	61.3
NOT	99	0	0	1	39 $\times 10^{- 4}$	37.6
FDR	90	7	2	1	66 $\times 10^{- 4}$	-
TGUH	83	0	12	5	147 $\times 10^{- 4}$	49.6
ID	95	4	1	0	60 $\times 10^{- 4}$	1.1

Table 14. Table 6: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated time series of the signal (SW1). Also, the average MSE, Hausdorff distance and computational time for each method are given

	•
	$\hat{N} - N$
Method	$\leq - 110$	$(- 110, - 1]$	0	$1$	$(1, 90)$	$\geq 90$	MSE	$d_{H}$	Time (s)
NOT	100	0	0	0	0	0	52.352	119	6.061
TF	0	0	0	0	0	100	107.463	0.381	2.309
CPOP	0	0	96	4	0	0	1.063	0.146	3.327
ID	0	0	90	10	0	0	1.781	0.254	0.041

Table 15. Table 7: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated time series of the signal (SW2). Also, the average MSE, Hausdorff distance and computational time for each method are given

	•
	$\hat{N} - N$
Method	$\leq - 10$	$(- 10, - 1)$	$- 1$	0	1	2	$(2, 10)$	$\geq 10$	MSE	$d_{H}$	Time (s)
NOT	0	5	1	1	1	3	25	64	0.24	0.95	0.412
TF	0	0	0	0	0	0	0	100	0.76	0.33	1.417
CPOP	0	0	0	97	3	0	0	0	0.05	0.17	8.728
ID	0	0	0	97	2	1	0	0	0.07	0.27	0.049

Table 16. Table 8: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences of the continuous piecewise-linear signal (SW3). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time for each method are also given

	•
	$\hat{N} - N$
Method	$\leq - 100$	$(- 100, - 1)$	$- 1$	0	$1$	$(1, 10]$	$> 10$	MSE	$d_{H}$	Time (s)
NOT	100	0	0	0	0	0	0	1.063	119	0.485
TF	0	0	0	0	0	0	100	217868.2	0.324	0.632
CPOP	0	0	0	98	2	0	0	0.027	0.154	0.438
ID	0	0	0	100	0	0	0	0.039	0.210	0.055

Table 17. Table 9: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from (NC). The average MSE and computational times are also given

	•
	$\hat{N} - N$
Method	0	1	2	$\geq 3$	MSE	Time (s)
ID $_{λ_{T} = 5}$	100	0	0	0	31 $\times 10^{- 5}$	0.116
ID $_{λ_{T} = 20}$	100	0	0	0	31 $\times 10^{- 5}$	0.033
ID $_{λ_{T} = 80}$	100	0	0	0	31 $\times 10^{- 5}$	0.012

Table 18. Table 10: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences of the piecewise-constant signals (M1)-(M3). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

		•
		$\hat{N} - N$
Method	Model	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (ms)
ID $_{λ_{T} = 5}$		0	3	40	56	1	0	0	2.49	0.06	26.8
ID $_{λ_{T} = 20}$	(M1)	0	3	40	57	0	0	0	2.44	0.06	17.4
ID $_{λ_{T} = 80}$		2	8	47	43	0	0	0	2.92	0.08	11.6
ID $_{λ_{T} = 5}$		7	8	3	72	9	1	0	67 $\times 10^{- 3}$	0.88	10.9
ID $_{λ_{T} = 20}$	(M2)	83	7	8	4	0	0	0	185 $\times 10^{- 3}$	6.80	8.1
ID $_{λ_{T} = 80}$		91	2	0	7	0	0	0	212 $\times 10^{- 3}$	8.75	5.7
ID $_{λ_{T} = 5}$		0	0	0	82	14	4	0	23 $\times 10^{- 3}$	0.16	9.9
ID $_{λ_{T} = 20}$	(M3)	20	38	30	10	2	0	0	86 $\times 10^{- 3}$	0.82	9.2
ID $_{λ_{T} = 80}$		100	0	0	0	0	0	0	784 $\times 10^{- 3}$	2.73	5

Table 19. Table 11: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the piecewise-constant signal (M4). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

	•
	$\hat{N} - N$
Method	$- 2$	$- 1$	$0$	$1$	$\geq 2$	MSE	$d_{H}$	Time (ms)
ID $_{λ_{T} = 5}$	12	0	87	1	0	7 $\times 10^{- 3}$	0.13	31.5
ID $_{λ_{T} = 20}$	11	0	89	0	0	6 $\times 10^{- 3}$	0.11	11.7
ID $_{λ_{T} = 80}$	28	0	72	0	0	11 $\times 10^{- 3}$	0.29	5.8

Table 20. Table 12: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the piecewise-constant signal (M5). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

	•
	$\hat{N} - N$
Method	$\leq - 500$	$(- 500, - 50]$	$(- 50, - 10)$	$[- 10, 10]$	$> 10$	MSE	$d_{H}$	Time (s)
ID $_{λ_{T} = 5}$	0	0	0	100	0	0.14	0.99	0.772
ID $_{λ_{T} = 20}$	100	0	0	0	0	1.28	3.45	0.395
ID $_{λ_{T} = 80}$	100	0	0	0	0	1.53	10.34	0.308

Table 21. Table 13: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences from the continuous piecewise-linear signal (W1). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time for each method are also given

	•
	$\hat{N} - N$
Method	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (s)
ID $_{λ_{T} = 5}$	0	0	0	91	9	0	0	0.031	0.104	0.036
ID $_{λ_{T} = 20}$	0	0	0	92	8	0	0	0.027	0.099	0.020
ID $_{λ_{T} = 80}$	0	0	0	99	1	0	0	0.020	0.067	0.017

Table 22. Table 14: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated data sequences of the continuous piecewise-linear signal (W2). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time for each method are also given

	•
	$\hat{N} - N$
Method	$\leq - 90$	$(- 90, - 1)$	$- 1$	0	$1$	$(1, 60]$	$> 60$	MSE	$d_{H}$	Time (s)
ID $_{λ_{T} = 5}$	0	0	0	97	3	0	0	0.227	0.272	0.690
ID $_{λ_{T} = 20}$	0	0	0	100	0	0	0	0.364	0.328	0.722
ID $_{λ_{T} = 80}$	100	0	0	0	0	0	0	4.730	98.955	0.102

Table 23. Table 15: Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N over 100 simulated time series of the continuous piecewise-linear signals (W3) and (W4). The average MSE, d H subscript 𝑑 𝐻 d_{H} and computational time are also given

		•
		$\hat{N} - N$
Method	Model	$\leq - 3$	$- 2$	$- 1$	0	1	2	$\geq 3$	MSE	$d_{H}$	Time (s)
ID $_{λ_{T} = 5}$		0	0	0	94	6	0	0	0.020	0.122	0.014
ID $_{λ_{T} = 20}$	(W3)	0	0	0	99	1	0	0	0.009	0.067	0.012
ID $_{λ_{T} = 80}$		100	0	0	0	0	0	0	3.917	1.875	0.009
ID $_{λ_{T} = 5}$		0	0	0	77	22	1	0	0.055	0.134	0.040
ID $_{λ_{T} = 20}$	(W4)	0	0	0	98	2	0	0	0.029	0.106	0.032
ID $_{λ_{T} = 80}$		0	36	50	14	0	0	0	1.603	0.950	0.030

Equations258

X_{t} = f_{t} + σ ϵ_{t}, t = 1, 2, \dots, T,

X_{t} = f_{t} + σ ϵ_{t}, t = 1, 2, \dots, T,

R_{s, e} = [c_{k_{1}}^{r}, c_{k_{1} + 1}^{r}, \dots, e], L_{s, e} = [c_{k_{2}}^{l}, c_{k_{2} + 1}^{l}, \dots, s],

R_{s, e} = [c_{k_{1}}^{r}, c_{k_{1} + 1}^{r}, \dots, e], L_{s, e} = [c_{k_{2}}^{l}, c_{k_{2} + 1}^{l}, \dots, s],

X_{t} = f_{t} + ϵ_{t}, t = 1, 2, \dots, T .

X_{t} = f_{t} + ϵ_{t}, t = 1, 2, \dots, T .

\tilde{X}_{s, e}^{b} = \frac{e - b}{n ( b - s + 1 )} t = s \sum b X_{t} - \frac{b - s + 1}{n ( e - b )} t = b + 1 \sum e X_{t},

\tilde{X}_{s, e}^{b} = \frac{e - b}{n ( b - s + 1 )} t = s \sum b X_{t} - \frac{b - s + 1}{n ( e - b )} t = b + 1 \sum e X_{t},

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{3} lo g T) \geq 1 - \frac{C _{4}}{T} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{3} lo g T) \geq 1 - \frac{C _{4}}{T} .

ζ_{T} = C 2 lo g T .

ζ_{T} = C 2 lo g T .

ϕ_{s, e}^{b} (t) = ⎩ ⎨ ⎧ α_{s, e}^{b} β_{s, e}^{b} [(e + 2 b - 3 s + 2) t - (b e + b s - 2 s^{2} + 2 s)], - \frac{α _{s, e}^{b}}{β _{s, e}^{b}} [(3 e - 2 b - s + 2) t - (2 e^{2} + 2 e - b e - b s)], 0, t = s, \dots, b, t = b + 1, \dots, e, otherwise,

ϕ_{s, e}^{b} (t) = ⎩ ⎨ ⎧ α_{s, e}^{b} β_{s, e}^{b} [(e + 2 b - 3 s + 2) t - (b e + b s - 2 s^{2} + 2 s)], - \frac{α _{s, e}^{b}}{β _{s, e}^{b}} [(3 e - 2 b - s + 2) t - (2 e^{2} + 2 e - b e - b s)], 0, t = s, \dots, b, t = b + 1, \dots, e, otherwise,

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2/3}) \leq C_{3} (lo g T)^{1/3}) \geq 1 - \frac{C _{4}}{T} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2/3}) \leq C_{3} (lo g T)^{1/3}) \geq 1 - \frac{C _{4}}{T} .

ζ_{T} = \tilde{C} 2 lo g T,

ζ_{T} = \tilde{C} 2 lo g T,

b = (b_{1}, b_{2}, \dots, b_{J}),

b = (b_{1}, b_{2}, \dots, b_{J}),

sSIC (j) = - 2 k = 1 \sum j + 1 ℓ (X_{\tilde{b}_{k - 1} + 1}, \dots, X_{\tilde{b}_{k}}; \hat{θ}_{k}) + n_{j} (lo g T)^{α},

sSIC (j) = - 2 k = 1 \sum j + 1 ℓ (X_{\tilde{b}_{k - 1} + 1}, \dots, X_{\tilde{b}_{k}}; \hat{θ}_{k}) + n_{j} (lo g T)^{α},

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{1} (lo g T)^{α}) \geq 1 - \frac{C _{2}}{T} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{1} (lo g T)^{α}) \geq 1 - \frac{C _{2}}{T} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2/3}) \leq C_{1} (lo g T)^{α /3}) \geq 1 - \frac{C _{2}}{T} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2/3}) \leq C_{1} (lo g T)^{α /3}) \geq 1 - \frac{C _{2}}{T} .

\tilde{y}_{s, e}^{b} = \frac{e - b}{n ( b - s + 1 )} t = s \sum b y_{t} - \frac{b - s + 1}{n ( e - b )} t = b + 1 \sum e y_{t}; \tilde{\tilde{y}}_{s, e} = \frac{\sum _{t = s}^{e} y _{t}}{( e - s + 1 ) ^{1/2}},

\tilde{y}_{s, e}^{b} = \frac{e - b}{n ( b - s + 1 )} t = s \sum b y_{t} - \frac{b - s + 1}{n ( e - b )} t = b + 1 \sum e y_{t}; \tilde{\tilde{y}}_{s, e} = \frac{\sum _{t = s}^{e} y _{t}}{( e - s + 1 ) ^{1/2}},

\forall γ > 0, P (s, b, e min \tilde{\tilde{ϵ}}_{s, b} \tilde{\tilde{ϵ}}_{b + 1, e} < - γ) \leq P (s, b, e max \tilde{\tilde{ϵ}}_{s, b} \tilde{\tilde{ϵ}}_{b + 1, e} > γ)

\forall γ > 0, P (s, b, e min \tilde{\tilde{ϵ}}_{s, b} \tilde{\tilde{ϵ}}_{b + 1, e} < - γ) \leq P (s, b, e max \tilde{\tilde{ϵ}}_{s, b} \tilde{\tilde{ϵ}}_{b + 1, e} > γ)

P (\exists_{s, a, e} (\tilde{ϵ}_{s, e}^{b})^{2} > 3 σ^{2} (1 + δ) lo g T) T \to \infty 0.

P (\exists_{s, a, e} (\tilde{ϵ}_{s, e}^{b})^{2} > 3 σ^{2} (1 + δ) lo g T) T \to \infty 0.

d_{H} = n_{s}^{- 1} max {j max k min ∣ r_{j} - \overset{r}{^}_{k} ∣, k max j min ∣ r_{j} - \overset{r}{^}_{k} ∣},

d_{H} = n_{s}^{- 1} max {j max k min ∣ r_{j} - \overset{r}{^}_{k} ∣, k max j min ∣ r_{j} - \overset{r}{^}_{k} ∣},

ψ_{s, e}^{b} (t) = ⎩ ⎨ ⎧ \frac{e - b}{n ( b - s + 1 )}, - \frac{b - s + 1}{n ( e - b )}, 0, t = s, s + 1, \dots, b, t = b + 1, b + 2, \dots, e, otherwise .,

ψ_{s, e}^{b} (t) = ⎩ ⎨ ⎧ \frac{e - b}{n ( b - s + 1 )}, - \frac{b - s + 1}{n ( e - b )}, 0, t = s, s + 1, \dots, b, t = b + 1, b + 2, \dots, e, otherwise .,

∥ ψ_{s, e}^{b} ⟨ f, ψ_{s, e}^{b} ⟩ - ψ_{s, e}^{r_{j}} ⟨ f, ψ_{s, e}^{r_{j}} ⟩ ∥_{2}^{2} = (\tilde{f}_{s, e}^{r_{j}})^{2} - (\tilde{f}_{s, e}^{b})^{2} .

∥ ψ_{s, e}^{b} ⟨ f, ψ_{s, e}^{b} ⟩ - ψ_{s, e}^{r_{j}} ⟨ f, ψ_{s, e}^{r_{j}} ⟩ ∥_{2}^{2} = (\tilde{f}_{s, e}^{r_{j}})^{2} - (\tilde{f}_{s, e}^{b})^{2} .

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{3} lo g T) \geq 1 - \frac{1}{6 π T},

P (\hat{N} = N, j = 1, 2, \dots, N max (∣ \overset{r}{^}_{j} - r_{j} ∣ (Δ_{j}^{f})^{2}) \leq C_{3} lo g T) \geq 1 - \frac{1}{6 π T},

A_{T} = {s, b, e : 1 \leq s \leq b < e \leq T max \tilde{X}_{s, e}^{b} - \tilde{f}_{s, e}^{b} \leq 8 lo g T} .

A_{T} = {s, b, e : 1 \leq s \leq b < e \leq T max \tilde{X}_{s, e}^{b} - \tilde{f}_{s, e}^{b} \leq 8 lo g T} .

P ((A_{T})^{c}) = P (s, b, e : 1 \leq s \leq b < e \leq T max \tilde{X}_{s, e}^{b} - \tilde{f}_{s, e}^{b} > 8 lo g T)

P ((A_{T})^{c}) = P (s, b, e : 1 \leq s \leq b < e \leq T max \tilde{X}_{s, e}^{b} - \tilde{f}_{s, e}^{b} > 8 lo g T)

\leq 1 \leq s \leq b < e \leq T \sum P (\tilde{ϵ}_{s, e}^{b} > 8 lo g T) \leq \frac{T ^{3}}{6} P (∣ Z ∣ > 8 lo g T)

= \frac{T ^{3}}{3} P (Z > 8 lo g T) \leq \frac{T ^{3}}{3} \frac{ϕ ( 8 lo g T )}{8 lo g T} \leq \frac{1}{12 π T},

B_{T} = ⎩ ⎨ ⎧ j = 1, 2, \dots, N max r_{j - 1} < s \leq r_{j} r_{j} < e \leq r_{j + 1} s \leq b < e max \frac{⟨ ψ _{s, e}^{b} ⟨ f , ψ _{s, e}^{b} ⟩ - ψ _{s, e}^{r_{j}} ⟨ f , ψ _{s, e}^{r_{j}} ⟩ , ϵ ⟩}{∥ ψ _{s, e}^{b} ⟨ f , ψ _{s, e}^{b} ⟩ - ψ _{s, e}^{r_{j}} ⟨ f , ψ _{s, e}^{r_{j}} ⟩ ∥ _{2}} \leq 8 lo g T ⎭ ⎬ ⎫ .

B_{T} = ⎩ ⎨ ⎧ j = 1, 2, \dots, N max r_{j - 1} < s \leq r_{j} r_{j} < e \leq r_{j + 1} s \leq b < e max \frac{⟨ ψ _{s, e}^{b} ⟨ f , ψ _{s, e}^{b} ⟩ - ψ _{s, e}^{r_{j}} ⟨ f , ψ _{s, e}^{r_{j}} ⟩ , ϵ ⟩}{∥ ψ _{s, e}^{b} ⟨ f , ψ _{s, e}^{b} ⟩ - ψ _{s, e}^{r_{j}} ⟨ f , ψ _{s, e}^{r_{j}} ⟩ ∥ _{2}} \leq 8 lo g T ⎭ ⎬ ⎫ .

P (A_{T} \cap B_{T}) \geq 1 - \frac{1}{6 π T} .

P (A_{T} \cap B_{T}) \geq 1 - \frac{1}{6 π T} .

C_{1} = C_{3} + 8, C_{2} = \frac{1}{6} - \frac{2 2}{C}, C_{3} = 2 (22 + 4)^{2},

C_{1} = C_{3} + 8, C_{2} = \frac{1}{6} - \frac{2 2}{C}, C_{3} = 2 (22 + 4)^{2},

I_{j}^{R} = [r_{j} + \frac{δ _{T}}{3}, r_{j} + 2 \frac{δ _{T}}{3}), I_{j}^{L} = (r_{j} - 2 \frac{δ _{T}}{3}, r_{j} - \frac{δ _{T}}{3}] .

I_{j}^{R} = [r_{j} + \frac{δ _{T}}{3}, r_{j} + 2 \frac{δ _{T}}{3}), I_{j}^{L} = (r_{j} - 2 \frac{δ _{T}}{3}, r_{j} - \frac{δ _{T}}{3}] .

\tilde{X}_{1, c_{k}^{r}}^{\tilde{b}} \geq \tilde{X}_{1, c_{k}^{r}}^{r_{1}} \geq \tilde{f}_{1, c_{k}^{r}}^{r_{1}} - 8 lo g T .

\tilde{X}_{1, c_{k}^{r}}^{\tilde{b}} \geq \tilde{X}_{1, c_{k}^{r}}^{r_{1}} \geq \tilde{f}_{1, c_{k}^{r}}^{r_{1}} - 8 lo g T .

\tilde{f}_{1, c_{k}^{r}}^{r_{1}} = \frac{c _{k}^{r} - r _{1}}{r _{1} c _{k}^{r}} r_{1} f_{r_{1}} - \frac{r _{1}}{c _{k}^{r} ( c _{k}^{r} - r _{1} )} (c_{k}^{r} - r_{1}) f_{r_{1} + 1} = \frac{( c _{k}^{r} - r _{1} ) r _{1}}{c _{k}^{r}} Δ_{1}^{f}

\tilde{f}_{1, c_{k}^{r}}^{r_{1}} = \frac{c _{k}^{r} - r _{1}}{r _{1} c _{k}^{r}} r_{1} f_{r_{1}} - \frac{r _{1}}{c _{k}^{r} ( c _{k}^{r} - r _{1} )} (c_{k}^{r} - r_{1}) f_{r_{1} + 1} = \frac{( c _{k}^{r} - r _{1} ) r _{1}}{c _{k}^{r}} Δ_{1}^{f}

= \frac{( c _{k}^{r} - r _{1} ) r _{1}}{( c _{k}^{r} - r _{1} ) + r _{1}} Δ_{1}^{f} \geq \frac{( c _{k}^{r} - r _{1} ) r _{1}}{2 max { c _{k}^{r} - r _{1} , r _{1} }} Δ_{1}^{f} = \frac{min { c _{k}^{r} - r _{1} , r _{1} }}{2} Δ_{1}^{f} .

min {c_{k}^{r} - r_{1}, r_{1}} \geq \frac{δ _{T}}{3} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetabolomics and Mass Spectrometry Studies · Statistical Methods and Inference · Fault Detection and Control Systems

Full text

∎

11institutetext: A. Anastasiou 22institutetext: Department of Mathematics and Statistics, University of Cyprus, P.O. Box: 20537, 1678, Nicosia, Cyprus.

22email: [email protected] 33institutetext: P. Fryzlewicz 44institutetext: Department of Statistics, The London School of Economics and Political Science, Columbia House, Houghton Street, London, WC2A 2AE.

Detecting multiple generalized change-points by isolating single ones

Andreas Anastasiou111On behalf of all authors, the corresponding author states that there is no conflict of interest.

Piotr Fryzlewicz222Piotr Fryzlewicz’s work was supported by the Engineering and Physical Sciences Research Council grant No. EP/L014246/1.

(Received: date / Accepted: date)

Abstract

We introduce a new approach, called Isolate-Detect (ID), for the consistent estimation of the number and location of multiple generalized change-points in noisy data sequences. Examples of signal changes that ID can deal with are changes in the mean of a piecewise-constant signal and changes, continuous or not, in the linear trend. The number of change-points can increase with the sample size. Our method is based on an isolation technique, which prevents the consideration of intervals that contain more than one change-point. This isolation enhances ID’s accuracy as it allows for detection in the presence of frequent changes of possibly small magnitudes. In ID, model selection is carried out via thresholding, or an information criterion, or SDLL, or a hybrid involving the former two. The hybrid model selection leads to a general method with very good practical performance and minimal parameter choice. In the scenarios tested, ID is at least as accurate as the state-of-the-art methods; most of the times it outperforms them. ID is implemented in the R packages IDetect and breakfast, available from CRAN.

Keywords:

Segmentation symmetric interval expansion threshold criterion Schwarz information criterion SDLL

1 Introduction

Change-point detection is an active area of statistical research that has attracted a lot of interest in recent years. Our work’s focus is on a posteriori change-point detection, where the aim is to estimate the number and locations of certain changes in the behaviour of the data. We work in the model

[TABLE]

where $X_{t}$ are the observed data and $f_{t}$ is a one-dimensional, deterministic signal with structural changes at certain points. Two examples are: change-points in the level when $f_{t}$ is seen as piecewise-constant, and change-points in the first derivative when $f_{t}$ is piecewise-linear. We highlight, however, that our methodology and analysis apply to more general scenarios, for instance the detection of knots in a piecewise polynomial signal of order $k$ , where $k$ is not necessarily equal to zero (piecewise-constant mean) or one (piecewise-linear mean). The number $N$ of change-points as well as their locations $r_{1},r_{2},\ldots,r_{N}$ are unknown and our aim is to estimate them. In addition, $N$ can grow with $T$ . The random variables $\epsilon_{t}$ in (1) have mean zero and variance one; further assumptions will be given in Section 3.2.

When $f_{t}$ is assumed to be piecewise-constant, the existing change-point detection techniques are mainly split into two categories based on whether the change-points are detected all at once or one at a time. The former category mainly includes optimization-based methods, in which the estimated signal is chosen based on its least squares or log-likelihood fit to the data, penalized by a complexity rule in order to avoid overfitting. The most common example of a penalty function is the Schwarz Information Criterion (SIC); see Yao (1988) for details. To solve the implied penalization problem, dynamic programming approaches, such as the Segment Neighborhood (SN) and Optimal Partitioning (OP) methods of Auger and Lawrence (1989) and Jackson et al. (2005), have been developed. In an attempt to improve on OP’s computational cost, Killick et al. (2012) introduce the PELT method, based on a pruning step applied to OP’s dynamic programming approach. A non-parametric adaptation of PELT is given in Haynes et al. (2017). Rigaill (2015) introduces an improvement over classical SN algorithms, through a pruning approach called PDPa, while Maidstone et al. (2017) give two algorithms by combining ideas from PELT and PDPa. Frick et al. (2014) propose the simultaneous multiscale change-point estimator (SMUCE) for the change-point problem in the case of exponential family regression; solving an optimization problem is also required. The FDRSeg method of Li et al. (2016) is a combination of False Discovery Rate (FDR) control and global segmentation methods in a multiscale way; the change-points are again detected all at once.

In the latter category, in which change-points are detected one at a time, a popular method is binary segmentation, which performs an iterative binary splitting of the data on intervals determined by the previously obtained splits. Vostrikova (1981) introduces and proves the validity of binary segmentation in the setting of change-point detection for piecewise-constant signals. The main advantages of binary segmentation are its conceptual simplicity and low computational cost. However, at each step of the algorithm, binary segmentation looks for a single change-point, which leads to its suboptimality in terms of accuracy, especially for signals with frequent change-points. Some variants of binary segmentation that work towards solving this issue are the Circular Binary Segmentation (CBS) of Olshen et al. (2004), the Wild Binary Segmentation (WBS) of Fryzlewicz (2014) as well as its second version (WBS2) of Fryzlewicz (2020), the Narrowest-Over-Threshold (NOT) method of Baranowski et al. (2019), and the Seeded Binary Segmentation (SeedBS) of Kovács et al. (2020). CBS searches for at most two change-points at each step. Instead of initially calculating the contrast value for the whole data sequence, WBS and NOT are based on a random draw of subintervals of the domain of the data, on which an appropriate statistic is tested against a threshold. The draw of all the subintervals takes place at the beginning of the algorithm. In contrast, WBS2 draws first only a small number, $\tilde{M}$ , of data subsamples. It then uses the first change-point candidate to split the data into two parts, and again recursively draws the same number $\tilde{M}$ of subsamples to the left and to the right of this change-point candidate, and so on. A major difference between WBS and WBS2 is that the latter adaptively decides where to recursively draw the next subsamples, based on the change-point candidates detected so far; this adds to the detection power of the method. SeedBS is an approach, similar to WBS and NOT, that relies instead on a deterministic construction of background intervals in which single change points are searched. Apart from binary-segmentation-related approaches, the category in which the change-points are detected one at a time also includes methods that control the False Discovery Rate. For instance, the “pseudo-sequential” (PS) procedure of Venkatraman (1992), as well as the CPM method of Ross (2015) are based on an adaptation of online detection algorithms to a posteriori situations and work by bounding the Type I error rate of falsely detecting change-points. Some methods do not fall in either category. For example, the tail-greedy algorithm in Fryzlewicz (2018) achieves a multiscale decomposition of the data using Unbalanced Haar wavelets in an agglomerative way. In addition, Eichinger and Kirch (2018) use moving sum (MOSUM) statistics in order to detect multiple change-points. For a more thorough review of the literature on the detection of multiple change-points in the mean of univariate data sequences, see Cho and Kirch (2020) and Yu (2020). Truong et al. (2020) also present a survey of various a posteriori change-point detection algorithms; the focus is, however, on multivariate time series.

Beyond the piecewise-constant signal model, existing methods mainly minimize the residual sum of squares taking into account a penalty, with the most common being the SIC. This is used in Bai and Perron (1998), in the trend filtering (TF) approach (Kim et al., 2009; Tibshirani, 2014), and in the dynamic programming algorithm CPOP (Maidstone et al., 2019). Friedman (1991) introduces the Multivariate Adaptive Regression Splines (MARS) method for regression analysis based on splines with the number and the location of the knots being determined by the data. Spiriti et al. (2013) propose two methods for optimizing knot locations in spline smoothing, where either the number of knots is fixed or an upper bound for it needs to be given. The NOT approach (Baranowski et al., 2019) detects change-points one at a time in various scenarios including piecewise-linear mean signals.

In general, change-point detection becomes easier in situations where there is at most one change-point to be detected in a given interval; in such cases the detection power of the contrast function (more details are in Section 3.2) is maximised. Therefore, it makes sense to decouple the multiple change-point detection problem into many single change-point detections. To achieve this, we propose a generic technique, Isolate-Detect (ID), for generalized change-point detection in various different structures, such as piecewise-constant or piecewise-linear signals. The concept behind ID is simple and is split into two stages; firstly, the isolation of each of the true change-points within subintervals of the domain $[1,2,\ldots,T]$ , and secondly their detection. From now on, the terms subinterval and interval will be used interchangeably. Although a detailed explanation of our methodology is provided in Section 3.1, the basic idea is that for an observed data sequence of length $T$ and with $\lambda_{T}$ a positive constant, ID first creates two ordered sets of $K=\left\lceil T/\lambda_{T}\right\rceil$ right- and left-expanding intervals as follows. The $j^{th}$ right-expanding interval is $R_{j}=[1,\min\left\{j\lambda_{T},T\right\}]$ , while the $j^{th}$ left-expanding interval is $L_{j}=[\max\left\{1,T-j\lambda_{T}+1\right\},T]$ . We collect these intervals in the ordered set $S_{RL}=\left\{R_{1},L_{1},R_{2},L_{2},\ldots,R_{K},L_{K}\right\}$ . For a suitably chosen contrast function (more details are in Section 3.2), ID identifies the point with the maximum contrast value in $R_{1}$ . If its value exceeds a threshold, denoted by $\zeta_{T}$ , then it is taken as a change-point. If not, then the next interval in $S_{RL}$ is tested. Upon detection, ID makes a new start from the end-point (or start-point) of the right- (or left-) expanding interval where the detection occurred. Upon correct choice of $\zeta_{T}$ , ID ensures that we work on intervals with at most one change-point, which was our aim.

We would like to highlight the importance of the change-point isolation aspect present in our method as explained in the previous paragraph. There are various advantages. First, it enables detection in higher-order polynomial signals. Second, it is carried out in a fixed and systematic way, which eliminates any randomness in the selection of the intervals and, by extension, in the final results. Third, the way the isolation is carried out in ID makes it quicker than other localisation-focused algorithms, such as NOT, due to the fact that it needs to work on fewer intervals; more details on this advantage of our proposed methodology are in Section 4.1. We note here that, even though the default methodology described in Fryzlewicz (2014) and Baranowski et al. (2019) is based on the construction of random intervals, the same approaches can be applied to a fixed grid of intervals. However, as noted in Kovács et al. (2020), the latter implementation can be quite slow. Fourth, the pseudo-sequential nature of the attempted isolation, makes our proposed methodology suitable for online change-point detection. This is one of the various different ways that ID is different from existing techniques in the literature which also attempt change-point isolation; a more thorough comparison with seemingly similar, but still different, methods is given in the next section.

The paper is organized as follows. Section 2 is a motivating illustration of our proposed method through examples. Section 3 gives a formal explanation of the ID methodology along with two different scenarios of use and the associated theory. In Section 4, we first discuss the computational aspects of ID and the choice of parameter values. ID variants which lead to improved practical performance are also explained. In Section 5, we provide a thorough simulation study to compare ID with state-of-the-art methods. Real-life data examples are provided in Section 6. The paper is concluded with reflections on the proposed method. The theoretical, as well as practical, merits and weaknesses of ID when compared against state-of-the-art methods are discussed throughout the paper. However, for the sake of clarity these are also brought together in Section 7. ID is implemented in the R packages IDetect and breakfast, available from CRAN.

2 Motivating illustration of Isolate-Detect

The fact that each change-point is sequentially detected using an interval that contains no other change-points leads to high detection power, especially in difficult structures, such as limited spacings between consecutive change-points and/or higher-order piecewise-polynomial signals. Two examples follow in order to make clear the importance of the isolation step and to illustrate the power of ID compared to other change-point detection methods (some of those also attempt localisation) in capturing even small movements in the data that are close to each other. Table 1 provides results on 100 replications of the continuous piecewise-linear signal (S1) and the piecewise-constant signal (S2), where

(S1)

$T=5200$ , with 21 change-points in the slope at locations $100,140,\cdots,900$ . The standard deviation is $\sigma=0.25$ ;

(S2)

$T=5200$ , with 21 change-points in the mean at locations $100,105,\cdots,200$ . The standard deviation is $\sigma=0.1$ .

As a measure of the accuracy of the estimated number we give $\hat{N}-N$ , while as a measure of the accuracy of the detected locations, we give Monte-Carlo estimates of the mean square error, ${\rm MSE}=T^{-1}\sum_{t=1}^{T}{\mathbb{E}}\left(\hat{f}_{t}-f_{t}\right)^{2}$ . The methods compared are ID, NOT, and MARS for (S1) and ID, WBS, NOT, and PELT for (S2). For the ID related results in Table 1, we used the hybrid version of ID explained in Section 4.4. The choice of the parameters is described in Section 4.2. As already mentioned, WBS and NOT also work on subintervals of the data, chosen though in a completely different manner than in ID. More comparative simulation and real-life studies will be given in Sections 5 and 6, respectively.

We notice from Table 1 that ID offers an important increase in the change-point detection power, especially under limited spacings between consecutive change-points. Figures 1 and 2 give a graphical representation of the results for the first out of the 100 repetitions for signals (S1) and (S2), respectively. For better presentation of the results, in (S1) the signals are presented up to $t=1000$ , since after $t=900$ there is no change-point and in all methods the estimated signal continues linearly beyond that point. For the same reason, in Figure 2 which is related to (S2), the results are presented up to $t=250$ .

The NOT and WBS methods also operate on sub-intervals of the data. However, the nature of the fixed, certain (we can expand one data point at each time), localization in ID means that it is of an order of magnitude faster than the aforementioned methods, which have high computational cost that increases linearly with the number of the randomly drawn intervals. This is an issue of fundamental importance, especially in signals with a large number of change-points, in which NOT and WBS need to increase the number $M$ of intervals drawn. However, doing this also increases the computational cost. More specifically, one could try and draw all possible combinations of start- and end-points of the intervals; however, the computational complexity turns out to be cubic in $T$ . In contrast, due to the explained interval expansion approach, in ID no choice of $M$ is required, which leads to better practical performance with more predictable execution times, while at the same time ID examines all possible change-point locations. We recall that unlike ID and NOT, the principle of WBS does not extend to models other than piecewise-constant. To be more precise, this generality of Isolate-Detect with respect to its applicability in many different signal structures is a main distinction between our method and recently published competing methods which, with the exception of NOT, have been developed to cover only the detection of level-changes.

3 Methodology and Theory

3.1 Methodology

The model is given in (1) and the unknown number, $N$ , of change-points $r_{j}$ can possibly grow with $T$ . Let $r_{0}=0$ and $r_{N+1}=T$ and let $\delta_{T}=\min_{j=1,2,\ldots,N+1}\left|r_{j}-r_{j-1}\right|$ . For clarity of exposition, we start with a simple example before providing a more thorough explanation of how ID works. Figure 3 covers a specific case of two change-points, $r_{1}=38$ and $r_{2}=77$ . We will be referring to Phases 1 and 2 involving six and four intervals, respectively. These are clearly indicated in the figure and they are only related to this specific example, as for cases with more change-points we would have more such phases. At the beginning, $s=1$ , $e=T=100$ , and we take $\lambda_{T}=10$ (how to choose $\lambda_{T}$ will be described in Section 4.2). Suppose the threshold $\zeta_{T}$ has been chosen well enough (more details in Section 4.2) so that $r_{2}$ gets detected in $\left\{X_{s^{*}},X_{s^{*}+1},\ldots,X_{e}\right\}$ , where $s^{*}=71$ . After the detection, $e$ is updated as the start-point of the interval where the detection occurred; therefore, $e=71$ . In Phase 2 indicated in the figure, ID is applied in $[s,e]=[1,71]$ . Intervals 1, 3 and 5 of Phase 1 will not be re-examined in Phase 2 and $r_{1}$ gets, upon a good choice of $\zeta_{T}$ , detected in $\left\{X_{s},X_{s+1},\ldots,X_{e^{*}}\right\}$ , where $e^{*}=40$ . After the detection, $s$ is updated as the end-point of the interval where the detection occurred; therefore, $s=40$ . Our method is then applied in $[s,e]=[40,71]$ ; supposing there is no interval $[s^{*},e^{*}]\subseteq[40,71]$ on which the contrast function value exceeds $\zeta_{T}$ , the process will terminate.

We now describe ID more generically. For each change-point, $r_{j}$ , ID works in two stages: Firstly, we isolate $r_{j}$ in an interval that contains no other change-point. To ensure this, the expansion parameter $\lambda_{T}$ can be taken to be as small as equal to 1. If $\lambda_{T}>1$ , then isolation is guaranteed with high probability. Theoretically for large $T$ , the chosen value for $\lambda_{T}$ (this typically will be small; see Section 4.2 for more details) is guaranteed to be smaller than the minimum distance $\delta_{T}$ (which has to grow with $T$ ) between two consecutive change-points and isolation will be guaranteed. For an explanation on the rate of $\delta_{T}$ with respect to the sample size $T$ , see the discussion that follows Theorem 3.1. (Of course when asymptotics is put aside, in finite samples anything can happen, and in some configurations no method can be guaranteed to detect change-points if they are arbitrarily close.) The second stage is to detect $r_{j}$ through the use of an appropriate contrast function. This function is, from now on, denoted by $C_{s,e}^{b}(\bm{X})$ , and it is defined for any integer triple $(s,e,b)$ , with $1\leq s\leq b<e\leq T$ . Heuristically, the value of $C_{s,e}^{b}(\bm{X})$ is small if $b$ is not a change-point and large otherwise. In piecewise-constant signals, the contrast function reduces to the absolute value of the CUSUM statistic defined in (4), while for continuous, piecewise-linear signals, the contrast function is given in Section 3.2. For the better understanding of the method, we provide its step-by-step simple outline through pseudocode, followed by a succinct narrative of the purpose of each step. The threshold to be used, in order to decide if a change has occurred at a specific data point, is denoted by $\zeta_{T}$ . Practical choices for $\lambda_{T}$ and $\zeta_{T}$ are given in Section 4.2. For $K=\lceil T/\lambda_{T}\rceil$ , let $c_{j}^{r}=j\lambda_{T}$ and $c_{j}^{l}=T-j\lambda_{T}+1$ for $j=1,2,\ldots,K-1$ , while $c_{K}^{r}=T$ and $c_{K}^{l}=1$ . For a generic interval $[s,e]$ , define the sequences

[TABLE]

where $k_{1}:={\rm argmin}_{j\in\left\{1,2\ldots,K\right\}}\left\{j\lambda_{T}>s\right\}$ and $k_{2}:={\rm argmin}_{j\in\left\{1,2\ldots,K\right\}}\left\{T-j\lambda_{T}+1<e\right\}$ . Denoting by $|A|$ , the cardinality of any sequence $A$ , and by $A(j)$ its $j^{th}$ element, the pseudocode of the main function is as below:

**Pseudocode explaining the proposed ID algorithm

function** ISOLATEDETECT( $s,e,\lambda_{T},\zeta_{T}$ )

if $e-s<1$ then

STOP

else

For $j\in\left\{1,2,\ldots,\left|R\right|\right\},$ denote $\left[s_{2j-1},e_{2j-1}\right]:=\left[s,{\rm R}_{s,e}(j)\right]$ For $j\in\left\{1,2,\ldots,\left|{\rm L}_{s,e}\right|\right\},$ denote $\left[s_{2j},e_{2j}\right]:=\left[{\rm L}_{s,e}(j),e\right]$$i=1$ (Main part) $b_{2i-1}:=\underset{t\in[s_{2i-1},e_{2i-1})}{\rm argmax}C_{s_{2i-1},e_{2i-1}}^{t}(\bm{X})$

if $C_{s_{2i-1},e_{2i-1}}^{b_{2i-1}}>\zeta_{T}$ then

add $b_{2i-1}$ to the set of estimated change-points.ISOLATEDETECT( $e_{2i-1},e,\lambda_{T},\zeta_{T}$ )

else

$b_{2i}:={\rm argmax}_{t\in[s_{2i},e_{2i})}C_{s_{2i},e_{2i}}^{t}(\bm{X})$

if $C_{s_{2i},e_{2i}}^{b_{2i}}>\zeta_{T}$ then

add $b_{2i}$ to the set of estimated change-points.ISOLATEDETECT( $s,s_{2i},\lambda_{T},\zeta_{T}$ )

else

$i=i+1$

if $i\leq\max\left\{|{\rm L}_{s,e}|,|{\rm R}_{s,e}|\right\}$ then

Go back to (Main part) and repeat

else

STOP

end if

end function

A brief explanation of the pseudocode follows. With $K$ already defined above, the intervals $[s_{1},e_{1}],[s_{2},e_{2}],\ldots,[s_{2K},e_{2K}]$ are those used for the isolation step. Notice that in the odd intervals $[s_{1},e_{1}],[s_{3},e_{3}],\ldots,[s_{2K-1},e_{2K-1}]$ the start-point is fixed, unchanged, and equal to $s$ , meaning that $s_{1}=s_{3}=\ldots=s_{2K-1}=s$ . In the even intervals $[s_{2},e_{2}],[s_{4},e_{4}],\ldots,[s_{2K},e_{2K}]$ , it is the end-point that is kept fixed and equal to $e$ , meaning that $e_{2}=e_{4}=\ldots=e_{2K}=e$ . The process will follow until there are intervals to check. The term “expanding intervals” that is used throughout the paper is due to this one-sided expansion (of magnitude $\lambda_{T}$ ) of the intervals. The pseudocode makes it also clear that ID is looking for change-points interchangeably in right- and left-expanding intervals which, with high probability, contain at most one change-point. The Isolate-Detect procedure is launched by the call ISOLATEDETECT( $1,T,\lambda_{T},\zeta_{T}$ ).

The idea of a-posteriori change-point detection, in which change-points are detected sequentially, has appeared previously in the literature. The PS method of Venkatraman (1992) studies the multiple change-point detection problem for the case of piecewise-constant mean signals, as well as for changes in the rate of an exponential process. The CPM method of Ross (2015) treats change-point detection in the mean or variance of a sequence of random variables when their distribution is known. In addition, CPM can be used for distributional changes. Fang and Siegmund (2020), in a work completed after the first version of the current paper appeared on arXiv, search for significant change-points in settings such as piecewise-linear, and one of their algorithms, labelled Seq, bears some resemblance to ID; we note, however, that in addition to some algorithmic differences our aim is different as we focus on consistent estimation while Fang and Siegmund (2020) on testing.

ID is conceptually and in practice different from these methods in a number of ways related to the threshold choice, the construction of the estimated change-point locations as well as the way PS, CPM, and Seq restart upon detection. Furthermore, ID’s isolation technique does not appear in CPM. By contrast, we use this isolation property of ID as a device enabling its use in piecewise-(higher-order-) polynomial models. Indeed, as shown in Baranowski et al. (2019), fast segmentation of signals of the latter type is difficult to achieve unless any change-point present can be isolated away from neighbouring change-points before detection is performed, which is exactly what ID sets out to do. In particular, this paper demonstrates the use of ID in continuous piecewise-linear models. A comparison between the performance of ID and that of state-of-the-art methods is given in Section 5.

3.2 Theoretical behavior of ID

The assumption of the random sequence $\left\{\epsilon_{t}\right\}_{t=1,2,\ldots,T}$ being independent and identically distributed (i.i.d.) from the Gaussian distribution is widely used in the literature. In this paper, the Gaussianity assumption is only made for technical convenience with respect to the proofs of Theorems 3.1 and 3.2. Relaxing both the Gaussianity and the independence assumptions in order to have time-dependent errors is a more complicated issue in terms of theory development. Recently, Dette et al. (2018) have attempted to treat this issue, specifically for the SMUCE approach of Frick et al. (2014), using a reliable estimate for the long run variance, $\sigma_{*}^{2}:=\sum_{k\in\mathbb{Z}}{\rm Cov}(\epsilon_{0},\epsilon_{k})$ , of the error distribution, which is not necessarily Gaussian.

Apart from the well-studied i.i.d. Gaussian noise structure, Isolate-Detect is explored under a variety of settings including i.i.d. non-Gaussian (see Section 4.5), and auto-correlated noise structures; see Fearnhead and Rigaill (2020) who conclude that “IDetect has very strong performance for many scenarios when either we have auto-correlated or heavy-tailed noise”.

If the standard deviation, $\sigma$ , of $\epsilon_{t}$ is unknown, then we need to estimate it and in the cases of independent errors with the signal being piecewise-constant or piecewise-linear, $\sigma$ can be estimated via the Median Absolute Deviation (MAD) method proposed in Hampel (1974). For $\bm{x}=(x_{1},x_{2},\ldots,x_{T})$ , the proposed estimator, denoted by $\hat{\sigma}:=C\times{\rm median}\left|\bm{x}-{\rm median}(\bm{x})\right|$ , has been shown to be, for $C=1.4826$ , a consistent estimator of the population standard deviation $\sigma$ in the case of Gaussian data (Rousseeuw, 1993). It is very robust as evidenced by its bounded influence function and its 50% breakdown point. For simplicity, let $\sigma=1$ , and (1) becomes

[TABLE]

With $r_{0}=0$ and $r_{N+1}=T$ , and for $j=1,2,\ldots,N+1$ , we examine the theoretical behaviour of ID in the following two illustration cases:

Piecewise-constant signals: $f_{t}=\mu_{j}$ for $t=r_{j-1}+1,\ldots,r_{j}$ , and $f_{r_{j}}\neq f_{r_{j}+1}$ .

Continuous, piecewise-linear signals: $f_{t}=\mu_{j,1}+\mu_{j,2}t$ , for $t=r_{j-1}+1,\ldots,r_{j}$ with the additional constraint of $\mu_{k,1}+\mu_{k,2}r_{k}=\mu_{k+1,1}+\mu_{k+1,2}r_{k}$ for $k=1,2,\ldots,N$ . The change-points, $r_{k}$ , satisfy $f_{r_{k}-1}+f_{r_{k}+1}\neq 2f_{r_{k}}$ .

The above scenarios are only examples of settings in which the ID methodology can be applied. The isolation aspect of the method allows its application to various different cases, such as the estimation of the number and the position of knots in piecewise polynomial signals (with or without the continuity constraint).

Piecewise-constant signals. Under piecewise-constancy, the contrast function used is the absolute value of the CUSUM statistic, the latter being

[TABLE]

where $1\leq s\leq b<e\leq T$ and $n=e-s+1$ . Under the i.i.d. Gaussian framework used for the theoretical results presented in this paper, it can be shown that ${\rm argmax}_{b}\left|\tilde{X}_{s,e}^{b}\right|={\rm argmax}_{b}\mathcal{R}_{s,e}^{b}(\bm{X})$ , where $\mathcal{R}_{s,e}^{b}(\bm{X})$ is the generalized log-likelihood ratio statistic for all potential single change-points within $[s,e]$ . For the main result of Theorem 3.1, we also make the following assumption.

(A1)

The minimum distance, $\delta_{T}$ , between two change-points and the minimum magnitude of jumps, $\underline{f}_{T}$ , are connected by $\sqrt{\delta_{T}}\underline{f}_{T}\geq\underline{C}\sqrt{\log T}$ , for a large enough constant $\underline{C}$ .

The number of change-points, $N$ , is assumed to be neither known nor fixed. It can grow with $T$ and the only indirect assumption on $N$ is due to the minimum distance, $\delta_{T}$ , between two change-points in the sense that $N+1\leq T/\delta_{T}$ . Below, we give the theoretical result for the consistency of the number and location of the estimated change-points. The proof is in Section 8 of the supplementary material.

Theorem 3.1

Let $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ follow model (3), with $f_{t}$ being a piecewise-constant signal and assume that the random sequence $\left\{\epsilon_{t}\right\}_{t=1,2,\ldots,T}$ is independent and identically distributed (i.i.d.) from the normal distribution with mean zero and variance one and also that (A1) holds. Let $N$ and $r_{j},j=1,2,\ldots,N$ be the number and locations of the change-points, while $\hat{N}$ and $\hat{r}_{j},j=1,2,\ldots,\hat{N}$ are their estimates sorted in increasing order. In addition, $\Delta_{j}^{f}=\left|f_{r_{j}+1}-f_{r_{j}}\right|$ , $j=1,2,\ldots,N$ . Then, there exist positive constants $C_{1},C_{2},C_{3},C_{4}$ , which do not depend on $T$ , such that for $C_{1}\sqrt{\log T}\leq\zeta_{T}<C_{2}\sqrt{\delta_{T}}\underline{f}_{T}$ and for a sufficiently large $T$ , we obtain

[TABLE]

The isolation aspect of Isolate-Detect helps us to prove consistency under the conditions used in Theorem 3.1 (and later in Theorem 3.2). From (5), we notice that in order to be able to match the estimated change-point locations with the true ones, $\delta_{T}$ should be larger than $\underset{j=1,2,\ldots,N}{\max}\left|\hat{r}_{j}-r_{j}\right|$ , meaning that $\delta_{T}$ must be at least $\mathcal{O}(\log T)$ . For this order of $\delta_{T}$ , Chan and Walther (2013) argue that the smallest possible $\delta_{T}\underline{f}_{T}^{2}$ that allows change-point detection is $\mathcal{O}\left(\log T-\log(\log T)\right)$ . In our case, assumption (A1) ensures that the $\mathcal{O}(\log T)$ rate for $\delta_{T}\underline{f}_{T}^{2}$ is attained, which is nearly optimal (up to the double logarithmic term). This provides evidence that ID allows for detection in complex scenarios, such as limited spacings between change-points. We mention that if $\delta_{T}$ is of higher order than $\mathcal{O}\left(\log T\right)$ , then Assumption (A1) implies that $\underline{f}_{T}$ could decrease with $T$ .

The quantity on the right-hand side of (5) is $1-\mathcal{O}(1/T)$ ; the same order as in WBS and NOT. However, ID gives a provably lower constant $C_{4}$ for the bound. To understand this consistency advantage of our method over, for example, NOT see our proof in Section 8 of the supplement and compare (17) with the result in Equation (19), p.28 in the online supplementary material of Baranowski et al. (2019). The rate of the lower bound for the threshold $\zeta_{T}$ is $\mathcal{O}\left(\sqrt{\log T}\right)$ and this is what will be used in practice as the default rate: we use

[TABLE]

and the choice of the constant $C$ will be explained in Section 4. Furthermore, (5) indicates that $\delta_{T}$ does not affect the rate of convergence of the estimated change-point locations; these only depend on $\Delta_{j}^{f}$ .

Continuous, piecewise-linear signals. Under Gaussianity and with $R_{s,e}^{b}(\bm{X})$ being the generalized log-likelihood ratio for all possible single change-points within $[s,e)$ , the idea is to find a contrast function $C_{s,e}^{b}(\bm{X})$ , which is maximized at the same point as $R_{s,e}^{b}(\bm{X})$ . The contrast function is constructed by taking inner products of the data with a contrast vector. In the case of continuous piecewise-linear signals, Baranowski et al. (2019) show that the contrast vector to be used is $\bm{\phi_{s,e}^{b}}=\left(\phi_{s,e}^{b}(1),\ldots,\phi_{s,e}^{b}(T)\right)$ , where

[TABLE]

where $n=e-s+1$ , $\alpha_{s,e}^{b}=(6/[n(n^{2}-1)(1+(e-b+1)(b-s+1)+(e-b)(b-s))])^{\frac{1}{2}}$ and $\beta_{s,e}^{b}=([(e-b+1)(e-b)]/[(b-s+1)(b-s)])^{\frac{1}{2}}$ . The contrast function is $C_{s,e}^{b}(\bm{X})=\left|\left\langle\bm{X},\bm{\phi_{s,e}^{b}}\right\rangle\right|$ . To explain the reasoning behind the choice of the triangular function $\phi_{s,e}^{b}(\cdot)$ , we define, for the interval $[s,e]$ , the linear vector $\gamma_{s,e}(t)=\left(\frac{1}{12}(e-s+1)\left(e^{2}-2es+2e+s^{2}-2s\right)\right)^{-\frac{1}{2}}\left(t-\frac{e+s}{2}\right)$ , $t=s,\ldots,e$ (and 0 otherwise) as well as the constant vector $1_{s,e}(t)=\left(e-s+1\right)^{-\frac{1}{2}}$ , $t=s,\ldots,e$ (and 0 otherwise). On the vector $\tilde{\phi}_{s,e}^{b}(t)=t-b$ , $t=b+1,\ldots,e$ (and 0 otherwise), which is linear with a kink at $b+1$ , we apply the Gram-Schmidt orthogonalization with respect to $\gamma_{s,e}(t)$ and $1_{s,e}(t)$ . Normalizing the obtained vector such that $\|.\|_{2}=1$ returns the contrast vector $\phi_{s,e}^{b}(t)$ defined in (7). The best approximation, in terms of the Euclidean distance, of $X_{t}$ in $[s,e]$ is a linear combination of $\gamma_{s,e}(t)$ , $1_{s,e}(t)$ , and $\phi_{s,e}(t)$ , which are mutually orthonormal (Baranowski et al., 2019). This orthonormality leads to $R_{s,e}^{b}(\bm{X})=\left|\left\langle\bm{X},\bm{\phi_{s,e}^{b}}\right\rangle\right|=C_{s,e}^{b}(\bm{X})$ . For the consistency of ID in continuous piecewise-linear signals, we make the following assumption.

(A2)

The minimum distance, $\delta_{T}$ , between two change-points and the minimum magnitude of jumps, $\underline{f}_{T}=\min_{j=1,2,\ldots,N}\left|2f_{r_{j}}-f_{r_{j}+1}-f_{r_{j}-1}\right|$ , are connected by the requirement $\delta_{T}^{3/2}\underline{f}_{T}\geq C^{*}\sqrt{\log T}$ , for a large enough constant $C^{*}$ .

The term $\delta_{T}^{3/2}\underline{f}_{T}$ characterizes the difficulty level of the detection problem and is analogous to $\sqrt{\delta_{T}}\underline{f}_{T}$ in the scenario of piecewise-constant signals. Theorem 3.2 gives the consistency result for the case of continuous piecewise-linear signals. The proof is in Section 8 of the supplement.

Theorem 3.2

Let $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ follow model (3) with $f_{t}$ being a continuous, piecewise-linear signal and assume that the random sequence $\left\{\epsilon_{t}\right\}_{t=1,2,\ldots,T}$ is independent and identically distributed (i.i.d.) from the normal distribution with mean zero and variance one and that (A2) holds. We denote by $N$ and $r_{j},j=1,2,\ldots,N$ the number and locations of the change-points, while $\hat{N}$ and $\hat{r}_{j},j=1,2,\ldots,\hat{N}$ are their estimates sorted in increasing order. Also, we denote $\Delta_{j}^{f}=\left|2f_{r_{j}}-f_{r_{j}+1}-f_{r_{j}-1}\right|$ . Then, there exist positive constants $C_{1},C_{2},C_{3},C_{4}$ , which do not depend on $T$ , such that for $C_{1}\sqrt{\log T}\leq\zeta_{T}<C_{2}\delta_{T}^{3/2}\underline{f}_{T}$ and for sufficiently large $T$ ,

[TABLE]

The quantity on the right-hand side of (8) is $1-\mathcal{O}\left(1/T\right)$ . In addition, in the case of $\underline{f}_{T}\sim T^{-1}$ , ID’s change-point detection accuracy is $\mathcal{O}\left(T^{2/3}\left(\log T\right)^{1/3}\right)$ , as can be seen from (8). This differs from the $\mathcal{O}\left(T^{2/3}\right)$ rate derived in Raimondo (1998) only by the logarithmic factor. The lower bound of the threshold is $\mathcal{O}\left(\sqrt{\log T}\right)$ . Therefore,

[TABLE]

where $\tilde{C}$ is a constant and we will comment on its choice in Section 4.2.

ID is flexible because it does not depend on the structure of the signal; what changes is the choice of an appropriate contrast function. Adopting a similar approach as the one for the case of continuous piecewise-linear signals, one can construct contrast functions for the detection of other types of features.

3.3 Information Criterion approach

Misspecification of the threshold in the ID algorithm can lead to the misestimation of the number of change-points. To remedy this, we develop an approach which starts by possibly overestimating the number of change-points and then creates a solution path, with the estimates ordered according to a certain predefined criterion. The best fit is then chosen, based on the optimization of a model selection criterion.

The solution path algorithm: The estimated number of change-points depends on $\zeta_{T}$ and this allows us to denote $\hat{N}=\hat{N}(\zeta_{T})$ . For given data, we employ ID using first $\zeta_{T}$ and then $\tilde{\zeta}_{T}$ , where $\tilde{\zeta}_{T}<\zeta_{T}$ . Let $C_{\tilde{\zeta}_{T}}$ and $\tilde{C}_{\tilde{\zeta}_{T}}$ be the $\tilde{\zeta}_{T}$ -associated constants in (6) and (9), respectively. With $J\geq\hat{N}(\zeta_{T})$ , we estimate $\tilde{r}_{j},j=1,2,\ldots,J$ , which are sorted in increasing order in $\tilde{S}=\left[\tilde{r}_{1},\tilde{r}_{2},\ldots,\tilde{r}_{J}\right]$ . Our aim is to prune the estimates through an iterative procedure, where at each iteration the estimation most likely to be spurious is removed. The algorithm is split into four parts, with their descriptions being fairly technical. We note however that the different parts are very similar and are based on the idea of removing change-points according to their contrast function values as well as their distance to neighbouring estimates. Even though the full explanation of each part is in Section 1 of the supplement, we now provide a brief summary for the framework of the solution path algorithm. With $\tilde{r}_{0}=1$ and $\tilde{r}_{J+1}=T$ , we first collect triplets $(\tilde{r}_{j-1},\tilde{r}_{j},\tilde{r}_{j+1})$ , $\forall\left\{1,2,\ldots,J\right\}$ and we calculate ${\rm CS}(\tilde{r}_{j}):=C_{\tilde{r}_{j-1},\tilde{r}_{j+1}}^{\tilde{r}_{j}}(\bm{X}),$ with $C_{s,e}^{b}(\bm{X})$ being the relevant contrast function. For $m={\rm argmin}_{j}\left\{{\rm CS}(\tilde{r}_{j})\right\}$ we check whether $CS(\tilde{r}_{m})\leq\tilde{\tilde{C}}\sqrt{\log T}$ , for $\tilde{\tilde{C}}>0$ ; in the proofs of Theorems 3.3 and 3.4, $\tilde{\tilde{C}}=2\sqrt{2}$ but smaller values could be sufficient; see for example Corollary 1. If $CS(\tilde{r}_{m})\leq\tilde{\tilde{C}}\sqrt{\log T}$ , we remove $\tilde{r}_{m}$ from $\tilde{S}$ , reduce $J$ by 1, relabel the remaining estimates (in increasing order) in $\tilde{S}$ , and repeat this estimate removal process, which is carried out in a way such that once the set $\tilde{S}$ contains $N$ estimates, then for $j=1,2,\ldots,N$ , each $\tilde{r}_{j}$ is within a distance of $C^{*}\left(\log T\right)^{\alpha}$ from the true change-point $r_{j}$ . We keep removing estimates until $\tilde{S}=\emptyset$ .

At the end of this change-point removal approach, we collect the estimates in

[TABLE]

where $b_{J}$ is the estimate that was removed first, $b_{J-1}$ is the one that was removed second, and so on. From now on, the vector $\bm{b}$ is called the solution path and is used to give a range of different fits. We define the collection $\left\{\mathcal{M}_{j}\right\}_{j=0,1,\ldots,J}$ where $\mathcal{M}_{0}=\emptyset$ and $\mathcal{M}_{j}=\left\{b_{1},b_{2},\ldots,b_{j}\right\}$ . For $j=2,\ldots,J$ , let $\tilde{b}_{1}<\ldots<\tilde{b}_{j}$ be the sorted elements of $\mathcal{M}_{j}$ . Among the collection of models $\left\{\mathcal{M}_{j}\right\}_{j=0,1,\ldots,J}$ , we propose to select the one that minimizes the strengthened Schwarz Information Criterion (Liu et al., 1997; Fryzlewicz, 2014), defined as

[TABLE]

where $\tilde{b}_{0}=0$ and for each collection $\mathcal{M}_{j}$ , $\tilde{b}_{j+1}=T$ and $\hat{\theta}_{1},\hat{\theta}_{2},\ldots,\hat{\theta}_{j+1}$ are the maximum likelihood estimators of the segment parameters for the model (3) with change-point locations $b_{1},b_{2},\ldots,b_{j}$ . The quantity $n_{j}$ is the total number of estimated parameters related to $\mathcal{M}_{j}$ . For example, if we do not consider the change-point locations as free parameters, then in the scenario of piecewise-constant mean $n_{j}=j+1$ (the constant values for each of the $j+1$ segments), while in the scenario of continuous and piecewise-linear signals $n_{j}=j+2$ (the starting intercept and slope and the $j$ changes in the slope). We mention that if the continuity constraint is to be removed, then $n_{j}$ would be equal to $2j+2$ (the constant and slope values for the $j+1$ segments). If now we consider the change-point locations to be free parameters, then we just need to add $j$ in the above values for $n_{j}$ in the different scenarios.

In the algorithm we have referred to three parameters: $C^{*},\tilde{\tilde{C}}$ , and $\alpha$ . Although we do not give a recipe for the choice of $C^{*}$ and $\tilde{\tilde{C}}$ , Section 4.3 describes how to circumvent their choice. With respect to $\alpha$ , taking its value to be equal to 1 in (11) gives the standard SIC penalty, but our theory requires $\alpha>1$ . In practice we use $\alpha=1.01$ in order to remain close to SIC. Theorems 3.3 and 3.4 below give the consistency results for the piecewise-constant and continuous piecewise-linear models, based on the sSIC approach. The proof of Theorem 3.3 is in the supplementary material and the same approach can be followed to prove Theorem 3.4.

Theorem 3.3

Let $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ follow model (3) under piecewise-constancy and let the assumptions of Theorem 3.1 hold. Let $N$ and $r_{j},j=1,2,\ldots,N$ be the number and locations of the change-points. Let $N\leq J$ , where $J$ can also grow with $T$ . In addition, let $\alpha>1$ be such that $\left(\log T\right)^{\alpha}=o(\delta_{T}\underline{f}_{T}^{2})$ is satisfied, where $\delta_{T}$ and $\underline{f}_{T}$ are defined in (A1). With $\left\{\mathcal{M}_{j}\right\}_{j=0,1,\ldots,J}$ being the set of candidate models obtained by the solution path algorithm, we define $\hat{N}={\rm argmin}_{j=0,1,\ldots,J}\;{\rm sSIC}(j)$ . Then, there exist positive constants $C_{1},C_{2}$ , which do not depend on $T$ , such that for $\Delta_{j}^{f}=\left|f_{r_{j}+1}-f_{r_{j}}\right|$ ,

[TABLE]

Theorem 3.4

Let $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ follow model (3) under continuous piecewise-linearity and let the assumptions of Theorem 3.2 hold. Let $N$ and $r_{j},j=1,2,\ldots,N$ be the number and locations of the change-points. Let $N\leq J$ , where $J$ can also grow with $T$ . In addition, let $\alpha>1$ be such that $\left(\log T\right)^{\alpha}=o(\delta_{T}^{3}\underline{f}_{T}^{2})$ is satisfied, where $\delta_{T}$ and $\underline{f}_{T}$ are defined in (A2). With $\left\{\mathcal{M}_{j}\right\}_{j=0,1,\ldots,J}$ being the set of candidate models obtained by the solution path algorithm, we define $\hat{N}={\rm argmin}_{j=0,1,\ldots,J}\;{\rm sSIC}(j)$ . Then, there exist positive constants $C_{1},C_{2}$ , which do not depend on $T$ , such that for $\Delta_{j}^{f}=\left|2f_{r_{j}}-f_{r_{j}+1}-f_{r_{j}-1}\right|$ ,

[TABLE]

We note that our solution path algorithm, explained in detail in Section 1 of the supplementary material, allows $J$ , the number of the detections from the already explained overestimation process, to grow with $T$ . The quantities on the right hand sides of (12) and (13) are $1-\mathcal{O}\left(1/T\right)$ ; the same order as those in (5) and (8). The lowest admissible $\delta_{T}\underline{f}_{T}^{2}$ and $\delta_{T}^{3}\underline{f}_{T}^{2}$ in Theorems 3.3 and 3.4, respectively, are slightly larger than the same quantities in the thresholding approach. Our empirical expertise suggests that SIC-based approaches tend to exhibit better practical behaviour for signals that have a moderate number of change-points and/or large spacings between them. A hybrid that combines the advantages of the thresholding and the SIC-based approach is introduced in Section 4.4.

4 Computational complexity and practicalities

4.1 Computational cost

With $\delta_{T}$ being the minimum distance between two change-points, and $\lambda_{T}$ the interval-expansion parameter, we use $\lambda_{T}<\delta_{T}$ . We note that while $\delta_{T}$ is unknown, choosing $\lambda_{T}$ small enough guarantees with high probability that this requirement holds; see Section 4.2 for how to choose $\lambda_{T}$ in order to obtain good accuracy performance and at the same time low computational cost. Now, since $K=\lceil T/\lambda_{T}\rceil>\lceil T/\delta_{T}\rceil$ and the total number, $M_{ID}$ , of intervals required to scan the data is no more than $2K$ ( $K$ intervals from each expanding direction), in the worst case scenario we have $M_{ID}=2K>2\left\lceil\frac{T}{\delta_{T}}\right\rceil$ . As a comparison, in WBS and NOT one needs to draw at least $M$ intervals where $M\geq\left(9T^{2}/\delta_{T}^{2}\right)\log\left(T^{2}/\delta_{T}\right)$ . The lower bound for $M$ in WBS and NOT is $\mathcal{O}\left(T^{2}/\delta_{T}^{2}\right)$ up to a logarithmic factor, whereas the lower bound for $M_{ID}$ is $\mathcal{O}\left(T/\delta_{T}\right)$ . This results in great speed gains of ID over WBS and NOT. The reason behind this significant difference in the computational complexity of the methods is that in WBS and NOT both the start- and end-points of the randomly drawn intervals have to be chosen, whereas in ID, depending on the expanding direction, we keep the start- or the end-point fixed.

4.2 Parameter choice

Choice of the threshold constant. We start with an upper bound on the constant $C$ , as defined in (6), for the case of piecewise-constant signals when the error terms $\epsilon_{t}$ are i.i.d. from the Gaussian distribution. We note that this result is of independent interest. Our model is as in (1) for stationary $\epsilon_{t}$ . For any vector $\bm{y}\in\mathbb{R}^{T}$ , we define

[TABLE]

where $1\leq s\leq b<e\leq T$ and $n=e-s+1$ . It can be shown that if $\epsilon_{t}$ , are serially independent and their distribution is symmetric about zero (for example i.i.d. standard Gaussian random variables), then the sequence $\left\{\epsilon_{t}\right\}_{t=1}^{T}$ satisfies

[TABLE]

The following corollary indicates that as $T\rightarrow\infty$ , we have that $C\leq\sqrt{3/2}$ , meaning that the threshold can be taken to be at most $\sqrt{3\log T}$ . This value of $\sqrt{3}$ is smaller than the constant used in the solution path algorithm of Section 3.3 ( $\tilde{\tilde{C}}=2\sqrt{2}$ ), which can however be used to give explicit upper bounds on the consistency results as explained in Theorems 3.1 and 3.3; in contrast, Corollary 1 does not give an explicit upper bound for the probability related to the consistency result as expressed in (16). We highlight that the aforementioned bound on the constant and its proof are simpler than the results presented in Fang et al. (2020) which involve the manipulation of complex distributions. The proof is in the supplementary material.

Corollary 1

Let $\{\epsilon_{t}\}_{t=1}^{T}$ be i.i.d. $N(0,\sigma^{2})$ . For any $\delta>0$ ,

[TABLE]

For the practical choice of the values of $C$ and $\tilde{C}$ , in (6) and (9), respectively, we ran a large-scale simulation study involving a wide range of signals. The number of change-points, $N$ , was generated from the Poisson distribution with rate parameter $N_{\alpha}\in\left\{4,8,12\right\}$ . For $T\in\left\{100,200,500,1000,2000,5000\right\}$ , we uniformly distributed the change-points in $\left\{1,2,\ldots,T\right\}$ . Then, for piecewise-constant (or continuous piecewise-linear) signals, at each change-point location we introduced a jump (or a slope change) which followed the normal distribution with mean zero and variance $\sigma^{2}\in\left\{1,3,5\right\}$ . Standard Gaussian noise was then added onto the simulated signal. For each value of $N_{\alpha}$ , $\sigma^{2}$ and $T$ we generated 1000 replicates and estimated the number of change-points using ID with threshold $\zeta_{T}$ as in (6) and (9) for a variety of constant values $C$ and $\tilde{C}$ . The best behaviour occurred when, approximately, $C=1.05$ and $\tilde{C}=1.4$ . These values will be referred to as the default constants and they hold true for all signals that satisfy the assumption of the error terms $\epsilon_{t}$ being i.i.d. Gaussian. We note that the value of $\tilde{C}=1.4$ does not violate Corollary 1 because the result expressed in the latter is only for piecewise-constant signals, while the constant $\tilde{C}$ applies to the scenario of continuous, piecewise-linear signals. Due to the fact that the contrast function used is based on local averaging, the CLT can be used to show that for sufficiently large sample size $T$ , ID is robust when the normality assumption is not satisfied; this has also been explored in Fearnhead and Rigaill (2020). Also, pre-averaging is a practical approach that we employed in Section 4.5 for such cases with error departures from Gaussianity.

In the SIC-based approach of Section 3.3, we started by detecting change-points using threshold $\tilde{\zeta}_{T}<\zeta_{T}$ . In practice, we take the constants related to $\tilde{\zeta}_{T}$ , namely $C_{\tilde{\zeta}_{T}}$ and $\tilde{C}_{\tilde{\zeta}_{T}}$ as defined in Section 3.3, to be 0.9 and 1.25, respectively.

Choice of the expansion parameter $\lambda_{T}$ . We start by highlighting that our numerical experience suggests that ID is robust to small changes in the value of $\lambda_{T}$ ; for a small-scale simulation study when the value of $\lambda_{T}$ changes significantly ( $\lambda_{T}\in\left\{5,20,80\right\}$ ), see Section 6 of the supplementary material. Theoretically, for a given signal, the change-point detection results obtained from ID are the same for any value of $\lambda_{T}$ used which is less than the minimum spacing between two successive change-points. The computational cost of running ID is inversely proportional to the size of the expansion parameter; the smaller the $\lambda_{T}$ , the more intervals we need to work on. However, the low computational complexity of our algorithm allows us to take $\lambda_{T}$ to be as small as the value of three leading to very good accuracy even for signals with frequent change-points. We now give example execution times for two models, (T1) and (T2) defined below, on a 3.60GHz CPU with 16 GB of RAM. We employed the ID-variant for long signals explained in Section 4.3.

(T1)

Length $l_{j}=7\times 10^{j},j=3,4,5$ , with change-points at $7,14,\ldots,l_{j}-7$ and values between them $0,4,0,4,\ldots,0,4$ . The standard deviation is $\sigma=0.5$ . Execution times: 0.31s ( $j=3$ ), 2.25s ( $j=4$ ), 26.41s ( $j=5$ ).

(T2)

Length $l_{j}=7\times 10^{j},j=3,4,5$ , with no change-points. We use $\sigma=1$ . Execution times: 0.64s ( $j=3$ ), 3.01s ( $j=4$ ), 30.35s ( $j=5$ ).

4.3 Variants

Here, we describe three different ways to further improve ID’s practical performance.

Long signals: If $T$ is large, we split the given data sequence uniformly into smaller parts (windows), to which ID is then applied. In practical implementations, the length of the window is 3000 and we apply this structure only when $T>12000$ , because for smaller values of $T$ there are no significant differences in the execution times of ID and its window-based variant. The computational improvement that this structure offers is explained in Section 3 of the supplement.

Restarting after detection: In practice, instead of starting from the end-point $e^{*}$ (or start-point $s^{*}$ ) of the right-expanding (or left-expanding) interval where a detection occurred, we could start from the estimated change-point, $\hat{b}$ . This alternative, labelled ${\rm ID}_{det}$ , leads to accuracy improvement without affecting the speed of the method.

Faster solution path algorithm: In practice, we use only Part 4 of the solution path algorithm described in Section 1 of the supplement because it is quicker and conceptually simpler; it requires only the choice of $\alpha$ and tends not to affect ID’s accuracy.

4.4 Alternative model selection criteria

A hybrid between thresholding and SIC stopping rules: For signals with a large number of regularly occurring change-points, the threshold-based ID tends to behave better than the SIC-based procedure. As explained after Theorems 3.3 and 3.4, this is unsurprising because SIC-based approaches typically perform better on signals with a moderate number of change-points separated by larger spacings. This difference in ID’s behaviour between the threshold- and SIC-based versions is what motivates us to introduce a hybrid of these two stopping rules with minimal parameter choice, which works as follows. Firstly, we estimate the change-points using the threshold approach ${\rm ID}_{det}$ with $\lambda_{T}^{th}=3$ . If the estimated change-points are more than a constant $J^{*}$ , then the result is accepted and we stop. Otherwise, the hybrid method proceeds to detect the change-points using the SIC-based approach with $\lambda_{T}>\lambda_{T}^{th}$ , since the already-applied thresholding rule has not suggested a signal with many change-points. In the simulations, we use $J^{*}=100$ , $\lambda_{T}=10$ .

Steepest Drop to Low Levels (SDLL): We also combine ID with the SDLL model selection method introduced in Fryzlewicz (2020).

4.5 Extension to different noise structures

This section describes how to use ID when the noise is not Gaussian. We pre-process the data in order to obtain a noise structure that is closer to Gaussianity. For a given scale number $s$ and data $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ , let $Q=\lceil T/s\rceil$ and $\tilde{X}_{q}=\frac{1}{s}\sum_{t=(q-1)s+1}^{qs}X_{t}$ , for $q=1,2,\ldots,Q-1$ , while $\tilde{X}_{Q}=(T-(Q-1)s)^{-1}\sum_{t=(Q-1)s+1}^{T}X_{t}$ . We apply ID on $\left\{\tilde{X}_{q}\right\}_{q=1,2,\ldots,Q}$ to obtain the estimated change-points, namely $\tilde{\tilde{r}}_{1},\tilde{\tilde{r}}_{2},\ldots,\tilde{\tilde{r}}_{\hat{N}}$ , in increasing order. To estimate the original locations of the change-points we define $\hat{r}_{k}=\left(\tilde{\tilde{r}}_{k}-1\right)s+\left\lfloor\frac{s}{2}+0.5\right\rfloor,\;k=1,2,\ldots,\hat{N}$ . The larger the value of $s$ , the closer the distribution of the noise to normal, but the more the amount of pre-processing. In simulations presented in Section 5, we use $s=3$ for the case of Student- $t_{5}$ distributed noise, while if the tails are heavier (Student- $t_{3}$ ), we set $s=5$ . The hybrid version of ID will be employed on $\left\{\tilde{X}_{q}\right\}_{q=1,2,\ldots,Q}$ and in order to be consistent with the choice of the expansion parameter, we take $\lambda^{*}_{T}=\left\lfloor\lambda_{T}/s\right\rfloor$ . In practice, for unknown noise, our recommendation is to set $s=5$ .

5 Simulations

This section compares the performance of ID with competitors. The main change-point detection $\mathsf{R}$ functions in the competing packages were called using their default input arguments, which does not always allow direct like-for-like comparisons of the methods. Whenever needed (difficult signal structures), and in order to help the competitors capture their best possible performance, the input values were adjusted accordingly. The $\mathsf{R}$ code used for the simulation study is available from Github at https://github.com/Anastasiou-Andreas/IDetect/blob/master/R/Simulations_used.R. Table 2 shows the competitors used.

CPOP is employed based on $\mathsf{R}$ code found in http://www.research.lancs.ac.uk/portal/en/datasets/cpop(56c07868-3fe9-4016-ad99-54439ec03b6c).html and TF in https://stanford.edu/~boyd/l1_tf. For WBS, we give results based on both the information criterion and the thresholding (for $C=1$ ) stopping rules. The notation is WBSIC and WBSC1, respectively. With respect to WBS2, its performance is investigated based on the SDLL model selection criterion introduced in Fryzlewicz (2020). In the cpm package, the threshold is decided through the average run length (ARL) until a false positive occurs. In our simulations, we give results for ${\rm ARL}=500$ (the default value) and if the signal length, $l_{s}$ , is greater than 500, results are also given for ${\rm ARL}=1000\lceil ls/1000\rceil$ . The notation is CPM. $l.A$ , with $A$ the value of ARL. For FKS, when the number of knots is unknown (the scenario we work in), we need to specify the maximum allowed number of knots. We take this to be 2 $N$ , with $N$ the true number of change-points. Also, the estimated change-points by FKS are positive real numbers; we take as estimation the closest integer. The proposed ID version is the hybrid described in Section 4.4. However, we also present the results for two more variants: SDLL and thresholding with constant $\sqrt{3/2}$ (see (6)), which is the upper bound proven in Corollary 1. The notation for these variants is ID.SDLL and ID ${}_{\sqrt{3/2}}$ , respectively.

A seemingly difficult structure for ID: Signals that present the most difficulty to ID are ones in which change-points are concentrated in the middle part of the data and offset each other, as in Figure 4. The reason is that due to the left- and right- expanding feature of ID, where one of the two end-points of the interval is kept fixed, the change-points need to be detectable based on relatively “unbalanced” (explanation follows directly below) tests, which typically tend to offer poor power. For example, referring again to Figure 4, the change-point at 490 will need to be isolated and detected by comparing the means of the data over the long interval $[1,490]$ and a short interval of the form $[491,e_{j}]$ , where $e_{j}\leq 510$ is the end-point of a right-expanding interval $[1,e_{j}]$ . To be more precise, if the expansion parameter $\lambda=3$ , then $e_{j}\in\left\{492,495,\ldots,510\right\}$ and therefore our procedure will have seven opportunities to detect the change-point 490 while it is still isolated in intervals that do not contain any other change-points. Even though ID would be expected to struggle in detecting the change-points in such unbalanced intervals, our numerical experience suggests that its performance on such challenging signals is in fact very good and matches or surpasses that of the best competitors; see for example the results in Table 3 for the model (M4), which follows this structure.

All the signals are fully specified in Section 2 of the supplementary material. Figure 5 shows examples of the data generated by models (M1) blocks, (M2) teeth, (M4) middle-points, and (W1) wave 1. Tables 3–7 summarize the results in the case of i.i.d. Gaussian noise. Table 8 presents the behaviour of ID under the setting of i.i.d. scaled Student- $t_{d}$ noise, where $d=3,5$ . More examples are in the supplement.

We highlight that the NOT, WBSIC, and S3IB methods require the specification of the maximum number, $K_{max}$ , of change-points allowed to be detected. If the default values in these methods are lower than the true number of change-points in the simulated examples, then we take $K_{max}=\lceil T/\delta_{T}\rceil$ , where $\delta_{T}$ is the minimum distance between two change-points. We ran 100 replications for each signal and the frequency distribution of $\hat{N}-N$ for each method is presented. The methods with the highest empirical frequency of $\hat{N}-N=0$ (or in a neighbourhood of zero, depending on the example) and those within $10\%$ off the highest are given in bold. As a measure of the accuracy of the detected locations, we provide Monte-Carlo estimates of the mean squared error, ${\rm MSE}=T^{-1}\sum_{t=1}^{T}{\mathbb{E}}\left(\hat{f}_{t}-f_{t}\right)^{2}$ , where $\hat{f}_{t}$ is the ordinary least square approximation of $f_{t}$ between two successive change-points. In continuous piecewise-linear signals, $\hat{f}_{t}$ is the splines fit obtained using the splines package in R. The scaled Hausdorff distance, $d_{H}=n_{s}^{-1}\max\left\{\max_{j}\min_{k}\left|r_{j}-\hat{r}_{k}\right|,\max_{k}\min_{j}\left|r_{j}-\hat{r}_{k}\right|\right\},$ where $n_{s}$ is the length of the largest segment, is also given in all examples apart from the signal (NC) in Table 5, which is a constant-mean signal with no change-points.

The average computational time for all methods, apart from FDR, is also provided. FDR is excluded due to its non-uniform procedure in terms of the execution speed for each signal (if a newly obtained signal has length greater than previously treated signals, then FDR estimates the threshold by 5000 Monte-Carlo simulations, which makes it slow). In some cases the average computational time for FKS is not given. We have already explained that we need to pre-specify the maximum allowed number of knots in order for FKS to work. The method is somewhat slow and we exclude the results for FKS when the true change-points are more than 10, as in such cases it would take a significant amount of time to finish all the 100 simulations.

With regards to piecewise-constancy, ID is always in the top $10\%$ of the best methods when considering accuracy in any aspect (estimation of $N$ , MSE, $d_{H}$ ); in most cases it is the best method overall. ID.SDLL is also, in most cases, in the top 10% of the best performing methods; this provides evidence that the Isolate-Detect algorithm can be combined with various model selection criteria (thresholding, SIC, SDLL) and maintain a good practical behaviour. When the threshold constant, $C$ , is equal to $\sqrt{3/2}$ , the behaviour of ID remains good for signals that have a moderate number of change-points that are not near each other. As we can see from Table 4, ID ${}_{\sqrt{3/2}}$ seems to struggle in scenarios with a large number of frequently occurring change-points. In continuous piecewise-linear signals, CPOP, ID, and ID.SDLL are in all cases in the top $10\%$ of the best methods in terms of the accurate estimation of $N$ . In terms of the MSE and $d_{H}$ , CPOP is by a narrow margin the overall best method, with ID and ID.SDLL coming second and third, respectively. We can deduce that our method exhibits uniformity in detecting with high accuracy the change-points for various different signal structures, a characteristic which is at least partly absent from the majority of its competitors. Furthermore, ID’s behaviour is particularly impressive in extremely long signals with a large number of frequently occurring change-points; see Tables 4 and 7. Compared to other well-behaved methods, such as NOT for piecewise-constancy and CPOP for continuous piecewise-linear signals, our methodology has by far the lowest computational cost. To conclude, the simulation study provides evidence that Isolate-Detect is an accurate, reliable, and quick method for generalized change-point detection.

The results of Table 8 are very good for $d=5$ and not too different from those under Gaussian noise. For $d=3$ , there is a slight overestimation of the number of change-points. When the tails of the distribution of the noise are significantly heavier than those of the normal distribution, one can obtain better results by increasing the threshold constant. For example, the results in Table 8 for $d=3$ were improved when the threshold constant was slightly increased. We highlight that more thorough simulations can be done using our R packages IDetect and breakfast and code available from https://github.com/Anastasiou-Andreas/IDetect/blob/master/R/Simulations_used.R.

6 Real data examples

6.1 UK House Price Index

We investigate the performance of ID on monthly percentage changes in the UK House price index from January 1995 to December 2020 in two London Boroughs: Tower Hamlets and Hackney. The data are available from http://landregistry.data.gov.uk/app/ukhpi and they were accessed in March 2021. Figure 6 shows the fits of ID, ID.SDLL, NOT, and TGUH. In both data sets, ID behaves similarly to NOT whereas ID.SDLL’s performance is closer to that of TGUH where we detect more change-points. This difference between the examined methods is, in our opinion, due to the fact that ID in this example and NOT detect change-points based on the Schwarz Information Criterion, so fewer estimated change-points can be expected. The detection of two change-points near March 2008 and September 2009 for both boroughs may be related to the financial crisis during that time, which led to a decrease in house prices. As explained in Section 3.3, our methodology returns the solution path defined in (10), which can be used to obtain different fits; see Section 7 in the supplement for more details and for a real-data example where this is useful.

Residual diagnostics have indicated that the behaviour of the raw residuals, $X_{t}-\hat{f}_{t}$ , in relation to normality and independence is good for all methods.

6.2 The COVID-19 outbreak in the UK

The performance of ID is investigated on data from the recent COVID-19 pandemic; we employ a continuous piecewise-linear model on the daily number of lab-confirmed cases in England, as well as on the daily additional COVID-19 associated UK deaths. The data concern the period from the beginning of March 2020 until the end of February 2021 and they are available from https://coronavirus.data.gov.uk. The data were accessed on the $8^{th}$ of March 2021. Before applying the various methods to the data, we bring the distribution closer to Gaussian with constant variance. To achieve this we perform the Anscombe transform, $a:\mathbb{N}\rightarrow\mathbb{R}$ , with $a(x)=2\sqrt{x+3/8}$ as described in Anscombe (1948). We denote the transformed number of COVID-19 cases by $\tilde{X}_{t}$ and the transformed number of COVID-19 associated deaths by $\tilde{D}_{t}$ . Figure 7 presents the results of ID, ID.SDLL, CPOP, and NOT for the transformed data.

We observe that ID, ID.SDLL, and NOT have a similar behaviour, while CPOP gives a higher estimated number of change-points. In an attempt to date the detected change-points by ID, we provide a possible explanation of their location with respect to the outbreak of the pandemic in the UK; this discussion is given in Section 4 of the supplementary material.

For another example related to the continuous, piecewise-linear case, see Section 7 of the supplement where we explore the behaviour of Isolate-Detect and two competitors, CPOP and NOT, on the daily closing stock prices of Samsung Electronics Co. from July 2012 until June 2020.

7 Concluding reflections on ID

In this paper, we have proposed Isolate-Detect which is a new, generic technique for multiple generalized change-point detection in noisy data sequences. The method is based on a change-point isolation approach which seems to provide an advantage in detection power, especially in complex structures where most state-of-the-art competitors seem to suffer (see the simulations in Section 5) such as limited spacings between change-points. In addition, the aforementioned isolation aspect allows the extension of our method to the detection of knots in higher-order polynomial signals. As already mentioned in Section 1, NOT, WBS, and WBS2 also work on sub-intervals of the data, but the way the isolation is carried out in ID, where one of the end-points of the subintervals is kept fixed, provides predictable execution times for the analysis of a given data sequence, which are faster than the aforementioned competitors; see Sections 4.1 and 5. Another advantage of our method over NOT, WBS and WBS2 is that, due to its pseudo-sequential interval expansion character, it can easily be applied for online change-point detection.

In Section 4.4, a variant of ID was introduced that combines the threshold- and SIC-based versions of our proposed method with the aim to enhance its accuracy (both in terms of the estimated number and the estimated change-point locations) for signals of different structures with respect to the true number of change-points and the distance between them. In addition, due to the way that the relevant hybrid approach has been developed in Section 4.4, we manage to offer, for ease of execution, minimal parameter choice. Apart from thresholding and SIC, we have also combined ID with the SDLL model selection criterion.

In the practical applications of Sections 5 and 6, compared to the state-of-the-art competitors, ID lies in the top 10% (in terms of the accurate estimation of the number and the location of the change-points) of the best methods. Furthermore, it exhibits a notable advantage over other techniques in long signals with many change-points that occur frequently. In addition, ID’s pseudo-sequential character assists in attaining a low computational time; our method can accurately analyse signals of tens of thousands with thousands of change-points in less than a second; see for example Table 4. In cases where the normality assumption for the error terms is violated, Section 4.5 provides a practical solution where pre-processing allows us to use ID without altering the proposed parameter values. The results of simulations from a Student- $t$ distribution with two options for the degrees of freedom are in Table 8.

Since no method has a uniformly best behaviour, it is natural to also highlight the weaknesses of our method in terms of its practical behaviour. To start with, ID can be slow in long and constant signals in which change-points do not occur. This is because of the expanding intervals attribute, which in the case of no change-points will push the method to keep testing for change-points in growing, overlapping intervals. This is inevitably going to lead to high computational costs. We tried to eliminate this weakness by introducing a window-based variant, as explained in Section 4.3. Another drawback of the method is that, due to its left- and right-expanding feature, the change-points need to be detectable based on relatively unbalanced intervals. This could lead to accuracy issues in signals where the change-points are in the middle of the data sequence and offset each other. In practice, we have not encountered this type of behaviour in ID; in particular it accurately detects the change-points for the model (M4) in Table 3, which is an example of the aforementioned structure with two nearby change-points in the middle of the data sequence.

1 The different parts in the removal process of the solution path algorithm

The four different parts of the removal process applied in order to create the solution path explained in Section 3.3 of the paper are very similar and are based on the idea of removing change-points according to their contrast function values as well as their distance to neighbouring estimates. We mention that if the algorithm proceeds from Part 1 to Part 2 as below, then it is guaranteed that it will also proceed up to Part 4. All events below occur with probability tending to one with $T$ .

Part 1: With $C^{*}$ being a positive constant, the aim is to prune the estimates in $\tilde{S}$ , such that, for each true change-point, there are at most four and at least one estimated change-point within a distance of $C^{*}(\log T)^{\alpha}$ . To achieve this, $\forall j\in\left\{1,2,\ldots,J\right\}$ and with $\tilde{r}_{0}=1$ , $\tilde{r}_{J+1}=T$ , we collect triplets $(\tilde{r}_{j-1},\tilde{r}_{j},\tilde{r}_{j+1})$ and we calculate $CS(\tilde{r}_{j}):=C_{\tilde{r}_{j-1},\tilde{r}_{j+1}}^{\tilde{r}_{j}}(\bm{X})$ , with $C_{s,e}^{b}(\bm{X})$ being the relevant contrast function. For $m={\rm argmin}_{j}\left\{CS(\tilde{r}_{j})\right\}$ , firstly we check whether $CS(\tilde{r}_{m})\leq\tilde{\tilde{C}}\sqrt{\log T}$ , for $\tilde{\tilde{C}}>0$ ; in the proofs of Theorems 3.3 and 3.4, $\tilde{\tilde{C}}=2\sqrt{2}$ , but smaller values could be sufficient. If $CS(\tilde{r}_{m})\leq\tilde{\tilde{C}}\sqrt{\log T}$ and also $\tilde{r}_{j+1}-\tilde{r}_{j-1}\leq 2C^{*}(\log T)^{\alpha}$ , we remove $\tilde{r}_{m}$ from $\tilde{S}$ , reduce $J$ by 1, relabel the remaining estimates (in increasing order) in $\tilde{S}$ , and repeat this estimate removal process. We proceed to Part 2 when $CS(\tilde{r}_{m})>\tilde{\tilde{C}}\sqrt{\log T}$ . If this is not satisfied at any point of this part, then we conclude that there are no change-points in the data sequence and we stop.

Part 2: The aim is to continue the pruning process of Part 1, in a way that at the end of Part 2 there is at least one estimate within a distance of $C^{*}(\log T)^{\alpha}$ from each true change-point, but also there are at most two estimates between any pair of consecutive true change-points. For the relabelled estimates in $\tilde{S}$ after the completion of Part 1, if $\tilde{r}_{j}-\tilde{r}_{j-1}\leq C^{*}(\log T)^{\alpha}$ , then we remove $\tilde{r}_{j}$ , relabel the remaining estimates, and keep removing the estimates until there is no pair $(\tilde{r}_{j-1},\tilde{r}_{j})$ , such that $\tilde{r}_{j}-\tilde{r}_{j-1}\leq C^{*}(\log T)^{\alpha}$ . We then calculate $CS(\tilde{r}_{j})$ as in Part 1 and for $m={\rm argmin}_{j}\left\{CS(\tilde{r}_{j})\right\}$ , if $CS(\tilde{r}_{m})\leq\tilde{\tilde{C}}\sqrt{\log T}$ , then we remove $\tilde{r}_{m}$ and relabel the remaining elements of $\tilde{S}$ . This removal process is repeated and we proceed to Part 3 only when $CS(\tilde{r}_{m})>\tilde{\tilde{C}}\sqrt{\log T}$ .

Part 3: We need to ensure that once $\tilde{S}$ contains $N$ estimates, then for $j=1,2,\ldots,N$ , each $\tilde{r}_{j}$ is within a distance of $C^{*}\left(\log T\right)^{\alpha}$ from $r_{j}$ . To achieve this, for the remaining estimated change-points after Part 2, we use triplets $(\tilde{s}_{j},\tilde{r}_{j},\tilde{e}_{j})$ , with $\tilde{s}_{j}=\left\lfloor(\tilde{r}_{j-1}+\tilde{r}_{j})/2\right\rfloor+1$ and $\tilde{e}_{j}=\left\lceil(\tilde{r}_{j}+\tilde{r}_{j+1})/2\right\rceil$ . For $m={\rm argmin}_{j}C_{\tilde{s}_{j},\tilde{e}_{j}}^{\tilde{r}_{j}}(\bm{X})$ , if $C_{\tilde{s}_{m},\tilde{e}_{m}}^{\tilde{r}_{m}}(\bm{X})\leq\tilde{\tilde{C}}\sqrt{\log T}$ , then we remove $\tilde{r}_{m}$ and relabel the remaining estimates in $\tilde{S}$ in increasing order. We repeat this removal procedure until $C_{\tilde{s}_{m},\tilde{e}_{m}}^{\tilde{r}_{m}}(\bm{X})>\tilde{\tilde{C}}\sqrt{\log T}$ , which is when we proceed to Part 4.

Part 4: For the estimated change-points that are in $\tilde{S}$ after Part 3 is completed, we use again the triplets $(\tilde{r}_{j-1},\tilde{r}_{j},\tilde{r}_{j+1})$ in order to find $m={\rm argmin}_{j}\left\{CS(\tilde{r}_{j})\right\}$ and then remove $\tilde{r}_{m}$ from $\tilde{S}$ . This estimates removal approach is repeated until $\tilde{S}=\emptyset$ .

2 Models used in the simulation study of the paper

The characteristics of the test signals $f_{t}$ as well as the standard deviations $\sigma$ of the noise $\epsilon_{t}$ , which were used in the simulation study are given in the list below.

(M1)

blocks: length 2048 with change-points at 205, 267, 308, 472, 512, 820, 902, 1332, 1557, 1598, 1659 with values between change-points 0, 14.64, $-$ 3.66, 7.32, $-$ 7.32, 10.98, $-$ 4.39, 3.29, 19.03, 7.68, 15.37, 0. The standard deviation is $\sigma=10$ .

(M2)

teeth: length 140 with change-points at 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121, 131 with values between change-points 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1. The standard deviation of the noise is $\sigma=0.4$ .

(M3)

stairs: length 150 with change-points at 11, 21, 31, 41, 51, 61, 71, 81, 91, 101, 111, 121, 131, 141 with values between change-points 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. The standard deviation of the noise is $\sigma=0.3$ .

(M4)

middle-points: length 2000 with change-points at 1000 and 1020 with values between change-points 0, 1.5, 0. The standard deviation of the noise is $\sigma=1$ .

(M5)

long teeth: length 20000 with 1999 change-points at $10,20,\ldots,19990$ with values between change-points $0,3,0,3,\ldots,0,3$ . The standard deviation is $\sigma=0.8$ .

(NC)

constant signal: length 3000 with no change-points. The standard deviation is $\sigma=1$ .

(W1)

wave 1: piecewise-linear signal without jumps in the intercept, $T=1408$ , with 7 change-points at $256,512,768,1024,1152,1280,1344$ with the corresponding changes in slopes $-1/64,2/64,-3/64,4/64,-5/64,6/64,-7/64$ , starting intercept $f_{1}=1$ and slope $f_{2}-f_{1}$ = $1/256$ . The standard deviation of the noise is $\sigma=1$ .

(W2)

wave 2: piecewise-linear signal without jumps in the intercept, $T=1500$ , with 99 change-points at $15,30,\ldots,1485$ . The corresponding changes in the slope are $-1,1,-1,\ldots,-1$ , while the starting intercept is $f_{1}=-1/2$ and the starting slope is $f_{2}-f_{1}$ = $1/40$ . The standard deviation is $\sigma=1$ .

(W3)

smoother signal 1: piecewise-linear signal without jumps in the intercept, $T=200$ , with 9 change-points at $20,40,\ldots,180$ with the corresponding changes in slopes $1/6,1/2,-3/4,-1/3,-2/3,1,1/4,3/4,-5/4$ . The starting intercept is $f_{1}=1$ and slope $f_{2}-f_{1}$ = $1/32$ . The standard deviation of the noise is $\sigma=0.3$ .

(W4)

smoother signal 2: piecewise-linear signal without jumps in the intercept, $T=1000$ , with 19 change-points at $50,100,\ldots,950$ with the corresponding changes in slopes $-1/16,-5/16,-5/8,1,5/16,15/32,-5/8,-7/32,-3/4,13/16,\linebreak 5/16,19/32,-1,-5/8,23/32,1/2,15/16,-25/16,-5/4$ , starting intercept $f_{1}=1$ and slope $1/32$ . The standard deviation of the noise is $\sigma=0.6$ .

3 Improvement of ID in the case of big data

In this section, we show through simulations that applying ID on a fixed window grid improves its speed in large data sets, without affecting its accuracy. We compare the classic ID method as explained in Section 3.1 of the main paper with the new window-grid-based version (WID) of Section 4.3 in the case of three data sequences of length $10^{5}$ , each with standard Gaussian noise. We work under the scenario of piecewise-constant mean. The three signals are

(D1)

No change-points;

(D2)

three change-points at 25000, 55000, 85000 and the values between change-points are 0,3,-3,2;

(D3)

seven change-points at 16000, 22000, 28000, 46000, 62000, 74000, 86000 and the values between change-points are 0,4,-4,4,-4,4,-4,4.

We took the expansion parameter $\lambda_{T}$ to be equal to 10 and the results are shown in Table 1. As a measure of the accuracy of the detected locations, we provide Monte-Carlo estimates of the mean squared error, ${\rm MSE}=T^{-1}\sum_{t=1}^{T}{\mathbb{E}}\left(\hat{f}_{t}-f_{t}\right)^{2}$ . The scaled Hausdorff distance,

[TABLE]

where $n_{s}$ is the length of the largest segment, is also given for (D2) and (D3); for (D1), $d_{H}$ is not informative. In terms of accuracy, both methods exhibit excellent behaviour. However, in terms of speed, the advantage of the windows-based approach is obvious. Note the decrease in the computational time of ID when the number of change-points gets larger. This is expected because the worst case in terms of computational complexity is when there are no change-points because ID will then be forced to calculate the contrast function on quite large intervals, even on $[1,T]$ , which is computationally more expensive.

4 A justification of the estimations in the example of Section 6.2

Here, we provide a possible explanation of the three most important (based on the solution path) detected change-points by ID in the real data related to the COVID-19 outbreak in the UK. We focus on the signal obtained with respect to the daily number of deaths (ID and ID.SDLL give exactly the same outcome), but similar conclusions can be extracted for the daily number of cases. The three most important changes in the behaviour of the data sequence have been detected on the $8^{th}$ of April, the $12^{th}$ of July and the $19^{th}$ of January. Some justification of these results is as follows:

•

The change-point on the $8^{th}$ of April, shows that the upward trend vanishes and a negative slope takes its place leading to a vast decrease in the number of reported deaths. There is an obvious connection of this change-point with the British Government’s important decision in late March to impose a general lockdown. On 9 April, the Secretary of State for Foreign Affairs stated that the UK was starting to see the impact of the restrictions; our second detection is in full agreement with the aforementioned date and statement.

•

With respect to the change-point on the $12^{th}$ of July, it indicates a stabilisation, at a low level, for the number of deaths. This is, of course, expected because the downward trend could not continue indefinitely.

•

The last change-point on the $19^{th}$ of January indicates that an upward trend on the daily number of deaths, which was apparent for a period of almost 3.5 months, vanishes and its place takes a significantly negative slope. There is an obvious connection of this change-point with the vaccination programme in the country. More specifically, on 8 January 2021, MRNA-1273 (commonly known as the Moderna vaccine) was the third COVID-19 vaccine approved for use in the UK.

5 Additional simulation results

Here we present the results of simulations for various signals other than those presented in Section 5 of the main article. Tables 3-7 summarize the results for the following models.

(LS)

long stairs: length 10000 with 499 change-points at $20,40,\ldots,9980$ with values between change-points $0,2,4,6,\ldots,996,998$ . The standard deviation is $\sigma=1$ .

(LT2)

long teeth 2: length 10000 with 249 change-points at $40,80,\ldots,9960$ with values between change-points $0,1.5,0,1.5,\ldots,0,1.5$ . The standard deviation is $\sigma=1$ .

(ELT)

extremely long teeth: length 100000 with 19999 change-points at $5,10,\ldots,99995$ with values between change-points $0,2,0,2,\ldots,0,2$ . The standard deviation is $\sigma=0.3$ .

(NC2)

constant signal 2: length 300 with no change-points. The standard deviation is $\sigma=1$ .

(SW1)

wave 5: piecewise-linear signal without jumps in the intercept, $T=2400$ , with 119 change-points at $20,40,\ldots,2380$ with the corresponding changes in slopes $2.5,-2.5,2.5,\ldots,2.5$ , starting intercept $f_{1}=1$ , slope $f_{2}-f_{1}$ = $1.25$ and $\sigma=3$ .

(SW2)

wave 6: piecewise-linear signal without jumps in the intercept, $T=1500$ , with 29 change-points at $50,100,\ldots,1450$ with the corresponding changes in slopes $-1/7,1/7,-1/7,\ldots,-1/7$ , starting intercept $f_{1}=-1/2$ , slope $f_{2}-f_{1}$ = $1/24$ and $\sigma=1$ .

(SW3)

wave 7: piecewise-linear signal without jumps in the intercept, $T=840$ , with 119 change-points at $7,14,\ldots,833$ with the corresponding changes in slopes $-1,1,-1,\ldots,-1$ , starting intercept $f_{1}=-1/2$ and slope $f_{2}-f_{1}$ = $1/32$ . The standard deviation is $\sigma=0.3$ .

The signals (LS), (LT2), (ELT), and (NC2) are treated under piecewise-constancy, while (SW1), (SW2) and (SW3) under the continuous and piecewise-linear case. FDR, WBSIC and S3IB are excluded from the comparative study for the extremely long signal (ELT). For FDR, we had to interrupt the execution after 10 hours, while for WBSIC and S3IB, in order to have a fair comparison of the methods with the rest, we had to increase the default value of the maximum number of change-points allowed to be detected to be greater than 20000. For a single iteration we had to stop the execution for WBSIC and S3IB after 30 minutes.

In all examples, the ID methodology is within $10\%$ of the best method. Once again, it exhibits remarkable behaviour when it comes to very long signals with a large number of frequently appearing change-points; see Tables 3, 4 and 6.

6 Investigation of the impact of different values of $\lambda_{T}$ on accuracy

In this section, we provide a small-scale simulation study in order to examine the behaviour of ID on different values of the expansion parameter $\lambda_{T}$ . The simulation set up is as in Section 5 of the main paper and the signals used are those explained in Section 2 of the supplement. For the expansion parameter, we take $\lambda_{T}\in\left\{5,20,80\right\}$ . The results are in Tables 9 - 15 below.

In terms of accuracy, we notice that in most cases, the best method is ID ${}_{\lambda_{T}=5}$ . In addition, as long as the different values of $\lambda_{T}$ are all less than the minimum distance, $\delta_{T}$ , between two successive change-points, then the results are extremely similar (if not identical); see for example Tables 9 and 13. On the other hand, when $\lambda_{T}>\delta_{T}$ the accuracy is getting worse; compare for example the results for the three different values of $\lambda_{T}$ in Models (M2), (M3), (M5), (W2), and (W3). In terms of speed, as expected, the larger the value of $\lambda_{T}$ , the quicker the method in general.

7 Additional real-data example

In this section we explore the behaviour of our method and two competitors, CPOP and NOT, to the daily closing stock prices of Samsung Electronics Co. from July 2012 until June 2020. The data are available from https://finance.yahoo.com/quote/005930.KS/history?p=005930.KS and they were accessed in July 2020. We look for changes in a continuous piecewise-linear mean signal. Figure 8 shows the results for the ID, ID.SDLL, NOT and CPOP methods, which detect $165,134,22,239$ change-points, respectively. From both the fit and the residuals given in Figure 8, it is not easy to say which of the three methods gives the “best” number of change-points.

ID can return a range of different fits providing users with the flexibility to choose according to their preference. In Figure 9, we use the solution path and we obtain the estimated signal and the raw residuals of ID for $\hat{N}=22$ and $\hat{N}=239$ , which are the estimated change-point numbers through NOT and CPOP, respectively. The fit is similar to those obtained by the aforementioned methods as presented in Figure 8. However, we note that both those competitors are significantly slower than ID; see Tables 6 and 7 in the main paper and Tables 6 - 8 in the supplement for a comparison. To conclude, apart from returning the estimated fit, the ID methodology can directly, and without any extra effort, produce a series of estimated signals based on the solution path defined in (10).

8 Proof of the theorems in the paper and of Corollary 1

From now on, the contrast vector $\bm{\psi_{s,e}^{b}}=(\psi_{s,e}^{b}(1),\psi_{s,e}^{b}(2),\ldots,\psi_{s,e}^{b}(T))$ is defined through the contrast function

[TABLE]

where $s\leq b<e$ . Notice that for any vector $\bm{v}=(v_{1},v_{2},\ldots,v_{T})$ , we have that $\langle\bm{v},\bm{\psi_{s,e}^{b}}\rangle=\tilde{v}_{s,e}^{b}.$ We first present Lemma 1, which is partly used for the proof of Theorem 3.1,

Lemma 1

Suppose $\bm{f}=\left(f_{1},f_{2},\ldots,f_{T}\right)^{\intercal}$ is a piecewise-constant vector. Pick any interval $[s,e]\subset[1,T]$ such that $[s,e-1]$ contains exactly one change-point $r_{j}$ . Let $\rho=|r_{j}-b|$ , $\Delta_{j}^{f}=\left|f_{r_{j}+1}-f_{r_{j}}\right|$ , $\eta_{L}=r_{j}-s+1$ and $\eta_{R}=e-r_{j}$ . Then,

[TABLE]

In addition,

for any $r_{j}\leq b<e$ , $\left(\tilde{f}_{s,e}^{r_{j}}\right)^{2}-\left(\tilde{f}_{s,e}^{b}\right)^{2}=\left(\rho\eta_{L}/(\rho+\eta_{L})\right)\left(\Delta_{j}^{f}\right)^{2}$ ; 2. 2.

for any $s\leq b<r_{j}$ , $\left(\tilde{f}_{s,e}^{r_{j}}\right)^{2}-\left(\tilde{f}_{s,e}^{b}\right)^{2}=\left(\rho\eta_{R}/(\rho+\eta_{R})\right)\left(\Delta_{j}^{f}\right)^{2}$ ;

Proof

See Lemma 4 from Baranowski et al. (2019).

**Brief discussion of the steps of the proof of Theorem 3.1

**Before proceeding with the thorough mathematical proof, we give an informal explanation of the main steps. In the main part of the proof, we derive results for the signal $f_{t}$ . However, the consistency is concerned with the estimated number and locations of the change-points in the observed process $\left\{X_{t}\right\}_{t=1,2,\ldots,T}$ . Therefore, in order to be able to deduce consistency related to $X_{t}$ from our $f_{t}$ -reliant proof, we need first to show that for all $1\leq s\leq b<e\leq T$ , the observed quantity $\left|\tilde{X}_{s,e}^{b}\right|$ is uniformly close to the unobserved $\left|\tilde{f}_{s,e}^{b}\right|$ ; this is achieved in Step 1. In Step 2, for $b_{1},b_{2}\in[s,e)$ , we control the distance between the noised $\left|\tilde{X}_{s,e}^{b_{1}}\right|-\left|\tilde{X}_{s,e}^{b_{2}}\right|$ and its noiseless equivalent $\left|\tilde{f}_{s,e}^{b_{1}}\right|-\left|\tilde{f}_{s,e}^{b_{2}}\right|$ for all possible combinations of $s,e,b_{1},b_{2}$ . This allows us to transfer the decision on whether $b_{1}$ or $b_{2}$ is more suitable as a change-point, from $\left|\tilde{f}_{s,e}^{b_{1}}\right|-\left|\tilde{f}_{s,e}^{b_{2}}\right|$ to the calculable $\left|\tilde{X}_{s,e}^{b_{1}}\right|-\left|\tilde{X}_{s,e}^{b_{2}}\right|$ . Step 3 is the main part of our proof, where we first show that as the ID algorithm proceeds, each change-point will get isolated in an interval where its detection will occur with high probability. Therefore, it suffices to restrict our proof to a single change-point detection framework, and the convergence rate is proved to hold for each estimated location. Because upon detection ID proceeds from the end-point (or start-point) of the interval where the detection occurred, we also show that with probability one there is no change-point in those bypassed points (between the detection and the new start- or end-point). Furthermore, in Step 3 it is shown that, the new start- and end-points are at places that allow the detection of the next change-point. In Step 4, we conclude the proof by showing that after detecting all change-points, then ID, with high probability, will terminate after scanning all the remaining data. We mention that for our proof, we employ Lemma 1 given in the online supplement.

Proof of Theorem 3.1. We will prove the more specific result

[TABLE]

which implies the result in (5).

Step 1: Allow us to denote by

[TABLE]

We will show that $\mathbb{P}\,\left(A_{T}\right)\geq 1-1/(12\sqrt{\pi}T)$ . From (3) and (4), simple steps yield $\tilde{X}_{s,e}^{b}-\tilde{f}_{s,e}^{b}=\tilde{\epsilon}_{s,e}^{b}$ , where $\tilde{\epsilon}_{s,e}^{b}\sim\mathcal{N}(0,1)$ . Thus, for $Z\sim\mathcal{N}(0,1)$ , using the Bonferroni inequality we get that

[TABLE]

where $\phi(\cdot)$ is the probability density function of the standard normal distribution.

Step 2: For intervals $[s,e)$ that contain only one true change-point $r_{j}$ , we denote by

[TABLE]

Because $\left|\left\langle\bm{\psi_{s,e}^{b}}\langle\bm{f},\bm{\psi_{s,e}^{b}}\rangle-\bm{\psi_{s,e}^{r_{j}}}\langle\bm{f},\bm{\psi_{s,e}^{r_{j}}}\rangle\right\rangle\right|/\left(\|\bm{\psi_{s,e}^{b}}\langle\bm{f},\bm{\psi_{s,e}^{b}}\rangle-\bm{\psi_{s,e}^{r_{j}}}\langle\bm{f},\bm{\psi_{s,e}^{r_{j}}}\rangle\|_{2}\right)$ follows the standard normal distribution, then we use a similar approach as in Step 1, to show that $\mathbb{P}\,\left(\left(B_{T}\right)^{c}\right)\leq\frac{1}{12\sqrt{\pi}T}$ . Therefore, Steps 1 and 2 lead to

[TABLE]

Step 3: This is the main part of our proof, where we explain in detail how to get the result in (17). For ease of understanding, we split this step into two smaller parts. From now on, we assume that $A_{T}$ and $B_{T}$ both hold. The constants we use are

[TABLE]

where $\underline{C}$ is as in condition (A1).

Step 3.1: For ease of presentation, we take $\lambda_{T}\leq\delta_{T}/3$ ; see Remark 1 for comments in regards to the general case of $\lambda_{T}\leq\delta_{T}/m$ , for an $m>1$ . Allow us now $\forall j\in\left\{1,2,\ldots,N\right\}$ , to define the intervals

[TABLE]

In order for $I_{j}^{R}$ and $I_{j}^{L}$ to have at least one point, we actually implicitly require that $\delta_{T}>3$ , which is the case for sufficiently large $T$ ; see assumption (A1). Since the length of the intervals in (21) is equal to $\delta_{T}/3$ and $\lambda_{T}\leq\delta_{T}/3$ , then ID ensures that for $K=\left\lceil T/\lambda_{T}\right\rceil$ and $k,m\in\left\{1,2,\ldots,K\right\}$ , there exists at least one $c_{k}^{r}=k\lambda_{T}$ and at least one $c_{m}^{l}=T-m\lambda_{T}+1$ that are in $I_{j}^{R}$ and $I_{j}^{L}$ , $\forall j=1,2,\ldots,N$ . At the beginning of our algorithm, $s=1$ , $e=T$ and depending on whether $r_{1}\leq T-r_{N}$ then $r_{1}$ or $r_{N}$ will get isolated in a right- or left-expanding interval, respectively. W.l.o.g., assume that $r_{1}\leq T-r_{N}$ . As already mentioned, ID naturally ensures that $\exists k\in\left\{1,2,\ldots,K\right\}$ such that $c_{k}^{r}\in I_{1}^{R}$ . There is no other change-point in $[1,c_{k}^{r}]$ apart from $r_{1}$ . We will show that for $\tilde{b}={\rm argmax}_{1\leq t<c_{k}^{r}}\left|\tilde{X}_{1,c_{k}^{r}}^{t}\right|$ , then $\left|\tilde{X}_{1,c_{k}^{r}}^{\tilde{b}}\right|>\zeta_{T}$ . Using (18), we have that

[TABLE]

But,

[TABLE]

By the definition of $\delta_{T}$ and from our notation of $r_{0}=0$ , we know that $r_{1}\geq\delta_{T}$ . In addition, since $c_{k}^{r}\in I_{1}^{R}$ , then $\delta_{T}/3\leq c_{k}^{r}-r_{1}<2\delta_{T}/3$ , meaning that

[TABLE]

The result in (22), the assumption (A1) and the application of (24) in (8) yield

[TABLE]

Therefore, there will be an interval of the form $[1,c_{\tilde{k}}^{r}]$ , with $c_{\tilde{k}}^{r}>r_{1}$ , such that $[1,c_{\tilde{k}}^{r}]$ contains only $r_{1}$ and $\max_{1\leq b<c_{\tilde{k}}^{r}}\left|\tilde{X}_{1,c_{\tilde{k}}^{r}}^{b}\right|>\zeta_{T}$ . Let us, for $k^{*}\in\left\{1,2,\ldots,K\right\}$ , to denote by $c_{k^{*}}^{r}\leq c_{\tilde{k}}^{r}$ the first right-expanding point where this happens and let $b_{1}={\rm argmax}_{1\leq t<c_{k^{*}}^{r}}\left|\tilde{X}_{1,c_{k^{*}}^{r}}^{t}\right|$ with $\left|\tilde{X}_{1,c_{k^{*}}^{r}}^{b_{1}}\right|>\zeta_{T}$ . Our aim now is to find $\gamma_{T}>0$ such that for any $b^{*}\in\left\{1,2,\ldots,c_{k^{*}}^{r}-1\right\}$ with $\left|b^{*}-r_{1}\right|\left(\Delta_{1}^{f}\right)^{2}>\gamma_{T}$ , we have that

[TABLE]

Proving (25) and using the definition of $b_{1}$ we can conclude that $|b_{1}-r_{1}|\left(\Delta_{1}^{f}\right)^{2}\leq\gamma_{T}$ . Now, since $X_{t}=f_{t}+\epsilon_{t}$ , then (25) can be expressed as

[TABLE]

W.l.o.g. assume that $b^{*}\geq r_{1}$ and a similar approach as below holds when $b^{*}<r_{1}$ . Lemma 1, gives for the left-hand side of the inequality in (8) that

[TABLE]

For the terms on the right-hand side of (8), using (18) we obtain that

[TABLE]

while from (19) and Lemma 1,

[TABLE]

Therefore (8) is satisfied if the stronger inequality $\Lambda>8\log T+2\sqrt{\Lambda}\sqrt{8\log T}$ is satisfied, which has solution

[TABLE]

From (27) and since $(\left|b^{*}-r_{1}\right|r_{1})/(\left|b^{*}-r_{1}\right|+r_{1})\geq\min\left\{\left|b^{*}-r_{1}\right|,r_{1}\right\}/2$ , we deduce that (25) is implied by

[TABLE]

However,

[TABLE]

and this is because if we assume that $\min\left\{r_{1},c_{k^{*}}^{r}-r_{1}\right\}\leq C_{3}\log T/\left(\Delta_{1}^{f}\right)^{2}$ , then

[TABLE]

This comes to a contradiction to $\left|\tilde{X}_{1,c_{k^{*}}^{r}}^{b_{1}}\right|>\zeta_{T}$ . Therefore, (29) holds and (28) is restricted to $\left|b^{*}-r_{1}\right|\left(\Delta_{1}^{f}\right)^{2}>C_{3}\log T$ , which implies (25). Thus, we conclude that necessarily,

[TABLE]

So far, for $\lambda_{T}\leq\delta_{T}/3$ we have proven that working under the assumption that $A_{T}$ and $B_{T}$ hold, there will be an interval $[1,c_{k^{*}}^{r}]$ , with $\left|\tilde{X}_{1,c_{k^{*}}^{r}}^{b_{1}}\right|>\zeta_{T}$ , where $b_{1}=\underset{1\leq t<c_{k^{*}}^{r}}{{\rm argmax}}\left|\tilde{X}_{1,c_{k^{*}}^{r}}^{t}\right|$ is an estimation of $r_{1}$ that satisfies (30).

Step 3.2: After detecting the first change-point, ID follows the same process as in Step 3.1 but in the set $[c_{k^{*}}^{r},T]$ , which contains $r_{2},r_{3},\ldots,r_{N}$ . This means that we bypass, without checking for possible change-points, the interval $[b_{1}+1,c_{k^{*}}^{r})$ and we need to prove that:

(S.1)

There is no change-point in $[b_{1}+1,c_{k^{*}}^{r})$ , apart from maybe the already detected $r_{1}$ ;

(S.2)

$c_{k^{*}}^{r}$ is at a location which allows for detection of $r_{2}$ .

For (S.1): We will split the explanation into two cases with respect to the location of $b_{1}$ .

Case 1: $b_{1}<r_{1}<c_{k^{*}}^{r}$ . Using (30) and imposing the condition

[TABLE]

then since $c_{\tilde{k}}^{r}\in I_{1}^{R}$ , we have that

[TABLE]

Since $r_{2}-r_{1}\geq\delta_{T}$ and $r_{1}$ is already in $[b_{1}+1,c_{k^{*}}^{r})$ , then there is no other change-point in $[b_{1}+1,c_{k^{*}}^{r})$ apart from $r_{1}$ . Actually, the result in (31) is not an extra assumption and we will briefly explain the reason at the end of our proof.

Case 2: $r_{1}\leq b_{1}<c_{k^{*}}^{r}$ . Since $c_{\tilde{k}}^{r}\in I_{1}^{R}$ , then $c_{k^{*}}^{r}-r_{1}\leq c_{\tilde{k}}^{r}-r_{1}<2\delta_{T}/3$ , which means that apart from $r_{1}$ there is no other change-point in $[r_{1},c_{k^{*}}^{r})$ . With $r_{1}\leq b_{1}$ , then $[b_{1}+1,c_{k^{*}}^{r})$ does not have any change-point.

Cases 1 and 2 above show that no matter the location of $b_{1}$ , there is no change-point in $[b_{1}+1,c_{k^{*}}^{r})$ other than possibly the previously detected $r_{1}$ . Similarly to the approach in Step 3.1, our method applied now in $[c_{k^{*}}^{r},T]$ , will first isolate $r_{2}$ or $r_{N}$ depending on whether $r_{2}-c_{k^{*}}^{r}$ is smaller or larger than $T-r_{N}$ . If $T-r_{N}<r_{2}-c_{k^{*}}^{r}$ then $r_{N}$ will get isolated first in a left-expanding interval and the procedure to show its detection is exactly the same as for the detection of $r_{1}$ in Step 3.1. Therefore, for the sake of showing (S.2) let us assume that $r_{2}-c_{k^{*}}^{r}\leq T-r_{N}$ .

For (S.2): With ${\rm R}_{s,e}$ as in (2), there exists $c_{k_{2}}^{r}\in{\rm R}_{c_{k^{*}}^{r},T}$ such that $c_{k_{2}}^{r}\in I_{2}^{R}$ , with $I_{j}^{R}$ defined in (21). We will show that $r_{2}$ gets detected in $[c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}]$ , for $k_{2}^{*}\leq k_{2}$ and its detection is $b_{2}={\rm argmax}_{c_{k^{*}}^{r}\leq t<c_{k^{*}_{2}}^{r}}\left|\tilde{X}_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{t}\right|$ , which satisfies $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2}\leq C_{3}\log T$ . Following similar steps as in (8), we have that for $\tilde{b}_{2}=\underset{c_{k^{*}}^{r}\leq t<c_{k_{2}}^{r}}{\rm argmax}\left|\tilde{X}_{c_{k^{*}}^{r},c_{k_{2}}^{r}}^{t}\right|$ ,

[TABLE]

By construction, $c_{k_{2}}^{r}-r_{2}\geq\delta_{T}/3$ and

[TABLE]

which means that $\min\left\{c_{k_{2}}^{r}-r_{2},r_{2}-c_{k^{*}}^{r}+1\right\}\geq(\delta_{T}/3)$ and therefore continuing from (32),

[TABLE]

Therefore, for a $c_{\tilde{k}_{2}}^{r}\in{\rm{R}}_{c_{k^{*}}^{r},T}$ we have shown that there exists an interval of the form $[c_{k^{*}}^{r},c_{\tilde{k}_{2}}^{r}]$ , with $\max_{c_{k^{*}}^{r}\leq b<c_{\tilde{k}_{2}}^{r}}\left|\tilde{X}_{c_{k^{*}}^{r},c_{\tilde{k}_{2}}^{r}}^{b}\right|>\zeta_{T}$ . Let us denote by $c_{k^{*}_{2}}^{r}\in{\rm{R}}_{c_{k^{*}}^{r},T}$ the first right-expanding point where this occurs and let $b_{2}={\rm argmax}_{c_{k^{*}}^{r}\leq t<c_{k^{*}_{2}}^{r}}\left|\tilde{X}_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{t}\right|$ with $\left|\tilde{X}_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{b_{2}}\right|>\zeta_{T}$ .

We will now show that $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2}\leq C_{3}\log T$ . Following exactly the same process as in Step 3.1 and assuming now w.l.o.g. that $b_{2}<r_{2}$ , we have that for $b^{*}\in\left\{c_{k^{*}}^{r},\ldots,c_{k^{*}_{2}}^{r}-1\right\}$ ,

[TABLE]

is implied by $\min\left\{\left|b^{*}-r_{2}\right|,c_{k^{*}_{2}}^{r}-r_{2}\right\}>C_{3}\log T/\left(\Delta_{2}^{f}\right)^{2}$ . In the same way as in Step 3.1 and by contradiction we can show that $\min\left\{c_{k^{*}_{2}}^{r}-r_{2},r_{2}-c_{k^{*}}^{r}+1\right\}>C_{3}\log T/\left(\Delta_{2}^{f}\right)^{2}$ and (33) is implied by $\left|b^{*}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2}>C_{3}\log T$ . Therefore $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2}>C_{3}\log T$ would mean that $\left|\tilde{X}_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{r_{2}}\right|>\left|\tilde{X}_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{b_{2}}\right|$ , which is not true by the definition of $b_{2}$ . Having said this, we conclude that $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2}\leq C_{3}\log T$ . Having detected $r_{2}$ , then our algorithm will proceed in the interval $[s,e]=[c_{k^{*}_{2}}^{r},T]$ and all the change-points will get detected one by one since Step 3.2 will be applicable as long as there are undetected change-points in $[s,e]$ .

Denoting by $\hat{r}_{j}$ the estimation of $r_{j}$ as we did in the statement of the theorem, then we conclude that all change-points will get detected one by one and $\left|\hat{r}_{j}-r_{j}\right|\left(\Delta_{j}^{f}\right)^{2}\leq C_{3}\log T,\quad\forall j\in\left\{1,2,\ldots,N\right\}$ . In addition, as one can see from (31), our process imposes that $\delta_{T}>3C_{3}\log T/\left(\Delta_{j}^{f}\right)^{2},\forall j\in\left\{1,2,\ldots,N\right\}$ , which, by the definition of $\underline{f}_{T}$ , is implied by

[TABLE]

We will now explain why (34) is not actually an extra assumption but it is implied by our assumption (A1), which requires $\delta_{T}\geq\underline{C}^{2}\log T/\underline{f}_{T}^{2}$ . Proving that $\underline{C}>\sqrt{3C_{3}}$ would mean that indeed (A1) implies (34). Due to $C_{1}\sqrt{\log T}\leq\zeta_{T}<C_{2}\sqrt{\delta_{T}}\underline{f}_{T}$ , we require $\underline{C}$ to be such that $\underline{C}C_{2}>C_{1}$ . Simple steps yield

[TABLE]

We conclude that $\underline{C}>\sqrt{3C_{3}}$ , meaning that (34) is something already satisfied due to (A1).

Step 4: The arguments given in Steps 1-3 hold in $A_{T}\cap B_{T}$ . At the beginning of the algorithm, $s=1,e=T$ and for $N\geq 1$ , there exist $k_{1}\in\left\{1,2,\ldots,K\right\}$ such that $s_{k_{1}}=s,e_{k_{1}}\in I_{1}^{R}$ and $k_{2}\in\left\{1,2,\ldots,K\right\}$ such that $s_{k_{2}}\in I_{N}^{L},e_{k_{2}}=e$ . As in our previous steps, w.l.o.g. assume that $r_{1}\leq T-r_{N}$ and $r_{1}$ gets isolated and detected first in an interval $[s,c_{k^{*}}^{r}]$ , where $c_{k^{*}}^{r}\in{\rm{R}}_{1,T}$ and it is less than or equal to $e_{k_{1}}$ . Then, $\hat{r}_{1}={\rm argmax}_{s\leq t<c_{k^{*}}^{r}}|\tilde{X}_{s,c_{k^{*}}^{r}}^{t}|$ is the estimated location for $r_{1}$ and $\left|r_{1}-\hat{r}_{1}\right|\left(\Delta_{1}^{f}\right)^{2}\leq C_{3}\log T$ . After this, the method continues in $[c_{k^{*}}^{r},T]$ and keeps detecting all the change-points as explained in Step 3. There will not be any double detection issues because naturally, at each step of the algorithm, the new interval $[s,e]$ does not include any previously detected change-points. Once all the change-points have been detected one by one, then $[s,e]$ will contain no other change-points. ID will keep checking for possible change-points in intervals of the form $\left[s,c_{\tilde{k}_{1}}^{r}\right]$ and $\left[c_{\tilde{k}_{2}}^{l},e\right]$ for $c_{\tilde{k}_{1}}^{r}\in{\rm{R}}_{s,e}$ and $c_{\tilde{k}_{2}}^{l}\in{\rm{L}}_{s,e}$ . We denote by $[s^{*},e^{*}]$ any of these intervals. ID will not detect anything in $[s^{*},e^{*}]$ since $\forall b\in[s^{*},e^{*})$ ,

[TABLE]

After not detecting anything in all intervals of the above form, then the algorithm concludes that there are not any change-points in $[s,e]$ and stops. $\qquad\qquad\qquad\;\;\;\;\blacksquare$

Remark 1

It is interesting to explore what happens when instead of $\lambda_{T}\leq\delta_{T}/3$ , we use the more general case of $\lambda_{T}\leq\delta_{T}/m$ , for $m>1$ . The adjustments need to be made are

(Adj.1)

Instead of the definition in (21), we now have

[TABLE]

Note that the length of the above intervals is $\delta_{T}/m$ , meaning that with probability one there will be at least one left and one right expanding point in each of them because the distance between two consecutive right (left) expanding points is $\lambda_{T}\leq\delta_{T}/m$ .

(Adj.2)

Instead of $C_{2}$ as in (20), we should now use that $C_{2}=\sqrt{(m-1)/4m}-2\sqrt{2}/\underline{C}$ . This is easy to prove and it will not be shown here.

(Adj.3)

In (35), we give a lower bound for $\underline{C}$ . Following similar steps, this now becomes

[TABLE]

We see from (Adj.3) that the higher the value of $m$ , the smaller the lower bound will be, meaning that the assumption on $\underline{C}$ gets possibly relaxed for larger values of $m$ . On the other hand, the results above hold for an expanding level of $\lambda_{T}\leq\delta_{T}/m$ and thus, we notice that the smaller the value of $m$ , the larger the upper bound for the acceptable $\lambda_{T}$ -values. Our choice of $m=3$ gives a more symmetric aspect to our approach as the length of the intervals $I_{j}^{R}$ and $I_{j}^{L}$ is the same as the minimum distance of their start- and end-points from possible change-points, which is $\delta_{T}/3$ .

We now proceed to prove the result in Theorem 3.2. For the continuous piecewise-linear case, the contrast function values at $b$ for the observed data, the signal, and the noise are denoted by $C_{s,e}^{b}(\bm{X})$ , $C_{s,e}^{b}(\bm{f})$ and $C_{s,e}^{b}(\bm{\epsilon})$ , respectively. We have $\Delta_{j}^{f}=\left|2f_{r_{j}}-f_{r_{j}-1}-f_{r_{j}+1}\right|$ and as in the case of piecewise-constancy, $\underline{f}_{T}=\underset{j=1,2,\ldots,N}{\min}\Delta_{j}^{f}$ . The contrast vector $\bm{\phi_{s,e}^{b}}=(\phi_{s,e}^{b}(1),\phi_{s,e}^{b}(2),\ldots,\phi_{s,e}^{b}(T))$ is defined through the contrast function

[TABLE]

where $\alpha_{s,e}^{b}=(6/[n(n^{2}-1)(1+(e-b+1)(b-s+1)+(e-b)(b-s))])^{1/2}$ and $\beta_{s,e}^{b}=([(e-b+1)(e-b)]/[(b-s+1)(b-s)])^{1/2}$ , with $n=e-s+1$ . For any vector $\bm{v}=(v_{1},v_{2},\ldots,v_{T})$ , we have that

[TABLE]

Towards the proof of Theorem 3.2, we use Lemmas 2 and 3 given below.

Lemma 2

Suppose $\bm{f}=\left(f_{1},f_{2},\ldots,f_{T}\right)^{\intercal}$ is piecewise-linear vector and $r_{1},\ldots,r_{N}$ are the locations of the change-points. Suppose $1\leq s<e\leq T$ , such that $r_{j-1}\leq s<r_{j}<e\leq r_{j+1}$ , for some $j=1,2,\ldots,N$ . Let $\eta=\min\left\{r_{j}-s,e-r_{j}\right\}$ . Then,

[TABLE]

Proof

See Lemma 5 from Baranowski et al. (2019).

Lemma 3

Suppose $\bm{f}=\left(f_{1},f_{2},\ldots,f_{T}\right)^{\intercal}$ is piecewise-linear vector. With $r_{1},r_{2},\ldots,r_{N}$ the locations of the change-points, suppose that $1\leq s<e\leq T$ , such that $r_{j-1}\leq s<r_{j}<e\leq r_{j+1}$ for some $j=1,2,\ldots,N$ . Let $\rho=|r_{j}-b|$ , $\Delta_{j}^{f}=\left|2f_{r_{j}}-f_{r_{j}-1}-f_{r_{j}+1}\right|$ , $\eta_{L}=r_{j}-s$ and $\eta_{R}=e-r_{j}$ . Then,

[TABLE]

Furthermore,

for any $r_{j}\leq b<e$ , $\left(C_{s,e}^{r_{j}}(\bm{f})\right)^{2}-\left(C_{s,e}^{b}(\bm{f})\right)^{2}\geq(1/63)\min(\rho,\eta_{L})^{3}\left(\Delta_{j}^{f}\right)^{2}$ ; 2. 2.

for any $s<b\leq r_{j}$ , $\left(C_{s,e}^{r_{j}}(\bm{f})\right)^{2}-\left(C_{s,e}^{b}(\bm{f})\right)^{2}\geq(1/63)\min(\rho,\eta_{R})^{3}\left(\Delta_{j}^{f}\right)^{2}$ .

Proof

See Lemma 7 from Baranowski et al. (2019), where the approach is similar as to the one for Lemma 1.

The steps we follow for the proof of Theorem 3.2 are the same as those explained for the proof of Theorem 3.1.

Proof of Theorem 3.2. We will prove the more specific result

[TABLE]

which implies the result in (8).

Steps 1 and 2: As in Theorem 3.1, let

[TABLE]

The same reasoning as in the proof of Theorem 3.1 leads to $\mathbb{P}\,\left(A^{*}_{T}\right)\geq 1-1/(12\sqrt{\pi}T)$ and $\mathbb{P}\,\left(B^{*}_{T}\right)\geq 1-1/(12\sqrt{\pi}T)$ . Therefore, Steps 1 and 2 lead to

[TABLE]

Step 3: This is the main part of our proof, where we explain in detail how to get the result in (36). From now on, we assume that $A_{T}^{*}$ and $B_{T}^{*}$ both hold. The constants we use are

[TABLE]

where $C^{*}$ is as in assumption (A2).

Step 3.1: First, $\forall j\in\left\{1,2,\ldots,N\right\}$ , we define $I_{j}^{R}$ and $I_{j}^{L}$ as in (21). At the beginning of our algorithm, $s=1$ , $e=T$ and depending on whether $r_{1}\leq T-r_{N}$ then $r_{1}$ or $r_{N}$ will get isolated first, respectively. W.l.o.g., assume that $r_{1}\leq T-r_{N}$ . Our aim is to first show that there will be at least an interval of the form $[1,c_{\tilde{k}}^{r}]$ , for $\tilde{k}\in\left\{1,2,\ldots,K\right\}$ , which contains only $r_{1}$ and no other change-point, such that $\max_{1\leq b<c_{\tilde{k}}^{r}}C_{1,c_{\tilde{k}}^{r}}^{b}>\zeta_{T}$ . Due to ID’s nature, for $K=\left\lceil T/\lambda_{T}\right\rceil$ , then $\exists\tilde{k}\in\left\{1,2,\ldots,K\right\}$ such that $c_{\tilde{k}}^{r}=\tilde{k}\lambda_{T}\in I_{1}^{R}$ and there is no other change-point in $[1,c_{\tilde{k}}^{r}]$ apart from $r_{1}$ . We will now show that for $\tilde{b}_{1}={\rm argmax}_{1<t<c_{\tilde{k}}^{r}}C_{1,c_{\tilde{k}}^{r}}^{t}(\bm{X})$ , then $C_{1,c_{\tilde{k}}^{r}}^{\tilde{b}_{1}}(\bm{X})>\zeta_{T}$ . Firstly, we have that

[TABLE]

From Lemma 2, we know that $C_{1,c_{\tilde{k}}^{r}}^{r_{1}}(\bm{f})\geq 1/(\sqrt{24})\left(\min\left\{r_{1}-1,c_{\tilde{k}}^{r}-r_{1}\right\}\right)^{3/2}\Delta_{1}^{f}$ . Now, $r_{1}-1=r_{1}-r_{0}-1\geq\delta_{T}-1>\delta_{T}/3$ , because for continuous piecewise-linear signals we have that $\delta_{T}\geq 2$ for identifiability purposes. In addition, since $c_{\tilde{k}}^{r}\in I_{1}^{R}$ , then $c_{\tilde{k}}^{r}-r_{1}\geq\delta_{T}/3$ , meaning that

[TABLE]

The result in (37), the assumption (A2) and (38) yield

[TABLE]

Therefore, there will be an interval of the form $[1,c_{\tilde{k}}^{r}]$ , with $c_{\tilde{k}}^{r}>r_{1}$ , such that $[1,c_{\tilde{k}}^{r}]$ contains only $r_{1}$ and $\max_{1\leq b<c_{\tilde{k}}^{r}}C_{1,c_{\tilde{k}}^{r}}^{b}>\zeta_{T}$ . Let us, for $k^{*}\in\left\{1,2,\ldots,K\right\}$ , denote by $c_{k^{*}}^{r}\leq c_{\tilde{k}}^{r}$ the first right-expanding point where this happens and let $b_{1}={\rm argmax}_{1\leq t<c_{k^{*}}^{r}}C_{1,c_{k^{*}}^{r}}^{t}$ with $C_{1,c_{k^{*}}^{r}}^{b_{1}}>\zeta_{T}$ . Note that $b_{1}$ can not be an estimation of any other change-point as $[1,c_{k^{*}}^{r}]$ includes only $r_{1}$ .

Our aim now is to find $\tilde{\gamma}_{T}>0$ such that for any $b^{*}\in\left\{1,2,\ldots,c_{k^{*}}^{r}-1\right\}$ with $\left|b^{*}-r_{1}\right|\left(\Delta_{1}^{f}\right)^{2/3}>\tilde{\gamma}_{T}$ , we have

[TABLE]

Proving (40) and using the definition of $b_{1}$ we can conclude that $|b_{1}-r_{1}|\left(\Delta_{1}^{f}\right)^{2/3}\leq\tilde{\gamma}_{T}$ . Since $X_{t}=f_{t}+\epsilon_{t}$ , then (40) can be expressed as

[TABLE]

W.l.o.g. assume that $b^{*}\geq r_{1}$ and a similar approach as below holds when $b^{*}<r_{1}$ . We denote by

[TABLE]

and for the terms in the right-hand side of (8), we get that

[TABLE]

while from Lemma 3,

[TABLE]

Therefore (8) is satisfied if the stronger inequality

[TABLE]

is satisfied, which has solution

[TABLE]

Using Lemma 3, we have that (42) is implied by

[TABLE]

However,

[TABLE]

and this is because if we assume that

[TABLE]

yields

[TABLE]

This comes to a contradiction to $C_{1,c_{k^{*}}^{r}}^{b_{1}}(\bm{X})>\zeta_{T}$ . Therefore, (44) holds and for sufficiently large $T$ ,

[TABLE]

From (45) we deduce that (8) is restricted to

[TABLE]

which implies (40). Therefore, necessarily,

[TABLE]

So far, for $\lambda_{T}\leq\delta_{T}/3$ we have proven that working under the sets $A_{T}^{*}$ and $B_{T}^{*}$ , there will be an interval of the form $[1,c_{k^{*}}^{r}]$ , with $C_{1,c_{k^{*}}^{r}}^{b_{1}}>\zeta_{T}$ , where $b_{1}=\underset{1\leq t<c_{k^{*}}^{r}}{\rm argmax}\;C_{1,c_{k^{*}}^{r}}^{t}$ is an estimation of $r_{1}$ that satisfies (46).

Step 3.2: After detecting the first change-point, ID follows the same process as in Step 3.1 in the set $[c_{k^{*}}^{r},T]$ , which contains $r_{2},r_{3},\ldots,r_{N}$ . This means that we do not check for possible change-points in the interval $[b_{1}+1,c_{k^{*}}^{r})$ . Therefore, we need to prove that:

(S.1)

There is no other change-point in $[b_{1}+1,c_{k^{*}}^{r})$ , apart from possibly the already detected $r_{1}$ ;

(S.2)

$c_{k^{*}}^{r}$ is at a location which allows for detection of $r_{2}$ .

For (S.1): The approach is the same as the one in Step 3.2 in the proof of Theorem 3.1 and will not be repeated here.

Similarly to the approach in Step 3.1, our method applied now to $[c_{k^{*}}^{r},T]$ , will first detect $r_{2}$ or $r_{N}$ depending on whether $r_{2}-c_{k^{*}}^{r}$ is smaller or larger than $T-r_{N}$ . If $T-r_{N}<r_{2}-c_{k^{*}}^{r}$ then $r_{N}$ will get isolated first and the procedure to show its detection is exactly the same as in Step 3.1 where we explained the detection of $r_{1}$ . Therefore, w.l.o.g. and also for the sake of showing (S.2) let us assume that $r_{2}-c_{k^{*}}^{r}\leq T-r_{N}$ .

For (S.2): With ${\rm{R}}_{s,e}$ as in (2), there exists $c_{k_{2}}^{r}\in{\rm{R}}_{c_{k^{*}}^{r},T}$ such that $c_{k_{2}}^{r}\in I_{2}^{R}$ . We will show that $r_{2}$ gets detected in $[c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}]$ , for $k_{2}^{*}\leq k_{2}$ and its detection is $b_{2}={\rm argmax}_{c_{k^{*}}^{r}\leq t<c_{k^{*}_{2}}^{r}}C_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{t}(\bm{X})$ , which satisfies $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2/3}\leq C_{3}(\log T)^{1/3}$ . Using again Lemma 3 and for $\tilde{b}_{2}={\rm argmax}_{c_{k^{*}}^{r}\leq t<c_{k_{2}}^{r}}C_{c_{k^{*}}^{r},c_{k_{2}}^{r}}^{t}$ , we have that

[TABLE]

By construction,

[TABLE]

which means that $\min\left\{c_{k_{2}}^{r}-r_{2},r_{2}-c_{k^{*}}^{r}\right\}\geq\delta_{T}/3$ . Therefore, continuing from (8) and using the exact same calculations as in (8), we have that

[TABLE]

Therefore, for a $c_{\tilde{k}_{2}}^{r}\in{\rm{R}}_{c_{k^{*}}^{r},T}$ we have shown that there exists an interval of the form $[c_{k^{*}}^{r},c_{\tilde{k}_{2}}^{r}]$ , with $\max_{c_{k^{*}}^{r}\leq b<c_{\tilde{k}_{2}}^{r}}C_{c_{k^{*}}^{r},c_{\tilde{k}_{2}}^{r}}^{b}>\zeta_{T}$ . Let us denote by $c_{k^{*}_{2}}^{r}\in{\rm{R}}_{c_{k^{*}}^{r},T}$ the first right-expanding point where this occurs and let $b_{2}={\rm argmax}_{c_{k^{*}}^{r}\leq t<c_{k^{*}_{2}}^{r}}C_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{t}$ with $C_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{b_{2}}>\zeta_{T}$ . We will now show that $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2/3}\leq C_{3}\left(\log T\right)^{1/3}$ . Following the same process as in Step 3.1 and assuming now that $b_{2}<r_{2}$ , we have that for $b^{*}\in\left\{c_{k^{*}}^{r},c_{k^{*}}^{r}+1,\ldots,c_{k^{*}_{2}}^{r}-1\right\}$ ,

[TABLE]

is implied by $\min\left\{\left|b^{*}-r_{2}\right|,c_{k_{2}}^{r}-r_{2}\right\}>C_{3}\left(\log T\right)^{1/3}/\left(\Delta_{2}^{f}\right)^{2/3}$ . However, following the same procedure as in Step 3.1 we can show that for sufficiently large $T$ ,

[TABLE]

Thus, (48) is implied by $\left|b^{*}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2/3}>C_{3}(\log T)^{1/3}$ . Therefore, $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2/3}>C_{3}(\log T)^{1/3}$ would necessarily mean that $C_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{r_{2}}(\bm{X})>C_{c_{k^{*}}^{r},c_{k^{*}_{2}}^{r}}^{b_{2}}(\bm{X})$ , which is not true by the definition of $b_{2}$ . Having said this, we conclude that $\left|b_{2}-r_{2}\right|\left(\Delta_{2}^{f}\right)^{2/3}\leq C_{3}\left(\log T\right)^{1/3}$ .

Having detected $r_{2}$ , then our algorithm will proceed in the interval $[s,e]=[c_{k^{*}_{2}}^{r},T]$ and all the change-points will get detected one by one since Step 3.2 will be applicable as long as there are previously undetected change-points in $[s,e]$ . Denoting by $\hat{r}_{j}$ the estimation of $r_{j}$ as we did in the statement of the theorem, then we conclude that all change-points will first get isolated and then detected one by one and

[TABLE]

Step 4: The arguments given in Steps 1-3 hold in $A_{T}^{*}\cap B_{T}^{*}$ . At the beginning of the algorithm, $s=1,e=T$ and for $N\geq 1$ , there exist $k_{1}\in\left\{1,2,\ldots,K\right\}$ such that $s_{k_{1}}=s,e_{k_{1}}\in I_{1}^{R}$ and $k_{2}\in\left\{1,2,\ldots,K\right\}$ such that $s_{k_{2}}\in I_{N}^{L},e_{k_{2}}=e$ . As in our previous steps, w.l.o.g. assume that $r_{1}\leq T-r_{N}+1$ , meaning that $r_{1}$ gets isolated and detected first in an interval $[s,c_{k^{*}}^{r}]$ , where $c_{k^{*}}^{r}\in{\rm{R}}_{1,T}$ and it is less than or equal to $e_{k_{1}}$ . Then, $\hat{r}_{1}={\rm argmax}_{s\leq t<c_{k^{*}}^{r}}C_{s,c_{k^{*}}^{r}}^{t}(\bm{X})$ is the estimated location for $r_{1}$ and $\left|r_{1}-\hat{r}_{1}\right|\left(\Delta_{1}^{f}\right)^{2/3}\leq C_{3}\left(\log T\right)^{1/3}$ . After this, the algorithm continues in $[c_{k^{*}}^{r},T]$ and keeps detecting all the change-points as explained in Step 3. It is important to note that there will not be any double detection issues because naturally, at each step of the algorithm, the new interval $[s,e]$ does not include any previously detected change-points.

Once all the change-points have been detected one by one, then $[s,e]$ will have no other change-points in it. Our method will keep interchangeably checking for possible change-points in intervals of the form $\left[s,c_{\tilde{k}_{1}}^{r}\right]$ and $\left[c_{\tilde{k}_{2}}^{l},e\right]$ for $c_{\tilde{k}_{1}}^{r}\in{\rm{R}}_{s,e}$ and $c_{\tilde{k}_{2}}^{l}\in{\rm{L}}_{s,e}$ . Allow us to denote by $[s^{*},e^{*}]$ any of these intervals. Our algorithm will not detect anything in $[s^{*},e^{*}]$ since $\forall b\in[s^{*},e^{*})$ ,

[TABLE]

After not detecting anything in all intervals of the above form, then the algorithm concludes that there are not any change-points in $[s,e]$ and stops. $\hfill\blacksquare$

**Brief discussion of the steps of the proof of Theorem 3.3

**Before the thorough mathematical proof of Theorem 3.3, we provide an informal explanation of the three main steps in our proof. The notation is as in the main paper with $\tilde{S}$ denoting the ordered set with the remaining and relabelled estimated change-points, $\tilde{r}_{k}$ , after each estimation is removed. At the beginning of the change-point removal approach, $\tilde{S}=\left[\tilde{r}_{1},\tilde{r}_{2},\ldots,\tilde{r}_{J}\right]$ . In Step 1 of the proof, we show that for each true change-point $r_{j},j\in\left\{1,2,\ldots,N\right\}$ , there is at least one and at most four estimated change-points, $\tilde{r}_{k},k\in\left\{1,2,\ldots,J\right\}$ within a distance equal to $\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ , where $\tilde{C}>0$ . In Step 2, we show that there are at most two estimated change-points between two consecutive true change-points. In Step 3, we prove that as the algorithm proceeds, then $\forall j\in\left\{1,2,\ldots,N\right\}$ , the only remaining change-point for $r_{j}$ is within a distance of $C_{1}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ and it cannot be removed whilst there are still more than $N$ estimated change-points in $\tilde{S}$ . Step 4 shows that the sSIC penalty as defined in (11), proposes a solution with $\hat{N}=N$ estimated change-points.

**Proof of Theorem 3.3

**Allow us first to denote by

[TABLE]

We will show that $\mathbb{P}\,\left(D_{T}\right)\geq 1-\sqrt{2}/(\sqrt{\pi}T)$ . For $Z\sim\mathcal{N}(0,1)$ , using the Bonferroni inequality we get that

[TABLE]

where $\phi(\cdot)$ is the probability density function of the standard normal distribution. Therefore, $\mathbb{P}\,(D_{T})\geq 1-\sqrt{2}/(\sqrt{\pi}T)$ . With $A_{T}$ as in (18), the work that follows is valid on the set $A_{T}\cap D_{T}$ with $\mathbb{P}\,\left(A_{T}\cap D_{T}\right)\geq 1-1/(12\sqrt{\pi}T)-\sqrt{2}/(\sqrt{\pi}T)$ . We take $\tilde{\tilde{C}}$ from the main paper to be equal to $2\sqrt{2}$ .

Step 1: When the algorithm moves from Part 1 to Part 2, as described in Subsection 3.3, then we are under a structure described by the following three characteristics:

(P1) For $\tilde{C}>0$ , there is at least one estimation within a distance of $\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ , $\forall j\in\left\{1,2,\ldots,N\right\}$ . We know that this is true at the beginning of Part 1 due to calling the ID algorithm with threshold $\zeta_{T}$ . This continues to be the case when the algorithm proceeds to Part 2, because if $\tilde{r}_{k}$ is the last estimation within $\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ , then $\tilde{r}_{k+1}-\tilde{r}_{k-1}>2\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}=2C^{*}(\log T)^{\alpha}$ and $\tilde{r}_{k}$ cannot be removed in Part 1.

(P2) For each $j=1,2,\ldots,N$ , there are at most four estimated change-points within a distance of $\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ . We can not have more than four estimations as if this was the case then at least three of them, let’s denote them by $p_{1},p_{2},p_{3}$ , would be either on the right or the left of $r_{j}$ , which would then mean that both $CS(p_{2})\leq 2\sqrt{2\log T}$ and $p_{3}-p_{1}\leq\tilde{C}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}=C^{*}(\log T)^{\alpha}$ are satisfied and therefore $p_{2}$ would have been removed in Part 1 of the algorithm as explained in Subsection 3.3 of the paper.

(P3) There is an unknown, possibly large, number of estimated change-points (which tends to infinity as $T$ goes to infinity) between any two true change-points, namely $r_{j}$ and $r_{j+1}$ . This issue is solved in Part 2 of the algorithm.

Step 2: We are now in Part 2 of the algorithm as explained in Subsection 3.3, which guarantees that the minimum distance between two estimated change-points is $C^{*}(\log T)^{\alpha}$ , and also that there exists $C\geq\tilde{C}$ , such that there is at least one estimation within a distance of $C(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ . After that, in Part 2 we collect the triplets $(\tilde{r}_{j-1},\tilde{r}_{j},\tilde{r}_{j+1})$ and we calculate ${\rm CS}(\tilde{r}_{j})$ and for $m={\rm argmin}_{j}\left\{{\rm CS}(\tilde{r}_{j})\right\}$ , if $CS(\tilde{r}_{m})\leq 2\sqrt{2\log T}$ , then $\tilde{r}_{m}$ is removed and the process is repeated for the remaining estimated change-points. By doing this it is easy to see that, first of all, between $r_{j}$ and $r_{j+1}$ , $j=0,1,\ldots,N$ there will be at most two estimated change-points, since if there were more, then we would have triplets $(\tilde{r}_{j-1},\tilde{r}_{j},\tilde{r}_{j+1})$ with ${\rm CS}(\tilde{r}_{j})\leq 2\sqrt{2\log T}$ and $\tilde{r}_{j}$ would have been removed. Secondly, for each $j=1,2,\ldots,N$ there is still at least an estimation within a distance of $C(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ . If there is exactly one estimated change-point in the area $r_{j}\pm C(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ , namely $\tilde{r}_{k}$ , then this cannot be removed in Part 2 of the algorithm because $\min\left\{\tilde{r}_{k}-\tilde{r}_{k-1},\tilde{r}_{k+1}-\tilde{r}_{k}\right\}>C^{*}(\log T)^{\alpha}$ and therefore for $C_{4}>0$ ,

[TABLE]

which for sufficiently large $T$ is greater than $2\sqrt{2\log T}$ . Therefore, $\tilde{r}_{k}$ will not be removed and there is at least one estimation within a distance of $C(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $r_{j}$ , $\forall j=1,2,\ldots,N$ . This estimation will be in the set $\tilde{S}$ of estimated change-points that continue in Part 3 of the algorithm.

Step 3: We are in Part 3 of the algorithm as explained in Subsection 3.3. In this step, we will show that with $m={\rm argmin}_{\tilde{r}_{k}\in\tilde{S}}\;CS(\tilde{r}_{k})$ , then once $CS(\tilde{r}_{m})>2\sqrt{2\log T}$ we will be at the stage where $\tilde{S}$ contains $N$ estimated change-points; one estimated change-point within a distance of $C_{1}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from each $r_{j}$ , $\forall j\in\left\{1,2,\ldots,N\right\}$ , where $C_{1}>0$ . W.l.o.g. let $\tilde{r}_{m}$ be between $r_{j}$ and $r_{j+1}$ . From Step 2 we know that there is a finite number (no more than four) of estimated change-points in $[r_{j-1},r_{j+1}]$ . It is straightforward that at the beginning of Part 3 we have at most $2N$ estimations and either of the following two cases is possible:

Case 1: $\tilde{r}_{m}$ is not the closest change-point to either the true change-point on its left ( $r_{j}$ for a $j\in\left\{1,2,\ldots,N\right\}$ ) or the true change-point on its right ( $r_{j+1}$ ). Since, $\tilde{s}_{m}=\left\lfloor\left(\tilde{r}_{m-1}+\tilde{r}_{m}\right)/2\right\rfloor+1$ and $\tilde{e}_{m}=\left\lceil\left(\tilde{r}_{m}+\tilde{r}_{m+1}\right)/2\right\rceil$ , it is straightforward to see that necessarily $\tilde{s}_{m}$ is on the right of $r_{j}$ and $\tilde{e}_{m}$ is on the left of $r_{j+1}$ . This means that there are not any true change-points in $\left[\tilde{s}_{m},\tilde{e}_{m}\right]$ and because we are working in the set $A_{T}$ , we have that

[TABLE]

Therefore, $\tilde{r}_{m}$ will be removed from the set $\tilde{S}$ .

Case 2: $\tilde{r}_{m}$ is the closest change-point to a true change-point, namely $r_{j}$ for a $j\in\left\{1,2,\ldots,N\right\}$ , and from what has been discussed in Steps 1 and 2, $\tilde{r}_{m}$ is within a distance of $C(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ . If $CS(\tilde{r}_{m})\leq 2\sqrt{2\log T}$ , then there is at least another estimated change-point within a distance of $C_{m}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ from $\tilde{r}_{m}$ (and therefore from $r_{j}$ too), where $C_{m}$ is a constant that does not depend on $T$ . If this was not the case, then since we are working under $A_{T}$ ,

[TABLE]

for a constant $C^{*}_{m}$ and for sufficiently large $T$ . Therefore, $\tilde{r}_{m}$ would not get removed. Since there are at most $2N$ estimations, then the constants $C_{m}$ are upper bounded by a general constant $C_{1}$ and therefore, when Part 3 terminates, each true change-point will have only one estimated change-point within the distance of $C_{1}(\log T)^{\alpha}/\left(\Delta_{j}^{f}\right)^{2}$ . This is exactly the stage in our algorithm, where $J=N$ , with $J$ denoting the number of estimated change-points in the set $\tilde{S}$ . There are not any change-point identifiability issues, because $(\log T)^{\alpha}=o\left(\delta_{T}\underline{f}_{T}^{2}\right)$ and for $T$ large enough

[TABLE]

The algorithm will then proceed to Part 4 explained in Subsection 3.3 and each of the remaining estimated change-points will be removed one by one until $\tilde{S}$ is the empty set.

Step 4: In this last step we will prove that the sSIC penalty indicates the solution obtained when Part 3 terminates, where the number of estimated change-points is equal to $N$ . We have already explained in Section 3.3 of the paper that in the scenario of piecewise-constant mean signals,

[TABLE]

where for any candidate model $\mathcal{M}_{j},j=0,1,\ldots,J$ , we have $\hat{f}_{t}^{j}=(\hat{r}_{j+1}-\hat{r}_{j})^{-1}\sum_{k=r_{j}+1}^{\hat{r}_{j+1}}X_{j}$ , for $\hat{r}_{j}+1\leq t\leq\hat{r}_{j+1}$ , and $\hat{\sigma}_{j}^{2}=T^{-1}\sum_{t=1}^{T}(X_{t}-\hat{f}_{t}^{j})^{2}$ is the maximum likelihood estimator of the residual variance associated with model $\mathcal{M}_{j}$ . It has also been proven in Fryzlewicz (2014) that in an interval $[s,e]$ ,

[TABLE]

where $d\in\left\{s,s+1,\ldots,e-1\right\}$ . With $j=0,1,\ldots,J$ the number of estimated change-points related to $\mathcal{M}_{j}$ , if $j>N$ , it means that all the change-points have been detected (see explanation in Step 2) and therefore $\left|\tilde{X}_{s,e}^{d}\right|\leq 2\sqrt{2\log T}$ since we are working under $A_{T}$ . Therefore, $\hat{\sigma}_{j-1}^{2}-\hat{\sigma}_{j}^{2}\leq 8\log T/T$ . In addition, since we are working in the set $D_{T}$ , we have that $\left|\hat{\sigma}_{N}^{2}-\sigma^{2}\right|\leq C^{*}\log T/T$ , for a positive constant $C^{*}$ . Using these results, the definition of ${\rm sSIC}(j)$ in (50), and a first order Taylor expansion, we conclude that

[TABLE]

where $K_{1}$ and $w$ are positive constants. The lower bound in (8) is positive for $T$ large enough. Now, if $j<N$ , then from the proof of Theorem 3.1, we know that in the interval $[s,e]$ , we have that $\left|\tilde{X}_{s,e}^{d}\right|\geq\tilde{C}_{2}\sqrt{\delta_{T}}\underline{f}_{T}$ , leading to $\hat{\sigma}_{j-1}^{2}-\hat{\sigma}_{j}^{2}\geq\tilde{C}_{2}^{2}\delta_{t}\underline{f}_{T}^{2}/T$ . Therefore,

[TABLE]

where $K_{2}$ and $w_{2}$ are positive constants. The lower bound in (8) is positive for $T$ large enough because $(\log T)^{\alpha}=o\left(\delta_{T}\underline{f}_{T}^{2}\right)$ . The results in (8) and (8) show that for $T$ large enough and on the set $A_{T}\cap D_{T}$ , we have that ${\rm sSIC}(j)>{\rm sSIC}(N)$ , for $j\neq N$ . Therefore, ${\rm sSIC}(j)$ is minimized for $j=N$ , showing that $\hat{N}=N$ . $\hfill\blacksquare$

Proof of Corollary 1. From now on, we denote by

[TABLE]

where $1\leq s\leq b<e\leq T$ and $n=e-s+1$ . Using also the notation in (14), we have

[TABLE]

Define

[TABLE]

and consider any straight line in the $(\tilde{\tilde{\epsilon}}_{s,b},\tilde{\tilde{\epsilon}}_{b+1,e})$ -plane, defined by the equation $\alpha\tilde{\tilde{\epsilon}}_{s,b}+\beta\tilde{\tilde{\epsilon}}_{b+1,e}=\lambda$ , that is tangent to the curve $\tilde{\tilde{\epsilon}}_{s,b}\tilde{\tilde{\epsilon}}_{b+1,e}=\gamma^{\prime}$ in the quadrant $(\tilde{\tilde{\epsilon}}_{s,b},\tilde{\tilde{\epsilon}}_{b+1,e})>(0,0)$ . By elementary calculus, which we do not repeat here, the tangency is attained when $\lambda=2\{\gamma^{\prime}\alpha\beta\}^{1/2}$ . By elementary geometry,

[TABLE]

Consider the particular values of $(\alpha,\beta)$ given by

[TABLE]

and note that by the definition of $\tilde{\tilde{\epsilon}}_{s,e}$ we have

[TABLE]

Therefore, from the definitions of $\gamma^{\prime}$ , $\alpha$ and $\beta$ , the implication in (54) and the identity in (55), we have

[TABLE]

Considering again (53), this implies

[TABLE]

By (15), we have

[TABLE]

From (56), we have

[TABLE]

Therefore, from (57), and using (58) and (59) in turn, we have

[TABLE]

Now, for any $\delta>0$ , taking $\gamma=\sigma^{2}(1+\delta)\log\,T$ , we have that

[TABLE]

and the statement is a consequence of Lemma 1 in Yao (1988). $\blacksquare$

References

Baranowski et al. (2019)

Baranowski, R., Chen, Y. and Fryzlewicz, P. (2019). Narrowest-over-threshold detection of multiple change points and change-point-like features.

Journal of the Royal Statistical Society Series B, 81, 649–672.

Fryzlewicz (2014)

Fryzlewicz, P. (2014).

Wild binary segmentation for multiple change-point detection.

Annals of Statistics 42, 2243–2281.

Yao (1988)

Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz’ criterion.

Statistics & Probability Letters, 6, 181–189.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anscombe (1948) Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika , 35 , 246–254.
2Auger and Lawrence (1989) Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the Optimal Identification of Segment Neighborhoods. Bulletin of Mathematical Biology , 51 , 39–54.
3Bai and Perron (1998) Bai, J. and Perron, P. (1998). Estimating and testing linear models with multiple structural changes. Econometrica , 66 , 47–78.
4Baranowski et al. (2019) Baranowski, R., Chen, Y. and Fryzlewicz, P. (2019). Narrowest-over-threshold detection of multiple change points and change-point-like features. Journal of the Royal Statistical Society Series B , 81 , 649–672.
5Chan and Walther (2013) Chan, H. P. and Walther, G. (2013). Detection with the scan and the average likelihood ratio. Statistica Sinica , 23 , 409–428.
6Cho and Kirch (2020) Cho, H., and Kirch, C. (2020). Data segmentation algorithms: Univariate mean change and beyond. https://arxiv.org/pdf/2012.12814.pdf .
7Dette et al. (2018) Dette, H., Eckle, T., and Vetter, M. (2020). Multiscale change point detection for dependent data. Scandinavian Journal of Statistics , 47 , 1243–1274.
8Eichinger and Kirch (2018) Eichinger, B. and C. Kirch (2018). A MOSUM procedure for the estimation of multiple random change points. Bernoulli 24 , 526–564.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Detecting multiple generalized change-points by isolating single ones

Abstract

Keywords:

1 Introduction

2 Motivating illustration of Isolate-Detect

3 Methodology and Theory

3.1 Methodology

3.2 Theoretical behavior of ID

Theorem 3.1

Theorem 3.2

3.3 Information Criterion approach

Theorem 3.3

Theorem 3.4

4 Computational complexity and practicalities

4.1 Computational cost

4.2 Parameter choice

Corollary 1

4.3 Variants

4.4 Alternative model selection criteria

4.5 Extension to different noise structures

5 Simulations

6 Real data examples

6.1 UK House Price Index

6.2 The COVID-19 outbreak in the UK

7 Concluding reflections on ID

1 The different parts in the removal process of the solution path algorithm

2 Models used in the simulation study of the paper

3 Improvement of ID in the case of big data

4 A justification of the estimations in the example of Section 6.2

5 Additional simulation results

6 Investigation of the impact of different values of λT\lambda_{T}λT​ on accuracy

7 Additional real-data example

8 Proof of the theorems in the paper and of Corollary 1

Lemma 1

Proof

Remark 1

Lemma 2

Proof

Lemma 3

Proof

References

6 Investigation of the impact of different values of $\lambda_{T}$ on accuracy