Detecting linear trend changes in data sequences

Hyeyoung Maeng; Piotr Fryzlewicz

arXiv:1906.01939·stat.ME·January 9, 2023

Detecting linear trend changes in data sequences

Hyeyoung Maeng, Piotr Fryzlewicz

PDF

Open Access 1 Repo

TL;DR

TrendSegment is a new method that detects multiple linear trend change-points in data sequences using a novel wavelet transform, enabling efficient and accurate identification of both short and long trend segments.

Contribution

It introduces a Tail-Greedy Unbalanced Wavelet transform for multiscale data decomposition, improving change-point detection in linear trends with theoretical consistency guarantees.

Findings

01

Effective detection of multiple trend change-points demonstrated on real data

02

Method shows consistency in estimating number and locations of change-points

03

Implementation available as an R package on CRAN

Abstract

We propose TrendSegment, a methodology for detecting multiple change-points corresponding to linear trend changes in one dimensional data. A core ingredient of TrendSegment is a new Tail-Greedy Unbalanced Wavelet transform: a conditionally orthonormal, bottom-up transformation of the data through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. Due to its bottom-up nature, this multiscale decomposition focuses on local features in its early stages and on global features next which enables the detection of both long and short linear trend segments at once. To reduce the computational complexity, the proposed method merges multiple regions in a single pass over the data. We show the consistency of the estimated number and locations of change-points. The practicality of our approach is demonstrated through simulations and two real…

Tables14

Table 1. Table 1 : Notation. See Section 2.2.4 for formulae for the terms listed.

$X_{p}$	$p^{th}$ element of the observation vector $𝑿 = {X_{1}, X_{2}, \dots, X_{T}}^{⊤}$ .
$s_{p, p}^{0}$	$p^{th}$ initial smooth coefficient of the vector $𝒔^{0}$ where $𝑿 = 𝒔^{0}$ .
$d_{p, q, r}$	detail coefficient obtained from ${X_{p}, \dots, X_{r}}$ (merges of Types 1 or 2).
$s_{p, r}^{[1]}, s_{p, r}^{[2]}$	smooth coefficients obtained from ${X_{p}, \dots, X_{r}}$ , paired under the “two together” rule.
$d_{p, q, r}^{[1]}, d_{p, q, r}^{[2]}$	paired detail coefficients obtained by merging two adjacent subintervals, ${X_{p}, \dots, X_{q}}$ and ${X_{q + 1}, \dots, X_{r}}$ , where $r > q + 2$ and $q > p + 1$ (merge of Type 3).
$𝒔$	data sequence vector containing the (recursively updated) smooth and detail coefficients from the initial input $𝒔^{0}$ .

Table 2. Table 2 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods listed in Section 4.1 and 4.2 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} defined in Section 4.3 , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} given by ( 20 ) and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS( $λ^{Naïve}$ )	0	0	2	98	0	0	0	0.23	2.96	0.22
	TS( $λ^{Robust}$ )	0	0	2	97	1	0	0	0.23	2.97	0.09
	NOT	0	0	0	98	2	0	0	0.19	2.28	0.22
	ID	0	0	0	97	3	0	0	0.14	1.52	0.02
	TF	0	0	0	0	0	0	100	0.11	4.50	3.18
	CPOP	0	0	0	97	2	1	0	0.09	1.09	0.05
	BUP	100	0	0	0	0	0	0	2.65	10.75	0.35
(M2)	TS( $λ^{Naïve}$ )	0	0	2	98	0	0	0	0.11	1.90	0.50
	TS( $λ^{Robust}$ )	0	0	4	96	0	0	0	0.11	1.91	0.24
	NOT	0	0	2	98	0	0	0	0.09	1.56	0.35
	ID	0	0	0	94	6	0	0	0.09	1.44	0.23
	TF	0	0	0	0	0	0	100	0.06	2.31	31.34
	CPOP	0	0	0	93	7	0	0	0.06	1.15	2.09
	BUP	100	0	0	0	0	0	0	0.75	4.69	2.21
(M3)	TS( $λ^{Naïve}$ )	0	0	0	99	1	0	0	0.03	3.33	0.61
	TS( $λ^{Robust}$ )	0	0	0	100	0	0	0	0.03	3.33	0.29
	NOT	0	0	0	100	0	0	0	0.02	2.70	0.33
	ID	0	0	0	100	0	0	0	0.02	1.86	0.02
	TF	0	0	0	0	0	0	100	0.01	5.41	28.89
	CPOP	0	0	0	100	0	0	0	0.01	1.02	17.38
	BUP	0	0	0	2	22	48	28	0.03	5.46	2.20
(M4)	TS( $λ^{Naïve}$ )	0	0	0	100	0	0	0	0.09	3.24	0.31
	TS( $λ^{Robust}$ )	0	0	0	100	0	0	0	0.09	3.24	0.09
	NOT	0	0	0	99	1	0	0	0.08	2.71	0.23
	ID	0	0	0	97	3	0	0	0.07	2.04	0.02
	TF	0	0	0	0	0	0	100	0.05	5.47	8.50
	CPOP	0	0	0	97	3	0	0	0.04	1.83	0.39
	BUP	7	64	27	2	0	0	0	0.52	10.66	0.56

Table 3. Table 3 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods listed in Section 4.1 and 4.2 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} defined in Section 4.3 , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} given by ( 20 ) and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS( $λ^{Naïve}$ )	0	0	0	90	10	0	0	0.03	1.40	1.30
	TS( $λ^{Robust}$ )	0	0	0	89	11	0	0	0.03	1.41	0.32
	NOT	0	12	9	75	3	0	1	0.05	0.73	0.25
	ID	0	0	0	1	5	25	69	0.29	8.09	0.03
	TF	0	0	0	0	0	0	100	0.14	6.15	28.53
	CPOP	0	0	0	8	27	31	34	0.03	1.42	3.50
	BUP	0	0	0	41	44	13	2	0.10	4.72	2.25
(M6)	TS( $λ^{Naïve}$ )	0	0	0	99	1	0	0	0.01	0.05	0.90
	TS( $λ^{Robust}$ )	0	3	1	96	0	0	0	0.02	0.64	0.34
	NOT	2	13	37	45	2	1	0	0.07	1.74	0.25
	ID	0	0	0	0	0	1	99	0.07	0.17	0.04
	TF	0	0	0	0	0	0	100	0.13	9.87	30.72
	CPOP	0	0	0	21	28	40	11	0.03	0.22	3.02
	BUP	0	0	0	0	0	0	100	0.12	9.29	2.70
(M7)	TS( $λ^{Naïve}$ )	0	5	21	40	28	6	0	0.10	7.02	0.31
	TS( $λ^{Robust}$ )	1	10	38	31	16	4	0	0.13	8.64	0.13
	NOT	1	1	8	56	31	3	0	0.06	2.62	0.25
	ID	3	0	16	14	26	13	28	0.32	10.87	0.12
	TF	0	0	0	0	0	0	100	0.10	6.11	23.19
	CPOP	0	0	1	1	3	17	78	0.05	3.37	1.19
	BUP	70	25	5	0	0	0	0	0.28	11.89	1.58
(M8)	TS( $λ^{Naïve}$ )	0	0	0	100	0	0	0	0.00	0.00	0.43
	TS( $λ^{Robust}$ )	0	0	0	100	0	0	0	0.00	0.00	0.19
	NOT	0	0	0	100	0	0	0	0.00	0.00	0.17
	ID	0	0	0	100	0	0	0	0.00	0.00	0.59
	TF	0	0	0	78	5	2	15	0.00	9.08	35.79
	CPOP	0	0	0	100	0	0	0	0.00	0.00	12.96
	BUP	0	0	0	0	0	0	100	0.01	46.34	2.63

Table 4. Table C.1 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods with the noise term ε t ∼ iid t 5 subscript 𝜀 𝑡 iid similar-to subscript 𝑡 5 \varepsilon_{t}\overset{\text{iid}}{\sim}t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS	0	0	1	88	9	2	0	0.24	3.10	0.09
	NOT	0	0	0	92	6	2	0	0.20	2.51	0.22
	ID	0	0	0	91	9	0	0	0.14	1.69	0.01
	TF	0	0	0	0	0	0	100	0.10	4.40	3.22
	CPOP	0	0	0	78	12	9	1	0.13	1.44	0.04
	BUP	100	0	0	0	0	0	0	2.63	10.61	0.35
(M2)	TS	0	0	4	83	9	2	2	0.13	2.05	0.24
	NOT	0	0	3	85	11	0	1	0.098	1.69	0.29
	ID	0	0	0	77	21	2	0	0.102	1.36	0.38
	TF	0	0	0	0	0	0	100	0.067	2.29	31.41
	CPOP	0	0	0	14	23	25	38	0.119	1.54	1.66
	BUP	100	0	0	0	0	0	0	0.752	4.69	2.18
(M3)	TS	0	0	8	81	8	2	1	0.04	4.43	0.29
	NOT	0	0	1	97	2	0	0	0.021	2.71	0.31
	ID	0	0	0	85	10	2	3	0.023	2.40	0.03
	TF	0	0	0	0	0	0	100	0.010	5.20	28.83
	CPOP	0	0	0	32	25	24	19	0.039	2.51	13.06
	BUP	0	0	0	2	26	46	26	0.032	5.39	2.18
(M4)	TS	0	0	0	91	7	0	2	0.11	3.39	0.09
	NOT	0	0	0	98	2	0	0	0.08	2.57	0.24
	ID	0	0	0	87	12	1	0	0.08	2.21	0.01
	TF	0	0	0	0	0	0	100	0.05	5.54	8.73
	CPOP	0	0	0	62	22	8	8	0.08	2.24	0.38
	BUP	2	73	24	1	0	0	0	0.52	10.80	0.57

Table 5. Table C.2 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods with the noise term ε t ∼ iid t 5 subscript 𝜀 𝑡 iid similar-to subscript 𝑡 5 \varepsilon_{t}\overset{\text{iid}}{\sim}t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS	0	0	2	75	18	4	1	0.04	2.17	0.31
	NOT	0	11	10	63	10	3	3	0.049	1.29	0.25
	ID	0	0	0	0	0	7	93	0.332	9.58	0.03
	TF	0	0	0	0	0	0	100	0.145	6.14	28.33
	CPOP	0	0	0	4	6	20	70	0.064	2.61	3.07
	BUP	0	0	0	32	44	20	4	0.097	4.62	2.22
(M6)	TS	0	4	1	88	3	1	3	0.02	1.23	0.35
	NOT	6	10	26	44	6	2	6	0.071	3.53	0.24
	ID	0	3	0	0	19	0	78	0.129	4.73	0.03
	TF	0	0	0	0	0	0	100	0.136	9.88	30.26
	CPOP	0	0	0	8	19	20	53	0.053	3.15	2.45
	BUP	0	0	0	0	0	0	100	0.132	9.23	2.47
(M7)	TS	5	15	28	36	10	4	2	0.16	8.51	0.13
	NOT	0	6	16	30	36	11	1	0.079	5.12	0.22
	ID	6	3	9	18	19	15	30	0.385	12.40	0.01
	TF	0	0	0	0	0	0	100	0.098	6.08	23.86
	CPOP	0	0	0	0	4	5	91	0.102	3.01	0.81
	BUP	69	28	3	0	0	0	0	0.266	12.12	1.47
(M8)	TS	0	0	0	99	0	0	1	0.00	0.50	0.19
	NOT	0	0	0	100	0	0	0	0.001	0.00	0.17
	ID	0	0	0	99	1	0	0	0.001	0.00	0.03
	TF	0	0	0	65	12	9	14	0.003	14.63	36.03
	CPOP	0	0	0	35	0	34	31	0.042	20.53	3.91
	BUP	0	0	0	0	0	0	100	0.014	46.80	2.62

Table 6. Table C.3 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods with the noise term ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.3 italic-ϕ 0.3 \phi=0.3 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS	0	1	13	82	4	0	0	0.39	3.65	0.07
	NOT	0	0	0	87	8	2	3	0.35	3.10	0.23
	ID	0	0	0	62	27	9	2	0.27	2.70	0.02
	TF	3	0	0	0	0	0	97	0.61	6.18	3.31
	CPOP	0	0	0	53	35	10	2	0.23	2.52	0.05
	BUP	100	0	0	0	0	0	0	2.64	10.96	0.36
(M2)	TS	4	9	30	57	0	0	0	0.20	2.56	0.24
	NOT	0	0	8	83	6	2	1	0.182	2.11	0.31
	ID	0	0	0	69	24	5	2	0.155	1.75	0.40
	TF	0	0	0	0	0	0	100	0.600	2.38	32.03
	CPOP	0	0	0	1	6	8	85	0.163	1.98	1.50
	BUP	100	0	0	0	0	0	0	0.717	4.63	2.39
(M3)	TS	0	0	17	79	4	0	0	0.05	4.79	0.30
	NOT	0	0	1	89	7	2	1	0.045	3.81	0.32
	ID	0	0	1	83	14	1	1	0.037	2.98	0.03
	TF	0	0	0	0	0	0	100	0.258	6.24	28.76
	CPOP	0	0	0	76	10	9	5	0.022	2.14	15.35
	BUP	0	0	0	0	6	23	71	0.040	5.59	2.31
(M4)	TS	0	0	2	95	3	0	0	0.15	3.61	0.09
	NOT	0	0	0	86	9	3	2	0.16	3.50	0.23
	ID	0	0	0	84	14	0	2	0.13	2.87	0.01
	TF	1	0	1	0	1	0	97	0.64	6.76	8.67
	CPOP	0	0	0	51	24	15	10	0.11	3.17	0.39
	BUP	1	61	38	0	0	0	0	0.50	10.43	0.58

Table 7. Table C.4 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods with the noise term ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.3 italic-ϕ 0.3 \phi=0.3 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS	0	0	0	84	15	1	0	0.05	1.85	0.32
	NOT	0	6	13	74	4	3	0	0.062	1.59	0.26
	ID	0	0	0	1	4	17	78	0.347	8.87	0.03
	TF	0	0	0	0	0	0	100	0.192	6.16	28.45
	CPOP	0	0	0	2	14	20	64	0.059	2.02	3.64
	BUP	0	0	0	11	32	30	27	0.131	5.19	2.32
(M6)	TS	0	6	0	93	1	0	0	0.02	1.34	0.35
	NOT	6	18	28	31	4	2	11	0.094	5.30	0.25
	ID	1	10	0	0	21	0	68	0.149	6.75	0.04
	TF	0	0	0	0	0	0	100	0.324	9.93	29.97
	CPOP	0	0	0	7	32	28	23	0.043	1.04	3.58
	BUP	0	0	0	0	0	0	100	0.159	9.09	2.82
(M7)	TS	20	47	24	7	2	0	0	0.23	11.74	0.12
	NOT	5	12	19	24	22	7	11	0.158	7.69	0.24
	ID	11	3	15	22	18	16	15	0.405	14.22	0.01
	TF	3	0	0	0	0	0	97	0.623	7.01	23.25
	CPOP	0	0	0	0	0	1	99	0.162	5.27	0.85
	BUP	54	43	3	0	0	0	0	0.283	11.92	1.55
(M8)	TS	0	0	0	100	0	0	0	0.00	0.00	0.19
	NOT	0	0	0	93	3	3	1	0.005	2.02	0.19
	ID	0	0	0	100	0	0	0	0.003	0.00	0.51
	TF	0	0	0	0	0	0	100	0.551	49.94	35.81
	CPOP	0	0	0	30	10	3	57	0.035	19.71	7.55
	BUP	0	0	0	0	0	0	100	0.025	46.73	2.72

Table 8. Table C.5 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods with the noise term ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.6 italic-ϕ 0.6 \phi=0.6 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS	2	2	22	67	7	0	0	0.82	4.82	0.07
	NOT	0	1	4	50	15	9	21	0.86	4.19	0.09
	ID	0	0	0	5	16	20	59	0.70	4.29	0.01
	TF	28	0	0	1	0	1	70	1.44	14.23	1.84
	CPOP	0	0	0	4	11	18	67	0.76	4.31	0.05
	BUP	100	0	0	0	0	0	0	2.55	11.22	0.15
(M2)	TS	30	34	26	8	2	0	0	0.50	3.69	0.23
	NOT	0	4	13	21	19	18	25	0.51	2.94	0.14
	ID	2	5	4	33	23	17	16	0.41	3.20	0.02
	TF	0	0	0	0	0	0	100	1.23	2.38	12.86
	CPOP	0	0	0	0	0	0	100	0.54	2.48	0.87
	BUP	100	0	0	0	0	0	0	0.70	4.43	0.70
(M3)	TS	0	4	23	45	16	8	4	0.14	6.77	0.31
	NOT	0	0	4	24	7	6	59	0.21	5.95	0.20
	ID	1	5	11	28	11	16	28	0.12	6.08	0.04
	TF	0	0	0	0	0	0	100	0.58	6.23	19.54
	CPOP	0	0	0	0	0	0	100	0.38	6.08	3.31
	BUP	0	0	0	0	0	0	100	0.13	5.92	1.22
(M4)	TS	0	1	16	57	18	4	4	0.42	5.63	0.09
	NOT	0	0	3	26	16	11	44	0.56	5.61	0.12
	ID	0	0	8	41	24	17	10	0.35	4.78	0.01
	TF	25	0	2	0	1	0	72	1.46	13.64	5.63
	CPOP	0	0	0	0	2	0	98	0.59	5.72	0.26
	BUP	1	37	58	4	0	0	0	0.57	9.45	0.31

Table 9. Table C.6 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods with the noise term ϵ t subscript italic-ϵ 𝑡 \epsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.6 italic-ϕ 0.6 \phi=0.6 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS	0	0	10	40	32	10	8	0.14	3.62	0.33
	NOT	0	3	10	11	11	10	55	0.22	4.46	0.18
	ID	2	3	1	4	6	5	79	0.42	7.28	0.04
	TF	0	0	0	0	0	0	100	0.40	6.18	18.88
	CPOP	0	0	0	0	0	0	100	0.47	5.96	2.18
	BUP	0	0	0	0	0	0	100	0.24	5.90	1.24
(M6)	TS	0	3	0	65	12	11	9	0.07	3.09	0.36
	NOT	13	8	16	22	6	3	32	0.19	8.10	0.15
	ID	39	26	2	0	19	1	13	0.28	24.64	0.04
	TF	0	0	0	0	0	0	100	0.69	9.91	197.40
	CPOP	0	0	0	0	0	0	100	0.52	9.50	2.91
	BUP	0	0	0	0	0	0	100	0.30	9.41	1.36
(M7)	TS	25	25	26	14	5	4	1	0.40	11.82	0.13
	NOT	3	4	4	11	13	3	62	0.49	8.03	0.11
	ID	10	3	9	20	19	15	24	0.47	13.15	0.02
	TF	24	0	2	0	0	0	74	1.47	13.14	9.90
	CPOP	0	0	0	0	0	0	100	0.59	7.02	0.44
	BUP	3	30	42	21	4	0	0	0.35	8.97	0.46
(M8)	TS	0	0	0	63	7	26	4	0.03	10.42	0.19
	NOT	0	0	0	7	8	4	81	0.19	37.77	0.10
	ID	0	0	0	96	3	0	1	0.01	0.61	0.03
	TF	0	0	0	0	0	0	100	1.10	49.94	15.54
	CPOP	0	0	0	0	1	0	99	0.38	45.54	2.08
	BUP	0	0	0	0	0	0	100	0.11	47.44	0.85

Table 10. Table C.7 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods with the ε t subscript 𝜀 𝑡 \varepsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.3 italic-ϕ 0.3 \phi=0.3 with noise term following t 5 subscript 𝑡 5 t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS	0	0	0	100	0	0	0	0.07	1.72	0.10
	NOT	0	0	0	90	7	2	1	0.06	1.61	0.09
	ID	0	0	0	49	29	9	13	0.06	2.08	0.01
	TF	17	0	0	0	1	0	82	0.13	9.94	1.64
	CPOP	0	0	0	100	0	0	0	0.03	0.87	0.05
	BUP	100	0	0	0	0	0	0	3.14	12.19	0.14
(M2)	TS	0	0	2	94	2	0	2	0.04	1.32	0.25
	NOT	0	0	0	95	4	1	0	0.03	1.10	0.16
	ID	0	0	0	47	25	14	14	0.08	1.18	0.02
	TF	0	0	0	0	0	0	100	0.14	2.38	12.86
	CPOP	0	0	0	100	0	0	0	0.04	0.82	1.38
	BUP	100	0	0	0	0	0	0	1.22	5.08	0.69
(M3)	TS	0	0	0	99	0	0	1	0.01	2.52	0.31
	NOT	0	0	0	88	9	2	1	0.01	2.04	0.24
	ID	0	0	0	56	29	9	6	0.01	2.42	0.02
	TF	0	0	0	0	0	0	100	0.04	6.23	20.08
	CPOP	0	0	0	99	0	1	0	0.00	0.67	21.78
	BUP	5	82	13	0	0	0	0	0.14	11.18	1.10
(M4)	TS	0	0	0	100	0	0	0	0.03	2.00	0.10
	NOT	0	0	0	88	8	4	0	0.03	1.81	0.18
	ID	0	0	0	54	27	15	4	0.05	2.02	0.01
	TF	5	0	0	0	0	0	95	0.13	7.75	5.95
	CPOP	0	0	0	100	0	0	0	0.03	1.62	0.34
	BUP	85	15	0	0	0	0	0	0.85	12.55	0.30

Table 11. Table C.8 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods with the ε t subscript 𝜀 𝑡 \varepsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.3 italic-ϕ 0.3 \phi=0.3 with noise term following t 5 subscript 𝑡 5 t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS	0	0	0	98	0	0	2	0.01	0.66	0.34
	NOT	0	12	7	66	6	4	5	0.03	0.74	0.17
	ID	0	0	0	0	0	2	98	0.20	5.02	0.03
	TF	0	0	0	0	0	0	100	0.27	5.92	18.55
	CPOP	0	0	0	50	38	9	3	0.02	1.43	3.27
	BUP	16	84	0	0	0	0	0	0.16	0.89	1.12
(M6)	TS	0	0	1	97	0	0	2	0.01	0.20	0.37
	NOT	0	3	44	48	2	0	3	0.05	0.80	0.14
	ID	0	0	0	0	0	0	100	0.09	0.39	0.03
	TF	0	0	0	0	0	0	100	0.07	9.90	21.12
	CPOP	0	0	0	72	24	4	0	0.01	0.09	2.01
	BUP	23	36	30	11	0	0	0	0.18	0.27	1.42
(M7)	TS	32	35	26	6	0	0	1	0.13	11.69	0.13
	NOT	0	0	0	74	20	4	2	0.01	0.81	0.11
	ID	2	2	3	10	16	33	34	0.35	10.75	0.01
	TF	0	0	0	0	0	0	100	0.14	6.24	11.24
	CPOP	0	0	0	10	30	28	32	0.02	0.65	0.76
	BUP	100	0	0	0	0	0	0	0.25	12.50	0.53
(M8)	TS	0	0	0	97	0	0	3	0.00	1.50	0.21
	NOT	0	0	0	88	4	7	1	0.00	3.46	0.09
	ID	0	0	0	100	0	0	0	0.00	0.00	0.03
	TF	0	0	0	0	0	0	100	0.11	49.93	15.78
	CPOP	0	0	0	99	0	1	0	0.00	0.05	5.50
	BUP	0	0	0	0	100	0	0	0.00	39.45	0.84

Table 12. Table C.9 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M1)-(M4) and all methods with the ε t subscript 𝜀 𝑡 \varepsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.6 italic-ϕ 0.6 \phi=0.6 with noise term following t 5 subscript 𝑡 5 t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M1)	TS	0	0	1	99	0	0	0	0.15	2.18	0.10
	NOT	0	0	0	55	12	8	25	0.17	2.90	0.10
	ID	0	0	0	7	9	11	73	0.15	3.98	0.01
	TF	43	0	1	0	0	0	56	0.29	18.34	1.74
	CPOP	0	0	0	100	0	0	0	0.10	1.21	0.06
	BUP	100	0	0	0	0	0	0	3.03	11.77	0.15
(M2)	TS	1	7	19	72	0	0	1	0.11	2.11	0.24
	NOT	0	0	0	59	8	7	26	0.09	1.68	0.17
	ID	0	0	0	15	11	14	60	0.11	1.80	0.03
	TF	0	0	0	0	0	0	100	0.25	2.38	12.96
	CPOP	0	0	0	83	15	1	1	0.07	1.13	1.34
	BUP	100	0	0	0	0	0	0	1.15	5.42	0.71
(M3)	TS	0	5	17	77	0	0	1	0.04	4.38	0.31
	NOT	0	0	0	16	7	8	69	0.05	5.16	0.22
	ID	0	0	0	24	18	14	44	0.03	4.20	0.03
	TF	0	0	0	0	0	0	100	0.10	6.23	19.46
	CPOP	0	0	0	96	2	2	0	0.01	1.39	19.42
	BUP	0	43	49	8	0	0	0	0.10	9.61	1.10
(M4)	TS	0	0	3	97	0	0	0	0.08	2.82	0.09
	NOT	0	0	0	22	14	10	54	0.12	4.68	0.18
	ID	0	0	0	27	19	21	33	0.09	3.71	0.01
	TF	38	0	0	1	1	0	60	0.29	17.07	6.15
	CPOP	0	0	0	95	3	2	0	0.05	2.03	0.35
	BUP	72	28	0	0	0	0	0	0.79	12.41	0.31

Table 13. Table C.10 : Distribution of N ^ − N ^ 𝑁 𝑁 \hat{N}-N for models (M5)-(M8) and all methods with the ε t subscript 𝜀 𝑡 \varepsilon_{t} being A R ( 1 ) 𝐴 𝑅 1 AR(1) process of ϕ = 0.6 italic-ϕ 0.6 \phi=0.6 with noise term following t 5 subscript 𝑡 5 t_{5} over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signal f ^ t subscript ^ 𝑓 𝑡 \hat{f}_{t} , the average Hausdorff distance d H subscript 𝑑 𝐻 d_{H} and the average computational time in seconds using 10 cores of Apple M1 Pro with 16 GB of RAM on mac OS, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency of N ^ − N = 0 ^ 𝑁 𝑁 0 \hat{N}-N=0 or within 10% of the lowest empirical average d H ( × 10 2 ) d_{H}(\times 10^{2}) . Note that TrendSegment is shortened to TS.

Model	Method	$\leq$ -3	-2	-1	0	1	2	$\geq$ 3	MSE	$d_{H} (\times 10^{2})$	time
					$\hat{N} - N$
(M5)	TS	0	2	15	81	1	0	1	0.03	1.11	0.33
	NOT	0	5	10	37	7	4	37	0.05	2.80	0.18
	ID	0	0	0	0	0	1	99	0.19	4.57	0.04
	TF	0	0	0	0	0	0	100	0.26	6.02	18.55
	CPOP	0	0	0	29	38	28	5	0.03	1.55	3.44
	BUP	13	86	1	0	0	0	0	0.16	1.32	1.17
(M6)	TS	0	0	0	99	0	0	1	0.01	0.10	0.36
	NOT	1	2	36	46	4	3	8	0.06	1.98	0.14
	ID	0	2	0	0	8	1	89	0.12	3.02	0.04
	TF	0	0	0	0	0	0	100	0.15	9.90	20.25
	CPOP	0	0	0	81	17	1	1	0.01	0.15	2.92
	BUP	0	2	24	74	0	0	0	0.08	0.24	1.22
(M7)	TS	73	25	1	0	0	0	1	0.20	12.44	0.12
	NOT	0	0	0	7	5	16	72	0.08	4.50	0.11
	ID	0	0	2	3	7	8	80	0.24	10.13	0.01
	TF	0	0	0	0	0	0	100	0.29	6.26	11.19
	CPOP	0	0	0	2	17	24	57	0.07	4.48	0.84
	BUP	100	0	0	0	0	0	0	0.26	12.51	0.52
(M8)	TS	0	0	0	99	0	0	1	0.00	0.50	0.20
	NOT	0	0	0	3	2	13	82	0.04	40.33	0.10
	ID	0	0	0	86	4	1	9	0.00	2.89	0.02
	TF	0	0	0	0	0	0	100	0.22	49.93	15.67
	CPOP	0	0	0	97	0	3	0	0.00	0.72	8.40
	BUP	0	0	0	0	80	20	0	0.00	40.07	0.83

Table 14. Table C.11 : The default thresholding constant C 𝐶 C with its range examined for five scenarios.

$ε_{t}$	$C_{min}^{η}$	$C_{med}^{η}$	$C_{max}^{η}$	$C_{min}^{d_{H}}$	$C_{med}^{d_{H}}$	$C_{max}^{d_{H}}$
(i)	1.2	1.4	1.5	1.4	1.5	1.7
(ii)	1.4	1.5	1.6	1.3	1.5	1.5
(iii)	1.5	1.5	1.5	1.4	1.4	1.4
(iv)	1.8	1.8	1.8	1.3	1.3	1.3
(v)	1.0	1.6	1.9	1.0	1.4	1.9
(vi)	1.4	1.5	1.7	1.4	1.5	1.7

Equations171

X_{t} = f_{t} + ε_{t}, t = 1, \dots, T,

X_{t} = f_{t} + ε_{t}, t = 1, \dots, T,

f_{t} = θ_{ℓ, 1} + θ_{ℓ, 2} t for t \in [η_{ℓ - 1} + 1, η_{ℓ}], ℓ = 1, \dots, N + 1 where f_{η_{ℓ}} + θ_{ℓ, 2} \neq = f_{η_{ℓ} + 1} for ℓ = 1, \dots, N .

f_{t} = θ_{ℓ, 1} + θ_{ℓ, 2} t for t \in [η_{ℓ - 1} + 1, η_{ℓ}], ℓ = 1, \dots, N + 1 where f_{η_{ℓ}} + θ_{ℓ, 2} \neq = f_{η_{ℓ} + 1} for ℓ = 1, \dots, N .

s_{1, T}^{[1]} s_{1, T}^{[2]} d^{(j, k)}_{j = 1, \dots, J, k = 1, \dots, K (j)} = ψ^{(0, 1)} ψ^{(0, 2)} ψ^{(j, k)}_{j = 1, \dots, J, k = 1, \dots, K (j)} X_{1} X_{2} ⋮ X_{T} = Ψ_{T \times T} X_{1} X_{2} ⋮ X_{T},

s_{1, T}^{[1]} s_{1, T}^{[2]} d^{(j, k)}_{j = 1, \dots, J, k = 1, \dots, K (j)} = ψ^{(0, 1)} ψ^{(0, 2)} ψ^{(j, k)}_{j = 1, \dots, J, k = 1, \dots, K (j)} X_{1} X_{2} ⋮ X_{T} = Ψ_{T \times T} X_{1} X_{2} ⋮ X_{T},

t = 1 \sum T X_{t}^{2} = j = 1 \sum J k = 1 \sum K (j) (d^{(j, k)})^{2} + (s_{1, T}^{[1]})^{2} + (s_{1, T}^{[2]})^{2} .

t = 1 \sum T X_{t}^{2} = j = 1 \sum J k = 1 \sum K (j) (d^{(j, k)})^{2} + (s_{1, T}^{[1]})^{2} + (s_{1, T}^{[2]})^{2} .

t = 1 \sum T X_{t}^{2} - (s_{1, T}^{[1]})^{2} - (s_{1, T}^{[2]})^{2} = t = 1 \sum T (X_{t} - \hat{X}_{t})^{2}

t = 1 \sum T X_{t}^{2} - (s_{1, T}^{[1]})^{2} - (s_{1, T}^{[2]})^{2} = t = 1 \sum T (X_{t} - \hat{X}_{t})^{2}

t = 1 \sum T (X_{t} - \hat{X}_{t})^{2} = j = 1 \sum J k = 1 \sum K (j) (d^{(j, k)})^{2} .

t = 1 \sum T (X_{t} - \hat{X}_{t})^{2} = j = 1 \sum J k = 1 \sum K (j) (d^{(j, k)})^{2} .

d_{p, q, r}^{\cdot} = a s_{p : r}^{1} + b s_{p : r}^{2} + c s_{p : r}^{3},

d_{p, q, r}^{\cdot} = a s_{p : r}^{1} + b s_{p : r}^{2} + c s_{p : r}^{3},

s_{p, r}^{[1]} s_{p, r}^{[2]} d_{p, q, r}^{\cdot} = ℓ_{1}^{⊤} ℓ_{2}^{⊤} h^{⊤} s_{p : r}^{1} s_{p : r}^{2} s_{p : r}^{3} = Λ s_{p : r}^{1} s_{p : r}^{2} s_{p : r}^{3} .

s_{p, r}^{[1]} s_{p, r}^{[2]} d_{p, q, r}^{\cdot} = ℓ_{1}^{⊤} ℓ_{2}^{⊤} h^{⊤} s_{p : r}^{1} s_{p : r}^{2} s_{p : r}^{3} = Λ s_{p : r}^{1} s_{p : r}^{2} s_{p : r}^{3} .

\hat{\mu}_{0}^{(j,k)}=d^{(j,k)}_{p,q,r}\cdot\mathbb{I}\;\Big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad\big{|}d^{(j^{\prime},k^{\prime})}_{p^{\prime},q^{\prime},r^{\prime}}\big{|}>\lambda\,\Big{\}},

\hat{\mu}_{0}^{(j,k)}=d^{(j,k)}_{p,q,r}\cdot\mathbb{I}\;\Big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad\big{|}d^{(j^{\prime},k^{\prime})}_{p^{\prime},q^{\prime},r^{\prime}}\big{|}>\lambda\,\Big{\}},

C_{j, k} = {(j^{'}, k^{'}), j^{'} = 1, \dots, j, k^{'} = 1, \dots, K (j^{'}) : d_{p^{'}, q^{'}, r^{'}}^{(j^{'}, k^{'})} is such that [p^{'}, r^{'}] \subseteq [p, r]} .

C_{j, k} = {(j^{'}, k^{'}), j^{'} = 1, \dots, j, k^{'} = 1, \dots, K (j^{'}) : d_{p^{'}, q^{'}, r^{'}}^{(j^{'}, k^{'})} is such that [p^{'}, r^{'}] \subseteq [p, r]} .

\overset{μ}{^}^{(j, k)} =

\overset{μ}{^}^{(j, k)} =

\overset{μ}{^}^{(j, k)} =

\overset{μ}{^}^{(j, k)} =

\tilde{f}=\text{TGUW}^{-1}\big{\{}\;\hat{\mu}^{(j,k)},j=1,\ldots,J,\,k=1,\ldots,K(j)\;\|\;s^{[1]}_{1,T},s^{[2]}_{1,T}\;\big{\}},

\tilde{f}=\text{TGUW}^{-1}\big{\{}\;\hat{\mu}^{(j,k)},j=1,\ldots,J,\,k=1,\ldots,K(j)\;\|\;s^{[1]}_{1,T},s^{[2]}_{1,T}\;\big{\}},

\tilde{\tilde{f}}_{t}=\hat{\theta}_{i,1}+\hat{\theta}_{i,2}\;t\quad\text{for}\quad t\in\big{[}\tilde{\tilde{\eta}}_{i-1}+1,\tilde{\tilde{\eta}}_{i}\big{]},\quad i=1,\ldots,\tilde{\tilde{N}},

\tilde{\tilde{f}}_{t}=\hat{\theta}_{i,1}+\hat{\theta}_{i,2}\;t\quad\text{for}\quad t\in\big{[}\tilde{\tilde{\eta}}_{i-1}+1,\tilde{\tilde{\eta}}_{i}\big{]},\quad i=1,\ldots,\tilde{\tilde{N}},

\mathbb{P}\;\bigg{(}\;\|\tilde{f}-f\|_{T}^{2}\;\leq\;C_{1}^{2}\;\frac{1}{T}\;\log(T)\;\Big{\{}4+8N\;\lceil\;\log(T)/\log(1-\rho)^{-1}\;\rceil\;\Big{\}}\;\bigg{)}\;\rightarrow\;1,

\mathbb{P}\;\bigg{(}\;\|\tilde{f}-f\|_{T}^{2}\;\leq\;C_{1}^{2}\;\frac{1}{T}\;\log(T)\;\Big{\{}4+8N\;\lceil\;\log(T)/\log(1-\rho)^{-1}\;\rceil\;\Big{\}}\;\bigg{)}\;\rightarrow\;1,

\mathbb{P}\;\bigg{(}\hat{N}=N,\quad\max_{i=1,\ldots,N}\bigg{\{}|\hat{\eta}_{i}-\eta_{i}|\cdot\Big{(}\underaccent{\bar}{f}_{T}^{i}\Big{)}^{2/3}\bigg{\}}\leq CT^{1/3}R_{T}^{1/3}\bigg{)}\;\rightarrow\;1,

\mathbb{P}\;\bigg{(}\hat{N}=N,\quad\max_{i=1,\ldots,N}\bigg{\{}|\hat{\eta}_{i}-\eta_{i}|\cdot\Big{(}\underaccent{\bar}{f}_{T}^{i}\Big{)}^{2/3}\bigg{\}}\leq CT^{1/3}R_{T}^{1/3}\bigg{)}\;\rightarrow\;1,

λ^{Na \overset{ı}{¨} ve} = C σ 2 lo g T,

λ^{Na \overset{ı}{¨} ve} = C σ 2 lo g T,

λ^{Robust} = C I g (K) 2 lo g T,

λ^{Robust} = C I g (K) 2 lo g T,

\hat{K} = \frac{\sum _{t = 1}^{T} ( ε ^ _{t} - ε ^ ˉ ) ^{4}}{T s ^ _{\overset{ε}{^}}^{4}},

\hat{K} = \frac{\sum _{t = 1}^{T} ( ε ^ _{t} - ε ^ ˉ ) ^{4}}{T s ^ _{\overset{ε}{^}}^{4}},

I = (1 + ϕ) / (1 - ϕ),

I = (1 + ϕ) / (1 - ϕ),

d_{H}=\frac{1}{T}\mathbb{E}\max\bigg{\{}\max_{i}\min_{j}\big{|}\eta_{i}-\hat{\eta}_{j}\big{|},\quad\max_{j}\min_{i}\big{|}\hat{\eta}_{j}-\eta_{i}\big{|}\bigg{\}}

d_{H}=\frac{1}{T}\mathbb{E}\max\bigg{\{}\max_{i}\min_{j}\big{|}\eta_{i}-\hat{\eta}_{j}\big{|},\quad\max_{j}\min_{i}\big{|}\hat{\eta}_{j}-\eta_{i}\big{|}\bigg{\}}

∥ \tilde{f} - f ∥_{T}^{2} =

∥ \tilde{f} - f ∥_{T}^{2} =

\leq

=:

\displaystyle\big{(}d^{(j,k)}\cdot\mathbb{I}\big{\{}\mathcal{B}\big{\}}-\mu^{(j,k)}\big{)}^{2}\;=\;

\displaystyle\big{(}d^{(j,k)}\cdot\mathbb{I}\big{\{}\mathcal{B}\big{\}}-\mu^{(j,k)}\big{)}^{2}\;=\;

\leq

\leq

\frac{1}{T}\sum_{j=\eta_{\ell}-r_{\ell,T}}^{\eta_{\ell}+r_{\ell,T}}\big{(}\tilde{\tilde{f}}_{j}-f_{j}\big{)}^{2}\geq\frac{13r_{\ell,T}^{3}}{24T}\big{(}\underaccent{\bar}{f}_{T}^{\ell}\big{)}^{2}>\frac{C^{3}}{4}R_{T}.

\frac{1}{T}\sum_{j=\eta_{\ell}-r_{\ell,T}}^{\eta_{\ell}+r_{\ell,T}}\big{(}\tilde{\tilde{f}}_{j}-f_{j}\big{)}^{2}\geq\frac{13r_{\ell,T}^{3}}{24T}\big{(}\underaccent{\bar}{f}_{T}^{\ell}\big{)}^{2}>\frac{C^{3}}{4}R_{T}.

A_{T}=\bigg{\{}\max_{g_{l}\in G}\;|g_{l}^{\top}\boldsymbol{\varepsilon}|\leq\lambda\bigg{\}},

A_{T}=\bigg{\{}\max_{g_{l}\in G}\;|g_{l}^{\top}\boldsymbol{\varepsilon}|\leq\lambda\bigg{\}},

Type 1: ψ_{p, q, r}^{(j, k)}

Type 1: ψ_{p, q, r}^{(j, k)}

Type 2: ψ_{p, q, r}^{(j, k)}

ψ_{p, q, r}^{(j, k)}

Type 3: ψ_{p, q, r}^{(j, k)}

+ γ_{3} (q \times 1 0, \dots, 0, ℓ_{1, q + 1, r}^{⊤}, (T - r) \times 1 0, \dots, 0) + γ_{4} (q \times 1 0, \dots, 0, ℓ_{2, q + 1, r}^{⊤}, (T - r) \times 1 0, \dots, 0),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hmaeng/trendsegment
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and numerical algorithms · Climate variability and models · Metabolomics and Mass Spectrometry Studies

Full text

Detecting linear trend changes in data sequences

Hyeyoung Maeng111Department of Mathematical Sciences, Durham University. Email: [email protected] and Piotr Fryzlewicz222Department of Statistics, London School of Economics. Email: [email protected]

Abstract

We propose TrendSegment, a methodology for detecting multiple change-points corresponding to linear trend changes in one dimensional data. A core ingredient of TrendSegment is a new Tail-Greedy Unbalanced Wavelet transform: a conditionally orthonormal, bottom-up transformation of the data through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. Due to its bottom-up nature, this multiscale decomposition focuses on local features in its early stages and on global features next which enables the detection of both long and short linear trend segments at once. To reduce the computational complexity, the proposed method merges multiple regions in a single pass over the data. We show the consistency of the estimated number and locations of change-points. The practicality of our approach is demonstrated through simulations and two real data examples, involving Iceland temperature data and sea ice extent of the Arctic and the Antarctic. Our methodology is implemented in the R package trendsegmentR, available from CRAN.

Keywords: change-point detection, bottom-up algorithms, piecewise-linear signal, wavelets

1 Introduction

Multiple change-point detection is a problem of importance in many applications; recent examples include automatic detection of change-points in cloud data to maintain the performance and availability of an app or a website (James et al.,, 2016), climate change detection in tropical cyclone records (Robbins et al.,, 2011), detecting exoplanets from light curve data (Fisch et al.,, 2018), detecting changes in the DNA copy number (Olshen et al.,, 2004; Jeng et al.,, 2012; Bardwell et al.,, 2017), estimation of stationary intervals in potentially cointegrated stock prices (Matteson et al.,, 2013), estimation of change-points in multi-subject fMRI data (Robinson et al.,, 2010) and detecting changes in vegetation trends (Jamali et al.,, 2015).

This paper considers the change-point model

[TABLE]

where $f_{t}$ is a deterministic and piecewise-linear signal containing $N$ change-points, i.e. time indices at which the slope and/or the intercept in $f_{t}$ undergoes changes. These changes occur at unknown locations $\eta_{1},\eta_{2},\ldots,\eta_{N}$ . In this article, we assume that the $\varepsilon_{t}$ ’s are iid N(0, $\sigma^{2}$ ) and in the supplementary material, we show how our method can be extended to dependent and/or non-Gaussian noise such as $\varepsilon_{t}$ following a stationary Gaussian AR process or t-distribution. The true change-points $\{\eta_{i}\}_{i=1}^{N}$ are such that,

[TABLE]

This definition permits both continuous and discontinuous changes in the linear trend.

Our main interest is in the estimation of $N$ and $\eta_{1},\eta_{2},\ldots,\eta_{N}$ under some assumptions that quantify the difficulty of detecting each $\eta_{i}$ ; therefore, our aim is to segment the data into sections of linearity in $f_{t}$ . In detail, a change-point located close to its neighbouring ones can only be detected when it has a large enough size of linear trend change, while a change-point capturing a small size of linear trend change requires a longer distance from its adjacent change-points to be detected. Detecting linear trend changes is an important applied problem in a variety of fields, including climate change, as illustrated in Section 5.

The change-point detection procedure proposed in this paper is referred to as TrendSegment; it is designed to work well in the presence of either long or short spacings between neighbouring change-points, or a mixture of both. The engine underlying TrendSegment is a new Tail-Greedy Unbalanced Wavelet (TGUW) transform: a conditionally orthonormal, bottom-up transformation for univariate data sequences through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. In this article, we show that TrendSegment offers good performance in estimating the number and locations of change-points across a wide range of signals containing constant and/or linear segments. TrendSegment is also shown to be statistically consistent and computationally efficient.

In earlier related work regarding linear trend changes, Bai and Perron, (1998) consider the estimation of linear models with multiple structural changes by least-squares and present Wald-type tests for the null hypothesis of no change. Kim et al., (2009) and Tibshirani et al., (2014) consider ‘trend filtering’ with the $L_{1}$ penalty and Fearnhead et al., (2019) detect changes in the slope with an $L_{0}$ regularisation via a dynamic programming algorithm. Spiriti et al., (2013) study two algorithms for optimising the knot locations in least-squares and penalised splines. Baranowski et al., (2019) propose a multiple change-point detection device termed Narrowest-Over-Threshold (NOT), which focuses on the narrowest segment among those whose contrast exceeds a pre-specified threshold. Anastasiou and Fryzlewicz, (2022) propose the Isolate-Detect (ID) approach which continuously searches expanding data segments for changes. Yu et al., (2022) propose a two-step algorithm for detecting multiple change-points in piecewise polynomials with general degrees.

Keogh et al., (2004) mention that sliding windows, top-down and bottom-up approaches are three principal categories which most time series segmentation algorithms can be grouped into. Keogh et al., (2004) apply those three approaches to the detection of changes in linear trends in 10 different signals and discover that the performance of bottom-up methods is better than that of top-down methods and sliding windows, notably when the underlying signal has jumps, sharp cusps or large fluctuations. Bottom-up procedures have rarely been used in change-point detection. Matteson and James, (2014) use an agglomerative algorithm for hierarchical clustering in the context of change-point analysis. Keogh et al., (2004) merge adjacent segments of the data according to a criterion involving the minimum residual sum of squares (RSS) from a linear fit, until the RSS falls under a certain threshold; but the lack of precise recipes for the choice of this threshold parameter causes the performance of this method to be somewhat unstable, as we report in Section 4.

As illustrated later in this paper, our TGUW transform, which underlies TrendSegment, is designed to work well in detecting frequent change-points or abrupt local features in which many existing change-point detection methods for the piecewise-linear model fail. The TGUW transform constructs, in a bottom-up way, an adaptive wavelet basis by consecutively merging neighbouring segments of the data starting from the finest level (throughout the paper, we refer to a wavelet basis as adaptive if it is constructed in a data-driven way). This enables it to identify local features at an early stage, before it proceeds to focus on more global features corresponding to longer data segments.

Fryzlewicz, (2018) introduces the Tail-Greedy Unbalanced Haar (TGUH) transform, a bottom-up, agglomerative, data-adaptive transformation of univariate sequences that facilitates change-point detection in the piecewise-constant sequence model. The current paper extends this idea to adaptive wavelets other than adaptive Haar, which enables change-point detection in the piecewise-linear model (and, in principle, to higher-order piecewise polynomials, where the details can be found in Section G of the supplementary material). We emphasise that this extension from TGUH to TGUW is both conceptually and technically non-trivial, due to the fact that it is not a priori clear how to construct a suitable wavelet basis in TGUW for wavelets other than adaptive Haar; this is due to the non-uniqueness of the local orthonormal matrix transformation for performing each merge in TGUW, which does not occur in TGUH. We solve this issue by imposing certain guiding principles in the way the merges are performed, which enables detecting not only long trend segments, but also frequent change-points including abrupt local features. The computational cost of TGUW is the same as TGUH. Important properties of the TGUW transform include orthonormality conditional on the merging order, nonlinearity and “tail-greediness”, and will be investigated in Section 2. The TGUW transform is the first step of the TrendSegment procedure, which involves four steps.

The remainder of the article is organised as follows. Section 2 gives a full description of the TrendSegment procedure and the relevant theoretical results are presented in Section 3. The supporting simulation studies are described in Section 4 and our methodology is illustrated in Section 5 through climate datasets. The proofs of our main theoretical results are in Appendix Appendix A Technical proofs. The supplementary material includes theoretical results for dependent and/or non-Gaussian noise, extension to piecewise-quadratic signal, details of robust threshold selection and extra simulation and data application results. The TrendSegment procedure is implemented in the R package trendsegmentR, available from CRAN.

2 Methodology

2.1 Summary of TrendSegment

The TrendSegment procedure for estimating the number and the locations of change-points includes four steps. We give the broad picture first and outline details in later sections.

TGUW transformation. Perform the TGUW transform, a bottom-up unbalanced adaptive wavelet transformation of the input data $X_{1},\ldots,X_{T}$ , by recursively applying local conditionally orthonormal transformations. This produces a data-adaptive multiscale decomposition of the data with $T-2$ detail-type coefficients and 2 smooth coefficients. The resulting conditionally orthonormal transform of the data hopes to encode most of the energy of the signal in only a few detail-type coefficients arising at coarse levels (see Figure 1 for an example output). This representation sparsity justifies thresholding in the next step. 2. 2.

Thresholding. Set to zero those detail coefficients whose magnitude is smaller than a pre-specified threshold as long as all the non-zero detail coefficients are connected to each other in the tree structure. This step performs “pruning” as a way of deciding the significance of the sparse representation obtained in step 1. 3. 3.

Inverse TGUW transformation. Obtain an initial estimate of $f_{t}$ by carrying out the inverse TGUW transformation of the thresholded coefficient tree. The resulting estimator is discontinuous at the estimated change-points. It can be shown to be $l_{2}$ -consistent, but not yet consistent for $N$ or $\eta_{1},\ldots,\eta_{N}$ . 4. 4.

Post-processing. Post-process the estimate from step 3 by removing some change-points perceived to be spurious, which enables us to achieve estimation consistency for $N$ and $\eta_{1},\ldots,\eta_{N}$ .

Figure 2 illustrates the first three steps of the TrendSegment procedure. We devote the following four sections to describing each step above in order.

2.2 TGUW transformation

2.2.1 Key principles of the TGUW transform

In the initial stage, the data are considered smooth coefficients and the TGUW transform iteratively updates the sequence of smooth coefficients by merging the adjacent sections of the data which are the most likely to belong to the same segment. The merging is done by performing an adaptively constructed orthonormal transformation to the chosen triplet of the smooth coefficients and in doing so, a data-adaptive unbalanced wavelet basis is established. The TGUW transform is completed after $T-2$ such orthonormal transformations and each merge is performed under the following principles.

In each merge, three adjacent smooth coefficients are selected and the orthonormal transformation converts those three values into one detail and two (updated) smooth coefficients. The size of the detail coefficient gives information about the strength of the local linearity and the two updated smooth coefficients are associated with the estimated parameters (intercept and slope) of the local linear regression performed on the raw observations corresponding to the initially chosen three smooth coefficients. 2. 2.

“Two together” rule. The two smooth coefficients returned by the orthonormal transformation are paired in the sense that both contain information about one local linear regression fit. Thus, we require that any such pair of smooth coefficients cannot be separated when choosing triplets in any subsequent merges. We refer to this recipe as the “two together” rule. 3. 3.

To decide which triplet of smooth coefficients should be merged next, we compare the corresponding detail coefficients as their magnitude represents the strength of the corresponding local linear trend; the smaller the (absolute) size of the detail, the smaller the local deviation from linearity. Smooth coefficients corresponding to the smallest detail coefficients have priority in merging.

As merging continues under the “two together” rule, all mergings can be classified into one of three forms:

•

Type 1: merging three initial smooth coefficients,

•

Type 2: merging one initial and a paired smooth coefficient,

•

Type 3: merging two sets of (paired) smooth coefficients,

where Type 3 is composed of two merges of triplets and more details are given in Section 2.2.2.

2.2.2 Example

We now provide a simple example of the TGUW transformation; the accompanying illustration is in Figure 3. The notation for this example and for the general algorithm introduced later is in Table 1. This example shows single merges at each pass through the data when the algorithm runs in a purely greedy way. We will later generalise it to multiple passes through the data, which will speed up computation (this device is referred to as “tail-greediness” as the algorithm merges those triplets corresponding to the lower tail of the distribution of local deviation from linearity in $\boldsymbol{X}$ ). We refer to $j^{\text{th}}$ pass through the data as scale $j$ . Assume that we have the initial input $\boldsymbol{s}^{0}=(X_{1},X_{2},\ldots,X_{8})$ , so that the complete TGUW transform consists of 6 merges. We show 6 example merges one by one under the rules introduced in Section 2.2.1. This example demonstrates all three possible types of merges.

Scale $j=1$ . From the initial input $\boldsymbol{s}^{0}=(X_{1},\ldots,X_{8})$ , we consider 6 triplets $(X_{1},X_{2},X_{3})$ , $(X_{2},X_{3},X_{4})$ , $(X_{3},X_{4},X_{5})$ , $(X_{4},X_{5},X_{6})$ , $(X_{5},X_{6},X_{7})$ , $(X_{6},X_{7},X_{8})$ and compute the size of the detail for each triplet, where the formula can be found in (7). Suppose that $(X_{2},X_{3},X_{4})$ gives the smallest size of detail, $|d_{2,3,4}|$ , then merge $(X_{2},X_{3},X_{4})$ through the orthogonal transformation formulated in (8) and update the data sequence into $\boldsymbol{s}=(X_{1},s^{[1]}_{2,4},s^{[2]}_{2,4},d_{2,3,4},X_{5},X_{6},X_{7},X_{8})$ . We categorise this transformation into Type 1 (merging three initial smooth coefficients).

Scale $j=2$ . From now on, the “two together” rule is applied. Ignoring any detail coefficients in $\boldsymbol{s}$ , the possible triplets for next merging are $(X_{1},s^{[1]}_{2,4},s^{[2]}_{2,4})$ , $(s^{[1]}_{2,4},s^{[2]}_{2,4},X_{5})$ , $(X_{5},X_{6},X_{7})$ , $(X_{6},X_{7},X_{8})$ . We note that $(s^{[2]}_{2,4},X_{5},X_{6})$ cannot be considered as a candidate for next merging under the “two together” rule as this triplet contains only one (not both) of the paired smooth coefficients returned by the previous merging. Assume that $(X_{5},X_{6},X_{7})$ gives the smallest size of detail coefficient $|d_{5,6,7}|$ among the four candidates, then we merge them through the orthogonal transformation formulated in (8) and now update the sequence into $\boldsymbol{s}=(X_{1},s^{[1]}_{2,4},s^{[2]}_{2,4},d_{2,3,4},s^{[1]}_{5,7},s^{[2]}_{5,7}$ , $d_{5,6,7},X_{8})$ . This transformation is also Type 1.

Scale $j=3$ . We now compare four candidates for merging, $(X_{1},s^{[1]}_{2,4},s^{[2]}_{2,4})$ , $(s^{[1]}_{2,4},s^{[2]}_{2,4},s^{[1]}_{5,7})$ , $(s^{[2]}_{2,4},s^{[1]}_{5,7},s^{[2]}_{5,7})$ and $(s^{[1]}_{5,7},s^{[2]}_{5,7},X_{8})$ . The two triplets in middle, $(s^{[1]}_{2,4},s^{[2]}_{2,4},s^{[1]}_{5,7})$ and $(s^{[2]}_{2,4},s^{[1]}_{5,7},s^{[2]}_{5,7})$ , are paired together as they contain two sets of paired smooth coefficients, $(s^{[1]}_{2,4},s^{[2]}_{2,4})$ and $(s^{[1]}_{5,7},s^{[2]}_{5,7})$ , and if we were to treat these two triplets separately, we would be violating the “two together” rule. The summary detail coefficient for this pair of triplets is obtained as $d_{2,4,7}=\max(|d^{[1]}_{2,4,7}|,|d^{[2]}_{2,4,7}|)$ , which is compared with those of the other triplets. Now suppose that $(X_{1},s^{[1]}_{2,4},s^{[2]}_{2,4})$ has the smallest size of detail; we merge this triplet and update the data sequence into $\boldsymbol{s}=(s^{[1]}_{1,4},s^{[2]}_{1,4},d_{1,1,4},d_{2,3,4},s^{[1]}_{5,7},s^{[2]}_{5,7},d_{5,6,7},X_{8})$ . This transformation is of Type 2.

Scale $j=4$ . We now have two pairs of paired coefficients: $(s^{[1]}_{1,4},s^{[2]}_{1,4})$ and $(s^{[1]}_{5,7},s^{[2]}_{5,7})$ . Therefore, with the “two together” rule in mind, the only possible options for merging are: to merge the two pairs into $(s^{[1]}_{1,4},s^{[2]}_{1,4},s^{[1]}_{5,7},s^{[2]}_{5,7})$ , or to merge $(s^{[1]}_{5,7},s^{[2]}_{5,7})$ with $X_{8}$ . Suppose that the second merging is preferred. Then we perform Type 2 merge and update the data sequence into $\boldsymbol{s}=(s^{[1]}_{1,4},s^{[2]}_{1,4},d_{1,1,4},d_{2,3,4},s^{[1]}_{5,8},s^{[2]}_{5,8},d_{5,6,7},d_{5,7,8})$ .

Scale $j=5$ . The only remaining step is merging $(s^{[1]}_{1,4},s^{[2]}_{1,4})$ and $(s^{[1]}_{5,8},s^{[2]}_{5,8})$ into $(s^{[1]}_{1,4},s^{[2]}_{1,4},s^{[1]}_{5,8},s^{[2]}_{5,8})$ . This transformation is Type 3 and performed in two stages as follows. In the first stage, we merge $(s^{[1]}_{1,4},s^{[2]}_{1,4},s^{[1]}_{5,8})$ and then update the sequence temporarily as $\boldsymbol{s}=(s^{[1^{\prime}]}_{1,8},s^{[2^{\prime}]}_{1,8},d_{1,1,4},d_{2,3,4},d^{[1]}_{1,4,8},s^{[2]}_{5,8},d_{5,6,7},d_{5,7,8})$ . In the second stage, we merge $(s^{[1^{\prime}]}_{1,8},s^{[2^{\prime}]}_{1,8},s^{[2]}_{5,8})$ , which gives the updated sequence $\boldsymbol{s}=(s^{[1]}_{1,8},s^{[2]}_{1,8},d_{1,1,4},d_{2,3,4},d^{[1]}_{1,4,8},d^{[2]}_{1,4,8},d_{5,6,7},d_{5,7,8})$ . The transformation is now completed with the updated data sequence which contains $T-2=6$ detail and $2$ smooth coefficients.

2.2.3 Some important features of TGUW transformation

Before formulating the TGUW transformation in generality, we describe how it achieves sparse representation of the data. Sometimes, we will be referring to a detail coefficient $d_{p,q,r}^{\cdot}$ as $d_{p,q,r}^{(j,k)}$ or $d^{(j,k)}$ , where $j=1,\ldots,J$ is the scale of the transform (i.e. the consecutive pass through the data) at which $d_{p,q,r}^{\cdot}$ was computed, $k=1,\ldots,K(j)$ is the location index of $d_{p,q,r}^{\cdot}$ within all scale $j$ coefficients, and $d_{p,q,r}^{\cdot}$ is $d_{p,q,r}^{[1]}$ or $d_{p,q,r}^{[2]}$ or $d_{p,q,r}$ , depending on the type of merge.

The TGUW transform eventually converts the input data sequence $\boldsymbol{X}$ of length $T$ into the sequence containing 2 smooth and $T-2$ detail coefficients through $T-2$ orthonormal transforms as follows,

[TABLE]

where $\Psi$ is a data-adaptively chosen orthonormal unbalanced wavelet basis for $\mathbb{R}^{T}$ . The detail coefficients $d^{(j,k)}$ can be regarded as scalar products between $\boldsymbol{X}$ and a particular unbalanced wavelet basis $\psi^{(j,k)}$ , where the formal representation is given as $\{d^{(j,k)}=\langle X,\psi^{(j,k)}\rangle,_{j=1,\ldots,J,k=1,\;\ldots,K(j)}\}$ for detail coefficients and $s^{[1]}_{1,T}=\langle X,\psi^{(0,1)}\rangle$ , $s^{[2]}_{1,T}=\langle X,\psi^{(0,2)}\rangle$ for the two smooth coefficients.

The TGUW transform is nonlinear, but it is also conditionally linear and orthonormal given the order in which the merges are performed. The orthonormality of the unbalanced wavelet basis, $\{\psi^{(j,k)}\}$ , implies Parseval’s identity:

[TABLE]

Furthermore, the filters $(\psi^{(0,1)},\psi^{(0,2)})$ corresponding to the two smooth coefficients $s^{[1]}_{1,T}$ and $s^{[2]}_{1,T}$ form an orthonormal basis of the subspace $\{(x_{1},x_{2},\ldots,x_{T})\;|\;x_{1}-x_{2}=x_{2}-x_{3}=\cdots=x_{T-1}-x_{T}\}$ of $\mathbb{R}^{T}$ ; see Section E of the supplementary materials for further details. This implies

[TABLE]

where $\hat{\boldsymbol{X}}=s^{[1]}_{1,T}\psi^{(0,1)}+s^{[2]}_{1,T}\psi^{(0,2)}$ is the best linear regression fit to $\boldsymbol{X}$ achieved by minimising the sum of squared errors. This, combined with the Parseval’s identity above, implies

[TABLE]

By construction, the detail coefficients $|d^{(j,k)}|$ obtained in the initial stages of the TGUW transform tend to be small in magnitude. Then the Parseval’s identity in (4) implies that a large portion of $\sum_{t=1}^{T}(X_{t}-\hat{X}_{t})^{2}$ is explained by only a few large $|d^{(j,k)}|$ ’s arising in the later stages of the transform; in this sense, the TGUW transform provides sparsity of signal representation.

2.2.4 TGUW transformation: general algorithm

In this section, we formulate in generality the TGUW transformation illustrated in Section 2.2.2 by showing how an adaptive orthonormal unbalanced wavelet basis, $\Psi$ in (3), is constructed. One of the important principles is “tail-greediness” (Fryzlewicz,, 2018) which enables us to reduce the computational complexity by performing multiple merges over non-overlapping regions in a single pass over the data. More specifically, it allows us to perform up to $\max\{2,\lceil\rho\alpha_{j}\rceil\}$ merges at each scale $j$ , where $\alpha_{j}$ is the number of smooth coefficients in the data sequence $\boldsymbol{s}$ and $\rho\in(0,1)$ (the lower bound of 2 is essential to permit a Type 3 transformation, which consists of two merges).

We now describe the TGUW algorithm.

At each scale $j$ , find the set of triplets that are candidates for merging under the “two together” rule and compute the corresponding detail coefficients. Regardless of the type of merge, a detail coefficient $d_{p,q,r}^{\cdot}$ is, in general, obtained as

[TABLE]

where $p\leq q<r$ , $\boldsymbol{s}_{p:r}^{k}$ is the $k^{\text{th}}$ smooth coefficient of the subvector $\boldsymbol{s}_{p:r}$ with a length of $r-p+1$ and the constants $a,b,c$ are the elements of the detail filter $\boldsymbol{h}=(a,b,c)^{\top}$ . We note that $(a,b,c)$ also depends on $(p,q,r)$ , but this is not reflected in the notation, for simplicity. The detail filter is a weight vector used in computing the weighted sum of a triplet of smooth coefficients which should satisfy the condition that the detail coefficient is zero if and only if the corresponding raw observations over the merged regions have a perfect linear trend. If $(X_{p},\ldots,X_{r})$ are the raw observations associated with the triplet of the smooth coefficients $(\boldsymbol{s}_{p:r}^{1},\boldsymbol{s}_{p:r}^{2},\boldsymbol{s}_{p:r}^{3})$ under consideration, then the detail filter $\boldsymbol{h}$ is obtained in such a way as to produce zero detail coefficient only when $(X_{p},\ldots,X_{r})$ has a perfect linear trend, as the detail coefficient itself represents the extent of non-linearity in the corresponding region of data. This implies that the smaller the size of the detail coefficient, the closer the alignment of the corresponding data section with linearity. 2. 2.

Summarise all $d_{p,q,r}^{\cdot}$ constructed in step 1 to a (equal length or shorter) sequence of $d_{p,q,r}$ by finding a summary detail coefficient $d_{p,q,r}=\max(|d^{[1]}_{p,q,r}|,|d^{[2]}_{p,q,r}|)$ for any pair of detail coefficients constructed by Type 3 merges. 3. 3.

Sort the size of the summarised detail coefficients $|d_{p,q,r}|$ obtained in step 2 in non-decreasing order. 4. 4.

Extract the (non-summarised) detail coefficient(s) $\lvert d^{\cdot}_{p,q,r}\rvert$ corresponding to the smallest (summarised) detail coefficient $\lvert d_{p,q,r}\rvert$ e.g. both $\lvert d^{[1]}_{p,q,r}\rvert$ and $\lvert d^{[2]}_{p,q,r}\rvert$ should be extracted only if $d_{p,q,r}=\max(\lvert d^{[1]}_{p,q,r}\rvert,\lvert d^{[2]}_{p,q,r}\rvert)$ . Repeat the extraction until $\max\{2,\lceil\rho\alpha_{j}\rceil\}$ (or all possible, whichever is the smaller number) detail coefficients have been obtained, as long as the region of the data corresponding to each detail coefficient extracted does not overlap with the regions corresponding to the detail coefficients already drawn. 5. 5.

For each $\lvert d_{p,q,r}^{\cdot}\rvert$ extracted in step 4, merge the corresponding smooth coefficients by updating the corresponding triplet in $\boldsymbol{s}$ through the orthonormal transform as follows,

[TABLE]

The key step is finding the $3\times 3$ orthonormal matrix, $\Lambda$ , which is composed of one detail and two low-pass filter vectors in its rows. Firstly the detail filter $\boldsymbol{h}^{\top}$ is determined to satisfy the condition mentioned in step 1, and then the two low-pass filters ( $\boldsymbol{\ell}_{1}^{\top},\boldsymbol{\ell}_{2}^{\top}$ ) are obtained by satisfying the orthonormality of $\Lambda$ . There is no uniqueness in the choice of ( $\boldsymbol{\ell}_{1}^{\top},\boldsymbol{\ell}_{2}^{\top}$ ), but this has no effect on the transformation itself. The details of this mechanism can be found in Section E of the supplementary materials. 6. 6.

Go to step 1 and repeat at new scale $j=j+1$ as long as we have at least three smooth coefficients in the updated data sequence $\boldsymbol{s}$ .

More specifically, when Type 1 merge is performed in step 1 (i.e. $\boldsymbol{s}_{p:r}$ in (7) consists of three initial smoothing coefficients, which implies $r=p+2$ ), the corresponding detail filter $\boldsymbol{h}$ is obtained as a unit normal vector to the plane $\{(x,y,z)|x-2y+z=0\}$ , thus the detail coefficient $d$ presents the projection of three initial smoothing coefficients to the unit normal vector. In the same manner, due to the orthonormality of $\Lambda$ in (8), the two low-pass filters ( $\boldsymbol{\ell}_{1}^{\top},\boldsymbol{\ell}_{2}^{\top}$ ) form an arbitrary orthonormal basis of the plane $\{(x,y,z)|x-2y+z=0\}$ . In practice, the detail filter $\boldsymbol{h}$ in Step 1 is obtained by updating so-called weight vectors of constancy and linearity in which the initial inputs have a form of $(1,1,\ldots,1)^{\top}$ and $(1,2,\ldots,T)^{\top}$ , respectively. The details can be found in Section F of the supplementary materials.

We now comment briefly on the computational complexity of the TGUW transform. Assume that $\alpha_{j}$ smooth coefficients are available in the data sequence $\boldsymbol{s}$ at scale $j$ and we allow the algorithm to merge up to $\big{\lceil}\rho\alpha_{j}\big{\rceil}$ many triplets (unless their corresponding data regions overlap) where $\rho\in(0,1)$ is a constant. This gives us at most $(1-\rho)^{j}T$ smooth coefficients remaining in $\boldsymbol{s}$ after $j$ scales. Solving for $(1-\rho)^{j}T\leq 2$ gives the largest number of scales $J$ as $\Big{\lceil}\log(T)/\log\big{(}(1-\rho)^{-1}\big{)}+\log(2)/\log(1-\rho)\Big{\rceil}$ , at which point the TGUW transform terminates with two smooth coefficients remaining. Considering that the most expensive step at each scale is sorting which takes $O(T\log(T))$ operations, the computational complexity of the TGUW transformation is $O(T\log^{2}(T))$ .

2.3 Thresholding

Because at each stage, the TGUW transform constructs the smallest possible detail coefficients, but it is at the same time orthonormal and so preserves the $l_{2}$ energy of the input data, the variability (= deviation from linearity) of the signal tends to be mainly encoded in only a few detail coefficients computed at the later stages of the transform. The resulting sparsity of representation of the input data in the domain of TGUW coefficients justifies thresholding as a way of deciding the significance of each detail coefficient (which measures the local deviation from linearity).

We propose to threshold the TGUW detail coefficients under two important rules, which should simultaneously be satisfied; we refer to these as the “connected” rule and the “two together” rule. The “two together” rule in thresholding is similar to the one in the TGUW transformation except it targets pairs of detail rather than smooth coefficients, and only applies to pairs of detail coefficients arising from Type 3 merges. Figure 4(b) shows one such pair in the example of Section 2.2.2, $(d_{1,4,8}^{[1]},d_{1,4,8}^{[2]})$ , and the “two together” rule means that both such detail coefficients should be kept if at least one survives the initial thresholding. This is a natural requirement as a pair of Type 3 detail coefficients effectively corresponds to a single merge of two adjacent regions.

The “connected” rule which prunes the branches of the TGUW detail coefficients if and only if the detail coefficient itself and all of its children coefficients fall below a certain threshold in absolute value. This is illustrated in Figure 4(a) along with the example of Section 2.2.2; if both $d_{2,3,4}$ and $(d_{1,4,8}^{[1]},d_{1,4,8}^{[2]})$ were to survive the initial thresholding, the “connected” rule would mean we also had to keep $d_{1,1,4}$ , which is the child of $(d_{1,4,8}^{[1]},d_{1,4,8}^{[2]})$ and the parent of $d_{2,3,4}$ in the TGUW coefficient tree.

Through the thresholding, we wish to estimate the underlying signal $f$ in (1) by estimating $\mu^{(j,k)}=\langle f,\psi^{(j,k)}\rangle$ where $\psi^{(j,k)}$ is an orthonormal unbalanced wavelet basis constructed in the TGUW transform from the data. Throughout the entire thresholding procedure, the “connected” and “two together” rules are applied in this order. We firstly threshold and apply the “connected” rule, which gives us $\hat{\mu}_{0}^{(j,k)}$ , the initial estimator of $\mu^{(j,k)}$ , as

[TABLE]

where $\mathbb{I}$ is an indicator function and

[TABLE]

Now the “two together” rule is applied to the initial estimators $\hat{\mu}_{0}^{(j,k)}$ to obtain the final estimators $\hat{\mu}^{(j,k)}$ . We firstly note that two detail coefficients, $d^{(j,k)}_{p,q,r}$ and $d^{(j^{\prime},k+1)}_{p^{\prime},q^{\prime},r^{\prime}}$ are called “paired” when they are formed by Type 3 mergings and when $(j,p,q,r)=(j^{\prime},p^{\prime},q^{\prime},r^{\prime})$ . The “two together” rule is formulated as below,

[TABLE]

It is important to note that the application of the two rules ensures that $\tilde{f}$ is a piecewise-linear function composed of best linear fits (in the least-squares sense) for each interval of linearity. As an aside, we note that the number of survived detail coefficients does not necessarily equal the number of change-points in $\tilde{f}$ as a pair of detail coefficients arising from a Type 3 merge are associated with a single change-point.

2.4 Inverse TGUW transformation

The estimator $\tilde{f}$ of the true signal $f$ in (1) is obtained by inverting (= transposing) the orthonormal transformations in (8) in reverse order to that in which they were originally performed. This inverse TGUW transformation is referred to as $\text{TGUW}^{-1}$ , and thus

[TABLE]

where $\|$ denotes vector concatenation.

2.5 Post processing for consistency of change-point detection

As will be formalised in Theorem 1 of Section 3, the piecewise-linear estimator $\tilde{f}$ in (12) possibly overestimates the number of change-points. To remove the spurious estimated change-points and to achieve the consistency of the number and the locations of the estimated change-points, we adopt the post-processing framework of Fryzlewicz, (2018). Lin et al., (2017) show that we can usually post-process $l_{2}$ -consistent estimators in this way as a fast enough $l_{2}$ error rate implies that each true change-point has an estimator nearby. The post-processing methodology includes two stages, i) execution of three steps, TGUW transform, thresholding and inverse TGUW transform, again to the estimator $\tilde{f}$ in (12) and ii) examination of regions containing only one estimated change-point to check for its significance.

Stage 1.

We transform the estimated function $\tilde{f}$ in (12) with change-points $(\tilde{\eta}_{1},\tilde{\eta}_{2},\ldots,\tilde{\eta}_{\tilde{N}})$ into a new estimator $\tilde{\tilde{f}}$ with corresponding change-points $(\tilde{\tilde{\eta}}_{1},\tilde{\tilde{\eta}}_{2},\ldots,\tilde{\tilde{\eta}}_{\tilde{\tilde{N}}})$ . Using $\tilde{f}$ in (12) as an input data sequence $\boldsymbol{s}$ , we perform the TGUW transform as presented in Section 2.2.4, but in a greedy rather than tail-greedy way such that only one detail coefficient $d^{(j,1)}$ is produced at each scale $j$ , and thus $K(j)=1$ for all $j$ . We repeat to produce detail coefficients until the first detail coefficient such that $|d^{(j,1)}|>\lambda$ is obtained where $\lambda$ is the parameter used in the thresholding procedure described in Section 2.3. Once the condition, $|d^{(j,1)}|>\lambda$ , is satisfied, we stop merging, relabel the surviving change-points as $(\tilde{\tilde{\eta}}_{1},\tilde{\tilde{\eta}}_{2},\ldots,\tilde{\tilde{\eta}}_{\tilde{\tilde{N}}})$ and construct the new estimator $\tilde{\tilde{f}}$ as

[TABLE]

where $\tilde{\tilde{\eta}}_{0}=0$ , $\tilde{\tilde{\eta}}_{\tilde{\tilde{N}}+1}=T$ and ( $\hat{\theta}_{i,1},\hat{\theta}_{i,2}$ ) are the OLS intercept and slope coefficients, respectively, for the corresponding pairs $\{(t,X_{t}),\;t\in\big{[}\tilde{\tilde{\eta}}_{i-1}+1,\tilde{\tilde{\eta}}_{i}\big{]}\}$ . The exception is when the region under consideration only contains a single data point $X_{t_{0}}$ , in which case fitting a linear regression is impossible. We then set $\tilde{\tilde{f}}_{t_{0}}=X_{t_{0}}$ .

Stage 2.

From the estimator $\tilde{\tilde{f}}_{t}$ in Stage 1, we obtain the final estimator $\hat{f}$ by pruning the change-points $(\tilde{\tilde{\eta}}_{1},\tilde{\tilde{\eta}}_{2},\ldots,\tilde{\tilde{\eta}}_{\tilde{\tilde{N}}})$ in $\tilde{\tilde{f}}_{t}$ . For each $i=1,\ldots,\tilde{\tilde{N}}$ , compute the corresponding detail coefficient $d_{p_{i},q_{i},r_{i}}$ as described in (7), where $p_{i}=\Big{\lfloor}\frac{\tilde{\tilde{\eta}}_{i-1}+\tilde{\tilde{\eta}}_{i}}{2}\Big{\rfloor}+1$ , $q_{i}=\tilde{\tilde{\eta}}_{i}$ and $r_{i}=\Big{\lceil}\frac{\tilde{\tilde{\eta}}_{i}+\tilde{\tilde{\eta}}_{i+1}}{2}\Big{\rceil}$ . Now prune by finding the minimiser $i_{0}=\operatorname*{arg\,min}_{i}{|d_{p_{i},q_{i},r_{i}}|}$ and removing $\tilde{\tilde{\eta}}_{i_{0}}$ and setting $\tilde{\tilde{N}}:=\tilde{\tilde{N}}-1$ if $|d_{p_{i_{0}},q_{i_{0}},r_{i_{0}}}|\leq\lambda$ where $\lambda$ is same as in Section 2.3. Then relabel the change-points with the subscripts $i=1,\ldots,\tilde{\tilde{N}}$ under the convention $\tilde{\tilde{\eta}}_{0}=0$ , $\tilde{\tilde{\eta}}_{\tilde{\tilde{N}}+1}=T$ . Repeat the pruning while we can find $i_{0}$ which satisfies the condition $\big{|}d_{p_{i_{0}},q_{i_{0}},r_{i_{0}}}\big{|}<\lambda$ . Otherwise, stop, denote by $\hat{N}$ the number of detected change-points and by $\hat{\eta}_{i}$ – the change-points in increasing order for $i=0,\ldots,\hat{N}+1$ where $\hat{\eta}_{0}=0$ and $\hat{\eta}_{\hat{N}+1}=T$ . The estimated function $\hat{f}$ is obtained by simple linear regression for each region determined by the final change-points $\hat{\eta}_{1},\ldots,\hat{\eta}_{\hat{N}}$ as in (13), with the exception for the case of single data point as described in Stage 1 above.

Through these two stages of post processing, the estimation of the number and the locations of change-points become consistent, and further details can be found in Section 3.

3 Theoretical results

We study the $l_{2}$ consistency of $\tilde{f}$ and $\tilde{\tilde{f}}$ , and the change-point detection consistency of $\hat{f}$ , where the estimators are defined in Section 2. The $l_{2}$ risk of an estimator $\tilde{f}$ is defined as $\big{\|}\tilde{f}-f\big{\|}_{T}^{2}=T^{-1}\sum_{i=1}^{T}(\tilde{f}_{i}-f_{i})^{2}$ , where $f$ is the underlying signal as in (1). We firstly investigate the $l_{2}$ behaviour of $\tilde{f}$ . The proofs of Theorems 1-3 can be found in Appendix Appendix A Technical proofs.

Theorem 1

$X_{t}$ * follows model (1) with $\sigma=1$ and $\tilde{f}$ is the estimator in (12). If the threshold $\lambda=C_{1}\{2\log(T)\}^{1/2}$ with a constant $C_{1}\geq\sqrt{3}$ , then we have*

[TABLE]

as $T\rightarrow\infty$ and the piecewise-linear estimator $\tilde{f}$ contains $\tilde{N}\leq CN\log(T)$ change-points where $C$ is a constant.

Thus, $\tilde{f}$ is $l_{2}$ consistent under the strong sparsity assumption (i.e. if $N$ is finite) or even under the relaxed condition that $N$ has the order of $\log T$ . The crucial mechanism of $l_{2}$ consistency is the “tail-greediness” which allows up to $K(j)\geq 1$ smooth coefficients to be removed at each scale $j$ . In other words, consistency is generally unachievable if we proceed in a greedy (as opposed to tail-greedy) way, i.e. if we only merge one triplet at each scale of the TGUW transformation.

We now move onto the estimator $\tilde{\tilde{f}}$ obtained in the first stage of post-processing.

Theorem 2

$X_{t}$ * follows model (1) with $\sigma=1$ and $\tilde{\tilde{f}}$ is the estimator in (13). Let the threshold $\lambda$ be as in Theorem 1. Then we have $\big{\|}\tilde{\tilde{f}}-f\big{\|}_{T}^{2}\;=\;O\big{(}NT^{-1}\log^{2}(T)\big{)}$ with probability approaching $1$ as $T\rightarrow\infty$ and there exist at most two estimated change-points between each pair of true change-points $(\eta_{i},\eta_{i+1})$ for $i=0,\ldots,N$ , where $\eta_{0}=0$ and $\eta_{N+1}=T$ . Therefore $\tilde{\tilde{N}}\leq 2(N+1)$ .*

We see that $\tilde{\tilde{f}}$ is $l_{2}$ consistent, but inconsistent for the number of change-points. Now we investigate the final estimators, $\hat{f}$ and $\hat{N}$ .

Theorem 3

$X_{t}$ * follows model (1) with $\sigma=1$ and ( $\hat{f}$ , $\hat{N}$ ) are the estimators obtained in Section 2.5. Let the threshold $\lambda$ be as in Theorem 1 and suppose that the number of true change-points, $N$ , has the order of $\log T$ . Let $\Delta_{T}=\min_{i=1,\ldots,N}\Big{\{}\Big{(}\underaccent{\bar}{f}_{T}^{i}\Big{)}^{2/3}\cdot\delta_{T}^{i}\Big{\}}$ where $\underaccent{\bar}{f}_{T}^{i}=\min\Big{(}|f_{\eta_{i+1}}-2f_{\eta_{i}}+f_{\eta_{i-1}}|,|f_{\eta_{i+2}}-2f_{\eta_{i+1}}+f_{\eta_{i}}|\Big{)}$ and $\delta_{T}^{i}=\min\Big{(}|\eta_{i}-\eta_{i-1}|,|\eta_{i+1}-\eta_{i}|\Big{)}$ . Assume that $T^{1/3}R_{T}^{1/3}=o\Big{(}\Delta_{T}\Big{)}$ where $\big{\|}\tilde{\tilde{f}}-f\big{\|}_{T}^{2}=O_{p}(R_{T})$ is as in Theorem 2. Then we have*

[TABLE]

as $T\rightarrow\infty$ where $C$ is a constant.

Our theory indicates that when $\min_{i}\underaccent{\bar}{f}_{T}^{i}\sim T^{-1}$ , the change-point detection rate of the TrendSegment procedure is $O_{p}(T^{2/3}\log T)$ . If the number of true change-points, $N$ , is finite, then the detection accuracy becomes $O_{p}(T^{2/3}(\log T)^{2/3})$ . Comparing it with the rate of $O_{p}(T^{2/3}(\log T)^{1/3})$ derived by Baranowski et al., (2019) and Anastasiou and Fryzlewicz, (2022) and also with the rate of $O_{p}(T^{2/3})$ derived by Raimondo, (1998), our detection accuracy is different by only a logarithmic factor. In the case in which $\min_{i}\underaccent{\bar}{f}_{T}^{i}$ is bounded away from zero, the consistent estimation of the number and locations of change-point is achieved by assuming $T^{1/3}R_{T}^{1/3}=o(\delta_{T})$ where $\delta_{T}=\min_{i=1,\ldots,N+1}|\eta_{i}-\eta_{i-1}|$ and $R_{T}=NT^{-1}\log^{2}(T)$ . In addition, when there exists a separate data segment containing only one data point, then the two consecutive change-points, $\eta_{k-1}$ and $\eta_{k}$ , linked via $\eta_{k-1}=\eta_{k}-1$ under the definition of a change-point in (2) can be detected exactly at their true locations only if the corresponding $\underaccent{\bar}{f}_{T}^{i}$ s satisfy the condition $\min\Big{(}\underaccent{\bar}{f}_{T}^{k},\underaccent{\bar}{f}_{T}^{k-1}\Big{)}\gtrsim\log(T)$ .

In the supplementary material, the assumptions of the Gaussianity and the independence on $\varepsilon_{t}$ are relaxed and the corresponding Theorems B.1-B.3 are presented in a setting in which the noise is dependent and/or non-Gaussian.

4 Simulation study

4.1 Parameter choice and setting

4.1.1 Post-processing

In what follows, we disable Stages 1 and 2 of post-processing by default: our empirical experience is that Stage 1 rarely makes a difference in practice but comes with an additional computational cost, and Stage 2 occasionally over-prunes change-point estimates.

4.1.2 Choice of the “tail-greediness” parameter

$\rho\in(0,1)$ is a constant which controls the greediness level of the TGUW transformation in the sense that it decides how many merges are performed in a single pass over the data. A large $\rho$ can reduce the computational cost but it makes the procedure less adaptive, whereas a small $\rho$ gives the opposite effect. Based on our empirical experience, the best performance is stably achieved in the range $\rho\in(0,0.05]$ and we use $\rho=0.04$ as a default in the simulation study and data analyses.

4.1.3 Choice of the minimum segment length

We can give a condition on the minimum segment length of the estimated signal returned by the TrendSegment algorithm. If it is set to $1$ , two consecutive data-points can be detected as change-points. As theoretically shown in the supplementary material, the minimum length of the estimated segment should have an order of $\log(T)$ to achieve estimation consistency in the case of dependent and/or non-Gaussian errors. To avoid too short segments, and to cover non iid Gaussian noise, we set the minimum segment length to $C\log(T)$ and use the default $C=0.9$ in the remainder of the paper, otherwise we are not able to detect those short segments in (M6). This constraint can be adjusted by users in the R package trendsegmentR.

4.1.4 Continuity at change-points

As described in Section 2, the TrendSegment algorithm works by detecting change-points first (in thresholding) and then estimating the best linear fit (in the least-squares sense) for each segment (in the inverse TGUW transform). These procedures normally ensure discontinuity at change-points, however our R package trendsegmentR has an option for ensuring continuous change-points by approximating $f$ using the linear spline fit with knots at detected change-points.

4.1.5 Choice of threshold $\lambda$

Motivated by Theorem 1, we consider the simplest naïve threshold of the form

[TABLE]

where $\sigma$ can be estimated in different ways depending on the type of noise. Under iid Gaussian noise, we can estimate $\sigma$ using the Median Absolute Deviation (MAD) estimator (Hampel,, 1974) defined as $\hat{\sigma}=\text{Median}(|X_{1}-2X_{2}+X_{3}|,\ldots,|X_{T-2}-2X_{T-1}+X_{T}|)/(\Phi^{-1}(3/4)\sqrt{6})$ where $\Phi^{-1}$ is the quantile function of the Gaussian distribution. We found that under iid Gaussian noise $C=1.3$ empirically leads to the best performance over a sequence of $C$ , where the details and the relevant results for non-Gaussian and/or dependent errors can be found in Section C of the supplementary material. For completeness, we now present an algorithm for a threshold that works well in all circumstances. When the noise is not generated from iid Gaussian, it is reasonable to assume that the threshold is affected by the serial dependence structure and/or the extent of heavy-tailedness of noise, which motivates us to use threshold of the form:

[TABLE]

where $\mathcal{I}$ is the long-run standard deviation, $\mathcal{K}$ is kurtosis and $g$ is a function. To estimate the unknown parameters in (17), we follow Algorithm 1.

We now describe the details of each step in Algorithm 1.

Pre-estimated fit in Step 1.

In (17), the heavy-tailedness and dependent structure of the noise are captured by $\mathcal{K}$ and $\mathcal{I}$ , respectively. In practice, estimating $\mathcal{I}$ and $\mathcal{K}$ is challenging as the observation includes change-points in its underlying signal. One of the most straightforward way is pre-estimating the fit $\hat{f}_{t}$ via TrendSegment algorithm with a parameter $\eta_{\max}$ , the maximum number estimated change-points. As long as $\eta_{\max}$ is not too large, some extent of overestimation would be acceptable, and we use $\eta_{\max}=\lceil 0.15T\rceil$ as a default in practice, as it empirically led to the best performance and the simulation results do not vary by much over the range $\eta_{\max}\in[\lceil 0.1T\rceil,\lceil 0.2T\rceil]$ . The pre-fitting gives us the estimated noise $\hat{\varepsilon}_{t}=X_{t}-\hat{f}_{t}$ , from which we can estimate both $\mathcal{I}$ and $\mathcal{K}$ .

Pre-specified constant $C$ in Step 4.

We set $C=1.3$ as it empirically led to the best performance for iid Gaussian noise with the naive approach in (16). Thus we hope to have both $\hat{\mathcal{I}}$ and $\hat{g}(\hat{\mathcal{K}})$ close to $1$ under iid Gaussian noise, but larger than $1$ when the noise has serial dependence and/or heavy-tailedness.

$\mathcal{I}$ and $\mathcal{K}$ in Step 4.

$\mathcal{I}$ and $\mathcal{K}$ capture dependency and heavy-tailedness of noise, respectively. First, kurtosis is estimated from the estimated noise as follows:

[TABLE]

where $\bar{\hat{\varepsilon}}$ and $\hat{s}_{\hat{\varepsilon}}$ are sample mean and sample standard deviation of $\hat{\varepsilon}$ , respectively. For estimating $\mathcal{I}$ , we consider the case when Gaussian noise has dependent structure. Then the dependencies increase the marginal variance of CUSUM statistic and one way of solving this issue is inflating the threshold by the following factor

[TABLE]

where $\phi$ is the true parameter of a AR(1) process (Fearnhead and Fryzlewicz,, 2022). We can estimate $\phi$ by fitting AR(1) model to the estimated noise $\hat{\varepsilon}_{t}=X_{t}-\hat{f}_{t}$ , and this gives us the estimated long-run standard deviation $\hat{\mathcal{I}}$ . Although in theory the inflation factor in (19) is valid only for Gaussian noise, we use the estimator of (19) as an estimated long-run standard deviation even when the noise has both serial dependence and heavy-tailedness, hoping that the heavy-tailedness is captured reasonably well by $\mathcal{K}$ .

Kurtosis function $g$ in Step 5.

We fit a non-parametric regression as described in step 5 of Algorithm 1 over different models and noise scenarios. We found that $g(\hat{\mathcal{K}})$ has no particular functional form in $\hat{\mathcal{K}}$ , and is scattered between $0.9$ and $1.6$ over all noise scenarios and all simulations models considered in the paper. Therefore, the resulting non-parametric fit $\hat{g}(\hat{\mathcal{K}})$ also has a flat shape over a range of $\hat{\mathcal{K}}$ , and we use this in finding the robust threshold in practice. This is due to the condition on the minimum segment length described earlier which helps the method to be robust to spikes.

The detailed procedure of estimating $g$ is presented in Section C.2 of the supplementary material. Also, the simulation results using Algorithm 1 for dependent and/or heavy-tailed noise can be found in Tables C.1 - C.10 in Section C.1 of the supplementary material. The proposed robust threshold selection algorithm can also be applied to iid Gaussian noise without any knowledge on type of noise and the corresponding simulation results are given in Section 4.3.

We consider iid Gaussian noise and simulate data from model (1) using 8 signals, (M1) wave1, (M2) wave2, (M3) mix1, (M4) mix2, (M5) mix3, (M6) lin.sgmts, (M7) teeth and (M8) lin, shown in Figure 5. (M1) is continuous at change-points, while (M2) has discontinuities. (M3) contains both constant and linear segments and is continuous at change-points, whereas (M4) is of the similar type but has a mix of continuous and discontinuous change-points. (M5) has three particularly short segments containing 12, 9 and 6 data points, respectively and (M6) has isolated spike-type short segments containing 6 data points each. (M7) is piecewise-constant, and (M8) is a linear signal without change-points. The signals and R code for all simulations can be downloaded from our GitHub repository (Maeng and Fryzlewicz,, 2021) and the simulation results under dependent or heavy-tailed errors can be found in Section C of the supplementary materials.

4.2 Competing methods and estimators

We perform the TrendSegment procedure based on the parameter choice in Section 4.1 and compare the performance with that of the following competitors: Narrowest-Over-Threshold detection (NOT, Baranowski et al., (2019)) implemented in the R package not from CRAN, Isolate-Detect (ID, Anastasiou and Fryzlewicz, (2022)) available in the R package IDetect, trend filtering (TF, Kim et al., (2009)) available from https://github.com/glmgen/genlasso, Continuous-piecewise-linear Pruned Optimal Partitioning (CPOP, Fearnhead et al., (2019)) available from https://www.maths.lancs.ac.uk/~fearnhea/Publications.html and a bottom-up algorithm based on the residual sum of squares (RSS) from a linear fit (BUP, Keogh et al., (2004)). The TrendSegment methodology is implemented in the R package trendsegmentR.

As BUP requires a pre-specified number of change-points (or a well-chosen stopping criterion which can vary depending on the data), we include it in the simulation study (with the stopping criterion optimised for the best performance using the knowledge of the truth) but not in data applications. We do not include the methods of Spiriti et al., (2013) and Bai and Perron, (2003) implemented in the R packages freeknotsplines and strucchange as we have found them to be particularly slow. For instance, the minimum segment size in strucchange can be adjusted to be small as long as it is greater than or equal to 3 for detecting linear trend changes. This is suitable for detecting very short segments (e.g in (M6) lin.sgmts), however this setting is accompanied by extremely heavy computation: with this minimum segment size in place, a single signal simulated from (M6) took us over three hours to process on a standard PC.

Out of the competing methods tested, ID, TF and CPOP return continuous change-points, while the estimated signals of Trendsegment and BUP is in principle discontinuous at change-points. For NOT, we use the contrast function for not necessarily continuous piecewise-linear signals. Regarding the tuning parameters for the competing methods, we follow the recommendation of each respective paper or the corresponding R package.

4.3 Results

The summary of the results for all models and methods can be found in Tables 2 and 3. We run 100 simulations and as a measure of accuracy of estimators, we use Monte-Carlo estimates of the Mean Squared Error of the estimated signal defined as MSE= $\mathbb{E}\{(1/T)\sum_{t=1}^{T}(f_{t}-\hat{f}_{t})^{2}\}$ . The empirical distribution of $\hat{N}-N$ is also reported where $\hat{N}$ is the estimated number of change-points and $N$ is the true one. In addition to this, for comparing the accuracy of the locations of the estimated change-points $\hat{\eta}_{i}$ , we show estimates of the scaled Hausdorff distance given by

[TABLE]

where $i=0,\ldots,N+1$ and $j=0,\ldots,\hat{N}+1$ with the convention $\eta_{0}=\hat{\eta}_{0}=0,\eta_{N+1}=\hat{\eta}_{N+1}=T$ and $\hat{\eta}$ and $\eta$ denote estimated and true locations of the change-points. The smaller the Hausdorff distance, the better the estimation of the change-point locations. For each method, the average computation time in seconds is shown.

We first emphasise that the results with both the naïve and the robust thresholds ( $\lambda^{\text{Naïve}}$ in (16) and $\lambda^{\text{Robust}}$ in (17)) are reported for TrendSegment, and the performances are nearly the same except (M7). For simplicity, we call both methods as TrendSegment in the remainder of this section.

The results for (M1) and (M2) are similar. TrendSegment shows comparable performance to NOT, ID and CPOP in terms of the estimation of the number of change-points while it is less attractive in terms of the estimated locations of change-points. TF tends to overestimate the number of change-points throughout all models. When the signal is a mix of constant and linear trends as in (M3) and (M4), TrendSegment, NOT and ID still perform well in terms of the estimation of the number of change-points. CPOP tends to overestimate in (M4) when there exists discontinuity at change-points, however it shows the best performs in terms of localisation (i.e. the smallest mean of Hausdorff distance) as it tends to estimate more than one (and somewhat frequent) change-points at discontinuous change-points. As TrendSegment and NOT deal with the piecewise-linear signals that is not necessarily continuous at change-points, they performs better than others in (M2) and (M4).

We see that TrendSegment has a particular advantage over the other methods especially in (M5) and (M6), when frequent change-points composed of the isolated spike-type short segments of length 6 exist. This is due to the bottom-up nature of TrendSegment which focuses on local features in the early stage of merges and enables TrendSegment to detect those short segments. TrendSegment shows its relative robustness in estimating the number and the location of change-points while NOT, ID and CPOP significantly underperform.

For the estimation of the piecewise-constant signal (M7), no methods show good performances and NOT, ID and TrendSegment tend to underestimate the number of change-points while CPOP and TF overestimate. In the case of the no-change-point signal (M8), all methods estimate well except TF and BUP. In summary, TrendSegment is never among the worst methods, is almost always among the best ones, and is particularly attractive for signals containing frequent change-points with short segments. With respect to computation time, NOT and ID are very fast in all cases, TrendSegment is slower than these two but is faster than TF, CPOP and BUP, especially when the length of the time series is larger than 2000.

5 Data applications

5.1 Average January temperatures in Iceland

We analyse a land temperature dataset available from https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data, consisting of average temperatures in January recorded in Reykjavik recorded from 1763 to 2013. Figure 6 shows the data; the point corresponding to 1918 appears to be an anomalous point. This is sometimes called point anomaly which can be viewed as a separate data segment containing only one datapoint. Regarding the 1918 observation, Moore and Babij, (2017) report that “[t]he winter of 1917/1918 is referred to as the Great Frost Winter in Iceland. It was the coldest winter in the region during the twentieth century. It was remarkable for the presence of sea ice in Reykjavik Harbour as well as for the unusually large number of polar bear sightings in northern Iceland.”

Out of the competing methods tested, ID, TF and CPOP are in principle able to classify two consecutive time point as change-points, and therefore they are able to detect separate data segments containing only one data point each. NOT and BUP are not designed to detect two consecutive time point as change-points as their minimum distance between two consecutive change-points is restricted to be at least two. In the TrendSegment algorithm, the minimum segment length can flexibly set by the users as described in Section 4. Figures 6(a) and 6(b) show that the change-point estimators depend on the type of threshold we use ( $\lambda^{\text{Naïve}}$ or $\lambda^{\text{Robust}}$ ) and also vary over conditions on the minimum segment length. Regardless of the minimum segment length, the robust threshold selection tends to detect more change-points than the naïve threshold. When the minimum segment length is set to $1$ , with both naïve and robust thresholds, TrendSegment commonly identifies change-points in 1917 and 1918, where the temperature in 1918 is fitted as a single point. As shown in Figure 6(d), out of the competing methods, only CPOP detects the temperature in 1918 as an anomalous point. Figures 6(b), 6(c) and 6(d) show that TrendSegment with $\lambda^{\text{Robust}}$ , NOT and CPOP detect the change of slope in 1974, ID returns an increasing function with no change-points and TF reports 6 points with the most recent one in 1981, but none of them detect the point in 1918 as a separate data segment. When setting the minimum segment length equals to the default ( $\lfloor 0.9\log(T)\rfloor$ ) in TrendSegment with $\lambda^{\text{Naïve}}$ in Figure 6(a), it returns no change-points as ID does. This example illustrates the flexibility of the TrendSegment as it detects not only change-points in linear trend but it can identify a separate data segment at the same time, which the competing methods do not achieve.

5.2 Monthly average sea ice extent of Arctic and Antarctic

We analyse the average sea ice extent of the Arctic and the Antarctic available from https://nsidc.org to estimate the change-points in its trend. As mentioned in Serreze and Meier, (2018), sea ice extent is the most common measure for assessing the feature of high-latitude oceans and it is defined as the area covered with an ice concentration of at least 15 $\%$ . Here we use the average ice extent in February and September as it is known that the Arctic has the maximum ice extent typically in February while the minimum occurs in September and the Antarctic does the opposite.

Serreze and Meier, (2018) indicate that the clear decreasing trend of sea ice extent of the Arctic in September is one of the most important indicator of climate change. In contrast to the Arctic, the sea ice extent of the Antarctic has been known to be stable in the sense that it shows a weak increasing trend in the decades preceding 2016 (Comiso et al.,, 2017; Serreze and Meier,, 2018). However, Rintoul et al., (2018) warn of a possible collapse of the past stability by citing a significant decline of the sea ice extent in 2016. We now use the most up-to-date records (to 2020) and re-examine the concerns expressed in Rintoul et al., (2018) with the help of our change-point detection methodology.

In this example, the condition on the minimum segment length does not affect the change-point estimation results, thus Figure 7 shows the results obtained from the default minimum segment length. Also, as shown in Figure 7, TrendSegment estimate with $\lambda^{\text{Robust}}$ identifies no change-point over all four datasets, thus we focus on giving interpretations for the TrendSegment estimate with $\lambda^{\text{Naïve}}$ in the following.

Figures 7(a) and 7(c) show the well-known decreasing trend of the average sea ice extent in the Arctic both in its winter (February) and summer (September). In Figure 7(a), the TrendSegment estimate identifies change-points in 2005 and detects a sudden drop during 2003-2005. One change-point in 2007 is identified in Figure 7(c), which differentiates the decreasing speed of winter ice extent in the Arctic before and after 2007. As observed in the above-mentioned literature, the sea ice extent of the Antarctic shows a modest increasing trend up until recently (Figures 7(b) and 7(d)); however, TrendSegment procedure estimates change-point in 2016 and detects a sudden drop during 2015-2017 for the Antarctic summer (February) and similarly detects two sudden drops by the estimated change-points in 2001 and 2015 for the Antarctic winter (September), which is in line with the message of Rintoul et al., (2018). The results of other competing methods can be found in Section D.1 of the supplementary materials.

6 Extension to non-Gaussian and/or dependent noise

Our TrendSegment algorithm can be extended to more realistic settings e.g. when the noise $\varepsilon_{t}$ is possibly dependent and/or non-Gaussian. The extension is performed by slightly altering the estimators $\tilde{f},\tilde{\tilde{f}}$ and $\hat{f}$ and keeping the rate of threshold the same as the one used in Theorems 1-3 (i.e. $\lambda=O((\log T)^{1/2})$ ) that is established under the iid Gaussian noise. We add an additional step to ensure that only the detail coefficients $d^{(j,k)}_{p,q,r}$ corresponding to a long enough interval $[p,r]$ survive, as this step enables us to apply strong asymptotic normality of $\sum_{t=p}^{r}\varepsilon_{t}$ . Under dependent or non-Gaussian noise, Theorems 1-3 presented in Section 3 still hold with a larger rate that is different by only a logarithmic factor, where the corresponding theories and proofs can be found in Section B of the supplementary material.

In Algorithm 1 in Section 4.1.5, we propose a robust way of threshold selection that works well in all circumstances including iid Gaussian noise. To demonstrate the robustness of our threshold selection in case the noise has serial dependence and/or heavy-tailedness, additional simulations are performed for five distributions of the noise; (a) $\varepsilon_{t}\sim$ i.i.d. scaled $t_{5}$ distribution with unit-variance, (b) $\varepsilon_{t}$ follows a stationary AR(1) process with $\phi=0.3$ and Gaussian innovation, (c) the same setting with (b) but with $\phi=0.6$ , (d) $\varepsilon_{t}$ follows a stationary AR(1) process with $\phi=0.3$ and $t_{5}$ innovation and (e) the same setting with (d) but with $\phi=0.6$ , where the results are summarised in Tables C.1-C.10 in Section C.1 of the supplementary material. Lastly, in Section D.2 of the supplementary material, we demonstrate that our TrendSegment algorithm shows a good performance on London air quality data that possibly has some non-negligible autocorrelation.

Appendix A Technical proofs

The proof of Theorems 1-3 are given below and Lemmas 1 and 2 can be found in Section A of the supplementary materials.

Proof of Theorem 1. Let $\mathcal{S}^{1}_{j}$ and $\mathcal{S}^{0}_{j}$ as in Lemma 2. From the conditional orthonormality of the unbalanced wavelet transform, on the set $A_{T}$ defined in Lemma 1, we have

[TABLE]

where $\mu^{(0,1)}=\langle{f},\psi^{(0,1)}\rangle$ and $\mu^{(0,2)}=\langle{f},\psi^{(0,2)}\rangle$ . We note that $\big{(}s^{[1]}_{1,T}-\mu^{(0,1)}\big{)}^{2}\leq 2C_{1}^{2}\log T$ is simply obtained by combining Lemma 2 and the fact that $s^{[1]}_{1,T}-\mu^{(0,1)}=\langle\boldsymbol{\varepsilon},\psi^{(0,1)}\rangle$ , which can also be applied to obtain $\big{(}s^{[2]}_{1,T}-\mu^{(0,2)}\big{)}^{2}\leq 2C_{1}^{2}\log T$ . By Lemma 2, $\mathbb{I}\big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad|d^{(j^{\prime},k^{\prime})}|>\lambda\big{\}}=0$ for $k\in\mathcal{S}^{0}_{j}$ and also by the fact that $\mu^{(j,k)}=0$ for $j=1,\ldots,J,k\in\mathcal{S}^{0}_{j}$ , we have $\mathit{I}=0$ . For $\mathit{II}$ , we denote $\mathcal{B}=\big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad|d^{(j^{\prime},k^{\prime})}|>\lambda\,\big{\}}$ and have

[TABLE]

Combining with the upper bound of $J$ , $\lceil\log(T)/\log((1-\rho)^{-1})+\log(2)/\log(1-\rho)\rceil$ , and the fact that $|\mathcal{S}^{1}_{j}|\leq N$ , we have $\mathit{II}\leq 8C_{1}^{2}NT^{-1}\lceil\log(T)/\log((1-\rho)^{-1})+\log(2)/\log(1-\rho)\rceil\log T$ , and therefore $\|\tilde{f}-f\|_{T}^{2}\;\leq\;C_{1}^{2}\;T^{-1}\;\log(T)\;\Big{\{}4+8N\;\lceil\log(T)/\log((1-\rho)^{-1})+\log(2)/\log(1-\rho)\rceil\;\Big{\}}$ . As the estimated change-points are obtained through those detail coefficients, thus at each scale, up to $N$ estimated change-points are added. Combining it with the largest scale $J$ whose order is $\log T$ , the number of change-points in $\tilde{f}$ returned from the inverse TGUW transformation is up to $CN\log T$ where $C$ is a constant.

Proof of Theorem 2. Let $\tilde{B}$ and $\tilde{\tilde{B}}$ the unbalanced wavelet basis corresponding to $\tilde{f}$ and $\tilde{\tilde{f}}$ , respectively. As the change-points in $\tilde{\tilde{f}}$ are a subset of those in $\tilde{f}$ , establishing $\tilde{\tilde{f}}$ can be considered as applying the TGUW transform again to $\tilde{f}$ which is just a repetition of procedure done in estimating $\tilde{f}$ in the greediest way. Thus $\tilde{\tilde{B}}$ is classified into two categories, 1) all basis vectors $\psi^{(j,k)}\in\tilde{B}$ such that $\psi^{(j,k)}$ is not associated with the change-points in $\tilde{f}$ and $|\langle\boldsymbol{X},\psi^{(j,k)}\rangle|=|d^{(j,k)}|<\lambda$ and 2) all vectors $\psi^{(j,1)}$ produced in Stage 1 of post-processing.

We now investigate how many scales are used for this particular transform. First, the detail coefficients $d^{(j,k)}$ corresponding to the basis vectors $\psi^{(j,k)}\in\tilde{B}$ live on no more than $J=O(\log T)$ scales and we have $|\mathcal{S}^{1}_{j}|\leq N$ by the argument used in the proof of Theorem 1. In addition, the vectors $\psi^{(j,1)}$ in the second category correspond to different change-points in $\tilde{f}$ and there exist at most $\tilde{N}=O(N\log T)$ change-points in $\tilde{f}$ which we examine one at once (i.e. $|\mathcal{S}^{1}_{j}|\leq 1$ ), thus at most $\tilde{N}$ scales are required for $d^{(j,1)}$ . Combining the results of two categories, the equivalent of quantity $\mathit{II}$ in the proof of Theorem 1 for $\tilde{\tilde{f}}$ is bounded by $\mathit{II}\leq C_{3}NT^{-1}\log^{2}T$ and this completes the proof of the $l_{2}$ result, $\big{\|}\tilde{\tilde{f}}-{f}\big{\|}_{T}^{2}\;=\;O\big{(}NT^{-1}\log^{2}(T)\big{)}$ where $C_{3}$ is a positive constant large enough.

Finally, we show that there exist at most two change-points in $\tilde{\tilde{f}}$ between true change-points $(\eta_{\ell},\eta_{\ell+1})$ for $\ell=0,\ldots,N$ where $\eta_{0}=0$ and $\eta_{N+1}=T$ . Consider the case where three change-point for instance ( $\tilde{\tilde{\eta}}_{l},\tilde{\tilde{\eta}}_{l+1},\tilde{\tilde{\eta}}_{l+2}$ ) lie between a pair of true change-point, $(\eta_{\ell},\eta_{\ell+1})$ . In this case, by Lemma 2, the maximum magnitude of two detail coefficients computed from the adjacent intervals, $[\tilde{\tilde{\eta}}_{l}+1,\tilde{\tilde{\eta}}_{l+1}]$ and $[\tilde{\tilde{\eta}}_{l+1}+1,\tilde{\tilde{\eta}}_{l+2}]$ , is less than $\lambda$ and $\tilde{\tilde{\eta}}_{l+1}$ would be get removed from the set of estimated change-points. This satisfies $\tilde{\tilde{N}}\leq 2(N+1)$ .

Proof of Theorem 3. From the assumptions of Theorem 3, the followings hold.

•

Given any $\epsilon>0$ and $C>0$ , for some $T_{1}$ and all $T>T_{1}$ , it holds that

$\mathbb{P}\Big{(}\big{\|}\tilde{\tilde{f}}-{f}\big{\|}_{T}^{2}>\frac{C^{3}}{4}R_{T}\Big{)}\leq\epsilon$ where $\tilde{\tilde{f}}$ is the estimated signal specified in Theorem 2.

•

For some $T_{2}$ , and all $T>T_{2}$ , it holds that $C^{1/3}T^{1/3}R_{T}^{1/3}(\underaccent{\bar}{f}_{T}^{\ell})^{-2/3}<\delta_{T}^{\ell}$ for all $\ell=1,\ldots,N$ .

Following the argument used in the proof of Theorem 19 in Lin et al., (2016), we take $T\geq T^{*}$ where $T^{*}=\max\{T_{1},T_{2}\}$ and let $r_{\ell,T}=\lfloor C^{1/3}T^{1/3}R_{T}^{1/3}(\underaccent{\bar}{f}_{T}^{\ell})^{-2/3}\rfloor$ for $\ell=1,\ldots,N$ . Suppose that there exist at least one $\eta_{\ell}$ whose closest estimated change-point is not within the distance of $r_{\ell,T}$ . Then there are no estimated change-points in $\tilde{\tilde{f}}$ within $r_{\ell,T}$ of $\eta_{\ell}$ which means that $\tilde{\tilde{f}}_{j}$ displays a linear trend over the entire segment $j\in\{\eta_{\ell}-r_{\ell,T},\ldots,\eta_{\ell}+r_{\ell,T}\}$ . Hence

[TABLE]

The first inequality holds by Lemma 20 of Lin et al., (2016), and the second one holds by the definition of $r_{\ell,T}$ . Assuming that at least one $\eta_{\ell}$ does not have an estimated change-point within the distance of $r_{\ell,T}$ implies that the estimation error exceeds $\frac{C^{3}}{4}R_{T}$ which is a contradiction as it is an event that we know occurs with probability at most $\epsilon$ . Therefore, there must exist at least one estimated change-point within the distance of $r_{\ell,T}$ from each true change point $\eta_{\ell}$ .

Throughout Stage 2 of post-processing, $\tilde{\tilde{\eta}}_{\ell_{0}}$ is either the closest estimated change-point of any $\eta_{\ell}$ or not. If $\tilde{\tilde{\eta}}_{\ell_{0}}$ is not the closest estimated change-point to the nearest true change-point on either its left or its right, by the construction of detail coefficients in Stage 2 of post-processing, Lemma 2 guarantees that the corresponding detail coefficient has the magnitude less than $\lambda$ and $\tilde{\tilde{\eta}}_{\ell_{0}}$ gets removed. Suppose $\tilde{\tilde{\eta}}_{\ell_{0}}$ is the closest estimated change-point of a true change-point $\eta_{\ell}$ and it is within the distance of $CT^{1/3}R_{T}^{1/3}\big{(}\underaccent{\bar}{f}_{T}^{\ell}\big{)}^{-2/3}$ from $\eta_{\ell}$ . If the corresponding detail coefficient has the magnitude less than $\lambda$ and $\tilde{\tilde{\eta}}_{\ell_{0}}$ is removed, there must exist another $\tilde{\tilde{\eta}}_{\ell}$ within the distance of $CT^{1/3}R_{T}^{1/3}\big{(}\underaccent{\bar}{f}_{T}^{\ell}\big{)}^{-2/3}$ from $\eta_{\ell}$ . If there are no such $\tilde{\tilde{\eta}}_{\ell}$ , then by the construction of the detail coefficient, the order of magnitude of $\big{|}d_{p_{\ell_{0}},q_{\ell_{0}},r_{\ell_{0}}}\big{|}$ would be such that $\big{|}d_{p_{\ell_{0}},q_{\ell_{0}},r_{\ell_{0}}}\big{|}>\lambda$ thus $\tilde{\tilde{\eta}}_{\ell_{0}}$ would not get removed. Therefore, after Stage 2 of post-processing is finished, each true change-point $\eta_{\ell}$ has its unique estimator within the distance of $CT^{1/3}R_{T}^{1/3}\big{(}\underaccent{\bar}{f}_{T}^{\ell}\big{)}^{-2/3}$ .

Supplementary materials for “Detecting linear trend changes in data sequences”

Hyeyoung Maeng and Piotr Fryzlewicz

This document includes the following sections:

A. Proofs

B. Extension to dependent non-Gaussian noise

C. Additional simulation results

D. Additional data application results

E. Shape of the unbalanced wavelet basis

F. A practical way to implement the TGUW transformation

G. Extension to piecewise-quadratic signal

Appendix A. Proofs

A.1 Some useful lemmas for Theorems 1-3 of the main article

Lemma 1

Let the distribution of $\varepsilon_{t}$ in model (1) of the main article be iid standard Gaussian. Let $\psi^{(j,k)}=\sum_{i=1}^{I^{(j,k)}}\phi_{i}^{(j,k)}g_{i}^{(j,k)}$ where $\phi_{i}^{(j,k)}$ are constants and $g_{i}^{(j,k)}$ are vectors of equal length with $\psi^{(j,k)}$ where $I^{(j,k)}\in\{3,4\},j=1,\ldots,J,\;k=1,\ldots,K(j)$ . If we define the set $G=\{g_{l}\}$ where there is a unique correspondence between $\big{\{}{g_{i}^{(j,k)}}_{i=1,\ldots,I^{(j,k)},j=1,\ldots,J,\,k=1,\ldots,K(j)}\big{\}}$ and $\{g_{l}\}$ , we then have $P(A_{T})\geq 1-C_{2}T^{-1}$ where

[TABLE]

$\lambda$ is as in Theorem 1 and $C_{2}$ is a positive constant.

Proof. We firstly show that for any fixed ${(j,k)}$ , $g_{i}^{(j,k)}$ and $\phi_{i}^{(j,k)}$ satisfy the conditions, $\big{(}g_{i}^{(j,k)}\big{)}^{\top}g_{i}^{(j,k)}=1$ , $\big{(}g_{i}^{(j,k)}\big{)}^{\top}g_{i^{\prime}}^{(j,k)}=0$ and $\sum_{i}\big{(}\phi_{i}^{(j,k)}\big{)}^{2}=1$ , where $\psi^{(j,k)}=\sum_{i=1}^{I^{(j,k)}}\phi_{i}^{(j,k)}g_{i}^{(j,k)}$ . Depending on the type of merge, $\psi^{(j,k)}$ fall into one of the followings,

[TABLE]

where $e_{i}$ is a vector of length $T$ having $1$ only at $i^{th}$ element and zero for the others. As will be shown in Section E., $\boldsymbol{\ell}_{1,i,j}$ and $\boldsymbol{\ell}_{2,i,j}$ are an arbitrary orthonormal basis of the subspace $\{(x_{1},x_{2},\ldots,x_{j-i+1})\;|\;x_{1}-x_{2}=x_{2}-x_{3}=\cdots=x_{j-i}-x_{j-i+1}\}$ of $\mathbb{R}^{j-i+1}$ .

In any case, we can obtain the representation $\psi^{(j,k)}=\sum_{i=1}^{I^{(j,k)}}\phi_{i}^{(j,k)}g_{i}^{(j,k)}$ from (S.2) if the constants $\phi_{i}^{(j,k)}$ correspond to $\{\alpha_{i}\}_{i=1}^{3}$ in Type 1, $\{\beta_{i}\}_{i=1}^{3}$ or $\{\beta_{i}\}_{i=4}^{6}$ in Type 2 and $\{\gamma_{i}\}_{i=1}^{4}$ in Type 3 and $g_{i}^{(j,k)}$ is the corresponding vector. From the orthonormality of the basis ( $\boldsymbol{\ell}_{1,m,n},\boldsymbol{\ell}_{2,m,n}$ ) for any $(m,n)$ , we see that the conditions, $\big{(}g_{i}^{(j,k)}\big{)}^{\top}g_{i}^{(j,k)}=1$ and $\big{(}g_{i}^{(j,k)}\big{)}^{\top}g_{i^{\prime}}^{(j,k)}=0$ , are satisfied for any $(i,i^{\prime},j,k)$ where $i\neq i^{\prime}$ . In addition, as $\psi^{(j,k)}$ keep orthonormality, we can argue that $\phi_{i}^{(j,k)}$ is bounded by the condition $\sum_{i}\big{(}\phi_{i}^{(j,k)}\big{)}^{2}=1$ for any ${(i,j,k)}$ which implies $\sum_{i=1}^{3}\alpha_{i}^{2}=\sum_{i=1}^{3}\beta_{i}^{2}=\sum_{i=4}^{6}\beta_{i}^{2}=\sum_{i=1}^{4}\gamma_{i}^{2}=1$ in (S.2).

If we predefine the pairs ( $\boldsymbol{\ell}_{1,m,n},\boldsymbol{\ell}_{2,m,n}$ ) for any $(m,n)$ by choosing an orthonormal basis of the subspace $\{(x_{1},x_{2},\ldots,x_{n-m+1})\;|\;x_{1}-x_{2}=x_{2}-x_{3}=\cdots=x_{n-m}-x_{n-m+1}\}$ of $\mathbb{R}^{n-m+1}$ , then there exist at most $T^{2}$ vectors $g_{l}$ in the set $G$ . This is because $m$ and $n$ can be randomly chosen from $\{1,2,\ldots,T\}$ with replacement and if $m\neq n$ , the two drawn pairs, $(m,n)$ and $(n,m)$ , correspond to the same basis vectors, ( $\boldsymbol{\ell}_{1,m,n},\boldsymbol{\ell}_{2,m,n}$ ), while $(m,m)$ correspond to one vector $e_{m}$ . Now we are in position to show that $P(A_{T})\geq 1-C_{2}T^{-1}$ . Using a simple Bonferroni inequality, we have

[TABLE]

where $\phi_{Z}$ is the p.d.f. of a standard normal $Z$ . This completes the proof.

Lemma 2

Let $\mathcal{S}^{1}_{j}=\{1\leq k\leq K(j):d^{(j,k)}$ is $d_{p,q,r}$ such that $p<\eta_{i}+1/2<r$ for some $i=1,\ldots,N$ $\}$ , and $\mathcal{S}^{0}_{j}=\{1,\ldots,K(j)\}\setminus\mathcal{S}^{1}_{j}$ . On the set $A_{T}$ in (S.1) which satisfies $P(A_{T})\rightarrow 1$ as $T\rightarrow\infty$ , we have

[TABLE]

where $\lambda$ is as in Theorem 1.

Proof. On the set $A_{T}$ , the following holds for $j=1,\ldots,J,k\in\mathcal{S}^{0}_{j}$ ,

[TABLE]

where $\boldsymbol{\varepsilon}=(\varepsilon_{1},\ldots,\varepsilon_{T})^{\top}$ and $\psi^{(j,k)}_{p,q,r}$ are as in (S.2). The condition, $\sum_{i}\big{(}\phi_{i}^{(j,k)}\big{)}^{2}=1$ for any fixed ${(j,k)}$ , given in the proof of Lemma 1 implies that $\max_{i}\big{|}\phi_{i}^{(j,k)}\big{|}\leq 1$ for any ${(j,k)}$ , thus we have (S.6) when the constant $C_{1}$ for $\lambda$ in (S.6) is larger than or equal to $4$ times $C_{1}$ used in (S.1).

Appendix B. Extension to dependent non-Gaussian noise

In this section, we extend the TGUW methodology to more realistic settings when the noise $\varepsilon_{t}$ is possibly dependent and/or non-Gaussian. We borrow the idea proposed in the supplementary material of Fryzlewicz, (2018) in the sense that the extension is performed in a way of altering the estimators $\tilde{f},\tilde{\tilde{f}}$ and $\hat{f}$ and keeping the rate of threshold, $O((\log T)^{1/2})$ , used in Theorems 1-3 of the main article established under the iid Gaussian noise. However, our technique is distinguished from Fryzlewicz, (2018) in that we put an additional step which ensures that only the detail coefficients $d^{(j,k)}_{p,q,r}$ corresponding to a long enough interval $[p,r]$ are survived, while Fryzlewicz, (2018) gives a condition that both the left ( $[p,q]$ ) and the right ( $[q+1,r]$ ) segments should be long enough. This enables us to use the same size of threshold, $O((\log T)^{1/2})$ , used in the iid Gaussian model without any further procedure such as basis rearrangement proposed in Fryzlewicz, (2018).

We now define the sets of short-segment and long-segment coefficients at each scale $j$ as follows:

[TABLE]

where $a$ will be specified later this section. Those detail coefficients obtained from short segments are set to zero in the construction of the new estimators $\tilde{f}^{L},\tilde{\tilde{f}}^{L}$ and $\hat{f}^{L}$ , where $L$ in $f^{L}$ stands for “Long-segment”. The initial estimator $\tilde{f}^{L}$ is obtained from the estimator of $\mu^{(j,k)}$ for $j\geq 1$ by applying the “connected” rule that is modified from the original one in Section 2.3 of the main article to satisfy the condition that the minimum segment length is longer than $a$ :

[TABLE]

where $\mathbb{I}$ is an indicator function and

[TABLE]

We then apply the “two together” rule to (S.8) in which both of the paired detail coefficients (formed by Type 3 mergings) should be survived if at least one is survived as done in thresholding of the main article. Compared to the estimator $\hat{\mu}^{(j,k)}$ obtained under the iid Gaussian setting, the only added step is setting all short-segment coefficient $d^{(j,k)}_{p,q,r}$ to zero.

B.1 Preparatory lemmas

Lemma 3

Let the distribution of $\varepsilon_{t}$ in model (1) of the main article as in Theorem 1. Then for a constant $C_{3}>0$ and $\lambda$ as in Theorem 1, we have $P(A_{T})\geq 1-C_{3}T^{-1}$ for a constant $C_{3}>0$ , where

[TABLE]

and $\boldsymbol{\ell}_{k,t_{1},t_{2}}^{t}$ is the $t$ -th element of the vector $\boldsymbol{\ell}_{k,t_{1},t_{2}}$ of length $t_{2}-t_{1}+1$ and the pairs ( $\boldsymbol{\ell}_{1,t_{1},t_{2}},\boldsymbol{\ell}_{2,t_{1},t_{2}}$ ) are predetermined for any $(t_{1},t_{2})$ by choosing an orthonormal basis of the subspace $\{(x_{1},x_{2},\ldots,x_{t_{2}-t_{1}+1})\;|\;x_{1}-x_{2}=x_{2}-x_{3}=\cdots=x_{t_{2}-t_{1}}-x_{t_{2}-t_{1}+1}\}$ of $\mathbb{R}^{t_{2}-t_{1}+1}$ .

Proof. In the following, we consider the single sum $\sum_{t=1}^{a}w_{t}\varepsilon_{t}$ from the interval $[1,a]$ where $w_{t}=\boldsymbol{\ell}_{k,1,a}^{t}$ for a fixed $k\in\{1,2\}$ . The results can in principle be applied to an interval with different ends given that the length of the interval is at least $a$ . Since $\varepsilon_{t}$ is $m$ -dependent, we have $\alpha(l)=0$ for $l>m$ where $\alpha(\cdot)$ is the $\alpha$ -mixing coefficients of $\varepsilon_{t}$ .

From Theorem 1.4 in Bosq, (1998), if $m_{2}^{2}<\infty$ , for each $\epsilon>0$ and for a constant $c>0$ , we obtain

[TABLE]

where

[TABLE]

The assumption $m_{2}^{2}<\infty$ is reasonably achievable as we can show $m_{2}^{2}=a\max_{t}(w^{2}_{t})$ is bounded by a constant from two conditions given on $w_{t}$ , 1) $\{w_{1}-w_{2}=\cdots=w_{a-1}-w_{a}\}$ and 2) $\sum_{t=1}^{a}w_{t}^{2}=1$ .

By setting $\epsilon=\lambda/\sqrt{a}$ , $a=C\log(T)$ and $\lambda=C_{1}\log^{1/2}T$ for large enough $C>0$ and $C_{1}>0$ , and setting $q=[c_{1}a]$ with a small $c_{1}$ (which gives $\Big{[}\frac{a}{q+1}\Big{]}\geq m+1$ ), we have that $a_{1}$ is bounded by a constant and $\alpha([a/(q+1)])=0$ , thus (S.10) can be bounded as

[TABLE]

where $C_{2}>0$ is suitably large. Since there exist at most $T^{2}$ sub-intervals $[t_{1},t_{2}]$ , applying a simple Bonferroni inequality, we have

[TABLE]

as $T\rightarrow\infty$ for a large enough $C>0$ and a certain constant $C_{3}>0$ .

Lemma 4

Let $\mathcal{S}^{1}_{j}$ and $\mathcal{S}^{0}_{j}$ as in Lemma 2. On the set $A_{T}^{L}$ in (S.9) that satisfies $P(A_{T}^{L})\rightarrow 1$ as $T\rightarrow\infty$ , we have

[TABLE]

where $\lambda$ is as in Theorem 1.

Proof. The argument follows the proof of Lemma 2.

B.2 Theoretical results of the length-lowerbounded-basis estimators

We now describe the behaviour of the initial estimator $\tilde{f}^{L}$ that is built from the basis vectors whose non-zero elements have length larger than $a$ .

Theorem 4

Let the distribution of $\varepsilon_{t}$ in model (1) of the main article as follows:

(a)

$\varepsilon_{t}$ has mean zero and satisfies Cramer’s conditions that

[TABLE]

where $c>0$ . 2. (b)

$\{\varepsilon_{t}\}_{t}$ is the stationary sequence and $m$ -dependent i.e. $\sigma(\varepsilon_{s},s\leq t)$ and $\sigma(\varepsilon_{s},s\geq t+k)$ are independent for $k>m$ .

Let $\bar{f}=\max_{t}f_{t}-\min_{t}f_{t}$ be bounded and let the estimator $\tilde{f}^{L}$ is obtained from the estimator $\hat{\mu}^{(j,k)}$ in (S.8), with $a=C\log(T)$ and the threshold $\lambda=C_{1}\log^{1/2}(T)$ , for large enough $C$ and $C_{1}$ . Then on the set $A_{T}^{L}$ in (S.9), we have

[TABLE]

for a constant $\tilde{C}>0$ .

Proof. Let $\mathcal{S}^{1}_{j}$ and $\mathcal{S}^{0}_{j}$ as in Lemma 2. From the conditional orthonormality of the unbalanced wavelet transform, on the set $A_{T}^{L}$ in (S.9), we have

[TABLE]

where $\mu^{(0,1)}=\langle{f},\psi^{(0,1)}\rangle$ , $\mu^{(0,2)}=\langle{f},\psi^{(0,2)}\rangle$ and $\mathcal{W}_{j^{\prime}}^{S}(a)$ and $\mathcal{W}_{j^{\prime}}^{L}(a)$ are as in (S.7). We note that $\big{(}s^{[1]}_{1,T}-\mu^{(0,1)}\big{)}^{2}\leq 2C_{1}^{2}\log T$ is simply obtained by combining Lemma 4 and the fact that $s^{[1]}_{1,T}-\mu^{(0,1)}=\langle\boldsymbol{\varepsilon},\psi^{(0,1)}\rangle$ , which can also be applied to obtain $\big{(}s^{[2]}_{1,T}-\mu^{(0,2)}\big{)}^{2}\leq 2C_{1}^{2}\log T$ . We now examine the terms $\mathit{I},\mathit{II}$ and $\mathit{III}$ in (S.11).

Term $\mathit{I}$ : By Lemma 4, on the set $A_{T}^{L}$ , $\mathbb{I}\big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad|d^{(j^{\prime},k^{\prime})}|>\lambda\big{\}}=0$ for $k\in\mathcal{S}^{0}_{j}$ if $k^{\prime}\in\mathcal{W}_{j^{\prime}}^{L}(a)$ . Also by the fact that $\mu^{(j,k)}=0$ for $j=1,\ldots,J,k\in\mathcal{S}^{0}_{j}$ , we obtain $\mathit{I}=0$ .

Term $\mathit{II}$ : As there is no short-segment parent coefficient whose children is from long-segment due to the principle of bottom-up merging, the indicator function returns zero and the term $\mathit{II}$ is simplified to $\frac{1}{T}\sum_{j=1}^{J}\sum_{k\in\mathcal{S}^{1}_{j}\cap\mathcal{W}_{j}^{S}(a)}\Big{(}\mu^{(j,k)}\Big{)}^{2}$ .

We now examine the bound of individual $\mu^{(j,k)}_{p,q,r}$ . Note that only Type 2 and Type 3 basis vectors are considered due to the minimum length constraint given on the set $A_{T}^{L}$ . Borrowing the generalised form of $\psi^{(j,k)}_{p,q,r}$ in (S.2), for Type 3 basis vector, we obtain

[TABLE]

where ${f}_{p:q}$ is the subvector of ${f}$ containing $q-p+1$ elements. The inequality (S.12) is obtained from the orthonormality of $\boldsymbol{\ell}_{1,p,q},\boldsymbol{\ell}_{2,p,q},\boldsymbol{\ell}_{1,q+1,r},\boldsymbol{\ell}_{2,q+1,r}$ and the definition of inner product $a\cdot b=\|a\|\cdot\|b\|\cdot\cos(\theta)$ , where $\theta$ is the angle between $a$ and $b$ . Note that if ${f}_{p:q}$ does not contain a change point, the corresponding $\cos(\theta)=0$ as ${f}_{p:q}$ has a perfect linear trend and in the case when ${f}_{p:q}$ includes a change point, the size of angle is bounded as $|\cos(\theta)|\leq 1$ . As $\bar{f}=\max_{t}f_{t}-\min_{t}f_{t}$ is assumed to be bounded and $\|{f}_{p:q}\|^{2}\leq C[(q-p+1)^{2}+\bar{f}^{2}]$ regardless of whether there exists a true change point in $[p,q]$ , we have

[TABLE]

where $c_{i}>0$ and $C>0$ . Without loss of generality, we assume $r-q+1\leq a$ , then using $a=O(\log(T))$ and applying the same upper bounds, $J\leq\lceil\log(T)/\log((1-\rho)^{-1})+\log(2)/\log(1-\rho)\rceil$ and $|\mathcal{S}^{1}_{j}|\leq N$ , used in the proof of Theorem 1 in the main article, we obtain

[TABLE]

Term $\mathit{III}$ : Denote $\mathcal{B}=\big{\{}\,\exists(j^{\prime},k^{\prime})\in\mathcal{C}_{j,k}\quad|d^{(j^{\prime},k^{\prime})}|>\lambda\,\;\;\text{and}\;\;k^{\prime}\in\mathcal{W}_{j^{\prime}}^{L}(a)\big{\}}$ and on the set $A_{T}^{L}$ in (S.9) we have

[TABLE]

Following the same argument used in the proof of Theorem 1 in the main article, we have

[TABLE]

To complete the proof, considering all terms in (S.11), we finally obtain

[TABLE]

where $\tilde{C}>0$ . Comparing it with Theorem 1 of the main article that is presented under the iid Gaussian noise assumption, the $\ell_{2}$ rate in (S.14) is different by only a logarithmic factor.

Theorem 5

$X_{t}$ follows model (1) with $\sigma=1$ . Let the distribution of $\varepsilon_{t}$ and the threshold $\lambda$ be as in Theorem 4. Further, let $\bar{f}=\max_{t}f_{t}-\min f_{t}$ be bounded. Then we have $\big{\|}\tilde{\tilde{f}}^{L}-f\big{\|}_{T}^{2}\;=\;O\big{(}NT^{-1}\log^{3}(T)\big{)}$ with probability approaching $1$ as $T\rightarrow\infty$ , where $\tilde{\tilde{f}}^{L}$ is the estimator constructed from ${\tilde{f}}^{L}$ through Stage 1 of the post-processing described in Section 2.5 of the main article. And there exist at most two estimated change-points between each pair of true change-points $(\eta_{i},\eta_{i+1})$ for $i=0,\ldots,N$ , where $\eta_{0}=0$ and $\eta_{N+1}=T$ . Therefore $\tilde{\tilde{N}}\leq 2(N+1)$ , where $\tilde{\tilde{N}}$ is the number of estimated change points in $\tilde{\tilde{f}}^{L}$ .

Proof. The proof proceeds the same as the proof of Theorem 2 of the main article.

Theorem 6

$X_{t}$ follows model (1) with $\sigma=1$ . Let the distribution of $\varepsilon_{t}$ and the threshold $\lambda$ be as in Theorem 4. Further, let the number of true change-points, $N$ , have the order of $§logT$ and let $\bar{f}=\max_{t}f_{t}-\min f_{t}$ be bounded. Let the estimators $\hat{f}^{L}$ , $\hat{N}$ and $(\hat{\eta}_{1},\ldots,\hat{\eta}_{\hat{N}})$ are constructed through Stage 2 of the post-processing described in Section 2.5 of the main article. Let $\Delta_{T}=\min_{i=1,\ldots,N}\Big{\{}\Big{(}\underaccent{\bar}{f}_{T}^{i}\Big{)}^{2/3}\cdot\delta_{T}^{i}\Big{\}}$ where $\underaccent{\bar}{f}_{T}^{i}=\min\Big{(}|f_{\eta_{i+1}}-2f_{\eta_{i}}+f_{\eta_{i-1}}|,|f_{\eta_{i+2}}-2f_{\eta_{i+1}}+f_{\eta_{i}}|\Big{)}$ and $\delta_{T}^{i}=\min\Big{(}|\eta_{i}-\eta_{i-1}|,|\eta_{i+1}-\eta_{i}|\Big{)}$ . Assume that $T^{1/3}R_{T}^{1/3}=o\Big{(}\Delta_{T}\Big{)}$ where $\big{\|}\tilde{\tilde{f}}^{L}-f\big{\|}_{T}^{2}=O_{p}(R_{T})$ is as in Theorem 5. Then we have

[TABLE]

as $T\rightarrow\infty$ where $C$ is a constant.

Proof. The proof proceeds the same as the proof of Theorem 3 of the main article.

Appendix C. Threshold selection and additional simulation results

C.1 Simulation results for non-Gaussian and/or dependent noise

In addition to the simulations in Section 4 of the main article, here we present the results for the cases when $\varepsilon_{t}$ is possibly dependent and/or non-Gaussian. Including the standard Gaussian noise, we consider the following six scenarios for $\varepsilon_{t}$ :

(i)

standard Gaussian, 2. (ii)

iid $t_{5}$ distribution with unit-variance, 3. (iii)

a stationary Gaussian AR(1) process of $\phi=0.3$ , with zero-mean and unit-variance, 4. (iv)

the same setting as in (iii) except $\phi=0.6$ , 5. (v)

a stationary AR(1) process of $\phi=0.3$ with the noise term following $t_{5}$ , 6. (vi)

the same setting as in (v) except $\phi=0.6$ .

In summary, (ii) is iid but heavy-tailed, (iii) and (iv) are Gaussian AR(1) error with relatively mild and strong dependence, respectively, and (v) and (vi) are both heavy-tailed but different strength of dependence, where the summary of the simulation results can be found in Tables C.1-C.10.

Following the theoretical results presented in Sections B.1-B.2, we need to set the minimum segment length to be an order of $\log(T)$ . As already used in the main paper, we set $\lfloor 0.9\log(T)\rfloor$ as a default minimum segment length. We follow the Algorithm 1 introduced in the main paper and use $\lambda^{\text{Robust}}$ as a default threshold, as it is designed to work well in all circumstances.

The simulation results under this robust threshold selection are presented in Tables C.1-C.10 and TrendSegment generally outperforms over all scenarios of noise and over almost all simulation models considered in this paper. Among other competitors, only ID provides the option for heavy-tailed noise in their R package IDetect and other methods are set to their default settings.

C.2 Choice of the threshold

In this section, we describe the details of how the thresholds, $\lambda^{\text{Naïve}}$ and $\lambda^{\text{Robust}}$ , are built under different scenarios of noise introduced in Section C.1.

The naive threshold selection

We first explain how the best performing threshold constant $C$ in the naive threshold. To cover all the noise scenario settings including dependent and/or non-Gaussian noise, here we use a simpler version of the follwoing naïve threshold:

[TABLE]

by considering that the $\sigma$ in $\lambda^{\text{Naïve}}=C\sigma\sqrt{2\log T}$ can be absorbed into the constant $C$ . To find the best performing constant $C$ over different noise scenarios introduced in Section C.1, we repeat the simulations with a range of $C$ , $[0.5,3.5]$ . The performance can be evaluated by the accuracy of detecting number and location of change-points. For the number of change-point, we define

[TABLE]

where $\tilde{C}^{\eta}_{\text{min},j}$ is the minimum of those constants $C$ that give the maximum number of the case $\{\hat{N}-N=0\}$ for the $j^{\text{th}}$ model from 100 simulation runs. The minimum condition is actually used when there is more than one constant giving the same maximum number of $\{\hat{N}-N=0\}$ . Similarly, we can define $C^{\eta}_{\text{med}}$ and $C^{\eta}_{\text{max}}$ by replacing the minimum condition with median and maximum respectively. For evaluating the performance of change-point location, we define

[TABLE]

where $\tilde{C}^{d_{H}}_{\text{max},j}$ is the maximum of those constants $C$ that give the minimum value of the average Hausdorff distance for the $j^{\text{th}}$ model computed from 100 simulation runs. Note that in contrast to that the maximum number of case $\{\hat{N}-N=0\}$ is considered in (S.17), the minimum value of Hausdorff distance is used in (S.18) as the smaller the Hausdorff distance, the better the estimation of the change-point locations. Similar to (S.17), the maximum condition actually works when there is more than one constant giving the same minimum average Hausdorff distance, and $C^{d_{H}}_{\text{med}}$ and $C^{d_{H}}_{\text{min}}$ can be defined by replacing the maximum condition with median and minimum respectively.

The best performing constants over all noise scenarios are reported in Table C.11. Compared to the iid Gaussian noise, it seems that a larger threshold constant tends to chosen when the noise is heavy-tailed and/or dependent. Also, compared to the other noise scenarios, when the noise is dependent but generated with Gaussian innovation ((iii) and (iv)), the best performing constant has a narrower range of $C_{\text{max}}^{\cdot}$ - $C_{\text{min}}^{\cdot}$ .

The naïve threshold, $\lambda^{\text{Naïve}}$ , is an essential element in building the robust threshold, $\lambda^{\text{Robust}}$ , as shown in Step 5 of Algorithm 1 in the main paper. For this, we use the default constants $C_{\text{med}}^{\eta}$ in Table C.11, where there is not much difference in simulation performance presented in Tables C.1-C.10 when $C_{\text{max}}^{\eta}$ is used instead.

The robust threshold selection

We now describe the details of how $\lambda^{\text{Robust}}$ is built. We first justify using the ratio,

[TABLE]

in the process of building the kurtosis function $g$ in Step 5 of Algorithm 1 in the main paper.

We first recall that the ratio in (S.19) corresponds to the kurtosis function $g(\mathcal{K})$ in the following robust threshold:

[TABLE]

Figure C.1 shows that $g(\hat{\mathcal{K}})$ behaves like constant over a range of the $\hat{\mathcal{K}}$ under all models and noise scenarios we considered. This is due to the condition on the minimum segment length imposed for stable and good performance. With this condition, we found that the constant-like behaviour is also observed in case noise has relatively extreme heavy-tail e.g. $t_{2.1}$ , however we do not include such an extreme case in estimating the non-parametric function $g$ .

We are now ready to estimate the function $g(\cdot)$ . To avoid the situation that the estimation of non-parametric fit is affected a lot by extremely large size of $\hat{\mathcal{K}}$ , we split $\hat{\mathcal{K}}$ into two with the $99\%$ quantile of $\hat{\mathcal{\kappa}}$ as shown in Figure C.1. Then we estimate the non-parametric regression fit for each interval, $g_{1}(\mathcal{K})$ and $g_{2}(\mathcal{K})$ respectively, and use these functions for computing the robust threshold.

Appendix D. Additional data application results

D.1 Monthly average sea ice extent of Arctic and Antarctic data

D.2 Nitrogen oxides concentrations

In this section, we demonstrate that our TrendSegment algorithm shows a good performance on a real-world dataset that possibly has some nonnegligible autocorrelation. London air quality data is recently studied by Cho and Fryzlewicz, (2020) in the context of proposing a methodology for detecting multiple changes in mean of a possibly autocorrelated time series. Using the same data but in a different context, we now detect changes in linear trend. We use the daily average concentrations of nitrogen dioxides ( $\text{NO}_{2}$ ) measured from September 1, 2000 to September 30, 2020 at Marylebone Road in London, United Kingdom, which results in $T=7139$ time points. The data is downloaded from https://github.com/haeran-cho/wem.gsc, where the original data can be obtained from Defra (https://uk-air.defra.gov.uk/). We follow the pre-processing steps used in Cho and Fryzlewicz, (2020) by taking the square root transform and by removing weekly, seasonal and bank holiday effects.

Considering that the data possibly has serial dependent and/or heavy-tailedness, we use the robust threshold ( $\lambda^{\text{Robust}}$ ) introduced in Section 4.1.5 of the main paper. The top plot in Figure D.6 shows the detected change-points using the robust threshold selection. From the two bottom plots, we see that the persistent autocorrelations are not observable anymore after removing the linear trends, although a certain amount of autocorrelations still exists.

Appendix E. Shape of the unbalanced wavelet basis

We now explore the shape of the adaptively constructed unbalanced wavelet basis. First, we denote that $\psi^{(j,k)}$ is sometimes referred to as $\psi_{p,q,r}^{(j,k)}$ . One of the important properties of the unbalanced wavelet basis is that $\psi_{p,q,r}^{(j,k)}$ always has a shape of linear trend in regions that are previously merged and this linearity will also be preserved in future merges, as long as later transforms are performed under the “two together” rule. For example, two vectors $(\psi^{(0,1)},\psi^{(0,2)})$ corresponding to the two smooth coefficients $s^{1}_{1,T}$ and $s^{[2]}_{1,T}$ , have linear trends in the region $[1,T]$ as they form an orthonormal basis of the subspace $\{(x_{1},x_{2},\ldots,x_{T})\;|\;x_{1}-x_{2}=x_{2}-x_{3}=\cdots=x_{T-1}-x_{T}\}$ of $\mathbb{R}^{T}$ . This is due to the fact that the local orthonormal transforms continue in a way of extending the geometric dimension of subspace in which an orthonormal basis lives.

Through an illustrative example, we now show how a basis vector $\psi_{p,q,r}^{(j,k)}$ keeps its linearity in subregions that are already merged in previous scales, which includes a geometric interpretation of the TGUW transformation. Suppose that the initial data sequence is $\boldsymbol{s}^{0}=(X_{1},\ldots,X_{5})$ and the initial weight vectors of constancy and linearity are $\boldsymbol{w}^{c}_{0}=(1,1,1,1,1)^{\top}$ and $\boldsymbol{w}^{l}_{0}=(1,2,3,4,5)^{\top}$ , respectively. As we have the data sequence of length 5, the complete TGUW transform consists of 3 orthonormal transformations and the most important task for each transform is finding an appropriate orthonormal matrix.

First merge. Assume that $(X_{3},X_{4},X_{5})$ is chosen as the first triplet to be merged. To find the values of the transform matrix $\Lambda$ ,

[TABLE]

we first seek the detail filter, $\boldsymbol{h}$ , which satisfies the conditions (1) $\boldsymbol{h}^{\top}\boldsymbol{w}^{c}_{0,3:5}=0$ , (2) $\boldsymbol{h}^{\top}\boldsymbol{w}^{l}_{0,3:5}=0$ and (3) $\boldsymbol{h}^{\top}\boldsymbol{h}=1$ , where $\boldsymbol{w}^{\cdot}_{0,p:r}$ is the subvector of length $r-p+1$ . Thus, $\boldsymbol{h}$ is obtained as a normal vector to the plane $\{(x,y,z)\;|\;x-2y+z=0\}$ . Then, two low filter vectors ( $\boldsymbol{\ell}_{1}$ and $\boldsymbol{\ell}_{2}$ ) are obtained under the conditions, (1) $\boldsymbol{\ell}_{1}^{\top}\boldsymbol{h}=0$ , (2) $\boldsymbol{\ell}_{2}^{\top}\boldsymbol{h}=0$ , (3) $\boldsymbol{\ell}_{1}^{\top}\boldsymbol{\ell}_{2}=0$ and (4) $\boldsymbol{\ell}_{1}^{\top}\boldsymbol{\ell}_{1}=\boldsymbol{\ell}_{2}^{\top}\boldsymbol{\ell}_{2}=1$ which implies that $\boldsymbol{\ell}_{1}$ and $\boldsymbol{\ell}_{2}$ form an arbitrary orthonormal basis of the plane $\{(x,y,z)\;|\;x-2y+z=0\}$ and this guarantees the linear trend of $\boldsymbol{\ell}_{1}$ and $\boldsymbol{\ell}_{2}$ . Now, the orthonormal transform updates the data sequence and weight vectors as follows,

[TABLE]

where the constants $(e_{c_{1}},e_{c_{2}})$ and $(e_{l_{1}},e_{l_{2}})$ are obtained by $\Lambda\boldsymbol{w}^{c}_{0,3:5}=(e_{c_{1}},e_{c_{2}},0)^{\top}$ and $\Lambda\boldsymbol{w}^{l}_{0,3:5}=(e_{l_{1}},e_{l_{2}},0)^{\top}$ , respectively. As $\boldsymbol{\ell}_{1}$ and $\boldsymbol{\ell}_{2}$ form an orthonormal basis of the plane $\{(x,y,z)\;|\;x-2y+z=0\}$ , $e_{c_{1}},e_{c_{2}}$ and $e_{l_{1}},e_{l_{2}}$ are unique constants which represent $\boldsymbol{w}^{c}_{0,3:5}$ and $\boldsymbol{w}^{l}_{0,3:5}$ as a linear span of basis vectors $\boldsymbol{\ell}_{1}$ and $\boldsymbol{\ell}_{2}$ as follows:

[TABLE]

Importantly, the orthonormal transform matrix $\Psi_{T\times T}$ introduced in (5) (i.e. an orthonormal basis in $\mathbb{R}^{5}$ in this example) is constructed by recursively updating its initial input $\Psi_{0}=\mathbf{I}_{5\times 5}$ through local orthonormal transforms. For example, if $(p,q,r)^{\text{th}}$ elements in $\boldsymbol{s}$ are selected to be merged, then we extract the corresponding $(p,q,r)^{\text{th}}$ columns of $\Psi^{\top}$ and update them through the matrix multiplication with $\Lambda$ used in that merge. Therefore, the first orthonormal transform performed in (S.22) updates the initial matrix $\Psi_{0}^{\top}$ by multiplying $\Lambda$ to the corresponding $(3,4,5)^{th}$ columns of $\Psi_{0}^{\top}$ which returns the following,

[TABLE]

The $5^{\text{th}}$ column of $\Psi^{\top}$ is now fixed (not going to be updated again) as it corresponds to the detail coefficient but other four columns corresponding to the smooth coefficients in $\boldsymbol{s}$ would be updated as the merging continues.

Second merge. Suppose that $(X_{2},s^{[1]}_{3,4,5},s^{[2]}_{3,4,5})$ are selected to be merged next under the “two together” rule. Then we need to find the following orthonormal transform matrix,

[TABLE]

where its elements would be different from those in (S.21). The detail filter ${\boldsymbol{h}^{*}}^{\top}=(a^{*},b^{*},c^{*})$ is constructed from the corresponding weight vectors, $\boldsymbol{w}^{c}_{2:4}=(1,e_{c_{1}},e_{c_{2}})^{\top}$ and $\boldsymbol{w}^{l}_{2:4}=(2,e_{l_{1}},e_{l_{2}})^{\top}$ , by satisfying the conditions (1) ${\boldsymbol{h}^{*}}^{\top}\boldsymbol{w}^{c}_{2:4}=0$ , (2) ${\boldsymbol{h}^{*}}^{\top}\boldsymbol{w}^{l}_{2:4}=0$ and (3) ${\boldsymbol{h}^{*}}^{\top}{\boldsymbol{h}^{*}}=1$ . The detail filter is a weight vector designed for indicating the strength of linearity in $(X_{2},X_{3},X_{4},X_{5})$ as $(e_{c_{1}},e_{c_{2}})$ and $(e_{l_{1}},e_{l_{2}})$ already contain the information of three raw observations $(X_{3},X_{4},X_{5})$ . Then, two low filters, ${\boldsymbol{\ell}_{1}^{*}}$ and ${\boldsymbol{\ell}_{2}^{*}}$ , are obtained by satisfying the conditions, ${\boldsymbol{\ell}_{1}^{*}}^{\top}{\boldsymbol{h}^{*}}=0$ , ${\boldsymbol{\ell}_{2}^{*}}^{\top}{\boldsymbol{h}^{*}}=0$ , ${\boldsymbol{\ell}_{1}^{*}}^{\top}{\boldsymbol{\ell}_{2}^{*}}=0$ and ${\Lambda^{*}}^{\top}\Lambda^{*}=\mathbf{I}$ . Now the data sequence and the weight vectors are updated as follows,

[TABLE]

and $\Psi^{\top}$ is also updated into

[TABLE]

At this scale, the $4^{\text{th}}$ column of $\Psi^{\top}$ is fixed. This corresponds to the Type 2 basis vector in (S.2) whose non-zero subregion is composed of a single point ( $a^{*}$ ) and a linear trend ( $b^{*}\boldsymbol{\ell}_{1}+c^{*}\boldsymbol{\ell}_{2}$ ).

Importantly, the orthonormal transform at this scale is performed in a way of returning an orthonormal basis of the expanded subspace e.g. $2^{\text{nd}}$ and $3^{\text{rd}}$ columns of (S.27) (which are referred to as ${\boldsymbol{\ell}_{1}^{**}}$ and ${\boldsymbol{\ell}_{2}^{**}}$ in (S.28)) are obtained as an arbitrary orthonormal basis of the subspace $\{(w,x,y,z)\;|\;w-x=x-y=y-z\}$ of $\mathbb{R}^{4}$ . This is due to the semi-orthogonality of the transformation matrix $\mathbf{\Pi}$ in (S.28) which extends the dimension from $\mathbb{R}^{3}$ to $\mathbb{R}^{4}$ but preserves the fact that ( $\boldsymbol{\ell}_{1}^{*},\boldsymbol{\ell}_{2}^{*}$ ) and ( $\boldsymbol{\ell}_{1}^{**},\boldsymbol{\ell}_{2}^{**}$ ) form an arbitrary orthonormal basis of the corresponding subspaces. This guarantees the properties, ${\boldsymbol{\ell}_{1}^{**}}^{\top}{\boldsymbol{\ell}_{2}^{**}}=0$ and ${\boldsymbol{\ell}_{1}^{**}}^{\top}{\boldsymbol{\ell}_{1}^{**}}={\boldsymbol{\ell}_{2}^{**}}^{\top}{\boldsymbol{\ell}_{2}^{**}}=1$ , where

[TABLE]

and $\mathbf{\Pi}$ is obtained from the $2^{\text{nd}}$ to $4^{\text{th}}$ columns of (S.24) and the selected rows correspond to the indices of smooth coefficients associated in the orthonormal transformation in (S.25).

As is in (S.23), now the extended subregions of the original weight vectors, $\boldsymbol{w}^{c}_{0,2:5}$ and $\boldsymbol{w}^{l}_{0,2:5}$ , can also be presented as a linear combination of $\boldsymbol{\ell}_{1}^{**}$ and $\boldsymbol{\ell}_{2}^{**}$ as follows:

[TABLE]

where $\boldsymbol{\ell}_{1}^{**}$ and $\boldsymbol{\ell}_{2}^{**}$ form an orthonormal basis of the subspace $\{(w,x,y,z)\;|\;w-x=x-y=y-z\}$ of $\mathbb{R}^{4}$ . This can be simply shown by 1) expressing the weight vectors as a linear combination of two low filters,

[TABLE]

and 2) performing the matrix multiplication with $\mathbf{\Pi}$ in (S.28) to both sides of (S.30),

[TABLE]

Last merge. In the same manner, after the last orthonormal transform is applied to $(X_{1},s^{[1]}_{2,5},s^{[2]}_{2,5})$ , we end up with the finalised $\Psi^{\top}$ in which an orthonormal basis of the subspace $\{(v,w,x,y,z)\;|\;v-w=w-x=x-y=y-z\}$ of $\mathbb{R}^{5}$ is shown in its first and second columns where these two columns correspond to two basis vectors, $\psi^{(0,1)}$ and $\psi^{(0,2)}$ , in (5). Regardless of the length of data ( $T$ ), the first two columns of the finalised $\Psi^{\top}$ build two smooth coefficients ( $s^{[1]}_{1,T},s^{[2]}_{1,T}$ ) and always keep a linear trend with length $T$ , while the shape of other columns of $\Psi^{\top}$ corresponding to the detail coefficients depends on the type of merge and follows one of the forms in (S.2).

As shown above, the non-uniqueness of the low filters has no effect on preserving the linearity of the subregions that are already merged. In simulation studies, we empirically found that the choice of low filters has no qualitative effect on the results as long as they are chosen by satisfying the orthonormality condition of the transform, thus we used a fixed type of function for choosing a set of low filters rather than choosing an arbitrary set of low filters that satisfies the orthonormal condition every run which also saves the computational costs.

Appendix F. A practical way to implement the TGUW transformation

In this section, we explore a way of implementing the TGUW transform. As briefly mentioned in Section 2.2.3, it is implemented by consecutively updating so-called weight vectors of constancy and linearity. These two weight vectors are initially used in the first stage of the TGUW transform for obtaining the detail filter $\boldsymbol{h}$ and updated through the orthonormal transform. In detail, Steps 1 and 5 of the TGUW algorithm presented in Section 2.2.3 can be reformulated by weight vectors as follows.

Step 1. At each scale $j$ , find the set of triplets that are candidates for merging under the “two together” rule and compute the corresponding detail coefficients. Regardless of the type of merge, a detail coefficient $d_{p,q,r}^{\cdot}$ is, in general, obtained as

[TABLE]

where $p\leq q<r$ , $\boldsymbol{s}_{p:r}^{k}$ is the $k^{\text{th}}$ smooth coefficient of the subvector $\boldsymbol{s}_{p:r}$ with a length of $r-p+1$ and the constants $a,b,c$ are the elements of the detail filter $\boldsymbol{h}=(a,b,c)^{\top}$ . Specifically, the detail filter $\boldsymbol{h}$ is established by solving the following equations,

[TABLE]

where $\boldsymbol{w}^{\cdot,k}_{p:r}$ is $k^{\text{th}}$ non-zero element of the subvector $\boldsymbol{w}^{\cdot}_{p:r}$ with a length of $r-p+1$ , and $\boldsymbol{w}^{c}$ and $\boldsymbol{w}^{l}$ are weight vectors of constancy and linearity, respectively, in which the initial inputs have a form of ${\boldsymbol{w}^{c}_{0}}=(1,1,\ldots,1)^{\top},{\boldsymbol{w}^{l}_{0}}=(1,2,\ldots,T)^{\top}$ . The last condition in (S.33) is to preserve the orthonormality of the transform and the detail filter $\boldsymbol{h}$ becomes a unit normal vector of the plane $\{(x,y,z)\;|\;x-2y+z=0\}$ . The solution to (S.33) is unique up to multiplication by $-1$ and this can be simply shown by solving the equations e.g. $a+b+c=0$ , $a+2b+3c=0$ and $a^{2}+b^{2}+c^{2}=1$ .

More specifically, the detail coefficient in (S.32) is formulated for each type of merging introduced in Section 2.2.1 as follows.

Type 1: merging three initial smooth coefficients $(s^{0}_{p,p},s^{0}_{p+1,p+1},s^{0}_{p+2,p+2})$ ,

[TABLE]

Type 2: merging one initial and a paired smooth coefficient $(s^{0}_{p,p},s^{[1]}_{p+1,r},s^{[2]}_{p+1,r})$ ,

[TABLE]

similarly, when merging a paired smooth coefficient and one initial, $(s^{[1]}_{p,r-1},s^{[2]}_{p,r-1},s^{0}_{r,r})$ ,

[TABLE]

Type 3: merging two sets of (paired) smooth coefficients, $(s^{[1]}_{p,q},s^{[2]}_{p,q})$ and $(s^{[1]}_{q+1,r},s^{[2]}_{q+1,r})$ ,

[TABLE]

where $q>p+1$ and $r>q+2$ . Importantly, the two consecutive merges in (S.37) are achieved by visiting the same two adjacent data regions twice. In this case, after the first detail coefficient, $d^{[1]}_{p,q,r}$ , has been obtained, we instantly update the corresponding triplets $\boldsymbol{s}$ , $\boldsymbol{w}^{c}$ and $\boldsymbol{w}^{l}$ via an orthonormal transform as defined in (8) and (S.39). Therefore, the second detail filter, $(a^{2}_{p,q,r},b^{2}_{p,q,r},c^{2}_{p,q,r})$ , is constructed with the updated $\boldsymbol{w}^{c}$ and $\boldsymbol{w}^{l}$ in a way that satisfies the conditions (S.33).

Step 5. For each $|d_{p,q,r}^{\cdot}|$ extracted in step 4, merge the corresponding smooth coefficients by updating the corresponding triplet in $\boldsymbol{s}$ , $\boldsymbol{w}^{c}$ and $\boldsymbol{w}^{l}$ through the orthonormal transform as follows,

[TABLE]

The key step is finding the $3\times 3$ orthonormal matrix, $\Lambda$ , which is composed of one detail and two low-pass filter vectors in its rows. Firstly the detail filter $\boldsymbol{h}^{\top}$ is determined to satisfy the conditions in (S.33), and then the two low-pass filters ( $\boldsymbol{\ell}_{1}^{\top},\boldsymbol{\ell}_{2}^{\top}$ ) are obtained by satisfying the orthonormality of $\Lambda$ . There is no uniqueness in the choice of ( $\boldsymbol{\ell}_{1}^{\top},\boldsymbol{\ell}_{2}^{\top}$ ), but as described in Section E., this has no effect on the orthonormal transformation itself.

Appendix G. Extension to piecewise-quadratic signal

In this section, we explore how the TGUW transform can be extended to handle piecewise-quadratic signals. Considering the fact that we perform an orthonormal transformation to the chosen pair (triplet) to deal with piecewise-constant (piecewise-linear) signals, it is natural to perform a transform to the chosen quadraplet of the smooth conefficients in the process of establishing a data-adaptive unbalanced wavelet basis. In each merge, four adjacent smooth coefficients are selected and the orthonormal transformation converts them into one detail and three (updated) smooth coefficients. Those three updated smooth coefficients are tripled in the sense that they contain information about one local quadratic regression fit. Therefore, any such triplet of smooth coefficients cannot be separated when choosing quadruplet in any subsequent merges which can be called as “three together” rule (instead of “two together” rule invented for piecewise-linear model). We now give a simple example to illustrate how the TGUW transform for piecewise-quadratic siganal works. Figure G.7 shows the merging history of the modified TGUW transform which follows the “three together” rule. Three different types of merges are similary defined as for piecewise-linear signal except the fact that the merges are performed on quadraplet instead of triplet. The tree structure show that the modified TGUW transform performs well in detecting a single change-point in piecewise-quadratic scenario as the last type 3 merge is corresponding to the true change-point.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anastasiou and Fryzlewicz, (2022) Anastasiou, A. and Fryzlewicz, P. (2022). Detecting multiple generalized change-points by isolating single ones. Metrika , 85:141–174.
2Bai and Perron, (1998) Bai, J. and Perron, P. (1998). Estimating and testing linear models with multiple structural changes. Econometrica , 66:47–78.
3Bai and Perron, (2003) Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change models. Journal of Applied Econometrics , 18:1–22.
4Baranowski et al., (2019) Baranowski, R., Chen, Y., and Fryzlewicz, P. (2019). Narrowest-over-threshold detection of multiple change points and change-point-like features. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 81:649–672.
5Bardwell et al., (2017) Bardwell, L., Fearnhead, P., et al. (2017). Bayesian detection of abnormal segments in multiple time series. Bayesian Analysis , 12:193–218.
6Bosq, (1998) Bosq, D. (1998). Nonparametric statistics for stochastic processes . Springer, New York.
7Cho and Fryzlewicz, (2020) Cho, H. and Fryzlewicz, P. (2020). Multiple change point detection under serial dependence: Wild energy maximisation and gappy schwarz criterion. ar Xiv preprint ar Xiv:2011.13884 .
8Comiso et al., (2017) Comiso, J. C., Gersten, R. A., Stock, L. V., Turner, J., Perez, G. J., and Cho, K. (2017). Positive trend in the antarctic sea ice cover and associated changes in surface temperature. Journal of Climate , 30:2251–2267.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Detecting linear trend changes in data sequences

Abstract

1 Introduction

2 Methodology

2.1 Summary of TrendSegment

2.2 TGUW transformation

2.2.1 Key principles of the TGUW transform

2.2.2 Example

2.2.3 Some important features of TGUW transformation

2.2.4 TGUW transformation: general algorithm

2.3 Thresholding

2.4 Inverse TGUW transformation

2.5 Post processing for consistency of change-point detection

Stage 1.

Stage 2.

3 Theoretical results

Theorem 1

Theorem 2

Theorem 3

4 Simulation study

4.1 Parameter choice and setting

4.1.1 Post-processing

4.1.2 Choice of the “tail-greediness” parameter

4.1.3 Choice of the minimum segment length

4.1.4 Continuity at change-points

4.1.5 Choice of threshold λ\lambdaλ

Pre-estimated fit in Step 1.

Pre-specified constant CCC in Step 4.

I\mathcal{I}I and K\mathcal{K}K in Step 4.

Kurtosis function ggg in Step 5.

4.2 Competing methods and estimators

4.3 Results

5 Data applications

5.1 Average January temperatures in Iceland

5.2 Monthly average sea ice extent of Arctic and Antarctic

6 Extension to non-Gaussian and/or dependent noise

Appendix A Technical proofs

Appendix A. Proofs

A.1 Some useful lemmas for Theorems 1-3 of the main article

Lemma 1

Lemma 2

Appendix B. Extension to dependent non-Gaussian noise

B.1 Preparatory lemmas

Lemma 3

Lemma 4

B.2 Theoretical results of the length-lowerbounded-basis estimators

Theorem 4

Theorem 5

Theorem 6

Appendix C. Threshold selection and additional simulation results

C.1 Simulation results for non-Gaussian and/or dependent noise

C.2 Choice of the threshold

The naive threshold selection

The robust threshold selection

Appendix D. Additional data application results

D.1 Monthly average sea ice extent of Arctic and Antarctic data

D.2 Nitrogen oxides concentrations

Appendix E. Shape of the unbalanced wavelet basis

Appendix F. A practical way to implement the TGUW transformation

Appendix G. Extension to piecewise-quadratic signal

4.1.5 Choice of threshold $\lambda$

Pre-specified constant $C$ in Step 4.

$\mathcal{I}$ and $\mathcal{K}$ in Step 4.

Kurtosis function $g$ in Step 5.