Fast Sequence Segmentation using Log-Linear Models

Nikolaj Tatti

arXiv:1902.03285·cs.DS·February 12, 2019

Fast Sequence Segmentation using Log-Linear Models

Nikolaj Tatti

PDF

TL;DR

This paper introduces a pruning technique for dynamic programming in sequence segmentation with log-linear models, significantly reducing computation time while maintaining optimality.

Contribution

It presents a theoretical pruning method that accelerates the dynamic programming approach for sequence segmentation using log-linear models.

Findings

01

Significant reduction in computational time demonstrated empirically.

02

Pruning method maintains optimal segmentation results.

03

Applicable to a broad class of distributions within log-linear models.

Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for one-dimensional log-linear models, and by doing so reduce the computational time. We demonstrate empirically,…

Tables1

Table 1. Table 1: Characteristics of real-world datasets and performance of the algorithm with 20 20 20 segments. The last column is the time needed to compute the optimal segmentation using traditional dynamic program

Data	length	performance	time (s)	baseline time (s)
Marotta	$5 000$	$0.04$	$0.6$	$13$
Power	$35 040$	$0.03$	$19.5$	$600$
Video1	$11 251$	$0.1$	$6.7$	$62$
Video2	$11 251$	$0.14$	$9.7$	$62$

Equations42

\begin{split}X&=\big{\{}\frac{1}{101-j}\sum_{i=j}^{100}D_{i}\mid 1\leq j\leq 100\big{\}}\quad\text{and}\\ Y&=\big{\{}\frac{1}{j-100}\sum_{i=101}^{j}D_{i}\mid 101\leq j\leq 200\big{\}},\end{split}

\begin{split}X&=\big{\{}\frac{1}{101-j}\sum_{i=j}^{100}D_{i}\mid 1\leq j\leq 100\big{\}}\quad\text{and}\\ Y&=\big{\{}\frac{1}{j-100}\sum_{i=101}^{j}D_{i}\mid 101\leq j\leq 200\big{\}},\end{split}

p (x ∣ r) = q (x) exp (Z (r) + r^{T} S (x)),

p (x ∣ r) = q (x) exp (Z (r) + r^{T} S (x)),

lo g I \in P \prod k \in I \prod p (D_{k} ∣ r_{I}) = I \in P \sum k \in I \sum lo g q (D_{k}) + Z (r_{I}) + r_{I}^{T} S (D_{k}) = k = 1 \sum ∣ D ∣ lo g q (D_{k}) + I \in P \sum k \in I \sum Z (r_{I}) + r_{I}^{T} S (D_{k}),

lo g I \in P \prod k \in I \prod p (D_{k} ∣ r_{I}) = I \in P \sum k \in I \sum lo g q (D_{k}) + Z (r_{I}) + r_{I}^{T} S (D_{k}) = k = 1 \sum ∣ D ∣ lo g q (D_{k}) + I \in P \sum k \in I \sum Z (r_{I}) + r_{I}^{T} S (D_{k}),

c (D) = i = 1 \sum ∣ D ∣ D_{i} and av (D) = \frac{c ( D )}{∣ D ∣}

c (D) = i = 1 \sum ∣ D ∣ D_{i} and av (D) = \frac{c ( D )}{∣ D ∣}

sc (D ∣ r) = ∣ D ∣ Z (r) + r^{T} c (D) .

sc (D ∣ r) = ∣ D ∣ Z (r) + r^{T} c (D) .

sc (P; D) = I \in P \sum sc (D [I]), where sc (D) = r sup sc (D ∣ r),

sc (P; D) = I \in P \sum sc (D [I]), where sc (D) = r sup sc (D ∣ r),

(2 π)^{- M /2} e^{- 0.5 ∥ x - μ ∥^{2}} as e^{- 0.5 ∥ x ∥^{2}} (2 π)^{- M /2} e^{- 0.5 ∥ μ ∥^{2} + μ^{T} x} .

(2 π)^{- M /2} e^{- 0.5 ∥ x - μ ∥^{2}} as e^{- 0.5 ∥ x ∥^{2}} (2 π)^{- M /2} e^{- 0.5 ∥ μ ∥^{2} + μ^{T} x} .

lo g I \in P \prod x \in D [I] \prod (2 π)^{- M /2} e^{- 0.5 ∥ x - μ_{I} ∥^{2}} = I \in P \sum x \in D [I] \sum - M /2 lo g 2 π - 0.5 ∥ x - μ_{I} ∥^{2} = - ∣ D ∣ M /2 lo g 2 π - 0.5 I \in P \sum x \in D [I] \sum ∥ x - μ_{I} ∥^{2} .

lo g I \in P \prod x \in D [I] \prod (2 π)^{- M /2} e^{- 0.5 ∥ x - μ_{I} ∥^{2}} = I \in P \sum x \in D [I] \sum - M /2 lo g 2 π - 0.5 ∥ x - μ_{I} ∥^{2} = - ∣ D ∣ M /2 lo g 2 π - 0.5 I \in P \sum x \in D [I] \sum ∥ x - μ_{I} ∥^{2} .

diff (D, E) = {av (D [k, ∣ D ∣]) - av (E [1, l]) ∣ 1 \leq k \leq ∣ D ∣, 1 \leq l \leq ∣ E ∣} .

diff (D, E) = {av (D [k, ∣ D ∣]) - av (E [1, l]) ∣ 1 \leq k \leq ∣ D ∣, 1 \leq l \leq ∣ E ∣} .

\mathit{int}_{L}\mathopen{}\left(D\right)=\big{(}\min_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({1,i}\right),\max_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({1,i}\right)\big{)}

\mathit{int}_{L}\mathopen{}\left(D\right)=\big{(}\min_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({1,i}\right),\max_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({1,i}\right)\big{)}

\mathit{int}_{R}\mathopen{}\left(D\right)=\big{(}\min_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({i,{\left|D\right|}}\right),\max_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({i,{\left|D\right|}}\right)\big{)}\quad.

\mathit{int}_{R}\mathopen{}\left(D\right)=\big{(}\min_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({i,{\left|D\right|}}\right),\max_{1\leq i\leq{\left|D\right|}}\mathit{av}\mathopen{}\left({i,{\left|D\right|}}\right)\big{)}\quad.

av (F [i, ∣ F ∣]) = 1 \leq j \leq ∣ F ∣ max av (F [j, ∣ F ∣]) .

av (F [i, ∣ F ∣]) = 1 \leq j \leq ∣ F ∣ max av (F [j, ∣ F ∣]) .

av (j, ∣ D ∣) = 1 \leq k \leq ∣ D ∣ max av (k, ∣ D ∣) .

av (j, ∣ D ∣) = 1 \leq k \leq ∣ D ∣ max av (k, ∣ D ∣) .

V (T) = {b ∣ b \in borders (c) for some c \in C} .

V (T) = {b ∣ b \in borders (c) for some c \in C} .

h (k ∣ s, r) = sc ([1, k] ∣ s) + sc ([k + 1, e] ∣ r) .

h (k ∣ s, r) = sc ([1, k] ∣ s) + sc ([k + 1, e] ∣ r) .

g (l, δ ∣ s, r) = l (Z (s) - Z (r) + (s - r)^{T} δ) .

g (l, δ ∣ s, r) = l (Z (s) - Z (r) + (s - r)^{T} δ) .

h (k ∣ s, r) = k Z (s) + s^{T} c (k) + (e - k) Z (r) + r^{T} (c (e) - c (k)) = k (Z (s) - Z (r)) + (s - r)^{T} c (k) + e Z (r) + r^{T} c (e) .

h (k ∣ s, r) = k Z (s) + s^{T} c (k) + (e - k) Z (r) + r^{T} (c (e) - c (k)) = k (Z (s) - Z (r)) + (s - r)^{T} c (k) + e Z (r) + r^{T} c (e) .

\begin{split}&h(k\mid s,r)-h(l\mid s,r)=k(Z(s)-Z(r))+(s-r)^{T}\mathit{c}\mathopen{}\left(k\right)-l(Z(s)-Z(r))-(s-r)^{T}\mathit{c}\mathopen{}\left(l\right)\\ &\quad=(k-l)\big{(}Z(s)-Z(r)+(s-r)^{T}\frac{\mathit{c}\mathopen{}\left(k\right)-\mathit{c}\mathopen{}\left(l\right)}{k-l}\big{)}=g(k-l,\mathit{av}\mathopen{}\left(l+1,k\right)\mid s,r)\quad.\end{split}

\begin{split}&h(k\mid s,r)-h(l\mid s,r)=k(Z(s)-Z(r))+(s-r)^{T}\mathit{c}\mathopen{}\left(k\right)-l(Z(s)-Z(r))-(s-r)^{T}\mathit{c}\mathopen{}\left(l\right)\\ &\quad=(k-l)\big{(}Z(s)-Z(r)+(s-r)^{T}\frac{\mathit{c}\mathopen{}\left(k\right)-\mathit{c}\mathopen{}\left(l\right)}{k-l}\big{)}=g(k-l,\mathit{av}\mathopen{}\left(l+1,k\right)\mid s,r)\quad.\end{split}

x = k < m max sc ([1, k], [k + 1, e]) and z = k > m max sc ([1, k], [k + 1, e]) .

x = k < m max sc ([1, k], [k + 1, e]) and z = k > m max sc ([1, k], [k + 1, e]) .

sc ([1, m] ∣ s) + sc ([m + 1, e] ∣ r) \geq y - ϵ .

sc ([1, m] ∣ s) + sc ([m + 1, e] ∣ r) \geq y - ϵ .

ϵ \geq z - h (m) \geq h (n) - h (m) = g (n - m, α) = c g (m - l, α) = c g (m - l, β) + c (m - l) (s - r)^{T} (α - β) \geq c g (m - l, β) = c (h (m) - h (l)) \geq c (h (m) - x) \geq c (y - ϵ - x),

ϵ \geq z - h (m) \geq h (n) - h (m) = g (n - m, α) = c g (m - l, α) = c g (m - l, β) + c (m - l) (s - r)^{T} (α - β) \geq c g (m - l, β) = c (h (m) - h (l)) \geq c (h (m) - x) \geq c (y - ϵ - x),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Nikolaj Tatti 22institutetext: Department of Mathematics and Computer Science, University of Antwerp, Antwerp,

Department of Computer Science, Katholieke Universiteit Leuven, Leuven,

Belgium

22email: [email protected]

Fast Sequence Segmentation using Log-Linear Models

Nikolaj Tatti

Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer $K$ , and some measure of homogeneity, the task is to split the sequence into $K$ contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for one-dimensional log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.

Keywords:

segmentation, pruning, change-point detection, dynamic program

††journal: Data Mining and Knowledge Discovery

1 Introduction

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer $K$ , and some measure of homogeneity, the task is to split the sequence into $K$ contiguous segments that are maximally homogeneous.

An exact solution for segmentation with $K$ segments can be obtained by a classic dynamic program in $O(L^{2}K)$ time, where $L$ is the length of the sequence (Bellman, 1961). Due to the quadratic complexity, we cannot apply segmentation for sequences of non-trivial length. In this paper we introduce a speedup to the dynamic program used for solving the exact solution. Our key result, given in Theorem 3.1, states that when certain conditions are met, we can discard the candidate for a segment border, thus speeding up the inner loop of the dynamic program.

We consider segmentation using the log-likelihood of a log-linear model to score the goodness of individual segments. Many standard distributions can be described as log-linear models, including Bernoulli, Gamma, Poisson, and Gaussian distributions. Moreover, when using a Gaussian distribution, optimizing the log-likelihood is equal to the minimizing the $L_{2}$ error (see Example 1).

The conditions given in Theorem 3.1 are hard to verify, however, we demonstrate that this can be done with relative ease for one-dimensional models. The key idea is as follows: Consider segmenting the sequence given in Figure 1(a) into $2$ segments using the $L_{2}$ error. Assume a segmentation $[1,100]$ , $[101,200]$ . Figure 1(b) tells us that this segmentation is not optimal. In fact, the optimal segmentation with 2 segments for this data is $[1,70],[71,200]$ .

Sequence values around $101$ have a particular characteristic which we can exploit to speedup the optimization. In order to demonstrate this, let us define

[TABLE]

that is, $X$ contains the averages from the right side of the first segment and $Y$ contains the averages from the left side of the second segment. Let us define $r_{1}=\min X$ , $r_{2}=\max X$ , $l_{1}=\min Y$ , $l_{2}=\max Y$ . We see that $r_{1}\approx-1$ , $r_{2}\approx 1.8$ , $l_{1}\approx-1.8$ , and $l_{2}\approx 1$ . That is, the intervals $[r_{1},r_{2}]$ and $[l_{1},l_{2}]$ intersect. We will show in such case that not only we can safely ignore the segmentation $[1,100]$ , $[101,200]$ but we also will show that even if we augment the sequence with additional data points, index $101$ will never be part of the optimal segmentation with 2 segments. This pruning allows us to speedup the dynamic programming.

In general, if the extreme values of averages $[r_{1},r_{2}]$ and $[l_{1},l_{2}]$ computed from neighboring segments intersect, we know that the segmentation is suboptimal. On the other hand, the optimal segmentation with $4$ segments for data in Figure 1(a), $[1,70],[71,100],[101,130],[131,200]$ , uses index $101$ . We do not violate our condition since the extreme values of averages $[r_{1},r_{2}]$ computed only from the second segment and extreme values of averages $[l_{1},l_{2}]$ computed only from the third segment no longer intersect.

Using this idea, we will build an efficient pruning technique for segmenting data using one-dimensional log-linear models. We empirically demonstrate that this approach can reduce the computational load by several orders of magnitude compared to the standard approach.

The remaining paper is organized as follows. In Section 2 we give preliminary notation and define the segmentation problem. In Section 3 we give the key result which allows us to prune segments. Sections 4–5 are devoted to a segmentation algorithm. We present our experiments in Section 6 and related work in Section 7. Finally, we conclude the paper with discussion in Section 8.

2 Segmentation for log-linear models

In this section we give preliminaries and define the segmentation problem.

A sequence $D=\left(D_{1},\ldots,D_{L}\right)$ is a sequence of real vectors of length $M$ , $D_{i}\in^{M}$ . A segment $I=[b,e]$ consists of two integers such that $1\leq b\leq e\leq L$ . We will write $k\in I$ whenever $b\leq k\leq e$ for an integer $k$ . We define $D[b,e]=\left(D_{b},\ldots,D_{e}\right)$ to be the subsequence corresponding to that segment. A segmentation $P$ is a list of disjoint segments that cover $D$ , that is, $P=\left(I_{1},\ldots,I_{K}\right)$ such that the first segment $I_{1}$ starts at $1$ , the last segment $I_{K}$ ends at ${\left|D\right|}$ and $I_{k}=[a,b]$ begins right after $I_{k-1}=[c,d]$ , that is $a=d+1$ .

Our goal is to find a segmentation that maximizes the likelihood of a log-linear model of each individual segment. By log-linear models, also known as exponential family, we mean models whose probability density function can we written as

[TABLE]

where ${S}:{{}^{M}}\to{{}^{N}}$ is a function mapping $x$ to a vector in N, $r\in^{N}$ is the parameter vector of the model, and $Z(r)$ is the normalization constant. Many standard distributions are log-linear, for example, Poisson, Gamma, Bernoulli, Binomial, and Gaussian (both with fixed or unknown variance). We will argue later in this section that using a Gaussian distribution with a fixed variance is equivalent to minimizing $L_{2}$ error.

Assume that we are given a segmentation $P$ and for each segment $I\in P$ , we have a parameter vector $r_{I}$ . Let us now consider the log-likelihood

[TABLE]

for this segmentation of $D$ . Note that the first term in the right-hand side does not depend on the parameters nor on the segmentation. Consequently, we can ignore it. In addition, note that we can safely assume that $S(x)=x$ . If this is not the case, we can always transform sequence $D$ into $D^{\prime}=\left(S(D_{1}),\ldots,S(D_{L})\right)$ . From now on we will assume that $S(x)=x$ .

For notational simplicity, let us define

[TABLE]

to be the sum and the average of data points in $D$ . If $D$ is clear from the context, we will often write $\mathit{c}\mathopen{}\left(i,j\right)$ and $\mathit{av}\mathopen{}\left(i,j\right)$ to mean $\mathit{c}\mathopen{}\left(D[i,j]\right)$ and $\mathit{av}\mathopen{}\left(D[i,j]\right)$ . As shorthand, we write $\mathit{av}\mathopen{}\left(j\right)$ and $\mathit{c}\mathopen{}\left(j\right)$ to mean $\mathit{c}\mathopen{}\left(1,j\right)$ and $\mathit{av}\mathopen{}\left(1,j\right)$ .

We define the score of a single segment given a parameter vector as

[TABLE]

We define the score for a segmentation $P$ as

[TABLE]

that is, $\mathit{sc}\mathopen{}\left(P;D\right)$ is a sum of the optimal scores of individual segments. We see that optimizing $\mathit{sc}\mathopen{}\left(P;D\right)$ is equivalent to maximizing likelihood of the log-linear model.

We are now ready to state our optimization problem.

Problem 1

Given a sequence $D$ , a log-linear model, and an integer $K$ , find a segmentation $P$ with $K$ segments maximizing $\mathit{sc}\mathopen{}\left(P;D\right)$ .

Example 1

Let us now consider a Gaussian distribution with identity covariance matrix. This distribution is log-linear since we can rewrite

[TABLE]

The log-likelihood of a Gaussian distribution for a segmentation $P$ is

[TABLE]

The optimal value for $\mu_{I}$ is an average of data points in $D[I]$ . The first term of the right-hand side is constant while the second term is the $L_{2}$ error. Consequently, selecting a segmentation that maximizes log-likelihood is equivalent to finding a segmentation that minimizes the $L_{2}$ error, a typical choice for an error function.

The optimal segmentation can be found with a dynamic program (Bellman, 1961). In order to see this, let $P=\left(I_{1},\ldots,I_{K}\right)$ be the optimal segmentation with $K$ segments. Let $c$ be the last index of $I_{K-1}$ . Then $\left(I_{1},\ldots,I_{K-1}\right)$ is the optimal segmentation for $D[1,c]$ . We can find the optimal segmentation with $K$ segments by first computing the optimal segmentation with $K-1$ segments for each $D[1,c]$ and then testing which segment of form $(c,{\left|D\right|})$ we need to add to produce the optimal segmentation with $K$ segments. This leads to an algorithm of time complexity $O(K{\left|D\right|}^{2})$ . The goal of this paper is to provide an optimization of this dynamic program.

3 Necessary Condition for Optimal Segmentation

In this section we give a key result of this paper. This result allows us to prune candidates that will not be included in the optimal segmentation and hence speedup the dynamic program.

In order to do so, let $V$ be a set of vectors in N. We say that $V$ is a cover if for any $y\in^{N}$ , there is a $v\in V$ such that $y^{T}v\geq 0$ . See Figure 2 for an example. Given two sequences $D$ and $E$ we define $\mathit{diff}\mathopen{}\left(D,E\right)$ to be the difference set for $D$ and $E$ as

[TABLE]

We are now ready to state the key result of the paper. For readability, we postpone the proof to Appendix A.1.

Theorem 3.1

Let $P$ be a segmentation. There is a segmentation $P^{\prime}$ such that $\mathit{sc}\mathopen{}\left(P^{\prime}\right)\geq\mathit{sc}\mathopen{}\left(P\right)$ and $\mathit{diff}\mathopen{}\left(D[I],D[J]\right)$ is not a cover for any two consecutive segments $I$ and $J$ in $P^{\prime}$ .

4 Segmentation for one-dimensional models

In the previous section we saw a necessary condition for optimal segmentation. This involves checking whether the difference set of consecutive segments is a cover. In this section and the next section we show that we can efficiently check this condition if our linear model is one-dimensional, that is, if data points $D_{i}$ are real numbers.

In order to show this, let $D$ be a sequence. We define a left interval to be an interval

[TABLE]

of extreme values of $\mathit{av}\mathopen{}\left(1,i\right)$ . Similarly, we define a right interval to be

[TABLE]

We can now express the condition using these intervals.

Theorem 4.1

Assume two sequences, $D$ and $E$ and let $S$ be a one-dimensional statistic. Then $\mathit{diff}\mathopen{}\left(D,E\right)$ is a cover if and only if the intervals $\mathit{int}_{R}\mathopen{}\left(D\right)$ and $\mathit{int}_{L}\mathopen{}\left(E\right)$ intersect.

Proof

Let $\mathit{int}_{R}\mathopen{}\left(D\right)=(x,y)$ and $\mathit{int}_{L}\mathopen{}\left(E\right)=(u,v)$ . $\mathit{diff}\mathopen{}\left(D,E\right)$ is a cover if and only if there are $a$ , $b$ , $c$ , and $d$ such that $\mathit{av}\mathopen{}\left(D[a,{\left|D\right|}]\right)\leq\mathit{av}\mathopen{}\left(E[1,b]\right)$ and $\mathit{av}\mathopen{}\left(D[c,{\left|D\right|}]\right)\geq\mathit{av}\mathopen{}\left(E[1,d]\right)$ . This is equivalent to $x\leq v$ and $y\geq u$ , which is equivalent to $\mathit{int}_{R}\mathopen{}\left(D\right)$ and $\mathit{int}_{L}\mathopen{}\left(E\right)$ intersecting.

We can now use this result to design an efficient algorithm. Assume that we already have computed for each $j$ the optimal segmentation with $K-1$ segments, say $P_{j}$ covering $D[1,j]$ . We now want to find an optimal segmentation with $K$ segments covering $D[1,i]$ . In order to do so we need to augment each $P_{j-1}$ for $j\leq i$ with a segment $[j,i]$ , and pick the optimal segmentation. Assume that the intervals $\mathit{int}_{L}\mathopen{}\left(D[j,i]\right)$ and $\mathit{int}_{R}\mathopen{}\left(D[c,j-1]\right)$ intersect, where $c$ is the starting point of the last segment in $P_{j-1}$ . Then Theorems 3.1 and 4.1 imply that we can safely ignore the segmentation $P_{j-1}$ augmented with $[j,i]$ . Moreover, if the intervals intersect when segmenting $D[1,i]$ , they will also intersect when segmenting $D[1,k]$ for $k>i$ . Hence, as soon as $\mathit{int}_{L}\mathopen{}\left(D[j,i]\right)$ and $\mathit{int}_{R}\mathopen{}\left(D[c,j-1]\right)$ intersect, we can ignore $j$ as a candidate for the starting point of the last segment. We present the pseudo-code for this approach as Algorithm 1.

Let us next analyze the time and memory complexity of Algorithm 1. Let $L$ be the maximal size of $C$ . It is easy to see that we can compute $\mathit{sc}\mathopen{}\left(D[j,i]\right)$ and $\mathit{int}_{L}\mathopen{}\left(D[j,i]\right)$ in constant time by keeping and updating the sum $\mathit{c}\mathopen{}\left(j,i\right)$ for every $j\in C$ . The only non-trivial part of Algorithm 1 is computing the right interval $\mathit{int}_{R}\mathopen{}\left(D[c,i]\right)$ . In the next section, we will show how to compute the right interval in amortized $O(L)$ time, hence the execution time of the algorithm is in $O(L{\left|D\right|})$ . Moreover, we will show that the total memory requirement for computing the right interval is in $O({\left|D\right|})$ which will make the memory usage of the algorithm $O({\left|D\right|})$ .

5 Computing the Right Interval

In this section we show how to compute the right interval, as needed in Algorithm 1. We will focus on how to compute the maximal value of the right interval; we can compute the minimal value using exactly the same framework.

5.1 Computing the Borders

Our first goal, given a sequence $D$ and integer $i$ , is to find $j$ such that $\mathit{av}\mathopen{}\left(j,i\right)$ is maximal. Naturally, if we have to do so from scratch, we have no other option but to test every $1\leq j\leq i$ . However, since segmentation needs the maximal average for every $i$ we can use information from previous scans to find the optimal $j$ more quickly.

We will now present the main results by Calders et al (2007) in which the authors considered efficiently finding the maximal average from a stream of data points. In the next section we will modify this approach to make it more memory-efficient.

Given a sequence $D$ we say that $1\leq i\leq{\left|D\right|}$ is a border if there is a (possibly empty) sequence $E$ such that if we define $F$ to be $D$ concatenated with $E$ , then

[TABLE]

We define $\mathit{borders}\mathopen{}\left(D\right)$ to be the sorted list of border points of $D$ .

Let $D$ be a sequence. Further, Let $1\leq i\leq j\leq{\left|D\right|}$ and let $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(D[i,j]\right)$ . Whenever $D$ is clear from the context, we define $\mathit{borders}\mathopen{}\left(i,j\right)$ to be $\left(b_{1}+i-1,\ldots,b_{M}+i-1\right)$ . Further, we will write $\mathit{borders}\mathopen{}\left(i\right)$ instead of $\mathit{borders}\mathopen{}\left(i,{\left|D\right|}\right)$ .

The following theorem states that a maximal average can be found by simply taking the largest border.111Calders et al (2007) deal only with binary sequences but we can easily extend these results to the general case.

Theorem 5.1 (see Calders et al, 2007)

Assume a sequence $D$ . Let $j=\max\mathit{borders}\mathopen{}\left(D\right)$ . Then

[TABLE]

We can describe the borders using the following theorem.

Theorem 5.2 (see Calders et al, 2007)

An integer $i$ is a border for $D$ if and only if there are no $a$ and $b$ , $a<i\leq b$ such that $\mathit{av}\mathopen{}\left(a,i-1\right)\geq\mathit{av}\mathopen{}\left(i,b\right)$ .

Example 2

Assume that we are given a sequence $D=\left(2,0,1,2,1,1,9,2,5,0\right)$ , and that $S(x)=x$ . According to Theorem 5.2, index $3\notin\mathit{borders}\mathopen{}\left(D\right)$ , since $\mathit{av}\mathopen{}\left(1,2\right)=1=\mathit{av}\mathopen{}\left(3,3\right)$ . The borders are $\left(1,4,7\right)=\mathit{borders}\mathopen{}\left(D\right)$ .

Our next step is to revise the algorithm given by Calders et al (2007) for constructing $\mathit{borders}\mathopen{}\left(1,i\right)$ from $\mathit{borders}\mathopen{}\left(1,i-1\right)$ . The key idea for update is given in the following theorem.

Theorem 5.3 (see Calders et al, 2007)

Let us assume a list of borders $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(1,i-1\right)$ . Define $b_{M+1}=i$ . Define $N$ , $2\leq N\leq M+1$ , to be the maximal integer such that $\mathit{av}\mathopen{}\left(b_{N-1},i\right)<\mathit{av}\mathopen{}\left(b_{N},i\right)$ . If such $N$ does not exist, we set $N=1$ . Then, $\left(b_{1},\ldots,b_{N}\right)=\mathit{borders}\mathopen{}\left(1,i\right)$ .

The update algorithm (given as Algorithm 2) starts with the previous borders $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(1,i-1\right)$ and adds $i$ as a border. Then the algorithm tests whether the average of the second last border is larger than the average of the last border. If so, then the condition in Theorem 5.3 is violated, and we remove the last border and repeat the test. The correctness of Update is given by Calders et al (2007). Note that we can compute the needed averages in constant time, for example, by precalculating a sequence $\left(\mathit{c}\mathopen{}\left(1\right),\ldots,\mathit{c}\mathopen{}\left({\left|D\right|}\right)\right)$ .

5.2 Computing Borders Simultaneously

We can use borders to discover the right interval for a single segment. However, recall that in Algorithm 1 we need to be able to compute the right interval for any $D[c,i]$ , where $c\in C$ is the current set of candidates for a segment. A naïve approach would be to compute borders separately for each $D[c,i]$ . This leads to $O(L{\left|D\right|})$ memory and time consumption, where $L$ is the maximum size of $C$ during evaluation of Segment. Here, we will modify the border update algorithm such that its total memory consumption is $O({\left|D\right|})$ . This will guarantee that the memory consumption of Segment is $O({\left|D\right|})$ .

Example 3

Let us continue Example 2. We have $\mathit{borders}\mathopen{}\left(1,10\right)=(1,4,7)$ , $\mathit{borders}\mathopen{}\left(3,10\right)=(3,4,7)$ , $\mathit{borders}\mathopen{}\left(8,10\right)=(8,9)$ , and $\mathit{borders}\mathopen{}\left(10,10\right)=(10)$ . Note that $\mathit{borders}\mathopen{}\left(1,10\right)$ and $\mathit{borders}\mathopen{}\left(3,10\right)$ have a common tail sequence, namely, $(4,7)$ . We can generalize this observation.

The following key result states that when two border lists, say $\mathit{borders}\mathopen{}\left(i\right)$ and $\mathit{borders}\mathopen{}\left(j\right)$ share a common border, the subsequent borders are equivalent.

Theorem 5.4

Let $D$ be a sequence and let $1\leq i,j\leq{\left|D\right|}$ be two indices. Assume that $a\in\mathit{borders}\mathopen{}\left(i\right)\cap\mathit{borders}\mathopen{}\left(j\right)$ . Let $b\geq a$ . Then $b\in\mathit{borders}\mathopen{}\left(i\right)$ if and only if $b\in\mathit{borders}\mathopen{}\left(j\right)$ .

Proof

We will show that $\left\{b\in\mathit{borders}\mathopen{}\left(i,k\right)\mid b\geq a\right\}=\left\{b\in\mathit{borders}\mathopen{}\left(j,k\right)\mid b\geq a\right\}$ using induction over $k$ . The result follows by setting $k={\left|D\right|}$ .

Since $\left\{b\in\mathit{borders}\mathopen{}\left(i,a\right)\mid b\geq a\right\}=\left\{a\right\}=\left\{b\in\mathit{borders}\mathopen{}\left(j,a\right)\mid b\geq a\right\}$ , the first $k=a$ step follows.

Assume that the result holds for $k-1$ . Let $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(i,k-1\right)$ and $\left(c_{1},\ldots,c_{K}\right)=\mathit{borders}\mathopen{}\left(j,k-1\right)$ . Let also $b_{M+1}=c_{K+1}=k$ . Let $x$ and $y$ be such that $b_{x}=a=c_{y}$ .

Theorem 5.3 states that there is an integer $N$ such that $\left(b_{1},\ldots,b_{N}\right)=\mathit{borders}\mathopen{}\left(i,k\right)$ and an integer $L$ such that $\left(c_{1},\ldots,c_{L}\right)=\mathit{borders}\mathopen{}\left(j,k\right)$ . Since $a\in\mathit{borders}\mathopen{}\left(i\right)$ , we must have $N\geq x$ and similarly $L\geq y$ . Note that, by the induction assumption, we have $\left(b_{x},\ldots,b_{M+1}\right)=\left(c_{y},\ldots,c_{K+1}\right)$ . This implies that update will process exactly the same input, and deletes exactly the same number of entries, that is, implies that $M-y\leq N-x$ . This proves the induction step.

This theorem allows us to group border lists into a tree. Let $D$ be a sequence and let $C$ be a set of indices. We define a border tree $T=\mathit{btree}\mathopen{}\left(D,C\right)$ as follows: The non-root nodes of the tree consists of the borders from $\mathit{borders}\mathopen{}\left(c\right)$ for each $c\in C$ , that is,

[TABLE]

There is an edge from a node $m$ to a node $n$ if and only if there is $c\in C$ such that $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(c\right)$ , $n=b_{j}$ and $m=b_{j+1}$ . Note that this is well-defined since Theorem 5.4 states that if we have a node $n$ , essentially a border, shared by several border lists, then each border list will have the exactly same next border, which is represented by the parent of $n$ . Finally, the last border from each $\mathit{borders}\mathopen{}\left(c\right)$ is a child of a root, which we will denote by $r$ . Note that, for each $c\in C$ , a path from $c$ to $r$ in $T$ is equal to $\left(c=b_{1},\ldots,b_{M},r\right)$ , where $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(c\right)$ .

Given a node $a$ in $\mathit{btree}\mathopen{}\left(D,C\right)$ we write $\mathit{children}\mathopen{}\left(a\right)$ to be the child nodes of $a$ . We assume that $\mathit{btree}\mathopen{}\left(D,C\right)$ is constructed so that the children are ordered from smallest to largest. In order to be able to modify the tree quickly, we store the tree structure as follows. Each node can have 3 pointers at most: a pointer to a right sibling, a pointer to a left sibling or to the parent, if there is no left sibling, and a pointer to the first child, see Figure 3(a) as an example.

Our next step is to show how to extract the maximal average, and by doing so compute the right interval. In order to do so we need the following results.

Theorem 5.5

Let $D$ be a sequence and let $1\leq i\leq j\leq{\left|D\right|}$ be two indices. If $a\in\mathit{borders}\mathopen{}\left(i\right)$ and $a\geq j$ , then $a\in\mathit{borders}\mathopen{}\left(j\right)$ .

Proof

Assume that $a\notin\mathit{borders}\mathopen{}\left(j\right)$ . Then Theorem 5.2 implies that there are $j\leq x<a\leq y\leq{\left|D\right|}$ such that $\mathit{av}\mathopen{}\left(x,a-1\right)\geq\mathit{av}\mathopen{}\left(a,y\right)$ . Since $i\leq x$ , Theorem 5.2 immediately implies $a\notin\mathit{borders}\mathopen{}\left(i\right)$ .

Corollary 1

Let $D$ be a sequence and let $1\leq i\leq j\leq{\left|D\right|}$ be two indices. If $a=\max\mathit{borders}\mathopen{}\left(i\right)$ and $a\geq j$ , then $a=\max\mathit{borders}\mathopen{}\left(j\right)$ .

Proof

Theorem 5.5 implies $a\in\mathit{borders}\mathopen{}\left(j\right)$ . Since both border lists share $a$ they also share any border larger than $a$ . If both $b\in\mathit{borders}\mathopen{}\left(j\right)$ and $b>a$ , then Theorem 5.4 implies $b\in\mathit{borders}\mathopen{}\left(i\right)$ , which is a contradiction. Consequently, $a=\max\mathit{borders}\mathopen{}\left(j\right)$ .

Corollary 2

Let $D$ be a sequence and let $C$ be a set of indices. Let $\mathit{btree}\mathopen{}\left(D,C\right)$ be a border tree and let $r$ be its root. Select $c\in C$ and let $a\in\mathit{children}\mathopen{}\left(r\right)$ be the smallest index such that $c\leq a$ . Then $\mathit{av}\mathopen{}\left(a,{\left|D\right|}\right)\geq\mathit{av}\mathopen{}\left(b,{\left|D\right|}\right)$ for any $b\geq c$ .

Proof

Let $b=\max\mathit{borders}\mathopen{}\left(c\right)$ be the maximal border. Theorem 5.1 states that we need to prove $a=b$ . We see immediately that $a\leq b$ . Let $d$ be such that $a=\max\mathit{borders}\mathopen{}\left(d\right)$ . If $d\leq c$ , then, since $c\leq a$ , Corollary 1 implies $a=\max\mathit{borders}\mathopen{}\left(c\right)=b$ . On the other hand, if $d>c$ , then since $d\leq a\leq b$ , Corollary 1 implies $b=\max\mathit{borders}\mathopen{}\left(d\right)=a$ .

Corollary 2 gives a way to find the maximal average. Given $\mathit{btree}\mathopen{}\left(D,C\right)$ and $c\in C$ , we look for the smallest child of root, say $a$ , such that $a\geq c$ .

Our next step is to update a border tree from $T=\mathit{btree}\mathopen{}\left(D[1,i-1],C\right)$ to $\mathit{btree}\mathopen{}\left(D[1,i],C\right)$ , an update step similar to Algorithm 2. We start by first adding a node $i$ between a root and its children. This corresponds to the first two lines in Algorithm 2. After this we modify the tree such that Theorem 5.3 holds for every path from $c\in C$ to the root. In Algorithm 2 we simply deleted indices that were no longer borders. However, since a single node $n$ can be shared by several border lists we cannot just delete it, since it might be the case that it is still used by another border list. Instead, we reattach children of $n$ violating Theorem 5.3 to the root; effectively removing $n$ from the border lists in which $n$ is no longer a border. We give the pseudo-code in Algorithm 3.

Example 4

Let us continue Examples 2–3. Assume that we have a sequence given in Example 2 and that we have $C=\left\{1,3,8,10\right\}$ . Based on borders given in Example 3, the border tree is given in Figure 3(a). Assume that we see a new data point, $D_{11}=1$ . We have $\mathit{borders}\mathopen{}\left(1,11\right)=(1,4,7)$ , $\mathit{borders}\mathopen{}\left(3,11\right)=(3,4,7)$ , $\mathit{borders}\mathopen{}\left(8,11\right)=(8)$ , and $\mathit{borders}\mathopen{}\left(10,11\right)=(10,11)$ .

We begin updating the tree by first adding node $11$ between the root and its children, see Figure 3(b). We continue by checking the first child of $11$ : node $7$ , and reattach it to $r$ , see Figure 3(c). After this, we check the first child of $7$ , node $4$ and leave it unmodified. We continue by reattaching $9$ to the root, see Figure 3(d), and similarly node $8$ , see Figure 3(e). Since node $9$ is now a leaf and $9\notin C$ , we can delete it. Finally, we leave $10$ attached to $11$ . The final tree, which corresponds to the correct border tree, is given in Figure 3(f).

Theorem 5.6

Let $T=\mathit{btree}\mathopen{}\left(D[1,i-1],C\right)$ . Algorithm $\textsc{UpdateTree}(T,C,D,i)$ outputs $\mathit{btree}\mathopen{}\left(D[1,i],C\right)$ .

See Appendix for the proof.

In addition to UpdateTree, we need a routine for updating the tree when an index $c$ is deleted from $C$ . This is needed when Segment deletes a candidate for the optimal segmentation. In order to update we simply check whether $c$ is a leaf, if it is, then we delete it, and recursively test the parent of $c$ .

Finally, let us address memory and time complexity of a border tree. First of all, we have ${\left|D\right|}$ nodes at maximum, hence we need $O({\left|D\right|})$ memory. Let $L$ be the maximum number of ${\left|C\right|}$ . Let $K_{i}$ be the number of nodes removed during $\textsc{UpdateTree}(T,D,C,i)$ . If we do not modify the tree during the while-loop, then we execute the while-loop only once, since there is only child of $r$ , namely $i$ . Note that by the end of each $\textsc{UpdateTree}(T,D,C,i)$ , root $r$ can have at most $L$ children. This means that at maximum we have done $L+K_{i}$ reattachments. Each reattachment increases the while-loop executions by 2: we need to check the child attached to the root and we need to check whether the parent has more children that need to be reattached. Hence, the while-loop is executed at most $2(L+K_{i})+1$ times during $\textsc{UpdateTree}(T,D,C,i)$ . Thus total time complexity is $O({\left|D\right|}L+\sum_{i=1}^{{\left|D\right|}}K_{i})$ . Note that once a node is deleted it will not be introduced again. Hence, $\sum_{i=1}^{{\left|D\right|}}K_{i}\leq{\left|D\right|}$ . This gives us a total execution time of $O({\left|D\right|}L)$ .

6 Experiments

In this section we empirically evaluate our approach on synthetic and real-world datasets.222The implementation of the algorithm is given at http://adrem.ua.ac.be/segmentation

Synthetic data

Our main contribution to the paper is the speedup of the dynamic program for finding the optimal segmentation when using one-dimensional log-linear models. We measure the efficiency by the total number of comparisons needed in Line 1 of Algorithm 1. We define a performance ratio by normalizing this number by the number of comparisons that we would have made if we would not use any pruning. This ensures that the ratio is between [math] and $1$ , smaller values indicating faster performance. Note that if we do not use any pruning, the total number of comparisons is $O(K{\left|D\right|}^{2})$ .

We begin by generating sequences of random samples drawn from the Gaussian distribution with [math] mean and $1$ variance. We generated 11 sequences of lengths $2^{k}$ for $k=10,\ldots,20$ and computed the performance ratio of our segmentation using $4$ segments of Gaussian distributions (as given in Example 1). From results given in Figure 4(a) we see that we obtain speedups of 1 order of magnitude for the smallest data, up to 3 orders of magnitude for longer data: the ratio for the largest sequence is $0.0007$ . Note that the ratios become smaller as the sequence becomes larger. The reason is that when considering longer segments, it becomes more likely that we can delete candidates, making the algorithm relatively faster. The absolute computation time grows with the length of a sequence, $11ms$ , $1.3s$ , and $20$ minutes for sequences of length $2^{10}$ , $2^{15}$ , and $2^{20}$ , respectively.

Our second experiment is to study the performance ratio as a function of segments. We sampled 3 sequences from a Gaussian distribution, with [math] mean and $1$ variance, of sizes $2^{14}$ , $2^{15}$ , $2^{16}$ . For each sequence we computed segmentations up to $50$ segments. From the results given in Figure 4(b) we see that the performance ratio becomes worse as we increase the number of segments. The reason is that when segments become shorter, consequently, the right intervals are more compact and have less chance of being intersected with the left interval. Nevertheless, we get $0.06$ , $0.04$ , and $0.02$ for performance ratios for our sequences when using $50$ segments. The peak at $3$ segments suggest that discovering segmentation with $3$ segments is particularly expensive. To see why this is happening, first note that the first segment always starts from the beginning. This implies that when looking for a segmentation with $2$ segments for a sequence $D[1,i]$ , the second segment will be typically either really short or really long as its mean needs to differ from the mean of the first segment. If the second segment is short, it will have an abnormal right interval, consequently, the interval has a smaller chance of overlapping with the left interval of the next segment.

Our next step is to study how candidates for segments are distributed. A candidate $c$ is added to $C$ on Line 1 and deleted from $C$ on Line 1 in Segment. The candidate is added when the counter $i$ is equal to $c$ and let us assume that it is deleted when the counter is equal to $j$ . If $c$ is not deleted, after the for-loop in Segment, we simply set $j={\left|D\right|}+1$ . We define a lifetime of a candidate $c$ to be $j-c$ , that is, a candidate lifetime is how often it has been used in the maximization step on Line 1. The smaller the value, the less computational burden a candidate is producing. In the worst case, that is, without any pruning, the lifetime for a candidate $c$ is equal to ${\left|D\right|}+1-c$ .

To study candidate lifetimes we generate a sequence of $4\,000$ samples, consisting of $4$ segments of Gaussian distribution with [math], $5$ , $-5$ , and [math] means, respectively, and variance of $1$ (see Figure 5(a)). We computed segmentations up to $4$ segments and present the lifetimes in Figure 5.333For clarity sake, figures show average lifetimes of bins containing $40$ points We see that there are four major spikes in lifetimes, at the beginning of the sequence and around each change point. Let us consider a spike at $2\,000$ for $K=4$ . A candidate on the left side of the spike has a longer lifetime because the left interval of the next segment is shifted and it is less likely that it will intersect with the right interval. On the other hand, a candidate on the right side of the spike has a longer lifetime because the segment is short and the right interval has a higher chance of being abnormal. The same rationale applies to spike at the beginning of the sequence. The spikes grow with increasing number of segments, nevertheless they are shallow, implying that we have significant speedup. In fact, the performance ratios are $0.004$ , $0.01$ , $0.02$ for segmentations with $K=2,3,4$ segments, respectively.

Finally, let us demonstrate the limitations of our approach. We generate a sequence of $4\,000$ samples, where a sample $i$ is generated from a Gaussian distribution with a mean of $i/100$ and variance of $1$ , see Figure 6(a). The performance ratio of segmentation with $4$ segments is $0.06$ , the lifetimes are given in Figure 6(b). While we see a good performance for this data, when we increase the slope (or equivalently, lower variance) the performance ratio becomes worse. The worst case scenario is a genuinely monotonically increasing (or decreasing) sequence, that is, $D_{i+1}>D_{i}$ . In such case, the left intervals and the right intervals will never overlap and no candidate will be pruned. We should point out that applying segmentation for a monotonic sequence in the first place is questionable as such sequence does not fit well the segmentation probabilistic model, and it might be beneficial to detrend the data to obtain a better segmentation.

Real-world data

We continue our experiments using real-world data sets. We considered $3$ different datasets.444The datasets were obtained from http://www.cs.ucr.edu/~eamonn/discords/ The first dataset, Marotta, is Space Shuttle Marotta Valve time series, consisting of 5 energize/de-energize cycles (TEK17). The second dataset, Power, consists of a power consumption of a Dutch research facility during the year 1997. The third dataset consists of two-dimensional time series extracted from videos of an actor performing various acts with and without a replica gun. Since this sequence is two-dimensional, we split the dimensions into Video1 and Video2. The sequence lengths are given in Table 1.

We study the performance by computing segmentations with $20$ segments for each data and comparing it against the traditional dynamic program, that is, without deleting any candidates. From the results, given in Table 1, we see that our approach has a significant advantage over a baseline approach, for example, with Power dataset we find an optimal solution in 20 seconds while the baseline approach requires 10 minutes.

Finally, let us look at some of the discovered segmentations. In Figure 7 we present a segmentation of Marotta with $11$ segments. The segments align with high and low energy states. Note that the $3$ rd high energy segment is more shallow than the other high energy segments. This cycle contains an anomaly as pointed out by Keogh et al (2005) resulting in a shorter high energy segment. In Figure 8 we show a segmentation with $3$ segments of the power consumption. We can see that the mean of the middle segment is lower than the other means, indicating a summer season.

7 Related Work

Segmentation is an instance of a larger problem setting, called change point detection, see (Basseville and Nikiforov, 1993), for introduction. We can divide the problem settings broadly into two categories: offline and online. Although these settings have conceptually the same goal, the setup details make it different from an algorithmic point of view. In online change point detection (see Kifer et al (2004), for example) the data arrives in a stream fashion, typically there is no budget for how many change points are allowed, and the decision needs to be made within some time frame, whereas in segmentation, offline change point detection, new datapoints can change early segments. A typical goal for online change detection is to alert the system or a user of a change, whereas in segmentation the only goal is to summarize the sequence.

Popular approaches for segmentation are top-down approaches where we select greedily a new change-point (see Shatkay and Zdonik (1996); Bernaola-Galván et al (1996); Douglas and Peucker (1973); Lavrenko et al (2000), for example) and bottom-up approaches where at the beginning each point is a segment, and points are combined in a greedy fashion (see Palpanas et al (2004), for example). A randomized heuristic was suggested by Himberg et al (2001), where we start from a random segmentation and optimize the segment boundaries. These approaches, although fast, are heuristics and have no theoretical guarantees of the approximation quality. A divide-and-segment approach, an approximation algorithm with theoretical guarantees on the approximation quality was given by Terzi and Tsaparas (2006).

Modifications of the original segmentation problem have been also studied. Discovering recurrent sources is a setup where one limits the amount of distinct means of the segments to be $H$ such that $H<K$ , where $K$ is the number of allowed segments has been suggested (Gionis and Mannila, 2003). Haiminen and Gionis (2004) study unimodal sequences, where means of the centroids (of one-dimensional sequence) are required to follow a unimodal curve, that is, the means should only rise to some point and then only decline afterwards. For a survey of the segmentation algorithms, see Chapter 8 in (Džeroski et al, 2011).

8 Discussion and Conclusions

In this paper we introduced a pruning technique to speedup the dynamic program used for solving the segmentation problem. We demonstrated on both synthetic and real-world data that we gain a significant speedup by using our pruning technique.

We should point out that our pruning is online, that is, the decision to delete a candidate is based only on current and past data points. We believe that we can speedup the algorithm further by applying additional pruning techniques based on future data points, such as (Gedikli et al, 2010). In addition, we conjecture that these optimizations may prove to be useful in other setups, such as, discovering HMMs or CRFs, where dynamic programs are used in order to optimize the model. We leave these studies as future work.

Segmentation requires a parameter, namely the number of segments. One approach to remove this parameter is by using model selection techniques, such as, BIC (Schwarz, 1978) or MDL (Grünwald, 2007). We conjecture that using these techniques not only remove the parameter but can be also used for further speedup.

Our algorithm is limited only to handle one-dimensional case. However, the key result, Theorem 3.1, actually handles the multi-dimensional case. The reason why we limit ourselves to one-dimensional case is that we were able to verify the sufficient conditions in Theorem 3.1 with relative ease. We leave studying applying Theorem 3.1 more generally as future work. While we are skeptical whether it is possible verify the conditions in Theorem 3.1 exactly, we believe that it is possible to find more conservative conditions that can be easily checked and that will imply the conditions in Theorem 3.1.

Acknowledgements

Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation – Flanders (fwo).

Appendix A Proofs

A.1 Proof of Theorem 3.1

Theorem 3.1 will follow from the following theorem.

Theorem A.1

Let $D=\left(D_{1},\ldots,D_{e}\right)$ . Let $1\leq m<e$ . Assume that $\mathit{diff}\mathopen{}\left(D[1,m],D[m+1,e]\right)$ is a cover. Then there exists $n>m$ such that $\mathit{sc}\mathopen{}\left([1,n],[n+1,e]\right)>\mathit{sc}\mathopen{}\left([1,m],[m+1,e]\right)$ or there exists $l<m$ such that $\mathit{sc}\mathopen{}\left([1,l],[l+1,e]\right)\geq\mathit{sc}\mathopen{}\left([1,m],[m+1,e]\right)$ .

In order to prove the theorem we will introduce some helpful notation. First, given a parameter vector $s$ and $r$ , we define

[TABLE]

Note that $h(k\mid s,r)\leq\mathit{sc}\mathopen{}\left([1,k],[k+1,e]\right)$ . We also define

[TABLE]

This function is essentially the difference between two scores.

Lemma 1

Let $k>l$ . We have $h(k\mid s,r)-h(l\mid s,r)=g(k-l,\mathit{av}\mathopen{}\left(l+1,k\right)\mid s,r)$ .

Proof

Note that

[TABLE]

The last two terms do not depend on $k$ . This allows us to write

[TABLE]

This completes the proof.

Proof (Proof of Theorem A.1)

Write $y=\mathit{sc}\mathopen{}\left([1,m],[m+1,e]\right)$ and define

[TABLE]

We need to show that either $x\geq y$ or $z>y$ . Assume that $z\leq y$ . Fix $\epsilon>0$ . By definition, there exist $s$ and $r$ such that

[TABLE]

From now on we will write $h(k)$ to mean $h(k\mid s,r)$ and $g(k,\delta)$ to mean $g(k,\delta\mid s,r)$ . We must have $h(m)+\epsilon\geq y\geq z$ or, equivalently, $\epsilon\geq z-h(m)$ .

Since $\mathit{diff}\mathopen{}\left(D[1,m],D[m+1,e]\right)$ is a cover, there exist integers $l$ and $n$ , $0\leq l<m<n\leq e$ , such that $(\alpha-\beta)^{T}(s-r)\geq 0$ , where $\alpha=\mathit{av}\mathopen{}\left(m+1,n\right)$ and $\beta=\mathit{av}\mathopen{}\left(l+1,m\right)$ .

Define $c=(n-m)/(m-l)$ . We now have

[TABLE]

which implies $y-x\leq\epsilon(1+c^{-1})\leq\epsilon(1+e)$ . Since this holds for any $\epsilon>0$ , we conclude that $y\leq x$ . This proves the theorem.

Proof (Proof of Theorem 3.1)

Let $P$ a segmentation and let $I$ and $J$ be two consecutive segments such that $\mathit{diff}\mathopen{}\left(D[I],D[J]\right)$ is a cover. We can now apply Theorem A.1 to find alternative segments $I^{\prime}$ and $J^{\prime}$ such that if we define $P^{\prime}$ by replacing $I$ and $J$ from $P$ with $I^{\prime}$ and $J^{\prime}$ then either $\mathit{sc}\mathopen{}\left(P^{\prime}\mid D\right)>\mathit{sc}\mathopen{}\left(P^{\prime}\mid D\right)$ or $\mathit{sc}\mathopen{}\left(P^{\prime}\mid D\right)\geq\mathit{sc}\mathopen{}\left(P^{\prime}\mid D\right)$ and $I^{\prime}$ ends before $I$ . We repeat this until no consecutive segments constitute a cover. This repetition ends because no segmentation will occur twice during these steps and there is a finite number of segmentations. The reason why no segmentation occur twice is because either the score properly increases or the score stays the same and we move a breakpoint to the left.

A.2 Proof of Theorem 5.6

Let $U$ be the resulting tree from $\textsc{UpdateTree}(T,C,D,i)$ . To prove the theorem we need to show that the paths of $U$ from leafs to the root consists of borders, there are no nodes in $U$ outside the borders, and that children are ordered. We will prove these results in a series of lemmata.

Lemma 2

Let $T^{\prime}$ be a tree after we have added a node $i$ in UpdateTree. Let $n\neq i$ be a node in $T^{\prime}$ and let $m$ be its parent. Let $c\in C$ be such that $n\in\mathit{borders}\mathopen{}\left(c,i-1\right)$ . If $m\notin\mathit{borders}\mathopen{}\left(c,i\right)$ , then $n$ will cease to be a child of $m$ during some stage of UpdateTree.

Proof

Let $r$ be a root node of $T^{\prime}$ . Consider a pre-order of nodes of $T^{\prime}$ , that is, parents and earlier siblings come first. We will prove the lemma using induction on the pre-order.

To prove the first step, let $n$ be the first child of $i$ . If $i\notin\mathit{borders}\mathopen{}\left(c,i\right)$ , then Theorem 5.3 implies that $\mathit{av}\mathopen{}\left(n,i\right)\geq\mathit{av}\mathopen{}\left(i,i\right)$ which is exactly the test on Line 3. Hence, $n$ will be disconnected from $i$ .

Let us now prove the induction step. Let $p$ be the parent of $m$ in $T^{\prime}$ . Assume that $p\neq r$ . Note that $p$ is the border next to $m$ in $\mathit{borders}\mathopen{}\left(c,i-1\right)$ . Theorem 5.3 implies that $p\notin\mathit{borders}\mathopen{}\left(c,i\right)$ , hence the induction assumption implies that $m$ and $p$ are disconnected and $m$ becomes a child of $r$ at some point.

Assume now that $n$ is not the first child of $m$ and let $q$ be the sibling left to $n$ , and let $p$ be such that $q\in\mathit{borders}\mathopen{}\left(p,i-1\right)$ . Theorem 5.1 implies that $\mathit{av}\mathopen{}\left(q,m-1\right)\geq\mathit{av}\mathopen{}\left(j,m-1\right)$ for any $q\leq j<m$ . Since $n>q$ , we must have $\mathit{av}\mathopen{}\left(q,m-1\right)\geq\mathit{av}\mathopen{}\left(n,m-1\right)\geq\mathit{av}\mathopen{}\left(m,i\right)$ , which implies that $m\notin\mathit{borders}\mathopen{}\left(p,i\right)$ . Again, the induction assumption implies that $q$ and $m$ will be disconnected. Consequently, $n$ will be the first child of $m$ at some point.

Note that while moving $m$ or left siblings of $n$ to be children of $r$ we move the current node $a$ in UpdateTree to the left. Hence, there will be a point where $a=m$ and $n$ is the first child of $m$ . Theorem 5.3 implies that $\mathit{av}\mathopen{}\left(n,i\right)\geq\mathit{av}\mathopen{}\left(m,i\right)$ which is exactly the test on Line 3. Hence, $n$ will be disconnected from $m$ . This proves the lemma.

Lemma 3

For every $c\in C$ , a path in $U$ from $c$ to a child of the root node $r$ equals $\mathit{borders}\mathopen{}\left(c,i\right)$ .

Proof

Fix $c\in C$ and let $\left(b_{1},\ldots,b_{M}\right)=\mathit{borders}\mathopen{}\left(c,i-1\right)$ and define $b_{M+1}=i$ . Theorem 5.3 implies that there is $1\leq N\leq M+1$ such that $\left(b_{1},\ldots,b_{N}\right)=\mathit{borders}\mathopen{}\left(c,i\right)$ .

After adding $i$ to $T$ , UpdateTree will not add new nodes into the path from $c$ to $r$ . Lemma 2 now implies that the path from $c$ to $r$ will be $\left(b_{1},\ldots,b_{K}\right)$ , where $K\leq N$ . If $N=1$ , then immediately $K=1$ . To conclude that $K=N$ in general, assume that $N>1$ and assume that at some point in UpdateTree we have $a=b_{N}$ and $b=b_{N-1}$ . Then, according to Theorem 5.3, the test on Line 3 will fail and $b_{N-1}$ remains as a child of $b_{N}$ .

Lemma 4

Let $n$ be a node in $U$ , then there is $c\in C$ such that $n\in\mathit{borders}\mathopen{}\left(c,i\right)$ .

Proof

Let $m$ be a node that occurs in $T$ but not in $\mathit{btree}\mathopen{}\left(D[1,i],C\right)$ . The lemma will follow if we can show that $m$ is not in $U$ . Let $n$ be the last child of $m$ . Lemma 2 implies that at some point $n$ will be disconnected from $m$ and we will visit $m$ when it is a leaf, since $m\notin C$ , we will delete $m$ .

Lemma 5

Consider a post-order of nodes of $T=\mathit{btree}\mathopen{}\left(D[1,i-1],C\right)$ , that is, parents and later siblings come first. Node values decrease with respect to this order.

Proof

We will prove that the following holds: Let $n$ be a node and let $m$ be its left sibling. Let $q$ be the smallest child of $n$ . Then $m<q$ . Note that this automatically proves the lemma.

Note that $q\in C$ . To prove that $m<q$ , let $c\in C$ such that $m\in\mathit{borders}\mathopen{}\left(c,i-1\right)$ . If $c\geq q$ , then since $n>m\geq c$ , Theorem 5.5 implies that $n\in\mathit{borders}\mathopen{}\left(c,i-1\right)$ which is a contradiction. Consequently, $c<q$ . If $q\leq m$ , then again Theorem 5.5 implies that $m\in\mathit{borders}\mathopen{}\left(q,i-1\right)$ which is a contradiction. This proves that $m<q$ .

Lemma 6

Child nodes of each node in $U$ are ordered from smallest to largest.

Proof

UpdateTree modifies the tree by moving the first child of a node $a$ to be the left sibling of $a$ . This does not change the post-order of the nodes. This implies that, since node values decrease with respect to the post-order in $T$ , they will also decrease in $U$ . This proves the lemma.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Basseville and Nikiforov (1993) Basseville M, Nikiforov IV (1993) Detection of Abrupt Changes — Theory and Application. Prentice-Hall
2Bellman (1961) Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6)
3Bernaola-Galván et al (1996) Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Physical Review E Statistical Physics Plasmas Fluids And Related Interdisciplinary Topics 53(5):5181–5189
4Calders et al (2007) Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92
5Douglas and Peucker (1973) Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer 10(2):112––122
6Džeroski et al (2011) Džeroski S, Goethals B, Panov P (eds) (2011) Inductive Databases and Constraint-based Data Mining. Springer
7Gedikli et al (2010) Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stochastic Environmental Research and Risk Assessment 24(5)
8Gionis and Mannila (2003) Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on Research in computational molecular biology, RECOMB ’03, pp 123–130