Fast Sequence Segmentation using Log-Linear Models
Nikolaj Tatti

TL;DR
This paper introduces a pruning technique for dynamic programming in sequence segmentation with log-linear models, significantly reducing computation time while maintaining optimality.
Contribution
It presents a theoretical pruning method that accelerates the dynamic programming approach for sequence segmentation using log-linear models.
Findings
Significant reduction in computational time demonstrated empirically.
Pruning method maintains optimal segmentation results.
Applicable to a broad class of distributions within log-linear models.
Abstract
Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for one-dimensional log-linear models, and by doing so reduce the computational time. We demonstrate empirically,…
| Data | length | performance | time (s) | baseline time (s) | |
|---|---|---|---|---|---|
| Marotta | |||||
| Power | |||||
| Video1 | |||||
| Video2 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Nikolaj Tatti 22institutetext: Department of Mathematics and Computer Science, University of Antwerp, Antwerp,
Department of Computer Science, Katholieke Universiteit Leuven, Leuven,
Belgium
22email: [email protected]
Fast Sequence Segmentation using Log-Linear Models
Nikolaj Tatti
Abstract
Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer , and some measure of homogeneity, the task is to split the sequence into contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for one-dimensional log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.
Keywords:
segmentation, pruning, change-point detection, dynamic program
††journal: Data Mining and Knowledge Discovery
1 Introduction
Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer , and some measure of homogeneity, the task is to split the sequence into contiguous segments that are maximally homogeneous.
An exact solution for segmentation with segments can be obtained by a classic dynamic program in time, where is the length of the sequence (Bellman, 1961). Due to the quadratic complexity, we cannot apply segmentation for sequences of non-trivial length. In this paper we introduce a speedup to the dynamic program used for solving the exact solution. Our key result, given in Theorem 3.1, states that when certain conditions are met, we can discard the candidate for a segment border, thus speeding up the inner loop of the dynamic program.
We consider segmentation using the log-likelihood of a log-linear model to score the goodness of individual segments. Many standard distributions can be described as log-linear models, including Bernoulli, Gamma, Poisson, and Gaussian distributions. Moreover, when using a Gaussian distribution, optimizing the log-likelihood is equal to the minimizing the error (see Example 1).
The conditions given in Theorem 3.1 are hard to verify, however, we demonstrate that this can be done with relative ease for one-dimensional models. The key idea is as follows: Consider segmenting the sequence given in Figure 1(a) into segments using the error. Assume a segmentation , . Figure 1(b) tells us that this segmentation is not optimal. In fact, the optimal segmentation with 2 segments for this data is .
Sequence values around have a particular characteristic which we can exploit to speedup the optimization. In order to demonstrate this, let us define
[TABLE]
that is, contains the averages from the right side of the first segment and contains the averages from the left side of the second segment. Let us define , , , . We see that , , , and . That is, the intervals and intersect. We will show in such case that not only we can safely ignore the segmentation , but we also will show that even if we augment the sequence with additional data points, index will never be part of the optimal segmentation with 2 segments. This pruning allows us to speedup the dynamic programming.
In general, if the extreme values of averages and computed from neighboring segments intersect, we know that the segmentation is suboptimal. On the other hand, the optimal segmentation with segments for data in Figure 1(a), , uses index . We do not violate our condition since the extreme values of averages computed only from the second segment and extreme values of averages computed only from the third segment no longer intersect.
Using this idea, we will build an efficient pruning technique for segmenting data using one-dimensional log-linear models. We empirically demonstrate that this approach can reduce the computational load by several orders of magnitude compared to the standard approach.
The remaining paper is organized as follows. In Section 2 we give preliminary notation and define the segmentation problem. In Section 3 we give the key result which allows us to prune segments. Sections 4–5 are devoted to a segmentation algorithm. We present our experiments in Section 6 and related work in Section 7. Finally, we conclude the paper with discussion in Section 8.
2 Segmentation for log-linear models
In this section we give preliminaries and define the segmentation problem.
A sequence is a sequence of real vectors of length , . A segment consists of two integers such that . We will write whenever for an integer . We define to be the subsequence corresponding to that segment. A segmentation is a list of disjoint segments that cover , that is, such that the first segment starts at , the last segment ends at and begins right after , that is .
Our goal is to find a segmentation that maximizes the likelihood of a log-linear model of each individual segment. By log-linear models, also known as exponential family, we mean models whose probability density function can we written as
[TABLE]
where is a function mapping to a vector in N, is the parameter vector of the model, and is the normalization constant. Many standard distributions are log-linear, for example, Poisson, Gamma, Bernoulli, Binomial, and Gaussian (both with fixed or unknown variance). We will argue later in this section that using a Gaussian distribution with a fixed variance is equivalent to minimizing error.
Assume that we are given a segmentation and for each segment , we have a parameter vector . Let us now consider the log-likelihood
[TABLE]
for this segmentation of . Note that the first term in the right-hand side does not depend on the parameters nor on the segmentation. Consequently, we can ignore it. In addition, note that we can safely assume that . If this is not the case, we can always transform sequence into . From now on we will assume that .
For notational simplicity, let us define
[TABLE]
to be the sum and the average of data points in . If is clear from the context, we will often write and to mean and . As shorthand, we write and to mean and .
We define the score of a single segment given a parameter vector as
[TABLE]
We define the score for a segmentation as
[TABLE]
that is, is a sum of the optimal scores of individual segments. We see that optimizing is equivalent to maximizing likelihood of the log-linear model.
We are now ready to state our optimization problem.
Problem 1
Given a sequence , a log-linear model, and an integer , find a segmentation with segments maximizing .
Example 1
Let us now consider a Gaussian distribution with identity covariance matrix. This distribution is log-linear since we can rewrite
[TABLE]
The log-likelihood of a Gaussian distribution for a segmentation is
[TABLE]
The optimal value for is an average of data points in . The first term of the right-hand side is constant while the second term is the error. Consequently, selecting a segmentation that maximizes log-likelihood is equivalent to finding a segmentation that minimizes the error, a typical choice for an error function.
The optimal segmentation can be found with a dynamic program (Bellman, 1961). In order to see this, let be the optimal segmentation with segments. Let be the last index of . Then is the optimal segmentation for . We can find the optimal segmentation with segments by first computing the optimal segmentation with segments for each and then testing which segment of form we need to add to produce the optimal segmentation with segments. This leads to an algorithm of time complexity . The goal of this paper is to provide an optimization of this dynamic program.
3 Necessary Condition for Optimal Segmentation
In this section we give a key result of this paper. This result allows us to prune candidates that will not be included in the optimal segmentation and hence speedup the dynamic program.
In order to do so, let be a set of vectors in N. We say that is a cover if for any , there is a such that . See Figure 2 for an example. Given two sequences and we define to be the difference set for and as
[TABLE]
We are now ready to state the key result of the paper. For readability, we postpone the proof to Appendix A.1.
Theorem 3.1
Let be a segmentation. There is a segmentation such that and is not a cover for any two consecutive segments and in .
4 Segmentation for one-dimensional models
In the previous section we saw a necessary condition for optimal segmentation. This involves checking whether the difference set of consecutive segments is a cover. In this section and the next section we show that we can efficiently check this condition if our linear model is one-dimensional, that is, if data points are real numbers.
In order to show this, let be a sequence. We define a left interval to be an interval
[TABLE]
of extreme values of . Similarly, we define a right interval to be
[TABLE]
We can now express the condition using these intervals.
Theorem 4.1
Assume two sequences, and and let be a one-dimensional statistic. Then is a cover if and only if the intervals and intersect.
Proof
Let and . is a cover if and only if there are , , , and such that and . This is equivalent to and , which is equivalent to and intersecting.
We can now use this result to design an efficient algorithm. Assume that we already have computed for each the optimal segmentation with segments, say covering . We now want to find an optimal segmentation with segments covering . In order to do so we need to augment each for with a segment , and pick the optimal segmentation. Assume that the intervals and intersect, where is the starting point of the last segment in . Then Theorems 3.1 and 4.1 imply that we can safely ignore the segmentation augmented with . Moreover, if the intervals intersect when segmenting , they will also intersect when segmenting for . Hence, as soon as and intersect, we can ignore as a candidate for the starting point of the last segment. We present the pseudo-code for this approach as Algorithm 1.
Let us next analyze the time and memory complexity of Algorithm 1. Let be the maximal size of . It is easy to see that we can compute and in constant time by keeping and updating the sum for every . The only non-trivial part of Algorithm 1 is computing the right interval . In the next section, we will show how to compute the right interval in amortized time, hence the execution time of the algorithm is in . Moreover, we will show that the total memory requirement for computing the right interval is in which will make the memory usage of the algorithm .
5 Computing the Right Interval
In this section we show how to compute the right interval, as needed in Algorithm 1. We will focus on how to compute the maximal value of the right interval; we can compute the minimal value using exactly the same framework.
5.1 Computing the Borders
Our first goal, given a sequence and integer , is to find such that is maximal. Naturally, if we have to do so from scratch, we have no other option but to test every . However, since segmentation needs the maximal average for every we can use information from previous scans to find the optimal more quickly.
We will now present the main results by Calders et al (2007) in which the authors considered efficiently finding the maximal average from a stream of data points. In the next section we will modify this approach to make it more memory-efficient.
Given a sequence we say that is a border if there is a (possibly empty) sequence such that if we define to be concatenated with , then
[TABLE]
We define to be the sorted list of border points of .
Let be a sequence. Further, Let and let . Whenever is clear from the context, we define to be . Further, we will write instead of .
The following theorem states that a maximal average can be found by simply taking the largest border.111Calders et al (2007) deal only with binary sequences but we can easily extend these results to the general case.
Theorem 5.1 (see Calders et al, 2007)
Assume a sequence . Let . Then
[TABLE]
We can describe the borders using the following theorem.
Theorem 5.2 (see Calders et al, 2007)
An integer is a border for if and only if there are no and , such that .
Example 2
Assume that we are given a sequence , and that . According to Theorem 5.2, index , since . The borders are .
Our next step is to revise the algorithm given by Calders et al (2007) for constructing from . The key idea for update is given in the following theorem.
Theorem 5.3 (see Calders et al, 2007)
Let us assume a list of borders . Define . Define , , to be the maximal integer such that . If such does not exist, we set . Then, .
The update algorithm (given as Algorithm 2) starts with the previous borders and adds as a border. Then the algorithm tests whether the average of the second last border is larger than the average of the last border. If so, then the condition in Theorem 5.3 is violated, and we remove the last border and repeat the test. The correctness of Update is given by Calders et al (2007). Note that we can compute the needed averages in constant time, for example, by precalculating a sequence .
5.2 Computing Borders Simultaneously
We can use borders to discover the right interval for a single segment. However, recall that in Algorithm 1 we need to be able to compute the right interval for any , where is the current set of candidates for a segment. A naïve approach would be to compute borders separately for each . This leads to memory and time consumption, where is the maximum size of during evaluation of Segment. Here, we will modify the border update algorithm such that its total memory consumption is . This will guarantee that the memory consumption of Segment is .
Example 3
Let us continue Example 2. We have , , , and . Note that and have a common tail sequence, namely, . We can generalize this observation.
The following key result states that when two border lists, say and share a common border, the subsequent borders are equivalent.
Theorem 5.4
Let be a sequence and let be two indices. Assume that . Let . Then if and only if .
Proof
We will show that using induction over . The result follows by setting .
Since , the first step follows.
Assume that the result holds for . Let and . Let also . Let and be such that .
Theorem 5.3 states that there is an integer such that and an integer such that . Since , we must have and similarly . Note that, by the induction assumption, we have . This implies that update will process exactly the same input, and deletes exactly the same number of entries, that is, implies that . This proves the induction step.
This theorem allows us to group border lists into a tree. Let be a sequence and let be a set of indices. We define a border tree as follows: The non-root nodes of the tree consists of the borders from for each , that is,
[TABLE]
There is an edge from a node to a node if and only if there is such that , and . Note that this is well-defined since Theorem 5.4 states that if we have a node , essentially a border, shared by several border lists, then each border list will have the exactly same next border, which is represented by the parent of . Finally, the last border from each is a child of a root, which we will denote by . Note that, for each , a path from to in is equal to , where .
Given a node in we write to be the child nodes of . We assume that is constructed so that the children are ordered from smallest to largest. In order to be able to modify the tree quickly, we store the tree structure as follows. Each node can have 3 pointers at most: a pointer to a right sibling, a pointer to a left sibling or to the parent, if there is no left sibling, and a pointer to the first child, see Figure 3(a) as an example.
Our next step is to show how to extract the maximal average, and by doing so compute the right interval. In order to do so we need the following results.
Theorem 5.5
Let be a sequence and let be two indices. If and , then .
Proof
Assume that . Then Theorem 5.2 implies that there are such that . Since , Theorem 5.2 immediately implies .
Corollary 1
Let be a sequence and let be two indices. If and , then .
Proof
Theorem 5.5 implies . Since both border lists share they also share any border larger than . If both and , then Theorem 5.4 implies , which is a contradiction. Consequently, .
Corollary 2
Let be a sequence and let be a set of indices. Let be a border tree and let be its root. Select and let be the smallest index such that . Then for any .
Proof
Let be the maximal border. Theorem 5.1 states that we need to prove . We see immediately that . Let be such that . If , then, since , Corollary 1 implies . On the other hand, if , then since , Corollary 1 implies .
Corollary 2 gives a way to find the maximal average. Given and , we look for the smallest child of root, say , such that .
Our next step is to update a border tree from to , an update step similar to Algorithm 2. We start by first adding a node between a root and its children. This corresponds to the first two lines in Algorithm 2. After this we modify the tree such that Theorem 5.3 holds for every path from to the root. In Algorithm 2 we simply deleted indices that were no longer borders. However, since a single node can be shared by several border lists we cannot just delete it, since it might be the case that it is still used by another border list. Instead, we reattach children of violating Theorem 5.3 to the root; effectively removing from the border lists in which is no longer a border. We give the pseudo-code in Algorithm 3.
Example 4
Let us continue Examples 2–3. Assume that we have a sequence given in Example 2 and that we have . Based on borders given in Example 3, the border tree is given in Figure 3(a). Assume that we see a new data point, . We have , , , and .
We begin updating the tree by first adding node between the root and its children, see Figure 3(b). We continue by checking the first child of : node , and reattach it to , see Figure 3(c). After this, we check the first child of , node and leave it unmodified. We continue by reattaching to the root, see Figure 3(d), and similarly node , see Figure 3(e). Since node is now a leaf and , we can delete it. Finally, we leave attached to . The final tree, which corresponds to the correct border tree, is given in Figure 3(f).
Theorem 5.6
Let . Algorithm outputs .
See Appendix for the proof.
In addition to UpdateTree, we need a routine for updating the tree when an index is deleted from . This is needed when Segment deletes a candidate for the optimal segmentation. In order to update we simply check whether is a leaf, if it is, then we delete it, and recursively test the parent of .
Finally, let us address memory and time complexity of a border tree. First of all, we have nodes at maximum, hence we need memory. Let be the maximum number of . Let be the number of nodes removed during . If we do not modify the tree during the while-loop, then we execute the while-loop only once, since there is only child of , namely . Note that by the end of each , root can have at most children. This means that at maximum we have done reattachments. Each reattachment increases the while-loop executions by 2: we need to check the child attached to the root and we need to check whether the parent has more children that need to be reattached. Hence, the while-loop is executed at most times during . Thus total time complexity is . Note that once a node is deleted it will not be introduced again. Hence, . This gives us a total execution time of .
6 Experiments
In this section we empirically evaluate our approach on synthetic and real-world datasets.222The implementation of the algorithm is given at http://adrem.ua.ac.be/segmentation
Synthetic data
Our main contribution to the paper is the speedup of the dynamic program for finding the optimal segmentation when using one-dimensional log-linear models. We measure the efficiency by the total number of comparisons needed in Line 1 of Algorithm 1. We define a performance ratio by normalizing this number by the number of comparisons that we would have made if we would not use any pruning. This ensures that the ratio is between [math] and , smaller values indicating faster performance. Note that if we do not use any pruning, the total number of comparisons is .
We begin by generating sequences of random samples drawn from the Gaussian distribution with [math] mean and variance. We generated 11 sequences of lengths for and computed the performance ratio of our segmentation using segments of Gaussian distributions (as given in Example 1). From results given in Figure 4(a) we see that we obtain speedups of 1 order of magnitude for the smallest data, up to 3 orders of magnitude for longer data: the ratio for the largest sequence is . Note that the ratios become smaller as the sequence becomes larger. The reason is that when considering longer segments, it becomes more likely that we can delete candidates, making the algorithm relatively faster. The absolute computation time grows with the length of a sequence, , , and minutes for sequences of length , , and , respectively.
Our second experiment is to study the performance ratio as a function of segments. We sampled 3 sequences from a Gaussian distribution, with [math] mean and variance, of sizes , , . For each sequence we computed segmentations up to segments. From the results given in Figure 4(b) we see that the performance ratio becomes worse as we increase the number of segments. The reason is that when segments become shorter, consequently, the right intervals are more compact and have less chance of being intersected with the left interval. Nevertheless, we get , , and for performance ratios for our sequences when using segments. The peak at segments suggest that discovering segmentation with segments is particularly expensive. To see why this is happening, first note that the first segment always starts from the beginning. This implies that when looking for a segmentation with segments for a sequence , the second segment will be typically either really short or really long as its mean needs to differ from the mean of the first segment. If the second segment is short, it will have an abnormal right interval, consequently, the interval has a smaller chance of overlapping with the left interval of the next segment.
Our next step is to study how candidates for segments are distributed. A candidate is added to on Line 1 and deleted from on Line 1 in Segment. The candidate is added when the counter is equal to and let us assume that it is deleted when the counter is equal to . If is not deleted, after the for-loop in Segment, we simply set . We define a lifetime of a candidate to be , that is, a candidate lifetime is how often it has been used in the maximization step on Line 1. The smaller the value, the less computational burden a candidate is producing. In the worst case, that is, without any pruning, the lifetime for a candidate is equal to .
To study candidate lifetimes we generate a sequence of samples, consisting of segments of Gaussian distribution with [math], , , and [math] means, respectively, and variance of (see Figure 5(a)). We computed segmentations up to segments and present the lifetimes in Figure 5.333For clarity sake, figures show average lifetimes of bins containing points We see that there are four major spikes in lifetimes, at the beginning of the sequence and around each change point. Let us consider a spike at for . A candidate on the left side of the spike has a longer lifetime because the left interval of the next segment is shifted and it is less likely that it will intersect with the right interval. On the other hand, a candidate on the right side of the spike has a longer lifetime because the segment is short and the right interval has a higher chance of being abnormal. The same rationale applies to spike at the beginning of the sequence. The spikes grow with increasing number of segments, nevertheless they are shallow, implying that we have significant speedup. In fact, the performance ratios are , , for segmentations with segments, respectively.
Finally, let us demonstrate the limitations of our approach. We generate a sequence of samples, where a sample is generated from a Gaussian distribution with a mean of and variance of , see Figure 6(a). The performance ratio of segmentation with segments is , the lifetimes are given in Figure 6(b). While we see a good performance for this data, when we increase the slope (or equivalently, lower variance) the performance ratio becomes worse. The worst case scenario is a genuinely monotonically increasing (or decreasing) sequence, that is, . In such case, the left intervals and the right intervals will never overlap and no candidate will be pruned. We should point out that applying segmentation for a monotonic sequence in the first place is questionable as such sequence does not fit well the segmentation probabilistic model, and it might be beneficial to detrend the data to obtain a better segmentation.
Real-world data
We continue our experiments using real-world data sets. We considered different datasets.444The datasets were obtained from http://www.cs.ucr.edu/~eamonn/discords/ The first dataset, Marotta, is Space Shuttle Marotta Valve time series, consisting of 5 energize/de-energize cycles (TEK17). The second dataset, Power, consists of a power consumption of a Dutch research facility during the year 1997. The third dataset consists of two-dimensional time series extracted from videos of an actor performing various acts with and without a replica gun. Since this sequence is two-dimensional, we split the dimensions into Video1 and Video2. The sequence lengths are given in Table 1.
We study the performance by computing segmentations with segments for each data and comparing it against the traditional dynamic program, that is, without deleting any candidates. From the results, given in Table 1, we see that our approach has a significant advantage over a baseline approach, for example, with Power dataset we find an optimal solution in 20 seconds while the baseline approach requires 10 minutes.
Finally, let us look at some of the discovered segmentations. In Figure 7 we present a segmentation of Marotta with segments. The segments align with high and low energy states. Note that the rd high energy segment is more shallow than the other high energy segments. This cycle contains an anomaly as pointed out by Keogh et al (2005) resulting in a shorter high energy segment. In Figure 8 we show a segmentation with segments of the power consumption. We can see that the mean of the middle segment is lower than the other means, indicating a summer season.
7 Related Work
Segmentation is an instance of a larger problem setting, called change point detection, see (Basseville and Nikiforov, 1993), for introduction. We can divide the problem settings broadly into two categories: offline and online. Although these settings have conceptually the same goal, the setup details make it different from an algorithmic point of view. In online change point detection (see Kifer et al (2004), for example) the data arrives in a stream fashion, typically there is no budget for how many change points are allowed, and the decision needs to be made within some time frame, whereas in segmentation, offline change point detection, new datapoints can change early segments. A typical goal for online change detection is to alert the system or a user of a change, whereas in segmentation the only goal is to summarize the sequence.
Popular approaches for segmentation are top-down approaches where we select greedily a new change-point (see Shatkay and Zdonik (1996); Bernaola-Galván et al (1996); Douglas and Peucker (1973); Lavrenko et al (2000), for example) and bottom-up approaches where at the beginning each point is a segment, and points are combined in a greedy fashion (see Palpanas et al (2004), for example). A randomized heuristic was suggested by Himberg et al (2001), where we start from a random segmentation and optimize the segment boundaries. These approaches, although fast, are heuristics and have no theoretical guarantees of the approximation quality. A divide-and-segment approach, an approximation algorithm with theoretical guarantees on the approximation quality was given by Terzi and Tsaparas (2006).
Modifications of the original segmentation problem have been also studied. Discovering recurrent sources is a setup where one limits the amount of distinct means of the segments to be such that , where is the number of allowed segments has been suggested (Gionis and Mannila, 2003). Haiminen and Gionis (2004) study unimodal sequences, where means of the centroids (of one-dimensional sequence) are required to follow a unimodal curve, that is, the means should only rise to some point and then only decline afterwards. For a survey of the segmentation algorithms, see Chapter 8 in (Džeroski et al, 2011).
8 Discussion and Conclusions
In this paper we introduced a pruning technique to speedup the dynamic program used for solving the segmentation problem. We demonstrated on both synthetic and real-world data that we gain a significant speedup by using our pruning technique.
We should point out that our pruning is online, that is, the decision to delete a candidate is based only on current and past data points. We believe that we can speedup the algorithm further by applying additional pruning techniques based on future data points, such as (Gedikli et al, 2010). In addition, we conjecture that these optimizations may prove to be useful in other setups, such as, discovering HMMs or CRFs, where dynamic programs are used in order to optimize the model. We leave these studies as future work.
Segmentation requires a parameter, namely the number of segments. One approach to remove this parameter is by using model selection techniques, such as, BIC (Schwarz, 1978) or MDL (Grünwald, 2007). We conjecture that using these techniques not only remove the parameter but can be also used for further speedup.
Our algorithm is limited only to handle one-dimensional case. However, the key result, Theorem 3.1, actually handles the multi-dimensional case. The reason why we limit ourselves to one-dimensional case is that we were able to verify the sufficient conditions in Theorem 3.1 with relative ease. We leave studying applying Theorem 3.1 more generally as future work. While we are skeptical whether it is possible verify the conditions in Theorem 3.1 exactly, we believe that it is possible to find more conservative conditions that can be easily checked and that will imply the conditions in Theorem 3.1.
Acknowledgements
Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation – Flanders (fwo).
Appendix A Proofs
A.1 Proof of Theorem 3.1
Theorem 3.1 will follow from the following theorem.
Theorem A.1
Let . Let . Assume that is a cover. Then there exists such that or there exists such that .
In order to prove the theorem we will introduce some helpful notation. First, given a parameter vector and , we define
[TABLE]
Note that . We also define
[TABLE]
This function is essentially the difference between two scores.
Lemma 1
Let . We have .
Proof
Note that
[TABLE]
The last two terms do not depend on . This allows us to write
[TABLE]
This completes the proof.
Proof (Proof of Theorem A.1)
Write and define
[TABLE]
We need to show that either or . Assume that . Fix . By definition, there exist and such that
[TABLE]
From now on we will write to mean and to mean . We must have or, equivalently, .
Since is a cover, there exist integers and , , such that , where and .
Define . We now have
[TABLE]
which implies . Since this holds for any , we conclude that . This proves the theorem.
Proof (Proof of Theorem 3.1)
Let a segmentation and let and be two consecutive segments such that is a cover. We can now apply Theorem A.1 to find alternative segments and such that if we define by replacing and from with and then either or and ends before . We repeat this until no consecutive segments constitute a cover. This repetition ends because no segmentation will occur twice during these steps and there is a finite number of segmentations. The reason why no segmentation occur twice is because either the score properly increases or the score stays the same and we move a breakpoint to the left.
A.2 Proof of Theorem 5.6
Let be the resulting tree from . To prove the theorem we need to show that the paths of from leafs to the root consists of borders, there are no nodes in outside the borders, and that children are ordered. We will prove these results in a series of lemmata.
Lemma 2
Let be a tree after we have added a node in UpdateTree. Let be a node in and let be its parent. Let be such that . If , then will cease to be a child of during some stage of UpdateTree.
Proof
Let be a root node of . Consider a pre-order of nodes of , that is, parents and earlier siblings come first. We will prove the lemma using induction on the pre-order.
To prove the first step, let be the first child of . If , then Theorem 5.3 implies that which is exactly the test on Line 3. Hence, will be disconnected from .
Let us now prove the induction step. Let be the parent of in . Assume that . Note that is the border next to in . Theorem 5.3 implies that , hence the induction assumption implies that and are disconnected and becomes a child of at some point.
Assume now that is not the first child of and let be the sibling left to , and let be such that . Theorem 5.1 implies that for any . Since , we must have , which implies that . Again, the induction assumption implies that and will be disconnected. Consequently, will be the first child of at some point.
Note that while moving or left siblings of to be children of we move the current node in UpdateTree to the left. Hence, there will be a point where and is the first child of . Theorem 5.3 implies that which is exactly the test on Line 3. Hence, will be disconnected from . This proves the lemma.
Lemma 3
For every , a path in from to a child of the root node equals .
Proof
Fix and let and define . Theorem 5.3 implies that there is such that .
After adding to , UpdateTree will not add new nodes into the path from to . Lemma 2 now implies that the path from to will be , where . If , then immediately . To conclude that in general, assume that and assume that at some point in UpdateTree we have and . Then, according to Theorem 5.3, the test on Line 3 will fail and remains as a child of .
Lemma 4
Let be a node in , then there is such that .
Proof
Let be a node that occurs in but not in . The lemma will follow if we can show that is not in . Let be the last child of . Lemma 2 implies that at some point will be disconnected from and we will visit when it is a leaf, since , we will delete .
Lemma 5
Consider a post-order of nodes of , that is, parents and later siblings come first. Node values decrease with respect to this order.
Proof
We will prove that the following holds: Let be a node and let be its left sibling. Let be the smallest child of . Then . Note that this automatically proves the lemma.
Note that . To prove that , let such that . If , then since , Theorem 5.5 implies that which is a contradiction. Consequently, . If , then again Theorem 5.5 implies that which is a contradiction. This proves that .
Lemma 6
Child nodes of each node in are ordered from smallest to largest.
Proof
UpdateTree modifies the tree by moving the first child of a node to be the left sibling of . This does not change the post-order of the nodes. This implies that, since node values decrease with respect to the post-order in , they will also decrease in . This proves the lemma.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Basseville and Nikiforov (1993) Basseville M, Nikiforov IV (1993) Detection of Abrupt Changes — Theory and Application. Prentice-Hall
- 2Bellman (1961) Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Communications of the ACM 4(6)
- 3Bernaola-Galván et al (1996) Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Physical Review E Statistical Physics Plasmas Fluids And Related Interdisciplinary Topics 53(5):5181–5189
- 4Calders et al (2007) Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92
- 5Douglas and Peucker (1973) Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer 10(2):112––122
- 6Džeroski et al (2011) Džeroski S, Goethals B, Panov P (eds) (2011) Inductive Databases and Constraint-based Data Mining. Springer
- 7Gedikli et al (2010) Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stochastic Environmental Research and Risk Assessment 24(5)
- 8Gionis and Mannila (2003) Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on Research in computational molecular biology, RECOMB ’03, pp 123–130
