Duel and sweep algorithm for order-preserving pattern matching
Davaajav Jargalsaikhan, Diptarama, Ryo Yoshinaka, Ayumi, Shinohara

TL;DR
This paper introduces an efficient duel-and-sweep algorithm for order-preserving pattern matching, improving speed over previous methods and extending to two-dimensional cases with specific time complexities.
Contribution
The paper presents a novel duel-and-sweep algorithm for order-preserving pattern matching with improved time complexity and extends it to two-dimensional pattern matching.
Findings
Algorithm runs in $O(n + m\log m)$ time generally
Faster than KMP-based algorithms in experiments
Extends to two-dimensional order-preserving pattern matching
Abstract
Given a text and a pattern over alphabet , the classic exact matching problem searches for all occurrences of pattern in text . Unlike exact matching problem, order-preserving pattern matching (OPPM) considers the relative order of elements, rather than their real values. In this paper, we propose an efficient algorithm for OPPM problem using the "duel-and-sweep" paradigm. Our algorithm runs in time in general and time under an assumption that the characters in a string can be sorted in linear time with respect to the string size. We also perform experiments and show that our algorithm is faster that KMP-based algorithm. Last, we introduce the two-dimensional order preserved pattern matching and give a duel and sweep algorithm that runs in time for duel stage and time for sweeping time with preprocessing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization
Duel and sweep algorithm for order-preserving pattern matching
Davaajav Jargalsaikhan
Graduate School of Information Sciences, Tohoku University
6-6-05 Aramaki Aza Aoba, Aoba-ku, Sendai, Japan
{davaajav@shino., diptarama@shino., ry@, ayumi@}ecei.tohoku.ac.jp
Diptarama
Graduate School of Information Sciences, Tohoku University
6-6-05 Aramaki Aza Aoba, Aoba-ku, Sendai, Japan
{davaajav@shino., diptarama@shino., ry@, ayumi@}ecei.tohoku.ac.jp
Ryo Yoshinaka
Graduate School of Information Sciences, Tohoku University
6-6-05 Aramaki Aza Aoba, Aoba-ku, Sendai, Japan
{davaajav@shino., diptarama@shino., ry@, ayumi@}ecei.tohoku.ac.jp
Ayumi Shinohara
Graduate School of Information Sciences, Tohoku University
6-6-05 Aramaki Aza Aoba, Aoba-ku, Sendai, Japan
{davaajav@shino., diptarama@shino., ry@, ayumi@}ecei.tohoku.ac.jp
Abstract
Given a text and a pattern over alphabet , the classic exact matching problem searches for all occurrences of pattern in text . Unlike exact matching problem, order-preserving pattern matching (OPPM) considers the relative order of elements, rather than their real values. In this paper, we propose an efficient algorithm for OPPM problem using the “duel-and-sweep” paradigm. Our algorithm runs in time in general and time under an assumption that the characters in a string can be sorted in linear time with respect to the string size. We also perform experiments and show that our algorithm is faster that KMP-based algorithm. Last, we introduce the two-dimensional order preserved pattern matching and give a duel and sweep algorithm that runs in time for duel stage and time for sweeping time with preprocessing time.
1 Introduction
The exact string matching problem is one of the most widely studied problems. Given a text and a pattern, the exact matching problem searches for all occurrences positions of pattern in the text. Motivated by low level image processing, the two-dimensional exact matching problem has been extensively studied in recent decades. Given a text of size and a pattern of size over alphabet of size , the exact matching problem on two-dimensional strings searches for all occurrence positions of in . Bird [4] and Baker [3] proposed two-dimensional exact matching using dictionary matching algorithm and Amir and Farach [2] proposed an algorithm that uses suffix trees. These algorithms require total ordering from the alphabet and run in time with preprocessing time. Amir et al. [1] also proposed alphabet independent approach to the problem that runs in preprocessing time and matching time.
Unlike the exact matching problem, order-preserving pattern matching (OPPM) considers the relative order of elements, rather than their real values. Order-preserving matching has gained much interest in recent years, due to its applicability in problems where the relative order is compared, rather than the exact value, such as share prices in stock markets, weather data or musical notes.
Kubica et al. [15] and Kim et al. [14] proposed a solution based on KMP algorithm. These algorithms address the one-dimensional OPPM problem and have time complexity of . Cho et al. [8] brought forward another algorithm based on the Horspool’s algorithm that uses -grams, which was proven to be experimentally fast. Crochemore et al. [10] proposed data structures for OPPM. On the other hand, Chhabra and Tarhio [7], Faro and Külekci [11] proposed filtration methods which practically fast. Moreover, faster filtration algorithms by using SIMD (Single Instruction Multiple Data) instructions were proposed by Cantone et al. [5], Chhabra et al. [6] and Ueki et al. [16]. They showed that SIMD instructions are efficient in speeding up their algorithms.
In this paper, we propose an algorithm that based on dueling technique [17] for OPPM. Our algorithm runs in time which is as fast as KMP based algorithm. Moreover, we perform experiments those compare the performance of our algorithm with the KMP-based algorithm. The experiment results show that our algorithm is faster that KMP-based algorithm. Last, we introduce the two-dimensional order preserved pattern matching and give a duel and sweep algorithm that runs in time for duel stage and time for sweeping time with preprocessing time. To the best of our knowledge, our solution is the first to address the two-dimensional order preserving patern matching problem.
The rest of the paper is organized as follows. In Section 2, we give preliminaries on the problem. In Section 3, we describe the algorithm for OPPM problem. In Section 4 we will show some experiment results those compare the performance of our algorithm with the KMP-based algorithm. In Section 5, we extend the algorithm and describe the method for the two-dimensional OPPM problem. In Section 6, we conclude our work and discuss future work.
2 Preliminaries
We use to denote an alphabet of integer symbols such that the comparison of any two symbols can be done in constant time. denotes the set of strings over the alphabet . For a string , we will denote -th element of by and a substring of that starts at the location and ends at the location as . We say that two strings and of equal length are order-isomorphic, written , if for all . For instance, .
In order to check order-isomorphism of two strings, Kubica et al. [15] introduced 111Similar arrays and are introduced in [13]. useful arrays and defined by
[TABLE]
We use the rightmost (largest) if there exist more than one such . If there is no such then we define and , respectively. From the definition, we can easily observe the following properties.
[TABLE]
Lemma 1** ([15]).**
For a string , let be the time required to sort the elements of . and can be computed in time.
Thus, and can be computed in time in general. Moreover, the computation can be done in time under a natural assumption [15] that the characters of are elements of the set . By using and , order-isomorphism of two strings can be decided as follow.
Lemma 2** ([8]).**
For two strings and of length , assume that for some . Let and . Then if and only if either of the following two conditions holds.
[TABLE]
We omit the corresponding equalities/inequalities if or .
Hasan et al. [13] proposed a modification to Z-function, which Gusfield [12] defined for ordinal pattern matching, to make it useful from the order-preserving point of view. For a string , the (modified) Z-array of is defined by
[TABLE]
In other words, is the length of the longest substring of that starts at position and is order-isomorphic with some prefix of . An example of Z-array is illustrated in Table 1.
Lemma 3**.**
([13]) For a string , Z-array can be computed in time, assuming that and are already computed.
Note that in their original work, Hasan et al. [13] assumed that each character in is distinct. However, we can extend their algorithm by using Lemma 2 to verify order-isomorphism even when contains duplicate characters.
3 One-dimensional order-preserving matching
In this section, we will propose an algorithm for one-dimensional OPPM using the “duel-and-sweep” paradigm [1]. In the dueling stage, all possible pairs of candidates “duel” with each other. The surviving candidates are further pruned during the sweeping stage, leaving the candidates that are order-isomorphic with the pattern. Prior to the dueling stage, the pattern is preprocessed to construct a witness table that contains witness pairs for all possible offsets.
Definition 1** (1d-OPPM problem).**
The one-dimensional order-preserving matching problem is defined as follows,
Input:
A text of length and a pattern of length ,
Output:
All occurrences of substrings of that are order-isomorphic with .
3.1 Pattern preprocessing
Let be an integer such that when is superimposed on itself with the offset , the overlap regions are not order-isomorphic. We say that a pair of locations is a witness pair for the offset if either of the following holds:
,
,
.
Next, we describe how to construct a witness table for , that stores witness pairs for all possible offsets . For the one-dimensional problem, the witness table is an array of length , such that is a witness pair for offset . In the case when there are multiple witness pairs for offset , we take the pair with the smallest value of and . When the overlap regions are order-isomorphic for offset , which implies that no witness pair exists for , we express it as .
Lemma 4**.**
For a pattern of length , we can construct in time assuming that is already computed.
Proof.
Remind that is the length of the longest prefix of that is order-isomorphic with a prefix of . For each , we have two cases.
Case 1
: Since , there is no witness pair for offset .
Case 2
: Let , , and . Then and , by the definition of . By Lemma 2, neither condition (5) nor (6) holds. If then by property (3), so that
[TABLE]
holds by condition (5). Otherwise, i.e. , we have by property (3), so that
[TABLE]
holds by condition (6). Therefore, is a witness pair if the leftside of condition (7) or (8) holds, and is a witness pair if rightside of condition (7) or (8) holds.
Algorithm 1 describes the procedure. Clearly it runs in time. ∎
3.2 Dueling stage
A substring of of length will be referred to as a candidate. A candidate that starts at the location will be denoted by . Witness pairs are useful in the following situation. Let and be two overlapping candidates and be the witness pair for offset . Without loss of generality, we assume that and .
If , then .
If , then .
Based on this information, we can safely eliminate either candidate or without looking into other locations. This process is called dueling. The procedure for the dueling is described in the Algorithm 2.
Next, we prove that the consistency property is transitive. Suppose and are two overlapping candidates. We say that and are consistent with respect to if . Candidates that do not overlap are trivially consistent.
Lemma 5**.**
For any and such that , let us consider three candidates , , and . If is consistent with and is consistent with , then is consistent with .
Proof.
Since is consistent with , it follows that , so that . Moreover, since is consistent with , it follows that , so that . Thus, , which implies that is consistent with . ∎
During the dueling stage, the candidates are eliminated until all remaining candidates are pairwise consistent. For that purpose, we can apply the dueling algorithm due to Amir et al. [1] developed for ordinal pattern matching.
Lemma 6** ([1]).**
The dueling stage can be done in time by using .
3.3 Sweeping stage
The goal of the sweeping stage is to prune candidates until all remaining candidates are order-isomorphic with the pattern. Suppose that we need to check whether some surviving candidate is order-isomorphic with the pattern . It suffices to successively check the conditions (7) and (8) in Lemma 2, starting from the leftmost location in . If the conditions are satisfied for all locations in , then . Otherwise, , and obtain a mismatch position .
A naive implementation of the sweeping will result in time. However, if we take advantage of the fact that all the remaining candidates are pairwise consistent, we can reduce the time complexity to time. Since the remaining candidates are consistent to each other, for the overlapping candidates and , the overlap region is checked only once if is order-isomorphic with the pattern . Otherwise, for a mismatch position , should be checked from position of , because . Algorithm 3 describes the procedure for the sweeping stage.
Lemma 7**.**
The sweeping stage can be completed in time.
By Lemmas 4, 6, and 7, we summarize this section as follows.
Theorem 1**.**
The duel-and-sweep algorithm solves 1d-OPPM Problem in time. Moreover, the running time is under the natural assumption that the characters of can be sorted in time.
4 Experiment
In order to compare the performance of proposed algorithm with the KMP-based algorithm, we conducted experiments on 1d-OPPM problem. We performed two sets of experiments. In the first experiment, the pattern size is fixed to , while the text size is changed from to . In the second experiment, the text size is fixed to while the pattern size is changed from to . We measured the average of running time and the number of comparisons for repetitions on each experiment. We used randomly generated texts and patterns with alphabet size . Experiments are executed on a machine with Intel Xeon CPU E5-2609 8 cores 2.40 GHz, 256 GB memory, and Debian Wheezy operating system.
The results of our preliminary experiments are shown in Fig. 1 and Fig. 2. We can see that our algorithm is better that KMP based algorithm in running time and number of comparison when the pattern size and text size are large. However, our algorithm is worse when the pattern size is small, less than .
5 Two-dimensional order preserving pattern matching
In this section, we will discuss how to perform two-dimensional order preserving pattern matching (2d-OPPM). Array indexing is used for two-dimensional strings, the horizontal coordinate increases from left to right and the vertical coordinate increases from top to bottom. denotes an element of S at position and denotes a substring of S of size with top-left corner at the position .
We say that two dimensional strings S and T are order-isomorphic, written , if for all and . For a simple presentation, we assume that both text and pattern are squares in this paper, but we can generalize it straightforwardly.
Definition 2** (2d-OPPM problem).**
The two-dimensional order-preserving matching problem is defined as follows,
Input:
A text T of size and a pattern P of size ,
Output:
All occurrences of substrings of T that are order-isomorphic with P.
Our approach is to reduce 2d-OPPM problem into 1d-OPPM problem, based on the following observation. For two-dimensional string S, let be a (one-dimensional) string which serializing S by traversing it in the left-to-right/top-to-bottom order. We can easily verify the following lemma.
Lemma 8**.**
* if and only if for any S and T.*
Theorem 2**.**
2d-OPPM problem can be solved in .
Proof.
For a fixed , consider the substring and let . By Lemma 8, P occurs in T at position , i.e. if and only if . The positions satisfying the latter condition can be found in time by 1d-OPPM algorithms, which we showed in Section 3 or KMP-based ones [15, 14], because and . Because we need the preprocess for the pattern only once, and execute the search in for each , the result follows. ∎
In the rest of this paper, we try a direct approach to two-dimensional strings based on the duel-and-sweep paradigm, inspired by the work [2, 9]. A substring of T of size will be referred as a candidate. denotes a candidate with the top-left corner at .
5.1 Pattern preprocessing
For and , we say that a pair of locations is a witness pair for the offset if either of the following holds:
,
,
.
The witness table for pattern P is a two-dimensional array of size , where is a witness pair for the offset . If the overlap regions are order-isomorphic when P is superimposed with offset , then no witness pair exists. We denote it as .
We show how to efficiently construct the witness table . For P and each , we define the Z-array by
[TABLE]
where , , and .
Lemma 9**.**
For arbitrarily fixed , we can compute the value of in time and for each , assuming that is already computed.
Proof.
For an offset with , let us consider .
Case 1
: Note that the value is equal to the number of elements in the overlap region. Then , so that no witness pair exists for the offset .
Case 2
: There exists a witness pair , where is the location of the element in P, that corresponds to the -th element of . By a simple calculation, we can obtain the values in time. We can also compute from in time, similarly to the proof of Lemma 4, with the help of auxiliary arrays and . (Details are omitted.)
Symmetrically, we can compute it for . ∎
Lemma 10**.**
We can construct the witness table in time.
Proof.
Assume that we sorted all elements of P. For an arbitrarily fixed , calculation of and takes time by using sorted P. can be constructed in time by Lemma 3. Furthermore, finding witness pairs for all offsets takes time by Lemma 9. Since there are such ’s to consider, can be constructed in time. ∎
5.2 Dueling stage
Similarly to Lemma 5, we can show the transitivity as follows.
Lemma 11**.**
For any , let us consider three candidates , , and . If is consistent with and is consistent with , then is consistent with .
The dueling algorithm due to Amir et al. [1] is also applicable to the problem.
Lemma 12**.**
([1]) The dueling stage can be done in time by using .
5.3 Sweeping stage
This is the hardest part for two-dimensional strings. We first consider two surviving candidates and in some column , with . If we traverse from top-to-bottom/left-to-right manner we can reduce the problem to one-dimensional order-preserving problem. Thus performing the sweeping stage for some column will take time. Since there are such columns, the sweeping stage will take time.
Next, we propose a method that takes advantage of consistency relation in both horizontal and vertical directions. First, we construct strings for by serializing P in different way. We then compute and for , thus we can compare the order-isomorphism of the pattern with the text in several different ways. and for can be computed in time by sorting once and then calculated and by using the sorted . Fig. 4 shows for where . We also do the same computation for bottom-to-top/left-to-right traversing direction.
Let us consider two overlapping candidates and , where and . Suppose that is order-isomorphic with the pattern and we need to check . Since is consistent with , we need to check the order-isomorphishm of the region of that is not an overlap region. We do this by using , where , without checking the overlap region. This idea is illustrated in Figure 5 (a). The procedure for is symmetrical.
Next, consider three overlapping candidates , and , such that and . We assume that and are both order-isomorphic with the pattern. If , we can use the method for two overlapping candidates that we described before to perform sweeping efficiently. However, if , as showed in Fig. 5 (b), we need to check the blue region twice since we do not know the order-isomorphism relation between the blue region with the overlap region of and .
By using the above method, we can reduce the number of comparisons for sweep stage. However, the time complexity remains the same.
Lemma 13**.**
The sweeping stage can be completed in time.
By Lemmas 10, 12, and 13, we conclude this section as follows.
Theorem 3**.**
The duel-and-sweep algorithm solves 2d-OPPM Problem in time.
6 Discussion
In the current status, the time complexity of duel-and-sweep algorithm for 2d-OPPM problem in Theorem 3 is not better than straightforward reduction to 1d-OPPM problem explained in Theorem 2. We showed this result as a preliminary work on solving 2d-OPPM, and we hope the 2d-OPPM can be solved more efficiently by finding more sophisticated method based on some unknown combinatorial properties, as Cole et al. [9] did for two dimensional parameterized matching problem. This is left for future work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Amir, G. Benson, and M. Farach. An alphabet independent approach to two-dimensional pattern matching. SIAM Journal on Computing , 23(2):313–323, 1994.
- 2[2] A. Amir and M. Farach. Two-dimensional dictionary matching. Information Processing Letters , 44(5):233–239, 1992.
- 3[3] T. P. Baker. A technique for extending rapid exact-match string matching to arrays of more than one dimension. SIAM Journal on Computing , 7(4):533–541, 1978.
- 4[4] R. S. Bird. Two dimensional pattern matching. Information Processing Letters , 6(5):168–170, 1977.
- 5[5] D. Cantone, S. Faro, and M. O. Külekci. An efficient skip-search approach to the order-preserving pattern matching problem. In PSC , pages 22–35, 2015.
- 6[6] T. Chhabra, M. O. Külekci, and J. Tarhio. Alternative algorithms for order-preserving matching. In PSC , pages 36–46, 2015.
- 7[7] T. Chhabra and J. Tarhio. Order-preserving matching with filtration. In SEA , pages 307–314, 2014.
- 8[8] S. Cho, J. C. Na, K. Park, and J. S. Sim. A fast algorithm for order-preserving pattern matching. Information Processing Letters , 115(2):397–402, 2015.
