Order-Preserving Pattern Matching Indeterminate Strings
Diogo Costa, Lu\'is M. S. Russo, Rui Henriques, Hideo Bannai, and Alexandre P. Francisco

TL;DR
This paper introduces the first polynomial-time algorithm for order-preserving pattern matching with indeterminate strings, enabling analysis of noisy time series and patterns with uncertain data, which was previously infeasible.
Contribution
It presents a novel polynomial algorithm for the $$OPPM problem with indeterminate strings, extending exact OPPM to handle uncertainty in pattern and text.
Findings
Algorithm runs in $O(mr\u2212lg r)$ time for one indeterminate string
Mappings to satisfiability problems for both pattern and text cases
Proves $$OPPM is NP-hard in the general case
Abstract
Given an indeterminate string pattern and an indeterminate string text , the problem of order-preserving pattern matching with character uncertainties (OPPM) is to find all substrings of that satisfy one of the possible orderings defined by . When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the OPPM problem whenā¦
| Formula | ||||||
|---|---|---|---|---|---|---|
| Pattern | ||||||
| Text |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Order-Preserving Pattern Matching Indeterminate Strings
Diogo Costa
LuĆs M. S. Russo
Rui Henriques
Hideo Bannai
Alexandre P. Francisco
INESC-ID and Instituto Superior TƩcnico, Universidade de Lisboa, Portugal
Department of Computer Science, Kyushu University, Japan
Abstract
Given an indeterminate string pattern and an indeterminate string text , the problem of order-preserving pattern matching with character uncertainties (OPPM) is to find all substrings of that satisfy one of the possible orderings defined by . When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions.
This paper provides the first polynomial algorithm to answer the OPPM problem when indetermination is observed on the pattern or text. Given two strings with length and uncertain characters per string position, we show that the OPPM problem can be solved in time when one string is indeterminate and . Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with OPPM problem proved to be NP-hard in general.
keywords:
order-preserving pattern matching, indeterminate string analysis, generic pattern matching, satisfiability
ā ā journal: Journal of LaTeXĀ Templates
1 Introduction
Given a pattern string and a text string , the exact order preserving pattern matching (OPPM) problem is to find all substrings of with the same relative orders as . The problem is applicable to strings with characters drawn from numeric or ordinal alphabets. Illustrating, given =(1,5,3,3) and , substring is reported since it satisfies the character orders in , . Despite its relevance, the OPPM problem has limited potential since it prevents the specification of errors, uncertainties or donāt care characters within the text.
Indeterminate strings allow uncertainties between two or more characters per position. Given indeterminate strings and , the problem of order preserving pattern matching uncertain text (OPPM) is to find all substrings of with an assignment of values that satisfy the orders defined by . For instance, let and . The substrings and are reported since there is an assignment of values that preserve either or orderings: respectively and .
Order-preserving pattern matching captures the structural isomorphism of strings, therefore having a wide-range of relevant applications in the analysis of financial times series, musical sheets, physiological signals and biological sequences [1, 2, 3]. Uncertainties often occur across these domains. In this context, although the OPPM problem is already a relaxation of the traditional pattern matching problem, the need to further handle localized errors is essential to deal with noisy strings [4]. For instance, given the stochasticity of gene regulation (or markets), the discovery of order-preserving patterns in gene expression (or financial) time series needs to account for uncertainties [5, 6]. Numerical indexes of amino-acids (representing physiochemical and biochemical properties) are subjected to errors difficulting the analysis of protein sequences [7]. Another example are ordinal strings obtained from the discretization of numerical strings, often having two uncertain characters in positions where the original values are near a discretization boundary [4].
Let and be the length of the pattern and text , respectively. The exact OPPM problem has a linear solution on the text length based on the Knuth-Morris-Pratt algorithm [8, 2, 9]. Alternative algorithms for the OPPM problem have also been proposed [10, 11, 12]. Contrasting with the large attention given to the resolution of the OPPM problem, to our knowledge there are no polynomial-time algorithms to solve the OPPM problem. Naive algorithms for OPPM assess all possible pattern and text assignments, bounded by when considering up to uncertain characters per position.
This work proposes the first polynomial time algorithms able to answer the OPPM problem. Accordingly, the contributions are organized as follows. First, we show that an indeterminate string of length order-preserving matches a determinate string with the same length in time based on their monotonic properties. Second, and given two indeterminate strings with the same size, we provide a linear encoding of the OPPM into a satisfiability formula with properties of interest. Furthermore, we extend this encoding and we present results concerning the computational complexity of OPPM problem variations, namely a proof of that the OPPM problem is NP-hard in general. Third, given a pattern and text strings with lengths and , only one of them indeterminate, we show that the OPPM problem can be solved in linear space and its average efficiency boosted under effective filtration procedures.
A preliminary version of this work was presented at the Annual Symposium on Combinatorial Pattern Matching (CPM)Ā [13]. In this paper, we revise previous results and we present new results concerning the computational complexity of OPPM problem; SectionsĀ 3.3, 3.4 and 5 are new.
2 Background
Let be a totally ordered alphabet and an element of be a string. The length of a string is denoted by . The empty string is a string of length 0. For a string , , and are called a prefix, substring, and suffix of , respectively. The -th character of a string is denoted by for each . For a string and integers , denotes the substring of from position to position . For convenience, let when .
Given strings and with equal length , is said to order-preserving against [8], denoted by , if the orders between the characters of and are the same, i.e. for any . A non-empty pattern string is said to order-preserving match (op-match in short) a non-empty text string if and only if there is a position in such that . The order-preserving pattern matching (OPPM) problem is to find all such text positions.
2.1 The Problem
Given a totally ordered alphabet , an indeterminate string is a sequence of disjunctive sets of characters where . Each position is given by where .
Given an indeterminate string , a valid assignment \xi$x[i]x[i]$x[0]\in x[0]$x[m-1]\in x[m-1](1|3,3|4,2|3,1|2)2^{4}x[i]\subseteq\Sigma$x_{j}[i]j^{th}x[i]$x_{0}[i]x[i]=1|2x\S x\S x[0]\subseteq x[0]\S x[m-1]\subseteq x[m-1]$.
Given a determinate string of length , an indeterminate string of equal length is said to be order-preserving against , identically denoted by , if there is a valid assignment \yx$yx[i]\leq x[j]\Leftrightarrow$y[i]\leq$y[j]0\leq i,j<mxymyxx\approx y$yy$xx$.
A non-empty indeterminate pattern string is said to order-preserving match (op-match in short) a non-empty indeterminate text string if and only if there is a position in such that . The problem of order-preserving pattern matching with character uncertainties (OPPM) problem is to find all such text positions.
To understand the complexity of the OPPM problem, let us look to its solution from a naive stance yet considering state-of-the-art OPPM principles. The algorithmic proposal by Kubica et al.Ā [8] is still up to this date the one providing a lowest bound, +, where for alphabets of size ( otherwise). Given a determinate string of length , an integer () is said in the context of this work to be an order-preserving border of if . In this context, given a pattern string , the orders between the characters of are used to linearly infer the order borders. The order borders can then be used within the Knuth-Morris-Pratt algorithm to find op-matches against a text string in linear time [8].
Given a determinate string of length and an indeterminate string of length , the previous approach is a direct candidate to the OPPM problem by decomposing in all its possible assignments, . Since determinate assignments to are only relevant in the context of -length windows, this approach can be improved to guarantee a maximum of assignments at each text position. Despite its simplicity, this solution is bounded by . This complexity is further increased when indetermination is also considered in the pattern, stressing the need for more efficient alternatives..
2.2 Related work
The exact OPPM problem is well-studied in literature. Kubica et al. [8], Kim et al. [2] and Cho et al. [9] presented linear time solutions on the text length by respectively combining order-borders, rank-based prefixes and grammars with the KnuthāMorrisāPratt (KMP) algorithm [14]. Cho et al. [10], Belazzougui et al. [11], and Chhabra et al. [12] presented algorithms that show a sublinear average complexity by either combining bad character heuristics with the BoyerāMoore algorithm [15] or applying filtration strategies. Recently, Chhabra et al. [16] proposed further principles to solve OPPM using word-size packed string matching instructions to enhance efficiency.
In the context of numeric strings, multiple relaxations to the exact pattern matching problem have been pursued to guarantee that approximate matches are retrieved. In norm matching [17, 18, 19, 20], matches between numeric strings occur if a given distance threshold is satisfied. In (,)-matching [21, 22, 23, 24, 25, 26, 27], strings are matched if the maximum difference of the corresponding characters is at most and the sum of differences is at most .
In the context of nominal strings, variants of the pattern matching task have also been extensively studied to allow for donāt care symbols in the pattern [28, 29, 30], transposition-invariant [25], parameterized matching [31, 32], less than matching [33], swapped matching [34, 35], gaps [36, 37, 38], overlap matching [39], and function matching [40, 41].
Despite the relevance of the aforementioned contributions to answer the exact order-preserving pattern matching and generic pattern matching, they cannot be straightforwardly extended to efficiently answer the OPPM problem.
3 On solving OPPM
SectionĀ 3.1 introduces the first efficient algorithm to solve the OPPM problem when one string is indeterminate (). SectionĀ 3.2 discusses the existence of efficient solvers when both strings are indeterminate. SectionĀ 3.3 introduces then a polynomial time algorithm for the Alternate-OPPM as a subproblem of OPPM where both strings may have indeterminate characters, but never in the same position. Given the formulations proposed in SectionĀ 3.2, we hypothesize that op-matching indeterminate strings with an arbitrary number of uncertain characters per position () is in class NPC. Furthermore, we show in SectionĀ 3.4 that the problem {3,3}-OPPM, defined as the subproblem of OPPM where both the pattern and the text have indeterminate characters in any position (although at least one position must have at least three indeterminate characters in both pattern and text), is NP-hard. We still leave a gap in between these two groups, namely for the strings where there are at most two indeterminate characters in both strings at the same position. It remains open whether or not this problem is NP-hard.
3.1 time OPPM when one string is indeterminate
Given a determinate string of length , there is a well-defined permutation of positions, , that specifies a non-monotonic ascending order of characters in . For instance, given =(1,4,3,1), then and . Given a determinate string with the same length, op-matches if it satisfies the same -1 orders. For instance, given and , orders are not preserved in since y[0]\mathbin{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\neq}}y[3]<y[2]<y[1].
The monotonic properties can be used to answer OPPM when one string is indeterminate. Given an indeterminate string , let and be the permuted strings in accordance with orders in . To handle equality constraints, positions in with identical characters in can be intersected, producing a new string with length (). Illustrating, given =(4,1,4,2) and , then =(1,3,0,2), =(1,2,4,4), and . To handle monotonic inequalities, characters can be concatenated in descending order to compose and the orders between and verified by testing if the longest increasing subsequence (LIS)Ā [42] of has length. In the given example, , and the LIS of is =(2,4,7). Since =3, op-matches .
Theorem 3.1**.**
Given a determinate string and an indeterminate string , let and be the sorted strings in accordance with order of characters in . Let the positions with equal characters in be intersected in to produce a new indeterminate string . Consider to be a string with characters in descending order and , then if and only if , where is a longest increasing subsequence in .
Proof.
If the length of the longest increasing subsequence (LIS), , equals the number of monotonic relations in , , then . By sorting characters in descending order per position, we guarantee that at most one character per position in appears in the LIS (respecting monotonic orders in given properties). By intersecting characters in positions of with identical characters in , we guarantee the eligibility of characters satisfying equality orders in , otherwise empty positions in are observed and the LIS length is less than . If , there is no assignment in that op-matches due to one of two reasons: 1) there are empty positions in due to the inability to satisfy equalities in , or 2) it is not possible to find a monotonically increasing assignment to and, given the properties of , cannot preserve the orders of .
Solving the LIS task on a string of size is Ā [42] where . In addition, set intersection operations are performed times on sets with size, which can be accomplished in time. As a result, the OPPM problem with one indeterminate string can be solved in .
Given the fact that the candidate string for the LIS task has properties of interest, we can improve the complexity of this calculus (TheoremĀ 3.2) in accordance with AlgorithmĀ 1.
Theorem 3.2**.**
OPPM two strings of length , one being indeterminate, is in time, where .
Proof.
In accordance with Algorithm 1, OPPM is bounded by the verification of equalities, [43]. Testing inequalities after set intersections can be linearly performed on the size of , time, improving the bound given by the LIS calculus.
The analysis of Algorthim 1 further reveals that the OPPM problem with one indeterminate string requires linear space in the text length, .
3.2 OPPM with indeterminate pattern and text
As indetermination in real-world strings is typically observed between pairs of characters [4], a key question is whether OPPM on two indeterminate strings is in class P when . To explore this possibility, new concepts need to be introduced. In OPPM research, character orders in a determinate string of length can be decomposed in 3 sequences with unit sets:
Definition 3.3**.**
For :
( if there is no eligible ),
- 2.
( if there is no eligible ),
- 3.
( if there is no eligible ).
Leq, Lmax and Lmin capture , and relationships between each character in and the closest preceding character . These orders can be inferred in linear time for alphabets of size and in time for other alphabets by answering the āall nearest smaller valuesā task on the sorted indexes [8]. FigureĀ 1 depicts Leq, Lmax and Lmin for . Given determinate strings and , , and , if , then if and only if
[TABLE]
When allowing uncertainties between pairs of characters, previous research on the OPPM problem cannot be straightforwardly extended due to the need to trace assignments on indeterminate strings.
Lemma 3.4**.**
Given a determinate string , an indeterminate string , and the singleton sets , and containing a position in . If is verified on a specific assignment of characters, denoted , then if and only if
[TABLE]
Proof.
In accordance with Leq, Lmax and Lmin definition, for any , and we have , and . If there is an assignment to in that preserves the orders of , then for each , and \y[t+1]=$y[a]$y[t+1]>$y[b]$y[t+1]<$y[c]$y[t+1]\in\S y[t+1]$y[a]\in\S y[a]$y[b]\in\S y[b]$y[c]\in\S y[c](\Leftarrow)x[0..t+1]\approx y[0..t+1]x[0..t]\approx y[0..t]i<t\exists_{$y[i]\in\S y[i],$y[t+1]\in\S y[t+1]}x[t+1]>x[i]\Leftrightarrow$y[t+1]>$y[i]x[t+1]>x[i]i\in{0,\ldots,t}\forall_{b\in B}x[b]>x[i]x[0..t]$y[0..t]\S y[0..t]$y[i]\in\S y[i]$y[b]\in\S y[b]\forall_{b\in B}$y[b]>$y[i]\forall_{b\in B}$y[t+1]>$y[b]$y[t+1]>$y[i]x[t+1]<x[i]x[t+1]=x[i]$y[t+1]<$y[i]$y[t+1]=$y[i]$), yielding the stated equivalence.
Given two strings of equal length, the OPPM problem can be schematically represented according to the identified order restrictions. FigureĀ 2 represents restrictions on the indeterminate string in accordance with the observed orders in . The left side edges are placed in accordance with LemmaĀ 3.4 and capture assessments on the orders between pairs of characters. The right side edges capture incompatibilities detected after the assessments, i.e. pairs of characters that cannot be selected simultaneously (for instance, and , or and ). For the given example, there are two valid assignments, \y_{1}=(2,4,3,2)$y_{2}=(2,5,3,2)x[0]=x[3]<x[2]<x[1]yx$.
To verify whether there is an assignment that satisfies the identified ordering restrictions, we propose the reduction of OPPM problem to a Boolean satisfiability problem.
Given a set of Boolean variables, a formula in conjunctive normal form is a conjunction of clauses, where each clause is a disjunction of literals, and a literal corresponds to a variable or its negation. Let a 2CNF formula be a formula in the conjunctive normal form with at most two literals per clause. Given a CNF formula, the satisfiability (SAT) problem is to verify if there is an assigning of values to the Boolean variables such that the CNF formula is satisfied.
Theorem 3.5**.**
The OPPM problem over two strings of equal length, one being indeterminate, can be reduced to a satisfiability problem with the following CNF formula:
[TABLE]
Proof.
Let us show that if op-matches then is satisfiable, and if does not op-match then is not satisfiable. When , there is an assignment of values to , \yx\phi\vee_{$y[i]\in y[i]}z_{i,$y[i]}\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\exists_{$y}\wedge_{i\in{0..m-1}}z_{i,$y[i]}\phi(\Leftarrow)xy$y\in yx\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\vee_{$y[i]\in y[i]}z_{i,$y[i]}$, leading to a non-satisfiable formula.
If the established formula is satisfiable, there is a Boolean assignment to the variables that specify an assignment of characters in , \yx$yx\phir\times m{z_{i,\sigma}\mid i\in{0..m-1},\ \sigma\in\Sigma}z_{i,\sigma}\sigmay[i]$yx. The reduced formula in ([1](#S3.E1)) is composed of two major types of clauses: \vee_{$y[i]\in y[i]}z_{i,$y[i]}(\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\vee\textsf{bool})$y[i]=$y[j]$y[i]<$y[j]$y[i]>$y[j]y$y[i]>$y[j](\neg z_{i,\sigma_{1}}\vee\neg z_{j,\sigma_{2}})\sigma_{1}\sigma_{2}characters should not be selected simultaneously since they do not satisfy the orders defined by a given pattern. For instance, the pairs of characters in orange from FigureĀ [2](#S3.F2) should not be simultaneously selected due to order conflicts. To this end,(\neg z_{0,2}\vee\neg z_{3,1})(\neg z_{1,4}\vee\neg z_{2,5})y\approx xy=(2,4|5,4|5,1|2)x=(1,4,3,1)$, schematically represented in FigureĀ 2, the associated CNF formula is:
[TABLE]
Theorem 3.6**.**
Given two strings of length , one being indeterminate with , the OPPM problem can be reduced to a 2SAT problem with a CNF formula with size.
Proof.
Given TheoremĀ 1 and the fact that the reduced CNF formula has at most two literals per clause ā is a composition of \vee_{\y[i]\in y[i]}z_{i,$y[i]}|y[i]|\in{1,2}(\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\vee\textsf{bool})\mur=210mm$:
[clauses that impose the selection of at least one character per position in ] Since has positions, and each position is either determinate (unitary clause) or defines an uncertainty between a pair of characters, there are clauses and at most literals;
- 2.
[clauses that define the ordering restrictions between two variables] A position in the indeterminate string needs to satisfy at most two order relations. Considering that , , and specify uncertainties between pairs of characters, there are up to 12 restrictions per position: 4 ordering restrictions between characters in and , and . Whenever the order between two characters is not satisfied, a clause is added per position, leading to at most clauses.
Theorem 3.7**.**
The OPPM between determinate and indeterminate strings of equal length can be solved in linear time when .
Proof.
Given the fact that a 2SAT problem can be solved in linear time [44]***2SAT problems have linear time and space solutions on the size of the input formula. Consider for instance the original proposal [44], the formula is modeled by a directed graph , with two nodes per variable in ( and ) and two directed edges for each clause (the equivalent implicative forms and ). Given , the strongly connected components (SCCs) of can be discovered in . During the traversal if a variable and its complement belong to the same SCC, then the procedure stops as is determined to be unsatisfiable. Given the fact that both and by LemmaĀ 3.6, this procedure is time and space., this proof directly derives from TheoremĀ 3.6 as it guarantees the soundness of reducing OPPM () to a 2SAT problem with a CNF formula with size.
As the size of the mapped CNF formula is and the a valid algorithm to verify its satisfiability would require the construction of a graph with nodes and edges, the required memory for the target OPPM problem is .
When moving from one to two indeterminate strings, previous contributions are insufficient to answer the OPPM problem. In this context, the Leq, Lmax and Lmin vectors need to be redefined to be inferred from an indeterminate string:
Definition 3.8**.**
For :
\textit{Leq}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \x_{j}[i]=$x_{p}[k]}\emptysetk$),
- 2.
\textit{Lmax}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \x_{j}[i]>$x_{p}[k]}\emptysetk$),
- 3.
\textit{Lmin}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \x_{j}[i]<$x_{p}[k]}\emptysetk$).
FigureĀ 3 schematically represents the order relationships of and the associated Leq, Lmax and Lmin vectors. In this scenario, needs to be verified not only against but also against in case is disregarded.
Remark 3.9**.**
Given Leq, Lmax and Lmin (DefinitionĀ 3.8), there are order relationships when since each character in a given position establishes at most relationships with characters in preceding positions.
Lemma 3.10**.**
Given indeterminate strings and , let , and (DefinitionĀ 3.8) be the orders associated with \x_{j}[t+1]x[1..t]\approx y[1..t]y\S yx[1..t+1]\approx y[1..t+1]$ if and only if
[TABLE]
Proof.
Similar to the proof of LemmaĀ 3.4, yet , and conditional to (DefinitionĀ 3.3) are now given by , and conditional to (DefinitionĀ 3.8). If there is an assignment to in that preserves one of the possible orders in , then for any , and : \y[t+1]=$y[a]$y[t+1]>$y[b]$y[t+1]<$y[c]$y[t+1]\in\S y[t+1]$y[a]\in\S y[a]$y[b]\in\S y[b]$y[c]\in\S y[c]$).
We need to show that . Since , it is sufficient to prove that for : exists \x[i]\in\S x[i]$x[t+1]\in\S x[t+1]$y[i]\in\S y[i]$y[t+1]\in\S y[t+1]$x[t+1]=$x[i]\Leftrightarrow$y[t+1]=$y[i]$x[t+1]>$x[i]\Leftrightarrow$y[t+1]>$y[i]$x[t+1]<$x[i]\Leftrightarrow$y[t+1]<$y[i]$. This results from DefinitionĀ 3.8, the order-isomorphism property and LemmaĀ 3.4.
FigureĀ 4 represents encountered restrictions when op-matching against . The right side edges capture the detected incompatibilities, i.e. pairs of characters that cannot be selected simultaneously. For the given example, there are 2 valid assignments ā \y_{1}=(2,0,3)$y_{2}=(2,0,4)$x_{0}[1]<$x_{0}[0]<$x_{0}[2]x\approx y$.
To verify whether there is an assignment that satisfies the identified ordering restrictions, TheoremĀ 2 extends the previously introduced SAT mapping given by (1).
Theorem 3.11**.**
Given Leq, Lmax and Lmin (DefinitionĀ 3.8), OPPM problem over two indeterminate strings of equal length can be reduced to a satisfiability problem with the following CNF formula:
[TABLE]
Proof.
If then is satisfiable, and if does not op-match then is not satisfiable.
When op-matches , there is an assignment of values in and such that \x\approx$y\phiz_{i,$x[i],$y[i]}i^{\textit{th}}\neg z_{i,$x[i],$y[i]}\vee\neg z_{j,$x[j],$y[j]}z_{i,$x[i],$y[i]}\phiz_{i,$x[i],$y[i]}i^{\textit{th}}(\Leftarrow)xy$x\in x$y\in y$x\approx$yz_{i,$x[i],$y[i]}i^{\textit{th}}\phi$ formula unsat.
If the formula in (2) is satisfiable, there is a Boolean assignment to the variables such that there is an assignment of characters in , \yx$xx\approx yr=2\phi4m{z_{i,\sigma_{1},\sigma_{2}}\mid i\in{0\ldots m-1}\sigma_{1},\sigma_{2}\in\Sigma}. The Boolean values assigned to these variables define whether characters \sigma_{1}\in x[i]\sigma_{2}\in y[i]$ belong to an op-match. The reduced formula is composed of two major types of clauses:
Those in the first line of (2) ensure that at least one combination of characters, \x[i]$y[i]i^{\textit{th}}$ position.
- 2.
Remaining ones in (2) specify ordering constraints between pairs of characters and , and ; if the inequalities \y[i]=$y[j]$y[i]>$y[j]$y[i]<$y[j](\neg z_{i,\sigma_{1}}\vee\neg z_{j,\sigma_{2}})$, meaning that these characters should not be selected simultaneously in the given positions (see FigureĀ 4).
To instantiate the proposed mapping, consider and , schematically represented in FigureĀ 3. The associated CNF formula is:
[TABLE]
Theorem 3.12**.**
The OPPM problem for two indeterminate strings of equal length is reducible into a satisfiability problem over a CNF formula with size .
Proof.
The reduced formula in (2) is in the two conjunctive normal form (CNF) with at most clauses in the first line of (2) and a maximum of orders per position (RemarkĀ 3.9), totalling at most order conflicts between characters, from the restriction clauses in the reammining of (2).
Although we are no longer in the conditions of TheoremĀ 3.7, namely because the above satisfiability formulation is not a 2SAT instance, given its unique properties, effective backtracking in accordance with the clauses in the first line of (2), as well as dedicated conflict pruning principles derived from reamining clauses in (2), can be considered to develop efficient SAT solvers able to solve the OPPM problem. And, as we will show later, we are not expected to do much better.
3.3 Polynomial time Alternate-OPPM
In this section, we define Alternate-OPPM as the subproblem of OPPM where both strings ( and , interchangeable) may have indeterminate characters, but never in the same position; we show that Alternate-OPPM is polynomial in both the number of indeterminacies (, which may be different in each position and string) and length of the strings (). To do this, we will present a set of 2SAT clauses, in the form of implications, that can represent every constraint of this problem. We will first assume that there are no repeated characters within each string and then extend the reduction to handle equalities.
Given a string and position , we represent the set of indeterminate characters as the ascending sequence where and . We will use only when the context leads to no ambiguities, or to mean the largest possible . All of our 2SAT variables will be of the form , meaning that the chosen value \x[i]a_{j}$.
Consistency clauses
Here, we describe the clauses that maintain consistency between all the variables for individual positions. We only need to specify that, if we have chosen a value greater than , we have also chosen a value greater than , the value immediately below it, i.e.,
[TABLE]
This leads to a single clause per indeterminacy, per position, for both pattern and text, and so, at most, clauses.
Order clauses (Type )
Here, we describe the clauses enforcing the order relation between each pair of positions. Given two strings and , for positions and , if \x[\alpha]>$x[\beta]$y[\alpha]>$y[\beta]<$ relation).
This first set of clauses applies to Type (see TableĀ 1). We only need to find the index (in each string) that separates the cases where \x[\alpha]>$x[\beta]$x[\alpha]<$x[\beta]$ and add a single constraint expressing it.
Let be the lowest index such that and the lowest index such that , where and are as in TableĀ 1. Then, we have
[TABLE]
This leads to two clauses for every pair of positions, and so, clauses.
Order clauses (Type )
Finally, we have a second set of clauses that applies to Type (see TableĀ 2). Here, we have the order between and fixed already by whichever string or has no indeterminacies.
If , for every index indexing , and let be the lowest index such that . Then we add
[TABLE]
If there is no such , we add instead
[TABLE]
Similarly, if , for every index indexing , let be the lowest index such that . Then we add
[TABLE]
If there is no such , we add instead
[TABLE]
This leads to at most clauses for every pair of positions, and so clauses. Because character order is a transitive property, this type of clauses may be reduced to using a similar notion to the Lmax and Lmin sets introduced in SectionĀ 3.2 to consider only āadjacentā (taking adjacent to mean the closest position of the same type) pairs of positions, instead of every pair.
Forcing choice
With the clauses specified above, we can find coherent solutions to the problem. However, it is possible to satisfy the formula by assigning all possible values for a given variable to false (effectively skipping the position). This has a straightforward solution, given the chosen encoding of the variables. Each 2SAT variable represents a greater or equal value in the corresponding OPPM position, the variable corresponding to the lowest value for each position is trivially true, letting us force a value choice with a single added variable. For every position, with variables , we add the clause , forcing it to be true to satisfy the 2SAT formula.
Extracting solutions
Finally, we need to extract the solution to the OPPM problem from the 2SAT solution. This is easily done in linear time by sweeping every variable in ascending order, in each position. In each position, with variables , we find the variable at index such that is true and is false. The chosen value in the OPPM problem, for the given position, is the value at index .
Dealing with equalities
We now turn to cases where characters match and show how to adapt the encoding above to equalities. Let us consider Type II equalities, first, where . The easy solution to this is the same as the one presented before. We preprocess the two strings by grouping all the repeats into a single position and intersecting their indeterminacies. For Type I equalities, we need to add clauses to each pair. Let be indexes such that and . We add
[TABLE]
If only exists (or ), we simply remove (or ) from the input, as such an assignment could never lead to a valid solution.
Pair incompatibility
All the clauses described above serve to maintain consistency between pairs. It may happen that a given pair is unsatisfiable by itself, and no clauses would be constructed. These cases can be dealt separately, as pre-processing. If we find a pair that can not be satisfied, we can terminate the program before ending the construction, since there is no solution to the OPPM instance.
Theorem 3.13**.**
The Alternate-OPPM can be solved in time and space.
Proof.
Property resulting from the encoding above and, as in the proof of TheoremĀ 3.7, given the fact that a 2SAT problem can be solved in linear timeĀ [44].
3.4 OPPM with 3 indeterminacies in both text and pattern is NP-hard
In this section, we define -OPPM as the subproblem of OPPM where both the pattern and the text have indeterminate characters in any position (although at least one position must have at least three indeterminate characters in both pattern and text) and prove it NP-hard (thus proving the same for general OPPM). We do this with a direct reduction from 3CNF-SAT, first presenting the construction and then the proof of equivalence between the two instances. The construction is similar to the one by Bose et al. for the permutation matching problemĀ [45].
Construction
To ease the description of the construction itself, we start by describing how we represent an instance of 3CNF-SAT. First, we assume that every literal and clause has some ordering. We have a set of literals, and a set of clauses. Each clause is represented by two tuples, and . represents the index of literal of clause ; represents the value of the literal in clause , having the value of [math] for positive literals and for negative literals. For example, the clause would be represented by the two tuples and .
Although the designations of text or pattern are interchangeable in this section, we will use pattern for the simpler string (with less indeterminacies) and text for the more complicated string (with more indeterminacies). We use and for the pattern and text, respectively, or when they are interchangeable.
Both text and pattern have two parts, one representing literals and the other representing clauses. Each literal, and clause, has a single position in each string to represent it, dividing into and . In , we have a simple sequence of literals given by their indexes, so , for ; in we have a similar sequence, but each literal takes one of two variable values to represent an assignment of true or false, so or . We choose the larger value to represent the assignment of true. In , each position has three indeterminacies, corresponding to the three variables of the clause. In , we choose one of the three literals of the respective clause. For clause , with literals (regardless of their value being positive or negative), its position in , . In , as in we choose one of the literals, but now the value of the literal must satisfy the clause. For clause , , . An example of this construction is shown in Table 3.
Lemma 3.14**.**
The construction above takes polynomial time.
Proof.
It is easy to see that, assuming that variables and clauses are numbered, we can simply scan the formula once to construct our two strings in linear time.
Lemma 3.15**.**
The initial 3CNF-SAT clause is satisfiable if and only if there is an order-isomorphic match between the two constructed strings.
Proof.
We start by showing how solving the OPPM instance solves the initial 3CNF-SAT instance. To solve OPPM, we need to choose exactly one value for each position in and that leads to two order-isomorphic strings. To extract the solution, we can limit ourselves to look at the initial part of , , which sets the value of each literal.
First, note that function is to maintain consistency between the values of literals chosen in . By choosing only literals in , and not their values, we force equality between all such literals. Because of order-isomorphism, this equality must be kept in , forcing a valid solution to use a single value for each literal (since different values match in but mismatch in ). If we choose a literal to be positive/negative at some position in , we force the value of that literal to be positive/negative at every position in .
Now, we focus on . Every clause has exactly one position in , and each of these positions have three choices of value, matching only the three values that satisfy a clause. Because we must choose one value in each position to solve our OPPM instance, we must choose one value that satisfies each clause, for every clause.
Putting these two properties together, to solve OPPM we must choose a literal value that satisfies each clause and those literals must have consistent values. This establishes the equivalence between the solutions of the two instances.
We can easily extract the solution from OPPM to 3CNF-SAT by checking whether the values in are even or odd, true or false, respectively. There is a unique solution to 3CNF-SAT given an OPPM solution.
To extract the solution from 3CNF-SAT to OPPM, we take the values assigned to each variable and choose the respective values in . Then, we need to choose values for and , which can easily be done by choosing any of the literals that satisfies its respective clause. There may be multiple OPPM solutions for a given 3CNF-SAT solution.
Theorem 3.16**.**
{3,3}-OPPM is NP-hard.
Proof.
Using LemmasĀ 3.14 andĀ 3.15 we show that 3CNF-SAT -OPPM by constructing an instance of OPPM in polynomial time. The solutions can also be retrieved and translated in polynomial time.
Theorem 3.17**.**
OPPM is NP-hard.
Proof.
Since -OPPM is a particular case of OPPM, and it is NP-hard, then OPPM is NP-hard.
4 Polynomial time OPPM
.
Lemma 4.1**.**
Given a pattern string of length and a text string of length , one being indeterminate, the OPPM problem can be solved in time.
Proof.
From TheoremĀ 3.2, verifying if two strings of length op-match can be done in time (indetermination in one string) since at most verifications need to be performed.
LemmaĀ 4.1 confirms that the OPPM problem with one indeterminate strings is in class P. This lemma further triggers the research question āIs a tight bound to solve the OPPM?ā, here left as an open research question.
Irrespectively of the answer, the analysis of the average complexity is of complementary relevance. State-of-the-art research on the exact OPPM problem shows that the average performance of algorithms in time can outperform linear time algorithms [12, 46, 47].
Motivated by the evidence gathered by these works, we suggest the use of filtration procedures to improve the average complexity of the proposed OPPM algorithm while still preserving its complexity bounds. A filtration procedure encodes the input pattern and text, and relies on this encoding to efficiently find positions in the text with a high likelihood to op-match a given pattern. Despite the diversity of string encodings, simplistic binary encodings are considered to be the state-of-the-art in OPPM research [12, 46]. In accordance with Chhabra et al. [12], a pattern can be mapped into a binary string expressing increases (1), equalities (0) and decreases (0) between subsequent positions. By searching for exact pattern matches of in an analogously transformed text string , we guarantee that the verification of whether and orders are preserved is only performed when exact binary matches occur. Illustrating, given and , then and , revealing two matches and : one spurious match and one true match .
When handling indeterminate strings the concept of increase, equality and decrease needs to be redefined. Given an indeterminate string , consider if , if , and otherwise. Under this encoding, the pattern matching problem is identical under the additional guard that a character in always matches a do not care position, , and vice-versa. Illustrating, given and , then and , leading to one true match ā e.g. \t[3..5]=(6,3,5)t[5..7]$. Exact pattern matching algorithms, such as Knuth-Morris-Pratt and Boyer-Moore, can be adapted to consider do not care positions while preserving complexity boundsĀ [14, 15].
The properties of the proposed encoding guarantee that the exact matches of in cannot skip any op-match of in . Thus, when combining the premises of LemmaĀ 4.1 with the previous observation, we guarantee that the computed OPPM solution is sound.
The application of this simple filtration procedure prevents the recurring verifications times. Instead, the complexity of the proposed method to solve the OPPM problem becomes (when one string is indeterminate) where is the number of exact matches (). According to previous work on exact OPPM with filtration procedures [12], SBNDM2 and SBNDM4 algorithms [48] (Boyer-Moore variants) were suggested to match binary encodings. In the presence of small patterns, Fast Shift-Or (FSO) [49] can be alternatively applied [12].
A given string text can be read and encoded incrementally from the standard input as needed to perform OPPM, thus requiring space. When filtration procedures are considered, the aforementioned algorithms for exact pattern matching require space [12], thus OPPM space requirements are bound by substring verifications (SectionĀ 3): space when one string is indeterminate and when indetermination is considered on both strings.
5 Open problem
We can look at the OPPM by the number and position of the indeterminate characters. We have shown that, for any number of indeterminacies, OPPM has a polynomial-time algorithm for indeterminate characters in a single string (Section 3.1), or in both strings, but never in both strings at the same position (Section 3.3). For indeterminate characters in both strings at the same position, we have also shown that for at least three indeterminacies (at select positions), the problem in NP-hard (Section 3.4).
There is a gap in between these two groups, however, for the strings where there are at most two indeterminate characters in both strings at the same position. It remains open whether or not this problem is NP-hard. Given that our reduction from Section 3.4 uses three indeterminate character in both strings, it also remains open whether the problem with two indeterminate characters in one string and three in the other (at the same position) is NP-hard.
Following the pattern-avoidance precedent by Guillemot and VialetteĀ [50] for the related problem of permutation matching, we note that, for the case of OPPM with at most two indeterminate characters (both strings, same position), there is a straightforward encoding in 2SAT for -avoiding strings, here taken to mean that, in a single string, for the pair of positions , the rank of the characters (only for the pair in question) is not in and in (with and let being interchangeable). The full problem, however, remains open.
6 Concluding remark
This work addressed the relevant yet scarcely studied problem of finding order-preserving pattern matches on indeterminate strings (OPPM). We showed that the problem has a linear time and space solution when one string is indeterminate. In addition, the OPPM problem (when both strings are indeterminate) was mapped into a satisfiability formula of polynomial size and two simple types of clauses in order to study efficient solvers for the OPPM problem. Moreover the OPPM problem was shown to be NP-hard in general. Finally, we showed that solvers of the OPPM problem can be boosted in the presence of filtration procedures and we identified a still open problem in what concerns the computational complexity of the OPPM problem when restricted to at most two indeterminate characters in both strings at the same position.
Acknowledgments
This work was developed in the context of a secondment granted by the BIRDS MASC RISE project funded in part by EU H2020 research and innovation programme under the Marie SkÅodowska-Curie grant agreement no.690941. This work was further supported by national funds through Fundação para a CiĆŖncia e Tecnologia (FCT), namely under projects PTDC/CCI-BIO/29676/2017, TUBITAK/0004/2014, SAICTPAC/0021/2015, and UID/CEC/50021/2019.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] X. Ge, Pattern matching in financial time series data, final project report for ICS 278 (1998).
- 2[2] J. Kim, P. Eades, R. Fleischer, S.-H. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, T. Tokuyama, Order-preserving matching, Theoretical Computer Science 525 (2014) 68ā79 (2014).
- 3[3] R. Henriques, A. Paiva, Seven principles to mine flexible behavior from physiological signals for effective emotion recognition and description in affective interactions., in: Phy CS, 2014, pp. 75ā82 (2014).
- 4[4] R. Henriques, Learning from high-dimensional data using local descriptive models, Ph.D. thesis, Instituto Superior Tecnico, Universidade de Lisboa, Lisboa (2016).
- 5[5] R. Henriques, S. C. Madeira, Bicspam: flexible biclustering using sequential patterns, BMC bioinformatics 15 (1) (2014) 130 (2014).
- 6[6] R. Henriques, C. Antunes, S. Madeira, Methods for the efficient discovery of large item-indexable sequential patterns, in: New Frontiers in Mining Complex Patterns, Vol. 8399 of LNCS, Springer International Publishing, 2014, pp. 100ā116 (2014).
- 7[7] S. Kawashima, M. Kanehisa, Aaindex: amino acid index database, Nucleic acids research 28 (1) (2000) 374ā374 (2000).
- 8[8] M. Kubica, T. KulczyÅski, J. Radoszewski, W. Rytter, T. WaleÅ, A linear time algorithm for consecutive permutation pattern matching, Information Processing Letters 113 (12) (2013) 430ā433 (2013).
