Order-Preserving Pattern Matching Indeterminate Strings

Diogo Costa; Lu\'is M. S. Russo; Rui Henriques; Hideo Bannai; and Alexandre P. Francisco

arXiv:1905.02589·cs.DS·May 8, 2019

Order-Preserving Pattern Matching Indeterminate Strings

Diogo Costa, Lu\'is M. S. Russo, Rui Henriques, Hideo Bannai, and Alexandre P. Francisco

PDF

TL;DR

This paper introduces the first polynomial-time algorithm for order-preserving pattern matching with indeterminate strings, enabling analysis of noisy time series and patterns with uncertain data, which was previously infeasible.

Contribution

It presents a novel polynomial algorithm for the $$OPPM problem with indeterminate strings, extending exact OPPM to handle uncertainty in pattern and text.

Findings

01

Algorithm runs in $O(mr\u2212lg r)$ time for one indeterminate string

02

Mappings to satisfiability problems for both pattern and text cases

03

Proves $$OPPM is NP-hard in the general case

Abstract

Given an indeterminate string pattern $p$ and an indeterminate string text $t$ , the problem of order-preserving pattern matching with character uncertainties ( $μ$ OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$ . When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $μ$ OPPM problem when…

Tables3

Table 1. Table 1: Type 1 1 1 of pairs we can have in Alternate- μ 𝜇 \mu OPPM.

$i$	$α$	$β$
$x$	$a$	$b_{0} \| \dots \| b_{r_{β} - 1}$
$y$	$a_{0} \| \dots \| a_{r_{α} - 1}$	$b$

Table 2. Table 2: Type 2 2 2 of pairs we can have in Alternate- μ 𝜇 \mu OPPM.

$i$	$α$	$β$
$x$	$a$	$b$
$y$	$a_{0} \| \dots \| a_{r_{α} - 1}$	$b_{0} \| \dots \| b_{r_{β} - 1}$

Table 3. Table 3: μ 𝜇 \mu OPPM instance corresponding to the 3CNF-SAT formula ( z 1 ∨ ¬ z 2 ∨ z 3 ) ∧ ( ¬ z 1 ∨ z 2 ∨ z 4 ) subscript 𝑧 1 subscript 𝑧 2 subscript 𝑧 3 subscript 𝑧 1 subscript 𝑧 2 subscript 𝑧 4 (z_{1}\lor\lnot z_{2}\lor z_{3})\land(\lnot z_{1}\lor z_{2}\lor z_{4}) .

$i$	$0$	$1$	$2$	$3$	$4$	$5$
Formula	$z_{1}$	$z_{2}$	$z_{3}$	$z_{4}$	$c_{1}$	$c_{2}$
Pattern	$1$	$2$	$3$	$4$	$1 \| 2 \| 3$	$1 \| 2 \| 4$
Text	$1 \| 2$	$3 \| 4$	$5 \| 6$	$7 \| 8$	$2 \| 3 \| 6$	$1 \| 4 \| 8$

Equations29

\forall_{a \in A} (y [t + 1] = y [a]) \land \forall_{b \in B} (y [t + 1] > y [b]) \land \forall_{c \in C} (y [t + 1] < y [c]) .

\forall_{a \in A} (y [t + 1] = y [a]) \land \forall_{b \in B} (y [t + 1] > y [b]) \land \forall_{c \in C} (y [t + 1] < y [c]) .

\exists_{$ y [t + 1] \in § y [t + 1]} \forall_{a \in A} \exists_{$ y [a] \in § y [a]} \forall_{b \in B} \exists_{$ y [b] \in § y [b]} \forall_{c \in C} \exists_{$ y [c] \in § y [c]} $ y [t + 1] = $ y [a] \land $ y [t + 1] > $ y [b] \land $ y [t + 1] < $ y [c]

\exists_{$ y [t + 1] \in § y [t + 1]} \forall_{a \in A} \exists_{$ y [a] \in § y [a]} \forall_{b \in B} \exists_{$ y [b] \in § y [b]} \forall_{c \in C} \exists_{$ y [c] \in § y [c]} $ y [t + 1] = $ y [a] \land $ y [t + 1] > $ y [b] \land $ y [t + 1] < $ y [c]

ϕ = i = 0 ⋀ m - 1 $ y [i] \in y [i] ⋁ z_{i, $ y [i]} \land i = 0 ⋀ m - 1 $ y [i] \in y [i] ⋀ j \in L e q [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] = $ y [j]) \land $ y [i] \in y [i] ⋀ j \in L ma x [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] > $ y [j]) \land $ y [i] \in y [i] ⋀ j \in L min [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] < $ y [j])

ϕ = i = 0 ⋀ m - 1 $ y [i] \in y [i] ⋁ z_{i, $ y [i]} \land i = 0 ⋀ m - 1 $ y [i] \in y [i] ⋀ j \in L e q [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] = $ y [j]) \land $ y [i] \in y [i] ⋀ j \in L ma x [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] > $ y [j]) \land $ y [i] \in y [i] ⋀ j \in L min [i] $ y [j] \in y [j] ⋀ (\neg z_{i, $ y [i]} \lor \neg z_{j, $ y [j]} \lor $ y [i] < $ y [j])

ϕ = z_{0, 2} \land (z_{1, 4} \lor z_{1, 5}) \land (z_{2, 4} \lor z_{2, 5}) \land (z_{3, 1} \lor z_{3, 2}) \land (\neg z_{0, 2} \lor \neg z_{3, 1}) \land (\neg z_{1, 4} \lor \neg z_{2, 5})

ϕ = z_{0, 2} \land (z_{1, 4} \lor z_{1, 5}) \land (z_{2, 4} \lor z_{2, 5}) \land (z_{3, 1} \lor z_{3, 2}) \land (\neg z_{0, 2} \lor \neg z_{3, 1}) \land (\neg z_{1, 4} \lor \neg z_{2, 5})

\exists_{j \in {0, 1}} \exists_{$ y [t + 1] \in § y [t + 1]} \forall_{a \in A_{j}, b \in B_{j}, c \in C_{j}} \exists_{$ y [a] \in § y [a], $ y [b] \in § y [b], $ y [c] \in § y [c]}

\exists_{j \in {0, 1}} \exists_{$ y [t + 1] \in § y [t + 1]} \forall_{a \in A_{j}, b \in B_{j}, c \in C_{j}} \exists_{$ y [a] \in § y [a], $ y [b] \in § y [b], $ y [c] \in § y [c]}

($ y [t + 1] = $ y [a] \land $ y [t + 1] > $ y [b] \land $ y [t + 1] < $ y [c])

ϕ = i = 0 ⋀ m - 1 $ y [i] \in y [i] $ x [i] \in x [i] ⋁ z_{i, $ x [i], $ y [i]} \land i = 0 ⋀ m - 1 $ y [i] \in y [i] $ x [i] \in x [i] ⋀ j \in L e q [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] = $ y [j]) \land j \in L ma x [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] > $ y [j]) \land j \in L min [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] < $ y [j])

ϕ = i = 0 ⋀ m - 1 $ y [i] \in y [i] $ x [i] \in x [i] ⋁ z_{i, $ x [i], $ y [i]} \land i = 0 ⋀ m - 1 $ y [i] \in y [i] $ x [i] \in x [i] ⋀ j \in L e q [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] = $ y [j]) \land j \in L ma x [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] > $ y [j]) \land j \in L min [i] ⋀ $ y [j] \in y [j] $ x [j] \in x [j] ⋀ (\neg z_{i, $ x [i], $ y [i]} \lor \neg z_{j, $ x [j], $ y [j]} \lor $ y [i] < $ y [j])

ϕ = z_{0, 2, 2} \land (z_{1, 1, 0} \lor z_{1, 3, 0}) \land (z_{2, 3, 3} \lor z_{2, 3, 4}) \land (\neg z_{0, 2, 0} \lor \neg z_{1, 3, 0}) \land (\neg z_{1, 3, 0} \lor \neg z_{2, 3, 3}) \land (\neg z_{1, 3, 0} \lor \neg z_{2, 3, 4})

ϕ = z_{0, 2, 2} \land (z_{1, 1, 0} \lor z_{1, 3, 0}) \land (z_{2, 3, 3} \lor z_{2, 3, 4}) \land (\neg z_{0, 2, 0} \lor \neg z_{1, 3, 0}) \land (\neg z_{1, 3, 0} \lor \neg z_{2, 3, 3}) \land (\neg z_{1, 3, 0} \lor \neg z_{2, 3, 4})

\forall_{i \in [1, r - 1]} (g_{a_{i}} ⟹ g_{a_{i - 1}}) .

\forall_{i \in [1, r - 1]} (g_{a_{i}} ⟹ g_{a_{i - 1}}) .

g_{b_{i}} ⟹ \neg g_{a_{j}}, \neg g_{b_{i}} ⟹ g_{a_{j}} .

g_{b_{i}} ⟹ \neg g_{a_{j}}, \neg g_{b_{i}} ⟹ g_{a_{j}} .

g_{b_{i}} ⟹ g_{a_{j}} .

g_{b_{i}} ⟹ g_{a_{j}} .

\neg g_{b_{i}} .

\neg g_{b_{i}} .

g_{a_{i}} ⟹ g_{b_{j}} .

g_{a_{i}} ⟹ g_{b_{j}} .

\neg g_{a_{i}} .

\neg g_{a_{i}} .

g_{b_{i}} g_{a_{j}} \neg g_{b_{i + 1}} \neg g_{a_{j + 1}} ⟹ g_{a_{j}}, ⟹ g_{b_{i}}, ⟹ \neg g_{a_{j + 1}}, ⟹ \neg g_{b_{i + 1}} .

g_{b_{i}} g_{a_{j}} \neg g_{b_{i + 1}} \neg g_{a_{j + 1}} ⟹ g_{a_{j}}, ⟹ g_{b_{i}}, ⟹ \neg g_{a_{j + 1}}, ⟹ \neg g_{b_{i + 1}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Order-Preserving Pattern Matching Indeterminate Strings

Diogo Costa

Luís M. S. Russo

Rui Henriques

Hideo Bannai

Alexandre P. Francisco

[email protected]

INESC-ID and Instituto Superior Técnico, Universidade de Lisboa, Portugal

Department of Computer Science, Kyushu University, Japan

Abstract

Given an indeterminate string pattern $p$ and an indeterminate string text $t$ , the problem of order-preserving pattern matching with character uncertainties ( $\mu$ OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$ . When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions.

This paper provides the first polynomial algorithm to answer the $\mu$ OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $\mu$ OPPM problem can be solved in $O(mr\lg r)$ time when one string is indeterminate and $r\in\mathbb{N}^{+}$ . Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $\mu$ OPPM problem proved to be NP-hard in general.

keywords:

order-preserving pattern matching, indeterminate string analysis, generic pattern matching, satisfiability

††journal: Journal of LaTeX Templates

1 Introduction

Given a pattern string $p$ and a text string $t$ , the exact order preserving pattern matching (OPPM) problem is to find all substrings of $t$ with the same relative orders as $p$ . The problem is applicable to strings with characters drawn from numeric or ordinal alphabets. Illustrating, given $p$ =(1,5,3,3) and $t=(5,1,4,2,2,5,2,4)$ , substring $t[1..4]=(1,4,2,2)$ is reported since it satisfies the character orders in $p$ , $p[0]\leq p[2]=p[3]\leq p[1]$ . Despite its relevance, the OPPM problem has limited potential since it prevents the specification of errors, uncertainties or don’t care characters within the text.

Indeterminate strings allow uncertainties between two or more characters per position. Given indeterminate strings $p$ and $t$ , the problem of order preserving pattern matching uncertain text ( $\mu$ OPPM) is to find all substrings of $t$ with an assignment of values that satisfy the orders defined by $p$ . For instance, let $p=(1,2|5,3,3)$ and $t=(5,0,1,2|1,2,5,2|3,3|4)$ . The substrings $t[1..4]$ and $t[4..7]$ are reported since there is an assignment of values that preserve either $p[0]<p[1]<p[2]=p[3]$ or $p[0]<p[2]=p[3]<p[1]$ orderings: respectively $t[1..4]=(0,1,2,2)$ and $t[4..7]=(2,5,3,3)$ .

Order-preserving pattern matching captures the structural isomorphism of strings, therefore having a wide-range of relevant applications in the analysis of financial times series, musical sheets, physiological signals and biological sequences [1, 2, 3]. Uncertainties often occur across these domains. In this context, although the OPPM problem is already a relaxation of the traditional pattern matching problem, the need to further handle localized errors is essential to deal with noisy strings [4]. For instance, given the stochasticity of gene regulation (or markets), the discovery of order-preserving patterns in gene expression (or financial) time series needs to account for uncertainties [5, 6]. Numerical indexes of amino-acids (representing physiochemical and biochemical properties) are subjected to errors difficulting the analysis of protein sequences [7]. Another example are ordinal strings obtained from the discretization of numerical strings, often having two uncertain characters in positions where the original values are near a discretization boundary [4].

Let $m$ and $n$ be the length of the pattern $p$ and text $t$ , respectively. The exact OPPM problem has a linear solution on the text length $O(n+m\lg m)$ based on the Knuth-Morris-Pratt algorithm [8, 2, 9]. Alternative algorithms for the OPPM problem have also been proposed [10, 11, 12]. Contrasting with the large attention given to the resolution of the OPPM problem, to our knowledge there are no polynomial-time algorithms to solve the $\mu$ OPPM problem. Naive algorithms for $\mu$ OPPM assess all possible pattern and text assignments, bounded by $O(nr^{m})$ when considering up to $r$ uncertain characters per position.

This work proposes the first polynomial time algorithms able to answer the $\mu$ OPPM problem. Accordingly, the contributions are organized as follows. First, we show that an indeterminate string of length $m$ order-preserving matches a determinate string with the same length in $O(mr\lg r)$ time based on their monotonic properties. Second, and given two indeterminate strings with the same size, we provide a linear encoding of the $\mu$ OPPM into a satisfiability formula with properties of interest. Furthermore, we extend this encoding and we present results concerning the computational complexity of $\mu$ OPPM problem variations, namely a proof of that the $\mu$ OPPM problem is NP-hard in general. Third, given a pattern and text strings with lengths $m$ and $n$ , only one of them indeterminate, we show that the $\mu$ OPPM problem can be solved in linear space and its average efficiency boosted under effective filtration procedures.

A preliminary version of this work was presented at the Annual Symposium on Combinatorial Pattern Matching (CPM) [13]. In this paper, we revise previous results and we present new results concerning the computational complexity of $\mu$ OPPM problem; Sections 3.3, 3.4 and 5 are new.

2 Background

Let $\Sigma$ be a totally ordered alphabet and an element of $\Sigma^{\ast}$ be a string. The length of a string $w$ is denoted by $|w|$ . The empty string $\varepsilon$ is a string of length 0. For a string $w=xyz$ , $x$ , $y$ and $z$ are called a prefix, substring, and suffix of $w$ , respectively. The $i$ -th character of a string $w$ is denoted by $w[i]$ for each $0\leq i<|w|$ . For a string $w$ and integers $0\leq i\leq j<|w|$ , $w[i..j]$ denotes the substring of $w$ from position $i$ to position $j$ . For convenience, let $w[i..j]=\varepsilon$ when $i>j$ .

Given strings $x$ and $y$ with equal length $m$ , $y$ is said to order-preserving against $x$ [8], denoted by $x\approx y$ , if the orders between the characters of $x$ and $y$ are the same, i.e. $x[i]\leq x[j]\Leftrightarrow y[i]\leq y[j]$ for any $0\leq i,j<m$ . A non-empty pattern string $p$ is said to order-preserving match (op-match in short) a non-empty text string $t$ if and only if there is a position $i$ in $t$ such that $p\approx t[i-|p|+1..i]$ . The order-preserving pattern matching (OPPM) problem is to find all such text positions.

2.1 The Problem

Given a totally ordered alphabet $\Sigma$ , an indeterminate string is a sequence of disjunctive sets of characters $x[0]x[1]..x[n-1]$ where $x[i]\subseteq\Sigma$ . Each position is given by $x[i]=\sigma_{1}..\sigma_{r}$ where $r\geq 1\wedge\sigma_{i}\in\Sigma$ .

Given an indeterminate string $x$ , a valid assignment $\$ x $is a (determinate) string with a single character at position$ i $, denoted$ $x[i] $, contained in the$ x[i] $set of characters, i.e.$ $x[0]\in x[0] $, …,$ $x[m-1]\in x[m-1] $. For instance, the indeterminate string$ (1|3,3|4,2|3,1|2) $has$ 2^{4} $valid assignments. Given an indeterminate position$ x[i]\subseteq\Sigma $,$ $x_{j}[i] $is the$ j^{th} $ordered value of$ x[i] $(e.g.$ $x_{0}[i] $=1 for$ x[i]=1|2 $). Given an indeterminate string$ x $, let a *partially assigned* string$ \S x $be an indeterminate string with an arbitrary number of uncertain characters removed, i.e.$ \S x[0]\subseteq x[0] $, …,$ \S x[m-1]\subseteq x[m-1]$.

Given a determinate string $x$ of length $m$ , an indeterminate string $y$ of equal length is said to be order-preserving against $x$ , identically denoted by $x\approx y$ , if there is a valid assignment $\$ y $such that the relative orders of the characters in$ x $and$ $y $are the same, i.e.$ x[i]\leq x[j]\Leftrightarrow$y[i]\leq$y[j] $for any$ 0\leq i,j<m $. Given two indeterminate strings$ x $and$ y $with length$ m $,$ y $preserves the orders of$ x $,$ x\approx y $, if exists$ $y $in$ y $that respects the orders of a valid assignment$ $x $in$ x$.

A non-empty indeterminate pattern string $p$ is said to order-preserving match (op-match in short) a non-empty indeterminate text string $t$ if and only if there is a position $i$ in $t$ such that $p\approx t[i-|p|+1..i]$ . The problem of order-preserving pattern matching with character uncertainties ( $\mu$ OPPM) problem is to find all such text positions.

To understand the complexity of the $\mu$ OPPM problem, let us look to its solution from a naive stance yet considering state-of-the-art OPPM principles. The algorithmic proposal by Kubica et al. [8] is still up to this date the one providing a lowest bound, $O(n$ + $q)$ , where $q=m$ for alphabets of size $m^{O(1)}$ ( $q=m\lg m$ otherwise). Given a determinate string $x$ of length $m$ , an integer $i$ ( $0\leq i<m$ ) is said in the context of this work to be an order-preserving border of $x$ if $x[0..i]\approx x[m-i+1..m]$ . In this context, given a pattern string $p$ , the orders between the characters of $p$ are used to linearly infer the order borders. The order borders can then be used within the Knuth-Morris-Pratt algorithm to find op-matches against a text string $t$ in linear time [8].

Given a determinate string $p$ of length $m$ and an indeterminate string $t$ of length $n$ , the previous approach is a direct candidate to the $\mu$ OPPM problem by decomposing $t$ in all its possible assignments, $O(r^{n})$ . Since determinate assignments to $t$ are only relevant in the context of $m$ -length windows, this approach can be improved to guarantee a maximum of $O(r^{m})$ assignments at each text position. Despite its simplicity, this solution is bounded by $O(nr^{m})$ . This complexity is further increased when indetermination is also considered in the pattern, stressing the need for more efficient alternatives..

2.2 Related work

The exact OPPM problem is well-studied in literature. Kubica et al. [8], Kim et al. [2] and Cho et al. [9] presented linear time solutions on the text length by respectively combining order-borders, rank-based prefixes and grammars with the Knuth–Morris–Pratt (KMP) algorithm [14]. Cho et al. [10], Belazzougui et al. [11], and Chhabra et al. [12] presented $O(nm)$ algorithms that show a sublinear average complexity by either combining bad character heuristics with the Boyer–Moore algorithm [15] or applying filtration strategies. Recently, Chhabra et al. [16] proposed further principles to solve OPPM using word-size packed string matching instructions to enhance efficiency.

In the context of numeric strings, multiple relaxations to the exact pattern matching problem have been pursued to guarantee that approximate matches are retrieved. In norm matching [17, 18, 19, 20], matches between numeric strings occur if a given distance threshold $f(x,y)\leq\theta$ is satisfied. In ( $\delta$ , $\gamma$ )-matching [21, 22, 23, 24, 25, 26, 27], strings are matched if the maximum difference of the corresponding characters is at most $\delta$ and the sum of differences is at most $\gamma$ .

In the context of nominal strings, variants of the pattern matching task have also been extensively studied to allow for don’t care symbols in the pattern [28, 29, 30], transposition-invariant [25], parameterized matching [31, 32], less than matching [33], swapped matching [34, 35], gaps [36, 37, 38], overlap matching [39], and function matching [40, 41].

Despite the relevance of the aforementioned contributions to answer the exact order-preserving pattern matching and generic pattern matching, they cannot be straightforwardly extended to efficiently answer the $\mu$ OPPM problem.

3 On solving $\mu$ OPPM

Section 3.1 introduces the first efficient algorithm to solve the $\mu$ OPPM problem when one string is indeterminate ( $r\in\mathbb{N}^{+}$ ). Section 3.2 discusses the existence of efficient solvers when both strings are indeterminate. Section 3.3 introduces then a polynomial time algorithm for the Alternate- $\mu$ OPPM as a subproblem of $\mu$ OPPM where both strings may have indeterminate characters, but never in the same position. Given the formulations proposed in Section 3.2, we hypothesize that op-matching indeterminate strings with an arbitrary number of uncertain characters per position ( $r\in\mathbb{N}^{+}$ ) is in class NPC. Furthermore, we show in Section 3.4 that the problem {3,3}- $\mu$ OPPM, defined as the subproblem of $\mu$ OPPM where both the pattern and the text have indeterminate characters in any position (although at least one position must have at least three indeterminate characters in both pattern and text), is NP-hard. We still leave a gap in between these two groups, namely for the strings where there are at most two indeterminate characters in both strings at the same position. It remains open whether or not this problem is NP-hard.

3.1 $O(mr\lg r)$ time $\mu$ OPPM when one string is indeterminate

Given a determinate string $x$ of length $m$ , there is a well-defined permutation of positions, $\pi$ , that specifies a non-monotonic ascending order of characters in $x$ . For instance, given $x$ =(1,4,3,1), then $x[0]=x[3]<x[2]<x[1]$ and $\pi=(0,3,2,1)$ . Given a determinate string $y$ with the same length, $y$ op-matches $x$ if it $y$ satisfies the same $m$ -1 orders. For instance, given $x=(1,4,3,1)$ and $y=(2,5,4,3)$ , $x$ orders are not preserved in $y$ since $y[0]\mathbin{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\neq}}y[3]<y[2]<y[1]$ .

The monotonic properties can be used to answer $\mu$ OPPM when one string is indeterminate. Given an indeterminate string $y$ , let $x_{\pi}$ and $y_{\pi}$ be the permuted strings in accordance with $\pi$ orders in $x$ . To handle equality constraints, positions in $y_{\pi}$ with identical characters in $x_{\pi}$ can be intersected, producing a new string $y^{\prime}_{\pi}$ with $s$ length ( $s\leq m$ ). Illustrating, given $x$ =(4,1,4,2) and $y=(2|7,2,7|8,1|4|8)$ , then ${\pi}$ =(1,3,0,2), $x_{\pi}$ =(1,2,4,4), $y_{\pi}=(2,8|4|1,7|2,8|7)$ and $y^{\prime}_{\pi}=(y_{\pi}[0],y_{\pi}[1],$ $y_{\pi}[2]\cap y_{\pi}[3])=(2,8|4|1,7)$ . To handle monotonic inequalities, $y^{\prime}_{\pi}[i]$ characters can be concatenated in descending order to compose $z=y^{\prime}_{\pi}[0]y^{\prime}_{\pi}[1]..y^{\prime}_{\pi}[s]$ and the orders between $x$ and $y$ verified by testing if the longest increasing subsequence (LIS) [42] of $z$ has $s$ length. In the given example, $z=(2,8,4,1,7)$ , and the LIS of $z=(\textbf{2},8,\textbf{4},1,\textbf{7})$ is $w$ =(2,4,7). Since $|w|=|y^{\prime}_{\pi}|$ =3, $y$ op-matches $x$ .

Theorem 3.1.

Given a determinate string $x$ and an indeterminate string $y$ , let $x_{\pi}$ and $y_{\pi}$ be the sorted strings in accordance with $\pi$ order of characters in $x$ . Let the positions with equal characters in $x_{\pi}$ be intersected in $y_{\pi}$ to produce a new indeterminate string $y^{\prime}_{\pi}$ . Consider $z_{i}$ to be a string with $y^{\prime}_{\pi}[i]$ characters in descending order and $z=z_{1}z_{2}..z_{m}$ , then $|w|=|y^{\prime}_{\pi}|$ if and only if $y\approx x$ , where $w$ is a longest increasing subsequence in $z$ .

Proof.

$(\Rightarrow)$ If the length of the longest increasing subsequence (LIS), $|w|$ , equals the number of monotonic relations in $x$ , $|y^{\prime}_{\pi}|$ , then $y\approx x$ . By sorting characters in descending order per position, we guarantee that at most one character per position in $y^{\prime}_{\pi}$ appears in the LIS (respecting monotonic orders in $x$ given $y^{\prime}_{\pi}$ properties). By intersecting characters in positions of $y$ with identical characters in $x$ , we guarantee the eligibility of characters satisfying equality orders in $x$ , otherwise empty positions in $y^{\prime}_{\pi}$ are observed and the LIS length is less than $|y^{\prime}_{\pi}|$ . $(\Leftarrow)$ If $|w|<|y^{\prime}_{\pi}|$ , there is no assignment in $y$ that op-matches $x$ due to one of two reasons: 1) there are empty positions in $y^{\prime}_{\pi}$ due to the inability to satisfy equalities in $x$ , or 2) it is not possible to find a monotonically increasing assignment to $y^{\prime}_{\pi}$ and, given the properties of $y^{\prime}_{\pi}$ , $y_{\pi}$ cannot preserve the orders of $x_{\pi}$ .

Solving the LIS task on a string of size $n$ is $O(n\lg n)$ [42] where $n=|z|=O(rm)$ . In addition, set intersection operations are performed $O(m)$ times on sets with $O(r)$ size, which can be accomplished in $O(rm\lg r)$ time. As a result, the $\mu$ OPPM problem with one indeterminate string can be solved in $O(rm\lg(rm))$ .

Given the fact that the candidate string for the LIS task has properties of interest, we can improve the complexity of this calculus (Theorem 3.2) in accordance with Algorithm 1.

Theorem 3.2.

$\mu$ OPPM two strings of length $m$ , one being indeterminate, is in $O(mr\lg r)$ time, where $r\in\mathbb{N}+$ .

Proof.

In accordance with Algorithm 1, $\mu$ OPPM is bounded by the verification of equalities, $O(mr\lg r)$ [43]. Testing inequalities after set intersections can be linearly performed on the size of $y$ , $O(mr)$ time, improving the $O(mr\lg(mr))$ bound given by the LIS calculus.

The analysis of Algorthim 1 further reveals that the $\mu$ OPPM problem with one indeterminate string requires linear space in the text length, $O(mr)$ .

3.2 $\mu$ OPPM with indeterminate pattern and text

As indetermination in real-world strings is typically observed between pairs of characters [4], a key question is whether $\mu$ OPPM on two indeterminate strings is in class P when $r=2$ . To explore this possibility, new concepts need to be introduced. In OPPM research, character orders in a determinate string of length $m$ can be decomposed in 3 sequences with $m$ unit sets:

Definition 3.3.

For $i=0,\ldots,m-1$ :

$\textit{Leq}_{x}[i]=\{\max\{k\mid k<i,x[i]=x[k]\}\}$ ( $\emptyset$ if there is no eligible $k$ ),

2.

$\textit{Lmax}_{x}[i]=\{\max\{\mathrm{argmax}_{k}\{x[k]\mid k<i,x[i]>x[k]\}\}\}$ ( $\emptyset$ if there is no eligible $k$ ),

3.

$\textit{Lmin}_{x}[i]=\{\max\{\mathrm{argmin}_{k}\{x[k]\mid k<i,x[i]<x[k]\}\}\}$ ( $\emptyset$ if there is no eligible $k$ ).

Leq, Lmax and Lmin capture $=$ , $>$ and $<$ relationships between each character $x[i]$ in $x$ and the closest preceding character $x[k]$ . These orders can be inferred in linear time for alphabets of size $m^{O(1)}$ and in $O(m\lg m)$ time for other alphabets by answering the “all nearest smaller values” task on the sorted indexes [8]. Figure 1 depicts Leq, Lmax and Lmin for $x=(1,4,3,1)$ . Given determinate strings $x$ and $y$ , $A=Leq_{x}[t+1]$ , $B=Lmax_{x}[t+1]$ and $C=Lmin_{x}[t+1]$ , if $x[0..t]\approx y[0..t]$ , then $x[0..t+1]\approx y[0..t+1]$ if and only if

[TABLE]

When allowing uncertainties between pairs of characters, previous research on the OPPM problem cannot be straightforwardly extended due to the need to trace $O(2^{m})$ assignments on indeterminate strings.

Lemma 3.4.

Given a determinate string $x$ , an indeterminate string $y$ , and the singleton sets $A=Leq_{x}[t+1]$ , $B=Lmax_{x}[t+1]$ and $C=Lmin_{x}[t+1]$ containing a position in $\{0,\ldots,t\}$ . If $x[0..t]\approx y[0..t]$ is verified on a specific assignment of $y$ characters, denoted $\S y$ , then $x[0..t+1]\approx y[0..t+1]$ if and only if

[TABLE]

Proof.

$(\Rightarrow)$ In accordance with Leq, Lmax and Lmin definition, for any $a\in A$ , $b\in B$ and $c\in C$ we have $x[t+1]=x[a]$ , $x[t+1]>x[b]$ and $x[t+1]<x[c]$ . If there is an assignment to $y[0..t+1]$ in $\S y$ that preserves the orders of $x[0..t+1]$ , then for each $a\in A$ , $b\in B$ and $c\in C$ $\$ y[t+1]=$y[a] $,$ $y[t+1]>$y[b] $and$ $y[t+1]<$y[c] $(where$ $y[t+1]\in\S y[t+1] $,$ $y[a]\in\S y[a] $,$ $y[b]\in\S y[b] $,$ $y[c]\in\S y[c] $).$ (\Leftarrow) $We need to show that$ x[0..t+1]\approx y[0..t+1] $. Since$ x[0..t]\approx y[0..t] $, for$ i<t $,$ \exists_{$y[i]\in\S y[i],$y[t+1]\in\S y[t+1]} $:$ x[t+1]>x[i]\Leftrightarrow$y[t+1]>$y[i] $. Assuming$ x[t+1]>x[i] $for some$ i\in{0,\ldots,t} $: by the definition of *Lmax*,$ \forall_{b\in B}x[b]>x[i] $; by the order-isomorphism of$ x[0..t] $and$ $y[0..t] $in$ \S y[0..t] $, there is$ $y[i]\in\S y[i] $and$ $y[b]\in\S y[b] $that$ \forall_{b\in B}$y[b]>$y[i] $; and by the assumption of the lemma,$ \forall_{b\in B}$y[t+1]>$y[b] $; hence$ $y[t+1]>$y[i] $. Similarly,$ x[t+1]<x[i] $(and$ x[t+1]=x[i] $) implies$ $y[t+1]<$y[i] $(and$ $y[t+1]=$y[i]$), yielding the stated equivalence.

Given two strings of equal length, the $\mu$ OPPM problem can be schematically represented according to the identified order restrictions. Figure 2 represents restrictions on the indeterminate string $y=(2,4|5,3|5,1|2)$ in accordance with the observed orders in $x=(1,4,3,1)$ . The left side edges are placed in accordance with Lemma 3.4 and capture assessments on the orders between pairs of characters. The right side edges capture incompatibilities detected after the assessments, i.e. pairs of characters that cannot be selected simultaneously (for instance, $y[0]=2$ and $y[3]=1$ , or $y[1]=4$ and $y[2]=5$ ). For the given example, there are two valid assignments, $\$ y_{1}=(2,4,3,2) $and$ $y_{2}=(2,5,3,2) $, that satisfy$ x[0]=x[3]<x[2]<x[1] $, thus$ y $op-matches$ x$.

To verify whether there is an assignment that satisfies the identified ordering restrictions, we propose the reduction of $\mu$ OPPM problem to a Boolean satisfiability problem.

Given a set of Boolean variables, a formula in conjunctive normal form is a conjunction of clauses, where each clause is a disjunction of literals, and a literal corresponds to a variable or its negation. Let a 2CNF formula be a formula in the conjunctive normal form with at most two literals per clause. Given a CNF formula, the satisfiability (SAT) problem is to verify if there is an assigning of values to the Boolean variables such that the CNF formula is satisfied.

Theorem 3.5.

The $\mu$ OPPM problem over two strings of equal length, one being indeterminate, can be reduced to a satisfiability problem with the following CNF formula:

[TABLE]

Proof.

Let us show that if $x$ op-matches $y$ then $\phi$ is satisfiable, and if $x$ does not op-match $y$ then $\phi$ is not satisfiable. $(\Rightarrow)$ When $x\approx y$ , there is an assignment of values to $y$ , $\$ y $, that satisfy the orderings of$ x $.$ \phi $is satisfiable if there is at least one variable assigned to true per clause$ \vee_{$y[i]\in y[i]}z_{i,$y[i]} $given conflicts$ \neg z_{i,$y[i]}\vee\neg z_{j,$y[j]} $. As conflicts do not prevent the existence of a valid assignment (by assumption), then$ \exists_{$y}\wedge_{i\in{0..m-1}}z_{i,$y[i]} $and$ \phi $is satisfiable.$ (\Leftarrow) $When$ x $does not op-match$ y $, there is no assignment of values$ $y\in y $that can satisfy the orders of$ x $. Per formulation, the conflicts$ \neg z_{i,$y[i]}\vee\neg z_{j,$y[j]} $prevent the satisfiability of one or more clauses$ \vee_{$y[i]\in y[i]}z_{i,$y[i]}$, leading to a non-satisfiable formula.

If the established $\phi$ formula is satisfiable, there is a Boolean assignment to the variables that specify an assignment of characters in $y$ , $\$ y $, preserving the orders of$ x $(as defined by *Leq*, *Lmax* and *Lmin*). Otherwise, it is not possible to select an assignment$ $y $op-matching$ x $.$ \phi $has at most$ r\times m $variables,$ {z_{i,\sigma}\mid i\in{0..m-1},\ \sigma\in\Sigma} $. The Boolean value assigned to a variable$ z_{i,\sigma} $simply defines that the associated character$ \sigma $from$ y[i] $can be either considered (when true) or not (when false) to compose a valid assignment$ $y $that op-matches the given determinate string$ x $. The reduced formula in ([1](#S3.E1)) is composed of two major types of clauses:$ \vee_{$y[i]\in y[i]}z_{i,$y[i]} $, and$ (\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\vee\textsf{bool}) $where bool is either given by$ $y[i]=$y[j] $,$ $y[i]<$y[j] $or$ $y[i]>$y[j] $. Clauses of the first type specify the need to select at least one character per position in$ y $to guarantee the presence of valid assignments. The remaining clauses specify ordering constraints between characters. If an inequality, such as$ $y[i]>$y[j] $, is assessed as true, the associated clause is removed. Otherwise,$ (\neg z_{i,\sigma_{1}}\vee\neg z_{j,\sigma_{2}}) $is derived, meaning that these$ \sigma_{1} $and$ \sigma_{2} $characters should not be selected simultaneously since they do not satisfy the orders defined by a given pattern. For instance, the pairs of characters in orange from Figure [2](#S3.F2) should not be simultaneously selected due to order conflicts. To this end,$ (\neg z_{0,2}\vee\neg z_{3,1}) $and$ (\neg z_{1,4}\vee\neg z_{2,5}) $clauses need to be included to verify if$ y\approx x $. Considering$ y=(2,4|5,4|5,1|2) $and$ x=(1,4,3,1)$, schematically represented in Figure 2, the associated CNF formula is:

[TABLE]

Theorem 3.6.

Given two strings of length $m$ , one being indeterminate with $r=2$ , the $\mu$ OPPM problem can be reduced to a 2SAT problem with a CNF formula with $O(m)$ size.

Proof.

Given Theorem 1 and the fact that the reduced CNF formula has at most two literals per clause – $\phi$ is a composition of $\vee_{\$ y[i]\in y[i]}z_{i,$y[i]} $clauses with$ |y[i]|\in{1,2} $and$ (\neg z_{i,$y[i]}\vee\neg z_{j,$y[j]}\vee\textsf{bool}) $clauses –$ \mu $OPPM with$ r=2 $and one indeterminate string is reducible to 2SAT. The reduced formula has at most$ 10m $clauses with 2 literals each, being linear in$ m$:

[clauses that impose the selection of at least one character per position in $y$ ] Since $y$ has $m$ positions, and each position is either determinate (unitary clause) or defines an uncertainty between a pair of characters, there are $m$ clauses and at most $2m$ literals;

2.

[clauses that define the ordering restrictions between two variables] A position in the indeterminate string $y[i]$ needs to satisfy at most two order relations. Considering that $i$ , $Leq[i]$ , $Lmax[i]$ and $Lmin[i]$ specify uncertainties between pairs of characters, there are up to 12 restrictions per position: 4 ordering restrictions between characters in $y[i]$ and $y[Leq[i]]$ , $y[Lmax[i]]$ and $y[Lmin[i]]$ . Whenever the order between two characters is not satisfied, a clause is added per position, leading to at most $12m$ clauses.

Theorem 3.7.

The $\mu$ OPPM between determinate and indeterminate strings of equal length can be solved in linear time when $r=2$ .

Proof.

Given the fact that a 2SAT problem can be solved in linear time [44]***2SAT problems have linear time and space solutions on the size of the input formula. Consider for instance the original proposal [44], the formula $\phi$ is modeled by a directed graph $G=(V,E)$ , with two nodes per variable $z_{i}$ in $\phi$ ( $z_{i}$ and $\neg z_{i}$ ) and two directed edges for each clause $z_{i}\vee z_{j}$ (the equivalent implicative forms $\neg z_{i}\Rightarrow z_{j}$ and $\neg z_{j}\Rightarrow z_{i}$ ). Given $G$ , the strongly connected components (SCCs) of $G$ can be discovered in $O(|V|+|E|)$ . During the traversal if a variable and its complement belong to the same SCC, then the procedure stops as $\phi$ is determined to be unsatisfiable. Given the fact that both $|V|=O(m)$ and $|E|=O(m)$ by Lemma 3.6, this procedure is $O(m)$ time and space., this proof directly derives from Theorem 3.6 as it guarantees the soundness of reducing $\mu$ OPPM ( $r=2$ ) to a 2SAT problem with a CNF formula with $O(m)$ size.

As the size of the mapped CNF formula $\phi$ is $O(m)$ and the a valid algorithm to verify its satisfiability would require the construction of a graph with $O(m)$ nodes and edges, the required memory for the target $\mu$ OPPM problem is $\Theta(m)$ .

When moving from one to two indeterminate strings, previous contributions are insufficient to answer the $\mu$ OPPM problem. In this context, the Leq, Lmax and Lmin vectors need to be redefined to be inferred from an indeterminate string:

Definition 3.8.

For $i=0,\ldots,m-1$ :

$\textit{Leq}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \$ x_{j}[i]=$x_{p}[k]} $($ \emptyset $if there is no eligible$ k$),

2.

$\textit{Lmax}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \$ x_{j}[i]>$x_{p}[k]} $($ \emptyset $if there is no eligible$ k$),

3.

$\textit{Lmin}_{x}[i|j]=\{k\mid k<i,\ \exists_{p}\ \$ x_{j}[i]<$x_{p}[k]} $($ \emptyset $if there is no eligible$ k$).

Figure 3 schematically represents the order relationships of $x=(2,1|3,3)$ and the associated Leq, Lmax and Lmin vectors. In this scenario, $x[2]$ needs to be verified not only against $x_{0}[1]$ but also against $x_{1}[1]$ in case $x_{0}[1]$ is disregarded.

Remark 3.9.

Given Leq, Lmax and Lmin (Definition 3.8), there are $O((rm)^{2})$ order relationships when $r\in\mathbb{N}^{+}$ since each character in a given position establishes at most $O(m)$ relationships with characters in preceding positions.

Lemma 3.10.

Given indeterminate strings $x$ and $y$ , let $A_{j}=\textit{Leq}_{x}[t+1|j]$ , $B_{j}=Lmax_{x}[t+1|j]$ and $C_{j}=Lmin_{x}[t+1|j]$ (Definition 3.8) be the orders associated with $\$ x_{j}[t+1] $. If$ x[1..t]\approx y[1..t] $is verified on a partial assignment of$ y $characters, denoted by$ \S y $, then$ x[1..t+1]\approx y[1..t+1]$ if and only if

[TABLE]

Proof.

$(\Rightarrow)$ Similar to the proof of Lemma 3.4, yet $A$ , $B$ and $C$ conditional to $x[t+1]$ (Definition 3.3) are now given by $A_{j}$ , $B_{j}$ and $C_{j}$ conditional to $x_{j}[t+1]$ (Definition 3.8). If there is an assignment to $y[1..t+1]$ in $\S y$ that preserves one of the possible orders in $x[1..t+1]$ , then for any $a\in A_{j}$ , $b\in B_{j}$ and $c\in C_{j}$ : $\$ y[t+1]=$y[a] $,$ $y[t+1]>$y[b] $and$ $y[t+1]<$y[c] $(where$ $y[t+1]\in\S y[t+1] $,$ $y[a]\in\S y[a] $,$ $y[b]\in\S y[b] $, and$ $y[c]\in\S y[c]$).

$(\Leftarrow)$ We need to show that $x[1..t+1]\approx y[1..t+1]$ . Since $x[1..t]\approx y[1..t]$ , it is sufficient to prove that for $i\leq t$ : exists $\$ x[i]\in\S x[i] $,$ $x[t+1]\in\S x[t+1] $,$ $y[i]\in\S y[i] $, and$ $y[t+1]\in\S y[t+1] $such that$ $x[t+1]=$x[i]\Leftrightarrow$y[t+1]=$y[i] $,$ $x[t+1]>$x[i]\Leftrightarrow$y[t+1]>$y[i] $and$ $x[t+1]<$x[i]\Leftrightarrow$y[t+1]<$y[i]$. This results from Definition 3.8, the order-isomorphism property and Lemma 3.4.

Figure 4 represents encountered restrictions when op-matching $x=(2,1|3,3)$ against $y=(2,0,3|4)$ . The right side edges capture the detected incompatibilities, i.e. pairs of characters that cannot be selected simultaneously. For the given example, there are 2 valid assignments – $\$ y_{1}=(2,0,3) $and$ $y_{2}=(2,0,4) $– satisfying$ $x_{0}[1]<$x_{0}[0]<$x_{0}[2] $, thus$ x\approx y$.

To verify whether there is an assignment that satisfies the identified ordering restrictions, Theorem 2 extends the previously introduced SAT mapping given by (1).

Theorem 3.11.

Given Leq, Lmax and Lmin (Definition 3.8), $\mu$ OPPM problem over two indeterminate strings of equal length can be reduced to a satisfiability problem with the following CNF formula:

[TABLE]

Proof.

If $x\approx y$ then $\phi$ is satisfiable, and if $x$ does not op-match $y$ then $\phi$ is not satisfiable.

$(\Rightarrow)$ When $x$ op-matches $y$ , there is an assignment of values in $x$ and $y$ such that $\$ x\approx$y $.$ \phi $is satisfiable if there is at least one valid assignment$ z_{i,$x[i],$y[i]} $per$ i^{\textit{th}} $position. As conflicts$ \neg z_{i,$x[i],$y[i]}\vee\neg z_{j,$x[j],$y[j]} $do not prevent the existence of a valid assignment (by assumption), one or more variables$ z_{i,$x[i],$y[i]} $can be selected per position.$ \phi $can then be satisfied by fixing a single variable$ z_{i,$x[i],$y[i]} $per$ i^{\textit{th}} $position as true and the remaining variables as false.$ (\Leftarrow) $When$ x $does not op-match$ y $, there is no assignment of values$ $x\in x $and$ $y\in y $such that$ $x\approx$y $. Per formulation, in the absence of an order-preserving match, conflicts will prevent the assignment of at least one variable$ z_{i,$x[i],$y[i]} $per$ i^{\textit{th}} $position, thus making$ \phi$ formula unsat.

If the formula in (2) is satisfiable, there is a Boolean assignment to the variables such that there is an assignment of characters in $y$ , $\$ y $, and in$ x $,$ $x $, such that both strings op-match. Otherwise, it is not possible to select assignments such that$ x\approx y $. Given$ r=2 $, the established$ \phi $formula has at most$ 4m $variables,$ {z_{i,\sigma_{1},\sigma_{2}}\mid i\in{0\ldots m-1} $,$ \sigma_{1},\sigma_{2}\in\Sigma $}. The Boolean values assigned to these variables define whether characters$ \sigma_{1}\in x[i] $and$ \sigma_{2}\in y[i]$ belong to an op-match. The reduced formula is composed of two major types of clauses:

Those in the first line of (2) ensure that at least one combination of characters, $\$ x[i] $and$ $y[i] $, should be selected per$ i^{\textit{th}}$ position.

2.

Remaining ones in (2) specify ordering constraints between pairs of characters $\sigma_{1}\in y[i]$ and $y[Leq[i]]$ , $y[Lmax[i]]$ and $y[Lmin[i]]$ ; if the inequalities $\$ y[i]=$y[j] $,$ $y[i]>$y[j] $and$ $y[i]<$y[j] $are assessed as false, then it leads to clauses of the form$ (\neg z_{i,\sigma_{1}}\vee\neg z_{j,\sigma_{2}})$, meaning that these characters should not be selected simultaneously in the given positions (see Figure 4).

To instantiate the proposed mapping, consider $x=(2,1|3,3)$ and $y=(2,0,3|4)$ , schematically represented in Figure 3. The associated CNF formula is:

[TABLE]

Theorem 3.12.

The $\mu$ OPPM problem for two indeterminate strings of equal length is reducible into a satisfiability problem over a CNF formula with size $O((mr)^{2})$ .

Proof.

The reduced formula in (2) is in the two conjunctive normal form (CNF) with at most $4m$ clauses in the first line of (2) and a maximum of $O(mr)$ orders per position (Remark 3.9), totalling at most $O((mr)^{2})$ order conflicts between characters, from the restriction clauses in the reammining of (2).

Although we are no longer in the conditions of Theorem 3.7, namely because the above satisfiability formulation is not a 2SAT instance, given its unique properties, effective backtracking in accordance with the clauses in the first line of (2), as well as dedicated conflict pruning principles derived from reamining clauses in (2), can be considered to develop efficient SAT solvers able to solve the $\mu$ OPPM problem. And, as we will show later, we are not expected to do much better.

3.3 Polynomial time Alternate- $\mu$ OPPM

In this section, we define Alternate- $\mu$ OPPM as the subproblem of $\mu$ OPPM where both strings ( $x$ and $y$ , interchangeable) may have indeterminate characters, but never in the same position; we show that Alternate- $\mu$ OPPM is polynomial in both the number of indeterminacies ( $r$ , which may be different in each position and string) and length of the strings ( $m$ ). To do this, we will present a set of 2SAT clauses, in the form of implications, that can represent every constraint of this problem. We will first assume that there are no repeated characters within each string and then extend the reduction to handle equalities.

Given a string $x$ and position $i$ , we represent the set of indeterminate characters $x[i]$ as the ascending sequence $a_{0}|...|a_{{r_{i}}-1}$ where $\forall_{j}\ a_{j}\in x[i]$ and $|x[i]|=r_{i}$ . We will use only $r$ when the context leads to no ambiguities, or to mean the largest possible $r_{i}$ . All of our 2SAT variables will be of the form $g_{a_{j}}$ , meaning that the chosen value $\$ x[i] $is greater than or equal to$ a_{j}$.

Consistency clauses

Here, we describe the clauses that maintain consistency between all the $g$ variables for individual positions. We only need to specify that, if we have chosen a value greater than $a_{i}$ , we have also chosen a value greater than $a_{i-1}$ , the value immediately below it, i.e.,

[TABLE]

This leads to a single clause per indeterminacy, per position, for both pattern and text, and so, at most, $2mr=O(mr)$ clauses.

Order clauses (Type $1$ )

Here, we describe the clauses enforcing the order relation between each pair of positions. Given two strings $x$ and $y$ , for positions $\alpha$ and $\beta$ , if $\$ x[\alpha]>$x[\beta] $, then$ $y[\alpha]>$y[\beta] $(and the same for the$ <$ relation).

This first set of clauses applies to Type $1$ (see Table 1). We only need to find the index (in each string) that separates the cases where $\$ x[\alpha]>$x[\beta] $from the cases where$ $x[\alpha]<$x[\beta]$ and add a single constraint expressing it.

Let $i$ be the lowest index such that $b_{i}>a$ and $j$ the lowest index such that $a_{j}\geq b$ , where $a$ and $b$ are as in Table 1. Then, we have

[TABLE]

This leads to two clauses for every pair of positions, and so, $O(m^{2})$ clauses.

Order clauses (Type $2$ )

Finally, we have a second set of clauses that applies to Type $2$ (see Table 2). Here, we have the order between $\alpha$ and $\beta$ fixed already by whichever string $x$ or $y$ has no indeterminacies.

If $a>b$ , for every index $i$ indexing $b_{i}$ , and let $j$ be the lowest index such that $a_{j}>b_{i}$ . Then we add

[TABLE]

If there is no such $j$ , we add instead

[TABLE]

Similarly, if $a<b$ , for every index $i$ indexing $a_{i}$ , let $j$ be the lowest index such that $a_{i}<b_{j}$ . Then we add

[TABLE]

If there is no such $j$ , we add instead

[TABLE]

This leads to at most $r$ clauses for every pair of positions, and so $O(rm^{2})$ clauses. Because character order is a transitive property, this type of clauses may be reduced to $O(rm)$ using a similar notion to the Lmax and Lmin sets introduced in Section 3.2 to consider only “adjacent” (taking adjacent to mean the closest position of the same type) pairs of positions, instead of every pair.

Forcing choice

With the clauses specified above, we can find coherent solutions to the problem. However, it is possible to satisfy the formula by assigning all possible values for a given variable to false (effectively skipping the position). This has a straightforward solution, given the chosen encoding of the variables. Each 2SAT variable represents a greater or equal value in the corresponding OPPM position, the variable corresponding to the lowest value for each position is trivially true, letting us force a value choice with a single added variable. For every position, with variables $g_{0},...g_{r_{i}}$ , we add the clause $g_{0}$ , forcing it to be true to satisfy the 2SAT formula.

Extracting solutions

Finally, we need to extract the solution to the OPPM problem from the 2SAT solution. This is easily done in linear time by sweeping every variable in ascending order, in each position. In each position, with variables $g_{0},...,g_{r_{i}}$ , we find the variable at index $j$ such that $g_{j}$ is true and $g_{j+1}$ is false. The chosen value in the OPPM problem, for the given position, is the value at index $j$ .

Dealing with equalities

We now turn to cases where characters match and show how to adapt the encoding above to equalities. Let us consider Type II equalities, first, where $a=b$ . The easy solution to this is the same as the one presented before. We preprocess the two strings by grouping all the repeats into a single position and intersecting their indeterminacies. For Type I equalities, we need to add $4$ clauses to each pair. Let $i,j$ be indexes such that $a=b_{i}$ and $b=a_{j}$ . We add

[TABLE]

If only $i$ exists (or $j$ ), we simply remove $b_{i}$ (or $a_{j}$ ) from the input, as such an assignment could never lead to a valid solution.

Pair incompatibility

All the clauses described above serve to maintain consistency between pairs. It may happen that a given pair is unsatisfiable by itself, and no clauses would be constructed. These cases can be dealt separately, as pre-processing. If we find a pair that can not be satisfied, we can terminate the program before ending the construction, since there is no solution to the OPPM instance.

Theorem 3.13.

The Alternate- $\mu$ OPPM can be solved in $O(rm^{2})$ time and space.

Proof.

Property resulting from the encoding above and, as in the proof of Theorem 3.7, given the fact that a 2SAT problem can be solved in linear time [44].

3.4 $\mu$ OPPM with 3 indeterminacies in both text and pattern is NP-hard

In this section, we define $\{3,3\}$ - $\mu$ OPPM as the subproblem of $\mu$ OPPM where both the pattern and the text have indeterminate characters in any position (although at least one position must have at least three indeterminate characters in both pattern and text) and prove it NP-hard (thus proving the same for general $\mu$ OPPM). We do this with a direct reduction from 3CNF-SAT, first presenting the construction and then the proof of equivalence between the two instances. The construction is similar to the one by Bose et al. for the permutation matching problem [45].

Construction

To ease the description of the construction itself, we start by describing how we represent an instance of 3CNF-SAT. First, we assume that every literal and clause has some ordering. We have a set $V$ of literals, and a set $C$ of clauses. Each clause $c$ is represented by two tuples, $(z_{c,0},z_{c,1},z_{c,2})$ and $(l_{c,0},l_{c,1},l_{c,2})$ . $z_{c,i}\in\{0,\ldots,|V|-1\}$ represents the index of literal $i$ of clause $c$ ; $l_{c,i}\in\{0,1\}$ represents the value of the literal $i$ in clause $c$ , having the value of [math] for positive literals and $1$ for negative literals. For example, the clause $(v_{1}\lor\lnot v_{2},\lor v_{5})$ would be represented by the two tuples $z=(1,2,5)$ and $l=(0,1,0)$ .

Although the designations of text or pattern are interchangeable in this section, we will use pattern for the simpler string (with less indeterminacies) and text for the more complicated string (with more indeterminacies). We use $p$ and $t$ for the pattern and text, respectively, or $s$ when they are interchangeable.

Both text and pattern have two parts, one representing literals and the other representing clauses. Each literal, and clause, has a single position in each string to represent it, dividing $s$ into $s_{V}=s[0..|V|-1]$ and $s_{C}=s[|V|..|V|+|C|-1]$ . In $p_{V}$ , we have a simple sequence of literals given by their indexes, so $p[i]=i+1$ , for $i\in\{0,\ldots,|V|-1\}$ ; in $t_{V}$ we have a similar sequence, but each literal takes one of two variable values to represent an assignment of true or false, so $t[i]=2\times(i+1)$ or $2\times(i+1)-1$ . We choose the larger value to represent the assignment of true. In $s_{C}$ , each position has three indeterminacies, corresponding to the three variables of the clause. In $p_{C}$ , we choose one of the three literals of the respective clause. For clause $c$ , with literals $v_{1},v_{2},v_{5}$ (regardless of their value being positive or negative), its position in $p$ , $p[|V|+c]=1|2|5$ . In $t_{C}$ , as in $p_{C}$ we choose one of the literals, but now the value of the literal must satisfy the clause. For clause $c$ , $(v_{1}\lor\lnot v_{2}\lor v_{5})$ , $t[|V|+c]=2\times 1-0\mid 2\times 2-1\mid 2\times 5-0$ $=2|3|10$ . An example of this construction is shown in Table 3.

Lemma 3.14.

The construction above takes polynomial time.

Proof.

It is easy to see that, assuming that variables and clauses are numbered, we can simply scan the formula once to construct our two strings in linear time.

Lemma 3.15.

The initial 3CNF-SAT clause is satisfiable if and only if there is an order-isomorphic match between the two constructed strings.

Proof.

We start by showing how solving the $\mu$ OPPM instance solves the initial 3CNF-SAT instance. To solve $\mu$ OPPM, we need to choose exactly one value for each position in $p$ and $t$ that leads to two order-isomorphic strings. To extract the solution, we can limit ourselves to look at the initial part of $t$ , $t[0,|V|-1]$ , which sets the value of each literal.

First, note that $p$ function is to maintain consistency between the values of literals chosen in $t$ . By choosing only literals in $p$ , and not their values, we force equality between all such literals. Because of order-isomorphism, this equality must be kept in $t$ , forcing a valid solution to use a single value for each literal (since different values match in $p$ but mismatch in $t$ ). If we choose a literal to be positive/negative at some position in $t$ , we force the value of that literal to be positive/negative at every position in $t$ .

Now, we focus on $t_{C}$ . Every clause has exactly one position in $t_{C}$ , and each of these positions have three choices of value, matching only the three values that satisfy a clause. Because we must choose one value in each position to solve our $\mu$ OPPM instance, we must choose one value that satisfies each clause, for every clause.

Putting these two properties together, to solve $\mu$ OPPM we must choose a literal value that satisfies each clause and those literals must have consistent values. This establishes the equivalence between the solutions of the two instances.

We can easily extract the solution from $\mu$ OPPM to 3CNF-SAT by checking whether the values in $t_{V}$ are even or odd, true or false, respectively. There is a unique solution to 3CNF-SAT given an $\mu$ OPPM solution.

To extract the solution from 3CNF-SAT to $\mu$ OPPM, we take the values assigned to each variable and choose the respective values in $t_{V}$ . Then, we need to choose values for $p_{C}$ and $t_{C}$ , which can easily be done by choosing any of the literals that satisfies its respective clause. There may be multiple $\mu$ OPPM solutions for a given 3CNF-SAT solution.

Theorem 3.16.

{3,3}- $\mu$ OPPM is NP-hard.

Proof.

Using Lemmas 3.14 and 3.15 we show that 3CNF-SAT $\leq_{p}$ $\{3,3\}$ - $\mu$ OPPM by constructing an instance of $\mu$ OPPM in polynomial time. The solutions can also be retrieved and translated in polynomial time.

Theorem 3.17.

$\mu$ OPPM is NP-hard.

Proof.

Since $\{3,3\}$ - $\mu$ OPPM is a particular case of $\mu$ OPPM, and it is NP-hard, then OPPM is NP-hard.

4 Polynomial time $\mu$ OPPM

.

Lemma 4.1.

Given a pattern string of length $m$ and a text string of length $n$ , one being indeterminate, the $\mu$ OPPM problem can be solved in $O(nmr\lg r)$ time.

Proof.

From Theorem 3.2, verifying if two strings of length $m$ op-match can be done in $O(mr\lg r)$ time (indetermination in one string) since at most $n-m+1$ verifications need to be performed.

Lemma 4.1 confirms that the $\mu$ OPPM problem with one indeterminate strings is in class P. This lemma further triggers the research question “Is $O(nmr\lg r)$ a tight bound to solve the $\mu$ OPPM?”, here left as an open research question.

Irrespectively of the answer, the analysis of the average complexity is of complementary relevance. State-of-the-art research on the exact OPPM problem shows that the average performance of algorithms in $O(nm)$ time can outperform linear time algorithms [12, 46, 47].

Motivated by the evidence gathered by these works, we suggest the use of filtration procedures to improve the average complexity of the proposed $\mu$ OPPM algorithm while still preserving its complexity bounds. A filtration procedure encodes the input pattern and text, and relies on this encoding to efficiently find positions in the text with a high likelihood to op-match a given pattern. Despite the diversity of string encodings, simplistic binary encodings are considered to be the state-of-the-art in OPPM research [12, 46]. In accordance with Chhabra et al. [12], a pattern $p$ can be mapped into a binary string $p^{\prime}$ expressing increases (1), equalities (0) and decreases (0) between subsequent positions. By searching for exact pattern matches of $p^{\prime}$ in an analogously transformed text string $t^{\prime}$ , we guarantee that the verification of whether $p[0..m-1]$ and $t[i..i+m-1]$ orders are preserved is only performed when exact binary matches occur. Illustrating, given $p=(3,1,2,4)$ and $t=(2,4,3,5,7,1,4,8)$ , then $p^{\prime}=(1,0,1,1)$ and $t^{\prime}=(1,1,0,1,1,0,1,1)$ , revealing two matches $t^{\prime}[1..4]$ and $t^{\prime}[4..7]$ : one spurious match $t[1..4]$ and one true match $t[4..7]$ .

When handling indeterminate strings the concept of increase, equality and decrease needs to be redefined. Given an indeterminate string $x$ , consider $x^{\prime}[i]=1$ if $max(x[i])<min(x[i+1])$ , $x^{\prime}[i]=0$ if $min(x[i])\geq max(x[i+1])$ , and $x^{\prime}[i]=\ast$ otherwise. Under this encoding, the pattern matching problem is identical under the additional guard that a character in $p^{\prime}$ always matches a do not care position, $t^{\prime}[i]=\ast$ , and vice-versa. Illustrating, given $p=(6,2|3,5)$ and $t=(3|4,5,6|8,6|7,3,5,4|6,7|8,4)$ , then $p^{\prime}=(0,1)$ and $t^{\prime}=(11\ast 01\ast 10)$ , leading to one true match $t[3..5]$ – e.g. $\$ t[3..5]=(6,3,5) $– and one spurious match$ t[5..7]$. Exact pattern matching algorithms, such as Knuth-Morris-Pratt and Boyer-Moore, can be adapted to consider do not care positions while preserving complexity bounds [14, 15].

The properties of the proposed encoding guarantee that the exact matches of $p^{\prime}$ in $t^{\prime}$ cannot skip any op-match of $p$ in $t$ . Thus, when combining the premises of Lemma 4.1 with the previous observation, we guarantee that the computed $\mu$ OPPM solution is sound.

The application of this simple filtration procedure prevents the recurring $O(mr\lg r)$ verifications $n-m+1$ times. Instead, the complexity of the proposed method to solve the $\mu$ OPPM problem becomes $O(dmr\lg r+n)$ (when one string is indeterminate) where $d$ is the number of exact matches ( $d\ll n$ ). According to previous work on exact OPPM with filtration procedures [12], SBNDM2 and SBNDM4 algorithms [48] (Boyer-Moore variants) were suggested to match binary encodings. In the presence of small patterns, Fast Shift-Or (FSO) [49] can be alternatively applied [12].

A given string text can be read and encoded incrementally from the standard input as needed to perform $\mu$ OPPM, thus requiring $O(mr)$ space. When filtration procedures are considered, the aforementioned algorithms for exact pattern matching require $O(m)$ space [12], thus $\mu$ OPPM space requirements are bound by substring verifications (Section 3): $O(mr)$ space when one string is indeterminate and $O((mr)^{2})$ when indetermination is considered on both strings.

5 Open problem

We can look at the $\mu$ OPPM by the number and position of the indeterminate characters. We have shown that, for any number of indeterminacies, $\mu$ OPPM has a polynomial-time algorithm for indeterminate characters in a single string (Section 3.1), or in both strings, but never in both strings at the same position (Section 3.3). For indeterminate characters in both strings at the same position, we have also shown that for at least three indeterminacies (at select positions), the problem in NP-hard (Section 3.4).

There is a gap in between these two groups, however, for the strings where there are at most two indeterminate characters in both strings at the same position. It remains open whether or not this problem is NP-hard. Given that our reduction from Section 3.4 uses three indeterminate character in both strings, it also remains open whether the problem with two indeterminate characters in one string and three in the other (at the same position) is NP-hard.

Following the pattern-avoidance precedent by Guillemot and Vialette [50] for the related problem of permutation matching, we note that, for the case of $\mu$ OPPM with at most two indeterminate characters (both strings, same position), there is a straightforward encoding in 2SAT for $(1|3,2|4)$ -avoiding strings, here taken to mean that, in a single string, for the pair of positions $(i,j)$ , the rank of the characters (only for the pair in question) is not $1|3$ in $i$ and $2|4$ in $j$ (with $i$ and let $j$ being interchangeable). The full problem, however, remains open.

6 Concluding remark

This work addressed the relevant yet scarcely studied problem of finding order-preserving pattern matches on indeterminate strings ( $\mu$ OPPM). We showed that the problem has a linear time and space solution when one string is indeterminate. In addition, the $\mu$ OPPM problem (when both strings are indeterminate) was mapped into a satisfiability formula of polynomial size and two simple types of clauses in order to study efficient solvers for the $\mu$ OPPM problem. Moreover the $\mu$ OPPM problem was shown to be NP-hard in general. Finally, we showed that solvers of the $\mu$ OPPM problem can be boosted in the presence of filtration procedures and we identified a still open problem in what concerns the computational complexity of the $\mu$ OPPM problem when restricted to at most two indeterminate characters in both strings at the same position.

Acknowledgments

This work was developed in the context of a secondment granted by the BIRDS MASC RISE project funded in part by EU H2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement no.690941. This work was further supported by national funds through Fundação para a Ciência e Tecnologia (FCT), namely under projects PTDC/CCI-BIO/29676/2017, TUBITAK/0004/2014, SAICTPAC/0021/2015, and UID/CEC/50021/2019.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] X. Ge, Pattern matching in financial time series data, final project report for ICS 278 (1998).
2[2] J. Kim, P. Eades, R. Fleischer, S.-H. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, T. Tokuyama, Order-preserving matching, Theoretical Computer Science 525 (2014) 68–79 (2014).
3[3] R. Henriques, A. Paiva, Seven principles to mine flexible behavior from physiological signals for effective emotion recognition and description in affective interactions., in: Phy CS, 2014, pp. 75–82 (2014).
4[4] R. Henriques, Learning from high-dimensional data using local descriptive models, Ph.D. thesis, Instituto Superior Tecnico, Universidade de Lisboa, Lisboa (2016).
5[5] R. Henriques, S. C. Madeira, Bicspam: flexible biclustering using sequential patterns, BMC bioinformatics 15 (1) (2014) 130 (2014).
6[6] R. Henriques, C. Antunes, S. Madeira, Methods for the efficient discovery of large item-indexable sequential patterns, in: New Frontiers in Mining Complex Patterns, Vol. 8399 of LNCS, Springer International Publishing, 2014, pp. 100–116 (2014).
7[7] S. Kawashima, M. Kanehisa, Aaindex: amino acid index database, Nucleic acids research 28 (1) (2000) 374–374 (2000).
8[8] M. Kubica, T. Kulczyński, J. Radoszewski, W. Rytter, T. Waleń, A linear time algorithm for consecutive permutation pattern matching, Information Processing Letters 113 (12) (2013) 430–433 (2013).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Order-Preserving Pattern Matching Indeterminate Strings

Abstract

keywords:

1 Introduction

2 Background

2.1 The Problem

2.2 Related work

3 On solving μ\muμOPPM

3.1 O(mrlg⁡r)O(mr\lg r)O(mrlgr) time μ\muμOPPM when one string is indeterminate

Theorem 3.1**.**

Proof.

Theorem 3.2**.**

Proof.

3.2 μ\muμOPPM with indeterminate pattern and text

Definition 3.3**.**

Lemma 3.4**.**

Proof.

Theorem 3.5**.**

Proof.

Theorem 3.6**.**

Proof.

Theorem 3.7**.**

Proof.

Definition 3.8**.**

Remark 3.9**.**

Lemma 3.10**.**

Proof.

Theorem 3.11**.**

Proof.

Theorem 3.12**.**

Proof.

3.3 Polynomial time Alternate-μ\muμOPPM

Consistency clauses

Order clauses (Type 111)

Order clauses (Type 222)

Forcing choice

Extracting solutions

Dealing with equalities

Pair incompatibility

Theorem 3.13**.**

Proof.

3.4 μ\muμOPPM with 3 indeterminacies in both text and pattern is NP-hard

Construction

Lemma 3.14**.**

Proof.

Lemma 3.15**.**

Proof.

Theorem 3.16**.**

Proof.

Theorem 3.17**.**

Proof.

4 Polynomial time μ\muμOPPM

Lemma 4.1**.**

Proof.

5 Open problem

6 Concluding remark

Acknowledgments

3 On solving $\mu$ OPPM

3.1 $O(mr\lg r)$ time $\mu$ OPPM when one string is indeterminate

Theorem 3.1.

Theorem 3.2.

3.2 $\mu$ OPPM with indeterminate pattern and text

Definition 3.3.

Lemma 3.4.

Theorem 3.5.

Theorem 3.6.

Theorem 3.7.

Definition 3.8.

Remark 3.9.

Lemma 3.10.

Theorem 3.11.

Theorem 3.12.

3.3 Polynomial time Alternate- $\mu$ OPPM

Order clauses (Type $1$ )

Order clauses (Type $2$ )

Theorem 3.13.

3.4 $\mu$ OPPM with 3 indeterminacies in both text and pattern is NP-hard

Lemma 3.14.

Lemma 3.15.

Theorem 3.16.

Theorem 3.17.

4 Polynomial time $\mu$ OPPM

Lemma 4.1.