LIKE Patterns and Complexity

Holger Petersen

arXiv:1903.06195·cs.CC·March 18, 2019

LIKE Patterns and Complexity

Holger Petersen

PDF

TL;DR

This paper explores the expressive capabilities and computational complexity of the LIKE operator in SQL, providing insights into its theoretical limitations and practical implications.

Contribution

It offers a detailed analysis of the LIKE operator's expressive power and complexity, which was previously not thoroughly understood.

Findings

01

Characterizes the complexity class of LIKE pattern matching

02

Identifies limitations in expressiveness for certain pattern classes

03

Provides theoretical bounds for LIKE operator's computational requirements

Abstract

We investigate the expressive power and complexity questions for the LIKE operator in SQL.

Equations10

B_{1}

B_{1}

M_{n}

B_{n}

E_{0} \subseteq B_{1} \subseteq M_{1} \subseteq B_{2} \subseteq M_{2} \dots

E_{0} \subseteq B_{1} \subseteq M_{1} \subseteq B_{2} \subseteq M_{2} \dots

k = 1 ⋃ ℓ i = 1 ⋂ m (k) \overline{w_{0}^{k, i} Σ^{*} w_{1}^{k, i} Σ^{*} \dots Σ^{*} w_{s (k, i)}^{k, i}} \cap j = 1 ⋂ n (k) u_{0}^{k, j} Σ^{*} u_{1}^{k, j} Σ^{*} \dots Σ^{*} u_{t (k, j)}^{k, j}

k = 1 ⋃ ℓ i = 1 ⋂ m (k) \overline{w_{0}^{k, i} Σ^{*} w_{1}^{k, i} Σ^{*} \dots Σ^{*} w_{s (k, i)}^{k, i}} \cap j = 1 ⋂ n (k) u_{0}^{k, j} Σ^{*} u_{1}^{k, j} Σ^{*} \dots Σ^{*} u_{t (k, j)}^{k, j}

F = (α_{1} \lor β_{1} \lor γ_{1}) \land \dots \land (α_{m} \lor β_{m} \lor γ_{m})

F = (α_{1} \lor β_{1} \lor γ_{1}) \land \dots \land (α_{m} \lor β_{m} \lor γ_{m})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

LIKE Patterns and Complexity

Holger Petersen

Reinsburgstr. 75

70197 Stuttgart

Germany

Abstract

We investigate the expressive power and complexity questions for the LIKE operator in SQL.

1 Introduction

Regular expressions conveniently support the analysis of software defects involving strings stored in a data base and the subsequent selection of test data for checking the effectiveness of data cleansing. As an example take a list of values separated by a special symbol. When manipulating strings of this form, it might happen that separator symbols are stored consecutively or strings start with a separator. This data corruption possibly leads to problems when displaying the data or generating export files.

A very restricted variant of regular expressions we will consider are patterns for the LIKE operator available in SQL (Structured Query Language) [1]. It admits defining patterns including constants and wild-card symbols representing single letters or arbitrary strings. Since our investigations are motivated by defect analysis and test data selection, which by definition may not modify data, we assume that new auxiliary columns for holding intermediate values cannot be defined.

Continuing the example given above, we can select corrupt strings in an obvious way. After data cleansing, the same selections can verify the correctness of the resulting data. It is known that LIKE pattern matching can define star-free languages only [4, Section 4.2]. In Section 3 we will explore what classes of languages known from the literature are characterized by LIKE patterns and their boolean combinations.

A more extensive set of operations than those available with the LIKE operator (including concatenation and closure) is employed in classical regular expressions studied in Theoretical Computer Science. An even more powerful set of operations is offered by practical regular expressions, which may contain back references [2].

2 Preliminaries

For basic definitions related to formal languages, finite automata, and computational complexity we refer to [13].

The star-free languages are those regular languages obtained by replacing the star-operator with complement in regular expressions. Cohen and Brzozowski [5] defined a hierarchy of star-free languages according to the notion of dot-depth. For an alphabet $\Sigma=\{a_{1},\ldots,a_{k}\}$ the family $E_{0}$ consists of the basic languages $\{a_{1}\},\ldots,\{a_{k}\},\{\varepsilon\}$ (where $\varepsilon$ denotes the empty string). If $X$ is a family of languages, then we denote by $B(X)$ the boolean closure of $X$ and by $M(X)$ the closure of $X$ under concatenation. Define the following hierarchy of language families:

[TABLE]

Obviously these families form a hierarchy:

[TABLE]

In [5] its is shown that the hierarchy is strict up to dot-depth 2 ( $B_{3}$ ), leaving open whether the upper levels can be separated. This open problem was resolved in [3] by showing that the hierarchy is strict. The dot-depth $d(R)$ for a language $R$ is defined to be $n$ if $R\in B_{n+1}\setminus B_{n}$ .

The LIKE operator of SQL admits defining patterns in WHERE clauses which can be matched against string valued columns . Each symbol represents itself except for certain meta-characters, among which the most important is % as a wildcard matching zero or more characters. Symbol _ is a substitute for an arbitrary single character. Similarly as further syntactic enhancements (character sets and complements of such sets), the _ wildcard can be seen as a (very convenient) shorthand for an enumeration of patterns for every symbol in the alphabet. If wildcard symbols are required in a pattern, an escape symbol can be declared that enforces a literal interpretation of % and _.

The more powerful operator SIMILAR TO or the Oracle*®* function REGEXP_LIKE implement general regular expression matching in SQL (the latter even for extended regular expressions).

The following table compares different notations of the variants Practical Regular Expressions (PRE), Classical Regular Expressions (CRE) [13], Star-Free Expressions (SFE), and LIKE Patterns 111By ’impl.’ we denote the implicit notation of the empty string or concatentaion by juxtaposition of neighboring symbols. $\Sigma$ is not part of the syntax of CRE or SFE but a common abbreviation:

[TABLE]

CRE and SFE include a notation for the empty set, which is not relevant for practical purposes and thus does not have a counterpart in PRE or LIKE patterns. PRE may include as “syntactic sugar” the notations $[\alpha_{1}\alpha_{2}\ldots\alpha_{n}]$ for the set of characters $\{\alpha_{1},\alpha_{2},\ldots,\alpha_{n}\}$ and $[\alpha_{1}-\alpha_{n}]$ for the range of consecutive characters $\alpha_{1}$ to $\alpha_{n}$ (this assumes some specific encoding). Notation $[\mbox{\^{}}\alpha_{1}\alpha_{2}\ldots\alpha_{n}]$ and $[\mbox{\^{}}\alpha_{1}-\alpha_{n}]$ denote the complements of these sets of characters. Other extensions are the notation $e?$ that denotes zero or one occurrence of expression $e$ and $e\{n\}$ that denotes exactly $n$ occurrences. None of these operators increases the expressive power of regular expressions, but they may lead to significantly shorter expressions than possible with CRE.

One extension of PRE that goes beyond regular languages is the use of back references. The $k$ -th subexpression put into parentheses can be referenced by $\\ k$ , which matches the same string as matched by the subexpression.

For CRE the membership problem asks whether the entire input text matches a given pattern. In practice we are more interested on one or even all substrings within the input text matching the pattern. From the latter set of substrings the answer to the decision problem can easily be derived and lower bounds above polynomial time carry over (notice that the number of substrings of a text of length $n$ is ${{n+1}\choose 2}=\Theta(n^{2})$ ). We can enforce a match of a PRE $\alpha$ with the entire input text by enclosing it into “anchors” and matching with $\mbox{\^{}}\alpha\$$. Conversely, the CRE$ \Sigma^{}\alpha\Sigma^{} $simulates the PRE$ \alpha$. We conclude that upper and lower bounds for CRE membership and PRE matching coincide.

Since LIKE patterns are rather restricted (see Section 3) we also consider boolean formulas containing LIKE patterns (LIKE expressions) and boolean formulas without negations (monotone LIKE expressions).

Definition 1

A language $L\subseteq\Sigma^{*}$ is LIKE-characterizable if it is a set of strings satisfying a boolean combination of LIKE pattern matching conditions.

We summarize known complexity results for some decision problems related to regular expressions:

[TABLE]

3 Expressive Power

In this section we briefly discuss the power of LIKE patterns and LIKE expressions in comparison to the dot-depth hierarchy as defined in [5].

It is clear that the languages of family $E_{0}$ can be characterized by LIKE patterns of the form $a_{i}$ . Family $B_{1}$ is incomparable to the languages characterized by LIKE patterns: For an alphabet $\Sigma$ with $|\Sigma|\geq 2$ the language $L_{1}=\{a_{1},\varepsilon\}$ is clearly in $B_{1}$ (a boolean combination of basic languages), but a LIKE pattern characterizing a finite language can contain different words via _ only, which allows for words of the same length only. Thus $L_{1}$ cannot be characterized by a LIKE pattern. Conversely, the LIKE pattern $00$ defines the language $L_{2}=\{00\}$ , which cannot be expressed as a boolean combination of basic languages. Monotone LIKE expressions can describe all finite languages, but also all co-finite languages. Therefore $B_{1}$ is properly contained in the languages characterized by monotone LIKE expressions (separation by $L_{2}$ ).

Every language $R$ in family $B_{2}$ can be denoted in the form

[TABLE]

with $w_{p}^{k,i}$ , $u_{q}^{k,j}$ words and $m(k)$ , $n(k)$ , $\ell$ , $s(k,i)$ , $t(k,j)$ non-negative integers [5, Lemma 2.8]. This representation translates directly to a LIKE expression. Given a LIKE expression, every pattern containing wildcards _ can be replaced by an enumeration of patterns substituting the alphabet symbols for wildcards. All negations can be moved to the LIKE operators applying De Morgan’s laws. The resulting expression characterizes a set in $B_{2}$ .

We are thus led to the following observation:

Observation 1

The class of LIKE-characterizable languages coincides with the class of languages of dot-depth 1.

An example of a star-free language shown to be of dot-depth 2 (and therefore not LIKE-characterizable) is $(0+1+2)^{*}02^{*}$ from [5, LEMMA 2.9].

Finally we sketch why monotone LIKE expressions are weaker than general LIKE expressions. We claim that monotone LIKE expressions cannot express that strings are formed over a proper subset $\Sigma^{\prime}$ of the underlying alphabet $\Sigma$ (which we assume to contain at least two symbols). Suppose a monotone LIKE expression $e$ can express this restriction. Choose a string $w$ over $\Sigma^{\prime}$ which is longer than $e$ . Then $w$ matches $e$ and at least one symbol of $w$ matches wildcards only. This symbol can be substituted by a symbol from $\Sigma\setminus\Sigma^{\prime}$ . The resulting string still matches $e$ , contradicting the assumption.

4 Computational Complexity

We first introduce a syntactical transformation of patterns that will simplify the subsequent algorithms.

Definition 2

A LIKE pattern is called normalized, if it contains none of the substrings %_ and %%.

Consider an arbitrary string $w\in\{\%,\_\}^{*}$ consisting of wildcards. If $w$ matches a string over the base alphabet, then a string $w^{\prime}$ containing the same number of the symbol _ and a trailing % if and only if $w$ contains % matches as well. Since $w^{\prime}$ is normalized we obtain:

Proposition 1

For every LIKE pattern there is an equivalent normalized LIKE pattern.

Normalization cannot in general identify equivalent patterns. As an example take the patterns $\%01\%$ and $\%0\%1\%$ over the binary alphabet $\{0,1\}$ . Obviously, any string matching the first pattern matches the second. But the converse is also true, because there is a left-most $1$ between the two constants of the pattern (including the $1$ ) and it is preceded by a [math]. Over the alphabet $\{0,1,2\}$ , the patterns are separated by $021$ .

Lemma 1

LIKE patterns can be normalized in deterministic logarithmic space.

Proof. Any input can be written as $x_{0}w_{0}\cdots x_{n}w_{n}x_{n+1}$ where $w_{0}\cdots w_{n}\in\{\%,\_\}^{*}$ and $x_{0}\cdots x_{n+1}\in\Sigma^{*}$ for the underlying alphabet $\Sigma$ .

A deterministic Turing machine $M$ scans the input and directly outputs any symbol from $\Sigma$ . For every string $w_{i}$ of consecutive wildcards, the number $m$ of occurrences of _ is counted and a flag is maintained indicating the presence of %. At the end of $w_{i}$ , machine $M$ outputs $m$ symbols _ and an optional % if the flag is set.

Since $M$ has to store counters bounded by the input length, it can do so in logarithmic space if the counters are encoded in binary notation. $\Box$

Theorem 1

Matching with a LIKE pattern can be done in deterministic logarithmic space.

Proof. If the pattern contains no %, in a single scan the constant symbols in the pattern are compared and for every _ in the pattern a symbol in the text is skipped.

By Lemma 1 we can assume that any LIKE-pattern containing % has the form $p=a_{1}\%a_{2}\%\cdots\%a_{n}$ where $a_{i}\in(\Sigma\cup\{\_\})^{*}$ . We first argue that a greedy matching strategy suffices for checking whether $p$ matches a text $t$ . Suppose in a given matching $i$ is minimal with the property that $a_{i}$ could be matched further to the start of the text (but after $a_{i-1}$ ). Then a new match can be obtained by moving $a_{i}$ to the first occurrence. Carrying out this operation for all $a_{i}$ leads to a greedy matching.

For every $a_{i}$ a left-most match can be determined by comparing the constant part and shifting the position in the text if a mis-match occurs. Once an $a_{i}$ has been matched, it is not necessary to reconsider it by the argument above.

In logarithmic space pointers into pattern and text can be stored and by scanning $p$ and $t$ in parallel a greedy matching can be determined. $\Box$

We have the following (weaker) lower bound for the membership problem:

Theorem 2

Matching with a LIKE pattern cannot be done by constant-depth, polynomial-size, unbounded fan-in circuits (it is not in AC0).

Proof. Recall from [8] that the majority predicate on $n$ binary variables is $1$ if and only if more than half of the input values are 1. We map a given input $x$ for the majority predicate to the pattern $\%(1\%)^{\lceil(|x|+1)/2\rceil}$ . String $x$ matches the pattern only if $x$ contains at least $\lceil(|x|+1)/2\rceil>|x|/2$ symbols $1$ , which is majority. By the result [8, Theorem 4.3] this predicate is not in AC0. $\Box$

Since the evaluation of boolean formulas is possible in logarithmic space, we obtain from Theorem 1:

Corollary 1

Matching with a LIKE expression can be done in deterministic logarithmic space.

Considering equivalence of LIKE patterns, a test using syntactical properties alone seems to be impossible because of the example given above.

Based on Theorem 1 we can obtain the following upper bound:

Corollary 2

Inequivalence of LIKE patterns is in nondeterministic logarithmic space.

Proof. Guess a separating text symbol by symbol and match with the given patterns in logarithmic space. $\Box$

Theorem 3

Nonemptiness of monotone LIKE-expressions is complete in NP.

Proof. For membership in NP consider a string $w$ matching a given expression $e$ . We claim that there is no loss of generality in assuming $|w|\leq|e|$ . We fix a matching of $w$ by $e$ . For every OR in expression $e$ there has to be at least one sub-expression matching $w$ . We delete the other sub-expression and continue this process until there is no OR left obtaining $e^{\prime}$ . Clearly $|e^{\prime}|\leq|e|$ . Now we mark every symbol of $w$ matched by a constant or _. At most $|e^{\prime}|$ symbols of $w$ will thus be marked and the others have to be matched by %. Deleting these symbols yields a string $w^{\prime}$ matching $e$ with $|w^{\prime}|\leq|e^{\prime}|\leq|e|$ . The NP algorithm simply consists in guessing a string $w$ with $|w|\leq|e|$ , writing it onto the work tape, and checking membership according to Corollary 1.

For hardness we reduce the satisfiability problem of boolean formulas in 3-CNF (3SAT) to the nonemptiness problem. It is well-known that 3SAT is complete in NP [13]. Let

[TABLE]

be a formula in CNF over variables $x_{1},\ldots,x_{n}$ . The idea is to enumerate all satisfied literals in a string that matches a monotone LIKE-expression. We form a set of LIKE patterns over the alphabet $\{x_{1},\ldots,x_{n},\bar{x}_{1},\ldots,\bar{x}_{n}\}$ that are joined by AND:

•

$\_^{n}$ (there are exactly $n$ literals).

•

For $1\leq i\leq n$ an OR of the patterns $x_{i}$ and $\bar{x}_{i}$ (for every variable at least one literal is true).

•

For every clause $\alpha_{k}\vee\beta_{k}\vee\gamma_{k}$ an OR of the patterns $\alpha_{k}$ , $\beta_{k}$ , and $\gamma_{k}$ (at least one literal is true in every clause).

Suppose that $F$ is satisfied by some assignment of boolean values to $x_{1},\ldots,x_{n}$ . Concatenate the satisfied literal for each variable to form a string to be matched. This string clearly matches all patterns defined above. Conversely, if a string matches all patterns it contains at least one literal per variable by the second item. The length restriction to $n$ symbols implies that exactly one literal per variable is included. These literals define a truth assignment in the obvious way and by the third item every clause is satisfied by this assignment. $\Box$

Lemma 2

For a deterministic Turing machine $M$ with input $w$ and space bound $s(|w|)$ , a LIKE-expression $e$ with the following properties can be constructed:

All LIKE conditions are negative. 2. 2.

The LIKE-expression $e$ is of size $O(s^{2}(|w|)$ . 3. 3.

If $M$ accepts $w$ within space $s(|w|)$ , there is a single string matching $e$ . 4. 4.

If $M$ does not accept $w$ within space $s(|w|)$ , the language described by $e$ is empty.

Proof. Without loss of generality we assume that $M$ accepts with a blank tape and the tape head on the left-most tape cell. We denote the input length by $n=|w|$ .

In order to simplify the presentation we first use arbitrary LIKE conditions. We encode a computation of $M$ as a sequence of configurations over the alphabet $\Gamma\cup Q$ (tape alphabet and set of states). A configuration $uqv$ encodes the tape inscription $uv$ , current state $q$ and head position on the first symbol of $v$ . A computation consisting of $k$ steps is encoded as $\#c_{0}\#c_{1}\#\cdots\#c_{k}\#$ . Configuration $c_{0}$ is $q_{0}w$ followed by $s(n)-n$ blanks and for $i\geq 1$ configuration $c_{i-1}$ yields $c_{i}$ by $M$ ’s transition function. We therefore identify the following patterns:

$\#c_{0}\#\%$ (start configuration). 2. 2.

$\%\#c_{\mbox{accept}}\#$ (accepting configuration). 3. 3.

For every $\delta(q_{i},b)=(q_{j},c,L)$ negative patterns $aq_{i}b\_^{s(n)}def$ with $def\neq q_{j}ac$ . 4. 4.

For every $\delta(q_{i},b)=(q_{j},c,R)$ negative patterns $aq_{i}b\_^{s(n)}def$ with $def\neq acq_{j}$ . 5. 5.

Negative patterns $abc\_^{s(n)}d$ with $a,b,c\in\Gamma\cup\{\#\}$ and $b\neq d$ (portions of the tape not affected by the computation).

For each of the patterns in item 1 and 2 we can substitute $(s(n)+2)(|\Gamma|+|Q|)$ equivalent negative patterns that exclude all but one symbol from $\Gamma\cup Q\cup\{\#\}$ at position $i$ with $1\leq i\leq s(n)+2$ from the start resp. end of the string. $\Box$

Lemma 3

Inequivalence of LIKE-expressions can be decided nondeterministically in linear space.

Proof. For two given expressions guess a string symbol by symbol and mark in every pattern the positions reachable by matching the guessed string. When a separating string has been found, both expressions are evaluated and it is checked that exactly one of the expressions matches. $\Box$

The previous lemmas can be summarized in the following way:

Theorem 4

Equivalence of monotone as well as of arbitrary LIKE-expressions is complete in PSPACE.

5 Discussion

We investigated the expressive power and computational complexity of the LIKE operator. For the more powerful monotone and general LIKE expressions we classified the complexity of nonemptiness and equivalence. In case of membership we could establish the upper bound L (deterministic logarithmic space). This is believed to be of lower complexity than the general membership problem for CRE, which is complete in NL [9]. Membership for a single LIKE pattern is not decidable by the highly parallel AC0 circuits. It remains open, what the exact complexity of the latter problem and inequivalence is.

Acknowledgement

Many thanks to Manfred Kufleitner for information about star-free languages.

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Oracle ® Database SQL Reference 10g Release 1 . https://docs.oracle.com/ cd/B 13789_01/server.101/b 10759/conditions 016.htm.
2[2] A. V. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science: Volume A, Algorithms and Complexity , pages 255–300. MIT Press, Cambridge, MA, 1990.
3[3] J. A. Brzozowski and R. Knast. The dot-depth hierarchy of star-free languages is infinite. JCSS , 16:37–55, 1978.
4[4] M. Benedikt, L. Libkin, T. Schwentick, and L. Segoufin. String operations in query languages. In P. Buneman, editor, Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, California, USA , pages 183–194, 2001.
5[5] R. S. Cohen and J. A. Brzozowski. Dot-depth of star-free events. JCSS , 5:1–16, 1971.
6[6] D. D. Freydenberger. Extended regular expressions: Succinctness and decidability. In T. Schwentick and C. Dürr, editors, Proceedings of the 28th Annual Symposium on Theoretical Aspects of Computer Science (STACS 11) , Leibniz International Proceedings in Informatics, pages 507–518, Schloss Dagstuhl, 2011. Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany.
7[7] M. Fürer. Nicht-elementare untere Schranken in der Automaten-Theorie . Ph D thesis, ETH Zürich, 1978.
8[8] M. Furst, J. B. Saxe and M. Sipser. Parity, Circuits, and the Polynomial-Time Hierarchy. Math. Systems Theory , 1713–27, 1984.