
TL;DR
This paper explores the expressive capabilities and computational complexity of the LIKE operator in SQL, providing insights into its theoretical limitations and practical implications.
Contribution
It offers a detailed analysis of the LIKE operator's expressive power and complexity, which was previously not thoroughly understood.
Findings
Characterizes the complexity class of LIKE pattern matching
Identifies limitations in expressiveness for certain pattern classes
Provides theoretical bounds for LIKE operator's computational requirements
Abstract
We investigate the expressive power and complexity questions for the LIKE operator in SQL.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
LIKE Patterns and Complexity
Holger Petersen
Reinsburgstr. 75
70197 Stuttgart
Germany
Abstract
We investigate the expressive power and complexity questions for the LIKE operator in SQL.
1 Introduction
Regular expressions conveniently support the analysis of software defects involving strings stored in a data base and the subsequent selection of test data for checking the effectiveness of data cleansing. As an example take a list of values separated by a special symbol. When manipulating strings of this form, it might happen that separator symbols are stored consecutively or strings start with a separator. This data corruption possibly leads to problems when displaying the data or generating export files.
A very restricted variant of regular expressions we will consider are patterns for the LIKE operator available in SQL (Structured Query Language) [1]. It admits defining patterns including constants and wild-card symbols representing single letters or arbitrary strings. Since our investigations are motivated by defect analysis and test data selection, which by definition may not modify data, we assume that new auxiliary columns for holding intermediate values cannot be defined.
Continuing the example given above, we can select corrupt strings in an obvious way. After data cleansing, the same selections can verify the correctness of the resulting data. It is known that LIKE pattern matching can define star-free languages only [4, Section 4.2]. In Section 3 we will explore what classes of languages known from the literature are characterized by LIKE patterns and their boolean combinations.
A more extensive set of operations than those available with the LIKE operator (including concatenation and closure) is employed in classical regular expressions studied in Theoretical Computer Science. An even more powerful set of operations is offered by practical regular expressions, which may contain back references [2].
2 Preliminaries
For basic definitions related to formal languages, finite automata, and computational complexity we refer to [13].
The star-free languages are those regular languages obtained by replacing the star-operator with complement in regular expressions. Cohen and Brzozowski [5] defined a hierarchy of star-free languages according to the notion of dot-depth. For an alphabet the family consists of the basic languages (where denotes the empty string). If is a family of languages, then we denote by the boolean closure of and by the closure of under concatenation. Define the following hierarchy of language families:
[TABLE]
Obviously these families form a hierarchy:
[TABLE]
In [5] its is shown that the hierarchy is strict up to dot-depth 2 (), leaving open whether the upper levels can be separated. This open problem was resolved in [3] by showing that the hierarchy is strict. The dot-depth for a language is defined to be if .
The LIKE operator of SQL admits defining patterns in WHERE clauses which can be matched against string valued columns . Each symbol represents itself except for certain meta-characters, among which the most important is % as a wildcard matching zero or more characters. Symbol _ is a substitute for an arbitrary single character. Similarly as further syntactic enhancements (character sets and complements of such sets), the _ wildcard can be seen as a (very convenient) shorthand for an enumeration of patterns for every symbol in the alphabet. If wildcard symbols are required in a pattern, an escape symbol can be declared that enforces a literal interpretation of % and _.
The more powerful operator SIMILAR TO or the Oracle*®* function REGEXP_LIKE implement general regular expression matching in SQL (the latter even for extended regular expressions).
The following table compares different notations of the variants Practical Regular Expressions (PRE), Classical Regular Expressions (CRE) [13], Star-Free Expressions (SFE), and LIKE Patterns 111By ’impl.’ we denote the implicit notation of the empty string or concatentaion by juxtaposition of neighboring symbols. is not part of the syntax of CRE or SFE but a common abbreviation:
[TABLE]
CRE and SFE include a notation for the empty set, which is not relevant for practical purposes and thus does not have a counterpart in PRE or LIKE patterns. PRE may include as “syntactic sugar” the notations for the set of characters and for the range of consecutive characters to (this assumes some specific encoding). Notation and denote the complements of these sets of characters. Other extensions are the notation that denotes zero or one occurrence of expression and that denotes exactly occurrences. None of these operators increases the expressive power of regular expressions, but they may lead to significantly shorter expressions than possible with CRE.
One extension of PRE that goes beyond regular languages is the use of back references. The -th subexpression put into parentheses can be referenced by , which matches the same string as matched by the subexpression.
For CRE the membership problem asks whether the entire input text matches a given pattern. In practice we are more interested on one or even all substrings within the input text matching the pattern. From the latter set of substrings the answer to the decision problem can easily be derived and lower bounds above polynomial time carry over (notice that the number of substrings of a text of length is ). We can enforce a match of a PRE with the entire input text by enclosing it into “anchors” and matching with \mbox{\^{}}\alpha\$$. Conversely, the CRE \Sigma^{}\alpha\Sigma^{}\alpha$. We conclude that upper and lower bounds for CRE membership and PRE matching coincide.
Since LIKE patterns are rather restricted (see Section 3) we also consider boolean formulas containing LIKE patterns (LIKE expressions) and boolean formulas without negations (monotone LIKE expressions).
Definition 1
A language is LIKE-characterizable if it is a set of strings satisfying a boolean combination of LIKE pattern matching conditions.
We summarize known complexity results for some decision problems related to regular expressions:
[TABLE]
3 Expressive Power
In this section we briefly discuss the power of LIKE patterns and LIKE expressions in comparison to the dot-depth hierarchy as defined in [5].
It is clear that the languages of family can be characterized by LIKE patterns of the form . Family is incomparable to the languages characterized by LIKE patterns: For an alphabet with the language is clearly in (a boolean combination of basic languages), but a LIKE pattern characterizing a finite language can contain different words via _ only, which allows for words of the same length only. Thus cannot be characterized by a LIKE pattern. Conversely, the LIKE pattern defines the language , which cannot be expressed as a boolean combination of basic languages. Monotone LIKE expressions can describe all finite languages, but also all co-finite languages. Therefore is properly contained in the languages characterized by monotone LIKE expressions (separation by ).
Every language in family can be denoted in the form
[TABLE]
with , words and , , , , non-negative integers [5, Lemma 2.8]. This representation translates directly to a LIKE expression. Given a LIKE expression, every pattern containing wildcards _ can be replaced by an enumeration of patterns substituting the alphabet symbols for wildcards. All negations can be moved to the LIKE operators applying De Morgan’s laws. The resulting expression characterizes a set in .
We are thus led to the following observation:
Observation 1
The class of LIKE-characterizable languages coincides with the class of languages of dot-depth 1.
An example of a star-free language shown to be of dot-depth 2 (and therefore not LIKE-characterizable) is from [5, LEMMA 2.9].
Finally we sketch why monotone LIKE expressions are weaker than general LIKE expressions. We claim that monotone LIKE expressions cannot express that strings are formed over a proper subset of the underlying alphabet (which we assume to contain at least two symbols). Suppose a monotone LIKE expression can express this restriction. Choose a string over which is longer than . Then matches and at least one symbol of matches wildcards only. This symbol can be substituted by a symbol from . The resulting string still matches , contradicting the assumption.
4 Computational Complexity
We first introduce a syntactical transformation of patterns that will simplify the subsequent algorithms.
Definition 2
A LIKE pattern is called normalized, if it contains none of the substrings %_ and %%.
Consider an arbitrary string consisting of wildcards. If matches a string over the base alphabet, then a string containing the same number of the symbol _ and a trailing % if and only if contains % matches as well. Since is normalized we obtain:
Proposition 1
For every LIKE pattern there is an equivalent normalized LIKE pattern.
Normalization cannot in general identify equivalent patterns. As an example take the patterns and over the binary alphabet . Obviously, any string matching the first pattern matches the second. But the converse is also true, because there is a left-most between the two constants of the pattern (including the ) and it is preceded by a [math]. Over the alphabet , the patterns are separated by .
Lemma 1
LIKE patterns can be normalized in deterministic logarithmic space.
Proof. Any input can be written as where and for the underlying alphabet .
A deterministic Turing machine scans the input and directly outputs any symbol from . For every string of consecutive wildcards, the number of occurrences of _ is counted and a flag is maintained indicating the presence of %. At the end of , machine outputs symbols _ and an optional % if the flag is set.
Since has to store counters bounded by the input length, it can do so in logarithmic space if the counters are encoded in binary notation.
Theorem 1
Matching with a LIKE pattern can be done in deterministic logarithmic space.
Proof. If the pattern contains no %, in a single scan the constant symbols in the pattern are compared and for every _ in the pattern a symbol in the text is skipped.
By Lemma 1 we can assume that any LIKE-pattern containing % has the form where . We first argue that a greedy matching strategy suffices for checking whether matches a text . Suppose in a given matching is minimal with the property that could be matched further to the start of the text (but after ). Then a new match can be obtained by moving to the first occurrence. Carrying out this operation for all leads to a greedy matching.
For every a left-most match can be determined by comparing the constant part and shifting the position in the text if a mis-match occurs. Once an has been matched, it is not necessary to reconsider it by the argument above.
In logarithmic space pointers into pattern and text can be stored and by scanning and in parallel a greedy matching can be determined.
We have the following (weaker) lower bound for the membership problem:
Theorem 2
Matching with a LIKE pattern cannot be done by constant-depth, polynomial-size, unbounded fan-in circuits (it is not in AC0).
Proof. Recall from [8] that the majority predicate on binary variables is if and only if more than half of the input values are 1. We map a given input for the majority predicate to the pattern . String matches the pattern only if contains at least symbols , which is majority. By the result [8, Theorem 4.3] this predicate is not in AC0.
Since the evaluation of boolean formulas is possible in logarithmic space, we obtain from Theorem 1:
Corollary 1
Matching with a LIKE expression can be done in deterministic logarithmic space.
Considering equivalence of LIKE patterns, a test using syntactical properties alone seems to be impossible because of the example given above.
Based on Theorem 1 we can obtain the following upper bound:
Corollary 2
Inequivalence of LIKE patterns is in nondeterministic logarithmic space.
Proof. Guess a separating text symbol by symbol and match with the given patterns in logarithmic space.
Theorem 3
Nonemptiness of monotone LIKE-expressions is complete in NP.
Proof. For membership in NP consider a string matching a given expression . We claim that there is no loss of generality in assuming . We fix a matching of by . For every OR in expression there has to be at least one sub-expression matching . We delete the other sub-expression and continue this process until there is no OR left obtaining . Clearly . Now we mark every symbol of matched by a constant or _. At most symbols of will thus be marked and the others have to be matched by %. Deleting these symbols yields a string matching with . The NP algorithm simply consists in guessing a string with , writing it onto the work tape, and checking membership according to Corollary 1.
For hardness we reduce the satisfiability problem of boolean formulas in 3-CNF (3SAT) to the nonemptiness problem. It is well-known that 3SAT is complete in NP [13]. Let
[TABLE]
be a formula in CNF over variables . The idea is to enumerate all satisfied literals in a string that matches a monotone LIKE-expression. We form a set of LIKE patterns over the alphabet that are joined by AND:
- •
(there are exactly literals).
- •
For an OR of the patterns and (for every variable at least one literal is true).
- •
For every clause an OR of the patterns , , and (at least one literal is true in every clause).
Suppose that is satisfied by some assignment of boolean values to . Concatenate the satisfied literal for each variable to form a string to be matched. This string clearly matches all patterns defined above. Conversely, if a string matches all patterns it contains at least one literal per variable by the second item. The length restriction to symbols implies that exactly one literal per variable is included. These literals define a truth assignment in the obvious way and by the third item every clause is satisfied by this assignment.
Lemma 2
For a deterministic Turing machine with input and space bound , a LIKE-expression with the following properties can be constructed:
All LIKE conditions are negative. 2. 2.
The LIKE-expression is of size . 3. 3.
If accepts within space , there is a single string matching . 4. 4.
If does not accept within space , the language described by is empty.
Proof. Without loss of generality we assume that accepts with a blank tape and the tape head on the left-most tape cell. We denote the input length by .
In order to simplify the presentation we first use arbitrary LIKE conditions. We encode a computation of as a sequence of configurations over the alphabet (tape alphabet and set of states). A configuration encodes the tape inscription , current state and head position on the first symbol of . A computation consisting of steps is encoded as . Configuration is followed by blanks and for configuration yields by ’s transition function. We therefore identify the following patterns:
(start configuration). 2. 2.
(accepting configuration). 3. 3.
For every negative patterns with . 4. 4.
For every negative patterns with . 5. 5.
Negative patterns with and (portions of the tape not affected by the computation).
For each of the patterns in item 1 and 2 we can substitute equivalent negative patterns that exclude all but one symbol from at position with from the start resp. end of the string.
Lemma 3
Inequivalence of LIKE-expressions can be decided nondeterministically in linear space.
Proof. For two given expressions guess a string symbol by symbol and mark in every pattern the positions reachable by matching the guessed string. When a separating string has been found, both expressions are evaluated and it is checked that exactly one of the expressions matches.
The previous lemmas can be summarized in the following way:
Theorem 4
Equivalence of monotone as well as of arbitrary LIKE-expressions is complete in PSPACE.
5 Discussion
We investigated the expressive power and computational complexity of the LIKE operator. For the more powerful monotone and general LIKE expressions we classified the complexity of nonemptiness and equivalence. In case of membership we could establish the upper bound L (deterministic logarithmic space). This is believed to be of lower complexity than the general membership problem for CRE, which is complete in NL [9]. Membership for a single LIKE pattern is not decidable by the highly parallel AC0 circuits. It remains open, what the exact complexity of the latter problem and inequivalence is.
Acknowledgement
Many thanks to Manfred Kufleitner for information about star-free languages.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Oracle ® Database SQL Reference 10g Release 1 . https://docs.oracle.com/ cd/B 13789_01/server.101/b 10759/conditions 016.htm.
- 2[2] A. V. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science: Volume A, Algorithms and Complexity , pages 255–300. MIT Press, Cambridge, MA, 1990.
- 3[3] J. A. Brzozowski and R. Knast. The dot-depth hierarchy of star-free languages is infinite. JCSS , 16:37–55, 1978.
- 4[4] M. Benedikt, L. Libkin, T. Schwentick, and L. Segoufin. String operations in query languages. In P. Buneman, editor, Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 21-23, 2001, Santa Barbara, California, USA , pages 183–194, 2001.
- 5[5] R. S. Cohen and J. A. Brzozowski. Dot-depth of star-free events. JCSS , 5:1–16, 1971.
- 6[6] D. D. Freydenberger. Extended regular expressions: Succinctness and decidability. In T. Schwentick and C. Dürr, editors, Proceedings of the 28th Annual Symposium on Theoretical Aspects of Computer Science (STACS 11) , Leibniz International Proceedings in Informatics, pages 507–518, Schloss Dagstuhl, 2011. Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany.
- 7[7] M. Fürer. Nicht-elementare untere Schranken in der Automaten-Theorie . Ph D thesis, ETH Zürich, 1978.
- 8[8] M. Furst, J. B. Saxe and M. Sipser. Parity, Circuits, and the Polynomial-Time Hierarchy. Math. Systems Theory , 1713–27, 1984.
