On the Complexity of Exact Pattern Matching in Graphs: Binary Strings   and Bounded Degree

Massimo Equi; Roberto Grossi; Veli M\"akinen

arXiv:1901.05264·cs.CC·June 4, 2020

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Massimo Equi, Roberto Grossi, Veli M\"akinen

PDF

TL;DR

This paper establishes a conditional lower bound on the computational complexity of exact pattern matching in labeled graphs with binary labels, showing it cannot be solved faster than quadratic time unless SETH is false, even in restricted graph classes.

Contribution

The paper provides a direct reduction from SETH to the exact pattern matching problem in graphs, strengthening the understanding of its computational hardness and linking it to well-known complexity hypotheses.

Findings

01

Exact pattern matching in graphs is conditionally quadratic-time hard.

02

The problem remains hard even for restricted graph classes like bounded degree and acyclic graphs.

03

Exact and approximate pattern matching are both quadratic-time hard under SETH.

Abstract

Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G = (V, E)$ that spell the same string as the pattern $P [1.. m]$ . This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant $ϵ > 0$ , an $O (∣ E ∣^{1 - ϵ} m)$ -time or an $O (∣ E ∣ m^{1 - ϵ})$ -time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of…

Figures6

Click any figure to enlarge with its caption.

Figure 6

Tables1

Table 1. Table 1 . Legend: V 𝑉 V = set of nodes, E 𝐸 E = set of edges, o c c 𝑜 𝑐 𝑐 occ = number of matches for the pattern in the graph, m 𝑚 m = pattern length, N 𝑁 N = total length of text in all nodes, (1) errors only in the pattern, (2) errors in the graph, (3) matches span only one edge. The two rows highlighted in gray report the best known bounds for exact and approximate pattern matching.

State of the art for PMLG
Year	Authors	Graph	Exact/	Time
			Approximate
1992	Manber, Wu [20]	DAG	approximate $^{(1)}$	$O (m \| E \| + o c c \lg \lg m)$
1993	Akutsu [2]	Tree	exact $^{(3)}$	$O (N)$
1995	Park, Kim [22]	DAG	exact $^{(3)}$	$O (N + m \| E \|)$
1997	Amir et al. [3]	general	exact $^{(3)}$	$O (N + m \| E \|)$
1997	Amir et al. [3]	general	approximate $^{(2)}$	NP-Hard
1997	Amir et al. [3]	general	approximate $^{(1)}$	$O (N m \lg N + m \| E \|)$
1998	Navarro [21]	general	approximate $^{(1)}$	$O (N m + m \| E \|)$
2017	Vadaddi et al. [29]	general	approximate $^{(1)}$	$O ((\| V \| + 1) m \| E \|)$
2017	Rautiainen, Marschall [24]	general	approximate $^{(1)}$	$O (N + m \| E \|)$
2019	Jain et al. [18]	general	approximate $^{(2)}$	NP-Hard on binary alphabet

Equations17

X =

X =

Y =

P_{x_{i}} [h] = {c if x_{i} \neq ⊨ c_{h} d otherwise

P_{x_{i}} [h] = {c if x_{i} \neq ⊨ c_{h} d otherwise

V_{F} = {c_{j, h} ∣ y_{j} ⊨ c_{h}, y_{j} \in Y, c_{h} \in C} \cup {d_{j, h} ∣ y_{j} \in Y, c_{h} \in C} \cup {b, e}

V_{F} = {c_{j, h} ∣ y_{j} ⊨ c_{h}, y_{j} \in Y, c_{h} \in C} \cup {d_{j, h} ∣ y_{j} \in Y, c_{h} \in C} \cup {b, e}

L_{F} (u) = ⎩ ⎨ ⎧ b if u = b e if u = e c if u = c_{j, h} d if u = d_{j, h}

L_{F} (u) = ⎩ ⎨ ⎧ b if u = b e if u = e c if u = c_{j, h} d if u = d_{j, h}

E_{F} =

E_{F} =

\cup {(c_{j, h}, c_{j, h + 1}) ∣ c_{j, h}, c_{j, h + 1} \in V} \cup {(c_{j, h}, d_{j, h + 1}) ∣ c_{j, h}, d_{j, h + 1} \in V}

\cup {(d_{j, h}, c_{j, h + 1}) ∣ d_{j, h}, c_{j, h + 1} \in V} \cup {(d_{j, h}, d_{j, h + 1}) ∣ d_{j, h}, d_{j, h + 1} \in V}

P^{'} = eb \frac{n}{2} times d \dots d P_{x_{1}}^{'} \frac{n}{2} times d \dots d e \dots b \frac{n}{2} times d \dots d P_{x_{2^{\frac{n}{2}}}}^{'} \frac{n}{2} times d \dots d eb .

P^{'} = eb \frac{n}{2} times d \dots d P_{x_{1}}^{'} \frac{n}{2} times d \dots d e \dots b \frac{n}{2} times d \dots d P_{x_{2^{\frac{n}{2}}}}^{'} \frac{n}{2} times d \dots d eb .

α (c) = 0000, α (d) = 1111, α (b) = 10, α (e) = 01 .

α (c) = 0000, α (d) = 1111, α (b) = 10, α (e) = 01 .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\hideLIPIcs

Department of Computer Science, University of Helsinki, [email protected] Dipartimento di Informatica, Università di Pisa, [email protected] Department of Computer Science, University of Helsinki, [email protected]

\CopyrightM. Equi, R. Grossi, V. Mäkinen

\supplement \fundingThis work has been partially supported by Academy of Finland (grant 309048) \EventEditorsXYZ \EventNoEds3 \EventLongTitleXYZ \EventShortTitleXYZ \EventAcronymMFCS \EventYear2018 \EventDateAugust 27–31, 2018 \EventLocationLiverpool, GB \EventLogo \SeriesVolume117 \ArticleNo84

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Massimo Equi

,

Roberto Grossi

and

Veli Mäkinen

Abstract.

Exact pattern matching in labeled graphs is the problem of searching paths of a graph $G=(V,E)$ that spell the same string as the pattern $P[1..m]$ . This basic problem can be found at the heart of more complex operations on variation graphs in computational biology, of query operations in graph databases, and of analysis operations in heterogeneous networks, where the nodes of some paths must match a sequence of labels or types. We describe a simple conditional lower bound that, for any constant $\epsilon>0$ , an $O(|E|^{1-\epsilon}\,m)$ -time or an $O(|E|\,m^{1-\epsilon})$ -time algorithm for exact pattern matching on graphs, with node labels and patterns drawn from a binary alphabet, cannot be achieved unless the Strong Exponential Time Hypothesis (seth) is false. The result holds even if restricted to undirected graphs of maximum degree three or directed acyclic graphs of maximum sum of indegree and outdegree three. Although a conditional lower bound of this kind can be somehow derived from previous results (Backurs and Indyk, FOCS’16), we give a direct reduction from seth for dissemination purposes, as the result might interest researchers from several areas, such as computational biology, graph database, and graph mining, as mentioned before. Indeed, as approximate pattern matching on graphs can be solved in $O(|E|\,m)$ time, exact and approximate matching are thus equally hard (quadratic time) on graphs under the seth assumption. In comparison, the same problems restricted to strings have linear time vs quadratic time solutions, respectively, where the latter ones have a matching seth lower bound on computing the edit distance of two strings (Backurs and Indyk, STOC’15).

Key words and phrases:

exact pattern matching, graph query, graph search, heterogeneous networks, labeled graphs, string matching, string search, strong exponential time hypothesis, variation graphs

1991 Mathematics Subject Classification:

\ccsdesc[500]Mathematics of computing Graph algorithms \ccsdesc[500]Theory of computation Problems, reductions and completeness \ccsdesc[500]Theory of computation Pattern matching

category:

\relatedversion

1. Introduction

Large-scale labeled graphs are becoming ubiquitous in several areas, such as computational biology, graph databases, and graph mining. Applications require sophisticated operations on these graphs, and often rely on primitives that locate paths whose nodes have labels or types matching a pattern given at query time.

In graph databases, query languages provide the user with the ability to select paths based on the labels of their nodes or edges, where the edge labels are called properties. In this way, graph databases explicitly lay out the dependencies between the nodes of data, whereas these dependencies are implicit in classical relational databases [4]. Although a standard query language has not been yet universally adopted (as it occurred for SQL in relational databases), popular query languages such as Cypher [13], Gremlin [25], and SPARQL [23] offer the possibility of specifying paths by matching the labels of their nodes.

In graph mining and machine learning for network analysis, heterogeneous networks specify the type of each node [27]. For example, in the DBLP network [30], the nodes for authors can be marked with letter ’A’, and the nodes for papers can be marked with letter ’P’, where edges connect authors to their papers. For example, coauthors can be identified by the pattern ’APA’ when it matches two different nodes with ’A’. The strings generated by the labels on the paths have several applications in heterogeneous networks, such as graph kernels [16] or node similarity [10], where a basic tool is retrieving the paths for a string.

In genome research, the very first step of many standard analysis pipelines of high-throughput sequencing data has been to align the sequenced fragments of DNA (called reads) on a reference genome of a species. Further analysis reveals a set of positions where the sequenced individual differs from the reference genome. After years of these kind of studies, there is now a growing dataset of frequently observed differences between individuals and the reference. A natural representation of this gained knowledge is a variation graph where the reference sequence is the backbone and variations are encoded as alternative paths [26]. Aligning reads (pattern matching) on this labeled graph gives the basis for the new paradigm called computational pan-genomics [9]. There are already tools that use such ideas, e.g. [15].

Although there is a growing need to perform pattern matching on graphs in several situations described above, the idea of extending the problem of string searching in sequences to pattern matching in graphs was studied over 25 years ago as a search problem in hypertext [20]. The history of key contributions is given in Table 1, where the two best known results for exact and approximate pattern matching, both taking quadratic time in the worst case, are highlighted. Note that errors in the graphs makes the problem NP-hard [3], so we consider errors in the pattern only.

A common feature of the bounds reported in Table 1 is the appearance of the quadratic term $m\,|E|$ (except for the special cases of trees and the general NP-hard approximate version). Here $m$ is the length of the pattern string and $E$ is the set of edges of the graph. The quadratic cost of the approximate matching on graphs by Rautiainen and Marschall [24] are asymptotical optimal under the Strong Exponential Time Hypothesis [17] (seth) as (i) they solve the approximate string matching as a special case, since a graph consisting of just one path of $|E|+1$ nodes and $|E|$ edges is a text string of length $n=|E|+1$ , and (ii) it has been recently proved that the edit distance of two strings of length $n$ cannot be computed in $O(n^{2-\epsilon})$ time, for any constant $\epsilon>0$ , unless seth is false [5]. Hence this conditional lower bound explains why the $O(m|E|)$ barrier has been difficult to cross.

We can only explain the complexity on approximate pattern matching on graphs, but nothing is known on exact pattern matching on graphs. Indeed, the classical exact pattern matching with a pattern and a text string can be solved in linear time [19], so one could expect the corresponding problem on graphs to be easier than approximate pattern matching.

In this paper we end up with a slightly surprising observation that exact and approximate pattern matching are equally hard on graphs. Namely, we show the conditional lower bound that an $O(|E|^{1-\epsilon}\,m)$ -time or an $O(|E|\,m^{1-\epsilon})$ -time algorithm for exact pattern matching on graphs cannot be achieved unless seth is false. This result explains why it has been difficult to find indexing schemes for graphs with other than best case or average case guarantees for fast exact pattern matching [28, 14].

Before going on to give the overview and details of the reduction, let us now fix the problem definition and seth formulation.

Definition 1.1.

Given an alphabet $\Sigma$ , a labeled graph $G$ is a triplet $(V,E,L)$ where $(V,E)$ is a directed or undirected graph and $L:V\rightarrow\Sigma^{+}$ is a function that defines which string over $\Sigma$ is assigned to each node.

Definition 1.2.

Let $u_{1},\ldots,u_{j}$ be a path in graph $G$ and $P$ be a pattern. Also, $L(u)[l:]$ and $L(u)[:l^{\prime}]$ denote the suffix of $L(u)$ starting at position $l$ and the prefix of $L(u)$ ending at position $l^{\prime}$ , respectively. We say that $u_{1},\ldots,u_{j}$ is a match for $P$ in $G$ with offset $l$ if the concatenation of the strings $L(u_{1})[l:]\cdot L(u_{2})\cdot\ldots\cdot L(u_{j-1})\cdot L(u_{j})[:l^{\prime}]$ equals $P$ , for some $l^{\prime}$ .

The Pattern Matching in Labeled Graphs (pmlg) problem is then defined as:

**: **

input: a labeled graph $G=(V,E,L)$ and a pattern $P$ over an alphabet $\Sigma$ .

**: **

output: all the matches for $P$ in $G$ .

For example, in Fig. 1 pattern $\mathtt{c}$$\mathtt{d}$$\mathtt{e}$ has two occurrences but pattern $\mathtt{b}$$\mathtt{c}$$\mathtt{c}$$\mathtt{c}$$\mathtt{e}$ does not occur. For our purpose it would be enough to exploit a decision version of the problem, namely, to be able to determine whether or not there exists at least one match for $P$ in $G$ , without reporting all of them. Note that the matching path for $P$ can go through the same nodes multiples times in $G$ as otherwise pmlg is trying to solve the NP-hard Hamiltonian path problem.

We now recall what is seth, namely, the Strong Exponential Time Hypothesis [17]. This is a conjecture which is commonly used as a basis of reductions in the scientific community, even though its weaker version eth is more widely accepted.

Definition 1.3 ([17]).

Let q-sat be an instance of sat with at most q literals per clause. Given $\delta_{q}=\inf\,\{\alpha\,:\,\text{there is an}\,O(2^{\alpha n})\text{-time algorithm for\ }q\text{-{sat}}\}$ , seth claims that $\lim\limits_{q\to\infty}\delta_{q}=1$ .

In other words, it is hard to find an $O(2^{\alpha n})$ -time algorithm for general sat for a constant $\alpha<1$ . We use seth in the the following result, given that the best known algorithm for pmlg, devised 20 years ago [3], has an $O(|E|\,m)$ time complexity.

Theorem 1.4.

For any constant $\epsilon>0$ , the Pattern Matching in Labeled Graphs (pmlg) problem for an alphabet of at least $4$ symbols cannot be solved in either $O(|E|^{1-\epsilon}\,m)$ or $O(|E|\,m^{1-\epsilon})$ time unless seth is false.

We can further strengthen the statement of this theorem by proving the following corollaries.

Corollary 1.5.

The conditional lower bound stated in Theorem 1.4 holds even if it is restricted to graphs with binary alphabet for the labels, where each node has degree at most three.

Corollary 1.6.

The conditional lower bound stated in Theorem 1.4 holds even if it is restricted to labeled directed acyclic graphs (DAGs) with binary alphabet for the labels, where each node has the sum of indegree and outdegree at most three.

In order to achieve these results we break down our reasoning process in some intermediate steps. Since this is a conditional lower bound we will reduce sat to pmlg. Then we will show that having a truly subquadratic algorithm for pmlg would cause to solve sat in $O(2^{\alpha n})$ time with $\alpha<1$ . Our reduction costs $\tilde{O}(2^{\frac{n}{2}})$ time for a sat formula with $n$ variables and $k=O(\operatorname{poly}(n))$ clauses, where $\tilde{O}$ is the shorthand for ignoring polynomial factors in $n$ , e.g. $O(kn^{2}\,2^{\frac{n}{2}})=\tilde{O}(2^{\frac{n}{2}})$ . Hence the main steps can be synthesized as follows.

•

Find a reduction from sat to pmlg.

•

Ensure that this reduction costs $\tilde{O}(2^{\frac{n}{2}})$ time.

•

Show that having a $O(|E|^{1-\epsilon}\,m)$ or a $O(|E|\,m^{1-\epsilon})$ time algorithm for pmlg gives a solution for sat that makes seth fail.

Our reduction shares some similarities with those for string problems in [5, 8, 1, 6, 7] as it uses seth. The closest connection is with [6], where regular expression matching is studied (graph $G_{F}$ in Section 2.2 is analogous to the NFA derived from the regular expression matching of type $\mid\cdot\mid$ in [6]). At presentation level, the difference to earlier work is that we reduce directly from seth, while the earlier work uses an intermediate problem, orthogonal vectors, as a tool; our reduction can also be presented via the orthogonal vectors problem, but we preferred to work with seth directly since sat is more familiar to researcher from various research areas. On a more conceptual level, the new reduction has some interesting features of independent interest. Given a sat formula, our reduction builds a pattern and a graph, using some special characters in the pattern to match bridges in the graph that can be traversed in one direction only (even if the graph is undirected). Also, obtaining the reduction for a binary alphabet requires a suitable variable-length encoding of the characters to avoid certain paths in the graph.

An earlier version of this reduction can be found in the Master’s thesis of the first author [11] (supervised by the two last authors).

2. Conditional lower bound for PMLG on undirected graphs

Consider a sat formula $F$ with variables $v_{1},\ldots,v_{n}$ and set $C$ of $k$ clauses.111 In this paper we discuss the interesting case where $k=O(\operatorname{poly}(n))$ .

We show how to generate a corresponding instance of pmlg. We build a pattern $P\in\Sigma^{m}$ of suitable length $m=\tilde{O}(2^{\frac{n}{2}})$ and a labeled graph $G=(V,E,L)$ , where $|E|=\tilde{O}(2^{\frac{n}{2}})$ and $L:V\rightarrow\Sigma^{*}$ is the node labeling with strings from $\Sigma^{*}$ , such that $P$ matches in $G$ if and only if $F$ is satisfied by some truth assignment of its variables. Recall that a truth assignment $x$ is a tuple $\langle b_{1},\ldots,b_{n}\rangle$ , where $b_{i}\in\{\mathtt{true},\mathtt{false}\}$ is the truth value assigned to each variable $v_{i}$ . We write $x\models c$ to indicate that there exists at least one literal $\ell\in c$ satisfied by $x$ (i.e. either $\ell=v_{i}$ and $b_{i}=\mathtt{true}$ , or $\ell=\neg v_{i}$ and $b_{i}=\mathtt{false}$ ).

Our reduction builds a pattern with $m=\tilde{O}(2^{\frac{n}{2}})$ symbols from a binary alphabet $\Sigma$ along with an undirected graph whose nodes are labeled with single symbols from $\Sigma$ (i.e. $L:V\rightarrow\Sigma$ ). This graph has $|V|,|E|=\tilde{O}(2^{\frac{n}{2}})$ nodes and edges, and maximum degree three. The reduction can be modified so that the graph is directed with maximum sum of indegree and outdegree at least three.

For presentation’s sake, we begin with a pattern $P$ using an alphabet of four symbols, $\Sigma=\{\mathtt{b},\mathtt{e},\mathtt{c},\mathtt{d}\}$ , whose interpretation is to label nodes according to their implicit functionality: $\mathtt{b}$ egin (synchronization token), $\mathtt{e}$ nd (synchronization token), $\mathtt{c}$ lause (marker), $\mathtt{d}$ ummy (don’t care); moreover, the resulting undirected graph $G$ has unbounded degree; after that, we will show how to get the minimal degree configuration for $G$ and how to achieve a binary alphabet, as depicted above.

We assume that $n$ is an even number, without loss of generality, and denote by $X$ the set of $2^{\frac{n}{2}}$ possible assignments for the first $n/2$ variables, and by $Y$ those for the last $n/2$ variables, that is,

[TABLE]

We call elements of $X$ and $Y$ half-assignments and interpret notation $\models$ accordingly. For example, $y_{j}\models c$ if and only if there is a literal $\ell\in c$ satisfied by the half-assignment $y_{j}$ (i.e. either $\ell=v_{i}$ and $b_{i}^{(j)}=\mathtt{true}$ , or $\ell=\neg v_{i}$ and $b_{i}^{(j)}=\mathtt{false}$ , for some $i\geq\frac{n}{2}+1$ ).

The reduction components to follow will be interpreted as follows. The pattern encodes by position, placing a symbol $\mathtt{c}$ to indicate which clauses cannot be satisfied by a half-assignment $x_{i}\in X$ ; the other clauses are marked by $\mathtt{d}$ as they are already satisfied by $x_{i}$ alone; symbols $\mathtt{b}$ and $\mathtt{e}$ are employed to sync the half-assignments from $X$ with portions of the graph, called gadgets.

The gadgets encode which clauses are satisfied by the half-assignments of $Y$ , encoding each such clause with a distinct node labeled with $\mathtt{c}$ : when a symbol $\mathtt{c}$ in the pattern matches a node with label $\mathtt{c}$ in the graph, the corresponding clause is now covered by a half-assignment $y_{j}\in Y$ , while it was not yet covered by half-assignment $x_{i}\in X$ . If all the symbols $\mathtt{c}$ for $x_{i}$ are matched by the nodes of the gadget corresponding to $y_{j}$ , then assignment $x_{i}y_{j}$ satisfyes the sat formula $F$ ; also the other direction holds.

Parallel nodes labeled with $\mathtt{d}$ are introduced to deal with the cases when the pattern indicates that the corresponding clause is already satisfied by a half-assignment in $X$ . Nodes labeled with $\mathtt{b}$ or $\mathtt{e}$ are used to match a half-assignment $x_{i}\in X$ with a half-assignment of $y_{j}\in Y$ . Details follow below.

2.1. Building the pattern

Pattern $P$ is defined over the alphabet $\Sigma=\{\mathtt{b},\mathtt{e},\mathtt{c},\mathtt{d}\}$ using the half-assignments in $X=\{x_{1},\ldots,x_{2^{\frac{n}{2}}}\}$ and the set $C=\{c_{1},\dots,c_{k}\}$ of clauses of sat formula $F$ . Specifically, it is built as the concatenation $P=\mathtt{e}\mathtt{b}P_{x_{1}}\mathtt{e}\,\mathtt{b}P_{x_{2}}\mathtt{e}\ldots\mathtt{b}P_{x_{2^{\frac{n}{2}}}}\mathtt{e}\mathtt{b}$ of $2^{\frac{n}{2}}$ strings where $x_{i}\in X$ and, for $1\leq h\leq k$ , the $h$ th symbol of string $P_{x_{i}}$ is defined as

[TABLE]

We will prove that $F$ is satisfiable if and only if we can find a match for this pattern in our graph, where the latter is made up of gadgets as specified below.

2.2. Graph gadgets for SAT formulas

Our gadget is an undirected graph $G_{F}=(V_{F},E_{F},L_{F})$ , illustrated in Figure 1 and defined as follows using the $2^{\frac{n}{2}}$ half-assignments in $Y=\{y_{1},\ldots,y_{2^{\frac{n}{2}}}\}$ and the set $C=\{c_{1},\dots,c_{k}\}$ of clauses of sat formula $F$ .

In the set $V_{F}$ of nodes, we have a clause node $c_{j,h}$ for every possible pair $y_{j},c_{h}\in Y\times C$ such that $y_{j}\models c_{h}$ , and a dummy node $d_{j,h}$ for every possible pair $y_{j},c_{h}\in Y\times C$ . Set $V_{F}$ also contains two special nodes, a begin node $b$ and an end node $e$ ,

[TABLE]

Labeling $L_{F}:V_{F}\rightarrow\Sigma$ is consequently defined, where a symbol $\mathtt{c}$ in the pattern that matches a node labeled with $\mathtt{c}$ in the graph will represent the fact a clause not satisfied by a certain half-assignment in $X$ is actually satisfied by a certain half-assignment in $Y$ . The $\mathtt{d}$ symbols are sort of “don’t care”, and $\mathtt{b}$ and $\mathtt{e}$ symbols synchronize the whole.

[TABLE]

As shown in Fig. 1, the edges in the set $E_{F}$ connect $b$ to every $c_{h,1}$ and $d_{h,1}$ , and connect every $c_{h,k}$ and $d_{h,k}$ to $e$ , for $1\leq h\leq k$ . Moreover, there is an edge for every pair of nodes that share the same $j$ and are consecutive in terms of $h$ coordinate (e.g. $c_{j,h},d_{j,h+1}$ ), for $1\leq j\leq 2^{\frac{n}{2}}$ and $1\leq h\leq k$ .

[TABLE]

We observe that pattern occurrences in $G_{F}$ have some combinatorial properties.222Gadget $G_{F}$ is analogous to the main component of the seth reduction to regular expression matching of type $\mid\cdot\mid$ in [6].

Lemma 2.1.

If subpattern $\mathtt{b}P_{x_{i}}\mathtt{e}$ matches in $G_{F}$ then all the nodes matching $P_{x_{i}}$ share the same $j$ coordinate and have distinct and consecutive $h$ coordinates (i.e. either $c_{j,h}$ or $d_{j,h}$ for $1\leq h\leq k$ ).

Proof 2.2.

Gadget $G_{F}$ contains a single node $b$ with label $L(b)=\mathtt{b}$ and a single node $e$ with label $L(e)=\mathtt{e}$ . Morever, the shortest path from $b$ to $e$ contains $k+2$ nodes ( $b$ and $e$ included). As $\mathtt{b}P_{x_{i}}\mathtt{e}$ contains $k+2$ symbols, its matching path $\pi=b,u_{1},\ldots,u_{k},e$ in $G_{F}$ must traverse all distinct nodes by construction. Suppose by contradiction that at least one node in $\pi$ has different $j$ coordinate. This means that two consecutive nodes $u_{h}$ and $u_{h+1}$ in $\pi$ have coordinates $j$ and $j^{\prime}$ , with $j\neq j^{\prime}$ . Node $u_{h}$ is actually either $c_{j,h}$ or $d_{j,h}$ , whereas $u_{h+1}$ is either $c_{j^{\prime},h+1}$ or $d_{j^{\prime},h+1}$ . By inspection of these four possible cases, we observe that our construction of $G_{F}$ does not provide any edge connecting $u_{h}$ and $u_{h+1}$ . Indeed, there is no edge that allows a node to change the $j$ coordinate in the middle of a path. Hence we reach a contradiction. Finally, if one of the matching nodes were not consecutive in terms of $h$ coordinate, by construction we know that we would not be following the shortest path to $e_{W}^{(j)}$ hence it would not be possible to complete the match.

Lemma 2.3.

Subpattern $\mathtt{b}P_{x_{i}}\mathtt{e}$ matches in $G_{F}$ if and only if there is $y_{j}\in Y$ such that the truth assignment $x_{i}y_{j}$ satisfies $F$ (i.e. $x_{i}y_{j}\models F$ ).

Proof 2.4.

By Lemma 2.1, we can focus on the $k$ distinct nodes matching $P_{x_{i}}$ , sharing the same coordinate $j$ . We handle the two implications of the statement individually.

( $\Rightarrow$ ) Consider the partial assignment $x_{i}\in X$ . From the structure of the pattern we know that $x_{i}$ satisfies all the clauses $c_{h}$ for which $P_{x_{i}}[h]=\mathtt{d}$ . Since $P_{x_{i}}$ has a match in $G_{F}$ , consider the assignment $y_{j}\in Y$ where $j$ exists by Lemma 2.1, as observed above. We observe that by construction $y_{j}$ satisfies those clauses that $x_{i}$ cannot satisfy, namely those for which $P_{x_{i}}[h]=\mathtt{c}$ . Hence we have found a truth assignment $x_{i}y_{j}$ that satisfies $F$ .

( $\Leftarrow$ ) Consider a truth assignment $x_{i}y_{j}$ that satisfies $F$ , that is, all clauses $c_{h}$ for sat formula $F$ are true. Consider now the nodes with coordinate $j$ in $G_{F}$ . For $h=1,2,\ldots,k$ , if $x_{i}\models c_{h}$ then $P_{x_{i}}[h]=\mathtt{d}$ and matching node $d_{j,h}$ exits in $G_{F}$ by its definition. If $x_{i}\,\not\!\models c_{h}$ then it must be $y_{j}\models c_{h}$ : thus $P_{x_{i}}[h]=\mathtt{c}$ and a matching node $c_{j,h}$ exists in $G_{F}$ by its construction. The definition of the edges of $G_{F}$ ensures that all the above nodes $c_{j,h}$ and $d_{j,h}$ , as we need, are properly linked to form a path of distinct nodes (for increasing values of $h$ ); it is so because they all share the same $j$ coordinate. This implies that $P_{x_{i}}$ matches in $G_{F}$ .

While the previous gadget is useful to check whether a half-assignment $x_{i}$ satisfies $F$ using a given subpattern $P_{x_{i}}\in\mathtt{b}\{\mathtt{c},\mathtt{d}\}^{k}\mathtt{e}$ , we need another “jolly” gadget that matches all subpatterns in $\mathtt{b}\{\mathtt{c},\mathtt{d}\}^{k}\mathtt{e}$ (this is useful when $x_{i}$ does not satisfy $F$ ). We concatenate $2^{\frac{n}{2}}-1$ instances of the latter gadget, thus obtaining the graph $G_{U}=G(V_{U},E_{U},L_{U})$ illustrated in Figure 2, whose definition is clear from the picture. The $j$ th copy of the gadget substructure has a node $b_{j}$ followed by nodes $c_{j,h}$ , $d_{j,h}$ and then node $e_{j}$ , with $1\leq j\leq 2^{\frac{n}{2}}-1$ and $1\leq h\leq k$ . The labels are $L_{U}(b_{j})=\mathtt{b}$ , $L_{U}(c_{j,h})=\mathtt{c}$ , $L_{U}(d_{j,h})=\mathtt{d}$ and $L_{U}(e_{j})=\mathtt{e}$ (we may think about nodes $c_{i,h}$ and $d_{i,h}$ as disposed along two parallel lines). We place the edges $(b_{i},c_{i,1}),\,(b_{i},d_{i,1}),\,(c_{i,k},e_{j}),\,(d_{i,k},e_{i})$ for connecting the beginning and ending nodes of each gadget with its inner part. We connect nodes $c_{i,h}$ and $d_{i,h}$ with the edges $(c_{i,h},c_{i,h+1}),\,(c_{i,h},d_{i,h+1}),\,(d_{i,h},c_{i,h+1}),\,(d_{i,h},d_{i,h+1})$ . We concatenate our gadgets one after the other using the edges $(e_{i},b_{i+1})$ , for $i=1,\ldots,2^{\frac{n}{2}}-1$ .

2.3. Putting all together

Armed with gadgets $G_{F}$ and $G_{U}$ , we obtain the graph $G=(V,E,L)$ from the sat formula $F$ by combining them as illustrated in Figure 3. We take one instance of $G_{F}=(V_{F},E_{F},L_{F})$ and two instances of $G_{U}$ , say $G_{U}^{(1)}=(V_{U}^{(1)},E_{U}^{(1)},L_{U}^{(1)})$ and $G_{U}^{(2)}=(V_{U}^{(2)},E_{U}^{(2)},L_{U}^{(2)})$ , and two new nodes $u$ and $z$ , where their label is respectively $\mathtt{e}$ and $\mathtt{b}$ . Then $G=(V,E,L)$ has node set $V=V_{F}\cup\{u,z\}\cup V_{U}^{(1)}\cup V_{U}^{(2)}$ , preserving the node labels. The edge set is the union of the previous edge sets plus four edges: one connects the “last” node labeled with $\mathtt{e}$ in $G_{U}^{(1)}$ with the node labeled with $\mathtt{b}$ in $G_{F}$ ; the other connects the node labeled with $\mathtt{e}$ in $G_{F}$ with the “first” node labeled with $\mathtt{b}$ in $G_{U}^{(2)}$ , plus $u$ is connected to the first node labeled with $\mathtt{b}$ in $G_{U}^{(1)}$ , and the last node labeled with $\mathtt{e}$ in $G_{U}^{(2)}$ is connected to $z$ .

Remark 2.5.

Each edge connecting a node labeled with $\mathtt{e}$ to a node labeled with $\mathtt{b}$ is a bridge in $G$ (i.e. its removal disconnect $G$ ). As we shall see, the purpose of these bridges is dual since, within a matching path, the $i$ th occurrence of $\mathtt{e}$$\mathtt{b}$ in the pattern matches the $i$ th bridge with labels $\mathtt{e}$ and $\mathtt{b}$ at its endpoints: (i) they synchronize the distinct subpatterns with the distinct (portions of the) gadgets, and (ii) they guarantee that the pattern matches a path of distinct nodes rather than a walk.

We now prove that the reduction is correct, first focusing on subpatterns of $P$ .

Lemma 2.6.

Pattern $P$ matches in $G$ if and only if a subpattern $\mathtt{b}P_{x_{i}}\mathtt{e}$ of $P$ matches in $G_{F}$ .

Proof 2.7.

For the $\Rightarrow$ implication, the bridges with endpoints labeled with $\mathtt{e}$ and $\mathtt{b}$ can only be traversed once in this direction, as $P$ contains the sequence $\mathtt{e}$$\mathtt{b}$ but does not contain $\mathtt{b}$$\mathtt{e}$ . Moreover, each occurrence of $P$ must begin with one such bridge and end with another such bridge. For this reason each distinct subpattern $\mathtt{b}P_{x_{i}}\mathtt{e}$ matches a path from either a distinct portion of $G_{U}^{(l)}$ ( $l=1,2$ )) or $G_{F}$ . Recall that $G_{U}^{(1)}$ and $G_{U}^{(2)}$ can match at most $2^{\frac{n}{2}}-1$ subpatterns of $P$ each, while $P$ has $2^{\frac{n}{2}}$ of them. Hence one subpattern $\mathtt{b}P_{x_{i}}\mathtt{e}$ is forced to have a match in $G_{F}$ in order to have a full match for $P$ .

The $\Leftarrow$ implication is trivial. In fact, if $\mathtt{b}P_{x_{i}}\mathtt{e}$ has a match in $G_{F}$ then we can match $\mathtt{b}P_{x_{1}}\mathtt{e},\ldots,\mathtt{b}P_{x_{i-1}}\mathtt{e}$ in $G_{U}^{(1)}$ and $\mathtt{b}P_{x_{i+1}}\mathtt{e},\ldots,\mathtt{b}P_{x_{2^{\frac{n}{2}}}}\mathtt{e}$ in $G_{U}^{(2)}$ by construction, and have a full match for $P$ in $G$ .

The main result proves the correctness of our reduction.

Theorem 2.8.

Pattern $P$ matches in $G$ if and only if the sat formula $F$ is satisfiable.

Proof 2.9.

By Lemma 2.6, $P$ matches in $G$ if and only if a subpattern $P_{x_{i}}$ matches in $G_{F}$ . By Lemma 2.3 this holds if and only if the truth assignment $x_{i}y_{j}$ satisfies $F$ , hence $F$ is satisfiable.

2.4. Cost of the reduction

We analyze the cost of building the pattern $P$ and the graph $G$ from the sat formula $F$ .

Lemma 2.10.

Given a sat formula $F$ with $n$ variables, the corresponding pattern $P$ and graph $G$ can be built in $\tilde{O}(2^{\frac{n}{2}})$ time and space.

Proof 2.11.

Checking if an assignment satisfies a clause takes $O(n)$ time which, for our goals, is negligible when compared to $\tilde{O}(2^{\frac{n}{2}})$ . Recalling that the number $k$ of clauses is polynomially bounded in $n$ , we observe that each $P_{x_{i}}$ in $P$ has $k$ symbols that can be either $\mathtt{c}$ or $\mathtt{d}$ plus symbols $\mathtt{b}$ and $\mathtt{e}$ . Since $P$ has $2^{\frac{n}{2}}$ sub-patterns $P_{x_{i}}$ , summing everything up we get a length of $m=(k+2)\,2^{\frac{n}{2}}=\tilde{O}(2^{\frac{n}{2}})$ symbols. As for $G_{U}$ , it has $2^{\frac{n}{2}}$ gadgets each one having $k$ nodes labeled with $\mathtt{c}$ , $k$ nodes labeled with $\mathtt{d}$ , and nodes $b_{i}$ and $e_{i}$ . Hence there are $(2+2k)\,2^{\frac{n}{2}}=\tilde{O}(2^{\frac{n}{2}})$ total nodes. Each node has a constant number of incident edges (at most $4$ ) thus their size is $\tilde{O}(2^{\frac{n}{2}})$ as well. As for $G_{F}$ , it has $O(k\,2^{\frac{n}{2}})$ nodes labeled with $\mathtt{c}$ and the same amount of nodes labeled with $\mathtt{d}$ plus those with $\mathtt{b}$ and $\mathtt{e}$ . In this case, each node has a constant number of edges but for $\mathtt{b}$ and $\mathtt{e}$ . Nevertheless, $\mathtt{b}$ and $\mathtt{e}$ have $O(2^{\frac{n}{2}})$ edges each, therefore the total amount of edges is again $\tilde{O}(2^{\frac{n}{2}})$ . For connecting $G_{F}$ to the two instances of $G_{U}$ we are adding just $2$ edges. Since the pattern and the graph have size $\tilde{O}(2^{\frac{n}{2}})$ , we conclude that the cost of our reduction is indeed $\tilde{O}(2^{\frac{n}{2}})$ .

2.5. Implications on SETH

The last step in our proof of Theorem 1.4 is showing that any $O(|E|^{1-\epsilon}\,m)$ -time or $O(|E|\,m^{1-\epsilon})$ -time algorithm for pmlg unavoidably leads to a failure of seth. To this aim, assume that we have such an algorithm, say $A$ . Given a sat formula $F$ we perform our reduction stated in Theorem 2.8 obtaining pattern $P$ and graph $G$ in $\tilde{O}(2^{\frac{n}{2}})$ time by Lemma 2.10, observing that $|E|=\tilde{O}(2^{\frac{n}{2}})$ and $m=\tilde{O}(2^{\frac{n}{2}})$ . At this point, no matter whether $A$ has $O(|E|^{1-\epsilon}m)$ or $O(|E|\,m^{1-\epsilon})$ time complexity, we will end up with an algorithm deciding if $F$ is satisfiable in $\tilde{O}(2^{\frac{n}{2}}\,2^{\frac{n}{2}(1-\epsilon)})=\tilde{O}(2^{\frac{(2-\epsilon)}{2}n})$ time. Since $\alpha=\frac{(2-\epsilon)}{2}<1$ we conclude that this implies to be able to solve sat in $O(2^{\alpha n})$ time with $\alpha<1$ , making seth false.

3. From undirected graphs to DAGs, with binary alphabets

In this section we show that the graph $G$ obtained from the reduction described in Section 2 can be transformed so that each node has degree at most three and label chosen from an alphabet of two symbols $\{\mathtt{0},\mathtt{1}\}$ .

We describe how to modify the proof of Theorem 1.4 so that it holds for any graph of degree at least three. We observe that the graph built in the reduction in Section 2 has degree $O(2^{\frac{n}{2}})$ . To obtain degree at most three, we first modify gadgets $G_{F}$ and $G_{U}$ to meet such requirement, and then adjust pattern $P$ consequently. Finally, we obtain a binary alphabet for the labels, thus proving Corollary 1.5.

After that we prove Corollary 1.6, showing that the undirected graph can be easily transformed into a directed acyclic graph (DAG).

3.1. Maximum degree three

Revised gadget $G_{F}$

As depicted in Figure 4(a), consider the $O(2^{\frac{n}{2}})$ edges connecting node $b$ with nodes $c_{j,1}$ and $d_{j,1}$ in $G_{F}$ . We replace them by a binary tree structure whose nodes are new dummy nodes $f_{l}$ with labels $L(f_{l})=\texttt{d}$ for $1\leq l\leq 2^{\frac{n}{2}}-2$ . As for node $e$ , we proceed along the same way and replace the edges connecting nodes $c_{j,k}$ and $d_{j,k}$ to $e$ by a binary tree structure (this case is not shown in the figure). The internal nodes of these trees have degree at most three.

This is not enough to guarantee degree at most three for each node in $G_{F}$ as nodes $c_{j,h}$ and $d_{j,h}$ could have degree four. For example, with some nodes $d_{j,h-1},d_{j,h}$ and $d_{j,h+1}$ , nodes $c_{j,h-1},c_{j,h}$ and $c_{j,h+1}$ could exists. Then both $c_{j,h}$ and $d_{j,h}$ would have degree four. This can be fixed as shown in Figure 4(b), adding two pairs of dummy nodes $f$ with label $L(f)=\mathtt{d}$ to lower the degree to three.333One pair is placed between nodes $c_{j,h-1},d_{j,h-1}$ and nodes $c_{j,h},d_{j,h}$ via edges $(c_{j,h-1},f_{j,h-1}^{\textit{(1)}}),(d_{j,h-1},f_{j,h-1}^{\textit{(1)}})$ and $(f_{j,h-1}^{\textit{(2)}},c_{j,h}),(f_{j,h-1}^{\textit{(2)}},d_{j,h})$ . The other pair of dummy nodes $f$ is placed between nodes $c_{j,h},d_{j,h}$ and nodes $c_{j,h+1},d_{j,h+1}$ via edges $(c_{j,h},f_{j,h}^{\textit{(1)}}),(d_{j,h},f_{j,h}^{\textit{(1)}})$ and $(f_{j,h}^{\textit{(2)}},c_{j,h+1}),(f_{j,h}^{\textit{(2)}},d_{j,h+1})$ .

At this point, we added $O(2^{\frac{n}{2}})$ dummy nodes $f$ for the binary tree, and $O((k-1)2^{\frac{n}{2}})=\tilde{O}(2^{\frac{n}{2}})$ pairs of nodes $f_{j,h}^{\textit{(1)}},f_{j,h}^{\textit{(2)}}$ . Moreover, the new edges for the binary tree are as many as the nodes while for the other modifications we add one edge for each pair of dummy nodes. The overall time complexity to build the transformed $G_{F}$ does not increase significantly.

Revised gadget $G_{U}$

Gadget $G_{U}$ has to be consistent with $G_{F}$ . We add $(\log 2^{\frac{n}{2}+1})-1=\frac{n}{2}$ dummy nodes $f$ with label $L(f)=\mathtt{d}$ between every $b_{i}$ node and the nodes $c_{i,1}$ and $d_{i,1}$ following it. We also add $\frac{n}{2}$ dummy nodes $f$ with label $L(f)=\mathtt{d}$ between every node $e_{i}$ and the previous nodes $c_{i,k}$ and $d_{i,k}$ . We are adding $2\frac{n}{2}(2^{\frac{n}{2}}-1)=\tilde{O}(2^{\frac{n}{2}})$ new nodes and one new edge per node, thus the overall time complexity will not be affected. The need for this step will be clearer when we will modify pattern $P$ , as it has to match either $G_{F}$ or $G_{U}$ , so the same format of $P$ is required in both types of gadgets.

We have another issue to handle. As in $G_{F}$ , there could be nodes $c_{i,h}$ and $d_{i,h}$ of degree four. In that case, we add pairs of dummy nodes $f$ with label $L(f)=\mathtt{d}$ following the same schema presented for $G_{F}$ and illustrated in Figure 4(b). In this way we are introducing $2(k-1)(2^{\frac{n}{2}}-1)=\tilde{O}(2^{\frac{n}{2}})$ new nodes and one edge for each pair of dummy nodes which do not change the time complexity of the reduction.

Revised pattern $P$

Pattern $P=\mathtt{e}\mathtt{b}P_{x_{1}}\texttt{eb}P_{x_{2}}\mathtt{e}\ldots\mathtt{b}P_{x_{2^{\frac{n}{2}}}}\mathtt{e}\mathtt{b}$ is modified so as to match $G_{F}$ and $G_{U}$ when needed. We add $\frac{n}{2}$ symbols $\mathtt{d}$ after each occurrence of $\mathtt{b}$ and before each occurrence of $\mathtt{e}$ . Moreover, we insert $\mathtt{d}$ symbols inside the subpatterns $P_{x_{i}}=a_{1}\,a_{2}\ldots a_{k}$ , where $a_{h}\in\{\texttt{c,d}\}$ , to obtain the new subpatterns $P^{\prime}_{x_{i}}=a_{1}\mathtt{d}\,\mathtt{d}\,a_{2}\,\mathtt{d}\,\mathtt{d}\ldots\mathtt{d}\,\mathtt{d}\,a_{k}$ . Therefore, the new pattern to match will be

[TABLE]

It is worth noting that $P^{\prime}\in\mathtt{e}\mathtt{b}\,(\{\mathtt{c},\mathtt{d}\}^{+}\mathtt{e}\mathtt{b})^{+}$ in this way. The number of new symbols added before and after the subpatterns is $\frac{n}{2}2^{\frac{n}{2}}=\tilde{O}(2^{\frac{n}{2}})$ while the ones inserted inside them are $2(k-1)2^{\frac{n}{2}}=\tilde{O}(2^{\frac{n}{2}})$ . The time cost of the reduction does not increase significantly.

3.2. Binary alphabet

The last step consists in defining a binary encoding $\alpha$ of the symbols $\Sigma=\{\mathtt{b},\mathtt{e},\mathtt{c},\mathtt{d}\}$ , namely,

[TABLE]

Given any string $x=x[1..m]$ , we define its binary encoding $\alpha(x)=\alpha(x[1])\cdots\alpha(x[m])$ . The following useful synchronizing property holds, recalling that each edge connecting a node with label $\mathtt{e}$ to a node with label $\mathtt{b}$ is a bridge in (transformed) $G$ .

Lemma 3.1.

For any string $x\in\Sigma^{+}$ , its binary encoding $\alpha(x)$ contains $\mathtt{0}$$\mathtt{1}$$\mathtt{1}$$\mathtt{0}$ if and only if $x$ contains $\mathtt{e}$$\mathtt{b}$ .

Proof 3.2.

We observe that $\mathtt{e}$ and $\mathtt{b}$ are encoded by two bits each, while $\mathtt{c}$ and $\mathtt{d}$ are enconed by four bits each. Hence, $\mathtt{0}$$\mathtt{1}$$\mathtt{1}$$\mathtt{0}$ can appear by concatenating the binary encoding of two or three symbols. On the other hand, $\mathtt{e}$$\mathtt{b}$ occurs in $x$ if and only if it occurs in a substring of length 3 of $x$ . Consequently, it suffices to check the claim by inspection of all the 64 substrings of $x$ of length 3, $\mathtt{c}$$\mathtt{c}$$\mathtt{c}$ , …, $\mathtt{e}$$\mathtt{e}$$\mathtt{e}$ , and their encodings to see that the property holds.

Any walk matched by the revised pattern $P^{\prime}$ crosses the bridges in the direction from $\mathtt{e}$ to $\mathtt{b}$ .

Lemma 3.3.

For any pattern $P^{\prime}$ obtained in the reduction, its binary encoding $\alpha(P^{\prime})$ does not contain $\mathtt{1}\mathtt{0}\mathtt{0}\mathtt{1}=\alpha(\mathtt{b}\mathtt{e})$ .

Proof 3.4.

Recalling that $P^{\prime}\in\mathtt{e}\mathtt{b}\,(\{\mathtt{c},\mathtt{d}\}^{+}\mathtt{e}\mathtt{b})^{+}$ , all the possible substrings of length 3 in $P^{\prime}$ by construction are of the forms $\mathtt{b}\{\mathtt{c},\mathtt{d}\}\mathtt{e}$ , $\mathtt{e}\mathtt{b}\ \{\mathtt{c},\mathtt{d}\}$ , $\{\mathtt{c},\mathtt{d}\}\mathtt{e}\mathtt{b}$ , $\mathtt{b}\{\mathtt{c},\mathtt{d}\}^{2}$ , $\{\mathtt{c},\mathtt{d}\}^{2}\mathtt{e}$ , and $\{\mathtt{c},\mathtt{d}\}^{3}$ . By inspection of this small number of cases, none contains $\mathtt{b}$$\mathtt{e}$ , and none of their binary encodings contains $\mathtt{1}$$\mathtt{0}$$\mathtt{0}$$\mathtt{1}$ .

An immediate consequence of Lemma 3.1 and 3.3 is that the encodings preserve the occurrences. Let $G^{\prime}$ be the transformed graph, and $P^{\prime}$ be the revised pattern in the reduction. Let $\alpha(G^{\prime})$ denote the graph obtained from $G^{\prime}$ by relabeling its nodes with the binary encoding $\alpha$ of their labels.

Lemma 3.5.

In the reduction, $P^{\prime}$ matches in $G^{\prime}$ if and only if $\alpha(P^{\prime})$ matches in $\alpha(G^{\prime})$ .

Proof 3.6.

It follows by Lemma 3.1 and 3.3, and the fact that all the edges whose endpoints have one label $\mathtt{e}$ and the other label $\mathtt{b}$ are bridges, and they are traversed in the direction from $\mathtt{e}$ to $\mathtt{b}$ when matching $P^{\prime}$ .

In the encoding above, each node stores two or four bits. By replacing it with a chain of two or four nodes with a single bit as a label, we obtain the proof of Corollary 1.5.

3.3. Directed acyclic graphs

In order to prove Corollary 1.6, we observe that the proof of Theorem 1.4 can be easily modified in order to work also for DAGs.

Considering the definitions of edges $E_{F}$ and $E_{U}$ in the proof of Theorem 1.4, and the transformation described so far, we immediately obtain a directed graph $G^{\prime}$ that is acyclic. Indeed, because of bridges and occurrences of $\mathtt{e}$$\mathtt{b}$ in the pattern, each pattern match must begin with some bridge, end with a different bridge and lay along a path from the first to the last bridge in the graph. So the edges can be oriented by construction from left (first bridge) to right (last bridge), as it can be checked in Fig. 1–4.

4. Conclusions

We studied the complexity of pattern matching on labeled graphs, giving a seth conditional quadratic lower bound for the exact pattern matching. In strings the exact pattern matching takes linear time whereas the approximate pattern matching takes quadratic time under a matching conditional lower bound. Differently from strings, our result along with the upper bounds in [3, 24] imply that the exact and approximate pattern matching (the latter with errors in the pattern) have the same complexity under the seth conjecture. Our conditional lower bound uses a binary alphabet and holds even if restricted to nodes of maximum degree at most three for undirected graphs, and to nodes of maximum sum of indegree and outdegree at most three for directed acyclic graphs (DAGs).

Two border cases are left if the maximum degree or sum of indegree and outdegree is at most two: a) when the undirected graph is a simple path or a cycle, and pattern matching goes along a walk (so it is a sort of zig-zag string matching), and b) when the graph is a directed cycle. For a), we can convert each edge into a pair of arcs and apply the known quadratic algorithm in [3]. On the other hand, we can extend our reduction to derive a matching seth lower bound for this case [12]. For b), we can adapt any known string matching algorithm (e.g. [19]) to get linear time.

An interesting and natural question for directed graphs is what happens when the graph is deterministic, that is, for each symbol $c$ and each node $v$ , there is at most one neighbor of $v$ labeled with $c$ . Unfortunately, this does not make the problem any easier. Although our reduction creates an inherently non-deterministic graph, it is possible to alter the reduction scheme to create a deterministic graph [12].

Acknowledgements

The first two authors are grateful to Alessio Conte and Luca Versari for providing their comments on the reduction. The last author wishes to thank the participants of the annual research group retreat on sparking the idea to study seth reductions in this context. We thank the anonymous reviewers of an earlier version of this paper for useful suggestions for improving the presentation and for pointing out the connection to regular expression matching.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015.
2[2] Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In 4th Symposium on Combinatorial Pattern Matching, Padova, Italy , pages 1–10, 1993.
3[3] Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. J. Algorithms , 35(1):82–99, 2000.
4[4] Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv. , 40(1):1:1–1:39, February 2008.
5[5] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing , STOC ’15, pages 51–58, New York, NY, USA, 2015. ACM.
6[6] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA , pages 457–466, 2016.
7[7] Arturs Backurs and Christos Tzamos. Improving viterbi is hard: Better runtimes imply faster clique algorithms. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70, pages 311–321. PMLR, 2017.
8[8] Karl Bringmann and Marvin Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the 2015 IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS) , FOCS ’15, pages 79–97, Washington, DC, USA, 2015. IEEE Computer Society.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Abstract.

Key words and phrases:

1991 Mathematics Subject Classification:

category:

1. Introduction

Definition 1.1**.**

Definition 1.2**.**

Definition 1.3** ([17]).**

Theorem 1.4**.**

Corollary 1.5**.**

Corollary 1.6**.**

2. Conditional lower bound for PMLG on undirected graphs

2.1. Building the pattern

2.2. Graph gadgets for SAT formulas

Lemma 2.1**.**

Proof 2.2**.**

Lemma 2.3**.**

Proof 2.4**.**

2.3. Putting all together

Remark 2.5**.**

Lemma 2.6**.**

Proof 2.7**.**

Theorem 2.8**.**

Proof 2.9**.**

2.4. Cost of the reduction

Lemma 2.10**.**

Proof 2.11**.**

2.5. Implications on SETH

3. From undirected graphs to DAGs, with binary alphabets

3.1. Maximum degree three

Revised gadget GFG_{F}GF​

Revised gadget GUG_{U}GU​

Revised pattern PPP

3.2. Binary alphabet

Lemma 3.1**.**

Proof 3.2**.**

Lemma 3.3**.**

Proof 3.4**.**

Lemma 3.5**.**

Proof 3.6**.**

3.3. Directed acyclic graphs

4. Conclusions

Acknowledgements

Definition 1.1.

Definition 1.2.

Definition 1.3 ([17]).

Theorem 1.4.

Corollary 1.5.

Corollary 1.6.

Lemma 2.1.

Proof 2.2.

Lemma 2.3.

Proof 2.4.

Remark 2.5.

Lemma 2.6.

Proof 2.7.

Theorem 2.8.

Proof 2.9.

Lemma 2.10.

Proof 2.11.

Revised gadget $G_{F}$

Revised gadget $G_{U}$

Revised pattern $P$

Lemma 3.1.

Proof 3.2.

Lemma 3.3.

Proof 3.4.

Lemma 3.5.

Proof 3.6.