Consensus Patterns parameterized by input string length is W[1]-hard

Laurent Bulteau

arXiv:1702.08238·cs.CC·February 28, 2017

Consensus Patterns parameterized by input string length is W[1]-hard

Laurent Bulteau

PDF

Open Access

TL;DR

This paper proves that the Consensus Patterns problem, which involves finding a common pattern in multiple strings with errors, is computationally hard (W[1]-hard) when parameterized by input string length.

Contribution

It establishes the W[1]-hardness of the Consensus Patterns problem based on input string length, highlighting its computational complexity.

Findings

01

Consensus Patterns problem is W[1]-hard when parameterized by string length.

02

The hardness result applies to the problem with errors allowed in pattern matching.

03

This work clarifies the computational limits of pattern extraction in strings.

Abstract

We consider the Consensus Patterns problem, where, given a set of input strings, one is asked to extract a long-enough pattern which appears (with some errors) in all strings. We prove that this problem is W[1]-hard when parameterized by the maximum length of input strings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Network Packet Processing and Optimization

Full text

Consensus Patterns parameterized by input string length is W[1]-hard.

Laurent Bulteau

We consider the Consensus Patterns problem, where, given a set of input strings, one is asked to extract a long-enough pattern which appears (with some errors) in all strings. Formally, the problem is defined as follows:

Consensus Patterns

Input: Strings $S_{1},\ldots S_{n}$ of length at most $\ell$ , integers $m$ and $d$ .

Output: Length- $m$ string $S$ and integers $(j_{1},\ldots,j_{n})$ such that $\sum_{i=1}^{n}\mathrm{Ham}(S,S_{i}[j_{i}..j_{i}+m-1])\leq d$

Where $\mathrm{Ham}()$ denotes the Hamming distance and $S[a..b]$ is the substring of $S$ starting in $a$ and ending in $b$ . This problem is one of many variations of the well-studied Consensus String problem. It is similar to Consensus Substring in that the target string must be close to a substring of each input string (rather than the whole string). However, in the latter problem the distance to each input string is bounded, rather than the sum of the distances in our case.

We look at this problem from the parameterized complexity viewpoint, more precisely for parameter $\ell$ . Recall that Consensus Substring is FPT for parameter $\ell$ [3]. See [1] for an overview of the variants of Consensus String, and [4] for recent advances on parameterized aspects of Consensus Substring and Consensus Patterns. We prove the following result.

Theorem 1.

Consensus Patterns $(\ell)$ * is W[1]-hard.*

By reduction from Multi-Colored Clique. We are given a graph $G=(V,E)$ , with a partition (coloring) $V=V_{1}\cup V_{2}\cup\ldots\cup V_{k}$ , such that no edge has both endpoints of the same color. Assume that $|V_{h}|=n$ for all $h\in[k]$ . Write $V_{h}=\{v_{h,1},v_{h,2},\ldots v_{h,n}\}$ , i.e. each vertex has an index depending both on its color and its rank within its color. Let $m=|E|$ . Multi-Colored Clique is W[1]-hard for parameter $k$ [2]. See Figure 1 for an example of the reduction.

We build an alphabet $\Sigma$ containing $V$ (i.e., one symbol per vertex) and two special characters $\$$ and$ {\circ}$.

Define string ${\mathcal{V}}_{i}=\$ v_{1,i}v_{2,i}\ldots v_{k,i} $. Let$ e=(v_{h,i},v_{h^{\prime},i^{\prime}}) $be the$ j $th edge of$ E $,$ j\in[m] $. Define$ {\mathcal{E}}{j} $as the string starting with$ $ $, followed by$ k+1 $characters: all$ {\circ} $, except for two positions:$ {\mathcal{E}}{j}[k+h+1]=v_{h,i} $and$ {\mathcal{E}}{j}[h^{\prime}+2]=v{h^{\prime},i^{\prime}}$.

Let $N=m(k+2)+1$ . The instance $\mathcal{I}$ of Consensus Patterns contains $N$ occurrences of strings ${\mathcal{V}}_{i}$ , $i\in[n]$ , and one occurrence of strings ${\mathcal{E}}_{j}$ , $j\in[m]$ . The target length is $\mathrm{m}=k+1$ .

Note that due to the large value of $N$ , any solution $S$ must have a minimal distance to the set of strings $\{{\mathcal{V}}_{i}\mid i\in[n]\}$ . Otherwise, (if it is, say, at the minimum distance plus one), the distance to the whole instance $\mathcal{I}$ increases by at least $N$ , which cannot be compensated by the remaining strings ${\mathcal{E}}_{j}$ (which have size $m(k+2)<N$ ). Hence we first enumerate the optimal solutions for the set $\{{\mathcal{V}}_{i}\mid i\in[n]\}$ .

Lemma 1.

The Consensus Patterns of $\{{\mathcal{V}}_{i}\mid i\in[n]\}$ (i.e., the strings of length $k+1$ at minimum total distance from strings ${\mathcal{V}}_{i}$ ) are the strings of the form $S=\$ v_{1,i_{1}}\ldots v_{k,i_{k}} $with$ i_{1},\ldots,i_{k}\in[n] $. Such a string has a total distance of$ (n-1)k$.

Proof.

Since all strings in $\{{\mathcal{V}}_{i}\mid i\in[n]\}$ have length $k+1$ , any consensus pattern $S$ must be aligned with ${\mathcal{V}}_{i}$ from the very first character. Hence $S$ is a consensus string of $\{{\mathcal{V}}_{i}\mid i\in[n]\}$ . The consensus strings of this set are obtained by taking the majority character at each position. Thus, $S[1]=\#$ , and, for all $h\in[k]$ , there exists $i_{h}$ such that $S[h+1]=\{v_{h,i_{h}}\}$ . ∎

Consider now an optimal solution $S$ for $\mathcal{I}$ . Let $\{i_{h}\mid h\in[k]\}$ be the set of indices as obtained from the lemma above. We show that the set of vertices $K=\{v_{h,i_{h}}\mid h\in[k]\}$ forms a clique of $G$ iff the distance is below a certain threshold. To this end, we compute the best possible alignment between $S$ and each string ${\mathcal{E}}_{j}$ .

Lemma 2.

Let $j\in[m]$ . If both endpoints of edge $e_{j}$ are in $K$ then there exists an alignment of $S$ at distance $k-1$ from ${\mathcal{E}}_{j}$ , otherwise the best possible alignment has distance $k$ .

Proof.

Let $h_{\mathrm{s}},h_{\mathrm{t}},i_{\mathrm{s}},i_{\mathrm{t}}$ be such that $e_{j}=(v_{h_{\mathrm{s}},i_{\mathrm{s}}},v_{h_{\mathrm{t}},i_{\mathrm{t}}})$ . There are two possible alignments of $S$ with ${\mathcal{E}}_{j}$ : $S[1]$ is aligned either with ${\mathcal{E}}_{j}[1]$ or with ${\mathcal{E}}_{j}[2]$ . We compute the distance in both cases.

If $S[1]$ is aligned with ${\mathcal{E}}_{j}[1]$ , then there is exactly one common character, namely $S[1]={\mathcal{E}}_{j}[1]=\$$. Indeed, for all$ h\in[k] $,$ S[h+1]\in V_{h} $and$ {\mathcal{E}}{j}[h+1]\in V{h-1}\cup{x} $, hence these two characters are different. The distance in this case is$ k$.

If $S[1]$ is aligned with ${\mathcal{E}}_{j}[2]$ , then first note that $S[1]=\$ \neq x={\mathcal{E}}{j}[2] $. Consider index$ h{\mathrm{s}} $. If$ i_{\mathrm{s}}=i_{h_{\mathrm{s}}} $, then$ S[{h_{\mathrm{s}}}]=v_{{h_{\mathrm{s}}},i_{h_{\mathrm{s}}}}={\mathcal{E}}{j}[{h{\mathrm{s}}}+1] $, otherwise$ S[{h_{\mathrm{s}}}]\neq{\circ}={\mathcal{E}}{j}[{h{\mathrm{s}}}+1] $. Similarly for$ h_{\mathrm{t}} $,$ S[{h_{\mathrm{t}}}]={\mathcal{E}}{j}[{h{\mathrm{t}}}+1] $iff$ i_{\mathrm{t}}=i_{h_{\mathrm{t}}} $. For other values of$ h $(i.e.$ h\in[k]\setminus{h_{\mathrm{s}},h_{\mathrm{t}}} $),$ S[h]\neq{\circ}={\mathcal{E}}{j}[h+1] $. The distance is thus$ k-1 $iff$ i{\mathrm{s}}=i_{h_{\mathrm{s}}} $and$ i_{\mathrm{t}}=i_{h_{\mathrm{t}}} $, it is at least$ k$ otherwise.

Overall, if $i_{\mathrm{s}}=i_{h_{\mathrm{s}}}$ and $i_{\mathrm{t}}=i_{h_{\mathrm{t}}}$ the optimal alignment has distance $k-1$ , otherwise the optimal alignment has distance $k$ .

∎

We can now conclude the proof. Let $S$ be an optimal solution of Consensus Pattern for instance $\mathcal{I}$ and $K$ its corresponding set of vertices. The distance from $S$ to the $N$ copies of strings ${\mathcal{V}}_{i}$ is $N(n-1)k$ . The distance between $S$ and ${\mathcal{E}}_{j}$ is $k-1$ if both endpoints of $e_{j}$ are in $K$ , and $k$ otherwise. $|E(K)|$ is the number of edges with both endpoints in $K$ : the total distance from $S$ to strings ${\mathcal{E}}_{j}$ is thus $mk-|E(K)|$ , and the total distance from $S$ to $\mathcal{I}$ is $N(n-1)k+mk-|E(K)|$ . Overall, the optimal distance is at most $N(n-1)k+mk-\frac{k(k-1)}{2}$ if, and only if, $G$ contains a size- $k$ set of vertices $K$ with $|E(K)|\geq\frac{k(k-1)}{2}$ , i.e. if $G$ contains a clique.

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Laurent Bulteau, Falk Hüffner, Christian Komusiewicz, and Rolf Niedermeier. Multivariate algorithmics for np-hard string problems. Bulletin of the EATCS , 114, 2014.
2[2] Rodney G Downey and Michael Ralph Fellows. Parameterized complexity . Springer Science & Business Media, 2012.
3[3] Patricia A. Evans, Andrew D. Smith, and Harold T. Wareham. On the complexity of finding common approximate substrings. Theor. Comput. Sci. , 306(1-3):407–430, 2003.
4[4] Markus L. Schmid. Finding consensus strings with small length difference between input and solution strings. In MFCS 2015, Part II , volume 9235 of LNCS , pages 542–554. Springer, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Consensus Patterns parameterized by input string length is W[1]-hard.

Theorem 1**.**

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Theorem 1.

Lemma 1.

Lemma 2.