Consensus Patterns parameterized by input string length is W[1]-hard
Laurent Bulteau

TL;DR
This paper proves that the Consensus Patterns problem, which involves finding a common pattern in multiple strings with errors, is computationally hard (W[1]-hard) when parameterized by input string length.
Contribution
It establishes the W[1]-hardness of the Consensus Patterns problem based on input string length, highlighting its computational complexity.
Findings
Consensus Patterns problem is W[1]-hard when parameterized by string length.
The hardness result applies to the problem with errors allowed in pattern matching.
This work clarifies the computational limits of pattern extraction in strings.
Abstract
We consider the Consensus Patterns problem, where, given a set of input strings, one is asked to extract a long-enough pattern which appears (with some errors) in all strings. We prove that this problem is W[1]-hard when parameterized by the maximum length of input strings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · Network Packet Processing and Optimization
Consensus Patterns parameterized by input string length is W[1]-hard.
Laurent Bulteau
We consider the Consensus Patterns problem, where, given a set of input strings, one is asked to extract a long-enough pattern which appears (with some errors) in all strings. Formally, the problem is defined as follows:
Consensus Patterns
Input: Strings of length at most , integers and .
Output: Length- string and integers such that
Where denotes the Hamming distance and is the substring of starting in and ending in . This problem is one of many variations of the well-studied Consensus String problem. It is similar to Consensus Substring in that the target string must be close to a substring of each input string (rather than the whole string). However, in the latter problem the distance to each input string is bounded, rather than the sum of the distances in our case.
We look at this problem from the parameterized complexity viewpoint, more precisely for parameter . Recall that Consensus Substring is FPT for parameter [3]. See [1] for an overview of the variants of Consensus String, and [4] for recent advances on parameterized aspects of Consensus Substring and Consensus Patterns. We prove the following result.
Theorem 1**.**
Consensus Patterns* is W[1]-hard.*
By reduction from Multi-Colored Clique. We are given a graph , with a partition (coloring) , such that no edge has both endpoints of the same color. Assume that for all . Write , i.e. each vertex has an index depending both on its color and its rank within its color. Let . Multi-Colored Clique is W[1]-hard for parameter [2]. See Figure 1 for an example of the reduction.
We build an alphabet containing (i.e., one symbol per vertex) and two special characters \$$ and {\circ}$.
Define string {\mathcal{V}}_{i}=\v_{1,i}v_{2,i}\ldots v_{k,i}e=(v_{h,i},v_{h^{\prime},i^{\prime}})jEj\in[m]{\mathcal{E}}{j}$k+1{\circ}{\mathcal{E}}{j}[k+h+1]=v_{h,i}{\mathcal{E}}{j}[h^{\prime}+2]=v{h^{\prime},i^{\prime}}$.
Let . The instance of Consensus Patterns contains occurrences of strings , , and one occurrence of strings , . The target length is .
Note that due to the large value of , any solution must have a minimal distance to the set of strings . Otherwise, (if it is, say, at the minimum distance plus one), the distance to the whole instance increases by at least , which cannot be compensated by the remaining strings (which have size ). Hence we first enumerate the optimal solutions for the set .
Lemma 1**.**
The Consensus Patterns of (i.e., the strings of length at minimum total distance from strings ) are the strings of the form S=\v_{1,i_{1}}\ldots v_{k,i_{k}}i_{1},\ldots,i_{k}\in[n](n-1)k$.
Proof.
Since all strings in have length , any consensus pattern must be aligned with from the very first character. Hence is a consensus string of . The consensus strings of this set are obtained by taking the majority character at each position. Thus, , and, for all , there exists such that . ∎
Consider now an optimal solution for . Let be the set of indices as obtained from the lemma above. We show that the set of vertices forms a clique of iff the distance is below a certain threshold. To this end, we compute the best possible alignment between and each string .
Lemma 2**.**
Let . If both endpoints of edge are in then there exists an alignment of at distance from , otherwise the best possible alignment has distance .
Proof.
Let be such that . There are two possible alignments of with : is aligned either with or with . We compute the distance in both cases.
If is aligned with , then there is exactly one common character, namely S[1]={\mathcal{E}}_{j}[1]=\$$. Indeed, for all h\in[k]S[h+1]\in V_{h}{\mathcal{E}}{j}[h+1]\in V{h-1}\cup{x}k$.
If is aligned with , then first note that S[1]=\\neq x={\mathcal{E}}{j}[2]h{\mathrm{s}}i_{\mathrm{s}}=i_{h_{\mathrm{s}}}S[{h_{\mathrm{s}}}]=v_{{h_{\mathrm{s}}},i_{h_{\mathrm{s}}}}={\mathcal{E}}{j}[{h{\mathrm{s}}}+1]S[{h_{\mathrm{s}}}]\neq{\circ}={\mathcal{E}}{j}[{h{\mathrm{s}}}+1]h_{\mathrm{t}}S[{h_{\mathrm{t}}}]={\mathcal{E}}{j}[{h{\mathrm{t}}}+1]i_{\mathrm{t}}=i_{h_{\mathrm{t}}}hh\in[k]\setminus{h_{\mathrm{s}},h_{\mathrm{t}}}S[h]\neq{\circ}={\mathcal{E}}{j}[h+1]k-1i{\mathrm{s}}=i_{h_{\mathrm{s}}}i_{\mathrm{t}}=i_{h_{\mathrm{t}}}k$ otherwise.
Overall, if and the optimal alignment has distance , otherwise the optimal alignment has distance .
∎
We can now conclude the proof. Let be an optimal solution of Consensus Pattern for instance and its corresponding set of vertices. The distance from to the copies of strings is . The distance between and is if both endpoints of are in , and otherwise. is the number of edges with both endpoints in : the total distance from to strings is thus , and the total distance from to is . Overall, the optimal distance is at most if, and only if, contains a size- set of vertices with , i.e. if contains a clique.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Laurent Bulteau, Falk Hüffner, Christian Komusiewicz, and Rolf Niedermeier. Multivariate algorithmics for np-hard string problems. Bulletin of the EATCS , 114, 2014.
- 2[2] Rodney G Downey and Michael Ralph Fellows. Parameterized complexity . Springer Science & Business Media, 2012.
- 3[3] Patricia A. Evans, Andrew D. Smith, and Harold T. Wareham. On the complexity of finding common approximate substrings. Theor. Comput. Sci. , 306(1-3):407–430, 2003.
- 4[4] Markus L. Schmid. Finding consensus strings with small length difference between input and solution strings. In MFCS 2015, Part II , volume 9235 of LNCS , pages 542–554. Springer, 2015.
