Longest Common Subsequence on Weighted Sequences
Evangelos Kipouridis, and Kostas Tsichlas

TL;DR
This paper advances the understanding of the Longest Common Subsequence problem on weighted sequences by providing efficient approximation schemes for bounded alphabets and establishing complexity bounds for unbounded alphabets.
Contribution
It introduces an EPTAS for bounded alphabets and proves hardness results for unbounded alphabets, closing the gap between upper and lower bounds.
Findings
EPTAS achieved for bounded alphabets
No EPTAS exists for unbounded alphabets unless FPT=W[1]
Lower bounds under ETH restrict PTAS improvements for unbounded alphabets
Abstract
We consider the general problem of the Longest Common Subsequence (LCS) on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a PTAS and noticed that no FPTAS is possible unless P=NP. In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an EPTAS for bounded alphabets (which is the most natural case), and prove that there does not exist any EPTAS for unbounded alphabets unless FPT=W[1]. Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better PTAS can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Basic Algorithms Research Copenhagen (BARC), University of Copenhagen, [email protected]://orcid.org/0000-0002-5830-5830Thorup’s Investigator Grant 16582, Basic Algorithms Research Copenhagen (BARC), from the VILLUM Foundation, and European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 801199. School of Informatics, Aristotle University of Thessaloniki, [email protected]
\CopyrightEvangelos Kipouridis and Kostas Tsichlas\ccsdesc[100]Theory of computation Approximation algorithms analysis \ccsdesc[100]Theory of computation W hierarchy \ccsdesc[100]Theory of computation Problems, reductions and completeness
Acknowledgements.
We would like to thank the anonymous reviewers for their careful reading of our paper and their many insightful comments and suggestions. \hideLIPIcs\EventEditorsInge Li Gørtz and Oren Weimann \EventNoEds2 \EventLongTitle31th Annual Symposium on Combinatorial Pattern Matching (CPM 2020) \EventShortTitleCPM 2020 \EventAcronymCPM \EventYear2020 \EventDateJune 17–19, 2020 \EventLocationCopenhagen, Denmark \EventLogo \SeriesVolume161 \ArticleNo21
Longest Common Subsequence on Weighted Sequences
Evangelos Kipouridis
Kostas Tsichlas
Abstract
We consider the general problem of the Longest Common Subsequence () on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a and noticed that no is possible unless . In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an for bounded alphabets (which is the most natural case), and prove that there does not exist any for unbounded alphabets unless . Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.
keywords:
WLCS, LCS, weighted sequences, approximation algorithms, lower bound
category:
1 Introduction
1.1 General concepts
We consider the problem of determining the (Longest Common Subsequence) on weighted sequences. Weighted sequences, also known as -weighted sequences or Position Weighted Matrices (PWM) [3, 35] are probabilistic sequences which extend the notion of strings, in the sense that in each position there is some probability for each letter of an alphabet to occur there.
Weighted sequences were introduced as a tool for motif discovery and local alignment and are extensively used in molecular biology [23]. They have been studied both in the context of short sequences (binding sites, sequences resulting from multiple alignment, etc.) and on large sequences, such as complete chromosome sequences that have been obtained using a whole-genome shotgun strategy [31, 36]. Weighted sequences are able to keep all the information produced by such strategies, while classical strings impose restrictions that oversimplify the original data.
Basic concepts concerning the combinatorics of weighted sequences (like pattern matching, repeats discovery and cover computation) were studied using weighted suffix trees [26], Crochemore’s partitioning [9, 11, 18], the Karp-Miller-Rabin algorithm [18], and other approaches [42, 29]. Other interesting results include approximate and gapped pattern matching [6, 40, 33], online pattern matching [16], weighted indexing [2, 10], swapped matching [39], the all-covers and all-seeds problem [38, 41], extracting motifs [28], and the weighted shortest common supersequence problem [4, 17]. There are also some more practical results on mapping short weighted sequences to a reference genome [7] (also studied in the parallel setting [27]), as well as on the reporting version of the problem which we also consider in this paper [11].
The Longest Common Subsequence () problem is a well-known measure of similarity between two strings. Given two strings, the output should be the length of the longest subsequence common to both strings. Dynamic programming solutions [25, 37] for this problem are classical textbook algorithms in Computer Science. has been applied in computational biology for measuring the commonality of DNA molecules or proteins which may yield similar functionality. A very interesting survey on algorithms for the can be found in [13]. The current algorithms are considered optimal, since matching lower bounds (under the Strong Exponential Time Hypothesis) were proven [1, 14].
Extensions of this problem on more general structures have also been investigated (trees and matrices [5], run-length encoded strings [8], and more). One interesting variant of the is the Heaviest Common Subsequence () where the matching between different letters is assigned a different weight, and the goal is to maximize the weight of the common subsequence, rather than its length.
1.2 Weighted LCS
The problem studied in this paper is the weighted (WLCS) problem. It was introduced by Amir et al. [3] as an extension of the classical problem on weighted sequences. Given two weighted sequences, the goal is to find a longest string which has a high probability of appearing in both sequences. Amir et al. initially solved an easier version of this problem in polynomial time, but unfortunately its applications are limited. As far as the general problem is concerned, they hinted its NP-Hardness by giving an NP-Hardness result on a closely related problem, which they call the log-probability version of WLCS. In short, the problem is the same, but all products in its definition are replaced with sums. Their proof is based on a Turing reduction and only works for unbounded alphabets. Finally, Amir et al. provide an -approximation algorithm for the WLCS problem.
Cygan et al. [19] strengthened the evidence that WLCS is NP-Hard by providing an NP-Completeness result on the decision log-probability version of WLCS (informally introduced in the previous paragraph), already for alphabets of size , using a Karp reduction; for alphabets of size the solution is trivial since there is no uncertainty. They also gave an -approximation algorithm and a , while also noticing that an cannot exist, assuming WLCS is indeed NP-Hard, as hinted by their evidence, and that P NP. Finally, they proved that every instance of the problem can be reduced to a more restricted class of instances. However, for this to be achieved their algorithm needs to perform exact computations of roots and logarithms that may make the algorithm to err.
Finally, it is worth noting that Charalampopoulos et al. [17], proved that unless P=NP, WLCS cannot be solved in time, for any function , where is the cut-off probability. We note that this result concerns exact computations rather than approximations.
1.3 Our results
In this paper we essentially close the gap between upper and lower bounds for WLCS by improving both; we prove that the problem is indeed NP-Hard even for alphabets of size . Furthermore, we provide an for bounded alphabets. These two results, along with the observation by Cygan et al. completely characterize the complexity of WLCS for bounded alphabets. For unbounded alphabets, a was already known by Cygan et al. [19]. We show matching lower bounds, both by ruling out the possibility of an , and by showing that, under the Exponential Time Hypothesis, no significantly better can exist. We also prove that every instance of WLCS can be reduced to a restricted class of instances without using roots and logarithms, thus being able to actually achieve exact computations without rounding errors that can make the algorithm err.
As noted in the previous paragraph, apart from essentially closing the gap between hardness results and faster algorithms we also circumvent the need to work with roots and logarithms as the previous results did. In short, by taking advantage of the property that and setting to be an appropriate logarithm, previous results transformed any instance to a more manageable form. However, this transformation introduces an error that may make the algorithm err as noted in Appendix A. Table 1 summarizes the above discussion. Table 2 summarizes our results depending on the alphabet-size.
A short discussion is in order with respect to what new insights on weighted enabled us to achieve progress. Our most crucial observation is the fact that the problem behaves differently in the natural case of a bounded alphabet, and in the case of an unbounded alphabet. Without this distinction, closing the gap between upper and lower bounds was unlikely. That’s because, on the one hand, no for the general case could be found, as none existed. On the other hand, proving that no exists requires reductions that work only on unbounded alphabets. The aforementioned distinction is what enabled us to understand that modifying the existing reductions, which work for alphabets of size , would be futile in proving -Hardness.
Furthermore, it was crucial to identify that working with products is the core difficulty in proving NP-Hardness of weighted . The introduction of the log-probability version of the weighted reflects the assumption that the difference between working with sums and working with products is just a technicality. After [3] and [19] successfully proved NP-Hardness for the log-probability version, it was natural to attempt reducing from it for proving NP-Hardness of the weighted problem. Despite the apparent similarities between the two problems, their difference did not allow us to craft such a reduction. For the same reason, Cygan et al. used a model that assumed infinite precision computations with reals, while we are able to avoid such a strong assumption.
1.4 Organization of the paper
The rest of the paper is organized as follows. In Section 2, we provide necessary definitions and discuss the model of computation. In Section 3, we show that WLCS is NP-Complete while in Section 4, we provide the algorithm for bounded alphabets, which is also an improved for unbounded alphabets. In Section 5, we show that there can be no for unbounded alphabets by showing that this problem is -hard and in Section 6, we describe the matching conditional lower bound. We conclude in Section 7.
For clarity purposes, some proofs and technical discussions are moved to the Appendix. More specifically, in Appendix A we present an algorithm that transforms any instance of our problem to an equivalent, but easier to handle, instance. We also show that the rounding errors introduced by working with reals (logarithms and roots) may cause a similar algorithm by Cygan et al. [19] to err if standard rounding is used.
2 Preliminaries
2.1 Basic Definitions
Let be a finite alphabet. We deal both with bounded and unbounded alphabets. denotes the set of all words of length over . denotes the set of all words over .
Definition 2.1** (Weighted Sequence).**
A weighted sequence is a sequence of functions , where each function assigns a probability to each letter from . We thus have for all , and for all .
By we denote the set of all weighted sequences over . Let . Let be the set of all increasing sequences of positions in . For a string and , define as the probability that the subsequence on positions corresponding to in equals . More formally, if and denotes the -th letter of , then
[TABLE]
Denote
[TABLE]
That is, is the set of deterministic strings which match a subsequence of with probability at least . Every is called an -subsequence of .
Let us give a clarifying example. If and is a long weighted sequence, where in each position the probability for each letter to appear is , then does not contain , as, for any increasing subsequence of positions, the probability of appearing is .
The decision problem we consider is the following:
Definition 2.2** (-WLCS decision problem).**
Given two weighted sequences , two cut-off probabilities and a number , find if the longest string contained in has length at least .
Naturally, the respective optimization problem is the following:
Definition 2.3** (-WLCS optimization problem).**
Given two weighted sequences , and two cut-off probabilities , find the length of the longest string contained in .
Both in the decision and the optimization version, the WLCS problem is the -WLCS problem, where . We denote these (equal) probabilities by () for concreteness.
Let us note that the problem is only interesting if . For the problem is trivial since there is no uncertainty at all. The same letter appears in every position in both strings with probability , and thus the answer is simply the length of the shorter weighted sequence.
Finally, let us also state that the Log-Probability version of the WLCS, studied in previous papers, is the same as the original WLCS if we replace by .
2.2 Model of Computation
Our model of computation is the standard word , introduced by Fredman and Willard [20] to simulate programming languages like C. The word size is , where is the input size in bits, so as to allow random access indexing of the whole input. Thus, arithmetic operations between words take constant time. However, due to the nature of our problem, it is necessary to compute products of many numbers. This can produce numbers that are much larger than the word size. We even allow numbers in the input to be larger than (these numbers just need to use more than one word to be represented). We generally assume that each number in the input is represented by at most bits, but we do not pose any constraint on other than the trivial one that . Of course, in cases where we deal with numbers that occupy many words, we no longer have unit-cost arithmetic operations; we guarantee, however, that our results only use linear or near-linear time operations (like comparisons and multiplications) on numbers polynomial in the input size. Thus, although we do not enjoy the unit-cost assumption for arbitrary numbers, we always stay within the polynomial-time regime.
2.3 Basic Operations
In this subsection we discuss the multiplication of two -bit input numbers in (polynomial) time, where is the word-size. For example, for integers there exists a multiplication algorithm by Harvey and van der Hoeven [24] with time complexity (generally the running time can also depend on , although in this case it does not). Let us notice that although the result is unpublished yet, we use it due to its easy to read time complexity; it is trivial to use other algorithms instead, like the one from Fürer [21], or the more practical one by Schönhage and Strassen [34]. We establish the complexity of multiplying -bit numbers. Our divide and conquer algorithm splits the numbers into two (equal sized) groups, recursively multiplies each, and multiplies the results in time. By a direct application of the Master Theorem by Bentley et al. [12] we prove the following lemma.
Lemma 2.4**.**
Multiplying -bit numbers costs
- •
* time if for some constant ,*
- •
* else if for some constant ,*
assuming that is a polynomial time algorithm that multiplies two -bit numbers.
Proof 2.5**.**
The algorithm simply splits the numbers in two equal-sized groups, recursively multiplies each, and then multiplies the results. Let . We have that the running time for multiplying -bit numbers is . Since , and , the Master Theorem [12] gives two cases. Either for some constant , in which case , or else for some constant (such a constant exists since we assume polynomial time multiplications). In this case, since it holds that , we get that if . Notice that we handled all cases, since is handled by the first case with , and whatever does not fit in the first case, definitely fits in the second, since we assumed that is polynomial in .
Corollary 2.6**.**
Multiplying -bit numbers costs polynomial time by using any polynomial time algorithm for multiplying two -bit numbers as a black box. Especially if we use Harvey and Van Der Hoeven’s algorithm, the time cost is .
Let us also notice that the way to divide two -bit numbers is simply storing both the numerator and the denominator. Comparing two numbers and can be done by comparing and . The only other operation we need when working with such fractions is subtracting a -bit number from . This is simply .
3 NP-Completeness
An NP-Completeness proof for the integer log-probability version of the WLCS problem has been given in [19]. This is a closely related problem, with the main difference being that products are replaced with sums. We do not know of any way to reduce from this log-probability version to WLCS other than exponentiating. As stated in the explanation of our model of computation in Section 2, there is no limit on the number of bits needed to represent a single number (it just occupies a lot of words). This means that, if the input consisted of bits, and there was a number (probability) represented with bits, exponentiating would result in a number with bits, meaning the reduction would not be a polynomial-time one. For this reason, we believe that although it is easier to prove NP-Completeness for the integer log-probability version of the problem, there is no easy way to use it for proving NP-Completeness for the general version. We, thus, give a reduction from the NP-Complete problem Subset Product [22] which proves NP-Completeness directly for the general problem.
Notice that for alphabets consisting of one letter, the problem is trivial since there is no uncertainty at all. In the following, we prove that even for alphabets consisting of two letters, the problem is NP-Complete.
Definition 3.1** (Subset Product).**
Given a set of integers and an integer , find if there exists a subset of the numbers in with product .
Lemma 3.2**.**
WLCS is NP-Complete, even for alphabets of size .
Proof 3.3**.**
Obviously since the increasing subsequences and the string for which are a certificate which, along with the input, can be used to verify in polynomial time that the problem has a solution.
Let be an instance of Subset Product and let . By we denote the -th number of the set , assuming any fixed ordering of the numbers of . We give a polynomial-time reduction to a instance of WLCS, with alphabet size (we call the letters and ).
The core idea is the following: The weighted sequences have positions (plus more for technical reasons related to the threshold ). The number is equal to the length of the sequences, meaning that we pick every position, and the only question is whether we picked letter or letter . Letter in position corresponds to picking the -th number in the original Subset Product, while letter corresponds to not picking it. Finally, the letters picked in form an inequality of the form: "some product is ", while the same letters in form the inequality: "the same product is ". For these two to hold simultaneously, it must be the case that we found some product equal to , which is the goal of the original Subset Product.
More formally, the weighted sequences have size . Let and .
[TABLE]
where for all , and similarly for . Notice that, in particular, and . Finally, we set and .
First of all, notice that since we must find a string of length , we must choose a letter from every position. Thus, a choice of letter at some position on corresponds to the same choice of letter at that position on . The choice of letter on positions and is in both cases since
[TABLE]
Suppose that the numbers at positions give product :
[TABLE]
Then, we form the string by picking at positions and at all other positions. Thus
[TABLE]
[TABLE]
Conversely, suppose a solution for the WLCS problem, where the string is formed by picking at positions and at all other positions. It holds that:
[TABLE]
[TABLE]
The above imply that . Finally, notice that all computations are done in polynomial time, due to Corollary 2.6.
4 EPTAS for Bounded Alphabets, Improved PTAS for Unbounded Alphabets
We now give an Efficient Polynomial Time Approximation Scheme () for the case where our alphabet size is bounded (). Let us notice that this is the case when working with DNA sequences (), the most usual application of weighted sequences. The same algorithm is an improved (when compared to [19]) in the case of unbounded alphabets. This means that the WLCS problem is Fixed-Parameter Tractable for constant size alphabets and thus belongs to the corresponding complexity class as shown in Corollary 4.6.
The authors in [19] first noted that there is no unless , and so we can only hope for an . Our result relies on their following result:
Lemma 4.1** (Lemma 4.6 of [19]).**
It is possible to find, in polynomial time, a solution of size to the WLCS optimization problem such that the optimal value is guaranteed to be either or (however we do not know which one holds).
Their uses the above result and in case the approximation is guaranteed to be good enough (, which implies that ), it stops. Else, it holds that , and the exhaustively searches all subsequences of , all subsequences of , and all possible strings of length , for a total complexity of
[TABLE]
is the time needed to multiply numbers with at most -bits each, by Lemma 2.4, and is insignificant compared to the other terms. Our improves the exhaustive search part to
[TABLE]
which is polynomial in the input size, in case of bounded alphabets. The following lemma is needed.
Lemma 4.2**.**
Given a weighted sequence of length , and a string of length , it is possible to find the maximum number such that there exists an increasing subsequence of length for which . The running time of the algorithm is , where is the maximum number of bits needed to represent each probability in .
Proof 4.3**.**
We use dynamic programming. Let be the string formed by the first letters of , be the -th letter of and be the maximum number such that there exists an increasing subsequence of length whose last term is at most and for which . Since we choose whether is picked from the -th position of , it holds that:
[TABLE]
For the base cases, for all (we can always form the empty string with certainty, by not picking anything), and for (not picking anything never gives us a non-empty string). We are interested in the value .
Now we are ready to give our .
Theorem 4.4**.**
For any value there exists an -approximation algorithm for the WLCS problem which runs in time and uses space, where is the input size, and is the maximum number of bits needed to represent a probability in and . Consequently, the WLCS problem admits an for bounded alphabets.
Proof 4.5**.**
We begin by using Lemma 4.1 to find an -subsequence of length , such that the optimal solution is at most . If , we are done, since in that case we have a approximation. Otherwise, we try all possible strings , and use Lemma 4.2 to check if any one of them can appear in both weighted sequences with probability at least .
Corollary 4.6**.**
* for bounded alphabets, parameterized by the solution length.*
Proof 4.7**.**
Follows directly from [30], Proposition 2.
5 No EPTAS for Unbounded Alphabets
We have already seen that there is no for WLCS, even for alphabets of size , unless . We have also shown an for bounded alphabets and a for unbounded alphabets. The natural question that arises is: Is it possible to give an for unbounded alphabets?
We answer this question negatively, by proving that WLCS is -hard, meaning that it does not admit an (and is in fact not even in ) unless ([30], Corollary ). To show this, we give a -step -reduction from Perfect Code, which was shown to be -Complete in [15], to -sized Subset Product and then to WLCS. The -sized Subset Product problem is the Subset Product problem with the additional constraint that the target subset must be of size .
Definition 5.1** (Perfect Code).**
A perfect code is a set of vertices with the property that for each vertex there is precisely one vertex in , where is the set of adjacent nodes of in .
In the perfect code problem, we are given an undirected graph and a positive integer , and we need to decide whether has a -element perfect code. Notice that the definition of a perfect code implies that there is a perfect code iff there is a set for which and for all . First we show that -sized Subset Product is -hard.
Lemma 5.2**.**
-sized Subset Product is -hard.
Proof 5.3**.**
Let be an instance of Perfect Code. Suppose that the vertices are . First of all, we compute the first prime numbers using the Sieve of Eratosthenes. We denote the -th prime number as . The set of positive integers as well as the positive integer are defined as follows:
[TABLE]
Notice that due to the unique prime factorization theorem, a subset of numbers from the set have product iff has a -element Perfect Code.
The size of our primes is due to the prime number theorem. Thus, they require bits to be represented. Each integer in , as well as in , is computed using Corollary 2.6 in time, for an overall complexity for our reduction. Since the new parameter is the same as the old one (no dependence on ), our reduction is in fact an -reduction.
Our result for this section is the following.
Theorem 5.4**.**
WLCS, parameterized by the length of the solution, is -hard.
Proof 5.5**.**
To prove the theorem we create diagonal weighted sequences. That is, we require each letter to appear only in one position and vice-versa. In this way, the subsequences picked for and are the same. The above rule is broken by the addition of two auxiliary letters that are there to make the probabilities add up to in each position. This creates no problem because we make sure that these letters are never picked. Finally, we force the product to be equal to our target, by forcing it to be at most our target and at least our target at the same time.
More formally, let be an instance of the -sized Subset Product problem and let , where is the maximum number in set . Notice that if then we only need to check the product of the highest numbers of , which means the problem is solvable in polynomial time. Thus we can assume that . The alphabet of is and we set .
[TABLE]
All non-specified probabilities are equal to 0. Notice that symbols and are used to guarantee that probabilities sum up to .
We show that the instance has a solution iff has a solution. Suppose there exists a solution to . Then, there exists an increasing subsequence such that . Let be extended by the number and be the string . It holds that .
Conversely, suppose there exists a solution to . Then there exist increasing subsequences and a string such that . First of all, notice that, due to for all , does not contain letters and , which leaves only one choice for every position. Also each letter appears only once in each sequence, and in the same position. Thus, , and due to our construction the -th letter of is the -th member of . Finally, not picking position would result in due to the fact that . Thus, the last letter of is . It holds that:
[TABLE]
[TABLE]
The above two inequalities imply a -sized subset of with product equal to .
The reduction is a polynomial-time one, due to Corollary 2.6. More than that, it is an -reduction since the new parameter is equal to the old parameter incremented by one, and thus has no dependence on .
6 Matching Conditional Lower Bound on any PTAS
In the -SUM problem, we are given numbers and need to decide whether there exists a -tuple that sums to zero. Patrascu and Williams [32] proved that any algorithm for solving the -SUM problem requires time, unless the Exponential Time Hypothesis () fails. To show this, they first proved a hardness result for a variant of 3-SAT, the sparse 1-in-3 SAT.
Definition 6.1** (Sparse 1-in-3 SAT).**
Given a boolean formula with variables and clauses in 3 CNF form, where each variable appears in a constant number of clauses, determine whether there exists an assignment of the variables such that each clause is satisfied by exactly one variable.
They first prove the following hardness result under .
Proposition 6.2**.**
Under , there is an (unknown) constant such that there exists no algorithm to solve sparse 1-in-3 SAT in time for .
By assuming an time algorithm for -SUM they disproved the above fact, which cannot happen under . We use the same technique for proving an lower bound for -sized Subset Product.
Lemma 6.3**.**
Assuming the , the problem of -sized Subset Product cannot be solved in time on instances satisfying and each number in the input set has bits, where is the size of , and is the target which can be arbitrarily big.
Proof 6.4**.**
Let be a sparse 1-in-3 SAT instance with variables and clauses, and . Conceptually, we split the variables of into blocks of equal size - apart from the last block that may have smaller size. Each block contains at most variables, and thus there are at most different assignments of values to the group-of-variables within a block. For each block and for each one of these assignments we generate a number which serves as an identifier of the corresponding block and assignment. Thus, there are different identifiers.
Let be the -th prime number. In order to compute an identifier, we initialize it to , where is the index of the identifier’s corresponding block. Then, we run through all of the clauses and do the following: suppose we process the -th clause and let be the number of variables of the identifier’s corresponding assignment that satisfy the clause. We update the identifier by multiplying it with .
Since each variable appears only in a constant number of clauses, each identifier is a product of numbers. The prime number theorem guarantees bits to represent each factor, which means the identifiers have bits. Using the fact that , each identifier is represented by bits.
These identifiers, along with the target (recall that is the -th prime number), form a -sized Subset Product instance. This preprocessing step costs time, ignoring polynomial terms, which is more efficient than .
Due to the unique prime factorization, a solution to the -sized Subset Product corresponds to a solution in and vice-versa. If the running time of the -sized Subset Product was then we could solve the above instance in time.
Since and , it follows that . But , which means .
Thus the previous running time becomes . Both the preprocessing step and the solution of the -sized Subset Product can be achieved in time , where . However, this would violate Proposition 6.2.
Using the above, we are ready to prove our (matching) lower bound, conditional on .
Theorem 6.5**.**
Under , there is no for WLCS with running time , where is the input size in bits.
Proof 6.6**.**
Suppose that such an algorithm existed. Let be the polynomial time reduction from -sized Subset Product to WLCS given in the proof of Theorem 5.4. Then, there is a solution to -sized Subset Product iff there is a solution to WLCS of size , or, equivalently, iff the optimal solution to WLCS is at least .
Using the hypothetical with an appropriate value of , we solve -sized Subset Product more efficiently than possible, thus reaching a contradiction.
Consider the following algorithm for -sized Subset Product, where there are numbers in the input, each having bits and . Given an instance , we define the instance for the WLCS to be . We run and if the output is at least we return that is satisfied, otherwise we return that it cannot be satisfied.
Note that if -sized Subset Product is solvable, then , and the value output by is at least . Thus, the value output by is at least . On the other hand, if -sized Subset Product is not solvable, then , and obviously the value output by is at most k.
Thus we found an algorithm for -sized Subset Product whose running time is . Since is obtained by a polynomial time reduction, its size is bounded by a polynomial in . Therefore, the above running time becomes . Under our assumptions, this becomes , which is not feasible under , due to Lemma 6.3.
7 Conclusion
In this paper we prove NP-Completeness for the WLCS decision problem, and give a along with a matching conditional lower bound for the optimization problem. In the most usual setting, where the alphabet size is constant, the above is in fact an , and it is known that no can exist unless . In the Appendix we give a transformation such that algorithms for the WLCS problem can also be applied for the -WLCS problem.
In proving that WLCS does not admit any , we proved that it is . It may be interesting to determine the exact complexity of WLCS in the .
Appendix A One Threshold is Enough
For clarity purposes, some proofs and technical discussions are moved in this appendix. In particular, in this section we show that -WLCS and WLCS are equivalent, thus one threshold is enough. Furthermore, we show that the rounding errors introduced by working with reals (logarithms and roots) may cause a similar algorithm from a paper by Cygan et al. [19] to err if standard rounding is used.
In the following, corresponds to the maximum number of bits to represent a number in the input (a probability or a symbol of the alphabet). is not to be confused with the word-size since an input number may need many words to be represented.
Lemma A.1**.**
Given an instance of -WLCS , it is possible to reduce it to an instance of WLCS. The construction of and requires time, while parameter is computed in time, where is the total length of the weighted sequences and , while is the maximum number of bits needed to represent an input number.
Proof A.2**.**
We first provide a sketch of the proof. Our goal is to use the same weighted sequences with one additional position at the end. We introduce a new letter () which only appears in this position, and we make sure that any correct algorithm picks it, by making its probability very appealing (high). Since we cannot assign a probability higher than one, increasing it is simulated by reducing all other probabilities, in all positions. Knowing that this specific letter is picked at this specific position allows us to choose the two corresponding probabilities in a way that completes the proof. In order for the probabilities to sum to in every position, we introduce two auxiliary letters ( and {}^{\prime}\^{\prime}{}^{\prime}$^{\prime}{}^{\prime}#^{\prime}$ never appears on the second).
The alphabet of is the alphabet of extended by three new letters, \Sigma^{\prime}=\Sigma\cup\{^{\prime}\#^{\prime},^{\prime}\^{\prime},^{\prime}%^{\prime}}m=\frac{a_{1}}{2}a=m^{k}a_{1}k\leq naX^{\prime}Y^{\prime}$ are constructed as follows:
[TABLE]
All non-specified probabilities are equal to [math].
If there exists a solution to , then there exist two increasing subsequences and a string such that . Define and to be equal to extended with the letter . It holds that:
**
Conversely, suppose there exists a solution to . Then, there exist increasing subsequences and a string such that . First of all, notice that, due to p^{(X^{\prime})}_{i}(^{\prime}\^{\prime})=p^{(Y^{\prime})}{i}(^{\prime}#^{\prime})=0is{}^{\prime}$^{\prime}{}^{\prime}#^{\prime}{}^{\prime}%^{\prime}P{X^{\prime}}(\pi_{1},s),P_{Y^{\prime}}(\pi_{2},s)\leq m^{k+1}<as{}^{\prime}%^{\prime}s^{\prime}sP_{X}({i_{1},\ldots,i_{k}},s^{\prime})\geq a_{1},P_{Y}({j_{1},\ldots,j_{k}},s^{\prime})\geq a_{2}$.
The computation of requires time due to Corollary 2.6, and the -multiplications of two numbers with at most bits each cost . All other computations take linear time.
We note that [19] proved the same result, but their reduction required computations with real numbers (raising to the power). To the best of our knowledge, there is no way to modify that reduction so that it tolerates the rounding error in the word introduced by working with roots and logarithms.
In what follows, we show that the rounding errors may cause the algorithm by Cygan et al. [19], which reduces any instance of WLCS to a more restricted class of instances, to err. This does not rule out the possibility that more clever rounding algorithms (depending on the input size) may indeed be used so that the algorithm does not err; however we are not aware of any such rounding technique, and even if it exists, the algorithm would probably become too complicated compared to ours.
Lemma A.3**.**
The reduction from -WLCS to WLCS with only one threshold given by Cygan et al. in [19] may err, if exact computations with logarithms and roots are not assumed (assuming the rounding technique does not depend on the input, for example it only keeps a constant number of decimal digits).
Proof A.4**.**
We prove the above with an example that demonstrates that the rounding error, introduced by not assuming exact computations with logarithms and roots, may cause the reduction to err.
Let and the two weighted sequences and on alphabet be:
[TABLE]
[TABLE]
where is a constant to be specified later. For , the weighted is and for the weighted is . The transformation described in **[19]** would give and the new sequences would be:
[TABLE]
[TABLE]
Since is an irrational number, it is rounded to some number . Suppose . In this case, when , while the weighted is the algorithm returns due to the rounding errors. On the other hand, if , we can always find an appropriate such that the weighted should have been but the algorithm returns due to the rounding errors. To show this, let for some integer . Then . It holds that is an increasing function of which converges to . Thus, we can find a big enough such that and err on this particular example, as long as the rounding technique does not depend on the input (for example it only keeps a constant number of decimal digits).
Once again, the above is not a proof that the algorithm given by Cygan et al. can never be correct, despite of the rounding algorithm used. It just shows that it is necessary to explicitly specify such a rounding algorithm in order to construct a correct algorithm.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015. doi:10.1109/FOCS.2015.14 . · doi ↗
- 2[2] Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. Property matching and weighted matching. Theoretical Computer Science , 395(2-3):298–310, 2008. doi:10.1016/j.tcs.2008.01.006 . · doi ↗
- 3[3] Amihood Amir, Zvi Gotthilf, and B. Riva Shalom. Weighted LCS. Journal of Discrete Algorithms , 8(3):273–281, 2010. doi:10.1016/j.jda.2010.02.001 . · doi ↗
- 4[4] Amihood Amir, Zvi Gotthilf, and B. Riva Shalom. Weighted shortest common supersequence. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings , pages 44–54, 2011. doi:10.1007/978-3-642-24583-1\_6 . · doi ↗
- 5[5] Amihood Amir, Tzvika Hartman, Oren Kapah, B. Riva Shalom, and Dekel Tsur. Generalized LCS. Theoretical Computer Science , 409(3):438–449, 2008. doi:10.1016/j.tcs.2008.08.037 . · doi ↗
- 6[6] Amihood Amir, Costas S. Iliopoulos, Oren Kapah, and Ely Porat. Approximate matching in weighted sequences. In Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings , pages 365–376, 2006. doi:10.1007/11780441\_33 . · doi ↗
- 7[7] Pavlos Antoniou, Costas S. Iliopoulos, Laurent Mouchard, and Solon P. Pissis. Algorithms for mapping short degenerate and weighted sequences to a reference genome. International Journal of Computational Biology and Drug Design , 2(4):385–397, 2009. doi:10.1504/IJCBDD.2009.030768 . · doi ↗
- 8[8] Alberto Apostolico, Gad M. Landau, and Steven Skiena. Matching for run-length encoded strings. Journal of Complexity , 15(1):4–16, 1999. doi:10.1006/jcom.1998.0493 . · doi ↗
