Longest Common Subsequence on Weighted Sequences

Evangelos Kipouridis; and Kostas Tsichlas

arXiv:1901.04068·cs.CC·July 21, 2020

Longest Common Subsequence on Weighted Sequences

Evangelos Kipouridis, and Kostas Tsichlas

PDF

TL;DR

This paper advances the understanding of the Longest Common Subsequence problem on weighted sequences by providing efficient approximation schemes for bounded alphabets and establishing complexity bounds for unbounded alphabets.

Contribution

It introduces an EPTAS for bounded alphabets and proves hardness results for unbounded alphabets, closing the gap between upper and lower bounds.

Findings

01

EPTAS achieved for bounded alphabets

02

No EPTAS exists for unbounded alphabets unless FPT=W[1]

03

Lower bounds under ETH restrict PTAS improvements for unbounded alphabets

Abstract

We consider the general problem of the Longest Common Subsequence (LCS) on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a PTAS and noticed that no FPTAS is possible unless P=NP. In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an EPTAS for bounded alphabets (which is the most natural case), and prove that there does not exist any EPTAS for unbounded alphabets unless FPT=W[1]. Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better PTAS can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.

Equations40

P_{X} (π, s) = k = 1 \prod d p_{i_{k}}^{(X)} (s_{k})

P_{X} (π, s) = k = 1 \prod d p_{i_{k}}^{(X)} (s_{k})

S U B S (X, a) = {s \in Σ^{*} ∣\exists π \in S e q_{∣ s ∣}^{∣ X ∣} \leavevmode s u c h \leavevmode t ha t \leavevmode P_{X} (π, s) \geq a}

S U B S (X, a) = {s \in Σ^{*} ∣\exists π \in S e q_{∣ s ∣}^{∣ X ∣} \leavevmode s u c h \leavevmode t ha t \leavevmode P_{X} (π, s) \geq a}

p_{i}^{(X)} (^{'} A^{'})

p_{i}^{(X)} (^{'} A^{'})

p_{n + 1}^{(X)} (^{'} A^{'})

p_{n + 2}^{(X)} (^{'} A^{'})

p_{n + 1}^{(X)} (^{'} B^{'}) = p_{n + 2}^{(Y)} (^{'} B^{'}) = 0

p_{n + 1}^{(X)} (^{'} B^{'}) = p_{n + 2}^{(Y)} (^{'} B^{'}) = 0

j = 1 \prod ℓ L_{i_{j}} = P

j = 1 \prod ℓ L_{i_{j}} = P

P_{X} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} c _{i}}{P ^{2}} = \frac{\prod _{j = 1}^{n} c _{i}}{P} = a

P_{X} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} c _{i}}{P ^{2}} = \frac{\prod _{j = 1}^{n} c _{i}}{P} = a

P_{Y} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{n} d _{i} \prod _{j = 1}^{n} c _{i}}{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} d _{i}} = \frac{\prod _{j = 1}^{n} c _{i}}{P} = a

P_{Y} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{n} d _{i} \prod _{j = 1}^{n} c _{i}}{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} d _{i}} = \frac{\prod _{j = 1}^{n} c _{i}}{P} = a

P_{X} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} c _{i}}{P ^{2}} \geq a ⟹ j = 1 \prod ℓ L_{i_{j}} \geq P

P_{X} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} c _{i}}{P ^{2}} \geq a ⟹ j = 1 \prod ℓ L_{i_{j}} \geq P

P_{Y} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{n} d _{i} \prod _{j = 1}^{n} c _{i}}{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} d _{i}} \geq a ⟹ j = 1 \prod ℓ L_{i_{j}} \leq P

P_{Y} ({1, 2, \dots, n + 2}, s) = \frac{\prod _{j = 1}^{n} d _{i} \prod _{j = 1}^{n} c _{i}}{\prod _{j = 1}^{ℓ} L _{i_{j}} \prod _{j = 1}^{n} d _{i}} \geq a ⟹ j = 1 \prod ℓ L_{i_{j}} \leq P

O (M u l_{w} (\frac{B}{ϵ}) lo g (\frac{B}{ϵ}) ∣Σ ∣^{\frac{1}{ϵ}} (\frac{1}{ϵ} n)^{2})

O (M u l_{w} (\frac{B}{ϵ}) lo g (\frac{B}{ϵ}) ∣Σ ∣^{\frac{1}{ϵ}} (\frac{1}{ϵ} n)^{2})

O (M u l_{w} (\frac{B}{ϵ}) \frac{n}{ϵ} ∣Σ ∣^{\frac{1}{ϵ}})

O (M u l_{w} (\frac{B}{ϵ}) \frac{n}{ϵ} ∣Σ ∣^{\frac{1}{ϵ}})

o p t_{X} (i, j) = max {o p t_{X} (i - 1, j), o p t_{X} (i - 1, j - 1) p_{i}^{(X)} (c_{j})}

o p t_{X} (i, j) = max {o p t_{X} (i - 1, j), o p t_{X} (i - 1, j - 1) p_{i}^{(X)} (c_{j})}

L_{v} = u \in N_{G} (v) \prod p_{u}, \leavevmode P = v = 1 \prod n p_{v}

L_{v} = u \in N_{G} (v) \prod p_{u}, \leavevmode P = v = 1 \prod n p_{v}

p_{i}^{(X)} (i)

p_{i}^{(X)} (i)

p_{n + 1}^{(X)} (n + 1)

p_{i}^{(X)} (n + 2)

P_{X} ({i_{1}, \dots, i_{k + 1}}, s) \geq a ⟹ \frac{\prod _{i = 1}^{k} L _{π_{i}}}{P ^{2} M ^{k}} \geq \frac{1}{P M ^{k}} ⟹ i = 1 \prod k L_{π_{i}} \geq P

P_{X} ({i_{1}, \dots, i_{k + 1}}, s) \geq a ⟹ \frac{\prod _{i = 1}^{k} L _{π_{i}}}{P ^{2} M ^{k}} \geq \frac{1}{P M ^{k}} ⟹ i = 1 \prod k L_{π_{i}} \geq P

P_{Y} ({i_{1}, \dots, i_{k + 1}}, s) \geq a ⟹ \frac{1}{M ^{k} \prod _{i = 1}^{k} L _{π_{i}}} \geq \frac{1}{P M ^{k}} ⟹ i = 1 \prod k L_{π_{i}} \leq P

P_{Y} ({i_{1}, \dots, i_{k + 1}}, s) \geq a ⟹ \frac{1}{M ^{k} \prod _{i = 1}^{k} L _{π_{i}}} \geq \frac{1}{P M ^{k}} ⟹ i = 1 \prod k L_{π_{i}} \leq P

p_{i}^{(X^{'})} (σ)

p_{i}^{(X^{'})} (σ)

p_{i}^{(X^{'})} (^{'} #^{'})

p_{∣ X ∣ + 1}^{(X^{'})} (^{'} %^{'}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Basic Algorithms Research Copenhagen (BARC), University of Copenhagen, [email protected]://orcid.org/0000-0002-5830-5830Thorup’s Investigator Grant 16582, Basic Algorithms Research Copenhagen (BARC), from the VILLUM Foundation, and European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 801199. School of Informatics, Aristotle University of Thessaloniki, [email protected]

\CopyrightEvangelos Kipouridis and Kostas Tsichlas\ccsdesc[100]Theory of computation Approximation algorithms analysis \ccsdesc[100]Theory of computation W hierarchy \ccsdesc[100]Theory of computation Problems, reductions and completeness

Acknowledgements.

We would like to thank the anonymous reviewers for their careful reading of our paper and their many insightful comments and suggestions. \hideLIPIcs\EventEditorsInge Li Gørtz and Oren Weimann \EventNoEds2 \EventLongTitle31th Annual Symposium on Combinatorial Pattern Matching (CPM 2020) \EventShortTitleCPM 2020 \EventAcronymCPM \EventYear2020 \EventDateJune 17–19, 2020 \EventLocationCopenhagen, Denmark \EventLogo \SeriesVolume161 \ArticleNo21

Longest Common Subsequence on Weighted Sequences

Evangelos Kipouridis

Kostas Tsichlas

Abstract

We consider the general problem of the Longest Common Subsequence ( $LCS$ ) on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a $PTAS$ and noticed that no $FPTAS$ is possible unless $P=NP$ . In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an $EPTAS$ for bounded alphabets (which is the most natural case), and prove that there does not exist any $EPTAS$ for unbounded alphabets unless $FPT=W[1]$ . Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better $PTAS$ can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.

keywords:

WLCS, LCS, weighted sequences, approximation algorithms, lower bound

category:

1 Introduction

1.1 General concepts

We consider the problem of determining the $LCS$ (Longest Common Subsequence) on weighted sequences. Weighted sequences, also known as $p$ -weighted sequences or Position Weighted Matrices (PWM) [3, 35] are probabilistic sequences which extend the notion of strings, in the sense that in each position there is some probability for each letter of an alphabet $\Sigma$ to occur there.

Weighted sequences were introduced as a tool for motif discovery and local alignment and are extensively used in molecular biology [23]. They have been studied both in the context of short sequences (binding sites, sequences resulting from multiple alignment, etc.) and on large sequences, such as complete chromosome sequences that have been obtained using a whole-genome shotgun strategy [31, 36]. Weighted sequences are able to keep all the information produced by such strategies, while classical strings impose restrictions that oversimplify the original data.

Basic concepts concerning the combinatorics of weighted sequences (like pattern matching, repeats discovery and cover computation) were studied using weighted suffix trees [26], Crochemore’s partitioning [9, 11, 18], the Karp-Miller-Rabin algorithm [18], and other approaches [42, 29]. Other interesting results include approximate and gapped pattern matching [6, 40, 33], online pattern matching [16], weighted indexing [2, 10], swapped matching [39], the all-covers and all-seeds problem [38, 41], extracting motifs [28], and the weighted shortest common supersequence problem [4, 17]. There are also some more practical results on mapping short weighted sequences to a reference genome [7] (also studied in the parallel setting [27]), as well as on the reporting version of the problem which we also consider in this paper [11].

The Longest Common Subsequence ( $LCS$ ) problem is a well-known measure of similarity between two strings. Given two strings, the output should be the length of the longest subsequence common to both strings. Dynamic programming solutions [25, 37] for this problem are classical textbook algorithms in Computer Science. $LCS$ has been applied in computational biology for measuring the commonality of DNA molecules or proteins which may yield similar functionality. A very interesting survey on algorithms for the $LCS$ can be found in [13]. The current $LCS$ algorithms are considered optimal, since matching lower bounds (under the Strong Exponential Time Hypothesis) were proven [1, 14].

Extensions of this problem on more general structures have also been investigated (trees and matrices [5], run-length encoded strings [8], and more). One interesting variant of the $LCS$ is the Heaviest Common Subsequence ( $HCS$ ) where the matching between different letters is assigned a different weight, and the goal is to maximize the weight of the common subsequence, rather than its length.

1.2 Weighted LCS

The problem studied in this paper is the weighted $LCS$ (WLCS) problem. It was introduced by Amir et al. [3] as an extension of the classical $LCS$ problem on weighted sequences. Given two weighted sequences, the goal is to find a longest string which has a high probability of appearing in both sequences. Amir et al. initially solved an easier version of this problem in polynomial time, but unfortunately its applications are limited. As far as the general problem is concerned, they hinted its NP-Hardness by giving an NP-Hardness result on a closely related problem, which they call the log-probability version of WLCS. In short, the problem is the same, but all products in its definition are replaced with sums. Their proof is based on a Turing reduction and only works for unbounded alphabets. Finally, Amir et al. provide an $\frac{1}{|\Sigma|}$ -approximation algorithm for the WLCS problem.

Cygan et al. [19] strengthened the evidence that WLCS is NP-Hard by providing an NP-Completeness result on the decision log-probability version of WLCS (informally introduced in the previous paragraph), already for alphabets of size $2$ , using a Karp reduction; for alphabets of size $1$ the solution is trivial since there is no uncertainty. They also gave an $\frac{1}{2}$ -approximation algorithm and a $PTAS$ , while also noticing that an $FPTAS$ cannot exist, assuming WLCS is indeed NP-Hard, as hinted by their evidence, and that P $\neq$ NP. Finally, they proved that every instance of the problem can be reduced to a more restricted class of instances. However, for this to be achieved their algorithm needs to perform exact computations of roots and logarithms that may make the algorithm to err.

Finally, it is worth noting that Charalampopoulos et al. [17], proved that unless P=NP, WLCS cannot be solved in $\mathcal{O}(n^{f(a)})$ time, for any function $f(a)$ , where $a$ is the cut-off probability. We note that this result concerns exact computations rather than approximations.

1.3 Our results

In this paper we essentially close the gap between upper and lower bounds for WLCS by improving both; we prove that the problem is indeed NP-Hard even for alphabets of size $2$ . Furthermore, we provide an $EPTAS$ for bounded alphabets. These two results, along with the $FPTAS$ observation by Cygan et al. completely characterize the complexity of WLCS for bounded alphabets. For unbounded alphabets, a $PTAS$ was already known by Cygan et al. [19]. We show matching lower bounds, both by ruling out the possibility of an $EPTAS$ , and by showing that, under the Exponential Time Hypothesis, no significantly better $PTAS$ can exist. We also prove that every instance of WLCS can be reduced to a restricted class of instances without using roots and logarithms, thus being able to actually achieve exact computations without rounding errors that can make the algorithm err.

As noted in the previous paragraph, apart from essentially closing the gap between hardness results and faster algorithms we also circumvent the need to work with roots and logarithms as the previous results did. In short, by taking advantage of the property that $(ab)^{c}=a^{c}b^{c}$ and setting $c$ to be an appropriate logarithm, previous results transformed any instance to a more manageable form. However, this transformation introduces an error that may make the algorithm err as noted in Appendix A. Table 1 summarizes the above discussion. Table 2 summarizes our results depending on the alphabet-size.

A short discussion is in order with respect to what new insights on weighted $LCS$ enabled us to achieve progress. Our most crucial observation is the fact that the problem behaves differently in the natural case of a bounded alphabet, and in the case of an unbounded alphabet. Without this distinction, closing the gap between upper and lower bounds was unlikely. That’s because, on the one hand, no $EPTAS$ for the general case could be found, as none existed. On the other hand, proving that no $EPTAS$ exists requires reductions that work only on unbounded alphabets. The aforementioned distinction is what enabled us to understand that modifying the existing reductions, which work for alphabets of size $2$ , would be futile in proving $W[1]$ -Hardness.

Furthermore, it was crucial to identify that working with products is the core difficulty in proving NP-Hardness of weighted $LCS$ . The introduction of the log-probability version of the weighted $LCS$ reflects the assumption that the difference between working with sums and working with products is just a technicality. After [3] and [19] successfully proved NP-Hardness for the log-probability version, it was natural to attempt reducing from it for proving NP-Hardness of the weighted $LCS$ problem. Despite the apparent similarities between the two problems, their difference did not allow us to craft such a reduction. For the same reason, Cygan et al. used a model that assumed infinite precision computations with reals, while we are able to avoid such a strong assumption.

1.4 Organization of the paper

The rest of the paper is organized as follows. In Section 2, we provide necessary definitions and discuss the model of computation. In Section 3, we show that WLCS is NP-Complete while in Section 4, we provide the $EPTAS$ algorithm for bounded alphabets, which is also an improved $PTAS$ for unbounded alphabets. In Section 5, we show that there can be no $EPTAS$ for unbounded alphabets by showing that this problem is $W[1]$ -hard and in Section 6, we describe the matching conditional lower bound. We conclude in Section 7.

For clarity purposes, some proofs and technical discussions are moved to the Appendix. More specifically, in Appendix A we present an algorithm that transforms any instance of our problem to an equivalent, but easier to handle, instance. We also show that the rounding errors introduced by working with reals (logarithms and roots) may cause a similar algorithm by Cygan et al. [19] to err if standard rounding is used.

2 Preliminaries

2.1 Basic Definitions

Let $\Sigma=\{\sigma_{1},\sigma_{2},\ldots,\sigma_{K}\}$ be a finite alphabet. We deal both with bounded $(K=O(1))$ and unbounded alphabets. $\Sigma^{d}$ denotes the set of all words of length $d$ over $\Sigma$ . $\Sigma^{*}$ denotes the set of all words over $\Sigma$ .

Definition 2.1 (Weighted Sequence).

A weighted sequence $X$ is a sequence of functions $p^{(X)}_{1},\ldots,p^{(X)}_{|X|}$ , where each function assigns a probability to each letter from $\Sigma$ . We thus have $\sum_{j=1}^{K}{p^{(X)}_{i}(\sigma_{j})}=1$ for all $i$ , and $p^{(X)}_{i}(\sigma_{j})\geq 0$ for all $i,j$ .

By $WS(\Sigma)$ we denote the set of all weighted sequences over $\Sigma$ . Let $X\in WS(\Sigma)$ . Let $Seq^{|X|}_{d}$ be the set of all increasing sequences of $d$ positions in $X$ . For a string $s\in\Sigma^{d}$ and $\pi\in Seq^{|X|}_{d}$ , define $P_{X}(\pi,s)$ as the probability that the subsequence on positions corresponding to $\pi$ in $X$ equals $s$ . More formally, if $\pi=(i_{1},i_{2},\ldots,i_{d})$ and $s_{k}$ denotes the $k$ -th letter of $s$ , then

[TABLE]

Denote

[TABLE]

That is, $SUBS(X,a)$ is the set of deterministic strings which match a subsequence of $X$ with probability at least $a$ . Every $s\in SUBS(X,a)$ is called an $a$ -subsequence of $X$ .

Let us give a clarifying example. If $\Sigma=\{\sigma_{1},\sigma_{2}\}$ and $X$ is a long weighted sequence, where in each position the probability for each letter to appear is $0.5$ , then $SUBS(X,0.3)$ does not contain $s=\sigma_{1}\sigma_{1}$ , as, for any increasing subsequence of $2$ positions, the probability of $s$ appearing is $0.25<0.3$ .

The decision problem we consider is the following:

Definition 2.2 ( $(a_{1},a_{2})$ -WLCS decision problem).

Given two weighted sequences $X,Y$ , two cut-off probabilities $a_{1},a_{2}$ and a number $k$ , find if the longest string $s$ contained in $SUBS(X,a_{1})\cap SUBS(Y,a_{2})$ has length at least $k$ .

Naturally, the respective optimization problem is the following:

Definition 2.3 ( $(a_{1},a_{2})$ -WLCS optimization problem).

Given two weighted sequences $X,Y$ , and two cut-off probabilities $a_{1},a_{2}$ , find the length of the longest string contained in $SUBS(X,a_{1})\cap SUBS(Y,a_{2})$ .

Both in the decision and the optimization version, the WLCS problem is the $(a_{1},a_{2})$ -WLCS problem, where $a_{1}=a_{2}$ . We denote these (equal) probabilities by $a$ ( $a=a_{1}=a_{2}$ ) for concreteness.

Let us note that the problem is only interesting if $|\Sigma|\geq 2$ . For $|\Sigma|=1$ the problem is trivial since there is no uncertainty at all. The same letter appears in every position in both strings with probability $1$ , and thus the answer is simply the length of the shorter weighted sequence.

Finally, let us also state that the Log-Probability version of the WLCS, studied in previous papers, is the same as the original WLCS if we replace $P_{X}(\pi,s)=\prod_{k=1}^{d}{p^{(X)}_{i_{k}}(s_{k})}$ by $P_{X}(\pi,s)=\sum_{k=1}^{d}{p^{(X)}_{i_{k}}(s_{k})}$ .

2.2 Model of Computation

Our model of computation is the standard word $RAM$ , introduced by Fredman and Willard [20] to simulate programming languages like C. The word size is $w=\Omega(\log{I})$ , where $I$ is the input size in bits, so as to allow random access indexing of the whole input. Thus, arithmetic operations between words take constant time. However, due to the nature of our problem, it is necessary to compute products of many numbers. This can produce numbers that are much larger than the word size. We even allow numbers in the input to be larger than $2^{w}$ (these numbers just need to use more than one word to be represented). We generally assume that each number in the input is represented by at most $B$ bits, but we do not pose any constraint on $B$ other than the trivial one that $B<I$ . Of course, in cases where we deal with numbers that occupy many words, we no longer have unit-cost arithmetic operations; we guarantee, however, that our results only use linear or near-linear time operations (like comparisons and multiplications) on numbers polynomial in the input size. Thus, although we do not enjoy the unit-cost assumption for arbitrary numbers, we always stay within the polynomial-time regime.

2.3 Basic Operations

In this subsection we discuss the multiplication of two $B$ -bit input numbers in (polynomial) $Mul_{w}(B)$ time, where $w$ is the word-size. For example, for integers there exists a multiplication algorithm by Harvey and van der Hoeven [24] with time complexity $Mul_{w}(B)=\mathcal{O}\left(B\log{B}\right)$ (generally the running time can also depend on $w$ , although in this case it does not). Let us notice that although the result is unpublished yet, we use it due to its easy to read time complexity; it is trivial to use other algorithms instead, like the one from Fürer [21], or the more practical one by Schönhage and Strassen [34]. We establish the complexity of multiplying $x$ $B$ -bit numbers. Our divide and conquer algorithm splits the numbers into two (equal sized) groups, recursively multiplies each, and multiplies the results in $Mul_{w}\left(\frac{xB}{2}\right)$ time. By a direct application of the Master Theorem by Bentley et al. [12] we prove the following lemma.

Lemma 2.4.

Multiplying $x$ $B$ -bit numbers costs

•

$\mathcal{O}(Mul_{w}(xB)\log(xB))$ * time if $Mul_{w}(xB)=\Theta(xB\log^{k}(xB))$ for some constant $k$ ,*

•

$\mathcal{O}((xB)^{c})$ * else if $Mul_{w}(xB)=\mathcal{O}((xB)^{c})$ for some constant $c\geq 1$ ,*

assuming that $Mul_{w}(N)$ is a polynomial time algorithm that multiplies two $N$ -bit numbers.

Proof 2.5.

The algorithm simply splits the numbers in two equal-sized groups, recursively multiplies each, and then multiplies the results. Let $N=xB$ . We have that the running time for multiplying $x$ $B$ -bit numbers is $T(N)=2T(\frac{N}{2})+Mul_{w}(N)$ . Since $c_{crit}=\log_{2}{2}=1$ , and $Mul_{w}(N)=\Omega(N)$ , the Master Theorem [12] gives two cases. Either $Mul_{w}(N)=\Theta(N\log^{k}(N))$ for some constant $k$ , in which case $T(N)=\mathcal{O}(Mul_{w}(N)\log{N})$ , or else $Mul_{w}(N)=\mathcal{O}(N^{c})$ for some constant $c\geq 1$ (such a constant exists since we assume polynomial time multiplications). In this case, since it holds that $2Mul_{w}(\frac{N}{2})\leq 2Mul_{w}(N)$ , we get that $T(N)=Mul_{w}(N)$ if $c>c_{crit}=1$ . Notice that we handled all cases, since $Mul_{w}(N)=N$ is handled by the first case with $k=0$ , and whatever does not fit in the first case, definitely fits in the second, since we assumed that $Mul_{w}(N)$ is polynomial in $N$ .

Corollary 2.6.

Multiplying $x$ $B$ -bit numbers costs polynomial time by using any polynomial time algorithm for multiplying two $B$ -bit numbers as a black box. Especially if we use Harvey and Van Der Hoeven’s algorithm, the time cost is $\mathcal{O}\left(xB\log^{2}{(xB)}\right)$ .

Let us also notice that the way to divide two $B$ -bit numbers is simply storing both the numerator and the denominator. Comparing two numbers $x_{1}=\frac{num_{1}}{den_{1}}$ and $x_{2}=\frac{num_{2}}{den_{2}}$ can be done by comparing $num_{1}\times den_{2}$ and $num_{2}\times den_{1}$ . The only other operation we need when working with such fractions is subtracting a $B$ -bit number $x=\frac{num}{den}$ from $1$ . This is simply $\frac{den-num}{den}$ .

3 NP-Completeness

An NP-Completeness proof for the integer log-probability version of the WLCS problem has been given in [19]. This is a closely related problem, with the main difference being that products are replaced with sums. We do not know of any way to reduce from this log-probability version to WLCS other than exponentiating. As stated in the explanation of our model of computation in Section 2, there is no limit on the number of bits needed to represent a single number (it just occupies a lot of words). This means that, if the input consisted of $I$ bits, and there was a number (probability) represented with $\frac{I}{100}$ bits, exponentiating would result in a number with $2^{\frac{I}{100}}$ bits, meaning the reduction would not be a polynomial-time one. For this reason, we believe that although it is easier to prove NP-Completeness for the integer log-probability version of the problem, there is no easy way to use it for proving NP-Completeness for the general version. We, thus, give a reduction from the NP-Complete problem Subset Product [22] which proves NP-Completeness directly for the general problem.

Notice that for alphabets consisting of one letter, the problem is trivial since there is no uncertainty at all. In the following, we prove that even for alphabets consisting of two letters, the problem is NP-Complete.

Definition 3.1 (Subset Product).

Given a set $L$ of $n$ integers and an integer $P$ , find if there exists a subset of the numbers in $L$ with product $P$ .

Lemma 3.2.

WLCS is NP-Complete, even for alphabets of size $2$ .

Proof 3.3.

Obviously $WLCS\in NP$ since the increasing subsequences $\pi_{1},\pi_{2}$ and the string $s$ for which $P_{X}(\pi_{1},s)\geq a,P_{Y}(\pi_{2},s)\geq a$ are a certificate which, along with the input, can be used to verify in polynomial time that the problem has a solution.

Let $(L,P)$ be an instance of Subset Product and let $n=|L|$ . By $L_{i}$ we denote the $i$ -th number of the set $L$ , assuming any fixed ordering of the $n$ numbers of $L$ . We give a polynomial-time reduction to a $(X,Y,a,k)$ instance of WLCS, with alphabet size $2$ (we call the letters ${}^{\prime}A^{\prime}$ and ${}^{\prime}B^{\prime}$ ).

The core idea is the following: The weighted sequences have $n$ positions (plus $2$ more for technical reasons related to the threshold $a$ ). The number $k$ is equal to the length of the sequences, meaning that we pick every position, and the only question is whether we picked letter ${}^{\prime}A^{\prime}$ or letter ${}^{\prime}B^{\prime}$ . Letter ${}^{\prime}A^{\prime}$ in position $i$ corresponds to picking the $i$ -th number in the original Subset Product, while letter ${}^{\prime}B^{\prime}$ corresponds to not picking it. Finally, the letters ${}^{\prime}A^{\prime}$ picked in $X$ form an inequality of the form: "some product is $\geq P$ ", while the same letters in $Y$ form the inequality: "the same product is $\leq P$ ". For these two to hold simultaneously, it must be the case that we found some product equal to $P$ , which is the goal of the original Subset Product.

More formally, the weighted sequences have size $n+2$ . Let $c_{i}=\frac{1}{1+L_{i}}$ and $d_{i}=\frac{1}{1+\frac{1}{L_{i}}}$ .

[TABLE]

where $p^{(X)}_{i}(^{\prime}B^{\prime})=1-p^{(X)}_{i}(^{\prime}A^{\prime})$ for all $i$ , and similarly for $Y$ . Notice that, in particular, $p^{(X)}_{i}(^{\prime}B^{\prime})=c_{i},1\leq i\leq n$ and $p^{(Y)}_{i}(^{\prime}B^{\prime})=d_{i},1\leq i\leq n$ . Finally, we set $k=n+2$ and $a=\frac{\prod_{j=1}^{n}{c_{i}}}{P}$ .

First of all, notice that since we must find a string of length $n+2$ , we must choose a letter from every position. Thus, a choice of letter at some position on $X$ corresponds to the same choice of letter at that position on $Y$ . The choice of letter on positions $n+1$ and $n+2$ is ${}^{\prime}A^{\prime}$ in both cases since

[TABLE]

Suppose that the numbers at positions $\{i_{1},\ldots,i_{\ell}\}$ give product $P$ :

[TABLE]

Then, we form the string $s$ by picking ${}^{\prime}A^{\prime}$ at positions $\{i_{1},\ldots,i_{\ell},n+1,n+2\}$ and ${}^{\prime}B^{\prime}$ at all other positions. Thus

[TABLE]

Conversely, suppose a solution for the WLCS problem, where the string $s$ is formed by picking ${}^{\prime}A^{\prime}$ at positions $\{i_{1},\ldots,i_{\ell},n+1,n+2\}$ and ${}^{\prime}B^{\prime}$ at all other positions. It holds that:

[TABLE]

The above imply that $\prod_{j=1}^{\ell}{L_{i_{j}}}=P$ . Finally, notice that all computations are done in polynomial time, due to Corollary 2.6.

4 EPTAS for Bounded Alphabets, Improved PTAS for Unbounded Alphabets

We now give an Efficient Polynomial Time Approximation Scheme ( $EPTAS$ ) for the case where our alphabet size is bounded ( $|\Sigma|=O(1)$ ). Let us notice that this is the case when working with DNA sequences ( $|\Sigma|=4$ ), the most usual application of weighted sequences. The same algorithm is an improved (when compared to [19]) $PTAS$ in the case of unbounded alphabets. This means that the WLCS problem is Fixed-Parameter Tractable for constant size alphabets and thus belongs to the corresponding complexity class $FPT$ as shown in Corollary 4.6.

The authors in [19] first noted that there is no $FPTAS$ unless $P=NP$ , and so we can only hope for an $EPTAS$ . Our result relies on their following result:

Lemma 4.1 (Lemma 4.6 of [19]).

It is possible to find, in polynomial time, a solution of size $d$ to the WLCS optimization problem such that the optimal value $OPT$ is guaranteed to be either $d$ or $d+1$ (however we do not know which one holds).

Their $PTAS$ uses the above result and in case the approximation is guaranteed to be good enough ( $d>(1-\epsilon)(d+1)$ , which implies that $d>(1-\epsilon)OPT$ ), it stops. Else, it holds that $\frac{1}{\epsilon}\geq d+1\geq OPT$ , and the $PTAS$ exhaustively searches all subsequences of $X$ , all subsequences of $Y$ , and all possible strings of length $d+1$ , for a total complexity of

[TABLE]

$Mul_{w}(\frac{B}{\epsilon})\log(\frac{B}{\epsilon})$ is the time needed to multiply $d+1$ numbers with at most $B$ -bits each, by Lemma 2.4, and is insignificant compared to the other terms. Our $EPTAS$ improves the exhaustive search part to

[TABLE]

which is polynomial in the input size, in case of bounded alphabets. The following lemma is needed.

Lemma 4.2.

Given a weighted sequence $X$ of length $n$ , and a string $s$ of length $d$ , it is possible to find the maximum number $a$ such that there exists an increasing subsequence $\pi$ of length $d$ for which $P_{X}(\pi,s)=a$ . The running time of the algorithm is $O(Mul_{w}(dB)nd)$ , where $B$ is the maximum number of bits needed to represent each probability in $X$ .

Proof 4.3.

We use dynamic programming. Let $s_{j}$ be the string formed by the first $j$ letters of $s$ , $c_{j}$ be the $j$ -th letter of $s$ and $opt_{X}(i,j)$ be the maximum number such that there exists an increasing subsequence $\pi^{\prime}$ of length $j$ whose last term $\pi^{\prime}_{j}$ is at most $i$ and for which $P_{X}(\pi^{\prime},s_{j})=opt_{X}(i,j)$ . Since we choose whether $c_{j}$ is picked from the $i$ -th position of $X$ , it holds that:

[TABLE]

For the base cases, $opt_{X}(i,0)=1$ for all $i$ (we can always form the empty string with certainty, by not picking anything), and $opt_{X}(0,j)=0$ for $j>0$ (not picking anything never gives us a non-empty string). We are interested in the value $opt_{X}(|X|,|s|)$ .

Now we are ready to give our $EPTAS$ .

Theorem 4.4.

For any value $\epsilon\in(0,1]$ there exists an $(1-\epsilon)$ -approximation algorithm for the WLCS problem which runs in $\mathcal{O}\left(poly(I)+\frac{n}{\epsilon}Mul_{w}\left(\frac{B}{\epsilon}\right)|\Sigma|^{\frac{1}{\epsilon}}\right)$ time and uses $\mathcal{O}\left(poly(I)\right)$ space, where $I$ is the input size, $n=|X|+|Y|$ and $B$ is the maximum number of bits needed to represent a probability in $X$ and $Y$ . Consequently, the WLCS problem admits an $EPTAS$ for bounded alphabets.

Proof 4.5.

We begin by using Lemma 4.1 to find an $a$ -subsequence of length $d$ , such that the optimal solution is at most $d+1$ . If $d+1\geq\frac{1}{\epsilon}$ , we are done, since in that case we have a $\frac{d}{d+1}=1-\frac{1}{d+1}\geq(1-\epsilon)$ approximation. Otherwise, we try all possible strings $s\in|\Sigma|^{d+1}$ , and use Lemma 4.2 to check if any one of them can appear in both weighted sequences with probability at least $a$ .

Corollary 4.6.

$WLCS\in FPT$ * for bounded alphabets, parameterized by the solution length.*

Proof 4.7.

Follows directly from [30], Proposition 2.

5 No EPTAS for Unbounded Alphabets

We have already seen that there is no $FPTAS$ for WLCS, even for alphabets of size $2$ , unless $P=NP$ . We have also shown an $EPTAS$ for bounded alphabets and a $PTAS$ for unbounded alphabets. The natural question that arises is: Is it possible to give an $EPTAS$ for unbounded alphabets?

We answer this question negatively, by proving that WLCS is $W[1]$ -hard, meaning that it does not admit an $EPTAS$ (and is in fact not even in $FPT$ ) unless $FPT=W[1]$ ([30], Corollary $1$ ). To show this, we give a $2$ -step $FPT$ -reduction from Perfect Code, which was shown to be $W[1]$ -Complete in [15], to $k$ -sized Subset Product and then to WLCS. The $k$ -sized Subset Product problem is the Subset Product problem with the additional constraint that the target subset must be of size $k$ .

Definition 5.1 (Perfect Code).

A perfect code is a set of vertices $V^{\prime}\subseteq V$ with the property that for each vertex $u\in V$ there is precisely one vertex in $N_{G}(u)\cap V^{\prime}$ , where $N_{G}(u)$ is the set of adjacent nodes of $u$ in $G$ .

In the perfect code problem, we are given an undirected graph $G$ and a positive integer $k$ , and we need to decide whether $G$ has a $k$ -element perfect code. Notice that the definition of a perfect code implies that there is a perfect code iff there is a set $V^{\prime}\subseteq V$ for which $\bigcup_{u\in V^{\prime}}{N_{G}(u)}=V$ and $N_{G}(u)\cap N_{G}(v)=\emptyset$ for all $u,v\in V^{\prime},u\neq v$ . First we show that $k$ -sized Subset Product is $W[1]$ -hard.

Lemma 5.2.

$k$ -sized Subset Product is $W[1]$ -hard.

Proof 5.3.

Let $(G=(V,E),k)$ be an instance of Perfect Code. Suppose that the vertices are $V=\{1,\ldots,n\}$ . First of all, we compute the first $n$ prime numbers using the Sieve of Eratosthenes. We denote the $i$ -th prime number as $p_{i}$ . The set of positive integers $L=\{L_{1},L_{2},\ldots,L_{n}\}$ as well as the positive integer $P$ are defined as follows:

[TABLE]

Notice that due to the unique prime factorization theorem, a subset of $k$ numbers from the set $L$ have product $P$ iff $G$ has a $k$ -element Perfect Code.

The size of our primes is $O(n\log{n})$ due to the prime number theorem. Thus, they require $O(\log{n})$ bits to be represented. Each integer in $L$ , as well as in $P$ , is computed using Corollary 2.6 in $O(n\log^{3}{n})$ time, for an overall $O(n^{2}\log^{3}{n})$ complexity for our reduction. Since the new parameter $k$ is the same as the old one (no dependence on $n$ ), our reduction is in fact an $FPT$ -reduction.

Our result for this section is the following.

Theorem 5.4.

WLCS, parameterized by the length of the solution, is $W[1]$ -hard.

Proof 5.5.

To prove the theorem we create diagonal weighted sequences. That is, we require each letter to appear only in one position and vice-versa. In this way, the subsequences picked for $X$ and $Y$ are the same. The above rule is broken by the addition of two auxiliary letters that are there to make the probabilities add up to $1$ in each position. This creates no problem because we make sure that these letters are never picked. Finally, we force the product to be equal to our target, by forcing it to be at most our target and at least our target at the same time.

More formally, let $(L=\{L_{1},L_{2},\ldots,L_{n}\},k,P)$ be an instance of the $k$ -sized Subset Product problem and let $M=m^{k+1}$ , where $m$ is the maximum number in set $L$ . Notice that if $m^{k}\leq P$ then we only need to check the product of the highest $k$ numbers of $L$ , which means the problem is solvable in polynomial time. Thus we can assume that $M\geq m^{k}>P$ . The alphabet of $X,Y$ is $\Sigma=\{1,2,\ldots,n,n+1,n+2,n+3\}$ and we set $a=\frac{1}{PM^{k}}$ .

[TABLE]

All non-specified probabilities are equal to 0. Notice that symbols $n+2$ and $n+3$ are used to guarantee that probabilities sum up to $1$ .

We show that the instance $(X,Y,a,k+1)$ has a solution iff $(L,k,P)$ has a solution. Suppose there exists a solution to $(L,k,P)$ . Then, there exists an increasing subsequence $\pi=(i_{1},\ldots,i_{k})$ such that $\prod_{j=1}^{k}{L_{i_{j}}}=P$ . Let $\pi^{\prime}$ be $\pi$ extended by the number $i_{k+1}=n+1$ and $s$ be the string $i_{1}i_{2}\ldots i_{k+1}$ . It holds that $P_{X}(\pi^{\prime},s)=P_{Y}(\pi^{\prime},s)=a$ .

Conversely, suppose there exists a solution to $(X,Y,a,k+1)$ . Then there exist increasing subsequences $\pi=(i_{1},\ldots,i_{k+1}),\pi^{\prime}=(j_{1},\ldots,j_{k+1})$ and a string $s$ such that $P_{X}(\pi,s)\geq a,P_{Y}(\pi^{\prime},s)\geq a$ . First of all, notice that, due to $p^{(X)}_{i}(n+3)=p^{(Y)}_{i}(n+2)=0$ for all $i$ , $s$ does not contain letters $n+2$ and $n+3$ , which leaves only one choice for every position. Also each letter appears only once in each sequence, and in the same position. Thus, $\pi=\pi^{\prime}$ , and due to our construction the $i$ -th letter of $s$ is the $i$ -th member of $\pi$ . Finally, not picking position $n+1$ would result in $P_{Y}(\pi,s)<a$ due to the fact that $P<M$ . Thus, the last letter of $s$ is $n+1$ . It holds that:

[TABLE]

The above two inequalities imply a $k$ -sized subset of $L$ with product equal to $P$ .

The reduction is a polynomial-time one, due to Corollary 2.6. More than that, it is an $FPT$ -reduction since the new parameter $k$ is equal to the old parameter incremented by one, and thus has no dependence on $n$ .

6 Matching Conditional Lower Bound on any PTAS

In the $d$ -SUM problem, we are given $N$ numbers and need to decide whether there exists a $d$ -tuple that sums to zero. Patrascu and Williams [32] proved that any algorithm for solving the $d$ -SUM problem requires $n^{\Omega(d)}$ time, unless the Exponential Time Hypothesis ( $ETH$ ) fails. To show this, they first proved a hardness result for a variant of 3-SAT, the sparse 1-in-3 SAT.

Definition 6.1 (Sparse 1-in-3 SAT).

Given a boolean formula with $n$ variables and $O(n)$ clauses in 3 CNF form, where each variable appears in a constant number of clauses, determine whether there exists an assignment of the variables such that each clause is satisfied by exactly one variable.

They first prove the following hardness result under $ETH$ .

Proposition 6.2.

Under $ETH$ , there is an (unknown) constant $s_{3}$ such that there exists no algorithm to solve sparse 1-in-3 SAT in $\mathcal{O}(2^{\delta n})$ time for $\delta<s_{3}$ .

By assuming an $n^{\mathcal{O}(d)}$ time algorithm for $d$ -SUM they disproved the above fact, which cannot happen under $ETH$ . We use the same technique for proving an $n^{\Omega(k)}$ lower bound for $k$ -sized Subset Product.

Lemma 6.3.

Assuming the $ETH$ , the problem of $k$ -sized Subset Product cannot be solved in $\mathcal{O}(n^{\frac{s_{3}k}{101}})$ time on instances satisfying $k<n^{0.99}$ and each number in the input set $L$ has $\mathcal{O}\left(\log{n}(\log{k}+\log{\log{n}})\right)$ bits, where $n$ is the size of $L$ , and $P$ is the target which can be arbitrarily big.

Proof 6.4.

Let $f$ be a sparse 1-in-3 SAT instance with $N$ variables and $M=\mathcal{O}(N)$ clauses, and $k>\frac{1}{s_{3}}$ . Conceptually, we split the variables of $f$ into $k$ blocks of equal size - apart from the last block that may have smaller size. Each block contains at most $\lceil\frac{N}{k}\rceil$ variables, and thus there are at most $2^{\lceil\frac{N}{k}\rceil}$ different assignments of values to the group-of-variables within a block. For each block and for each one of these assignments we generate a number which serves as an identifier of the corresponding block and assignment. Thus, there are $n=k2^{\lceil\frac{N}{k}\rceil}$ different identifiers.

Let $p_{i}$ be the $i$ -th prime number. In order to compute an identifier, we initialize it to $p_{b}$ , where $b$ is the index of the identifier’s corresponding block. Then, we run through all of the $M=\mathcal{O}(N)$ clauses and do the following: suppose we process the $i$ -th clause and let $0\leq j\leq 3$ be the number of variables of the identifier’s corresponding assignment that satisfy the clause. We update the identifier by multiplying it with $p_{k+i}^{j}$ .

Since each variable appears only in a constant number of clauses, each identifier is a product of $\mathcal{O}(\frac{N}{k})$ numbers. The prime number theorem guarantees $\mathcal{O}(\log{N})$ bits to represent each factor, which means the identifiers have $\mathcal{O}(\frac{N}{k}\log{N})$ bits. Using the fact that $n=k2^{\lceil\frac{N}{k}\rceil}$ , each identifier is represented by $\mathcal{O}\left(\log{n}(\log{k}+\log{\log{n}})\right)$ bits.

These $n$ identifiers, along with the target $P=\prod_{i=1}^{k+M}p_{i}$ (recall that $p_{i}$ is the $i$ -th prime number), form a $k$ -sized Subset Product instance. This preprocessing step costs $\mathcal{O}(2^{\frac{N}{k}})$ time, ignoring polynomial terms, which is more efficient than $\mathcal{O}(2^{s_{3}N})$ .

Due to the unique prime factorization, a solution to the $k$ -sized Subset Product corresponds to a solution in $f$ and vice-versa. If the running time of the $k$ -sized Subset Product was $\mathcal{O}(n^{\frac{s_{3}k}{101}})$ then we could solve the above instance in $\mathcal{O}((k2^{\frac{N}{k}})^{\frac{s_{3}k}{101}})$ time.

Since $k=\frac{n}{2^{\lceil\frac{N}{k}\rceil}}$ and $k<n^{0.99}$ , it follows that $\frac{n}{2^{\lceil\frac{N}{k}\rceil}}<n^{0.99}\implies n^{0.99}<2^{99\lceil\frac{N}{k}\rceil}$ . But $k<n^{0.99}$ , which means $k<2^{99\lceil\frac{N}{k}\rceil}$ .

Thus the previous running time becomes $\mathcal{O}(2^{\frac{100}{101}s_{3}N})$ . Both the preprocessing step and the solution of the $k$ -sized Subset Product can be achieved in time $\mathcal{O}(2^{\delta N})$ , where $\delta<s_{3}$ . However, this would violate Proposition 6.2.

Using the above, we are ready to prove our (matching) lower bound, conditional on $ETH$ .

Theorem 6.5.

Under $ETH$ , there is no $PTAS$ for WLCS with running time $|I|^{o(\frac{1}{\epsilon})}$ , where $|I|$ is the input size in bits.

Proof 6.6.

Suppose that such an algorithm $A(I,\epsilon)$ existed. Let $R()$ be the polynomial time reduction from $k$ -sized Subset Product to WLCS given in the proof of Theorem 5.4. Then, there is a solution to $k$ -sized Subset Product iff there is a solution to WLCS of size $k+1$ , or, equivalently, iff the optimal solution to WLCS is at least $k+1$ .

Using the hypothetical $A(I,\epsilon)$ with an appropriate value of $\epsilon$ , we solve $k$ -sized Subset Product more efficiently than possible, thus reaching a contradiction.

Consider the following algorithm for $k$ -sized Subset Product, where there are $|L|$ numbers in the input, each having $\mathcal{O}\left(\log{|L|}(\log{k}+\log{\log{|L|}})\right)$ bits and $k<|L|^{0.99}$ . Given an instance $(L,k,P)$ , we define the instance for the WLCS to be $I=R(L,k,P)$ . We run $A(I,\frac{1}{2(k+1)})$ and if the output is at least $k+1$ we return that $(L,k,P)$ is satisfied, otherwise we return that it cannot be satisfied.

Note that if $k$ -sized Subset Product is solvable, then $OPT(I)\geq k+1$ , and the value output by $A$ is at least $(1-\frac{1}{2(k+1)})(k+1)=k+\frac{1}{2}>k$ . Thus, the value output by $A$ is at least $k+1$ . On the other hand, if $k$ -sized Subset Product is not solvable, then $OPT(I)<k+1$ , and obviously the value output by $A$ is at most k.

Thus we found an algorithm for $k$ -sized Subset Product whose running time is $|I|^{o(k)}$ . Since $I$ is obtained by a polynomial time reduction, its size is bounded by a polynomial in $|(L,k,P)|$ . Therefore, the above running time becomes $|(L,k,P)|^{o(k)}$ . Under our assumptions, this becomes $|L|^{o(k)}$ , which is not feasible under $ETH$ , due to Lemma 6.3.

7 Conclusion

In this paper we prove NP-Completeness for the WLCS decision problem, and give a $PTAS$ along with a matching conditional lower bound for the optimization problem. In the most usual setting, where the alphabet size is constant, the above $PTAS$ is in fact an $EPTAS$ , and it is known that no $FPTAS$ can exist unless $P=NP$ . In the Appendix we give a transformation such that algorithms for the WLCS problem can also be applied for the $(a_{1},a_{2})$ -WLCS problem.

In proving that WLCS does not admit any $EPTAS$ , we proved that it is $W[1]-hard$ . It may be interesting to determine the exact complexity of WLCS in the $W-hierarchy$ .

Appendix A One Threshold is Enough

For clarity purposes, some proofs and technical discussions are moved in this appendix. In particular, in this section we show that $(a_{1},a_{2})$ -WLCS and WLCS are equivalent, thus one threshold is enough. Furthermore, we show that the rounding errors introduced by working with reals (logarithms and roots) may cause a similar algorithm from a paper by Cygan et al. [19] to err if standard rounding is used.

In the following, $B$ corresponds to the maximum number of bits to represent a number in the input (a probability or a symbol of the alphabet). $B$ is not to be confused with the word-size $w$ since an input number may need many words to be represented.

Lemma A.1.

Given an instance $(X,Y,a_{1},a_{2},k)$ of $(a_{1},a_{2})$ -WLCS $(a_{1}<a_{2})$ , it is possible to reduce it to an instance $(X^{\prime},Y^{\prime},a,k+1)$ of WLCS. The construction of $X^{\prime}$ and $Y^{\prime}$ requires $\mathcal{O}(n|\Sigma|Mul_{w}(B))$ time, while parameter $a$ is computed in $\mathcal{O}(Mul_{w}(nB)\log{(nB)})$ time, where $n=|X|+|Y|$ is the total length of the weighted sequences $X$ and $Y$ , while $B$ is the maximum number of bits needed to represent an input number.

Proof A.2.

We first provide a sketch of the proof. Our goal is to use the same weighted sequences with one additional position at the end. We introduce a new letter ( ${}^{\prime}\%^{\prime}$ ) which only appears in this position, and we make sure that any correct algorithm picks it, by making its probability very appealing (high). Since we cannot assign a probability higher than one, increasing it is simulated by reducing all other probabilities, in all positions. Knowing that this specific letter is picked at this specific position allows us to choose the two corresponding probabilities in a way that completes the proof. In order for the probabilities to sum to $1$ in every position, we introduce two auxiliary letters ( ${}^{\prime}\#^{\prime}$ and ${}^{\prime}\$ ^{\prime} $) that are never picked ($ {}^{\prime}$^{\prime} $never appears on the first weighted sequence,$ {}^{\prime}#^{\prime}$ never appears on the second).

The alphabet $\Sigma^{\prime}$ of $X^{\prime},Y^{\prime}$ is the alphabet $\Sigma$ of $X,Y$ extended by three new letters, $\Sigma^{\prime}=\Sigma\cup\{^{\prime}\#^{\prime},^{\prime}\$ ^{\prime},^{\prime}%^{\prime}} $. Let$ m=\frac{a_{1}}{2} $and$ a=m^{k}a_{1} $. Notice that since$ k\leq n $, the size of$ a $in bits is only polynomial compared to the input size, not exponential. The new sequences$ X^{\prime} $and$ Y^{\prime}$ are constructed as follows:

[TABLE]

All non-specified probabilities are equal to [math].

If there exists a solution to $(X,Y,a_{1},a_{2},k)$ , then there exist two increasing subsequences $\pi_{1}=(i_{1},\ldots,i_{k}),\pi_{2}=(j_{1},\ldots,j_{k})$ and a string $s$ such that $P_{X}(\pi_{1},s)\geq a_{1},P_{Y}(\pi_{2},s)\geq a_{2}$ . Define $\pi_{1}^{\prime}=(i_{1},\ldots,i_{k},|X|+1),\pi_{2}^{\prime}=(j_{1},\ldots,j_{k},|Y|+1)$ and $s^{\prime}$ to be equal to $s$ extended with the letter ${}^{\prime}\%^{\prime}$ . It holds that:

$P_{X^{\prime}}(\pi_{1}^{\prime},s^{\prime})=m^{k}P_{X}(\pi_{1},s)\geq m^{k}a_{1}=a,P_{Y^{\prime}}(\pi_{2}^{\prime},s^{\prime})=m^{k}P_{Y}(\pi_{2},s)\frac{a_{1}}{a_{2}}\geq m^{k}a_{2}\frac{a_{1}}{a_{2}}=a$ **

Conversely, suppose there exists a solution to $(X^{\prime},Y^{\prime},a,k+1)$ . Then, there exist increasing subsequences $\pi_{1}=(i_{1},\ldots,i_{k+1}),\pi_{2}=(j_{1},\ldots,j_{k+1})$ and a string $s$ such that $P_{X^{\prime}}(\pi_{1},s)\geq a,P_{Y^{\prime}}(\pi_{2},s)\geq a$ . First of all, notice that, due to $p^{(X^{\prime})}_{i}(^{\prime}\$ ^{\prime})=p^{(Y^{\prime})}{i}(^{\prime}#^{\prime})=0 $for all$ i $,$ s $does not contain letters$ {}^{\prime}$^{\prime} $and$ {}^{\prime}#^{\prime} $. In addition, the letter$ {}^{\prime}%^{\prime} $only appears at the last position, and it is the only possible option for this position. Finally, the last position shall be used on both subsequences, because otherwise$ P{X^{\prime}}(\pi_{1},s),P_{Y^{\prime}}(\pi_{2},s)\leq m^{k+1}<a $. Thus, the last letter of$ s $is$ {}^{\prime}%^{\prime} $. If we denote by$ s^{\prime} $the string$ s $without its last letter, it holds that$ P_{X}({i_{1},\ldots,i_{k}},s^{\prime})\geq a_{1},P_{Y}({j_{1},\ldots,j_{k}},s^{\prime})\geq a_{2}$.

The computation of $a$ requires $\mathcal{O}(Mul_{w}(nB)\log{(nB)})$ time due to Corollary 2.6, and the $n|\Sigma|$ -multiplications of two numbers with at most $B$ bits each cost $\mathcal{O}(n|\Sigma|Mul_{w}(B))$ . All other computations take linear time.

We note that [19] proved the same result, but their reduction required computations with real numbers (raising to the $\log_{a_{2}}{a_{1}}$ power). To the best of our knowledge, there is no way to modify that reduction so that it tolerates the rounding error in the word $RAM$ introduced by working with roots and logarithms.

In what follows, we show that the rounding errors may cause the algorithm by Cygan et al. [19], which reduces any instance of WLCS to a more restricted class of instances, to err. This does not rule out the possibility that more clever rounding algorithms (depending on the input size) may indeed be used so that the algorithm does not err; however we are not aware of any such rounding technique, and even if it exists, the algorithm would probably become too complicated compared to ours.

Lemma A.3.

The reduction from $(a_{1},a_{2})$ -WLCS to WLCS with only one threshold given by Cygan et al. in [19] may err, if exact computations with logarithms and roots are not assumed (assuming the rounding technique does not depend on the input, for example it only keeps a constant number of decimal digits).

Proof A.4.

We prove the above with an example that demonstrates that the rounding error, introduced by not assuming exact computations with logarithms and roots, may cause the reduction to err.

Let $a_{1}=\frac{1}{8},a_{2}=\frac{1}{4}$ and the two weighted sequences $X$ and $Y$ on alphabet $\Sigma=\{a,b\}$ be:

[TABLE]

where $0\leq x\leq 1$ is a constant to be specified later. For $x=1$ , the weighted $LCS$ is $aaaa$ and for $x<1$ the weighted $LCS$ is $aaa$ . The transformation described in **[19]** would give $a=\frac{1}{8},\gamma=\frac{3}{2}$ and the new sequences would be:

[TABLE]

Since $\frac{1}{2}^{\gamma}$ is an irrational number, it is rounded to some number $r=\left\lfloor\frac{1}{2}^{\gamma}\right\rceil$ . Suppose $r<\frac{1}{2}^{\gamma}$ . In this case, when $x=1$ , while the weighted $LCS$ is $aaaa$ the algorithm returns $aaa$ due to the rounding errors. On the other hand, if $r>\frac{1}{2}^{\gamma}$ , we can always find an appropriate $x<1$ such that the weighted $LCS$ should have been $aaa$ but the algorithm returns $aaaa$ due to the rounding errors. To show this, let $x=\left(\frac{k-1}{k}\right)^{2}$ for some integer $k$ . Then $x^{\gamma}=\left(\frac{k-1}{k}\right)^{3}$ . It holds that $\left(\frac{k-1}{k}\right)^{3}r^{2}$ is an increasing function of $k$ which converges to $r^{2}>\frac{1}{8}$ . Thus, we can find a big enough $k$ such that $x^{\gamma}r^{2}\geq\frac{1}{8}$ and err on this particular example, as long as the rounding technique does not depend on the input (for example it only keeps a constant number of decimal digits).

Once again, the above is not a proof that the algorithm given by Cygan et al. can never be correct, despite of the rounding algorithm used. It just shows that it is necessary to explicitly specify such a rounding algorithm in order to construct a correct algorithm.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015 , pages 59–78, 2015. doi:10.1109/FOCS.2015.14 . · doi ↗
2[2] Amihood Amir, Eran Chencinski, Costas S. Iliopoulos, Tsvi Kopelowitz, and Hui Zhang. Property matching and weighted matching. Theoretical Computer Science , 395(2-3):298–310, 2008. doi:10.1016/j.tcs.2008.01.006 . · doi ↗
3[3] Amihood Amir, Zvi Gotthilf, and B. Riva Shalom. Weighted LCS. Journal of Discrete Algorithms , 8(3):273–281, 2010. doi:10.1016/j.jda.2010.02.001 . · doi ↗
4[4] Amihood Amir, Zvi Gotthilf, and B. Riva Shalom. Weighted shortest common supersequence. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings , pages 44–54, 2011. doi:10.1007/978-3-642-24583-1\_6 . · doi ↗
5[5] Amihood Amir, Tzvika Hartman, Oren Kapah, B. Riva Shalom, and Dekel Tsur. Generalized LCS. Theoretical Computer Science , 409(3):438–449, 2008. doi:10.1016/j.tcs.2008.08.037 . · doi ↗
6[6] Amihood Amir, Costas S. Iliopoulos, Oren Kapah, and Ely Porat. Approximate matching in weighted sequences. In Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings , pages 365–376, 2006. doi:10.1007/11780441\_33 . · doi ↗
7[7] Pavlos Antoniou, Costas S. Iliopoulos, Laurent Mouchard, and Solon P. Pissis. Algorithms for mapping short degenerate and weighted sequences to a reference genome. International Journal of Computational Biology and Drug Design , 2(4):385–397, 2009. doi:10.1504/IJCBDD.2009.030768 . · doi ↗
8[8] Alberto Apostolico, Gad M. Landau, and Steven Skiena. Matching for run-length encoded strings. Journal of Complexity , 15(1):4–16, 1999. doi:10.1006/jcom.1998.0493 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Acknowledgements.

Longest Common Subsequence on Weighted Sequences

Abstract

keywords:

category:

1 Introduction

1.1 General concepts

1.2 Weighted LCS

1.3 Our results

1.4 Organization of the paper

2 Preliminaries

2.1 Basic Definitions

Definition 2.1** (Weighted Sequence).**

Definition 2.2** ((a1,a2)(a_{1},a_{2})(a1​,a2​)-WLCS decision problem).**

Definition 2.3** ((a1,a2)(a_{1},a_{2})(a1​,a2​)-WLCS optimization problem).**

2.2 Model of Computation

2.3 Basic Operations

Lemma 2.4**.**

Proof 2.5**.**

Corollary 2.6**.**

3 NP-Completeness

Definition 3.1** (Subset Product).**

Lemma 3.2**.**

Proof 3.3**.**

4 EPTAS for Bounded Alphabets, Improved PTAS for Unbounded Alphabets

Lemma 4.1** (Lemma 4.6 of [19]).**

Lemma 4.2**.**

Proof 4.3**.**

Theorem 4.4**.**

Proof 4.5**.**

Corollary 4.6**.**

Proof 4.7**.**

5 No EPTAS for Unbounded Alphabets

Definition 5.1** (Perfect Code).**

Lemma 5.2**.**

Proof 5.3**.**

Theorem 5.4**.**

Proof 5.5**.**

6 Matching Conditional Lower Bound on any PTAS

Definition 6.1** (Sparse 1-in-3 SAT).**

Proposition 6.2**.**

Lemma 6.3**.**

Proof 6.4**.**

Theorem 6.5**.**

Proof 6.6**.**

7 Conclusion

Appendix A One Threshold is Enough

Lemma A.1**.**

Proof A.2**.**

Lemma A.3**.**

Proof A.4**.**

Definition 2.1 (Weighted Sequence).

Definition 2.2 ( $(a_{1},a_{2})$ -WLCS decision problem).

Definition 2.3 ( $(a_{1},a_{2})$ -WLCS optimization problem).

Lemma 2.4.

Proof 2.5.

Corollary 2.6.

Definition 3.1 (Subset Product).

Lemma 3.2.

Proof 3.3.

Lemma 4.1 (Lemma 4.6 of [19]).

Lemma 4.2.

Proof 4.3.

Theorem 4.4.

Proof 4.5.

Corollary 4.6.

Proof 4.7.

Definition 5.1 (Perfect Code).

Lemma 5.2.

Proof 5.3.

Theorem 5.4.

Proof 5.5.

Definition 6.1 (Sparse 1-in-3 SAT).

Proposition 6.2.

Lemma 6.3.

Proof 6.4.

Theorem 6.5.

Proof 6.6.

Lemma A.1.

Proof A.2.

Lemma A.3.

Proof A.4.