Reoptimization of the Closest Substring Problem under Pattern Length Modification
Jhoirene B. Clemente, Henry N. Adorna

TL;DR
This paper explores reoptimization techniques for the closest substring problem, demonstrating that while the problem remains hard with added pattern length, approximation algorithms can leverage previous solutions to improve efficiency and accuracy.
Contribution
It introduces greedy approximation algorithms utilizing previous solutions for reoptimization, proving their additive error bounds and improving the PTAS runtime.
Findings
Problem remains hard with k=1.
Approximation algorithms have additive error increasing with k.
Reoptimization can slightly improve PTAS runtime.
Abstract
This study investigates whether reoptimization can help in solving the closest substring problem. We are dealing with the following reoptimization scenario. Suppose, we have an optimal l-length closest substring of a given set of sequences S. How can this information be beneficial in obtaining an (l+k)-length closest substring for S? In this study, we show that the problem is still computationally hard even with k=1. We present greedy approximation algorithms that make use of the given information and prove that it has an additive error that grows as the parameter k increases. Furthermore, we present hard instances for each algorithm to show that the computed approximation ratio is tight. We also show that we can slightly improve the running-time of the existing polynomial-time approximation scheme (PTAS) for the original problem through reoptimization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Optimization and Search Problems · Complexity and Algorithms in Graphs
Reoptimization of the Closest Substring Problem under Pattern Length Modification
Jhoirene B. Clemente and Henry N. Adorna
University of the Philippines DilimanDepartment of Computer ScienceQuezon City1101Philippines
Abstract.
This study investigates whether reoptimization can help in solving the closest substring problem. We are dealing with the following reoptimization scenario. Suppose, we have an optimal -length closest substring of a given set of sequences . How can this information be beneficial in obtaining an -length closest substring for ? In this study, we show that the problem is still computationally hard even with . We present greedy approximation algorithms that make use of the given information and prove that it has an additive error that grows as the parameter increases. Furthermore, we present hard instances for each algorithm to show that the computed approximation ratio is tight. We also show that we can slightly improve the running-time of the existing polynomial-time approximation scheme (PTAS) for the original problem through reoptimization.
approximation, reoptimization, closest substring problem
††ccs: Mathematics of computing Combinatorial optimization††ccs: Mathematics of computing Approximation algorithms
1. Introduction
Given a set of sequences defined over some alphabet , where each , and for some , find a string and a set containing , where each is a substring of , such that the total Hamming distance is minimized. We call the -length * closest substring* of . The string is also called the consensus of the set . Solutions to this problem has been applied to variety of pattern identification ranging from biological sequences to text mining. Not to mention its many application to other discrete structures such as graphs.
The problem of finding the closest substring is NP-hard (Garey1979, ), i.e., unless P=NP, there does not exists a polynomial-time exact solution for the problem. Therefore, approaches such as finding near-optimal solutions has been widely used to address the intractability of the problem. Approximation is one among these approaches. In this approach, algorithms are required to have provable error bounds. Here, one can compute a constant , called the approximation ratio which serves as a performance guarantee of an algorithm. A hierarchy exists for NP-hard optimization problems showing that, while others have constant-factor approximation ratio, some still are not even possible to approximate. Examples of inapproximable problems include, the unrestricted traveling salesman problem (Sahni1976, ) and the maximum subgraph problem (Yannakakis1979, ). Therefore, we use another approach that goes hand in hand with approximation, called reoptimization.
Reoptimization was first mentioned in (Schaffter1997, ). Reoptimization is used to solve computational problems that are defined over instances that change over time. To illustrate the concept, consider a railway system with an optimal routing schedule. As part of development, new stations or connections will be added to the railway system. Thus, as a consequence, a new routing schedule for the new railway system is required. Reoptimization has been applied to similar studies including finding the shortest path in (Nardelli2003, ), finding the minimum spanning tree in (Thorup2000, ) and some of its variants with edge weights in (Ribeiro2007, ) (Cattaneo2010, ). It is also used in providing reoptimization solutions for vehicle routing problem (Secomandi2009, ), and the facility location problem (Shachnai2012, ).
For some instances, the optimal routing schedule remains to be optimal after the modification, but for some, however trivial the modification, the problem of coming up with a new routing schedule remains to be computationally hard (Bockenhauer2008, ). In line with this, several studies investigate the benefit of reoptimization when applied to computationally hard problems.For some problems, the given optimal solution provides a good approximate solution to the new instance. Moreover, it was shown that reoptimization can help to either improve the approximability and even provide a PTAS for some problems that are APX-hard (Bockenhauer2008, ; Zych2012, ). These results include improvements for the metric-traveling salesman problem (Bockenhauer2008, ), the Steiner tree problem (Hromkovic2009, ; Bilo2012, ; Bockenhauer2012, ), the common superstring problem (Bilo2011, ), and hereditary graph problems (Boria2012, ; Boria2012b, ).
The first application of reoptimization for the closest substring problem has been shown from our initial work in (Clemente2014, ; Clemente2015, ). In (Clemente2014, ), we proved that CSP obeys a certain property called self-reducibility. We also proved that all problems that are polynomial-time reducible to a self-reducible problems admits the same property. The simple idea behind this property is that we can easily break-down any given instance of the problem to a smaller instance, such that whenever there exists a solution to a smaller instance, we can easily make it feasible to the larger instance.
Initial findings in (Clemente2015, ), focused on a reoptimization variant characterized by adding a new sequence in . In other words, we have an additional information that is the optimal closest substring for a subset of sequences. Furthermore, we also showed that can we obtain an error that grows as the number of additional sequences is increasing. With the same approximation ratio of the PTAS in (Li1999, ), we can improve the running time from to .
In this paper, we will explore the corresponding reoptimization variant of CSP. We will start with the simple case where the pattern length is increased by . Let us define Reopt-CSP as follows. Given a set of sequences and an optimal closest substring of length . Find the closest substring of length . Later on we generalized the reoptimization variant to Reopt-CSP. Given the same set of sequences , we investigate whether a given optimal -length closest substring will be beneficial or not in finding an -length closest substring.
The paper is organized as follows. In Section 2, we showed that even though we have an additional information regarding -length closest substring, solving Reopt-CSP and Reopt-CSP remains to be computationally hard. In sections 3 and 4, we provide approximation algorithms for Reopt-CSP and Reopt-CSP respectively. In section 5, we showed how reoptimization can be beneficial in improving the running time of the PTAS for CSP. Lastly, we conclude this paper in section 6.
2. Hardness Result
Theorem 2.1.
Reopt-CSP is NP-hard.
Proof.
Towards contradiction, suppose Reopt-CSP problem is polynomial-time solvable, then there an optimal polynomial-time algorithm Alg for Reopt-CSP. Now, we present an iterative algorithm for closest substring problem utilizing Alg. We will start with a trivial closest substring of length . For any valid set of sequences, any symbol that is present in all sequences is an optimal solution for , except for the trivial case where the set of alphabets in ’s are disjoint.
Using the optimal closest substring of length , we can obtain an optimal solution of length in polynomial-time using Alg. Iteratively, we can use the optimal solution of length to get the optimal solution of length , for . Ultimately, we arrive to an optimal solution of an arbitrary length in polynomial-time. However, the closest substring problem is NP-hard. Thus, Reopt-CSP must also be NP-hard. ∎
Using Theorem 2.1, we have the following corollary.
Corollary 2.2.
Reopt-CSP is NP-hard.
3. Approximation Algorithms
It is natural to think that the given optimal solution already provides a good approximate solution for Reopt-CSP. Here, we investigate possible transformations of in order to obtain a feasible solution for Reopt-CSP, as well as the approximation ratio of the best possible solution from transforming .
Let be the optimal -length closest substring of and be the sequence of closest substrings of in . In order to obtain a feasible solution of length from a given -length pattern, we define algorithm EXTEND in Algorithm 1.
Algorithm 1 extends each either to the left or to the right to obtain an occurrence of length . Let be the extended substring from of length , such that
[TABLE]
is minimized over all combinations of left and right extensions of each in , with respect to their consensus substring . Naively, we can get the best solution from EXTEND() in . Due to the transformation, the quality of with respect to a given is
[TABLE]
Since in this type of modification, we have an additive approximation ratio of
[TABLE]
From the given computations, we have the following theorem.
Theorem 3.1.
Procedure EXTEND in Algorithm 1 is an approximation algorithm for Reopt-CSP with cost at most which runs in .
Though it might seem that the shown approximation ratio for Reopt-CSP is a trivial upper bound, we will show that the ratio is indeed tight by showing a set of hard instances for Reopt-CSP. Let us consider an instance for Reopt-CSP. Let be the following set of sequences,
[TABLE]
where . The optimal -length closest substring of is with . However, all possible extension of will incur an additional cost of . On the other hand, a suboptimal solution could have been a better option when transformed to . This particular example can be generalized to a set of input instances for Reopt-CSP. The description of such instances is described in the proof of the following claim.
Claim 1.
There exists an instance and a given for Reopt-CSP such that the
[TABLE]
Proof.
We prove the following claim by describing a set instances for Reopt-CSP. Let be the set of sequences defined over the alphabet containing the subset of symbols . The set is defined such that , where is of the following form
[TABLE]
and all the remaining sequences in is described as follows
[TABLE]
For illustration purposes, we have the following alignment.
[TABLE]
The optimal solution of length is the closest substring with . However, the best possible solution from will incur an additional cost of , i.e., . On the other hand, a suboptimal solution , with , can be transformed into the optimal solution , i.e., , with
[TABLE]
Therefore, showing
[TABLE]
[TABLE]
[TABLE]
[TABLE]
∎
Theorem 3.2.
If there exists a -approximation algorithm for CSP, then there exists an approximation algorithm with ratio
[TABLE]
for Reopt-CSP.
Proof.
Let Alg return the minimum between the results of and . Let be the reoptimization algorithm EXTEND and be a existing -approximation algorithm. We have the following computation of .
[TABLE]
∎
In the following corollary, we identify properties of some input instances where we can actually benefit from the additional information in Reopt-CSP.
Corollary 3.3.
If for some feasible instance , then algorithm EXTEND for Reopt-CSP is an advantage over any existing -approximation algorithm for CSP.
On the contrary, if , it is better to solve from scratch using the existing -approximation algorithm. In this case, the given optimal solution is not beneficial in improving the quality of the solution.
4. Generalization
The procedure EXTEND in Algorithm 1 can be generalized to obtain a feasible solution of length . We illustrate all possible extensions of a sample substring as follows. Consider a subtring in . For , we have possible values for . For the rest of our discussion, we may refer to the additional substrings in as the flanking substrings of in .
A substring of length can be extended to at most possible in . Procedure K-EXTEND in Algorithm 2 works by getting all possible combination of extensions from the left and right of each occurrence.
Theorem 4.1.
Procedure K-EXTEND in Algorithm 2 is an approximation algorithm for Reopt-CSP with cost at most which runs in .
Proof.
Exhausting all possible extensions will take steps. Extracting substrings from each sequence will take steps. Therefore, Algorithm 2 has a worst case time complexity of .
For the approximation ratio of K-EXTEND, we use the proof of Theorem 3.2 to give us an upper bound of for Reopt-CSP.
∎
5. Improving the PTAS
Recall the sampling-based PTAS in (Li1999, ). For each parameter , it describes an approximation algorithm for CSP that outputs a solution with
[TABLE]
in time. In this section, we will show that it is also possible to adapt the general idea of the existing PTAS from (Li1999, ) for improving the approximation ratio of K-EXTEND algorithm. Moreover, we argue that we also improve the running time of the existing PTAS for Reopt-CSP.
Note that, by exhausting all possible substring alignments in , we can get the optimal closest substring in . The PTAS from Li et. al. (Li1999, ) explores a subset of this search space by limiting the number of substrings in the alignments. Instead of exhausting all possible alignments of substrings in , the PTAS explores all possible alignments of substrings present in , where parameter . For some fix , it is easy to see how the problem admits a polynomial-time approximation solution in .
Before we proceed with the discussion of how we aim to improve the PTAS via reoptimization. Let us present the following concepts. An -sample from a given instance or a set of sequences, i.e.,
[TABLE]
is a collection of -length substrings from . Repetition of substrings are allowed for as long as no two substrings are obtained from the same sequence. Let denote the set of all possible -sample from . The total number of samples in is which is bounded above by . Note that, a consensus pattern is polynomial-time computable from a given -sample. This is done by simply getting the column-wise majority symbol from an alignment of a given set of equal length substrings.
We present an approximation algorithm for Reopt-CSP in Algorithm 3. The algorithm outputs the best between two feasible solutions and. The first solution is obtained from using K-EXTEND on the given optimal solution . The second feasible solution is obtained by minimizing the cost among all -samples obtained from the set , where contains the set of occurrences of from K-EXTEND Algorithm.
We argue that the algorithm can actually skip a portion of the sample search space. Thereby, maintaining the same approximation ratio while improving the running time. To illustrate the idea of the algorithm, we abstracted the sample space using the following figure.
Suppose each sequence in has a uniform length of . Let us partition the set of sample spaces into regions as shown in Figure 1. Recall the definition of an -sample in the previous section. Region consists of the set of all -samples obtained from the occurrences of . Regions , , and consist of the set of -samples obtained from the occurrences of including flanking substrings to the left and right of each occurrence. We have the following illustration to visualize the set of substrings where samples from regions , , and were taken from.
The above illustration captures the fact that occurrences of in may not necessarily align in terms of their starting position. Without lost of generality, we assume that flanking substrings on the left and right of each occurrence exist in . The remaining parts of that is not considered in Figure 2 comprises the samples in region .
Theorem 5.1.
Algorithm 3 is a approximation algorithm for Reopt-CSP which runs in .
Proof.
Algorithm 7 uses K-EXTEND which runs in . The sampling step in lines 4-10 runs in . For small values of , we have a total running time of . However, for large values of , the algorithm will be dominated by the running time of the K-EXTEND which is .
An algorithm has to cover all possible -samples in in order to achieve the desired competitive ratio as the PTAS. This is equivalent to covering all samples that are obtained from regions to in Figure 1. The K-EXTEND algorithm already covers regions , , and . Due to the exhaustiveness of K-EXTEND, the feasible solution has the local minimum cost when samples obtained from - are considered. The remaining space that is not covered by the K-EXTEND is handled by the sampling based approach in lines 4 to 10 of the Algorithm 3. Thus, maintaining the same approximation ratio of PTAS in (Li1999, ). ∎
We can see in this scenario, how the amount of information is useful in Algorithm 3. As we have more information about the optimal solution, or equivalently, if we have an optimal solution for a longer sequence, i.e., we have a smaller value of , then we can actually get an advantage over the existing PTAS. But if we have little information, i.e., larger value of , it is advisable to solve the problem from scratch, as it is much more expensive to start from to obtain a solution of longer lengths through K-EXTEND algorithm.
Reoptimization in this case is helpful and it scales up as decreases. The observation in Reopt-CSP is analogous to our result in the previous section for Reopt-CSP.
6. Decreasing the Pattern Length
We study the reoptimization variant of CSP when the pattern length is increased. In this section, we will investigate the case where we look for smaller pattern length. Let us make use of Reopt-CSP and Reopt-CSP to denote the case where the pattern length is decreased by and , respectively. It is natural to think that Reopt-CSP and is easier than Reopt-CSP, i.e., we can always get the smaller closest substring () inside the longer closest substrings (). This is true for some cases. However, we will show some instances where the smaller closest substring is totally different from the longer substring. Let , where .
[TABLE]
In the given instance, we can observe that and can be totally different in terms of their edit distance even for binary alphabets, i.e., is defined over the alphabet . The optimal -length closest substring has cost equal to . Meanwhile, the optimal -length substring has cost equal to [math]. The the occurrences of the optimal closest substring of smaller length is not necessarily contained in the occurrences of longer closest substring, which can be observed even if the pattern length is decreased by . We can generalize the description to an arbitrary length . For and , we can always describe an instance such that we cannot transform the given solution to obtain a modified solution for Reopt-CSP. In such instances, the given optimal -length closest substring has cost equal to and an optimal -length closest substring with cost equal to [math].
The relationship of Reopt-CSP and Reopt-CSP is different as compared with the first reoptimization variant that we studied in (Clemente2015, ). In Reopt-CSP and Reopt-CSP, the hardest instance for Reopt-CSP remains to be the hardest for Reopt-CSP in terms of approximability, whereas in reoptimization variants where the pattern length is involved, the hard instance for Reopt-CSP is not easily realizable from the hard instance of Reopt-CSP.
7. Conclusion
In this paper, we showed that the reoptimization variant of CSP under pattern length modifications for both the simple case and its generalization are NP-hard. We presented simple greedy algorithms called EXTEND and K-EXTEND in Algorithms 1 and 2, respectively. We used a simple idea where the algorithms transform the given optimal solution to become feasible for the instance with longer closest substring length. The running time of algorithm EXTEND is exponential in with solutions that has a worst case approximation ratio of . Furthermore, we present a set of hard instances for EXTEND to show that the approximation ratio that we computed is tight. The scenario happens only when the cardinality of the alphabet exceeds . We isolated the case where we can actually have an advantage over any existing -approximation algorithm. As a corollary, we showed that we can benefit from K-EXTEND if , for any existing -approximation algorithm for CSP. We also presented an analogous result from our previous work in (Clemente2015, ) regarding the running time improvement over the existing PTAS in (Li1999, ). Here, we showed that we can maintain the same approximation ratio while saving running time for Reopt-CSP. For value of parameter , reoptimization variant Reopt-CSP can be more beneficial for CSP compared to Reopt-CSP.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Davide Bilò, Hans-Joachim Böckenhauer, Dennis Komm, Richard Královič, Tobias Mömke, Sebastian Seibert, and Anna Zych. Reoptimization of the shortest common superstring problem. Algorithmica (New York) , 61(2):227–251, 2011.
- 2(2) Davide Bilo and Anna Zych. New Advances in Reoptimizing the Minimum Steiner Tree Problem. In Proc. of the Mathematical Foundations of Computer Science, LNCS , 7464:184–197, 2012.
- 3(3) Hans-Joachim Böckenhauer, Karin Freiermuth, Juraj Hromkovič, Tobias Mömke, Andreas Sprock, and Björn Steffen. Steiner tree reoptimization in graphs with sharpened triangle inequality. Journal of Discrete Algorithms , 11(1):73–86, February 2012.
- 4(4) Nicolas Boria, J Monnot, and VT Paschos. Reoptimization of maximum weight induced hereditary subgraph problems. Theoretical Computer Science , pages 1–12, 2012.
- 5(5) Nicolas Boria, Jérôme Monnot, Vangelis Th Paschos, Davide Bilò, Peter Widmayer, and Anna Zych. Reoptimization of the Maximum Weighted Pk-Free Subgraph Problem under Vertex Insertion. 5426:76–87, January 2012.
- 6(6) Guiseppe Cattaneo, Pompeo Faruolo, Umberto Ferraro Petrillo, and Guiseppe Italiano. Maintaining dynamic minimum spanning trees: An experimental study. Discrete Applied Mathematics , 158(5):404–425, 2010.
- 7(7) Jhoirene Clemente, Jeffrey Aborot, and Henry Adorna. Reoptimization of Motif Finding Problem. In Proc. of the International Multi Conference of Engineers and Computer Scientists , volume I, pages 106–111, 2014.
- 8(8) Jhoirene Clemente, Jeffrey Aborot, and Henry Adorna. On self-reducibility and reoptimization of the closest substring problem. Philippine Computing Journal , volume 10(2):1–7, 2016.
