The Hybrid k-Deck Problem: Reconstructing Sequences from Short and Long Traces
Ryan Gabrys, Olgica Milenkovic

TL;DR
This paper introduces the hybrid k-deck problem, combining traditional sequence reconstruction with partial subsequences, providing bounds for the minimal k needed for accurate reconstruction, motivated by DNA sequencing applications.
Contribution
It defines the hybrid k-deck problem, derives bounds for the minimal k in single and multiple subsequence cases, and extends classical sequence reconstruction theory.
Findings
Bounds for k in single subsequence case: [log t+2, min{t+1, O(√(n(1+log t)))}]
Extension to multiple subsequences by aggregation and applying single-trace results
Motivated by nanopore sequencing for DNA data storage
Abstract
We introduce a new variant of the -deck problem, which in its traditional formulation asks for determining the smallest that allows one to reconstruct any binary sequence of length from the multiset of its -length subsequences. In our version of the problem, termed the hybrid k-deck problem, one is given a certain number of special subsequences of the sequence of length , , and the question of interest is to determine the smallest value of such that the -deck, along with the subsequences, allows for reconstructing the original sequence in an error-free manner. We first consider the case that one is given a single subsequence of the sequence of length , obtained by deleting zeros only, and seek the value of that allows for hybrid reconstruction. We prove that in this case, . We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · Forensic and Genetic Research
The Hybrid -Deck Problem: Reconstructing Sequences from Short and Long Traces
Ryan Gabrys12 and Olgica Milenkovic1
1ECE Department, University of Illinois, Urbana-Champaign 2Spawar Systems Center, Pacific
Abstract
We introduce a new variant of the -deck problem, which in its traditional formulation asks for determining the smallest that allows one to reconstruct any binary sequence of length from the multiset of its -length subsequences. In our version of the problem, termed the hybrid -deck problem, one is given a certain number of special subsequences of the sequence of length , , and the question of interest is to determine the smallest value of such that the -deck, along with the subsequences, allows for reconstructing the original sequence in an error-free manner. We first consider the case that one is given a single subsequence of the sequence of length , obtained by deleting zeros only, and seek the value of that allows for hybrid reconstruction. We prove that in this case, . We then proceed to extend the single-subsequence setup to the case where one is given subsequences of length obtained by deleting zeroes only. In this case, we first aggregate the asymmetric traces and then invoke the single-trace results. The analysis and problem at hand are motivated by nanopore sequencing problems for DNA-based data storage.
I Introduction
The -deck of a sequence of length is the multiset of all its subsequences of length . A sequence that is uniquely defined by its -deck is termed -deck reconstructable. The -deck problem is to determine , the smallest value of such that any sequence of length is reconstructable from its -deck. The problem was first described in [8], where it was also shown that . The first lower bounds were established in [19], and improved bounds were described in [9] and [16]. The -deck problem is also closely related to a number of other reconstruction problems that have received significant attention, such as trace reconstruction [2], reconstruction of graphs from subgraphs [3], and set reconstruction based on multiset information [1].
The -deck problem may be viewed as an abstracted version of a DNA nanopore sequencing problem [12]. In this context, a string is passed through the nanopore multiple times, and at each pass a trace sequence is produced. Sequencing traces arise due to insertions, deletions and substitution edits in the original sequence and are usually of variable length. For simplicity, we consider traces obtained via deletions only, all of which have the same length. One issue in nanopore sequencing that was observed in the experimental study of the authors [18] is that the biological “nanopore channels” tend to degrade in time: The sequences produced in the first hour of sequencing usually contain fewer errors (i.e., fewer deletions) and are hence of longer length than the sequences produced later in the process. Furthermore, early deletion errors appear to be context dependent, in so far that so called purine symbols (bases) show larger error rates than pyrimidine symbols111The DNA bases and are called purines, while and are called pyramidines.. We abstract this observation by assuming that the “good” sequencing channels are asymmetric, in so far that they delete only purines. In this case, it suffices to focus on analyzing binary sequences only, as “0” may be used to designate purines, and “1” may be used to designate pyrimidines.
The above discussion motivates the introduction of a “hybrid” sequence reconstruction problem, in which one is given a small set of long (length , ), asymmetric subsequences of a sequence , and asked to determine the shortest length of a large set of shorter (length ) subsequences that allows for unique reconstruction of . We refer to this problem as the hybrid -deck problem. Our results on the hybrid -deck problem include lower and upper bounds on the smallest that allows for exact sequence reconstruction, for the case that only one asymmetric sequence of length is given, or for the case that such sequences are available. A related, simpler problem is that of hybrid -substring reconstruction, in which the -deck is replaced by the set of all substrings of of length . This previously unexplored problem is relevant in the context of DNA sequence reconstruction from a combination of short (i.e., Illumina [11]) and long (i.e., Oxford Nanopore [12]) reads, and will be discussed elsewhere.
The paper is organized as follows. In Section II, we introduce the problem and derive upper and non-asymptotic lower bounds on the hybrid -deck size for the case than one long sequence is observed. In this setting, we show that under some constraints for , we have . For , we show that the upper bound is tight. We also consider the case of large , in which case significantly smaller -decks are needed for reconstruction. In Section III, we consider the scenario when subsequences of of length are available, along with the sequence’s -deck and describe a simple trace aggregation procedure that maps the problem to that of one asymmetric trace-aided reconstruction.
II Problem Statement and Single Trace Analysis
We introduce the hybrid -deck problem, where one is asked to find the minimum value of , denoted by , such that any binary sequence may be reconstructed given subsequences of of length obtained by deleting zeros only, and the -deck of (note that the subsequences in the -deck are obtained via deletions of both zeroes and ones). Clearly, we require that , and mostly focus constant values of where . Nevertheless, we provide some results for the case as well. Furthermore, we start our analysis with the case and refer to the problem as the multi-deck problem. In this case, the goal is to find the minimum value of , denoted by , such that reconstruction is possible given a single length subsequence of obtained by deleting zeros only, and the -deck of .
Example** 1**
. Suppose that {\boldsymbol{x}}=(1,1,1,{\color[rgb]{1,0,0}0}) and that is the observed subsequence of of length . In this case, we may reconstruct given and the -deck of , denoted by ,
[TABLE]
(Observe that given the -deck, one can uniquely reconstruct the -decks for any .) Note that reconstructing is straightforward since we know that only symbols of value [math] may have be deleted: Since (1,{\color[rgb]{1,0,0}0}) appears three times in , it follows that to obtain from we need to insert [math] in the last position of . The -deck does not suffice for reconstruction.
The following claim formalizes the above observation and establishes a connection between Varshamov-Tenengoltz (VT) codes [17, 15] and the hybrid -deck problem.
Claim 1
For any positive integer , .
Proof:
Following the approach of [16], let denote the number of subsequences of of length that end with a one. Then,
[TABLE]
In particular, we are interested in , in which case and . Let
[TABLE]
and set . Thus, where It is known from [17] that is a code capable of correcting a single deletion so that there exists a decoder for that can uniquely determine given and . This proves the claim. ∎
Corollary** 1**
. For a positive integer , .
Theorem** 2**
. For positive integers and , one has .
Proof:
Let denote the -deck of and let denote the -deck of . For , let denote the number of subsequences in that start with ones and end with a zero, and similarly, let denote the number of subsequences in that start with ones and end with a zero. Suppose that where correspond to the positions of the zeros deleted in that lead to (For simplicity, we omit the arguments of whenever the meaning is clear from the context). As an example, if and {\boldsymbol{x}}=({\color[rgb]{1,0,0}0},0,{\color[rgb]{1,0,0}0},1,0), then . For an integer , let denote the number of ones that appear in before position . For example, if , then and .
Next, note that the difference equals
[TABLE]
as deleting a zero at position reduces the count of the sequences compared to by \left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{i})\\ j\end{array}\right).
Let R=\Big{\{}1_{{{\boldsymbol{x}}}}(k_{1}),\ldots,1_{{{\boldsymbol{x}}}}(k_{t})\Big{\}} and let be a polynomial with its set of roots equal to . It is straightforward to see that given for , we may uniquely recover the the -th power sum symmetric polynomials over recursively. Recall that the -th power sum symmetric polynomial over the variables is defined as
[TABLE]
Using Newton’s identities [14] one may evaluate the elementary symmetric polynomials over based on the power sum symmetric polynomials over . The elementary symmetric polynomials are defined as
[TABLE]
[TABLE]
[TABLE]
Thus, we can recover the polynomial and the elements of . This allows us to determine from and . ∎
We now turn our attention to lower bounds. We use the following notation: For a vector , we let denote the set of all sequences that may be obtained by deleting zeros from . Also, for a , we say that is an asymmetric subsequence (or subsequence for short) of and that is an asymmetric supersequence (or supersequence for short) of .
Lemma** 3**
. For all positive integers and , one has .
Proof:
Assume that . Then, there exist two distinct binary vectors with the same -deck and such that and . From [10], we have that the -deck of is equal to the -deck of . Clearly, and . Thus, we have two sequences and each of length sharing the same -deck and containing the subsequence of length Therefore, as desired. ∎
Theorem** 4**
. For , .
Proof:
Let and . Then, and from repeated application of Lemma 3, we have This establishes the claim. (For a related use of the infinite Morse-Thue sequence and its complement, the interested reader is referred to [6]).
∎
Using Theorem 4, we show next that the upper bound of Theorem 2 is tight for .
Corollary** 5**
. For , provided that .
Proof:
The claim for follows from Lemma 1. The previous theorem established the result for . The claim for follows by observing that and share a common supersequence of length and have the same -deck. For , the bound follows from the existence of two sequences - and - which share a common length subsequence and have the same -deck. ∎
Let , where denotes the weight of the vector . The next lemma provides an improvement of the result of Theorem 2 for the case that and . Similar to [13], we make use of the following result from [4].
Lemma** 6**
. (c.f., [4]) There is an absolute constant such that every polynomial of the form:
[TABLE]
has at most zeros at one.
Theorem** 7**
. If , where , than any sequence may be reconstructed given an asymmetric trace and a -deck of with
[TABLE]
where is a constant.
Proof:
The result follows by counting the number of subsequences from the -deck that start with ones, for , and end with a zero, denoted by . For , let denote its complement and assume that . Furthermore, suppose that has ones and recall that . Let be a vector with elements defined as follows: For , equals the number of zeros between the -th and -th one in (We tacitly assume that a one is pre-pended and a one is appended to the sequence first). For example, if , then .
Note that similarly to our previous approach, we may write
[TABLE]
By linearly combining the counts for different values of we can determine
[TABLE]
Suppose next that , , and let and have the same -deck. In addition, assume that there exists a sequence such that and . Define in a manner analogous to . Then
[TABLE]
for . Let
[TABLE]
Furthermore, let be the -th partial derivative of evaluated at . Note that if (1) holds, then
[TABLE]
holds as well. Letting , we have
[TABLE]
Assume that the degree of the polynomial is and observe that for any , , since by assumption, there exists a such that and . Define ; satisfies the conditions of Lemma 6, so that is has at most zeros at one, which implies
[TABLE]
Substituting proves the claim. ∎
The previous result improves upon Theorem 2 for the case when . For large values of , an alternative approach is to discard the vector and reconstruct using only the -deck for according to [9], [16]. For the case when , Theorem 7 improves upon the best known result in the literature [9], which asserts that . The following corollary summarizes Theorem 2 and Theorem 7.
Corollary** 8**
. For such that and where ,
[TABLE]
III The Multitrace Reconstruction Problem
We focus next on the scenario where one is given trace sequences of length , each of which is obtained by deleting zeros from . The question of interest is to determine the minimum value of , denoted by , such that it is possible to reconstruct given the set along with the -deck of .
For a set and a sequence , let denote the set obtained by pre-pending to every element in the vector . For instance if and , then . For a vector , let denote the set of vectors that may be obtained by inserting zeros into . For instance, if , then {\cal I}_{1}({\boldsymbol{v}})=\{({\color[rgb]{1,0,0}0},0,1),(0,1,{\color[rgb]{1,0,0}0})\}.
Lemma** 9**
. For positive integers,
[TABLE]
and
[TABLE]
Proof:
Let and suppose that we have two sequences and such that have the same -deck and such that there exists a (i.e, and share a trace of length ). Clearly, under this setup, have the same -deck.
First, note that . Since , we also have . Furthermore, since , one also has . Let , say . Then, are such that for all , . Thus, . The statement in the lemma follows now by setting .
For the case that and have odd length, we let the alternating sequence have length , and . In this case, we get . Substituting gives the second expression. ∎
Example** 2**
. Suppose that and that . Let , , and observe that is a common subsequence of both and . Then, we may choose , such that and where and . Thus, we have sequences each of length where each sequence is a subsequence of both and . Since and have the same -deck, it follows from that .
We now turn our attention to an upper bound. Let and be as defined in the previous lemmas. In addition, reserve , for the sequence of obtained by counting the occurrences of zeros between ones as described in the proof of Theorem 7.
Lemma** 10**
. For positive integers , and , .
Proof:
Suppose that , and let and . Observe that is non-increasing in , hence it suffices to analyze the case only. Furthermore, since otherwise . Since , we can identify and correct at least one deletion since we can find at least one run of zeros in that underwent a deletion. Let be the vector which results from correcting this deletion in . Then, the minimum -deck required to reconstruct given and is at most which proves the statement in the lemma. ∎
Corollary** 11**
. For , and ,
[TABLE]
Example** 3**
. Suppose that so that . Assume that we observe the following subsequences of length of , . Hence, {\mathbf{X}}^{(1)}=(0,{\color[rgb]{1,0,0}0},0,{\color[rgb]{1,0,0}1},1) and {\mathbf{X}}^{(2)}=(0,1,0,{\color[rgb]{1,0,0}1},{\color[rgb]{1,0,0}0}). Let be given according to \bar{X}_{i}=\max\Big{\{}X^{(1)}_{i},X^{(2)}_{i}\Big{\}}. Then, and . Note that and that is the result of deleting a zero from . Let denote the number of occurrences of the subsequence in and similarly, let denote the number of occurrences of the subsequence in . Since , we need to add one to the value at the third position of to obtain . From , we can then recover .
Next, we consider the case when is sufficiently large to guarantee a signifiant reduction in the value of the deck length . In our proofs, we make use of the following claims.
Claim 2
Let , be such that there exists a such that . Let be the smallest possible integer for which and suppose that . Then, .
Proof:
The result follows by noting that for any two strings such that , we have for . Here, and denote the -analogues of and . ∎
Example** 4**
. Suppose that and so that and . Then may be formed by taking the maximum element of and , . This gives . Observe that if is any asymmetric supersequence of and , then for , we require and similarly which implies that since .
Claim 3
Suppose that . Then, for , one has
[TABLE]
Proof:
Let be the alternating string of length and suppose that , is an arbitrary binary string of length that contains at least one run of zeros of length (i.e., the substring ).
We first show that when . The proof proceeds by induction. We first establish the base case. For and for an arbitrary , . Furthermore, for any , it is straightforward to see since has at most runs of zeros. Next, for the inductive step, suppose that and assume that the claim holds for all . Suppose the first occurrence of in from the left starts at position . We partition the set as follows:
- •
: The set of all sequences in in which the zero between the positions and is not deleted.
- •
: The set of all sequences in in which the zero between the positions and is deleted.
We partition the set similarly:
- •
: The set of sequences in that start with zero.
- •
: The set of sequences in that start with one.
Note that where and that . Also, where is the length sequence obtained by deleting the string starting at index from . In addition, . Since , can apply the inductive hypothesis to determine and , which implies when .
Consider next the case when is any length- vector that has no runs of zeros of length one, and let . In this case, since has at most runs of zeros, and |{\cal D}_{t}({\boldsymbol{a}})|\geqslant\left(\begin{array}[]{c}\lfloor n/2\rfloor\\ t\end{array}\right). Since \left(\begin{array}[]{c}\lfloor n/2\rfloor\\ t\end{array}\right)\geqslant\left(n/3\right)^{t} when , the result follows. ∎
Using the previous claims, we can establish upper and lower bounds on .
Lemma** 12**
. For integers , let m_{0}=\Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor}. Then, for ,
[TABLE]
Proof:
Under the assumptions of Claim 2 applied to sequences, we seek the smallest possible length sequence , , such that . According to Claim 3, for we have
[TABLE]
Since \left(\begin{array}[]{c}\lceil\frac{m}{2}\rceil\\ t-n+m\end{array}\right)\leqslant\left(\lceil\frac{m}{2}\rceil\right)^{t-n+m}, if
[TABLE]
then
[TABLE]
Hence, has length at least and . We can determine the sequence given the length subsequence and its -deck. ∎
Lemma** 13**
. For integers , let m=\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}. Then, for ,
[TABLE]
Proof:
Under the assumptions of Claim 2 applied to sequences, we need to determine the minimum length sequence , , such that . Wlog, assume that is the alternating string. Then, . Since , if m=\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}, then . Hence, . ∎
Theorem** 14**
. For integers ,
[TABLE]
where \Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor}\leqslant m\leqslant\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}, and .
Invoking the results of the previous section, we arrive at the following corollary.
Corollary** 15**
. Suppose that and M=\left(\begin{array}[]{c}\frac{m}{2}\\ t-n+m\end{array}\right)+1 where is an even integer. If , then
[TABLE]
Acknowledgement. This research was supported in part by the NSF grants CIF CCS 1526875 and 1618366, and the NSF STC Center for Science of Information at Purdue University.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String reconstruction from substring compositions,” SIAM Journal on Discrete Mathematics 29, no. 3, 1340-1371, 2015.
- 2[2] Batu, Tukan, Sampath Kannan, Sanjeev Khanna, and Andrew Mc Gregor. ”Reconstructing strings from random traces.” In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 910-918. Society for Industrial and Applied Mathematics, 2004.
- 3[3] Bondy, John Adrian, and Robert L. Hemminger. ”Graph reconstruction a survey.” Journal of Graph Theory 1, no. 3 (1977): 227-268.
- 4[4] P. Borwein, T. Erdelyi, G. Kos, “Littlewood-type problems on [0,1],” Proc. London Math. Soc. , vol. 79, no. 1, pp. 22-46, 1999.
- 5[5] C. Choffrut and J. Karhumaki, “Combinatorics of words,” in Handbook of Formal Languages , vol. I, Springer, Berlin, 1997, pp. 329-438.
- 6[6] M. Dudik and L.J. Schulman, “Reconstruction from subsequences,” Journal of Combinatorial Theory , vol. 103, no. 2, pp. 337-348, 2003.
- 7[7] R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,” Proc. IEEE ISIT , Barcelona, 2016.
- 8[8] Kalashnik, L. O. “The reconstruction of a word from fragments,” Numerical Mathematics and Computer Technology , Akad. Nauk. Ukrain. SSR Inst. Mat., Preprint IV (1973): 56-57.
