The Hybrid k-Deck Problem: Reconstructing Sequences from Short and Long   Traces

Ryan Gabrys; Olgica Milenkovic

arXiv:1701.08111·cs.IT·January 30, 2017

The Hybrid k-Deck Problem: Reconstructing Sequences from Short and Long Traces

Ryan Gabrys, Olgica Milenkovic

PDF

Open Access

TL;DR

This paper introduces the hybrid k-deck problem, combining traditional sequence reconstruction with partial subsequences, providing bounds for the minimal k needed for accurate reconstruction, motivated by DNA sequencing applications.

Contribution

It defines the hybrid k-deck problem, derives bounds for the minimal k in single and multiple subsequence cases, and extends classical sequence reconstruction theory.

Findings

01

Bounds for k in single subsequence case: [log t+2, min{t+1, O(√(n(1+log t)))}]

02

Extension to multiple subsequences by aggregation and applying single-trace results

03

Motivated by nanopore sequencing for DNA data storage

Abstract

We introduce a new variant of the $k$ -deck problem, which in its traditional formulation asks for determining the smallest $k$ that allows one to reconstruct any binary sequence of length $n$ from the multiset of its $k$ -length subsequences. In our version of the problem, termed the hybrid k-deck problem, one is given a certain number of special subsequences of the sequence of length $n - t$ , $t > 0$ , and the question of interest is to determine the smallest value of $k$ such that the $k$ -deck, along with the subsequences, allows for reconstructing the original sequence in an error-free manner. We first consider the case that one is given a single subsequence of the sequence of length $n - t$ , obtained by deleting zeros only, and seek the value of $k$ that allows for hybrid reconstruction. We prove that in this case, $k \in [lo g t + 2, min {t + 1, O (n \cdot (1 + lo g t))}]$ . We…

Equations58

\Big{\{}(1,1),(1,1),(1,{\color[rgb]{1,0,0}0}),(1,1),(1,{\color[rgb]{1,0,0}0})\Big{\}}.\vspace{-0.5ex}

\Big{\{}(1,1),(1,1),(1,{\color[rgb]{1,0,0}0}),(1,1),(1,{\color[rgb]{1,0,0}0})\Big{\}}.\vspace{-0.5ex}

n_{i}=\sum_{j=1}^{n}\left(\begin{array}[]{c}j-1\\ i-1\end{array}\right)\cdot x_{j}.\vspace{-0.5ex}

n_{i}=\sum_{j=1}^{n}\left(\begin{array}[]{c}j-1\\ i-1\end{array}\right)\cdot x_{j}.\vspace{-0.5ex}

S (x) = n_{1} + n_{2} = j = 1 \sum j \cdot x_{j}, \vspace - 0.5 e x

S (x) = n_{1} + n_{2} = j = 1 \sum j \cdot x_{j}, \vspace - 0.5 e x

\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{1})\\ j\end{array}\right)+\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{2})\\ j\end{array}\right)+\cdots+\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{t})\\ j\end{array}\right),\vspace{-0.5ex}

\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{1})\\ j\end{array}\right)+\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{2})\\ j\end{array}\right)+\cdots+\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{t})\\ j\end{array}\right),\vspace{-0.5ex}

p_{j} (a_{1}, \dots, a_{m}) = i = 1 \sum m a_{i}^{j} . \vspace - 0.5 e x

p_{j} (a_{1}, \dots, a_{m}) = i = 1 \sum m a_{i}^{j} . \vspace - 0.5 e x

e_{0} (R) = 1, e_{1} (R) = 1_{x} (k_{1}) + \dots + 1_{x} (k_{t}), \dots

e_{0} (R) = 1, e_{1} (R) = 1_{x} (k_{1}) + \dots + 1_{x} (k_{t}), \dots

e_{t - 1} (R) = i_{1} < i_{2} < \dots < i_{t - 1} \sum 1_{x} (k_{i_{1}}) \dots 1_{x} (k_{i_{t - 1}}),

e_{t - 1} (R) = i_{1} < i_{2} < \dots < i_{t - 1} \sum 1_{x} (k_{i_{1}}) \dots 1_{x} (k_{i_{t - 1}}),

e_{t} (R) = 1_{x} (k_{1}) \dots 1_{x} (k_{t}) .

e_{t} (R) = 1_{x} (k_{1}) \dots 1_{x} (k_{t}) .

p (x) = j = 0 \sum n a_{j} \cdot x^{j}, ∣ a_{j} ∣ ⩽ 1, a_{j} \in C,

p (x) = j = 0 \sum n a_{j} \cdot x^{j}, ∣ a_{j} ∣ ⩽ 1, a_{j} \in C,

k ⩽ c N \cdot (1 + ϵ lo g N),

k ⩽ c N \cdot (1 + ϵ lo g N),

n_{{\boldsymbol{x}},1^{j}0}=\sum_{\ell=1}^{n}\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(\ell)\\ j\end{array}\right)\cdot\bar{x}_{\ell}=\sum_{\ell=1}^{N}\left(\begin{array}[]{c}\ell-1\\ j\end{array}\right)\cdot X_{\ell}.

n_{{\boldsymbol{x}},1^{j}0}=\sum_{\ell=1}^{n}\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(\ell)\\ j\end{array}\right)\cdot\bar{x}_{\ell}=\sum_{\ell=1}^{N}\left(\begin{array}[]{c}\ell-1\\ j\end{array}\right)\cdot X_{\ell}.

s_{j} (x) = ℓ = 1 \sum N ℓ^{j} \cdot X_{ℓ} .

s_{j} (x) = ℓ = 1 \sum N ℓ^{j} \cdot X_{ℓ} .

s_{j} (x) = ℓ = 1 \sum N ℓ^{j} \cdot X_{ℓ} = ℓ = 1 \sum N ℓ^{j} \cdot U_{ℓ} = s_{j} (u),

s_{j} (x) = ℓ = 1 \sum N ℓ^{j} \cdot X_{ℓ} = ℓ = 1 \sum N ℓ^{j} \cdot U_{ℓ} = s_{j} (u),

p_{x} (z) = ℓ = 0 \sum N X_{ℓ} \cdot z^{ℓ}, p_{u} (z) = ℓ = 0 \sum N U_{ℓ} \cdot z^{ℓ} .

p_{x} (z) = ℓ = 0 \sum N X_{ℓ} \cdot z^{ℓ}, p_{u} (z) = ℓ = 0 \sum N U_{ℓ} \cdot z^{ℓ} .

(\frac{\partial ^{j}}{\partial z ^{j}} p_{x} (z))_{z = 1} = (\frac{\partial ^{j}}{\partial z ^{j}} p_{u} (z))_{z = 1}

(\frac{\partial ^{j}}{\partial z ^{j}} p_{x} (z))_{z = 1} = (\frac{\partial ^{j}}{\partial z ^{j}} p_{u} (z))_{z = 1}

(1 - z)^{k_{m}} ∣ P (z) .

(1 - z)^{k_{m}} ∣ P (z) .

k_{m} ⩽ c N \cdot (1 + lo g t) .

k_{m} ⩽ c N \cdot (1 + lo g t) .

f(n,t)\leqslant\min\Big{\{}N^{\epsilon}+1,O\Big{(}\sqrt{N\cdot(1+\epsilon\log N)}\Big{)}\Big{\}}.

f(n,t)\leqslant\min\Big{\{}N^{\epsilon}+1,O\Big{(}\sqrt{N\cdot(1+\epsilon\log N)}\Big{)}\Big{\}}.

f (2 r, r - M + 1, M) ⩾ f (2 (r - M), r - M),

f (2 r, r - M + 1, M) ⩾ f (2 (r - M), r - M),

f (2 r + 1, r - M + 1, M + 1) ⩾ f (2 (r - M), r - M) .

f (2 r + 1, r - M + 1, M + 1) ⩾ f (2 (r - M), r - M) .

f (n, t, M) = t .

f (n, t, M) = t .

\max_{{\boldsymbol{z}}\in\{0,1\}^{n}}|{\cal D}_{t}({\boldsymbol{z}})|\leqslant\left(\begin{array}[]{c}\lceil\frac{n}{2}\rceil\\ t\end{array}\right).

\max_{{\boldsymbol{z}}\in\{0,1\}^{n}}|{\cal D}_{t}({\boldsymbol{z}})|\leqslant\left(\begin{array}[]{c}\lceil\frac{n}{2}\rceil\\ t\end{array}\right).

f (n, t, M) ⩽ f (n, n - m_{0}) .

f (n, t, M) ⩽ f (n, n - m_{0}) .

|{\cal D}_{t-(n-m)}({\boldsymbol{z}})|\leqslant\left(\begin{array}[]{c}\lceil\frac{m}{2}\rceil\\ t-(n-m)\end{array}\right).

|{\cal D}_{t-(n-m)}({\boldsymbol{z}})|\leqslant\left(\begin{array}[]{c}\lceil\frac{m}{2}\rceil\\ t-(n-m)\end{array}\right).

m=m_{0}=\Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor},

m=m_{0}=\Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor},

M > ∣ D_{t - (n - m)} (z) ∣.

M > ∣ D_{t - (n - m)} (z) ∣.

f (n, t, M) ⩾ f (n, n - m) .

f (n, t, M) ⩾ f (n, n - m) .

f (n, t, M) = f (n, n - m)

f (n, t, M) = f (n, n - m)

f (n, t, M) = n - m + 1.

f (n, t, M) = n - m + 1.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · Forensic and Genetic Research

Full text

The Hybrid $k$ -Deck Problem: Reconstructing Sequences from Short and Long Traces

Ryan Gabrys12 and Olgica Milenkovic1

1ECE Department, University of Illinois, Urbana-Champaign 2Spawar Systems Center, Pacific

Abstract

We introduce a new variant of the $k$ -deck problem, which in its traditional formulation asks for determining the smallest $k$ that allows one to reconstruct any binary sequence of length $n$ from the multiset of its $k$ -length subsequences. In our version of the problem, termed the hybrid $k$ -deck problem, one is given a certain number of special subsequences of the sequence of length $n-t$ , $t>0$ , and the question of interest is to determine the smallest value of $k$ such that the $k$ -deck, along with the subsequences, allows for reconstructing the original sequence in an error-free manner. We first consider the case that one is given a single subsequence of the sequence of length $n-t$ , obtained by deleting zeros only, and seek the value of $k$ that allows for hybrid reconstruction. We prove that in this case, $k\in[\log t+2,\min\{{t+1,O(\sqrt{n\cdot(1+\log t)})\}}]$ . We then proceed to extend the single-subsequence setup to the case where one is given $M$ subsequences of length $n-t$ obtained by deleting zeroes only. In this case, we first aggregate the asymmetric traces and then invoke the single-trace results. The analysis and problem at hand are motivated by nanopore sequencing problems for DNA-based data storage.

I Introduction

The $k$ -deck of a sequence ${\boldsymbol{x}}$ of length $n$ is the multiset of all its subsequences of length $k$ . A sequence that is uniquely defined by its $k$ -deck is termed $k$ -deck reconstructable. The $k$ -deck problem is to determine $f(n)$ , the smallest value of $k$ such that any sequence ${\boldsymbol{x}}$ of length $n$ is reconstructable from its $k$ -deck. The problem was first described in [8], where it was also shown that $f(n)\leqslant\lfloor n/2\rfloor$ . The first lower bounds were established in [19], and improved bounds were described in [9] and [16]. The $k$ -deck problem is also closely related to a number of other reconstruction problems that have received significant attention, such as trace reconstruction [2], reconstruction of graphs from subgraphs [3], and set reconstruction based on multiset information [1].

The $k$ -deck problem may be viewed as an abstracted version of a DNA nanopore sequencing problem [12]. In this context, a string is passed through the nanopore multiple times, and at each pass a trace sequence is produced. Sequencing traces arise due to insertions, deletions and substitution edits in the original sequence and are usually of variable length. For simplicity, we consider traces obtained via deletions only, all of which have the same length. One issue in nanopore sequencing that was observed in the experimental study of the authors [18] is that the biological “nanopore channels” tend to degrade in time: The sequences produced in the first hour of sequencing usually contain fewer errors (i.e., fewer deletions) and are hence of longer length than the sequences produced later in the process. Furthermore, early deletion errors appear to be context dependent, in so far that so called purine symbols (bases) show larger error rates than pyrimidine symbols111The DNA bases $A$ and $G$ are called purines, while $T$ and $C$ are called pyramidines.. We abstract this observation by assuming that the “good” sequencing channels are asymmetric, in so far that they delete only purines. In this case, it suffices to focus on analyzing binary sequences only, as “0” may be used to designate purines, and “1” may be used to designate pyrimidines.

The above discussion motivates the introduction of a “hybrid” sequence reconstruction problem, in which one is given a small set of long (length $n-t$ , $t>0$ ), asymmetric subsequences of a sequence ${\boldsymbol{x}}$ , and asked to determine the shortest length of a large set of shorter (length $k$ ) subsequences that allows for unique reconstruction of ${\boldsymbol{x}}$ . We refer to this problem as the hybrid $k$ -deck problem. Our results on the hybrid $k$ -deck problem include lower and upper bounds on the smallest $k$ that allows for exact sequence reconstruction, for the case that only one asymmetric sequence of length $n-t$ is given, or for the case that $M$ such sequences are available. A related, simpler problem is that of hybrid $k$ -substring reconstruction, in which the $k$ -deck is replaced by the set of all substrings of ${\boldsymbol{x}}$ of length $k$ . This previously unexplored problem is relevant in the context of DNA sequence reconstruction from a combination of short (i.e., Illumina [11]) and long (i.e., Oxford Nanopore [12]) reads, and will be discussed elsewhere.

The paper is organized as follows. In Section II, we introduce the problem and derive upper and non-asymptotic lower bounds on the hybrid $k$ -deck size for the case than one long sequence is observed. In this setting, we show that under some constraints for $t$ , we have $\log t+2<k\leqslant\min\{t+1,O(\sqrt{n\cdot(1+\log t)})\}$ . For $t\leqslant 4$ , we show that the upper bound is tight. We also consider the case of large $t$ , in which case significantly smaller $k$ -decks are needed for reconstruction. In Section III, we consider the scenario when $M$ subsequences of ${\boldsymbol{x}}$ of length $n-t$ are available, along with the sequence’s $k$ -deck and describe a simple trace aggregation procedure that maps the problem to that of one asymmetric trace-aided reconstruction.

II Problem Statement and Single Trace Analysis

We introduce the hybrid $(t,k,M)$ $k$ -deck problem, where one is asked to find the minimum value of $k$ , denoted by $f(n,t,M)$ , such that any binary sequence ${\boldsymbol{x}}$ may be reconstructed given $M$ subsequences ${\cal U}=\{\underline{{\boldsymbol{x}}}_{1},\ldots,\underline{{\boldsymbol{x}}}_{M}\}$ of ${\boldsymbol{x}}$ of length $n-t$ obtained by deleting zeros only, and the $k$ -deck of ${\boldsymbol{x}}$ (note that the subsequences in the $k$ -deck are obtained via deletions of both zeroes and ones). Clearly, we require that $k<n-t$ , and mostly focus constant values of $t$ where $t=o(n)$ . Nevertheless, we provide some results for the case $t=O(n)$ as well. Furthermore, we start our analysis with the case $M=1$ and refer to the problem as the $(t,k)$ multi-deck problem. In this case, the goal is to find the minimum value of $k$ , denoted by $f(n,t)$ , such that reconstruction is possible given a single length $n-t$ subsequence $\underline{{\boldsymbol{x}}}$ of ${\boldsymbol{x}}$ obtained by deleting zeros only, and the $k$ -deck of ${\boldsymbol{x}}$ .

Example 1

. Suppose that ${\boldsymbol{x}}=(1,1,1,{\color[rgb]{1,0,0}0})$ and that $\underline{{\boldsymbol{x}}}=(1,1,1)$ is the observed subsequence $\underline{{\boldsymbol{x}}}$ of ${\boldsymbol{x}}$ of length $n-1=3$ . In this case, we may reconstruct ${\boldsymbol{x}}$ given $\underline{{\boldsymbol{x}}}$ and the $2$ -deck of ${\boldsymbol{x}}$ , denoted by ${\cal X}$ ,

[TABLE]

(Observe that given the $k$ -deck, one can uniquely reconstruct the $\ell$ -decks for any $\ell<k$ .) Note that reconstructing ${\boldsymbol{x}}$ is straightforward since we know that only symbols of value [math] may have be deleted: Since $(1,{\color[rgb]{1,0,0}0})$ appears three times in ${\cal X}$ , it follows that to obtain ${\boldsymbol{x}}$ from $\underline{{\boldsymbol{x}}}$ we need to insert [math] in the last position of $\underline{{\boldsymbol{x}}}$ . The $1$ -deck does not suffice for reconstruction.

The following claim formalizes the above observation and establishes a connection between Varshamov-Tenengoltz (VT) codes [17, 15] and the $f(n,1)$ hybrid $k$ -deck problem.

Claim 1

For any positive integer $n\geqslant 2$ , $f(n,1)\leqslant 2$ .

Proof:

Following the approach of [16], let $n_{i}$ denote the number of subsequences of ${\boldsymbol{x}}=(x_{1},\ldots,x_{n})$ of length $i$ that end with a one. Then,

[TABLE]

In particular, we are interested in $i\in\{1,2\}$ , in which case $n_{1}=\sum_{j=1}^{n}x_{j}$ and $n_{2}=\sum_{j=1}^{n}(j-1)\cdot x_{j}$ . Let

[TABLE]

and set $a=S({\boldsymbol{x}})\bmod(n+1)$ . Thus, ${\boldsymbol{x}}\in{\cal C}(n,a)$ where ${\cal C}(n,a)=\{{\boldsymbol{x}}:\sum_{i=1}^{n}i\cdot x_{i}\equiv a\bmod(n+1)\}.$ It is known from [17] that ${\cal C}(n,a)$ is a code capable of correcting a single deletion so that there exists a decoder for ${\cal C}(n,a)$ that can uniquely determine ${\boldsymbol{x}}$ given $\underline{{\boldsymbol{x}}}$ and $a$ . This proves the claim. ∎

Corollary 1

. For a positive integer $n\geqslant 2$ , $f(n,1)=2$ .

Theorem 2

. For positive integers $n\geqslant 2$ and $t<n$ , one has $f(n,t)\leqslant t+1$ .

Proof:

Let ${\cal X}$ denote the $(t+1)$ -deck of ${\boldsymbol{x}}$ and let $\underline{{\cal X}}$ denote the $(t+1)$ -deck of $\underline{{\boldsymbol{x}}}$ . For $j\in[t]$ , let $n_{{\boldsymbol{x}},1^{j}0}$ denote the number of subsequences in ${\cal X}$ that start with $j$ ones and end with a zero, and similarly, let $n_{\underline{{\boldsymbol{x}}},1^{j}0}$ denote the number of subsequences in $\underline{{\cal X}}$ that start with $j$ ones and end with a zero. Suppose that $I({\boldsymbol{x}},\underline{{\boldsymbol{x}}})=\{{k_{1},k_{2},\ldots,k_{t}\}},$ where $k_{1}<k_{2}<\cdots<k_{t}$ correspond to the positions of the zeros deleted in ${\boldsymbol{x}}$ that lead to $\underline{{\boldsymbol{x}}}$ (For simplicity, we omit the arguments of $I({\boldsymbol{x}},\underline{{\boldsymbol{x}}})$ whenever the meaning is clear from the context). As an example, if $I=\{1,3\}$ and ${\boldsymbol{x}}=({\color[rgb]{1,0,0}0},0,{\color[rgb]{1,0,0}0},1,0)$ , then $\underline{{\boldsymbol{x}}}=(0,1,0)$ . For an integer $m\leqslant n$ , let $1_{{{\boldsymbol{x}}}}(m)$ denote the number of ones that appear in ${{\boldsymbol{x}}}$ before position $m$ . For example, if ${{\boldsymbol{x}}}=(0,0,0,1,0)$ , then $1_{{{\boldsymbol{x}}}}(2)=0$ and $1_{{{\boldsymbol{x}}}}(5)=1$ .

Next, note that the difference $n_{{\boldsymbol{x}},1^{j}0}-n_{\underline{{\boldsymbol{x}}},1^{j}0}$ equals

[TABLE]

as deleting a zero at position $k_{i}$ reduces the count of the $n_{\underline{{\boldsymbol{x}}},1^{j}0}$ sequences compared to $n_{{\boldsymbol{x}},1^{j}0}$ by $\left(\begin{array}[]{c}1_{{{\boldsymbol{x}}}}(k_{i})\\ j\end{array}\right).$

Let $R=\Big{\{}1_{{{\boldsymbol{x}}}}(k_{1}),\ldots,1_{{{\boldsymbol{x}}}}(k_{t})\Big{\}}$ and let $F(x)$ be a polynomial with its set of roots equal to $R$ . It is straightforward to see that given $n_{{\boldsymbol{x}},1^{j}0}-n_{\underline{{\boldsymbol{x}}},1^{j}0},$ for $1\leqslant j\leqslant t$ , we may uniquely recover the the $j$ -th power sum symmetric polynomials over $R$ recursively. Recall that the $j$ -th power sum symmetric polynomial over the variables $a_{1},a_{2},\ldots,a_{m}$ is defined as

[TABLE]

Using Newton’s identities [14] one may evaluate the elementary symmetric polynomials $e_{i},\,i=1,\ldots,t,$ over $R$ based on the power sum symmetric polynomials over $R$ . The elementary symmetric polynomials are defined as

[TABLE]

Thus, we can recover the polynomial $F(x)$ and the elements of $R$ . This allows us to determine ${\boldsymbol{x}}$ from $R$ and $\underline{{\boldsymbol{x}}}$ . ∎

We now turn our attention to lower bounds. We use the following notation: For a vector ${\boldsymbol{v}}\in\{0,1\}^{n}$ , we let ${\cal D}_{t}({\boldsymbol{v}})\subseteq\mathbb{\{}0,1\}^{n-t}$ denote the set of all sequences that may be obtained by deleting $t$ zeros from ${\boldsymbol{v}}$ . Also, for a ${\boldsymbol{v}}^{\prime}\in{\cal D}_{t}({\boldsymbol{v}})$ , we say that ${\boldsymbol{v}}^{\prime}$ is an asymmetric subsequence (or subsequence for short) of ${\boldsymbol{v}}$ and that ${\boldsymbol{v}}$ is an asymmetric supersequence (or supersequence for short) of ${\boldsymbol{v}}^{\prime}$ .

Lemma 3

. For all positive integers $n\geqslant 2$ and $t<n$ , one has $f(2n,2t)\geqslant f(n,t)+1$ .

Proof:

Assume that $f(n,t)=k+1$ . Then, there exist two distinct binary vectors ${\boldsymbol{x}},{\boldsymbol{y}}\in\{0,1\}^{n}$ with the same $k$ -deck and such that $\underline{{\boldsymbol{x}}}\in{\cal D}_{t}({\boldsymbol{x}})$ and $\underline{{\boldsymbol{x}}}\in{\cal D}_{t}({\boldsymbol{y}})$ . From [10], we have that the $(k+1)$ -deck of ${\boldsymbol{x}}{\boldsymbol{y}}$ is equal to the $(k+1)$ -deck of ${\boldsymbol{y}}{\boldsymbol{x}}$ . Clearly, $\underline{{\boldsymbol{x}}}\underline{{\boldsymbol{x}}}\in{\cal D}_{2t}({\boldsymbol{x}}{\boldsymbol{y}})$ and $\underline{{\boldsymbol{x}}}\underline{{\boldsymbol{x}}}\in{\cal D}_{2t}({\boldsymbol{y}}{\boldsymbol{x}})$ . Thus, we have two sequences ${\boldsymbol{x}}{\boldsymbol{y}}$ and ${\boldsymbol{y}}{\boldsymbol{x}},$ each of length $2n,$ sharing the same $(k+1)$ -deck and containing the subsequence $\underline{{\boldsymbol{x}}}\underline{{\boldsymbol{x}}}$ of length $2n-2t$ Therefore, $f(2n,2t)\geqslant k+2=f(n,t)+1,$ as desired. ∎

Theorem 4

. For $t\leqslant\frac{n}{2}$ , $f(n,t)\geqslant\log t+2$ .

Proof:

Let ${\boldsymbol{x}}=01$ and ${\boldsymbol{y}}=10$ . Then, $f(2,1)\geqslant 2$ and from repeated application of Lemma 3, we have $f(2^{s},2^{s-1})\geqslant s+1.$ This establishes the claim. (For a related use of the infinite Morse-Thue sequence and its complement, the interested reader is referred to [6]).

∎

Using Theorem 4, we show next that the upper bound of Theorem 2 is tight for $t\leqslant 4$ .

Corollary 5

. For $t\leqslant 4$ , $f(n,t)=t+1,$ provided that $n\geqslant 2t$ .

Proof:

The claim for $t=1$ follows from Lemma 1. The previous theorem established the result for $t=2$ . The claim for $t=3$ follows by observing that ${\boldsymbol{x}}=(0,1,1,0,1,0,0,1)$ and ${\boldsymbol{y}}=(1,0,0,1,0,1,1,0)$ share a common supersequence of length $11$ and have the same $3$ -deck. For $t=4$ , the bound follows from the existence of two sequences - $(1,1,0,0,1,1,1,0,1,1,0,0,1)$ and $(1,0,1,1,1,0,1,0,0,1,1,1,0)$ - which share a common length $9$ subsequence and have the same $4$ -deck. ∎

Let $N=1+wt({\boldsymbol{x}})$ , where $wt({\boldsymbol{x}})$ denotes the weight of the vector ${\boldsymbol{x}}$ . The next lemma provides an improvement of the result of Theorem 2 for the case that $t=N^{\epsilon}$ and $1/2<\epsilon<1$ . Similar to [13], we make use of the following result from [4].

Lemma 6

. (c.f., [4]) There is an absolute constant $c>0$ such that every polynomial $p$ of the form:

[TABLE]

has at most $c\sqrt{n(1-\log|a_{0}|)}$ zeros at one.

Theorem 7

. If $t=N^{\epsilon}$ , where $1/2<\epsilon<1$ , than any sequence ${\boldsymbol{x}}\in\{0,1\}^{n}$ may be reconstructed given an asymmetric $n-t$ trace and a $k$ -deck of ${\boldsymbol{x}}$ with

[TABLE]

where $c$ is a constant.

Proof:

The result follows by counting the number of subsequences from the $k$ -deck that start with $j$ ones, for $j+1\in[k]$ , and end with a zero, denoted by $1^{j}0$ . For $b\in\{0,1\}$ , let $\bar{b}=1-b$ denote its complement and assume that ${\boldsymbol{x}}=(x_{1},\ldots,x_{n})$ . Furthermore, suppose that ${\boldsymbol{x}}$ has $wt({\boldsymbol{x}})$ ones and recall that $N=wt({\boldsymbol{x}})+1$ . Let ${\mathbf{X}}=(X_{1},\ldots,X_{N})\in\{0,1,\ldots,n\}^{N}$ be a vector with elements defined as follows: For $i\in[N]$ , $X_{i}$ equals the number of zeros between the $(i-1)$ -th and $i$ -th one in ${\boldsymbol{x}}$ (We tacitly assume that a one is pre-pended and a one is appended to the sequence first). For example, if ${\boldsymbol{x}}=(0,1,1,0)$ , then ${\mathbf{X}}=(1,0,1)$ .

Note that similarly to our previous approach, we may write

[TABLE]

By linearly combining the counts $n_{{\boldsymbol{x}},1^{j}0}$ for different values of $j$ we can determine

[TABLE]

Suppose next that ${\boldsymbol{u}}\in\{0,1\}^{n}$ , ${\boldsymbol{u}}\neq{\boldsymbol{x}}$ , and let ${\boldsymbol{x}}$ and ${\boldsymbol{u}}$ have the same $k$ -deck. In addition, assume that there exists a sequence ${\boldsymbol{y}}\in\{0,1\}^{n-t}$ such that ${\boldsymbol{y}}\in{\cal D}_{t}({\boldsymbol{x}})$ and ${\boldsymbol{y}}\in{\cal D}_{t}({\boldsymbol{u}})$ . Define ${\mathbf{U}}$ in a manner analogous to ${\mathbf{X}}$ . Then

[TABLE]

for $1\leqslant j\leqslant k-1$ . Let

[TABLE]

Furthermore, let $\left(\frac{\partial^{j}}{\partial z^{j}}p_{{\boldsymbol{x}}}(z)\right)_{z=1}$ be the $j$ -th partial derivative of $p_{\boldsymbol{x}}(z)$ evaluated at $z=1$ . Note that if (1) holds, then

[TABLE]

holds as well. Letting $P(z)=p_{{\boldsymbol{x}}}(z)-p_{{\boldsymbol{u}}}(z)$ , we have

[TABLE]

Assume that the degree of the polynomial $P(z)$ is $d$ and observe that for any $1\leqslant\ell\leqslant N$ , $|X_{\ell}-U_{\ell}|\leqslant t$ , since by assumption, there exists a ${\boldsymbol{y}}$ such that ${\boldsymbol{y}}\in{\cal D}_{t}({\boldsymbol{x}})$ and ${\boldsymbol{y}}\in{\cal D}_{t}({\boldsymbol{u}})$ . Define $f(z)=\frac{1}{z^{d}t}\cdot P(z)$ ; $f(z)$ satisfies the conditions of Lemma 6, so that is has at most $c\sqrt{N\cdot(1-\log|\frac{1}{t}|)}$ zeros at one, which implies

[TABLE]

Substituting $t=N^{\epsilon}$ proves the claim. ∎

The previous result improves upon Theorem 2 for the case when $\epsilon>\frac{1}{2}$ . For large values of $t$ , an alternative approach is to discard the vector $\underline{{\boldsymbol{x}}}$ and reconstruct ${\boldsymbol{x}}$ using only the $k$ -deck for ${\boldsymbol{x}}$ according to [9], [16]. For the case when $n>>N$ , Theorem 7 improves upon the best known result in the literature [9], which asserts that $f(n,n)\leqslant(1+o(1))\frac{16}{7}\sqrt{n}$ . The following corollary summarizes Theorem 2 and Theorem 7.

Corollary 8

. For ${\boldsymbol{x}}\in\{{0,1\}}^{n}$ such that $wt({\boldsymbol{x}})=N-1$ and $t=N^{\epsilon},$ where $1/2<\epsilon<1$ ,

[TABLE]

III The Multitrace Reconstruction Problem

We focus next on the scenario where one is given $M$ trace sequences ${\cal U}=\{\underline{{\boldsymbol{x}}}^{(1)},\underline{{\boldsymbol{x}}}^{(2)},$ $\ldots,\underline{{\boldsymbol{x}}}^{(M)}\}$ of length $n-t$ , each of which is obtained by deleting $t$ zeros from ${\boldsymbol{x}}$ . The question of interest is to determine the minimum value of $k$ , denoted by $f(n,t,M)$ , such that it is possible to reconstruct ${\boldsymbol{x}}$ given the set ${\cal U}$ along with the $k$ -deck of ${\boldsymbol{x}}$ .

For a set $S\subseteq\{0,1\}^{m}$ and a sequence ${\boldsymbol{v}}\in\{0,1\}^{k}$ , let ${\boldsymbol{v}}\circ S$ denote the set obtained by pre-pending to every element in $S$ the vector ${\boldsymbol{v}}$ . For instance if $S=\{(0,1),(1,1)\}$ and ${\boldsymbol{v}}=(0,0)$ , then ${\boldsymbol{v}}\circ S=\{(0,0,0,1),(0,0,1,1)\}$ . For a vector ${\boldsymbol{v}}\in\{0,1\}^{n}$ , let ${\cal I}_{t}({\boldsymbol{v}})$ denote the set of vectors that may be obtained by inserting $t$ zeros into ${\boldsymbol{v}}$ . For instance, if ${\boldsymbol{v}}=(0,1)$ , then ${\cal I}_{1}({\boldsymbol{v}})=\{({\color[rgb]{1,0,0}0},0,1),(0,1,{\color[rgb]{1,0,0}0})\}$ .

Lemma 9

. For positive integers, $r\geqslant 2,t>1,1\leqslant M<n-1,$

[TABLE]

and

[TABLE]

Proof:

Let ${\boldsymbol{a}}=(0,1,0,1,\ldots,0,1)\in\{0,1\}^{2M}$ and suppose that we have two sequences ${\boldsymbol{x}}=({\boldsymbol{x}}^{\prime},{\boldsymbol{a}})\in\{0,1\}^{2T+2M}$ and ${\boldsymbol{y}}=({\boldsymbol{y}}^{\prime},{\boldsymbol{a}})\in\{0,1\}^{2T+2M}$ such that ${\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime}\in\{{0,1\}}^{2T}$ have the same $k$ -deck and such that there exists a ${\boldsymbol{z}}\in{\cal D}_{T}({\boldsymbol{x}}^{\prime})\cap{\cal D}_{T}({\boldsymbol{y}}^{\prime})$ (i.e, ${\boldsymbol{x}}^{\prime}$ and ${\boldsymbol{y}}^{\prime}$ share a trace of length $T$ ). Clearly, under this setup, ${\boldsymbol{x}},{\boldsymbol{y}}$ have the same $k$ -deck.

First, note that ${\boldsymbol{z}}\circ{\cal D}_{1}({\boldsymbol{a}})\subseteq{\cal D}_{T+1}({\boldsymbol{x}})$ . Since ${\boldsymbol{z}}\in{\cal D}_{T}({\boldsymbol{y}}^{\prime})$ , we also have ${\boldsymbol{z}}\circ{\cal D}_{1}({\boldsymbol{a}})\subseteq{\cal D}_{T+1}({\boldsymbol{y}})$ . Furthermore, since $|D_{1}({\boldsymbol{a}})|\geqslant M$ , one also has $|{\boldsymbol{z}}\circ{\cal D}_{1}({\boldsymbol{a}})|\geqslant M$ . Let ${\cal U}_{{\boldsymbol{z}}}\subseteq{\boldsymbol{z}}\circ{\cal D}_{1}({\boldsymbol{a}})$ , say ${\cal U}_{{\boldsymbol{z}}}=\{\underline{{\boldsymbol{x}}}^{(1)},\underline{{\boldsymbol{x}}}^{(2)},$ $\ldots,\underline{{\boldsymbol{x}}}^{(M)}\}$ . Then, ${\boldsymbol{x}},{\boldsymbol{y}}$ are such that for all $i\in[M]$ , ${\boldsymbol{x}},{\boldsymbol{y}}\in{\cal I}_{T+1}(\underline{{\boldsymbol{x}}}^{(i)})$ . Thus, $f(2T+2M,T+1,M)\geqslant f(2T,T)$ . The statement in the lemma follows now by setting $r=M+T$ .

For the case that ${\boldsymbol{x}}$ and ${\boldsymbol{y}}$ have odd length, we let the alternating sequence ${\boldsymbol{a}}$ have length $2M+1$ , and $|{\cal U}_{{\boldsymbol{z}}}|=M+1$ . In this case, we get $f(2T+2M+1,T+1,M+1)\geqslant f(2T,T)$ . Substituting $r=M+T$ gives the second expression. ∎

Example 2

. Suppose that $M=3$ and that $T=2$ . Let ${\boldsymbol{x}}^{\prime}=(0,1,1,0)$ , ${\boldsymbol{y}}^{\prime}=(1,0,0,1)$ , ${\boldsymbol{a}}=(0,1,0,1,0,1)$ and observe that ${\boldsymbol{z}}=(1,1)$ is a common subsequence of both ${\boldsymbol{x}}^{\prime}$ and ${\boldsymbol{y}}^{\prime}$ . Then, we may choose ${\cal U}=\{(1,1,1,0,1,0,1),(1,1,0,1,1,0,1),\\ (1,1,0,1,0,1,1)\}=\{\underline{{\boldsymbol{x}}}^{(1)},\underline{{\boldsymbol{x}}}^{(2)},\underline{{\boldsymbol{x}}}^{(3)}\}$ , such that ${\cal U}\subseteq{\cal D}_{T}({\boldsymbol{x}})$ and ${\cal U}\subseteq{\cal D}_{T}({\boldsymbol{y}}),$ where ${\boldsymbol{x}}=(0,1,1,0,0,1,0,1,0,1)$ and ${\boldsymbol{y}}=(1,0,0,1,0,1,0,1,0,1)$ . Thus, we have $3$ sequences each of length $10-3=7$ where each sequence is a subsequence of both ${\boldsymbol{x}}$ and ${\boldsymbol{y}}$ . Since ${\boldsymbol{x}}$ and ${\boldsymbol{y}}$ have the same $2$ -deck, it follows from that $f(2\cdot 2+2\cdot 3,3,3)=f(8,3,3)\geqslant f(4,2)=3$ .

We now turn our attention to an upper bound. Let $N=wt({\boldsymbol{x}})+1$ and ${\mathbf{X}}=(X_{1},\ldots,X_{N})$ be as defined in the previous lemmas. In addition, reserve $\underline{{\mathbf{X}}}^{(m)}$ , $1\leqslant m\leqslant M$ for the sequence ${\mathbf{X}}$ of ${\boldsymbol{x}}^{(m)}$ obtained by counting the occurrences of zeros between ones as described in the proof of Theorem 7.

Lemma 10

. For positive integers $n\geqslant 2,t<n$ , and $M\geqslant 1$ , $f(n,t,M)\leqslant f(n,t-1)\leqslant t$ .

Proof:

Suppose that $f(n,t,M)\leqslant f(n,t)$ , and let $M=2$ and ${\cal U}=\{{\boldsymbol{x}}^{(1)},{\boldsymbol{x}}^{(2)}\}$ . Observe that $f(n,t,M)$ is non-increasing in $M$ , hence it suffices to analyze the case $M=2$ only. Furthermore, $d_{H}({\mathbf{X}}^{(1)},{\mathbf{X}}^{(2)})\geqslant 1$ since otherwise $|{\cal U}|=1$ . Since $d_{H}({\mathbf{X}}^{(1)},{\mathbf{X}}^{(2)})\geqslant 1$ , we can identify and correct at least one deletion since we can find at least one run of zeros in ${\boldsymbol{x}}^{(1)}$ that underwent a deletion. Let $\underline{{\boldsymbol{x}}}\in\{0,1\}^{n-t+1}$ be the vector which results from correcting this deletion in ${\boldsymbol{x}}^{(1)}$ . Then, the minimum $k$ -deck required to reconstruct ${\boldsymbol{x}}$ given ${\cal U}$ and $\underline{{\boldsymbol{x}}}$ is at most $f(n,t-1)$ which proves the statement in the lemma. ∎

Corollary 11

. For $t\leqslant 5$ , and $M\leqslant n-2t$ ,

[TABLE]

Example 3

. Suppose that ${\boldsymbol{x}}=(1,0,1,1,0,0,1,0)$ so that ${\mathbf{X}}=(0,1,0,2,1)$ . Assume that we observe the following subsequences of length $n-t=n-2=6$ of ${\boldsymbol{x}}$ , ${\cal U}=\{(1,1,1,0,1,0),(1,0,1,1,0,1)\}$ . Hence, ${\mathbf{X}}^{(1)}=(0,{\color[rgb]{1,0,0}0},0,{\color[rgb]{1,0,0}1},1)$ and ${\mathbf{X}}^{(2)}=(0,1,0,{\color[rgb]{1,0,0}1},{\color[rgb]{1,0,0}0})$ . Let $\bar{{\mathbf{X}}}=(\bar{X}_{1},\bar{X}_{2},\bar{X}_{3},\bar{X}_{4})$ be given according to $\bar{X}_{i}=\max\Big{\{}X^{(1)}_{i},X^{(2)}_{i}\Big{\}}$ . Then, $\bar{{\mathbf{X}}}=(0,1,0,1,1)$ and $\bar{{\boldsymbol{x}}}=(1,0,1,1,0,1,0)$ . Note that $d_{H}({\mathbf{X}},\bar{{\mathbf{X}}})=1$ and that $\bar{{\boldsymbol{x}}}$ is the result of deleting a zero from ${\boldsymbol{x}}$ . Let $n_{10,{\boldsymbol{x}}}$ denote the number of occurrences of the subsequence $10$ in ${\boldsymbol{x}}$ and similarly, let $n_{10,\bar{{\boldsymbol{x}}}}$ denote the number of occurrences of the subsequence $10$ in $\bar{{\boldsymbol{x}}}$ . Since $n_{10,{\boldsymbol{x}}}-n_{10,\bar{{\boldsymbol{x}}}}=11-8=3$ , we need to add one to the value at the third position of $\bar{{\mathbf{X}}}$ to obtain ${\mathbf{X}}$ . From ${\mathbf{X}}$ , we can then recover ${\boldsymbol{x}}$ .

Next, we consider the case when $M$ is sufficiently large to guarantee a signifiant reduction in the value of the deck length $k$ . In our proofs, we make use of the following claims.

Claim 2

Let ${\boldsymbol{x}}\in\{0,1\}^{n}$ , ${\boldsymbol{y}}\in\{0,1\}^{n}$ be such that there exists a ${\boldsymbol{w}}\in\{0,1\}^{n+t},$ such that ${\boldsymbol{w}}\in{\cal I}_{t}({\boldsymbol{x}})\cap{\cal I}_{t}({\boldsymbol{y}})$ . Let $t_{0}\leqslant t$ be the smallest possible integer for which ${\cal I}_{t_{0}}({\boldsymbol{x}})\cap{\cal I}_{t_{0}}({\boldsymbol{y}})\neq\emptyset$ and suppose that ${\boldsymbol{z}}\in{\cal I}_{t_{0}}({\boldsymbol{x}})\cap{\cal I}_{t_{0}}({\boldsymbol{y}})$ . Then, ${\boldsymbol{w}}\in{\cal I}_{t-t_{0}}({\boldsymbol{z}})$ .

Proof:

The result follows by noting that for any two strings ${\boldsymbol{v}},{\boldsymbol{w}}$ such that ${\boldsymbol{w}}\in{\cal I}_{t}({\boldsymbol{v}})$ , we have $W_{i}\geqslant V_{i}$ for $i\in[N]$ . Here, ${\mathbf{V}}=(V_{1},\ldots,V_{N})$ and ${\mathbf{W}}=(W_{1},\ldots,W_{N})$ denote the ${\mathbf{X}}$ -analogues of ${\boldsymbol{v}}$ and ${\boldsymbol{w}}$ . ∎

Example 4

. Suppose that ${\boldsymbol{x}}=(0,1,1,0,1)$ and ${\boldsymbol{y}}=(0,0,1,1,1)$ so that ${\mathbf{X}}=(1,0,1,0)$ and ${\mathbf{Y}}=(2,0,0,0)$ . Then ${\mathbf{Z}}$ may be formed by taking the maximum element of ${\mathbf{X}}=(X_{1},\ldots,X_{4})$ and ${\mathbf{Y}}=(Y_{1},\ldots,Y_{4})$ , ${\mathbf{Z}}=(2,0,1,0)=(Z_{1},\ldots,Z_{4})$ . This gives ${\boldsymbol{z}}=(0,0,1,1,0,1)$ . Observe that if ${\boldsymbol{w}}$ is any asymmetric supersequence of ${\mathbf{X}}$ and ${\mathbf{Y}}$ , then for $i\in[4]$ , we require $W_{i}\geqslant X_{i}$ and similarly $W_{i}\geqslant Y_{i}$ which implies that $W_{i}\geqslant Z_{i},$ since $Z_{i}=\max\{X_{i},Y_{i}\}$ .

Claim 3

Suppose that $n\geqslant 2$ . Then, for $t<\lfloor\frac{n}{6}\rfloor$ , one has

[TABLE]

Proof:

Let ${\boldsymbol{a}}=(0,1,0,1,0,1,\ldots,)\in\{0,1\}^{m}$ be the alternating string of length $m,$ and suppose that ${\boldsymbol{v}}\in\{0,1\}^{m},$ ${\boldsymbol{v}}\neq{\boldsymbol{a}}$ , is an arbitrary binary string of length $m$ that contains at least one run of zeros of length $1$ (i.e., the substring $101$ ).

We first show that $|{\cal D}_{t}({\boldsymbol{a}})|\geqslant|{\cal D}_{t}({\boldsymbol{v}})|$ when $t\leqslant\lceil\frac{m}{2}\rceil$ . The proof proceeds by induction. We first establish the base case. For $t=1$ and for an arbitrary $m$ , $|{\cal D}_{t}({\boldsymbol{a}})|\geqslant|{\cal D}_{t}({\boldsymbol{v}})|$ . Furthermore, for any $t\leqslant\lceil\frac{m}{2}\rceil$ , it is straightforward to see $|{\cal D}_{t}({\boldsymbol{a}})|\geqslant|{\cal D}_{t}({\boldsymbol{v}})|$ since ${\boldsymbol{v}}$ has at most $t=\lceil\frac{m}{2}\rceil$ runs of zeros. Next, for the inductive step, suppose that $m+t=s$ and assume that the claim holds for all $m+t<s$ . Suppose the first occurrence of $101$ in ${\boldsymbol{v}}$ from the left starts at position $j$ . We partition the set ${\cal D}_{t}({\boldsymbol{v}})$ as follows:

•

${\cal D}({\boldsymbol{v}})^{(0)}$ : The set of all sequences in ${\cal D}_{t}({\boldsymbol{v}})$ in which the zero between the positions $j$ and $(j+2)$ is not deleted.

•

${\cal D}({\boldsymbol{v}})^{(1)}$ : The set of all sequences in ${\cal D}_{t}({\boldsymbol{v}})$ in which the zero between the positions $j$ and $(j+2)$ is deleted.

We partition the set ${\cal D}_{t}({\boldsymbol{a}})$ similarly:

•

${\cal D}({\boldsymbol{a}})^{(0)}$ : The set of sequences in ${\cal D}_{t}({\boldsymbol{a}})$ that start with zero.

•

${\cal D}({\boldsymbol{a}})^{(1)}$ : The set of sequences in ${\cal D}_{t}({\boldsymbol{a}})$ that start with one.

Note that $|{\cal D}({\boldsymbol{a}})^{(0)}|=|{\cal D}_{t}({\boldsymbol{a}}^{\prime})|,$ where ${\boldsymbol{a}}^{\prime}=(0,1,0,\ldots)\in\{0,1\}^{m-2},$ and that $|{\cal D}({\boldsymbol{a}})^{(1)}|=|{\cal D}_{t-1}({\boldsymbol{a}}^{\prime})|$ . Also, $|{\cal D}({\boldsymbol{v}})^{(0)}|=|{\cal D}_{t}({\boldsymbol{v}}^{\prime})|,$ where ${\boldsymbol{v}}^{\prime}$ is the length $m-2$ sequence obtained by deleting the string $10$ starting at index $j$ from ${\boldsymbol{v}}$ . In addition, $|{\cal D}({\boldsymbol{v}})^{(1)}|=|{\cal D}_{t-1}({\boldsymbol{v}}^{\prime})|$ . Since $m-2+t<s$ , can apply the inductive hypothesis to determine $|{\cal D}_{t}({\boldsymbol{a}}^{\prime})|\geqslant|{\cal D}_{t}({\boldsymbol{v}}^{\prime})|$ and $|{\cal D}_{t-1}({\boldsymbol{a}}^{\prime})|\geqslant|{\cal D}_{t-1}({\boldsymbol{v}}^{\prime})|$ , which implies $|{\cal D}_{t}({\boldsymbol{a}})|\geqslant|{\cal D}_{t}({\boldsymbol{v}})|$ when $m+t=s$ .

Consider next the case when ${\boldsymbol{v}}$ is any length- $n$ vector that has no runs of zeros of length one, and let $t<\lfloor n/6\rfloor$ . In this case, $|{\cal D}_{t}({\boldsymbol{v}})|\leqslant\left(n/3\right)^{t}$ since ${\boldsymbol{v}}$ has at most $n/3$ runs of zeros, and $|{\cal D}_{t}({\boldsymbol{a}})|\geqslant\left(\begin{array}[]{c}\lfloor n/2\rfloor\\ t\end{array}\right)$ . Since $\left(\begin{array}[]{c}\lfloor n/2\rfloor\\ t\end{array}\right)\geqslant\left(n/3\right)^{t}$ when $t<\lfloor n/6\rfloor$ , the result follows. ∎

Using the previous claims, we can establish upper and lower bounds on $f(n,t,M)$ .

Lemma 12

. For integers $n\geqslant 2,t<n,M\geqslant 1$ , let $m_{0}=\Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor}$ . Then, for $t<\lfloor\frac{m_{0}}{6}\rfloor$ ,

[TABLE]

Proof:

Under the assumptions of Claim 2 applied to $M$ sequences, we seek the smallest possible length sequence ${\boldsymbol{z}}\in\{0,1\}^{m}$ , $m\geqslant n-t$ , such that ${\boldsymbol{z}}\in{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(1)})\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(2)})\cap\cdots\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(M)})$ . According to Claim 3, for $t<\lfloor\frac{m}{6}\rfloor$ we have

[TABLE]

Since $\left(\begin{array}[]{c}\lceil\frac{m}{2}\rceil\\ t-n+m\end{array}\right)\leqslant\left(\lceil\frac{m}{2}\rceil\right)^{t-n+m}$ , if

[TABLE]

then

[TABLE]

Hence, ${\boldsymbol{z}}$ has length at least $m$ and ${\boldsymbol{z}}\in{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(1)})\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(2)})\cap\cdots\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(M)})$ . We can determine the sequence ${\boldsymbol{x}}$ given the length $n-m$ subsequence ${\boldsymbol{z}}\in{\cal I}_{n-m}({\boldsymbol{x}})$ and its $f(n,n-m)$ -deck. ∎

Lemma 13

. For integers $n\geqslant 2,t,M\geqslant 1$ , let $m=\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}$ . Then, for $t<\lfloor\frac{m}{6}\rfloor$ ,

[TABLE]

Proof:

Under the assumptions of Claim 2 applied to $M$ sequences, we need to determine the minimum length sequence ${\boldsymbol{z}}\in\{0,1\}^{m}$ , $m\geqslant n-t$ , such that ${\boldsymbol{z}}\in{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(1)})\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(2)})\cap\cdots\cap{\cal I}_{m-n+t}(\underline{{\boldsymbol{x}}}^{(M)})$ . Wlog, assume that ${\boldsymbol{z}}$ is the alternating string. Then, $|{\cal D}_{t-(n-m)}({\boldsymbol{z}})|\geqslant\left(\frac{m/2}{t-n+m}\right)^{t-n+m}=\left(\frac{1}{2(1-\frac{n-t}{m})}\right)^{t-n+m}$ . Since $m\geqslant n-t+1$ , if $m=\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}$ , then $M<|{\cal D}_{t-(n-m)}({\boldsymbol{z}})|$ . Hence, $f(n,t,M)\geqslant f(n,n-m)$ . ∎

Theorem 14

. For integers $n\geqslant 2,t<n,M\geqslant 1$ ,

[TABLE]

where $\Big{\lfloor}\frac{\log M}{\log n}+(n-t)\Big{\rfloor}\leqslant m\leqslant\Big{\lceil}\frac{\log M}{-\log\left({2(1-\frac{n-t}{n-t+1})}\right)}+(n-t)\Big{\rceil}$ , and $t<\lfloor\frac{m}{6}\rfloor$ .

Invoking the results of the previous section, we arrive at the following corollary.

Corollary 15

. Suppose that $t<\lfloor\frac{m}{6}\rfloor$ and $M=\left(\begin{array}[]{c}\frac{m}{2}\\ t-n+m\end{array}\right)+1$ where $m$ is an even integer. If $n-m\leqslant 4$ , then

[TABLE]

Acknowledgement. This research was supported in part by the NSF grants CIF CCS 1526875 and 1618366, and the NSF STC Center for Science of Information at Purdue University.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan, “String reconstruction from substring compositions,” SIAM Journal on Discrete Mathematics 29, no. 3, 1340-1371, 2015.
2[2] Batu, Tukan, Sampath Kannan, Sanjeev Khanna, and Andrew Mc Gregor. ”Reconstructing strings from random traces.” In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 910-918. Society for Industrial and Applied Mathematics, 2004.
3[3] Bondy, John Adrian, and Robert L. Hemminger. ”Graph reconstruction a survey.” Journal of Graph Theory 1, no. 3 (1977): 227-268.
4[4] P. Borwein, T. Erdelyi, G. Kos, “Littlewood-type problems on [0,1],” Proc. London Math. Soc. , vol. 79, no. 1, pp. 22-46, 1999.
5[5] C. Choffrut and J. Karhumaki, “Combinatorics of words,” in Handbook of Formal Languages , vol. I, Springer, Berlin, 1997, pp. 329-438.
6[6] M. Dudik and L.J. Schulman, “Reconstruction from subsequences,” Journal of Combinatorial Theory , vol. 103, no. 2, pp. 337-348, 2003.
7[7] R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,” Proc. IEEE ISIT , Barcelona, 2016.
8[8] Kalashnik, L. O. “The reconstruction of a word from fragments,” Numerical Mathematics and Computer Technology , Akad. Nauk. Ukrain. SSR Inst. Mat., Preprint IV (1973): 56-57.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The Hybrid kkk-Deck Problem: Reconstructing Sequences from Short and Long Traces

Abstract

I Introduction

II Problem Statement and Single Trace Analysis

Example** 1**

Claim 1

Proof:

Corollary** 1**

Theorem** 2**

Proof:

Lemma** 3**

Proof:

Theorem** 4**

Proof:

Corollary** 5**

Proof:

Lemma** 6**

Theorem** 7**

Proof:

Corollary** 8**

III The Multitrace Reconstruction Problem

Lemma** 9**

Proof:

Example** 2**

Lemma** 10**

Proof:

Corollary** 11**

Example** 3**

Claim 2

Proof:

Example** 4**

Claim 3

Proof:

Lemma** 12**

Proof:

Lemma** 13**

Proof:

Theorem** 14**

Corollary** 15**

The Hybrid $k$ -Deck Problem: Reconstructing Sequences from Short and Long Traces

Example 1

Corollary 1

Theorem 2

Lemma 3

Theorem 4

Corollary 5

Lemma 6

Theorem 7

Corollary 8

Lemma 9

Example 2

Lemma 10

Corollary 11

Example 3

Example 4

Lemma 12

Lemma 13

Theorem 14

Corollary 15