Trace Reconstruction: Generalized and Parameterized
Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor and, Soumyabrata Pal

TL;DR
This paper advances the understanding of trace reconstruction by exploring generalized and parameterized versions, providing new bounds for matrix, sparse, and random string reconstruction problems, and highlighting differences from sequence reconstruction.
Contribution
It introduces new bounds and methods for trace reconstruction in matrices and sparse strings, extending classical results and addressing open problems.
Findings
Exponential bounds for matrix trace reconstruction improve previous results.
Logarithmic trace complexity for random matrix reconstruction.
Polynomial traces suffice for sparse strings with separation promise.
Abstract
In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string given random "traces" of where each trace is generated by deleting each coordinate of independently with probability . The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is still an exponential gap between upper and lower sample complexity bounds and our understanding of the problem is still surprisingly limited. In this paper, we consider natural parameterizations and generalizations of this problem in an effort to attain a deeper and more comprehensive understanding. We prove that traces suffice for reconstructing arbitrary matrices. In the matrix version of the problem, each row and column of an unknown …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Trace Reconstruction: Generalized and Parameterized
Akshay Krishnamurthy
Arya Mazumdar
Andrew McGregor
Soumyabrata Pal
Abstract
In the beautifully simple-to-state problem of trace reconstruction, the goal is to reconstruct an unknown binary string given random “traces” of where each trace is generated by deleting each coordinate of independently with probability . The problem is well studied both when the unknown string is arbitrary and when it is chosen uniformly at random. For both settings, there is still an exponential gap between upper and lower sample complexity bounds and our understanding of the problem is still surprisingly limited. In this paper, we consider natural parameterizations and generalizations of this problem in an effort to attain a deeper and more comprehensive understanding. Perhaps our most surprising results are:
We prove that traces suffice for reconstructing arbitrary matrices. In the matrix version of the problem, each row and column of an unknown matrix is deleted independently with probability . Our results contrasts with the best known results for sequence reconstruction where the best known upper bound is . 2. 2.
An optimal result for random matrix reconstruction: we show that traces are necessary and sufficient. This is in contrast to the problem for random sequences where there is a super-logarithmic lower bound and the best known upper bound is . 3. 3.
We show that traces suffice to reconstruct -sparse strings, providing an improvement over the best known sequence reconstruction results when . 4. 4.
We show that traces suffice if is -sparse and we additionally have a “separation” promise, specifically that the indices of 1’s in all differ by .
††footnotetext: Akshay Krishnamurthy is with Microsoft Research, New York. Arya Mazumdar is with University of California, San Diego. Andrew McGregor and Soumyabrata Pall are with College of Information and Computer Sciences, University of Massachusetts, Amherst. Emails: {akshay,arya,mcgregor,spal}@cs.umass.edu. This work was supported in part by the National Science Foundation under CCF1642658, 1637536, 1763618, 1934846, 1909046 and 1908849. Part of this work was presented in the European Symposium of Algorithms, 2019.
1 Introduction
V. Levenshtein in [1] asked the following combinatorial question regarding reconstruction of a sequence from its subsequences: how many subsequences of a particular length are necessary and sufficient to reconstruct the original sequence? He followed up with [2] and [3] where upper and lower bounds were provided for different variations on the problem, along with efficient reconstruction algorithms. A similar question was studied in [4]: to find the minimum value of such that we can reconstruct any binary sequence provided we are given all subsequences of length . In his paper [2], Levenshtein also introduced the probabilistic version of the problem for discrete memoryless channels, stopping just short of introducing the trace reconstruction problem.
In the trace reconstruction problem, first proposed by Batu et al. [5], the goal is to reconstruct an unknown string given a set of random subsequences of . Each subsequence, or “trace”, is generated by passing through the deletion channel in which each entry of is deleted independently with probability . The locations of the deletions are not known; if they were, the channel would be an erasure channel. The central question is to find how many traces are required to exactly reconstruct with high probability.
This intriguing problem has attracted significant attention from a large number of researchers [6, 7, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16]. In a recent breakthrough, De et al. [13] and Nazarov and Peres [12] independently showed that traces suffice where . This bound is achieved by a mean-based algorithm, which means that the only information used is the fraction of traces that have a 1 in each position. While is known to be optimal amongst mean-based algorithms, the best algorithm-independent lower bound is the much weaker [17].
Many variants of the problem have also been considered including: (1) larger alphabets and (2) an average case analysis where is drawn uniformly from . Larger alphabets are only easier than the binary case, since we can encode the alphabet in binary, e.g., by mapping a single particular character to 1 and the rest to 0. We can then solve the binary problem and subsequently, repeat the process for all characters to reconstruct the entire string. In the average case analysis, the state-of-the-art result is that traces suffice111 is assumed to be constant in that work., whereas traces are necessary [11, 9, 17]. Very recently, and concurrent with our work, other variants have been studied including a) where the bits of are associated with nodes of a tree whose topology determines the distribution of traces generated [15] and b) where is a codeword from a code with redundancy [16].
In order to develop a deeper understanding of this intriguing problem, we consider fine-grained parameterization and structured generalizations of trace reconstruction. We prove several new results for these variations that shed new light on the problem. Moreover, in studying these settings, we refine existing tools and introduce new techniques that we believe may be helpful in closing the gaps in the fully general problem.
1.1 Our Results
In all our results below, we have used the term with high probability to imply that the statement holds with probability at least where is a term asymptotically going to 0 as the size of the input (typically the length of the string, ) grows.
1.1.1 Parametrizations
We begin by considering parameterizations of the trace reconstruction problem. Given the important role that sparsity plays in other reconstruction problems (see, e.g., Gilbert and Indyk [18]), we first study the recovery of sparse strings. Here we prove the following result.
Theorem 1**.**
Let be the retention probability and assume that . If has at most non-zeros, traces suffice to recover exactly, with high probability.
As some points of comparison, note that there is a trivial upper bound, which our result improves on with a polynomially better dependence on in the exponent. The trivial bound is obtained by getting enough samples so that it is possible to obtain samples where none of the 1s are deleted. The best known result for the general case is [12, 13] and our result is a strict improvement when . Note that since we have no restrictions on in the statement, improving upon would imply an improved bound in the general setting.
Somewhat surprisingly, our actual result is considerably stronger (See Corollary 7 for a precise statement). We also obtain sample complexity in an asymmetric deletion channel, where each 0 is deleted with probability extremely close to , but each 1 is deleted with probability . With such a channel, all but a vanishingly small fraction of the traces contain only 1s, yet we are still able to exactly identify the location of every 0. Since we can accommodate this result also applies to the general case with an asymmetric channel, yielding improvements over De et al. [13] and Nazarov and Peres [12].
We elaborate more on our techniques in the next section, but the result is obtained by establishing a connection between trace reconstruction and learning binomial mixtures. There is a large body of work devoted to learning mixtures [19, 20, 21, 22, 23, 24, 25, 26, 27, 28] where it is common to assume that the mixture components are well-separated. In our context, separation corresponds to a promise that each pair of 1s in the original string is separated by a 0-run of a certain length. Our second result concerns strings with a separation promise.
Theorem 2**.**
If has at most 1s and each 1 is separated by 0-run of length , then, for any constant deletion probability , traces suffice to recover with high probability.
Note that reconstruction with traces is straightforward if every 1 is separated by a 0-run of length ; the basic idea is that we can identify which 1s in a collection of traces correspond to the same 1 in the original sequence and then we can use the indices of these 1s in their respective traces to infer the index of the 1 in the original string. However, reducing to separation is rather involved and is perhaps the most technically challenging result in this paper.
Here as well, we actually obtain a slightly stronger result. Instead of parameterizing by the sparsity and the separation, we instead parameterize by the number of runs, and the run lengths, where a run is a contiguous sequence of the same character. We require that each 0-run has length , where is the total number of runs. Note that this parameterization yields a stronger result since is at most if the string is sparse, but it can be much smaller, for example if the 1-runs are very long. On the other hand, the best lower bound, which is [17], considers strings with runs and run length .
Using the general approach used to prove Theorem 2, we can also prove an average case reconstruction result for sparse strings: traces suffice if each where for some sufficiently small . As mentioned, above if , it was already known that a sub-polynomial number of traces sufficed for reconstruction. However, for random strings sparsity is not necessarily helpful. In fact, if it is relatively straightforward to argue that traces are necessary since with constant probability has the form
[TABLE]
and identifying the position of the 1 requires traces.
As our last parametrization, we consider a sparse testing problem. We specifically consider testing whether the true string is or , with the promise that the Hamming distance between and , , is at most . This question is naturally related to sparse reconstruction, since the difference sequence is sparse, although of course neither string may be sparse on its own. Here we obtain the following result.
Theorem 3**.**
For any pair with , traces from the deletion channel with suffice to distinguish between and with high probability.
1.1.2 Generalizations
Turning to generalizations, we consider a natural multivariate version of the trace reconstruction problem, which we call matrix reconstruction. Here we receive matrix traces of an unknown binary matrix , where each matrix trace is obtained by deleting each row and each column with probability , independently. Here the deletion channel is much more structured, as there are only random bits, rather than in the sequence case. Our results show that we can exploit this structure to obtain improved sample complexity guarantees.
In the worst case, we prove the following theorem.
Theorem 4**.**
For the matrix deletion channel with deletion probability ,
[TABLE]
traces suffice to recover an arbitrary matrix with high probability.
While no existing results are directly comparable, it is possible to obtain sample complexity via a combinatorial result due to Kós et al. [29]. This agrees with the results from the sequence case, but is obtained using very different techniques. Additionally, our proof is constructive, and the algorithm is actually mean-based, so the only information it requires are estimates of the probabilities that each received entry is 1. As we mentioned, for the sequence case, both Nazarov and Peres [12] and De et al. [13] prove a lower bound for mean-based algorithms. Thus, our result provides a strict separation between matrix and sequence reconstruction, at least from the perspective of mean-based approaches.
Lastly, we consider the random matrix case, where every entry of is drawn iid from . Here we show that traces are sufficient.
Theorem 5**.**
For any constant deletion probability , traces suffice to reconstruct a random with high probability over the randomness in and the channel.
This result is optimal, since with traces, there is reasonable probability that a row/column will be deleted from all traces, at which point recovering this row/column is impossible. The result should be contrasted with the analogous results in the sequence case. For sequences, the best results for random strings are [9] and [17]. In light of the lower bound for sequences, it is perhaps surprising that matrix reconstruction admits sample complexity.
In Section 8, we show that it is possible to extend both matrix reconstruction results to tensors in a reasonably straightforward way.
1.2 Our Techniques
To prove our results, we introduce several new techniques in addition to refining and extending many existing ideas in prior trace reconstruction results.
Theorem 1 is proved via a reduction from trace reconstruction to learning the parameters of a mixture of binomial distributions. Surprisingly, this natural connection does not seem to have been observed in the earlier literature. We then use a generalization of a complex-analytic approach introduced by De et al. [13] and Nazarov and Peres [12] to prove a bound on the sample complexity of learning a binomial mixture. This generalization is to move beyond the analysis of Littlewood polynomials, i.e., polynomials with coefficients, to the case where coefficients have bounded precision. The generalization is not difficult. This is our simplest result to prove but we consider the final result to be revealing as it shows that sparsity plays a more important role than length in the complexity of trace reconstruction.
Our most technically involved result is Theorem 2. This is proved via an algorithm that constructs a hierarchical clustering of the individual 1s in all received traces according to their corresponding position in the original string. This clustering step requires a careful recursion, where in each step we ensure no false negatives (two 1s from the same origin are always clustered together) but we have many false positives, which we successively reduce. At the bottom of the recursion, we can identify a large fraction of 1s from each 1 in the original string. However, as the recursion eliminates many of the 1s, simply averaging the positions of the surviving fraction leads to a biased estimate. To resolve this, we introduce a de-biasing step which eliminates even more 1s, but ensures the survivors are unbiased, so that we can accurately estimate the location of each 1 in the original string. The initial recursion has levels, which is critical since the debiasing step involves conditioning on the presence of 1s in a trace, which only happens with probability .
Theorem 3 leverages combinatorial arguments about -decks (the multiset of subsequences of a string) due to Krasikov and Roditty [4]. The result demonstrates the utility of these combinatorial tools in trace reconstruction. As further evidence for the utility of combinatorial tools, the connection to -decks was also used by Ban et al. [30] in independent concurrent work on the deletion channel.
For Theorem 4, we return to the complex-analytic approach and extend the Littlewood polynomial argument to multivariate polynomials. Since the unknown matrices are , we can use a natural bivariate polynomial of degree , which yields the improvement. However, the result of Borwein and Erdélyi [31] used in previous work on trace reconstruction applies only to univariate polynomials. Our key technical result is a generalization of their result to accommodate bivariate Littlewood polynomials, which we then use in a statistical test to identify the unknown matrix.
For Theorem 5, using an averaging argument and exploiting randomness in the original matrix, we construct a statistical test to determine if two rows (or columns) from two different traces correspond to the same row (column) in the original string. We show that this test succeeds with overwhelming probability, which lets us align the rows and columns in all traces. Once aligned, we know which rows/columns were deleted from each trace, so we can simply read off the original matrix .
Notation
Throughout, is the length of the binary string being reconstructed, is the number of 0s, is the number of 1s, i.e., the sparsity or weight. For matrices, is the total number of entries, and we focus on square matrices. For most of our results, we assume that are known since, if not, they can easily be estimated using a polynomial number of traces. Let denote the deletion probability when the 1s and 0s are deleted with the same probability. We also study a channel where the 1s and 0s are deleted with different probabilities; in this case, is the deletion probability of a 0 and is the deletion probability of a 1. We refer to the corresponding channel as the -Deletion Channel or the asymmetric deletion channel. It will also be convenient to define and as the corresponding retention probabilities. Throughout, denotes the number of traces. For a natural number we use the notation .
2 Sparsity and Learning Binomial Mixtures
We begin with the sparse trace reconstruction problem, where we assume that the unknown string has at most 1s. Our analysis for this setting is based on a simple reduction from trace reconstruction to learning a mixture of binomial distributions, followed by a new sample complexity guarantee for the latter problem. This approach yields two new results: first, we obtain an sample complexity bound for sparse trace reconstruction, and second, we show that this guarantee applies even if the deletion probability for 0s is very close to .
To establish our results, we introduce a slightly more challenging channel which we refer to as the Austere Deletion Channel. The bulk of the proof analyzes this channel, and we obtain results for the channel via a simple reduction.
Theorem 6** (Austere Deletion Channel Trace Reconstruction).**
In the Austere Deletion Channel, all but exactly one 0 are deleted (the choice of which 0 to retain is made uniformly at random) and each 1 is deleted with probability . For such a channel,
[TABLE]
traces suffice for sparse trace reconstruction with high probability where , provided .
We will prove this result shortly, but we first derive our main result for this section as a simple corollary.
Corollary 7** (Deletion Channel Trace Reconstruction).**
For the -deletion channel,
[TABLE]
traces suffice for sparse trace reconstruction with high probability where and .
Proof.
This follows from Theorem 6. By focusing on just a single 0, it is clear that the probability that a trace from the -deletion channel contains at least one 0 is at least . If among the retained 0s we keep one at random and remove the rest, we generate a sample from the austere deletion channel. Thus, with samples from the deletion channel, we obtain at least samples from the austere channel and the result follows. Note that Theorem 1 is a special case where . ∎
Remark 1**.**
Note that the case where is constant (a typical setting for the problem) and is not covered by the corollary. However, in this case a simpler approach applies to argue that traces suffice: with probability no 1s are deleted in the generation of the trace and given such traces, we can infer the original position of each 1 based on the average position of each 1 in each trace.
Remark 2**.**
Note that the weak dependence on ensures that as long as , we still have the bound. Thus, our result shows that sparse trace reconstruction is possible even when zeros are retained with super-polynomially small probability.
2.1 Reduction to Learning Binomial Mixtures
We prove Theorem 6 via a reduction from austere deletion channel trace reconstruction to learning binomial mixtures. Given a string of length , let be the number of ones before the zero in . For example, if then Note that the multi-set uniquely determines , that each , and that the multi-set has size . The reduction from trace reconstruction to learning binomial mixtures is appealingly simple:
Given traces from the austere channel, let be the number of leading ones in . 2. 2.
Observe that each is generated by a uniform222Note that since the are not necessarily distinct some of the binomial distributions are the same. mixture of where . Hence, learning from allows us to reconstruct .
We will say that a number has -precision if where and . To obtain Theorem 6, we establish the following new guarantee for learning binomial mixtures.
Theorem 8** (Learning Binomial Mixtures).**
Let be a mixture of binomials:
[TABLE]
where are distinct integers, the values have precision, and . Then samples suffice to learn the parameters exactly with high probability.
Proof.
Let be a mixture where the samples are drawn from , where are distinct and the probabilities where . Consider the variational distance between and where
[TABLE]
We will show that the variational distance between and is at least
[TABLE]
Since there are at most possible choices for the parameters of , standard union bound arguments show that
[TABLE]
samples are sufficient to distinguish from all other mixtures.
To prove the total variation bound, observe that by applying the binomial formula, for any complex number , we have
[TABLE]
where . Let and apply the triangle inequality to obtain:
[TABLE]
Note that is a non-zero degree polynomial with coefficients in the set
[TABLE]
We would like to find a such that has large modulus but is small, since this will yield a total variation lower bound. We proceed along similar lines to Nazarov and Peres [12] and De et al. [13]. It follows from Corollary 3.2 in Borwein and Erdélyi [31] that there exists such that
[TABLE]
for some constant . For such a value of , Nazarov and Peres [12] show that
[TABLE]
for some constant . Therefore,
[TABLE]
For , by an application of the Chernoff bound, , so we obtain
[TABLE]
[TABLE]
where the second equality follows from the assumption that (which we will ensure when we set ) since,
[TABLE]
Set
[TABLE]
for some sufficiently large constant . This ensures that the first term of Eqn. 1 is
[TABLE]
Note that
[TABLE]
and so by the assumption that we may set the constant large enough such that as required. The second term of Eqn. 1 is a lower order term given the assumption on and thus we obtain the required lower bound on the total variation distance. ∎
Theorem 6 now follows from Theorem 8, since in the reduction, we have binomials, one per 0 in , is a multiple of and importantly, we have . The key is that we have a polynomial with degree rather than a degree polynomial as in the previous analysis.
Remark
If all are equal, Theorem 8 can be improved to by using a more refined bound from Borwein and Erdélyi [31] in our proof. This follows by observing that if , then is a multiple of a Littlewood polynomial and we may use the stronger bound , see Borwein and Erdélyi [31].
2.2 Lower Bound on Learning Binomial Mixtures
We now show that the exponential dependence on in Theorem 8 is necessary.
Theorem 9** (Binomial Mixtures Lower Bound).**
There exists subsets
[TABLE]
such that if and , then . Thus, samples are required to distinguish from with constant probability.
Proof.
Previous work [12, 13] shows the existence of two strings such that where is the expected value of the th element (element at th position counted from beginning) of a string formed by applying the -deletion channel to the string . We may assume since otherwise
[TABLE]
which would contradict the assumption .
Consider and , where () is the number of coordinates preceding the th 1 in (). Note that
[TABLE]
and so
[TABLE]
which proves the result. ∎
3 Well-Separated Sequences
We now prove Theorem 2, showing that traces suffice for reconstruction of a -sparse string when there are 0s between each consecutive 1. For clarity of exposition, we are going to prove the statement of Theorem 2 for . The proof follows verbatim for any other constant . We call such sequences of 0s the 0-runs of the string. We also refer to the length of the shortest 0-run as the gap of the string .
Theorem** (Restatement of Theorem 2).**
Let be a -sparse string of length and gap at least for a large enough . Then traces from the -Deletion Channel suffice to recover with high probability.
In Section 3.1, we present a high-level overview of the algorithm and the analysis to provide intuition. In Section 3.2 we describe the algorithm in detail, state the key lemmas, and explain how to set the parameters. Due to the technical nature of the analysis, full details, including proofs of the lemmas, are deferred to Appendix A.
3.1 A Recursive Hierarchical Clustering Algorithm and Its Analysis: Overview
Let denote the positions (index of the coordinate from the left) of the 1s in the original string . Let denote the multi-set of all positions of all received 1s and call . We will construct a graph on vertices where every vertex is associated with a received 1. We decorate each vertex with a number , which is the position of the associated received
- Each vertex also has an unknown label denoting the corresponding 1 in the original string.
At a high level, our approach uses the observed values to recover the unknown labels . Once this “alignment” has been performed, the original string can be recovered easily, since the average of is an unbiased estimator for .
A starting observation
Our first observation is a simple fact about binomial concentration, which we will use to define the edge set in : by the Chernoff bound, with high probability, for every vertex , if then we must have for some constant . Defining the edges in to be then guarantees that all vertices with are connected. This immediately yields an algorithm for the much stronger gap condition , since with such separation, no two vertices with will have an edge. Therefore, the connected components reveal the labeling so that traces suffice with .
Intuitively, we have constructed a clustering of the received 1s that corresponds to the underlying labeling. To tolerate a weaker gap condition, we proceed recursively, in effect constructing a hierarchical clustering. However there are many subtleties that must be resolved.
The first recursion
To proceed, let us consider the weaker gap condition of . In this regime, still maintains a consistency property that for each all vertices with are in the same connected component, but now a connected component may have vertices with different labels, so that each connected component identifies a continguous set of the original 1s. Moreover, due to the sparsity assumption, must have length, defined as , at most . Therefore if we can correctly identify every trace that contains the left-most and right-most 1 in , we can recurse and are left to solve a subproblem of length . Appealing to our starting observation, this can be done with a gap of .
The challenge for this step is in identifying every trace that contains the left-most and right-most 1 in , which we call and respectively. This is important for ensuring a “clean” recursion, meaning that the traces used in the subproblem are generated by passing exactly the same substring through the deletion channel. To solve this problem we use a device that we call a Length Filter. For every trace, consider the subtrace that starts with the first received 1 in and ends with the last received 1 in (this subtrace can be identified using ). If the trace contains then the length of this subtrace is where is the distance between in the original string. On the other hand, if the subtrace does not contain both end points, then the length is where . Since we know that and we are operating with gap condition , binomial concentration implies that with high probability we can exactly identify the subtraces containing and .
Further recursion
The difficulty in applying a second recursive step is that when the length filter cannot isolate the subtraces that contain the leftmost and rightmost 1s for a block , so we cannot guarantee a clean recursion. However, substrings that pass through the filter are only missing a short prefix/suffix which upper bounds any error in the indices of the received 1s. We ensure consistency at subsequent levels by incorporating this error into a more cautious definition of the edge set (in fact the additional error is the same order as the binomial deviation at the next level, so it has negligible effect). In this way, we can continue the recursion until we have isolated each 1 from the original string. The lower bound on run length arises since the gap at level of the recursion, , is related to the gap at level via with , and this recursion asymptotes at .
The last technical challenge is that, while we can isolate each original 1, the error in our length filter introduces some bias into the recursion, so simply averaging the values of the clustered vertices does not accurately estimate the original position. However, since we have isolated each 1 into pure clusters, for any connected component corresponding to a block of 1s, we can identify all traces that contain the first and last 1 in the block. Applying this idea recursively from the bottom up allows us to debias the recursion and accurately estimate all positions.
3.2 The algorithm in detail: recursive hierarchical clustering
We now describe the recursive process in more detail. Let us define the thresholds:
[TABLE]
which will be used in the length filter and in the definitions of the edge set. Observe that with , we have . Let denote the traces. We will construct a sequence of graphs on the vertex sets , where each vertex corresponds to a received 1 in some trace and is decorated with its position and the unknown label . The round of the algorithm is specified as follows with , as the multi-set of all received 1s and .
Define with edge set E_{d}=\bigcup_{j}\{(v,w):v,w\in V_{d}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\cap C_{j}^{(d-1)}}\textrm{ and }|z_{v}^{(d)}-z_{w}^{(d)}|\leq\tau_{d}\}. 2. 2.
Extract connected components from . 3. 3.
For each connected component , extract subtraces where is the substring of starting with the first 1 in and ending with the last 1 in . Formally, with and , we define . 4. 4.
Length Filter: Define . If
[TABLE]
delete all vertices with . Let be the multi-set of all surviving vertices. 5. 5.
For , define .
See Algorithm 1 for pseudocode. We note that corresponds to a shifted index of the received 1 associated with vertex . Intuitively, we shift by removing a prefix of the trace , which provides a form of noise reduction.
We analyze the procedure via a sequence of lemmas. The first one establishes a basic consistency property: that two 1s originating from the same source 1 are always clustered together.
Lemma 10** (Consistency).**
At level let for each . Then with high probability, for each and there exists some component at level such that .
The next lemma provides a length upper bound on any component, which is important for the recursion. At a high level since we are using a threshold at level and the string is -sparse, no connected component can span more than positions.
Lemma 11** (Length Bound).**
At level , the following holds with probability at least : For every component at level , we have . Moreover if is a contiguous subsequence of with , then .
Finally we characterize the length filter.
Lemma 12** (Length Filter).**
Assume . At level , the following holds with probability at least : For a component at level , let be the maximal contiguous subsequence of such that . Define and . Then for any , if and are present in , then survives to round , that is . Moreover, for any , let denote the original position of the first 1 from that is also in the trace . Then we have .
The lemmas are all interconnected and proved formally in Appendix A. It is important that the error incurred by the length filter is which is exactly the binomial deviation at level . Thus the threshold used to construct accounts for both the length filter error and the binomial deviation. This property, established in Lemma 12, is critical in the proof of Lemma 10.
For the hierarchical clustering, observe that after iterations, we have . With gap condition and applying Lemma 10, this means that the connected components at level each correspond to exactly one 1 in the original string. Moreover since the length filter preserves every trace containing the left-most and right-most 1 in the component, the probability that a subtrace passes through the length filter is at least . Hence, after levels, the expected number of surviving traces in each cluster is . Thus for each index corresponding to a 1 in the original string, our recursion identifies at least vertices such that .
Removing Bias
The last step in the algorithm is to overcome the bias introduced by the length filter. The de-biasing process works upward from the bottom of the recursion. Since we have isolated the vertices corresponding to each 1 in the original string, for a component at level , we can identify all subtraces that survived to this level that contain the first and last 1 of the corresponding block . Thus, we can eliminate all subtraces that erroneously passed this length filter.
Working upwards, consider a component that corresponds to a block of 1s in the original string. Since we have performed further clustering, we have effectively partitioned into sub-blocks . We would like to identify exactly the subtraces that survived to level that contain the first and last 1 of , but unfortunately this is not possible due to a weak gap condition. However, by induction, we can exactly identify all subtraces that survive to level that contain the first and last 1 of the first and last sub-block of , namely and . Thus we can de-bias the length filter at level by filtering based on a more stringent event, namely the presence of the nodes required to de-bias the first and last blocks and . In total to de-bias all length filters above a particular component, we require the presence of nodes, which happens with probability . Thus we can debias with only a polynomial overhead in sample complexity. See Figure 1 for an illustration.
4 Applications of the Well-Separated Strings Result and Methodology
In this section, we present two applications of the results and methodology developed in the previous section.
4.1 Strengthening to a Parameterization by Runs
We next strengthen Theorem 2 to show that traces suffice under the assumption that each 0-run has length where , in the string being reconstructed. Observe that this is a weaker assumption than assuming has sparsity and each one is separated by a 0-run of length , since always, but can be much less than .
Theorem 13**.**
For the -Deletion Channel, traces suffice with high probability if the lengths of the 0-runs are where is the number of runs in .
The proof is via a reduction to the -sparse case in the previous sections. Let be the string formed by replacing every run of 1s in by a single 1. We first argue that we can reconstruct with high probability using traces generated by applying the -Deletion Channel to . We will prove this result for the case since otherwise traces is sufficient even with no gap promise.333Specifically, if , with probability at least a trace also has runs. Given traces with runs we can estimate each run length because we know the run in each such trace corresponds to the run in the original string. Observe that with traces, if every 0-run in has length at least for some sufficiently large constant , then a bit in every 0-run of appears in every trace with high probability. Conditioned on this event, no two 1’s that originally appeared in different runs of are adjacent in any trace. Next replace each run of 1s in each trace with a single 1. The end result is that we generate traces that are generated as if we had deleted each 0 in with probability and each 1 in with probability where is the length of the run that the 1 belonged to in . This channel is not equivalent to the -Deletion channel, but our analysis for the sparse case (that only depends on the alignment of 1s using the deletion properties of the 0s) continues to hold even if the deletion probability of each 1 is different. Thus we can apply Theorem 2 to recover , and the sparsity of is at most . Since the algorithm identifies corresponding 1s in in the different traces, we can then estimate the length of the 1-runs in that were collapsed to each single 1 of by looking at the lengths of the corresponding 1-runs in the traces of before they were collapsed.
4.2 Reconstruction of random sparse strings with polynomial traces
Suppose we have an unknown string such that every element of is sampled uniformly and independently according to for some sufficiently small . Again, we send through the deletion channel where every bit is deleted with probability and observe random traces. We have the following theorem characterizing the sufficient number of traces required to recover .
Theorem 14**.**
* traces are sufficient to recover with high probability if every element of is drawn randomly according to for where is some small constant.*
Proof.
Let denote the positions (index of the coordinate from the left) of the 1s in the original string . Let denote the multi-set of all positions of all received 1s and call . construct a graph on vertices where every vertex is associated with a received 1. We decorate each vertex with a number , which is the position of the associated received
- Each vertex also has an unknown label denoting the corresponding 1 in the original string. Finally, the edges in are defined as following: two vertices will have an edge if for some appropriate large constant . Consider the original string partitioned into contiguous segments each of length . In that case, notice that
[TABLE]
Taking a union bound over all sets of consecutive segments ( of them), we get that no consecutive segments should all include 1’s with probability at least . We now have the following two claims:
Claim 1**.**
For any two vertices such that and , they will never have an edge with high probability.
Proof.
We will prove this claim by contradiction. Suppose indeed have an edge which must imply that because of the definition of graph . Therefore we must have by using Chernoff bound
[TABLE]
we can take a union bound over all vertices of the graph to conclude that for all vertices of the graph . In that case,
[TABLE]
which is a contradiction to the fact that . ∎
Therefore two 1’s in the original string which are separated by at least will never have an edge in the graph .
Claim 2**.**
For such that , there will exist an edge between and in the graph with high probability.
Proof.
For two vertices such that (implying that ), we must have
[TABLE]
with probability at least . Again, we can take a union bound over all vertices and over all traces to ensure that for such that , there will exist an edge between and in the graph . ∎
Further, the total number of 1’s in a particular segment of the string of length , denoted by the random variable is sampled according to
[TABLE]
Therefore, we have and we can further use Chernoff Bound to conclude that with probability at least . Taking a union bound, we can say that all segments of the string of length has at most 1’s with probability of failure at most . In that case, fix a particular connected component in the graph so that we can focus on reconstructing the contiguous sub-sequence of corresponding to the component . From our previous analysis, we can ensure that
[TABLE]
since at most contiguous segments will include 1 in all of them. Moreover the total number of 1’s in the component is at most . The probability that in a particular trace, all the 1’s in the component will appear is at least and from now, we will only consider traces which has all the 1’s present. Subsequently, if the total number of traces used is , then the number of traces containing all the 1’s in is at least with exponentially high probability. Using the Binomial Mean Estimator (defined in Appendix A), on these subset of traces containing all the 1’s from , we can recover the length of all the 0-runs in the component with probability at least (after taking union bound over at most 0 runs in ). We can repeat this procedure to reconstruct the substrings of corresponding to all the components in the graph .
In order to reconstruct the length of the run of 0’s between two distinct components , we can only consider those traces where all the 1’s corresponding to both has appeared. There are at most such 1’s and as before, we can use traces to obtain traces containing all the 1’s in . Subsequently, using the Binomial Mean Estimator, we can reconstruct the length of the 0-run between . Thus we can reconstruct the entire string with probability of failure at most . Setting appropriately results in a failure probability of . ∎
5 Bounded Hamming Distance
In this section, we turn to the sparse testing problem. We show that it is possible to distinguish between two strings and with Hamming distance , given traces. This question is naturally related to sparse reconstruction, since the difference string is at most sparse, but distinguishing two strings from traces is also at the core of our analysis in Section 2, as well as the analysis of Nazarov and Peres [12] and De et al. [13]. In particular given a testing routine, reconstruction simply requires applying the union bound.
In the binary symmetric channel (where each bit is flipped independently with some probability), distinguishing between two strings is easier if the Hamming distance is larger, since the two strings are farther apart. However, it is unclear if this intuition carries over to the deletion channel. In particular, the number of traces required for testing is unlikely to even be monotonic in the Hamming distance; if the Hamming distance is odd, then and have different Hamming weight, and we can estimate the Hamming weight using just traces.
Our analysis uses a combinatorial result about -decks due to Krasikov and Roditty [4] that is defined below, along with an approach first used in McGregor et al. [14].
Definition 1**.**
The -deck of a string is the multi-set of all length subsequences of the string.
Theorem 15** (Krasikov and Roditty [4]).**
No two strings of length have the same -deck if .
Theorem 16**.**
The -deck of a binary string can be determined exactly with traces from the symmetric deletion channel with high probability assuming .
Proof.
We argue that sampling length -subsequence of a string is sufficient to reconstruct the -deck with high probability. The result then follows because if , then with constant probability a trace generated by the deletion channel has length at least and hence we can take a random subsequence of such a trace as a random subsequence from .
Let be the number of times that appears as a subsequence of . Then, let be the number of times is generated if we sample subsequences of length uniformly at random. and by an application of the Chernoff bound,
[TABLE]
where the last line follows given and . Hence, by taking the union bound over all sequences , it follows that we can determine the frequency of all length subsequences with high probability.
∎
Theorem 3 follows directly from Theorem 15 and Theorem 16.
Theorem** (Restatement of Theorem 3).**
For all such that ,
[TABLE]
traces are sufficient to be distinguished between and with high probability.
As noted earlier, if is odd then traces suffice. Also, regardless of the Hamming distance, if the location of the first and second positions (say and ) where and differs by at least then it is easy to show that expected weight of the length prefix of the traces differs by and hence we can distinguish and with traces.
6 Reconstructing Arbitrary Matrices
Recall that in the matrix reconstruction problem, we are given samples of a matrix passed through a matrix deletion channel, which deletes each row and each column independently with probability . In this section we prove Theorem 4.
Theorem** (Restatement of Theorem 4).**
For matrix reconstruction, traces suffice with high probability to recover an arbitrary matrix , where is the deletion probability and .
The bulk of the proof involves designing a procedure to test between two matrices and . This test is based on identifying a particular received entry where the traces must differ significantly, and to show this, we analyze a certain bivariate Littlewood polynomial, which is the bulk of the proof. Equipped with this test, we can apply a union bound and simply search over all pairs of matrices to recover the string.
For a matrix , let denote a matrix trace. Let us denote the entry of the matrix as , an indexing protocol we adhere to for every matrix. For two complex numbers , observe that
[TABLE]
Thus, for two matrices , we have
[TABLE]
where we are rebinding and . Observe that is a bivariate Littlewood polynomial; all coefficients are in , and the degree is in each variable. For such polynomials, we have the following estimate, which extends a result due of Borwein and Erdélyi [31] for univariate polynomials.
Lemma 17**.**
Let be non-zero Littlewood polynomial of degree in each variable. Then,
[TABLE]
for some where , and is a universal constant.
Proof.
Fix and define the polynomial
[TABLE]
We use the maximum modulus principle that is stated as follows: For any holomorphic function , the modulus of i.e. does not have a strict local maxima completely within its domain and therefore achieves the maximum value on the boundary of its domain. We first show by an iterated application of the maximum modulus principle that there exists on the unit disk such that . First factorize where is chosen such that has no common factors of . Since has non-zero coefficients, this implies that is a non-zero univariate polynomial. Further factorize so that terms in have no common factors of . is also a Littlewood polynomial and moreover it has non-zero leading term, so that . Thus by the maximum modulus principle:
[TABLE]
Now, for any we have
[TABLE]
where we are using the fact that . This proves the lemma, since we may choose such that for .
∎
Let denote the arc specified in Lemma 17. For any , Nazarov and Peres [12] provide the following estimate for the modulus of :
[TABLE]
Using these two estimates, we may sandwich by
[TABLE]
This implies that there exists some coordinate such that
[TABLE]
where the second inequality follows by optimizing for .
The remainder of the proof follows the argument of [12]: Since we have witnessed significant separation between the traces received from and those received from , we can test between these cases with samples (via a simple Chernoff bound). Since we do not know which of the matrices is the truth, we actually test between all pairs, where the test has no guarantee if neither matrix is the truth. However, via a union bound, the true matrix will beat every other in these tests and this only introduces a factor in the sample complexity.
7 Reconstructing Random Matrices
In this section, we prove Theorem 5: traces suffice to reconstruct a random matrix with high probability for any constant deletion probability . This is optimal since traces are necessary to just ensure that with high probability, every bit appears in at least one trace.
Our result is proved in two steps. We first design an oracle that allows us to identify when two rows (or two columns) in different matrix traces correspond to the same row (resp. column) of the original matrix. We then use this oracle to identify which rows and columns of the original matrix have been deleted to generate each trace. This allows us to identify the original position of each bit in each trace. Hence, as long as each bit is preserved in at least one trace (and traces is sufficient to ensure this with high probability), we can reconstruct the entire original matrix.
7.1 Steps to reconstruct the matrix
Oracle for Identifying Corresponding Rows/Columns
We will first design an oracle that given two strings and distinguishes, for any constant , with high probability between the cases:
Case 1:
and are traces generated by the deletion channel with preservation probability from the same random string
Case 2:
and are traces generated by the deletion channel with preservation probability from independent random strings
It and are two rows (or two columns) from two different matrix traces, then this test determines whether and correspond to the same or different row (resp. column) of the original matrix. In Section 7.2, we show how to perform this test with failure probability at most . In fact, the failure probability can be made exponentially small but a polynomially small failure probability will be sufficient for our purposes.
Using the Oracle for Reconstruction
Given traces we can ensure that every bit of appears in at least one of the matrix traces with high probability. We then use this oracle to associate each row in each trace with the rows in other traces that are subsequences of the same original row. This requires at most applications of the oracle and so, by the union bound, this can performed with failure probability at most where the inequality applies for sufficiently large .
After using the oracle to identify corresponding rows amongst the different traces we group all the rows of the traces into groups where the expected size of each group is . We next infer which group corresponds to the row of for each . Let be the bijection between groups and that we are trying to learn, i.e., if the group corresponds to the row of . If suffices to determine whether or for each pair . If there exists a matrix trace that includes a row in and a row in then we can infer the relative ordering of and based on whether the row from appears higher or lower in than the row in . The probability there exists such a trace is and we can learn the bijection with high probability.
We also perform an analogous process with columns. After both rows and columns have been processed, we know exactly which rows and columns were deleted to form each trace, which reveals the original position of each received bit in each trace. Given that every bit of appeared in at least some trace, this suffices to reconstruct , proving Theorem 5.
Theorem** (Restatement of Theorem 5).**
For any constant deletion probability , traces are sufficient to reconstruct a random with high probability.
7.2 Oracle: Testing whether two traces come from same random string
For any , define to be a contiguous subset of size
[TABLE]
Note that there are size gaps between each and , i.e., elements that are both larger than and smaller than . This will later help us argue that the bits in positions and in different traces are independent. Given traces , define the three quantities: , and . We will show that by considering we can determine whether and are traces of the same original string or traces of two different random strings.
The basic idea is that if and are generated by the same string, many of the bits summed to construct and the bits summed to construct will correspond to the same bits of the original string; hence will be smaller than it would be if and were generated from two independent random strings. To make this precise, we need to introduce some additional notation.
Definition 2**.**
For , let be the indices of the bits in the transmitted string that landed in positions in trace . Similarly define . For example, if bits in position 0 and 2 were deleted during the transmission of then .
The next lemma quantifies the overlap between and .
Lemma 18** (Deletion Patterns).**
With high probability over the randomness of the deletion channel,
[TABLE]
Note that conditioned on the second property, each of the ’s are independent random variables.
Proof.
First note that by the Chernoff bound, for each , the bit of the original sequence appears in position that belongs to where with high probability. The second part of the lemma follows since and therefore, with high probability, any bit in the original string will not appear in in one trace and in another for because there was a size gap between and .
For the first part of the lemma, for each , define
[TABLE]
By the Chernoff Bound, with high probability the bits in positions in the original string arrive in positions in the trace. Also with high probability, of the bits in are transmitted in the generation of both and . Hence, as required. ∎
Now, we prove a helper lemma characterizing the mean and variance of the square of difference of two independent binomials.
Lemma 19**.**
Let and be independent and . Then,
[TABLE]
Proof.
The result follows by direct calculation:
[TABLE]
and
[TABLE]
We are now ready to argue that the values are sufficient to determine whether or not and are generated from the same random string.
Theorem 20**.**
Let for and .
Case 1.
If and are generated from the same string, then .
Case 2.
If and are generated from different strings, then .
Proof.
Throughout the proof we condition on the equations in Lemma 18 being satisfied. Note that this event is a function of the randomness of the deletion channel rather than the randomness of the strings being transmitted over the deletion channel.
First, suppose and are generated from different strings. Then has the same distribution as the variable in Lemma 19 when is set to . Hence, and . Therefore,
[TABLE]
Therefore, by the Chernoff bound, with probability at least .
Now, suppose and are generated from the same string. Then, has the same distribution as in Lemma 19 for some . Hence, and . Therefore,
[TABLE]
Therefore, by the Chernoff bound, with probability at least . ∎
8 Extending Matrix Results to Tensors
8.1 Reconstruction of arbitrary tensors
In this setting, we have a order binary tensor such that has equal number of elements along every dimension. The tensor is now passed through a tensor deletion channel, which deletes each element along every dimension independently with probability . Notice that this is a generalization of the previous settings in matrix reconstruction (special case for ) and the trace reconstruction problem (special case for ) considered earlier.
In this section we prove Theorem 21.
Theorem 21**.**
For tensor reconstruction, \exp\Big{(}O\Big{(}(n(kp/q^{2})^{k}\log^{2}n)^{1/(k+2)}\Big{)}\Big{)} traces suffice with high probability to recover an arbitrary tensor , where is the deletion probability and .
We again design a procedure to test between two tensors and . This test is based on identifying a particular received entry where the traces (traces of the two tensors) must differ significantly, and to show this, we analyze a certain multivariate Littlewood polynomial. Equipped with this test, we can apply a union bound and simply search over all pairs of tensors to recover the correct one. We will begin by showing an extension of Lemma 17 for any value of .
Lemma 22**.**
Let be a non-zero Littlewood polynomial of degree in each variable. In that case,
[TABLE]
for some where and is a universal constant.
The proof of Lemma 22 follows from an iterative use of the maximum modulus principle for multivariate Littlewood polynomials and follows along the lines of the proof presented in Lemma 17. The detailed proof has been deferred to Appendix B.
For a matrix , let denote a tensor trace (the output after the tensor is passed through the tensor deletion channel). Let us denote by the element in whose location along the dimension is i.e. there are elements along the dimension before . Notice that this indexing protocol uniquely determines the element within the tensor. We now show the following lemma:
Lemma 23**.**
For any two distinct tensors , there exists a position denoted by the set of ordered indices such that
[TABLE]
The proof of Lemma 23 follows from using the complex generating function of the tensor traces and subsequently, using Lemma 22 based on similar ideas as in Section 6. The detailed proof has been deferred to Appendix B. For the remaining part, we follow the argument of [12]: Since we have witnessed significant separation between the traces received from and those received from , we can test between these cases with samples (via a simple Chernoff bound). Since we do not know which of the traces is the truth, we actually test between all pairs, where the test has no guarantee if neither tensor is the truth. However, via a union bound, the true tensor will beat every other in these tests and this only introduces a factor in the sample complexity.
8.2 Reconstruction of random tensors
In this section, we extend the results in Section 7 for random tensors. Suppose we have a order random binary tensor such that has equal number of elements along every dimension and every element in is randomly sampled from uniformly and independently. The tensor is now passed through a tensor deletion channel, which deletes each element along every dimension independently with probability . In this section we will prove the following theorem:
Theorem 24**.**
For any constant deletion probability , traces are sufficient with high probability to reconstruct a random .
Notice that this bound is also tight since we need traces to at least observe every bit in the tensor . The detailed proof of Theorem 24 is a generalization of the ideas presented in Section 7 and has been deferred to Appendix B.
9 Conclusion
In this paper, we study several variations on the trace reconstruction problem to understand how structural assumptions on the input influence the sample complexity. Our results shed light on how sparsity, separation between 1s, randomness, and multivariate structures can enable efficient statistical inference with the deletion channel. Along the way, we refine existing techniques, such as the Littlewood polynomial approach, and introduce several new ideas, including clustering and combinatorial methods. We hope our insights and techniques will prove useful in future work on trace reconstruction and related problems.
Appendix A Sparsity with gap: Technical details
This section contains missing details from Section 3. Recall that we have a string that is -sparse. We further assume that each pair of successive 1s in is separated by a run of 0s, and we refer to as the gap. Recall that we define as the position of the 1s in original string, where . As further notation we refer to the collection of traces as .
The first level
As a warm up, we show an algorithm called FindPositions, that uses traces to reconstruct exactly with high probability when the gap . The algorithm returns the values and crucially uses a binomial mean estimator. Given samples from a binomial distribution this estimator returns an estimate of , \hat{n}={\rm round}\Big{(}\frac{2}{s}\sum_{i=1}^{s}X_{i}\Big{)}, where the function simply rounds the argument to the nearest integer. From the Hoeffding bound, it is clear that
[TABLE]
as long as for any .
The algorithm FindPositions is displayed in Algorithm 2. Our first result of this section guarantees that with Algorithm 2 recovers exactly with traces.
Proposition 25**.**
Algorithm 2* (FindPositions) successfully returns the string from traces with probability at least as long as and the gap .*
Proof.
First, let us associate with each vertex an unknown label describing the correspondence between this received 1 and a 1 in the original string. The first observation is that if then and we always have . Thus, by Hoeffding’s inequality and a union bound, we have
[TABLE]
And so with , with probability at least all values concentrate appropriately.
This event immediately implies that is consistent in the sense that if then . Further the gap condition implies the converse property, which we call purity: if then . Formally, if then
[TABLE]
which implies that . Hence .
The above two properties reveal that each connected component can be identified with a single index corresponding to a 1 in the original string and the component contains exactly the received 1s corresponding to that original one (formally ). From here we simply use the binomial estimator on each component. First observe that, by a Chernoff bound, with probability at least , each 1 from the original string appears in at least a -fraction of the traces, so that . Then apply the guarantee for the binomial mean estimator along with another union bound over the positions. Overall the failure probability is at most
[TABLE]
which is at most with . With this choice, we can tolerate . ∎
The recursion
The algorithm RecurGap (Algorithm 1) uses the clustering scheme in FindPositions in a recursive manner to estimate the parameters even when the gap is much less than . Define a series of threshold parameters, to be used in each level of the recursion:
[TABLE]
where the total number of levels is . Note that, . In particular, if then we have .
Recall that is the vertex set for the graph used above, where each vertex corresponds to a received 1 and is associated with an unknown original one . Our main result for RecurGap is the following.
Theorem 26**.**
*Assume for some . Then with probability at least , Algorithm 1 (RecurGap) with levels of recursion returns sets such that *
. 2. 2.
.
The theorem follows from the three lemmas stated earlier. Here we restate the lemmas and provide the proofs.
Lemma** (Consistency, restatement of Lemma 10).**
At level let for each . Then with probability , for each and there exists some component at level such that .
Lemma** (Length Bound, restatement of Lemma 11).**
At level , the following holds with probability at least : For every component at level , we have . Moreover if is a contiguous subsequence of with , then with high probability.
Lemma** (Length Filter, restatement of Lemma 12).**
Assume . At level , the following holds with probability at least : For a component at level , let be the maximal contiguous subsequence of such that . Define and . Then for any , if and are present in , then survives to round , that is . Moreover, for any , let denote the original position of the first 1 from that is also in the trace . Then we have with high probability.
The proofs of the lemmas are all-intertwined. In the induction step we will assume that all lemmas hold at the previous level of the recursion. Throughout we repeatedly take union bound over all traces and all up-to- components, and set the failure probability for each event to be . In applications of Hoeffding’s inequality, this produces a term inside the square root.
Proof of Lemma 11.
We proceed by induction. For the base case, by Hoeffding’s inequality, we know that for all we have
[TABLE]
except with probability at most . This means that the position corresponding to a single index can span at most positions. Formally, if two vertices have then, by the triangle inequality, . Additionally, if two vertices have and (so that ), then . Use these two facts, along with the fact that there are at most distinct values for , the total length of any connected component is at most . The second claim follows from the concentration statement.
For the induction step, assume that the connected components at level have length at most . Fix a connected component and let denote the left-most original 1 present in (). By another application of Hoeffding’s inequality and using the error guarantee in Lemma 12, we have that
[TABLE]
except with probability at most . From here, the same argument as in the base case yields the claim. ∎
Proof of Lemma 12.
We have two conditions to verify. Fix a component at level with maximal contiguous subsequence and recall the definitions and . By another concentration bound, we know that
[TABLE]
with probability at least . This reveals that:
[TABLE]
Moreover, for any trace that contains the tail bound is two-sided:
[TABLE]
Note that we also have with overwhelming probability as:
[TABLE]
Here we are using the symmetry of the binomial distribution. Thus, with , the failure probability here is , which is negligible.
Using the upper bound on reveals that survives, since
[TABLE]
For the second condition, assume that some trace survives but does not contain . Let denote the first original 1 in this trace that belongs to s block (By definition for each ). Then we know that
[TABLE]
but since passed through the length filter, we also have a lower bound on its length, and so we get that
[TABLE]
where the last inequality follows from Lemma 11. ∎
Proof of Lemma 10.
The proof here is similar to that of Lemma 11. Fix a component with corresponding block at level and assume that all three lemmas apply for all previous levels. For a subtrace in this component observe and recall the definition and , which is the position of the first 1 in that appears in trace . Since the length of the subtrace is at most by Lemma 11 we get that
[TABLE]
Here the last inequality uses Hoeffding’s bound along with Lemma 12 at level . This implies that the clustering at level is consistent. ∎
Proof of Theorem 26.
First take a union bound over applications of the three lemmas, so that the total failure probability is . From now, assume that the events in the three lemmas all hold for all levels. In particular, this implies that the components are consistent. We must verify that the clusters are pure and then track how many vertices remain.
For the first claim, let us revisit the proof of Lemma 10. If two vertices, say , in a component at level corresponded to different 1s, say then by the gap condition, we know that . On the other hand, we know that (2) holds, and we will use this to prove that no edge appears between these vertices. We have that
[TABLE]
and so, if , then the two vertices will not share an edge. The argument applies for all pairs and hence the clusters at level are pure, which establishes the first claim in the Theorem 26.
For the second claim, note that by Lemma 12, for every component at every level, if a trace contains the two endpoints of that component, then it will survive the filter. Hence, in every filtering step we expect to retain of the subtraces passing through, and, by a Chernoff bound, we will retain of the subtraces except with , provided . Since we perform levels, we retain traces in each cluster with high probability. ∎
Removing Bias: The reverse recursion
Now that we have isolated the vertices into pure clusters, we need to work our way up through the recursion to remove biases introduced by the hierarchical clustering. For any component corresponding to block at level , since the components at level are pure, we can identify exactly the subtraces that contain the first and last 1 in the block. We throw away all other traces, which de-biases the length filter at level .
Unfortunately for a component corresponding to a block at level , we cannot identify exactly the subtraces that contain the exactly the first and last 1 in the block. However, we know that is further refined into sub-components at level , and by induction we can identify all the traces that contain the left-most and right-most 1 in the left-most and right-most sub-components. We identify all such traces and eliminate the rest to debias the length filter at level . See Figure 1 for an illustration.
To debias this length filter, we filter based on the presence of two 1s at level (just the end points), and two futher 1s at level (the inner endpoints of the first and last sub-components), four further 1s at , and so on. So, just to debias the length filter at level we require 1s to be present. Since we must debias all length filters above a particular component, we require the presence of 1s. The probability of all of these 1s appearing is and by Chernoff bound, with high probability at least of our traces will contain all of these 1s.
For any 1, , in the original string, let denote the subset of 1s, whose presence we require to debias the length filters above the pure component containing . After the debiasing step, the remaining vertices in the component containing have values distributed as
[TABLE]
where is the number of 1s in that appear before in the sequence, and the final 1 is due to the presence of . Using the binomial mean estimator, we can therefore estimate with probability at least , provided . Thus, traces suffice to recover all values, provided that and . This proves Theorem 2.
Appendix B Missing Proofs from Section 8
Proof of Lemma 22.
Fix and define the polynomial
[TABLE]
We first show that there exists on the unit disk () such that . This follows from an iterated application of the maximum modulus principle. First factorize where is chosen such that has no common factors of . Since has non-zero coefficients, this implies that is a non-zero polynomial and therefore using the maximum modulus principle, for any fixed , there exists a value of such that and
[TABLE]
Subsequently we can further factorize so that has no common factors in . Repeating this procedure times, we can show the following chain of inequalities
[TABLE]
Now, for any we have
[TABLE]
where we are using the fact that . This proves the lemma, since we may choose such that for for all . ∎
Proof of Lemma 23.
For complex numbers , observe that
[TABLE]
Thus, for two tensors , we have
[TABLE]
where we are rebinding for all . Observe that is a multivariate Littlewood polynomial; all coefficients are in , and the degree is in each variable.
Again, for we can use Lemma 22 and the fact that
[TABLE]
to sandwich by
[TABLE]
This implies that there exists such that
[TABLE]
where the second inequality follows by optimizing for . ∎
Proof of Theorem 24.
We will use the oracle described in Section 7 again. Recall that the oracle was able to distinguish between the following two cases
Case 1:
and are traces generated by the deletion channel with preservation probability from the same random string
Case 2:
and are traces generated by the deletion channel with preservation probability from independent random strings
with failure probability at most .
Notice that the probability of a particular bit in getting deleted is . In that case, with traces we can ensure that every bit of appears in at least one of the tensor traces with probability at least . Suppose we fix dimensions and without loss of generality suppose we fix the value of the dimension of to be for all . In that case the elements form a binary vector of length . There are such binary vectors corresponding to the different values of and we will denote the set of traces from the such binary vector by . Notice that there exists a natural ordering among these groups . For two distinct groups , where is defined by and respectively, we will have if and only if there exists a value such that
[TABLE]
Moreover, when we observe a tensor trace after fixing all the dimensions, except the first one, we actually observe the vector traces of one of those binary vectors. Suppose for every tensor trace, we do this process and collect all the vector traces by fixing every dimension except the first one. We can now use our oracle to group all these vector traces according to the original binary vector they emanated from i.e two vector traces belong to the same group if both of them belong to for some value of . This requires at most applications of the oracle and so, by the union bound, this can performed with failure probability at most
[TABLE]
where the inequality applies for sufficiently large . We next infer the ordering among the groups . For two distinct , where is defined by and respectively, suppose there exists a tensor trace having at least one vector trace from both and . Moreover suppose the position of the vector trace from is given by and the position of the vector trace from is given by . In that case, we will infer that if there exists an such that
[TABLE]
and infer otherwise. The probability there exists such a trace is . We also perform an analogous process with every such dimension. After all dimensions have been processed, we know exactly the elements along each dimension that has been deleted to form each tensor trace, which subsequently reveals the original position of each received bit in each tensor trace. Given that every bit of appeared in at least some trace, this suffices to reconstruct , proving the main theorem. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Levenshtein, “Reconstruction of objects from a minimum number of distorted patterns,” in Doklady Mathematics , vol. 55, no. 3. Pleiades Publishing, Ltd., 1997, pp. 417–420.
- 2[2] V. I. Levenshtein, “Efficient reconstruction of sequences,” IEEE Transactions on Information Theory , vol. 47, no. 1, pp. 2–22, 2001.
- 3[3] V. Levenshtein, “Efficient reconstruction of sequences from their subsequences or supersequences,” Journal of Combinatorial Theory, Series A , vol. 93, no. 2, pp. 310–332, 2001.
- 4[4] I. Krasikov and Y. Roditty, “On a reconstruction problem for sequences,” Journal of Combinatorial Theory, Series A , 1997.
- 5[5] T. Batu, S. Kannan, S. Khanna, and A. Mc Gregor, “Reconstructing strings from random traces,” in Symposium on Discrete Algorithms , 2004.
- 6[6] S. Kannan and A. Mc Gregor, “More on reconstructing strings from random traces: Insertions and deletions,” in International Symposium on Information Theory , 2005.
- 7[7] K. Viswanathan and R. Swaminathan, “Improved string reconstruction over insertion-deletion channels,” in Symposium on Discrete Algorithms , 2008.
- 8[8] T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder, “Trace reconstruction with constant deletion probability and related results,” in Symposium on Discrete Algorithms , 2008.
