On the Parallel Reconstruction from Pooled Data
Oliver Gebhard, Max Hahn-Klimroth, Dominik Kaaser, Philipp, Loick

TL;DR
This paper introduces a simple greedy algorithm for reconstructing sparse binary signals from pooled additive measurements in parallel, establishing sharp theoretical thresholds and validating them through empirical simulations.
Contribution
It presents a new efficient greedy reconstruction algorithm and derives the exact information-theoretic query threshold for sparse signals in pooled data problems.
Findings
The greedy algorithm achieves performance comparable to complex methods.
Theoretical thresholds for minimal queries are established and validated.
Empirical results confirm the practical effectiveness of the approach.
Abstract
In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal , we can query multiple entries at once and get the total number of non-zero entries in the query as a result. We assume that queries are time-consuming and therefore focus on the setting where all queries are executed in parallel. For the regime where the signal is sparse such that our results are twofold: First, we propose and analyze a simple and efficient greedy reconstruction algorithm. Secondly, we derive a sharp information-theoretic threshold for the minimum number of queries required to reconstruct with high probability. Our first result matches the performance guarantees of much more involved constructions (Karimi et al. 2019). Our second result extends a result of Alaoui et al. (2014) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Medical Imaging Techniques and Applications · Machine Learning and Algorithms
On the Parallel Reconstruction from Pooled Data††thanks: OG and PL were supported by DFG CO 646/3. MHK was supported by DFG FOR 2975 and Stiftung Polytechnische Gesellschaft.
Oliver Gebhard
Dominik Kaaser
*TU Dortmund University
*Dortmund, Germany
*Universität Hamburg
*Hamburg, Germany
Max Hahn-Klimroth
Philipp Loick
*TU Dortmund University
*Dortmund, Germany
*Goethe University Frankfurt
*Frankfurt, Germany
Abstract
In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal , we can query multiple entries at once and get the total number of non-zero entries in the query as a result. We assume that queries are time-consuming and therefore focus on the setting where all queries are executed in parallel. For the regime where the signal is sparse such that our results are twofold: First, we propose and analyze a simple and efficient greedy reconstruction algorithm. Secondly, we derive a sharp information-theoretic threshold for the minimum number of queries required to reconstruct with high probability. Our first result matches the performance guarantees of much more involved constructions (Karimi et al. 2019). Our second result extends a result of Alaoui et al. (2014) and Scarlett & Cevher (2017) who studied the pooled data problem for dense signals. Finally, our theoretical findings are complemented with empirical simulations. Our data not only confirm the information-theoretic thresholds but also hint at the practical applicability of our pooling scheme and the simple greedy reconstruction algorithm.
Index Terms:
Reconstruction, Sparse Signal, Pooled Data, Information Theory, Phase Transitions
I Introduction
We consider the binary pooled data problem with additive queries which is defined as follows. We are given a signal of length , a large vector of Hamming weight and a querying method. Each query pools multiple entries of together and returns the exact number of non-zero entries contained in the pool (see Fig. 1 for an example). The goal is to reconstruct using as few queries as possible.
In many real-world scenarios the time to compute a reconstruction of is dominated by the time to perform a single query. The evaluation of such a query may require, e.g., computations using a deep neural network on a GPU [20], biological processes such as DNA screening [7, 26], or PCR tests in a bio-medical context [4]. To obtain a substantial speed-up, we therefore focus on parallel schemes where all queries are specified a priori and executed simultaneously. This assumption makes sense in the context of a life sciences laboratory: queries can be envisioned as measurements conducted by a liquid handling robot. The time to perform all (parallel) queries then clearly dominates the time to run an efficient (sequential) reconstruction algorithm (for practical input sizes).
In this paper we focus on the sublinear regime where the number of non-zero entries scales sub-linearly in the signal’s length such that for some . In this setting, our main task is to specify a suitable parallel pooling design and an efficient reconstruction algorithm that allows us to compute efficiently from the queried data. We are interested in two different types of phase-transitions that commonly arise in the analysis of reconstruction and statistical inference problems:
What is the minimum number of queries that allows us to infer from the query results given unlimited computational power? 2. 2.
How many queries are required such that an efficient algorithm can compute from the query results?
We will refer to the first phase-transition as the information-theoretic threshold and to the second phase-transition as the algorithmic threshold.
I-A The Teacher-Student Model
As in many related reconstruction problems, the teacher-student model provides the fundamental means towards analyzing information-theoretic questions. The challenge in such reconstruction problems lies in deriving probability distributions that are dependent on a variety of random variables and hard to express per se. However, deriving probability distributions conditioned on certain high-probability events is feasible. For an introduction and mathematical justification of the model, we refer the reader to [10]. The setup is the following: a teacher aims to convey some ground truth to a student. Rather than directly providing the ground truth to the student, the teacher generates observable data from the ground truth via some statistical model and passes both the data and the model to the student. The student now aims to infer the ground truth from the observed data and the model.
In terms of this paper we see as the ground truth. Its distribution is inherited from all vectors in of Hamming weight . The observable data , together with the conducted queries (expressed as a graph ) are passed to the student in order to infer . In the following, we analyze the chances of the student to infer the ground truth from the observable data. First, we derive the model distribution from the provided information and the query results . Afterwards, we use the gained knowledge to analyze the chances of the student to recover the ground truth by estimating the number of possible input vectors that are consistent with the observed query results. As our goal is to recover with high probability, we condition on the event that the underlying bipartite multi-graph , which will be defined properly in due course, behaves almost as expected. We exploit the knowledge about to derive high-probability events which we can condition on. Eventually, our analysis conveys the information whether there is a unique input vector or multiple possible input vectors out of which the student has to guess the correct one.
I-B Related Work
The binary pooled data problem, sometimes called quantitative group testing, finds its roots in early works of Dorfman [13], Djackov [11], and Shapiro [27]. It has recently gained a lot of interest in the literature [1, 6, 14, 18, 25], with applications in a multitude of disciplines such as DNA screening [26], identifying genetic carriers [7] and machine learning [20, 23, 33]. Variants of the problem include binary group testing [2, 9] or threshold group testing [8, 22]. We start our discussion with an overview of related work from information theory.
Information-Theoretic Aspects
A simple information-theoretic lower bound can be obtained by a folklore counting argument: each query returns a number from [math] to , thus a pooling design with queries can produce at most different outcomes. This number must be larger than in order to distinguish all possible input vectors of length with Hamming weight . By standard asymptotic bounds, we obtain
[TABLE]
The universal lower bound on holds in any case, even if the queries do not need to be conducted in parallel. Restricted to the important special case in which all queries are conducted in parallel, [11] shows that reconstruction of requires at least
[TABLE]
queries, even with unlimited computational power. On the positive side, Bshouty [6] proves that reconstruction of is efficiently possible with queries if they are conducted sequentially and Grebinski and Kucherov [17] provide a parallelizable design with an exponential-time reconstruction decoding algorithm which guarantees inference with queries using separating matrices. The latter positive result was extended to the so-called Subset Select problem [21], a relaxation of the pooled data problem that asks to identify only a subset of positive entries correctly. Recently, [14] improved the result for this relaxation by a factor of . So far, these results hold independently of . For the linear regime where , much stricter results are already known: Alaoui et al. [1] and Scarlett and Cevher [25] show that there is an exponential-time construction that achieves reconstruction with parallel queries – a result that is dependent on scaling linearly in .
Algorithmic Aspects
If allowed for sequential queries, Bshouty [6] presents an efficient reconstruction algorithm that succeeds at recovery of with no more than queries. However, for parallel schemes, there are significant gaps between the information-theoretic lower bound and the currently best known efficient algorithms [1, 12, 14, 15, 19, 24]. For instance, Alaoui et al. [1] present an Approximate Message Passing algorithm for dense signals (). Furthermore, Donoho and Tanner [12] give a decoding strategy based on -minimization, and Foucart and Rauhut [15] introduce the Basis Pursuit-algorithm. They can be used to recover with
[TABLE]
queries, respectively, if the signal is sparse (). Note that these algorithms solve the more general compressed sensing problem. Various improvements over the Basis Pursuit algorithm are known (e.g., the Orthogonal Matching Pursuit [24] and its improved version for discrete signals [29]) but as Wang and Yin [32] discuss, they do not perform asymptotically better in the setting discussed in this paper. More recent algorithms explicitly designed for recovery of from additive queries in the sparse regime are due to Karimi et al. [18, 19]. They provide two algorithms based on graph codes that require
[TABLE]
queries, respectively. Furthermore, in a yet unpublished draft that appeared subsequently to our work on arXiv, Feige and Lellouche [14] analyze the Subset Select problem. They prove that, under mild assumptions, an algorithm succeeding at this relaxation can be turned into an algorithm for recovery of without significantly increasing the required number of queries.
I-C Our Contributions
We study the pooled data problem under the random regular model which is known to be information-theoretically optimal in the linear regime as well as in similar inference problems [9]. More precisely, we let be a random bipartite multi-graph with query-nodes representing the queries, entry-nodes representing the coordinates of , and edges indicating how often a specific entry is contained in a given query. Hereby, each query contains exactly entries chosen uniformly at random with replacement.
Algorithmic Results
For the aforementioned pooling design we present a fairly intuitive greedy algorithm called Maximum Neighborhood (MN) Algorithm that allows reconstruction of w.h.p.111The expression with high probability (w.h.p.) refers to a probability that tends to 1 as . It follows a thresholding approach that is much simpler than the known algorithms by Karimi et al. [18, 19], which are technically highly challenging. A formal definition of the MN-Algorithm is given in Algorithm 1.
On an intuitive level, the MN-Algorithm works as follows. First, we query times exactly randomly chosen entries of the signal in parallel, which yields the graph representation . Secondly, we sum up the query results in the neighborhood induced by of each coordinate, counting multi-edges only once. The sum is then centralized by its expected value. Finally, those coordinates with a large score are very likely to have the value under . Our first main theorem states how many parallel queries are required for the MN-Algorithm to recover the correct w.h.p.
Theorem 1**.**
Suppose that , , and and let
[TABLE]
If , then Algorithm 1 outputs w.h.p. on input and and an additive querying method query that returns the total number of one-entries in a query.
While the MN-algorithm takes as an input, the proof reveals that prior knowledge of is not required in detail. More precisely, a lower bound on suffices, as in this case enough queries are conducted and the design of is independent from . Observe that one additional parallel query on all entries reveals the exact value of immediately without increasing asymptotically and therefore the only dependence on in Algorithm 1 (Line 7) can be easily removed by this one additional query. Beside not being strictly dependent on , a main novelty of the MN-algorithm is its greedy fashion, providing a straightforward approach compared to the technically challenging algorithms presented in [18, 19].
Parallelized Reconstruction
Observe that our reconstruction algorithm, apart from sampling the test design and performing all queries in parallel, is specified in a sequential fashion. This emphasizes the local structure of the reconstruction algorithm. In the context of a parallel computation we observe that our algorithm can be readily parallelized. When individual queries can be conducted much faster, this further reduces the overall running time of our approach. Such improved reconstruction algorithms can be used in the context of machine learning, see, e.g., [33] for an application.
Recall that our test design is described by a random bipartite graph and let be the unweighted biadjacency matrix of . Intuitively, the entries of are those values that are summed up in Line of Algorithm 1. It follows that the and vectors are matrix-vector products and where is the all-one-vector and is the query result vector. The sums computed in Lines 4 to 6 of Algorithm 1 can therefore be expressed in terms of two matrix-vector products for which efficient parallelizations are known. Finally, in Lines 7 to 9 of Algorithm 1 the (coordinates of) the resulting vector are sorted. See [28] for a rather recent survey (with a focus on but not limited to GPUs) on parallel sorting algorithms.
Information-Theoretic Results
We prove that in the sublinear regime where for some it is possible to reconstruct from with high probability with no more than parallel queries for some arbitrarily small . More precisely, we show that there is, with high probability, no second input vector leading to the same sequence of query results.
Theorem 2**.**
Suppose that , , and and let
[TABLE]
If , can be computed from and w.h.p.
Our result reduces the previously known upper bound of Grebinski and Kucherov [17] by a factor of two and we provide the missing counter part of (2) which establishes the existence of a phase-transition at for parallel designs.
I-D Discussion
Our results extend information-theoretic results of Alaoui et al. [1] from the linear regime to the sublinear regime. For , our threshold of Theorem 1 turns out to converge towards the threshold of [1]. The study of the sublinear regime is inspired by studies of the compressed sensing problem with a sparse underlying signal [3]. In the special case of the binary pooled data problem, those studies were initiated by [19]. The sparse regime is indeed interesting in real-world applications, with examples including epidemiology where Heaps law models the early spread of pandemics [5, 31] or the detection of rare features in image classification in machine learning [20]. The relevance of the sublinear regime can be seen in the following example. Suppose a screening for HIV is conducted. Out of about 67,220,000 residents of the UK, 105,200 are known to be infected with the HI virus. Hence, by screening n = 10.000 random probes, we expect 16 positive entries in the signal corresponding to the infection status. Thus, the choice describes the situation quite well.
It is not surprising that also similar problems have been recently analyzed in the sublinear regime. By now, a vast body of related literature exists (see, e.g., the survey by Aldridge et al. [2]). Interestingly, for the (presumably more difficult) variant in which a query only returns the information whether at least one non-zero entry was found, a very sophisticated efficient algorithm is known for which requires parallel queries [9]. Thus, dropping most of the available information and using this approach outperforms not only the simple greedy approach discussed in this paper for small values of , but also the quite involved algorithms by Karimi et al. [18, 19]. This result is of fundamental theoretical interest, since it solves an open complexity theoretical question. Nevertheless, their proposed algorithm appears to be of rather limited interest for practical applications, as it requires, e.g., that is large. This is in contrast to our simple greedy scheme, which our simulations have shown to work well for real-world input sizes.
As in state-of-the art designs for similar reconstruction problems [2, 9], we allow a specific entry to be included multiple times in one query. While this seems counter-intuitive in the first place, it does not affect practicability of the proposed design.
II Model and Notation
In this section we formally introduce the pooling design. As before, is the ground truth chosen uniformly at random from all vectors of length with exactly non-zero entries, where for some . We use to denote the random bipartite multi-graph that models the pooling design, where denotes the total number of queries and describes the number of queries each individual participates in. Observe that . Similarly, we let denote the number of distinct queries with expected value . We let the vector denote the sequence of query results. When we refer to any other input vector than , we simply write for the input vector and for the corresponding results’ vector. Additionally, we write for the set of the entries of and let and be the set of entries with value [math] and , respectively. For , we write for the multiset of queries in which is contained. Similarly, we write for the set of distinct such queries. Analogously, for a query , we denote by the multiset of entries that are contained.
Recall that in our model every query contains exactly entries, and those entries are assigned uniformly at random with replacement. If a one-entry participates in a query more than once, it increases multiple times. For each , we let be the sum of its query results for distinct queries it belongs to. That is, even if the entry appears more than once in a query and thus contributes to the result multiple times, this query’s result contributes to only once. Of course, the value of under has a significant impact on this sum, increasing it by , if is non-zero. To account for this effect in our analysis, we introduce a second variable that sums all the query results in which is contained and excludes the impact of . Formally, for any configuration we define
[TABLE]
and let and . When we consider a specific instance , we will write and for the sake of brevity. Notably, while is known to the observer or an algorithm instantly from the queries, is not, since the ground truth itself is unknown.
To express the number of queries conducted, we let denote a positive function from to such that
[TABLE]
While it turns out that suffices in the analysis of the information-theoretic bound, we will see that the performance guarantee of the MN-algorithm requires to scale as . Finally, we define a high probability event that we will condition on as explained in the teacher-student model. Let be the event that, for all , we have
[TABLE]
meaning that the underlying random graph satisfies concentration properties. The following lemma states that is indeed a high probability event.
Lemma 3**.**
If is constructed according to our pooling scheme, then .
The proof follows from standard concentration results, see the appendix for the technical details. Since Theorems 2 and 1 only contain w.h.p.-assertions, we can safely condition on for the remainder of our analysis.
III MN-Algorithm
Outline
Recall that is the sum over all query results in which the entry is contained (multi-edges counted only once) and is the (random) number of disjoint such queries. Furthermore, let be the algebra generated by the edges connected with . As already discussed, we get
[TABLE]
w.h.p. Therefore, intuitively spoken, a non-zero entry increases the value of by , other than zero-entries. Moreover, by construction of the random bipartite (multi-)graph , we get that the second neighborhood of contains non-zero entries. Thus we expect
[TABLE]
Therefore, if is called the score of entry , we observe that the scores differ between zero entries and non-zero entries. The whole proof of the algorithmic performance boils down to identify a threshold value such that, if sufficiently many queries are conducted, all scores of zero entries are below while the scores of all non-zero entries exceed this threshold w.h.p. If we conduct queries, with , we get by a standard application of a Chernoff bound and a union bound over all non-zero entries and, respectively, zero-entries that is a valid threshold whenever
[TABLE]
which will become clear in a second. Optimizing (LABEL:eqs_greedy) with respect to and plugging into yields for any the sufficient condition
[TABLE]
Formal Analysis
Let denote how often entry appears in query and let be the adjacency matrix of . Then the following holds.
Corollary 4**.**
Let . Given , the random variable
[TABLE]
has distribution
Proof.
This is an immediate consequence of the model definition. There are half-edges connected to query-nodes in the neighborhood of that are connected to entry-nodes . Each of these half-edges is connected to one of entry-nodes belonging to an entry of value , independently, from the remaining entry-nodes. ∎
Now it is possible to immediately infer the expectation of conditioned on the event (as defined in (3)). For the sake of brevity let . Given the event which guarantees concentration properties of the underlying graph, we get w.h.p.
[TABLE]
The Chernoff bound allows us to bound as follows.
Lemma 5**.**
Let be a constant and . Then
[TABLE]
Proof.
The Chernoff bound (Lemma 12) directly implies
[TABLE]
Next we show that, with a suitable choice of a threshold, the scores of zero- and one-entries are well separated.
Corollary 6**.**
Let be an arbitrary constant. If then there exists an such that, w.h.p., we have
[TABLE]
for all where .
Proof.
Let . Again, we make use of the concentration properties guaranteed by conditioning on . Therefore, we assume that . Then Lemma 5 ensures that
[TABLE]
Hence, the union bound shows that the first inequality holds for all elements of w.h.p. if
[TABLE]
Analogously, the second inequality holds for all elements of w.h.p. if
[TABLE]
Again, the union bound shows that the second inequality holds w.h.p. if
[TABLE]
Note that the condition in (6) is monotonically decreasing in while the condition in (7) is monotonically increasing in . Hence the optimal choice of is the one that makes the two terms in (6) and (7) equal:
[TABLE]
which boils down to
[TABLE]
By putting this solution for into (6) we get
[TABLE]
It now suffices to find the minimal such that
[TABLE]
Hence, we solve for (positive) and obtain that Eqs. 6 and 7 hold w.h.p. provided
[TABLE]
which matches the assumption in the lemma statement. ∎
We are now ready to formally prove Theorem 1.
Proof of Theorem 1.
According to Lemma 3, the event is a high-probability event. Corollary 6 then immediately implies the theorem, together with the definition . ∎
IV Information-Theoretic Achievability
In the following section we prove Theorem 2. Our approach is based on counting alternative input vectors that yield the same sequence of query results as the ground truth . Note that the underlying techniques are regularly employed for random constraint satisfaction problems [10].
We start with an outline of the proof. Let be the set of all vectors of Hamming weight such that
[TABLE]
This means, we fix queries and let be the set of all vectors with exactly ones that are consistent with the query results. Let now . We need to prove that w.h.p. if the number of queries exceeds . Note that we can always reconstruct exactly in this case via an exhaustive search (recall that from an information-theoretic point of view the computational power is assumed to be unlimited).
In our analysis, it turns out that it is much more convenient to study , the number of alternative vectors that are consistent with the query results and have a so-called overlap of with . The overlap is the number of one-entries under that are also present in an alternative vector . Formally, we define
[TABLE]
It now suffices to prove that for w.h.p. To this end, two separate arguments are needed. First, we show in Proposition 7 via a first moment argument that no second satisfying input vector can exist with a small overlap with . Secondly, we employ in Proposition 11 the classical coupon collector argument to show that a second satisfying configuration cannot exist for large overlaps. Intuitively, this means that an entry that is flipped from zero under to one under an alternative configuration initiates a cascade of other changes to maintain the observed query results. The full technical proofs for the following statements can be found in the appendix.
Proposition 7**.**
Let , and assume that . W.h.p., we have
[TABLE]
We now sketch the proof of Proposition 7. By Markov’s inequality it suffices to show that fast enough for all with if for some . For we compute
[TABLE]
The combinatorial meaning is the following: The binomial coefficients count the number of possible input vectors of overlap with . The subsequent term measures the probability that a specific such yields the same results on queries as . To see this, we divide the entries into three categories. The first category contains those entries that exhibit the same value under and . The second and third category feature those entries that are set to one under and to zero under and vice versa. Recall that determines the number of that are set to one under both vectors and . The probability for a specific entry to be in the first category is , while the probability for a specific entry to be in the second or third categories is each. The key observation is that the query results are the same between and if and only if the number of entries in the second category is identical to the number of entries in the third category. We compute (a bound on) the sum over the number of entries which are flipped. Simplifying the term and conditioning on the high probability event yields the following lemma.
Lemma 8**.**
For every and a random variable , we have
[TABLE]
Here, is the binomial distribution with parameters and where we condition that its outcome is at least .
Proof.
The product of the two binomial coefficients simply accounts for the number of vectors that have overlap with . Let denote the event that one specific that has overlap with belongs to . It suffices to show for that
[TABLE]
The remainder of the proof is dedicated to showing Eq. 8.
By the design , each query contains entries chosen uniformly at random, and we observe that all query results are statistically independent of each other. Therefore, we need only to determine the probability that for a specific and a specific query the result is consistent with the result under such that . Given the overlap , we know for drawn uniformly at random that and finally holds for all , . We get
[TABLE]
The last two components of (9) describe the probability that a one-dimensional simple random walk returns to its original position after steps, which is by Lemma 14 equal to . The former term describes the probability that a random variable takes the value . For the expectation of given is at least of order such that the asymptotic description of the random walk return probability is feasible. Note that if gets closer to , the expectation of gets finite, s.t. the random walk approximation is not feasible anymore. Therefore, using Lemma 15, we can, as long as , simplify (9) to
[TABLE]
for large which implies Lemma 8. ∎
While the expression given through Lemma 8 might look hard to work with, it can be simplified using standard asymptotic arguments as follows.
Lemma 9**.**
For every , and , we have
[TABLE]
The key is to choose such that for every when . Asymptotically, takes its maximum at . Therefore, the r.h.s. of (9) becomes negative if and only if the number of queries parametrized by exceeds . This is formalized in the following lemma and concludes the proof of Proposition 7.
Lemma 10**.**
For every , and it holds if that
[TABLE]
Proof of Proposition 7.
The proposition is a direct consequence of Lemmas 8, 9 and 10 and Markov’s inequality. ∎
While we could already establish that there are w.h.p. no feasible vectors that have a small overlap with the ground truth , we still need to ensure that there are w.h.p. no feasible vectors that have a large overlap with . Indeed, we exclude such vectors with the next proposition.
Proposition 11**.**
Let and and assume that . Given we have for all w.h.p.
The proof is fundamentally easy as it follows the classical coupon collector argument. However, it needs some technical attention. If we consider a vector of length different from with the same Hamming weight , at least one entry that is set to one under is labeled zero under . Given the event , this entry is part of at least different queries whose results all change by at least , depending on how often the entry participates. To compensate for these changes, we need to find that are zero under and one under such that their joint neighborhood is a super-set of the changed queries. We show that this only happens with probability following a classical balls-into-bins argument. We now give the full technical proof.
Proof of Proposition 11.
Assume that is a second vector that is consistent with the query results . By definition, there is an index for which but . By Lemma 3 the size of is at least
[TABLE]
and for any query we have . To guarantee that it is necessary to identify a set of entries for which for all with the property that .
By construction of , the number of queries in that do not contain any of the entries in , i.e., , can be coupled with the number of empty bins in a balls-into-bins experiment as follows. Given , throw ) balls into bins. Observe that
[TABLE]
and denote by the number of empty bins in this experiment. Since for any the edges are not only distributed over the query-nodes in but over all query-nodes in , we get
[TABLE]
We condition on and therefore . Furthermore, set and let . Then the r.h.s. of (10) becomes
[TABLE]
Therefore, if , or equivalently,
[TABLE]
Thus, a Hamming distance of at least one between and immediately implies that the Hamming distance is at least with probability . A union bound over all one-entries implies the proposition. ∎
Proof of Theorem 2.
The theorem follows directly from Propositions 7 and 11. ∎
V Empirical Analysis and Simulation Results
In this section we present simulation results for the MN-Algorithm (Algorithm 1).
Our simulation software is implemented in the C**++** programming language. It performs a faithful simulation of the parallel system. To generate the random structures, we resort to the Mersenne Twister mt19937_64 as provided by the C**++**11 <random> library. All of our simulations have been carried out on machines equipped with 20 Intel(R) Xeon(R) E5-2630 v4 CPU cores, backed by 128GiB memory, and running the linux 5.11 kernel. All required code to reproduce our figures, including the gnuplot scripts and various helper tools, can be obtained from our public github repository.
In our first empirical result in Fig. 2 we analyze the number of queries required to reconstruct for and different values of . The dotted lines show our theoretical asymptotic bounds. Note that the discontinuities in the theoretical bound stem from rounding the number of one-entries to the closest integer. We remark that our simulation results align well with the theoretical predictions for larger values of . For smaller values of , our theoretical results are too optimistic: the lower-order term hidden in the in LABEL:eqs_greedy scales as , and while this expression decreases polynomially fast in , it is far from vanishing for small values of and .
In Figs. 3 and 4 we analyze the success probability for exact reconstruction of and the number of correctly identified one-entries. For different numbers of queries we conducted 100 independent simulation runs for and and different values of . The dashed lines show the phase-transitions predicted by Theorem 1. The data in Fig. 4 indicate that all but a small fraction of one-entries are correctly detected, even if the exact reconstruction of is still quite unlikely according to Fig. 3. Overall, the implementation hints at the practical usability of the MN-Algorithm, even for small values of .
Remark**.**
The formal proof of the algorithmic bound directly gives an insight about the convergence speed and thus about the expected performance of the MN-Algorithm for finite : we can compute that the MN-Algorithm requires an additional multiplicative factor of at least
[TABLE]
queries in addition to the asymptotic analysis for . This explains the (slight) deviation of the theoretical and the empirical results for small values of . See the proof of Corollary 6 in Section III for the rigorous analysis.
VI Conclusions and Open Problems
In this paper we analyze the binary pooled data problem with additive queries both from an information-theoretic and an algorithmic point of view. Our first result is a simple greedy reconstruction scheme that performs well even close to the information-theoretic boundaries. Our main concern is the design of a reconstruction scheme that works well when all queries are conducted in parallel. In a series of simulations we show that this scheme is applicable to a large range of parameters that can be expected from real-world instances. For example, our data indicate that on average we correctly identify 99% of the one-entries when conducting only 220 queries for and . Our second result sheds light on the information-theoretic achievability threshold, where our theorem closes the open gap between the results of [11] and [17] by establishing a sharp phase transition.
An immediate open problem is to close the gap between the algorithmic and the information theoretic threshold. Furthermore, there are similar reconstruction problems in which parallel conductance of all queries is crucial. As discussed in the introduction, group testing is such a prime example which was recently fully understood using similar techniques as in the present work. A less well understood reconstruction problem is threshold group testing [8, 22], in which a query outputs if and only if the number of positive entries exceeds a threshold . It is very likely that the techniques of the present contribution can be applied to threshold group testing as well, as they were previously applied to various reconstruction problems, but the tailor-made application remains a highly non-trivial challenge. Another exciting avenue for future research are partially parallelizable designs. Suppose that, for instance, processing units can be used to evaluate queries in parallel. Then it is a natural requirement for a design to always conduct up to queries in parallel. An interesting open question then is to analyze the trade-offs that arise in such partially parallelized schemes. In particular, there might be designs providing efficient reconstruction algorithms that outperform the completely parallel design studied in this paper.
Acknowledgements
The authors thank Uriel Feige for various detailed comments which improved the quality of the paper significantly. Furthermore, the authors thank Petra Berenbrink and Amin Coja-Oghlan for helpful discussions and important hints.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, and M. I. Jordan, “Decoding from pooled data: Phase transitions of message passing,” IEEE Trans. Information Theory , vol. 65, no. 1, pp. 572–585, 2019.
- 2[2] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,” Foundations and Trends in Communications and Information Theory , vol. 15, no. 3–4, pp. 196–392, 2019.
- 3[3] Y. Arjoune, N. Kaabouch, H. E. Ghazi, and A. Tamtaoui, “Compressive sensing: Performance comparison of sparse recovery algorithms,” Proc. 7th IEEE CCWC , 2017.
- 4[4] R. Ben-Ami, A. Klochendler et al. , “Large-scale implementation of pooled rna extraction and rt-pcr for sars-cov-2 detection,” Clinical Microbiology and Infection , vol. 26, no. 9, pp. 1248–1253, 2020.
- 5[5] R. W. Benz, S. J. Swamidass, and P. Baldi, “Discovery of power-laws in chemical space,” Journal of Chemical Information and Modeling , vol. 48, no. 6, pp. 1138–1151, 2008.
- 6[6] N. H. Bshouty, “Optimal algorithms for the coin weighing problem with a spring scale,” Proc. 22nd COLT , 2009.
- 7[7] C. C. Cao, C. Li, and X. Sun, “Quantitative group testing-based overlapping pool sequencing to identify rare variant carriers,” BMC Bioinformatics , vol. 15, p. 195, 2014.
- 8[8] C. L. Chan, S. Cai, M. Bakshi, S. Jaggi, and V. Saligrama, “Stochastic threshold group testing,” 2013 IEEE Information Theory Workshop (ITW) , 2013.
