On the Parallel Reconstruction from Pooled Data

Oliver Gebhard; Max Hahn-Klimroth; Dominik Kaaser; Philipp; Loick

arXiv:1905.01458·cs.DM·April 14, 2022

On the Parallel Reconstruction from Pooled Data

Oliver Gebhard, Max Hahn-Klimroth, Dominik Kaaser, Philipp, Loick

PDF

Open Access

TL;DR

This paper introduces a simple greedy algorithm for reconstructing sparse binary signals from pooled additive measurements in parallel, establishing sharp theoretical thresholds and validating them through empirical simulations.

Contribution

It presents a new efficient greedy reconstruction algorithm and derives the exact information-theoretic query threshold for sparse signals in pooled data problems.

Findings

01

The greedy algorithm achieves performance comparable to complex methods.

02

Theoretical thresholds for minimal queries are established and validated.

03

Empirical results confirm the practical effectiveness of the approach.

Abstract

In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal $σ \in {0, 1}^{n}$ , we can query multiple entries at once and get the total number of non-zero entries in the query as a result. We assume that queries are time-consuming and therefore focus on the setting where all queries are executed in parallel. For the regime where the signal is sparse such that $∣∣ σ ∣ ∣_{1} = o (n)$ our results are twofold: First, we propose and analyze a simple and efficient greedy reconstruction algorithm. Secondly, we derive a sharp information-theoretic threshold for the minimum number of queries required to reconstruct $σ$ with high probability. Our first result matches the performance guarantees of much more involved constructions (Karimi et al. 2019). Our second result extends a result of Alaoui et al. (2014) and…

Equations161

m_{seq}^{BPD} \geq (1 - o (1)) \frac{ln \frac{n}{k}}{ln k} k .

m_{seq}^{BPD} \geq (1 - o (1)) \frac{ln \frac{n}{k}}{ln k} k .

m_{para}^{BPD} = (2 - o (1)) \frac{ln \frac{n}{k}}{ln k} k = 2 \cdot m_{seq}^{BPD}

m_{para}^{BPD} = (2 - o (1)) \frac{ln \frac{n}{k}}{ln k} k = 2 \cdot m_{seq}^{BPD}

(2 + o (1)) k ln \frac{n}{k} and (2 + o (1)) k ln n \sim \frac{2}{1 - θ} k ln \frac{n}{k}

(2 + o (1)) k ln \frac{n}{k} and (2 + o (1)) k ln n \sim \frac{2}{1 - θ} k ln \frac{n}{k}

(1.72 + o (1)) k ln \frac{n}{k} and (1.515 + o (1)) k ln \frac{n}{k}

(1.72 + o (1)) k ln \frac{n}{k} and (1.515 + o (1)) k ln \frac{n}{k}

m_{mn} (n, θ) = 4 (1 - \frac{1}{e}) \frac{1 + θ}{1 - θ} k ln (n / k) .

m_{mn} (n, θ) = 4 (1 - \frac{1}{e}) \frac{1 + θ}{1 - θ} k ln (n / k) .

m_{para}^{BPD}

m_{para}^{BPD}

Ψ_{i} (σ) = j \in \partial^{⋆} x_{i} \sum y_{a_{j}} and Φ_{i} (σ) = Ψ_{i} (σ) - 1 {σ (i) = 1} Δ_{i}

Ψ_{i} (σ) = j \in \partial^{⋆} x_{i} \sum y_{a_{j}} and Φ_{i} (σ) = Ψ_{i} (σ) - 1 {σ (i) = 1} Δ_{i}

m

m

Δ_{i}

Δ_{i}

and Δ_{i}^{⋆}

Δ_{i}^{⋆} = (1 + o (1)) (1 - exp (- 1/2)) m

Δ_{i}^{⋆} = (1 + o (1)) (1 - exp (- 1/2)) m

\mathbb{E}\left[{\Psi_{i}-\boldsymbol{\Delta}^{\star}_{i}\frac{k}{2}\Big{|}{\mathcal{E}}_{i}}\right]=\boldsymbol{1}{\left\{{\boldsymbol{\sigma}(i)=1}\right\}}\boldsymbol{\Delta}_{i}.

\mathbb{E}\left[{\Psi_{i}-\boldsymbol{\Delta}^{\star}_{i}\frac{k}{2}\Big{|}{\mathcal{E}}_{i}}\right]=\boldsymbol{1}{\left\{{\boldsymbol{\sigma}(i)=1}\right\}}\boldsymbol{\Delta}_{i}.

\frac{- ( 1 - θ ) α ^{2} d}{4 ( 1 - exp ( - 1/2 ) ) ( 1 + o ( 1 ))} + θ < 0

\frac{- ( 1 - θ ) α ^{2} d}{4 ( 1 - exp ( - 1/2 ) ) ( 1 + o ( 1 ))} + θ < 0

\frac{- ( 1 - θ ) ( 1 - α ) ^{2} d}{4 ( 1 - exp ( - 1/2 ) ) ( 1 + o ( 1 ))} + 1 < 0,

m \geq (4 + ε) (1 + o (1)) (1 - exp (- 1/2)) \frac{1 + θ}{1 - θ} k ln (n / k) .

m \geq (4 + ε) (1 + o (1)) (1 - exp (- 1/2)) \frac{1 + θ}{1 - θ} k ln (n / k) .

S_{j}

S_{j}

E [S_{j} ∣ E_{j}, R]

E [S_{j} ∣ E_{j}, R]

δ

P (∣ S_{j} - E [S_{j} ∣ E_{j}, R] ∣ \geq (1 - α) m /2 ∣ E_{j}, R)

P (∣ S_{j} - E [S_{j} ∣ E_{j}, R] ∣ \geq (1 - α) m /2 ∣ E_{j}, R)

\leq exp (- (1 + o (1)) \frac{( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} ln \frac{n}{k}) .

P (∣ S_{j} - E [S_{j} ∣ E_{j}] ∣ E_{j}, R ∣ \geq (1 - α) m /2 ∣ R)

P (∣ S_{j} - E [S_{j} ∣ E_{j}] ∣ E_{j}, R ∣ \geq (1 - α) m /2 ∣ R)

\leq exp (- (1 + o (1)) \frac{( 1 - α ) ^{2} m}{8 E [ S _{j} ∣ E _{j} , R ]})

= exp (- (1 + o (1)) \frac{( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} ln (n / k)) . \qed

S_{j} + Δ_{j}

S_{j} + Δ_{j}

S_{j}

P (S_{j} + Δ_{j} \leq E [S_{j} ∣ E_{j}, R] + (1 - α) m /2 ∣ E_{j}, R)

P (S_{j} + Δ_{j} \leq E [S_{j} ∣ E_{j}, R] + (1 - α) m /2 ∣ E_{j}, R)

\leq exp (- α^{2} d / (4 γ (1 + o (1))) ln \frac{n}{k})

= exp (\frac{( θ - 1 ) α ^{2} d}{4 γ ( 1 + o ( 1 ))} ln n) .

\frac{( θ - 1 ) α ^{2} d}{4 γ ( 1 + o ( 1 ))} + θ

\frac{( θ - 1 ) α ^{2} d}{4 γ ( 1 + o ( 1 ))} + θ

P [S_{j} \geq E [S_{j} ∣ E_{j}, R] + (1 - α) m /2 ∣ E_{j}, R]

P [S_{j} \geq E [S_{j} ∣ E_{j}, R] + (1 - α) m /2 ∣ E_{j}, R]

\leq exp (((1 - α)^{2} d / (4 γ (1 + o (1)))) ln \frac{n}{k})

= exp (\frac{( θ - 1 ) ( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} ln n) .

\frac{( θ - 1 ) ( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} + 1 < 0.

\frac{( θ - 1 ) ( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} + 1 < 0.

\frac{( θ - 1 ) α ^{2} d}{4 γ ( 1 + o ( 1 ))} + θ = \frac{( θ - 1 ) ( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} + 1,

\frac{( θ - 1 ) α ^{2} d}{4 γ ( 1 + o ( 1 ))} + θ = \frac{( θ - 1 ) ( 1 - α ) ^{2} d}{4 γ ( 1 + o ( 1 ))} + 1,

α

α

\frac{( θ - 1 ) ( d - 4 ( γ + o ( 1 ) ) ^{2}}{16 γ d + o ( 1 )} + θ < 0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Medical Imaging Techniques and Applications · Machine Learning and Algorithms

Full text

On the Parallel Reconstruction from Pooled Data††thanks: OG and PL were supported by DFG CO 646/3. MHK was supported by DFG FOR 2975 and Stiftung Polytechnische Gesellschaft.

Oliver Gebhard

Dominik Kaaser

*TU Dortmund University

*Dortmund, Germany

[email protected]

*Universität Hamburg

*Hamburg, Germany

[email protected]

Max Hahn-Klimroth

Philipp Loick

*TU Dortmund University

*Dortmund, Germany

[email protected]

*Goethe University Frankfurt

*Frankfurt, Germany

[email protected]

Abstract

In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal $\boldsymbol{\sigma}\in\left\{{0,1}\right\}^{n}$ , we can query multiple entries at once and get the total number of non-zero entries in the query as a result. We assume that queries are time-consuming and therefore focus on the setting where all queries are executed in parallel. For the regime where the signal is sparse such that $\left|{\left|{\boldsymbol{\sigma}}\right|}\right|_{1}=o(n)$ our results are twofold: First, we propose and analyze a simple and efficient greedy reconstruction algorithm. Secondly, we derive a sharp information-theoretic threshold for the minimum number of queries required to reconstruct $\boldsymbol{\sigma}$ with high probability. Our first result matches the performance guarantees of much more involved constructions (Karimi et al. 2019). Our second result extends a result of Alaoui et al. (2014) and Scarlett & Cevher (2017) who studied the pooled data problem for dense signals. Finally, our theoretical findings are complemented with empirical simulations. Our data not only confirm the information-theoretic thresholds but also hint at the practical applicability of our pooling scheme and the simple greedy reconstruction algorithm.

Index Terms:

Reconstruction, Sparse Signal, Pooled Data, Information Theory, Phase Transitions

I Introduction

We consider the binary pooled data problem with additive queries which is defined as follows. We are given a signal of length $n$ , a large vector $\boldsymbol{\sigma}\in\left\{{0,1}\right\}^{n}$ of Hamming weight $k$ and a querying method. Each query pools multiple entries of $\boldsymbol{\sigma}$ together and returns the exact number of non-zero entries contained in the pool (see Fig. 1 for an example). The goal is to reconstruct $\boldsymbol{\sigma}$ using as few queries as possible.

In many real-world scenarios the time to compute a reconstruction of $\boldsymbol{\sigma}$ is dominated by the time to perform a single query. The evaluation of such a query may require, e.g., computations using a deep neural network on a GPU [20], biological processes such as DNA screening [7, 26], or PCR tests in a bio-medical context [4]. To obtain a substantial speed-up, we therefore focus on parallel schemes where all queries are specified a priori and executed simultaneously. This assumption makes sense in the context of a life sciences laboratory: queries can be envisioned as measurements conducted by a liquid handling robot. The time to perform all (parallel) queries then clearly dominates the time to run an efficient (sequential) reconstruction algorithm (for practical input sizes).

In this paper we focus on the sublinear regime where the number of non-zero entries $k$ scales sub-linearly in the signal’s length $n$ such that $k=n^{\theta}$ for some $\theta<1$ . In this setting, our main task is to specify a suitable parallel pooling design and an efficient reconstruction algorithm that allows us to compute $\boldsymbol{\sigma}$ efficiently from the queried data. We are interested in two different types of phase-transitions that commonly arise in the analysis of reconstruction and statistical inference problems:

What is the minimum number of queries that allows us to infer $\boldsymbol{\sigma}$ from the query results given unlimited computational power? 2. 2.

How many queries are required such that an efficient algorithm can compute $\boldsymbol{\sigma}$ from the query results?

We will refer to the first phase-transition as the information-theoretic threshold and to the second phase-transition as the algorithmic threshold.

I-A The Teacher-Student Model

As in many related reconstruction problems, the teacher-student model provides the fundamental means towards analyzing information-theoretic questions. The challenge in such reconstruction problems lies in deriving probability distributions that are dependent on a variety of random variables and hard to express per se. However, deriving probability distributions conditioned on certain high-probability events is feasible. For an introduction and mathematical justification of the model, we refer the reader to [10]. The setup is the following: a teacher aims to convey some ground truth to a student. Rather than directly providing the ground truth to the student, the teacher generates observable data from the ground truth via some statistical model and passes both the data and the model to the student. The student now aims to infer the ground truth from the observed data and the model.

In terms of this paper we see $\boldsymbol{\sigma}$ as the ground truth. Its distribution is inherited from all vectors in $\left\{{0,1}\right\}^{n}$ of Hamming weight $k$ . The observable data $\boldsymbol{y}$ , together with the conducted queries (expressed as a graph ${\boldsymbol{G}}$ ) are passed to the student in order to infer $\boldsymbol{\sigma}$ . In the following, we analyze the chances of the student to infer the ground truth from the observable data. First, we derive the model distribution from the provided information ${\boldsymbol{G}}$ and the query results $\boldsymbol{y}$ . Afterwards, we use the gained knowledge to analyze the chances of the student to recover the ground truth by estimating the number of possible input vectors that are consistent with the observed query results. As our goal is to recover $\boldsymbol{\sigma}$ with high probability, we condition on the event that the underlying bipartite multi-graph ${\boldsymbol{G}}$ , which will be defined properly in due course, behaves almost as expected. We exploit the knowledge about ${\boldsymbol{G}}$ to derive high-probability events which we can condition on. Eventually, our analysis conveys the information whether there is a unique input vector or multiple possible input vectors out of which the student has to guess the correct one.

I-B Related Work

The binary pooled data problem, sometimes called quantitative group testing, finds its roots in early works of Dorfman [13], Djackov [11], and Shapiro [27]. It has recently gained a lot of interest in the literature [1, 6, 14, 18, 25], with applications in a multitude of disciplines such as DNA screening [26], identifying genetic carriers [7] and machine learning [20, 23, 33]. Variants of the problem include binary group testing [2, 9] or threshold group testing [8, 22]. We start our discussion with an overview of related work from information theory.

Information-Theoretic Aspects

A simple information-theoretic lower bound can be obtained by a folklore counting argument: each query returns a number from [math] to $k$ , thus a pooling design with $m$ queries can produce at most $(k+1)^{m}$ different outcomes. This number must be larger than $\binom{n}{k}$ in order to distinguish all possible input vectors of length $n$ with Hamming weight $k$ . By standard asymptotic bounds, we obtain

[TABLE]

The universal lower bound on $m^{\text{\scriptsize BPD}}_{\text{\scriptsize seq}}$ holds in any case, even if the queries do not need to be conducted in parallel. Restricted to the important special case in which all queries are conducted in parallel, [11] shows that reconstruction of $\boldsymbol{\sigma}$ requires at least

[TABLE]

queries, even with unlimited computational power. On the positive side, Bshouty [6] proves that reconstruction of $\boldsymbol{\sigma}$ is efficiently possible with $(2+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize seq}}$ queries if they are conducted sequentially and Grebinski and Kucherov [17] provide a parallelizable design with an exponential-time reconstruction decoding algorithm which guarantees inference with $(2+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ queries using separating matrices. The latter positive result was extended to the so-called Subset Select problem [21], a relaxation of the pooled data problem that asks to identify only a subset of positive entries correctly. Recently, [14] improved the result for this relaxation by a factor of $2$ . So far, these results hold independently of $k$ . For the linear regime where $k=\Theta(n)$ , much stricter results are already known: Alaoui et al. [1] and Scarlett and Cevher [25] show that there is an exponential-time construction that achieves reconstruction with $(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ parallel queries – a result that is dependent on $k$ scaling linearly in $n$ .

Algorithmic Aspects

If allowed for sequential queries, Bshouty [6] presents an efficient reconstruction algorithm that succeeds at recovery of $\boldsymbol{\sigma}$ with no more than $(2+o(1))m^{\text{\scriptsize BPD}}_{\text{\scriptsize seq}}$ queries. However, for parallel schemes, there are significant gaps between the information-theoretic lower bound and the currently best known efficient algorithms [1, 12, 14, 15, 19, 24]. For instance, Alaoui et al. [1] present an Approximate Message Passing algorithm for dense signals ( $k=\Theta(n)$ ). Furthermore, Donoho and Tanner [12] give a decoding strategy based on $\ell_{1}$ -minimization, and Foucart and Rauhut [15] introduce the Basis Pursuit-algorithm. They can be used to recover $\boldsymbol{\sigma}$ with

[TABLE]

queries, respectively, if the signal is sparse ( $k\ll n$ ). Note that these algorithms solve the more general compressed sensing problem. Various improvements over the Basis Pursuit algorithm are known (e.g., the Orthogonal Matching Pursuit [24] and its improved version for discrete signals [29]) but as Wang and Yin [32] discuss, they do not perform asymptotically better in the setting discussed in this paper. More recent algorithms explicitly designed for recovery of $\boldsymbol{\sigma}$ from additive queries in the sparse regime are due to Karimi et al. [18, 19]. They provide two algorithms based on graph codes that require

[TABLE]

queries, respectively. Furthermore, in a yet unpublished draft that appeared subsequently to our work on arXiv, Feige and Lellouche [14] analyze the Subset Select problem. They prove that, under mild assumptions, an algorithm succeeding at this relaxation can be turned into an algorithm for recovery of $\boldsymbol{\sigma}$ without significantly increasing the required number of queries.

I-C Our Contributions

We study the pooled data problem under the random regular model ${\boldsymbol{G}}$ which is known to be information-theoretically optimal in the linear regime as well as in similar inference problems [9]. More precisely, we let ${\boldsymbol{G}}=(V\cup F,E)$ be a random bipartite multi-graph with query-nodes $F=\{a_{1},\ldots,a_{m}\}$ representing the queries, entry-nodes $V=\{x_{1},\ldots,x_{n}\}$ representing the coordinates of $\boldsymbol{\sigma}$ , and edges $E$ indicating how often a specific entry is contained in a given query. Hereby, each query $a_{i}\in F$ contains exactly $\Gamma=n/2$ entries chosen uniformly at random with replacement.

Algorithmic Results

For the aforementioned pooling design we present a fairly intuitive greedy algorithm called Maximum Neighborhood (MN) Algorithm that allows reconstruction of $\boldsymbol{\sigma}$ w.h.p.111The expression with high probability (w.h.p.) refers to a probability that tends to 1 as $n\rightarrow\infty$ . It follows a thresholding approach that is much simpler than the known algorithms by Karimi et al. [18, 19], which are technically highly challenging. A formal definition of the MN-Algorithm is given in Algorithm 1.

On an intuitive level, the MN-Algorithm works as follows. First, we query $m$ times exactly $\Gamma$ randomly chosen entries of the signal in parallel, which yields the graph representation ${\boldsymbol{G}}$ . Secondly, we sum up the query results $(\boldsymbol{y}_{a})_{a\in\partial x}$ in the neighborhood induced by ${\boldsymbol{G}}$ of each coordinate, counting multi-edges only once. The sum is then centralized by its expected value. Finally, those coordinates with a large score are very likely to have the value $1$ under $\boldsymbol{\sigma}$ . Our first main theorem states how many parallel queries are required for the MN-Algorithm to recover the correct $\boldsymbol{\sigma}$ w.h.p.

Theorem 1.

Suppose that $0<\theta<1$ , $k=n^{\theta}$ , and $\varepsilon>0$ and let

[TABLE]

If $m>(1+\varepsilon)m_{\text{{mn}}}(n,\theta)$ , then Algorithm 1 outputs $\boldsymbol{\sigma}$ w.h.p. on input $m$ and $k$ and an additive querying method query that returns the total number of one-entries in a query.

While the MN-algorithm takes $k$ as an input, the proof reveals that prior knowledge of $k$ is not required in detail. More precisely, a lower bound on $k$ suffices, as in this case enough queries are conducted and the design of ${\boldsymbol{G}}$ is independent from $k$ . Observe that one additional parallel query on all entries reveals the exact value of $k$ immediately without increasing $m$ asymptotically and therefore the only dependence on $k$ in Algorithm 1 (Line 7) can be easily removed by this one additional query. Beside not being strictly dependent on $k$ , a main novelty of the MN-algorithm is its greedy fashion, providing a straightforward approach compared to the technically challenging algorithms presented in [18, 19].

Parallelized Reconstruction

Observe that our reconstruction algorithm, apart from sampling the test design and performing all queries in parallel, is specified in a sequential fashion. This emphasizes the local structure of the reconstruction algorithm. In the context of a parallel computation we observe that our algorithm can be readily parallelized. When individual queries can be conducted much faster, this further reduces the overall running time of our approach. Such improved reconstruction algorithms can be used in the context of machine learning, see, e.g., [33] for an application.

Recall that our test design is described by a random bipartite graph ${\boldsymbol{G}}$ and let $M=M(G)=(m_{ij})\in\{0,1\}^{n\times m}$ be the unweighted biadjacency matrix of $G$ . Intuitively, the entries of $M$ are those values that are summed up in Line $6$ of Algorithm 1. It follows that the $\Psi_{i}$ and $\boldsymbol{\Delta}^{\star}_{i}$ vectors are matrix-vector products $\boldsymbol{\Delta}^{*}=M1$ and $\Psi=M\boldsymbol{y}$ where $1=(1,\dots,1)$ is the all-one-vector and $\boldsymbol{y}$ is the query result vector. The sums computed in Lines 4 to 6 of Algorithm 1 can therefore be expressed in terms of two matrix-vector products for which efficient parallelizations are known. Finally, in Lines 7 to 9 of Algorithm 1 the (coordinates of) the resulting vector are sorted. See [28] for a rather recent survey (with a focus on but not limited to GPUs) on parallel sorting algorithms.

Information-Theoretic Results

We prove that in the sublinear regime where $k=n^{\theta}$ for some $\theta\in(0,1)$ it is possible to reconstruct $\boldsymbol{\sigma}$ from $({\boldsymbol{G}},\boldsymbol{y})$ with high probability with no more than $(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ parallel queries for some arbitrarily small $\varepsilon>0$ . More precisely, we show that there is, with high probability, no second input vector $\tau\in\{0,1\}^{n}$ leading to the same sequence of query results.

Theorem 2.

Suppose that $0<\theta<1$ , $k=n^{\theta}$ , and $\varepsilon>0$ and let

[TABLE]

If $m>(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ , $\boldsymbol{\sigma}$ can be computed from ${\boldsymbol{G}}$ and $\boldsymbol{y}$ w.h.p.

Our result reduces the previously known upper bound of Grebinski and Kucherov [17] by a factor of two and we provide the missing counter part of (2) which establishes the existence of a phase-transition at $m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ for parallel designs.

I-D Discussion

Our results extend information-theoretic results of Alaoui et al. [1] from the linear regime to the sublinear regime. For $\theta\to 1$ , our threshold of Theorem 1 turns out to converge towards the threshold of [1]. The study of the sublinear regime is inspired by studies of the compressed sensing problem with a sparse underlying signal [3]. In the special case of the binary pooled data problem, those studies were initiated by [19]. The sparse regime is indeed interesting in real-world applications, with examples including epidemiology where Heaps law models the early spread of pandemics [5, 31] or the detection of rare features in image classification in machine learning [20]. The relevance of the sublinear regime can be seen in the following example. Suppose a screening for HIV is conducted. Out of about 67,220,000 residents of the UK, 105,200 are known to be infected with the HI virus. Hence, by screening n = 10.000 random probes, we expect 16 positive entries in the signal corresponding to the infection status. Thus, the choice $\theta=0.3$ describes the situation quite well.

It is not surprising that also similar problems have been recently analyzed in the sublinear regime. By now, a vast body of related literature exists (see, e.g., the survey by Aldridge et al. [2]). Interestingly, for the (presumably more difficult) variant in which a query only returns the information whether at least one non-zero entry was found, a very sophisticated efficient algorithm is known for $\theta\leq\ln 2/(1+\ln 2)\approx 0.409$ which requires $m_{GT}\sim\ln^{-1}(2)k\ln\frac{n}{k}$ parallel queries [9]. Thus, dropping most of the available information and using this approach outperforms not only the simple greedy approach discussed in this paper for small values of $\theta$ , but also the quite involved algorithms by Karimi et al. [18, 19]. This result is of fundamental theoretical interest, since it solves an open complexity theoretical question. Nevertheless, their proposed algorithm appears to be of rather limited interest for practical applications, as it requires, e.g., that $\sqrt{\ln\ln n}$ is large. This is in contrast to our simple greedy scheme, which our simulations have shown to work well for real-world input sizes.

As in state-of-the art designs for similar reconstruction problems [2, 9], we allow a specific entry to be included multiple times in one query. While this seems counter-intuitive in the first place, it does not affect practicability of the proposed design.

II Model and Notation

In this section we formally introduce the pooling design. As before, $\boldsymbol{\sigma}\in\left\{{0,1}\right\}^{n}$ is the ground truth chosen uniformly at random from all $0-1$ vectors of length $n$ with exactly $k$ non-zero entries, where $k=n^{\theta}$ for some $\theta\in(0,1)$ . We use ${\boldsymbol{G}}={\boldsymbol{G}}(n,m,\Delta)$ to denote the random bipartite multi-graph that models the pooling design, where $m$ denotes the total number of queries and $\boldsymbol{\Delta}=\{\boldsymbol{\Delta}_{1},\dots,\boldsymbol{\Delta}_{n}\}$ describes the number of queries each individual participates in. Observe that $\boldsymbol{\Delta}_{i}\sim{\rm Bin}(mn/2,1/n)$ . Similarly, we let $\boldsymbol{\Delta}^{\star}=\{\boldsymbol{\Delta}^{\star}_{1},\dots,\boldsymbol{\Delta}^{\star}_{n}\}$ denote the number of distinct queries with expected value $\mathbb{E}\left[{\boldsymbol{\Delta}_{i}^{\star}}\right]=(1-\exp(-1/2))m$ . We let the vector $\boldsymbol{y}\in\left\{{0,\dots,\Gamma}\right\}^{m}$ denote the sequence of query results. When we refer to any other input vector than $\boldsymbol{\sigma}$ , we simply write $\sigma$ for the input vector and $y=y({\boldsymbol{G}},\sigma)$ for the corresponding results’ vector. Additionally, we write $V=\left\{{x_{1},\ldots,x_{n}}\right\}$ for the set of the $n$ entries of $\boldsymbol{\sigma}$ and let $V_{0}=\left\{{x_{i}\in V:\boldsymbol{\sigma}(i)=0}\right\}$ and $V_{1}=V\setminus V_{0}$ be the set of entries with value [math] and $1$ , respectively. For $x_{i}\in V$ , we write $\partial x_{i}$ for the multiset of queries $a_{j}$ in which $x_{i}$ is contained. Similarly, we write $\partial^{\star}x_{i}$ for the set of distinct such queries. Analogously, for a query $a_{i}$ , we denote by $\partial a_{i}$ the multiset of entries that are contained.

Recall that in our model every query contains exactly $\Gamma=n/2$ entries, and those entries are assigned uniformly at random with replacement. If a one-entry $x_{i}$ participates in a query $a_{j}$ more than once, it increases $\boldsymbol{y}_{j}$ multiple times. For each $x_{i}\in V$ , we let $\Psi_{i}$ be the sum of its query results for distinct queries it belongs to. That is, even if the entry appears more than once in a query and thus contributes to the result multiple times, this query’s result contributes to $\Psi_{i}$ only once. Of course, the value of $x_{i}$ under $\boldsymbol{\sigma}$ has a significant impact on this sum, increasing it by $\boldsymbol{\Delta}_{i}$ , if $x_{i}$ is non-zero. To account for this effect in our analysis, we introduce a second variable $\Phi_{i}$ that sums all the query results in which $x_{i}$ is contained and excludes the impact of $x_{i}$ . Formally, for any configuration $\sigma\in\{0,1\}^{n}$ we define

[TABLE]

and let $\Psi=(\Psi_{1},\dots,\Psi_{n})$ and $\Phi=(\Phi_{1},\dots,\Phi_{n})$ . When we consider a specific instance $({\boldsymbol{G}},\boldsymbol{y})$ , we will write $\boldsymbol{\Psi}_{i}=\Psi_{i}(\boldsymbol{\sigma})$ and $\boldsymbol{\Phi}_{i}=\Phi_{i}(\boldsymbol{\sigma})$ for the sake of brevity. Notably, while $\boldsymbol{\Psi}_{i}$ is known to the observer or an algorithm instantly from the queries, $\boldsymbol{\Phi}_{i}$ is not, since the ground truth $\boldsymbol{\sigma}$ itself is unknown.

To express the number $m$ of queries conducted, we let $c(n)>0$ denote a positive function from $\mathbb{N}$ to $\mathbb{R}^{+}$ such that

[TABLE]

While it turns out that $c(n)=\Theta(1)$ suffices in the analysis of the information-theoretic bound, we will see that the performance guarantee of the MN-algorithm requires $c(n)$ to scale as $\Theta(\ln n)$ . Finally, we define a high probability event ${\mathcal{R}}$ that we will condition on as explained in the teacher-student model. Let ${\mathcal{R}}$ be the event that, for all $i\in[n]$ , we have

[TABLE]

meaning that the underlying random graph satisfies concentration properties. The following lemma states that ${\mathcal{R}}$ is indeed a high probability event.

Lemma 3.

If ${\boldsymbol{G}}$ is constructed according to our pooling scheme, then ${\mathbb{P}}({\mathcal{R}})=1-o(1)$ .

The proof follows from standard concentration results, see the appendix for the technical details. Since Theorems 2 and 1 only contain w.h.p.-assertions, we can safely condition on ${\mathcal{R}}$ for the remainder of our analysis.

III MN-Algorithm

Outline

Recall that $\Psi_{i}$ is the sum over all query results in which the entry $x_{i}$ is contained (multi-edges counted only once) and $\boldsymbol{\Delta}^{\star}_{i}$ is the (random) number of disjoint such queries. Furthermore, let ${\mathcal{E}}_{j}$ be the $\sigma-$ algebra generated by the edges connected with $x_{j}$ . As already discussed, we get

[TABLE]

w.h.p. Therefore, intuitively spoken, a non-zero entry $x_{i}$ increases the value of $\boldsymbol{\Psi}_{i}$ by $\boldsymbol{\Delta}_{i}=(1+o(1))m/2$ , other than zero-entries. Moreover, by construction of the random bipartite (multi-)graph ${\boldsymbol{G}}$ , we get that the second neighborhood of $x_{i}$ contains ${\rm Bin}\left({\Gamma\boldsymbol{\Delta}^{\star}_{i},k/n}\right)$ non-zero entries. Thus we expect

[TABLE]

Therefore, if $\Psi_{i}-\boldsymbol{\Delta}^{\star}_{i}\frac{k}{2}$ is called the score of entry $x_{i}$ , we observe that the scores differ between zero entries and non-zero entries. The whole proof of the algorithmic performance boils down to identify a threshold value $T(\alpha)=T(n,k,\alpha)$ such that, if sufficiently many queries are conducted, all scores of zero entries are below $T(\alpha)$ while the scores of all non-zero entries exceed this threshold w.h.p. If we conduct $m=dk\ln\frac{n}{k}$ queries, with $d=c(n)\ln(k)^{-1}$ , we get by a standard application of a Chernoff bound and a union bound over all $k=n^{\theta}$ non-zero entries $x_{i}\in V_{1}$ and, respectively, $n-k=\Theta(n)$ zero-entries $x_{i}\in V_{0}$ that $T(\alpha)$ is a valid threshold whenever

[TABLE]

which will become clear in a second. Optimizing (LABEL:eqs_greedy) with respect to $\alpha\in(0,1)$ and plugging $d$ into $m=dk\ln(n/k)$ yields for any $\varepsilon>0$ the sufficient condition

[TABLE]

Formal Analysis

Let $\boldsymbol{A}_{ij}\in\mathbb{N}_{0}$ denote how often entry $x_{i}$ appears in query $a_{j}$ and let $\boldsymbol{A}=\left({\boldsymbol{A}_{ij}}\right)_{i\in[n],j\in[m]}$ be the adjacency matrix of ${\boldsymbol{G}}$ . Then the following holds.

Corollary 4.

Let $1\leq j\leq n$ . Given ${\mathcal{E}}_{j}$ , the random variable

[TABLE]

has distribution $\displaystyle{\rm Bin}\left({\boldsymbol{\Delta}^{\star}_{j}\Gamma-\boldsymbol{\Delta}_{j},\frac{k-\boldsymbol{1}\{\boldsymbol{\sigma}(j)=1\}}{n-1}}\right).$

Proof.

This is an immediate consequence of the model definition. There are $\Gamma\boldsymbol{\Delta}^{\star}_{j}-\boldsymbol{\Delta}_{j}$ half-edges connected to query-nodes in the neighborhood of $x_{j}$ that are connected to entry-nodes $x_{i}\neq x_{j}$ . Each of these half-edges is connected to one of $k-\boldsymbol{1}\left\{{\boldsymbol{\sigma}(j)=1}\right\}$ entry-nodes belonging to an entry of value $1$ , independently, from the $n-1$ remaining entry-nodes. ∎

Now it is possible to immediately infer the expectation of $\boldsymbol{S}_{j}$ conditioned on the event ${\mathcal{R}}$ (as defined in (3)). For the sake of brevity let $\gamma=1-\exp\left({-1/2}\right)$ . Given the event ${\mathcal{R}}$ which guarantees concentration properties of the underlying graph, we get w.h.p.

[TABLE]

The Chernoff bound allows us to bound $\boldsymbol{S}_{j}$ as follows.

Lemma 5.

Let $\alpha\in(0,1)$ be a constant and $m=dk\ln\frac{n}{k}$ . Then

[TABLE]

Proof.

The Chernoff bound (Lemma 12) directly implies

[TABLE]

Next we show that, with a suitable choice of a threshold, the scores of zero- and one-entries are well separated.

Corollary 6.

Let $\varepsilon>0$ be an arbitrary constant. If $m\geq\left({4+\varepsilon}\right)\left({1-\exp\left({-1/2}\right)}\right)\frac{1+\sqrt{\theta}}{1-\sqrt{\theta}}k\ln\frac{n}{k}$ then there exists an $\alpha\in(0,1)$ such that, w.h.p., we have

[TABLE]

for all $x_{j}$ where $\boldsymbol{\sigma}(j)=0$ .

Proof.

Let $x_{j}\in V_{1}({\boldsymbol{G}})$ . Again, we make use of the concentration properties guaranteed by conditioning on ${\mathcal{R}}$ . Therefore, we assume that $\boldsymbol{\Delta}_{j}=m/2+O(\sqrt{m}\ln n)$ . Then Lemma 5 ensures that

[TABLE]

Hence, the union bound shows that the first inequality holds for all $k$ elements of $V_{1}({\boldsymbol{G}})$ w.h.p. if

[TABLE]

Analogously, the second inequality holds for all $n-k$ elements of $V_{0}({\boldsymbol{G}})$ w.h.p. if

[TABLE]

Again, the union bound shows that the second inequality holds w.h.p. if

[TABLE]

Note that the condition in (6) is monotonically decreasing in $\alpha$ while the condition in (7) is monotonically increasing in $\alpha$ . Hence the optimal choice of $\alpha$ is the one that makes the two terms in (6) and (7) equal:

[TABLE]

which boils down to

[TABLE]

By putting this solution for $\alpha$ into (6) we get

[TABLE]

It now suffices to find the minimal $d=d(\theta)>0$ such that

[TABLE]

Hence, we solve for (positive) $d$ and obtain that Eqs. 6 and 7 hold w.h.p. provided

[TABLE]

which matches the assumption in the lemma statement. ∎

We are now ready to formally prove Theorem 1.

Proof of Theorem 1.

According to Lemma 3, the event ${\mathcal{R}}$ is a high-probability event. Corollary 6 then immediately implies the theorem, together with the definition $m=dk\ln\frac{n}{k}$ . ∎

IV Information-Theoretic Achievability

In the following section we prove Theorem 2. Our approach is based on counting alternative input vectors $\sigma\neq\boldsymbol{\sigma}$ that yield the same sequence of query results as the ground truth $\boldsymbol{\sigma}$ . Note that the underlying techniques are regularly employed for random constraint satisfaction problems [10].

We start with an outline of the proof. Let $S_{k}({\boldsymbol{G}},\boldsymbol{y})$ be the set of all vectors $\sigma\in\{0,1\}^{n}$ of Hamming weight $k$ such that

[TABLE]

This means, we fix $m$ queries $a_{1},\ldots,a_{m}$ and let $S_{k}({\boldsymbol{G}},\boldsymbol{y})$ be the set of all vectors $\sigma\in\left\{{0,1}\right\}^{n}$ with exactly $k$ ones that are consistent with the query results. Let now $Z_{k}({\boldsymbol{G}},\boldsymbol{y})=|S_{k}({\boldsymbol{G}},\boldsymbol{y})|$ . We need to prove that $Z_{k}({\boldsymbol{G}},\boldsymbol{y})=1$ w.h.p. if the number of queries $m$ exceeds $m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ . Note that we can always reconstruct $\boldsymbol{\sigma}$ exactly in this case via an exhaustive search (recall that from an information-theoretic point of view the computational power is assumed to be unlimited).

In our analysis, it turns out that it is much more convenient to study $Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})$ , the number of alternative vectors that are consistent with the query results and have a so-called overlap of $\ell$ with $\boldsymbol{\sigma}$ . The overlap is the number of one-entries under $\boldsymbol{\sigma}$ that are also present in an alternative vector $\sigma$ . Formally, we define

[TABLE]

It now suffices to prove that $\sum_{\ell=0}^{k-1}Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})=0$ for $m\geq(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ w.h.p. To this end, two separate arguments are needed. First, we show in Proposition 7 via a first moment argument that no second satisfying input vector $\sigma$ can exist with a small overlap with $\boldsymbol{\sigma}$ . Secondly, we employ in Proposition 11 the classical coupon collector argument to show that a second satisfying configuration cannot exist for large overlaps. Intuitively, this means that an entry that is flipped from zero under $\boldsymbol{\sigma}$ to one under an alternative configuration $\sigma$ initiates a cascade of other changes to maintain the observed query results. The full technical proofs for the following statements can be found in the appendix.

Proposition 7.

Let $\varepsilon>0$ , $0<\theta<1$ and assume that $m>(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ . W.h.p., we have

[TABLE]

We now sketch the proof of Proposition 7. By Markov’s inequality it suffices to show that $\mathbb{E}[Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})]\to 0$ fast enough for all $\ell$ with $0\leq\ell<k-\left({1-\exp(-1/2)}\right)\ln k$ if $m\geq(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ for some $\varepsilon>0$ . For $\mathbb{E}[Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})]$ we compute

[TABLE]

The combinatorial meaning is the following: The binomial coefficients count the number of possible input vectors $\sigma\neq\boldsymbol{\sigma}$ of overlap $\ell$ with $\boldsymbol{\sigma}$ . The subsequent term measures the probability that a specific such $\sigma$ yields the same results on queries $a_{1},\ldots,a_{m}$ as $\boldsymbol{\sigma}$ . To see this, we divide the entries $x_{1},\ldots,x_{n}$ into three categories. The first category contains those entries that exhibit the same value under $\sigma$ and $\boldsymbol{\sigma}$ . The second and third category feature those entries that are set to one under $\sigma$ and to zero under $\boldsymbol{\sigma}$ and vice versa. Recall that $\ell$ determines the number of $x_{i}$ that are set to one under both vectors $\sigma$ and $\boldsymbol{\sigma}$ . The probability for a specific entry to be in the first category is $1-2(1-\ell/k)k/n$ , while the probability for a specific entry to be in the second or third categories is $(1-\ell/k)k/n$ each. The key observation is that the query results are the same between $\sigma$ and $\boldsymbol{\sigma}$ if and only if the number of entries in the second category is identical to the number of entries in the third category. We compute (a bound on) the sum over the number of entries which are flipped. Simplifying the term and conditioning on the high probability event ${\mathcal{R}}$ yields the following lemma.

Lemma 8.

For every $0\leq\ell\leq k-\left({1-\exp(-1/2)}\right)\ln k$ and a random variable $\boldsymbol{X}\sim{\rm Bin}_{\geq 1}(\Gamma,2(1-\ell/k)k/n)$ , we have

[TABLE]

Here, ${\rm Bin}_{\geq i}(n,p)$ is the binomial distribution with parameters $n$ and $p$ where we condition that its outcome is at least $i$ .

Proof.

The product of the two binomial coefficients simply accounts for the number of vectors $\sigma$ that have overlap $\ell$ with $\boldsymbol{\sigma}$ . Let $\mathcal{S}$ denote the event that one specific $\sigma\in\{0,1\}^{n}$ that has overlap $\ell$ with $\boldsymbol{\sigma}$ belongs to $S_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})$ . It suffices to show for $\boldsymbol{X}\sim{\rm Bin}_{\geq 1}(\Gamma,2(1-\ell/k)k/n)$ that

[TABLE]

The remainder of the proof is dedicated to showing Eq. 8.

By the design ${\boldsymbol{G}}$ , each query contains $\Gamma=n/2$ entries chosen uniformly at random, and we observe that all query results are statistically independent of each other. Therefore, we need only to determine the probability that for a specific $\sigma$ and a specific query $a_{i}$ the result is consistent with the result under $\boldsymbol{\sigma}$ such that $\boldsymbol{y}_{i}=y_{i}$ . Given the overlap $\ell$ , we know for $\sigma$ drawn uniformly at random that ${\mathbb{P}}\left[{\boldsymbol{\sigma}_{i}=\sigma_{i}=1}\right]=\ell/n,$ ${\mathbb{P}}\left[{\boldsymbol{\sigma}_{i}=\sigma_{i}=0}\right]=(n-2k+\ell)/n$ and finally ${\mathbb{P}}\left[{\boldsymbol{\sigma}_{i}\neq\sigma_{i}}\right]=(k-\ell)/n$ holds for all $x_{i}$ , $i=1\ldots n$ . We get

[TABLE]

The last two components of (9) describe the probability that a one-dimensional simple random walk returns to its original position after $2j$ steps, which is by Lemma 14 equal to $(1+O(j^{-1}))/\sqrt{\pi j}$ . The former term describes the probability that a ${\rm Bin}_{\geq 1}(\Gamma,2(1-\ell/k)k/n))$ random variable $\boldsymbol{X}$ takes the value $2j$ . For $\ell\leq k-(1-\exp(-1/2))\ln k$ the expectation of $\boldsymbol{X}$ given ${\boldsymbol{G}}$ is at least of order $\ln k$ such that the asymptotic description of the random walk return probability is feasible. Note that if $\ell$ gets closer to $k$ , the expectation of $\boldsymbol{X}$ gets finite, s.t. the random walk approximation is not feasible anymore. Therefore, using Lemma 15, we can, as long as $\Gamma(2(1-\ell/k)k/n)=\Omega(\ln n)$ , simplify (9) to

[TABLE]

for large $n\gg 1$ which implies Lemma 8. ∎

While the expression given through Lemma 8 might look hard to work with, it can be simplified using standard asymptotic arguments as follows.

Lemma 9.

For every $0\leq\ell\leq k-\left({1-\exp(-1/2)}\right)\ln k$ , $m=ck\frac{\ln(n/k)}{\ln(k)}$ and $n\gg 1$ , we have

[TABLE]

The key is to choose $c=c(n)$ such that $Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})\to 0$ for every $\ell\leq k-\left({1-\exp(-1/2)}\right)\ln k$ when $n\to\infty$ . Asymptotically, $\ln\left(\mathbb{E}[Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})]/n\right)$ takes its maximum at $\ell=\Theta\left({k^{2}/n}\right)$ . Therefore, the r.h.s. of (9) becomes negative if and only if the number of queries $m$ parametrized by $c$ exceeds $m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ . This is formalized in the following lemma and concludes the proof of Proposition 7.

Lemma 10.

For every $0\leq\ell\leq k-\left({1-\exp(-1/2)}\right)\ln k$ , $0<\theta<1$ and $\varepsilon>0$ it holds if $m\geq(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ that

[TABLE]

Proof of Proposition 7.

The proposition is a direct consequence of Lemmas 8, 9 and 10 and Markov’s inequality. ∎

While we could already establish that there are w.h.p. no feasible vectors $\sigma\in\left\{{0,1}\right\}^{n}$ that have a small overlap with the ground truth $\boldsymbol{\sigma}$ , we still need to ensure that there are w.h.p. no feasible vectors that have a large overlap with $\boldsymbol{\sigma}$ . Indeed, we exclude such vectors with the next proposition.

Proposition 11.

Let $\varepsilon>0$ and $0<\theta\leq 1$ and assume that $m>(1+\varepsilon)m^{\text{\scriptsize BPD}}_{\text{\scriptsize para}}$ . Given ${\mathcal{R}}$ we have $Z_{k,\ell}({\boldsymbol{G}},\boldsymbol{y})=0$ for all $k-(1-\exp(-1/2))\ln k<\ell<k$ w.h.p.

The proof is fundamentally easy as it follows the classical coupon collector argument. However, it needs some technical attention. If we consider a vector $\sigma$ of length $n$ different from $\boldsymbol{\sigma}$ with the same Hamming weight $k$ , at least one entry that is set to one under $\boldsymbol{\sigma}$ is labeled zero under $\sigma$ . Given the event ${\mathcal{R}}$ , this entry is part of at least $\boldsymbol{\Delta}_{i}^{\star}>m/4$ different queries whose results all change by at least $-1$ , depending on how often the entry participates. To compensate for these changes, we need to find $x_{1}\dots x_{\ell}$ that are zero under $\boldsymbol{\sigma}$ and one under $\sigma$ such that their joint neighborhood is a super-set of the changed queries. We show that this only happens with probability $o(1)$ following a classical balls-into-bins argument. We now give the full technical proof.

Proof of Proposition 11.

Assume that $\sigma\in\left\{{0,1}\right\}^{n}$ is a second vector that is consistent with the query results $\boldsymbol{y}$ . By definition, there is an index $j\in\left\{{1,\ldots,n}\right\}$ for which $\boldsymbol{\sigma}(j)=1$ but $\sigma(j)=0$ . By Lemma 3 the size of $\partial^{\star}x_{j}$ is at least

[TABLE]

and for any query $a_{l}\in\partial x_{j}$ we have $\left|{y_{l}(\sigma)-y_{l}(\boldsymbol{\sigma})}\right|\geq 1$ . To guarantee that $y(\boldsymbol{\sigma})=y(\sigma)$ it is necessary to identify a set of $h$ entries $\mathcal{X}$ for which $\sigma(i)=1-\boldsymbol{\sigma}(i)$ for all $i\in\mathcal{X}$ with the property that $\mathcal{X}\supseteq\partial x_{j}$ .

By construction of ${\boldsymbol{G}}$ , the number of queries in $\partial^{\star}x_{j}$ that do not contain any of the entries in $\mathcal{X}$ , i.e., $\boldsymbol{H}=\left|{\left\{{a\in\partial^{\star}x_{j}:\mathcal{X}\cap\partial a=\emptyset}\right\}}\right|$ , can be coupled with the number of empty bins in a balls-into-bins experiment as follows. Given ${\boldsymbol{G}}$ , throw $b=\sum_{i=1}^{h}\deg(x_{i}$ ) balls into $\deg(x_{i})$ bins. Observe that

[TABLE]

and denote by $\boldsymbol{H^{\prime}}$ the number of empty bins in this experiment. Since for any $x_{i}$ the $\deg(x_{i})$ edges are not only distributed over the $(1-o(1))\left({1-\exp(-1/2)}\right)m$ query-nodes in $\partial x_{j}$ but over all $m$ query-nodes in ${\boldsymbol{G}}$ , we get

[TABLE]

We condition on ${\mathcal{R}}$ and therefore $b=(1+o(1))hm/2$ . Furthermore, set $L=\ln(m)h^{-1}$ and let $\gamma=(1-\exp(-1/2))$ . Then the r.h.s. of (10) becomes

[TABLE]

Therefore, if $L<2\gamma$ , or equivalently,

[TABLE]

Thus, a Hamming distance of at least one between $\boldsymbol{\sigma}$ and $\sigma$ immediately implies that the Hamming distance is at least $2\gamma\left({\ln k+\ln\ln k}\right)$ with probability $1-n^{-\omega(1)}$ . A union bound over all $k$ one-entries implies the proposition. ∎

Proof of Theorem 2.

The theorem follows directly from Propositions 7 and 11. ∎

V Empirical Analysis and Simulation Results

In this section we present simulation results for the MN-Algorithm (Algorithm 1).

Our simulation software is implemented in the C**++** programming language. It performs a faithful simulation of the parallel system. To generate the random structures, we resort to the Mersenne Twister mt19937_64 as provided by the C**++**11 <random> library. All of our simulations have been carried out on machines equipped with 20 Intel(R) Xeon(R) E5-2630 v4 CPU cores, backed by 128GiB memory, and running the linux 5.11 kernel. All required code to reproduce our figures, including the gnuplot scripts and various helper tools, can be obtained from our public github repository.

In our first empirical result in Fig. 2 we analyze the number of queries required to reconstruct $\boldsymbol{\sigma}$ for $n\in[10^{2},10^{6}]$ and different values of $\theta$ . The dotted lines show our theoretical asymptotic bounds. Note that the discontinuities in the theoretical bound stem from rounding the number of one-entries $k$ to the closest integer. We remark that our simulation results align well with the theoretical predictions for larger values of $n$ . For smaller values of $n$ , our theoretical results are too optimistic: the lower-order term hidden in the $o(1)$ in LABEL:eqs_greedy scales as $\Theta\left({\frac{\sqrt{\ln n}}{k}}\right)$ , and while this expression decreases polynomially fast in $n$ , it is far from vanishing for small values of $n$ and $\theta$ .

In Figs. 3 and 4 we analyze the success probability for exact reconstruction of $\boldsymbol{\sigma}$ and the number of correctly identified one-entries. For different numbers of queries we conducted 100 independent simulation runs for $n=10^{3}$ and $n=10^{4}$ and different values of $\theta$ . The dashed lines show the phase-transitions predicted by Theorem 1. The data in Fig. 4 indicate that all but a small fraction of one-entries are correctly detected, even if the exact reconstruction of $\boldsymbol{\sigma}$ is still quite unlikely according to Fig. 3. Overall, the implementation hints at the practical usability of the MN-Algorithm, even for small values of $n$ .

Remark.

The formal proof of the algorithmic bound directly gives an insight about the convergence speed and thus about the expected performance of the MN-Algorithm for finite $n$ : we can compute that the MN-Algorithm requires an additional multiplicative factor of at least

[TABLE]

queries in addition to the asymptotic analysis for $n\rightarrow\infty$ . This explains the (slight) deviation of the theoretical and the empirical results for small values of $n$ . See the proof of Corollary 6 in Section III for the rigorous analysis.

VI Conclusions and Open Problems

In this paper we analyze the binary pooled data problem with additive queries both from an information-theoretic and an algorithmic point of view. Our first result is a simple greedy reconstruction scheme that performs well even close to the information-theoretic boundaries. Our main concern is the design of a reconstruction scheme that works well when all queries are conducted in parallel. In a series of simulations we show that this scheme is applicable to a large range of parameters that can be expected from real-world instances. For example, our data indicate that on average we correctly identify 99% of the one-entries when conducting only 220 queries for $n=1000$ and $\theta=0.3$ . Our second result sheds light on the information-theoretic achievability threshold, where our theorem closes the open gap between the results of [11] and [17] by establishing a sharp phase transition.

An immediate open problem is to close the gap between the algorithmic and the information theoretic threshold. Furthermore, there are similar reconstruction problems in which parallel conductance of all queries is crucial. As discussed in the introduction, group testing is such a prime example which was recently fully understood using similar techniques as in the present work. A less well understood reconstruction problem is threshold group testing [8, 22], in which a query outputs $1$ if and only if the number of positive entries exceeds a threshold $T>0$ . It is very likely that the techniques of the present contribution can be applied to threshold group testing as well, as they were previously applied to various reconstruction problems, but the tailor-made application remains a highly non-trivial challenge. Another exciting avenue for future research are partially parallelizable designs. Suppose that, for instance, $L$ processing units can be used to evaluate queries in parallel. Then it is a natural requirement for a design to always conduct up to $L$ queries in parallel. An interesting open question then is to analyze the trade-offs that arise in such partially parallelized schemes. In particular, there might be designs providing efficient reconstruction algorithms that outperform the completely parallel design studied in this paper.

Acknowledgements

The authors thank Uriel Feige for various detailed comments which improved the quality of the paper significantly. Furthermore, the authors thank Petra Berenbrink and Amin Coja-Oghlan for helpful discussions and important hints.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, and M. I. Jordan, “Decoding from pooled data: Phase transitions of message passing,” IEEE Trans. Information Theory , vol. 65, no. 1, pp. 572–585, 2019.
2[2] M. Aldridge, O. Johnson, and J. Scarlett, “Group testing: An information theory perspective,” Foundations and Trends in Communications and Information Theory , vol. 15, no. 3–4, pp. 196–392, 2019.
3[3] Y. Arjoune, N. Kaabouch, H. E. Ghazi, and A. Tamtaoui, “Compressive sensing: Performance comparison of sparse recovery algorithms,” Proc. 7th IEEE CCWC , 2017.
4[4] R. Ben-Ami, A. Klochendler et al. , “Large-scale implementation of pooled rna extraction and rt-pcr for sars-cov-2 detection,” Clinical Microbiology and Infection , vol. 26, no. 9, pp. 1248–1253, 2020.
5[5] R. W. Benz, S. J. Swamidass, and P. Baldi, “Discovery of power-laws in chemical space,” Journal of Chemical Information and Modeling , vol. 48, no. 6, pp. 1138–1151, 2008.
6[6] N. H. Bshouty, “Optimal algorithms for the coin weighing problem with a spring scale,” Proc. 22nd COLT , 2009.
7[7] C. C. Cao, C. Li, and X. Sun, “Quantitative group testing-based overlapping pool sequencing to identify rare variant carriers,” BMC Bioinformatics , vol. 15, p. 195, 2014.
8[8] C. L. Chan, S. Cai, M. Bakshi, S. Jaggi, and V. Saligrama, “Stochastic threshold group testing,” 2013 IEEE Information Theory Workshop (ITW) , 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Parallel Reconstruction from Pooled Data††thanks: OG and PL were supported by DFG CO 646/3. MHK was supported by DFG FOR 2975 and Stiftung Polytechnische Gesellschaft.

Abstract

Index Terms:

I Introduction

I-A The Teacher-Student Model

I-B Related Work

Information-Theoretic Aspects

Algorithmic Aspects

I-C Our Contributions

Algorithmic Results

Theorem 1**.**

Parallelized Reconstruction

Information-Theoretic Results

Theorem 2**.**

I-D Discussion

II Model and Notation

Lemma 3**.**

III MN-Algorithm

Outline

Formal Analysis

Corollary 4**.**

Proof.

Lemma 5**.**

Proof.

Corollary 6**.**

Proof.

Proof of Theorem 1.

IV Information-Theoretic Achievability

Proposition 7**.**

Lemma 8**.**

Proof.

Lemma 9**.**

Lemma 10**.**

Proof of Proposition 7.

Proposition 11**.**

Proof of Proposition 11.

Proof of Theorem 2.

V Empirical Analysis and Simulation Results

Remark**.**

VI Conclusions and Open Problems

Acknowledgements

Theorem 1.

Theorem 2.

Lemma 3.

Corollary 4.

Lemma 5.

Corollary 6.

Proposition 7.

Lemma 8.

Lemma 9.

Lemma 10.

Proposition 11.

Remark.