Information-theoretic and algorithmic thresholds for group testing

Amin Coja-Oghlan; Oliver Gebhard; Max Hahn-Klimroth; Philipp Loick

arXiv:1902.02202·cs.DM·May 14, 2021

Information-theoretic and algorithmic thresholds for group testing

Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, Philipp Loick

PDF

TL;DR

This paper determines the minimum number of tests needed for successful group testing using a randomized design, establishing sharp thresholds and analyzing algorithms to solve the problem efficiently.

Contribution

It precisely characterizes the information-theoretic threshold for group testing and analyzes the performance of inference algorithms, settling prior conjectures.

Findings

01

Identified sharp thresholds for the number of tests needed.

02

Analyzed the performance of two efficient inference algorithms.

03

Settled conjectures from previous studies.

Abstract

In the group testing problem we aim to identify a small number of infected individuals within a large population. We avail ourselves to a procedure that can test a group of multiple individuals, with the test result coming out positive iff at least one individual in the group is infected. With all tests conducted in parallel, what is the least number of tests required to identify the status of all individuals? In a recent test design [Aldridge et al.\ 2016] the individuals are assigned to test groups randomly, with every individual joining an equal number of groups. We pinpoint the sharp threshold for the number of tests required in this randomised design so that it is information-theoretically possible to infer the infection status of every individual. Moreover, we analyse two efficient inference algorithms. These results settle conjectures from [Aldridge et al.\ 2014, Johnson et al.\…

Tables1

Notation	Definition & Properties	Description
$n$		population size
$k$	$k \sim n^{θ}$ for $θ \in (0, 1)$	number of infected individuals
$m$	$m = c k \log (n / k)$	number of tests
$x_{1}, \dots, x_{n}$		variable nodes
$V = V_{n}$	${x_{i}, \dots, x_{n}}$	set of all individuals
$a_{1}, \dots, a_{n}$		factor nodes
$F = F_{m}$	${a_{i}, \dots, a_{m}}$	set of all tests
$Δ$	$Δ = d \log (n / k)$	tests per individual, variable node degree
$Γ_{1}, \dots, Γ_{m}$	$(\sum_{i = 1}^{m} Γ_{i}) / m = d n / (c k)$	individuals per test, factor node degree
$Γ$	${(Γ_{i})}_{i \in [m]}$	$σ$ -algebra generated by the random variables ${(Γ_{i})}_{i \in [m]}$
$𝝈 \in {0, 1}^{V}$	$\sum_{i = 1}^{n} 𝝈_{i} = k$	$n$ -dimensional vector of Hamming weight $k$ indicating the individuals’ infection status
$𝑮 = 𝑮 (n, m, Δ)$		random bipartite graph on $n$ variable nodes, $m$ factor nodes and variable degree $Δ$
$\partial x_{i} = \partial_{𝑮} x_{i}$ for $i \in [n]$	$\partial x_{i} \subseteq F, \| \partial x_{i} \| = Δ$	set of tests that individual $x_{i}$ participates in under $𝑮$
$\partial a_{i} = \partial_{𝑮} a_{i}$ for $i \in [m]$	$\partial a_{i} \subseteq V, \| \partial a_{i} \| = Γ_{i}$	set of individuals in test $a_{i}$ under $𝑮$
$\hat{𝝈} \in {0, 1}^{F}$	${\hat{𝝈}}_{i} = 𝟏 {\exists x \in \partial a_{i} : 𝝈_{x} = 1}$	$m$ -dimensional vector indicating the test outcomes
$𝒎_{1}, 𝒎_{0}$	$𝒎_{1} = \| {a \in F : {\hat{𝝈}}_{a} = 1} \|, 𝒎_{0} = m - 𝒎_{1}$	number of positive and negative tests
$V_{0}$	$V_{0} = {x \in V : 𝝈_{x} = 0}$	set of healthy individuals
$V_{1}$	$V_{1} = V ∖ V_{0}, \| V \| = k$	set of infected individuals
$V_{0}^{+}$	${x \in V_{0} : \forall a \in \partial x : {\hat{𝝈}}_{a} = 1}$	set of healthy individuals only included in positive tests
$V_{0}^{-}$	$V_{0}^{-} = V_{0} ∖ V_{0}^{+}$	set of healthy individuals included in at least one negative test
$V_{1}^{+}$	${x \in V_{1} : \forall a \in \partial x : \exists y \in \partial a ∖ {x} : 𝝈_{y} = 1}$	set of infected individuals that have another infected individual in all their tests
$V_{1}^{- -}$	${x \in V_{1} : \exists a \in \partial x : \partial a ∖ {x} \subseteq V_{0}^{-}}$	Set of infected individuals that occur in at least one test with only healthy individuals
$Γ_{\min}, Γ_{\max}$	$Γ_{\min} = \min_{i \in [m]} Γ_{i}, Γ_{\max} = \max_{i \in [m]} Γ_{i}$	minimum and maximum test degree
$S_{k} (𝑮, \hat{𝝈})$	$S_{k} (𝑮, \hat{𝝈}) = {σ \in {0, 1}^{V} :$ $\forall a_{i} \in [m] : {\hat{𝝈}}_{a_{i}} = 𝟏 {\exists x \in \partial a_{i} : σ_{x} = 1}}$	set of configurations consistent with the test results under $𝑮$
$Z_{k} (𝑮, \hat{𝝈})$	$Z_{k} (𝑮, \hat{𝝈}) = \| S_{k} (𝑮, \hat{𝝈}) \|$	number of configurations consistent with the test results
$Z_{k, ℓ} (𝑮, \hat{𝝈})$	$Z_{k, ℓ} (𝑮, \hat{𝝈}) = \| {σ \in S_{k} (𝑮, \hat{𝝈}) : ⟨ 𝝈, σ ⟩ = ℓ} \|$	number of configuration consistent with the test results and with overlap $ℓ$ with $𝝈$
$𝒀_{i}$ for $i \in [m]$	$𝒀_{i} = \| {x \in \partial a_{i} : 𝝈_{x} = 1} \|$	number of edges that connect test $a_{i}$ with an infected individual
$𝑿_{i}$ for $i \in [m]$	$𝑿_{i} \sim Bin (Γ_{i}, k / n)$	binomially-distributed random variable with parameters $Γ_{i}$ and $k / n$
$W, W^{'}$	$W = \sum_{i = 1}^{m} 𝟏 {𝒀_{i} = 1}, W^{'} = \sum_{i = 1}^{m} 𝟏 {𝑿_{i} = 1}$	$W$ is the number of tests containing a single infected individual, $W^{'}$ is a random variable depending on ${(𝑿_{i})}_{i \in [m]}$
$U$	$U = \| {x \in V_{1} : \forall a_{i} \in \partial x : 𝒀_{i} > 1} \|$	number of infected individuals not adjacent not any test with precisely one infected individual
$T$	$\| {x \in V_{1} : \sum_{a \in \partial x} 𝟏 {\partial a ∖ {x} \subseteq V_{0}} < δ Δ} \|$	number of infected individuals who appear in less than $δ Δ$ tests as the only infected individual for some constant $δ > 0$
$R$	$R = \| {x \in V_{1} : \exists a_{i} \in \partial x :$ $𝒀_{i} > 1 and \partial a ∖ {x} \subseteq V_{0}} \|$	number of infected individual adjacent to some test multiple times with no other infected individual besides themselves
$𝑨_{i, j}^{'}, 𝑨_{i, j, k}^{'}$		auxiliary random variables, defined in proof of Proposition 3.1
$𝒜$	$𝒜 = {\forall i \in [m] :$ $\max_{j \in [Γ_{i}]} 𝑨_{i, j, 1}^{'} = \max_{j \in [Γ_{i}]} 𝑨_{i, j, 2}^{'}}$	event that every test under the balls-and-bins experiment features the same test result
$ℰ$	$ℰ = {\sum_{i \in [m]} 𝑿_{i} = k Δ}$	event that the sum of $𝑿_{i}$ is exactly $k Δ$
$ℳ$		set of all indices $i \in [m]$ for which there exists precisely one $g_{i} \in [Γ_{i}]$ such that $𝑨_{i, g_{i}, 1}^{'} = 1$
$𝒩$		set of indices $i \in [m]$ such that $\max_{j \in [Γ_{i}]} 𝑨_{i, j, 1}^{'} = 0$
$ℛ$	$ℛ = {\forall x \in V_{1} : \| {a \in \partial x : \partial a ∖ {x} \subseteq V_{0}} \| \geq δ Δ}$	event that for every $x \in V_{1}$ there are at least $δ Δ$ tests $a \in \partial x$ for some $δ > 0$ such that $\partial a ∖ {x} \subseteq V_{0}$ .
$𝒮$		event that one specific $σ$ that has overlap $ℓ$ with $𝝈$ belongs to $S_{k} (𝑮, \hat{𝝈})$
$𝒯$		event that sum of independent random variable is equal to specific value, defined in (7)
$𝒱$	$𝒱 = {𝒎_{1} = \frac{m}{2} (1 + o (1))}$	event that around half of the tests are positive
$𝒲$	$𝒲 =$ ${\| V_{0}^{+} \| = (1 + o (1)) (n - k) {(1 - \exp (- d / c))}^{Δ}}$	event that the size of $V_{0}^{+}$ is concentrated around its mean
$o (1), ω (1)$		$o (1)$ [ $ω (1)$ ] denotes a term that vanishes [diverges] in the limit of large $n$
w.h.p.		probability of $1 - o (1)$ as $n \to \infty$

Equations263

m_{inf} = m_{inf} (n, θ)

m_{inf} = m_{inf} (n, θ)

m_{alg} = m_{alg} (n, θ)

m_{alg} = m_{alg} (n, θ)

m_{inf}^{adapt} (n, θ) = \frac{k lo g ( n / k )}{lo g 2} .

m_{inf}^{adapt} (n, θ) = \frac{k lo g ( n / k )}{lo g 2} .

V = V_{n} = {x_{1}, \dots, x_{n}}, V_{0} = {x_{i} \in V : σ_{x_{i}} = 0} and V_{1} = V ∖ V_{0}

V = V_{n} = {x_{1}, \dots, x_{n}}, V_{0} = {x_{i} \in V : σ_{x_{i}} = 0} and V_{1} = V ∖ V_{0}

m

m

\hat{σ}_{a_{i}}

\hat{σ}_{a_{i}}

P [A (G, \hat{σ}, k) = σ] = o (1) .

P [A (G, \hat{σ}, k) = σ] = o (1) .

P [A (G, \hat{σ}, k) = σ] = 1 - o (1) .

P [A (G, \hat{σ}, k) = σ] = 1 - o (1) .

V_{0}^{+}

V_{0}^{+}

V_{1}^{+}

V_{1}^{+}

V_{0}^{-} = V_{0} ∖ V_{0}^{+} and V_{1}^{-} = V_{1} ∖ V_{1}^{+}

V_{0}^{-} = V_{0} ∖ V_{0}^{+} and V_{1}^{-} = V_{1} ∖ V_{1}^{+}

Δ n / m - Δ n / m lo g n \leq Γ_{m i n} \leq Γ_{m a x} \leq Δ n / m + Δ n / m lo g n .

Δ n / m - Δ n / m lo g n \leq Γ_{m i n} \leq Γ_{m a x} \leq Δ n / m + Δ n / m lo g n .

m_{0} = exp (- d / c) m + O (m lo g^{2} n) .

m_{0} = exp (- d / c) m + O (m lo g^{2} n) .

Z_{k, ℓ} (G, \hat{σ})

Z_{k, ℓ} (G, \hat{σ})

⟨ σ, σ ⟩ = i = 1 \sum n 1 {σ_{x_{i}} = σ_{x_{i}} = 1}

⟨ σ, σ ⟩ = i = 1 \sum n 1 {σ_{x_{i}} = σ_{x_{i}} = 1}

E [Z_{k, ℓ} (G, \hat{σ}) ∣ Γ]

E [Z_{k, ℓ} (G, \hat{σ}) ∣ Γ]

P [S ∣ Γ]

P [S ∣ Γ]

P [S ∣ Γ]

P [S ∣ Γ]

P [A_{i, j}^{'} = (1, 1)]

P [A_{i, j}^{'} = (1, 1)]

P [A_{i, j}^{'} = (0, 0)]

i = 1 \sum m j = 1 \sum Γ_{i} 1 {A_{i, j}^{'} = (1, 1)}

i = 1 \sum m j = 1 \sum Γ_{i} 1 {A_{i, j}^{'} = (1, 1)}

i = 1 \sum m j = 1 \sum Γ_{i} 1 {A_{i, j}^{'} = (1, 0)}

P [S ∣ Γ]

P [S ∣ Γ]

A = {\forall i \in [m] : j \in [Γ_{i}] max A_{i, j, 1}^{'} = j \in [Γ_{i}] max A_{i, j, 2}^{'}} .

A = {\forall i \in [m] : j \in [Γ_{i}] max A_{i, j, 1}^{'} = j \in [Γ_{i}] max A_{i, j, 2}^{'}} .

P [S ∣ Γ] = P [A ∣ T, Γ] .

P [S ∣ Γ] = P [A ∣ T, Γ] .

P [A ∣ Γ]

P [A ∣ Γ]

P [A ∣ T, Γ]

P [A ∣ T, Γ]

P [T ∣ Γ] = Ω ((Δ k)^{- 3/2}) .

P [T ∣ Γ] = Ω ((Δ k)^{- 3/2}) .

0 \leq ℓ \leq ⌈ (1 - 1/ l o g n) k ⌉ \sum O ((Δ k)^{3/2}) (ℓ k) (k - ℓ n - k) i = 1 \prod m (1 - 2 (1 - k / n)^{Γ_{i}} + 2 (1 - 2 k / n + ℓ / n)^{Γ_{i}})

0 \leq ℓ \leq ⌈ (1 - 1/ l o g n) k ⌉ \sum O ((Δ k)^{3/2}) (ℓ k) (k - ℓ n - k) i = 1 \prod m (1 - 2 (1 - k / n)^{Γ_{i}} + 2 (1 - 2 k / n + ℓ / n)^{Γ_{i}})

E

E

\leq O ((Δ k)^{3/2}) (\frac{e}{( 1 - α )} \frac{e n}{( 1 - α ) k})^{(1 - α) k} (1 - 2 (1 - \frac{k}{n})^{Γ_{m a x}} + 2 (1 - \frac{2 k}{n} + \frac{α k}{n})^{Γ_{m i n}})^{m}

\displaystyle\leq O\left({\left({\Delta k}\right)^{3/2}}\right)\left(\frac{e}{(1-\alpha)}\frac{en}{(1-\alpha)k}\right)^{(1-\alpha)k}\bigg{(}1-2\left({1-\frac{k}{n}}\right)^{\frac{n\log 2}{k}\left({1+n^{-\Omega(1)}}\right)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Information-theoretic and algorithmic thresholds for group testing

Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, Philipp Loick

Amin Coja-Oghlan, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.

Oliver Gebhard, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.

Max Hahn-Klimroth, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.

Philipp Loick, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.

Abstract.

In the group testing problem we aim to identify a small number of infected individuals within a large population. We avail ourselves to a procedure that can test a group of multiple individuals, with the test result coming out positive iff at least one individual in the group is infected. With all tests conducted in parallel, what is the least number of tests required to identify the status of all individuals? In a recent test design [Aldridge et al. 2016] the individuals are assigned to test groups randomly with replacement, with every individual joining an almost equal number of groups. We pinpoint the sharp threshold for the number of tests required in this randomised design so that it is information-theoretically possible to infer the infection status of every individual. Moreover, we analyse two efficient inference algorithms. These results settle conjectures from [Aldridge et al. 2014, Johnson et al. 2019].

Supported by DFG CO 646/3 and Stiftung Polytechnische Gesellschaft. An extended abstract of this work appeared in the 2019 ICALP proceedings. A revised version is to appear in IEEE Transactions on Information Theory (Copyright (c) 2017 IEEE DOI: 10.1109/TIT.2020.3023377)

1. Introduction

1.1. Background and motivation

The group testing problem goes back to the work of Dorfman from the 1940s [24]. Among a large population a few individuals are infected with a rare disease. The objective is to identify the infected individuals effectively. At our disposal we have a testing procedure capable of not merely testing one individual, but several. The test result will be positive if at least one individual in the test group is infected, and negative otherwise; all tests are conducted in parallel. We are at liberty to assign a single individual to several test groups. The aim is to devise a test design that identifies the status of every single individual correctly while requiring as small a number of tests as possible. A recently proposed test design allocates the individuals to tests randomly [10, 12, 13, 30, 33]. To be precise, given integers $n,m,\Delta>0$ we create a random bipartite multi-graph by choosing independently for each of the $n$ vertices $x_{1},\ldots,x_{n}$ ‘at the top’ $\Delta$ neighbours among the $m$ vertices $a_{1},\ldots,a_{m}$ ‘at the bottom’ uniformly at random with replacement. The vertices $x_{1},\ldots,x_{n}$ represent the individuals, the $a_{1},\ldots,a_{m}$ represent the test groups and an individual joins a test group iff the corresponding vertices are adjacent (see Figure 1). The wisdom behind this construction is that the expansion properties of the random bipartite graph precipitate virtuous correlations, facilitating inference. Given $n$ and (an estimate of) the number $k$ of infected individuals, what is the least $m$ for which, with a suitable choice of $\Delta$ , the status of every individual can be inferred correctly from the test results with high probability?Like in many other inference problems the answer comes in two instalments. First, we might ask for what $m$ it is information-theoretically possible to detect the infected individuals. In other words, regardless of computational resources, do the test results contain enough information in principle to identify the infection status of every individual? Second, for what $m$ does this problem admit efficient algorithms? The first main result of this paper resolves the information-theoretic question completely. Specifically, Aldridge, Johnson and Scarlett [13] obtained a function $m_{\mathrm{inf}}=m_{\mathrm{inf}}(n,k)$ such that for any fixed $\varepsilon>0$ the inference problem is information-theoretically infeasible if $m<(1-\varepsilon)m_{\mathrm{inf}}$ . They conjectured that this bound is tight, i.e., that for $m>(1+\varepsilon)m_{\mathrm{inf}}(n,k)$ there is an (exponential) algorithm that correctly identifies the infected individuals with high probability. We prove this conjecture. Furthermore, concerning the algorithmic question, Johnson, Aldridge and Scarlett [30] obtained a function $m_{\mathrm{alg}}=m_{\mathrm{alg}}(n,k)$ that exceeds $m_{\mathrm{inf}}$ by a constant factor for small $k$ such that for $m>(1+\varepsilon)m_{\mathrm{alg}}$ certain efficient algorithms successfully identify the infected individuals with high probability. They conjectured that SCOMP, their most sophisticated algorithm, actually succeeds for smaller values of $m$ . We refute this conjecture and show that SCOMP asymptotically fails to outperform a much simpler algorithm called DD. A technical novelty of the present work is that we investigate the group testing problem from a new perspective. While most prior contributions rely either on elementary calculations and/or information-theoretic arguments [12, 13, 30, 39], here we bring to bear techniques from the theory of random constraint satisfaction problems [5, 32].

Indeed, group testing can be viewed naturally as a constraint satisfaction problem: the tests provide the constraints and the task is to find all possible ways of assigning a status (‘infected’ or ‘not infected’) to the $n$ individuals in a way consistent with the given test results. Since the allocation of individuals to tests is random, this question is similar in nature to, e.g., the random $k$ -SAT problem that asks for a Boolean assignment that satisfies a random collection of clauses [4, 6, 20, 23]. It also puts the group testing problem in the same framework as the considerable body of recent work on other inference problems on random graphs such as the stochastic block model (e.g., [1, 18, 22, 35, 37, 43]) or decoding from pooled data [7, 8].

We proceed to state the main results of the paper precisely, followed by a detailed discussion of the prior literature on group testing. The proofs of the information-theoretic and algorithmic bounds follow in 3, Section 4, and 5. The technical details can be found in the appendix.

1.2. The information-theoretic threshold

Throughout the paper we labour under the assumptions commonly made in the context of group testing; we will revisit their merit in Section 1.4. Specifically, we assume that the number $k$ of infected individuals satisfies $k\sim n^{\theta}$ for a fixed $0<\theta<1$ 111While we write that $k\sim n^{\theta}$ for the sake of brevity, our results immediately extend to the case $k\sim Cn^{\theta}$ for some constant $C$ .. Moreover, let $\bm{\sigma}\in\{0,1\}^{\{x_{1},\ldots,x_{n}\}}$ be a vector of Hamming weight $k$ chosen uniformly at random. The (one-)entries of $\bm{\sigma}$ indicate which of the $n$ individuals are infected. Moreover, let $\bm{G}=\bm{G}(n,m,\Delta)$ signify the aforementioned random bipartite graph with multi-edges. Then $\bm{\sigma}$ induces a vector $\hat{\bm{\sigma}}\in\{0,1\}^{\{a_{1},\ldots,a_{m}\}}$ that indicates which of the $m$ tests come out positive. To be precise, $\hat{\bm{\sigma}}_{i}=1$ iff test $a_{i}$ is adjacent to an individual $x_{j}$ with $\bm{\sigma}_{x_{j}}=1$ . For what $m$ is it possible to recover $\bm{\sigma}$ from $\bm{G},\hat{\bm{\sigma}}$ ? (Throughout the paper all logarithms are base $\mathrm{e}$ .)

Theorem 1.1.

Suppose that $0<\theta<1$ , $k\sim n^{\theta}$ and $\varepsilon>0$ and let

[TABLE]

(i)

If $m>(1+\varepsilon)m_{\mathrm{inf}}(n,\theta)$ , then there exists an algorithm that given $\bm{G},\hat{\bm{\sigma}}$ outputs $\bm{\sigma}$ with high probability. 2. (ii)

If $m<(1-\varepsilon)m_{\mathrm{inf}}(n,\theta)$ , then there does not exist any algorithm that given $\bm{G},\hat{\bm{\sigma}},k$ outputs $\bm{\sigma}$ with a non-vanishing probability.

Since for $\theta\leq\log(2)/(1+\log(2))$ the first part of Theorem 1.1 readily follows from a folklore argument [25], the interesting regime is $\theta>\log(2)/(1+\log(2))\approx 0.41$ . The negative part of Theorem 1.1 strengthens a result from [13], who showed that for $m<(1-\varepsilon)m_{\mathrm{inf}}$ any inference algorithm has a strictly positive error probability. By comparison, Theorem 1.1 shows that any algorithm fails with high probability.

But the main contribution of Theorem 1.1 is the first, positive statement. While the problem was solved for $\theta<1/3$ for a different test design [39, 40] and the case $\theta>1/2$ is easy because a plain greedy algorithm succeeds [30], the case $1/3<\theta<1/2$ proved more challenging. Only heuristic arguments predicting the result of Theorem 1.1 have been put forward for this regime so far [33]. Indeed, Aldridge et al. [12] conjectured that in this case inferring $\bm{\sigma}$ from $\bm{G},\hat{\bm{\sigma}}$ is equivalent to solving a hypergraph minimum vertex cover problem. The proof of Theorem 1.1 vindicates this conjecture. Specifically, the vertex set of the hypergraph comprises all ‘potentially infected’ individuals, i.e., those that do not appear in any negative test. The hyperedges are the neighbourhoods $\partial a_{i}$ of the positive tests $a_{i}$ in $\bm{G}$ . Exhaustive search solves this vertex cover problem in time $\exp(O(n^{\theta}\log n))$ . But how about efficient algorithms for general $\theta$ ?

1.3. Efficient algorithms for group testing

Several polynomial time group testing algorithms have been proposed. A very simple greedy strategy called DD (for ‘definitive defectives’) first labels all individuals that are members of negative test groups as uninfected. Subsequently it checks for positive tests in which all individuals but one have been identified as uninfected in the first step. Clearly, the single as yet unlabelled individual in such a test group must be infected. Up to this point all decisions made by DD are correct. But in the final step DD marks all as yet unclassified individuals as uninfected, possibly causing false negatives. In fact, the output of DD may be inconsistent with the test results as possibly some positive tests may fail to include an individual classified as ’infected’. While an achievability result is known for the DD algorithm, a corollary of the work in this paper is a matching converse.

The more sophisticated SCOMP algorithm is roughly equivalent to the well-known greedy algorithm for the hypergraph vertex cover problem applied to the hypergraph from the previous paragraph. Specifically, in its first step SCOMP proceeds just like DD, classifying all individuals that occur in negative tests as uninfected. Then SCOMP identifies as infected all unmarked individuals that appear in at least one test whose other participants are already known to be uninfected. Subsequently the algorithm keeps picking an individual that appears in the largest number of as yet ‘unexplained’ (viz. uncovered) positive tests and marks that individual as infected, with ties broken randomly, until every positive test contains an individual classified as infected. Clearly, SCOMP may produce false positives as well as false negatives. But at least the output is consistent with the test results. Algorithm 1 summarises the procedure of SCOMP.

Analysing SCOMP has been prominently posed as an open problem in the group testing literature [9, 12, 30]. Indeed, Aldridge et al. [12] opined that “the complicated sequential nature of SCOMP makes it difficult to analyse mathematically”. On the positive side, [12] proved that SCOMP succeeds in recovering $\bm{\sigma}$ correctly given $(\bm{G},\hat{\bm{\sigma}})$ if $m>(1+\varepsilon)m_{\mathrm{alg}}(n,\theta)$ w.h.p.222W.h.p.refers to a probability of $1-o(1)$ as $n\to\infty$ ., where

[TABLE]

However, the algorithm succeeds for a trivial reason; namely, for $m>(1+\varepsilon)m_{\mathrm{alg}}$ even DD suffices to recover $\bm{\sigma}$ w.h.p. Yet based on experimental evidence [12, 30] conjectured that SCOMP strictly outperforms DD. The following theorem refutes this conjecture.

Theorem 1.2.

Suppose that $0<\theta<1$ and $\varepsilon>0$ . If $m<(1-\varepsilon)m_{\mathrm{alg}}(n,\theta)$ , then given $\bm{G},\hat{\bm{\sigma}}$ w.h.p. both SCOMP and DD fail to output $\bm{\sigma}$ .

For $\theta<1/2$ the information-theoretic bound provided by Theorem 1.1 and the algorithmic bound $m_{\mathrm{alg}}$ supplied by Theorem 1.2 remain a modest constant factor apart; see Figure 2. Whether there exists an efficient algorithm for group testing that can close the gap to the information-theoretic bound has long been an open research question. A recent result by Coja-Oghlan et al. [19] shows that such a polynomial-time algorithm indeed exists. The proposed algorithm which is inspired by the notion of spatial coupling from coding theory is able to recover $\bm{\sigma}$ whenever $m>(1+\varepsilon)m_{\mathrm{inf}}$ . Moreover, the authors prove that below the information-theoretic threshold from Theorem 1.1 no non-adaptive algorithm can succeed under any test design (not only the random regular test design considered here) thereby establishing the presence of an adaptivity gap in the group testing problem. An exciting avenue for future research is to investigate the merits of the results and techniques of this paper and [19, 28] for the noisy variant of group testing.

1.4. Discussion and related work

Dorfman’s original group testing scheme, intended to test the American army for syphilis, was adaptive. In a first round of tests each soldier would be allocated to precisely one test group. If the test result came out negative, none of the soldiers in the group were infected. In a second round the soldiers whose group was tested positively would be tested individually. Of course, Dorfman’s scheme was not information-theoretically optimal. A first-order optimal adaptive scheme that involves several test stages, with the tests conducted in the present stage governed by the results from the previous stages, is known [15, 25]. In the adaptive scenario the information-theoretic threshold works out to be

[TABLE]

The lower bound, i.e., that no adaptive design gets by with $(1-\varepsilon)m_{\mathrm{inf}}^{\mathrm{adapt}}(n,\theta)$ tests, follows from a very simple information-theoretic consideration. Namely, with a total of $m$ tests at our disposal there are merely $2^{m}$ possible test outcomes, and we need this number to exceed the count ${\binom{n}{k}}$ of possible vectors $\bm{\sigma}$ , i.e., [14].

More recently there has been a great deal of interest in non-adaptive group testing, where the infection status of each individual is to be determined after just one round of tests [14, 17, 27, 33]. This is the version of the problem that we deal with in the present paper. An important advantage of the non-adaptive scenario is that tests, which may be time-consuming, can be conducted in parallel. Indeed, some of today’s most popular applications of group testing are non-adaptive such as DNA screening [17, 31, 38] or protein interaction experiments [36, 42] in computational molecular biology. The randomised test design that we deal with here is the best currently known non-adaptive design (in terms of the number of tests required).

The most interesting regime for the group testing problem is when the number $k$ of infected individuals scales as a power $n^{\theta}$ of the entire population. Mathematically this is because in the linear regime $k=\Omega(n)$ the optimal strategy is to perform $n$ individual tests [11] in order to achieve a vanishing error probability. Similarly, the case of constant $k$ has been solved for some time [41]. Thus, for $k$ linear in $n$ and $k$ constant the theory is already well established. But the sublinear case is also of practical relevance, as witnessed by Heap’s law in epidemiology [16] or biological applications [27].

Apart from the randomised test design $\bm{G}$ where each individual chooses precisely $\Delta$ tests (with replacement), the so-called Bernoulli design assigns each individual to every test with a certain probability independently. A considerable amount of attention has been devoted to this model, and its information-theoretic threshold as well as the thresholds for various algorithms have been determined [9, 10, 12, 39]. However, the Bernoulli test design, while easier to analyse, for $\theta>1/3$ is provably inferior to the test design $\bm{G}$ that we study here. This is because in the Bernoulli design there are likely quite a few individuals that participate in far fewer tests than expected due to degree fluctuations. We note that our proofs can easily be adapted to reprove the known results for the Bernoulli design. In fact, many technical parts of the proofs become significantly easier and shorter, since we can assume independence between tests, whereas for the constant-column design under consideration here gives rise to subtle dependencies between the tests. A significant portion of the tests is devoted to getting a handle o these dependencies.

1.5. Notation

Throughout the paper $\bm{G}=\bm{G}(n,m,\Delta)$ denotes the random bipartite graph that describes which individuals take part in which test groups, the vector $\bm{\sigma}\in\{0,1\}^{\left\{{x_{1},\ldots,x_{n}}\right\}}$ encodes which individuals are infected, and $\hat{\bm{\sigma}}\in\{0,1\}^{\left\{{a_{1},\ldots,a_{m}}\right\}}$ indicates the test results. Clearly, $\bm{G}$ is independent of $\bm{\sigma}$ . Moreover, $k\sim n^{\theta}$ signifies the number of infected individuals. Additionally, we write

[TABLE]

for the set of all individuals, the set of uninfected and infected individuals, respectively. For an individual $x\in V$ we write $\partial x$ for the multi-set of tests $a_{i}$ adjacent to $x$ with $\left|{\partial x}\right|=\Delta$ . Analogously, for a test $a_{i}$ we denote by $\partial a_{i}$ the multi-set of individuals that take part in the test and $\Gamma_{i}=\left|{\partial a_{i}}\right|$ . These are multi-sets since individuals are assigned to tests uniformly at random with replacement and therefore $\bm{G}$ features multi-edges w.h.p.. Let $\Gamma$ be the vector $(\Gamma_{i})_{i\in[m]}$ . Furthermore, all asymptotic notation refers to the limit $n\to\infty$ . Thus, $o(1)$ denotes a term that vanishes in the limit of large $n$ , while $\omega(1)$ stands for a function that diverges to $\infty$ as $n\to\infty$ . We also let $c,d>0$ denote reals such that

[TABLE]

Later, we will prove that $c,d=\Theta(1)$ as $n\to\infty$ is optimal for inference. Finally, let $\Gamma_{\min}=\min_{i\in[m]}\Gamma_{i}$ , $\Gamma_{\max}=\max_{i\in[m]}\Gamma_{i}$ . The following sections will outline the proofs of the information-theoretic bounds and the analysis of the SCOMP algorithm and feature the important proofs. The technical details are left to the appendix

2. Getting started

The very first item on the agenda is to get a handle on the posterior distribution of $\bm{\sigma}$ given $\bm{G}$ and $\hat{\bm{\sigma}}$ . To this end, let $S_{k}(\bm{G},\hat{\bm{\sigma}})$ be the set of all vectors $\sigma\in\{0,1\}^{V}$ of Hamming weight $k$ such that

[TABLE]

In words, $S_{k}(\bm{G},\hat{\bm{\sigma}})$ contains the set of all vectors $\sigma$ with $k$ ones that label the individuals infected/uninfected in a way consistent with the test results, i.e. that are "satisfying sets" [12, 14]. Let $Z_{k}(\bm{G},\hat{\bm{\sigma}})=|S_{k}(\bm{G},\hat{\bm{\sigma}})|$ . The following proposition shows that the posterior of $\bm{\sigma}$ given $\bm{G},\hat{\bm{\sigma}}$ is uniform on $S_{k}(\bm{G},\hat{\bm{\sigma}})$ .

Proposition 2.1 ([10]).

For all $\tau\in\left\{{0,1}\right\}^{\left\{{x_{1},\ldots,x_{n}}\right\}}$ we have $\displaystyle{\mathbb{P}}\left[{\bm{\sigma}=\tau\mid\bm{G},\hat{\bm{\sigma}}}\right]=\frac{\bm{1}\left\{{\tau\in S_{k}(\bm{G},\hat{\bm{\sigma}})}\right\}}{Z_{k}(\bm{G},\hat{\bm{\sigma}})}.$

Adopting the jargon of the recent literature on inference problems on random graphs, we refer to Proposition 2.1 as the Nishimori identity [18, 43]. The proposition shows that apart from the actual test results, there is no further ‘hidden information’ about $\bm{\sigma}$ encoded in $\bm{G},\hat{\bm{\sigma}}$ . In particular, the information-theoretically optimal inference algorithm just outputs a uniform sample from $S_{k}(\bm{G},\hat{\bm{\sigma}})$ . In effect, we obtain the following.

Corollary 2.2.

(1)

If $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega(1)$ w.h.p., then for any algorithm $\mathcal{A}$ we have

[TABLE] 2. (2)

If $Z_{k}(\bm{G},\hat{\bm{\sigma}})=1$ w.h.p., then there is an algorithm $\mathcal{A}$ such that

[TABLE]

Both the positive and the negative part of Corollary 2.2 assume that the precise number $k$ of infected individuals is known to the algorithm. This assumption makes the negative part stronger, but weakens the positive part. Yet we will see in due course how in the positive scenario the assumption that $k$ be known can be removed.

For the information-theoretic bound, the proof hinges on analysing the number of individuals that can be flipped without affecting the test results. We encounter two kinds of such individuals. The first kind consists of healthy individuals that only appear in positive tests and which we will denote by $V_{0}^{+}$ . In symbols,

[TABLE]

Similarly, let $V_{1}^{+}$ be the set of all infected individuals $x_{i}$ such that every test in which $x_{i}$ occurs features another infected individual; in symbols,

[TABLE]

We think of the individuals in $V_{0}^{+}$ as the ‘potential false positives’. Indeed, if for any $x_{i}\in V_{0}^{+}$ we obtain $\bm{\sigma}^{\prime}$ from $\bm{\sigma}$ by setting $x_{i}$ to one, then $\bm{\sigma}^{\prime}$ will render the same test results as $\bm{\sigma}$ . Similarly, the individuals in $V_{1}^{+}$ are potential false negatives. For completeness, we also define $V_{0}^{-}$ and $V_{1}^{-}$ as

[TABLE]

In the following, let us get a handle on the size of sets $V_{0}^{+}$ and $V_{1}^{+}$ . Specifically, we prove the following five statements.

Proposition 2.3.

Let $c,d=\Theta(1)$ . Then, the following statements hold w.h.p.

(1)

$\left|{V_{0}^{+}}\right|=(1+n^{-\Omega(1)})n\left({1-\exp(-d/c)}\right)^{\Delta}.$ ** 2. (2)

If $k(1-\exp(-d/c))^{\Delta}\geq n^{\Omega(1)}$ , then $\left|{V_{1}^{+}}\right|=n^{\Omega(1)}.$ 3. (3)

If $k(1-\exp(-d/c))^{\Delta}=o(1)$ , then $\left|{V_{1}^{+}}\right|=o(1).$ 4. (4)

If $c<\frac{\theta}{1-\theta}\frac{1}{\log^{2}2}$ , then $\left|{V_{1}^{+}}\right|{},\left|{V_{0}^{+}}\right|{}=n^{\Omega(1)}.$ 5. (5)

If $c>\frac{\theta}{1-\theta}\frac{1}{\log^{2}2}$ , then $\left|{V_{1}^{+}}\right|{}=o(1).$

The proof of Proposition 2.3, while not fundamentally difficult, requires a bit of care because we are dealing with a random bipartite multi-graph whose (test-)degrees scale as a power of $n$ . In effect, the diameter of the bipartite graph is quite small and the neighbourhoods of different tests may have a sizeable intersection. The technical workout follows in Section B.6. In the next step, let us get a handle on the size of the test degrees.

Lemma 2.4.

With probability at least $1-o(n^{-2})$ we have

[TABLE]

The proof of this and the subsequent elementary lemmas are included in Section B. Next, we calculate the number of positive and negative tests. Let ${\bm{m}}_{1}$ be the number of positive tests and let ${\bm{m}}_{0}$ be the number of negative tests. Clearly ${\bm{m}}_{0}+{\bm{m}}_{1}=m$ .

Lemma 2.5.

With probability at least $1-o(n^{-2})$ we have

[TABLE]

Finally, we justify that setting $c,d=\Theta(1)$ as $n\to\infty$ is optimal for inference. The fact that $c=\Theta(1)$ immediately follows from the information-theoretic counting bound, i.e., [14].

Lemma 2.6.

(1)

If $\Delta=o(\log(n/k))$ and $m=\Theta(k\log(n/k))$ , then $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega(1)$ w.h.p. 2. (2)

If $\Delta=\omega(\log(n/k))$ and $m={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Theta}(k\log(n/k))$ , then $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega(1)$ w.h.p.

3. The information-theoretic upper bound

We proceed to discuss the proof of Theorem 1.1. The proof of the first, positive statement and of the second, negative statement hinge on two separate arguments. We begin with the proof of the information-theoretic upper bound which is the principal achievement of the present work. The proof rests upon techniques that have come to play an important role in the theory of random constraint satisfaction problems. Specifically, we need to show that $Z_{k}(\bm{G},\hat{\bm{\sigma}})=1$ w.h.p., i.e., that $\bm{\sigma}$ is the only assignment compatible with the test results w.h.p. We establish this result by combining two separate arguments. First, we use a moment calculation to show that w.h.p. there are no other solutions that have a small ‘overlap’ with $\bm{\sigma}$ . Then we use an expansion argument to show that w.h.p. there are no alternative solutions with a big overlap. Both these arguments are variants of the arguments that have been used to study the solution space geometry of random constraint satisfaction problems such as random $k$ -SAT or random $k$ -XORSAT [3, 4, 26], as well as the freezing thresholds of random constraint satisfaction problems [2, 34]. Yet to our knowledge these methods have thus far not been applied to the group testing problem. In this section we choose $\Delta=\lceil\frac{m}{k}\log 2\rceil$ which maximises the entropy of the test results. Formally, we define

[TABLE]

as the number of assignments $\sigma\in S_{k}(\bm{G},\hat{\bm{\sigma}})$ different from the true configuration $\bm{\sigma}$ whose overlap

[TABLE]

with $\bm{\sigma}$ is equal to $\ell$ . The following two propositions rule out assignments with a small and a big overlap, respectively. In either case we choose $\Delta=\lceil\frac{m}{k}\log 2\rceil$ to take its optimal value.

Proposition 3.1.

Let $\varepsilon>0$ and $0<\theta<1$ and assume that $m>(1+\varepsilon)m_{\mathrm{inf}}(k,\theta)$ . W.h.p. we have $Z_{k,\ell}(\bm{G},\hat{\bm{\sigma}})=0$ for all $\ell<(1-1/\log n)k$ .

Proof.

For $i\in[m]$ let $\Gamma_{i}$ be the degree of $a_{i}$ in $\bm{G}$ , i.e., the number of edges incident with $a_{i}$ ; this number may exceed the number of different individuals that participate in test $a_{i}$ as $\bm{G}$ may feature multi-edges. Let $\Gamma$ be the $\sigma$ -algebra generated by the random variables $(\Gamma_{i})_{i\in[m]}$ . Whenever we condition on $\Gamma$ , we assume that the bounds from Lemma 2.4 and 2.5 hold. Given $\Gamma$ we can generate $\bm{G}$ from the well-known pairing model [29]. Specifically, we create a set $\left\{{x_{i}}\right\}\times[\Delta]$ of $\Delta$ clones of each individual as well as sets $\left\{{a_{i}}\right\}\times[\Gamma_{i}]$ of clones of the tests. Then we draw a perfect matching of the complete bipartite graph on the vertex sets $\bigcup_{i=1}^{n}\left\{{x_{i}}\right\}\times[\Delta]$ , $\bigcup_{i=1}^{m}\left\{{a_{i}}\right\}\times[\Gamma_{i}]$ uniformly at random. For each matching edge linking a clone of $x_{i}$ with a clone of $a_{j}$ we insert an $i$ - $j$ -edge. The resulting bipartite random multi-graph has the same distribution as $\bm{G}$ given $\Gamma$ . As an application of this observation we obtain for every integer $0\leq\ell<k$

[TABLE]

To see why (4) holds we use the linearity of expectation. The product of the two binomial coefficients simply accounts for the number of assignments $\sigma$ that have overlap $\ell$ with $\bm{\sigma}$ . Hence, with $\mathcal{S}$ the event that one specific $\sigma\in\{0,1\}^{V}$ that has overlap $\ell$ with $\bm{\sigma}$ belongs to $S_{k,\ell}(\bm{G},\hat{\bm{\sigma}})$ , we need to show that

[TABLE]

By symmetry we may assume that $\bm{\sigma}_{x_{i}}=\bm{1}\{i\leq k\}$ and that $\sigma_{x_{i}}=\bm{1}\{i\leq\ell\}+\bm{1}\{k<i\leq 2k-\ell\}$ .

To establish (5) we harness the pairing model. Namely, given $\Gamma$ we can think of each test $a_{i}$ as a bin of capacity $\Gamma_{i}$ . Moreover, we think of each clone $(x_{i},h)$ , $h\in[\Delta]$ , of an individual as a ball. The ball is labelled $(\bm{\sigma}_{x_{i}},\sigma_{x_{i}})\in\{0,1\}^{2}$ . The random matching that creates $\bm{G}$ effectively tosses the $\Delta n$ balls randomly into the bins. Hence, for $i\in[m]$ and for $j\in[\Gamma_{i}]$ let us write $\bm{A}_{i,j}=(\bm{A}_{i,j,1},\bm{A}_{i,j,2})\in\{0,1\}^{2}$ for the label of the $j$ th ball that ends up in bin number $i$ . Then we are left to calculate the probability that for every test $a_{i}$ either $\bm{A}_{i,j,1}=\bm{A}_{i,j,2}=0$ for every $j\in[\Gamma_{i}]$ or there is at least one pair $(j,k)\in[\Gamma_{i}]^{2}$ such that $\bm{A}_{i,j,1}=\bm{A}_{i,k,2}=1$

[TABLE]

To calculate this probability we borrow a trick from the analysis of the random $k$ -SAT model [20]. Namely, we consider a new set of $\{0,1\}^{2}$ -valued random variables $\bm{A}_{i,j}^{\prime}=(\bm{A}_{i,j,1}^{\prime},\bm{A}_{i,j,2}^{\prime})$ such that $(\bm{A}_{i,j}^{\prime})_{i\in[m],j\in[\Gamma_{i}]}$ are mutually independent and such that

[TABLE]

for all $i,j$ . Due to their independence, these multinomially distributed random variables are much easier to handle than $\bm{A}_{i,j}$ . It will turn out, that given a (not too unlikely) event, it suffices to analyse these independent variables instead of $\bm{A}_{i,j}$ . Now, let $\mathcal{T}$ be the event that

[TABLE]

i..e, that all of the sums on the l.h.s. are precisely equal to their expected values. Then $\bm{A}^{\prime}=(\bm{A}_{i,j}^{\prime})_{i,j}$ given $\mathcal{T}$ is distributed precisely as $\bm{A}=(\bm{A}_{i,j})_{i,j}$ . Hence, (6) yields

[TABLE]

Thus, let

[TABLE]

The grand idea is now to calculate the probability ${\mathbb{P}}\left[{\mathcal{A}\mid\Gamma}\right]$ . Subsequently, we employ Bayes’ Theorem to derive a bound for the conditional probability ${\mathbb{P}}\left[{\mathcal{A}\mid\mathcal{T},\Gamma}\right]$ for which we know by the above application of the balls-into-bins principle

[TABLE]

Because the $(\bm{A}_{i,j}^{\prime})_{i,j}$ are mutually independent, we can easily compute the unconditional probability ${\mathbb{P}}\left[{\mathcal{A}\mid\Gamma}\right]$ : by inclusion/exclusion,

[TABLE]

(the probability that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,1}^{\prime}=\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,2}^{\prime}=1$ , i.e., both tests positive, equals one minus the probability that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,1}^{\prime}=0$ minus the probability that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,2}^{\prime}=0$ plus the probability that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,1}^{\prime}=\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,2}^{\prime}=0$ ; then add the probability that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,1}^{\prime}=\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,2}^{\prime}=0$ , i.e., both tests negative).

Finally, to deal with the conditioning we use Bayes’ rule:

[TABLE]

Since the $(\bm{A}_{i,j}^{\prime})_{i,j}$ are independent, Stirling’s formula yields

[TABLE]

A short justification can be found in Section B.1. Moreover, by definition we have ${\mathbb{P}}\left[{\mathcal{T}\mid\mathcal{A},\Gamma}\right]\leq 1$ . Hence, (5) follows from (9)–(11). To complete the proof of the proposition, we claim that

[TABLE]

To prove Equation (12), let $\alpha=\ell/k$ . Using Lemma 2.4 and recalling $m=ck\log(n/k)$ and $\Delta=d\log(n/k)$ , we find

[TABLE]

By the definition of $m>(1+\varepsilon)m_{\mathrm{inf}}$ and $\ell<\left\lceil k(1-\log^{-1}n)\right\rceil$ , we have

[TABLE]

Moreover, as $\ell<\left\lceil k(1-\log^{-1}n)\right\rceil$ we have $(1-\alpha)k=\omega(1)$ . Thus (15) implies that (14) tends to zero with $n\to\infty$ . Therefore, the proposition follows from Equations (14), (15) and Markov’s inequality.

∎

The argument from Proposition 3.1 does not extend to large overlaps (close to $k$ ) because the expression on the r.h.s. of (4) gets too large. In other words, merely computing the expected number of solutions with a given overlap does not do the trick. This ‘lottery phenomenon’ is ubiquitous in random constraint satisfaction problems: for big overlap values rare solution-rich instances drive up the expected number of solutions [4, 5]. Fortunately, we can find a remedy.

Proposition 3.2.

Let $\varepsilon>0$ and $0<\theta<1$ and assume that $m>(1+\varepsilon)m_{\mathrm{inf}}(k,\theta)$ . W.h.p. we have $Z_{k,\ell}(\bm{G},\hat{\bm{\sigma}})=0$ for all $(1-1/\log n)k\leq\ell<k$ .

In order to cope with this issue we take another leaf out of the random CSP literature [2, 34]. Namely, we show that the solution $\bm{\sigma}$ is locally rigid. That is, the expansion properties of the random bipartite graph $\bm{G}$ preclude the existence of other solutions that have a big overlap with $\bm{\sigma}$ . The following lemma holds the key to this effect.

Lemma 3.3.

For any $\varepsilon>0$ there exists $\delta=\delta(\varepsilon)>0$ such that for all $m>(1+\varepsilon)m_{\mathrm{inf}}$ the following is true. Let ${\mathcal{R}}$ be the event that for every $x_{i}$ with $\bm{\sigma}_{x_{i}}=1$ there are at least $\delta\Delta$ tests $a\in\partial x_{i}$ such that $\partial a\setminus\{x_{i}\}\subseteq V_{0}$ . Then ${\mathbb{P}}\left[{{\mathcal{R}}}\right]=1-o(1)$ .

Proof.

Let $(\bm{X}_{i})_{i\in[m]}$ be a sequence of independent ${\rm Bin}(\Gamma_{i},k/n)$ -variables as in Section 2. Also let $W=\sum_{i=1}^{m}\bm{1}\left\{{\bm{Y}_{i}=1}\right\}$ as in Section 2. Proceeding along the lines of the proof of Lemma 2.3 (see (35) in Section B.6), we obtain

[TABLE]

Let $T$ be the number of infected individuals which only show up less than $\delta\Delta$ of their tests as the only infected individual, i.e.

[TABLE]

Moreover, let $\bm{H}=\bm{H}(N,K,n^{\prime})$ be a hypergeometric random variable with parameters $N=k\Delta$ (total eligible assignments for infected individuals), $K=W$ (tests with only one infected individual) and $n^{\prime}=\Delta$ (number of tests per individuals). Then the union bound over $k$ infected individuals yields

[TABLE]

Further, the Chernoff bound for the hypergeometric distribution implies

[TABLE]

Recall $\Delta=d\log(n/k)$ . Since $D_{\mathrm{KL}}\left({{{\delta}\|{1/2+o(1)}}}\right)=\delta\log\delta+(1-\delta)\log(1-\delta)+\log 2+o(1)$ and $\delta\log\delta+(1-\delta)\log(1-\delta)\nearrow 0$ as $\delta\to 0$ and $c>\frac{\theta}{(1-\theta)\log^{2}2}$ , we can choose $\delta>0$ small enough so that

[TABLE]

Finally, the assertion follows from (16)–(19). ∎

Hence, w.h.p. any infected individual appears in plenty of tests where all the other individuals are uninfected. This property causes $\bm{\sigma}$ to be locally rigid. To see why, consider the repercussions of just changing the status of a single individual $x_{i}$ from infected to uninfected. Because given ${\mathcal{R}}$ the individual $x_{i}$ appears as the only infected individual in at least $\delta\Delta$ tests, in order to maintain the same tests results we will also need to flip at least one individual in each of these tests from ‘uninfected’ to ‘infected’. Since tests typically have relatively few individuals in common, the necessary number of flips from [math] to $1$ will be $\Omega(\Delta)=\Omega(\log n)$ . But then in order to keep the total number of infected individuals constant $k$ , we will need to perform another $\Omega(\Delta)$ flips from $1$ to [math]. Yet given ${\mathcal{R}}$ each of these ‘second generation’ individuals that we flip from infected to uninfected is itself the only infected individual in many tests. Thus, the single flip that we started from triggers a veritable avalanche of flips, which will stop only after the overlap has dropped significantly. The next lemma formalises this intuition. The lemma shows that while the unconditional expectation of $Z_{k,\ell}(\bm{G},\hat{\bm{\sigma}})$ is ‘too big’, the conditional expectation of $Z_{k,\ell}(\bm{G},\hat{\bm{\sigma}})$ given ${\mathcal{R}}$ (as defined in Lemma 3.3) is much smaller. Let ${\bm{m}}_{0}={\bm{m}}_{0}(\bm{G},\hat{\bm{\sigma}})$ be the total number of negative tests.

Lemma 3.4.

Suppose that $(1-1/\log n)k\leq\ell<k$ and let $\Gamma_{\min}=\min_{i\in[m]}\Gamma_{i}$ , $\Gamma_{\max}=\max_{i\in[m]}\Gamma_{i}$ . Then

[TABLE]

The proof of Lemma 3.4 is somehow subtle as we need to get a handle on the dependencies in $\bm{G}$ and is included in Section C.1. To convey the intuition behind the expression in Lemma 3.4, the term ${\binom{k}{\ell}}{\binom{n-k}{k-\ell}}$ accounts for the number of assignments $\tau\in\left\{{0,1}\right\}^{V}$ of Hamming weight $k$ whose overlap with $\bm{\sigma}$ is equal to $\ell$ . The terms thereafter capture the probability that such an assignment $\tau$ exhibits the same test results as the true configuration $\bm{\sigma}$ . The first term provides a necessary condition for a positive test under $\bm{\sigma}$ to stay positive under $\tau$ . By Lemma 3.3, we know that every infected individual shows up in at least $\delta\Delta$ tests as the only infected individual. Now, there are $k-\ell$ infected under $\bm{\sigma}$ , but healthy under $\tau$ . For any of these $\delta\Delta(k-\ell)$ tests, we need to have at least one individual that is healthy under $\bm{\sigma}$ , but infected under $\tau$ included in this test. Next, we need to ensure that any negative test under $\bm{\sigma}$ stay negative under $\tau$ . To this end, every individual included in a negative test under $\bm{\sigma}$ of which we have at least $\Gamma_{\min}{\bm{m}}_{0}$ must be healthy under $\tau$ . The second term captures this probability.

Proof of Proposition 3.2.

In order to establish the proposition it suffices to show that there is $\varepsilon^{\prime}\leq(1-1/\log(n))k$ such that

[TABLE]

Starting from the expression in Lemma 3.4, setting $\alpha=\ell/k$ and recalling $m=ck\log(n/k)$ and $\Delta=d\log(n/k)$ , we obtain

[TABLE]

As long as $1-\alpha=o(1)$ , we find

[TABLE]

Moreover, $(1-\alpha)k\geq 1$ . Thus, the expression (23) is of order

[TABLE]

Since (24) holds for any constant $c>0$ and any value of $\alpha$ s.t. $1-\alpha=o(1)$ , it also holds for $\alpha\geq 1-1/\log n$ . Consequently (21) is established w.h.p. ∎

Propositions 3.1 and 3.2 readily imply that $Z_{k}(\bm{G},\hat{\bm{\sigma}})=1$ w.h.p. if $m>(1+\varepsilon)m_{\mathrm{inf}}(k,\theta)$ . Hence, Corollary 2.2 shows that there exists an inference algorithm that given $\bm{G},\hat{\bm{\sigma}}$ and $k$ outputs $\bm{\sigma}$ w.h.p. Up to now, the algorithm relies on exactly knowing the number of infected individuals $k$ , which in practice could be rather difficult to learn. Fortunately, this assumption can be removed. Namely, the following proposition shows that w.h.p. there is no assignment $\sigma$ that is compatible with the test results and that has Hamming weight less than $k$ .

Proposition 3.5.

Let $\varepsilon>0$ and $0<\theta<1$ and assume that $m>(1+\varepsilon)m_{\mathrm{inf}}(k,\theta)$ . W.h.p. we have $\sum_{k^{\prime}<k}Z_{k^{\prime}}(\bm{G},\hat{\bm{\sigma}})=0$ .

Proof.

To get started, suppose that $0<\theta<1$ and $c<\log^{-2}2$ . We claim that for any value of $d>0$ , $\left|{V_{0}^{+}}\right|\geq k\log n$ w.h.p.. Indeed, from Proposition 2.3(1), we know that

[TABLE]

Recalling $\Delta=d\log(n/k)$ , the expression takes the minimum at $d=c\log 2$ . It follows that

[TABLE]

If $c=(1-\varepsilon)\log^{-2}2$ for $\epsilon>0$ , then

[TABLE]

Now, the following two statements establish that if there does not exist a second satisfying set of Hamming weight $k$ , there does also not exist a satisfying set with smaller Hamming weight w.h.p..

First, we claim that if $m>(1+\epsilon)m_{\mathrm{inf}}(k,\theta)$ , w.h.p. there does not exist a satisfying configuration with Hamming weight smaller than the correct configuration, where the set of infected individuals is not a subset of the true set of infected individuals. To see why, suppose there existed a satisfying configuration with a smaller Hamming weight, whose infected individuals are not a subset of the true infected individuals. By (25), we know that $\left|{V_{0}^{+}}\right|\gg k$ for $m<(1-\epsilon)m_{\mathrm{alg}}$ w.h.p. Therefore, we could construct a satisfying configuration of identical Hamming weight as the true configuration by flipping individuals in $V_{0}^{+}$ from healthy to infected. Observe that by the definition of $V_{0}^{+}$ , flipping individuals in $V_{0}^{+}$ does not change the test result. Therefore, we would be left with a second satisfying configuration of identical Hamming weight as the true configuration, a contradiction to Propositions 3.1 and 3.2.

Second, we argue that if $m>(1+\epsilon)m_{\mathrm{inf}}(k,\theta)$ , w.h.p. there does not exist a satisfying configuration with Hamming weight smaller than the correct configuration, where the set of infected individuals is a subset of the true set of infected individuals. Suppose there existed a satisfying configuration with a smaller Hamming weight, whose infected individuals are a subset of the true infected individuals. Then, the true configuration would need to contain individuals in $V_{1}^{+}$ , which can be flipped from infected to healthy without affecting the test result. However, Proposition 2.3(5) shows that for $m>(1+\epsilon)m_{\mathrm{inf}}$ , $V_{1}^{+}=\emptyset$ w.h.p. ∎

As an immediate consequence of Proposition 3.5 we conclude that for $m>(1+\varepsilon)m_{\mathrm{inf}}(k,\theta)$ the problem of inferring $\bm{\sigma}$ boils down to a minimum vertex cover problem, as previously conjectured by Aldridge, Baldassini and Johnson [12]. Namely, let $\mathcal{P}$ be the set of all positive tests, i.e., all tests $a_{i}$ , $i\in[m]$ , with $\hat{\bm{\sigma}}_{a_{i}}=1$ . Moreover, let $V^{+}$ be the set of all variables $x_{i}\in V$ such that $\partial x_{i}\subseteq\mathcal{P}$ ; in words, $x_{i}$ takes part in positive tests only. We set up a hypergraph $\bm{H}$ with vertex set $V^{+}$ and hyperedges $\partial a_{i}\cap V^{+}$ , $a_{i}\in\mathcal{P}$ . Clearly, the set of all individuals $x_{i}$ with $\bm{\sigma}_{x_{i}}=1$ provides a valid vertex cover of $\bm{H}$ (as any positive test must feature an infected individual). Conversely, Propositions 3.1 and 3.2 show that w.h.p. this is the unique vertex cover of size $k$ , and Proposition 3.5 shows that there is no strictly smaller vertex cover w.h.p. Therefore, w.h.p. we can infer $\bm{\sigma}$ even without prior knowledge of $k$ by way of solving this minimum vertex cover instance.

4. The information-theoretic lower bound

We proceed with the negative statement that w.h.p. $\bm{\sigma}$ cannot be inferred if $m<(1-\varepsilon)m_{\mathrm{inf}}$ . In light of Corollary 2.2 in order to prove the first part of Theorem 1.1 we need to show that the number $Z_{k}(\bm{G},\hat{\bm{\sigma}})$ of assignments consistent with the test results $\hat{\bm{\sigma}}$ is unbounded w.h.p. The proof of this fact is based on a very simple idea: we just identify a moderately large number of individuals whose infection status could be flipped without affecting the test results. The following lemma yields a bound on $m$ below which the number of such potential false positives ( $\left|{V_{0}^{+}}\right|$ ) and negatives ( $\left|{V_{1}^{+}}\right|$ ) abound.

Proposition 4.1.

Let $\varepsilon>0$ and $0<\theta<1$ and assume that

[TABLE]

Then for any choice of $\Delta$ we have $|V_{0}^{+}|,|V_{1}^{+}|=n^{\Omega(1)}$ w.h.p.

Proof.

Thanks to Lemma 2.6 we may assume that $\Delta=d(\log(n/k))$ , for a constant $d$ as this choice minimizes the number of individuals in $V_{1}^{+}$ . Then Proposition 2.3(4) guarantees that for every such constant as long as $c<\frac{\theta}{1-\theta}\frac{1}{\log^{2}2}$ , there are $n^{\Omega(1)}$ individuals in both $V_{1}^{+}$ and $V_{0}^{+}$ , which yields to Proposition 4.1. ∎

As an immediate application we obtain the following information-theoretic lower bound.

Corollary 4.2.

Let $\varepsilon>0$ and $0<\theta<1$ and assume that

[TABLE]

Then $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega\left({1}\right)$ w.h.p.

Proof.

We need to exhibit alternative vectors $\bm{\sigma}^{\prime}\in\left\{{0,1}\right\}^{V}$ with Hamming weight $k$ that render the same test results as $\bm{\sigma}$ . Thus, pick any $x_{i}\in V_{0}^{+}$ and any $x_{j}\in V_{1}^{+}$ and obtain $\bm{\sigma}^{\prime}$ from $\bm{\sigma}$ by setting $\bm{\sigma}^{\prime}_{x_{i}}=1$ and $\bm{\sigma}^{\prime}_{x_{j}}=0$ . By construction, $\bm{\sigma}^{\prime}$ has Hamming weight $k$ and renders the same test results. Hence, Proposition 4.1 shows that $Z_{k}(\bm{G},\hat{\bm{\sigma}})\geq|V_{0}^{+}\times V_{1}^{+}|=\Omega(n^{2\theta})\gg 1$ w.h.p.∎

The bound (26) matches $m_{\mathrm{inf}}$ for $\theta\gtrapprox 0.41$ . A simpler, purely information-theoretic argument covers the remaining $\theta$ .

Proposition 4.3.

Let $\varepsilon>0$ , $0<\theta<1$ . If $m<\frac{1-\varepsilon}{\log 2}n^{\theta}(1-\theta)\log n$ , then $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega\left({1}\right)$ w.h.p.

Proof.

This Lemma follows from the classical information-theoretic lower bound for the group testing problem. Namely, $m$ tests allow for $2^{m}$ possible test results. Hence, if

[TABLE]

then the number of possible test results is far smaller than the number of vectors $\bm{\sigma}\in\left\{{0,1}\right\}^{V}$ with Hamming weight $k$ . Therefore, w.h.p. there exists an unbounded number of vectors of Hamming weight $k$ that render the same test results as $\bm{\sigma}$ . ∎

We thus conclude that for all $0<\theta<1$ , w.h.p. $Z_{k}(\bm{G},\hat{\bm{\sigma}})=\omega(1)$ if $m<(1-\varepsilon)m_{\mathrm{inf}}$ . Therefore, the desired information-theoretic lower bound follows from Corollary 2.2.

5. The SCOMP algorithm

For $\theta\geq 1/2$ we have $m_{\mathrm{alg}}=m_{\mathrm{inf}}$ and thus Theorem 1.1 implies that SCOMP as described in Section 1.3 w.h.p. fails to infer $\bm{\sigma}$ for $m<(1-\varepsilon)m_{\mathrm{alg}}$ . Therefore, we are left to establish Theorem 1.2 for $\theta<1/2$ , in which case

[TABLE]

The proof of Theorem 1.2 for $\theta<1/2$ hinges on two propositions. First we show that below $m_{\mathrm{alg}}$ , the set $V_{1}^{--}$ of infected individuals that the second step of SCOMP identifies correctly is empty. Formally, with $V_{0}^{-}$ from (3), let

[TABLE]

Proposition 5.1.

Suppose that $0<\theta<1/2$ and $\varepsilon>0$ . If $m<(1-\varepsilon)m_{\mathrm{alg}}$ , then for all $\Delta>0$ we have $V^{--}_{1}(\bm{G},\hat{\bm{\sigma}}^{*})=\emptyset$ w.h.p.

The proofs of Propositions 5.1 and 5.2 are based on moment calculations that turn out to be mildly subtle due to the potentially very large degrees of the underlying graph $\bm{G}$ . The technical workout in included in Section D.1 and D.2.

With the second step of SCOMP failing to ‘explain’ (viz. cover) any positive tests, the greedy vertex cover algorithm takes over. This algorithm is applied to the hypergraph whose vertices are the as yet unclassified individuals and whose edges are the neighbourhoods of the positive tests. Our second lemma shows that the set $V^{+,\Delta}$ of potententially false positive individuals $x\in V_{0}^{+}$ that participate in the maximum number $\Delta$ of different tests is far greater than the actual number $k$ of infected individuals. Formally, let

[TABLE]

Proposition 5.2.

Suppose that $0<\theta<1/2$ and $\varepsilon>0$ . If $m<(1-\varepsilon)m_{\mathrm{alg}}$ , then for $\Delta=d\log\left({n/k}\right)$ for all constant $d$ we have $\left|{V_{0}^{+,\Delta}}\right|\geq k\log n$ w.h.p.

We complete the proof of Theorem 1.2 as follows.

Proof of Theorem 1.2.

The first step of SCOMP (correctly) marks all individuals that appear in negative tests as healthy. Moreover, Proposition 5.1 implies that the second step of SCOMP is void w.h.p., because there is no single infected individual that appears in a test whose other individuals have already been identified as healthy by the first step. Consequently, SCOMP simply applies the greedy vertex cover algorithm. Now, thanks to Proposition 5.2 it suffices to prove that SCOMP will fail w.h.p. if $\left|{V_{0}^{+,\Delta}}\right|=\omega\left({k}\right)$ . Because they belong to positive tests only, all the individuals of $V_{0}^{+,\Delta}$ are present in the vertex cover instance that SCOMP attempts to solve. Moreover, in the hypergraph no vertex has degree greater than $\Delta$ , because the degrees of $x_{1},\ldots,x_{n}$ in $\bm{G}$ are equal to $\Delta$ . (Some of the hypergraph degrees may be strictly smaller than $\Delta$ because $\bm{G}$ is a multi-graph.) Therefore, since $|V_{0}^{+,\Delta}|\geq k\log n$ while the actual set of infected individuals only has size $k$ , w.h.p. the individual classified as infected by the very first step of the greedy set cover algorithm belongs to $V^{+}_{0}$ . Hence, this individual is not actually infected, i.e., SCOMP errs w.h.p. ∎

Since the success probability of the SCOMP algorithm is at least as high as of the DD algorithm, we can prove the conjecture of [30] regarding the upper bound of the DD algorithm.

Corollary 5.3.

If $m<(1-\varepsilon)m_{\mathrm{alg}}$ , the DD algorithm will fail to retrieve the correct set of infected individuals w.h.p..

Acknowledgment

We thank Arya Mazumdar for bringing the group testing problem to our attention.

Appendix A Notation

The following sections contain the proofs of the lemmas omitted so far.

Appendix B Preliminaries

B.1. Preliminaries

We start by stating the Chernoff bound as applied in this paper.

Lemma B.1 (Chernoff bound, [29] (Section 2.1)).

Let $\bm{X}\sim{\rm Bin}(n,p)$ be a binomially-distributed random variable with $\lambda=\mathbb{E}[\bm{X}]$ . Further, let

[TABLE]

Then for some $t\geq 0$ ,

[TABLE]

As an application, we readily find

[TABLE]

Next, we justify that the Stirling approximation of Section 3 is accurate. Namely, let $\bm{A}_{i,j}^{\prime}=(\bm{A}_{i,j,1}^{\prime},\bm{A}_{i,j,2}^{\prime})$ be $\{0,1\}^{2}$ -valued random variables such that $(\bm{A}_{i,j}^{\prime})_{i\in[m],j\in[\Gamma_{i}]}$ are mutually independent and such that

[TABLE]

for all $i,j$ . As before, we denote by $\mathcal{T}$ the event that

[TABLE]

i..e, that all of the sums on the l.h.s. are precisely equal to their expected values. Since the $(\bm{A}_{i,j}^{\prime})_{i,j}$ are independent, Stirling’s formula yields

[TABLE]

This can be seen as follows. For the sake of brevity, define

[TABLE]

As $\bm{A}_{i,j}^{\prime}$ is a family of independent multinomial variables

[TABLE]

we find

[TABLE]

Hence, the probability of event $\mathcal{T}$ occurring is the probability, that $\bm{X}$ hits its expectation. Thus, using the very basic approximation $n!=\Theta\left({\sqrt{n}}\right)(n/\mathrm{e}{})^{n}$ we find

[TABLE]

where (29) follows immediately from $\ell\leq k=o(n)$ and directly implies (28). In due course we apply similar calculations often, some calculations involve conditional probabilities. These conditions are only restricting $\Gamma_{i}$ to take specific (common) values and clearly the above argument is totally invariant under different values of $\Gamma_{i}$ , as long as $\sum_{i}^{m}\Gamma_{i}=n\Delta$ .

B.2. Getting started

In the next step, recall that neighbourhoods of different tests in the random multi-graph seizably intersect. To cope with the ensuing correlations, we introduce a new family of random variables that, as we will see, are closely related to the statistics of the appearances of infected/uninfected individuals in the various tests. Specifically, recalling that $\Gamma_{i}$ signifies the degree of test $a_{i}$ and that $\sum_{i=1}^{m}\Gamma_{i}=n\Delta$ , let $(\bm{X}_{i})_{i\in[m]}$ be a sequence of independent ${\rm Bin}(\Gamma_{i},k/n)$ -variables. Moreover, let

[TABLE]

Because the $\bm{X}_{i}$ are mutually independent, Stirling’s formula shows that

[TABLE]

which follows along the lines of Section B.1. Additionally, let $\bm{Y}_{i}$ be the number of edges that connect test $a_{i}$ with an infected individual. (Since $\bm{G}$ is a multi-graph, it is possible that an infected individual contributes more than one to $\bm{Y}_{i}$ .) Further, let $\Gamma$ be the $\sigma$ -algebra generated by the random variables $(\Gamma_{i})_{i\in[m]}$ . Whenever we condition on $\Gamma$ , we assume that the bounds from Lemma 2.4 and 2.5 hold.

Lemma B.2.

Given $\Gamma$ , the vectors $(\bm{Y}_{1},\ldots,\bm{Y}_{m})$ and $(\bm{X}_{1},\ldots,\bm{X}_{m})$ given ${\mathcal{E}}$ are identically distributed.

Proof.

For any integer sequence $(y_{i})_{i\in[m]}$ with $y_{i}\geq 0$ and $\sum_{i\in[m]}y_{i}=k\Delta$ we have

[TABLE]

Hence, for any sequences $(y_{i}),(y_{i}^{\prime})$ we obtain

[TABLE]

as claimed. ∎

B.3. Proof of Lemma 2.4

Since each variable draws a sequence of $\Delta$ tests uniformly at random, for every $i\in[m]$ the degree $\Gamma_{i}$ has distribution ${\rm Bin}(n\Delta,1/m)$ . Therefore, the assertion follows from the Chernoff bound.

B.4. Proof of Lemma 2.5

Let ${\bm{m}}_{0}^{\prime}=\sum_{i=1}^{m}\bm{1}\left\{{\bm{X}_{i}=0}\right\}$ . Then $\mathbb{E}[{\bm{m}}_{0}^{\prime}]=\sum_{i=1}^{m}{\mathbb{P}}\left[{{\rm Bin}(\Gamma_{i},k/n))=0}\right]=\sum_{i=1}^{m}(1-k/n)^{\Gamma_{i}}$ . Hence, Lemma 2.4 shows that with probability $1-o(n^{-2})$ ,

[TABLE]

Because the $\bm{X}_{i}$ are mutually independent, ${\bm{m}}_{0}^{\prime}$ is a binomial variable. Therefore, the Chernoff bound (e.g. Lemma B.1) shows that

[TABLE]

Finally, the assertion follows from (30), (31)–(34) and Lemma B.2.

B.5. Proof of Lemma 2.6

The expected degree of a test $a_{i}$ equals $\Delta n/m$ . Therefore, if $\Delta=o(\log(n/k))$ , then by Lemma 2.5, ${\bm{m}}_{1}=o(m)$ w.h.p. To exploit this fact, call $\sigma\in\left\{{0,1}\right\}^{V}$ of Hamming weight $k$ bad for $\bm{G}$ if given $\bm{\sigma}=\sigma$ we indeed have ${\bm{m}}_{1}=o(m)$ . Let $B(\bm{G})$ be the set of all such bad $\sigma$ . Then w.h.p. $\bm{G}$ has the property that $|B(\bm{G})|\sim\binom{n}{k}$ , i.e. asymptotically most configurations will have few positive tests. Now, condition on the event that $|B(\bm{G})|\sim\binom{n}{k}$ and let $\mathcal{B}$ be the set of all subsets of $[m]$ of size $o(m)$ . Further, let $f_{\bm{G}}:B(\bm{G})\to\mathcal{B}$ map $\sigma\in\left\{{0,1}\right\}^{V}$ to the corresponding set of positive tests. Finally, let $B^{\prime}(\bm{G})$ be the set of all $\sigma\in B(\bm{G})$ such that $|f_{\bm{G}}^{-1}(f_{\bm{G}}(\sigma))|<n$ , i.e. the set of all configurations for which there are less than $n$ other configurations rendering the same test results. Then

[TABLE]

Consequently, w.h.p. over the choice of $\bm{G}$ and $\bm{\sigma}$ we have $Z_{k}(\bm{G},\hat{\bm{\sigma}})\geq n$ . The same argument applies for $\log(n/k)=o(\Delta)$ with the term ‘positive test’ replaced by ‘negative test’.

B.6. Proof of Proposition 2.3

We start by proving part (1) using a straightforward second-moment calculation. Recall $\Delta=d\log(n/k)$ and $m=ck\log(n/k)$ . Lemma 2.4 and Lemma 2.5 show that with probability at least $1-o(n^{-2})$ the total degree of the negative tests comes to

[TABLE]

Consequently, with probability at least $1-o(n^{-2})$ the total number of edges between $V_{0}$ and the set of positive tests is $\Delta n\left({1-\exp(-d/c)+n^{-\Omega(1)}}\right)$ . Moreover, the total number of edges between $V_{0}$ and all tests comes down to $\Delta(n-k)$ . Given these events and since each individual is assigned to tests uniformly at random with replacement, the probability that a given $x\in V_{0}$ belongs to $V_{0}^{+}$ comes out as

[TABLE]

Next, we estimate the probability that $x,x^{\prime}\in V_{0}$ both belong to $V_{0}^{+}$ :

[TABLE]

Hence, $\mathbb{E}[|V_{0}^{+}{}|^{2}\mid\Gamma]-\mathbb{E}[|V_{0}^{+}{}|\mid\Gamma]^{2}=O(n^{2-\Omega(1)})$ . Therefore, the assertion follows from Chebyshev’s inequality.

Proceeding with part (2), let the number of tests containing a single infected individual be

[TABLE]

Then Lemma 2.4 shows that w.h.p.

[TABLE]

Analogously,

[TABLE]

Hence, because $W^{\prime}$ is a binomial random variable, the Chernoff bound (e.g. Lemma B.1) shows that

[TABLE]

Therefore, (30) yields

[TABLE]

Now, let $U$ be the number of $x\in V_{1}$ that are not adjacent to any test with precisely one positive individual. An individual $x\in V_{1}$ counts towards $U$ , if out of all possible assignment $k\Delta$ , it is only assigned to those tests where it is not the only infected individual (there are a total of $k\Delta-W$ such assignments). Using the notation $n^{\underline{k}}=n(n-1)\dots(n-k+1)$ and recalling $\Delta=\Theta(\log n)$ , the bound on $W$ yields

[TABLE]

By a similar token we obtain

[TABLE]

Therefore, Chebyshev’s inequality shows that w.h.p.

[TABLE]

To complete the proof we need to compare $U$ and $\left|{V_{1}^{+}}\right|{}$ . Clearly, $U\geq\left|{V_{1}^{+}}\right|{}$ . But the inequality may be strict because $U$ includes positive individuals that appear twice in the same test. To be precise, an individual might be assigned to one test twice as the only infected individual. Such an individual should not be in $V_{1}^{+}$ , but it shows up in $U$ . Indeed, letting $R$ be the number of such individuals, we obtain $\left|{V_{1}^{+}}\right|{}\geq U-R$ . Hence, we are left to estimate $R$ . To this end, we observe that the probability that an individual appears in a specific test twice is upper-bounded by $\left({\Delta/m}\right)^{2}$ . Recall $m=ck\log(n/k)$ and $\Delta=d\log(n/k)$ . Consequently, taking the union bound over all tests and infected individuals we yield

[TABLE]

Since by assumption the r.h.s. of (36) is $n^{\Omega(1)}$ , we conclude that $\left|{V_{1}^{+}}\right|{}\geq U-R=n^{\Omega(1)}$ w.h.p., as claimed.

Next, we consider (3). Define $U$ as in the proof of Proposition 2.3(2). Then we know that $U\geq\left|{V_{1}^{+}}\right|{}$ . Hence, if $k(1-\exp(-d/c))^{\Delta}=o(1)$ then $\left|{V_{1}^{+}}\right|=o(1)$ due to (36).

For part (4), we observe for a given $c$ that $\min_{d}(1-\exp(-d/c))^{\Delta}$ is attained at $d=c\log 2$ . To see this, consider the function $f(d)=(1-\exp(-d/c))^{\Delta}=n^{(1-\theta)d\log(1-\exp(-d/c))}$ and observe that the minimum of $f(d)$ coincides with the minimum of $g(d)=d\log(1-\exp(-d/c))$ . Letting $x=d/c$ , the derivatives read as

[TABLE]

For $d>0$ , the unique maximum is attained at $x=\log 2$ and accordingly, $d=c\log 2$ . Furthermore, it is the case that $k(1-\exp(-\log 2))^{c\log 2\log(n/k)}\geq n^{\Omega(1)}$ and therefore by Proposition 2.3(2), $\left|{V_{1}^{+}}\right|{}=n^{\Omega(1)}$ . By a similar token by Proposition 2.3(1), $\left|{V_{0}^{+}}\right|{}=n^{\Omega(1)}$ .

Finally, for part (5), setting $d=c\log 2$ , we see that $k(1-\exp(-\log 2))^{c\log 2\log(n/k)}=o(1)$ and therefore by Proposition 2.3(3), $\left|{V_{1}^{+}}\right|{}=o(1)$ .

Appendix C The information-theoretic upper bound

C.1. Proof of Lemma 3.4

The term ${\binom{k}{\ell}}{\binom{n-k}{k-\ell}}$ accounts for the number of assignments $\sigma\in\left\{{0,1}\right\}^{V}$ of Hamming weight $k$ whose overlap with $\bm{\sigma}$ is equal to $\ell$ . Hence, with $\mathcal{S}$ being the event that one specific $\sigma\in\{0,1\}^{V}$ that has overlap $\ell$ with $\bm{\sigma}$ belongs to $S_{k,\ell}(\bm{G},\hat{\bm{\sigma}})$ , we need to show that

[TABLE]

Due to symmetry we may assume that $\bm{\sigma}_{x_{i}}=\bm{1}\{i\leq k\}$ and that $\sigma_{x_{i}}=\bm{1}\{i\leq\ell\}+\bm{1}\{k<i\leq 2k-\ell\}$ .

Proceeding as in the proof of Proposition 3.1, we think of each test $a_{i}$ as a bin of capacity $\Gamma_{i}$ and of each clone $(x_{i},h)$ , $h\in[\Delta]$ , of an individual as a ball labelled $(\bm{\sigma}_{x_{i}},\sigma_{x_{i}})\in\{0,1\}^{2}$ . We toss the $\Delta n$ balls randomly into the bins. For $i\in[m]$ and for $j\in[\Gamma_{i}]$ we let $\bm{A}_{i,j}=(\bm{A}_{i,j,1},\bm{A}_{i,j,2})\in\{0,1\}^{2}$ be the label of the $j$ th ball that ends up in bin number $i$ . To cope with this experiment we introduce a new set $\{0,1\}^{2}$ -valued random variables $\bm{A}_{i,j}^{\prime}=(\bm{A}_{i,j,1}^{\prime},\bm{A}_{i,j,2}^{\prime})$ such that $(\bm{A}_{i,j}^{\prime})_{i\in[m],j\in[\Gamma_{i}]}$ are mutually independent and

[TABLE]

for all $i,j$ . With $\mathcal{T}$ being the event that

[TABLE]

the vector $\bm{A}^{\prime}=(\bm{A}_{i,j}^{\prime})_{i,j}$ given $\mathcal{T}$ is distributed as $\bm{A}=(\bm{A}_{i,j})_{i,j}$ given $\Gamma$ . Moreover, with similar arguments as in Section B.1, Stirling’s formula yields

[TABLE]

Let $\mathcal{N}$ be the set of indices $i\in[m]$ such that $\max_{j\in[\Gamma_{i}]}\bm{A}_{i,j,1}^{\prime}=0$ . Moreover, let $\mathcal{M}$ be the set of all indices $i\in[m]$ for which there exists precisely one $g_{i}\in[\Gamma_{i}]$ such that $\bm{A}_{i,g_{i},1}^{\prime}=1$ and such that for this index we have $\bm{A}_{i,g_{i},2}^{\prime}=0$ . Further, let

[TABLE]

Then

[TABLE]

Furthermore, given $\mathcal{N},\mathcal{M}$ the events $\mathcal{S}^{\prime},\mathcal{S}^{\prime\prime}$ are independent and

[TABLE]

For an intuitive explanation of the above expressions, please refer to the section immediately following the statement of the Lemma 3.4. Given $\left|{\mathcal{N}}\right|\geq\left({1-n^{-\Omega(1)}}\right){\bm{m}}_{0}$ and $\left|{\mathcal{M}}\right|\geq\delta\Delta(k-\ell)$ , we obtain

[TABLE]

Moreover, we find by 3.3, the concentration of $\left|{\mathcal{N}}\right|$ and the fact that $\mathbb{E}\left[{\left|{\mathcal{N}}\right|}\right]=\mathbb{E}\left[{{\bm{m}}_{0}}\right]=m/2$

[TABLE]

and thus

[TABLE]

Combining (40)–(41) and using the trivial bound

[TABLE]

we obtain by Bayes Theorem

[TABLE]

Because $\bm{A}^{\prime}=(\bm{A}_{i,j}^{\prime})_{i,j}$ given $\mathcal{T}$ is distributed as $\bm{A}=(\bm{A}_{i,j})_{i,j}$ given $\Gamma$ , (37) follows from (43).

Appendix D The SCOMP algorithm

D.1. Proof of Proposition 5.1

The proof of Proposition 5.1 proceeds in three steps. First, we show that $\left|{V_{0}^{+}}\right|$ is concentrated around its expectation. $\mathcal{W}$ denotes the corresponding event. Second, we need to get a handle on the subtle dependencies in $\bm{G}$ . To this end, we introduce a set of independent multinomial random variables indexed over the tests. Whereas $\bm{Y}^{i}_{1},\bm{Y}^{i}_{0+},\bm{Y}^{i}_{0-}$ denotes the number of infected, potentially false positive and definitively healthy individuals in test $a_{i}$ , respectively, the triple $(\bm{X}^{i}_{1},\bm{X}^{i}_{0+},\bm{X}^{i}_{0-})$ denote the corresponding multinomial random variable. We will show that conditioned on the sum of $\bm{X}^{i}_{1},\bm{X}^{i}_{0+},\bm{X}^{i}_{0-}$ hitting the total number of individuals of the three types, $(\bm{X}^{i}_{1},\bm{X}^{i}_{0+},\bm{X}^{i}_{0-})$ is distributed like $\bm{Y}^{i}_{1},\bm{Y}^{i}_{0+},\bm{Y}^{i}_{0-}$ . The technical workout is delicate, but is based on standard results from balls-into-bins experiments. Third, we show that for $m<(1-\varepsilon)m_{\mathrm{alg}}$ , the number of tests $W$ for which $\bm{X}^{i}_{1}=1$ and $\bm{X}^{i}_{0+}=0$ decays exponentially in $n$ , which implies that $V_{1}^{--}=\emptyset$ w.h.p.

Proof.

Lemma 2.6 implies that the optimal choice for the variable degree is $\Delta=d\log(n/k)$ for a constant $d$ . Let ${\bm{m}}_{1}$ be the amount of positive tests and, w.l.o.g. assume that $a_{1}...a_{{\bm{m}}_{1}}$ are the positive tests and define

[TABLE]

as the event that the number of ‘potential false positives’ $\left|{V_{0}^{+}}\right|$ is highly concentrated around its mean. Then by Proposition 2.3(1), we find

[TABLE]

Similarly as before, we introduce a family of independent random variables corresponding to the tests.

Let $\bm{Y}_{1}^{1},\ldots,\bm{Y}_{1}^{{\bm{m}}_{1}}$ be the number of ones in the tests corresponding to $a_{1},\ldots,a_{{\bm{m}}_{1}}$ respectively. Let $\bm{Y}_{0+}^{1},\ldots,\bm{Y}_{0+}^{{\bm{m}}_{1}}$ count the $V_{0}^{+}$ occurrences in $a_{1},\ldots,a_{{\bm{m}}_{1}}$ . Let $\bm{Y}_{0-}^{1},\ldots,\bm{Y}_{0-}^{{\bm{m}}_{1}}$ count the $V_{0}^{-}$ occurrences in $a_{1},\ldots,a_{{\bm{m}}_{1}}$ . By definition we find $\bm{Y}_{0-}^{i}=\Gamma_{i}-\bm{Y}_{0+}^{i}-\bm{Y}_{1}^{i}$ . We introduce auxiliary variables $\bm{X}_{1}^{1},\ldots,\bm{X}^{{\bm{m}}_{1}}_{1}$ , $\bm{X}^{1}_{0+},\ldots,\bm{X}^{{\bm{m}}_{1}}_{0+},\bm{X}^{1}_{0-},\ldots,\bm{X}^{{\bm{m}}_{1}}_{0-}$ such that $(\bm{X}^{i}_{1},\bm{X}^{i}_{0+},\bm{X}^{i}_{0-})$ have distribution

[TABLE]

a multinomial distribution conditioned on the first variable being at least one. The triples $\left((X^{i}_{1},X^{i}_{0^{+}},X^{i}_{0-})\right)_{i\in{\bm{m}}_{1}}$ are mutually independent. We seek a choice of $p$ satisfying the equation

[TABLE]

and will show following equation (48) that such a choice exists. Define

[TABLE]

Along the lines of Section B.1 , Stirling’s formula implies

[TABLE]

Moreover, $(\bm{Y}^{1}_{1},\bm{Y}^{1}_{0+},\bm{Y}^{1}_{0-},\ldots,\bm{Y}^{{\bm{m}}_{1}}_{1},\bm{Y}^{{\bm{m}}_{1}}_{0+},\bm{Y}^{{\bm{m}}_{1}}_{0-})$ and $(\bm{X}^{1}_{1},\bm{X}^{1}_{0+},\bm{X}^{1}_{0-},\ldots,\bm{X}^{{\bm{m}}_{1}}_{1},\bm{X}^{{\bm{m}}_{1}}_{0+},\bm{X}^{{\bm{m}}_{1}}_{0-})$ given ${\mathcal{E}}$ are identically distributed. This can be seen as follows:

[TABLE]

Thus, given $y^{\prime\prime}_{i}=\Gamma_{i}-y_{i}-y^{\prime}_{i}$ and $\tilde{y}^{\prime\prime}_{i}=\Gamma_{i}-\tilde{y}_{i}-\tilde{y}^{\prime}_{i}$ for all $i\in[{\bm{m}}_{1}]$ , we find

[TABLE]

Given $x_{i}^{\prime\prime}=\Gamma_{i}-x_{i}-x^{\prime}_{i}$ , we find:

[TABLE]

where the last equality follows from the fact that we conditioned on ${\mathcal{E}}$ . Since the first terms are independent of $x_{i},x^{\prime}_{i},x^{\prime\prime}_{i}$ , we find

[TABLE]

Therefore, given $\Gamma_{i}=x_{i}+x^{\prime}_{i}+x_{i}^{\prime\prime}=\tilde{x}_{i}+\tilde{x}^{\prime}_{i}+\tilde{x}^{\prime\prime}_{i},$ we have by comparison with (46),

[TABLE]

which yields the claim. Let

[TABLE]

be the number of positive tests that contain exactly one infected individual and no healthy individuals in $V_{0}^{+}$ . Note that this split is the only possibility for the test to be positive. Then

[TABLE]

By Lemma 2.5 we readily find for any choice of $c,d=\Theta(1)$ that

[TABLE]

Hence,

[TABLE]

Moreover, since $W$ is a binomial random variable, the Chernoff bound (e.g. Lemma B.1) shows that

[TABLE]

Further, Lemma 2.4 yields approximations for $\Gamma_{\min}$ and $\Gamma_{\max}$ . Now assume that $c<\log^{-2}2$ . Using a similar reformulation as in (47), we find that $p=(1+o(1))k/n$ . Thus, we have

[TABLE]

As Lemma 2.6 shows, the optimal value of $d$ is a constant. For a fixed $c$ the same $d$ that maximizes $-d/c\log(1-\exp(-d/c))$ in (48), also maximizes $\mathbb{E}[W|\Gamma,{\mathcal{E}},\left|{V_{0}^{+}}\right|]$ . This maximum is attained at $d=c\log 2$ . Consequently $p=o(q)$ and

[TABLE]

Hence,

[TABLE]

As before, we find $\mathbb{E}[W]\to 0$ w.h.p. since ${\mathbb{P}}(\mathcal{W})=1-o(1)$ and ${\mathbb{P}}({\mathcal{E}})=\Omega(1/\sqrt{\Delta k})$ and Markov’s inequality leads to $V_{1}^{--}=\emptyset$ . Proposition 5.1 follows.

∎

D.2. Proof of Proposition 5.2

By Lemma 2.3, we have $\left|{V_{0}^{+}}\right|\geq k\log n$ for $m<(1-\varepsilon)m_{\mathrm{alg}}$ . To prove Proposition 5.2, we need to show that for such $m$ , we also have $\left|{V_{0}^{+,\Delta}}\right|\geq k\log n$ . We proceed in two steps. First, we show that every individual $x\in V$ is assigned to at least $\Delta-O(1)$ distinct tests. Second, we show that a constant fraction of individuals $x\in V_{0}^{+}$ are assigned to exactly $\Delta$ tests establishing Proposition 5.2.

Proof.

Let $d^{\star}(x)$ be the number of distinct neighbors of a vertex $x$ . We claim that w.h.p. the following statements are true.

[TABLE]

The probability that a given $x\in V$ appears $\ell\geq 2$ times in the same test is upper-bounded by

[TABLE]

provided that $\ell>1+1/\theta$ . Moreover, the probability that $x$ appears in one test twice is upper-bounded by $\Delta\dot{\Delta}/m$ . Thus, the probability that $x$ appears in at least $\ell$ tests at least twice is upper-bounded by

[TABLE]

provided that $\ell>1/\theta$ and since $m=ck\log(n/k)$ and $\Delta=d\log(n/k)$ . The bound follows.

By Lemma 2.3, we know that for $m<(1-\varepsilon)m_{\mathrm{alg}}$ , $\left|{V_{0}^{+}}\right|\geq k\log n$ w.h.p.. Since the SCOMP algorithm in its third stage selects the individual with the highest number of adjacent unexplained tests, we are left to show that also $\left|{{V_{0}^{+}}^{,\Delta}}\right|\geq k\log n$ , which implies that w.h.p. we erroneously classify a healthy individual as infected. The prior bounds ensure that each individual is in at least $\Delta-O(1)$ tests. The question remains which fraction of individuals in $V_{0}^{+}$ are in ${V_{0}^{+}}^{,\Delta}$ . In principle, it could be the case that most potentially false positive individuals of $V_{0}^{+}$ appear in less than $\Delta$ different tests. Indeed, it is more likely for such an individual in $V_{0}^{+}$ to be in fewer than $\Delta$ different tests since each additional test increases the probability for such an individual to be assigned to a negative test. However, we claim that a constant fraction of all potentially false positive individuals in $V_{0}^{+}$ will have degree $\Delta$ , thus be in ${V_{0}^{+}}^{,\Delta}$ . To see this, let $p$ be the maximum proportion of $\left|{{V_{0}^{+}}^{,\Delta-i}}\right|$ and $\left|{{V_{0}^{+}}^{,\Delta-i+1}}\right|$ for $i\in[2/\theta^{2}]$ , i.e.

[TABLE]

By conditioning on a test degree sequence $\Gamma_{1},\dots,\Gamma_{m}$ , we find

[TABLE]

as long as $c,d=\Theta(1)$ , which by Lemma 2.6 we can safely assume. Since each individual in $V_{0}^{+}$ is in at least $\Delta-O(1)$ different tests and the probability of being in any number of different tests $\Delta,\Delta-1\dots$ is constant, a constant fraction of individuals in $V_{0}^{+}$ will be in exactly $\Delta$ tests. Since $\left|{V_{0}^{+}}\right|=\Omega(k\log n)$ , the claim follows. ∎

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. Abbe: Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research 18 (2017) 6446–6531.
2[2] D. Achlioptas, A. Coja-Oghlan: Algorithmic barriers from phase transitions. Proc. 49th FOCS (2008) 793–802.
3[3] D. Achlioptas, A. Coja-Oghlan, F. Ricci-Tersenghi: On the solution space geometry of random formulas. Random Structures and Algorithms 38 (2011) 251–268.
4[4] D. Achlioptas, C. Moore: Random k 𝑘 k -SAT: two moments suffice to cross a sharp threshold. SIAM Journal on Computing 36 (2006) 740–762.
5[5] D. Achlioptas, A. Naor, and Y. Peres: Rigorous location of phase transitions in hard optimization problems. Nature 435 (2005) 759–764.
6[6] D. Achlioptas, Y. Peres: The threshold for random k 𝑘 k -SAT is 2 k log ⁡ 2 − O ( k ) superscript 2 𝑘 2 𝑂 𝑘 2^{k}\log 2-O(k) . Journal of the AMS 17 (2004) 947–973.
7[7] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, M. Jordan: Decoding from pooled data: Sharp information-theoretic bounds. SIAM Journal on Mathematics of Data Science 1 (2019) 161–188.
8[8] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, M. Jordan: Decoding from pooled data: Phase transitions of message passing. IEEE Transactions on Information Theory 65 (2019) 572–585.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Information-theoretic and algorithmic thresholds for group testing

Abstract.

1. Introduction

1.1. Background and motivation

1.2. The information-theoretic threshold

Theorem 1.1**.**

1.3. Efficient algorithms for group testing

Theorem 1.2**.**

1.4. Discussion and related work

1.5. Notation

2. Getting started

Proposition 2.1** ([10]).**

Corollary 2.2**.**

Proposition 2.3**.**

Lemma 2.4**.**

Lemma 2.5**.**

Lemma 2.6**.**

3. The information-theoretic upper bound

Proposition 3.1**.**

Proof.

Proposition 3.2**.**

Lemma 3.3**.**

Proof.

Lemma 3.4**.**

Proof of Proposition 3.2.

Proposition 3.5**.**

Proof.

4. The information-theoretic lower bound

Proposition 4.1**.**

Proof.

Corollary 4.2**.**

Proof.

Proposition 4.3**.**

Proof.

5. The SCOMP algorithm

Proposition 5.1**.**

Proposition 5.2**.**

Proof of Theorem 1.2.

Corollary 5.3**.**

Acknowledgment

Appendix A Notation

Appendix B Preliminaries

B.1. Preliminaries

Lemma B.1** (Chernoff bound, [29] (Section 2.1)).**

B.2. Getting started

Lemma B.2**.**

Proof.

B.3. Proof of Lemma 2.4

B.4. Proof of Lemma 2.5

B.5. Proof of Lemma 2.6

B.6. Proof of Proposition 2.3

Appendix C The information-theoretic upper bound

C.1. Proof of Lemma 3.4

Appendix D The SCOMP algorithm

D.1. Proof of Proposition 5.1

Proof.

D.2. Proof of Proposition 5.2

Proof.

Theorem 1.1.

Theorem 1.2.

Proposition 2.1 ([10]).

Corollary 2.2.

Proposition 2.3.

Lemma 2.4.

Lemma 2.5.

Lemma 2.6.

Proposition 3.1.

Proposition 3.2.

Lemma 3.3.

Lemma 3.4.

Proposition 3.5.

Proposition 4.1.

Corollary 4.2.

Proposition 4.3.

Proposition 5.1.

Proposition 5.2.

Corollary 5.3.

Lemma B.1 (Chernoff bound, [29] (Section 2.1)).

Lemma B.2.