Information-theoretic and algorithmic thresholds for group testing
Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, Philipp Loick

TL;DR
This paper determines the minimum number of tests needed for successful group testing using a randomized design, establishing sharp thresholds and analyzing algorithms to solve the problem efficiently.
Contribution
It precisely characterizes the information-theoretic threshold for group testing and analyzes the performance of inference algorithms, settling prior conjectures.
Findings
Identified sharp thresholds for the number of tests needed.
Analyzed the performance of two efficient inference algorithms.
Settled conjectures from previous studies.
Abstract
In the group testing problem we aim to identify a small number of infected individuals within a large population. We avail ourselves to a procedure that can test a group of multiple individuals, with the test result coming out positive iff at least one individual in the group is infected. With all tests conducted in parallel, what is the least number of tests required to identify the status of all individuals? In a recent test design [Aldridge et al.\ 2016] the individuals are assigned to test groups randomly, with every individual joining an equal number of groups. We pinpoint the sharp threshold for the number of tests required in this randomised design so that it is information-theoretically possible to infer the infection status of every individual. Moreover, we analyse two efficient inference algorithms. These results settle conjectures from [Aldridge et al.\ 2014, Johnson et al.\…
| Notation | Definition & Properties | Description |
| population size | ||
| for | number of infected individuals | |
| number of tests | ||
| variable nodes | ||
| set of all individuals | ||
| factor nodes | ||
| set of all tests | ||
| tests per individual, variable node degree | ||
| individuals per test, factor node degree | ||
| -algebra generated by the random variables | ||
| -dimensional vector of Hamming weight indicating the individuals’ infection status | ||
| random bipartite graph on variable nodes, factor nodes and variable degree | ||
| for | set of tests that individual participates in under | |
| for | set of individuals in test under | |
| -dimensional vector indicating the test outcomes | ||
| number of positive and negative tests | ||
| set of healthy individuals | ||
| set of infected individuals | ||
| set of healthy individuals only included in positive tests | ||
| set of healthy individuals included in at least one negative test | ||
| set of infected individuals that have another infected individual in all their tests | ||
| Set of infected individuals that occur in at least one test with only healthy individuals | ||
| minimum and maximum test degree | ||
| set of configurations consistent with the test results under | ||
| number of configurations consistent with the test results | ||
| number of configuration consistent with the test results and with overlap with | ||
| for | number of edges that connect test with an infected individual | |
| for | binomially-distributed random variable with parameters and | |
| is the number of tests containing a single infected individual, is a random variable depending on | ||
| number of infected individuals not adjacent not any test with precisely one infected individual | ||
| number of infected individuals who appear in less than tests as the only infected individual for some constant | ||
| number of infected individual adjacent to some test multiple times with no other infected individual besides themselves | ||
| auxiliary random variables, defined in proof of Proposition 3.1 | ||
| event that every test under the balls-and-bins experiment features the same test result | ||
| event that the sum of is exactly | ||
| set of all indices for which there exists precisely one such that | ||
| set of indices such that | ||
| event that for every there are at least tests for some such that . | ||
| event that one specific that has overlap with belongs to | ||
| event that sum of independent random variable is equal to specific value, defined in (7) | ||
| event that around half of the tests are positive | ||
| event that the size of is concentrated around its mean | ||
| [] denotes a term that vanishes [diverges] in the limit of large | ||
| w.h.p. | probability of as |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Information-theoretic and algorithmic thresholds for group testing
Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, Philipp Loick
Amin Coja-Oghlan, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.
Oliver Gebhard, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.
Max Hahn-Klimroth, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.
Philipp Loick, [email protected], Goethe University, Mathematics Institute, 10 Robert Mayer St, Frankfurt 60325, Germany.
Abstract.
In the group testing problem we aim to identify a small number of infected individuals within a large population. We avail ourselves to a procedure that can test a group of multiple individuals, with the test result coming out positive iff at least one individual in the group is infected. With all tests conducted in parallel, what is the least number of tests required to identify the status of all individuals? In a recent test design [Aldridge et al. 2016] the individuals are assigned to test groups randomly with replacement, with every individual joining an almost equal number of groups. We pinpoint the sharp threshold for the number of tests required in this randomised design so that it is information-theoretically possible to infer the infection status of every individual. Moreover, we analyse two efficient inference algorithms. These results settle conjectures from [Aldridge et al. 2014, Johnson et al. 2019].
Supported by DFG CO 646/3 and Stiftung Polytechnische Gesellschaft. An extended abstract of this work appeared in the 2019 ICALP proceedings. A revised version is to appear in IEEE Transactions on Information Theory (Copyright (c) 2017 IEEE DOI: 10.1109/TIT.2020.3023377)
1. Introduction
1.1. Background and motivation
The group testing problem goes back to the work of Dorfman from the 1940s [24]. Among a large population a few individuals are infected with a rare disease. The objective is to identify the infected individuals effectively. At our disposal we have a testing procedure capable of not merely testing one individual, but several. The test result will be positive if at least one individual in the test group is infected, and negative otherwise; all tests are conducted in parallel. We are at liberty to assign a single individual to several test groups. The aim is to devise a test design that identifies the status of every single individual correctly while requiring as small a number of tests as possible. A recently proposed test design allocates the individuals to tests randomly [10, 12, 13, 30, 33]. To be precise, given integers we create a random bipartite multi-graph by choosing independently for each of the vertices ‘at the top’ neighbours among the vertices ‘at the bottom’ uniformly at random with replacement. The vertices represent the individuals, the represent the test groups and an individual joins a test group iff the corresponding vertices are adjacent (see Figure 1). The wisdom behind this construction is that the expansion properties of the random bipartite graph precipitate virtuous correlations, facilitating inference. Given and (an estimate of) the number of infected individuals, what is the least for which, with a suitable choice of , the status of every individual can be inferred correctly from the test results with high probability?Like in many other inference problems the answer comes in two instalments. First, we might ask for what it is information-theoretically possible to detect the infected individuals. In other words, regardless of computational resources, do the test results contain enough information in principle to identify the infection status of every individual? Second, for what does this problem admit efficient algorithms? The first main result of this paper resolves the information-theoretic question completely. Specifically, Aldridge, Johnson and Scarlett [13] obtained a function such that for any fixed the inference problem is information-theoretically infeasible if . They conjectured that this bound is tight, i.e., that for there is an (exponential) algorithm that correctly identifies the infected individuals with high probability. We prove this conjecture. Furthermore, concerning the algorithmic question, Johnson, Aldridge and Scarlett [30] obtained a function that exceeds by a constant factor for small such that for certain efficient algorithms successfully identify the infected individuals with high probability. They conjectured that SCOMP, their most sophisticated algorithm, actually succeeds for smaller values of . We refute this conjecture and show that SCOMP asymptotically fails to outperform a much simpler algorithm called DD. A technical novelty of the present work is that we investigate the group testing problem from a new perspective. While most prior contributions rely either on elementary calculations and/or information-theoretic arguments [12, 13, 30, 39], here we bring to bear techniques from the theory of random constraint satisfaction problems [5, 32].
Indeed, group testing can be viewed naturally as a constraint satisfaction problem: the tests provide the constraints and the task is to find all possible ways of assigning a status (‘infected’ or ‘not infected’) to the individuals in a way consistent with the given test results. Since the allocation of individuals to tests is random, this question is similar in nature to, e.g., the random -SAT problem that asks for a Boolean assignment that satisfies a random collection of clauses [4, 6, 20, 23]. It also puts the group testing problem in the same framework as the considerable body of recent work on other inference problems on random graphs such as the stochastic block model (e.g., [1, 18, 22, 35, 37, 43]) or decoding from pooled data [7, 8].
We proceed to state the main results of the paper precisely, followed by a detailed discussion of the prior literature on group testing. The proofs of the information-theoretic and algorithmic bounds follow in 3, Section 4, and 5. The technical details can be found in the appendix.
1.2. The information-theoretic threshold
Throughout the paper we labour under the assumptions commonly made in the context of group testing; we will revisit their merit in Section 1.4. Specifically, we assume that the number of infected individuals satisfies for a fixed 111While we write that for the sake of brevity, our results immediately extend to the case for some constant .. Moreover, let be a vector of Hamming weight chosen uniformly at random. The (one-)entries of indicate which of the individuals are infected. Moreover, let signify the aforementioned random bipartite graph with multi-edges. Then induces a vector that indicates which of the tests come out positive. To be precise, iff test is adjacent to an individual with . For what is it possible to recover from ? (Throughout the paper all logarithms are base .)
Theorem 1.1**.**
Suppose that , and and let
[TABLE]
- (i)
If , then there exists an algorithm that given outputs with high probability. 2. (ii)
If , then there does not exist any algorithm that given outputs with a non-vanishing probability.
Since for the first part of Theorem 1.1 readily follows from a folklore argument [25], the interesting regime is . The negative part of Theorem 1.1 strengthens a result from [13], who showed that for any inference algorithm has a strictly positive error probability. By comparison, Theorem 1.1 shows that any algorithm fails with high probability.
But the main contribution of Theorem 1.1 is the first, positive statement. While the problem was solved for for a different test design [39, 40] and the case is easy because a plain greedy algorithm succeeds [30], the case proved more challenging. Only heuristic arguments predicting the result of Theorem 1.1 have been put forward for this regime so far [33]. Indeed, Aldridge et al. [12] conjectured that in this case inferring from is equivalent to solving a hypergraph minimum vertex cover problem. The proof of Theorem 1.1 vindicates this conjecture. Specifically, the vertex set of the hypergraph comprises all ‘potentially infected’ individuals, i.e., those that do not appear in any negative test. The hyperedges are the neighbourhoods of the positive tests in . Exhaustive search solves this vertex cover problem in time . But how about efficient algorithms for general ?
1.3. Efficient algorithms for group testing
Several polynomial time group testing algorithms have been proposed. A very simple greedy strategy called DD (for ‘definitive defectives’) first labels all individuals that are members of negative test groups as uninfected. Subsequently it checks for positive tests in which all individuals but one have been identified as uninfected in the first step. Clearly, the single as yet unlabelled individual in such a test group must be infected. Up to this point all decisions made by DD are correct. But in the final step DD marks all as yet unclassified individuals as uninfected, possibly causing false negatives. In fact, the output of DD may be inconsistent with the test results as possibly some positive tests may fail to include an individual classified as ’infected’. While an achievability result is known for the DD algorithm, a corollary of the work in this paper is a matching converse.
The more sophisticated SCOMP algorithm is roughly equivalent to the well-known greedy algorithm for the hypergraph vertex cover problem applied to the hypergraph from the previous paragraph. Specifically, in its first step SCOMP proceeds just like DD, classifying all individuals that occur in negative tests as uninfected. Then SCOMP identifies as infected all unmarked individuals that appear in at least one test whose other participants are already known to be uninfected. Subsequently the algorithm keeps picking an individual that appears in the largest number of as yet ‘unexplained’ (viz. uncovered) positive tests and marks that individual as infected, with ties broken randomly, until every positive test contains an individual classified as infected. Clearly, SCOMP may produce false positives as well as false negatives. But at least the output is consistent with the test results. Algorithm 1 summarises the procedure of SCOMP.
Analysing SCOMP has been prominently posed as an open problem in the group testing literature [9, 12, 30]. Indeed, Aldridge et al. [12] opined that “the complicated sequential nature of SCOMP makes it difficult to analyse mathematically”. On the positive side, [12] proved that SCOMP succeeds in recovering correctly given if w.h.p.222W.h.p.refers to a probability of as ., where
[TABLE]
However, the algorithm succeeds for a trivial reason; namely, for even DD suffices to recover w.h.p. Yet based on experimental evidence [12, 30] conjectured that SCOMP strictly outperforms DD. The following theorem refutes this conjecture.
Theorem 1.2**.**
Suppose that and . If , then given w.h.p. both SCOMP and DD fail to output .
For the information-theoretic bound provided by Theorem 1.1 and the algorithmic bound supplied by Theorem 1.2 remain a modest constant factor apart; see Figure 2. Whether there exists an efficient algorithm for group testing that can close the gap to the information-theoretic bound has long been an open research question. A recent result by Coja-Oghlan et al. [19] shows that such a polynomial-time algorithm indeed exists. The proposed algorithm which is inspired by the notion of spatial coupling from coding theory is able to recover whenever . Moreover, the authors prove that below the information-theoretic threshold from Theorem 1.1 no non-adaptive algorithm can succeed under any test design (not only the random regular test design considered here) thereby establishing the presence of an adaptivity gap in the group testing problem. An exciting avenue for future research is to investigate the merits of the results and techniques of this paper and [19, 28] for the noisy variant of group testing.
1.4. Discussion and related work
Dorfman’s original group testing scheme, intended to test the American army for syphilis, was adaptive. In a first round of tests each soldier would be allocated to precisely one test group. If the test result came out negative, none of the soldiers in the group were infected. In a second round the soldiers whose group was tested positively would be tested individually. Of course, Dorfman’s scheme was not information-theoretically optimal. A first-order optimal adaptive scheme that involves several test stages, with the tests conducted in the present stage governed by the results from the previous stages, is known [15, 25]. In the adaptive scenario the information-theoretic threshold works out to be
[TABLE]
The lower bound, i.e., that no adaptive design gets by with tests, follows from a very simple information-theoretic consideration. Namely, with a total of tests at our disposal there are merely possible test outcomes, and we need this number to exceed the count of possible vectors , i.e., [14].
More recently there has been a great deal of interest in non-adaptive group testing, where the infection status of each individual is to be determined after just one round of tests [14, 17, 27, 33]. This is the version of the problem that we deal with in the present paper. An important advantage of the non-adaptive scenario is that tests, which may be time-consuming, can be conducted in parallel. Indeed, some of today’s most popular applications of group testing are non-adaptive such as DNA screening [17, 31, 38] or protein interaction experiments [36, 42] in computational molecular biology. The randomised test design that we deal with here is the best currently known non-adaptive design (in terms of the number of tests required).
The most interesting regime for the group testing problem is when the number of infected individuals scales as a power of the entire population. Mathematically this is because in the linear regime the optimal strategy is to perform individual tests [11] in order to achieve a vanishing error probability. Similarly, the case of constant has been solved for some time [41]. Thus, for linear in and constant the theory is already well established. But the sublinear case is also of practical relevance, as witnessed by Heap’s law in epidemiology [16] or biological applications [27].
Apart from the randomised test design where each individual chooses precisely tests (with replacement), the so-called Bernoulli design assigns each individual to every test with a certain probability independently. A considerable amount of attention has been devoted to this model, and its information-theoretic threshold as well as the thresholds for various algorithms have been determined [9, 10, 12, 39]. However, the Bernoulli test design, while easier to analyse, for is provably inferior to the test design that we study here. This is because in the Bernoulli design there are likely quite a few individuals that participate in far fewer tests than expected due to degree fluctuations. We note that our proofs can easily be adapted to reprove the known results for the Bernoulli design. In fact, many technical parts of the proofs become significantly easier and shorter, since we can assume independence between tests, whereas for the constant-column design under consideration here gives rise to subtle dependencies between the tests. A significant portion of the tests is devoted to getting a handle o these dependencies.
1.5. Notation
Throughout the paper denotes the random bipartite graph that describes which individuals take part in which test groups, the vector encodes which individuals are infected, and indicates the test results. Clearly, is independent of . Moreover, signifies the number of infected individuals. Additionally, we write
[TABLE]
for the set of all individuals, the set of uninfected and infected individuals, respectively. For an individual we write for the multi-set of tests adjacent to with . Analogously, for a test we denote by the multi-set of individuals that take part in the test and . These are multi-sets since individuals are assigned to tests uniformly at random with replacement and therefore features multi-edges w.h.p.. Let be the vector . Furthermore, all asymptotic notation refers to the limit . Thus, denotes a term that vanishes in the limit of large , while stands for a function that diverges to as . We also let denote reals such that
[TABLE]
Later, we will prove that as is optimal for inference. Finally, let , . The following sections will outline the proofs of the information-theoretic bounds and the analysis of the SCOMP algorithm and feature the important proofs. The technical details are left to the appendix
2. Getting started
The very first item on the agenda is to get a handle on the posterior distribution of given and . To this end, let be the set of all vectors of Hamming weight such that
[TABLE]
In words, contains the set of all vectors with ones that label the individuals infected/uninfected in a way consistent with the test results, i.e. that are "satisfying sets" [12, 14]. Let . The following proposition shows that the posterior of given is uniform on .
Proposition 2.1** ([10]).**
For all we have
Adopting the jargon of the recent literature on inference problems on random graphs, we refer to Proposition 2.1 as the Nishimori identity [18, 43]. The proposition shows that apart from the actual test results, there is no further ‘hidden information’ about encoded in . In particular, the information-theoretically optimal inference algorithm just outputs a uniform sample from . In effect, we obtain the following.
Corollary 2.2**.**
- (1)
If w.h.p., then for any algorithm we have
[TABLE] 2. (2)
If w.h.p., then there is an algorithm such that
[TABLE]
Both the positive and the negative part of Corollary 2.2 assume that the precise number of infected individuals is known to the algorithm. This assumption makes the negative part stronger, but weakens the positive part. Yet we will see in due course how in the positive scenario the assumption that be known can be removed.
For the information-theoretic bound, the proof hinges on analysing the number of individuals that can be flipped without affecting the test results. We encounter two kinds of such individuals. The first kind consists of healthy individuals that only appear in positive tests and which we will denote by . In symbols,
[TABLE]
Similarly, let be the set of all infected individuals such that every test in which occurs features another infected individual; in symbols,
[TABLE]
We think of the individuals in as the ‘potential false positives’. Indeed, if for any we obtain from by setting to one, then will render the same test results as . Similarly, the individuals in are potential false negatives. For completeness, we also define and as
[TABLE]
In the following, let us get a handle on the size of sets and . Specifically, we prove the following five statements.
Proposition 2.3**.**
Let . Then, the following statements hold w.h.p.
- (1)
** 2. (2)
If , then 3. (3)
If , then 4. (4)
If , then 5. (5)
If , then
The proof of Proposition 2.3, while not fundamentally difficult, requires a bit of care because we are dealing with a random bipartite multi-graph whose (test-)degrees scale as a power of . In effect, the diameter of the bipartite graph is quite small and the neighbourhoods of different tests may have a sizeable intersection. The technical workout follows in Section B.6. In the next step, let us get a handle on the size of the test degrees.
Lemma 2.4**.**
With probability at least we have
[TABLE]
The proof of this and the subsequent elementary lemmas are included in Section B. Next, we calculate the number of positive and negative tests. Let be the number of positive tests and let be the number of negative tests. Clearly .
Lemma 2.5**.**
With probability at least we have
[TABLE]
Finally, we justify that setting as is optimal for inference. The fact that immediately follows from the information-theoretic counting bound, i.e., [14].
Lemma 2.6**.**
- (1)
If and , then w.h.p. 2. (2)
If and m={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Theta}(k\log(n/k)), then w.h.p.
3. The information-theoretic upper bound
We proceed to discuss the proof of Theorem 1.1. The proof of the first, positive statement and of the second, negative statement hinge on two separate arguments. We begin with the proof of the information-theoretic upper bound which is the principal achievement of the present work. The proof rests upon techniques that have come to play an important role in the theory of random constraint satisfaction problems. Specifically, we need to show that w.h.p., i.e., that is the only assignment compatible with the test results w.h.p. We establish this result by combining two separate arguments. First, we use a moment calculation to show that w.h.p. there are no other solutions that have a small ‘overlap’ with . Then we use an expansion argument to show that w.h.p. there are no alternative solutions with a big overlap. Both these arguments are variants of the arguments that have been used to study the solution space geometry of random constraint satisfaction problems such as random -SAT or random -XORSAT [3, 4, 26], as well as the freezing thresholds of random constraint satisfaction problems [2, 34]. Yet to our knowledge these methods have thus far not been applied to the group testing problem. In this section we choose which maximises the entropy of the test results. Formally, we define
[TABLE]
as the number of assignments different from the true configuration whose overlap
[TABLE]
with is equal to . The following two propositions rule out assignments with a small and a big overlap, respectively. In either case we choose to take its optimal value.
Proposition 3.1**.**
Let and and assume that . W.h.p. we have for all .
Proof.
For let be the degree of in , i.e., the number of edges incident with ; this number may exceed the number of different individuals that participate in test as may feature multi-edges. Let be the -algebra generated by the random variables . Whenever we condition on , we assume that the bounds from Lemma 2.4 and 2.5 hold. Given we can generate from the well-known pairing model [29]. Specifically, we create a set of clones of each individual as well as sets of clones of the tests. Then we draw a perfect matching of the complete bipartite graph on the vertex sets , uniformly at random. For each matching edge linking a clone of with a clone of we insert an --edge. The resulting bipartite random multi-graph has the same distribution as given . As an application of this observation we obtain for every integer
[TABLE]
To see why (4) holds we use the linearity of expectation. The product of the two binomial coefficients simply accounts for the number of assignments that have overlap with . Hence, with the event that one specific that has overlap with belongs to , we need to show that
[TABLE]
By symmetry we may assume that and that .
To establish (5) we harness the pairing model. Namely, given we can think of each test as a bin of capacity . Moreover, we think of each clone , , of an individual as a ball. The ball is labelled . The random matching that creates effectively tosses the balls randomly into the bins. Hence, for and for let us write for the label of the th ball that ends up in bin number . Then we are left to calculate the probability that for every test either for every or there is at least one pair such that
[TABLE]
To calculate this probability we borrow a trick from the analysis of the random -SAT model [20]. Namely, we consider a new set of -valued random variables such that are mutually independent and such that
[TABLE]
for all . Due to their independence, these multinomially distributed random variables are much easier to handle than . It will turn out, that given a (not too unlikely) event, it suffices to analyse these independent variables instead of . Now, let be the event that
[TABLE]
i..e, that all of the sums on the l.h.s. are precisely equal to their expected values. Then given is distributed precisely as . Hence, (6) yields
[TABLE]
Thus, let
[TABLE]
The grand idea is now to calculate the probability . Subsequently, we employ Bayes’ Theorem to derive a bound for the conditional probability for which we know by the above application of the balls-into-bins principle
[TABLE]
Because the are mutually independent, we can easily compute the unconditional probability : by inclusion/exclusion,
[TABLE]
(the probability that , i.e., both tests positive, equals one minus the probability that minus the probability that plus the probability that ; then add the probability that , i.e., both tests negative).
Finally, to deal with the conditioning we use Bayes’ rule:
[TABLE]
Since the are independent, Stirling’s formula yields
[TABLE]
A short justification can be found in Section B.1. Moreover, by definition we have . Hence, (5) follows from (9)–(11). To complete the proof of the proposition, we claim that
[TABLE]
To prove Equation (12), let . Using Lemma 2.4 and recalling and , we find
[TABLE]
By the definition of and , we have
[TABLE]
Moreover, as we have . Thus (15) implies that (14) tends to zero with . Therefore, the proposition follows from Equations (14), (15) and Markov’s inequality.
∎
The argument from Proposition 3.1 does not extend to large overlaps (close to ) because the expression on the r.h.s. of (4) gets too large. In other words, merely computing the expected number of solutions with a given overlap does not do the trick. This ‘lottery phenomenon’ is ubiquitous in random constraint satisfaction problems: for big overlap values rare solution-rich instances drive up the expected number of solutions [4, 5]. Fortunately, we can find a remedy.
Proposition 3.2**.**
Let and and assume that . W.h.p. we have for all .
In order to cope with this issue we take another leaf out of the random CSP literature [2, 34]. Namely, we show that the solution is locally rigid. That is, the expansion properties of the random bipartite graph preclude the existence of other solutions that have a big overlap with . The following lemma holds the key to this effect.
Lemma 3.3**.**
For any there exists such that for all the following is true. Let be the event that for every with there are at least tests such that . Then .
Proof.
Let be a sequence of independent -variables as in Section 2. Also let as in Section 2. Proceeding along the lines of the proof of Lemma 2.3 (see (35) in Section B.6), we obtain
[TABLE]
Let be the number of infected individuals which only show up less than of their tests as the only infected individual, i.e.
[TABLE]
Moreover, let be a hypergeometric random variable with parameters (total eligible assignments for infected individuals), (tests with only one infected individual) and (number of tests per individuals). Then the union bound over infected individuals yields
[TABLE]
Further, the Chernoff bound for the hypergeometric distribution implies
[TABLE]
Recall . Since and as and , we can choose small enough so that
[TABLE]
Finally, the assertion follows from (16)–(19). ∎
Hence, w.h.p. any infected individual appears in plenty of tests where all the other individuals are uninfected. This property causes to be locally rigid. To see why, consider the repercussions of just changing the status of a single individual from infected to uninfected. Because given the individual appears as the only infected individual in at least tests, in order to maintain the same tests results we will also need to flip at least one individual in each of these tests from ‘uninfected’ to ‘infected’. Since tests typically have relatively few individuals in common, the necessary number of flips from [math] to will be . But then in order to keep the total number of infected individuals constant , we will need to perform another flips from to [math]. Yet given each of these ‘second generation’ individuals that we flip from infected to uninfected is itself the only infected individual in many tests. Thus, the single flip that we started from triggers a veritable avalanche of flips, which will stop only after the overlap has dropped significantly. The next lemma formalises this intuition. The lemma shows that while the unconditional expectation of is ‘too big’, the conditional expectation of given (as defined in Lemma 3.3) is much smaller. Let be the total number of negative tests.
Lemma 3.4**.**
Suppose that and let , . Then
[TABLE]
The proof of Lemma 3.4 is somehow subtle as we need to get a handle on the dependencies in and is included in Section C.1. To convey the intuition behind the expression in Lemma 3.4, the term accounts for the number of assignments of Hamming weight whose overlap with is equal to . The terms thereafter capture the probability that such an assignment exhibits the same test results as the true configuration . The first term provides a necessary condition for a positive test under to stay positive under . By Lemma 3.3, we know that every infected individual shows up in at least tests as the only infected individual. Now, there are infected under , but healthy under . For any of these tests, we need to have at least one individual that is healthy under , but infected under included in this test. Next, we need to ensure that any negative test under stay negative under . To this end, every individual included in a negative test under of which we have at least must be healthy under . The second term captures this probability.
Proof of Proposition 3.2.
In order to establish the proposition it suffices to show that there is such that
[TABLE]
Starting from the expression in Lemma 3.4, setting and recalling and , we obtain
[TABLE]
As long as , we find
[TABLE]
Moreover, . Thus, the expression (23) is of order
[TABLE]
Since (24) holds for any constant and any value of s.t. , it also holds for . Consequently (21) is established w.h.p. ∎
Propositions 3.1 and 3.2 readily imply that w.h.p. if . Hence, Corollary 2.2 shows that there exists an inference algorithm that given and outputs w.h.p. Up to now, the algorithm relies on exactly knowing the number of infected individuals , which in practice could be rather difficult to learn. Fortunately, this assumption can be removed. Namely, the following proposition shows that w.h.p. there is no assignment that is compatible with the test results and that has Hamming weight less than .
Proposition 3.5**.**
Let and and assume that . W.h.p. we have .
Proof.
To get started, suppose that and . We claim that for any value of , w.h.p.. Indeed, from Proposition 2.3(1), we know that
[TABLE]
Recalling , the expression takes the minimum at . It follows that
[TABLE]
If for , then
[TABLE]
Now, the following two statements establish that if there does not exist a second satisfying set of Hamming weight , there does also not exist a satisfying set with smaller Hamming weight w.h.p..
First, we claim that if , w.h.p. there does not exist a satisfying configuration with Hamming weight smaller than the correct configuration, where the set of infected individuals is not a subset of the true set of infected individuals. To see why, suppose there existed a satisfying configuration with a smaller Hamming weight, whose infected individuals are not a subset of the true infected individuals. By (25), we know that for w.h.p. Therefore, we could construct a satisfying configuration of identical Hamming weight as the true configuration by flipping individuals in from healthy to infected. Observe that by the definition of , flipping individuals in does not change the test result. Therefore, we would be left with a second satisfying configuration of identical Hamming weight as the true configuration, a contradiction to Propositions 3.1 and 3.2.
Second, we argue that if , w.h.p. there does not exist a satisfying configuration with Hamming weight smaller than the correct configuration, where the set of infected individuals is a subset of the true set of infected individuals. Suppose there existed a satisfying configuration with a smaller Hamming weight, whose infected individuals are a subset of the true infected individuals. Then, the true configuration would need to contain individuals in , which can be flipped from infected to healthy without affecting the test result. However, Proposition 2.3(5) shows that for , w.h.p. ∎
As an immediate consequence of Proposition 3.5 we conclude that for the problem of inferring boils down to a minimum vertex cover problem, as previously conjectured by Aldridge, Baldassini and Johnson [12]. Namely, let be the set of all positive tests, i.e., all tests , , with . Moreover, let be the set of all variables such that ; in words, takes part in positive tests only. We set up a hypergraph with vertex set and hyperedges , . Clearly, the set of all individuals with provides a valid vertex cover of (as any positive test must feature an infected individual). Conversely, Propositions 3.1 and 3.2 show that w.h.p. this is the unique vertex cover of size , and Proposition 3.5 shows that there is no strictly smaller vertex cover w.h.p. Therefore, w.h.p. we can infer even without prior knowledge of by way of solving this minimum vertex cover instance.
4. The information-theoretic lower bound
We proceed with the negative statement that w.h.p. cannot be inferred if . In light of Corollary 2.2 in order to prove the first part of Theorem 1.1 we need to show that the number of assignments consistent with the test results is unbounded w.h.p. The proof of this fact is based on a very simple idea: we just identify a moderately large number of individuals whose infection status could be flipped without affecting the test results. The following lemma yields a bound on below which the number of such potential false positives () and negatives () abound.
Proposition 4.1**.**
Let and and assume that
[TABLE]
Then for any choice of we have w.h.p.
Proof.
Thanks to Lemma 2.6 we may assume that , for a constant as this choice minimizes the number of individuals in . Then Proposition 2.3(4) guarantees that for every such constant as long as , there are individuals in both and , which yields to Proposition 4.1. ∎
As an immediate application we obtain the following information-theoretic lower bound.
Corollary 4.2**.**
Let and and assume that
[TABLE]
Then w.h.p.
Proof.
We need to exhibit alternative vectors with Hamming weight that render the same test results as . Thus, pick any and any and obtain from by setting and . By construction, has Hamming weight and renders the same test results. Hence, Proposition 4.1 shows that w.h.p.∎
The bound (26) matches for . A simpler, purely information-theoretic argument covers the remaining .
Proposition 4.3**.**
Let , . If , then w.h.p.
Proof.
This Lemma follows from the classical information-theoretic lower bound for the group testing problem. Namely, tests allow for possible test results. Hence, if
[TABLE]
then the number of possible test results is far smaller than the number of vectors with Hamming weight . Therefore, w.h.p. there exists an unbounded number of vectors of Hamming weight that render the same test results as . ∎
We thus conclude that for all , w.h.p. if . Therefore, the desired information-theoretic lower bound follows from Corollary 2.2.
5. The SCOMP algorithm
For we have and thus Theorem 1.1 implies that SCOMP as described in Section 1.3 w.h.p. fails to infer for . Therefore, we are left to establish Theorem 1.2 for , in which case
[TABLE]
The proof of Theorem 1.2 for hinges on two propositions. First we show that below , the set of infected individuals that the second step of SCOMP identifies correctly is empty. Formally, with from (3), let
[TABLE]
Proposition 5.1**.**
Suppose that and . If , then for all we have w.h.p.
The proofs of Propositions 5.1 and 5.2 are based on moment calculations that turn out to be mildly subtle due to the potentially very large degrees of the underlying graph . The technical workout in included in Section D.1 and D.2.
With the second step of SCOMP failing to ‘explain’ (viz. cover) any positive tests, the greedy vertex cover algorithm takes over. This algorithm is applied to the hypergraph whose vertices are the as yet unclassified individuals and whose edges are the neighbourhoods of the positive tests. Our second lemma shows that the set of potententially false positive individuals that participate in the maximum number of different tests is far greater than the actual number of infected individuals. Formally, let
[TABLE]
Proposition 5.2**.**
Suppose that and . If , then for for all constant we have w.h.p.
We complete the proof of Theorem 1.2 as follows.
Proof of Theorem 1.2.
The first step of SCOMP (correctly) marks all individuals that appear in negative tests as healthy. Moreover, Proposition 5.1 implies that the second step of SCOMP is void w.h.p., because there is no single infected individual that appears in a test whose other individuals have already been identified as healthy by the first step. Consequently, SCOMP simply applies the greedy vertex cover algorithm. Now, thanks to Proposition 5.2 it suffices to prove that SCOMP will fail w.h.p. if . Because they belong to positive tests only, all the individuals of are present in the vertex cover instance that SCOMP attempts to solve. Moreover, in the hypergraph no vertex has degree greater than , because the degrees of in are equal to . (Some of the hypergraph degrees may be strictly smaller than because is a multi-graph.) Therefore, since while the actual set of infected individuals only has size , w.h.p. the individual classified as infected by the very first step of the greedy set cover algorithm belongs to . Hence, this individual is not actually infected, i.e., SCOMP errs w.h.p. ∎
Since the success probability of the SCOMP algorithm is at least as high as of the DD algorithm, we can prove the conjecture of [30] regarding the upper bound of the DD algorithm.
Corollary 5.3**.**
If , the DD algorithm will fail to retrieve the correct set of infected individuals w.h.p..
Acknowledgment
We thank Arya Mazumdar for bringing the group testing problem to our attention.
Appendix A Notation
The following sections contain the proofs of the lemmas omitted so far.
Appendix B Preliminaries
B.1. Preliminaries
We start by stating the Chernoff bound as applied in this paper.
Lemma B.1** (Chernoff bound, [29] (Section 2.1)).**
Let be a binomially-distributed random variable with . Further, let
[TABLE]
Then for some ,
[TABLE]
As an application, we readily find
[TABLE]
Next, we justify that the Stirling approximation of Section 3 is accurate. Namely, let be -valued random variables such that are mutually independent and such that
[TABLE]
for all . As before, we denote by the event that
[TABLE]
i..e, that all of the sums on the l.h.s. are precisely equal to their expected values. Since the are independent, Stirling’s formula yields
[TABLE]
This can be seen as follows. For the sake of brevity, define
[TABLE]
As is a family of independent multinomial variables
[TABLE]
we find
[TABLE]
Hence, the probability of event occurring is the probability, that hits its expectation. Thus, using the very basic approximation we find
[TABLE]
where (29) follows immediately from and directly implies (28). In due course we apply similar calculations often, some calculations involve conditional probabilities. These conditions are only restricting to take specific (common) values and clearly the above argument is totally invariant under different values of , as long as .
B.2. Getting started
In the next step, recall that neighbourhoods of different tests in the random multi-graph seizably intersect. To cope with the ensuing correlations, we introduce a new family of random variables that, as we will see, are closely related to the statistics of the appearances of infected/uninfected individuals in the various tests. Specifically, recalling that signifies the degree of test and that , let be a sequence of independent -variables. Moreover, let
[TABLE]
Because the are mutually independent, Stirling’s formula shows that
[TABLE]
which follows along the lines of Section B.1. Additionally, let be the number of edges that connect test with an infected individual. (Since is a multi-graph, it is possible that an infected individual contributes more than one to .) Further, let be the -algebra generated by the random variables . Whenever we condition on , we assume that the bounds from Lemma 2.4 and 2.5 hold.
Lemma B.2**.**
Given , the vectors and given are identically distributed.
Proof.
For any integer sequence with and we have
[TABLE]
Hence, for any sequences we obtain
[TABLE]
as claimed. ∎
B.3. Proof of Lemma 2.4
Since each variable draws a sequence of tests uniformly at random, for every the degree has distribution . Therefore, the assertion follows from the Chernoff bound.
B.4. Proof of Lemma 2.5
Let . Then . Hence, Lemma 2.4 shows that with probability ,
[TABLE]
Because the are mutually independent, is a binomial variable. Therefore, the Chernoff bound (e.g. Lemma B.1) shows that
[TABLE]
Finally, the assertion follows from (30), (31)–(34) and Lemma B.2.
B.5. Proof of Lemma 2.6
The expected degree of a test equals . Therefore, if , then by Lemma 2.5, w.h.p. To exploit this fact, call of Hamming weight bad for if given we indeed have . Let be the set of all such bad . Then w.h.p. has the property that , i.e. asymptotically most configurations will have few positive tests. Now, condition on the event that and let be the set of all subsets of of size . Further, let map to the corresponding set of positive tests. Finally, let be the set of all such that , i.e. the set of all configurations for which there are less than other configurations rendering the same test results. Then
[TABLE]
Consequently, w.h.p. over the choice of and we have . The same argument applies for with the term ‘positive test’ replaced by ‘negative test’.
B.6. Proof of Proposition 2.3
We start by proving part (1) using a straightforward second-moment calculation. Recall and . Lemma 2.4 and Lemma 2.5 show that with probability at least the total degree of the negative tests comes to
[TABLE]
Consequently, with probability at least the total number of edges between and the set of positive tests is . Moreover, the total number of edges between and all tests comes down to . Given these events and since each individual is assigned to tests uniformly at random with replacement, the probability that a given belongs to comes out as
[TABLE]
Next, we estimate the probability that both belong to :
[TABLE]
Hence, . Therefore, the assertion follows from Chebyshev’s inequality.
Proceeding with part (2), let the number of tests containing a single infected individual be
[TABLE]
Then Lemma 2.4 shows that w.h.p.
[TABLE]
Analogously,
[TABLE]
Hence, because is a binomial random variable, the Chernoff bound (e.g. Lemma B.1) shows that
[TABLE]
Therefore, (30) yields
[TABLE]
Now, let be the number of that are not adjacent to any test with precisely one positive individual. An individual counts towards , if out of all possible assignment , it is only assigned to those tests where it is not the only infected individual (there are a total of such assignments). Using the notation and recalling , the bound on yields
[TABLE]
By a similar token we obtain
[TABLE]
Therefore, Chebyshev’s inequality shows that w.h.p.
[TABLE]
To complete the proof we need to compare and . Clearly, . But the inequality may be strict because includes positive individuals that appear twice in the same test. To be precise, an individual might be assigned to one test twice as the only infected individual. Such an individual should not be in , but it shows up in . Indeed, letting be the number of such individuals, we obtain . Hence, we are left to estimate . To this end, we observe that the probability that an individual appears in a specific test twice is upper-bounded by . Recall and . Consequently, taking the union bound over all tests and infected individuals we yield
[TABLE]
Since by assumption the r.h.s. of (36) is , we conclude that w.h.p., as claimed.
Next, we consider (3). Define as in the proof of Proposition 2.3(2). Then we know that . Hence, if then due to (36).
For part (4), we observe for a given that is attained at . To see this, consider the function and observe that the minimum of coincides with the minimum of . Letting , the derivatives read as
[TABLE]
For , the unique maximum is attained at and accordingly, . Furthermore, it is the case that and therefore by Proposition 2.3(2), . By a similar token by Proposition 2.3(1), .
Finally, for part (5), setting , we see that and therefore by Proposition 2.3(3), .
Appendix C The information-theoretic upper bound
C.1. Proof of Lemma 3.4
The term accounts for the number of assignments of Hamming weight whose overlap with is equal to . Hence, with being the event that one specific that has overlap with belongs to , we need to show that
[TABLE]
Due to symmetry we may assume that and that .
Proceeding as in the proof of Proposition 3.1, we think of each test as a bin of capacity and of each clone , , of an individual as a ball labelled . We toss the balls randomly into the bins. For and for we let be the label of the th ball that ends up in bin number . To cope with this experiment we introduce a new set -valued random variables such that are mutually independent and
[TABLE]
for all . With being the event that
[TABLE]
the vector given is distributed as given . Moreover, with similar arguments as in Section B.1, Stirling’s formula yields
[TABLE]
Let be the set of indices such that . Moreover, let be the set of all indices for which there exists precisely one such that and such that for this index we have . Further, let
[TABLE]
Then
[TABLE]
Furthermore, given the events are independent and
[TABLE]
For an intuitive explanation of the above expressions, please refer to the section immediately following the statement of the Lemma 3.4. Given and , we obtain
[TABLE]
Moreover, we find by 3.3, the concentration of and the fact that
[TABLE]
and thus
[TABLE]
Combining (40)–(41) and using the trivial bound
[TABLE]
we obtain by Bayes Theorem
[TABLE]
Because given is distributed as given , (37) follows from (43).
Appendix D The SCOMP algorithm
D.1. Proof of Proposition 5.1
The proof of Proposition 5.1 proceeds in three steps. First, we show that is concentrated around its expectation. denotes the corresponding event. Second, we need to get a handle on the subtle dependencies in . To this end, we introduce a set of independent multinomial random variables indexed over the tests. Whereas denotes the number of infected, potentially false positive and definitively healthy individuals in test , respectively, the triple denote the corresponding multinomial random variable. We will show that conditioned on the sum of hitting the total number of individuals of the three types, is distributed like . The technical workout is delicate, but is based on standard results from balls-into-bins experiments. Third, we show that for , the number of tests for which and decays exponentially in , which implies that w.h.p.
Proof.
Lemma 2.6 implies that the optimal choice for the variable degree is for a constant . Let be the amount of positive tests and, w.l.o.g. assume that are the positive tests and define
[TABLE]
as the event that the number of ‘potential false positives’ is highly concentrated around its mean. Then by Proposition 2.3(1), we find
[TABLE]
Similarly as before, we introduce a family of independent random variables corresponding to the tests.
Let be the number of ones in the tests corresponding to respectively. Let count the occurrences in . Let count the occurrences in . By definition we find . We introduce auxiliary variables , such that have distribution
[TABLE]
a multinomial distribution conditioned on the first variable being at least one. The triples are mutually independent. We seek a choice of satisfying the equation
[TABLE]
and will show following equation (48) that such a choice exists. Define
[TABLE]
Along the lines of Section B.1 , Stirling’s formula implies
[TABLE]
Moreover, and given are identically distributed. This can be seen as follows:
[TABLE]
Thus, given and for all , we find
[TABLE]
Given , we find:
[TABLE]
where the last equality follows from the fact that we conditioned on . Since the first terms are independent of , we find
[TABLE]
Therefore, given we have by comparison with (46),
[TABLE]
which yields the claim. Let
[TABLE]
be the number of positive tests that contain exactly one infected individual and no healthy individuals in . Note that this split is the only possibility for the test to be positive. Then
[TABLE]
By Lemma 2.5 we readily find for any choice of that
[TABLE]
Hence,
[TABLE]
Moreover, since is a binomial random variable, the Chernoff bound (e.g. Lemma B.1) shows that
[TABLE]
Further, Lemma 2.4 yields approximations for and . Now assume that . Using a similar reformulation as in (47), we find that . Thus, we have
[TABLE]
As Lemma 2.6 shows, the optimal value of is a constant. For a fixed the same that maximizes in (48), also maximizes . This maximum is attained at . Consequently and
[TABLE]
Hence,
[TABLE]
As before, we find w.h.p. since and and Markov’s inequality leads to . Proposition 5.1 follows.
∎
D.2. Proof of Proposition 5.2
By Lemma 2.3, we have for . To prove Proposition 5.2, we need to show that for such , we also have . We proceed in two steps. First, we show that every individual is assigned to at least distinct tests. Second, we show that a constant fraction of individuals are assigned to exactly tests establishing Proposition 5.2.
Proof.
Let be the number of distinct neighbors of a vertex . We claim that w.h.p. the following statements are true.
[TABLE]
The probability that a given appears times in the same test is upper-bounded by
[TABLE]
provided that . Moreover, the probability that appears in one test twice is upper-bounded by . Thus, the probability that appears in at least tests at least twice is upper-bounded by
[TABLE]
provided that and since and . The bound follows.
By Lemma 2.3, we know that for , w.h.p.. Since the SCOMP algorithm in its third stage selects the individual with the highest number of adjacent unexplained tests, we are left to show that also , which implies that w.h.p. we erroneously classify a healthy individual as infected. The prior bounds ensure that each individual is in at least tests. The question remains which fraction of individuals in are in . In principle, it could be the case that most potentially false positive individuals of appear in less than different tests. Indeed, it is more likely for such an individual in to be in fewer than different tests since each additional test increases the probability for such an individual to be assigned to a negative test. However, we claim that a constant fraction of all potentially false positive individuals in will have degree , thus be in . To see this, let be the maximum proportion of and for , i.e.
[TABLE]
By conditioning on a test degree sequence , we find
[TABLE]
as long as , which by Lemma 2.6 we can safely assume. Since each individual in is in at least different tests and the probability of being in any number of different tests is constant, a constant fraction of individuals in will be in exactly tests. Since , the claim follows. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Abbe: Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research 18 (2017) 6446–6531.
- 2[2] D. Achlioptas, A. Coja-Oghlan: Algorithmic barriers from phase transitions. Proc. 49th FOCS (2008) 793–802.
- 3[3] D. Achlioptas, A. Coja-Oghlan, F. Ricci-Tersenghi: On the solution space geometry of random formulas. Random Structures and Algorithms 38 (2011) 251–268.
- 4[4] D. Achlioptas, C. Moore: Random k 𝑘 k -SAT: two moments suffice to cross a sharp threshold. SIAM Journal on Computing 36 (2006) 740–762.
- 5[5] D. Achlioptas, A. Naor, and Y. Peres: Rigorous location of phase transitions in hard optimization problems. Nature 435 (2005) 759–764.
- 6[6] D. Achlioptas, Y. Peres: The threshold for random k 𝑘 k -SAT is 2 k log 2 − O ( k ) superscript 2 𝑘 2 𝑂 𝑘 2^{k}\log 2-O(k) . Journal of the AMS 17 (2004) 947–973.
- 7[7] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, M. Jordan: Decoding from pooled data: Sharp information-theoretic bounds. SIAM Journal on Mathematics of Data Science 1 (2019) 161–188.
- 8[8] A. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborová, M. Jordan: Decoding from pooled data: Phase transitions of message passing. IEEE Transactions on Information Theory 65 (2019) 572–585.
