Finding a planted clique by adaptive probing
Mikl\'os Z. R\'acz, Benjamin Schiffer

TL;DR
This paper investigates the query complexity of detecting and finding a planted clique in a random graph, establishing bounds on the number of adaptive edge queries needed for both tasks.
Contribution
It provides nearly tight bounds on the number of adaptive queries required for detection and finding of planted cliques, highlighting the query complexity in this problem.
Findings
Detection requires roughly n^2 / k^2 queries.
Finding the clique requires roughly (n^2 / k^2) log^2 n + n log n queries.
No algorithms with fewer than o(n^2 / k^2 + n) queries can reliably find the clique.
Abstract
We consider a variant of the planted clique problem where we are allowed unbounded computational time but can only investigate a small part of the graph by adaptive edge queries. We determine (up to logarithmic factors) the number of queries necessary both for detecting the presence of a planted clique and for finding the planted clique. Specifically, let be a random graph on vertices with a planted clique of size . We show that no algorithm that makes at most adaptive queries to the adjacency matrix of is likely to find the planted clique. On the other hand, when there exists a simple algorithm (with unbounded computational power) that finds the planted clique with high probability by making adaptive queries. For detection, the additive term is not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Finding a planted clique by adaptive probing
Miklós Z. Rácz Princeton University; [email protected]. Research supported in part by NSF grant DMS 1811724.
Benjamin Schiffer Princeton University [email protected].
Abstract
We consider a variant of the planted clique problem where we are allowed unbounded computational time but can only investigate a small part of the graph by adaptive edge queries. We determine (up to logarithmic factors) the number of queries necessary both for detecting the presence of a planted clique and for finding the planted clique.
Specifically, let be a random graph on vertices with a planted clique of size . We show that no algorithm that makes at most adaptive queries to the adjacency matrix of is likely to find the planted clique. On the other hand, when there exists a simple algorithm (with unbounded computational power) that finds the planted clique with high probability by making adaptive queries. For detection, the additive term is not necessary: the number of queries needed to detect the presence of a planted clique is (up to logarithmic factors).
1 Introduction
In the planted clique problem the goal is to find a clique that is planted within an Erdős-Rényi random graph. This problem has received widespread attention in the past few decades because there exists a (wide) range of clique sizes for which it is information-theoretically possible to find the planted clique but there are no known polynomial-time algorithms to do so [20, 23, 1, 14, 11, 12]. In this regime it is conjectured to be computationally hard to find the planted clique and this conjecture forms the basis of numerous average-case complexity results in recent years [6, 5, 18, 9, 8].
In this paper we consider a variant of the planted clique problem where we are allowed unbounded computational time but can only investigate a small part of the graph by adaptive edge queries. We consider the problems of detection and estimation under this model, and determine (up to logarithmic factors) the number of queries necessary both for detecting the presence of a planted clique and for finding the planted clique.
In the problems we consider there is an underlying vertex graph with vertex set . The algorithms that we consider are allowed unbounded computational power but we restrict the number of edges they are allowed to inspect. Specifically, we consider algorithms that evolve dynamically over a certain number of steps. In the first step, the algorithm chooses a pair , , and asks whether this pair is an edge or not. Depending on the outcome, the algorithm selects a second pair , , and asks whether this pair is an edge or not. It then selects , and so on. The algorithm may ask such edge queries and use unbounded computational time to produce an output.
The detection problem can be phrased as a simple hypothesis testing problem. Under the null hypothesis , the graph is an Erdős-Rényi random graph with edge density . Under the alternative hypothesis , the graph is drawn from the planted clique model with clique size . That is, we first choose a (uniformly) random subset of the vertices of size , we connect all pairs of vertices in —that is, the vertices in form a clique—and every other pair of vertices is connected independently with probability . In short:
[TABLE]
We denote the two probability distributions over vertex graphs by and , respectively. An algorithm for detection under the adaptive edge query model makes up to adaptive edge queries to and then outputs a hypothesis in . We measure the performance of an algorithm by its risk, which is defined as the sum of its type I and type II errors:
[TABLE]
If an algorithm achieves vanishing risk— as —then we say that can detect the presence of a planted clique; otherwise, we say that it cannot do so.
The following theorem determines (up to logarithmic factors) the number of queries necessary to detect the presence of a planted clique. All logarithms in this paper are in base 2.
Theorem 1** (Detecting a planted clique).**
Consider the hypothesis testing problem in (1).
- (a)
Let as . If an algorithm makes at most adaptive edge queries then its risk must satisfy as . 2. (b)
Suppose that for some constant and let be arbitrary. There exists an algorithm (with unlimited computational power) that can detect the presence of a planted clique by querying
[TABLE]
pairs of vertices. Moreover, the queries can be nonadaptive.
If we can detect the presence of a planted clique, the natural next goal is to find it. The following theorem determines (up to logarithmic factors) the number of queries necessary to find the planted clique. In particular, it shows that an extra queries suffice compared to detection and that this is tight (up to logarithmic factors).
Theorem 2** (Finding the planted clique).**
Let , where .
- (a)
Let as . No algorithm that makes at most adaptive edge queries can find the planted clique. That is, any estimator of the planted clique that is based on at most adaptive edge queries satisfies as . 2. (b)
Suppose that for some constant and let be arbitrary. There exists an algorithm (with unlimited computational power) that adaptively queries
[TABLE]
pairs of vertices and finds the planted clique with probability as .
Theorems 1 and 2 give a complete phase diagram of when detection and estimation are possible as a function of the clique size and the number of queries (up to some boundary cases). A natural parametrization is to take both and to be polynomial in : and for some and . Corollary 3 summarizes the results with this parametrization—see also Figure 1 for an illustration. Note in particular the region of the phase space where detection is possible but estimation is not. Note also that the conjectured computational threshold is at .
Corollary 3** (Phase diagram).**
Suppose that and for some and .
- (a)
If , then detecting the presence of a planted clique is impossible. 2. (b)
If and , then it is possible to detect the presence of a planted clique, but it is impossible to find the planted clique. 3. (c)
If and , then it is possible to find the planted clique.
These results raise several open questions. First, can the logarithmic factors be closed, to obtain results that are tight up to constant factors? Second, how do these results change if we plant a different subgraph instead of a clique? For instance, one can plant a dense random graph with . Finally, while we neglected all computational considerations in this paper, are there any connections to average-case computational hardness? We leave exploring these questions to future work.
The rest of this paper is outlined as follows. After discussing motivation and related work in the remainder of the introduction, we turn to algorithms for detection and estimation in Section 2, proving Theorem 1(b) and Theorem 2(b). Finally, we prove the impossibility results of Theorem 1(a) and Theorem 2(a) in Section 3.
1.1 Motivation
There are several potential applications where understanding the query complexity necessary to finding cliques may be of interest. For instance, in scientific applications one may wish to find closely related entities (corresponding to a clique or dense subgraph), and querying an edge may correspond to performing a physical experiment which is costly and/or time-consuming.
Another potential application is to the analysis of social media connections. Here the nodes of a graph represent individuals and the edges represent connections between individuals such as Facebook friends, Twitter following, or LinkedIn connections. Access to these connections may be expensive to obtain or limited (due to privacy limitations or any other source of incomplete information), and hence query complexity may be relevant when trying to reconstruct a specific close-knit group within the network.
The planted clique problem and related subgraph inference problems have been applied to a variety of applications, including biological networks [29], cryptography [21], and finance [4]. Obtaining full information about the underlying networks in these applications may not be possible due to queries being expensive and/or limited, and hence the planted clique problem with limited adaptive probing could be relevant to these same applications.
1.2 Related work
This paper is a natural follow-up to the recent work of Feige, Gamarnik, Neeman, Rácz, and Tetali [13], where the authors consider the problem of finding cliques in an Erdős-Rényi random graph under the same adaptive edge query model. While the largest clique in an Erdős-Rényi random graph with edge density has size approximately , the current best algorithm that makes at most adaptive edge queries finds a clique of size approximately . Feige et al. [13] show an impossibility result if the adaptivity of the algorithm is limited: any algorithm that makes edge queries () in rounds finds cliques of size at most where . Very recently, Alweiss, Ben Hamida, He, and Moreira [2] improved upon this result, showing that there exists such that depends only on and not on . However, closing the gap between the upper and lower bounds remains an open problem.
Several recent works consider finding structure in a random graph under such an adaptive edge query model. Ferber, Krivelevich, Sudakov, and Vieira studied finding a Hamilton cycle [16] and finding long paths [17], while Conlon, Fox, Grinshpun, and He [10] studied finding a copy of a fixed target graph (such as a constant size clique). All of these works focus on sparse random graphs.
As mentioned in the introduction, the planted clique problem has been studied from many angles in the past few decades [20, 23, 1, 14, 11, 12, 6, 5, 18, 9, 8]. To the best of our knowledge, it has not been considered under an edge query model before. It would be interesting to see if there are any connections to computational aspects of the planted clique problem. The recent work of Mardia, Ali, and Chandrasekher [25] develops sublinear111Here the input size is , hence sublinear refers to . time algorithms for finding the planted clique in the regime and makes such connections. As the authors point out, our results imply an running time lower bound for finding the planted clique, which shows that their algorithms are optimal (up to logarithmic factors) whenever .
Finally, we mention that query complexity arises naturally in many other areas, such as clustering [30, 27, 28], where answers to queries are often noisy due to them being crowdsourced, and community detection [19, 3], where the evolution of the underlying graph necessitates repeated queries. Statistical queries have also been widely studied [22], including in the setting of the planted clique model [15]. More generally, our work fits into the framework of online learning, a large and rapidly growing area which is beyond the scope of this article to survey.
2 Algorithms
We start with a simple sampling-based algorithm to detect the presence of a planted clique. This is contained in Section 2.1 and proves Theorem 1(b). We then extend this algorithm in Section 2.2 to find all vertices of the planted clique, thus proving Theorem 2(b).
First, recall that the largest clique in an Erdős-Rényi random graph has size approximately . In fact, very precise results are known. Define . Matula [26] showed that for any , the clique number (the size of the largest clique) of a random graph drawn from satisfies with probability tending to as ; see also [7]. For our purposes much weaker estimates suffice. Indeed, a first moment argument shows that as (see [24]).
2.1 Detecting the presence of a planted clique
The basic idea in detecting the presence of a planted clique is to sample all pairs of vertices among a set of size . After these queries we learn the induced subgraph on and we can use the size of the largest clique in as a statistic to distinguish between the hypotheses and , as follows. We know (see above) that under this statistic is at most with probability . On the other hand, under the set contains, in expectation, vertices from the planted clique, so with probability this statistic is at least .
The following proof makes this reasoning precise. Note that this algorithm to detect the presence of a planted clique is nonadaptive, making all queries at the same time.
Proof of Theorem 1(b).
Let be such that and . First, we choose an arbitrary subset of the vertices of size ; for instance, choose . We then query all pairs of vertices among . This results in
[TABLE]
queries. After the queries we know the induced subgraph on . In particular—due to the fact that we have no restrictions on computational power—we can compute the size of the largest clique in this induced subgraph. The algorithm then chooses a hypothesis based on this statistic: if contains a clique of size at least , then it accepts the alternative hypothesis ; otherwise, it accepts the null hypothesis .
We now argue that this algorithm achieves vanishing risk. First, as we discussed at the beginning of Section 2, if , then the largest clique in has size approximately . In particular, as ; that is, the type I error vanishes in the limit.
Next, assuming , let denote the number of planted clique vertices in . Observe that has a hypergeometric distribution with parameters , , and . Thus we have that and , so Chebyshev’s inequality implies that for some constant depending on . To conclude, note that if then contains a clique of size at least . ∎
2.2 Finding the planted clique
In order to find the planted clique, we start with the same step as in Section 2.1 above: we sample all pairs of vertices among a set of size . As we show below, with probability under , the set of vertices in the largest clique in is exactly the set of vertices in that are in the planted clique. Thus the remaining goal is to identify the vertices of the planted clique that are not in .
To do this, a natural idea is to query all pairs of vertices where one vertex is part of the largest clique in and the other vertex is not in . Any vertex that is in the planted clique and not in will necessarily be connected to all planted clique vertices in , while vertices not in and not in the planted clique will not be connected to all of the planted clique vertices in (with probability under ). Thus in the second step the algorithm selects all vertices not in that were connected to all vertices in where the pair was queried.
Finally, the algorithm outputs the union of the two sets of vertices identified in the two steps. The following proof makes all this precise and proves Theorem 2(b). Note that in the second step we take only a subset of the largest clique in —this is done in order to lessen the number of queries made. Note furthermore that this algorithm has limited adaptivity, as it can be implemented in two “rounds”.
Proof of Theorem 2(b).
The algorithm for finding the planted clique consists of two steps, the first step being the same as the one used for detection.
- •
Step 1: Choose a subset of the vertices of size , where is chosen as in Section 2.1. We then query all pairs of vertices among .
- •
Step 2: Let be the set of vertices in the largest clique in . Let be a fixed subset of size (e.g., take to be the nodes in with lowest label). (If is large such that , then let .) We then query all pairs of vertices where one of the vertices is in and the other is in .
Let denote the vertices in that are connected to all vertices in . The algorithm then outputs as its estimate for the planted clique.
We have seen in Section 2.1 that we make at most queries in the first step, while in the second step we make at most queries. We now argue that this algorithm succeeds in finding the planted clique with probability .
As we argued in Section 2.1, we have that as . Furthermore, with probability , the set contains only planted clique vertices. Indeed, as we discussed at the beginning of Section 2, with probability the largest clique in an Erdős-Rényi random graph with edge density has size at most , so no vertex outside of the planted clique is in a clique of size greater than . Thus in the first step of the algorithm we have found at least vertices of the planted clique. Moreover, we have found all vertices of the planted clique that are in .
Any vertex in that is in the planted clique will be connected to every planted clique vertex and hence every vertex in . Thus all vertices of the planted clique are contained in . To see that there are no false positives in this set, note that the probability that a vertex not in the planted clique is connected to a fixed set of planted clique vertices is . Taking a union bound over vertices in , we see that the probability that there exists a vertex not in the planted clique that is in is at most . ∎
Note that this algorithm succeeds in finding the planted clique even though it does not check that all pairs of vertices within the planted clique are connected. In fact, it checks the edge between pairs of vertices within the planted clique, instead of the pairs that exist.
3 Lower bounds
To prove our lower bounds we introduce a simpler variant problem that removes all graph structure from the problem. In this hypothesis testing problem we consider the set , where each element of the set is either marked or unmarked. Under the null hypothesis , all elements are unmarked. Under the alternative hypothesis , a uniformly randomly chosen subset of size is chosen and its elements are marked, and the elements of are unmarked. We denote the two probability distributions over by and , respectively.
We consider algorithms that can adaptively query pairs, where . We refer to such queries as pair queries to distinguish them from the edge queries of the original problem. When pair is queried, the query evaluates to true if both and are marked and it evaluates to false otherwise. The algorithm may ask such adaptive pair queries and use unbounded computational time to produce an output in (corresponding to or ). We again measure the performance of an algorithm by its risk, defined as
[TABLE]
We consider randomized algorithms as well, in which case the type I and type II error probabilities in the display above are taken over the internal randomness of the algorithm as well.
The following lemma connects this variant problem with the original hypothesis testing problem.
Lemma 4** (Reduction).**
Suppose that there exists an algorithm that makes at most adaptive edge queries and achieves risk for the hypothesis testing problem in (1). Then there exists an algorithm that makes at most adaptive pair queries in the variant problem described above—distinguishing between and —and achieves risk .
Proof.
There is a direct correspondence between the two hypothesis testing problems, which allows the answers to pair queries to simulate answers to edge queries. Specifically, marked elements of correspond to planted clique vertices. Thus a pair query that evaluates to true corresponds to querying two planted clique vertices, while a pair query that evaluates to false corresponds to querying two vertices between which the edge is random. Thus given the answer to a pair query, the answer to an edge query can be simulated as follows: if the answer to the pair query is true, the answer to the corresponding edge query is that the edge exists, while if the answer to the pair query is false, then flip a fair coin to answer the corresponding edge query.
Thus for any algorithm that makes at most adaptive edge queries, there exists a corresponding algorithm that makes at most adaptive pair queries in the variant problem and simulates . We then let the output of be the same as the output of the simulated algorithm . Since the simulation of involves extra randomness, is thus a randomized algorithm. By conditioning on the extra randomness, it follows that the risk of is the same as the risk of . ∎
This lemma implies that to prove Theorem 1(a) it suffices to prove the analogous result for the variant problem. Consequently, we turn our focus to the variant problem. Observe that under all answers to all pair queries will be false. The next lemma considers the alternative hypothesis .
Lemma 5**.**
Let . Let be any algorithm that makes at most adaptive pair queries. Let denote the event that all of the pair queries of evaluate to false. We then have that
[TABLE]
In particular, if as , then as .
Proof.
To highlight the key elements of the proof, we first prove the claim for deterministic algorithms, where each query is a deterministic function of the previous queries and the answers to them; at the end of the proof we address how the proof changes for randomized algorithms, which may use additional randomness. Thus for now assume that the algorithm is deterministic. To describe the structure of deterministic algorithms we introduce some notation. We let denote the pair queries made by the algorithm. Furthermore, let denote the answers to the pair queries, as follows: if the th pair query evaluates to false, and if the th pair query evaluates to true. Any deterministic algorithm can thus be described as follows:
- •
First, makes the pair query . The algorithm receives the answer (which depends on the realization of ).
- •
The next pair query of depends on the answer :
- –
if , then makes the pair query ;
- –
if , then makes the pair query .
The algorithm receives the answer (which again depends on the realization of ).
- •
The third pair query of depends on the answers and :
- –
if and , then makes the pair query ;
- –
if and , then makes the pair query ;
- –
if and , then makes the pair query ;
- –
if and , then makes the pair query .
The algorithm receives the answer .
- •
And so on. The pair query of depends on the answers as follows: for every , if , then .
Thus the set of pairs
[TABLE]
completely determines how the algorithm behaves for any realization of ; and vice versa: any set of pairs as in (3) determines a deterministic pair query algorithm . (Note that in the description of a deterministic algorithm we have only described how the algorithm makes the pair queries and not how the algorithm produces an output after making adaptive pair queries—for the purposes of the claim this is all that we care about.)
In the following we thus fix the deterministic algorithm by fixing the set of pairs in (3). Also, for notational convenience, we write for when and ; furthermore, let and . Recall that denotes the event that the first adaptive pair queries of all evaluate to false. We prove (2) by determining the conditional law of given the event . Note that we fixed the set of pairs in (3), and thus we know that, given , the first pair queries were . Let denote the set of -tuples such that there was a pair query among these first pair queries that queried a pair from this -tuple; that is,
[TABLE]
Since all pair queries evaluated to false, no -tuple in can be the marked subset given (since otherwise a pair query would have evaluated to true). That is,
[TABLE]
for all . Now let us consider a -tuple that is not in . By Bayes’s rule we have that
[TABLE]
Since the prior on is uniform, we have that . Now since , if , then the answers to the pair queries are necessarily all false (due to the definition of ). Therefore for every we have that . Thus we have shown that for every we have that
[TABLE]
Altogether, we have shown that the conditional law of given is given by
[TABLE]
Notice that this conditional probability is equal for all -tuples that are not in . Therefore we also have that
[TABLE]
Putting together the previous two displays we have that
[TABLE]
Note that every pair (where ) is part of exactly different subsets of elements. This implies the following upper bound on the size of :
[TABLE]
Plugging this bound back into (4) we obtain (2), as desired.
Finally, we discuss randomized algorithms, which may use additional randomness. We may condition on the extra randomness and then use the argument above for deterministic algorithms. This shows that no matter what the realization of the additional randomness is, the conditional probability of all pair queries evaluating to false is at least . Taking an expectation over the additional randomness then shows the desired claim. ∎
We are now ready to prove the analogue of Theorem 1(a) for the variant problem.
Corollary 6** (Detecting a marked set of elements).**
Consider the hypothesis testing problem versus . Let as . If an algorithm makes at most adaptive pair queries then its risk must satisfy as .
Proof.
No matter what algorithm does, all of its pair queries will evaluate to false under (by definition), and all of its pair queries will evaluate to false under with probability (by Lemma 5). Suppose that outputs [math] with probability and with probability when all of its queries evaluate to false, where . The first sentence of the proof then implies that its risk is at least . ∎
Proof of Theorem 1(a).
This follows directly from Lemma 4 and Corollary 6. ∎
We now turn to proving Theorem 2(a). Here too we leverage the connection to the corresponding estimation problem for the simplified variant problem, where we aim to estimate the set of marked elements.
Lemma 7**.**
Let be a uniformly randomly chosen set of size , where . Let the elements of be marked and let the elements of be unmarked. Let as . If is any estimator of the marked set that is based on at most adaptive pair queries, then satistifies as .
Proof.
There are two cases to consider. First, consider the case when . The proof of Lemma 5 shows that, with probability , after adaptive pair queries there remain a fraction of subsets of size that are equally likely to be the marked set. No estimator can do better than pick randomly among these, and this will succeed with probability .
Next, consider the case when . In this case we show that it is not possible to estimate the marked set even for algorithms with significantly more information. Specifically, we consider algorithms that can adaptively query pairs, where , and when pair is queried, the algorithm learns, for both and , whether they are marked or unmarked. We refer to such queries as strong pair queries to distinguish them from pair queries. From the answer to a strong pair query it is possible to determine the answer to the appropriate pair query. Therefore any algorithm that makes adaptive pair queries can be simulated by an algorithm that makes adaptive strong pair queries. Thus in order to prove the claim it suffices to show that if is any estimator of the marked set that is based on at most adaptive strong pair queries, then satistifies as . This is what we will show; thus in the following we consider algorithms that make at most adaptive strong pair queries, and we assume that .
We now argue that for we may, without loss of generality, consider the algorithm that makes the strong pair queries . We argue this by induction. Since the marked set is chosen uniformly at random, the first strong pair query may be without loss of generality. There are now three cases to consider, depending on the answer to this first strong pair query.
- •
Both elements are unmarked. Suppose that the answer to the strong pair query is that both and are unmarked. Thus and therefore neither nor will be in any estimator (since otherwise the estimator will be incorrect). The algorithm thus knows that . Moreover, by Bayes’s rule, the conditional distribution of , given that both and are unmarked, is uniform among -tuples in .
- •
One element is unmarked, the other is marked. Suppose that the answer to the strong pair query is that is marked and is unmarked. Thus and therefore will be in any estimator (since otherwise the estimator will be incorrect). We also learn that , so will not be in any estimator (since otherwise the estimator will be incorrect). Moreover, by Bayes’s rule, the conditional distribution of , given that is marked and is unmarked, is uniform among -tuples in that contain the element and do not contain the element . Thus the conditional distribution of , given that is marked and is unmarked, is uniform among -tuples in .
The case where is unmarked and is marked is analogous.
- •
Both elements are marked. Suppose that the answer to the strong pair query is that both and are marked. Thus and therefore both and will be in any estimator (otherwise the estimator will be incorrect). Moreover, by Bayes’s rule, the conditional distribution of , given that both and are marked, is uniform among -tuples in that contain both and . Thus the conditional distribution of , given that both and are marked, is uniform among -tuples in .
In summary, no matter what the answer to the strong pair query is, the algorithm deduces the following two points.
- •
For elements and , the algorithm knows whether or not to include them in any estimator that has any possibility of being correct.
- •
The conditional distribution of , given the answer to the strong pair query , is uniform among -tuples in ; here if both and are unmarked, if one is unmarked and the other is marked, and if both and are marked.
Due to the uniformity of the conditional distribution in the last bullet point, the next strong pair query may, without loss of generality, be . More generally, after having made the strong pair queries , the algorithm deduces the following two points.
- •
For each element in , the algorithm knows whether or not to include them in any estimator that has any possibility of being correct.
- •
The conditional distribution of , given the answers to the strong pair queries , is uniform among -tuples in , where is equal to minus the number of marked elements in .
Again, due to the uniformity of the conditional distibution in the bullet point above, the next strong pair query may, without loss of generality, be . This finishes the proof of the induction.
Finally, we analyze the algorithm that makes the strong pair queries . After the answers to these strong pair queries, the algorithm knows for each element in whether they are marked or unmarked. Let denote the subset of marked elements in , and let . Any estimator that has any possibility of being correct must include as a subset (since otherwise the estimator will be incorrect); similarly, any estimator that has any possibility of being correct must not include any elements in . If , then this determines that the estimator should be , and indeed in this case the estimator is correct: . If , then the estimator has to choose a subset of size and outputs . The estimator is then correct (that is, holds) if and only if . As we have argued above, the conditional distribution of , given the answers to the strong pair queries , is uniform among -tuples in . Due to the uniformity of this conditional distribution, for any estimator the conditional probability of is equal to . Putting everything together we have thus obtained that
[TABLE]
Since the distribution of is uniform among -tuples in , the distribution of is hypergeometric with parameters , , and . We now distinguish three cases based on how the parameters , , and relate to each other, and in each case we bound the expected value in (5).
- •
Case 1: . Since , we have that for all large enough. Thus for all large enough we have that . This implies that, for all large enough, if , then . Also, if , then . We thus have, for all large enough, that
[TABLE]
Since is a hypergeometric random variable with parameters , , and , we have that
[TABLE]
Combining (6) and (7) we have that as .
- •
Case 2: . Since , in this case we always have that , so . Also, . Put together, we have that , which implies that . Therefore in this case we have that
[TABLE]
- •
Case 3: . Since , we have that for all large enough. Note also that . Thus for all large enough we have that , so . Also note that by definition. If , then , where the second inequality holds for all large enough. We thus have, for all large enough, that
[TABLE]
Since is a hypergeometric random variable with parameters , , and , we have that
[TABLE]
Combining (8) and (9) we have that as .
In summary, in all three cases above we have that as . Combining this with (5) proves the claim. ∎
Proof of Theorem 2(a).
There exists a direct correspondence between the problem of estimating the planted clique and the problem of estimating the marked set in the variant problem. This correspondence for the estimation problem is analogous to the correspondence for the detection problem described in the proof of Lemma 4. The proof then follows directly from Lemma 7 and this correspondence. ∎
Acknowledgements
M.Z.R. is grateful to David Gamarnik for helpful discussions. We also thank Jay Mardia, Joe Neeman, an anonymous reviewer, and an anonymous associate editor for helpful comments, feedback, and questions which helped improve the manuscript.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random Structures & Algorithms , 13(3-4):457–466, 1998.
- 2[2] R. Alweiss, C. Ben Hamida, X. He, and A. Moreira. On the subgraph query problem. Preprint available at https://arxiv.org/abs/1911.04413 , 2019.
- 3[3] A. Anagnostopoulos, J. Łacki, S. Lattanzi, S. Leonardi, and M. Mahdian. Community detection on evolving graphs. In Advances in Neural Inf. Proc. Systems , pages 3522–3530, 2016.
- 4[4] S. Arora, B. Barak, M. Brunnermeier, and R. Ge. Computational complexity and information asymmetry in financial products. Communications of the ACM , 54(5):101–107, 2011.
- 5[5] Q. Berthet and P. Rigollet. Complexity Theoretic Lower Bounds for Sparse Principal Component Detection. In Proceedings of the 26th Annual Conference on Learning Theory (COLT) , pages 1046–1066, 2013.
- 6[6] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. The Annals of Statistics , 41(4):1780–1815, 2013.
- 7[7] B. Bollobás and P. Erdős. Cliques in random graphs. Mathematical Proceedings of the Cambridge Philosophical Society , 80(3):419–427, 1976.
- 8[8] M. Brennan and G. Bresler. Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness. Preprint at https://arxiv.org/abs/1902.07380 , 2019.
