On the computational tractability of statistical estimation on amenable graphs
Ahmed El Alaoui, Andrea Montanari

TL;DR
This paper investigates how the structure of graphs influences the gap between statistically optimal and computationally feasible solutions in estimating discrete variables, showing that amenable graphs allow near-optimal local algorithms, unlike random graphs.
Contribution
It demonstrates that for amenable graphs, simple local algorithms can nearly achieve optimal estimation, contrasting with the persistent gap in random graphs.
Findings
Local algorithms achieve near-optimal accuracy on amenable graphs.
The information-computation gap persists in random regular graphs.
Graph structure critically affects the computational-statistical tradeoff.
Abstract
We consider the problem of estimating a vector of discrete variables , based on noisy observations of the pairs on the edges of a graph . This setting comprises a broad family of statistical estimation problems, including group synchronization on graphs, community detection, and low-rank matrix estimation. A large body of theoretical work has established sharp thresholds for weak and exact recovery, and sharp characterizations of the optimal reconstruction accuracy in such models, focusing however on the special case of Erd\"os--R\'enyi-type random graphs. The single most important finding of this line of work is the ubiquity of an information-computation gap. Namely, for many models of interest, a large gap is found between the optimal accuracy achievable by any statistical method, and the optimal accuracy achieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the computational tractability of statistical estimation
on amenable graphs
Ahmed El Alaoui* and Andrea Montanari Department of Electrical Engineering and Department of Statistics, Stanford University
Abstract
We consider the problem of estimating a vector of discrete variables , based on noisy observations of the pairs on the edges of a graph . This setting comprises a broad family of statistical estimation problems, including group synchronization on graphs, community detection, and low-rank matrix estimation.
A large body of theoretical work has established sharp thresholds for weak and exact recovery, and sharp characterizations of the optimal reconstruction accuracy in such models, focusing however on the special case of Erdös–Rényi-type random graphs. The single most important finding of this line of work is the ubiquity of an information-computation gap. Namely, for many models of interest, a large gap is found between the optimal accuracy achievable by any statistical method, and the optimal accuracy achieved by known polynomial-time algorithms. Moreover, this gap is generally believed to be robust to small amounts of additional side information revealed about the ’s.
How does the structure of the graph affect this picture? Is the information-computation gap a general phenomenon or does it only apply to specific families of graphs?
We prove that the picture is dramatically different for graph sequences converging to amenable graphs (including, for instance, -dimensional grids). We consider a model in which an arbitrarily small fraction of the vertex labels is revealed, and show that a linear-time local algorithm can achieve reconstruction accuracy that is arbitrarily close to the information-theoretic optimum. We contrast this to the case of random graphs. Indeed, focusing on group synchronization on random regular graphs, we prove that the information-computation gap still persists even when a small amount of side information is revealed.
1 Introduction
Classical statistics focuses on problems in which a small number of parameters needs to be estimated from data. As a consequence, it is mostly unconcerned with computational complexity considerations. Fundamental limits to statistical estimation are proven on the basis of information-theoretic considerations. On the contrary, in modern high-dimensional applications, it is not uncommon to come across statistical models that require estimating simultaneously thousands or even millions of parameters. In this setting, a large gap is often observed between information-theoretic limits and what is achieved by the best known polynomial-time algorithms. Indeed, it is expected that no polynomial-time algorithm can achieve optimal statistical performance in general. In specific classes of models, a precise information-computation gap has been conjectured on the basis of current knowledge (see, e.g., [MM09, DKMZ11, MR14, LM17, BKM*+*19, CM19] and references therein).
As explained below, most of our understanding of this information-computation gap was developed by analyzing probabilistic models with a high degree of exchangeability. This suggests a natural question: Is the same gap present in models with other type of structures?
Statistical estimation on graphs provides a rich and interesting setting to study this question. Let be a graph on vertices, . Edges are assumed to be directed in an arbitrary way, i.e., they are ordered pairs . We associate to the vertices random variables , uniformly distributed on a finite alphabet . For each edge , we observe , where is also a finite alphabet. The observations are conditionally independent with , where is a probability kernel from to . Given the edge observations (and, possibly, additional side information, see below), the purpose is to estimate the vertex assignment .
This model is general enough to include a broad variety of examples studied in the literature, including group synchronization, community detection, low-rank matrix estimation, and so on. As an example consider the –synchronization problem (further examples are presented in Section 3.1). The unknown variables are i.i.d. uniform in , which we identify with the cyclic group with additive structure. Observations are noisy measurements of the difference between and for each edge :
[TABLE]
where is a collection of independent random variables , independent of .
In addition to the observations , we consider independent observations on the vertices of :
[TABLE]
where is a symbol not belonging to , so that with probability the value of is directly observed. We will write . Following the information theory literature, we refer to this noise model as the Binary Erasure Channel, and denote it by BEC(). (It is customary to parametrize the BEC by its erasure probability .) The parameter will be considered very small (eventually going to zero as becomes large). The purpose of this side information is to break the occasional group symmetry (sign symmetry or cyclic shifts in the case of ) that would otherwise be preserved by the observations .
We consider two metrics for the estimation accuracy. In our first definition, the goal is to estimate the rank-one matrix whose entries are
[TABLE]
where is a given real-valued function. For instance by setting and then considering all values of , this allows to estimate whether for each pair of vertices . An estimator is a map , i.e., a function of the observations and the side information . We evaluate its risk under the square loss
[TABLE]
We denote by the minimal achievable error, i.e., the one achieved by the posterior expectation
[TABLE]
(We have made use of the following notation: for a graph , we denote by the union of the vertex and edge observations over : .) Our second metric for estimation accuracy is the ‘overlap’, and will be introduced in Section 4, see Eq. (4.7).
Statistical estimation on graphs has motivated substantial amount of work. In this context, the first example of a statistical model with a large information-computation gap is probably the planted clique problem [Jer92, AKV02]. This can be recast in the general framework described above, with the complete graph over vertices (see Section 3.1). Despite more than a quarter century of research, and the study of increasingly powerful classes of algorithms [FK00, DM13, BHK*+*16], no known polynomial-time algorithm comes close to saturate the information-theoretic limits for this problem.
In recent years, a much more refined picture of the information-computation gap has emerged, mainly through the careful analysis of a variety of models on sparse random graphs (as well as models on dense graphs in a different noise regime than the hidden clique model). We refer to Section 2 for a brief summary of this vast literature. In most of these models an information-computation gap is observed, and has been precisely delineated. This gap is generally conjectured to remain unchanged if a small amount of side information is revealed111The careful reader will notice that this statement does not apply to the planted clique problem. If the label of random vertices is revealed (i.e., whether or not they belong to the clique), then it is easy to find planted cliques of size , i.e., far below the best known polynomial algorithms for . This behavior is however related to the fact that, in the planted clique problem, the labels’ prior distribution is strongly dependent on , as revealed from the fact that the clique’s size is sublinear in ., as in Eq. (1.2). As mentioned above, most of the theoretical work has focused however on random graphs (Erdös–Rényi random graphs, random regular graphs and their relatives). This motivates the following key question:
Does an information-computation gap exist for statistical estimation on other types of graphs?
In this paper, we consider the case of graph sequences that converge locally to amenable graphs. Roughly, these are graphs for which the boundary of large sets of vertices is negligible compared to their volume. We refer to Section 3 for a reminder on the relevant definitions. Our results are already interesting for the simplest example of such graphs, namely large boxes in the -dimensional grid (with ).
Our main finding is that no information-computation gap exists for such graphs (as long as the gap is defined in terms of polynomial- versus non-polynomial time algorithms). A specific formalization of this finding is given below, and proved in Section 4.
Theorem A**.**
Let be a function with for . Let be a sequence of finite graphs (with ) converging locally–weakly to a random rooted graph which is infinite, locally-finite, almost surely anchored–amenable and tame. Then for each there exists an estimator , with runtime , such that the following holds. For almost every , we have
[TABLE]
The notions of local–weak convergence, anchored–amenability and tameness will be defined in Section 3. More in detail, we present the following contributions:
No information-computation gap on amenable graphs.
Theorem A provides a concrete formalization of the general statement that statistically optimal estimation can be performed using polynomial time algorithms on (asymptotically) amenable graphs. In fact, we will prove that this follows from a more fundamental result, establishing that the vertex marginals of the posterior {\mathbb{P}}\big{(}{\bm{\theta}}\;|Y^{(\varepsilon)}_{G_{n}}\big{)} can be computed to arbitrary accuracy in polynomial time, for almost all values of , on asymptotically amenable graphs, cf. Section 4.
Note that approximating the Bayes estimator , Eq. (1.5), requires to approximate the joint distribution of pairs of well separated vertices. However, we will use a decoupling argument to reduce ourselves to the case of vertex marginals.
Local algorithms.
Our proof that vertex marginals can be computed efficiently follows from an even stronger, and somewhat surprising fact (as above, holding for almost all ). The marginal at a vertex can be well approximated by computing the marginal with respect to the posterior given observations in a large constant-size ball centered at . In other words, the marginal can be approximated by a local algorithm. The reason for this phenomenon can be explained in information theoretic terms. We will prove that the average conditional mutual information between a random vertex in a region , and the boundary of , I\big{(}\theta_{v};\theta_{\partial S}|Y_{S}^{(\varepsilon)}\big{)} is upper bounded by . Hence, for amenable graphs, the effect of the boundary information is generally negligible.
Robust information-computation gap on random regular graphs.
We provide a counter-example, by showing that the conclusions at the previous points do not hold for random regular graphs, converging locally to -regular trees, which are non-amenable. As mentioned above, several cases of statistical estimation problems have been observed to present an information-computation gap, when the underlying graph is random. While this gap is often expected to be robust to side information about the vertices, we are not aware of any result that explicitly establishes robustness—in the setting of the present paper. We consider the –synchronization problem on random -regular graphs. We prove that, for a large range of the model parameters and all small enough: There exists a statistical estimator that achieves non-trivial reconstruction accuracy uniformly as ; Local algorithms can only achieve accuracy that vanishes as .
2 Related literature
As mentioned in the introduction, large information computation gaps were observed in a number of statistical estimation problems, when the underlying structure is a random graph, the complete graph, or close relatives. An incomplete list includes community detection in the stochastic block model [DKMZ11, Mas14, MNS18, Abb17], high-dimensional linear regression and generalized linear models [BKM*+*19, CM19], low-rank matrix estimation and sparse principal component analysis [JL09, AW09, BR13, MW15, LM17], tensor principal component analysis [MR14, HSS15, HKP*+*17], tensor decomposition, and so on.
In many of these models, two types of results are established. On one hand an ‘information-theoretic’ analysis allows to characterize the optimal statistical accuracy that is achieved by an ideal estimator. On the other, specific classes of polynomial-time algorithms are analyzed. Sometimes the resulting statistical estimation limits are stated in terms of specific goals such as ‘weak recovery’ or ‘exact recovery’: in the present paper we consider the general goal of estimation with certain expected accuracy, or risk.
The most frequently analyzed classes of algorithms have been spectral methods, local algorithms, and convex relaxations in the sum-of-squares hierarchy. A remarkable dichotomy has emerged from these works. Roughly speaking, in all the examples we know of, either highly sophisticated semidefinite programming hierarchies fail, or simple combinations of spectral methods and local algorithms succeed. The behavior of the latter is in turn characterized by studying the Bayes optimal local algorithm (belief propagation), in the presence of a small amount of side information. Partial rationalizations of this surprising dichotomy were given in [HSSS16, HKP*+*17, FM17]. Motivated by this work, our analysis of –synchronization on random regular graphs (Section 5) will focus on the same simple algorithm: belief propagation in the presence of side information. As common in the literature, we will use the weak recovery threshold for this algorithm as a proxy for the fundamental algorithmic threshold.
Let us stress that our main focus is statistical estimation on amenable graphs. Versions of this problem have been studied in a few recent papers [AMM*+*17, SB18, PW18, AB18, ABRS18]. In particular, [AMM*+*17] proved the existence of a weak recovery threshold for –synchronization222For , [AMM*+*17] proves that a threshold exists in the case , and indeed the same is expected to hold for as well. For no non-trivial threshold exists in that weak recovery is always impossible. on grids in dimensions. However, in contrast with random graphs, no explicit characterization exists (or is likely to exist) for the optimal statistical accuracy nor, in general, for the location of weak recovery thresholds. This poses a clear challenge to us: we want to prove that the optimal statistical accuracy can be achieved by polynomial time algorithms, but we do not have an explicit characterization for the target accuracy. Indeed, our proof will be purely conceptual.
Let us finally mention that it is well understood that certain algorithmic tasks are easy on graphs that can be embedded well in (e.g., on grids). For instance, approximate optimization of a function that decomposes as a sum of edge terms over a grid is easy, by partitioning the grid into large boxes. Unfortunately, these ideas do not have direct implications on the questions addressed in this paper. Even if we can find an approximate-maximum likelihood assignment of the unknown variables , this is not guaranteed to have any good statistical properties, let alone achieve optimal estimation error. Inference and estimation do not reduce to optimization.
3 Background
3.1 Further examples
It is interesting to check that the framework defined in the introduction is broad enough to encompass a variety of models of interest.
Spiked Wigner and Wishart models. Low-rank plus noise models are ubiquitous in statistics and signal processing [Joh06], and can be recast in the language of the present paper. As an example, consider the case of a signal vector , with i.i.d. components, and assume we observe the rank-one-plus-noise matrix . Here is a noise matrix, with –for instance– and controls the noise level.
We take to be the complete graph, and be i.i.d. random variables333Unlike for the model described in the introduction, the variables ’s typically take any value in , and their distribution is non-uniform. However, it is easy to reduce from one case to the other. For instance, we can let . We can choose the nonlinear function so that when . from a distribution on . Observations on the edges are given by
[TABLE]
where denotes the Gaussian distribution.
This example can be easily generalized. For instance, higher rank models can be produced by taking , fixed. Rectangular (non-symmetric) random matrices of dimensions , can also be produced by setting . In this case where and depending whether belongs to the first vertices (left factor) or the last ones (right factor).
Community detection. The stochastic block model is a popular model for community detection in networks. The model is parametrized by a symmetric ‘connectivity’ matrix , whereby is the expected edge density between vertices in communities and . (For the sake of simplicity, we consider here the ‘balanced’ case in which the communities have all equal expected size.) Each vertex is assigned a label independently and uniformly at random. Conditional on , we generate a graph by connecting vertices independently with probability .
We can encode this model in our general framework as follows. The graph is the complete graph, and observe on every edge, where . The connection with the standard description is given by the correspondence . The same encoding can be used for the planted clique problem.
Let us note that although the above models are special cases of our framework, we will focus in the rest of the paper onto graphs whose local–weak limit (to be defined shortly) is locally finite. This rules out graphs with diverging typical degree (in particular the complete graph).
3.2 Local–weak convergence and amenability
For the reader’s convenience, we collect here some relevant graph-theoretic definitions, referring to [BS01, AL07, LP17] for more details. In this paper, all graphs have a finite or countably infinite vertex set, are connected, and are locally finite; i.e., all vertices have finite degree. A rooted graph is a graph together with a choice of a vertex , called the root of . We say that two rooted graphs and are isomorphic—and we write —if there exists an edge–preserving and root–preserving bijective map , i.e., , and . For an integer , define to be the rooted subgraph spanned by a ball of radius around the root on : this is the rooted graph where , and . Here, is the graph distance in .
Definition 3.1**.**
A sequence of rooted graphs is said to converge locally to a rooted graph , and we write , if for every radius , there exists such that for all .
This notion of convergence endows the set (of –equivalence classes) of rooted graphs with a metrizable topology, called the topology of local, or Benjamini–Schramm, convergence [BS01]. This gives the structure of a complete separable metric space. Now we can define , the space of probability measures on when endowed with its Borel –algebra. Then we endow with the usual topology of weak convergence.
From a finite deterministic graph , we can construct a random rooted graph by choosing the root uniformly at random from . We denote the law of this random rooted graph by .
Definition 3.2**.**
A sequence of finite graphs is said to converge locally–weakly to a random rooted graph if the sequence of probability measures converges weakly to a probability measure , which is the law of .
In other words, the definition requires that given a fixed finite connected rooted graph and a fixed radius , the probability \operatorname{\mathbb{P}}\big{(}[G_{n},o_{n}]_{l}\equiv(H,o^{\prime})\big{)} converges to \operatorname{\mathbb{P}}\big{(}[G,o]_{l}\equiv(H,o^{\prime})\big{)} as .
Probability measures that are local–weak limits of sequences of finite graphs as per Definition 3.2 (such measures are called sofic in the literature) inherit a important stationarity property which roughly expresses the intuition that the random graph should “look the same” when viewed from any of its vertices. A formal definition takes the form of a mass–transport principle termed unimodularity [AL07]: Similarly to , we define the space of –equivalence classes of doubly–rooted graphs where the isomorphy relation and local convergence as per Definition 3.1 are both extended in the natural way.
Definition 3.3**.**
A measure is unimodular if for every Borel function ,
[TABLE]
when .
It is clear that if is finite then is unimodular, since the root is chosen uniformly at random. Furthermore, the property of unimodularity is closed in the topology of local–weak convergence [AL07], hence all local–weak limits of sequences of finite graphs are unimodular.
Next, we define the key concept of anchored–amenability.
Definition 3.4**.**
An infinite rooted graph where is said to be anchored–amenable if its Cheeger constant anchored at is zero:
[TABLE]
Here, is the vertex-boundary of the set .
We will informally use the phrase ‘asymptotically amenable’ to refer to graph sequences that converge locally–weakly to almost surely anchored–amenable graphs.
Observe that if is vertex–transitive, the above statement does not depend on the root , and anchored–amenability reduces to the more classical notion of amenability of (non-rooted) graphs. For instance, the Euclidean lattice is amenable, the -regular tree is not (both graphs being transitive).
Observe that if is almost surely anchored–amenable, there exists a sequence of finite sets such that and which ‘witnesses’ the amenability of : as . Moreover, this random sequence can be chosen in a measurable way as a function of the rooted graph . Indeed, we can for instance label the vertices of by , the root being labelled by [math], and for every , choose the first finite set (among countably many) in the lexicographic ordering such that and . For clarity we make this dependence explicit: . We require a technical condition regarding such sets .
Definition 3.5**.**
We say that is tame if it is supported on anchored–amenable rooted graphs, and there exists a sequence of sets that witnesses anchored–amenability (i.e., such that is a measurable function of , and almost surely) such that the following holds. For every there exists such that
[TABLE]
By extension, we say that the random rooted graph is tame if its law is tame.
Intuitively, tameness is satisfied when the size of the neighborhoods of each vertex around the root is comparable with . To discuss it further, it is useful to introduce the random variables
[TABLE]
The tameness condition requires a uniform upper bound on the lower tail of . An equivalent way to express this condition is to say that the sequence of random variables is tight when .
Note that whenever is unimodular. Indeed, by a direct application of the mass-transport principle (for the function )
[TABLE]
Moreover, tameness is satisfied if is supported on vertex-transitive graphs, and in this case almost surely. Indeed, assume is supported on a single vertex-transitive graph. Then is unimodular whence , but is non-random and therefore . In the general case where is not an atom, since almost surely conditional on , we have almost surely unconditionally as well.
We next provide a few examples of graphs that are anchored-amenable and tame.
Example 1 (Percolation clusters). Consider the -dimensional grid , i.e., , and edges connect vertices at distance one . Remove edges independently with probability and let be the connected component of the origin . We consider , the percolation threshold on so that is infinite with positive probability, and condition on the event that is indeed infinite. In this case we can take to be the subset of vertices contained in the ball of radius around : , for a deterministic sequence of radii . A classical result of Newman and Schulman [NS81] implies almost surely for some non-random constant . Further whence . Hence, there exists a random such that almost surely , for all .
Further, , and if and only if . Therefore
[TABLE]
Therefore a.s., whence \rho\big{(}\alpha_{k}(G,o)\leq\delta\big{)}\to 0 for all .
Example 2 (Random geometric graph). In this case the vertices are the points of a Poisson point process on with constant intensity . Any two vertices are connected by an edge if and only if for a fixed radius . We choose the root as the closest vertex to the origin and let be the connected component of . This graph is infinite with positive probability provided is larger than the percolation threshold for this model [P*+*03].
The calculations for Bernoulli bond percolation on can be applied almost verbatim to the random geometric graph. In particular, letting witnesses anchored–amenability and satisfies the tameness assumption.
4 Results for asymptotically amenable graphs
Recall that refers to the union of the vertex- and edge-observations over : . A natural way to construct an estimator is to first estimate the posterior marginals of given at every vertex:
[TABLE]
Letting be such estimates of the posterior marginals, we can construct , for instance, by independently sampling from the marginals: , for all .
Of course, computing the exact posterior probabilities is in general intractable. As a tractable alternative, we can compute a local version of the vertex marginals by using only observations in a ball of radius around each vertex. For and , let
[TABLE]
(Recall that denotes the set of vertices within graph distance form in .) The local marginals can be computed with complexity at most per vertex. The complexity of estimating all the vertex marginals is linear or nearly linear, under additional assumptions. In particular:
- •
If has degree bounded by independently of , then .
- •
If converges to a locally finite unimodular graphs, then
[TABLE]
In other words, for each , there exists such that, for all large enough, all but a fraction of the vertices have neighborhood of size bounded by . Hence can be estimated for all but a fraction of the vertices in linear time.
Notice that we can safely neglect atypical vertices for our purposes. For instance, the matrix estimation risk (1.4) is bounded away in the present setting (unless the channel is noiseless), and therefore ignoring vertices has a negligible impact on the asymptotic risk.
Do the local estimates provide good approximations of the actual marginals ? Our first result shows that this is the case for asymptotically amenable graphs, for almost all , and on average over vertices in .
Theorem B**.**
Let be a sequence of finite graphs (with ) that converges locally–weakly to random rooted graph which is almost surely anchored–amenable and tame. Then for almost every ,
[TABLE]
The proof of this theorem follows from a technical result which we will present next.
We define an observation model on the infinite random graph exactly as for the finite graphs . We then let
[TABLE]
where we condition on the realization of the rooted graph and on -algebra generated by the sequence of random variables \big{(}Y^{(\varepsilon)}_{B_{G}(o,l)}\big{)}_{l\geq 0}. Equivalently, we can also define as the almost-sure limit of the sequence \big{(}\operatorname{\mathbb{P}}(\theta_{o}=x|(G,o),Y^{(\varepsilon)}_{B_{G}(o,l)})\big{)}_{l\geq 0}, where convergence is guaranteed by Lévy’s upward theorem. We have the following general relation between marginals on the finite graphs , and marginals on the infinite rooted graph .
Proposition 4.1**.**
Under the conditions of Theorem B, we have for all and almost every ,
[TABLE]
(The expectation on the right-hand side is w.r.t. the randomness of and .)
The proof of Proposition 4.1 is presented in Section 6. Theorem B is a consequence of Proposition 4.1 as shown below.
Proof of Theorem B.
We claim that \operatorname{\mathbb{E}}\big{[}\widehat{\mu}_{G_{n},u,l}(x)\mu_{G_{n},u}(x)\big{]}=\operatorname{\mathbb{E}}\big{[}\widehat{\mu}_{G_{n},u,l}(x)^{2}\big{]}. Indeed, by conditioning on we obtain
[TABLE]
Now we use the fact that for two measures and on , :
[TABLE]
(Here and below denotes the norm of the vector .) The claim follows by averaging over , and applying Proposition 4.1.
Note that Theorem B is not sufficient to establish Theorem A about the optimality of polynomial-time algorithms to estimate the pairwise correlations . Indeed, the latter requires to approximate the joint distribution of , for two arbitrary vertices. In order to achieve this goal, we define a decoupled estimator:
[TABLE]
Note that may a priori have suboptimal accuracy. This is however not the case for almost all .
Proposition 4.2**.**
Let be defined as per Eq. (4.5). Then for almost every ,
[TABLE]
The proof of the above proposition can be found in Appendix A.
Given Theorem B and Proposition 4.2, it is natural to consider the following low complexity version of :
[TABLE]
Since we can compute for all but vertices in time , the overall complexity of is . (Setting for a sublinear fraction of vertices produces a negligible error.) We can now prove Theorem A.
Proof of Theorem A. Since Proposition 4.2 yields for almost all , we only need to compare the risks of and . We have
[TABLE]
We have
[TABLE]
By consecutive triangle inequalities, this is bounded in absolute value by
[TABLE]
Here, denotes the supremum norm of .
On the other hand, and following a similar strategy,
[TABLE]
Invoking Theorem B concludes the proof.
Theorem B and Proposition 4.1 allow to control other metrics for the estimation errors beyond . As an example, we consider the ‘overlap’ metric that applies to estimators which assign labels to vertices. We define
[TABLE]
where is the set of permutations on , with .
As a corollary of Proposition 4.1, the overlap between a sample from the local marginals and can be lower-bounded in a nontrivial way (the proof can be found in Appendix A):
Corollary 4.3**.**
For each let , let where independently for all . Then for almost every ,
[TABLE]
As the radius of the local balls increases, the performance of approaches that of a sample drawn from the full marginals .
5 Results for random regular graphs
The assumption of anchored–amenability is crucial in the proofs of Theorems A and B. While we do not know whether a weaker condition is sufficient, we show that these results do not hold for at least one non-amenable case, namely, when is a random -regular graph with constant degree . For the case of –synchronization we show that in a certain regime of signal-to-noise ratio (SNR), the local estimates of vertex marginals provide no information about the hidden assignment , while in the same regime, it is information-theoretically possible to estimate non-trivially.
As mentioned in the introduction, an information-computation gap has been observed in several statistical models. However, none of the rigorous results in the literature matches the setting of Theorems A and B. To the best of our knowledge, the closest example is the case of the stochastic block model with communities on sparse random graphs (see [Abb17] for a comprehensive survey and references therein). As explained in Section 3.1, this example fits our framework, although with being the complete graph. In particular, does not converge to a locally finite graphs. In contrast, the example treated in this section satisfies all the assumptions of Theorems A and B except amenability (and tameness). Proofs for this section are deferred to Appendices B and C.
5.1 Information-theoretic reconstruction: An exhaustive search algorithm
Given a graph on vertices, and , we define the edge empirical distribution
[TABLE]
This is a probability distribution on : . (Recall that denotes the simplex of probability distributions over the set .) Define to be the uniform distribution on and via
[TABLE]
We then define the set of ‘typical’ assignments of node variables by
[TABLE]
We then consider the reconstruction algorithm that outputs a typical configuration
[TABLE]
If is empty, we define arbitrarily (for instance for a fixed reference configuration ). If contains more than one element, then selects one arbitrarily, e.g., the first one in lexicographic order. In fact our proofs apply to any algorithm that satisfy condition (5.2) with high probability. As discussed below (see Remark Remark) this condition is also satisfied by the randomized estimator \hat{\bm{\theta}}\sim\operatorname{\mathbb{P}}\big{(}\cdot|Y^{(\varepsilon)}_{G_{n}}\big{)} that samples from the posterior.
It is immediate to show that the typical set is non-empty with high probability. (Throughout this section, we use for the ground truth, in order to distinguish it from a generic vector .)
Lemma 5.1**.**
Let be a random -regular graph on vertices, and let be distributed according to the random observation model described in the Introduction. Then, there exists such that
[TABLE]
Remark**.**
As mentioned above, one might consider a randomized estimator that outputs a sample from the posterior: \hat{{\bm{\theta}}}\sim\operatorname{\mathbb{P}}\big{(}\,\cdot\,|Y^{(\varepsilon)}_{G_{n}}\big{)}. Note that this satisfies the condition (cf. Eq. (5.2)) with the same probability . Indeed this follows simply by noting that, with this definition, the pair is distributed as . Therefore all the results to follow apply to this randomized estimator as well.
Given two assignments , we define their joint empirical vertex distribution as
[TABLE]
This is a probability distribution on : .
To state the next result let us briefly recall some notions form information theory. Given a discrete random variable (or random vector) , we denote by the Shannon entropy of the law of , namely –with a slight abuse of notation– . For a vector , . The conditional entropy is defined by , and the mutual information by .
Theorem C**.**
Assume there exists such that for all , , and let have joint distribution (recall that where is the uniform distribution over ). If
[TABLE]
for some , then there exists and a constant such that
[TABLE]
The proof of this theorem relies on a truncated first moment method, where we count the expected number of typical assignments having a given value of the empirical overlap distribution , conditioned on certain typicality constraints on the instance . The full argument is deferred to Appendix B.2. (We refer, e.g., to [DMS13] for similar calculations in a somewhat simpler context.)
The next corollary applies the result of Theorem C to –synchronization.
Corollary 5.2**.**
Consider the –synchronization problem. If
[TABLE]
then there exists depending on such that, with probability at least , .
Furthermore, as , we have
[TABLE]
This corollary follows from Theorem C simply by computing in the case of –synchronization. We omit the details. Finally, we deduce from Theorem C the possibility of weak recovery.
Corollary 5.3**.**
Under the assumptions of Theorem C, if , then there exists a constant such that
[TABLE]
Moreover, there exists a function with zero mean, unit variance, and a constant such that
[TABLE]
In particular, the conclusions (5.4) and (5.5) hold in the –synchronization model if .
5.2 Performance of the local algorithm
In this section we examine the asymptotics of the local marginals
[TABLE]
when is a random -regular graph, in the special case of –synchronization with side information from BEC().
We have seen in the previous section that weak recovery is possible (albeit non-efficiently) when even in the absence of side information (Corollary 5.3). We show on the other hand that the local marginals are approximately uniform if . The latter condition is known as the Kesten-Stigum threshold for the problem of robust reconstruction on the tree [JM04].
Theorem D**.**
Consider –synchronization with side information from BEC() on a random -regular graph . There exist constants and such that the following holds. If and then
[TABLE]
The above theorem implies that all estimators where independently for all , have almost trivial performance. Recall the definition of the matrix :
[TABLE]
Corollary 5.4**.**
In the setting of Theorem D, if , then there exists constants depending on and such that
[TABLE]
Moreover, for all with zero mean and unit variance,
[TABLE]
Remark**.**
The above implies that no local algorithm can estimate with non-trivial accuracy. Indeed, the estimator of of minimal risk based on the information contained is the balls of radius centered around and respectively is \operatorname{\mathbb{E}}\big{[}f(\theta_{u})f(\theta_{v})|Y^{(\varepsilon)}_{B_{G_{n}}(u,l)\cup B_{G_{n}}(v,l)}\big{]}. The latter quantity is equal to if the two balls are disjoint, which is the case for fraction of pairs of vertices when is held constant.
The proof of Theorem D is deferred to Appendix C.1, but we give here an outline. We use local–weak convergence to first lift the problem to the infinite -regular tree, in which the study of the local marginals reduces to the study of a certain distributional recursion. Then we prove that below the Kesten-Stigum threshold, the uniform distribution is a stable fixed point of this recursion. The argument proceeds as follows. Let be the root of infinite –ary tree and denote by the subtree consisting of the first generations of rooted at . Now let \mu_{o,l}(x):=\operatorname{\mathbb{P}}\big{(}\theta_{o}=x|{Y}^{(\varepsilon)}_{T_{k}(l)}\big{)} for all and consider the sequence which measures the deviation from uniformity of the local marginal at the root. We use the recursive structure of the tree to show that for small enough and , the sequence satisfies the approximate recursion
[TABLE]
where is constant depending only on . Since , this implies that if then the sequence stays within an interval of size around the origin. This, in turn, can be converted to the claim of Theorem D. The analysis of this recursion originates in the study of the robust reconstruction problem on the tree. In this problem, a spin at the root (an -valued r.v.) is broadcast through noisy channels along edges of the tree. The statistician observes a noisy realization of this process on the leaves of for large , and is tasked with inferring the value at the root (see e.g., [EKPS00, MP03, JM04]). Similar recursions also arise in the study of the ‘robustness’ of phase transitions in the Ising model on the tree [PS99]. In particular, our analysis builds on ideas from [MM06, Sly11].
6 Proof of Proposition 4.1
We start with the proof of (4.3), which is straightforward and does not need the amenability assumption. The proof of (4.4) will crucially hinge upon a property of decay of certain point–to–set correlations (Lemma 6.2), which we establish using anchored–amenability and the presence of –side information. For ease of notation, we adopt the following convention in this section: in quantities of the form \operatorname{\mathbb{P}}\big{(}\theta_{o}=x|Y^{(\varepsilon)}_{A}\big{)} where is any subgraph of , it is implicit that the rooted graph is also conditioned on, abbreviating the more accurate but lengthier notation \operatorname{\mathbb{P}}\big{(}\theta_{o}=x|(G,o),Y^{(\varepsilon)}_{A}\big{)}.
6.1 Proof of the ‘local’ statement (4.3)
Let and . The function defined by f(G,o)=\operatorname{\mathbb{E}}\big{[}\operatorname{\mathbb{P}}\big{(}\theta_{o}=x|Y^{(\varepsilon)}_{[G,o]_{l}}\big{)}^{2}\big{]} is clearly continuous in the topology of local convergence. Indeed for , let such that for all . Hence for all . Since is also bounded, we obtain by local–weak convergence under uniform rooting that
[TABLE]
Next, we observe that the sequence \big{(}\operatorname{\mathbb{P}}\big{(}\theta_{o}=x|(G,o),Y^{(\varepsilon)}_{[G,o]_{l}}\big{)}\big{)}_{l\geq 1} is a bounded martingale, therefore it converges almost surely and in to \operatorname{\mathbb{P}}\big{(}\theta_{o}=x|Y^{(\varepsilon)}_{G}\big{)} by Lévy’s upward theorem. This concludes the proof of the first statement (4.3):
[TABLE]
6.2 Proof of the ‘global’ statement (4.4)
The proof breaks into three parts. First, we easily obtain a lower bound from Jensen’s inequality:
[TABLE]
Therefore
[TABLE]
where the last equality is the content of statement (4.3). As for the upper bound, we have
Lemma 6.1**.**
Consider the -algebra
[TABLE]
where is the distance in . Then
[TABLE]
Proof.
Fix and . We condition on the r.v.’s and apply Jensen’s inequality:
[TABLE]
We now observe that conditionally on the boundary variables , is independent of for all and outside the ball . This is guaranteed by the spatial Markov property of the model. Therefore
[TABLE]
The event on the right–hand side is localized to a ball of fixed radius. So by local–weak convergence, we pass to the limiting rooted graph , (similarly to the proof of (4.3)):
[TABLE]
Now using the same Markov property as above, the expectation in the right–hand side remains unchanged if we further condition on and , which are beyond the boundary of : the extra information is irrelevant to . We arrive at the upper bound
[TABLE]
Now we observe that the sequence \big{(}\operatorname{\mathbb{P}}\big{(}\theta_{o}=x|(G,o),Y^{(\varepsilon)}_{G},{\mathcal{F}}_{o}^{\geq l}\big{)}\big{)}_{l\geq 1} is a bounded backward martingale (since the corresponding filtration is decreasing), which converges to \operatorname{\mathbb{P}}\big{(}\theta_{o}=x|(G,o),\mathcal{T}^{(\varepsilon)}_{\infty}\big{)} a.s. and in by Lévy’s downward theorem. This concludes the argument.
The last piece of the proof is to show that the lower and upper bounds (6.1) and (6.2) coincide when is a.s. anchored–amenable:
Proposition 6.2**.**
Assume is unimodular, almost surely anchored–amenable and tame. Then for almost every and all ,
[TABLE]
This is the only part of the proof which requires assumptions of the limiting random rooted graph, and the presence of non-zero side information from BEC(). We reiterate that unimodularity is guaranteed if is the limit of a sequence of finite graphs (see Section 3), so it is automatically satisfied in our setting.
The first ingredient in the proof of Proposition 6.2 is the following generic lemma that allow to control the dependency under the posterior between variables associated to vertices in the interior of a set and variables associated to the boundary of this set. Our first lemma bounds the mutual information between and . This result is inspired by Lemma 3.1 in [Mon08]. Let us recall the definition of conditional mutual information between and given : , where is the conditional entropy.
Lemma 6.3**.**
Let be a graph, and finite and non-empty. For all , we have
[TABLE]
Proof.
The argument relies on differentiating the conditional Shannon entropy of given with respect to . Let us first replace the single parameter (the probability of non-erasure) by a set of parameters : for each vertex , is revealed with probability . We also replace the notation by , omitting an explicit reference to and to the ball . We finally denote with removed. We have
[TABLE]
Taking a derivative w.r.t. yields:
[TABLE]
where the latter is the conditional mutual information of and given . Now we set for all . We obtain
[TABLE]
We now integrate w.r.t. :
[TABLE]
The second line is by positivity of entropy, the third line follows from the fact that conditioning reduces the entropy, the fourth line is by sub-additivity, and the last line is since is marginally uniform on . Now we finish the proof by observing that I\big{(}\theta_{u};\theta_{\partial S}|Y,\xi\big{)}=I\big{(}\theta_{u};\theta_{\partial S}|Y,\xi^{\backslash(u)},\xi_{u}\big{)}\leq I\big{(}\theta_{u};\theta_{\partial S}|Y,\xi^{\backslash(u)}\big{)} because the left–hand side vanishes whenever .
Next, we translate Lemma 6.3 into an average statement about decay to point–to–set correlations:
Lemma 6.4**.**
Let be a graph and finite and non-empty. Then for all ,
[TABLE]
Proof.
Let so that for every . By Jensen’s inequality we have
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
The last quantity is equal to
[TABLE]
Summing over and using (6.3) we get
[TABLE]
We send to infinity and use martingale convergence on the left-hand side, and use Pinsker’s inequality on the right-hand side to obtain
[TABLE]
Now we obtain the desired result by averaging over and and using Jensen’s inequality, and then invoking Lemma 6.3:
[TABLE]
Now we are in a position to prove Proposition 6.2.
Proof of Proposition 6.2.
Assume is almost surely anchored–amenable and tame. Let be the sequence of finite measurable random subsets of satisfying the conditions of Definition 3.5 (recall in particular that .) We use Lemma 6.4 with this choice of sequence , and then average over the realization of the rooted graph :
[TABLE]
where by an application of dominated convergence (since almost surely by assumption). Now we let defined by
[TABLE]
With this notation, expression (6.4) is equal to \operatorname{\mathbb{E}}_{\rho}\big{[}\sum_{u\in V(G)}f_{k}(G,o,u)\big{]}. By unimodularity of , this is also equal to
[TABLE]
where . Now we use the tameness assumption: the sequence is tight. Let be the above integral over (so that the above display is .) For let such that
[TABLE]
Since all involved quantities are nonnegative, we have
[TABLE]
Since a.s., we obtain
[TABLE]
where we have used (6.4) and (6.2) to obtain the last display. Letting and then we obtain for all ,
[TABLE]
and this concludes the proof.
Acknowledgements
This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729.
Appendix A Amenable graphs: some omitted proofs
A.1 Proof of Proposition 4.2
The proof is based on a decoupling principle under -perturbation of a general observation channel. This principle is given in Lemma 3.1 in [Mon08], which once specialized to our setting, takes the following form:
Lemma A.1** (Lemma 3.1 [Mon08]).**
For all , it holds that
[TABLE]
This is very similar to our Lemma 6.3. In fact the latter follows the same line of proof.
Recall the definition of the decoupled estimator
[TABLE]
For a pair of vertices we let \mu_{u,v,G_{n}}(x,x^{\prime}):=\operatorname{\mathbb{P}}\big{(}\theta_{u}=x,\theta_{v}=x^{\prime}\big{|}Y^{(\varepsilon)}_{G_{n}}\big{)}, for . Expanding the squares and cancelling equal terms we have
[TABLE]
Moreover,
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
We used Pinsker’s inequality and Jensen’s inequality in the last line. We apply Lemma A.1 and Jensen’s inequality and obtain for all ,
[TABLE]
Since the integrand is non-negative, it too converges to zero almost everywhere.
A.2 Proof of Corollary 4.3
The proof follows from statement (4.3) of Proposition 4.1 since
[TABLE]
Here in we used the fact that, by construction and in the remark, already made in the proof of Theorem B, that \operatorname{\mathbb{E}}\big{[}\widehat{\mu}_{G_{n},u,l}(x)\mu_{G_{n},u}(x)\big{]}=\operatorname{\mathbb{E}}\big{[}\widehat{\mu}_{G_{n},u,l}(x)^{2}\big{]}.
Appendix B Information-theoretic reconstruction on random graphs: Technical proofs
B.1 Proof of Lemma 5.1
This is a consequence of McDiarmid’s bounded differences inequality. For and , we let , and let . Since
[TABLE]
we have
[TABLE]
We associate to each edge an independent random variable . We can then construct a function , such that . Hence we can define by letting for each , and we view as a function of the independent random variables .
Moreover, if we change the value at vertex to and call the resulting value of , we have (recall that is the degree of and ). If we further change to at an edge , we have . The bounded differences inequality then implies
[TABLE]
Now we let and .
B.2 Proof of Theorem C: A truncated first moment method
Instead of working directly with the ensemble of random regular graphs, we will use the configuration model [Bol80] for our moment computations. Let be even and let be the set of perfect matchings on vertices. For we define the multi-graph on vertices where a vertex in is sent to a vertex in through the mapping . The resulting multi-graph may contain multiple edges and self-loops. The configuration model is the probability measure on multi-graphs induced by the uniform measure on perfect matchings through the above mapping. The measure conditioned on the multi-graph being simple (i.e., not having self-loops nor multiple edges) is the uniform measure on -regular graphs . The probability that is simple under is for large by a formula of McKay and Wormald [Wor99]. Therefore, for any event and sequence , implies with depending only on .
Let be from the configuration model with . We will assume edges to be directed, and the direction to be chosen uniformly at random. The number of such graphs is
[TABLE]
Indeed, is the number of ordered pairings of the half-edges. Such a pairing can be constructed by ordering the half-edges (which can be done in possible ways), and then pairing consecutive half edges following this ordering. Each pairing can arise in possible ways.
We next state a standard counting lemma that will be useful in what follows. Given finite alphabets , and integers with even, let be the subset of probability distributions such that for all , , and for all .
Given , we let , . We further let .
Recall that Shannon entropy of a probability distribution on the finite set is , and the joint empirical edge distribution of on a graph is
[TABLE]
Lemma B.1**.**
For such , let be the number of triples where is a graph from the configuration model, , , with edge empirical distribution equal to . Let . Then
[TABLE]
Proof.
Recall that is the number of edges in . Note that is the number of edges such that , and is the number of edges such that . Therefore is the number of vertices such that . Further is the number of edges such that and .
Given a non-negative integer vector with , we denote the corresponding multinomial coefficient by
[TABLE]
We then obtain the following exact counting formula (where and ):
[TABLE]
The first factor account for the number of ways of choosing . The second corresponds to the ways of giving a matching type to half-edfes. The third factor counts the number of ways of matching half-edges, and the last one the number of ways of assigning labels in to edges.
Equation (B.1) follows by using the following elementary bounds (that hold for any and any ):
[TABLE]
Now recall the joint empirical distribution of two assignments :
[TABLE]
Further, let , being the uniform distribution on , and
[TABLE]
Given a graph , a true assignment , observations , and a closed set we define
[TABLE]
where . We denote by the set of instances, i.e., triples , where is a graph over vertices, and .
Lemma B.2**.**
Assume there exists such that for all , . Define the map by
[TABLE]
(Here , are defined as in Lemma B.1, with , and denotes the Shannon entropy.) Further define by
[TABLE]
There is a set of ‘good’ instances such that the following happens. For a closed set, we have
[TABLE]
Proof.
Given a tuple , where is a graph, , , we define its joint edge empirical distribution as
[TABLE]
In other words is the probability that, sampling an edge uniformly at random, we have , , , , . Let be the subset of probability distributions with entries that are integer multiples of . For , we let denote the number of tuples with edge empirical distribution equal to :
[TABLE]
Notice that setting , we can view as a vector in and as a probability distribution in . Applying Eq. (B.1) and Lemma B.1, we get
[TABLE]
We define
[TABLE]
Then Eq. (B.6) follows immediately from Lemma 5.1.
We also define . With this notation
[TABLE]
and therefore, using Eq. (B.9),
[TABLE]
Recall the definition of from Eq. (B.7). We observe that this empirical measure has the following marginals:
[TABLE]
Moreover, if does not vanish, we have
[TABLE]
Therefore the summand in the formula (B.10) depends only in the empirical edge distribution of the instance . Now let be the set of satisfying the constraints
[TABLE]
We have
[TABLE]
We applied Lemma B.1 in the last line above. Due to the second constraint in (B.11), we can upper bound as follows
[TABLE]
Therefore, letting
[TABLE]
we arrive at
[TABLE]
which implies the claim.
The next result provides a sufficient condition for weak recovery using the estimator satisfying Eq. (5.2); this is a more general version of Theorem C.
Theorem E**.**
Assume there exists such that for all , . Assume . Then there exists such that, with probability at least , the following happens
[TABLE]
Proof.
Recall that denotes the set of probability distributions such that . We claim that, under the stated assumptions there exists such that, setting , and as in Lemma B.2, we have
[TABLE]
Hence, applying Lemma B.2, it follows that, with probability at least (eventually adjusting the constant ), . Hence by construction of , and therefore the claim follows.
We are left with the task of proving Eq. (B.13), which by Lemma B.2 and a continuity argument, follows from .
The condition might be hard to verify in practice because it requires solving the optimization problem (B.5). We provide a simpler sufficient condition, which is the content of Theorem C:
Lemma B.3**.**
Let , with . We have .
Proof.
Let be any distribution achieving the maximum in (B.5) for , and let have distribution . Note that , , , , , whence
[TABLE]
Step follows by sub-additivity of entropy.
Hence, if , then , and the claim follows by applying Theorem E.
B.3 Proof of Corollary 5.3
Let is the set of all non-negative doubly stochastic matrices (with ). It holds that
[TABLE]
Indeed, since the right-most expression in the above display is a linear program, the objective value is maximized at the extreme points of the polytope , which by Birkhoff’s theorem are permutation matrices: for , hence the equality.
Since (we abused notation and identified the joint distribution on with a matrix), we have
[TABLE]
Now, on the event , we have . Hence on the same event.
Next, we prove the second statement. For two functions , we let . Theorem C implies
[TABLE]
Indeed, if then there exist such that . Now take and .
On the other hand, letting , a union bound implies
[TABLE]
Therefore, there exists a (deterministic) pair such that \operatorname{\mathbb{P}}\big{(}|\hat{\omega}_{\hat{\bm{\theta}},{\bm{\theta}}_{0}}(f,g)|\geq\frac{\delta}{q(q-1)}\big{)}\geq\frac{1-o_{n}(1)}{q^{2}}>c_{0}>0. By Markov’s inequality, this in turn implies that for this specific pair we have
[TABLE]
Now consider estimating the matrix (recall that ) with the matrix having entries , with \lambda=n^{2}\operatorname{\mathbb{E}}\big{[}\hat{\omega}_{\hat{\bm{\theta}},{\bm{\theta}}_{0}}(f,g)^{2}\big{]}\big{/}\operatorname{\mathbb{E}}\big{[}\|\widehat{{\bm{X}}}^{(1)}\|_{F}^{2}\big{]}. Since
[TABLE]
the loss incurred is
[TABLE]
We have . So . Furthermore, since , . Combining these estimates with the lower bound (B.15) implies \limsup\mathcal{R}_{n}\big{(}\widehat{{\bm{X}}}^{(\lambda)};f\big{)}<1-c(q)\delta^{2}. Since \mathcal{R}_{n}^{\textup{Bayes}}(f)\leq\mathcal{R}_{n}\big{(}\widehat{{\bm{X}}}^{(\lambda)};f\big{)} this concludes the proof.
Appendix C Local algorithms on random graphs: Technical proofs
C.1 Proof of Theorem D
C.1.1 Preliminaries
Let denote the infinite -regular tree rooted at . (Except the root , every vertex has offsprings.) By expanding the square, we get
[TABLE]
(Here, is the distance in .) Since the graph sequence almost surely converges locally–weakly to (a Dirac delta on) , we have
[TABLE]
Recall
[TABLE]
Let be the conditional law of given the value at the root being and no information revealed by the side channel. This is a probability distribution on the simplex : . Furthermore, let . The following simple lemma from [MM06] is quite useful.
Lemma C.1**.**
For every , has a density w.r.t. , and for all .
Proof.
Let be bounded measurable. We let . Then
[TABLE]
Therefore .
With the above lemma in hand, the right-hand side in (C.1) can be written as
[TABLE]
The first equality follows by conditioning on as noting that conditional on , . Lemma C.1 was used to obtain the second equality.
In light of the above expression, we will track the evolution of the sequence
[TABLE]
which measures the deviation from uniformity of the local marginal at the root. In order to exploit the recursive structure of the tree, we will need to work at the level of the first offsprings of . For every offspring of , we denote by the first generations of the subtree rooted at not containing ; this is a –ary tree. Now, (with a slight notation override) we redefine
[TABLE]
and consider the auxiliary sequence
[TABLE]
Note that the above definition does not depend on since have the same distribution for all . In the next proposition, we relate the two sequences and , and establish a recursion for the latter.
Proposition C.2**.**
*Let and . There exists constants depending only on such that the following holds. If for some , and , then *
[TABLE]
The proof of this proposition is presented in Section C.1.2 Theorem D follows directly from Proposition C.2, as shown in the next Corollary.
Corollary C.3**.**
If and for a constant then there exists such that for all .
Proof.
We only need to prove that , which we will achieve by induction. Since , let’s assume that for a fixed . Then we obtain from Proposition C.2 that
[TABLE]
It suffices to find an (independent of ) such that the above upper bound is smaller than for all . This is equivalent to the quadratic inequality . The smallest solution to this inequality is , with , and . Latter is non-negative provided that for constant some . Moreover, for small enough we can write , so that . Therefore, we can take .
C.1.2 Proof of Proposition C.2: Analysis of the recursion on the tree
Here, we prove Proposition C.2. The two statements can be treated in exactly the same way; the only difference being that the root has children, while every other vertex has children. For this reason we only write a detailed proof for the first statement; the second one is obtained merely by replacing by .
Observe that conditional on the marginal at is obtained from the marginals at its offsprings by a sum-product relation which, in the case of –synchronization, has the form
[TABLE]
where is the normalizing constant, and is the Markov transition matrix associated to a ‘broadcasting process’ on the tree according to the –synchronization model.
The recursion (C.2) induces a deterministic recursion over probability distributions over the simplex . Namely, if we define , we obtain a recursion that determines in terms of (notice that, by Lemma C.1, once is given for one value of , it is determined for the other values as well.) The laws of are given by for all . Note that this law does not depend on since are i.i.d. given . Then can be obtained from as follows:
Draw and independently and uniformly at random from . 2. 2.
Construct according to the –synchronization model (1.1). 3. 3.
Draw from independently for each . 4. 4.
Construct a distribution according to (C.2). 5. 5.
Then, given , has the same law as .
We now analyze the map described above. Define
[TABLE]
so , where we have dropped the indices for convenience. Following the analysis of [Sly11], we use the identity with , and to write
[TABLE]
Next we compute the conditional expectations of and (given and ) in order to control .
Lemma C.4**.**
Let for . For all , we have
[TABLE]
and
[TABLE]
Proof.
We start with the first identity (C.4). Since the distributions are conditionally independent given , we have
[TABLE]
Moreover,
[TABLE]
The first term in the right-hand side is . The second term is
[TABLE]
Therefore
[TABLE]
So we obtain
[TABLE]
where is an arbitrary offspring since terms participating in the product are all equal. Now we deal with the second identity (C.5):
[TABLE]
Similarly to a previous computation, we have
[TABLE]
and
[TABLE]
Combining and rearranging terms we obtain the desired result.
Now we use the expressions just obtained to produce Taylor estimates for each term in the decomposition (C.3).
Lemma C.5**.**
Let and . There exists constants depending only on such that if and , then
[TABLE]
Proof.
We use for all such that . Applying this to (C.4) yields (C.6). For and we have
[TABLE]
Next, we use (C.5), combined with the fact to obtain that if for some constant then
[TABLE]
where gathers all the terms other than 1 in the expression (C.5), and the constant depends on . We use the inequality to obtain
[TABLE]
The last term was obtained by using Cauchy-Schwarz on the term in (C.5) and then replacing sums over by maxima. Now it remains to show that the last two terms in (C.1.2) are bounded by . Starting with the last term, we have
[TABLE]
As for the remaining term,
Lemma C.6**.**
We have .
This implies . This, combined with (C.6), allows us to deduce (C.7). Now we treat the last term (C.8):
[TABLE]
Similarly to our treatment of the quantity , we use expression (C.5) and perform a Taylor expansion to obtain
[TABLE]
Using (C.4) the cross term can be estimated as
[TABLE]
Now we conclude
[TABLE]
Proof of Lemma C.6. For , using Lemma C.1 we have . The last equality follows from . Then
[TABLE]
Inequality follows from , and follows from Cauchy-Schwarz. Lastly, we have
Now we plug the estimates of Lemma C.5 in (C.3). Using the fact , we obtain
[TABLE]
where is a constant that depends only on .
C.2 Proof of Corollary 5.4
We first prove the result concerning the overlap with . Let be a fixed permutation. We have
[TABLE]
The last line follows by Cauchy-Schwarz and then Jensen’s inequality. Averaging over , applying Jensen’s inequality once more, and then using Theorem D yields the first statement.
Next, let with and . The loss of is
[TABLE]
We have . So . On the other hand, since , we have
[TABLE]
where , . On the other hand we have
[TABLE]
We use Cauchy-Schwarz inequality and the fact to obtain
[TABLE]
We apply Theorem 5.6 to obtain \limsup_{l}\limsup_{n}\frac{1}{n^{2}}\operatorname{\mathbb{E}}\big{\langle}\widehat{{\bm{X}}}^{(l)},{\bm{X}}_{f}\big{\rangle}\leq C\|f\|_{\infty}^{2}\varepsilon, and this yields the desired result.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AB 18] Emmanuel Abbe and Enric Boix, An information-percolation bound for spin synchronization on general graphs , ar Xiv:1806.03227 (2018).
- 2[Abb 17] Emmanuel Abbe, Community detection and stochastic block models: recent developments , The Journal of Machine Learning Research 18 (2017), no. 1, 6446–6531.
- 3[ABRS 18] Emmanuel Abbe, Enric Boix, Peter Ralli, and Colin Sandon, Graph powering and spectral robustness , ar Xiv:1809.04818 (2018).
- 4[AKV 02] Noga Alon, Michael Krivelevich, and Van H Vu, On the concentration of eigenvalues of random symmetric matrices , Israel Journal of Mathematics 131 (2002), no. 1, 259–267.
- 5[AL 07] David Aldous and Russell Lyons, Processes on unimodular random networks , Electron. J. Probab 12 (2007), no. 54, 1454–1508.
- 6[AMM + 17] Emmanuel Abbe, Laurent Massoulie, Andrea Montanari, Allan Sly, and Nikhil Srivastava, Group synchronization on grids , ar Xiv:1706.08561 (2017).
- 7[AW 09] Arash A. Amini and Martin J. Wainwright, High-dimensional analysis of semidefinite relaxations for sparse principal components , Annals of Statistics 37 (2009), no. 5B, 2877–2921.
- 8[BHK + 16] Boaz Barak, Samuel B Hopkins, Jonathan Kelner, Pravesh Kothari, Ankur Moitra, and Aaron Potechin, A nearly tight sum-of-squares lower bound for the planted clique problem , 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2016, pp. 428–437.
