Uniform estimation in stochastic block models is slow
Isma\"el Castillo, Peter Orbanz

TL;DR
This paper demonstrates that uniform estimation in stochastic block models is inherently slow, especially when classes are similar, with convergence rates depending on the number of vertices rather than edges.
Contribution
It provides explicit nonasymptotic minimax bounds for estimation in SBMs, revealing slower uniform rates compared to pointwise estimation, and extends results to smooth graphons.
Findings
Uniform estimation rate scales with vertices, not edges.
Estimation is harder when classes are similar.
Lower bounds are local around any SBM.
Abstract
We explicitly quantify the empirically observed phenomenon that estimation under a stochastic block model (SBM) is hard if the model contains classes that are similar. More precisely, we consider estimation of certain functionals of random graphs generated by a SBM. The SBM may or may not be sparse, and the number of classes may be fixed or grow with the number of vertices. Minimax lower and upper bounds of estimation along specific submodels are derived. The results are nonasymptotic and imply that uniform estimation of a single connectivity parameter is much slower than the expected asymptotic pointwise rate. Specifically, the uniform quadratic rate does not scale as the number of edges, but only as the number of vertices. The lower bounds are local around any possible SBM. An analogous result is derived for functionals of a class of smooth graphons.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarkov Chains and Monte Carlo Methods · Stochastic processes and statistical mechanics · Complex Network Analysis Techniques
Uniform estimation in stochastic block models is slow
Ismaël Castillolabel=e1][email protected] [
Laboratoire Probabilités, Statistique et Modélisation
Sorbonne Université
Peter Orbanzlabel=e2][email protected] [
Gatsby Computational Neuroscience Unit
University College London
LPSM, Sorbonne Université and Gatsby Unit, UCL
(2020)
Abstract
We explicitly quantify the empirically observed phenomenon that estimation under a stochastic block model (SBM) is hard if the model contains classes that are similar. More precisely, we consider estimation of certain functionals of random graphs generated by a SBM. The SBM may or may not be sparse, and the number of classes may be fixed or grow with the number of vertices. Minimax lower and upper bounds of estimation along specific submodels are derived. The results are nonasymptotic and imply that uniform estimation of a single connectivity parameter is much slower than the expected asymptotic pointwise rate. Specifically, the uniform quadratic rate does not scale as the number of edges, but only as the number of vertices. The lower bounds are local around any possible SBM. An analogous result is derived for functionals of a class of smooth graphons.
62G20,
Stochastic Blockmodel, semiparametric estimation of functionals, minimax rates, spectral clustering, graphon model,
keywords:
[class=MSC]
keywords:
††volume: 0††issue: 0
\setpkgattr
copyrighttext \setpkgattrissuedatatext
\startlocaldefs
\endlocaldefs
1 Introduction
Network data occurs in a range of fields, and its analysis has become a highly interdisciplinary effort [14, 18, 24, 28]. In statistical network analysis, two classes of models have recently received particular attention: Graphon models [9, 21, 29], and the subclass of stochastic block models (SBMs) [1, 7, 17]. The results of this paper show, informally speaking, that estimation under a SBM becomes difficult if the parameters specifying two classes are close to each other.
SBM and graphon models parametrize a random graph by a symmetric measurable function , which can be interpreted as representing an adjacency matrix in the limit of infinite graph size [9]. In a SBM, the function is in particular piece-wise constant. Examples of statistical problems arising in this field include estimation problems (see below), class label recovery [5, 25, 26, 31, 36], and signal detection, which refers to testing for the presence of a signal in settings where observed data constitutes a network or array [4, 3, 10, 33].
We consider estimation problems. SBMs label each vertex in a graph with a category (a “community”), and this labelling is typically not observed. There is a substantial body of work on rates of estimation in such models [7, 1, 11, 12, 2, 8]. This literature considers asymptotic pointwise rates, and shows that, informally speaking, estimators of finite-dimensional statistics can converge quickly even if the labelling of vertices is not observed. Estimation of the entire function has also been studied [16, 23, 34]. In a case where this parameter has infinite dimension and is estimated in a uniform way, Gao, Lu, and Zhou [16] show that not observing labels slows the rate. Our results show that that is not a consequence of the nonparametric setting: The best uniform (rather than pointwise) rate for estimating a finite-dimensional statistic—even a very simple one, and under a very simple parametric SBM—is slow. The same holds for a simple, one-dimensional functional of a smooth, infinite-dimensional parameter function .
1.1 An informal overview
The remainder of this section provides an informal overview of our results. Rigorous definitions and statements follow in the next sections. A SBM is defined by two parameters, a probability distribution on categories (which we regard as a vector in ), and a matrix . The model generates undirected random graphs of any size : Each vertex is assigned a category drawn randomly from , and an edge between vertices and is then added with probability . (A proper definition follows in Section 2.)
The main results. SBMs pose an estimation problem: Given an observed graph with vertices, estimate and . The purpose of our work is to show that, loosely speaking, the estimation problem can be harder than it appears, or indeed than previous results may suggest at first glance. Our results are phrased as minimax lower bounds: Suppose is a set of parameter pairs . A minimax bound specifies a decreasing function such that, informally,
[TABLE]
That is, given an observed graph with vertices, there exists no estimator whose quadratic risk is smaller than for all parameter values in . Since the supremum means that shrinking will not increase the lower bound, it can suffice to consider a subclass of parameters—any lower bound for is also a lower bound for . Indeed, we will see that one can obtain a meaningful bound by choosing a very small subclass with one degree of freedom, where is fixed to the uniform distribution, and the matrix is a function of a one-dimensional surrogate parameter . The statement above then takes the form
[TABLE]
Our main result shows that the relevant lower bound is
[TABLE]
where is permitted to depend on . This is Theorem 1 (for the simplest case ) and Theorem 2 (for ). Indeed our proofs imply a stronger statement:
The results are completely non-asymptotic, and it is possible to explicitly determine numerical values for all relevant constants. See Remark 2 for an example.
The minimax bound holds locally, not just globally: In principle, a slow minimax rate may be caused by just a few “pathological” points in the set . One can ask whether the rate can be improved by removing a small part of . That is not the case here: Shrinking the set of all SBM parameter pairs to any open Euclidean neighborhood of any specific pair still results in the same rate (see Section 3.3). Informally, every region of parameter space contains parameters that prevent the rate from improving.
SBMs are often used in “sparse” forms, and we verify in Appendix B that the result also applies in the sparse case. Since sparsification reduces the amount of available data, it slows convergence. Theorems 8 and 9 show the rate in the sparse setting also scales linearly in the expected number of edges.
Interpretation. If we were to simplify the estimation problem artificially by assuming that the assignments variables are observed, could be estimated at rate and at rate , both by computing sample averages. (The rates differ since is estimated from vertices, whereas is estimated from edges, and the expected number of edges grows quadratically with .) Since our bounds are phrased in terms of a quadratic risk, both rates must be squared: in the sequel, a bound as above is referred to as slow rate; by contrast, a fast rate corresponds to .
One must distinguish uniform rates (which hold uniformly over sets of parameters) and asymptotic pointwise rates (where asymptotics in are considered at just a given point). Previous work has established that estimation of can be fast even if the are not observed. For example, a remarkable result of [7] shows that, if and are chosen as certain profile maximum likelihood estimators, then, as ,
[TABLE]
where and are multivariate Gaussian variables. This holds up to label switching (see Section 2), and requires that the columns of are “not too similar”. Related results can be found in [1, 11, 12]. Since this result does not use a quadratic risk, it can be paraphrased informally as:
Under suitable conditions on the model, the matrix can be estimated, at least asymptotically and pointwise, at a fast rate.
It has long been recognized in statistics that pointwise asymptotic rates can be hard to interpret: As runs through some set , the constants implicit in the rate may change locally around a given parameter as well as with , and if they do so quickly enough, that results in an effective change in the rate. The Hodges phenomenon, for example, illustrates that highly pathological behavior of an estimator may only be visible in its uniform rate, whereas the asymptotic pointwise rate suggests good performance [see e.g. Section 8 and Figure 8.1 in 32]. Our result says:
If measured uniformly over any given neighborhood in parameter space, the best achievable rate for connectivity parameters (i.e. for ) is always slow.
In other words, the change of constants is indeed an issue here, and makes the rate drop from a fast to a slow one. Since the minimax bound is local, this problem cannot be avoided by removing some (fixed) parameters .
Further results. Section 4 and the Appendix provide additional results on achievability, i.e. upper bounds to complement the lower ones, and on graphon models and sparse graphs.
Upper bounds. Like most lower bound results, Theorems 1 and 2 do not show whether the bound is achievable—that is, the convergence rate of any actual estimator could be even slower than . To show that a rate is achievable uniformly, one has to specify an estimator whose uniform risk matches the lower bound up to constants. Estimators for SBMs and their convergence rates are subject of a substantial literature, but these rates are, once again, generally pointwise. Obtaining a tight uniform upper bound for arbitrary SBMs is beyond the scope of this work, but we do consider the “hard” one-dimensional model (1), and show the following:
For estimation of in (1), the rate is achieved by a type of maximum likelihood estimator. That is shown in Theorem 3 (for ) and in Theorem 4 (for general ), in Section 4.1.
However, this estimator is not generally computable in polynomial time, which raises the additional question whether the problem exhibits a computational gap—that is, whether this is a problem where a sample of size contains enough information to achieve a given rate , but this information cannot be extracted in polynomial time, and every practically computable estimate converges at a slower rate. In this context, we show:
Under additional conditions, a spectral estimator (based on work of Lei and Zhu [26]) achieves , and is computable in polynomial time. (See Theorem 5 in Section 4.2 for classes, and the appendix for the general case .)
Thus, in the submodel specified by the conditions, there is no computational gap. We do not know at present whether the same holds for general SBMs.
Smooth graphons. SBMs are a special case of so-called graphon models, which parametrize a random graph on vertices by a function of a certain form. In SBMs, is piece-wise constant. Section 4.4 instead considers a class of smooth graphons. It is known that uniform estimation of such a graphon from is only possible at a slow rate [16, 23]. Theorem 6 considers a simple, real-valued statistic that can be read as a form of standard deviation. It shows that
[TABLE]
In other words, even if the infinite-dimensional quantity is substituted by the much simpler, one-dimensional quantity , the rate is still slow. In this sense, Theorem 6 can be seen as a semiparametric counterpart to the nonparametric results of [16].
1.2 Contents
@starttoc
toc
2 Preliminaries and notation
This section defines the models we consider, and briefly reviews some related background.
Notation. We abbreviate , so that is the set of all mappings . For a subset of the integers, denotes cardinality. If is a square matrix, is its Frobenius norm and its spectral norm. Let Be be a shorthand for the Bernoulli distribution. By , we denote the law of an Erdös-Renyi random graph edge probability over nodes, that is, , where denotes a tensor product of distributions.
Stochastic block models. Consider sampling at random an undirected, simple graph on the vertex set as follows. Fix some . Let be a probability distribution on the set , with identified as a line vector of size . Let be a symmetric matrix with elements . To sample a graph , we generate its adjacency matrix . Since is undirected, it suffices to sample entries with ,
For each vertex , independently generate a label . 2. 2.
For each pair in , independently sample from the distribution .
In this notation, is a (random) mapping that attributes a label to each node of the graph. It is random because labels are by definition randomly sampled. The distribution so defined on the set of undirected, simple graphs is called a stochastic blockmodel of order with parameters and . One can also write
[TABLE]
where , and here and in the sequel refers to all pairs of indices with . Any given partitions the vertex set into distinct classes. We call the proportions vector and the matrix of connectivity parameters.
Graphon models. SBMs can be regarded as a special case of a more general class of random graphs, parametrized by the set of all measurable functions that are symmetric, i.e. . Any such defines a random graph : denoting by the uniform distribution on , and , set
[TABLE]
The law of the graph defined by the random matrix in (4) is called a graphon model [9]. SBMs are recovered by choosing as a histogram: subdivide the unit interval into intervals of respective lengths , and set
[TABLE]
Then . In a graphon model, the continuous vertex labels are almost surely distinct; in a stochastic block model, labels coincide whenever two vertices belong to the same class. Thus, the SBM labels can be regarded as discretization of graphon labels. Conversely, any graphon can be approximated by a sequence of stochastic blockmodels of increasing order ; indeed, the set of stochastic blockmodels—that is, of graphons of the form (5) for all , and —is dense in the set of functions endowed with its natural topology [see e.g. 22, for details]. This idea can be used to construct SBM-valued estimators for graphons [34, 16]. SBMs and graphon models both generalize to directed graphs, by dropping the symmetry constraints on and , and requiring only rather than ; in the following, we consider only the undirected case.
Label switching and identifiability. The distribution (4) remains invariant if is replaced by , for any measure-preserving transformation of : for . More generally, two graphons and are considered equivalent if . The equivalence class of is called a graph limit. Similarly in (3), if is a fixed arbitrary permutation of , with permutation matrix , then . The parameters of the SBM can only be recovered up to label switching. We refer to [1] and [11] for detailed identifiability statements.
Fixed and random design. In models (3)-(4), the latent variables, respectively and , are random. Sometimes, a slightly different version of the model is considered, where and are still unobserved, but fixed, non-random quantities. For instance, under this setting (3) becomes
[TABLE]
for a given, unknown, , and the data distribution is denoted . Such models will be referred to as fixed design SBM and random design SBM respectively. The term SBM as used in the literature typically refers to a random design. Some theoretical arguments simplify in the fixed design case, for which the data distribution is a product measure, rather than a mixture of products measures. Most results below are obtained for both cases.
Mixture interpretation. The -tuple in a graphon model, or, equivalenly, the mapping in a SBM, are in general not observed, and can hence be interpreted as latent variables. In other words, the distribution of the data is a mixture. The mixture representation is useful to relate fixed and random designs to each other. In the random design case, we have
[TABLE]
where and is the number of times the label is present. In the fixed design model, the labels given through are also unobserved, but fixed, so that the distribution is given by
[TABLE]
3 Main results: lower bounds
In this section, we first construct in Section 3.1 natural submodels of a SBM with along which the two classes become close, and derive an estimation lower bound in terms of the quadratic risk for the submodel parameter. We then consider in Section 3.2 the more general setting of a SBM with classes ‘containing’ the previously constructed difficult submodel and derive a minimax estimation lower bound in this setting, which is local around any possible SBM of this type, as we discuss in more detail in Section 3.3.
3.1 The case
Consider the set of distributions
[TABLE]
where and are given by
[TABLE]
The set is a –dimensional submodel of the set of all SBMs with at most two classes. For the matrix is degenerate and the model is simply an Erdös-Reyni graph model with parameter , that is all edges are independent and have a probability of being present. SBMs with connectivity matrices that—like above—specify only two, one for intra-group and one for between-group connections, are known as affiliation models [e.g. 2, 3, 27].
Theorem 1**.**
Consider a stochastic blockmodel (3) with specified by , that is with given by (8)-(9). There exists a constant such that for all ,
[TABLE]
where the infimum is taken over all estimators of in the model .
Proof.
See Section 6.1. ∎
Theorem 1 states that, even in a very simple SBM with classes and only one unknown parameter in its connectivity matrix, the minimax estimation rate is no faster than . This is no contradiction to the fast rate obtained by Bickel et al. [7] (meaning a rate for the convergence in distribution but a rate for the quadratic risk): the latter is a pointwise asymptotic result, and assumes that no two lines of the connectivity matrix are the same, whereas Theorem 1 is nonasymptotic and uniform. It shows that the rate in a two-class model changes for distributions close to an Erdős-Renyi model (); informally, models close to the ‘boundary’ are harder to estimate. We note the result does not require the sub-model to include the Erdős-Renyi model; see the remark below. The phenomenon is reminiscent of effects familiar from community detection, where matrices similar to (9) naturally arise as most difficult submodels. Community detection is a testing problem, though, as opposed to the estimation problem considered here. For a different but related result in the very sparse case, see [27].
Remark 1* (different parameter choices).*
One can easily check that the result of Theorem 1 remains unchanged if instead of in the matrix in (9), another number is used. If is bounded away from [math] and , assuming , then the result is only modified by constants. Also, if the proportions vector is of the form with , similar results continue to hold, provided the matrix is replaced by
[TABLE]
for suitable constants that depend on (one can take e.g. and ).
Remark 2* (numerical constants).*
In Theorem 1, one can take ; additionally, the supremum can be restricted to for
[TABLE]
Moreover, the proof implies that one can restrict the supremum to a set not actually containing , but rather two points close enough to , namely , for suitably chosen, fixed constants .
Fixed design. A result similar to Theorem 1 holds for fixed designs. In this case, the map is deterministic, and the model can be written as . Expectations with respect to the measures and are denoted respectively and . We then have
[TABLE]
where the infimum is taken over all estimators of in the fixed design model. The proof is the same as for Theorem 1, see Section 6.1.
3.2 The general case
We now consider an arbitrary number of classes. Above, we have perturbed a SBM with classes around an Erdös-Renyi model. We now similarly perturb a -class SBM around one with classes. The connectivity matrix of a SBM with at most classes is of the form, for for ,
[TABLE]
For simplicity of notation and easy comparison with Section 3.1, we assume throughout. Results are easily adapted to the case , requiring only that be bounded away from [math] and . If needed, one can ensure the number of classes is exactly by requiring no two rows of coincide, which we require only in Theorems 7 and 8, which describe the behavior of spectral estimators.
We consider -dimensional submodels in the parameter space of connectivity matrices: Set
[TABLE]
and, for coefficients as above, define
[TABLE]
where
[TABLE]
and
[TABLE]
Thus, is a symmetric matrix, obtained from by replacing the scalar coefficient by the matrix , and repeating the vector .
The number of nodes in a given class will be specified as follows. For simplicity, we choose the proportions vector in (3) equiproportional and equal to in (11). (As in the case , analogous results can be obtained if the proportions are of similar sizes.) Consider the model defined by
[TABLE]
for as in (11)-(12). This is a –dimensional submodel of the set of all SBMs with at most classes. For , the matrix again has two identical rows, and the model becomes a SBM with at most classes, with connectivity matrix given by defined in (10). By , we denote the expectation under in the model given by (13).
Theorem 2**.**
Consider a stochastic blockmodel (3) with classes specified by in (13), that is with given by (11)-(12), for fixed matrices with arbitrary coefficients. There exists a constant , independent of , such that, for all ,
[TABLE]
where the infimum is taken over all estimators of in the model .
Proof.
See Section 6.2. ∎
Fixed design. A similar result holds for the fixed design case, assuming that classes, given by the mapping , are balanced in the following sense. Let denote the set of maps such that, for some constants , for any ,
[TABLE]
The set thus consists of those maps that produce classes all of size of order . Then the conclusion of Theorem 2 still holds, provided is replaced by , and the supremum taken over and as defined just above.
3.3 Some comments on the results
Theorem 2 establishes that the minimax estimation rate of in model (13) is at best of the order , uniformly over and . An intuitive explanation for this particularly slow rate is as follows: the phenomenon observed for is still present but this time the part of the matrix containing information about is smaller, as only of the order of the nodes will be assigned to classes or , which are the elements of the connectivity matrix that depend on .
An important point is that this lower bound is minimax local (as opposed to more commonly proved minimax global results) that is, not only does this slow rate occur around one specific least-favorable point in the parameter space, it does occur around any point. More precisely: If we start with any , any proportions vector, and any connectivity matrix as in (10) with classes, there exists at least one submodel around , namely in (13), such that estimation of a connectivity parameter in cannot be faster than . In Theorem 1, the model given by is an Erdös-Renyi graph, which raises the question whether the slow rate in Theorem 1 is a consequence of the distinguished properties of the Erdös-Renyi model. This is not the case. Proving such a local bound makes the proof of Theorem 2 more involved in the random design case, as one has to quantify the -distance between two mixtures of probability measures, instead of between one fixed measure and a mixture as is often the case in proving minimax global bounds.
It is interesting to compare the rate in Theorem 2 to the one that would be obtained if the labels were observed. If is fixed, Lemma 2 in Bickel et al. [7] gives a quadratic rate of order for connectivity parameters when labels are observed. This result can be easily adapted to the case where possibly grows with , say in an asymptotic setting with and , leading to a quadratic rate of order . The uniform rate in Theorem 2 is the square-root of this rate and thus much slower.
4 Further results: upper bounds and smooth graphons
In this Section, we complement our main results by upper bounds (under some conditions when ) and results for certain smooth graphons, which can be seen as a continuous analogue of the results for the SBM parameter .
We establish upper bounds that show that the lower bounds in the previous section can be matched for certain subsets of connectivity matrices. In the case of classes, the conditions are arguably somewhat restrictive and can probably be improved. However, since the lower bounds are proved to be local around any possible SBM containing two classes that are close, the rate, if not matched, can only become worse. As we show below, some conditions are in fact necessary. Indeed, we give an example in Section 4.3 where the rate drops further, illustrating the difficulty of the estimation problem.
4.1 Upper bounds via maximum likelihood
Theorems 1 and 2 provide lower bounds. There are corresponding, matching upper-bound, which we obtain next.
The case . We define an estimator of as follows. For any an element of , i.e. for any mapping , define
[TABLE]
Maximising (14) in leads to set
[TABLE]
which leads to the profile maximum likelihood estimate
[TABLE]
This estimator can be seen as a (pseudo)-maximum likelihood estimate, see Appendix C.
Theorem 3**.**
Consider a stochastic blockmodel (3) with specified by , that is, with given by (8)-(9). Let be the estimator defined by (15). There exists a constant such that for all ,
[TABLE]
The same risk bound holds for in the fixed design model, uniformly over and .
Proof.
See Appendix C.2. ∎
The main takeaway from this result is that the uniform quadratic rate for estimating the connectivity parameter along the submodel is exactly of order , up to constants. That follows from combining Theorems 1 and 3. This ‘slow’ rate (as compared to the asymptotic pointwise quadratic rate of (2)) arises even if all other parameters—here, the vector of proportions —are assumed known. The submodel built for can be regarded as a local perturbation of an Erdös-Renyi graph model with connection probability . The drop in the rate is already noteworthy, as the rate of estimation of for a ER model is of the order .
The case . For this case, we make additional (but fairly mild) assumptions on the matrix . These conditions are for simplicity of presentation and could, in some cases, be improved. Our main purpose here is to show that, for ‘typical’ matrices and in (12), the rate of estimation of in (12) is indeed exactly of the order . In Section 4.3 below, we show that at least some conditions on possible matrices are necessary: for certain unfavourable matrices, the rate drops below . As was the case for Theorem 1, the result of Theorem 3 remains unchanged if the constant in is replaced by any .
We modify the criterion function (14) by restricting it to a given subset of indices,
[TABLE]
To avoid technicalities, we maximize over a grid, which constitutes no loss of generality. To this end, define the regular grid in , and
[TABLE]
Equation (17) defines a global maximum-likelihood type estimator, which is then used to obtain an estimate of the set of nodes labelled or . Given this estimate, one can apply the profile-type method already used in the case : For as in (18), , and as in (16), set
[TABLE]
We require the coefficient of the matrix in (10) to be sufficiently distinct from the remaining entries: Let be the set of coefficients of the matrices and in (12), with ,
[TABLE]
Theorem 4**.**
Consider a stochastic blockmodel (3) with classes specified by with and given by (11)-(12), for fixed matrices and . Define as in (20). Suppose (21) holds and that, for some small enough and as in (21),
[TABLE]
Then there exists a universal constant such that for ,
[TABLE]
The same risk bound holds for in the fixed design model, uniformly over .
Proof.
See Appendix C.3. ∎
Note in (22) may depend on and , and may go to zero in a framework where go to infinity. Below are two examples for the behaviour of . These examples illustrate that our conditions are indeed met in commonly encountered settings, in particular, with high probability, if is a random matrix and does not grow too rapidly with .
Example 1 (well-separated block). If is a fixed positive constant e.g. , then the submatrix is well separated from the other coefficients of the matrix . The procedure above then correctly picks up a sensible approximation of the true set via and the rate is achieved, as long as does not grow faster than , an already fairly important number of classes.
Example 2 (randomly sampled matrix ). Suppose that the symmetric matrix in (10) is a random matrix whose upper triangular entries are drawn i.i.d. with uniform distribution , except . The distribution of except for is then , and it is a standard fact that the first order statistic of a uniformly distributed sample of size is Beta distributed. That implies the random variable has law . Therefore, in (21) is of order no less than with high probability. From (22) one deduces that for of the form with and large enough, the rate is achieved uniformly and locally, for typical matrices . Inspection of the proof of Theorem 4 reveals that with in fact suffices for the rate to be attained with high probability when is random: this is achieved by distinguishing of the types or in the proof and noting that the minimum of over will be of larger order , instead of for the minimum over of .
Remark 3* (conditions and (21)).*
We slightly restrict the range of in the upper bound of Theorem 4. Formally, the matching upper bound is obtained for a somewhat smaller interval than when . (If is fixed and , the restriction is only to for a small enough constant .) The condition is needed to ensure, in combination with (21), that the block in the matrix (12) is separated sufficiently from the other submatrices and . If this is not the case, the estimation problem can become more difficult, and the rate hence slower. This is formally shown in Section 4.3, where the extreme case of all coefficients of being equal to is discussed. This phenomenon can also occur if only some parts of and are close to , or to either or for some .
We do not claim that the restriction to and (21) are sharp conditions; they can probably be improved. However, the argument above shows some condition of this form is needed, although it may vary depending on the estimation procedure considered: for spectral estimators as considered in Appendix A, for example, we need a similar separation assumption, although it takes a slightly different form (see Theorem 7 in Apprendix A and the comments below it). We also note that small values of are conceptually the most interesting case, since the subproblem becomes easier the larger becomes.
4.2 Upper bounds via spectral estimates
Since the maximum likelihood estimator (17) has to optimize over the set , it need not be computable in polynomial time. It hence seems natural to ask whether there is a “computational gap”, that is, whether the best estimator computable in polynomial time converges at a slower rate than predicted by the minimax bound. We do not have a complete answer to his question, but for a somewhat restricted model class, no such gap exists: The estimator described below for the case uses a spectral method, see e.g. [25]. A generalization to classes is discussed in Appendix A, which requires further conditions. Within the remit of these conditions, however, the minimax rate is achievable in polynomial time. An extension to sparse graphs is considered in Appendix B. A small simulation study in Section A.2 illustrates the behaviour of the estimator.
With the convention that and , define the matrix by
[TABLE]
Let denote the largest eigenvalue in absolute value of and set
[TABLE]
We refer to this procedure as spectral algorithm for and denote it . The intuition behind this estimator in the fixed design setting is the following. For , we have
[TABLE]
Set and V:=vv^{t}=\bigl{(}(-1)^{\mathds{1}\{\varphi(i)\neq\varphi(j)\}}\bigr{)}_{i,j\leq n}. Then for non-random ,
[TABLE]
where is the identity matrix of size . As is a rank matrix whose non-zero eigenvalue equals (with the corresponding eigenvector), this leads us to introduce as in (23).
Theorem 5**.**
In the same setting as in Theorem 3, let be the estimator defined by (23). There exists a constant such that for all ,
[TABLE]
The same risk bound holds for in the fixed design model, uniformly over and .
Proof.
This follows as a special case of Theorem 8, in Appendix B. ∎
4.3 Necessity of conditions on
What precedes shows that the rate is achieved under conditions on in (10) and/or . In general, we expect the rate to depend on the matrices . Although we do not investigate this point in full here, we discuss it briefly.
The estimation methods investigated in Section 4.1 (MLE) and Appendix A (spectral method) require the upper-left block of to be sufficiently separated from at least part of the other entries of . Among those matrices whose upper-left corner equals , a worst case scenario should correspond to a matrix whose coefficients in and all equal . This leads to the matrix
[TABLE]
which is of course heavily over-specified from the SBM perspective. Consider the SBM in a fixed design case, where is unobserved. Suppose all classes are of cardinality of order , and the connectivity matrix is given by (24). This specific model can be regarded as a special case of the setting considered from a testing perspective by Butucea and Ingster [10] and Arias-Castro and Verzelen [3]. From Theorem 4.3 of [10], one can deduce that the minimax rate for the quadratic risk when estimating is no better than , for and . The rate is therefore no better than for , and remains much slower than even for .
4.4 Minimax rates for a class of functionals of smooth graphons
Stochastic block models can be identified with piecewise constant graphons; we now consider the case where the graphon is a smooth function instead. Let a measurable function, let be its graphon equivalence class, and denote by the distribution of data generated by the graphon model (4). Consider the problem of estimating the functional
[TABLE]
for any representer of . This is well defined in terms of the graphon, as the integral is invariant under any simultaneous (Lebesgue-)measure-preserving transformation of and .
The statistic (25) can be interpreted as a ‘graphon-standard deviation’. Its estimation under a smooth graphon model is, in a sense, analogous to the problem of estimating the functional in the simple SBM with two classes discussed in Section 3.1: Let be the piece-wise constant graphon characterizing the SBM defined by (8)–(9). Since , estimating is then indeed equivalent (for positive values) to estimation of .
Under a 2-class SBM, the results of Section 3.1 show in (9) cannot be estimated faster than . It is natural to ask whether the same still holds if one works with ‘smoother’ graphons instead of histograms (where we refer to as smooth if at least one of its representers is a smooth function). The following result addresses this question for a simple class of smooth graphons, both for and for a larger class of functionals containing .
Let be the collection of all graphons that admit a representer which is a polynomial in , with degree bounded by some integer and coefficients bounded by an arbitrary constant (this boundedness restriction is only to ensure a –nearly, up to a log term– matching upper-bound in the next result). For any , let us denote by the function from to given by
[TABLE]
and let denote the constant function equal to . The function can be interpreted as a ‘smooth’ counterpart to the histogram graphon underlying the SBM (9).
Theorem 6**.**
Let be data from the graphon model (4). Let be defined as in (25). There exist constants such that
[TABLE]
where the infimum is taken over all possible estimators of in model (4). Let be an arbitrary functional defined on graphon equivalence classes satisfying
[TABLE]
for some , for any , and for the function in (26), Then for some ,
[TABLE]
Proof.
See Section 7. ∎
The first part of Theorem 6 asserts that the quadratic minimax rate for estimating (25) cannot be faster than , even if one restricts the parameter set to a small class of smooth graphons , namely graphons with a polynomial representer of bounded degree. This class can be seen as a smooth analogue of the histogram graphon underlying model (7), or more generally the model with classes and connectivity (12). The degree of the polynomial can be seen as the analog of . The rate of order is obtained because the degree of the polynomial is assumed bounded. Although we do not investigate this further here, one may conjecture that the rate would slow even further for a larger class (e.g. growing degree of polynomials, or a nonparametric class such as a Hölder ball).
The second part of Theorem 6 indicates that the specific form of the functional in (25) is not essential for the lower bound to hold. A given functional leads to a rate at least as slow of over the considered class of graphons as soon as (27) holds. This condition intuitively means that the functional is at least as hard as to estimate as the functional , for which the difference on the left hand-side of (27) indeed behaves like . By direct computation we see that an example of such a graphon functional is
[TABLE]
Providing a unified theory with matching lower and upper bounds for graphon functionals is an interesting topic for future research.
5 Discussion
Gao et al. [16] show that, if one estimates the parameter function of a graphon model, not observing the vertex labels—in this case, the variables in (4)—does (in general) impact on the optimal rate. In the present paper, we have considered uniform estimation of certain functionals of graphon models (in particular, the loss function is quite different from theirs). For estimation of certain random graph functionals—including the connectivity parameters considered by Bickel et al. [7]—we have shown that the uniform, minimax rate does depend on whether the labels are observed, i.e. the phenomenon described by [16] persists even if one does not try to recover the entire function , but only a specific –dimensional aspect of . The fast quadratic rate is not achievable uniformly. If the number of classes is known and fixed, the quadratic rate becomes . If the number of classes grows with , the rate drops to . We have used some mild assumptions on the part of the connectivity matrix other than the submodel. If those assumptions are not satisfied, the rate may even drop further. Similar results also hold for sparse graphs.
Interestingly, for the functionals considered here, the uniform rate is always, regardless of the number of classes , much below the rate in the case where labels would be observed. This is in contrast with the problem of recovery of the mean adjacency matrix considered in [16], where for is larger than , the (non–normalised) rate is dominated by the ‘parametric’ rate , the rate if labels are observed.
We claim no novelty regarding the algorithms—the MLE and spectral method—which we have adapted from existing work to the problem at hand. Their purpose is to verify that the lower bound is tight (both algorithms achieve it) under some mild conditions, and that there is no computational gap (the spectral method does so in polynomial time). Yet, we are not aware of other work providing uniform rates for SBM connectivity parameters for these or other algorithms, which constitutes another novelty of the paper.
Aspects of our proofs reflect the fact that graphon models constitute a specific type of mixture model, and estimation in mixtures can be difficult if mixture components are hard to distinguish; although no general theory of these phenomena seems to exist, we refer to the early work on estimation in finite mixture models by [19] and [6], and e.g. to [20] and [15] for more recent results.
6 Proofs of the lower bounds in SBMs
The proofs of Theorems 1 and 2 rely on variations of Le Cam’s ‘two-points’ method, which bounds the minimax risk from below by a quantity involving the distance between a distribution and a finite mixture. (Specifically, this is the ‘point versus mixture’ variant of the two points method, see e.g. [35].) This and other relevant technical lemmas are recalled in Section 6.3 below; the two points method is Lemma 3. For Theorem 2, for classes, one main idea is to ‘isolate’ the part corresponding to the submatrix . More details comments are given along the proof in Section 6.2 below.
Notation. Recall that a SBM with classes, proportions vector and connectivity matrix has distribution as given in (6). For a symmetric matrix with zero diagonal, we write
[TABLE]
If is only given by for , one extends it by symmetry and sets . The distribution of a SBM in the fixed design case with given and labelling function is hence , where . In the random design case, if is the vector with equal proportions , then from (6),
[TABLE]
We generically denote universal constants by , where the value may change from line to line.
6.1 Two classes
Proof of Theorem 1.
Let and let be the collection of symmetric matrices with general term , and zero diagonal, for all possible and some . Let be the matrix with all elements equal to on the off-diagonal, that is the matrix with . By Lemma 3, applied with and small to be chosen below, in order to get a lower bound for the minimax risk, it is enough to bound the -distance between
[TABLE]
If , , , a simple computation leads to
[TABLE]
By Lemma 4 applied to and , where ,
[TABLE]
Note that for any . Denote and , for any index . We have , so that . The term under brackets in the last display can be interpreted as an expectation over , where both variables are sampled uniformly from the set of all mappings from to . Under this distribution, the variables for are independent Rademacher, as well as the variables , and both samples are independent. Further note that the variables for form again a sample of independent Rademacher variables. It is thus enough to bound,
[TABLE]
where denotes expectation under the law of the . The previous exponent is an instance of Rademacher chaos; its Laplace transform can be bounded using Lemma 1. If , we have that for any (say ), there exists such that for all ,
[TABLE]
Choosing leads to , so that the minimax risk is bounded below by .
To obtain the constants as in the remark below the Theorem, using Lemma 2 in the final step of the proof with , , as in Lemma 2, gives
[TABLE]
6.2 Lower bounds for classes
Here the problem is more delicate compared to , as the typical number of nodes per class now depends on , and, in the random design case, the data distribution for , around which we build the lower bound, is itself a mixture. As a first step, we start by establishing a result in a fixed design setting, that is
[TABLE]
Proof of (28).
Define . Set and . Let be a mapping such that
[TABLE]
Let be such that
[TABLE]
and denote by the set of all such ’s. Then the restriction of to can be identified to an element of .
Let be the matrix defined in (12). For , let denote the matrix with general term equal to . There are as many such matrices as possible s, that is . As and are identical by construction on ,
[TABLE]
where belongs to . Next set, with the matrix with general term ,
[TABLE]
Now we apply Lemma 4 to . Both and are product measures over all pairs of indices with . By construction, the individual components of these products coincide as soon as either or does not belong to . We write
[TABLE]
where, for any indices with ,
[TABLE]
For , we set
[TABLE]
If or belongs to , then by definition, in which case the last display equals [math]. In Lemma 4, where play the role of the indices . Identifying with the corresponding mapping , we have and
[TABLE]
The last expression coincides with the bound obtained in the proof of Theorem 1, with replaced by . As in that proof, there hence exist independent Rademacher variables such that satisfies
[TABLE]
Provided is defined as, for a small enough constant,
[TABLE]
using Lemma 1 as in the proof of Theorem 1 leads to the bound if is a small enough multiple of , which again leads to a lower bound for the minimax risk of a positive constant times , which proves (28). ∎
Proof of Theorem 2.
[TABLE]
and set corresponding to . Our aim is to show that and are close in the sense say, while is a fixed positive multiple of . For a given , set
[TABLE]
By definition we have and .
The proof has two steps. First, one shows that with high probability one can restrict to designs (i.e. specific mapping ’s) such that there are around nodes that have label either or . Second, we show that estimation with a random design is ‘harder’ than in the (easiest) typical fixed design case. This argument is reminiscent of ‘information processing inequalities’ encountered in information theory, although here a maximisation also takes place for not knowing the class labels. It is then important to maximise only over designs obtained from Step 1, in order for the lower bound rate to be .
Step 1. One first shows that it is possible to restrict the sum in the definition of and to ’s in the set
[TABLE]
The reason is that the large majority of sets have a cardinality of the order close to . The proportion of ’s not in among all possible ’s is given by the probability of a binomial variable being farther than from its mean. By Bernstein’s inequality, as , for any ,
[TABLE]
Taking and setting , we have just shown that
[TABLE]
Now set
[TABLE]
By the triangle inequality,
[TABLE]
By Lemma 5, is bounded above by .
Step 2. We now focus on bounding the middle term . Let denote the collection of subsets of with . For a given , let denote the restriction of to . Below we use the notation with the meaning that each term of the sum corresponds to a possible mapping , that is a given collection of values .
To do so, we rewrite and as ‘mixtures of mixtures’, by splitting the sum over into a sum over and given . Specifying is equivalent to giving oneself (then ), and . Denote
[TABLE]
For given and , set
[TABLE]
where one sums over all possible mappings , while and are fixed. We have
[TABLE]
Note that the above measures are normalised to be probability measures. Indeed, given , there are possible choices for . As is of total mass one, we have
[TABLE]
Using the triangle inequality, one can bound
[TABLE]
It is now sufficient to bound uniformly the above -distance. For simplicity, we denote
[TABLE]
where and , and is the pair . Set
[TABLE]
Using the definition of above,
[TABLE]
Combining this with the previous bounds one deduces that
[TABLE]
To conclude the proof, observe that the structure of the bound in the maximum in the last display is nearly identical to the quantities appearing in Equation (31) for the fixed-design case.
In the present case, we have a fixed mapping , with and , that plays the role of in the fixed-design case. On the other hand, we have a collection of other mappings, say , that coincide with on , that is , and that cover all possible cases for the image of , namely . The only difference to the fixed-design case is that belongs to , instead of being exactly , as specified in the definition of above. That is, denoting as above , with ,
[TABLE]
This bound is uniform over . As , if one chooses , with the constant in Lemma 2, then this Lemma implies that for any between and , the -distance in the last display is bounded by . Crucially, the constant in Lemma 1 is independent of the number of terms in the Rademacher chaos. Deduce
[TABLE]
Choosing makes this bound smaller than for . ∎
6.3 Useful lemmas
Let be i.i.d. Rademacher variables. For reals and , set
[TABLE]
Lemma 1** (Corollary 3.2.6 of de la Peña and Giné [13]).**
Let , and and as above. For every , there exists independent of such that
[TABLE]
We repeatedly use Lemma 1 in the case where all are equal to , for various values of . In such a setting, a reformulation is as follows. For any and , one can find a constant independent of such that
[TABLE]
Lemma 2** (Rademacher chaos with explicit constant).**
Let , and and as above. For any ,
[TABLE]
The lemma applied with gives a bound for the right hand side.
Proof.
Theorem 3.2.2 in [13] gives, for any ,
[TABLE]
For one has . From this one deduces that for any ,
[TABLE]
and the result follows from an application of the nonasymptotic Stirling bound valid for . ∎
Lemma 3** (Le Cam’s method ‘point versus mixture’).**
Let be a collection of probability measures indexed by an arbitrary set , . Set
[TABLE]
If is a real-valued functional such that and for any , then
[TABLE]
where the infimum is over all estimators of based on the observation of .
Proof.
This is a standard variation on the case where stated in e.g. [35]. ∎
Lemma 4** (Bound on total variation distance).**
For , let and for , for some , be probability measures. Set
[TABLE]
Suppose that for any , has density with respect to . Denote . Then, for ,
[TABLE]
Proof.
The first bound on distances is standard, while the second bound follows from elementary calculations. ∎
Lemma 5**.**
Let be two integers with , , and be an arbitrary collection of probability measures with . If and , we have
[TABLE]
Proof.
The result follows by splitting the sum over in a sum over and , applying the triangle inequality and using the fact that . ∎
7 Proofs for results on graphon functionals
To prove Theorem 6, we observe that polynomial graphons of bounded degree include the graphon in (26). The proof approximates this smooth graphon by a piecewise constant graphon, and then uses a lower-bound for such piecewise constants. We prove this lower bound, Lemma 6, first. Similar to the SBM case, this builds on Le Cam’s point versus mixture method. We then proceed to prove Theorem 6.
7.1 Auxiliary lower bound
Assume the function is piecewise constant, with different values taken along blocks corresponding to a regular partition of in blocks, and an even integer , with . That defines a law of the form
[TABLE]
where is an element of and a given matrix defined below. In the next statement and proof, denotes the expectation under this distribution. Denote by the matrix with only ones as coefficients,
[TABLE]
and, for a symmetric matrix with coefficients , define the matrix
[TABLE]
We define as the matrix
[TABLE]
Lemma 6**.**
Let be an even integer and an arbitrary symmetric matrix. Let be the matrix defined in (33). There exists a constant such that
[TABLE]
where the infimum is over all estimators of valid under .
Proof of Lemma 6.
Let be as above. That is,
[TABLE]
with the matrix of general term , for ranging over the set . Let denote the Erdös-Renyi distribution over nodes, which also corresponds to for . Consider the functional defined as,
[TABLE]
By definition, for any , we have and . The same computation as in the proof of Theorem 1 now shows that, for given in the display below,
[TABLE]
The last term in the bound can be interpreted as an expectation over , where both variables are sampled uniformly from the set of all mappings from to . Recall that and for any integer , denote by the integer in that equals modulo , plus . The variable can be written
[TABLE]
When follows the uniform distribution over , the variables are independent and are marginally uniform over . Also, the variables and are independent under the uniform distribution for as we show next. If Pr denotes the corresponding distribution, then for any and any
[TABLE]
Note the identity holds both for and . Set and . Deduce from the previous reasoning that the variables and are independent. Now, denoting by the expectation under Pr,
[TABLE]
As and are independent, one can compute the inner expectation in the last display under the distribution of , the ’s being fixed. The form a sample of independent Rademacher variables, hence
[TABLE]
is a Rademacher chaos of order with weights . Suppose the matrix is not identically zero (otherwise the bound below holds trivially). By Lemma 1, for any one can find with
[TABLE]
Choose . By definition, all s are bounded by . There is hence a such that, if ,
[TABLE]
The result now follows from an application of Lemma 3 to the functional . ∎
7.2 Proof of the theorem
Proof of Theorem 6.
Let us recall the definition, for any , of the function in (26)
[TABLE]
and let be its graphon equivalence class. By definition, belongs to . One has
[TABLE]
for some constant , so that . The function is the constant , and the density of the data distribution with respect to counting measure on is
[TABLE]
where, for any in and in , we have set
[TABLE]
Next one shows that is close in the total variation sense to a discrete mixture of the previous Bernoulli-probability distributions, provided the number of points in the mixture is suitably large. To do so, we approximate the function defined by
[TABLE]
by a piecewise constant function , where is split into blocks, , using a regular grid of with points and for all . To do so, one just replaces by, say, the value of on the middle of the block the point belongs to. This defines a function
[TABLE]
where is constant on every block of the subdivision. Let denote the corresponding measure, with density
[TABLE]
Taking as above, the function is a polynomial in , and its degree with respect to each variable is . The partial derivatives of can be computed, and each of them can be seen to be bounded by : For each variable, only non-zero terms appear when evaluating the partial derivative, and each term is uniformly bounded by . Consequently, if and belong to the same block,
[TABLE]
For as above, we can thus bound the total variation distance as
[TABLE]
Each probability measure is a mixture of distributions, each of which in turn corresponds to a block in the subdivision of . One can rewrite
[TABLE]
where the matrix is the symmetric matrix with terms
[TABLE]
If is even, which one can assume without loss of generality, the matrix is exactly of the same form as in (33), with elements in , so one can use the bound in -distance between measures obtained in the proof of Lemma 6. Note that the argument remains valid even if the number of classes exceeds the number of observations , which will be of importance below. For a small constant and , we obtain
[TABLE]
for sufficiently small. Choosing , for large enough, leads to
[TABLE]
An application of Lemma 3 with the functional concludes the proof of the lower bound in Theorem 6 in the case where . The lower bound for a general follows by the same proof, noting that the specific form of the functional only comes in through the difference , which behaves as for by assumption.
For the upper-bound, we first link the squared distance to the truth for the functional to the squared -distance of corresponding graphons. Let be two fixed graphon functions, and suppose that at least one of these is non constant (almost everywhere), which means that either or . Then, writing simply to denote the double integral on ,
[TABLE]
where the denominator is nonzero by assumption on ; we henceforth denote it . Then
[TABLE]
The two factors in brackets are bounded as follows: For the second term, apply the inequality , followed by , which yields
[TABLE]
For the first term, use for a bounded measurable , as one integrates over . That yields
[TABLE]
and this inequality clearly still holds true in case . One concludes that
[TABLE]
where is the set of all measure-preserving bijections of . Indeed, the previous inequalities hold true for any choice of representer of the graphon , so one can take the infimum over in the previous bounds. By Corollary 3.6 of [23], for data generated from , there exists an estimator that satisfies . Since has a representer that belongs to by assumption, it belongs in particular to the Hölder class , provided is chosen large enough. For the plug-in estimator , combining the previous result with the last display implies
[TABLE]
for large enough depending only on , which concludes the proof. ∎
Acknowledgements
I. C. is very grateful for the hospitality of Columbia’s statistics department, where parts of this work where carried out. I. C.’s work is supported by ANR-17-CE40-0001 (BASICS).
Appendix A Upper bounds computable in polynomial-time
This section generalizes the polynomial-time estimate in Section 4.2 to classes, by combining the spectral clustering method of Lei and Rinaldo [25] with a refinement due to Lei and Zhu [26]. The latter is based on a sample splitting, and under appropriate conditions on the connectivity matrix recovers the labels exactly, with high probability. Theorem 7 below shows that, under additional conditions, this polynomial-time estimator achieves the minimax rate.
A.1 Spectral estimation for classes
Recall the assumed form of the connectivity matrix in (12). The conditions of the next results are in terms of an ‘aggregated’ matrix obtained from by merging the first and second row/columns when , that is
[TABLE]
Recall that denotes the true labelling map. Define a labelling by if and if . That is, we ‘aggregate’ nodes of label or in one class and renumber the remaining labels so that the label set is, now, . Following [26], we write for the true (aggregated) label of node and .
The algorithm Spec- specified in the frame below has three steps. First, one runs the exact label recovery algorithm V-Clust of Lei and Zhu [26] for classes. Under some conditions on the matrix , see (A1)–(A2) below, this finds the ‘aggregate’ labels above up to label permutation with high probability. Then the aim is to recover the aggregated class with original labels and . Due to the label switching issue, this requires some extra condition on . For simplicity (see also comments below) we assume in (A3) that the diagonal terms are separated from , which enables to estimate the aggregated class label by comparing diagonal empirical connectivities to . Finally, in a third step one can run the spectral algorithm from Section 3.1 on the nodes found at the previous step.
Algorithm: Spectral method for estimation of (Spec-)
Input: adjacency matrix (where we set ), number of classes
Subroutines: V-Clust (Lei-Zhu), Initial community recovery (Lei-Rinaldo), Spectral algorithm for (Section 3.1)
Apply V-Clust on adjacency matrix using classes, and
Set , where
\hat{\ell}\,=\,\underset{l\in[k-1]}{\text{argmin}}\ \Bigg{|}\,\frac{1}{{{|\hat{g}^{-1}(l)|}\choose{2}}}\sum_{i<j,\,i,j\in\hat{g}^{-1}(l)}X_{ij}\,-\,\frac{1}{2}\,\Bigg{|}
Run spectral algorithm for on corresponding nodes and set
where is the induced adjacency matrix over nodes in .
We set and assume that, for a large enough universal constant :
- (A1)
is full rank and any two rows of are separated by at least in -norm. 2. (A2)
For the smallest absolute eigenvalue of ,
[TABLE] 3. (A3)
For all ,
[TABLE]
where .
Comments on (A1)–(A3) follow below. For a version for sparse graphs, see Appendix B.
Theorem 7**.**
In the fixed design SBM model with classes, under the assumptions (A1)–(A3), let us set, for a small enough universal constant and ,
[TABLE]
Then the obtained from algorithm Spec- satisfies, for a large enough constant,
[TABLE]
Proof.
This is a special case of Theorem 9, in Appendix B below. ∎
The algorithm Spec-, unlike the likelihood method considered below, only uses the fact that the connectivity matrix is of the form , but does not use specific knowledge of the vector and matrix to compute .
Comments on the assumptions. Conditions (A1) and (A2) are typical for spectral methods; their specific form is that assumed by Lei and Zhu [26], with the initial recovery algorithm being that of Lei and Rinaldo [25]. If is fixed independently of , then (A2) follows from (A1) if is large enough. Condition (A3) is specific to our problem, and assumed in this form only for simplicity of exposition: To identify the special cluster arising from the coefficient in the matrix (step 2. in Spec-), some identifiability condition is needed, because even the refined spectral clustering algorithm of [26] can only recover the original labels up to a permutation. Condition (A3) is similar in spirit to condition (21), but weaker. It can be replaced with any other condition that ensures cluster can be identified from a noisy, permuted version of (with noise amplitude going to zero fast, as ). Note that, if is fixed and large enough, (A3) simply requires the diagonal terms of to differ from .
Finally, a comment on in (34). The label recovery in Steps 1–2 is run with classes, and hence joins two of the classes in the sample. The restriction on the range of ensures the classes joined are the first two, with high probability. Indeed, here we are interested in the situation where may be small, which makes identification of labels difficult, and the rate slow; if is large, the problem becomes easier. Again, note that if is fixed, the condition simply requires that is smaller than a given constant.
A.2 Simulation study
Those estimators described above that are computationally feasible—the spectral and sample splitting estimators for , and the Spec- estimator for —can be tested in simulation: Draw vertices from a stochastic block model as in (13) with a given value of , compute the respective estimate, and report the empirical quadratic risk. Figure 1 shows how the risk develops as a function of sample size for different values of , for the two-community model (9). For communities, the model is given by the connectivity matrix (12). Simulation results for , with , and , are shown in Figure 2. As is visible in Figures 1 and 2, smaller values of correspond overall to a larger risk, and a much slower decay of the empirical risk curves. This illustrates our theoretical finding that there exists a range of parameters corresponding to two classes that become close where estimation is much slower.
Appendix B Extension to sparse graphs
So far, we have for simplicity considered dense graphs, in the sense that at least some elements of the connectivity matrix (e.g. or ) are bounded away from zero.
B.1 Two classes
An –sparse SBM model is generally defined as one in which the connectivity matrix can be written, for a sequence going to [math] with , as , for a nonnegative symmetric matrix with maximum entry [e.g. 8, 25]. Here, we assume that the connectivity matrix is with
[TABLE]
and as in (12). Then the largest coefficient of is between and , as the coefficients of the upper block are . We also set, for ,
[TABLE]
In constructing upper bounds below, we assume that for a large enough constant,
[TABLE]
as up to a constant is the typical boundary between the moderately sparse and very sparse situations, the later requiring different tools, see [25]. For simplicity we also assume that is known for the upper-bound results.
Theorem 8**.**
Consider a stochastic blockmodel (3) with specified by with given by (8)-(36). There exists a constant such that for all ,
[TABLE]
where the infimum is taken over all estimators of in the model . Furthermore, if , and the largest absolute eigenvalue of , set . Then, under (B0), for some constant and ,
[TABLE]
B.2 classes
The case of classes carries over to the sparse situation as follows. The lower bound result is only modified by a scaling factor . For upper bounds, considering the more easily computable spectral algorithm Spec- only, Assumption (A2) is replaced by (B2) below, where has the same definition as in Appendix A.
- (B2)
For the smallest absolute eigenvalue of , there exists such that
[TABLE]
Theorem 9**.**
Consider a stochastic blockmodel (3) with classes specified by in (13), that is with given by (11)–(35), for fixed matrices with arbitrary coefficients. There exists a constant , independent of , such that, for all ,
[TABLE]
where the infimum is taken over all estimators of in the model . Let be the algorithm for classes in the sparse case described in Theorem 8. Consider the fixed-design setting and suppose and are satisfied. Then the algorithm Spec- used with subroutine outputs an estimator that satisfies, for as in (34),
[TABLE]
Similar comments as for Theorems 2–7 can be made. Also, in the case that does not grow with , then (B2) follows from (B0) for larger than a fixed constant. The proof of the lower bound in Theorem 9 is similar to that of Theorem 2 using the normalisation as in the proof of Theorem 8 and is omitted. The upper bound result includes that of Theorem 7 and is proved in Appendix D.
Appendix C Remaining proofs: likelihood-based upper bounds
The proof for below analyzes the least-squares criterion directly. For classes, we ‘isolate’ the part corresponding to the submodel, by controlling the number of errors in recovering the labels of the corresponding classes. We then invoke the result for the case .
C.1 Interpretation as a pseudo-likelihood
We first justify the interpretation of the estimator in (15) as a maximum (pseudo-)likelihood estimate. In the fixed design model, suppose the data is Gaussian instead of Bernoulli Be. This suggests defining a (pseudo-)log-likelihood as follows, with ,
[TABLE]
for a constant depending only on and . Setting and
[TABLE]
it is enough to study the function , which satisfies
[TABLE]
Consequently, the pseudo maximum likelihood estimator is given by (15) as claimed.
C.2 Upper bound result, two classes
Proof of Theorem 3.
We first prove the result in the fixed design case. Let denote the true values of . The aim is to show that holds uniformly in . For a given ,
[TABLE]
One can write, for any ,
[TABLE]
where we have set
[TABLE]
For any and , and for a large enough to be chosen below,
[TABLE]
By definition of , with defined in (37),
[TABLE]
For any , using that ,
[TABLE]
For , there are two cases, depending on the sign of ,
[TABLE]
Let us discuss the term first and note that if , then using the definition of as a maximum. First,
[TABLE]
where the first three inequalities use identities obtained for above and the inequality obtained before the display, and the last inequality uses and .
Second, as implies by definition of the maximum,
[TABLE]
where for the last inequality we have used the lower bound on obtained in Lemma 7.
The case is treated in a symmetric way, by distinguishing the two cases and respectively. To obtain a deviation bound for , it is enough to study the supremum of the process . For any given and , by Hoeffding’s inequality,
[TABLE]
A union bound now leads to
[TABLE]
This bound is smaller than if one chooses . Combining the bounds obtained previously, and choosing above as , one deduces
[TABLE]
The deviation bound in turn implies the bound in expectation
[TABLE]
where for the second term we have used , as has diameter . This concludes the proof of Theorem 3 in the fixed design case.
In the random design case, one slightly updates the definition of . Here, the design is specified by , which is now random, but one can consider
[TABLE]
By definition, . Now one can follow the proof in the fixed design case by writing all statements conditionally on . As conditionally on the variables are independent and centered, the arguments leading to the various upper bounds remain unchanged. As the upper-bounds themselves do not depend on , the bounds also hold unconditionally. ∎
What remains to be shown is the bound on used above:
Lemma 7**.**
For any , with defined in (37) and ,
[TABLE]
Proof.
The upper bound corresponds to the number of terms in the sum. For the lower bound, denote . By symmetry, one can always assume , otherwise one works with . The number of pairs for which is at most , if , . This implies , using the inequality , for any . Thus the number of positive elements in the sum defining is at least , where corresponds to the diagonal terms and the division by to the fact that the sum is restricted to only (note that the general term of the sum defining is symmetric in ). This is at least if . Hence,
[TABLE]
which is the desired bound in view of . ∎
C.3 Upper bound result, classes
Proof of Theorem 4.
First consider the fixed design case: As in the proof of Theorem 3, let denote the true values of . The aim is to show that
[TABLE]
Let us denote by and the matrices of general terms and respectively, with given by (17) and . One interpretation of (17) if that the matrix provides the best fit to the data with respect to the squared loss, when optimising over .
As a first step, we show that and are close with high probability, a result in the spirit of Gao et al [16], Theorem 2.1. This follows from Lemma 8 below, which states that with probability at least , where is the Frobenius norm.
In a second step, denoting and recalling from (18) that , we show that is close to . To do so, one separately bounds from below some terms from the quantity , recalling that one extends the matrices by symmetry and sets the diagonal to [math]. First, using the definitions, (21), , and ,
[TABLE]
if (otherwise the inequality below holds trivially), as well as
[TABLE]
and, with , if ,
[TABLE]
The previous bound on implies that
[TABLE]
It now follows from (22) that for any , one has , provided in (22) is small enough. So for small , as , and as by assumption so that , one deduces . From the bound on , it follows that
[TABLE]
By (22) this shows that is close to either or up to . Now for any in , if and is close to (the cases where or is close to are treated similarly), setting and using (22), with ,
[TABLE]
Therefore,
[TABLE]
and
[TABLE]
Combining the previous bounds on cardinalities and denoting for two sets and , one obtains
[TABLE]
which in turn implies that
[TABLE]
In a third and last step, we follow the proof of Theorem 3. Let be the mapping in (19). It is a map . Let be the mapping that coincides with on and with on . By definition we have, with ,
[TABLE]
where for the second identity we have used that the s are bounded by and (38), and is defined as in the proof of Theorem 3. Similarly, denoting and , we have
[TABLE]
with high probability, since (38) implies using that for two sets . Also, since and , by the same argument we have
[TABLE]
Let and be two sequences depending on and whose specific values are determined below (see the last paragraph of the proof). If , then . Let be the set of all maps . In the following inequalities we repeatedly use the fact that the normalisation in the definition of can be replaced by up to a factor , see (39),
[TABLE]
where the last inequality uses , see Lemma 7 with in place of , and .
Second, as implies by definition of ,
[TABLE]
where for the last inequality we have used the first inequality of Lemma 7. Also,
[TABLE]
By the same argument as in the proof of Theorem 3, the supremum
is of the order , by definition of . Recall that . Set and , with a large enough constant. Assumption (22) ensures that . Hence by taking a large enough constant, one obtains that the last three displays are bounded above by , which concludes the proof in the fixed design case, proceeding as in the proof of Theorem 3 to get the final bound in expectation.
The proof in the random design case is obtained by first deriving the results conditionally on and then integrating out , as we did in the proof of Theorem 3. The first part is almost identical to the fixed design case: one only needs to note that one can restrict to mappings that belong to, essentially, . Denote by the subset of those satisfying . Then
[TABLE]
For the first term, one can apply the arguments above in fixed design, while for the second an application of Bernstein’s inequality gives
[TABLE]
By (22), holds for large enough—note that must be smaller than , as the entries of are in . One deduces , for small enough and large enough, so the quadratic risk is at most in this case as well. ∎
Lemma 8**.**
Let and , with given by (17). Let denote the matrix Frobenius norm. With probability at least ,
[TABLE]
Proof of Lemma 8.
Let denote the element of closest to , so that . Let be the matrix given by . By definition, and hence , so
[TABLE]
where we denote . As elements of the matrix are between and , we note that is of the form , where are independent, and . So using Hoeffding’s inequality, for any ,
[TABLE]
The cardinality of the set is bounded above by . A union bound then shows that, with probability at least ,
[TABLE]
Inserting this back into the previous inequality on leads to with probability at least . As , the triangle inequality leads to the result. ∎
Appendix D Remaining proofs: lower and upper bounds in the sparse case
We begin with a brief overview of the proof techniques: For the lower bounds in the sparse setting (Theorems 8 and 9), proofs are very similar to the dense case, and it suffices to track the dependence on the sparsity parameter . To upper-bound the convergence rate of spectral estimates (Theorems 5 and 8), we use the fact that can be estimated from the largest absolute eigenvalue of the (translated) adjacency matrix. The latter can in turn be estimated empirically. For the proofs of the upper bounds for classes, we show that with high probability it is possible to recover the true aggregated labels, where aggregation means that classes and , corresponding to the ‘hard submodel’ are merged (this is the ‘ map’ introduced above in the second paragraph of Appendix A). To do so, one adapt techniques introduced by Lei and Rinaldo [25], and Lei and Zhu [26] and show that their results still hold ‘under small perturbations’, as explained in more details below. Once the true aggregated labels of classes and are obtained, it suffices to apply the (already derived) result for the case .
D.1 Proofs for the two-class case
Proof of the lower bound in Theorem 8.
One proceeds in the same way as in the proof of Theorem 1 with replaced by , . If , , , we now have
[TABLE]
This leads to, with and ,
[TABLE]
By the same argument as in the proof of Theorem 1, it is enough to solve
[TABLE]
for , where is a universal positive small enough constant, under the constraint that . This leads to take equal up to a constant to and the proof is complete. ∎
Proof of the upper bound in Theorem 8.
We write the proof directly in the possibly sparse setting. Let us first consider the fixed design case, where is non-random. Let denote the spectral norm of a matrix (for a symmetric matrix , , so ). By [25, Theorem 5.2], we have that for any , there exists a such that
[TABLE]
with probability at least . From this one deduces that . The eigenvalues of and those of and can be related to each other by a Weyl-type inequality as
[TABLE]
for any , see e.g. [30, eq. (1.64)]. Suppose for now that . In this case and , which by the previous inequality implies, with high probability,
[TABLE]
Now if , using the first inequality we have and follows from the second inequality, which means . If , the triangle inequality and the second inequality imply , which combined with the first inequality gives . So, for , in all cases with high probability. The case is treated similarly. In the random design setting, one can argue conditionally on , and then note that both the obtained bounds and the in-probability statements do not depend on , which gives the result in this setting as well. ∎
D.2 Proofs for the general case
Proof of the lower bound in Theorem 9.
The proof is similar to that of Theorem 2, where one now uses the sparse lower bound for two classes of Theorem 8 instead of Theorem 1, and is thus omitted. ∎
Proof of the upper bound in Theorem 9.
We show that the proof approach used by [26] to establish their Theorem 2 can be adapted to our problem. More precisely, it is amenable to a perturbation of the true matrix of connection probabilities: We show that, for a graph generated by model (12) with a sufficiently small value of , the V-Clust algorithm with classes recovers the aggregated labelling defined by above with high probability. We do the proof in the possibly sparse situation, thereby also proving the upper-bound in Theorem 9.
There are three steps. First, we show that the initial label recovery algorithm of [25] recovers most of the labels correctly, and control the error. Second, we show that the scheme of proof of [26] carries over to the problem of recovering the aggregated clustering up to label permutation. Finally, using assumption (A3) one can recover the aggregated class with high probability, and restricting to nodes with label in that class we can apply the spectral method of the case .
First step (Perturbed spectral method of Lei and Rinaldo).
[TABLE]
The matrix (i.e. with ) can be transformed into the matrix above by removing the first line and then the first column.
Let be the matrix . Since the relevant design is fixed, there exists a binary matrix , with a single 1 in each row, for which we have , where is a diagonal matrix with entries bounded by . Lei and Rinaldo call a membership matrix. It can be rewritten in terms of , using the relation between and noted above: for a membership matrix and the expected value of ,
[TABLE]
Now we can follow Lei and Rinaldo’s analysis of simple spectral clustering with and the expectation matrix ; one only needs to show that, despite the perturbation , the argument still holds. Intuitively, this is guaranteed by the assumption that is small enough, which ensures that the spectrum of the perturbation does not interact much with that of . More precisely, we decompose as
[TABLE]
Following the proof of Theorem 3.1 of [25], the pair parametrises a SBM with classes and is full rank. By their Lemma 2.1, the eigendecomposition of can be written , where is the matrix of the leading eigenvectors of , and one can write , for some matrix with orthogonal rows (and ). It also follows from the proof of that Lemma that if denotes the smallest absolute nonzero eigenvalue of , we have , with the cardinality of the smallest class, here of order using that classes are balanced, and the smallest absolute eigenvalue of .
By Lemma 5.1 of [25], one can control the distance between the leading eigenspaces of and (for the first non-zero eigenvalues) in terms of the spectral norm of . The assumptions of that Lemma are fulfilled with here of rank and of smallest nonzero singular value . If is the matrix of the leading eigenvectors of (and the one for , as above), there exists a orthogonal matrix such that, with and the Frobenius and spectral norms respectively,
[TABLE]
By the triangle inequality, the spectral norm is in turn bounded by
[TABLE]
The matrix can be written , where is the row of length . In particular, is of rank , and (a nonzero eigenvector is ). By construction, , the number of elements of classes and , so that . Also, since is diagonal with terms bounded by . By Theorem 5.2 of [25], the norm is, with probability at least , no larger than , for a sufficiently large constant . Gathering the last bounds and using , one obtains .
On the other hand, following Lei and Rinaldo [25], one can perform an approximate -means clustering on the rows of : Application of their Lemma 5.3 to the matrices and shows the approximate -means solution is a pair , where a membership matrix, a matrix, and is an approximate least-squares fit to . Moreover, the estimated membership coincides with up to label permutation, except on sets that are characterized as follows: Recall that is the ‘true’ labelling obtained by merging the original classes 1 and 2 of nodes. Each set satisfies
[TABLE]
whenever
[TABLE]
This implies, using the previous bounds and , that
[TABLE]
provided, for some suitably small constant , with ,
[TABLE]
The first summand coincides with the condition in [25]. The second term accounts for the perturbation induced by . Provided that
[TABLE]
the simple spectral clustering algorithm has recovery error at most , with
[TABLE]
The conditions on permit this quantity to be chosen suitably large. This means that, with high probability, Step 1 of the algorithm with recovers a sufficiently large proportions of the labels of , up to label permutation.
Second step (Lei and Zhu’s exact label recovery method via sample splitting). We can now use the method introduced by Lei and Zhu [26]: using a first rough estimate of the labels, one can refine it to an exact label recovery with high probability, provided is large enough in terms of a certain function of . The recovered labels are those of the original classes , and of the aggregated class containing classes and . To verify that the proof of Lei and Zhu generalizes to the perturbed cased, it suffices to note that the distortion of for from to does not interfere with the bounds of the proof of Theorem 2 in [26]. The sample splitting algorithm of Lei and Zhu involves two subroutines called and . The mean of enters in the proof of that result via two applications of Bernstein’s inequality, in the proofs of Lemma 6 (which implies the consistency of CrossClust via Lemma 3) and Lemma 7 (consistency of Merge) of Lei and Zhu [26].
We impose the assumptions of Lei and Zhu [26], Theorem 2, on and : one needs and the inequalities and . The last two conditions are implied by (B2) (respectively (A2) in the dense case). By (42), the first one is satisfied if
[TABLE]
Again, the first inequality holds by (B2). The second inequality asks for , which is guaranteed by (34).
The parameter affects the proof of Lemma 3 of Lei and Zhu [26] as follows. The proof relies on bounding three terms and via Lemma 6 in the Appendix of their paper. To be able to apply Bernstein’s inequality on , one needs
[TABLE]
where in the notation of [26], Definition 1. To bound , one needs
[TABLE]
while, similarly, to bound one needs
[TABLE]
On the other hand, consistency of Merge with requires
[TABLE]
The condition required for implies the remaining ones: The one required for is weaker, since for some constant , by (41)–(42). The condition for Merge is also weaker up to constants, provided that , which is satisfied under our conditions. To obtain the above Bernstein’s inequalities, one thus needs
[TABLE]
It follows from the spectral decomposition of that . By combining with (41), we see that it is enough that satisfies , which was already required above. Finally, to see that the last inequality is satisfied, one notes that it is implied by (34), using that . We have just proved that the exact recovery from [26] also holds here.
Third step (Finding true cluster and conclusion). The second step provides a labelling that coincides, up to permutation, with the aggregated labelling with high probability. The assumed separation from allows us to identify cluster : For , compute
[TABLE]
Since class sizes are of order , an application of Bernstein’s inequality gives
[TABLE]
for some permutation , with high probability. By (A3), if , we have . So w.h.p. there is exactly one diagonal element within of , since the conditions of the theorem imply
[TABLE]
The index then identifies the first cluster of —which is the aggregate cluster corresponding to clusters 1 and 2 defined by —with high probability. We can now apply the spectral algorithm for to the induced submatrix . Using the upper-bound part of Theorem 8 with a number of nodes leads to , by observing that the event with high probability arising from the previous arguments (that is, the concentration result by [25] and Bernstein’s inequalities) holds with probability at least . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allman et al. [2011] E. S. Allman, C. Matias, and J. A. Rhodes. Parameter identifiability in a class of random graph mixture models. J. Statist. Plann. Inference , 141(5):1719–1736, 2011.
- 2Ambroise and Matias [2012] C. Ambroise and C. Matias. New consistent and asymptotically normal parameter estimates for random-graph mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol. , 74(1):3–35, 2012.
- 3Arias-Castro and Verzelen [2014] E. Arias-Castro and N. Verzelen. Community detection in dense random networks. Ann. Statist. , 42(3):940–969, 2014.
- 4Arias-Castro et al. [2011] E. Arias-Castro, E. J. Candès, and A. Durand. Detection of an anomalous cluster in a network. Ann. Statist. , 39(1):278–304, 2011.
- 5Bickel and Chen [2009] P. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. PNAS , 106(50):21068–21073, 2009.
- 6Bickel and Chernoff [1993] P. Bickel and H. Chernoff. Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem. In Statistics and Probability: A Raghu Raj Bahadur Festschrift , pages 83–96. Wiley, New York, 1993.
- 7Bickel et al. [2013] P. Bickel, D. Choi, X. Chang, and H. Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. , 41(4):1922–1943, 2013.
- 8Bickel et al. [2011] P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. Ann. Statist. , 39(5):2280–2301, 2011.
