Uniform estimation in stochastic block models is slow

Isma\"el Castillo; Peter Orbanz

arXiv:1703.03412·math.ST·April 27, 2022

Uniform estimation in stochastic block models is slow

Isma\"el Castillo, Peter Orbanz

PDF

Open Access

TL;DR

This paper demonstrates that uniform estimation in stochastic block models is inherently slow, especially when classes are similar, with convergence rates depending on the number of vertices rather than edges.

Contribution

It provides explicit nonasymptotic minimax bounds for estimation in SBMs, revealing slower uniform rates compared to pointwise estimation, and extends results to smooth graphons.

Findings

01

Uniform estimation rate scales with vertices, not edges.

02

Estimation is harder when classes are similar.

03

Lower bounds are local around any SBM.

Abstract

We explicitly quantify the empirically observed phenomenon that estimation under a stochastic block model (SBM) is hard if the model contains classes that are similar. More precisely, we consider estimation of certain functionals of random graphs generated by a SBM. The SBM may or may not be sparse, and the number of classes may be fixed or grow with the number of vertices. Minimax lower and upper bounds of estimation along specific submodels are derived. The results are nonasymptotic and imply that uniform estimation of a single connectivity parameter is much slower than the expected asymptotic pointwise rate. Specifically, the uniform quadratic rate does not scale as the number of edges, but only as the number of vertices. The lower bounds are local around any possible SBM. An analogous result is derived for functionals of a class of smooth graphons.

Equations483

all estimators in f (π, M) \in S sup E_{π, M} [∥

all estimators in f (π, M) \in S sup E_{π, M} [∥

- (π, M) ∥]^{2} \geq τ (n) .

all estimators in f θ \in [- 1/2, 1/2] sup

all estimators in f θ \in [- 1/2, 1/2] sup

\geq τ (n) .

τ (n) = \frac{constant \cdot k}{n},

τ (n) = \frac{constant \cdot k}{n},

n (\overset{π}{^} - π)

n (\overset{π}{^} - π)

all estimators in f w \in S sup

all estimators in f w \in S sup

\geq constant \cdot \frac{1}{n} .

(φ (1), \dots, φ (n)) (X_{ij})_{i < j} ∣ φ \sim π^{\otimes n} \sim i < j ⨂ Be (M_{φ (i) φ (j)}),

(φ (1), \dots, φ (n)) (X_{ij})_{i < j} ∣ φ \sim π^{\otimes n} \sim i < j ⨂ Be (M_{φ (i) φ (j)}),

(U_{i})_{i} (X_{ij})_{i < j} ∣ (U_{i})_{i} \sim Unif [0, 1]^{\otimes n} \sim i < j ⨂ Be (w (U_{i}, U_{j})) .

(U_{i})_{i} (X_{ij})_{i < j} ∣ (U_{i})_{i} \sim Unif [0, 1]^{\otimes n} \sim i < j ⨂ Be (w (U_{i}, U_{j})) .

w (x, y) := M_{ij} for x \in I_{i}, y \in I_{j} .

w (x, y) := M_{ij} for x \in I_{i}, y \in I_{j} .

(X_{ij})_{i < j} \sim i < j ⨂ Be (M_{φ (i) φ (j)}),

(X_{ij})_{i < j} \sim i < j ⨂ Be (M_{φ (i) φ (j)}),

P_{π, M} = φ \in [k]^{n} \sum μ_{π} [φ] i < j ⨂ Be (M_{φ (i) φ (j)}),

P_{π, M} = φ \in [k]^{n} \sum μ_{π} [φ] i < j ⨂ Be (M_{φ (i) φ (j)}),

P_{φ, M} = i < j ⨂ Be (M_{φ (i) φ (j)}) .

P_{φ, M} = i < j ⨂ Be (M_{φ (i) φ (j)}) .

M = {P_{θ} := P_{e, Q^{θ}}, θ \in [- 1/2, 1/2]},

M = {P_{θ} := P_{e, Q^{θ}}, θ \in [- 1/2, 1/2]},

e =

e =

Q^{θ} =

T in f θ \in [- 1/2, 1/2] sup E_{θ} [T (X) - θ]^{2} \geq \frac{c _{1}}{n},

T in f θ \in [- 1/2, 1/2] sup E_{θ} [T (X) - θ]^{2} \geq \frac{c _{1}}{n},

Q_{b}^{θ} = [\frac{1}{2} + c_{b} θ \frac{1}{2} - d_{b} θ \frac{1}{2} - d_{b} θ \frac{1}{2} + c_{b} θ]

Q_{b}^{θ} = [\frac{1}{2} + c_{b} θ \frac{1}{2} - d_{b} θ \frac{1}{2} - d_{b} θ \frac{1}{2} + c_{b} θ]

θ_{n} = \frac{c _{0}}{n} and c_{0} = \frac{1}{3 \cdot 2 ^{3/4}} \approx 0.56.

θ_{n} = \frac{c _{0}}{n} and c_{0} = \frac{1}{3 \cdot 2 ^{3/4}} \approx 0.56.

T_{f} in f θ \in [- 1/2, 1/2], φ \in [2]^{n} sup E_{θ, φ} [T_{f} (X) - θ]^{2} \geq \frac{c _{1}}{n},

T_{f} in f θ \in [- 1/2, 1/2], φ \in [2]^{n} sup E_{θ, φ} [T_{f} (X) - θ]^{2} \geq \frac{c _{1}}{n},

M = a_{0} a_{1} ⋮ a_{k - 2} a_{1} b_{11} ⋮ b_{1 k - 2} \dots \dots \dots a_{k - 2} b_{1 k - 2} ⋮ b_{k - 2 k - 2} .

M = a_{0} a_{1} ⋮ a_{k - 2} a_{1} b_{11} ⋮ b_{1 k - 2} \dots \dots \dots a_{k - 2} b_{1 k - 2} ⋮ b_{k - 2 k - 2} .

e_{k} = [\frac{1}{k}, \dots, \frac{1}{k}],

e_{k} = [\frac{1}{k}, \dots, \frac{1}{k}],

M^{θ} = \frac{1}{2} + θ \frac{1}{2} - θ a_{1} ⋮ a_{k - 2} \frac{1}{2} - θ \frac{1}{2} + θ a_{1} ⋮ a_{k - 2} a_{1} a_{1} b_{11} ⋮ b_{1 k - 2} \dots \dots \dots \dots a_{k - 2} \vspace .1 c m a_{k - 2} b_{1 k - 2} ⋮ b_{k - 2 k - 2} = [Q^{θ} A^{T} A B],

M^{θ} = \frac{1}{2} + θ \frac{1}{2} - θ a_{1} ⋮ a_{k - 2} \frac{1}{2} - θ \frac{1}{2} + θ a_{1} ⋮ a_{k - 2} a_{1} a_{1} b_{11} ⋮ b_{1 k - 2} \dots \dots \dots \dots a_{k - 2} \vspace .1 c m a_{k - 2} b_{1 k - 2} ⋮ b_{k - 2 k - 2} = [Q^{θ} A^{T} A B],

Q^{θ} = [\frac{1}{2} + θ \frac{1}{2} - θ \frac{1}{2} - θ \vspace .1 c m \frac{1}{2} + θ]

Q^{θ} = [\frac{1}{2} + θ \frac{1}{2} - θ \frac{1}{2} - θ \vspace .1 c m \frac{1}{2} + θ]

A = [a_{1} a_{1} a_{2} a_{2} \dots \dots a_{k - 2} a_{k - 2}], B = b_{11} ⋮ b_{1 k - 2} \dots \dots b_{1 k - 2} ⋮ b_{k - 2 k - 2} .

A = [a_{1} a_{1} a_{2} a_{2} \dots \dots a_{k - 2} a_{k - 2}], B = b_{11} ⋮ b_{1 k - 2} \dots \dots b_{1 k - 2} ⋮ b_{k - 2 k - 2} .

M_{k} = {P_{θ} := P_{e_{k}, M^{θ}}, θ \in [- 1/2, 1/2]},

M_{k} = {P_{θ} := P_{e_{k}, M^{θ}}, θ \in [- 1/2, 1/2]},

T in f θ \in [- 1/2, 1/2] sup E_{θ} [T (X) - θ]^{2} \geq c_{3} \frac{k}{n},

T in f θ \in [- 1/2, 1/2] sup E_{θ} [T (X) - θ]^{2} \geq c_{3} \frac{k}{n},

c_{1} \frac{n}{k} \leq ∣ φ^{- 1} (j) ∣ \leq c_{2} \frac{n}{k} .

c_{1} \frac{n}{k} \leq ∣ φ^{- 1} (j) ∣ \leq c_{2} \frac{n}{k} .

2 Z_{n} (σ, X)

2 Z_{n} (σ, X)

\overset{σ}{^} = σ \in [2]^{n} argmax ∣ Z_{n} (σ, X) ∣

\overset{σ}{^} = σ \in [2]^{n} argmax ∣ Z_{n} (σ, X) ∣

\hat{θ}

\hat{θ}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMarkov Chains and Monte Carlo Methods · Stochastic processes and statistical mechanics · Complex Network Analysis Techniques

Full text

Uniform estimation in stochastic block models is slow

Ismaël Castillolabel=e1][email protected] [

Laboratoire Probabilités, Statistique et Modélisation

Sorbonne Université

Peter Orbanzlabel=e2][email protected] [

Gatsby Computational Neuroscience Unit

University College London

LPSM, Sorbonne Université and Gatsby Unit, UCL

(2020)

Abstract

We explicitly quantify the empirically observed phenomenon that estimation under a stochastic block model (SBM) is hard if the model contains classes that are similar. More precisely, we consider estimation of certain functionals of random graphs generated by a SBM. The SBM may or may not be sparse, and the number of classes may be fixed or grow with the number of vertices. Minimax lower and upper bounds of estimation along specific submodels are derived. The results are nonasymptotic and imply that uniform estimation of a single connectivity parameter is much slower than the expected asymptotic pointwise rate. Specifically, the uniform quadratic rate does not scale as the number of edges, but only as the number of vertices. The lower bounds are local around any possible SBM. An analogous result is derived for functionals of a class of smooth graphons.

62G20,

Stochastic Blockmodel, semiparametric estimation of functionals, minimax rates, spectral clustering, graphon model,

keywords:

[class=MSC]

keywords:

††volume: 0††issue: 0

\setpkgattr

copyrighttext \setpkgattrissuedatatext

\startlocaldefs

\endlocaldefs

1 Introduction

Network data occurs in a range of fields, and its analysis has become a highly interdisciplinary effort [14, 18, 24, 28]. In statistical network analysis, two classes of models have recently received particular attention: Graphon models [9, 21, 29], and the subclass of stochastic block models (SBMs) [1, 7, 17]. The results of this paper show, informally speaking, that estimation under a SBM becomes difficult if the parameters specifying two classes are close to each other.

SBM and graphon models parametrize a random graph by a symmetric measurable function $w$ , which can be interpreted as representing an adjacency matrix in the limit of infinite graph size [9]. In a SBM, the function is in particular piece-wise constant. Examples of statistical problems arising in this field include estimation problems (see below), class label recovery [5, 25, 26, 31, 36], and signal detection, which refers to testing for the presence of a signal in settings where observed data constitutes a network or array [4, 3, 10, 33].

We consider estimation problems. SBMs label each vertex in a graph with a category (a “community”), and this labelling is typically not observed. There is a substantial body of work on rates of estimation in such models [7, 1, 11, 12, 2, 8]. This literature considers asymptotic pointwise rates, and shows that, informally speaking, estimators of finite-dimensional statistics can converge quickly even if the labelling of vertices is not observed. Estimation of the entire function $w$ has also been studied [16, 23, 34]. In a case where this parameter has infinite dimension and is estimated in a uniform way, Gao, Lu, and Zhou [16] show that not observing labels slows the rate. Our results show that that is not a consequence of the nonparametric setting: The best uniform (rather than pointwise) rate for estimating a finite-dimensional statistic—even a very simple one, and under a very simple parametric SBM—is slow. The same holds for a simple, one-dimensional functional of a smooth, infinite-dimensional parameter function $w$ .

1.1 An informal overview

The remainder of this section provides an informal overview of our results. Rigorous definitions and statements follow in the next sections. A SBM is defined by two parameters, a probability distribution ${\pi}$ on $k$ categories (which we regard as a vector in $[0,1]^{k}$ ), and a matrix ${M\in[0,1]^{k\times k}}$ . The model generates undirected random graphs $G_{n}(\pi,M)$ of any size ${n\in\mathbb{N}}$ : Each vertex ${i\leq n}$ is assigned a category $\varphi(i)\in\{1,\ldots,k\}$ drawn randomly from $\pi$ , and an edge between vertices $i$ and $j$ is then added with probability ${M_{\varphi(i)\varphi(j)}}$ . (A proper definition follows in Section 2.)

The main results. SBMs pose an estimation problem: Given an observed graph $G_{n}$ with $n$ vertices, estimate $\pi$ and $M$ . The purpose of our work is to show that, loosely speaking, the estimation problem can be harder than it appears, or indeed than previous results may suggest at first glance. Our results are phrased as minimax lower bounds: Suppose $\mathcal{S}$ is a set of parameter pairs $(\pi,M)$ . A minimax bound specifies a decreasing function $\tau$ such that, informally,

[TABLE]

That is, given an observed graph with $n$ vertices, there exists no estimator whose quadratic risk is smaller than $\tau(n)$ for all parameter values in $\mathcal{S}$ . Since the supremum means that shrinking $\mathcal{S}$ will not increase the lower bound, it can suffice to consider a subclass ${\mathcal{S}^{\prime}\subset\mathcal{S}}$ of parameters—any lower bound for $\mathcal{S}^{\prime}$ is also a lower bound for $\mathcal{S}$ . Indeed, we will see that one can obtain a meaningful bound by choosing a very small subclass with one degree of freedom, where $\pi$ is fixed to the uniform distribution, and the matrix $M$ is a function $M(\theta)$ of a one-dimensional surrogate parameter ${\theta\in[-1/2,1/2]}$ . The statement above then takes the form

[TABLE]

Our main result shows that the relevant lower bound is

[TABLE]

where $k=k(n)$ is permitted to depend on $n$ . This is Theorem 1 (for the simplest case ${k=2}$ ) and Theorem 2 (for $k=k(n)\geq 2$ ). Indeed our proofs imply a stronger statement:

$\bullet$

The results are completely non-asymptotic, and it is possible to explicitly determine numerical values for all relevant constants. See Remark 2 for an example.

$\bullet$

The minimax bound holds locally, not just globally: In principle, a slow minimax rate may be caused by just a few “pathological” points in the set $\mathcal{S}$ . One can ask whether the rate $\tau$ can be improved by removing a small part of $\mathcal{S}$ . That is not the case here: Shrinking the set of all SBM parameter pairs $(\pi,M)$ to any open Euclidean neighborhood of any specific pair still results in the same rate (see Section 3.3). Informally, every region of parameter space contains parameters that prevent the rate from improving.

SBMs are often used in “sparse” forms, and we verify in Appendix B that the result also applies in the sparse case. Since sparsification reduces the amount of available data, it slows convergence. Theorems 8 and 9 show the rate in the sparse setting also scales linearly in the expected number of edges.

Interpretation. If we were to simplify the estimation problem artificially by assuming that the assignments variables $\varphi(i)$ are observed, $\pi$ could be estimated at rate ${1/\sqrt{n}}$ and $M$ at rate $1/n$ , both by computing sample averages. (The rates differ since $\pi$ is estimated from $n$ vertices, whereas $M$ is estimated from edges, and the expected number of edges grows quadratically with $n$ .) Since our bounds are phrased in terms of a quadratic risk, both rates must be squared: in the sequel, a bound ${\tau(n)\approx 1/n}$ as above is referred to as slow rate; by contrast, a fast rate corresponds to ${\tau(n)\approx 1/n^{2}}$ .

One must distinguish uniform rates (which hold uniformly over sets of parameters) and asymptotic pointwise rates (where asymptotics in $n$ are considered at just a given point). Previous work has established that estimation of $M$ can be fast even if the $\varphi(i)$ are not observed. For example, a remarkable result of [7] shows that, if $\hat{\pi}$ and $\hat{M}$ are chosen as certain profile maximum likelihood estimators, then, as $n\to\infty$ ,

[TABLE]

where $Z_{\pi}$ and $Z_{M}$ are multivariate Gaussian variables. This holds up to label switching (see Section 2), and requires that the columns of $M$ are “not too similar”. Related results can be found in [1, 11, 12]. Since this result does not use a quadratic risk, it can be paraphrased informally as:

$\bullet$

Under suitable conditions on the model, the matrix $M$ can be estimated, at least asymptotically and pointwise, at a fast rate.

It has long been recognized in statistics that pointwise asymptotic rates can be hard to interpret: As $(\pi,M)$ runs through some set $\mathcal{S}$ , the constants implicit in the rate may change locally around a given parameter as well as with $n$ , and if they do so quickly enough, that results in an effective change in the rate. The Hodges phenomenon, for example, illustrates that highly pathological behavior of an estimator may only be visible in its uniform rate, whereas the asymptotic pointwise rate suggests good performance [see e.g. Section 8 and Figure 8.1 in 32]. Our result says:

$\bullet$

If measured uniformly over any given neighborhood in parameter space, the best achievable rate for connectivity parameters (i.e. for $M$ ) is always slow.

In other words, the change of constants is indeed an issue here, and makes the rate drop from a fast to a slow one. Since the minimax bound is local, this problem cannot be avoided by removing some (fixed) parameters $(\pi,M)$ .

Further results. Section 4 and the Appendix provide additional results on achievability, i.e. upper bounds to complement the lower ones, and on graphon models and sparse graphs.

Upper bounds. Like most lower bound results, Theorems 1 and 2 do not show whether the bound $\tau$ is achievable—that is, the convergence rate of any actual estimator could be even slower than $\tau$ . To show that a rate is achievable uniformly, one has to specify an estimator whose uniform risk matches the lower bound up to constants. Estimators for SBMs and their convergence rates are subject of a substantial literature, but these rates are, once again, generally pointwise. Obtaining a tight uniform upper bound for arbitrary SBMs is beyond the scope of this work, but we do consider the “hard” one-dimensional model (1), and show the following:

$\bullet$

For estimation of $\theta$ in (1), the rate $\tau$ is achieved by a type of maximum likelihood estimator. That is shown in Theorem 3 (for $k=2$ ) and in Theorem 4 (for general $k$ ), in Section 4.1.

However, this estimator is not generally computable in polynomial time, which raises the additional question whether the problem exhibits a computational gap—that is, whether this is a problem where a sample of size $n$ contains enough information to achieve a given rate $\tau(n)$ , but this information cannot be extracted in polynomial time, and every practically computable estimate converges at a slower rate. In this context, we show:

$\bullet$

Under additional conditions, a spectral estimator (based on work of Lei and Zhu [26]) achieves $\tau$ , and is computable in polynomial time. (See Theorem 5 in Section 4.2 for $k=2$ classes, and the appendix for the general case $k\geq 2$ .)

Thus, in the submodel specified by the conditions, there is no computational gap. We do not know at present whether the same holds for general SBMs.

Smooth graphons. SBMs are a special case of so-called graphon models, which parametrize a random graph $G_{n}(w)$ on $n$ vertices by a function $w$ of a certain form. In SBMs, $w$ is piece-wise constant. Section 4.4 instead considers a class $\mathcal{S}$ of smooth graphons. It is known that uniform estimation of such a graphon $w$ from $G_{n}(w)$ is only possible at a slow rate [16, 23]. Theorem 6 considers a simple, real-valued statistic $\vartheta(w)$ that can be read as a form of standard deviation. It shows that

[TABLE]

In other words, even if the infinite-dimensional quantity $w$ is substituted by the much simpler, one-dimensional quantity $\vartheta$ , the rate is still slow. In this sense, Theorem 6 can be seen as a semiparametric counterpart to the nonparametric results of [16].

1.2 Contents

@starttoc

toc

2 Preliminaries and notation

This section defines the models we consider, and briefly reviews some related background.

Notation. We abbreviate ${[k]:=\{1,\ldots,k\}}$ , so that ${[k]^{n}}$ is the set of all mappings $\{1,\ldots,n\}\to\{1,\ldots,k\}$ . For a subset $A$ of the integers, $|A|$ denotes cardinality. If $M$ is a square matrix, $\|M\|_{F}$ is its Frobenius norm and $\|M\|_{Sp}$ its spectral norm. Let Be $(p)$ be a shorthand for the Bernoulli $(p)$ distribution. By $ER(p)$ , we denote the law of an Erdös-Renyi random graph edge probability $p$ over $n$ nodes, that is, ${ER(p)=ER(p,n)=\otimes_{i<j\leq n}\text{Be}(p)}$ , where $\otimes$ denotes a tensor product of distributions.

Stochastic block models. Consider sampling at random an undirected, simple graph $G$ on the vertex set $\mathcal{V}=\{1,\ldots,n\}$ as follows. Fix some $k\in\{1,\ldots,n\}$ . Let ${\pi=(\pi_{1},\ldots,\pi_{k})}$ be a probability distribution on the set ${\{1,\ldots,k\}}$ , with $\pi$ identified as a line vector of size $k$ . Let ${M}:=(M_{lm})$ be a symmetric ${k\times k}$ matrix with elements ${M_{lm}\in[0,1]}$ . To sample a graph $G$ , we generate its adjacency matrix $X=(X_{ij})_{i,j\in\mathcal{V}}$ . Since $G$ is undirected, it suffices to sample entries with ${i<j}$ ,

For each vertex $i\in\mathcal{V}$ , independently generate a label ${\varphi(i)\,\sim\,\pi}$ . 2. 2.

For each pair ${i<j}$ in $\mathcal{V}$ , independently sample from the distribution ${X_{ij}\,|\,\varphi(i),\varphi(j)\,\sim\,\text{Be}(M_{\varphi(i)\varphi(j)})}$ .

In this notation, $\varphi$ is a (random) mapping $\varphi:\{1,\ldots,n\}\to\{1,\ldots,k\}$ that attributes a label to each node of the graph. It is random because labels are by definition randomly sampled. The distribution $P_{\pi,M}$ so defined on the set of undirected, simple graphs is called a stochastic blockmodel of order $k$ with parameters $\pi$ and $M$ . One can also write

[TABLE]

where $\pi^{\otimes n}=\pi\otimes\cdots\otimes\pi$ , and here and in the sequel $i<j$ refers to all pairs of indices $(i,j)\in\mathcal{V}^{2}$ with $i<j$ . Any given $\varphi$ partitions the vertex set ${\{1,\ldots,n\}}$ into $k$ distinct classes. We call $\pi$ the proportions vector and $M$ the matrix of connectivity parameters.

Graphon models. SBMs can be regarded as a special case of a more general class of random graphs, parametrized by the set of all measurable functions ${w:[0,1]^{2}\rightarrow[0,1]}$ that are symmetric, i.e. ${w(x,y)=w(y,x)}$ . Any such $w$ defines a random graph $G$ : denoting by $\text{Unif}[0,1]$ the uniform distribution on $[0,1]$ , and $(U_{i})_{i}=(U_{i})_{1\leq i\leq n}$ , set

[TABLE]

The law $P_{w}$ of the graph $G$ defined by the random matrix $X$ in (4) is called a graphon model [9]. SBMs are recovered by choosing $w$ as a histogram: subdivide the unit interval into $k$ intervals ${I_{s}:=[\,\sum_{i<s}\pi_{i},\sum_{i\leq s}\pi_{i})}$ of respective lengths $\pi_{s}$ , and set

[TABLE]

Then $P_{w}=P_{\pi,M}$ . In a graphon model, the continuous vertex labels $U_{i}$ are almost surely distinct; in a stochastic block model, labels coincide whenever two vertices belong to the same class. Thus, the SBM labels can be regarded as discretization of graphon labels. Conversely, any graphon can be approximated by a sequence of stochastic blockmodels of increasing order $k$ ; indeed, the set of stochastic blockmodels—that is, of graphons of the form (5) for all $k$ , $\pi$ and $M$ —is dense in the set of functions $w$ endowed with its natural topology [see e.g. 22, for details]. This idea can be used to construct SBM-valued estimators for graphons [34, 16]. SBMs and graphon models both generalize to directed graphs, by dropping the symmetry constraints on $\pi$ and $w$ , and requiring only ${i\neq j}$ rather than ${i<j}$ ; in the following, we consider only the undirected case.

Label switching and identifiability. The distribution (4) remains invariant if $w$ is replaced by $w\circ g$ , for any measure-preserving transformation $g$ of $[0,1]$ : $P_{w}=P_{\tilde{w}}$ for ${\tilde{w}(x,y)=w(g(x),g(y))}$ . More generally, two graphons $w$ and $w^{\prime}$ are considered equivalent if ${P_{w}=P_{w^{\prime}}}$ . The equivalence class $\left<w\right>$ of $w$ is called a graph limit. Similarly in (3), if $\sigma$ is a fixed arbitrary permutation of $\{1,\ldots,k\}$ , with permutation matrix $\Sigma$ , then $P_{\pi,M}=P_{\pi\Sigma,\Sigma M\Sigma^{T}}$ . The parameters of the SBM can only be recovered up to label switching. We refer to [1] and [11] for detailed identifiability statements.

Fixed and random design. In models (3)-(4), the latent variables, respectively $\varphi$ and $U$ , are random. Sometimes, a slightly different version of the model is considered, where $\varphi$ and $U$ are still unobserved, but fixed, non-random quantities. For instance, under this setting (3) becomes

[TABLE]

for a given, unknown, $\varphi:\{1,\ldots,n\}\to\{1,\ldots,k\}$ , and the data distribution is denoted $P_{\varphi,M}$ . Such models will be referred to as fixed design SBM and random design SBM respectively. The term SBM as used in the literature typically refers to a random design. Some theoretical arguments simplify in the fixed design case, for which the data distribution is a product measure, rather than a mixture of products measures. Most results below are obtained for both cases.

Mixture interpretation. The $n$ -tuple $(U_{i})$ in a graphon model, or, equivalenly, the mapping $\varphi$ in a SBM, are in general not observed, and can hence be interpreted as latent variables. In other words, the distribution of the data $(X_{ij})_{i<j}$ is a mixture. The mixture representation is useful to relate fixed and random designs to each other. In the random design case, we have

[TABLE]

where $\mu_{\pi}[\varphi]=\prod_{l=1}^{k}\pi_{l}^{N_{l}(\varphi)}$ and $N_{j}(\varphi)=\sum_{i=1}^{n}1\!{\rm l}_{\varphi(i)=j}$ is the number of times the label $j$ is present. In the fixed design model, the labels given through $\varphi$ are also unobserved, but fixed, so that the distribution is $P_{\varphi,M}$ given by

[TABLE]

3 Main results: lower bounds

In this section, we first construct in Section 3.1 natural submodels of a SBM with $k=2$ along which the two classes become close, and derive an estimation lower bound in terms of the quadratic risk for the submodel parameter. We then consider in Section 3.2 the more general setting of a SBM with $k$ classes ‘containing’ the previously constructed difficult submodel and derive a minimax estimation lower bound in this setting, which is local around any possible SBM of this type, as we discuss in more detail in Section 3.3.

3.1 The case $k=2$

Consider the set of distributions

[TABLE]

where $e$ and $Q^{\theta}$ are given by

[TABLE]

The set $\mathcal{M}$ is a $1$ –dimensional submodel of the set of all SBMs with at most two classes. For $\theta=0$ the matrix $Q^{0}$ is degenerate and the model is simply an Erdös-Reyni graph model with parameter $1/2$ , that is all edges are independent and have a probability $1/2$ of being present. SBMs with connectivity matrices that—like $Q^{\theta}$ above—specify only two, one for intra-group and one for between-group connections, are known as affiliation models [e.g. 2, 3, 27].

Theorem 1.

Consider a stochastic blockmodel (3) with $k=2$ specified by $\mathcal{M}$ , that is $P_{\theta}=P_{e,Q^{\theta}}$ with $e,Q^{\theta}$ given by (8)-(9). There exists a constant $c_{1}>0$ such that for all $n\geq 2$ ,

[TABLE]

where the infimum is taken over all estimators $T$ of $\theta$ in the model $\mathcal{M}$ .

Proof.

See Section 6.1. ∎

Theorem 1 states that, even in a very simple SBM with ${k=2}$ classes and only one unknown parameter in its connectivity matrix, the minimax estimation rate is no faster than $1/n$ . This is no contradiction to the fast rate obtained by Bickel et al. [7] (meaning a $1/n$ rate for the convergence in distribution but a $1/n^{2}$ rate for the quadratic risk): the latter is a pointwise asymptotic result, and assumes that no two lines of the connectivity matrix are the same, whereas Theorem 1 is nonasymptotic and uniform. It shows that the rate in a two-class model changes for distributions close to an Erdős-Renyi model ( $k=1$ ); informally, models close to the ‘boundary’ are harder to estimate. We note the result does not require the sub-model $\mathcal{M}$ to include the Erdős-Renyi model; see the remark below. The phenomenon is reminiscent of effects familiar from community detection, where matrices similar to (9) naturally arise as most difficult submodels. Community detection is a testing problem, though, as opposed to the estimation problem considered here. For a different but related result in the very sparse case, see [27].

*Remark 1** (different parameter choices).*

One can easily check that the result of Theorem 1 remains unchanged if instead of $1/2$ in the matrix $Q^{\theta}$ in (9), another number $a_{0}\in(0,1)$ is used. If $a_{0}$ is bounded away from [math] and $1$ , assuming $\min(a_{0},1-a_{0})\geq\rho>0$ , then the result is only modified by constants. Also, if the proportions vector $\pi$ is of the form $[b\,,\,1-b]$ with $b>0$ , similar results continue to hold, provided the matrix $Q^{\theta}$ is replaced by

[TABLE]

for suitable constants $c_{b},d_{b}$ that depend on $b$ (one can take e.g. $c_{b}=1-b$ and $d_{b}=b$ ).

*Remark 2** (numerical constants).*

In Theorem 1, one can take $c_{1}=1/107$ ; additionally, the supremum can be restricted to $(-\theta_{n},\theta_{n})$ for

[TABLE]

Moreover, the proof implies that one can restrict the supremum to a set not actually containing $\theta=0$ , but rather two points close enough to $\theta=0$ , namely $\theta_{1}=c_{1}/\sqrt{n}$ , $\theta_{2}=c_{2}/\sqrt{n}$ for suitably chosen, fixed constants $c_{1},c_{2}>0$ .

Fixed design. A result similar to Theorem 1 holds for fixed designs. In this case, the map $\varphi$ is deterministic, and the model can be written as $\mathcal{M}_{F}=\{P_{\theta,\varphi}:=P_{\varphi,Q^{\theta}},\,\theta\in[-1/2,1/2],\ \varphi\in[2]^{n}\}$ . Expectations with respect to the measures $P_{\theta}$ and $P_{\theta,\varphi}$ are denoted respectively $E_{\theta}$ and $E_{\theta,\varphi}$ . We then have

[TABLE]

where the infimum is taken over all estimators $T_{f}$ of $\theta$ in the fixed design model. The proof is the same as for Theorem 1, see Section 6.1.

3.2 The general case

We now consider an arbitrary number $k$ of classes. Above, we have perturbed a SBM with ${k=2}$ classes around an Erdös-Renyi model. We now similarly perturb a $k$ -class SBM around one with ${k-1}$ classes. The connectivity matrix of a SBM with at most ${k-1}$ classes is of the form, for $a_{0},a_{i},b_{ij}\in[0,1]$ for $i,j\in[k-2]$ ,

[TABLE]

For simplicity of notation and easy comparison with Section 3.1, we assume $a_{0}=1/2$ throughout. Results are easily adapted to the case $a_{0}\in(0,1)$ , requiring only that $a_{0}$ be bounded away from [math] and $1$ . If needed, one can ensure the number of classes is exactly ${k-1}$ by requiring no two rows of $M$ coincide, which we require only in Theorems 7 and 8, which describe the behavior of spectral estimators.

We consider $1$ -dimensional submodels in the parameter space of connectivity matrices: Set

[TABLE]

and, for coefficients $\{a_{i}\},\{b_{ij}\}$ as above, define

[TABLE]

where

[TABLE]

and

[TABLE]

Thus, $M^{\theta}$ is a symmetric $k\times k$ matrix, obtained from $M$ by replacing the scalar coefficient $a_{0}$ by the $2\times 2$ matrix $Q^{\theta}$ , and repeating the vector $(a_{i})_{1\leq i\leq k-2}$ .

The number of nodes in a given class will be specified as follows. For simplicity, we choose the proportions vector $\pi$ in (3) equiproportional and equal to $e_{k}$ in (11). (As in the case $k=2$ , analogous results can be obtained if the proportions are of similar sizes.) Consider the model defined by

[TABLE]

for $e_{k},M^{\theta}$ as in (11)-(12). This is a $1$ –dimensional submodel of the set of all SBMs with at most $k$ classes. For $\theta=0$ , the matrix $M^{0}$ again has two identical rows, and the model becomes a SBM with at most $k-1$ classes, with connectivity matrix given by $M$ defined in (10). By $E_{\theta}$ , we denote the expectation under $P_{\theta}$ in the model $\mathcal{M}_{k}$ given by (13).

Theorem 2.

Consider a stochastic blockmodel (3) with $k\geq 2$ classes specified by $\mathcal{M}_{k}$ in (13), that is $P_{\theta}=P_{e_{k},M^{\theta}}$ with $e_{k},M^{\theta}$ given by (11)-(12), for fixed matrices $A,B$ with arbitrary coefficients. There exists a constant $c_{3}=c_{3}(\rho)>0$ , independent of $A,B$ , such that, for all $n\geq 12k$ ,

[TABLE]

where the infimum is taken over all estimators $T$ of $\theta$ in the model $\mathcal{M}_{k}$ .

Proof.

See Section 6.2. ∎

Fixed design. A similar result holds for the fixed design case, assuming that classes, given by the mapping $\varphi$ , are balanced in the following sense. Let $\Sigma_{e}$ denote the set of maps $\varphi\in[k]^{n}$ such that, for some constants $c_{1},c_{2}$ , for any $1\leq j\leq k$ ,

[TABLE]

The set $\Sigma_{e}$ thus consists of those maps $\varphi$ that produce $k$ classes all of size of order $n/k$ . Then the conclusion of Theorem 2 still holds, provided $E_{\theta}$ is replaced by $E_{\theta,\varphi}$ , and the supremum taken over $\theta\in[-1/2,1/2]$ and $\varphi\in\Sigma_{e}$ as defined just above.

3.3 Some comments on the results

Theorem 2 establishes that the minimax estimation rate of $\theta$ in model (13) is at best of the order $k/n$ , uniformly over $k$ and $n$ . An intuitive explanation for this particularly slow rate is as follows: the phenomenon observed for $k=2$ is still present but this time the part of the matrix $Z^{\theta}$ containing information about $\theta$ is smaller, as only of the order $2/k$ of the nodes will be assigned to classes $1$ or $2$ , which are the elements of the connectivity matrix that depend on $\theta$ .

An important point is that this lower bound is minimax local (as opposed to more commonly proved minimax global results) that is, not only does this slow rate occur around one specific least-favorable point in the parameter space, it does occur around any point. More precisely: If we start with any $k\geq 2$ , any proportions vector, and any connectivity matrix $M$ as in (10) with $k-1$ classes, there exists at least one submodel around $M$ , namely $\mathcal{M}_{k}$ in (13), such that estimation of a connectivity parameter in $M$ cannot be faster than $k/n$ . In Theorem 1, the model given by $\theta=0$ is an Erdös-Renyi graph, which raises the question whether the slow rate in Theorem 1 is a consequence of the distinguished properties of the Erdös-Renyi model. This is not the case. Proving such a local bound makes the proof of Theorem 2 more involved in the random design case, as one has to quantify the $L^{1}$ -distance between two mixtures of probability measures, instead of between one fixed measure and a mixture as is often the case in proving minimax global bounds.

It is interesting to compare the rate in Theorem 2 to the one that would be obtained if the labels were observed. If $k$ is fixed, Lemma 2 in Bickel et al. [7] gives a quadratic rate of order $1/n^{2}$ for connectivity parameters when labels are observed. This result can be easily adapted to the case where $k$ possibly grows with $n$ , say in an asymptotic setting with $n\to\infty$ and $k/n\to 0$ , leading to a quadratic rate of order $(k/n)^{2}$ . The uniform rate in Theorem 2 is the square-root of this rate and thus much slower.

4 Further results: upper bounds and smooth graphons

In this Section, we complement our main results by upper bounds (under some conditions when $k\geq 3$ ) and results for certain smooth graphons, which can be seen as a continuous analogue of the results for the SBM parameter $\theta$ .

We establish upper bounds that show that the lower bounds in the previous section can be matched for certain subsets of connectivity matrices. In the case of $k\geq 3$ classes, the conditions are arguably somewhat restrictive and can probably be improved. However, since the lower bounds are proved to be local around any possible SBM containing two classes that are close, the rate, if not matched, can only become worse. As we show below, some conditions are in fact necessary. Indeed, we give an example in Section 4.3 where the rate drops further, illustrating the difficulty of the estimation problem.

4.1 Upper bounds via maximum likelihood

Theorems 1 and 2 provide lower bounds. There are corresponding, matching upper-bound, which we obtain next.

The case ${k=2}$ . We define an estimator of $\theta$ as follows. For any $\sigma$ an element of $[2]^{n}$ , i.e. for any mapping $\{1,\ldots,n\}\to\{1,2\}$ , define

[TABLE]

Maximising (14) in $\sigma$ leads to set

[TABLE]

which leads to the profile maximum likelihood estimate

[TABLE]

This estimator can be seen as a (pseudo)-maximum likelihood estimate, see Appendix C.

Theorem 3.

Consider a stochastic blockmodel (3) with $k=2$ specified by $\mathcal{M}$ , that is, $P_{\theta}=P_{e,Q^{\theta}}$ with $e,Q^{\theta}$ given by (8)-(9). Let $\smash{\hat{\theta}}$ be the estimator defined by (15). There exists a constant $C_{1}>0$ such that for all $n\geq 2$ ,

[TABLE]

The same risk bound holds for $\hat{\theta}$ in the fixed design model, uniformly over $\theta$ and $\varphi\in[2]^{n}$ .

Proof.

See Appendix C.2. ∎

The main takeaway from this result is that the uniform quadratic rate for estimating the connectivity parameter along the submodel $\mathcal{M}$ is exactly of order $n^{-1}$ , up to constants. That follows from combining Theorems 1 and 3. This ‘slow’ rate (as compared to the asymptotic pointwise quadratic rate $n^{-2}$ of (2)) arises even if all other parameters—here, the vector of proportions $\pi$ —are assumed known. The submodel built for ${k=2}$ can be regarded as a local perturbation of an Erdös-Renyi graph model with connection probability $1/2$ . The drop in the rate is already noteworthy, as the rate of estimation of $p$ for a ER $(p)$ model is of the order $n^{-2}$ .

The case ${k\geq 2}$ . For this case, we make additional (but fairly mild) assumptions on the matrix $M$ . These conditions are for simplicity of presentation and could, in some cases, be improved. Our main purpose here is to show that, for ‘typical’ matrices $A$ and $B$ in (12), the rate of estimation of $\theta$ in (12) is indeed exactly of the order $k/n$ . In Section 4.3 below, we show that at least some conditions on possible matrices $A,B$ are necessary: for certain unfavourable matrices, the rate drops below $k/n$ . As was the case for Theorem 1, the result of Theorem 3 remains unchanged if the constant $1/2$ in $Q^{\theta}$ is replaced by any $a_{0}\in(0,1)$ .

We modify the criterion function (14) by restricting it to a given subset $S\subset\{1,2,\ldots,n\}$ of indices,

[TABLE]

To avoid technicalities, we maximize over a grid, which constitutes no loss of generality. To this end, define the regular grid $\Theta_{n}=\{i/(2n^{2}),\ i=-n^{2},\ldots,n^{2}\}$ in $\Theta=[-1/2,1/2]$ , and

[TABLE]

Equation (17) defines a global maximum-likelihood type estimator, which is then used to obtain an estimate $\tilde{S}_{I}$ of the set of nodes labelled $1$ or $2$ . Given this estimate, one can apply the profile-type method already used in the case ${k=2}$ : For $\tilde{S}_{I}$ as in (18), $\tilde{n}_{k}={|\tilde{S}_{I}|\choose 2}$ , and $Z_{n}$ as in (16), set

[TABLE]

We require the coefficient $a_{0}$ of the matrix $M$ in (10) to be sufficiently distinct from the remaining entries: Let $\mathcal{C}=\{a_{i},b_{ij},1\leq i,j\leq k-2\}$ be the set of coefficients of the matrices $A$ and $B$ in (12), with $a_{0}=1/2$ ,

[TABLE]

Theorem 4.

Consider a stochastic blockmodel (3) with $k\geq 2$ classes specified by $P_{\theta}=P_{e_{k},M^{\theta}}$ with $e_{k}$ and $M^{\theta}$ given by (11)-(12), for fixed matrices $A$ and $B$ . Define $\smash{\hat{\theta}}=\smash{\hat{\theta}}(X)$ as in (20). Suppose (21) holds and that, for some small enough $d$ and $\kappa$ as in (21),

[TABLE]

Then there exists a universal constant $C_{1}>0$ such that for $n\geq 5$ ,

[TABLE]

The same risk bound holds for $\hat{\theta}$ in the fixed design model, uniformly over $|\theta|\leq\kappa,\varphi\in\Sigma_{e}$ .

Proof.

See Appendix C.3. ∎

Note $\kappa$ in (22) may depend on $k$ and $n$ , and may go to zero in a framework where $k,n$ go to infinity. Below are two examples for the behaviour of $\kappa$ . These examples illustrate that our conditions are indeed met in commonly encountered settings, in particular, with high probability, if $M$ is a random matrix and $k$ does not grow too rapidly with $n$ .

Example 1 (well-separated block). If $\kappa$ is a fixed positive constant e.g. $1/4$ , then the submatrix $Q^{\theta}$ is well separated from the other coefficients of the matrix $M$ . The procedure above then correctly picks up a sensible approximation of the true set $\sigma^{-1}(\{1,2\})$ via $\tilde{S}_{I}$ and the rate $k/n$ is achieved, as long as $k$ does not grow faster than $n^{1/3}/\log{n}$ , an already fairly important number of classes.

Example 2 (randomly sampled matrix $M$ ). Suppose that the symmetric matrix ${M=:(c_{ij})}$ in (10) is a random matrix whose upper triangular entries are drawn i.i.d. with uniform distribution $\mathcal{U}[0,1]$ , except $c_{11}=1/2$ . The distribution of $|c_{ij}-1/2|$ except for $i=j=1$ is then $\mathcal{U}[0,1/2]$ , and it is a standard fact that the first order statistic of a uniformly distributed sample of size $N$ is Beta $(1,N)$ distributed. That implies the random variable $2\min_{c_{ij}\in\mathcal{C}}|c_{ij}-1/2|$ has law $\text{Beta}(1,k(k-1)/2-1)$ . Therefore, $\kappa$ in (21) is of order no less than $1/k^{2}$ with high probability. From (22) one deduces that for $k$ of the form $n^{\delta}$ with $\delta<1/11$ and $n$ large enough, the rate $k/n$ is achieved uniformly and locally, for typical matrices $M$ . Inspection of the proof of Theorem 4 reveals that $k=o(n^{\delta})$ with $\delta<1/7$ in fact suffices for the rate $k/n$ to be attained with high probability when $M$ is random: this is achieved by distinguishing $c_{ij}$ of the types $a_{i}$ or $b_{ij}$ in the proof and noting that the minimum of $|a_{i}-1/2|$ over $i$ will be of larger order $k^{-1}$ , instead of $k^{-2}$ for the minimum over $i,j$ of $|b_{ij}-1/2|$ .

*Remark 3** (conditions $|\theta|\leq\kappa$ and (21)).*

We slightly restrict the range of $\theta$ in the upper bound of Theorem 4. Formally, the matching upper bound is obtained for a somewhat smaller interval than $[-1/2,1/2]$ when $k\geq 3$ . (If $k$ is fixed and $n\to\infty$ , the restriction is only to $[-\delta,\delta]$ for a small enough constant $\delta>0$ .) The condition is needed to ensure, in combination with (21), that the block $Q^{\theta}$ in the matrix (12) is separated sufficiently from the other submatrices $A$ and $B$ . If this is not the case, the estimation problem can become more difficult, and the rate hence slower. This is formally shown in Section 4.3, where the extreme case of all coefficients of $A,B$ being equal to $1/2$ is discussed. This phenomenon can also occur if only some parts of $A$ and $B$ are close to $1/2$ , or to either $1/2+\theta$ or $1/2-\theta$ for some $\theta\in(0,1/2)$ .

We do not claim that the restriction to $[-\kappa,\kappa]$ and (21) are sharp conditions; they can probably be improved. However, the argument above shows some condition of this form is needed, although it may vary depending on the estimation procedure considered: for spectral estimators as considered in Appendix A, for example, we need a similar separation assumption, although it takes a slightly different form (see Theorem 7 in Apprendix A and the comments below it). We also note that small values of $\theta$ are conceptually the most interesting case, since the $2\times 2$ subproblem becomes easier the larger $\theta$ becomes.

4.2 Upper bounds via spectral estimates

Since the maximum likelihood estimator (17) has to optimize over the set $[k]^{n}$ , it need not be computable in polynomial time. It hence seems natural to ask whether there is a “computational gap”, that is, whether the best estimator computable in polynomial time converges at a slower rate than predicted by the minimax bound. We do not have a complete answer to his question, but for a somewhat restricted model class, no such gap exists: The estimator described below for the case ${k=2}$ uses a spectral method, see e.g. [25]. A generalization to ${k\geq 3}$ classes is discussed in Appendix A, which requires further conditions. Within the remit of these conditions, however, the minimax rate is achievable in polynomial time. An extension to sparse graphs is considered in Appendix B. A small simulation study in Section A.2 illustrates the behaviour of the estimator.

With the convention that $X_{ii}=1/2$ and $X_{ji}=X_{ij}$ , define the $n\times n$ matrix $\Delta$ by

[TABLE]

Let $\lambda_{1}^{a}(\Delta)$ denote the largest eigenvalue in absolute value of $\Delta$ and set

[TABLE]

We refer to this procedure as spectral algorithm for $k=2$ and denote it $\mathcal{S}_{2}$ . The intuition behind this estimator in the fixed design setting is the following. For $i\neq j$ , we have

[TABLE]

Set ${v=((-1)^{\mathds{1}\{\varphi(i)=1\}})_{i\leq n}}$ and $V:=vv^{t}=\bigl{(}(-1)^{\mathds{1}\{\varphi(i)\neq\varphi(j)\}}\bigr{)}_{i,j\leq n}$ . Then for non-random $\varphi$ ,

[TABLE]

where $I_{n}$ is the identity matrix of size $n$ . As $E[\Delta]$ is a rank $1$ matrix whose non-zero eigenvalue equals $(n-1)\theta$ (with $v$ the corresponding eigenvector), this leads us to introduce $\tilde{\theta}$ as in (23).

Theorem 5.

In the same setting as in Theorem 3, let $\tilde{\theta}$ be the estimator defined by (23). There exists a constant $C>0$ such that for all $n\geq 2$ ,

[TABLE]

The same risk bound holds for $\hat{\theta}$ in the fixed design model, uniformly over $\theta$ and $\varphi\in[2]^{n}$ .

Proof.

This follows as a special case of Theorem 8, in Appendix B. ∎

4.3 Necessity of conditions on $M$

What precedes shows that the rate $k/n$ is achieved under conditions on $M$ in (10) and/or $k$ . In general, we expect the rate to depend on the matrices $M$ . Although we do not investigate this point in full here, we discuss it briefly.

The estimation methods investigated in Section 4.1 (MLE) and Appendix A (spectral method) require the upper-left ${2\times 2}$ block of $M^{\theta}$ to be sufficiently separated from at least part of the other entries of $M^{\theta}$ . Among those matrices $M^{\theta}$ whose upper-left corner equals $Q^{\theta}$ , a worst case scenario should correspond to a matrix whose coefficients in $A$ and $B$ all equal $1/2$ . This leads to the matrix

[TABLE]

which is of course heavily over-specified from the SBM perspective. Consider the SBM in a fixed design case, where $\varphi:\{1,\ldots,n\}\to\{1,\ldots,k\}$ is unobserved. Suppose all classes $\sigma^{-1}(i)$ are of cardinality of order $n/k$ , and the connectivity matrix is given by (24). This specific model can be regarded as a special case of the setting considered from a testing perspective by Butucea and Ingster [10] and Arias-Castro and Verzelen [3]. From Theorem 4.3 of [10], one can deduce that the minimax rate for the quadratic risk when estimating $\theta$ is no better than $\rho_{n}=\min\left(\frac{k^{2}}{n},\sqrt{\frac{k\log k}{n}}\right)$ , for $k,n\to\infty$ and $\rho_{n}=o(1)$ . The rate is therefore no better than $k^{2}/n$ for $k\leq n^{1/3}$ , and remains much slower than $k/n$ even for $k>n^{1/3}$ .

4.4 Minimax rates for a class of functionals of smooth graphons

Stochastic block models can be identified with piecewise constant graphons; we now consider the case where the graphon is a smooth function instead. Let $w:[0,1]^{2}\to[0,1]$ a measurable function, let $\left<w\right>$ be its graphon equivalence class, and denote by ${P_{\left<w\right>}=P_{w}}$ the distribution of data $X$ generated by the graphon model (4). Consider the problem of estimating the functional

[TABLE]

for any representer $w$ of $\left<w\right>$ . This is well defined in terms of the graphon, as the integral is invariant under any simultaneous (Lebesgue-)measure-preserving transformation of $x$ and $y$ .

The statistic (25) can be interpreted as a ‘graphon-standard deviation’. Its estimation under a smooth graphon model is, in a sense, analogous to the problem of estimating the functional $\theta$ in the simple SBM with two classes discussed in Section 3.1: Let $h_{\theta}$ be the piece-wise constant graphon characterizing the SBM defined by (8)–(9). Since $\vartheta({\langle}h_{\theta}{\rangle})=|\theta|$ , estimating $\theta$ is then indeed equivalent (for positive values) to estimation of $\vartheta(\left<h_{\theta}\right>)$ .

Under a 2-class SBM, the results of Section 3.1 show $\theta$ in (9) cannot be estimated faster than $c/n$ . It is natural to ask whether the same still holds if one works with ‘smoother’ graphons instead of histograms (where we refer to ${\langle}w{\rangle}$ as smooth if at least one of its representers is a smooth function). The following result addresses this question for a simple class of smooth graphons, both for $\vartheta(\cdot)$ and for a larger class of functionals containing $\vartheta(\cdot)$ .

Let $\mathcal{P}_{B}$ be the collection of all graphons that admit a representer which is a polynomial in $x,y$ , with degree bounded by some integer $D\geq 2$ and coefficients bounded by an arbitrary constant $M>0$ (this boundedness restriction is only to ensure a –nearly, up to a log term– matching upper-bound in the next result). For any $0\leq\theta\leq 1$ , let us denote by $w_{\theta}$ the function from $[0,1]^{2}$ to $[0,1]$ given by

[TABLE]

and let $w_{0}$ denote the constant function equal to $1/2$ . The function $w_{\theta}$ can be interpreted as a ‘smooth’ counterpart to the histogram graphon underlying the SBM (9).

Theorem 6.

Let $X$ be data from the graphon model (4). Let $\vartheta(\cdot)$ be defined as in (25). There exist constants $c_{1},c_{2}>0$ such that

[TABLE]

where the infimum is taken over all possible estimators of $\vartheta({\langle}w{\rangle})$ in model (4). Let $\psi$ be an arbitrary functional defined on graphon equivalence classes satisfying

[TABLE]

for some $c>0$ , for any $0\leq\theta\leq 1$ , and for the function $w_{\theta}$ in (26), Then for some $d>0$ ,

[TABLE]

Proof.

See Section 7. ∎

The first part of Theorem 6 asserts that the quadratic minimax rate for estimating (25) cannot be faster than $c/n$ , even if one restricts the parameter set to a small class of smooth graphons $w$ , namely graphons with a polynomial representer of bounded degree. This class can be seen as a smooth analogue of the histogram graphon underlying model (7), or more generally the model with $k$ classes and connectivity (12). The degree of the polynomial can be seen as the analog of $k$ . The rate of order $1/n$ is obtained because the degree of the polynomial is assumed bounded. Although we do not investigate this further here, one may conjecture that the rate would slow even further for a larger class (e.g. growing degree of polynomials, or a nonparametric class such as a Hölder ball).

The second part of Theorem 6 indicates that the specific form of the functional $\vartheta(\cdot)$ in (25) is not essential for the lower bound to hold. A given functional $\psi(\cdot)$ leads to a rate at least as slow of $\vartheta(\cdot)$ over the considered class of graphons as soon as (27) holds. This condition intuitively means that the functional $\psi(\cdot)$ is at least as hard as to estimate as the functional $\vartheta(\cdot)$ , for which the difference on the left hand-side of (27) indeed behaves like $|\theta|$ . By direct computation we see that an example of such a graphon functional is

[TABLE]

Providing a unified theory with matching lower and upper bounds for graphon functionals is an interesting topic for future research.

5 Discussion

Gao et al. [16] show that, if one estimates the parameter function $w$ of a graphon model, not observing the vertex labels—in this case, the variables $U_{i}$ in (4)—does (in general) impact on the optimal rate. In the present paper, we have considered uniform estimation of certain functionals of graphon models (in particular, the loss function is quite different from theirs). For estimation of certain random graph functionals—including the connectivity parameters considered by Bickel et al. [7]—we have shown that the uniform, minimax rate does depend on whether the labels are observed, i.e. the phenomenon described by [16] persists even if one does not try to recover the entire function $w$ , but only a specific $1$ –dimensional aspect of $w$ . The fast quadratic rate $1/n^{2}$ is not achievable uniformly. If the number $k$ of classes is known and fixed, the quadratic rate becomes $1/n$ . If the number of classes $k$ grows with $n$ , the rate drops to $k/n$ . We have used some mild assumptions on the part of the connectivity matrix other than the $2\times 2$ submodel. If those assumptions are not satisfied, the rate may even drop further. Similar results also hold for sparse graphs.

Interestingly, for the functionals considered here, the uniform rate is always, regardless of the number of classes $k$ , much below the rate in the case where labels would be observed. This is in contrast with the problem of recovery of the mean adjacency matrix considered in [16], where for $k$ is larger than $\sqrt{n\log{n}}$ , the (non–normalised) rate $k^{2}+n\log{k}$ is dominated by the ‘parametric’ rate $k^{2}$ , the rate if labels are observed.

We claim no novelty regarding the algorithms—the MLE and spectral method—which we have adapted from existing work to the problem at hand. Their purpose is to verify that the lower bound is tight (both algorithms achieve it) under some mild conditions, and that there is no computational gap (the spectral method does so in polynomial time). Yet, we are not aware of other work providing uniform rates for SBM connectivity parameters for these or other algorithms, which constitutes another novelty of the paper.

Aspects of our proofs reflect the fact that graphon models constitute a specific type of mixture model, and estimation in mixtures can be difficult if mixture components are hard to distinguish; although no general theory of these phenomena seems to exist, we refer to the early work on estimation in finite mixture models by [19] and [6], and e.g. to [20] and [15] for more recent results.

6 Proofs of the lower bounds in SBMs

The proofs of Theorems 1 and 2 rely on variations of Le Cam’s ‘two-points’ method, which bounds the minimax risk from below by a quantity involving the $L_{1}$ distance between a distribution and a finite mixture. (Specifically, this is the ‘point versus mixture’ variant of the two points method, see e.g. [35].) This and other relevant technical lemmas are recalled in Section 6.3 below; the two points method is Lemma 3. For Theorem 2, for $k\geq 2$ classes, one main idea is to ‘isolate’ the part corresponding to the submatrix $Q^{\theta}$ . More details comments are given along the proof in Section 6.2 below.

Notation. Recall that a SBM with $k$ classes, proportions vector $\pi$ and connectivity matrix $M$ has distribution $P_{\pi,M}$ as given in (6). For a $n\times n$ symmetric matrix $A$ with zero diagonal, we write

[TABLE]

If $A$ is only given by $A_{i,j}$ for $i<j$ , one extends it by symmetry and sets $A_{i,i}=0$ . The distribution of a SBM in the fixed design case with given $k,M$ and labelling function $\varphi$ is hence $P_{A^{\varphi}}$ , where $A^{\varphi}_{i,j}=A^{\varphi}(M)_{i,j}=M_{\varphi(i)\varphi(j)}$ . In the random design case, if $\pi$ is the vector with equal proportions $e_{k}=[k^{-1},\ldots,k^{-1}]$ , then from (6),

[TABLE]

We generically denote universal constants by $C$ , where the value may change from line to line.

6.1 Two classes

Proof of Theorem 1.

Let $N=2^{n}$ and let $A_{1},\ldots,A_{N}$ be the collection of symmetric $n\times n$ matrices with general term $a_{ij}(\varphi)=a_{ij}(\theta,\varphi)=Q_{\varphi(i)\varphi(j)}^{\theta}$ , $i<j$ and zero diagonal, for all possible $\varphi\in[2]^{n}$ and some $\theta\in\Theta$ . Let $A_{0}$ be the $n\times n$ matrix with all elements equal to $1/2$ on the off-diagonal, that is the matrix with $\theta=0$ . By Lemma 3, applied with $\vartheta=0$ and $\theta=\theta_{n}$ small to be chosen below, in order to get a lower bound for the minimax risk, it is enough to bound the $L^{1}$ -distance $\|\mathbb{P}-\mathbb{Q}\|_{1}$ between

[TABLE]

If $\lambda_{1}=\text{Be}(q)$ , $\lambda_{2}=\text{Be}(r)$ , $\mu=\text{Be}(s)$ , a simple computation leads to

[TABLE]

By Lemma 4 applied to $\mathbb{P}$ and $\mathbb{Q}$ , where $\vartheta_{i,j}(\varphi,\psi)=(2a_{ij}(\varphi)-1)(2a_{ij}(\psi)-1)$ ,

[TABLE]

Note that $2a_{ij}(\varphi)-1=2\theta(-1)^{1\!{\rm l}_{\varphi(i)\neq\varphi(j)}}$ for any $\varphi\in[2]^{n}$ . Denote $\eta_{i}=1\!{\rm l}_{\varphi(i)=1}-1\!{\rm l}_{\varphi(i)=2}$ and $\eta_{i}^{\prime}=1\!{\rm l}_{\psi(i)=1}-1\!{\rm l}_{\psi(i)=2}$ , for any index $i$ . We have $2a_{ij}(\varphi)-1=2\theta\eta_{i}\eta_{j}$ , so that $\vartheta_{i,j}(\varphi,\psi)=4\theta^{2}\eta_{i}\eta_{j}\eta_{i}^{\prime}\eta_{j}^{\prime}$ . The term under brackets in the last display can be interpreted as an expectation over $\varphi,\psi$ , where both variables are sampled uniformly from the set of all mappings from $\{1,\ldots,n\}$ to $\{1,2\}$ . Under this distribution, the variables $\eta_{i}$ for $i=1,\ldots,n$ are independent Rademacher, as well as the variables $\eta_{i}^{\prime}$ , and both samples are independent. Further note that the variables $R_{i}:=\eta_{i}\eta_{i}^{\prime}$ for $i=1,\ldots n$ form again a sample of independent Rademacher variables. It is thus enough to bound,

[TABLE]

where $E$ denotes expectation under the law of the $R_{i}$ . The previous exponent is an instance of Rademacher chaos; its Laplace transform can be bounded using Lemma 1. If $Z_{n}:=\sum_{i<j}R_{i}R_{j}$ , we have that for any $\varepsilon$ (say $\varepsilon=1/2$ ), there exists $\lambda>0$ such that for all $n\geq 2$ ,

[TABLE]

Choosing $n\theta_{n}^{2}:=1/(4\lambda)$ leads to $\|\mathbb{P}-\mathbb{Q}\|_{1}^{2}\leq\varepsilon=1/2$ , so that the minimax risk is bounded below by $(32n\lambda)^{-1}$ .

To obtain the constants as in the remark below the Theorem, using Lemma 2 in the final step of the proof with $\theta_{n}^{2}=1/(12s_{n})$ , $s_{n}^{2}=n(n-1)/2$ , $r(\cdot)$ as in Lemma 2, gives

[TABLE]

6.2 Lower bounds for $k$ classes

Here the problem is more delicate compared to $k=2$ , as the typical number of nodes per class now depends on $k$ , and, in the random design case, the data distribution for $\theta=0$ , around which we build the lower bound, is itself a mixture. As a first step, we start by establishing a result in a fixed design setting, that is

[TABLE]

Proof of (28).

Define $m=m_{k}=2\lfloor\frac{n}{k}\rfloor$ . Set $S_{1}=\{1,\ldots,m\}$ and $S_{2}=\{m+1,\ldots,n\}$ . Let $\varphi_{0}\in[k]^{n}$ be a mapping such that

[TABLE]

Let $\varphi\in[k]^{n}$ be such that

[TABLE]

and denote by $\mathcal{F}=\mathcal{F}(\varphi_{0},S_{1})$ the set of all such $\varphi$ ’s. Then the restriction $\varphi_{|S_{1}}=:\varphi_{1}$ of $\varphi\in\mathcal{F}$ to $S_{1}$ can be identified to an element of $[2]^{m}$ .

Let $M^{\theta}$ be the $k\times k$ matrix defined in (12). For $\varphi\in\mathcal{F}$ , let $R_{\varphi}$ denote the matrix with general term $r_{ij}=r_{ij}(\varphi)$ equal to $M^{\theta}_{\varphi(i)\varphi(j)}$ . There are as many such matrices as possible $\varphi_{1}$ s, that is $|[2]^{m}|=2^{m}$ . As $\varphi$ and $\varphi_{0}$ are identical by construction on $S_{2}$ ,

[TABLE]

where $\varphi_{1}$ belongs to $[2]^{m}$ . Next set, with $A_{0}$ the matrix with general term $m_{ij}(\varphi_{0})=M^{0}_{\varphi_{0}(i)\varphi_{0}(j)}$ ,

[TABLE]

Now we apply Lemma 4 to $\mathbb{P}^{\prime},\mathbb{Q}^{\prime}$ . Both $P_{A_{0}}$ and $P_{R_{\varphi}}$ are product measures over all pairs of indices $(i,j)$ with $1\leq i<j\leq n$ . By construction, the individual components of these products coincide as soon as either $i$ or $j$ does not belong to $S_{1}$ . We write

[TABLE]

where, for any indices $i,j$ with $i<j$ ,

[TABLE]

For ${\varphi,\psi\in[4]^{n}}$ , we set

[TABLE]

If $i$ or $j$ belongs to $S_{2}$ , then $r_{ij}(\varphi)=M^{0}_{\varphi_{0}(i)\varphi_{0}(j)}=A_{0}(i,j)=r_{ij}(\psi)$ by definition, in which case the last display equals [math]. In Lemma 4, where $\varphi,\psi$ play the role of the indices $k,l$ . Identifying $\psi_{|S_{1}}$ with the corresponding mapping $\psi_{1}\in[2]^{m}$ , we have $\|\mathbb{P}^{\prime}-\mathbb{Q}^{\prime}\|_{1}^{2}\leq\chi^{2}(\mathbb{Q}^{\prime},\mathbb{P}^{\prime})$ and

[TABLE]

The last expression coincides with the bound obtained in the proof of Theorem 1, with $n$ replaced by $m=m_{k}$ . As in that proof, there hence exist independent Rademacher variables $R_{1},\ldots,R_{m}$ such that $Z_{m}=m^{-1}\sum_{1\leq i<j\leq m}R_{i}R_{j}$ satisfies

[TABLE]

Provided $\theta$ is defined as, for $a$ a small enough constant,

[TABLE]

using Lemma 1 as in the proof of Theorem 1 leads to the bound $\|\mathbb{P}^{\prime}-\mathbb{Q}^{\prime}\|_{1}\leq 1/2$ if $\theta^{2}$ is a small enough multiple of $k/n$ , which again leads to a lower bound for the minimax risk of a positive constant times $k/n$ , which proves (28). ∎

Proof of Theorem 2.

For $e=e_{k}$ and $M^{\theta}$ as in (11)-(12), let

[TABLE]

and set $\mathbb{P}=\mathbb{Q}^{0}$ corresponding to $\theta=0$ . Our aim is to show that $\mathbb{Q}^{\theta}$ and $\mathbb{P}$ are close in the sense $\|\mathbb{Q}^{\theta}-\mathbb{P}\|_{1}\leq 1/2$ say, while $\theta$ is a fixed positive multiple of $\sqrt{k/n}$ . For a given $\varphi\in[k]^{n}$ , set

[TABLE]

By definition we have $S_{1}=S_{2}^{c}:=\{1,\ldots,n\}\setminus S_{2}$ and $|S_{1}|+|S_{2}|=n$ .

The proof has two steps. First, one shows that with high probability one can restrict to designs (i.e. specific mapping $\varphi$ ’s) such that there are around $2n/k$ nodes that have label either $1$ or $2$ . Second, we show that estimation with a random design is ‘harder’ than in the (easiest) typical fixed design case. This argument is reminiscent of ‘information processing inequalities’ encountered in information theory, although here a maximisation also takes place for not knowing the class labels. It is then important to maximise only over designs obtained from Step 1, in order for the lower bound rate to be $k/n$ .

Step 1. One first shows that it is possible to restrict the sum in the definition of $\mathbb{Q}^{\theta}$ and $\mathbb{P}$ to $\varphi$ ’s in the set

[TABLE]

The reason is that the large majority of sets $S_{1}$ have a cardinality of the order close to $n/k$ . The proportion of $\varphi$ ’s not in $\mathcal{A}_{n}$ among all possible $\varphi$ ’s is given by the probability of a binomial $Y\sim\text{Bin}(n,2/k)$ variable being farther than $n/k$ from its mean. By Bernstein’s inequality, as $v:=\text{Var}[Y]=n(2/k)(1-2/k)$ , for any $t>0$ ,

[TABLE]

Taking $t=n/k$ and setting $R_{n}:=|\mathcal{A}_{n}|$ , we have just shown that

[TABLE]

Now set

[TABLE]

By the triangle inequality,

[TABLE]

By Lemma 5, $\|\mathbb{Q}^{\theta}-\widetilde{\mathbb{Q}}^{\theta}\|_{1}+\|\widetilde{\mathbb{P}}-\mathbb{P}\|_{1}$ is bounded above by $4(1-R_{n}/k^{n})\leq 8e^{-\frac{n}{k}(5-\frac{8}{k})^{-1}}$ .

Step 2. We now focus on bounding the middle term $\|\widetilde{\mathbb{Q}}^{\theta}-\widetilde{\mathbb{P}}\|_{1}$ . Let $\Sigma_{n}$ denote the collection of subsets of $\{1,2,\ldots,n\}$ with $|S_{2}-(k-2)n/k|\leq n/k$ . For a given $S\in\Sigma$ , let $\varphi_{S}=\varphi_{|S}$ denote the restriction of $\varphi$ to $S$ . Below we use the notation $\sum_{\varphi_{S}}$ with the meaning that each term of the sum corresponds to a possible mapping $\varphi_{S}$ , that is a given collection of values $(\varphi(i))_{i\in S}\in\{1,\ldots,k\}^{S}$ .

To do so, we rewrite $\widetilde{\mathbb{Q}}^{\theta}$ and $\widetilde{\mathbb{P}}$ as ‘mixtures of mixtures’, by splitting the sum over $\varphi$ into a sum over $S_{2},\varphi_{S_{2}}$ and $\varphi_{S_{1}}$ given $S_{2}$ . Specifying $\varphi$ is equivalent to giving oneself $S_{2}$ (then $S_{1}=S_{2}^{c}$ ), $\varphi_{S_{2}}$ and $\varphi_{S_{2}^{c}}=\varphi_{S_{1}}$ . Denote

[TABLE]

For given $S_{2}$ and $\varphi_{S_{2}}$ , set

[TABLE]

where one sums over all possible mappings $\varphi_{S_{1}}$ , while $S_{2}$ and $\varphi_{S_{2}}$ are fixed. We have

[TABLE]

Note that the above measures are normalised to be probability measures. Indeed, given $S_{2}\in\Sigma_{n}$ , there are $2^{|S_{1}|}=2^{n-|S_{2}|}$ possible choices for $\varphi_{S_{1}}$ . As $\widetilde{\mathbb{Q}}^{\theta}$ is of total mass one, we have

[TABLE]

Using the triangle inequality, one can bound

[TABLE]

It is now sufficient to bound uniformly the above $L^{1}$ -distance. For simplicity, we denote

[TABLE]

where $\varphi_{2}=\varphi_{S_{2}}$ and $\varphi_{1}=\varphi_{S_{1}}$ , and $\varphi$ is the pair $(\varphi_{1},\varphi_{2})$ . Set

[TABLE]

Using the definition of $T_{\varphi,S_{2}}^{\theta}$ above,

[TABLE]

Combining this with the previous bounds one deduces that

[TABLE]

To conclude the proof, observe that the structure of the bound in the maximum in the last display is nearly identical to the quantities appearing in Equation (31) for the fixed-design case.

In the present case, we have a fixed mapping $\varphi:\{1,\ldots,n\}\to\{1,\ldots,k\}$ , with $\varphi_{1}=\varphi_{\,|\,S_{1}}$ and $\varphi_{2}=\varphi_{\,|\,S_{2}}$ , that plays the role of $\varphi_{0}$ in the fixed-design case. On the other hand, we have a collection of other mappings, say $\bar{\varphi}$ , that coincide with $\varphi$ on $S_{2}$ , that is $\bar{\varphi}_{2}=\bar{\varphi}_{\,|\,S_{2}}=\varphi_{\,|\,S_{2}}=\varphi_{2}$ , and that cover all possible cases for the image of $S_{1}$ , namely $\bar{\varphi}_{1}=\bar{\varphi}_{\,|\,S_{1}}=\varphi_{1}^{\prime}$ . The only difference to the fixed-design case is that $|S_{1}|$ belongs to $[n/k,3n/k]$ , instead of being exactly $2\lfloor n/k\rfloor$ , as specified in the definition of $\Sigma_{n}$ above. That is, denoting as above $Z_{m}=m^{-1}\sum_{1\leq i<j\leq m}R_{i}R_{j}$ , with $m_{1}=|S_{1}|$ ,

[TABLE]

This bound is uniform over $S_{2},\varphi_{2}$ . As $m_{1}\leq 3n/k$ , if one chooses $\theta^{2}\leq 1/(12\lambda n/k)$ , with $\lambda=\lambda(1+\varepsilon)$ the constant in Lemma 2, then this Lemma implies that for any $m_{1}$ between $n/k$ and $3n/k$ , the $L^{1}$ -distance in the last display is bounded by $\varepsilon$ . Crucially, the constant $\lambda$ in Lemma 1 is independent of the number of terms in the Rademacher chaos. Deduce

[TABLE]

Choosing $n/k>12$ makes this bound smaller than $\varepsilon+4/5<1$ for $\varepsilon<1/5$ . ∎

6.3 Useful lemmas

Let $\{Z_{i},\ i\geq 1\}$ be i.i.d. Rademacher variables. For reals $x_{ij}$ and $N\geq 2$ , set

[TABLE]

Lemma 1 (Corollary 3.2.6 of de la Peña and Giné [13]).

Let $N\geq 2$ , and $Y=Y_{N}$ and $s(Y)$ as above. For every $c>1$ , there exists $\lambda=\lambda(c)>0$ independent of $N$ such that

[TABLE]

We repeatedly use Lemma 1 in the case where all $y_{ij}$ are equal to $1$ , for various values of $N$ . In such a setting, a reformulation is as follows. For any $c>1$ and $N\geq 2$ , one can find a constant $a=a(c)$ independent of $N$ such that

[TABLE]

Lemma 2 (Rademacher chaos with explicit constant).

Let $N\geq 2$ , and $Y=Y_{N}$ and $s(Y)$ as above. For any $0\leq\delta\leq 1$ ,

[TABLE]

The lemma applied with $\delta=1/3$ gives a bound $1.87$ for the right hand side.

Proof.

Theorem 3.2.2 in [13] gives, for any $k\geq 2$ ,

[TABLE]

For $k=1$ one has $E[|Y|]\leq E[Y^{2}]^{1/2}=s(Y)$ . From this one deduces that for any $0\leq\delta\leq 1$ ,

[TABLE]

and the result follows from an application of the nonasymptotic Stirling bound $k!\geq e^{-k}k^{k+\frac{1}{2}}\sqrt{2\pi}$ valid for $k\geq 1$ . ∎

Lemma 3 (Le Cam’s method ‘point versus mixture’).

Let $\mathcal{P}=\{P_{M},\ M\in\mathcal{M}\}$ be a collection of probability measures indexed by an arbitrary set $\mathcal{M}=\{M_{0},M_{1},\ldots,M_{N}\}$ , $N>1$ . Set

[TABLE]

If $\psi$ is a real-valued functional such that $\psi(P_{M_{0}})=\tau$ and $\psi(P_{M_{i}})=\theta$ for any $i=1,\ldots,N$ , then

[TABLE]

where the infimum is over all estimators $\hat{\psi}(X)$ of $\psi(P_{M})$ based on the observation of $X\sim P_{M}$ .

Proof.

This is a standard variation on the case where $N=1$ stated in e.g. [35]. ∎

Lemma 4 (Bound on total variation distance).

For $n\geq 1$ , let $P_{1},\ldots,P_{n}$ and $Q_{1}(k),\ldots,Q_{n}(k)$ for $1\leq k\leq N$ , for some $N\geq 1$ , be probability measures. Set

[TABLE]

Suppose that for any $i$ , $Q_{i}(k)$ has density $1+\Delta_{i}(k)$ with respect to $P_{i}$ . Denote $\vartheta_{i}(k,l)=P_{i}\Delta_{i}(k)\Delta_{i}(l)$ . Then, for $\chi^{2}(\mathbb{Q},\mathbb{P})=\int(d\mathbb{Q}/d\mathbb{P}-1)^{2}d\mathbb{P}$ ,

[TABLE]

Proof.

The first bound on distances is standard, while the second bound follows from elementary calculations. ∎

Lemma 5.

Let $N,R$ be two integers with $N\geq 2$ , $1\leq R\leq N$ , and $(P_{i})_{i\in I}$ be an arbitrary collection of probability measures with $|I|=N$ . If $J\subset I$ and $|J|=R$ , we have

[TABLE]

Proof.

The result follows by splitting the sum over $I$ in a sum over $J$ and $I\setminus J$ , applying the triangle inequality and using the fact that $\|\sum_{j\in J}P_{j}\|_{1}=|J|$ . ∎

7 Proofs for results on graphon functionals

To prove Theorem 6, we observe that polynomial graphons of bounded degree include the graphon $w_{\theta}$ in (26). The proof approximates this smooth graphon $w_{\theta}$ by a piecewise constant graphon, and then uses a lower-bound for such piecewise constants. We prove this lower bound, Lemma 6, first. Similar to the SBM case, this builds on Le Cam’s point versus mixture method. We then proceed to prove Theorem 6.

7.1 Auxiliary lower bound

Assume the function $w$ is piecewise constant, with different values taken along blocks corresponding to a regular partition of $[0,1]^{2}$ in $k\times k=k^{2}$ blocks, and $k$ an even integer $k=2l$ , with $l\geq 1$ . That defines a law of the form

[TABLE]

where $\varphi$ is an element of $[k]^{n}$ and $Q=Q^{\theta}$ a given $k\times k$ matrix defined below. In the next statement and proof, $E_{\theta}$ denotes the expectation under this distribution. Denote by $O_{k}$ the $k\times k$ matrix with only ones as coefficients,

[TABLE]

and, for a symmetric $l\times l$ matrix $A$ with coefficients $A_{ij}\in[0,1]$ , define the $k\times k=(2l)\times(2l)$ matrix

[TABLE]

We define $Q=Q^{\theta}$ as the $k\times k=(2l)\times(2l)$ matrix

[TABLE]

Lemma 6.

Let $k=2l$ be an even integer and $A$ an arbitrary symmetric $l\times l$ matrix. Let $Q=Q^{\theta}$ be the matrix defined in (33). There exists a constant $c_{3}>0$ such that

[TABLE]

where the infimum is over all estimators of $\theta$ valid under $E_{\theta}=E_{P_{e_{k},Q^{\theta}}}$ .

Proof of Lemma 6.

Let $\mathbb{Q}=P_{e_{k},Q^{\theta}}$ be as above. That is,

[TABLE]

with $\{Z_{\varphi}^{\theta}\}$ the matrix of general term $z_{ij}(\varphi,\theta)=Q^{\theta}_{\varphi(i)\varphi(j)}$ , for $\varphi$ ranging over the set $[k]^{n}$ . Let $\mathbb{P}$ denote the Erdös-Renyi $ER(1/2)$ distribution over $n$ nodes, which also corresponds to $P_{e_{k},Q^{\theta}}$ for $\theta=0$ . Consider the functional $\psi$ defined as,

[TABLE]

By definition, for any $\varphi\in[k]^{n}$ , we have $\psi(\mathbb{P})=\psi(P_{e_{k},Q^{0}})=0$ and $\psi(\mathbb{Q})=\psi(P_{e_{k},Q^{\theta}})=\theta$ . The same computation as in the proof of Theorem 1 now shows that, for $B$ given in the display below,

[TABLE]

The last term in the bound can be interpreted as an expectation over $\varphi,\psi$ , where both variables are sampled uniformly from the set of all mappings from $\{1,\ldots,n\}$ to $\{1,\ldots,k\}$ . Recall that $l=k/2$ and for any integer $s$ , denote by $[s]_{l}$ the integer in $\{1,\ldots,l\}$ that equals $s$ modulo $l$ , plus $1$ . The variable $B_{\varphi(i)\varphi(j)}$ can be written

[TABLE]

When $\varphi$ follows the uniform distribution over $[k]^{n}$ , the variables $(\varphi(i))_{i}$ are independent and are marginally uniform over $[k]$ . Also, the variables $((-1)^{\varphi(i)>l})_{i}$ and $([\varphi(i)]_{l})_{i}$ are independent under the uniform distribution for $\varphi$ as we show next. If Pr denotes the corresponding distribution, then for any ${i\leq n}$ and any $s\leq l$

[TABLE]

Note the identity holds both for $k\leq n$ and $k>n$ . Set $R_{i}:=(-1)^{\varphi(i)>l}$ and $a_{ij}:=A_{[\varphi(i)]_{l}[\varphi(j)]_{l}}$ . Deduce from the previous reasoning that the variables $(R_{i})_{i}$ and $(a_{ij})_{i<j}$ are independent. Now, denoting by $E$ the expectation under Pr,

[TABLE]

As $(R_{i})_{i}$ and $(a_{ij})_{i<j}$ are independent, one can compute the inner expectation in the last display under the distribution of $(R_{i})_{i}$ , the $a_{ij}$ ’s being fixed. The $(R_{i})_{i}$ form a sample of independent Rademacher variables, hence

[TABLE]

is a Rademacher chaos of order $2$ with weights $(a_{ij})$ . Suppose the matrix $A$ is not identically zero (otherwise the bound below holds trivially). By Lemma 1, for any $c>1$ one can find $\lambda>0$ with

[TABLE]

Choose $c=3/2$ . By definition, all $a_{ij}$ s are bounded by $1$ . There is hence a $\lambda>0$ such that, if $\theta^{2}=2/(\lambda n)$ ,

[TABLE]

The result now follows from an application of Lemma 3 to the functional $\psi$ . ∎

7.2 Proof of the theorem

Proof of Theorem 6.

Let us recall the definition, for any $0\leq\theta\leq 1$ , of the function $w=w_{\theta}$ in (26)

[TABLE]

and let ${\langle}w_{\theta}{\rangle}$ be its graphon equivalence class. By definition, ${\langle}w_{\theta}{\rangle}$ belongs to $\mathcal{P}$ . One has

[TABLE]

for some constant $c>0$ , so that $\vartheta({\langle}w_{\theta}{\rangle})-\vartheta({\langle}w_{0}{\rangle})=c\theta$ . The function $w_{0}$ is the constant $1/2$ , and the density of the data distribution $P_{{\langle}w{\rangle}}$ with respect to counting measure on $\{0,1\}^{n(n-1)/2}$ is

[TABLE]

where, for any $z$ in $[0,1]$ and $x_{ij}$ in $\{0,1\}$ , we have set

[TABLE]

Next one shows that $P_{{\langle}w{\rangle}}$ is close in the total variation sense to a discrete mixture of the previous Bernoulli-probability distributions, provided the number of points in the mixture is suitably large. To do so, we approximate the function $P_{n}$ defined by

[TABLE]

by a piecewise constant function $h_{N,\theta}=h_{N}$ , where $[0,1]^{n}$ is split into $N^{n}$ blocks, $N\geq 1$ , using a regular grid of $[0,1]^{n}$ with points $(i_{1}/N,\ldots,i_{n}/N)$ and $0\leq i_{j}\leq N$ for all $j$ . To do so, one just replaces $w(u_{i},u_{j})$ by, say, the value of $w$ on the middle of the block the point $(u_{i},u_{j})$ belongs to. This defines a function

[TABLE]

where $\bar{w}$ is constant on every block of the subdivision. Let $Q_{w}^{N}$ denote the corresponding measure, with density

[TABLE]

Taking $w=w_{\theta}$ as above, the function $P_{n}$ is a polynomial in $u_{1},\ldots,u_{n}$ , and its degree with respect to each variable $u_{i}$ is $n-1$ . The partial derivatives of $P_{n}$ can be computed, and each of them can be seen to be bounded by ${n-1}$ : For each variable, only $n-1$ non-zero terms appear when evaluating the partial derivative, and each term is uniformly bounded by $1$ . Consequently, if $(u_{1},\ldots,u_{n})$ and $(u_{1}^{\prime},\ldots,u_{n}^{\prime})$ belong to the same block,

[TABLE]

For $w=w_{\theta}$ as above, we can thus bound the total variation distance as

[TABLE]

Each probability measure $Q_{w}^{N}$ is a mixture of $N^{n}$ distributions, each of which in turn corresponds to a block in the subdivision of $[0,1]^{n}$ . One can rewrite

[TABLE]

where the matrix $M=(M_{pq})_{1\leq p,q\leq N}$ is the symmetric matrix with terms

[TABLE]

If $N$ is even, which one can assume without loss of generality, the matrix $M$ is exactly of the same form as $Q$ in (33), with elements in $(0,1)$ , so one can use the bound in $\|\cdot\|_{1}$ -distance between measures obtained in the proof of Lemma 6. Note that the argument remains valid even if the number of classes exceeds the number of observations $n$ , which will be of importance below. For a small constant $c$ and $\theta^{2}=\kappa/n$ , we obtain

[TABLE]

for $\kappa$ sufficiently small. Choosing $N=Cn^{4}$ , for $C>0$ large enough, leads to

[TABLE]

An application of Lemma 3 with the functional $\psi(P_{{\langle}w{\rangle}}):=\psi({\langle}w{\rangle})$ concludes the proof of the lower bound in Theorem 6 in the case where $\psi(\cdot)=\vartheta(\cdot)$ . The lower bound for a general $\psi$ follows by the same proof, noting that the specific form of the functional only comes in through the difference $\psi({\langle}w_{\theta}{\rangle})-\psi({\langle}w_{0}{\rangle})$ , which behaves as for $\vartheta(\cdot)$ by assumption.

For the upper-bound, we first link the squared distance to the truth for the functional to the squared $L^{2}$ -distance of corresponding graphons. Let $w,w_{1}$ be two fixed graphon functions, and suppose that at least one of these is non constant (almost everywhere), which means that either $\vartheta({\langle}w{\rangle})>0$ or $\vartheta({\langle}w_{1}{\rangle})>0$ . Then, writing simply $\int$ to denote the double integral on $[0,1]^{2}$ ,

[TABLE]

where the denominator is nonzero by assumption on $w,w_{1}$ ; we henceforth denote it $c$ . Then

[TABLE]

The two factors in brackets are bounded as follows: For the second term, apply the inequality $(a+b)^{2}\leq 2a^{2}+2b^{2}$ , followed by $\sqrt{u+v}\leq\sqrt{u}+\sqrt{v}$ , which yields

[TABLE]

For the first term, use $0\leq\int(g-\int g)^{2}\leq\int g^{2}$ for a bounded measurable $g$ , as one integrates over $[0,1]^{2}$ . That yields

[TABLE]

and this inequality clearly still holds true in case $\vartheta({\langle}w{\rangle})=\vartheta({\langle}w_{1}{\rangle})=0$ . One concludes that

[TABLE]

where $\mathcal{T}$ is the set of all measure-preserving bijections of $[0,1]$ . Indeed, the previous inequalities hold true for any choice of representer of the graphon $w_{1}$ , so one can take the infimum over $\mathcal{T}$ in the previous bounds. By Corollary 3.6 of [23], for data $X$ generated from $P_{w}$ , there exists an estimator $\hat{w}=\hat{w}(X)$ that satisfies $E_{P_{w}}[\delta^{2}(\hat{w},w)]\leq C(\log{n}/n)$ . Since ${\langle}w{\rangle}$ has a representer that belongs to $\mathcal{P}_{B}$ by assumption, it belongs in particular to the Hölder class $\Sigma(1,L)$ , provided $L$ is chosen large enough. For the plug-in estimator $\hat{\vartheta}(X):=\vartheta(\hat{w})$ , combining the previous result with the last display implies

[TABLE]

for $C$ large enough depending only on $\mathcal{P}_{B}$ , which concludes the proof. ∎

Acknowledgements

I. C. is very grateful for the hospitality of Columbia’s statistics department, where parts of this work where carried out. I. C.’s work is supported by ANR-17-CE40-0001 (BASICS).

Appendix A Upper bounds computable in polynomial-time

This section generalizes the polynomial-time estimate in Section 4.2 to ${k\geq 2}$ classes, by combining the spectral clustering method of Lei and Rinaldo [25] with a refinement due to Lei and Zhu [26]. The latter is based on a sample splitting, and under appropriate conditions on the connectivity matrix recovers the labels exactly, with high probability. Theorem 7 below shows that, under additional conditions, this polynomial-time estimator achieves the minimax rate.

A.1 Spectral estimation for ${k\geq 2}$ classes

Recall the assumed form of the connectivity matrix $M^{\theta}$ in (12). The conditions of the next results are in terms of an ‘aggregated’ $(k-1)\times(k-1)$ matrix $N$ obtained from $M^{\theta}$ by merging the first and second row/columns when $\theta=0$ , that is

[TABLE]

Recall that $\varphi$ denotes the true labelling map. Define a labelling $\psi:[n]\to[k-1]$ by $\psi(v)=1$ if $\varphi(v)\in\{1,2\}$ and $\psi(v)=\varphi(v)-1$ if $v\in\{3,\ldots,k\}$ . That is, we ‘aggregate’ nodes of label $1$ or $2$ in one class and renumber the remaining labels so that the label set is, now, $[k-1]$ . Following [26], we write $g_{v}=\psi(v)$ for the true (aggregated) label of node $v\in[n]$ and $\mathcal{I}^{(l)}=\{v\in[n]:\ g_{v}=l\}$ .

The algorithm Spec- $\theta$ specified in the frame below has three steps. First, one runs the exact label recovery algorithm V-Clust of Lei and Zhu [26] for $K=k-1$ classes. Under some conditions on the matrix $N$ , see (A1)–(A2) below, this finds the ‘aggregate’ labels $\psi$ above up to label permutation with high probability. Then the aim is to recover the aggregated class with original labels $1$ and $2$ . Due to the label switching issue, this requires some extra condition on $N$ . For simplicity (see also comments below) we assume in (A3) that the diagonal terms $b_{ii}$ are separated from $1/2$ , which enables to estimate the aggregated class label $1$ by comparing diagonal empirical connectivities to $1/2$ . Finally, in a third step one can run the spectral algorithm $\mathcal{S}_{2}$ from Section 3.1 on the nodes found at the previous step.

Algorithm: Spectral method for estimation of $\theta$ (Spec- $\theta$ )

Input: adjacency matrix $X$ (where we set $X_{ii}=0$ ), number of classes $k$

Subroutines: V-Clust (Lei-Zhu), Initial community recovery $\mathcal{S}$ (Lei-Rinaldo), Spectral algorithm $\mathcal{S}_{2}$ for $k=2$ (Section 3.1)

Apply V-Clust on adjacency matrix $X$ using $k-1$ classes, $\mathcal{S}$ and $V=2$

$\hat{g}=\text{{\tt V-Clust}}(X,k-1,V,\mathcal{S}).$

Set $\hat{}\mathcal{I}^{(1)}=\{v\in[n]:\ \hat{g}_{v}=\hat{\ell}\,\}$ , where

$\hat{\ell}\,=\,\underset{l\in[k-1]}{\text{argmin}}\ \Bigg{|}\,\frac{1}{{{|\hat{g}^{-1}(l)|}\choose{2}}}\sum_{i<j,\,i,j\in\hat{g}^{-1}(l)}X_{ij}\,-\,\frac{1}{2}\,\Bigg{|}$

Run spectral algorithm $\mathcal{S}_{2}$ for $k=2$ on corresponding nodes and set

$\hat{\theta}=\mathcal{S}_{2}(X^{\hat{}\mathcal{I}^{(1)}}),$

where $X^{\hat{}\mathcal{I}^{(1)}}$ is the induced adjacency matrix over nodes in $\hat{}\mathcal{I}^{(1)}$ .

We set $K=k-1$ and assume that, for a large enough universal constant $C$ :

(A1)

$N$ is full rank and any two rows of $N$ are separated by at least $\gamma=\gamma(K)>0$ in $\ell_{2}$ -norm. 2. (A2)

For $\lambda=\lambda(K)$ the smallest absolute eigenvalue of $N$ ,

[TABLE] 3. (A3)

For all $i\in\{1,\ldots,k-2\}$ ,

[TABLE]

where $\kappa=\kappa(K)\geq C\sqrt{K(\log{n})/n}$ .

Comments on (A1)–(A3) follow below. For a version for sparse graphs, see Appendix B.

Theorem 7.

In the fixed design SBM model with $k$ classes, under the assumptions (A1)–(A3), let us set, for $c$ a small enough universal constant and $K=k-1$ ,

[TABLE]

Then the obtained $\hat{\theta}$ from algorithm Spec- $\theta$ satisfies, for $C_{3}$ a large enough constant,

[TABLE]

Proof.

This is a special case of Theorem 9, in Appendix B below. ∎

The algorithm Spec- $\theta$ , unlike the likelihood method considered below, only uses the fact that the connectivity matrix is of the form $M^{\theta}$ , but does not use specific knowledge of the vector $a$ and matrix $B$ to compute $\theta$ .

Comments on the assumptions. Conditions (A1) and (A2) are typical for spectral methods; their specific form is that assumed by Lei and Zhu [26], with the initial recovery algorithm being that of Lei and Rinaldo [25]. If $K$ is fixed independently of $n$ , then (A2) follows from (A1) if $n$ is large enough. Condition (A3) is specific to our problem, and assumed in this form only for simplicity of exposition: To identify the special cluster arising from the $1/2$ coefficient in the matrix $N$ (step 2. in Spec- $\theta$ ), some identifiability condition is needed, because even the refined spectral clustering algorithm of [26] can only recover the original labels up to a permutation. Condition (A3) is similar in spirit to condition (21), but weaker. It can be replaced with any other condition that ensures cluster $1$ can be identified from a noisy, permuted version of $N$ (with noise amplitude going to zero fast, as $k/n$ ). Note that, if $k$ is fixed and $n$ large enough, (A3) simply requires the diagonal terms of $B$ to differ from $1/2$ .

Finally, a comment on $T_{K}$ in (34). The label recovery in Steps 1–2 is run with $k-1$ classes, and hence joins two of the $k$ classes in the sample. The restriction on the range of $\theta$ ensures the classes joined are the first two, with high probability. Indeed, here we are interested in the situation where $\theta$ may be small, which makes identification of labels difficult, and the rate slow; if $\theta$ is large, the problem becomes easier. Again, note that if $k$ is fixed, the condition simply requires that $|\theta|$ is smaller than a given constant.

A.2 Simulation study

Those estimators described above that are computationally feasible—the spectral and sample splitting estimators for $k=2$ , and the Spec- $\theta$ estimator for $k>2$ —can be tested in simulation: Draw $n$ vertices from a stochastic block model as in (13) with a given value of $\theta$ , compute the respective estimate, and report the empirical quadratic risk. Figure 1 shows how the risk develops as a function of sample size for different values of $\theta$ , for the two-community model (9). For $k>2$ communities, the model is given by the connectivity matrix (12). Simulation results for $k=5$ , with ${a_{1}=\frac{1}{12}}$ , ${a_{2}=\frac{11}{12}}$ and ${a_{3}=1}$ , are shown in Figure 2. As is visible in Figures 1 and 2, smaller values of $\theta$ correspond overall to a larger risk, and a much slower decay of the empirical risk curves. This illustrates our theoretical finding that there exists a range of parameters corresponding to two classes that become close where estimation is much slower.

Appendix B Extension to sparse graphs

So far, we have for simplicity considered dense graphs, in the sense that at least some elements of the connectivity matrix (e.g. $1/2+\theta$ or $1/2-\theta$ ) are bounded away from zero.

B.1 Two classes

An $\alpha_{n}$ –sparse SBM model is generally defined as one in which the connectivity matrix $M$ can be written, for $\alpha_{n}$ a sequence going to [math] with $n$ , as $M=\alpha_{n}M_{0}$ , for $M_{0}$ a nonnegative symmetric matrix with maximum entry $1$ [e.g. 8, 25]. Here, we assume that the connectivity matrix is $M^{\theta}(\alpha_{n})$ with

[TABLE]

and $M^{\theta}$ as in (12). Then the largest coefficient of $M^{\theta}$ is between $\alpha_{n}/2$ and $\alpha_{n}$ , as the coefficients of the upper $2\times 2$ block are $\alpha_{n}(1/2\pm\theta)$ . We also set, for $\theta\in[-1/2,1/2]$ ,

[TABLE]

In constructing upper bounds below, we assume that for $C_{s}$ a large enough constant,

[TABLE]

as up to a constant $\log{n}/n$ is the typical boundary between the moderately sparse and very sparse situations, the later requiring different tools, see [25]. For simplicity we also assume that $\alpha_{n}$ is known for the upper-bound results.

Theorem 8.

Consider a stochastic blockmodel (3) with $k=2$ specified by $P_{\theta}=P_{e,Q^{\theta}(\alpha_{n})}$ with $e,Q^{\theta}(\alpha_{n})$ given by (8)-(36). There exists a constant $c_{1}>0$ such that for all $n\geq 2$ ,

[TABLE]

where the infimum is taken over all estimators $T$ of $\theta$ in the model $\mathcal{M}$ . Furthermore, if $\Delta_{n}=X-\alpha_{n}J/2$ , and $\lambda_{1}^{a}(\Delta_{n})$ the largest absolute eigenvalue of $\Delta_{n}$ , set $\tilde{\theta}:=\lambda_{1}^{a}(\Delta_{n})/\{(n-1)\alpha_{n}\}$ . Then, under (B0), for some constant $C>0$ and $n\geq 2$ ,

[TABLE]

B.2 $k\geq 2$ classes

The case of $k$ classes carries over to the sparse situation as follows. The lower bound result is only modified by a scaling factor $1/\alpha_{n}$ . For upper bounds, considering the more easily computable spectral algorithm Spec- $\theta$ only, Assumption (A2) is replaced by (B2) below, where $N$ has the same definition as in Appendix A.

(B2)

For $\lambda=\lambda(K)$ the smallest absolute eigenvalue of $N$ , there exists $C>0$ such that

[TABLE]

Theorem 9.

Consider a stochastic blockmodel (3) with $k\geq 2$ classes specified by $\mathcal{M}_{k}$ in (13), that is $P_{\theta}=P_{e_{k},M^{\theta}}$ with $e_{k},M^{\theta}$ given by (11)–(35), for fixed matrices $A,B$ with arbitrary coefficients. There exists a constant $c_{3}=c_{3}(\rho)>0$ , independent of $A,B$ , such that, for all $n\geq 12k$ ,

[TABLE]

where the infimum is taken over all estimators $T$ of $\theta$ in the model $\mathcal{M}_{k}$ . Let $\mathcal{S}_{2,\alpha_{n}}$ be the algorithm for $k=2$ classes in the sparse case described in Theorem 8. Consider the fixed-design setting and suppose $(B0),(B2),(A1)$ and $(A3)$ are satisfied. Then the algorithm Spec- $\theta$ used with subroutine $\mathcal{S}_{2,\alpha_{n}}$ outputs an estimator $\hat{\theta}$ that satisfies, for $T_{K}$ as in (34),

[TABLE]

Similar comments as for Theorems 2–7 can be made. Also, in the case that $k$ does not grow with $n$ , then (B2) follows from (B0) for $n$ larger than a fixed constant. The proof of the lower bound in Theorem 9 is similar to that of Theorem 2 using the normalisation as in the proof of Theorem 8 and is omitted. The upper bound result includes that of Theorem 7 and is proved in Appendix D.

Appendix C Remaining proofs: likelihood-based upper bounds

The proof for ${k=2}$ below analyzes the least-squares criterion directly. For ${k\geq 2}$ classes, we ‘isolate’ the part corresponding to the $2\times 2$ submodel, by controlling the number of errors in recovering the labels of the corresponding $2$ classes. We then invoke the result for the case $k=2$ .

C.1 Interpretation as a pseudo-likelihood

We first justify the interpretation of the estimator $\hat{\theta}$ in (15) as a maximum (pseudo-)likelihood estimate. In the fixed design model, suppose the data is Gaussian $\mathcal{N}(\theta_{ij},1)$ instead of Bernoulli Be $(\theta_{ij})$ . This suggests defining a (pseudo-)log-likelihood $\ell_{n}(\sigma,\theta)$ as follows, with $c_{n}={n\choose 2}\log(2\pi)$ ,

[TABLE]

for a constant $C_{n}(X)$ depending only on $n$ and $X$ . Setting $b_{n}={n\choose 2}$ and

[TABLE]

it is enough to study the function $g_{n}(\theta,\sigma):=b_{n}\theta^{2}-2Z_{n}(\sigma,X)\theta$ , which satisfies

[TABLE]

Consequently, the pseudo maximum likelihood estimator $(\hat{\theta},\hat{\sigma})$ is given by (15) as claimed.

C.2 Upper bound result, two classes

Proof of Theorem 3.

We first prove the result in the fixed design case. Let $\theta_{0},\sigma_{0}$ denote the true values of $\theta,\varphi$ . The aim is to show that $E_{\theta_{0},\sigma_{0}}(\hat{\theta}-\theta_{0})^{2}\leq C/n$ holds uniformly in $\theta_{0},\sigma_{0}$ . For a given $\sigma\in 2^{[n]}$ ,

[TABLE]

One can write, for any $\sigma\in 2^{[n]}$ ,

[TABLE]

where we have set

[TABLE]

For any $t>0$ and $t_{n}=M_{2}/\sqrt{n}$ , and for a large enough $M_{2}$ to be chosen below,

[TABLE]

By definition of $\hat{\theta}$ , with $\delta(\sigma,\sigma_{0})$ defined in (37),

[TABLE]

For any $t\geq 4M_{2}$ , using that $|\delta(\hat{\sigma},\sigma_{0})|\leq b_{n}$ ,

[TABLE]

For $\mathcal{P}_{2}(t)$ , there are two cases, depending on the sign of $\theta_{0}$ ,

[TABLE]

Let us discuss the term $\theta_{0}>t_{n}$ first and note that if $Z_{n}(\hat{\sigma},X)\geq 0$ , then $Z_{n}(\hat{\sigma},X)=|Z_{n}(\hat{\sigma},X)|\geq|Z_{n}(\sigma_{0},X)|\geq Z_{n}(\sigma_{0},X)$ using the definition of $\hat{\sigma}$ as a maximum. First,

[TABLE]

where the first three inequalities use identities obtained for $Z_{n}(\hat{\sigma},X),Z_{n}(\sigma_{0},X)$ above and the inequality obtained before the display, and the last inequality uses $\delta(\hat{\sigma},\sigma_{0})-b_{n}\leq 0$ and $\theta_{0}\geq 0$ .

Second, as $Z_{n}(\hat{\sigma},X)<0$ implies $Z_{n}(\hat{\sigma},X)<-|Z_{n}(\sigma_{0},X)|$ by definition of the maximum,

[TABLE]

where for the last inequality we have used the lower bound on $\delta$ obtained in Lemma 7.

The case $\theta_{0}<-t_{n}$ is treated in a symmetric way, by distinguishing the two cases $Z_{n}(\hat{\sigma},X)<0$ and $Z_{n}(\hat{\sigma},X)\geq 0$ respectively. To obtain a deviation bound for $\hat{\theta}$ , it is enough to study the supremum of the process $|R_{n}(\sigma)|$ . For any given $\sigma$ and $y>0$ , by Hoeffding’s inequality,

[TABLE]

A union bound now leads to

[TABLE]

This bound is smaller than $\exp\{-2n\}$ if one chooses $y=n^{3/2}$ . Combining the bounds obtained previously, and choosing $M_{2}$ above as $M_{2}=64$ , one deduces

[TABLE]

The deviation bound in turn implies the bound in expectation

[TABLE]

where for the second term we have used $|\hat{\theta}-\theta_{0}|\leq 1$ , as $\Theta$ has diameter $1$ . This concludes the proof of Theorem 3 in the fixed design case.

In the random design case, one slightly updates the definition of $r_{ij}$ . Here, the design is specified by $\varphi$ , which is now random, but one can consider

[TABLE]

By definition, $E[\tilde{r}_{ij}]=E[E[\tilde{r}_{ij}\,|\,\varphi]]=0$ . Now one can follow the proof in the fixed design case by writing all statements conditionally on $\varphi$ . As conditionally on $\varphi$ the variables $r_{ij}$ are independent and centered, the arguments leading to the various upper bounds remain unchanged. As the upper-bounds themselves do not depend on $\varphi$ , the bounds also hold unconditionally. ∎

What remains to be shown is the bound on $\delta$ used above:

Lemma 7.

For any $\sigma_{0},\sigma\in\Sigma$ , with $\delta(\sigma,\sigma_{0})$ defined in (37) and $b_{n}={n\choose 2}$ ,

[TABLE]

Proof.

The upper bound corresponds to the number of terms in the sum. For the lower bound, denote $\mathcal{C}_{1}=\sigma_{0}^{-1}(\{1\})=\{i,\ \sigma_{0}(i)=1\}$ . By symmetry, one can always assume $|\mathcal{C}_{1}|\geq n/2$ , otherwise one works with $\mathcal{C}_{2}=\sigma_{0}^{-1}(\{2\})$ . The number $T_{\sigma}(\mathcal{C}_{1})$ of pairs $(i,j)\in\mathcal{C}_{1}\times\mathcal{C}_{1}$ for which $\sigma(i)\neq\sigma(j)$ is at most $2N_{1}(\sigma)N_{2}(\sigma)$ , if $N_{i}(\sigma)=|\sigma^{-1}(\{i\})\cap\mathcal{C}_{1}|$ , $i=1,2$ . This implies $T_{\sigma}(\mathcal{C}_{1})\leq|\mathcal{C}_{1}|^{2}/2$ , using the inequality $p(q-p)\leq q^{2}/4$ , for any $0\leq p\leq q$ . Thus the number of positive elements in the sum defining $\delta(\sigma,\sigma_{0})$ is at least $(|\mathcal{C}_{1}|^{2}-|\mathcal{C}_{1}|^{2}/2-|\mathcal{C}_{1}|)/2$ , where $|\mathcal{C}_{1}|$ corresponds to the diagonal terms and the division by $2$ to the fact that the sum is restricted to $i<j$ only (note that the general term of the sum defining $\delta$ is symmetric in $i,j$ ). This is at least $|\mathcal{C}_{1}|^{2}/8$ if $|\mathcal{C}_{1}|\geq 3$ . Hence,

[TABLE]

which is the desired bound in view of $|\mathcal{C}_{1}|^{2}\geq b_{n}/2$ . ∎

C.3 Upper bound result, $k$ classes

Proof of Theorem 4.

First consider the fixed design case: As in the proof of Theorem 3, let $\theta_{0},\sigma_{0}$ denote the true values of $\theta,\varphi$ . The aim is to show that

[TABLE]

Let us denote by $Z^{0}$ and $\tilde{Z}$ the matrices of general terms $Z^{0}_{i,j}:=M_{\sigma_{0}(i)\sigma_{0}(j)}^{\theta_{0}}$ and $\tilde{Z}_{ij}:=M^{\tilde{\theta}}_{\tilde{\sigma}(i)\tilde{\sigma}(j)}$ respectively, with $\tilde{\sigma},\tilde{\theta}$ given by (17) and $i\neq j$ . One interpretation of (17) if that the matrix $\tilde{Z}$ provides the best fit to the data $X_{ij}$ with respect to the squared $L^{2}$ loss, when optimising over $\Sigma_{e}\times\Theta_{n}$ .

As a first step, we show that $\tilde{Z}$ and $Z^{0}$ are close with high probability, a result in the spirit of Gao et al [16], Theorem 2.1. This follows from Lemma 8 below, which states that $\|\tilde{Z}-Z^{0}\|^{2}\leq Cn\log k$ with probability at least $1-e^{-n\log{k}}$ , where $\|\cdot\|$ is the Frobenius norm.

In a second step, denoting $S_{I}^{0}:=\sigma_{0}^{-1}(\{1,2\})$ and recalling from (18) that $\tilde{S}_{I}=\tilde{\sigma}^{-1}(\{1,2\})$ , we show that $\tilde{S}_{I}$ is close to $S_{I}^{0}$ . To do so, one separately bounds from below some terms from the quantity $\|Z^{0}-\tilde{Z}\|^{2}=\sum_{i,j}(Z^{0}_{i,j}-\tilde{Z}_{i,j})^{2}$ , recalling that one extends the $Z$ matrices by symmetry and sets the diagonal to [math]. First, using the definitions, (21), $|\theta|\leq\kappa$ , and $n\geq 2$ ,

[TABLE]

if $|S_{I}^{0}\,\setminus\,\tilde{S}_{I}|\geq 2$ (otherwise the inequality below holds trivially), as well as

[TABLE]

and, with $a\wedge b=\min(a,b)$ , if $|S_{I}^{0}\,\cap\,\tilde{S}_{I}|\geq 2$ ,

[TABLE]

The previous bound on $\|\tilde{Z}-Z^{0}\|^{2}$ implies that

[TABLE]

It now follows from (22) that for any $\delta>0$ , one has $C\kappa^{-2}n\log{k}\leq\delta n^{2}/k^{2}$ , provided $d$ in (22) is small enough. So for small $d$ , as $|S_{I}^{0}|=|S_{I}^{0}\cap\tilde{S}_{I}|+|S_{I}^{0}\setminus\tilde{S}_{I}|$ , and as by assumption $\sigma_{0}\in\Sigma_{e}$ so that $|S_{I}^{0}|\asymp n/k$ , one deduces $|S_{I}^{0}\cap\tilde{S}_{I}|\gtrsim n/k$ . From the bound on $\|\tilde{Z}-Z^{0}\|^{2}$ , it follows that

[TABLE]

By (22) this shows that $\tilde{\theta}$ is close to either $\theta_{0}$ or $-\theta_{0}$ up to $k(\log{n}/n)^{1/2}=:\rho\leq\kappa/2$ . Now for any $i,j$ in $\tilde{S}_{I}\setminus S_{I}^{0}$ , if $\tilde{Z}_{ij}=\frac{1}{2}+\tilde{\theta}$ and $\tilde{\theta}$ is close to $\theta_{0}$ (the cases where $\tilde{Z}_{ij}=\frac{1}{2}-\tilde{\theta}$ or $\tilde{\theta}$ is close to $-\theta_{0}$ are treated similarly), setting $Z_{ij}^{0}=c_{ij}$ and using (22), with $a_{0}=1/2$ ,

[TABLE]

Therefore,

[TABLE]

and

[TABLE]

Combining the previous bounds on cardinalities and denoting $A\ \Delta\ B:=(A\setminus B)\,\cup\,(B\setminus A)$ for two sets $A$ and $B$ , one obtains

[TABLE]

which in turn implies that

[TABLE]

In a third and last step, we follow the proof of Theorem 3. Let $\hat{\sigma}=\hat{\sigma}_{I}$ be the mapping in (19). It is a map $\tilde{S}_{I}\to\{1,2\}$ . Let $\bar{\sigma}$ be the mapping $S_{I}^{0}\to\{1,2\}$ that coincides with $\hat{\sigma}$ on $\tilde{S}_{I}\cap S_{I}^{0}$ and with $\sigma_{0}$ on $S_{I}^{0}\setminus\tilde{S}_{I}$ . By definition we have, with $\Delta_{n}:=\kappa^{-2}n\log{k}$ ,

[TABLE]

where for the second identity we have used that the $X_{ij}$ s are bounded by $1$ and (38), and $R_{n}(\sigma)$ is defined as in the proof of Theorem 3. Similarly, denoting $n_{k}={|S_{I}^{0}|\choose 2}$ and $\tilde{n}_{k}={|\tilde{S}_{I}|\choose 2}$ , we have

[TABLE]

with high probability, since (38) implies $|\tilde{n}_{k}-n_{k}|=O(\Delta_{n})$ using that $||A|-|B||\leq|A\Delta B|$ for two sets $A,B$ . Also, since $\hat{\theta}=Z_{n}(\hat{\sigma},\tilde{S}_{I},X)/\tilde{n}_{k}$ and $|\hat{\theta}|\leq 1/2$ , by the same argument we have

[TABLE]

Let $v_{k}$ and $t_{k}$ be two sequences depending on $n$ and $k$ whose specific values are determined below (see the last paragraph of the proof). If $Z_{n}(\hat{\sigma},\tilde{S}_{I},X)\geq 0$ , then $Z_{n}(\hat{\sigma},\tilde{S}_{I},X)\geq|Z_{n}(\sigma_{0},\tilde{S}_{I},X)|\geq Z_{n}(\sigma_{0},\tilde{S}_{I},X)$ . Let $\Sigma_{0}$ be the set of all maps $S_{0}\to\{1,2\}$ . In the following inequalities we repeatedly use the fact that the normalisation $\tilde{n}_{k}$ in the definition of $\hat{\theta}$ can be replaced by $n_{k}$ up to a factor $O(\Delta_{n})/n_{k}$ , see (39),

[TABLE]

where the last inequality uses $\delta(\bar{\sigma},\sigma_{0})-n_{k}\leq 0$ , see Lemma 7 with $n_{k}$ in place of $b_{n}$ , and $\theta_{0}\geq 0$ .

Second, as $Z_{n}(\hat{\sigma},\tilde{S}_{I},X)<0$ implies $Z_{n}(\hat{\sigma},\tilde{S}_{I},X)<-|Z_{n}(\sigma_{0},\tilde{S}_{I},X)|$ by definition of $\hat{\sigma}$ ,

[TABLE]

where for the last inequality we have used the first inequality of Lemma 7. Also,

[TABLE]

By the same argument as in the proof of Theorem 3, the supremum

$\sup_{\sigma\in\Sigma_{0}}|R_{n}(\sigma)|$ is of the order $|\Sigma_{0}|^{3/2}\asymp(n/k)^{3/2}$ , by definition of $\Sigma_{e}$ . Recall that $n_{k}\asymp(n/k)^{2}$ . Set $v_{k}=\sqrt{n/k}$ and $t_{k}=Dv_{k}^{-1}$ , with $D$ a large enough constant. Assumption (22) ensures that $\Delta_{n}=\kappa^{-2}n\log{k}=O((n/k)^{3/2})$ . Hence by taking $t$ a large enough constant, one obtains that the last three displays are bounded above by $e^{-Cn/k}$ , which concludes the proof in the fixed design case, proceeding as in the proof of Theorem 3 to get the final bound in expectation.

The proof in the random design case is obtained by first deriving the results conditionally on $\varphi$ and then integrating out $\varphi$ , as we did in the proof of Theorem 3. The first part is almost identical to the fixed design case: one only needs to note that one can restrict to mappings $\varphi$ that belong to, essentially, $\Sigma_{e}$ . Denote by $\Sigma_{e}^{\prime}$ the subset of those ${\sigma\in\Sigma_{e}}$ satisfying $||\sigma^{-1}(j)|-\frac{n}{k}|<\frac{n}{2k}$ . Then

[TABLE]

For the first term, one can apply the arguments above in fixed design, while for the second an application of Bernstein’s inequality gives

[TABLE]

By (22), $k^{3}\leq Cn$ holds for $C$ large enough—note that $\kappa$ must be smaller than $1$ , as the entries of $M$ are in $[0,1]$ . One deduces $ke^{-\frac{n}{10k}}\leq\frac{k}{n}ne^{-dn^{2/3}}\leq C\frac{k}{n}$ , for $d$ small enough and $C$ large enough, so the quadratic risk is at most $Ck/n$ in this case as well. ∎

Lemma 8.

Let $Z^{0}_{i,j}:=M_{\sigma_{0}(i)\sigma_{0}(j)}^{\theta_{0}}$ and $\tilde{Z}_{ij}:=M^{\tilde{\theta}}_{\tilde{\sigma}(i)\tilde{\sigma}(j)}$ , with $\tilde{\sigma},\tilde{\theta}$ given by (17). Let $\|\cdot\|$ denote the matrix Frobenius norm. With probability at least $1-e^{-cn\log{k}}$ ,

[TABLE]

Proof of Lemma 8.

Let $\theta_{1}$ denote the element of $\Theta_{n}$ closest to $\theta_{0}$ , so that $|\theta_{0}-\theta_{1}|\leq n^{-2}$ . Let $Z^{1}$ be the matrix given by $Z^{1}_{ij}:=M^{\theta_{1}}_{\sigma_{0}(i)\sigma_{0}(j)}$ . By definition, $\|X-\tilde{Z}\|^{2}\leq\|X-Z^{1}\|^{2}$ and hence $\|Z^{1}-\tilde{Z}\|^{2}+2{\langle}X-Z^{1},Z^{1}-\tilde{Z}{\rangle}\leq 0$ , so

[TABLE]

where we denote $T_{n}(\sigma,\theta):={\langle}X-Z^{0},(M^{\theta}_{\sigma(i)\sigma(j)}-Z^{1})/\|M^{\theta}_{\sigma(i)\sigma(j)}-Z^{1}\|{\rangle}$ . As elements of the matrix $X-Z^{0}$ are between $-1$ and $1$ , we note that $T_{n}(\sigma,\theta)$ is of the form $\sum_{l}\mu_{l}\varepsilon_{l}$ , where $\varepsilon_{l}\in[-1,1]$ are independent, and $\sum_{l}\mu_{l}^{2}=1$ . So using Hoeffding’s inequality, for any $t>0$ ,

[TABLE]

The cardinality of the set $\Theta_{n}\times\Sigma_{e}$ is bounded above by $(2n^{2}+1)k^{n}\lesssim k^{Cn}$ . A union bound then shows that, with probability at least $1-e^{-cn\log k}$ ,

[TABLE]

Inserting this back into the previous inequality on $\|Z^{1}-\tilde{Z}\|^{2}$ leads to $\|Z^{1}-\tilde{Z}\|\leq C\sqrt{n\log k}+\|Z^{0}-Z^{1}\|$ with probability at least $1-e^{-cn\log{k}}$ . As $\|Z^{0}-Z^{1}\|^{2}\leq Cn^{2}/n^{2}\leq C$ , the triangle inequality leads to the result. ∎

Appendix D Remaining proofs: lower and upper bounds in the sparse case

We begin with a brief overview of the proof techniques: For the lower bounds in the sparse setting (Theorems 8 and 9), proofs are very similar to the dense case, and it suffices to track the dependence on the sparsity parameter $\alpha_{n}$ . To upper-bound the convergence rate of spectral estimates (Theorems 5 and 8), we use the fact that $\theta$ can be estimated from the largest absolute eigenvalue of the (translated) adjacency matrix. The latter can in turn be estimated empirically. For the proofs of the upper bounds for $k$ classes, we show that with high probability it is possible to recover the true aggregated labels, where aggregation means that classes $1$ and $2$ , corresponding to the ‘hard submodel’ are merged (this is the ‘ $g$ map’ introduced above in the second paragraph of Appendix A). To do so, one adapt techniques introduced by Lei and Rinaldo [25], and Lei and Zhu [26] and show that their results still hold ‘under small perturbations’, as explained in more details below. Once the true aggregated labels of classes $1$ and $2$ are obtained, it suffices to apply the (already derived) result for the case $k=2$ .

D.1 Proofs for the two-class case

Proof of the lower bound in Theorem 8.

One proceeds in the same way as in the proof of Theorem 1 with $a_{ij}(\varphi)$ replaced by $b_{ij}(\varphi):=Q_{\varphi(i)\varphi(j)}^{\theta}(\alpha_{n})$ , $i<j$ . If $\lambda_{1}=\text{Be}(b_{ij}(\varphi))$ , $\lambda_{2}=\text{Be}(b_{ij}(\psi))$ , $\mu=\text{Be}(\alpha_{n}/2)$ , we now have

[TABLE]

This leads to, with $\eta_{i}=1\!{\rm l}_{\varphi(i)=1}-1\!{\rm l}_{\varphi(i)=2}$ and $\eta_{i}^{\prime}=1\!{\rm l}_{\psi(i)=1}-1\!{\rm l}_{\psi(i)=2}$ ,

[TABLE]

By the same argument as in the proof of Theorem 1, it is enough to solve

[TABLE]

for $\theta$ , where $C$ is a universal positive small enough constant, under the constraint that $|\theta|\leq 1/2$ . This leads to take $\theta^{2}$ equal up to a constant to $(n\alpha_{n})^{-1}\wedge 1$ and the proof is complete. ∎

Proof of the upper bound in Theorem 8.

We write the proof directly in the possibly sparse setting. Let us first consider the fixed design case, where $\varphi$ is non-random. Let $\|\,.\,\|_{Sp}$ denote the spectral norm of a matrix (for a symmetric matrix $\Delta$ , $\|\,.\,\|_{Sp}=\max(|\lambda_{1}(\,.\,)|,|\lambda_{n}(\,.\,)|)$ , so $|\lambda_{1}^{a}(\Delta)|=\|\Delta\|_{Sp}$ ). By [25, Theorem 5.2], we have that for any $r>0$ , there exists a $C=C(r,c_{0})>0$ such that

[TABLE]

with probability at least $1-n^{-r}$ . From this one deduces that $\|\Delta_{n}-E[\Delta_{n}]\|_{Sp}\leq C\sqrt{n\alpha_{n}}$ . The eigenvalues of $\Delta$ and those of ${\Delta_{n}-E[\Delta_{n}]}$ and ${E[\Delta_{n}]}$ can be related to each other by a Weyl-type inequality as

[TABLE]

for any $1\leq i\leq n$ , see e.g. [30, eq. (1.64)]. Suppose for now that $\theta\geq 0$ . In this case $\lambda_{1}(E\Delta_{n})=(n-1)\alpha_{n}\theta$ and $\lambda_{n}(E\Delta_{n})=0$ , which by the previous inequality implies, with high probability,

[TABLE]

Now if $\theta>2C/\sqrt{n\alpha_{n}}$ , using the first inequality we have $\lambda_{1}(\Delta_{n})/\{\alpha_{n}(n-1)\}>C/\sqrt{n\alpha_{n}}$ and $\tilde{\lambda}_{1}=\lambda_{1}(\Delta_{n})$ follows from the second inequality, which means $|\hat{\theta}_{n}-\theta|\leq C/\sqrt{n\alpha_{n}}$ . If $\theta\leq 2C/\sqrt{n\alpha_{n}}$ , the triangle inequality and the second inequality imply $|\lambda_{n}(\Delta_{n})-\theta|\leq 3C/\sqrt{n\alpha_{n}}$ , which combined with the first inequality gives $|\hat{\theta}_{n}-\theta|\leq 3C/\sqrt{n\alpha_{n}}$ . So, for $\theta\geq 0$ , in all cases $|\hat{\theta}_{n}-\theta|\leq 3C/\sqrt{n\alpha_{n}}$ with high probability. The case $\theta<0$ is treated similarly. In the random design setting, one can argue conditionally on $\varphi$ , and then note that both the obtained bounds and the in-probability statements do not depend on $\varphi$ , which gives the result in this setting as well. ∎

D.2 Proofs for the general case

Proof of the lower bound in Theorem 9.

The proof is similar to that of Theorem 2, where one now uses the sparse lower bound for two classes of Theorem 8 instead of Theorem 1, and is thus omitted. ∎

Proof of the upper bound in Theorem 9.

We show that the proof approach used by [26] to establish their Theorem 2 can be adapted to our problem. More precisely, it is amenable to a perturbation of the true matrix $M$ of connection probabilities: We show that, for a graph generated by model (12) with a sufficiently small value of $\theta$ , the V-Clust algorithm with $K=k-1$ classes recovers the aggregated labelling defined by $g$ above with high probability. We do the proof in the possibly sparse situation, thereby also proving the upper-bound in Theorem 9.

There are three steps. First, we show that the initial label recovery algorithm $\mathcal{S}$ of [25] recovers most of the labels correctly, and control the error. Second, we show that the scheme of proof of [26] carries over to the problem of recovering the aggregated clustering up to label permutation. Finally, using assumption (A3) one can recover the aggregated class $1$ with high probability, and restricting to nodes with label in that class we can apply the spectral method $\mathcal{S}_{2}$ of the case $k=2$ .

First step (Perturbed spectral method of Lei and Rinaldo).

[TABLE]

The matrix $M^{0}$ (i.e. $M^{\theta}$ with $\theta=0$ ) can be transformed into the matrix $N$ above by removing the first line and then the first column.

Let $X$ be the matrix $(X_{ij})$ . Since the relevant design is fixed, there exists a binary $n\times k$ matrix $T$ , with a single 1 in each row, for which we have ${E[X]=TM^{\theta}(\alpha_{n})T^{t}+D}$ , where $D=-\text{Diag}(TM^{\theta}(\alpha_{n})T^{t})$ is a diagonal matrix with entries bounded by $\alpha_{n}$ . Lei and Rinaldo call $T$ a membership matrix. It can be rewritten in terms of $N$ , using the relation between $M^{0}$ and $N$ noted above: for a $n\times K$ membership matrix $S$ and $E[X]$ the expected value of $X$ ,

[TABLE]

Now we can follow Lei and Rinaldo’s analysis of simple spectral clustering with $K=k-1$ and the expectation matrix $SNS^{t}$ ; one only needs to show that, despite the perturbation $\theta TRT^{t}$ , the argument still holds. Intuitively, this is guaranteed by the assumption that $\theta$ is small enough, which ensures that the spectrum of the perturbation $\theta TRT^{t}$ does not interact much with that of $SNS^{t}$ . More precisely, we decompose $X$ as

[TABLE]

Following the proof of Theorem 3.1 of [25], the pair $(S,N)$ parametrises a SBM with $K=k-1$ classes and $N$ is full rank. By their Lemma 2.1, the eigendecomposition of $P=S(\alpha_{n}N)S^{t}$ can be written $P=UDU^{t}$ , where $U$ is the matrix of the $K$ leading eigenvectors of $P$ , and one can write $U=S\xi$ , for some matrix $\xi\in\mathbb{R}^{K\times K}$ with orthogonal rows (and $\|\xi_{k*}-\xi_{l*}\|^{2}=n_{k}^{-1}+n_{l}^{-1}$ ). It also follows from the proof of that Lemma that if $\gamma_{n}$ denotes the smallest absolute nonzero eigenvalue of $P$ , we have $\gamma_{n}=n_{min}\alpha_{n}\lambda(K)$ , with $n_{min}$ the cardinality of the smallest class, here of order $n/k$ using that classes are balanced, and $\lambda(K)$ the smallest absolute eigenvalue of $N$ .

By Lemma 5.1 of [25], one can control the distance between the leading eigenspaces of $X$ and $P$ (for the first $K$ non-zero eigenvalues) in terms of the spectral norm of $W$ . The assumptions of that Lemma are fulfilled with $P$ here of rank $K=k-1$ and of smallest nonzero singular value $\gamma_{n}$ . If $\hat{U}\in\mathbb{R}^{n\times K}$ is the matrix of the $K$ leading eigenvectors of $X$ (and $U$ the one for $P$ , as above), there exists a $K\times K$ orthogonal matrix $Q$ such that, with $\|\cdot\|_{K}$ and $\|\cdot\|_{Sp}$ the Frobenius and spectral norms respectively,

[TABLE]

By the triangle inequality, the spectral norm $\|X-P\|_{Sp}$ is in turn bounded by

[TABLE]

The matrix $R$ can be written $R=uu^{t}$ , where $u^{t}$ is the row $(1\,-1\ 0\ldots 0)$ of length $k$ . In particular, $R$ is of rank $1$ , and $\|TRT^{t}\|_{Sp}=\|Tu\|_{2}^{2}$ (a nonzero eigenvector is $Tu$ ). By construction, $\|Tu\|_{2}^{2}=n_{1}+n_{2}$ , the number of elements of classes $1$ and $2$ , so that $\|TRT^{t}\|_{Sp}\leq Cn/K$ . Also, $\|D\|_{Sp}\leq\alpha_{n}$ since $D$ is diagonal with terms bounded by $\alpha_{n}$ . By Theorem 5.2 of [25], the norm $\|X-E[X]\|$ is, with probability at least $1-1/n^{2}$ , no larger than $C\sqrt{n\alpha_{n}}$ , for a sufficiently large constant $C$ . Gathering the last bounds and using $\alpha_{n}\lesssim\sqrt{n\alpha_{n}}$ , one obtains $\|X-P\|_{Sp}\leq C(\sqrt{n\alpha_{n}}+|\theta|n\alpha_{n}/K)$ .

On the other hand, following Lei and Rinaldo [25], one can perform an $(1+\varepsilon)-$ approximate $k$ -means clustering on the rows of $\hat{U}$ : Application of their Lemma 5.3 to the matrices $\hat{U}$ and $UQ$ shows the approximate $k$ -means solution is a pair $(\hat{S},\hat{\xi})$ , where $\hat{S}$ a membership matrix, $\hat{\xi}$ a $K\times K$ matrix, and $\hat{S}\hat{\xi}$ is an approximate least-squares fit to $\hat{U}$ . Moreover, the estimated membership $\hat{S}$ coincides with $S$ up to label permutation, except on sets ${S_{1},\ldots,S_{K}}$ that are characterized as follows: Recall that $\psi$ is the ‘true’ labelling obtained by merging the original classes 1 and 2 of nodes. Each set $S_{j}\subset\psi^{-1}(j)$ satisfies

[TABLE]

whenever

[TABLE]

This implies, using the previous bounds and $\gamma_{n}=n_{min}\alpha_{n}\lambda(K)\gtrsim(n/K)\alpha_{n}\lambda(K)$ , that

[TABLE]

provided, for some suitably small constant $c>0$ , with $\lambda=\lambda(K)$ ,

[TABLE]

The first summand coincides with the condition in [25]. The second term accounts for the perturbation induced by $\alpha_{n}\theta R$ . Provided that

[TABLE]

the simple spectral clustering algorithm has recovery error at most $n/f(n\alpha_{n},K)$ , with

[TABLE]

The conditions on $n,\alpha_{n},\lambda,K,\theta$ permit this quantity to be chosen suitably large. This means that, with high probability, Step 1 of the algorithm with $K=k-1$ recovers a sufficiently large proportions of the labels of $N$ , up to label permutation.

Second step (Lei and Zhu’s exact label recovery method via sample splitting). We can now use the method introduced by Lei and Zhu [26]: using a first rough estimate of the labels, one can refine it to an exact label recovery with high probability, provided $f(n\alpha_{n},K)$ is large enough in terms of a certain function of $K$ . The recovered labels are those of the original classes $3,4,\ldots,k$ , and of the aggregated class containing classes $1$ and $2$ . To verify that the proof of Lei and Zhu generalizes to the perturbed cased, it suffices to note that the distortion of $E[X_{ij}]$ for $i,j\in\psi^{-1}(\{1,2\})$ from $1/2$ to $1/2\pm\theta$ does not interfere with the bounds of the proof of Theorem 2 in [26]. The sample splitting algorithm of Lei and Zhu involves two subroutines called ${\tt CrossClust}$ and ${\tt Merge}$ . The mean of $X_{ij}$ enters in the proof of that result via two applications of Bernstein’s inequality, in the proofs of Lemma 6 (which implies the consistency of CrossClust via Lemma 3) and Lemma 7 (consistency of Merge) of Lei and Zhu [26].

We impose the assumptions of Lei and Zhu [26], Theorem 2, on $K$ and $f(\alpha_{n}n/2,K)$ : one needs $f(n\alpha_{n}/2,K)\gamma(K)\geq CK^{2.5}$ and the inequalities $\alpha_{n}\geq CK^{3}\log{n}/(\gamma^{2}(K)n)$ and $Cn\geq K^{3}$ . The last two conditions are implied by (B2) (respectively (A2) in the dense case). By (42), the first one is satisfied if

[TABLE]

Again, the first inequality holds by (B2). The second inequality asks for $\theta^{2}<C\lambda^{2}\gamma(K)/K^{2.5}$ , which is guaranteed by (34).

The parameter $\theta$ affects the proof of Lemma 3 of Lei and Zhu [26] as follows. The proof relies on bounding three terms $T_{1},T_{2}$ and $T_{3}$ via Lemma 6 in the Appendix of their paper. To be able to apply Bernstein’s inequality on $T_{1}$ , one needs

[TABLE]

where $\pi_{0}=n_{min}/(n/K)\gtrsim 1$ in the notation of [26], Definition 1. To bound $T_{2}$ , one needs

[TABLE]

while, similarly, to bound $T_{3}$ one needs

[TABLE]

On the other hand, consistency of Merge with $V=2$ requires

[TABLE]

The condition required for $T_{1}$ implies the remaining ones: The one required for $T_{2}$ is weaker, since $f(n\alpha_{n},K)>C^{\prime}K$ for some constant $C^{\prime}>0$ , by (41)–(42). The condition for Merge is also weaker up to constants, provided that $n\gtrsim K^{3/2}$ , which is satisfied under our conditions. To obtain the above Bernstein’s inequalities, one thus needs

[TABLE]

It follows from the spectral decomposition of $N$ that $\lambda(K)\leq\gamma(K)/\sqrt{2}$ . By combining with (41), we see that it is enough that $\theta$ satisfies $|\theta|\leq C\lambda/\sqrt{K}$ , which was already required above. Finally, to see that the last inequality is satisfied, one notes that it is implied by (34), using that $\gamma(K)\leq\sqrt{K}$ . We have just proved that the exact recovery from [26] also holds here.

Third step (Finding true cluster $1$ and conclusion). The second step provides a labelling $\hat{g}$ that coincides, up to permutation, with the aggregated labelling $g$ with high probability. The assumed separation from $1/2$ allows us to identify cluster $1$ : For $l=1,\ldots,k-1$ , compute

[TABLE]

Since class sizes are of order $n/k$ , an application of Bernstein’s inequality gives

[TABLE]

for some permutation $\sigma$ , with high probability. By (A3), if $l\neq 1$ , we have $|N_{ll}-1/2|\geq\kappa$ . So w.h.p. there is exactly one diagonal element $\hat{N}_{ll}=\hat{N}_{\hat{\ell}\hat{\ell}}$ within $\kappa/2$ of $1/2$ , since the conditions of the theorem imply

[TABLE]

The index $\hat{\ell}$ then identifies the first cluster of $N$ —which is the aggregate cluster corresponding to clusters 1 and 2 defined by $M^{\theta}$ —with high probability. We can now apply the spectral algorithm for $k=2$ to the induced submatrix $(X_{ij})_{i,j\in\hat{g}^{-1}(\hat{\ell})}$ . Using the upper-bound part of Theorem 8 with a number of nodes $|\hat{g}^{-1}(\hat{\ell})|\asymp n/k$ leads to $E_{\theta}[(\hat{\theta}-\theta)^{2}]\leq Ck/(n\alpha_{n})$ , by observing that the event with high probability arising from the previous arguments (that is, the concentration result by [25] and Bernstein’s inequalities) holds with probability at least $1-1/n$ . ∎

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allman et al. [2011] E. S. Allman, C. Matias, and J. A. Rhodes. Parameter identifiability in a class of random graph mixture models. J. Statist. Plann. Inference , 141(5):1719–1736, 2011.
2Ambroise and Matias [2012] C. Ambroise and C. Matias. New consistent and asymptotically normal parameter estimates for random-graph mixture models. J. R. Stat. Soc. Ser. B. Stat. Methodol. , 74(1):3–35, 2012.
3Arias-Castro and Verzelen [2014] E. Arias-Castro and N. Verzelen. Community detection in dense random networks. Ann. Statist. , 42(3):940–969, 2014.
4Arias-Castro et al. [2011] E. Arias-Castro, E. J. Candès, and A. Durand. Detection of an anomalous cluster in a network. Ann. Statist. , 39(1):278–304, 2011.
5Bickel and Chen [2009] P. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. PNAS , 106(50):21068–21073, 2009.
6Bickel and Chernoff [1993] P. Bickel and H. Chernoff. Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem. In Statistics and Probability: A Raghu Raj Bahadur Festschrift , pages 83–96. Wiley, New York, 1993.
7Bickel et al. [2013] P. Bickel, D. Choi, X. Chang, and H. Zhang. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Ann. Statist. , 41(4):1922–1943, 2013.
8Bickel et al. [2011] P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. Ann. Statist. , 39(5):2280–2301, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Uniform estimation in stochastic block models is slow

Abstract

keywords:

keywords:

1 Introduction

1.1 An informal overview

1.2 Contents

2 Preliminaries and notation

3 Main results: lower bounds

3.1 The case k=2k=2k=2

Theorem 1**.**

Proof.

Remark 1* (different parameter choices).*

Remark 2* (numerical constants).*

3.2 The general case

Theorem 2**.**

Proof.

3.3 Some comments on the results

4 Further results: upper bounds and smooth graphons

4.1 Upper bounds via maximum likelihood

Theorem 3**.**

Proof.

Theorem 4**.**

Proof.

Remark 3* (conditions ∣θ∣≤κ|\theta|\leq\kappa∣θ∣≤κ and (21)).*

4.2 Upper bounds via spectral estimates

Theorem 5**.**

Proof.

4.3 Necessity of conditions on MMM

4.4 Minimax rates for a class of functionals of smooth graphons

Theorem 6**.**

Proof.

5 Discussion

6 Proofs of the lower bounds in SBMs

6.1 Two classes

Proof of Theorem 1.

6.2 Lower bounds for kkk classes

Proof of (28).

Proof of Theorem 2.

6.3 Useful lemmas

Lemma 1** (Corollary 3.2.6 of de la Peña and Giné [13]).**

Lemma 2** (Rademacher chaos with explicit constant).**

Proof.

Lemma 3** (Le Cam’s method ‘point versus mixture’).**

Proof.

Lemma 4** (Bound on total variation distance).**

Proof.

Lemma 5**.**

Proof.

7 Proofs for results on graphon functionals

7.1 Auxiliary lower bound

Lemma 6**.**

Proof of Lemma 6.

7.2 Proof of the theorem

Proof of Theorem 6.

Acknowledgements

Appendix A Upper bounds computable in polynomial-time

A.1 Spectral estimation for k≥2{k\geq 2}k≥2 classes

Theorem 7**.**

Proof.

A.2 Simulation study

Appendix B Extension to sparse graphs

B.1 Two classes

Theorem 8**.**

B.2 k≥2k\geq 2k≥2 classes

Theorem 9**.**

Appendix C Remaining proofs: likelihood-based upper bounds

C.1 Interpretation as a pseudo-likelihood

C.2 Upper bound result, two classes

Proof of Theorem 3.

Lemma 7**.**

Proof.

C.3 Upper bound result, kkk classes

3.1 The case $k=2$

Theorem 1.

*Remark 1** (different parameter choices).*

*Remark 2** (numerical constants).*

Theorem 2.

Theorem 3.

Theorem 4.

*Remark 3** (conditions $|\theta|\leq\kappa$ and (21)).*

Theorem 5.

4.3 Necessity of conditions on $M$

Theorem 6.

6.2 Lower bounds for $k$ classes

Lemma 1 (Corollary 3.2.6 of de la Peña and Giné [13]).

Lemma 2 (Rademacher chaos with explicit constant).

Lemma 3 (Le Cam’s method ‘point versus mixture’).

Lemma 4 (Bound on total variation distance).

Lemma 5.

Lemma 6.

A.1 Spectral estimation for ${k\geq 2}$ classes

Theorem 7.

Theorem 8.

B.2 $k\geq 2$ classes

Theorem 9.

Lemma 7.

C.3 Upper bound result, $k$ classes

Lemma 8.