Optimal graphon estimation in cut distance

Olga Klopp (CREST); Nicolas Verzelen (MISTEA)

arXiv:1703.05101·math.ST·October 17, 2018

Optimal graphon estimation in cut distance

Olga Klopp (CREST), Nicolas Verzelen (MISTEA)

PDF

Open Access

TL;DR

This paper establishes minimax estimation rates for graphons and connection probability matrices in cut distance, revealing that the adjacency matrix alone is already optimally informative for this metric.

Contribution

It proves that the adjacency matrix achieves optimal minimax rates in cut distance, showing no benefit from more complex estimation procedures.

Findings

01

Raw adjacency matrix is minimax optimal in cut distance.

02

Estimation rates are established for block constant matrices and step function graphons.

03

Contrasts with classical distances where more complex methods improve results.

Abstract

Consider the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a W-random graph. We establish the minimax estimation rates with respect to the cut metric for classes of block constant matrices and step function graphons. Surprisingly, our results imply that, from the minimax point of view, the raw data, that is, the adjacency matrix of the observed graph, is already optimal and more involved procedures cannot improve the convergence rates for this metric. This phenomenon contrasts with optimal rates of convergence with respect to other classical distances for graphons such as the l 1 or l 2 metrics.

Equations463

Θ_{ij} = ρ_{n} W_{0} (ξ_{i}, ξ_{j}), \forall i \neq = j and Θ_{ii} = 0

Θ_{ij} = ρ_{n} W_{0} (ξ_{i}, ξ_{j}), \forall i \neq = j and Θ_{ii} = 0

\|\boldsymbol{B}\|_{\square}=\frac{1}{n^{2}}\underset{S,T\subset[n]}{\max}\Big{|}\sum_{i\in S,j\in T}\boldsymbol{B}_{ij}\Big{|}.

\|\boldsymbol{B}\|_{\square}=\frac{1}{n^{2}}\underset{S,T\subset[n]}{\max}\Big{|}\sum_{i\in S,j\in T}\boldsymbol{B}_{ij}\Big{|}.

∥ W ∥_{□} = S, T \subset [0, 1] sup S \times T \int W (x, y) d x d y,

∥ W ∥_{□} = S, T \subset [0, 1] sup S \times T \int W (x, y) d x d y,

δ_{□} (W_{1}, W_{2}) = τ \in M in f ∥ W_{1} - W_{2}^{τ} ∥_{□},

δ_{□} (W_{1}, W_{2}) = τ \in M in f ∥ W_{1} - W_{2}^{τ} ∥_{□},

δ_{N} (W_{1}, W_{2}) = τ \in M in f ∥ W_{1} - W_{2}^{τ} ∥_{N} .

δ_{N} (W_{1}, W_{2}) = τ \in M in f ∥ W_{1} - W_{2}^{τ} ∥_{N} .

δ_{□} (\tilde{f}_{A}, W_{0}) \leq \frac{22}{lo g ( n )} .

δ_{□} (\tilde{f}_{A}, W_{0}) \leq \frac{22}{lo g ( n )} .

E_{W_{0}} [δ_{□} (f_{A}, W_{0})] \leq C \frac{k}{n lo g ( k )},

E_{W_{0}} [δ_{□} (f_{A}, W_{0})] \leq C \frac{k}{n lo g ( k )},

∥ B ∥_{p \to q} = in f {c \geq 0 : ∥ B v ∥_{l_{q}} \leq c ∥ v ∥_{l_{p}} for all v \in l_{p}}

∥ B ∥_{p \to q} = in f {c \geq 0 : ∥ B v ∥_{l_{q}} \leq c ∥ v ∥_{l_{p}} for all v \in l_{p}}

∥ A ∥_{□} \leq \frac{1}{n ^{2}} ∥ A ∥_{1} \leq \frac{1}{n} ∥ A ∥_{2}

∥ A ∥_{□} \leq \frac{1}{n ^{2}} ∥ A ∥_{1} \leq \frac{1}{n} ∥ A ∥_{2}

∥ W ∥_{□} \leq ∥ W ∥_{1} \leq ∥ W ∥_{2} \leq ∥ W ∥_{\infty} \leq 1

∥ W ∥_{□} \leq ∥ W ∥_{1} \leq ∥ W ∥_{2} \leq ∥ W ∥_{\infty} \leq 1

∥ W ∥_{\infty \to 1} = ∥ f ∥_{\infty}, ∥ g ∥_{\infty} \leq 1 sup [0, 1]^{2} \int W (x, y) f (x) g (y) d x d y

∥ W ∥_{\infty \to 1} = ∥ f ∥_{\infty}, ∥ g ∥_{\infty} \leq 1 sup [0, 1]^{2} \int W (x, y) f (x) g (y) d x d y

∥ W ∥_{□} \leq ∥ W ∥_{\infty \to 1} \leq 4∥ W ∥_{□} .

∥ W ∥_{□} \leq ∥ W ∥_{\infty \to 1} \leq 4∥ W ∥_{□} .

E_{Θ_{0}} ∥ A - Θ_{0} ∥_{□} \leq 12 \frac{∥ Θ _{0} ∥ _{1} + n}{n ^{3}} .

E_{Θ_{0}} ∥ A - Θ_{0} ∥_{□} \leq 12 \frac{∥ Θ _{0} ∥ _{1} + n}{n ^{3}} .

E_{Θ_{0}} ∥ A - Θ_{0} ∥_{□} \leq 24 \frac{∥ Θ _{0} ∥ _{\infty}}{n} .

E_{Θ_{0}} ∥ A - Θ_{0} ∥_{□} \leq 24 \frac{∥ Θ _{0} ∥ _{\infty}}{n} .

T [k] = {Θ_{0} : \exists z \in Z_{n, k}, Q \in [0, 1]_{sym}^{k \times k} such that Θ_{ij} = Q_{z (i) z (j)}, i \neq = j, and Θ_{ii} = 0 \forall i}

T [k] = {Θ_{0} : \exists z \in Z_{n, k}, Q \in [0, 1]_{sym}^{k \times k} such that Θ_{ij} = Q_{z (i) z (j)}, i \neq = j, and Θ_{ii} = 0 \forall i}

T [k, ρ_{n}] = {Θ_{0} \in T [k] : ∥ Θ_{0} ∥_{\infty} \leq ρ_{n}},

T [k, ρ_{n}] = {Θ_{0} \in T [k] : ∥ Θ_{0} ∥_{\infty} \leq ρ_{n}},

Θ in f Θ_{0} \in T [2, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] \geq C min (\frac{ρ _{n}}{n}, ρ_{n})

Θ in f Θ_{0} \in T [2, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] \geq C min (\frac{ρ _{n}}{n}, ρ_{n})

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] ≍ min (\frac{ρ _{n}}{n}, ρ_{n}) .

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] ≍ min (\frac{ρ _{n}}{n}, ρ_{n}) .

\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\|\boldsymbol{\Theta}_{0}-\overline{\boldsymbol{A}}\|_{\square}\Big{]}\leq\dfrac{1}{n^{2}}\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\|\boldsymbol{\Theta}_{0}-\overline{\boldsymbol{A}}\|_{1}\Big{]}\leq\dfrac{n-1}{n}\sqrt{\operatorname{Var}(\bar{A})}\leq\sqrt{\frac{2p}{n(n-1)}}\ ,

\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\|\boldsymbol{\Theta}_{0}-\overline{\boldsymbol{A}}\|_{\square}\Big{]}\leq\dfrac{1}{n^{2}}\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\|\boldsymbol{\Theta}_{0}-\overline{\boldsymbol{A}}\|_{1}\Big{]}\leq\dfrac{n-1}{n}\sqrt{\operatorname{Var}(\bar{A})}\leq\sqrt{\frac{2p}{n(n-1)}}\ ,

Θ in f Θ_{0} \in T [1, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] \geq C min (\frac{ρ _{n}}{n}, ρ_{n})

Θ in f Θ_{0} \in T [1, ρ_{n}] sup E_{Θ_{0}} [Θ - Θ_{0}_{□}] \geq C min (\frac{ρ _{n}}{n}, ρ_{n})

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [\frac{1}{n} Θ - Θ_{0}_{2}] ≍ min (\frac{ρ _{n} lo g ( k )}{n} + \frac{ρ _{n} k}{n}, ρ_{n}),

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [\frac{1}{n} Θ - Θ_{0}_{2}] ≍ min (\frac{ρ _{n} lo g ( k )}{n} + \frac{ρ _{n} k}{n}, ρ_{n}),

Θ_{k, ρ_{n}} := ar g Θ \in T [k, ρ_{n}] min ∥ Θ - A ∥_{2}^{2} .

Θ_{k, ρ_{n}} := ar g Θ \in T [k, ρ_{n}] min ∥ Θ - A ∥_{2}^{2} .

\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\big{\|}\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}-\boldsymbol{\Theta}_{0}\big{\|}_{\square}\Big{]}\leq C\sqrt{\frac{\rho_{n}\log(k)}{n}}\ .

\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}\Big{[}\big{\|}\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}-\boldsymbol{\Theta}_{0}\big{\|}_{\square}\Big{]}\leq C\sqrt{\frac{\rho_{n}\log(k)}{n}}\ .

A = j = 1 Σ rank (A) σ_{j} (A) u_{j} (A) v_{j} (A)^{T},

A = j = 1 Σ rank (A) σ_{j} (A) u_{j} (A) v_{j} (A)^{T},

Θ_{λ} = j : σ_{j} (A) \geq λ Σ σ_{j} (A) u_{j} (A) v_{j} (A)^{T}

Θ_{λ} = j : σ_{j} (A) \geq λ Σ σ_{j} (A) u_{j} (A) v_{j} (A)^{T}

\frac{1}{n} ∥ Θ_{λ} - Θ_{0} ∥_{2}

\frac{1}{n} ∥ Θ_{λ} - Θ_{0} ∥_{2}

∥ Θ_{λ} - Θ_{0} ∥_{□} \leq \frac{1}{n} ∥ Θ_{λ} - Θ_{0} ∥_{2 \to 2}

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [\frac{1}{n ^{2}} Θ - Θ_{0}_{1}] ≍ min {\frac{ρ _{n} lo g ( k )}{n} + \frac{ρ _{n} k}{n}, ρ_{n}} .

Θ in f Θ_{0} \in T [k, ρ_{n}] sup E_{Θ_{0}} [\frac{1}{n ^{2}} Θ - Θ_{0}_{1}] ≍ min {\frac{ρ _{n} lo g ( k )}{n} + \frac{ρ _{n} k}{n}, ρ_{n}} .

E [\frac{1}{n ^{2}} ∥ Θ^{r} - Θ_{0} ∥_{F}^{2}] \leq C ρ_{n} (\frac{lo g ( k )}{n} + \frac{k ^{2}}{n ^{2}}) .

E [\frac{1}{n ^{2}} ∥ Θ^{r} - Θ_{0} ∥_{F}^{2}] \leq C ρ_{n} (\frac{lo g ( k )}{n} + \frac{k ^{2}}{n ^{2}}) .

W (x, y) = Q_{ϕ (x), ϕ (y)} for all x, y \in [0, 1] .

W (x, y) = Q_{ϕ (x), ϕ (y)} for all x, y \in [0, 1] .

f_{Θ} (x, y) = Θ_{⌈ n x ⌉, ⌈ n y ⌉}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph theory and applications · Random Matrices and Applications · Complex Network Analysis Techniques

Full text

Optimal graphon estimation in cut distance

Olga Klopp111ESSEC Business School and CREST, FRANCE, [email protected] and Nicolas Verzelen222INRA, UMR 729 MISTEA, F-34060 Montpellier, FRANCE, [email protected]

Abstract

Consider the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a $W$ -random graph. We establish the minimax estimation rates with respect to the cut metric for classes of block constant matrices and step function graphons. Surprisingly, our results imply that, from the minimax point of view, the raw data, that is, the adjacency matrix of the observed graph, is already optimal and more involved procedures cannot improve the convergence rates for this metric. This phenomenon contrasts with optimal rates of convergence with respect to other classical distances for graphons such as the $l_{1}$ or $l_{2}$ metrics.

Keywords: inhomogeneous random graph, graphon, W-random graphs, networks, stochastic block model, cut distance.

1 Introduction

In the last decade, network analysis has become an important research field driven by applications in social sciences, computer sciences, statistical physics, genomics, ecology, etc. A flourishing line of literature amounts to fit observed networks to parametric or non-parametric models of random graphs. Among the parametric models, one of the most popular is the stochastic block model [23]. In the stochastic block model with $n$ vectices and $k$ blocks, the class $Z_{i}$ of each vertex $i\in[n]$ is drawn independently in $[k]$ according to some probability distribution $\pi$ . Given $Z$ , the edges of the graph are then sampled independently, the probability that there is an edge between $i$ and $j$ being equal to $\boldsymbol{Q}_{Z_{i}Z_{j}}$ where $\boldsymbol{Q}=(\boldsymbol{Q}_{ij})\in[0,1]^{k\times k}$ is a given symmetric matrix. Although this model is suitable for analyzing small networks, it does not allow to analyze the finer structures of extremely large networks. To go beyond the possible limitation of parametric models, non-parametric models of random graphs have been introduced [22, 18].

One possible non-parametric generalization of the stochastic block models is given by the $W$ -random graph model [18] based on the notion of graphon. Graphons are symmetric measurable functions $W:[0,1]^{2}\rightarrow[0,1]$ . In the sequel, the space of graphons is denoted by $\mathcal{W}^{+}$ . Given a graphon $W_{0}\in\mathcal{W}^{+}$ , a graph on $n$ vertices is sampled according to the $W$ -random graph model in the following way. Let $\boldsymbol{\Theta}_{0}=(\boldsymbol{\Theta}_{ij})$ be a $n\times n$ random symmetric matrix defined by

[TABLE]

where $1\geq\rho_{n}>0$ is the scale parameter that can be interpreted as the expected proportion of non-zero edges and $\xi_{1},\dots,\xi_{n}$ are unobserved (latent) i.i.d. random variables uniformly distributed on $[0,1]$ . Then, given $\boldsymbol{\Theta}_{0}$ , the graph is sampled according to the inhomogeneous random graph model [6]. More precisely, vertices $i$ and $j$ are connected by an edge with probability $\boldsymbol{\Theta}_{ij}$ and these events are independent for all pairs $(i,j)$ with $i<j$ . When $\boldsymbol{\Theta}_{0}$ is considered as a deterministic matrix, we call it inhomogeneous random graph model with respect ot $\boldsymbol{\Theta}_{0}$ . If $W_{0}$ is a step-function with $k$ steps, the graph is distributed as a stochastic block model with $k$ groups. The case of a dense graph corresponds to $\rho_{n}=1$ , whereas the choice $\rho_{n}\rightarrow 0$ when $n\rightarrow\infty$ produces sparser graphs. This model was recently studied by a number of authors, see e.g., [4, 5, 34, 27, 17, 16].

In the present paper we consider the problems of estimating the matrix of connection probabilities $\boldsymbol{\Theta}_{0}$ and the graphon $f_{0}=\rho_{n}W_{0}$ from a single observation of a graph. Suppose that we observe the $n\times n$ adjacency matrix $\boldsymbol{A}=(\boldsymbol{A}_{ij})$ of a graph that has either been sampled according to the inhomogeneous random graph model with a fixed matrix $\boldsymbol{\Theta}_{0}$ or to the $W$ -random graph model with graphon $W_{0}$ . Then, given a single observation $\boldsymbol{A}$ , we want to estimate $\boldsymbol{\Theta}_{0}$ or $f_{0}$ .

Graphon estimation is more challenging than probability matrix estimation, in particular, because of identifiability issues: multiple graphons can lead to the same distribution on the space of graphs of size $n$ . This is not unexpected as the distribution of the network is invariant with respect to any change of labeling of its nodes. More precisely, two graphons $U$ and $W$ in $\mathcal{W}^{+}$ define the same probability distribution if and only if there exist measure preserving maps $\phi$ , $\psi$ : $[0,1]\to[0,1]$ such that $U\left(\phi(x),\phi(y)\right)=W\left(\psi(x),\psi(y)\right)$ almost everywhere. This equivalence relation is called a weak isomorphism [28]. The corresponding quotient space is denoted by $\widetilde{\mathcal{W}}^{+}$ . As a consequence, one can only estimate the equivalence class of $\rho_{n}W_{0}$ in $\widetilde{\mathcal{W}}^{+}$ and we refer henceforth to graphon estimation as the problem of estimating this equivalence class from the adjacency matrix $\boldsymbol{A}$ sampled from the $W$ -random graph model (1). When there is no amibiguity, we shall identify a graphon $W\in\mathcal{W}^{+}$ and its corresponding equivalence class.

The problem of estimating $\boldsymbol{\Theta}_{0}$ was previously considered in a number of papers. For matrix estimation problem, the quality of an estimator $\widehat{\boldsymbol{\Theta}}$ is usually assessed through the Frobenius loss $\|\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}_{0}\|_{2}$ . For instance, [16] obtain sub-optimal convergence rates for this problem using a singular thresholding algorithm. Relying on a least-square estimator [20] have established the minimax estimation rates for $\boldsymbol{\Theta}_{0}$ on classes of block constant matrices and smooth graphon classes. Their analysis is restricted to the dense case with constant $\|\boldsymbol{\Theta}_{0}\|_{\infty}$ . More recently, [26] extended their results to sparse case when $\|\boldsymbol{\Theta}_{0}\|_{\infty}$ depends on $n$ and goes to zero when $n\rightarrow\infty$ .

As for graphon estimation, most of results on estimation error are expressed in terms of $l_{2}$ loss $\|\widehat{W}-W_{0}\|_{2}$ (see below for a formal definition of this metric). For classes of smooth graphons, estimators based on maximum likelihood, restricted least-squares estimators, or neighborhood smoothing have been studied in [33, 15, 1, 14, 35, 26]. For classes of step-function graphons, restricted least-squares estimators have been considered in [10, 26] and the minimax optimal rates of convergence have been derived in [26].

Although one can take advantage of the Euclidean structure of the Frobenius matrix norm and the $l_{2}$ metric on $\mathcal{W}^{+}$ , both these metrics do not readily reflect the closeness in terms of the topology of the random graphs. As the structure of the graphon space is infinite-dimensional, not all norms are equivalent and one may wonder whether one should not study the graphon estimation problem with respect to a more suitable distance. We argue below that the cut distance which plays a central role in the random graph theory is a good candidate for this.

1.1 Cut metric

One of the fundamental questions in graph theory is the following one: what does it mean for two large graphs to be similar or close? There are different ways of defining the distance of two graphs. For example, the edit distance is defined as normalized Hamming distance of the edge sets. Up to a normalization, it corresponds to $l_{1}$ distance between the adjacency matrices. One of the troubles with this notion of distance is that it does not reflect well structural similarities between two graphs. For instance, the edit distance between two independent graphs drawn from the Erdös-Rényi model $\mathcal{G}(n,p)$ with $p=1/2$ is close to 1/2 with high probability. Another notion of distance, called cut distance, better reflects the structural similarity. The cut norm of a matrix $\boldsymbol{B}=(\boldsymbol{B}_{ij})\in\mathbb{R}^{n\times n}$ has been introduced by Frieze and Kannan [19]. It is defined by

[TABLE]

In other words, $\|\boldsymbol{B}\|_{\square}$ corresponds (up to to a renormalization) to the maximal sum of entries over all submatrices of $\boldsymbol{B}$ . Then, the cut distance $d_{\square}(G,G^{\prime})$ between two graphs $G$ and $G^{\prime}$ defined on the same set of nodes and with adjacency matrices $\boldsymbol{A}$ and $\boldsymbol{A}^{\prime}$ is defined as the cut norm $\|\boldsymbol{A}-\boldsymbol{A}^{\prime}\|_{\square}$ . Denoting $e_{G}(S,T)$ the number of edge between nodes in $S$ and $T$ in the graph $G$ , the cut distance $d(G,G^{\prime})$ is the supremum over all $S,T$ of $(e_{G}(S,T)-e_{G^{\prime}}(S,T))/n^{2}$ . In other words, $d_{\square}(G,G^{\prime})$ is small if the restrictions of $G$ and $G^{\prime}$ to all subsets $S,T$ have similar edge densities.

Let us denote $\mathcal{W}$ the collection of symmetric measurable functions $[0,1]^{2}\rightarrow[-1,1]$ . By analogy with the matrix cut norm, we can define the cut norm of a kernel $W\in\mathcal{W}$ :

[TABLE]

where the supremum is taken over all measurable subsets $S$ and $T$ . Then, the distance $d_{\square}(W,W^{\prime})$ between two graphons $W$ and $W^{\prime}$ in $\mathcal{W}^{+}$ is simply $\|W-W^{\prime}\|_{\square}$ . As explained earlier in the introduction, graphons in $\mathcal{W}^{+}$ are not identifiable. This is why we consider the metric induced by $\|\cdot\|_{\square}$ on the quotient space $\widetilde{\mathcal{W}}^{+}$ defined by

[TABLE]

where we take the infimum in the set $\mathcal{M}$ of all measure-preserving bijections $\tau:[0,1]\rightarrow[0,1]$ and $W^{\tau}(x,y)=W(\tau(x),\tau(y))$ .

The cut distance is also a cornerstone in the graph limit theory introduced by Lovász and Szegedy [29] and further developed in, e.g., [8, 9]. In particular, this theory states that graphons can be interpreted as limits (with respect to $\delta_{\square}$ ) of graph sequences. Besides, convergence in $\delta_{\square}$ is equivalent to other structural properties such as the convergence of all homomorphisms numbers. Given a simple graph $F$ with $q$ nodes and a graphon $W_{0}$ , the homomorphisms number $t(F,W_{0})$ is the probability that the edge set of size $q$ of a graph sampled from the model (1) (with $\rho_{n}=1$ ) contains the edge set of $F$ . As a consequence, the homomorphisms numbers $t(F,W_{0})$ and $t(F,W^{\prime}_{0})$ are close when the expected number of subgraphs $F$ for a size $n$ random graph $G$ sampled from $W_{0}$ is close to that of a size $n$ random graph sampled from $W^{\prime}_{0}$ . It has been established that convergence in the cut distance is equivalent to convergence of homomorphism numbers for all simple graphs $F$ (see Theorem 11.5 in [28] for more details). Hence, estimating well the graphon $W_{0}$ in the cut distance allows to estimate well the number of small patterns induced by $W_{0}$ . On the other hand, the cut distance controls other quantities of interest for computer scientists such as the size of multi-way cuts [12, 10]. So, a consistent estimator of $W_{0}$ in cut distance gives consistent estimators for the multi-way cuts.

The construction of $\delta_{\square}$ can be extended to any other norm $N$ that is invariant under measure preserving maps:

[TABLE]

Besides the cut norm, we already mentioned the $l_{1}$ and $l_{2}$ -norms on $\mathcal{W}$ defined by $\|W\|_{1}=\int_{[0,1]^{2}}|W(x,y)|dxdy$ and $\|W\|_{2}=[\int_{[0,1]^{2}}W^{2}(x,y)dxdy]^{1/2}$ . These two norms define the corresponding distances $\delta_{1}$ and $\delta_{2}$ on the quotient space $\widetilde{\mathcal{W}}^{+}$ . The distance $\delta_{\square}$ is dominated by $\delta_{1}$ and $\delta_{2}$ (for details see Section 2.2). As already noted for instance in [10], this immediately implies that the convergence rate of an estimator $\widehat{W}$ with respect to the $\delta_{\square}$ -distance is at least as fast as its convergence rate with respect to the $\delta_{2}$ -distance. Then, one may wonder whether the convergence rates in $\delta_{\square}$ -distance can be significantly faster and whether those faster rates are achieved by the estimators that are already minimax optimal with respect to other metrics.

In fact, a partial result on uniform convergence rates has already been proved. One of the striking consequences of the celebrated Szemerédi’s Lemma [31] states that an adjacency matrix sampled from a $W$ -random graph model converges to the true graphon $W_{0}$ in cut distance, this at an uniform rate over all graphons. To be more specific, let $W_{0}\in\mathcal{W}^{+}$ be a graphon and let $\boldsymbol{A}$ be the size $n$ adjacency matrix sampled according to the $W$ -random graph model (1) with $\rho_{n}=1$ . It has been shown in [8] (see also [2] or [28]) that, with high probability, the empirical graphon $\widetilde{f}_{\boldsymbol{A}}$ associated to the adjacency matrix $\boldsymbol{A}$ (see (19) for a precise definition) is $O(1/\sqrt{\log(n)})$ close in the cut distance to the true graphon $W_{0}$ :

Proposition 1 (Lemma 10.16 [28]).

Let $n\geq 1$ and let $W_{0}\in\mathcal{W}^{+}$ be a graphon. Then, with probability at least $1-\exp\left\{-n/(2\log n)\right\}$ ,

[TABLE]

An important point is that the above result is valid for all $W_{0}\in\mathcal{W}^{+}$ . Note that if we replace the cut-distance by $\delta_{1}$ or $\delta_{2}$ -distance this is not true any more: even in the simple case of a constant graphon $W_{0}\equiv a$ (with $a\in(0,1)$ ), the $l_{2}$ distance between $\tilde{f}_{\boldsymbol{A}}$ and $W_{0}$ does not converge to zero.

1.2 Our contribution and related results

Our purpose in this paper is to go beyond uniform convergence rates over all graphons in $\mathcal{W}^{+}$ and to understand the optimal cut distance convergence rates when $W_{0}$ has a specific structure. First, optimal convergence rates are derived for the estimation of the connection probability matrix $\boldsymbol{\Theta}_{0}$ when it belongs to classes of block-constant matrices. Second, we establish the optimal convergence rates for all classes of step-function graphons $f=\rho_{n}W_{0}$ both in sparse and dense case. In particular for $\rho_{n}=1$ (dense case), our results imply that, for any integer $k\in[2,n]$ and $k$ –steps graphon $W_{0}$ , one has

[TABLE]

where $C$ is a numerical constant (independent of $n$ and $k$ ) and that this convergence rate is optimal from the minimax point of view. This result has some interesting implications. In particular, this guarantees the optimality of the $\log(n)^{-1/2}$ rate in Proposition 1 for general graphons. On the other hand, our results imply that for more structured classes of graphons ( $k\ll n$ ) much faster rates are achievable. Interestingly, we show that the adjacency matrix and its associated empirical graphons are already adaptive to the unknown number of blocks of the matrix $\boldsymbol{\Theta}_{0}$ or steps of $W_{0}$ and minimax optimal. As a consequence, there is no need to look for more involved estimators.

In practice, it could be disappointing that the raw data are already optimal with respect to the cut distance, whereas they perform really badly with respect to the $\delta_{2}$ distance. This is why we prove that a singular value hard thresholding estimator is still optimal with respect to the cut metric $\delta_{\square}$ while achieving the best known rate in $\delta_{2}$ -distance in the class of polynomial-time estimators.

Our results are in sharp contrast to all aforementioned manuscripts [33, 15, 10, 1, 14, 35, 26] whose primary focus is the $\delta_{2}$ -distance and whose convergence rates with respect to the $\delta_{\square}$ -distance are derived from the domination of $\delta_{\square}$ by $\delta_{2}$ . Closest to our contributions, is the recent paper [7] where the authors introduce a least-cut norm estimator for a more general model of unbounded graphons. Translated in our framework, their non-polynomial time algorithm achieves, in some cases, the optimal convergence rate (up to a logarithmic loss) and it is slower in other cases. In Section 4.3 we extend our study to unbounded graphons and compare our results to those of [7]. In particular, our Proposition 8 implies that the empirical graphon associated to the adjacency matrix and to the singular value hard thresholding estimator are optimal (up to a logarithmic factor) also in the general case of unbounded graphons. Note that the main difference with the method proposed in [7] is that both our estimators can be easily computed in polynomial time.

From a technical point of view, the tools needed for deriving optimal cut distance rates differ from those used for the $\delta_{2}$ -distance. For establishing the minimax lower bounds, the main technical hurdle is to build a collection of well-spaces graphons with respect to the cut distance. Indeed, the cut distance $\delta_{\square}(W_{1},W_{2})$ is difficult to lower bound as it is defined as an infimum over all measure-preserving transformations. As for the minimax upper bound on the estimation error in (6), it can be obtained quite easily without the correct logarithmic term thanks to the Bernstein’s inequality together with some bounds from [26] for the stronger metric $\delta_{2}$ . However, recovering the right logarithmic term in (6) is much more challenging. The proof relies among other things on a careful application of Szemerédi’s regularity lemma to distorted versions of the graphon.

The manuscript is organized as follows. First, we recall some basic results related to the cut metric. The problem of estimating the matrix of connection probabilities is considered in Section 3. We study the problem of graphon estimation in Section 4. The appendix contains all the proofs where in Appendix A we recall some basic facts and results that are often used in the proofs.

2 Notation and Preliminaries

2.1 Notation

We gather here some of the notation used throughout this paper. Some of them have already been defined in the introduction.

•

For a matrix $\boldsymbol{B}$ , $\boldsymbol{B}_{ij}$ (or $\boldsymbol{B}_{i,j}$ , or $(\boldsymbol{B})_{ij}$ ) is its $(i,j)$ -th entry. Let $\boldsymbol{B}_{i,\cdot}$ and $\boldsymbol{B}_{\cdot,j}$ stand for its $i$ th row and $j$ th column respectively. We denote by $\mathbb{R}^{k\times k}_{\rm sym}$ the class of all symmetric $k\times k$ matrices with real-valued entries. Given a matrix $\boldsymbol{B}$ and $p\in[1,\infty]$ , $\|\boldsymbol{B}\|_{p}$ denotes its entry-wise $l_{p}$ norm, that is $\|\boldsymbol{B}\|_{p}^{p}=\sum_{i,j}|\boldsymbol{B}_{ij}|^{p}$ for $p<\infty$ and $\|\boldsymbol{B}\|_{\infty}=\max_{i,j}|\boldsymbol{B}_{ij}|$ . Given $(p,q)\in[1,\infty]$ , $\|\boldsymbol{B}\|_{p\rightarrow q}$ stands for its $l_{p}\rightarrow l_{q}$ operator norm:

[TABLE]

Finally, $\langle\boldsymbol{D},\boldsymbol{B}\rangle=\sum_{i,j}\boldsymbol{D}_{ij}\boldsymbol{B}_{ij}$ stands for the canonical inner product between matrices $\boldsymbol{D},\boldsymbol{B}\in\mathbb{R}^{n\times n}$ .

•

$\mathcal{W}$ is the collection of symmetric measurable functions $[0,1]^{2}\rightarrow[-1,1]$ . Given a kernel $W\in\mathcal{W}$ and $p\in(1,\infty)$ , its $l_{p}$ norm is defined by $\|W\|^{p}_{p}=\int|W(x,y)|^{p}dxdy$ , whereas $\|W\|_{\infty}=\mathrm{ess\ sup}_{x,y}|W(x,y)|$ . $\mathcal{W}^{+}$ is the space of graphons and $\widetilde{\mathcal{W}}^{+}$ is the corresponding quotient space. The cut distance $\delta_{\square}(\cdot,\cdot)$ in the graphon spaces is defined by (3). Also, $\delta_{1}(\cdot,\cdot)$ and $\delta_{2}(\cdot,\cdot)$ defined by (4) respectively correspond to the $l_{1}$ and $l_{2}$ distances on the quotient space of graphons $\widetilde{\mathcal{W}}^{+}$ . Given a symmetric square matrix $\boldsymbol{\Theta}$ with values in $[0,1]$ , $\widetilde{f}_{\boldsymbol{\Theta}}$ is the empirical graphon $\boldsymbol{\Theta}$ as defined in (19).

•

Given a probability matrix $\boldsymbol{\Theta}_{0}$ , we denote by $\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}$ the expectation with respect to the distribution of $\boldsymbol{A}$ if we consider the inhomogeneous random graph model and given a graphon $W$ and $\rho_{n}$ , we write $\operatorname{\mathbb{E}}_{W}$ for the expectation with respect to the joint distribution of $({\boldsymbol{\xi}},\boldsymbol{A})$ .

•

We denote by $\lfloor x\rfloor$ the maximal integer less than or equal to $x$ and by $\lceil x\rceil$ the smallest integer greater than or equal to $x$ . For an positive integer $m$ , set $[m]=\{1,\dots,m\}$ . $\mathds{1}_{A}(\cdot)$ denotes the indicator function of a set $A$ . In the sequence, $C$ stands for a positive constant that can vary from line to line. These are absolute constants unless otherwise mentioned. For two positive functions $f$ and $g$ , we write $f\asymp g$ when there exist two positive numerical constants $C$ and $C^{\prime}$ such $Cg\leq f\leq C^{\prime}g$ . Finally, $\lambda$ is the Lebesgue measure on the interval $[0,1]$ .

•

Given a $n\times n$ matrix $\boldsymbol{\Theta}$ with entries in $[0,1]$ , we define the empirical graphon $\widetilde{f}_{\boldsymbol{\Theta}}$ as the following piecewise constant function: $\widetilde{f}_{\boldsymbol{\Theta}}(x,y)=\boldsymbol{\Theta}_{\lceil nx\rceil,\lceil ny\rceil}$ for all $x$ and $y$ in $(0,1]$ .

2.2 Preliminaries

We start with a few basic properties of the cut norm for matrices $\boldsymbol{A}$ and graphons $W$ . It is easy to see that

[TABLE]

where $\|\cdot\|_{1}$ and $\|\cdot\|_{2}$ are the usual entry-wise $l_{1}$ and $l_{2}$ -norms of a matrix. For a function $W\in\mathcal{W}$ , we have

[TABLE]

where $\|\cdot\|_{1}$ and $\|\cdot\|_{2}$ denote $l_{1}$ and $l_{2}$ -norms of a graphon. In the opposite direction, we have $\|W\|_{2}\leq\sqrt{\|W\|_{1}}$ . As a consequence, the metric $\delta_{1}$ and $\delta_{2}$ define the same topology on the space $\widetilde{\mathcal{W}}^{+}$ of graphons. In contrast, the cut distance $\delta_{\square}$ defines a weaker topology on the space $\widetilde{\mathcal{W}}^{+}$ as illustrated by the aforementioned sampling result (Proposition 1).

We shall also sometimes rely on the equivalence between the cut norm and to the $l_{\infty}\rightarrow l_{1}$ operator norm:

[TABLE]

where the supremum is taken over all (real-valued) functions $f$ and $g$ with values in $[-1,1]$ . It is known that (see e.g., [24])

[TABLE]

3 Probability matrix estimation

3.1 Cut norm minimax risk

We start with a simple proposition that bounds the expected cut distance between $\boldsymbol{\Theta}_{0}$ and the sampled adjacency matrix $\boldsymbol{A}$ . Similar results already appeared in the literature, see e.g., [28, Lemma 10.11], [7] or [21]. Its proofs is based on Bernstein’s inequality and is given in Section B.

Proposition 2.

For any probability matrix $\boldsymbol{\Theta}_{0}$ we have

[TABLE]

In particular, if $\|\boldsymbol{\Theta}_{0}\|_{\infty}\geq 1/n$ , we get

[TABLE]

This implies that the adjacency matrix $\boldsymbol{A}$ is $\sqrt{\|\boldsymbol{\Theta}_{0}\|_{\infty}/n}$ -close in cut-distance to the probability matrix $\boldsymbol{\Theta}_{0}$ . This bound is valid for all matrices $\boldsymbol{\Theta}_{0}$ . It turns out that no estimator can perform much better than $\boldsymbol{A}$ , even on some simple classes of parameters $\boldsymbol{\Theta}_{0}$ .

Let $n,k$ be integers such that $2\leq k\leq n$ and $\mathcal{T}[k]$ be defined by

[TABLE]

where we denote by $\mathcal{Z}_{n,k}$ the set of all mappings $z$ from $[n]$ to $[k]$ . In other words $\mathcal{T}[k]$ is made of matrices that, up to a permutation of their rows and their columns, are (up to the diagonal) block constants with at most $k$ blocks. Also, this corresponds to connection probability matrices of $k$ -class stochastic blocks models whose vector label $Z=(Z_{a})$ has been fixed. For any $\rho_{n}\in(0,1]$ , consider the set

[TABLE]

of matrices whose largest value is smaller or equal to $\rho_{n}$ . The following Proposition, proved in section C, gives a lower bound on the minimax risk over the class $\mathcal{T}[2,\rho_{n}]$ of block-constant matrices with only two blocks:

Proposition 3.

The minimax risk measured in cut norm satisfies

[TABLE]

where $\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}$ denotes the expectation with respect to the distribution of $\boldsymbol{A}$ when the underlying probability matrix is $\boldsymbol{\Theta}_{0}$ .

Comparing Proposition 3 with Proposition 2 we observe that the raw data $\boldsymbol{A}$ is minimax optimal for the class $\mathcal{T}[2,\rho_{n}]$ for all $\rho_{n}\geq 1/n$ . As a consequence, there is no need to look for a more involved estimator. Since for $\rho_{n}\leq 1/n$ the constant estimator $\widehat{\boldsymbol{\Theta}}=0$ satisfies $\operatorname{\mathbb{E}}_{\boldsymbol{\Theta}_{0}}[\|\widehat{\boldsymbol{\Theta}}-\boldsymbol{\Theta}_{0}\|_{\square}]\leq\rho_{n}$ and using that the collections $\mathcal{T}[k,\rho_{n}]$ are nested, the two previous propositions imply that the optimal cut norm estimation rates for $\mathcal{T}[k,\rho_{n}]$ with $k\geq 2$ is given by

[TABLE]

Until now, we left aside the specific case of constant matrices $\mathcal{T}[1,\rho_{n}]$ which correspond to Erdös-Rényi random graphs. It turns out that the situation is quite different for this simple class. For a constant matrix $\boldsymbol{\Theta}_{0}$ , estimating $\boldsymbol{\Theta}_{0}$ given $\boldsymbol{A}$ amounts to infer the parameter $p$ of a Bernoulli distribution given a sample of size $n(n-1)/2$ . From this analogy, we consider the matrix $\overline{\boldsymbol{A}}$ whose all non-diagonal entries are equal to $\bar{A}=\sum_{i,j}\boldsymbol{A}_{ij}/(n(n-1))$ . Then, it is straightforward to prove that

[TABLE]

which is $\sqrt{n}$ -faster than what is achieved by the adjacency matrix $\boldsymbol{A}$ . Using again the analogy with the problem of Bernoulli parameter estimation, one may easily get the following minimax lower bound:

[TABLE]

which assesses that the $\sqrt{\rho_{n}}/n$ -rate achieved by $\overline{\boldsymbol{A}}$ is optimal.

3.2 Comparison with $l_{1}$ and $l_{2}$ -estimation

The cut norm optimal estimation rate is quite different from what has been established for the Frobenius norm (also called $l_{2}$ ) estimation rate in [26] (see also [20] for the dense case), that is

[TABLE]

for any $k=2,\ldots,n$ . Besides, the minimax risk bound is achieved by the restricted least-square estimators [26] defined by

[TABLE]

Since the Frobenius norm dominates the cut norm, it is expected that the cut norm convergence rate is faster than the Frobenius norm estimation rate. When $\rho_{n}$ is not too small and the number of blocks remains small ( $k\leq\sqrt{n\log(n)}$ ), the gain is a $\log(k)$ factor, whereas, for larger $k$ , the gain is of order $k/\sqrt{n}$ . More importantly, the optimal Frobenius norm convergence rate (10) is only known to be achieved by non-polynomial time estimators such as (11).

In view of the above discussion, one may wonder whether it is possible to build estimators that are near optimal is terms of both the cut and Frobenius distances. Since for any matrix $\boldsymbol{B}$ , $\|\boldsymbol{B}\|_{\square}\leq\|\boldsymbol{B}\|_{2}/n$ , it follows that, for $k\leq\sqrt{n}$ , the restricted least-square estimator $\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}$ (11) is also near optimal (up to $\sqrt{\log(k)}$ factor) with respect to the cut distance, that is,

[TABLE]

For matrices $\boldsymbol{\Theta}_{0}$ with more than $\sqrt{n}$ blocks, it is not clear whether the estimator $\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}$ achieves a fast rate of convergence in the cut norm.

In any case, the computational complexity of $\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}$ is non polynomial. In fact, no polynomial-time algorithm is known to achieve the minimax risk (10) with respect to the Frobenius norm. Below, we describe an estimator that is optimal in the cut distance and also achieves the best known rate in Frobenius distance in the class of polynomial-time estimators. Let us write the singular value decomposition of $\boldsymbol{A}$ :

[TABLE]

where $\sigma_{j}(\boldsymbol{A})>0$ are the singular values of $\boldsymbol{A}$ indexed in the decreasing order, $u_{j}(\boldsymbol{A})$ are eigenvectors of $\boldsymbol{A}$ and $v_{j}(\boldsymbol{A})=\pm u_{j}(\boldsymbol{A})$ . Given a tuning parameter $\lambda>0$ , we define

[TABLE]

as the singular value hard thresholding estimator of $\boldsymbol{\Theta}_{0}$ . We have the following

Proposition 4.

Assume that $\rho_{n}\geq\log(n)/n$ . Let $\lambda=c\sqrt{\rho_{n}n}$ where $c$ is a sufficiently large numerical constant. Then, for any $k\in[n]$ and any $\boldsymbol{\Theta}_{0}\in\mathcal{T}[k,\rho_{n}]$ , the hard thresholding estimator $\widetilde{\boldsymbol{\Theta}}_{\lambda}$ simultaneously satisfies, with probability larger than $1-1/n$ ,

[TABLE]

where $C$ is a numerical constant.

The low-rank estimator $\widetilde{\boldsymbol{\Theta}}_{\lambda}$ was previously considered in [16] for Frobenius norm estimation, but error bounds obtained in [16] are much more pessimistic than (14). It follows from (15), that for $\rho_{n}\geq\log(n)/n$ , with high probability, $\widetilde{\boldsymbol{\Theta}}_{\lambda}$ achieves the optimal rate in the cut norm and the $\sqrt{\rho_{n}k/n}$ rate in Frobenius norm, which is the best known rate among polynomial-time estimators.

We close this section by the following proposition which gives the minimax optimal rate of estimation in $l_{1}$ -norm. This will allow us to further compare the $\delta_{1}$ and $\delta_{\square}$ convergence rates for graphon estimation in the next section.

Proposition 5.

For any sequence $\rho_{n}>0$ and any positive integer $2\leq k\leq n$ , one has

[TABLE]

To prove the upper bound we can use the following result which provides the control of the estimation error measured in Frobenius norm of the restricted least-squares estimator $\boldsymbol{\Theta}^{r}$ proven in [26]:

Proposition 6.

Consider the network sequence model. There exist positive absolute constant $C$ such that the following holds. If $\|\boldsymbol{\Theta}_{0}\|_{\infty}\leq\rho_{n}$ , then

[TABLE]

The upper bound in (16) is a consequence of the inequality $\|\boldsymbol{B}\|_{1}/n^{2}\leq\|\boldsymbol{B}\|_{2}/n$ and (17). The lower bound of the minimax risk in (16) is proved following the same lines as the proof of Proposition 2.4 in [26] with $\|\cdot\|_{2}$ replaced by $\|\cdot\|_{1}$ . We skip the details.

4 Graphon estimation problem

In this section, we are interested in estimating the graphon $W_{0}$ in the sparse $W$ -random graph model (1). Let $\mathcal{W}^{+}[k]$ be the collection of $k$ –step graphons, that is, the subset of graphons $W\in\mathcal{W}^{+}$ such that for some $\boldsymbol{Q}\in[0,1]^{k\times k}_{\text{sym}}$ and some $\phi:[0,1]\to[k]$ ,

[TABLE]

Note $\mathcal{W}^{+}[k]$ is also in correspondence with the collection of stochastic block models with $k$ blocks. Our purpose here, is to characterize the minimax convergence rates over classes $\mathcal{W}^{+}[k]$ .

4.1 Cut distance minimax risk

Following [26], we start by associating a graphon to any $n\times n$ probability matrix $\boldsymbol{\Theta}_{0}$ . Then, we can estimate graphon $f_{0}(\cdot,\cdot)=\rho_{n}W_{0}(\cdot,\cdot)$ using the empirical graphon associated to an estimate of $\boldsymbol{\Theta}_{0}$ . Recall that, given a $n\times n$ matrix $\boldsymbol{\Theta}$ with entries in $[0,1]$ , we define the graphon $\widetilde{f}_{\boldsymbol{\Theta}}$ as the following piecewise constant function:

[TABLE]

for all $x$ and $y$ in $(0,1]$ . For any estimator $\widehat{\boldsymbol{T}}$ of $\boldsymbol{\Theta}_{0}$ and any norm $N$ that is invariant under measure preserving maps the triangle inequality implies

[TABLE]

We have two parts in (20). The first term is the estimation error term $\|\widehat{\boldsymbol{T}}-\boldsymbol{\Theta}_{0}\|_{N}$ that has been considered in the previous section. The second term $\delta_{N}(\widetilde{f}_{\boldsymbol{\Theta}_{0}},f_{0})$ is the agnostic error. It measures the $\delta_{N}$ -distance between the true graphon $f_{0}$ and its discretized version sampled at the unobserved random design points $\xi_{1},\ldots,\xi_{n}$ . The behavior of $\delta_{N}(\widetilde{f}_{\boldsymbol{\Theta}_{0}},f_{0})$ depends on the topology of the considered class of graphons. The following theorem, proved in Section E, gives the upper bound on the agnostic error, measured in $\delta_{\square}$ -distance for step function graphons:

Theorem 1 (Agnostic error measured in cut distance).

Consider the $W$ -random graph model (1). For all integers $k\geq 2$ , all positive integers $n$ , all $W_{0}\in\mathcal{W}^{+}[k]$ and $\rho_{n}>0$ , we have

[TABLE]

Note that the case $k>n$ is a consequence of Proposition 1 from [28], so that we effectively only have to consider the case $k\leq n$ . The proof combines two ideas. First, we build $W$ and $\widehat{W}$ as the representatives of $W_{0}$ and $\widetilde{f}_{{\boldsymbol{\Theta}}_{0}}$ in the quotient space $\widetilde{\mathcal{W}}^{+}$ such that $W$ and $\widehat{W}$ match everywhere except on a set of Lebesgue measure of order at most $\sqrt{k/n}$ . This allows us to get a risk bound of order $\sqrt{k/n}$ . In order to recover the correct logarithmic factor $\sqrt{\log(k)}$ , we rely on the weak Szemerédi’s Lemma. Here, the key idea is to build a cut-norm approximation of a distorted transformation of $W$ where the weights of the group have been modified to take into account the geometry of the sampling error.

As an immediate consequence of (20), Proposition 2 and Theorem 1, we get the following upper bound on the risk of the empirical graphon $\widetilde{f}_{\boldsymbol{A}}$ . For any $k\geq 2$ , it holds that

[TABLE]

where $C$ is an absolute constant. Here, $\operatorname{\mathbb{E}}_{W_{0}}$ denotes the expectation with respect to the distribution of observations $\boldsymbol{A}=(\boldsymbol{A}_{ij},1\leq j<i\leq n)$ when the underlying sparse graphon is $f_{0}=\rho_{n}W_{0}$ . The following Proposition provides a matching lower bound for $2\leq k\leq n$ .

Theorem 2.

There exists a universal constant $C>0$ such that for any sequence $\rho_{n}>0$ and any positive integer $2\leq k\leq n$ ,

[TABLE]

where $\inf_{\widehat{f}}$ is the infimum over all estimators.

Since the collections $\mathcal{W}^{+}[k]$ are nested, it follows that for all $k\geq n$ , one has

[TABLE]

In view of (22) and (23), we observe that, as long as, $\rho_{n}\geq 1/n$ , the empirical graphon $\widetilde{f}_{\boldsymbol{A}}$ is minimax optimal over all classes $\mathcal{W}^{+}[k]$ , $k\geq 2$ . For sparser graphs ( $\rho_{n}\leq 1/n$ ), the trivial estimator $\widehat{f}\equiv 0$ achieves the optimal rate $\rho_{n}$ .

Note that there are two distinct regimes in the minimax convergence rate. When $\rho_{n}\geq\log(k)/k$ (weakly sparse graphs or large number of groups), the agnostic error dominates and the minimax risk is of order $\rho_{n}\sqrt{k/(n\log(k))}$ . For moderately sparse graphs or equivalently a small number of steps ( $n^{-1}\leq\rho_{n}\leq\log(k)/k$ ), the error arising from the probability matrix $\boldsymbol{\Theta}_{0}$ estimation dominates and the minimax risk is of order $\sqrt{\rho_{n}/n}$ .

As in the previous section, we left aside the specific case of constant graphons $\mathcal{W}^{+}[1]$ . Note that for a graphon $W_{0}\in\mathcal{W}^{+}[1]$ the agnostic error is always zero and the loss comes from the probability matrix estimation. Following the arguments of the previous section, we derive that the graphon $\widetilde{f}_{\overline{\boldsymbol{A}}}$ converges to $\rho_{n}W_{0}$ at the rate $\sqrt{\rho_{n}}/n$ which is optimal as soon as $\rho_{n}\geq 1/n^{2}$ .

4.2 Comparison with $\delta_{1}$ and $\delta_{2}$ -estimation

Minimax risk for graphon estimation in the $\delta_{2}$ -distance was obtained in [26, Proposition 3.2] :

[TABLE]

The following proposition, proved in Section G, gives the minimax $\delta_{1}$ -convergence rate:

Proposition 7.

For any sequence $\rho_{n}>0$ and any positive integer $2\leq k\leq n$ , we have

[TABLE]

Conversely, there exists an estimator $\widehat{f}$ based on the restricted least-squares estimator (11) such that

[TABLE]

The upper and lower bounds given by Proposition 7 match (up to a $\sqrt{\log(k)}$ multiplicative term in one of the regimes). There are three regions in (26) for $\delta_{1}$ graphon estimation. The first one corresponds to the case of weakly sparse graphs with $\rho_{n}\geq k^{-1}\vee(k/n)$ . In this case, the agnostic error dominates and the optimal risk is of order $\rho_{n}\sqrt{k/n}$ . For moderately sparse graphs with $n^{-1}\vee(k/n)^{2}\leq\rho_{n}\leq k^{-1}\vee(k/n)$ , the probability matrix estimation error dominates and the minimax rate is of order $\sqrt{\rho_{n}/n}+\sqrt{\rho_{n}}k/n$ (up to a $\log(k)$ multiplicative term). In the case of highly sparse graphs with $\rho_{n}\leq n^{-1}\vee(k/n)^{2}\vee\left(\frac{k}{n}\right)^{2}$ , the minimax risk is $\rho_{n}$ which corresponds to the risk of the null estimator $\tilde{f}\equiv 0$ .

Let us compare the optimal convergence rates with respect to the $\delta_{1}$ (26), $\delta_{2}$ (24) and $\delta_{\square}$ (23). Bearing in mind that $\delta_{2}$ dominates $\delta_{1}$ , which in turn dominates $\delta_{\square}$ , one should not be surprised that optimal rates with respect to $\delta_{2}$ are the slowest. When the number of steps $k$ is less than $\sqrt{n}$ or when the graph is weakly sparse ( $\rho_{n}\geq\sqrt{k/n}$ ), then the $\delta_{1}$ and $\delta_{\square}$ optimal rates only differ by a $\log(k)$ multiplicative term. For larger $k$ and sparser graph, the optimal $\delta_{1}$ -risk can be $k/\sqrt{n}$ larger than the $\delta_{\square}$ -risk.

Following the discussion in Section 3.2, one may easily build graphon estimators performing well in all these three distances. For instance, the graphon $f_{\widehat{\boldsymbol{\Theta}}_{k,\rho_{n}}}$ based on the restricted-least-squares estimator is optimal with respect to $\delta_{2}$ and $\delta_{1}$ and near optimal (up to a possible $\sqrt{\log(k)}$ loss) with respect to $\delta_{\square}$ for $k\leq\sqrt{n}$ . Besides, the graphon $f_{\widetilde{\boldsymbol{\Theta}}_{\lambda}}$ based on the singular value thresholding estimator is optimal with respect to $\delta_{\square}$ and achieves best known convergence rates with respect to $\delta_{1}$ and $\delta_{2}$ among polynomial time algorithms.

4.3 Cut distance estimation of $L_{1}$ and $L_{2}$ graphons

Until now we have restricted our attention to graphons $W$ taking values in $[0,1]$ . As argued in [11, 12], in this case the empirical degree distribution of a graph sampled from the corresponding $W$ -random graph model (1) is light. This contrasts with many practical situations, where the degree distribution is heavy tailed. To circumvent this limitation, Borgs et al [11, 12] introduce, for $p\geq 1$ , the class $\mathcal{W}_{p}^{+}$ of symmetric measurable functions $W:[0,1]^{2}\to\mathbb{R}^{+}$ such that $\int|W(x,y)|^{p}dxdy<\infty$ . This collection $\mathcal{W}_{p}^{+}$ is referred as the collection of $L_{p}$ graphons. We have the inclusions $\mathcal{W}^{+}\subset\mathcal{W}_{p}^{+}\subset\mathcal{W}_{p^{\prime}}^{+}$ for $p>p^{\prime}\geq 1$ . Given a graphon $W_{0}\in\mathcal{W}_{p}^{+}$ and a sparsity parameter $1\geq\rho_{n}>0$ , the corresponding $W$ -random graph model amounts to generating a graph with $n$ vertices according to the random matrix $\boldsymbol{\Theta}_{0}$ sampled as follows

[TABLE]

where $\xi_{1},\ldots,\xi_{n}$ are, as in (1), i.i.d. random variables uniformly distributed in $[0,1]$ . Note that since $W_{0}$ is now unbounded, we have to take the minimum with $1$ in (27). We write $f^{\prime}_{0}=(\rho_{n}W_{0})\wedge 1$ . Since $W_{0}$ is now allowed to be unbounded, graphs sampled according to the model (27) may have power law degree distribution [11]. As in the introduction, we may extend the norms $\|.\|_{\square}$ and $\|.\|_{q}$ and the distances $\delta_{\square}$ and $\delta_{q}$ to any graphon $W_{0}\in\mathcal{W}_{p}^{+}$ with $p\leq q$ . Also, we write $\widetilde{\mathcal{W}}_{p}^{+}$ for the quotient space of $L_{p}$ graphons under weak isometry.

Let us also define the collection $\mathcal{W}^{+}_{p}[k]$ of $k$ -steps $L_{p}$ graphons, that is the subsets of graphon $W\in\mathcal{W}_{p}^{+}$ such that $W(x,y)=\boldsymbol{Q}_{\phi(x),\phi(y)}$ for some $\boldsymbol{Q}\in(\mathbb{R}^{+})^{k\times k}_{\text{sym}}$ and some $\phi:[0,1]\to[k]$ (note that $\mathcal{W}^{+}_{p}[k]$ does not depend on $p$ ). For $1\geq\mu>0$ we denote by $\mathcal{W}^{+}_{p}[k,\mu]$ the subset of $\mathcal{W}^{+}_{p}[k]$ of “balanced” step functions, that is, $W\in\mathcal{W}^{+}_{p}[k,\mu]$ if $\lambda(\phi^{-1}(a))\geq\mu/k$ for all $a\in[k]$ . This means that the size of each step is larger than $\mu/k$ .

Without lost of generality we can consider normalized graphons, that is, we assume that $\|W_{0}\|_{1}=1$ . The following proposition proved in Appendix H gives an oracle inequality for the risk of the empirical graphon associated to the adjacency matrix and to the singular value hard thresholding estimator:

Proposition 8.

Let $\lambda=c\sqrt{\rho_{n}n}$ where $c$ is a sufficiently large numerical constant. Given a graphon $W_{0}$ and $\rho_{n}>0$ , write $W^{\prime}_{0}=\rho^{-1}_{n}[(\rho_{n}W_{0})\wedge 1]$ .

(1)

Let $W_{0}\in\mathcal{W}^{+}_{1}$ with $\|W_{0}\|_{1}=1$ , $\rho_{n}\geq 1/n$ and $1\geq\mu>0$ . Then, for any positive integer $k\leq\mu n$ , we have

[TABLE]

and

[TABLE]

(2)

Assume that $W_{0}\in\mathcal{W}^{+}_{2}$ with $\|W_{0}\|_{1}=1$ and $\rho_{n}\geq 1/n$ . For any positive integer $k\leq n$ , we have

[TABLE]

If $W_{0}$ belongs to some $\mathcal{W}^{+}_{2}[k]$ or to $\mathcal{W}^{+}_{1}[k,\mu]$ the convergence rates given by Proposition 8 are the same as the optimal rates for bounded graphons up to a $\log^{-1/2}(k)$ factor. We conjecture that the $\log^{-1/2}(k)$ factor should appear in Proposition 8. Indeed, for bounded graphons, this logarithmic terms derives from Szemerédi’s Regularity lemma and extensions of this lemma to $L_{p}$ graphons have been recently proved [11]. Nevertheless, our arguments in the proof of Theorem 1 makes heavily use of the boundedness of the graphons. In particular, one should replace all applications of McDiarmid’s inequality (Lemma 1) by more involved concentration inequalities [13]. We leave this for future work.

When the graphon $W_{0}$ is not a finite step graphon, a bias term is occurring in the risk bounds (28–31). As the estimation risk is measured in the cut-distance, one could have hoped to obtain a bias term in the cut distance also (instead of the larger $l_{1}$ and $l_{2}$ distances). It is an interesting open problem to prove whether one can obtain oracle inequalities with cut distance bias terms. Note that, for bounded graphons $W\in\mathcal{W}^{+}$ , using Theorem 1, we can also get an oracle inequality with the $\delta_{1}$ bias term and minimax optimal error term.

Upper bounds of the cut distance risk for $L_{p}$ graphons estimation were previously obtained in [7] where the authors introduced the least cut norm estimator $\widehat{f}_{LC}$ . For any $L_{1}$ normalized graphon $W_{0}$ any $\kappa\in[\log n/n,1]$ , Borgs et al. [7] show in their Theorem 4.1 that this estimator $\widehat{f}_{LC}$ achieves the risk bound

[TABLE]

For $L_{1}$ graphons, this bound is quite similar (up to an additional $\log^{1/2}(n)$ term) to those we obtained in (28–29) for the empirical estimators $\widetilde{f}_{\boldsymbol{A}}$ and $\widetilde{f}_{\widetilde{\boldsymbol{\Theta}}_{\lambda}}$ . Note that the least cut norm estimator can not be computed in polynomial time contrary to the empirical graphons associated to the adjacency matrix and to the singular value hard thresholding estimator. Also, when the true graphon $W_{0}$ either belongs to $\mathcal{W}^{+}_{2}$ or to $\mathcal{W}^{+}[k]$ , then the rate in (32) is much slower than what has been obtained in Proposition 4 and Theorem 1.

Appendix A Proof methods

In this section, we summarize some basic facts and fundamental results that we use in the proofs.

A.1 Non-symmetric kernels

At some point, we will need to work with non-symmetric kernels and with kernel defined on general measurable subsets of $\mathbb{R}$ . In this section we define the corresponding spaces. Let $\mathcal{X}$ and $\mathcal{Y}$ denote two bounded measurable subsets of $\mathbb{R}$ . Then, $\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ refers to the collection of bounded measurable functions $W:\mathcal{X}\times\mathcal{Y}\rightarrow[-1,1]$ . We will denote by $\mathcal{W}^{+}_{\mathcal{X},\mathcal{Y}}$ the collection of bounded measurable and non-negative functions $W:\mathcal{X}\times\mathcal{Y}\rightarrow[0,1]$ . Let $\mathcal{W}_{\mathcal{X},\mathcal{Y}}[k]$ be the collection of $k-$ step kernels, that is, the subset of kernels $W\in\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ such that for some $\boldsymbol{Q}\in\mathbb{R}^{k\times k}$ and some $\phi_{1}:\mathcal{X}\to[k]$ , $\phi_{2}:\mathcal{X}\to[k]$ ,

[TABLE]

A kernel $W$ is also said to be a $q_{1}\times q_{2}$ -step function when it decomposes as in (33) but where $\boldsymbol{Q}$ is a size $q_{1}\times q_{2}$ matrix, $\phi_{1}$ mapping $\mathcal{X}$ to $[q_{1}]$ , and $\phi_{2}$ mapping $\mathcal{Y}$ to $[q_{2}]$ . The cut norm can be readily extended to kernels $W\in\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ in the following way:

[TABLE]

where the supremum is taken over all measurable subsets $X$ and $Y$ .

A.2 Concentration inequalities

In the proofs we repeatedly use Bernstein’s inequality. We state it here for the readers’ convenience. Let $X_{1},\dots,X_{N}$ be independent zero-mean random variables. Suppose that $|X_{i}|\leq M$ almost surely, for all $i$ . Then, for any $t>0$ ,

[TABLE]

We shall also rely on the bounded difference inequality (also called McDiarmid’s inequality).

Lemma 1 (Bounded difference inequality).

Let $X_{1},\ldots,X_{n}$ denote $n$ independent real random variables. Assume that $g:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a measurable function satisfying, for some positive constants $(c_{i})_{1\leq i\leq n}$ , the bounded difference condition

[TABLE]

for all $x=(x_{1},\ldots,x_{i},\ldots,x_{n})\in\mathbb{R}^{n}$ , $x^{\prime}=(x_{1},\ldots,x^{\prime}_{i},\ldots,x_{n})\in\mathbb{R}^{n}$ and all $i\in[n]$ . Then, the random variable $Z=g(X_{1},\ldots,X_{n})$ satisfies

[TABLE]

for all $t>0$ .

A.3 Fano’s lemma

In the sequel, $\mathcal{KL}(.,.)$ denotes the Kullback-Leibler divergence between two distributions. In this manuscript, all the proofs of the minimax lower bounds rely on Fano’s method. The following version of Fano’s lemma is borrowed from [32]:

Lemma 2.

[32, Theorem 2.7]** Consider a parametric model $\operatorname{\mathbb{P}}_{\theta}$ , with $\theta\in\Theta$ and a metric $d(.,.)$ on $\Theta$ . Assume that $\Theta$ contains elements $\theta_{1},\ldots,\theta_{M}$ , $M\geq 3$ , such that for all $j,k\in[M]$ with $j\neq k$

(i)

$d(\theta_{j},\theta_{k})\geq s>0\ ,$ ** 2. (ii)

$\mathcal{KL}(\mathbb{P}_{\theta_{j}},\operatorname{\mathbb{P}}_{\theta_{k}})\leq\log(M)/32\$ .

Then, we have

[TABLE]

where the constant $C>0$ is numeric.

A.4 Khintchine’s inequality

Next, we state a particular case of Khintchine’s inequality that turns out to be useful for bounding the cut norm of step kernels in terms of their $l_{1}$ norm:

Lemma 3.

[30]** Let $\epsilon_{1},\ldots,\epsilon_{p}$ be i.i.d. Rademacher random variables and let $x_{1},\ldots,x_{p}$ be some real numbers. Then,

[TABLE]

We use this result to prove the following lower bound on the cut norm of step kernels:

Lemma 4.

Let $U:\mathcal{X}\times\mathcal{Y}\mapsto[-1,1]$ denote a measurable $q_{1}\times q_{2}$ –step function. Then,

[TABLE]

Proof of Lemma 4.

There exist partitions $\mathcal{X}=\mathcal{X}_{1}\cup\ldots\mathcal{X}_{q_{1}}$ and $\mathcal{Y}=\mathcal{Y}_{1}\cup\ldots\mathcal{Y}_{q_{2}}$ such that, for any fixed $y\in\mathcal{Y}$ , $U(x,y)$ is constant over $x\in\mathcal{X}_{i}$ for all $i\in[q_{1}]$ and, for any fixed $y\in\mathcal{X}$ , $U(x,y)$ is constant over $y\in\mathcal{Y}_{i}$ for all $i\in[q_{2}]$ . For any $a\in[q_{1}]$ (resp. $b\in[q_{2}]$ ), denote $x_{a}$ (resp. $y_{b}$ ) any element of $\mathcal{X}_{a}$ (resp. $\mathcal{Y}_{b}$ ). By definition of $\|U\|_{\square}$ ,

[TABLE]

where we used in the last line that the value of the sum only depends on $S$ and $T$ through the quantities $\lambda(S\cap\mathcal{X}_{a})$ and $\lambda(T\cap\mathcal{Y}_{b})$ . Since the maximum of a linear function on a convex set is achieved at an extremal point, it follows that

[TABLE]

where we use (8) and take $\epsilon_{a}=\operatorname{sign}\sum_{b\in[q_{2}]}\epsilon^{\prime}_{b}\lambda(\mathcal{Y}_{b})U[x_{a},y_{b}]$ . Let $v=(v_{1},\ldots,v_{q_{2}})$ denote i.i.d. Rademacher random variables and let $\operatorname{\mathbb{E}}_{v}[.]$ denotes the expectation with respect to $v$ . Now, Khintchine’s inequality (36) and Cauchy-Schwarz inequality imply

[TABLE]

∎

Appendix B Proof of Proposition 2

Since the diagonals of $\boldsymbol{A}$ and $\boldsymbol{\Theta}$ are both zero, it suffices to control the supremum over disjoints subsets $S$ and $T$ (see, e.g., [8])

[TABLE]

Let $S$ and $T$ be any two disjoint subsets of $[n]$ . Using Bernstein’s inequality (35) we have that

[TABLE]

Now, using that the number of disjoint pairs $(S,T)$ is $3^{n}$ and the union bound, we get that the probability that $|\sum_{i\in S,j\in T}\boldsymbol{A}_{ij}-\boldsymbol{\Theta}_{ij}|$ exceeds $3\sqrt{\left(\|\boldsymbol{\Theta}_{0}\|_{1}+n\right)n}$ for some $(S,T)$ is bounded by $2\exp(-n)$ . Hence, we have

[TABLE]

with probability $1-2e^{-n}$ . Now bounding the distance by $1$ in the exceptional case we get the statement of Proposition 2.

Appendix C Proof of Proposition 3

Fix $\rho_{n}\in(0,1)$ . This proof is based on Fano’s method. To apply Fano’s Lemma (Lemma 2), it is enough to check that there exists a finite subset $\Omega$ of $\mathcal{T}[2,\rho_{n}]$ such that for any two distinct $\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}$ in $\Omega$ we have

(a)

$\|\boldsymbol{\Theta}-\boldsymbol{\Theta}^{\prime}\|_{\square}\geq C\,\sqrt{\rho_{n}}\left(\frac{1}{\sqrt{n}}\wedge\sqrt{\rho_{n}}\right)$ and

(b)

$\mathcal{KL}(\mathbb{P}_{\boldsymbol{\Theta}},\mathbb{P}_{\boldsymbol{\Theta}^{\prime}})\leq\log(|\Omega|)/32\,$

for some constants $C>0$ . Then, Applying Lemma 2 to $\Omega$ leads to the desired result. It remains to prove the existence of $\Omega$ . As it is classical for this kind of proof, we first build a collection $\Omega^{\prime}\subset\mathcal{T}[2,\rho_{n}]$ and then extract a maximal subset $\Omega\subset\Omega^{\prime}$ satisfying (a). Then, we control the Kullback divergence between any two probability to show (b).

Construction of $\Omega^{\prime}$ . Fix $\epsilon\in(0,\rho_{n}/4)$ . For any $u\in\{-1,1\}^{n}$ , define $\boldsymbol{\Theta}_{u}$ by $(\boldsymbol{\Theta}_{u})_{i,j}=\rho_{n}/2+u(i)u(j)\epsilon$ where $u=\left(u(1),\dots,u(n)\right)$ . In other words, the entries $\boldsymbol{\Theta}_{u}$ are equal to $\rho_{n}/2+\epsilon$ if $u(i)u(j)=1$ and $\rho_{n}/2-\epsilon$ if $u(i)u(j)=-1$ . Obviously, the collection $\Omega^{\prime}:=\left\{\boldsymbol{\Theta}_{u}\;:\;u\in\{-1,1\}^{n}\right\}$ is included in $\mathcal{T}[2,\rho_{n}]$ .

Computation of the cut distances and extraction of a maximal subset. Given $u\in\{-1,1\}^{n}$ , denote $V_{u}:=\{i\in[n]\;:\;u(i)=1\}$ the set of indices corresponding to $u(i)=1$ and $\bar{V}_{u}$ its complement. Then, given two vector $u$ and $v$ , we define $S:=V_{u}\setminus V_{v}$ and $T:=V_{v}\cap V_{u}$ , we easily obtain

[TABLE]

By symmetry, we derive that

[TABLE]

where $A\triangle B$ is the symmetric difference of $A$ and $B$ . As a consequence, the cut distance between any two graphons is large as long as the symmetric difference between $u$ and $v$ is both bounded away from zero and from $n$ .

By Varshamov-Gilbert combinatorial bound (see, e.g., [32, Lemma 2.9]), we can in fact pick $u_{1},\dots,u_{N}$ satisfying

[TABLE]

with $N\geq\exp(c_{1}n)$ for some $c_{1}>0$ . In the sequel, we consider $\Omega=\{\boldsymbol{\Theta}_{u_{i}}\;:\;i=1,\dots,N\}$ . Hence, we have $\log|\Omega|\geq c_{1}n$ , whereas the previous inequalities ensure that

[TABLE]

which proves (a) when one takes $\epsilon$ as defined in (38) below.

Control of the Kullback Divergence. To prove (b) we use the definition of Kullback-Leibler divergence $\mathcal{KL}(\operatorname{\mathbb{P}}_{\boldsymbol{\Theta}_{u}},\operatorname{\mathbb{P}}_{\boldsymbol{\Theta}_{v}})$ and $\log x\leq x-1$ for $x>0$ to get

[TABLE]

Now, $(\boldsymbol{\Theta}_{v})_{i,j}\geq\rho_{n}/4$ and $\rho_{n}\leq 1$ imply

[TABLE]

Taking

[TABLE]

with a constant $c_{2}>0$ small enough, we derive from the lower bound $\log(|\Omega|)\geq c_{1}n$ that

[TABLE]

which proves (b).

Appendix D Proof of Proposition 4

Set $\boldsymbol{E}=\boldsymbol{A}-\boldsymbol{\Theta}_{0}$ . We have the following simple proposition (see Theorem 5 in [25])

Proposition 9.

If $\lambda\geq\|\boldsymbol{E}\|_{2\rightarrow 2}$ , then

[TABLE]

In view of Proposition 9 we need to estimate $\|\boldsymbol{E}\|$ with high probability in order to specify the value of the regularization parameter $\lambda$ . Let $\boldsymbol{E}^{*}=(\boldsymbol{E}^{*}_{ij})$ be such that $\boldsymbol{E}^{*}_{ij}=\boldsymbol{E}_{ij}$ for $i<j$ and $\boldsymbol{E}^{*}_{ij}=0$ for $i\geq j$ . Then $\|\boldsymbol{E}\|_{2\rightarrow 2}\leq 2\|\boldsymbol{E}^{*}\|$ . We can upper bound $\|\boldsymbol{E}^{*}\|$ using the following bound on the spectral norm of random matrices from [3]:

Proposition 10.

Let $\boldsymbol{W}$ be the $n\times m$ rectangular matrix whose entries $\boldsymbol{W}_{ij}$ are independent centered random variables bounded (in absolute value) by some $\sigma_{*}>0$ . Then, for any $0<\epsilon\leq 1/2$ there exists a universal constant $c_{\epsilon}$ such that, for every $t\geq 0$

[TABLE]

where we have defined

[TABLE]

For $\boldsymbol{E}^{*}$ , we have $\sigma_{1}\leq\sqrt{\rho_{n}n}$ , $\sigma_{2}\leq\sqrt{\rho_{n}n}$ , and $\sigma_{*}\leq 1$ . Taking $\epsilon=1/2$ and $t=\sqrt{2c_{\epsilon}\log(n)}$ in Proposition 10, we obtain that there exists absolute constants $c^{*}$ such that

[TABLE]

with probability at least $1-1/n$ . Since $\rho_{n}\geq\log(n)/n$ , we can take $\lambda=c\sqrt{\rho_{n}n}$ where $c\geq 12\sqrt{2}+4c^{*}$ so that $\left\|\boldsymbol{E}\right\|_{2\rightarrow 2}\leq\lambda/2$ . Then, Proposition 9 implies

[TABLE]

It is easy to see that the cut-norm of a matrix can be bounded by its spectral norm:

[TABLE]

Bound on the cut-norm (15) then follows from

[TABLE]

In order to prove the Frobenius bound (14), we use the argument from [25]: we can equivalently write the singular value hard thresholding estimator as the solution to the following optimization problem:

[TABLE]

which implies that, with probability larger than $1-1/n$ ,

[TABLE]

where we used in the last line that $\|\boldsymbol{E}\|_{2\rightarrow 2}\leq\lambda/2$ . Since $\operatorname{rank}(\boldsymbol{\Theta}_{0})\leq k$ , we have proved (14).

Appendix E Proof of Theorem 1

Note that both $f_{0}=\rho_{n}W_{0}$ and $\tilde{f}_{\boldsymbol{\Theta}_{0}}$ are proportional to $\rho_{n}$ , so without loss of generality we can assume that $\rho_{n}=1$ . For $k\geq n/2$ , the result is a straightforward consequence of the second Sampling Lemma for Graphons of [28] stated in Proposition 1. Given any graphon $W_{0}\in\mathcal{W}^{+}[k]$ , one can always divide some of the steps into smaller steps in such a way that $W_{0}$ is a $2k$ –step graphon whose weights are all less than or equal to $1/k$ . Thus, we only need to prove the results for all graphons $W_{0}\in\mathcal{W}^{+}[k]$ with $32\leq k\leq n$ and such that its weights are all smaller or equal to $2/k$ .

Let ${\boldsymbol{\Theta}}_{0}^{\prime}$ be the matrix with entries $({\boldsymbol{\Theta}}_{0}^{\prime})_{ij}=W(\xi_{i},\xi_{j})$ for all $i,j$ . As opposed to $\boldsymbol{\Theta}_{0}$ , the diagonal entries of ${\boldsymbol{\Theta}}_{0}^{\prime}$ are not constrained to be null. By the triangle inequality, we have

[TABLE]

As the entries of $\boldsymbol{\Theta}_{0}$ coincide with those of ${\boldsymbol{\Theta}}_{0}^{\prime}$ outside the diagonal, the difference $\widetilde{f}_{\boldsymbol{\Theta}_{0}}-\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}}$ is null outside of a set of measure $1/n$ . Since $\|W_{0}\|_{\infty}\leq 1$ , $\operatorname{\mathbb{E}}[\delta_{\square}(\widetilde{f}_{\boldsymbol{\Theta}_{0}},\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}})]\leq 1/n$ . Thus, we only need to prove that

[TABLE]

We first need to build two suitable representations of $W_{0}$ and $\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}}$ in the quotient space $\widetilde{\mathcal{W}}^{+}$ .

As a first idea, one may want to define a representation $\widehat{W}$ of $\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}}$ that matches $W_{0}$ on the largest possible (with respect to the Lebesgue measure) Borel set. In fact, one can match the two representations everywhere expcept on a Borel set of measure of the order of $\sqrt{k/n}$ . This turns out to lead to a suboptimal bound of the order of $\sqrt{k/n}$ . In order to recover the correct logarithmic term, we refine the argument by showing that, for a suitable representation, the difference $\widehat{W}-W_{0}$ , when non-zero, is well approximated in cut distance by a $\lfloor\sqrt{k}\rfloor$ -step function which is zero exĉept on a Borel set of measure much smaller than $\sqrt{k/n(\log(n)}$ . To prepare the proof, we carefully build the representations of $W_{0}$ and $\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}}$ .

Step 1: Construction of a suitable representation $W$ of $W_{0}$ in $\widetilde{\mathcal{W}}^{+}$ .

In the sequel, we denote $q_{1}:=\lfloor\sqrt{k}\rfloor$ . Here, we want to choose $W$ in such a way that a distortion of $W$ is well approximated in the cut norm by a $q_{1}$ –step kernel. We use the following lemma which is based on a variation of Szemerédi’s lemma. Let $\boldsymbol{Q}_{0}\in\mathbb{R}^{k\times k}_{\text{sym}}$ and $\phi_{0}:[0,1]\to[k]$ be associated to $W_{0}$ as in definition (18).

Lemma 5.

There exist a permutation $\pi$ of $[k]$ and a partition $\mathcal{P}=(P_{1},\ldots,P_{q_{1}})$ of $[k]$ made of successive intervals such that the following holds. Let ${\bf Q}$ be the matrix obtained from ${\bf Q}_{0}$ by jointly applying the permutation $\pi$ to its rows and its columns. Denote by $\phi=\pi\circ\phi_{0}$ , and for $a=1,\ldots,k$ , $\lambda_{a}:=\lambda(\phi^{-1}(a))$ . There are two matrices ${\bf Q}^{(ap)}$ and ${\bf Q}^{(ap,+)}\in[0,1]^{k\times k}$ that are $q_{1}$ -block-constant according to the partition $\mathcal{P}$ and that satisfy

[TABLE]

According to Lemma 5, there exists two $q_{1}$ -block constant matrices ${\bf Q}^{(ap)}$ and ${\bf Q}^{(ap,+)}$ that approximate well ${\bf Q}$ with respect to some weighted cut norm. As for (42), the weights are respectively $\lambda_{b}$ and $\sqrt{\lambda_{a}}$ whereas for (43), the weights are $\sqrt{\lambda_{a}}$ and $\sqrt{\lambda_{b}}$ . Informally, these weights arise for the following reason: writing $\widehat{\lambda}_{a}$ as the empirical weight of group $a$ in $\widehat{W}$ (see Step 2 for the definition), we have $\widehat{\lambda}_{a}-\lambda_{a}=O_{P}(\sqrt{\lambda_{a}/n})$ .

Invoking Lemma 5, we consider the graphons

[TABLE]

Obviously, $W$ is weakly isomorphic to $W_{0}$ .

Step 2: Construction of a suitable representation $\widehat{W}$ of $\widetilde{f}_{{\boldsymbol{\Theta}}_{0}^{\prime}}$ in the quotient space $\widetilde{\mathcal{W}}^{+}$ .

Recall that $\xi_{1},\ldots,\xi_{n}$ are the i.i.d. uniformly distributed random variables in the $W$ -random graph model (1) and that $\phi$ is defined in the previous step. For $a=1,\ldots,k$ , let

[TABLE]

be the (unobserved) empirical frequency of the group $a$ corresponding to a finer partition of $[0,1]$ given by $\phi$ . For $l=1,\ldots,q_{1}$ , let

[TABLE]

be the (unobserved) empirical frequency of the group $l$ corresponding to a coarser partition $P$ of $[0,1]$ given by $\mathcal{P}\circ\phi$ .

The relations $\sum_{a=1}^{k}\lambda_{a}=\sum_{a=1}^{k}\hat{\lambda}_{a}=1$ imply

[TABLE]

Consider a function $\psi:[0,1]\rightarrow[k]$ such that:

(i)

For all $a\in[k]$ , $\lambda(\{x,\ \psi(x)=\phi(x)=a\})=\widehat{\lambda}_{a}\wedge\lambda_{a}$ ,

(ii)

for all $l\in[q_{1}]$ , $\lambda\Big{[}\{x\ ,\psi(x)\in P_{l}\text{ and }\phi(x)\in P_{l}\}\Big{]}=\omega_{l}\wedge\widehat{\omega}_{l}$ ,

(iii)

for all $a\in[k]$ , $\lambda(\psi^{-1}(a))=\widehat{\lambda}_{a}$ .

Such a function $\psi$ exists. To see it, we first construct $\psi$ to satisfy (i) and (iii):

•

For each $a$ such that $\lambda_{a}>\widehat{\lambda}_{a}$ , conditions (i) and (iii) are trivially satisfied if we take $\psi^{-1}(a)$ to be any subset of $\phi^{-1}(a)$ of Lebesgue measure $\widehat{\lambda}(a)$ . Then, there is a subset of $\phi^{-1}(a)$ of Lebesgue measures $\lambda_{a}-\widehat{\lambda}_{a}$ left non-assigned. Summing over all such $a$ , we see that there is a union of subsets with Lebesgue measure $m_{+}:=\sum_{a:\lambda_{a}>\widehat{\lambda}_{a}}(\lambda_{a}-\widehat{\lambda}_{a})$ left non-assigned.

•

For $a$ such that $\lambda_{a}<\widehat{\lambda}_{a}$ , we must have $\psi(x)=a$ for $x\in\phi^{-1}(a)$ to satisfy (i). On the other hand, to meet condition (iii) we need additionally to assign $\psi(x)=a$ for $x$ on a set of Lebesgue measure $\hat{\lambda}_{a}-\lambda_{a}$ . Summing over all such $a$ , we need additionally to find a set of Lebesgue measure $m_{-}:=\sum_{a:\widehat{\lambda}_{a}>\lambda_{a}}(\lambda_{a}-\widehat{\lambda}_{a})$ to make such assignments. But this set is readily available as the union of non-assigned intervals for all $a$ such that $\lambda_{a}>\widehat{\lambda}_{a}$ since $m_{+}=m_{-}$ by virtue of (45).

Now, to ensure that condition (ii) is satisfied, we assign as a priority $\psi(x)$ to values belonging to the same partition element as $\phi(x)$ . Again, (45) ensures that this is possible.

Finally, define the graphons $\widehat{W}(x,y)=\boldsymbol{Q}_{\psi(x),\psi(y)}$ , $\widehat{W}_{1}(x,y)=\boldsymbol{Q}^{(ap)}_{\psi(x),\psi(y)}$ , and $\widehat{W}_{1}^{+}(x,y)=\boldsymbol{Q}^{(ap,+)}_{\psi(x),\psi(y)}$ where $\boldsymbol{Q}$ , $\boldsymbol{Q}^{(ap)}$ , and $\boldsymbol{Q}^{(ap,+)}$ are as in (44). Notice that in view of (iii) $\widehat{W}$ is weakly isomorphic to the empirical graphon $\widetilde{f}_{\boldsymbol{\Theta}^{\prime}_{0}}$ . Let $\mathcal{R}=\{x\ ,\phi(x)\neq\psi(x)\}$ . Since $W$ and $\widehat{W}$ match on $\mathcal{R}^{c}\times\mathcal{R}^{c}$ , the purpose of (i) is to minimize the Lebesgue measure of the support of $W-\widehat{W}$ . With properties (i) and (iii) alone, it would be possible to prove that $\operatorname{\mathbb{E}}[\|W-\widehat{W}\|_{\square}]\leq C\sqrt{k/n}$ as the Lebesgue measure of its support is at most of order $\sqrt{k/n}$ . We will improve this rate by a logarithmic term as (ii) will enforce that the cut norm of $W-\widehat{W}$ is much smaller than its Lebesgue measure.

Step 3: Control of the cut norm. Since $\delta_{\square}(\cdot,\cdot)$ is a metric on the quotient space $\widetilde{\mathcal{W}}^{+}$ ,

[TABLE]

By definition of $\psi$ , the two functions $W(x,y)$ and $\widehat{W}(x,y)$ are equal except possibly when either $x$ or $y$ belongs to $\mathcal{R}$ . As a consequence of triangular inequality and of the symmetry of $W-\widehat{W}$ , we get

[TABLE]

First, we focus on $\mathbb{E}[\|(W-\widehat{W})|_{\mathcal{R}\times\mathcal{R}^{c}}\|_{\square}]$ , the second term being handled similarly at the end of the proof. For $a$ and $b$ in $[k]$ , we write $a\sim_{P}b$ (resp. $a\nsim_{P}b$ ) when $a$ and $b$ belongs (resp. do not belong) to the same element of the partition $P$ . Define

[TABLE]

Obviously, we have $\mathcal{R}_{2}\subset\mathcal{R}$ . Property (ii) of $\psi$ , implies that $\lambda(\mathcal{R}_{2})=\sum_{a=1}^{q_{1}}(\omega_{a}-\widehat{\omega}_{a})_{+}$ . We shall rely on the decomposition $W=W_{1}+(W-W_{1})$ and $\widehat{W}=\widehat{W}_{1}+(\widehat{W}-\widehat{W}_{1})$ . For any $x\in\mathcal{R}\setminus\mathcal{R}_{2}$ , we have by definition (44) of $W_{1}$ that $(W_{1}-\widehat{W}_{1})(x,y)=0$ . Together with the triangular inequality, this yields

[TABLE]

To control the first expression in the rhs, we simply bound the cut norm of the difference by its $l_{1}$ norm

[TABLE]

since $W_{1}$ and $\widehat{W}_{1}$ take values in $[0,1]$ . Then, relying on the fact that $n\widehat{\omega}_{a}$ is distributed as a Binomial random variable with parameters $(n,\omega_{a})$ and on Cauchy-Schwarz inequality, we get $\mathbb{E}\left|\omega_{a}-\widehat{\omega}_{a}\right|\leq\sqrt{\frac{\omega_{a}(1-\omega_{a})}{n}}$ and

[TABLE]

where we used again Cauchy-Schwarz in the last line. Let us turn to the second and third expressions in (47). To this end, we introduce a new kernel function $U$ . For $a=1,\ldots,k$ , define $\widehat{\lambda}^{\delta}_{a}=|\lambda_{a}-\widehat{\lambda}_{a}|$ and the functions $F_{\widehat{\lambda}^{\delta}}\;:\;[k]\rightarrow\left[0,\sum_{a}|\lambda_{a}-\widehat{\lambda}_{a}|\right]$ and $F_{\phi}\;:\;[k]\mapsto[0,1]$ by

[TABLE]

For any $a,b\in[k]$ , set $\widehat{\Pi}_{a,b}=[F_{\widehat{\lambda}^{\delta}}(a-1),F_{\widehat{\lambda}^{\delta}}(a))\times[F_{\phi}(b-1),F_{\phi}(b))$ and let $U$ be a $k\times k$ step kernel on $[0,\sum_{a}|\widehat{\lambda}_{a}-\lambda_{a}|]\times[0,1]$ defined by

[TABLE]

By definition of $\mathcal{R}$ and of the function $\psi$ , we have that for any $a\in[k]$ , $\lambda(\phi^{-1}(a))\cap\mathcal{R})=(\lambda_{a}-\widehat{\lambda})_{+}$ and $\lambda(\psi^{-1}(a))\cap\mathcal{R}^{c})=\lambda_{a}\wedge\widehat{\lambda}$ . As a consequence, the restriction of $(W-W_{1})$ to $\mathcal{R}\times\mathcal{R}^{c}$ is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of $U$ to the set $(\cup_{a:\ \lambda_{a}>\widehat{\lambda}_{a}}[F_{\widehat{\lambda}^{\delta}}(a-1),F_{\widehat{\lambda}^{\delta}}(a)))\times(\cup_{a}[F_{\phi}(a-1),F_{\phi}(a-1)+\widehat{\lambda}_{a}\wedge\lambda_{a})$ . This entails that

[TABLE]

On the other hand, for any $(x,y)\in\mathcal{R}\times\mathcal{R}^{c}$ ,

[TABLE]

by the definition of $\mathcal{R}$ . In view of the definition of $\psi$ , for any $a\in[k]$ we have $\lambda(\phi^{-1}(a))\cap\mathcal{R})=(\widehat{\lambda}-\lambda_{a})_{+}$ . As a consequence, the restriction of $(\widehat{W}-\widehat{W}_{1})$ to $\mathcal{R}\times\mathcal{R}^{c}$ is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of $U$ to the set $(\cup_{a:\ \lambda_{a}<\widehat{\lambda}_{a}}[F_{\widehat{\lambda}^{\delta}}(a-1),F_{\widehat{\lambda}^{\delta}}(a)))\times(\cup_{a}[F_{\phi}(a-1),F_{\phi}(a-1)+\widehat{\lambda}_{a}\wedge\lambda_{a})$ . This implies that $\|(\widehat{W}-\widehat{W}_{1})|_{\mathcal{R}\times\mathcal{R}^{c}}\|_{\square}\leq\|U\|_{\square}$ . Thus, we only have to control $\mathbb{E}[\|U\|_{\square}]$ .

Step 4: Control of $\mathbb{E}[\|U\|_{\square}]$ . Define the sets $\mathcal{B}_{1}:=\prod_{a=1}^{k}[0,|\widehat{\lambda}_{a}-\lambda_{a}|]$ and $\mathcal{B}_{2}:=\prod_{a=1}^{k}\left[0,\left|\lambda_{a}\right|\right]$ . Then, the cut norm of $U$ writes as

[TABLE]

since the supremum of a linear function on a convex set is achieved at an extremal point. The random variable $|\widehat{\lambda}_{a}-\lambda_{a}|$ is in expectation of the order $\sqrt{\lambda_{a}/n}$ . If we could replace each $|\widehat{\lambda}_{a}-\lambda_{a}|$ by $\sqrt{\lambda_{a}/n}$ in (51), then thanks to (42), we could prove that $\|U\|_{\square}$ is (up to a multiplicative constant) less than $\sqrt{k/(n\log(k))}$ . Unfortunately, if we directly applied Bernstein’s inequality or the bounded difference inequality to simultaneously control $|\widehat{\lambda}_{a}-\lambda_{a}|$ over all $a\in[k]$ or to simultaneously control $\sum_{a\in S,b\in T}\lambda_{b}|\widehat{\lambda}_{a}-\lambda_{a}|({\bf Q}_{ab}-{\bf Q}^{(ap)}_{ab})$ over all $S,T\subset[k]$ , we would lose at least a logarithmic factor.

To bypass this issue, we adapt Lemma 10.9 of [28], which is a key point in the proof of sampling Lemma for graphons (Lemma 10.5 in [28]). Given a bounded non-symmetric kernel $W\in\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ , let us define the following one-side version of the cut norm:

[TABLE]

where we take the supremum without any absolute value. As a consequence, the cut norm $\|W\|_{\square}$ is the maximum $\|W\|^{+}_{\square}$ and $\|-W\|^{+}_{\square}$ .

Lemma 6.

Let $W\in\mathcal{W}_{[0,u],[0,v]}[k]$ and let $\boldsymbol{Q}\in\mathbb{R}^{k\times k}$ , $\phi_{1}:[0,u]\to[k]$ and $\phi_{2}:[0,v]\to[k]$ be associated to $W$ as in (33). For $a=1,\ldots,k$ , define $\alpha_{a}:=\lambda(\phi_{1}^{-1}(\{a\}))$ and $\beta_{a}:=\lambda(\phi_{2}^{-1}(\{a\}))$ . Given any subset $R\subset[k]$ , let

[TABLE]

Finally, we define for any $S,T\subset[k]$ , $W[S,T]:=\sum_{a\in S,b\in T}\alpha_{a}\beta_{b}\boldsymbol{Q}_{ab}\ .$ Then, for any integer $q$ with $1\leq q\leq k$ , we have

[TABLE]

Note that in contrast to Equation (51) where one considers a supremum of $2^{2k}$ sums, only $k^{2q}$ terms are involved in (53) up to the price of an additive term of order $q^{-1/2}$ . The difficulty is that we will apply this lemma to $U$ for which these $k^{2q}$ will turn out to be random.

In the sequel, we fix $q=\lfloor\sqrt{k}\rfloor$ and apply Lemma 6 to $U$ . Then, we can take $u=v=1$ . Since $\sum_{a=1}^{k}\lambda_{a}=1$ and since we assumed at the beginning of the proof that the weights $\lambda_{a}$ are all smaller than $2/k$ , it follows that $(k\sum_{a=1}^{k}\lambda_{a}^{2})^{1/2}\leq\sqrt{2}$ . Let $M$ and $N$ denote the random variables $M:=\sum_{a=1}^{k}|\widehat{\lambda}_{a}-\lambda_{a}|$ and $N:=\left(\sum_{a=1}^{k}k|\widehat{\lambda}_{a}-\lambda_{a}|^{2}\right)^{1/2}$ . Both $M$ and $N$ are functions of the independent random variables $(\xi_{1},\ldots,\xi_{n})$ . Besides, if we change the values of one of these $\xi^{\prime}_{i}$ the value of $M$ changes by at most $2/n$ and the value of $N$ changes by at most $\sqrt{2k}/n$ . As a consequence, we may apply the bounded difference inequality (Lemma (1)) to these two random variables. Then, with probability larger than $1-2\exp(-\sqrt{k}/\log(k))$ , one has

[TABLE]

In (54) - (55) we bound the expectation using that, since $\xi_{1},\ldots,\xi_{n}$ are i.i.d. uniformly distributed random variables, $n\widehat{\lambda}_{a}$ has a binomial distribution with parameters ( $n$ , $\lambda_{a}$ ) and the Cauchy-Schwarz inequality:

[TABLE]

Bound (55) and $(k\sum_{a=1}^{k}\lambda_{a}^{2})^{1/2}\leq\sqrt{2}$ , implies that for $U$ , with probability larger than $1-2\exp(-\sqrt{k}/\log(k))$ ,

[TABLE]

Fix any two subsets $R_{1},R_{2}\subset[k]$ of size less than or equal to $q$ . In view of (53), one needs to control the following random variable

[TABLE]

It is done in the following Lemma:

Lemma 7.

Let $R_{1},R_{2}$ be two subsets of $[k]$ of size less than or equal to $q$ and $Z_{R_{1},R_{2}}$ given by (57). Then, we have that with probability larger than $1-(1+2k)\exp(-\sqrt{k}/\log(k))$ ,

[TABLE]

Now, it follows from Lemma 6 together with (56) and Lemma 7 that, with probability larger than $1-(3+2k)\exp(-\sqrt{k}/\log(k))$ ,

[TABLE]

Controlling analogously $\|-U\|^{+}_{\square}$ , we conclude that there exists an event $\mathcal{A}$ of probability larger than $1-10\exp(-\sqrt{k}/\log(k))$ such that, on $\mathcal{A}$ ,

[TABLE]

To finish the control of $\operatorname{\mathbb{E}}[\|U\|_{\square}]$ , we use the rough bound $\|U\|_{\square}\leq\|U\|_{1}\leq\sum_{a=1}^{k}|\widehat{\lambda}_{a}-\lambda_{a}|$ on the complementary event $\bar{\mathcal{A}}$ .

[TABLE]

where we use (54). Now, using the decomposition (47), (48) and (50), we can conclude that

[TABLE]

The following lemma gives a corresponding bound on the second term $\left\|(W-\widehat{W})\right|_{\mathcal{R}\times\mathcal{R}}\|_{\square}$ in (46). The proof is somewhat analogous to that of the control of $\left\|(W-\widehat{W})|_{\mathcal{R}\times\mathcal{R}^{c}}\right\|_{\square}$ and is postponed to the end of the section.

Lemma 8.

We have

[TABLE]

In view of (46), we have proved Theorem 1.∎

Proof of Lemma 5.

For $a\in[k]$ , we denote $(\lambda_{0})_{a}=\lambda(\phi_{0}^{-1}(a))$ and $u_{a}=\frac{\sqrt{(\lambda_{0})_{a}}}{\sum_{b}\sqrt{(\lambda_{0})_{b}}}$ . For any $b\in[k]$ , define the cumulative distribution functions $F_{0}(b)=\sum_{a=1}^{b}(\lambda_{0})_{a}$ and $F_{1}(b)=\sum_{a=1}^{b}u_{a}$ . For $a,b\in[k]$ , let $(\Pi_{d})_{ab}=[F_{0}(a-1),F_{0}(a))\times[F_{1}(b-1),F_{1}(b))$ and $(\Pi^{+}_{d})_{ab}=[F_{1}(a-1),F_{1}(a))\times[F_{1}(b-1),F_{1}(b))$ . In order to construct a suitable $q_{1}$ -step kernel we consider first the (non necessarily symmetric) kernels $W_{d}$ and $W^{+}_{d}$ defined by

[TABLE]

In comparison to $W_{0}$ , the length of the steps in $W_{d}$ and $W_{d}^{+}$ has been modified.

Lemma 9.

Let $W\in\mathcal{W}_{[0,1],[0,1]}$ be a k-step kernel defined by

[TABLE]

where $\boldsymbol{Q}\in[0,1]^{k\times k}$ and $(S_{1},\dots,S_{k})$ and $(T_{1},\dots,T_{k})$ are two partitions of $[0,1]$ into a finite number of measurable sets. For any integer $q_{0}\geq 2$ , there exist a $q_{0}$ –step kernel $W^{(ap)}\in\mathcal{W}^{+}_{[0,1],[0,1]}$ satisfying

(i)

for any $(a,b)\in[k]$ , $W^{(ap)}$ is constant on $S_{a}\times T_{b}$ and

(ii)

$\left\|W-W^{(ap)}\right\|_{\square}\leq\frac{C}{\sqrt{\log(q_{0})}}$ .

The second property (ii) is just the consequence of the weak Regularity Lemma for kernels [19] (see also Corollary 9.13 in [28]). The first property, (i), follows from the explicit construction of the approximate kernel by Kannan and Frieze (see the proof of Lemma 9.10 in [28]). For the sake of completeness, we give the details in the end of this section.

Fix $q_{0}=\lfloor k^{1/4}\rfloor$ . Note that $q_{0}\geq 2$ since we assume that $k\geq 16$ . We denote by $W_{d}^{(ap)}$ and $W^{(ap,+)}_{d}$ the $q_{0}$ –step kernels given by Lemma 9 to respectively approximate $W_{d}$ and $W^{(+)}_{d}$ . In virtue of Property $(i)$ , there exist two matrices $\boldsymbol{Q}^{(ap)}_{0}$ and $\boldsymbol{Q}_{0}^{(ap,+)}$ in $[0,1]^{k\times k}$ such that

[TABLE]

There exist two partitions $\mathcal{P}_{d}$ and $\mathcal{P}_{d}^{+}$ of $[k]$ such that $\boldsymbol{Q}^{(ap)}_{0}$ is block constant according to $\mathcal{P}_{d}$ and $\boldsymbol{Q}^{(ap,+)}_{0}$ is block constant according to $\mathcal{P}^{+}_{d}$ . Let $\mathcal{P}^{*}$ be the coarsest partition that refines both $\mathcal{P}$ and $\mathcal{P}_{d}^{+}$ . As a consequence, $\mathcal{P}^{*}$ is made of less than $q_{0}^{2}\leq q_{1}$ subsets. By possibly refining $\mathcal{P}^{*}$ , we may assume without loss of generality that $\mathcal{P}^{*}=(P^{*}_{1},\ldots,P^{*}_{q_{1}})$ is made of exactly $q_{1}$ elements. Let $\pi$ be a permutation of $[k]$ transforming $\mathcal{P}^{*}$ in a partition $\mathcal{P}=(P_{1},\ldots,P_{q_{1}})$ with $P_{a}=\{\pi(b),b\in P^{*}_{a}\}$ made of consecutive intervals. Denoting $\boldsymbol{\Pi}$ the corresponding permutation matrix, we finally take

[TABLE]

Now we are ready to prove (42) and (43). Recall that we denote $\phi=\pi\circ\phi_{0}$ and $\lambda_{a}:=\lambda(\phi^{-1}(a))$ for $a\in[k]$ . Define the sets $\mathcal{B}_{1}:=\prod_{a=1}^{k}[0,u_{\pi(a)}]$ and $\mathcal{B}_{2}:=\prod_{a=1}^{k}[0,\lambda_{a}]$ . Since $W_{d}-W_{d}^{(ap)}$ is a $k$ –step function, its cut norm writes as

[TABLE]

since the supremum is achieved at an extremal point of the convex and in the last inequality we use property (ii) of Lemma 9. Now (59) and the definition of $u_{\pi(a)}$ imply

[TABLE]

by Cauchy-Schwarz inequality. We have proved (42). The second inequality (43) is derived similarly.

∎

Proof of Lemma 9.

We adapt the proof of the weak Regularity Lemma for symmetric kernels [28, Lemma 9.9] to non symmetric ones. We use the following extension of Lemma 9.11(a) in [28].

Lemma 10.

For every $W\in\mathcal{W}_{[0,1],[0,1]}[k]$ such that

[TABLE]

where $\boldsymbol{Q}\in\mathbb{R}^{k\times k}$ and $\mathcal{P}=\left\{\left(S_{1},\dots,S_{k}\right),\left(T_{1},\dots,T_{k}\right)\right\}$ are two partitions of $[0,1]$ into a finite number of measurable sets, there are two sets $\mathcal{A},\mathcal{B}\subset[k]$ and a real number $0\leq a\leq\max_{a,b}|\boldsymbol{Q}_{ab}|$ such that, for $S^{\prime}=\cup_{a\in\mathcal{A}}S_{a}\quad\text{and }\quad T^{\prime}=\cup_{b\in\mathcal{B}}T_{b}$ ,

[TABLE]

Now we apply Lemma 10 repeatedly, to get pairs of sets $S^{\prime}_{i},T^{\prime}_{i}$ and real numbers $a_{i}$ such that for any positive integer $j$ , $W_{j}=W-\sum_{i=1}^{j}a_{i}\mathds{1}_{S^{\prime}_{i}\times T^{\prime}_{i}}$ we have

[TABLE]

Fix some integer $k_{0}>0$ . Since the right-hand side of the above equation remains non-negative, there exists $0\leq i<k_{0}$ with $\|W_{i}\|^{2}_{\square}\leq 1/k_{0}$ . Now putting $a_{l}=0$ for $l>i$ we get that for any $W\in\mathcal{W}_{[0,1],[0,1]}[k]$ and any $k_{0}\geq 1$ there are $k_{0}$ pairs of subsets $S^{\prime}_{i},T^{\prime}_{i}\subset[0,1]$ and $k_{0}$ real numbers $a_{i}$ such that

[TABLE]

Note that the approximation $W^{ap}=\sum_{i=1}^{k_{0}}a_{i}\mathds{1}_{S^{\prime}_{i}\times T^{\prime}_{i}}$ is a step function with at most $2^{k_{0}}$ steps and $a_{i}\geq O$ , for all $i$ . On the other hand, by construction we have that for any $(a,b)\in[k]$ , $W^{(ap)}$ is constant on all sets of the form $S_{a}\times T_{b}$ . We conclude by taking $k_{0}=\lfloor\log(q_{0})/\log(2)\rfloor$ . ∎

Proof of Lemma 10.

This lemma is proved in [28, Lemma 9.11] for symmetric kernels. For readers convenience we get the details here. Let $W$ be a $k$ –step kernel and let $\left(S_{1},\dots,S_{k}\right),\left(T_{1},\dots,T_{k}\right)$ be two measurable partitions of $[0,1]$ such that $W$ is constant on each set $S_{i}\times T_{j}$ . Relying on a convexity argument as in the proof of Lemma 5, the cut norm is achieved for measurable sets $S$ and $T$ that are unions of $S_{i}$ and $T_{j}$ respectively, that is

[TABLE]

where $S=\cup_{a\in\mathcal{A}}S_{a}$ and $T=\cup_{b\in\mathcal{B}}T_{b}$ with $\mathcal{A}$ , $\mathcal{B}\subset[k]$ . Let $\mathbf{a}=\frac{1}{\lambda(S)\lambda(T)}\|W\|_{\square}$ . Then, we have

[TABLE]

which completes the proof. ∎

Proof of Lemma 6.

This proof closely follows that of Lemma 10.9 in [28]. It is easy to see that

[TABLE]

so we only need to bound these expressions. Let $Q$ and $Q^{\prime}$ be independent uniformly chosen $q$ -subset of $[k]$ and let $\operatorname{\mathbb{E}}_{Q}$ (resp. $\operatorname{\mathbb{E}}_{Q^{\prime}}$ ) denote the expectation with respect to $Q$ (resp. $Q^{\prime}$ ). We shall prove that, for any $S,T\subset[k]$

[TABLE]

By symmetry, this will imply

[TABLE]

so that gathering both inequalities yields to

[TABLE]

Since the above expectation is less than or equal to $\sup_{R_{i},\ |R_{i}|\leq q}W\left[R_{2}^{r,W},R_{1}^{l,W}\right]$ , this will conclude the proof. Thus, we only have to show (62). Note that $W[S,T]\leq W[T^{r,W},T]$ implies that it suffices to prove

[TABLE]

Let us denote $Z$ the above difference of expectations. For any $a\in[k]$ , write $B_{a}=\sum_{b\in T}\beta_{b}\boldsymbol{Q}_{ab}$ and $A_{a}=\sum_{b\in T\cap Q}\beta_{b}\boldsymbol{Q}_{ab}$ . By the definition (52), we have that $B_{a}$ is non-negative for $a\in T^{r,W}$ and $B_{a}\leq 0$ if $a\not\in T^{r,W}$ . In the same way, $A_{a}>0$ for $a\in(Q\cap T)^{r,W}$ and $A_{a}\leq 0$ for $a\notin(Q\cap T)^{r,W}$ . Denoting $\operatorname{\mathbb{P}}_{Q}$ the probability with respect to $Q$ , we obtain

[TABLE]

Now, using $\operatorname{\mathbb{E}}_{Q}[A_{a}]=qB_{a}/k$ , it follows from the Chebyshev inequality that, for $a\in T^{r,W}$ , we have $\operatorname{\mathbb{P}}_{Q}[A_{a}<0]\leq\operatorname{Var}_{Q}[A_{a}]/\operatorname{\mathbb{E}}_{Q}^{2}[A_{a}]$ . Since a probability is smaller or equal to one, it follows that $\operatorname{\mathbb{P}}_{Q}[A_{a}<0]\leq\sqrt{\operatorname{Var}_{Q}[A_{a}]}/|\operatorname{\mathbb{E}}_{Q}[A_{a}]|$ . Similarly, for $a\notin T^{r,W}$ we also have that $\operatorname{\mathbb{P}}_{Q}[A_{a}>0]\leq\sqrt{\operatorname{Var}_{Q}[A_{a}]}/|\operatorname{\mathbb{E}}_{Q}[A_{a}]|$ . Coming back to $Z$ , this yields

[TABLE]

Working out the variance, we get $\operatorname{Var}_{Q}[A_{a}]\leq\tfrac{q}{k}\sum_{b\in T}\beta^{2}_{b}\boldsymbol{Q}^{2}_{ab}\leq q(\sum_{b\in[k]}\beta^{2}_{b})/k$ , which concludes the proof. ∎

Proof of Lemma 7.

Note that in (57), the definition of $Z_{R_{1},R_{2}}$ , the set $R_{2}^{r,U}$ is deterministic whereas the set $R_{1}^{l,U}$ only depends on $(\widehat{\lambda}_{a})_{a\in R_{1}}$ . We can upper bound $Z_{R_{1},R_{2}}$ in the following way:

[TABLE]

where we use $\left|\sum_{b}\lambda_{b}(\boldsymbol{Q}_{ab}-\boldsymbol{Q}^{ad}_{ab})\right|\leq 1$ . We set

[TABLE]

Conditionally to $(\widehat{\lambda}_{a})_{a\in R_{1}}$ , $T_{R_{1},R_{2}}$ is distributed as a function of $n-n\sum_{a\in R_{1}}\widehat{\lambda}_{a}$ i.i.d. random variables $\xi^{\prime}_{i}$ such that $\operatorname{\mathbb{P}}[\xi^{\prime}=a]=\lambda_{a}/(1-\sum_{a\in R_{1}}\lambda_{a})$ for any $a\in[k]\setminus R_{1}$ . Besides, if we change the values of one of these $\xi^{\prime}_{i}$ the value of this expression changes by at most $2/n$ . It then follows from the bounded difference inequality (Lemma (1)) that, for any $t>0$

[TABLE]

Let us bound this conditional expectation:

[TABLE]

Now, using Cauchy-Schwarz inequality, we have

[TABLE]

where we used that $\lambda_{b}\leq 2/k$ , $|R_{1}|\leq q\leq k^{1/2}$ and $k\geq 8$ . The supremum in (67) is achieved for subsets ( $S^{*},T^{*}$ ) such that for all $a\in S^{*}$ , $\sum_{b\in T^{*}}\lambda_{b}(\boldsymbol{Q}_{ab}-\boldsymbol{Q}^{ad}_{ab})$ is non-negative (otherwise this contradicts the optimality of $S^{*},T^{*}$ ). As a consequence, we can plug the upper bounds on $\operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{a}-\lambda_{a}|\right|(\widehat{\lambda}_{c})_{c\in R_{1}}\right]$ into (67):

[TABLE]

where we used the property (42) of $\boldsymbol{Q}^{ad}$ . Coming back to (66) and integrating the deviation inequality with respect to $(\widehat{\lambda}_{a})_{a\in R_{1}}$ , we conclude that, for any $t>0$

[TABLE]

Fixing $t=2\log(k)q+\sqrt{k}/\log(k)$ and taking an union bound over all possible $R_{1}$ , $R_{2}$ , we derive that

[TABLE]

on an event of probability higher than $1-\exp(-\sqrt{k}/\log(k))$ .

Next we bound $\max_{a=1,\dots k}|\widehat{\lambda}_{a}-\lambda_{a}|$ . Recall that $n\widehat{\lambda}_{a}$ has a binomial distribution with parameters ( $n$ , $\lambda_{a}$ ) and $\lambda_{a}\leq 2/k$ . For any $a\in[k]$ , applying Bernstein’s inequality to $|\widehat{\lambda}_{a}-\lambda_{a}|$ we get

[TABLE]

Taking $t=C\sqrt{n/\log(k)}$ (for a suitable constant $C>0$ ) and applying the union bound, we derive that with probability larger than $1-2k\exp(-\sqrt{k}/\log(k))$

[TABLE]

The bound (68) together with (69) imply the statement of Lemma 7. ∎

Proof of Lemma 8.

As the control of $(W-\widehat{W})|_{\mathcal{R}\times\mathcal{R}}$ is quite similar to that of $(W-\widehat{W})|_{\mathcal{R}\times\mathcal{R}^{c}}$ , we only sketch the main steps. Relying on the graphon $W_{1}^{+}$ (defined in (44)), we have the following decomposition:

[TABLE]

Since $(W_{1}^{+}-\widehat{W}_{1}^{+})(x,y)$ is zero except if $x\in\mathcal{R}_{2}$ or $y\in\mathcal{R}_{2}$ , we bound the first expression by its $l_{1}$ norm as for $W_{1}-\widehat{W}_{1}$ :

[TABLE]

The two last expressions in (70) are bounded by the cut norm of a kernel $V$ defined as follows. For any $a,b\in[k]$ , define $\widetilde{\Pi}_{a,b}=[F_{\widehat{\lambda}^{\delta}}(a-1),F_{\widehat{\lambda}^{\delta}}(a))\times[F_{\widehat{\lambda}^{\delta}}(b-1),F_{\widehat{\lambda}^{\delta}}(b))$ where $F_{\widehat{\lambda}^{\delta}}(.)$ has been defined in (49). Let $V$ be the $k\times k$ step kernel on $\left[0,\sum_{a}|\widehat{\lambda}_{a}-\lambda_{a}|\right]^{2}$ given by

[TABLE]

Now, as for the restrictions of $W-W_{1}$ and $\widehat{W}-\widehat{W}_{1}$ to $\mathcal{R}\times\mathcal{R}^{c}$ , we have

[TABLE]

Thus, it boils down to controlling $\mathbb{E}\left[\|V\|_{\square}\right]$ . Since $V$ is a $k$ –step kernel, its cut norm writes as

[TABLE]

As for the kernel $U$ in the main proof, we rely on the Lemma 6. The random variables $\sum_{a}|\widehat{\lambda}_{a}-\lambda_{a}|$ and $(\sum_{a}|\widehat{\lambda}_{a}-\lambda_{a}|^{2})^{1/2}$ are controlled as in (54) and (55).

Fix any two subsets $R_{1},R_{2}\subset[k]$ of size less than or equal to $q$ and define

[TABLE]

The set $R_{1}^{l,V}$ only depends on $(\widehat{\lambda}_{a})_{a\in R_{1}}$ and $R_{2}^{r,V}$ only depends on $(\widehat{\lambda}_{a})_{a\in R_{2}}$ . We have

[TABLE]

since $\sum_{a\in[k]}|\widehat{\lambda}_{a}-\lambda_{a}|\leq 2$ . We set

[TABLE]

Write $R:=R_{1}\cup R_{2}$ and $\widehat{\lambda}_{\{R\}}:=(\widehat{\lambda}_{a})_{a\in R}$ . Conditionally to $\widehat{\lambda}_{\{R\}}$ , $T_{R_{1},R_{2}}$ is a function of $n-n\sum_{a\in R}\widehat{\lambda}_{a}$ independent random variables. Besides, if we change the values of one of these independent random variables the value of $T_{R_{1},R_{2}}$ changes by at most $4/n$ . Hence, the bounded difference inequality enforces that, for any $t>0$ ,

[TABLE]

The conditional expectation is upper bounded by

[TABLE]

Here, unfortunately, we cannot directly replace $\operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{a}-\lambda_{a}||\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{\{R\}}\big{]}$ by an upper bound of it because this expression does not factorize. We shall prove that $\operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{a}-\lambda_{a}||\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{\{R\}}\big{]}$ is, up to a small loss, close to a product of expectations.

Write $N:=n-n\sum_{c\in R}\widehat{\lambda}_{c}$ , $\lambda_{R}:=\sum_{c\in R}\lambda_{c}$ and $\widehat{\lambda}_{R}=\sum_{c\in R}\widehat{\lambda}_{c}$ . Note that $n\widehat{\lambda}_{R}$ has a binomial distribution with parameters ( $n$ , $\lambda_{R}$ ). Applying Bernstein’s inequality to $|\widehat{\lambda}_{R}-\lambda_{R}|$ we get

[TABLE]

Let $\mathcal{R}=\left\{|\widehat{\lambda}_{R}-\lambda_{R}|\leq\frac{1}{\sqrt{n\log(k)}}\right\}$ . Taking $t=\sqrt{n/\log(k)}$ in (75) we have that

[TABLE]

In what follows we assume that the event $\mathcal{R}$ is true. Take any two distinct elements $a$ and $b$ of $[k]\setminus R$ . We shall prove that the conditional expectations $\operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{a}-\lambda_{a}\right|\left|\widehat{\lambda}_{b}-\lambda_{b}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right]$ are close to the products $\operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{a}-\lambda_{a}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right]\operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{b}-\lambda_{b}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right]$ . It is easy to see that conditionally on $(\widehat{\lambda}_{\{R\}},\widehat{\lambda}_{a})$ , $n\widehat{\lambda}_{b}$ follows the Binomial distribution with parameters $((N-n\widehat{\lambda}_{a}),\lambda_{b}/(1-\lambda_{R}-\lambda_{a}))$ . On the other hand, conditionally on $\widehat{\lambda}_{\{R\}}$ , $n\widehat{\lambda}_{b}$ follows the Binomial distribution with parameters $(N,\lambda_{b}/(1-\lambda_{R}))$ . Let $z_{1},z_{2},\ldots,$ be a sequence of independent Bernoulli random variables with parameters $\lambda_{b}/(1-\lambda_{a}-\lambda_{R})$ , $w_{1},w_{2}\ldots,$ be an independent sequence of Bernoulli random variables with parameters $(1-\lambda_{a}-\lambda_{R})/(1-\lambda_{R})$ and $v_{1},v_{2},\ldots,$ be an independent sequence of Bernoulli random variables with parameters $\lambda_{b}/(1-\lambda_{R})$ . We define the following random variables:

[TABLE]

where we use $\lambda_{c}\leq 2/k$ and $|R|\leq 2\sqrt{k}$ . It is easy to see that $X$ follows the Binomial distribution with parameters $(N-n\widehat{\lambda}_{a})$ and $\lambda_{b}/(1-\lambda_{R}-\lambda_{a})$ and $Y$ follows the Binomial distribution with parameters $N$ and $\lambda_{b}/(1-\lambda_{R})$ . Hence, we have that

[TABLE]

Relying our coupling between $X$ and $Y$ , we obtain

[TABLE]

On the other hand, conditionally on $\widehat{\lambda}_{\{R\}}$ , $n\widehat{\lambda}_{a}$ follows the Binomial distribution with parameters $(N,\lambda_{a}/(1-\lambda_{R}))$ so that Cauchy-Schwarz inequality implies

[TABLE]

where we use that $\lambda_{a}\leq 2/k$ and the definition of the event $\mathcal{R}$ . Similarly we compute

[TABLE]

Plugging (77 – 79) into (76) we get

[TABLE]

where we use $\lambda_{b},\lambda_{a}\leq 2/k$ . For $a=b$ , (78) implies that the above difference is of order $(kn)^{-1}$ . Going back to (74), we obtain that

[TABLE]

Take $S^{*}$ and $T^{*}$ being two sets maximizing the above expression. Then, for all $a\in S^{*}$ we have that $\sum_{b\in T^{*}}\operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{R}\big{]}(\boldsymbol{Q}_{ab}-\boldsymbol{Q}^{ad,+}_{ab})$ is non-negative. As a consequence, using (78), we have that

[TABLE]

as soon as the event $\mathcal{R}$ holds. The same reasoning and $|\boldsymbol{Q}_{ab}-\boldsymbol{Q}^{ad,+}_{ab}|\leq 2$ leads to

[TABLE]

as soon as the event $\mathcal{R}$ holds. Going back to (73) and integrating the deviation inequality with respect to $\widehat{\lambda}_{\{R\}}$ , we conclude that

[TABLE]

where we use $\operatorname{\mathbb{P}}(\mathcal{R})\geq 1-2e^{-\sqrt{k}/\log(k)}$ . From this point the proof is identical to that of the main proof: we fix $t=2\log(k)q+\sqrt{k}/\log(k)$ and take an union bound over all possible $R_{1}$ and $R_{2}$ to derive that

[TABLE]

on an event of probability higher than $1-3\exp(-\sqrt{k}/\log(k))$ . Then, as in the main proof, Lemma 6 together with (56) and (69) enforce that $\|V\|^{+}_{\square}\leq C\sqrt{K/(n\log(k))}$ with probability larger than $1-(5+2k)\exp(-\sqrt{k}/\log(k))$ . By symmetry, we can find an event $\mathcal{A}$ of probability larger than $1-(10+4k)\exp(-\sqrt{k}/\log(k))$ such that, on $\mathcal{A}$ ,

[TABLE]

In order to control $\operatorname{\mathbb{E}}[\|V\|_{\square}]$ on the complementary event $\bar{\mathcal{A}}$ we use the rough bound

[TABLE]

which implies

[TABLE]

where we use (54). Together with the decomposition (70), (71) and (72), we conclude that

[TABLE]

∎

Appendix F Proof of Theorem 2

It is enough to prove separately the following two minimax lower bounds:

[TABLE]

The proof of (81) is identical to the proof of (45) in [26] so we just sketch the main idea. Fix some $0<\epsilon\leq 1/4$ . We consider $W_{1}$ to be the constant graphon with $W_{1}(x,y)\equiv 1/2$ , and $W_{2}\in\mathcal{W}^{+}[2]$ to be the $2$ –step graphon with $W_{2}(x,y)=1/2+\epsilon$ if $x,y\in[0,1/2)^{2}\cup[1/2,1]^{2}$ and $W_{2}(x,y)=1/2-\epsilon$ elsewhere. Obviously, we have $\delta_{\square}[\rho_{n}W_{1},\rho_{n}W_{2}]=\rho_{n}\epsilon$ . Then, standard testing arguments [32] ensure that the minimax risk $\inf_{\widehat{f}}\sup_{W_{0}\in\mathcal{W}^{+}[2]}\operatorname{\mathbb{E}}_{W_{0}}[\delta_{\square}(\widehat{f},\rho_{n}W_{0})]$ is at least of the order $\rho_{n}\epsilon$ when $\epsilon$ is chosen small enough so that the $\chi^{2}$ -distance $\chi^{2}(\operatorname{\mathbb{P}}_{W_{2}},\operatorname{\mathbb{P}}_{W_{1}})$ is smaller than $1/4$ . According to Lemma 4.9 in [26], this is the case when $\epsilon$ is small in front of $(\rho_{n}n)^{-1/2}$ which proves (81).

Henceforth, we only focus on (80). We first consider the case of $k$ multiple of $32$ and such that $k\geq C_{0}$ and $k\leq C_{1}n$ for some sufficiently large numerical constants $C_{0}$ and $C_{1}$ . As the collections $\mathcal{W}^{+}[k]$ are nested this will imply (80) for all $k\in[32C_{0},n]$ . Afterwards, it will suffice to show (80) for $k=2$ to prove the proposition. So, we assume that $k$ is a multiple of 32, $k$ is large enough and that $k$ is small in front of $n$ . Define $k_{1}:=k/2$ , $M_{k}:=\lceil 128\log(k)\rceil$ , $\eta_{0}:=1/16$ and $\eta_{1}:=7/8$ .

As for Proposition 3, we will rely on Fano’s method (Lemma 2). Hence, we shall build a collection $(W_{u})$ of graphons that are well-spaced in cut distance and such that the Kullback-Leibler divergence between the associated distribution $\operatorname{\mathbb{P}}_{W_{u}}$ remains small enough. All the graphons considered in this collection will be based on a $k_{1}\times M_{k}$ matrix $\boldsymbol{B}$ such that (i) the rows of $B$ are almost orthogonal and (ii) such that the $l_{1}$ distance between permutation and convex combinations of the columns of $\boldsymbol{B}$ are bounded from below. Such a property will turn out to be useful when taking a lower bound on the $\delta_{\square}$ distance between the corresponding graphons.

Lemma 11.

For $k$ large enough, there exists a matrix $\boldsymbol{B}\in\{-1,1\}^{k_{1}\times M_{k}}$ satisfying the following two properties:

(i)

For any $(a,b)\in[k_{1}]$ with $a\neq b$ , the inner product of two columns $\langle\boldsymbol{B}_{a,\cdot},\boldsymbol{B}_{b,\cdot}\rangle$ satisfies

[TABLE]

(ii)

For any two subsets $X$ and $Y$ of $[k_{1}]$ satisfying $|X|=|Y|=\eta_{0}k_{1}$ and $X\cap Y=\emptyset$ , any labellings $\pi_{1}:[\eta_{0}k_{1}]\to X$ and $\pi_{2}:[\eta_{0}k_{1}]\to Y$ , any subset $Z$ of $[M_{k}]$ of size larger than $\eta_{1}M_{k}$ and any $Z\times M_{k}$ stochastic matrix $\omega$ , we have

[TABLE]

for some universal constant $C>0$ .

Taking $\boldsymbol{B}$ as in Lemma 11, we define the connection probability matrix $\boldsymbol{Q}:=(\boldsymbol{J}+\boldsymbol{B})/2$ where $\boldsymbol{J}$ is the $k_{1}\times M_{k}$ matrix with all entries equal to 1. Now we define a collection of step graphons based on $\boldsymbol{Q}$ that will only slightly differ by the weight of each step.

Fix some $\epsilon<1/(8k_{1})$ and denote by $\mathcal{C}_{0}$ the collection of vectors $u\in\{-\epsilon,\epsilon\}^{k_{1}}$ satisfying $\sum_{a=1}^{k_{1}}u_{a}=0$ . For any $u\in\mathcal{C}_{0}$ , define the cumulative distribution $F_{u}$ on $\{0,\ldots,k_{1}\}$ by the relations $F_{u}(0)=0$ and $F_{u}(a)=a/(2k_{1})+\sum_{b=1}^{a}u_{b}$ for $a\in[k_{1}]$ and the cumulative distribution $G$ on $\{0,\ldots,M_{k}\}$ by $G(0)=1/2$ and $G(b)=1/2+b/(2M_{k})$ . Note that $F_{u}$ takes values in $[0,1/2]$ and $G$ takes values in $[1/2,1]$ . Then, set $\Pi_{ab}(u)=[F_{u}(a-1),F_{u}(a))\times[G(b-1),G(b))$ and define the graphon $W_{u}\in\mathcal{W}^{+}[k_{1}+M_{k}]$ by

[TABLE]

See Figure (1) for a drawing of $W_{u}$ . Note that $W_{u}$ is a fairly unbalanced $(k_{1}+M_{k})$ –step graphon: $M_{k}$ of its steps have a large weight of order $1/\log(k)$ . Besides, the $k_{1}$ smaller steps are slightly unbalanced as the weight of each class is either $1/k-\epsilon$ or $1/k+\epsilon$ . The purpose of these $M_{k}$ big steps is to make the cut distances between $W_{u}$ and $W_{v}$ the largest possible (see the proof of Lemma 13).

Next, we shall consider a subcollection $\mathcal{C}$ of $\mathcal{C}_{0}$ such that the graphons $W_{u}$ with $u\in\mathcal{C}$ are well spaced. The following combinatorial result is in the spirit of the Varshamov-Gilbert lemma [32, Lemma 2.9]. It is borrowed from [26] (Lemma 4.4). For $u\in\mathcal{C}_{0}$ , let $\mathcal{A}_{u}:=\{a\in[k_{1}]:\ u_{a}=\epsilon\}$ . Notice that, by definition of $\mathcal{C}_{0}$ , we have $|\mathcal{A}_{u}|=k_{1}/2$ for all $u\in\mathcal{C}_{0}$ .

Lemma 12.

There exists a subset $\mathcal{C}$ of $\mathcal{C}_{0}$ such that $\log|\mathcal{C}|\geq k_{1}/16$ and

[TABLE]

for any $u\neq v\in\mathcal{C}$ .

Lemmas 11 and 12 are used to obtain the following lower bound on the distance $\delta_{\square}(W_{u},W_{v})$ between two distinct graphons with $u$ and $v$ in $\mathcal{C}$ . This lemma is the main ingredient of the proof.

Lemma 13.

There exists two positive universal constants $C_{1}$ and $C_{2}$ such that if $k\epsilon\leq C_{2}$ then, for any $(u,v)\in\mathcal{C}$ with $u\neq v$ , we have

[TABLE]

which implies

[TABLE]

Note that for any $u$ and $v$ in $\mathcal{C}$ it is possible to build a measure-preserving transformation $\tau$ such that $W_{u}-W_{v}^{\tau}$ is null expect on a measurable set of Lebesgue measure of order $k\epsilon$ (see the proof of Theorem 1 in Section E for such construction). Hence, the $l_{1}$ norm of $W_{u}-W_{v}^{\tau}$ is of order $k\epsilon$ . Lemma 13 states, that by taking the infimum over all $\tau$ and by considering the weaker norm $\|.\|_{\square}$ , one still has a lower bound of the same order. The $M^{-1/2}_{k}$ factor arises as a consequence of Lemma 4. See the proof for more details.

To apply Fano’s method, we need to upper bound the Kullback-Leibler divergence between the distribution corresponding to any two graphon $W_{u}$ and $W_{v}$ with $u$ and $v$ in $\mathcal{C}$ . Let $\operatorname{\mathbb{P}}_{W_{u}}$ denote the distribution of $\boldsymbol{A}$ sampled according to the sparse $W$ -random graph model (1) with $W_{0}=W_{u}$ . Since the matrix $\boldsymbol{Q}$ is fixed the difficulty in distinguishing between the distributions $\operatorname{\mathbb{P}}_{W_{u}}$ and $\operatorname{\mathbb{P}}_{W_{v}}$ for $u\neq v$ comes from the randomness of the design points $\xi_{1},\ldots,\xi_{n}$ in the $W$ -random graph model (1) rather than from the randomness of the realization of the adjacency matrix $\boldsymbol{A}$ conditionally on $\xi_{1},\ldots,\xi_{n}$ . The following lemma gives an upper bound on the Kullback-Leibler divergences $\mathcal{KL}(\operatorname{\mathbb{P}}_{W_{u}},\operatorname{\mathbb{P}}_{W_{v}})$ :

Lemma 14.

For all $u,v\in\mathcal{C}_{0}$ we have

[TABLE]

Now, choose $\epsilon$ such that $\epsilon^{2}=\frac{3}{(16)^{3}nk_{1}}$ . When $k$ is small in front of $n$ , this choice of $\epsilon$ satisfies the conditions of Lemma 13. Then it follows from Lemmas 12 and 14 that

[TABLE]

In view Fano’s Lemma (Lemma 2), inequalities (86) and (87) imply that

[TABLE]

where $C>0$ is an absolute constant. This completes the proof for $k$ large enough.

Now we turn to the case $k=2$ . We reduce the lower bound to the problem of testing two hypotheses. Consider the matrix $\boldsymbol{B}=\left(\begin{array}[]{cc}1&1\\ 1&-1\end{array}\right)$ . Given $u\in\{-\epsilon,+\epsilon\}$ define $F_{u}(0)=0$ , $F_{u}(1)=1/2+u$ and $F_{u}(2)=1$ . Then, we set $\Pi_{ab}(u)=[F_{u}(a-1),F_{u}(a))\times[F_{u}(b-1),F_{u}(b))$ for any $a,b\in\{1,2\}$ and define graphons

[TABLE]

For any measure preserving bijection $\tau$ , $(W_{\epsilon}-W^{\tau}_{-\epsilon})$ is a four-step graphon. Thanks to Lemma 4, we deduce that $\delta_{\square}(W_{\epsilon},W_{-\epsilon})\geq C\delta_{1}(W_{\epsilon},W_{-\epsilon})$ . Then, it is not hard to see that $\delta_{1}(W_{\epsilon},W_{-\epsilon})\geq C^{\prime}\epsilon$ so that $\delta_{\square}(\rho_{n}W_{\epsilon},\rho_{n}W_{-\epsilon})\geq C^{\prime}\rho_{n}\epsilon$ . Moreover, exactly as in Lemma 14, the Kullback-Leibler divergence between $\operatorname{\mathbb{P}}_{W_{\epsilon}}$ and $\operatorname{\mathbb{P}}_{W_{-\epsilon}}$ is bounded by $Cn\epsilon^{2}$ . Taking $\epsilon$ of the order $n^{-1/2}$ , this divergence is small. It is therefore impossible to reliably distinguish $\operatorname{\mathbb{P}}_{W_{\epsilon}}$ from $\operatorname{\mathbb{P}}_{W_{-\epsilon}}$ and the estimation error is at least of order $\rho_{n}\epsilon$ . More formally, we use Theorem 2.2 from [32] to conclude that

[TABLE]

where $C>0$ is an absolute constant.

Proof of Lemma 11.

Let $\boldsymbol{B}$ be a $k_{1}\times M_{k}$ random matrix whose entries are independent Rademacher variables. We shall prove that, with positive probability, $\boldsymbol{B}$ satisfies both (82) and (83). In particular, this implies the existence of $\boldsymbol{B}$ satisfying both (82) and (83).

Fix $a\neq b$ . Then, $\langle\boldsymbol{B}_{a,\cdot},\boldsymbol{B}_{b,\cdot}\rangle$ is distributed as a sum of $k_{1}$ independent Rademacher variables. Using Hoeffding’s inequality, we have that

[TABLE]

By the union bound, property (82) is satisfied for all $a\neq b$ with probability greater than $1-k_{1}^{2}\exp[-M_{k}/32]$ . Since $M_{k}\geq 128\log(k)$ , for $k$ greater than some absolute constant, this probability is greater than $3/4$ .

Turning to (83), we first fix $X$ , $Y$ , $Z$ , $\pi_{1}$ , $\pi_{2}$ , and $\omega$ . Let

[TABLE]

We have that, conditionally on $(\boldsymbol{B}_{b,c})_{b\in Y,c\in[M_{k}]}$ , $T_{X,Y,Z,\pi_{1},\pi_{2},\omega}$ stochastically dominates a binomial distribution with parameters $(\eta_{0}k_{1})\times|Z|$ and $1/2$ . Then, Hoeffding’s inequality yields

[TABLE]

Given any integer $Z\in[\eta_{1}M_{k},M_{k}]$ , define $\Omega_{Z}$ the collection of $Z\times[M_{k}]$ stochastic matrices taking values in the discrete set $\{0,1/(8M_{k}),2/(8M_{k}),\ldots,1\}$ . Since $X,Y\subset[k_{1}]$ and $Z\subset M_{k}$ , it is easy to see that the cardinality of the set of all possible tuples $(X,Y,Z,\pi_{1},\pi_{2},\omega)$ with $\omega\in\Omega_{Z}$ is bounded by

[TABLE]

Now, taking the union bound, we derive that, simultaneously for all such parameters,

[TABLE]

with probability greater than $1-2^{2k_{1}+M_{k}+1}(\eta_{0}k_{1})!^{2}(8M_{k}+1)^{M_{k}^{2}}\exp[-\eta_{0}\eta_{1}k_{1}M_{k}/8]$ . Using Stirling’s approximation

and $\eta_{1}M_{k}\geq 64\log(k)$ we get that this probability is larger than $3/4$ for $k$ large enough.

Finally, let us consider a general case, when matrix $\omega$ does not necessarily belong to $\Omega_{Z}$ . Observe that in this case, there exists a matrix $\omega^{\prime}\in\Omega_{Z}$ such that $\max_{b\in Z}\sum_{c\in[M_{k}]}|\omega_{b,c}-\omega^{\prime}_{b,c}|\leq 1/8$ . This implies that

[TABLE]

We have proved that (83) holds with probability larger than $3/4$ . As a consequence, $\boldsymbol{B}$ satisfies both (82) and (83) with probability larger than $1/2$ . ∎

Proof of Lemma 13.

We fix $u$ and $v$ , two different vectors in $\mathcal{C}$ , and fix $\tau$ , a measure-preserving bijection on $[0,1]\rightarrow[0,1]$ . We shall prove that for $k\epsilon$ small enough

[TABLE]

Since $\delta_{\square}\big{(}W_{u},W_{v}\big{)}=\inf_{\tau}\|W_{u}(.,.)-W_{v}(\tau.,\tau.)\|_{\square}$ both (85) and (86) straightforwardly follow from (88). We denote

[TABLE]

Since $\tau$ is measure-preserving, we have

[TABLE]

Now, we consider three cases (i) $\lambda(\mathcal{B}_{12})\leq k_{1}\epsilon/64$ , (ii) $k_{1}\epsilon/64<\lambda(\mathcal{B}_{12})\leq 1/2-k_{1}\epsilon/64$ and (iii) $\lambda(\mathcal{B}_{12})>1/2-k_{1}\epsilon/64$ . In the Case (i) we shall focus on the restriction of $W_{u}$ and $W_{v}^{\tau}$ on $\mathcal{B}_{11}\times\mathcal{B}_{22}$ so that these restrictions are $k_{1}\times M_{k}$ –step functions. In the Case (ii), we focus on restrictions to $\mathcal{B}_{21}\times\mathcal{B}_{22}$ , so that $W_{v}^{\tau}$ is constant on this restriction. In the pathological case (iii), we introduce a subset such that the restriction of $W_{u}$ is a $M_{k}\times k_{1}$ –step function and the restriction of $W_{v}^{\tau}$ is a $k_{1}\times M_{k}$ –step function.

Case (i). We focus our attention on coordinates $(x,y)$ in $\mathcal{B}_{11}\times\mathcal{B}_{22}$ . Recall that the cumulative distribution function $G$ is defined by $G(0)=1/2$ and $G(b)=1/2+b/(2M_{k})$ for $b\in[M_{k}]$ . For any $(r,s)\in[M_{k}]^{2}$ , define

[TABLE]

In other words, $\omega_{r,s}$ stands for the weight of indices corresponding to class $r$ in $W_{u}$ and class $s$ in $W_{v}^{\tau}$ . By definition of $\omega_{r,s}$ , for any $r\in[M_{k}]$ , we have

[TABLE]

Let $\mathcal{R}$ denote the sets of $r\in[M_{k}]$ such that $[G(r-1),G(r))$ has a large intersection with $\tau^{-1}([1/2,1]$ :

[TABLE]

Denote $\bar{\mathcal{R}}$ the complementary set of $\mathcal{R}$ . We have that $\lambda(\mathcal{B}_{22})=1/2-\lambda(\mathcal{B}_{12})\geq 1/2-k_{1}\epsilon/64\geq\tfrac{27}{56}$ for $k_{1}\epsilon$ small enough. Hence, it follows that

[TABLE]

which implies that $|\mathcal{R}|\geq 3M_{k}/4$ and $\lambda(\mathcal{Y})=\sum_{r\in\mathcal{R}}\omega_{r\bullet}\geq 9/28$ .

Now, denoting $\mathcal{X}:=\mathcal{B}_{11}$ , we define a new kernel $\overline{W}_{v}^{\tau}:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by

[TABLE]

We can view $\overline{W}_{v}^{\tau}$ as a smoothed version of the restriction of $W_{v}^{\tau}$ to $\mathcal{X}\times\mathcal{Y}$ . The marginal functions $\overline{W}_{v}^{\tau}(x,.)$ are step functions with at most $|\mathcal{R}|\leq M_{k}$ steps of the form $[G(r-1),G(r))\cap\mathcal{B}_{22}$ . Moreover, on each interval $[G(r-1),G(r))\cap\mathcal{B}_{22}$ , $\overline{W}_{v}^{\tau}(x,y)$ is equal to the mean of $W_{v}^{\tau}(x,z)$ for $z$ ranging on this set. Equipped with this notation, we can control the cut distance between $W_{u}$ and $W_{v}^{\tau}$ in terms of the $l_{1}$ distance between the restriction of $W_{u}$ to $\mathcal{X}\times\mathcal{Y}$ and $\overline{W}_{v}^{\tau}$ . For ease of notation, we still write $W_{u}$ for for the restriction of $W_{u}$ to $\mathcal{X}\times\mathcal{Y}$ when there is no ambiguity.

The following lemma provides a lower bound of the cut norm $\|W_{u}-W_{v}^{\tau}\|_{\square}$ in terms of the $l_{1}$ norm of $\|W_{u}-\overline{W}_{v}^{\tau}\|_{1}$ .

Lemma 15.

For any $u$ , $v$ in $\mathcal{C}$ and any measure-preserving transformation $\tau$ , we have

[TABLE]

where $\overline{W}_{v}^{\tau}$ is defined in (93).

In view of Lemma 15 it is enough to control the $l_{1}$ norm $\|W_{u}-\overline{W}_{v}^{\tau}\|_{1}$ . We can do it in a similar way as it is done in the proof of Lemma 4.5 in [26]. For $a\neq b$ and any $x\in\left[F_{u}(a-1),F_{u}(a)\right)\cap\mathcal{X}$ and $x^{\prime}\in\left[F_{u}(b-1),F_{u}(b)\right)\cap\mathcal{X}$ , the inner product between $W_{u}(x,.)$ and $W_{v}(x^{\prime},.)$ satisfies

[TABLE]

where we used (82) in the last line. For any $a,b\in[k_{1}]$ , let $\psi_{ab}$ denote the Lebesgue measure of the set

[TABLE]

Since $\tau$ is measure preserving, it follows that $\sum_{b}\psi_{ab}\leq 1/(2k_{1})+u_{a}$ and $\sum_{a}\psi_{ab}\leq 1/(2k_{1})+v_{b}$ . For any $y\in\mathcal{Y}$ , we set

[TABLE]

Equipped with this notation, we have

[TABLE]

Now take any $a_{1}\neq a_{2}$ . By (95), $|h_{u,a}(y)|=1/2$ and using the triangle inequality, we derive that

[TABLE]

where we used $\lambda(\mathcal{Y})\geq 9/28$ in the last line. As a consequence, for any $b\in[k_{1}]$ there exists at most one $a\in[k_{1}]$ such that $\|h_{u,a}-k_{v,b}\|_{1}<1/224$ . If such index $a$ exists, it is denoted by $\pi(b)$ . Then, it is possible to extend $\pi$ as a function from $[k_{1}]$ to $[k_{1}]$ . Since $\sum_{a,b}\psi_{a,b}=\lambda(\mathcal{X})$ , we get

[TABLE]

since $\lambda[\mathcal{B}_{1,2}]\leq k_{1}\epsilon/64$ . If the sum $\sum_{b=1}^{k_{1}}1/(2k_{1})+v_{b}-\psi_{\pi(b),b}$ is greater than $k_{1}\epsilon/32$ , then (88) is satisfied. Thus, we can assume in the sequel that $\sum_{b=1}^{k_{1}}1/(2k_{1})+v_{b}-\psi_{\pi(b),b}\leq k_{1}\epsilon/32$ .

Using that $\psi_{a,b}\leq(1/(2k_{1})+u_{a})\wedge(1/(2k_{1})+v_{b})$ and that the cardinality of the collection $\{b\in[k_{1}]:\,v_{b}>0\}$ is $k_{1}/2$ we deduce that the collection $\{b\in[k_{1}]:\,v_{b}>0,\ u_{\pi(b)}>0\text{ and }\psi_{\pi(b),b}\geq 1/(2k_{1})\}$ has cardinality greater than $7k_{1}/16$ . Now, Lemma 12 implies that $|\mathcal{A}_{u}\cap\mathcal{A}_{v}|\leq 3k_{1}/8$ for $u\neq v\in\mathcal{C}$ . Then, there exist subsets $A\subset\mathcal{A}_{u}$ and $B\subset\mathcal{A}_{v}$ of cardinality $\eta_{0}k_{1}$ (recall that $\eta_{0}=1/16)$ such that $\pi(B)=A$ , $A\cap B=\emptyset$ , and $\psi_{\pi(b),b}\geq 1/(2k_{1})$ for all $b\in B$ . The condition $\psi_{\pi(b),b}\geq 1/(2k_{1})$ implies that $\pi$ is injective on $B$ . Hence,

[TABLE]

where the second inequality follows from $\psi_{\pi(b),b}\geq 1/(2k_{1})$ and the fact that $h_{u,\pi(b)}$ and $k_{v,b}$ are step functions with steps larger than $3/(7M_{k})$ (see (91), the definition of $\mathcal{R}$ and $\mathcal{Y}$ ). Finally, we apply the property (83) of $\boldsymbol{B}$ to conclude that

[TABLE]

which, together with Lemma 15, proves (88).

Case (ii). Now we assume that $k_{1}\epsilon/64<\lambda(\mathcal{B}_{12})<1/2-k_{1}\epsilon/64$ . Take $\mathcal{X}=\mathcal{B}_{21}$ and $\mathcal{Y}=\mathcal{B}_{22}$ . We have that, on $\mathcal{X}\times\mathcal{Y}$ , $W_{v}^{\tau}$ is constant and equals $1/2$ . Denote $U$ the restriction of $W_{u}-1/2$ to $\mathcal{X}\times\mathcal{Y}$ . Then, it follows that $\|W_{u}-W_{v}^{\tau}\|_{\square}\geq\|U\|_{\square}$ . The kernel $U$ is at most $k_{1}\times M_{k}$ step function. By Lemma 4, we obtain

[TABLE]

where the last equality follows from (90). Using $\lambda(\mathcal{X})=\lambda(\mathcal{B}_{12})$ and $x(1/2-x)\geq 1/4\min\left(x,(1/2-x)\right)$ we obtain (88).

Case (iii). Now we assume that $\lambda(\mathcal{B}_{12})\geq 1/2-k\epsilon/64$ and take $\mathcal{X}=\mathcal{B}_{21}$ and $\mathcal{Y}=\mathcal{B}_{12}$ so that $\lambda(\mathcal{X})=\lambda(\mathcal{B}_{12})\geq 1/2-k_{1}\epsilon/64$ . Define the smoothed kernel $\overline{W}_{v}^{\tau}:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by

[TABLE]

As a consequence, $\overline{W}_{v}^{\tau}$ is $M_{k}\times M_{k}$ block-constant on subsets of the form $\big{(}\tau^{-1}[G(a-1),G(a))\cap\mathcal{X}\big{)}\times\big{(}[G(b-1),G(b))\cap\mathcal{Y}\big{)}$ . Arguing as in the proof of Lemma 15, we derive that

[TABLE]

For any $a$ such that $[F_{u}(a-1),F_{u}(a))\cap\mathcal{X}\neq\emptyset$ define the function $h_{u,a}$ on $\mathcal{Y}$ by $h_{u,a}(y):=W_{u}(F_{u}(a-1),y)-1/2$ . Arguing as in Case (i), we observe that $\|h_{u,a_{1}}-h_{u,a_{2}}\|_{1}\geq 1/112$ for any $a_{1}\neq a_{2}$ . We have that the kernel $\overline{W}_{v}^{\tau}$ is a $M_{k}\times M_{k}$ step function. Hence, there exists a partition $(\mathcal{X}_{b})_{b=1,\ldots,M_{k}}$ of $\mathcal{X}$ and $M_{k}$ functions $k_{b}(y)$ such that $\left(\overline{W}_{v}^{\tau}-1/2\right)(x,y)=\sum_{b=1}^{M_{k}}\mathds{1}_{x\in\mathcal{X}_{b}}k_{b}(y)$ . Then, the triangular inequality ensures that, for any $a_{1}\neq a_{2}$ and any $b\in[M_{k}]$ , we have $\|h_{u,a_{1}}-k_{b}\|_{1}+\|h_{u,a_{1}}-k_{b}\|_{1}\geq\|h_{u,a_{1}}-h_{u,a_{2}}\|_{1}\geq 1/112$ . As a consequence, for any $b\in[M_{k}]$ there exists at most one $a$ , which we will denote by $\pi(b)$ , such that $\|h_{u,\pi(b)}-k_{b}\|_{1}\leq 1/224$ . Now we compute

[TABLE]

where we used $\lambda(\mathcal{X})\geq 1/4$ , $M_{k}/k\leq 1/8$ , and that $M_{k}\epsilon\leq k\epsilon$ is small enough. Together with (96), we obtain the desired result (88). ∎

Proof of Lemma 15.

We first prove that $\|W_{u}-\overline{W}_{v}^{\tau}\|_{\square}\leq\|W_{u}-W_{v}^{\tau}\|_{\square}$ . Fix any measurable subset $S\subset\mathcal{X}$ . Since functions $\left[W_{u}-\overline{W}_{v}^{\tau}\right](x,\cdot)$ are constant on each set $[G(r-1),G(r))\cap\mathcal{Y}$ , the supremum $\sup_{T\subset\mathcal{Y}}\left|\int_{S\times T}W_{u}(x,y)-\overline{W}_{v}^{\tau}(x,y)dxdy\right|$ is achieved by a subset $T$ which is an union of some of $[G(r-1),G(r))\cap\mathcal{Y}$ , that is $T=\cup_{r\in\mathcal{R}^{\prime}\subset\mathcal{R}}[G(r-1),G(r))\cap\mathcal{Y}$ . For such $T$ , the definition (93) of $\overline{W}^{\tau}_{v}$ implies $\int_{S\times T}\overline{W}_{v}^{\tau}(x,y)dxdy=\int_{S\times T}W_{v}^{\tau}(x,y)dxdy$ so that

[TABLE]

Taking the supremum over all $S$ leads to $\|W_{u}-\overline{W}_{v}^{\tau}\|_{\square}\leq\|W_{u}-W_{v}^{\tau}\|_{\square}$ . By definition of $W_{u}$ and $\overline{W}_{v}^{\tau}$ we have that $U$ is a $k_{1}^{2}\times M_{k}$ step function. Then, Lemma 4 allows us to conclude

[TABLE]

∎

Proof of Lemma 14.

The proof of Lemma 14 follows the lines of the proof of of Lemma 4.3 in [26] and we give it here for completeness. For $u\in\mathcal{C}_{0}$ , let $\zeta(u)=(\zeta_{1}(u),\ldots,\zeta_{n}(u))$ be the vector of $n$ i.i.d. random variables with the discrete distribution on $[k_{1}+M_{k}]$ defined by

[TABLE]

Let $\boldsymbol{\Theta}_{0}$ be the $n\times n$ symmetric matrix with elements $(\boldsymbol{\Theta}_{0})_{ii}=0$ and $(\boldsymbol{\Theta}_{0})_{ij}=\rho_{n}\boldsymbol{Q}_{\zeta_{i}(u),\zeta_{j}(u)}$ for $i\neq j$ . Assume that, conditionally on $\zeta(u)$ , the adjacency matrix $\boldsymbol{A}$ is sampled according to the network sequence model with such probability matrix $\boldsymbol{\Theta}_{0}$ . Notice that in this case the observations $\boldsymbol{A}^{\prime}=(\boldsymbol{A}_{ij},1\leq j<i\leq n)$ have the probability distribution $\operatorname{\mathbb{P}}_{W_{u}}$ . Using this remark and introducing the probabilities $\alpha_{\boldsymbol{a}}(u)=\operatorname{\mathbb{P}}[\zeta(u)=\boldsymbol{a}]$ and $p_{A\boldsymbol{a}}=\operatorname{\mathbb{P}}[\boldsymbol{A}^{\prime}=A|\zeta(u)=\boldsymbol{a}]$ for $\boldsymbol{a}\in[k_{1}+M_{k}]^{n}$ , we can write the Kullback-Leibler divergence between $\operatorname{\mathbb{P}}_{W_{u}}$ and $\operatorname{\mathbb{P}}_{W_{v}}$ in the form

[TABLE]

where the sums in $\boldsymbol{a}$ are over $[k_{1}+M_{k}]^{n}$ and the sum in $A$ is over all triangular upper halves of matrices in $\{0,1\}^{n\times n}$ . Since the function $(x,y)\mapsto x\log(x/y)$ is convex we can apply Jensen’s inequality to get

[TABLE]

where the last equality follows from the fact that $\alpha_{\boldsymbol{a}}(u)$ are $n$ -product probabilities. Using (97) we get

[TABLE]

which is equal to $n/2$ times the Kullback-Leibler divergence between two discrete distribution. Since the Kullback-Leibler divergence is less than the chi-square divergence we obtain

[TABLE]

where last inequality we use $|v_{a}|\leq\epsilon\leq 1/(8k_{1})$ , and $|u_{a}-v_{a}|\leq 2\epsilon$ . Combining this with (98) proves the lemma.

∎

Appendix G Proof of Proposition 7

To prove (25), it is enough to prove separately the following three minimax lower bounds:

[TABLE]

The proof of (99) follows from the proof of (43) in [26] using the trivial inequality

[TABLE]

The proof of (100) follows the lines of the proof of (44) using that $\|\mathbf{B}\|^{2}_{2}=\|\mathbf{B}\|_{1}$ for matrices with entries in $\{-1,1\}$ . The proof of (101) is identical to the proof of (45) in [26].

In order to prove the upper bound (26), the proof of Proposition 3.2 in [26] can be easily modified to get an upper bound on the agnostic error measured in $l_{1}$ -distance:

Lemma 16 (Agnostic error measured in $l_{1}$ -distance).

Consider the $W$ -random graph model. For all integer $k\leq n$ , $W_{0}\in\mathcal{W}^{+}[k]$ and $\rho_{n}>0$ , we have

[TABLE]

Now (26) follows from Lemma 16 and (16). Finally, the $\rho_{n}$ convergence rate is simply achieved by the constant estimator $\widehat{f}\equiv 0$ .

Appendix H Proof of Proposition 8

For $\boldsymbol{\Theta}_{0}$ generated according to the sparse $W$ -random graph model (27) with graphon $W_{0}\in\mathcal{W}^{+}_{1}$ , integrating (9) with respect to ${\boldsymbol{\xi}}$ and using $\|W_{0}\|_{1}=1$ , we get

[TABLE]

So, using the triangle inequality (20) it is enough to bound the agnostic error $\operatorname{\mathbb{E}}_{W_{0}}[\delta_{\square}(\widetilde{f}_{\boldsymbol{\Theta}_{0}},f^{\prime}_{0})]$ . We take $W^{*}\in\mathcal{W}^{+}_{1}[k,\mu]$ (or $W^{*}\in\mathcal{W}^{+}_{2}[k]$ in the case of $L_{2}$ graphons) such that

[TABLE]

or $\delta_{2}\left(W^{*},W^{\prime}_{0}\right)\leq\inf_{W\in\mathcal{W}^{+}_{2}[k]}\delta_{2}\left(W,W^{\prime}_{0}\right)(1+1/n^{2})$ for $L_{2}$ graphons. Without lost of generality we can assume that $\rho_{n}W^{*}(x,y)\leq 1$ . Let $f^{*}=\rho_{n}W^{*}$ and $\boldsymbol{\Theta}^{*}=(\boldsymbol{\Theta}^{*}_{ij})$ be such such that for $i\neq j$ $\boldsymbol{\Theta}^{*}_{ij}=W^{*}[\xi_{i},\xi_{j}]$ where $(\xi_{i})$ are the same as for $\boldsymbol{\Theta}_{0}$ . Triangle inequality implies

[TABLE]

where we use $\delta_{\square}(f^{\prime}_{0},f^{*})\leq\delta_{1}(f^{\prime}_{0},f^{*})$ and $\operatorname{\mathbb{E}}_{W_{0}}[\delta_{\square}(\widetilde{f}_{\boldsymbol{\Theta}^{*}},\widetilde{f}_{\boldsymbol{\Theta}_{0}})]\leq\delta_{1}(f^{\prime}_{0},f^{*})$ and that $\widetilde{f}_{\boldsymbol{\Theta}^{*}}$ is distributed as under $W^{*}$ . Similarly for $L_{2}$ graphons, we obtain $\operatorname{\mathbb{E}}_{W_{0}}[\delta_{\square}(\widetilde{f}^{\prime}_{\boldsymbol{\Theta}_{0}},f_{0})]\leq 2\delta_{2}(f^{\prime}_{0},f^{*})+\operatorname{\mathbb{E}}_{W\emph{*}}[\delta_{\square}(f^{*},\widetilde{f}_{\boldsymbol{\Theta}^{*}})]$ . Then, we use the following lemma:

Lemma 17.

(i)

Consider any $W^{*}\in\mathcal{W}^{+}_{1}[k,\mu]$ and $\rho_{n}\geq 1/n$ such that $\rho_{n}W^{*}(x,y)\leq 1$ . Then

[TABLE]

(ii)

Consider any $W^{*}\in\mathcal{W}^{+}_{2}[k]$ and $\rho_{n}\geq 1/n$ such that $\rho_{n}W^{*}(x,y)\leq 1$ . Then,

[TABLE]

Now (28) follows from (i) of Lemma 17 and $\|W^{*}\|_{1}\leq\|W_{0}\|_{1}(2+n^{-2})$ . The proof of (30) follows the same lines using (ii) of Lemma 17.

To prove (29) and (31) we only need to prove that $\operatorname{\mathbb{E}}_{W_{0}}\left[\|\widetilde{\boldsymbol{\Theta}}_{\lambda}-\boldsymbol{\Theta}_{0}\|_{\square}\right]\leq C\sqrt{\rho_{n}/n}$ . Using the definition of $\widetilde{\boldsymbol{\Theta}}_{\lambda}$ (13) we compute

[TABLE]

where we used that $\|\boldsymbol{B}\|_{\square}\leq\|\boldsymbol{B}\|_{2\rightarrow 2}/n$ and the definition of $\widetilde{\boldsymbol{\Theta}}_{\lambda}$ . This completes the proof of Proposition 8.

Proof of Lemma 17.

Consider the matrix ${\boldsymbol{\Theta}}^{\prime}$ with entries $({\boldsymbol{\Theta}}^{\prime})_{ij}=\rho_{n}W^{*}(\xi_{i},\xi_{j})$ for all $i,j$ . As opposed to $\boldsymbol{\Theta}^{*}$ , the diagonal entries of ${\boldsymbol{\Theta}}^{\prime}$ are not constrained to be null. By the triangle inequality, we get

[TABLE]

Since the entries of $\boldsymbol{\Theta}^{*}$ coincide with those of ${\boldsymbol{\Theta}}^{\prime}$ outside the diagonal, the difference $\widetilde{f}_{\boldsymbol{\Theta}^{*}}-\widetilde{f}_{{\boldsymbol{\Theta}}^{\prime}}$ is null outside of a set of measure $1/n$ . Also, the entries of ${\boldsymbol{\Theta}}^{\prime}$ are smaller than $1$ . It follows that $\operatorname{\mathbb{E}}[\delta_{\square}(\widetilde{f}_{\boldsymbol{\Theta}^{*}},\widetilde{f}_{{\boldsymbol{\Theta}}^{\prime}})]\leq 1/n\leq\sqrt{\rho_{n}/n}$ . Since $\delta_{\square}(\widetilde{f}_{{\boldsymbol{\Theta}}^{\prime}},f^{*})\leq\delta_{1}(\widetilde{f}_{{\boldsymbol{\Theta}}^{\prime}},f^{*})$ , it suffices to prove that

[TABLE]

Since $W^{*}$ is a $k$ -step function, we can reorganize $f^{*}$ and $\widetilde{f}_{\boldsymbol{\Theta}^{\prime}}$ in such a way that these two graphon are equal on a set of large Lebesgue value. More precisely, we adopt the same approach as in the proof of Theorem 1 and we only sketch the result here. Let $\boldsymbol{Q}\in(\mathbb{R}^{+})^{k\times k}_{sym}$ and $\phi:[0,1]\times[k]$ that characterize $W^{*}$ . For $a=1,\ldots,k$ , denote $\lambda_{a}=\lambda(\phi^{-1}(\{a\}))$ . For any $b\in[k]$ , define the cumulative distribution function $F_{\phi}(b)=\sum_{a=1}^{b}\lambda_{a}$ and set $F_{\phi}(0)=0$ . For any $(a,b)\in[k]\times[k]$ define $\Pi_{ab}(\phi)=[F_{\phi}(a-1),F_{\phi}(a))\times[F_{\phi}(b-1),F_{\phi}(b))$ . Define $W^{\prime}(x,y)=\sum_{a=1}^{k}\sum_{b=1}^{k}\boldsymbol{Q}_{ab}\mathds{1}_{\Pi_{ab}(\phi)}(x,y)$ . Obviously, $f^{\prime}=\rho_{n}W^{\prime}$ is weakly isomorphic to $f^{*}=\rho_{n}W^{*}$ . Now, let $\widehat{\lambda}_{a}=\frac{1}{n}\sum_{i=1}^{n}\mathds{1}_{\{\xi_{i}\in\phi^{-1}(a)\}}$ be the (unobserved) empirical frequency of group $a$ . Consider a function $\psi:[0,1]\rightarrow[k]$ such that:

(i)

$\psi(x)=a$ for all $a\in[k]$ and $x\in[F_{\phi}(a-1),F_{\phi}(a-1)+\widehat{\lambda}_{a}\wedge\lambda_{a})$ ,

(ii)

$\lambda(\psi^{-1}(a))=\widehat{\lambda}_{a}$ for all $a\in[k]$ .

Such a function $\psi$ exists (for details see the Step 2 of the proof of Theorem 1). Finally define the graphon $\widehat{f}^{\prime}(x,y)=\boldsymbol{Q}_{\psi(x),\psi(y)}$ . Notice that $\widehat{f}^{\prime}$ is weakly isomorphic to the empirical graphon $\widetilde{f}_{\boldsymbol{\Theta}^{*}}$ . Since $\delta_{1}(\cdot,\cdot)$ is a metric on the quotient space of graphons, we have

[TABLE]

The two functions $f^{\prime}(x,y)$ and $\widehat{f}^{\prime}(x,y)$ are equal except possibly the case when either $x$ or $y$ belongs to one of the intervals $[F_{\phi}(a-1)+\widehat{\lambda}_{a}\wedge\lambda_{a},F_{\phi}(a-1)+\lambda_{a})$ for $a\in[k]$ and we have

[TABLE]

Since $\xi_{1},\ldots,\xi_{n}$ are i.i.d. uniformly distributed random variables, $n\widehat{\lambda}_{a}$ has a binomial distribution with parameters ( $n$ , $\lambda_{a}$ ). By Cauchy-Schwarz inequality we get $\operatorname{\mathbb{E}}[|\lambda_{a}-\widehat{\lambda}_{a}|]\leq\sqrt{\lambda_{a}(1-\lambda_{a})/n}$ and $\mathbb{E}(|\lambda_{a}-\widehat{\lambda}_{a}||\lambda_{b}-\widehat{\lambda}_{b}|)\leq\sqrt{\lambda_{a}\lambda_{b}}/n$ . Then, we get

[TABLE]

Now for $W^{*}\in\mathcal{W}^{+}_{1}[k,\mu]$ we use $\lambda_{a}\geq\mu/k$ for all $a\in[k]$ to get

[TABLE]

since we assume that $k\leq\mu n$ . For $W^{*}\in\mathcal{W}^{+}_{2}[k]$ we use the Cauchy-Schwarz inequality:

[TABLE]

since $k\leq n$ . ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Edo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems , pages 692–700, 2013.
2[2] Noga Alon, W. Fernandez De La Vega, Ravi Kannan, and Marek Karpinski. Random sampling and approximation of max-csps. Journal of computer and system sciences , 67(2):212–243, 2003.
3[3] Afonso S. Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Ann. Probab. , 44(4):2479–2506, 2016.
4[4] Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences , 106(50):21068–21073, 2009.
5[5] Peter J Bickel, Aiyou Chen, and Elizaveta Levina. The method of moments and degree distributions for network models. The Annals of Statistics , 39(5):2280–2301, 2011.
6[6] Béla Bollobás, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous random graphs. Random Structures Algorithms , 31(1):3–122, 2007.
7[7] C. Borgs, J.T. Chayes, H. Cohn, and S. Ganguly. Consistent nonparametric estimation for heavy-tailed sparse graphs. Ar Xiv e-prints , August 2015.
8[8] C. Borgs, J.T. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Convergent sequences of dense graphs. I. Subgraph frequencies, metric properties and testing. Adv. Math. , 219(6):1801–1851, 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Optimal graphon estimation in cut distance

Abstract

1 Introduction

1.1 Cut metric

Proposition 1** (Lemma 10.16 [28]).**

1.2 Our contribution and related results

2 Notation and Preliminaries

2.1 Notation

2.2 Preliminaries

3 Probability matrix estimation

3.1 Cut norm minimax risk

Proposition 2**.**

Proposition 3**.**

3.2 Comparison with l1l_{1}l1​ and l2l_{2}l2​-estimation

Proposition 4**.**

Proposition 5**.**

Proposition 6**.**

4 Graphon estimation problem

4.1 Cut distance minimax risk

Theorem 1** (Agnostic error measured in cut distance).**

Theorem 2**.**

4.2 Comparison with δ1\delta_{1}δ1​ and δ2\delta_{2}δ2​-estimation

Proposition 7**.**

4.3 Cut distance estimation of L1L_{1}L1​ and L2L_{2}L2​ graphons

Proposition 8**.**

Appendix A Proof methods

A.1 Non-symmetric kernels

A.2 Concentration inequalities

Lemma 1** (Bounded difference inequality).**

A.3 Fano’s lemma

Lemma 2**.**

A.4 Khintchine’s inequality

Lemma 3**.**

Lemma 4**.**

Proof of Lemma 4.

Appendix B Proof of Proposition 2

Appendix C Proof of Proposition 3

Appendix D Proof of Proposition 4

Proposition 9**.**

Proposition 10**.**

Appendix E Proof of Theorem 1

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

Proof of Lemma 5.

Lemma 9**.**

Proof of Lemma 9.

Lemma 10**.**

Proof of Lemma 10.

Proof of Lemma 6.

Proof of Lemma 7.

Proof of Lemma 8.

Appendix F Proof of Theorem 2

Lemma 11**.**

Lemma 12**.**

Lemma 13**.**

Lemma 14**.**

Proof of Lemma 11.

Proof of Lemma 13.

Lemma 15**.**

Proof of Lemma 15.

Proof of Lemma 14.

Appendix G Proof of Proposition 7

Lemma 16** (Agnostic error measured in l1l_{1}l1​-distance).**

Appendix H Proof of Proposition 8

Lemma 17**.**

Proof of Lemma 17.

Proposition 1 (Lemma 10.16 [28]).

Proposition 2.

Proposition 3.

3.2 Comparison with $l_{1}$ and $l_{2}$ -estimation

Proposition 4.

Proposition 5.

Proposition 6.

Theorem 1 (Agnostic error measured in cut distance).

Theorem 2.

4.2 Comparison with $\delta_{1}$ and $\delta_{2}$ -estimation

Proposition 7.

4.3 Cut distance estimation of $L_{1}$ and $L_{2}$ graphons

Proposition 8.

Lemma 1 (Bounded difference inequality).

Lemma 2.

Lemma 3.

Lemma 4.

Proposition 9.

Proposition 10.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Lemma 16 (Agnostic error measured in $l_{1}$ -distance).

Lemma 17.