Optimal graphon estimation in cut distance
Olga Klopp (CREST), Nicolas Verzelen (MISTEA)

TL;DR
This paper establishes minimax estimation rates for graphons and connection probability matrices in cut distance, revealing that the adjacency matrix alone is already optimally informative for this metric.
Contribution
It proves that the adjacency matrix achieves optimal minimax rates in cut distance, showing no benefit from more complex estimation procedures.
Findings
Raw adjacency matrix is minimax optimal in cut distance.
Estimation rates are established for block constant matrices and step function graphons.
Contrasts with classical distances where more complex methods improve results.
Abstract
Consider the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a W-random graph. We establish the minimax estimation rates with respect to the cut metric for classes of block constant matrices and step function graphons. Surprisingly, our results imply that, from the minimax point of view, the raw data, that is, the adjacency matrix of the observed graph, is already optimal and more involved procedures cannot improve the convergence rates for this metric. This phenomenon contrasts with optimal rates of convergence with respect to other classical distances for graphons such as the l 1 or l 2 metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph theory and applications · Random Matrices and Applications · Complex Network Analysis Techniques
Optimal graphon estimation in cut distance
Olga Klopp111ESSEC Business School and CREST, FRANCE, [email protected] and Nicolas Verzelen222INRA, UMR 729 MISTEA, F-34060 Montpellier, FRANCE, [email protected]
Abstract
Consider the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a -random graph. We establish the minimax estimation rates with respect to the cut metric for classes of block constant matrices and step function graphons. Surprisingly, our results imply that, from the minimax point of view, the raw data, that is, the adjacency matrix of the observed graph, is already optimal and more involved procedures cannot improve the convergence rates for this metric. This phenomenon contrasts with optimal rates of convergence with respect to other classical distances for graphons such as the or metrics.
Keywords: inhomogeneous random graph, graphon, W-random graphs, networks, stochastic block model, cut distance.
1 Introduction
In the last decade, network analysis has become an important research field driven by applications in social sciences, computer sciences, statistical physics, genomics, ecology, etc. A flourishing line of literature amounts to fit observed networks to parametric or non-parametric models of random graphs. Among the parametric models, one of the most popular is the stochastic block model [23]. In the stochastic block model with vectices and blocks, the class of each vertex is drawn independently in according to some probability distribution . Given , the edges of the graph are then sampled independently, the probability that there is an edge between and being equal to where is a given symmetric matrix. Although this model is suitable for analyzing small networks, it does not allow to analyze the finer structures of extremely large networks. To go beyond the possible limitation of parametric models, non-parametric models of random graphs have been introduced [22, 18].
One possible non-parametric generalization of the stochastic block models is given by the -random graph model [18] based on the notion of graphon. Graphons are symmetric measurable functions . In the sequel, the space of graphons is denoted by . Given a graphon , a graph on vertices is sampled according to the -random graph model in the following way. Let be a random symmetric matrix defined by
[TABLE]
where is the scale parameter that can be interpreted as the expected proportion of non-zero edges and are unobserved (latent) i.i.d. random variables uniformly distributed on . Then, given , the graph is sampled according to the inhomogeneous random graph model [6]. More precisely, vertices and are connected by an edge with probability and these events are independent for all pairs with . When is considered as a deterministic matrix, we call it inhomogeneous random graph model with respect ot . If is a step-function with steps, the graph is distributed as a stochastic block model with groups. The case of a dense graph corresponds to , whereas the choice when produces sparser graphs. This model was recently studied by a number of authors, see e.g., [4, 5, 34, 27, 17, 16].
In the present paper we consider the problems of estimating the matrix of connection probabilities and the graphon from a single observation of a graph. Suppose that we observe the adjacency matrix of a graph that has either been sampled according to the inhomogeneous random graph model with a fixed matrix or to the -random graph model with graphon . Then, given a single observation , we want to estimate or .
Graphon estimation is more challenging than probability matrix estimation, in particular, because of identifiability issues: multiple graphons can lead to the same distribution on the space of graphs of size . This is not unexpected as the distribution of the network is invariant with respect to any change of labeling of its nodes. More precisely, two graphons and in define the same probability distribution if and only if there exist measure preserving maps , : such that almost everywhere. This equivalence relation is called a weak isomorphism [28]. The corresponding quotient space is denoted by . As a consequence, one can only estimate the equivalence class of in and we refer henceforth to graphon estimation as the problem of estimating this equivalence class from the adjacency matrix sampled from the -random graph model (1). When there is no amibiguity, we shall identify a graphon and its corresponding equivalence class.
The problem of estimating was previously considered in a number of papers. For matrix estimation problem, the quality of an estimator is usually assessed through the Frobenius loss . For instance, [16] obtain sub-optimal convergence rates for this problem using a singular thresholding algorithm. Relying on a least-square estimator [20] have established the minimax estimation rates for on classes of block constant matrices and smooth graphon classes. Their analysis is restricted to the dense case with constant . More recently, [26] extended their results to sparse case when depends on and goes to zero when .
As for graphon estimation, most of results on estimation error are expressed in terms of loss (see below for a formal definition of this metric). For classes of smooth graphons, estimators based on maximum likelihood, restricted least-squares estimators, or neighborhood smoothing have been studied in [33, 15, 1, 14, 35, 26]. For classes of step-function graphons, restricted least-squares estimators have been considered in [10, 26] and the minimax optimal rates of convergence have been derived in [26].
Although one can take advantage of the Euclidean structure of the Frobenius matrix norm and the metric on , both these metrics do not readily reflect the closeness in terms of the topology of the random graphs. As the structure of the graphon space is infinite-dimensional, not all norms are equivalent and one may wonder whether one should not study the graphon estimation problem with respect to a more suitable distance. We argue below that the cut distance which plays a central role in the random graph theory is a good candidate for this.
1.1 Cut metric
One of the fundamental questions in graph theory is the following one: what does it mean for two large graphs to be similar or close? There are different ways of defining the distance of two graphs. For example, the edit distance is defined as normalized Hamming distance of the edge sets. Up to a normalization, it corresponds to distance between the adjacency matrices. One of the troubles with this notion of distance is that it does not reflect well structural similarities between two graphs. For instance, the edit distance between two independent graphs drawn from the Erdös-Rényi model with is close to 1/2 with high probability. Another notion of distance, called cut distance, better reflects the structural similarity. The cut norm of a matrix has been introduced by Frieze and Kannan [19]. It is defined by
[TABLE]
In other words, corresponds (up to to a renormalization) to the maximal sum of entries over all submatrices of . Then, the cut distance between two graphs and defined on the same set of nodes and with adjacency matrices and is defined as the cut norm . Denoting the number of edge between nodes in and in the graph , the cut distance is the supremum over all of . In other words, is small if the restrictions of and to all subsets have similar edge densities.
Let us denote the collection of symmetric measurable functions . By analogy with the matrix cut norm, we can define the cut norm of a kernel :
[TABLE]
where the supremum is taken over all measurable subsets and . Then, the distance between two graphons and in is simply . As explained earlier in the introduction, graphons in are not identifiable. This is why we consider the metric induced by on the quotient space defined by
[TABLE]
where we take the infimum in the set of all measure-preserving bijections and .
The cut distance is also a cornerstone in the graph limit theory introduced by Lovász and Szegedy [29] and further developed in, e.g., [8, 9]. In particular, this theory states that graphons can be interpreted as limits (with respect to ) of graph sequences. Besides, convergence in is equivalent to other structural properties such as the convergence of all homomorphisms numbers. Given a simple graph with nodes and a graphon , the homomorphisms number is the probability that the edge set of size of a graph sampled from the model (1) (with ) contains the edge set of . As a consequence, the homomorphisms numbers and are close when the expected number of subgraphs for a size random graph sampled from is close to that of a size random graph sampled from . It has been established that convergence in the cut distance is equivalent to convergence of homomorphism numbers for all simple graphs (see Theorem 11.5 in [28] for more details). Hence, estimating well the graphon in the cut distance allows to estimate well the number of small patterns induced by . On the other hand, the cut distance controls other quantities of interest for computer scientists such as the size of multi-way cuts [12, 10]. So, a consistent estimator of in cut distance gives consistent estimators for the multi-way cuts.
The construction of can be extended to any other norm that is invariant under measure preserving maps:
[TABLE]
Besides the cut norm, we already mentioned the and -norms on defined by and . These two norms define the corresponding distances and on the quotient space . The distance is dominated by and (for details see Section 2.2). As already noted for instance in [10], this immediately implies that the convergence rate of an estimator with respect to the -distance is at least as fast as its convergence rate with respect to the -distance. Then, one may wonder whether the convergence rates in -distance can be significantly faster and whether those faster rates are achieved by the estimators that are already minimax optimal with respect to other metrics.
In fact, a partial result on uniform convergence rates has already been proved. One of the striking consequences of the celebrated Szemerédi’s Lemma [31] states that an adjacency matrix sampled from a -random graph model converges to the true graphon in cut distance, this at an uniform rate over all graphons. To be more specific, let be a graphon and let be the size adjacency matrix sampled according to the -random graph model (1) with . It has been shown in [8] (see also [2] or [28]) that, with high probability, the empirical graphon associated to the adjacency matrix (see (19) for a precise definition) is close in the cut distance to the true graphon :
Proposition 1** (Lemma 10.16 [28]).**
Let and let be a graphon. Then, with probability at least ,
[TABLE]
An important point is that the above result is valid for all . Note that if we replace the cut-distance by or -distance this is not true any more: even in the simple case of a constant graphon (with ), the distance between and does not converge to zero.
1.2 Our contribution and related results
Our purpose in this paper is to go beyond uniform convergence rates over all graphons in and to understand the optimal cut distance convergence rates when has a specific structure. First, optimal convergence rates are derived for the estimation of the connection probability matrix when it belongs to classes of block-constant matrices. Second, we establish the optimal convergence rates for all classes of step-function graphons both in sparse and dense case. In particular for (dense case), our results imply that, for any integer and –steps graphon , one has
[TABLE]
where is a numerical constant (independent of and ) and that this convergence rate is optimal from the minimax point of view. This result has some interesting implications. In particular, this guarantees the optimality of the rate in Proposition 1 for general graphons. On the other hand, our results imply that for more structured classes of graphons () much faster rates are achievable. Interestingly, we show that the adjacency matrix and its associated empirical graphons are already adaptive to the unknown number of blocks of the matrix or steps of and minimax optimal. As a consequence, there is no need to look for more involved estimators.
In practice, it could be disappointing that the raw data are already optimal with respect to the cut distance, whereas they perform really badly with respect to the distance. This is why we prove that a singular value hard thresholding estimator is still optimal with respect to the cut metric while achieving the best known rate in -distance in the class of polynomial-time estimators.
Our results are in sharp contrast to all aforementioned manuscripts [33, 15, 10, 1, 14, 35, 26] whose primary focus is the -distance and whose convergence rates with respect to the -distance are derived from the domination of by . Closest to our contributions, is the recent paper [7] where the authors introduce a least-cut norm estimator for a more general model of unbounded graphons. Translated in our framework, their non-polynomial time algorithm achieves, in some cases, the optimal convergence rate (up to a logarithmic loss) and it is slower in other cases. In Section 4.3 we extend our study to unbounded graphons and compare our results to those of [7]. In particular, our Proposition 8 implies that the empirical graphon associated to the adjacency matrix and to the singular value hard thresholding estimator are optimal (up to a logarithmic factor) also in the general case of unbounded graphons. Note that the main difference with the method proposed in [7] is that both our estimators can be easily computed in polynomial time.
From a technical point of view, the tools needed for deriving optimal cut distance rates differ from those used for the -distance. For establishing the minimax lower bounds, the main technical hurdle is to build a collection of well-spaces graphons with respect to the cut distance. Indeed, the cut distance is difficult to lower bound as it is defined as an infimum over all measure-preserving transformations. As for the minimax upper bound on the estimation error in (6), it can be obtained quite easily without the correct logarithmic term thanks to the Bernstein’s inequality together with some bounds from [26] for the stronger metric . However, recovering the right logarithmic term in (6) is much more challenging. The proof relies among other things on a careful application of Szemerédi’s regularity lemma to distorted versions of the graphon.
The manuscript is organized as follows. First, we recall some basic results related to the cut metric. The problem of estimating the matrix of connection probabilities is considered in Section 3. We study the problem of graphon estimation in Section 4. The appendix contains all the proofs where in Appendix A we recall some basic facts and results that are often used in the proofs.
2 Notation and Preliminaries
2.1 Notation
We gather here some of the notation used throughout this paper. Some of them have already been defined in the introduction.
- •
For a matrix , (or , or ) is its -th entry. Let and stand for its th row and th column respectively. We denote by the class of all symmetric matrices with real-valued entries. Given a matrix and , denotes its entry-wise norm, that is for and . Given , stands for its operator norm:
[TABLE]
Finally, stands for the canonical inner product between matrices .
- •
is the collection of symmetric measurable functions . Given a kernel and , its norm is defined by , whereas . is the space of graphons and is the corresponding quotient space. The cut distance in the graphon spaces is defined by (3). Also, and defined by (4) respectively correspond to the and distances on the quotient space of graphons . Given a symmetric square matrix with values in , is the empirical graphon as defined in (19).
- •
Given a probability matrix , we denote by the expectation with respect to the distribution of if we consider the inhomogeneous random graph model and given a graphon and , we write for the expectation with respect to the joint distribution of .
- •
We denote by the maximal integer less than or equal to and by the smallest integer greater than or equal to . For an positive integer , set . denotes the indicator function of a set . In the sequence, stands for a positive constant that can vary from line to line. These are absolute constants unless otherwise mentioned. For two positive functions and , we write when there exist two positive numerical constants and such . Finally, is the Lebesgue measure on the interval .
- •
Given a matrix with entries in , we define the empirical graphon as the following piecewise constant function: for all and in .
2.2 Preliminaries
We start with a few basic properties of the cut norm for matrices and graphons . It is easy to see that
[TABLE]
where and are the usual entry-wise and -norms of a matrix. For a function , we have
[TABLE]
where and denote and -norms of a graphon. In the opposite direction, we have . As a consequence, the metric and define the same topology on the space of graphons. In contrast, the cut distance defines a weaker topology on the space as illustrated by the aforementioned sampling result (Proposition 1).
We shall also sometimes rely on the equivalence between the cut norm and to the operator norm:
[TABLE]
where the supremum is taken over all (real-valued) functions and with values in . It is known that (see e.g., [24])
[TABLE]
3 Probability matrix estimation
3.1 Cut norm minimax risk
We start with a simple proposition that bounds the expected cut distance between and the sampled adjacency matrix . Similar results already appeared in the literature, see e.g., [28, Lemma 10.11], [7] or [21]. Its proofs is based on Bernstein’s inequality and is given in Section B.
Proposition 2**.**
For any probability matrix we have
[TABLE]
In particular, if , we get
[TABLE]
This implies that the adjacency matrix is -close in cut-distance to the probability matrix . This bound is valid for all matrices . It turns out that no estimator can perform much better than , even on some simple classes of parameters .
Let be integers such that and be defined by
[TABLE]
where we denote by the set of all mappings from to . In other words is made of matrices that, up to a permutation of their rows and their columns, are (up to the diagonal) block constants with at most blocks. Also, this corresponds to connection probability matrices of -class stochastic blocks models whose vector label has been fixed. For any , consider the set
[TABLE]
of matrices whose largest value is smaller or equal to . The following Proposition, proved in section C, gives a lower bound on the minimax risk over the class of block-constant matrices with only two blocks:
Proposition 3**.**
The minimax risk measured in cut norm satisfies
[TABLE]
where denotes the expectation with respect to the distribution of when the underlying probability matrix is .
Comparing Proposition 3 with Proposition 2 we observe that the raw data is minimax optimal for the class for all . As a consequence, there is no need to look for a more involved estimator. Since for the constant estimator satisfies and using that the collections are nested, the two previous propositions imply that the optimal cut norm estimation rates for with is given by
[TABLE]
Until now, we left aside the specific case of constant matrices which correspond to Erdös-Rényi random graphs. It turns out that the situation is quite different for this simple class. For a constant matrix , estimating given amounts to infer the parameter of a Bernoulli distribution given a sample of size . From this analogy, we consider the matrix whose all non-diagonal entries are equal to . Then, it is straightforward to prove that
[TABLE]
which is -faster than what is achieved by the adjacency matrix . Using again the analogy with the problem of Bernoulli parameter estimation, one may easily get the following minimax lower bound:
[TABLE]
which assesses that the -rate achieved by is optimal.
3.2 Comparison with and -estimation
The cut norm optimal estimation rate is quite different from what has been established for the Frobenius norm (also called ) estimation rate in [26] (see also [20] for the dense case), that is
[TABLE]
for any . Besides, the minimax risk bound is achieved by the restricted least-square estimators [26] defined by
[TABLE]
Since the Frobenius norm dominates the cut norm, it is expected that the cut norm convergence rate is faster than the Frobenius norm estimation rate. When is not too small and the number of blocks remains small (), the gain is a factor, whereas, for larger , the gain is of order . More importantly, the optimal Frobenius norm convergence rate (10) is only known to be achieved by non-polynomial time estimators such as (11).
In view of the above discussion, one may wonder whether it is possible to build estimators that are near optimal is terms of both the cut and Frobenius distances. Since for any matrix , , it follows that, for , the restricted least-square estimator (11) is also near optimal (up to factor) with respect to the cut distance, that is,
[TABLE]
For matrices with more than blocks, it is not clear whether the estimator achieves a fast rate of convergence in the cut norm.
In any case, the computational complexity of is non polynomial. In fact, no polynomial-time algorithm is known to achieve the minimax risk (10) with respect to the Frobenius norm. Below, we describe an estimator that is optimal in the cut distance and also achieves the best known rate in Frobenius distance in the class of polynomial-time estimators. Let us write the singular value decomposition of :
[TABLE]
where are the singular values of indexed in the decreasing order, are eigenvectors of and . Given a tuning parameter , we define
[TABLE]
as the singular value hard thresholding estimator of . We have the following
Proposition 4**.**
Assume that . Let where is a sufficiently large numerical constant. Then, for any and any , the hard thresholding estimator simultaneously satisfies, with probability larger than ,
[TABLE]
where is a numerical constant.
The low-rank estimator was previously considered in [16] for Frobenius norm estimation, but error bounds obtained in [16] are much more pessimistic than (14). It follows from (15), that for , with high probability, achieves the optimal rate in the cut norm and the rate in Frobenius norm, which is the best known rate among polynomial-time estimators.
We close this section by the following proposition which gives the minimax optimal rate of estimation in -norm. This will allow us to further compare the and convergence rates for graphon estimation in the next section.
Proposition 5**.**
For any sequence and any positive integer , one has
[TABLE]
To prove the upper bound we can use the following result which provides the control of the estimation error measured in Frobenius norm of the restricted least-squares estimator proven in [26]:
Proposition 6**.**
Consider the network sequence model. There exist positive absolute constant such that the following holds. If , then
[TABLE]
The upper bound in (16) is a consequence of the inequality and (17). The lower bound of the minimax risk in (16) is proved following the same lines as the proof of Proposition 2.4 in [26] with replaced by . We skip the details.
4 Graphon estimation problem
In this section, we are interested in estimating the graphon in the sparse -random graph model (1). Let be the collection of –step graphons, that is, the subset of graphons such that for some and some ,
[TABLE]
Note is also in correspondence with the collection of stochastic block models with blocks. Our purpose here, is to characterize the minimax convergence rates over classes .
4.1 Cut distance minimax risk
Following [26], we start by associating a graphon to any probability matrix . Then, we can estimate graphon using the empirical graphon associated to an estimate of . Recall that, given a matrix with entries in , we define the graphon as the following piecewise constant function:
[TABLE]
for all and in . For any estimator of and any norm that is invariant under measure preserving maps the triangle inequality implies
[TABLE]
We have two parts in (20). The first term is the estimation error term that has been considered in the previous section. The second term is the agnostic error. It measures the -distance between the true graphon and its discretized version sampled at the unobserved random design points . The behavior of depends on the topology of the considered class of graphons. The following theorem, proved in Section E, gives the upper bound on the agnostic error, measured in -distance for step function graphons:
Theorem 1** (Agnostic error measured in cut distance).**
Consider the -random graph model (1). For all integers , all positive integers , all and , we have
[TABLE]
Note that the case is a consequence of Proposition 1 from [28], so that we effectively only have to consider the case . The proof combines two ideas. First, we build and as the representatives of and in the quotient space such that and match everywhere except on a set of Lebesgue measure of order at most . This allows us to get a risk bound of order . In order to recover the correct logarithmic factor , we rely on the weak Szemerédi’s Lemma. Here, the key idea is to build a cut-norm approximation of a distorted transformation of where the weights of the group have been modified to take into account the geometry of the sampling error.
As an immediate consequence of (20), Proposition 2 and Theorem 1, we get the following upper bound on the risk of the empirical graphon . For any , it holds that
[TABLE]
where is an absolute constant. Here, denotes the expectation with respect to the distribution of observations when the underlying sparse graphon is . The following Proposition provides a matching lower bound for .
Theorem 2**.**
There exists a universal constant such that for any sequence and any positive integer ,
[TABLE]
where is the infimum over all estimators.
Since the collections are nested, it follows that for all , one has
[TABLE]
In view of (22) and (23), we observe that, as long as, , the empirical graphon is minimax optimal over all classes , . For sparser graphs (), the trivial estimator achieves the optimal rate .
Note that there are two distinct regimes in the minimax convergence rate. When (weakly sparse graphs or large number of groups), the agnostic error dominates and the minimax risk is of order . For moderately sparse graphs or equivalently a small number of steps (), the error arising from the probability matrix estimation dominates and the minimax risk is of order .
As in the previous section, we left aside the specific case of constant graphons . Note that for a graphon the agnostic error is always zero and the loss comes from the probability matrix estimation. Following the arguments of the previous section, we derive that the graphon converges to at the rate which is optimal as soon as .
4.2 Comparison with and -estimation
Minimax risk for graphon estimation in the -distance was obtained in [26, Proposition 3.2] :
[TABLE]
The following proposition, proved in Section G, gives the minimax -convergence rate:
Proposition 7**.**
For any sequence and any positive integer , we have
[TABLE]
Conversely, there exists an estimator based on the restricted least-squares estimator (11) such that
[TABLE]
The upper and lower bounds given by Proposition 7 match (up to a multiplicative term in one of the regimes). There are three regions in (26) for graphon estimation. The first one corresponds to the case of weakly sparse graphs with . In this case, the agnostic error dominates and the optimal risk is of order . For moderately sparse graphs with , the probability matrix estimation error dominates and the minimax rate is of order (up to a multiplicative term). In the case of highly sparse graphs with , the minimax risk is which corresponds to the risk of the null estimator .
Let us compare the optimal convergence rates with respect to the (26), (24) and (23). Bearing in mind that dominates , which in turn dominates , one should not be surprised that optimal rates with respect to are the slowest. When the number of steps is less than or when the graph is weakly sparse (), then the and optimal rates only differ by a multiplicative term. For larger and sparser graph, the optimal -risk can be larger than the -risk.
Following the discussion in Section 3.2, one may easily build graphon estimators performing well in all these three distances. For instance, the graphon based on the restricted-least-squares estimator is optimal with respect to and and near optimal (up to a possible loss) with respect to for . Besides, the graphon based on the singular value thresholding estimator is optimal with respect to and achieves best known convergence rates with respect to and among polynomial time algorithms.
4.3 Cut distance estimation of and graphons
Until now we have restricted our attention to graphons taking values in . As argued in [11, 12], in this case the empirical degree distribution of a graph sampled from the corresponding -random graph model (1) is light. This contrasts with many practical situations, where the degree distribution is heavy tailed. To circumvent this limitation, Borgs et al [11, 12] introduce, for , the class of symmetric measurable functions such that . This collection is referred as the collection of graphons. We have the inclusions for . Given a graphon and a sparsity parameter , the corresponding -random graph model amounts to generating a graph with vertices according to the random matrix sampled as follows
[TABLE]
where are, as in (1), i.i.d. random variables uniformly distributed in . Note that since is now unbounded, we have to take the minimum with in (27). We write . Since is now allowed to be unbounded, graphs sampled according to the model (27) may have power law degree distribution [11]. As in the introduction, we may extend the norms and and the distances and to any graphon with . Also, we write for the quotient space of graphons under weak isometry.
Let us also define the collection of -steps graphons, that is the subsets of graphon such that for some and some (note that does not depend on ). For we denote by the subset of of “balanced” step functions, that is, if for all . This means that the size of each step is larger than .
Without lost of generality we can consider normalized graphons, that is, we assume that . The following proposition proved in Appendix H gives an oracle inequality for the risk of the empirical graphon associated to the adjacency matrix and to the singular value hard thresholding estimator:
Proposition 8**.**
Let where is a sufficiently large numerical constant. Given a graphon and , write .
- (1)
Let with , and . Then, for any positive integer , we have
[TABLE]
and
[TABLE]
- (2)
Assume that with and . For any positive integer , we have
[TABLE]
[TABLE]
If belongs to some or to the convergence rates given by Proposition 8 are the same as the optimal rates for bounded graphons up to a factor. We conjecture that the factor should appear in Proposition 8. Indeed, for bounded graphons, this logarithmic terms derives from Szemerédi’s Regularity lemma and extensions of this lemma to graphons have been recently proved [11]. Nevertheless, our arguments in the proof of Theorem 1 makes heavily use of the boundedness of the graphons. In particular, one should replace all applications of McDiarmid’s inequality (Lemma 1) by more involved concentration inequalities [13]. We leave this for future work.
When the graphon is not a finite step graphon, a bias term is occurring in the risk bounds (28–31). As the estimation risk is measured in the cut-distance, one could have hoped to obtain a bias term in the cut distance also (instead of the larger and distances). It is an interesting open problem to prove whether one can obtain oracle inequalities with cut distance bias terms. Note that, for bounded graphons , using Theorem 1, we can also get an oracle inequality with the bias term and minimax optimal error term.
Upper bounds of the cut distance risk for graphons estimation were previously obtained in [7] where the authors introduced the least cut norm estimator . For any normalized graphon any , Borgs et al. [7] show in their Theorem 4.1 that this estimator achieves the risk bound
[TABLE]
For graphons, this bound is quite similar (up to an additional term) to those we obtained in (28–29) for the empirical estimators and . Note that the least cut norm estimator can not be computed in polynomial time contrary to the empirical graphons associated to the adjacency matrix and to the singular value hard thresholding estimator. Also, when the true graphon either belongs to or to , then the rate in (32) is much slower than what has been obtained in Proposition 4 and Theorem 1.
Appendix A Proof methods
In this section, we summarize some basic facts and fundamental results that we use in the proofs.
A.1 Non-symmetric kernels
At some point, we will need to work with non-symmetric kernels and with kernel defined on general measurable subsets of . In this section we define the corresponding spaces. Let and denote two bounded measurable subsets of . Then, refers to the collection of bounded measurable functions . We will denote by the collection of bounded measurable and non-negative functions . Let be the collection of step kernels, that is, the subset of kernels such that for some and some , ,
[TABLE]
A kernel is also said to be a -step function when it decomposes as in (33) but where is a size matrix, mapping to , and mapping to . The cut norm can be readily extended to kernels in the following way:
[TABLE]
where the supremum is taken over all measurable subsets and .
A.2 Concentration inequalities
In the proofs we repeatedly use Bernstein’s inequality. We state it here for the readers’ convenience. Let be independent zero-mean random variables. Suppose that almost surely, for all . Then, for any ,
[TABLE]
We shall also rely on the bounded difference inequality (also called McDiarmid’s inequality).
Lemma 1** (Bounded difference inequality).**
Let denote independent real random variables. Assume that is a measurable function satisfying, for some positive constants , the bounded difference condition
[TABLE]
for all , and all . Then, the random variable satisfies
[TABLE]
for all .
A.3 Fano’s lemma
In the sequel, denotes the Kullback-Leibler divergence between two distributions. In this manuscript, all the proofs of the minimax lower bounds rely on Fano’s method. The following version of Fano’s lemma is borrowed from [32]:
Lemma 2**.**
[32, Theorem 2.7]** Consider a parametric model , with and a metric on . Assume that contains elements , , such that for all with
- (i)
** 2. (ii)
\mathcal{KL}(\mathbb{P}_{\theta_{j}},\operatorname{\mathbb{P}}_{\theta_{k}})\leq\log(M)/32\.
Then, we have
[TABLE]
where the constant is numeric.
A.4 Khintchine’s inequality
Next, we state a particular case of Khintchine’s inequality that turns out to be useful for bounding the cut norm of step kernels in terms of their norm:
Lemma 3**.**
[30]** Let be i.i.d. Rademacher random variables and let be some real numbers. Then,
[TABLE]
We use this result to prove the following lower bound on the cut norm of step kernels:
Lemma 4**.**
Let denote a measurable –step function. Then,
[TABLE]
Proof of Lemma 4.
There exist partitions and such that, for any fixed , is constant over for all and, for any fixed , is constant over for all . For any (resp. ), denote (resp. ) any element of (resp. ). By definition of ,
[TABLE]
where we used in the last line that the value of the sum only depends on and through the quantities and . Since the maximum of a linear function on a convex set is achieved at an extremal point, it follows that
[TABLE]
where we use (8) and take . Let denote i.i.d. Rademacher random variables and let denotes the expectation with respect to . Now, Khintchine’s inequality (36) and Cauchy-Schwarz inequality imply
[TABLE]
∎
Appendix B Proof of Proposition 2
Since the diagonals of and are both zero, it suffices to control the supremum over disjoints subsets and (see, e.g., [8])
[TABLE]
Let and be any two disjoint subsets of . Using Bernstein’s inequality (35) we have that
[TABLE]
Now, using that the number of disjoint pairs is and the union bound, we get that the probability that exceeds for some is bounded by . Hence, we have
[TABLE]
with probability . Now bounding the distance by in the exceptional case we get the statement of Proposition 2.
Appendix C Proof of Proposition 3
Fix . This proof is based on Fano’s method. To apply Fano’s Lemma (Lemma 2), it is enough to check that there exists a finite subset of such that for any two distinct in we have
- (a)
and
- (b)
for some constants . Then, Applying Lemma 2 to leads to the desired result. It remains to prove the existence of . As it is classical for this kind of proof, we first build a collection and then extract a maximal subset satisfying (a). Then, we control the Kullback divergence between any two probability to show (b).
Construction of . Fix . For any , define by where . In other words, the entries are equal to if and if . Obviously, the collection is included in .
Computation of the cut distances and extraction of a maximal subset. Given , denote the set of indices corresponding to and its complement. Then, given two vector and , we define and , we easily obtain
[TABLE]
By symmetry, we derive that
[TABLE]
where is the symmetric difference of and . As a consequence, the cut distance between any two graphons is large as long as the symmetric difference between and is both bounded away from zero and from .
By Varshamov-Gilbert combinatorial bound (see, e.g., [32, Lemma 2.9]), we can in fact pick satisfying
[TABLE]
with for some . In the sequel, we consider . Hence, we have , whereas the previous inequalities ensure that
[TABLE]
which proves (a) when one takes as defined in (38) below.
Control of the Kullback Divergence. To prove (b) we use the definition of Kullback-Leibler divergence and for to get
[TABLE]
Now, and imply
[TABLE]
Taking
[TABLE]
with a constant small enough, we derive from the lower bound that
[TABLE]
which proves (b).
Appendix D Proof of Proposition 4
Set . We have the following simple proposition (see Theorem 5 in [25])
Proposition 9**.**
If , then
[TABLE]
In view of Proposition 9 we need to estimate with high probability in order to specify the value of the regularization parameter . Let be such that for and for . Then . We can upper bound using the following bound on the spectral norm of random matrices from [3]:
Proposition 10**.**
Let be the rectangular matrix whose entries are independent centered random variables bounded (in absolute value) by some . Then, for any there exists a universal constant such that, for every
[TABLE]
where we have defined
[TABLE]
For , we have , , and . Taking and in Proposition 10, we obtain that there exists absolute constants such that
[TABLE]
with probability at least . Since , we can take where so that . Then, Proposition 9 implies
[TABLE]
It is easy to see that the cut-norm of a matrix can be bounded by its spectral norm:
[TABLE]
Bound on the cut-norm (15) then follows from
[TABLE]
In order to prove the Frobenius bound (14), we use the argument from [25]: we can equivalently write the singular value hard thresholding estimator as the solution to the following optimization problem:
[TABLE]
which implies that, with probability larger than ,
[TABLE]
where we used in the last line that . Since , we have proved (14).
Appendix E Proof of Theorem 1
Note that both and are proportional to , so without loss of generality we can assume that . For , the result is a straightforward consequence of the second Sampling Lemma for Graphons of [28] stated in Proposition 1. Given any graphon , one can always divide some of the steps into smaller steps in such a way that is a –step graphon whose weights are all less than or equal to . Thus, we only need to prove the results for all graphons with and such that its weights are all smaller or equal to .
Let be the matrix with entries for all . As opposed to , the diagonal entries of are not constrained to be null. By the triangle inequality, we have
[TABLE]
As the entries of coincide with those of outside the diagonal, the difference is null outside of a set of measure . Since , . Thus, we only need to prove that
[TABLE]
We first need to build two suitable representations of and in the quotient space .
As a first idea, one may want to define a representation of that matches on the largest possible (with respect to the Lebesgue measure) Borel set. In fact, one can match the two representations everywhere expcept on a Borel set of measure of the order of . This turns out to lead to a suboptimal bound of the order of . In order to recover the correct logarithmic term, we refine the argument by showing that, for a suitable representation, the difference , when non-zero, is well approximated in cut distance by a -step function which is zero exĉept on a Borel set of measure much smaller than . To prepare the proof, we carefully build the representations of and .
Step 1: Construction of a suitable representation of in .
In the sequel, we denote . Here, we want to choose in such a way that a distortion of is well approximated in the cut norm by a –step kernel. We use the following lemma which is based on a variation of Szemerédi’s lemma. Let and be associated to as in definition (18).
Lemma 5**.**
There exist a permutation of and a partition of made of successive intervals such that the following holds. Let be the matrix obtained from by jointly applying the permutation to its rows and its columns. Denote by , and for , . There are two matrices and that are -block-constant according to the partition and that satisfy
[TABLE]
According to Lemma 5, there exists two -block constant matrices and that approximate well with respect to some weighted cut norm. As for (42), the weights are respectively and whereas for (43), the weights are and . Informally, these weights arise for the following reason: writing as the empirical weight of group in (see Step 2 for the definition), we have .
Invoking Lemma 5, we consider the graphons
[TABLE]
Obviously, is weakly isomorphic to .
Step 2: Construction of a suitable representation of in the quotient space .
Recall that are the i.i.d. uniformly distributed random variables in the -random graph model (1) and that is defined in the previous step. For , let
[TABLE]
be the (unobserved) empirical frequency of the group corresponding to a finer partition of given by . For , let
[TABLE]
be the (unobserved) empirical frequency of the group corresponding to a coarser partition of given by .
The relations imply
[TABLE]
Consider a function such that:
- (i)
For all , ,
- (ii)
for all , \lambda\Big{[}\{x\ ,\psi(x)\in P_{l}\text{ and }\phi(x)\in P_{l}\}\Big{]}=\omega_{l}\wedge\widehat{\omega}_{l},
- (iii)
for all , .
Such a function exists. To see it, we first construct to satisfy (i) and (iii):
- •
For each such that , conditions (i) and (iii) are trivially satisfied if we take to be any subset of of Lebesgue measure . Then, there is a subset of of Lebesgue measures left non-assigned. Summing over all such , we see that there is a union of subsets with Lebesgue measure left non-assigned.
- •
For such that , we must have for to satisfy (i). On the other hand, to meet condition (iii) we need additionally to assign for on a set of Lebesgue measure . Summing over all such , we need additionally to find a set of Lebesgue measure to make such assignments. But this set is readily available as the union of non-assigned intervals for all such that since by virtue of (45).
Now, to ensure that condition (ii) is satisfied, we assign as a priority to values belonging to the same partition element as . Again, (45) ensures that this is possible.
Finally, define the graphons , , and where , , and are as in (44). Notice that in view of (iii) is weakly isomorphic to the empirical graphon . Let . Since and match on , the purpose of (i) is to minimize the Lebesgue measure of the support of . With properties (i) and (iii) alone, it would be possible to prove that as the Lebesgue measure of its support is at most of order . We will improve this rate by a logarithmic term as (ii) will enforce that the cut norm of is much smaller than its Lebesgue measure.
Step 3: Control of the cut norm. Since is a metric on the quotient space ,
[TABLE]
By definition of , the two functions and are equal except possibly when either or belongs to . As a consequence of triangular inequality and of the symmetry of , we get
[TABLE]
First, we focus on , the second term being handled similarly at the end of the proof. For and in , we write (resp. ) when and belongs (resp. do not belong) to the same element of the partition . Define
[TABLE]
Obviously, we have . Property (ii) of , implies that . We shall rely on the decomposition and . For any , we have by definition (44) of that . Together with the triangular inequality, this yields
[TABLE]
To control the first expression in the rhs, we simply bound the cut norm of the difference by its norm
[TABLE]
since and take values in . Then, relying on the fact that is distributed as a Binomial random variable with parameters and on Cauchy-Schwarz inequality, we get and
[TABLE]
where we used again Cauchy-Schwarz in the last line. Let us turn to the second and third expressions in (47). To this end, we introduce a new kernel function . For , define and the functions and by
[TABLE]
For any , set and let be a step kernel on defined by
[TABLE]
By definition of and of the function , we have that for any , and . As a consequence, the restriction of to is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of to the set . This entails that
[TABLE]
On the other hand, for any ,
[TABLE]
by the definition of . In view of the definition of , for any we have . As a consequence, the restriction of to is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of to the set . This implies that . Thus, we only have to control .
Step 4: Control of . Define the sets and . Then, the cut norm of writes as
[TABLE]
since the supremum of a linear function on a convex set is achieved at an extremal point. The random variable is in expectation of the order . If we could replace each by in (51), then thanks to (42), we could prove that is (up to a multiplicative constant) less than . Unfortunately, if we directly applied Bernstein’s inequality or the bounded difference inequality to simultaneously control over all or to simultaneously control over all , we would lose at least a logarithmic factor.
To bypass this issue, we adapt Lemma 10.9 of [28], which is a key point in the proof of sampling Lemma for graphons (Lemma 10.5 in [28]). Given a bounded non-symmetric kernel , let us define the following one-side version of the cut norm:
[TABLE]
where we take the supremum without any absolute value. As a consequence, the cut norm is the maximum and .
Lemma 6**.**
Let and let , and be associated to as in (33). For , define and . Given any subset , let
[TABLE]
Finally, we define for any , Then, for any integer with , we have
[TABLE]
Note that in contrast to Equation (51) where one considers a supremum of sums, only terms are involved in (53) up to the price of an additive term of order . The difficulty is that we will apply this lemma to for which these will turn out to be random.
In the sequel, we fix and apply Lemma 6 to . Then, we can take . Since and since we assumed at the beginning of the proof that the weights are all smaller than , it follows that . Let and denote the random variables and . Both and are functions of the independent random variables . Besides, if we change the values of one of these the value of changes by at most and the value of changes by at most . As a consequence, we may apply the bounded difference inequality (Lemma (1)) to these two random variables. Then, with probability larger than , one has
[TABLE]
In (54) - (55) we bound the expectation using that, since are i.i.d. uniformly distributed random variables, has a binomial distribution with parameters (, ) and the Cauchy-Schwarz inequality:
[TABLE]
[TABLE]
Bound (55) and , implies that for , with probability larger than ,
[TABLE]
Fix any two subsets of size less than or equal to . In view of (53), one needs to control the following random variable
[TABLE]
It is done in the following Lemma:
Lemma 7**.**
Let be two subsets of of size less than or equal to and given by (57). Then, we have that with probability larger than ,
[TABLE]
Now, it follows from Lemma 6 together with (56) and Lemma 7 that, with probability larger than ,
[TABLE]
Controlling analogously , we conclude that there exists an event of probability larger than such that, on ,
[TABLE]
To finish the control of , we use the rough bound on the complementary event .
[TABLE]
where we use (54). Now, using the decomposition (47), (48) and (50), we can conclude that
[TABLE]
The following lemma gives a corresponding bound on the second term in (46). The proof is somewhat analogous to that of the control of and is postponed to the end of the section.
Lemma 8**.**
We have
[TABLE]
In view of (46), we have proved Theorem 1.∎
Proof of Lemma 5.
For , we denote and . For any , define the cumulative distribution functions and . For , let and . In order to construct a suitable -step kernel we consider first the (non necessarily symmetric) kernels and defined by
[TABLE]
In comparison to , the length of the steps in and has been modified.
Lemma 9**.**
Let be a k-step kernel defined by
[TABLE]
where and and are two partitions of into a finite number of measurable sets. For any integer , there exist a –step kernel satisfying
- (i)
for any , is constant on and
- (ii)
.
The second property (ii) is just the consequence of the weak Regularity Lemma for kernels [19] (see also Corollary 9.13 in [28]). The first property, (i), follows from the explicit construction of the approximate kernel by Kannan and Frieze (see the proof of Lemma 9.10 in [28]). For the sake of completeness, we give the details in the end of this section.
Fix . Note that since we assume that . We denote by and the –step kernels given by Lemma 9 to respectively approximate and . In virtue of Property , there exist two matrices and in such that
[TABLE]
There exist two partitions and of such that is block constant according to and is block constant according to . Let be the coarsest partition that refines both and . As a consequence, is made of less than subsets. By possibly refining , we may assume without loss of generality that is made of exactly elements. Let be a permutation of transforming in a partition with made of consecutive intervals. Denoting the corresponding permutation matrix, we finally take
[TABLE]
Now we are ready to prove (42) and (43). Recall that we denote and for . Define the sets and . Since is a –step function, its cut norm writes as
[TABLE]
since the supremum is achieved at an extremal point of the convex and in the last inequality we use property (ii) of Lemma 9. Now (59) and the definition of imply
[TABLE]
by Cauchy-Schwarz inequality. We have proved (42). The second inequality (43) is derived similarly.
∎
Proof of Lemma 9.
We adapt the proof of the weak Regularity Lemma for symmetric kernels [28, Lemma 9.9] to non symmetric ones. We use the following extension of Lemma 9.11(a) in [28].
Lemma 10**.**
For every such that
[TABLE]
where and are two partitions of into a finite number of measurable sets, there are two sets and a real number such that, for ,
[TABLE]
Now we apply Lemma 10 repeatedly, to get pairs of sets and real numbers such that for any positive integer , we have
[TABLE]
Fix some integer . Since the right-hand side of the above equation remains non-negative, there exists with . Now putting for we get that for any and any there are pairs of subsets and real numbers such that
[TABLE]
Note that the approximation is a step function with at most steps and , for all . On the other hand, by construction we have that for any , is constant on all sets of the form . We conclude by taking . ∎
Proof of Lemma 10.
This lemma is proved in [28, Lemma 9.11] for symmetric kernels. For readers convenience we get the details here. Let be a –step kernel and let be two measurable partitions of such that is constant on each set . Relying on a convexity argument as in the proof of Lemma 5, the cut norm is achieved for measurable sets and that are unions of and respectively, that is
[TABLE]
where and with , . Let . Then, we have
[TABLE]
which completes the proof. ∎
Proof of Lemma 6.
This proof closely follows that of Lemma 10.9 in [28]. It is easy to see that
[TABLE]
so we only need to bound these expressions. Let and be independent uniformly chosen -subset of and let (resp. ) denote the expectation with respect to (resp. ). We shall prove that, for any
[TABLE]
By symmetry, this will imply
[TABLE]
so that gathering both inequalities yields to
[TABLE]
Since the above expectation is less than or equal to , this will conclude the proof. Thus, we only have to show (62). Note that implies that it suffices to prove
[TABLE]
Let us denote the above difference of expectations. For any , write and . By the definition (52), we have that is non-negative for and if . In the same way, for and for . Denoting the probability with respect to , we obtain
[TABLE]
Now, using , it follows from the Chebyshev inequality that, for , we have . Since a probability is smaller or equal to one, it follows that . Similarly, for we also have that . Coming back to , this yields
[TABLE]
Working out the variance, we get , which concludes the proof. ∎
Proof of Lemma 7.
Note that in (57), the definition of , the set is deterministic whereas the set only depends on . We can upper bound in the following way:
[TABLE]
where we use . We set
[TABLE]
Conditionally to , is distributed as a function of i.i.d. random variables such that for any . Besides, if we change the values of one of these the value of this expression changes by at most . It then follows from the bounded difference inequality (Lemma (1)) that, for any
[TABLE]
Let us bound this conditional expectation:
[TABLE]
Now, using Cauchy-Schwarz inequality, we have
[TABLE]
where we used that , and . The supremum in (67) is achieved for subsets () such that for all , is non-negative (otherwise this contradicts the optimality of ). As a consequence, we can plug the upper bounds on into (67):
[TABLE]
where we used the property (42) of . Coming back to (66) and integrating the deviation inequality with respect to , we conclude that, for any
[TABLE]
Fixing and taking an union bound over all possible , , we derive that
[TABLE]
on an event of probability higher than .
Next we bound . Recall that has a binomial distribution with parameters (, ) and . For any , applying Bernstein’s inequality to we get
[TABLE]
Taking (for a suitable constant ) and applying the union bound, we derive that with probability larger than
[TABLE]
The bound (68) together with (69) imply the statement of Lemma 7. ∎
Proof of Lemma 8.
As the control of is quite similar to that of , we only sketch the main steps. Relying on the graphon (defined in (44)), we have the following decomposition:
[TABLE]
Since is zero except if or , we bound the first expression by its norm as for :
[TABLE]
The two last expressions in (70) are bounded by the cut norm of a kernel defined as follows. For any , define where has been defined in (49). Let be the step kernel on given by
[TABLE]
Now, as for the restrictions of and to , we have
[TABLE]
Thus, it boils down to controlling . Since is a –step kernel, its cut norm writes as
[TABLE]
As for the kernel in the main proof, we rely on the Lemma 6. The random variables and are controlled as in (54) and (55).
Fix any two subsets of size less than or equal to and define
[TABLE]
The set only depends on and only depends on . We have
[TABLE]
since . We set
[TABLE]
Write and . Conditionally to , is a function of independent random variables. Besides, if we change the values of one of these independent random variables the value of changes by at most . Hence, the bounded difference inequality enforces that, for any ,
[TABLE]
The conditional expectation is upper bounded by
[TABLE]
Here, unfortunately, we cannot directly replace \operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{a}-\lambda_{a}||\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{\{R\}}\big{]} by an upper bound of it because this expression does not factorize. We shall prove that \operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{a}-\lambda_{a}||\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{\{R\}}\big{]} is, up to a small loss, close to a product of expectations.
Write , and . Note that has a binomial distribution with parameters (, ). Applying Bernstein’s inequality to we get
[TABLE]
Let . Taking in (75) we have that
[TABLE]
In what follows we assume that the event is true. Take any two distinct elements and of . We shall prove that the conditional expectations \operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{a}-\lambda_{a}\right|\left|\widehat{\lambda}_{b}-\lambda_{b}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right] are close to the products \operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{a}-\lambda_{a}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right]\operatorname{\mathbb{E}}\left[\left|\widehat{\lambda}_{b}-\lambda_{b}\right|\Big{|}\widehat{\lambda}_{\{R\}}\right]. It is easy to see that conditionally on , follows the Binomial distribution with parameters . On the other hand, conditionally on , follows the Binomial distribution with parameters . Let be a sequence of independent Bernoulli random variables with parameters , be an independent sequence of Bernoulli random variables with parameters and be an independent sequence of Bernoulli random variables with parameters . We define the following random variables:
[TABLE]
where we use and . It is easy to see that follows the Binomial distribution with parameters and and follows the Binomial distribution with parameters and . Hence, we have that
[TABLE]
Relying our coupling between and , we obtain
[TABLE]
On the other hand, conditionally on , follows the Binomial distribution with parameters so that Cauchy-Schwarz inequality implies
[TABLE]
where we use that and the definition of the event . Similarly we compute
[TABLE]
Plugging (77 – 79) into (76) we get
[TABLE]
where we use . For , (78) implies that the above difference is of order . Going back to (74), we obtain that
[TABLE]
Take and being two sets maximizing the above expression. Then, for all we have that \sum_{b\in T^{*}}\operatorname{\mathbb{E}}\big{[}|\widehat{\lambda}_{b}-\lambda_{b}|\big{|}\widehat{\lambda}_{R}\big{]}(\boldsymbol{Q}_{ab}-\boldsymbol{Q}^{ad,+}_{ab}) is non-negative. As a consequence, using (78), we have that
[TABLE]
as soon as the event holds. The same reasoning and leads to
[TABLE]
as soon as the event holds. Going back to (73) and integrating the deviation inequality with respect to , we conclude that
[TABLE]
where we use . From this point the proof is identical to that of the main proof: we fix and take an union bound over all possible and to derive that
[TABLE]
on an event of probability higher than . Then, as in the main proof, Lemma 6 together with (56) and (69) enforce that with probability larger than . By symmetry, we can find an event of probability larger than such that, on ,
[TABLE]
In order to control on the complementary event we use the rough bound
[TABLE]
which implies
[TABLE]
where we use (54). Together with the decomposition (70), (71) and (72), we conclude that
[TABLE]
∎
Appendix F Proof of Theorem 2
It is enough to prove separately the following two minimax lower bounds:
[TABLE]
The proof of (81) is identical to the proof of (45) in [26] so we just sketch the main idea. Fix some . We consider to be the constant graphon with , and to be the –step graphon with if and elsewhere. Obviously, we have . Then, standard testing arguments [32] ensure that the minimax risk is at least of the order when is chosen small enough so that the -distance is smaller than . According to Lemma 4.9 in [26], this is the case when is small in front of which proves (81).
Henceforth, we only focus on (80). We first consider the case of multiple of and such that and for some sufficiently large numerical constants and . As the collections are nested this will imply (80) for all . Afterwards, it will suffice to show (80) for to prove the proposition. So, we assume that is a multiple of 32, is large enough and that is small in front of . Define , , and .
As for Proposition 3, we will rely on Fano’s method (Lemma 2). Hence, we shall build a collection of graphons that are well-spaced in cut distance and such that the Kullback-Leibler divergence between the associated distribution remains small enough. All the graphons considered in this collection will be based on a matrix such that (i) the rows of are almost orthogonal and (ii) such that the distance between permutation and convex combinations of the columns of are bounded from below. Such a property will turn out to be useful when taking a lower bound on the distance between the corresponding graphons.
Lemma 11**.**
For large enough, there exists a matrix satisfying the following two properties:
- (i)
For any with , the inner product of two columns satisfies
[TABLE]
- (ii)
For any two subsets and of satisfying and , any labellings and , any subset of of size larger than and any stochastic matrix , we have
[TABLE]
for some universal constant .
Taking as in Lemma 11, we define the connection probability matrix where is the matrix with all entries equal to 1. Now we define a collection of step graphons based on that will only slightly differ by the weight of each step.
Fix some and denote by the collection of vectors satisfying . For any , define the cumulative distribution on by the relations and for and the cumulative distribution on by and . Note that takes values in and takes values in . Then, set and define the graphon by
[TABLE]
See Figure (1) for a drawing of . Note that is a fairly unbalanced –step graphon: of its steps have a large weight of order . Besides, the smaller steps are slightly unbalanced as the weight of each class is either or . The purpose of these big steps is to make the cut distances between and the largest possible (see the proof of Lemma 13).
Next, we shall consider a subcollection of such that the graphons with are well spaced. The following combinatorial result is in the spirit of the Varshamov-Gilbert lemma [32, Lemma 2.9]. It is borrowed from [26] (Lemma 4.4). For , let . Notice that, by definition of , we have for all .
Lemma 12**.**
There exists a subset of such that and
[TABLE]
for any .
Lemmas 11 and 12 are used to obtain the following lower bound on the distance between two distinct graphons with and in . This lemma is the main ingredient of the proof.
Lemma 13**.**
There exists two positive universal constants and such that if then, for any with , we have
[TABLE]
which implies
[TABLE]
Note that for any and in it is possible to build a measure-preserving transformation such that is null expect on a measurable set of Lebesgue measure of order (see the proof of Theorem 1 in Section E for such construction). Hence, the norm of is of order . Lemma 13 states, that by taking the infimum over all and by considering the weaker norm , one still has a lower bound of the same order. The factor arises as a consequence of Lemma 4. See the proof for more details.
To apply Fano’s method, we need to upper bound the Kullback-Leibler divergence between the distribution corresponding to any two graphon and with and in . Let denote the distribution of sampled according to the sparse -random graph model (1) with . Since the matrix is fixed the difficulty in distinguishing between the distributions and for comes from the randomness of the design points in the -random graph model (1) rather than from the randomness of the realization of the adjacency matrix conditionally on . The following lemma gives an upper bound on the Kullback-Leibler divergences :
Lemma 14**.**
For all we have
[TABLE]
Now, choose such that . When is small in front of , this choice of satisfies the conditions of Lemma 13. Then it follows from Lemmas 12 and 14 that
[TABLE]
In view Fano’s Lemma (Lemma 2), inequalities (86) and (87) imply that
[TABLE]
where is an absolute constant. This completes the proof for large enough.
Now we turn to the case . We reduce the lower bound to the problem of testing two hypotheses. Consider the matrix \boldsymbol{B}=\left(\begin{array}[]{cc}1&1\\ 1&-1\end{array}\right). Given define , and . Then, we set for any and define graphons
[TABLE]
For any measure preserving bijection , is a four-step graphon. Thanks to Lemma 4, we deduce that . Then, it is not hard to see that so that . Moreover, exactly as in Lemma 14, the Kullback-Leibler divergence between and is bounded by . Taking of the order , this divergence is small. It is therefore impossible to reliably distinguish from and the estimation error is at least of order . More formally, we use Theorem 2.2 from [32] to conclude that
[TABLE]
where is an absolute constant.
Proof of Lemma 11.
Let be a random matrix whose entries are independent Rademacher variables. We shall prove that, with positive probability, satisfies both (82) and (83). In particular, this implies the existence of satisfying both (82) and (83).
Fix . Then, is distributed as a sum of independent Rademacher variables. Using Hoeffding’s inequality, we have that
[TABLE]
By the union bound, property (82) is satisfied for all with probability greater than . Since , for greater than some absolute constant, this probability is greater than .
Turning to (83), we first fix , , , , , and . Let
[TABLE]
We have that, conditionally on , stochastically dominates a binomial distribution with parameters and . Then, Hoeffding’s inequality yields
[TABLE]
Given any integer , define the collection of stochastic matrices taking values in the discrete set . Since and , it is easy to see that the cardinality of the set of all possible tuples with is bounded by
[TABLE]
Now, taking the union bound, we derive that, simultaneously for all such parameters,
[TABLE]
with probability greater than . Using Stirling’s approximation
and we get that this probability is larger than for large enough.
Finally, let us consider a general case, when matrix does not necessarily belong to . Observe that in this case, there exists a matrix such that . This implies that
[TABLE]
We have proved that (83) holds with probability larger than . As a consequence, satisfies both (82) and (83) with probability larger than . ∎
Proof of Lemma 13.
We fix and , two different vectors in , and fix , a measure-preserving bijection on . We shall prove that for small enough
[TABLE]
Since \delta_{\square}\big{(}W_{u},W_{v}\big{)}=\inf_{\tau}\|W_{u}(.,.)-W_{v}(\tau.,\tau.)\|_{\square} both (85) and (86) straightforwardly follow from (88). We denote
[TABLE]
Since is measure-preserving, we have
[TABLE]
Now, we consider three cases (i) , (ii) and (iii) . In the Case (i) we shall focus on the restriction of and on so that these restrictions are –step functions. In the Case (ii), we focus on restrictions to , so that is constant on this restriction. In the pathological case (iii), we introduce a subset such that the restriction of is a –step function and the restriction of is a –step function.
Case (i). We focus our attention on coordinates in . Recall that the cumulative distribution function is defined by and for . For any , define
[TABLE]
In other words, stands for the weight of indices corresponding to class in and class in . By definition of , for any , we have
[TABLE]
Let denote the sets of such that has a large intersection with :
[TABLE]
Denote the complementary set of . We have that for small enough. Hence, it follows that
[TABLE]
which implies that and .
Now, denoting , we define a new kernel by
[TABLE]
We can view as a smoothed version of the restriction of to . The marginal functions are step functions with at most steps of the form . Moreover, on each interval , is equal to the mean of for ranging on this set. Equipped with this notation, we can control the cut distance between and in terms of the distance between the restriction of to and . For ease of notation, we still write for for the restriction of to when there is no ambiguity.
The following lemma provides a lower bound of the cut norm in terms of the norm of .
Lemma 15**.**
For any , in and any measure-preserving transformation , we have
[TABLE]
where is defined in (93).
In view of Lemma 15 it is enough to control the norm . We can do it in a similar way as it is done in the proof of Lemma 4.5 in [26]. For and any and , the inner product between and satisfies
[TABLE]
where we used (82) in the last line. For any , let denote the Lebesgue measure of the set
[TABLE]
Since is measure preserving, it follows that and . For any , we set
[TABLE]
Equipped with this notation, we have
[TABLE]
Now take any . By (95), and using the triangle inequality, we derive that
[TABLE]
where we used in the last line. As a consequence, for any there exists at most one such that . If such index exists, it is denoted by . Then, it is possible to extend as a function from to . Since , we get
[TABLE]
since . If the sum is greater than , then (88) is satisfied. Thus, we can assume in the sequel that .
Using that and that the cardinality of the collection is we deduce that the collection has cardinality greater than . Now, Lemma 12 implies that for . Then, there exist subsets and of cardinality (recall that such that , , and for all . The condition implies that is injective on . Hence,
[TABLE]
where the second inequality follows from and the fact that and are step functions with steps larger than (see (91), the definition of and ). Finally, we apply the property (83) of to conclude that
[TABLE]
which, together with Lemma 15, proves (88).
Case (ii). Now we assume that . Take and . We have that, on , is constant and equals . Denote the restriction of to . Then, it follows that . The kernel is at most step function. By Lemma 4, we obtain
[TABLE]
where the last equality follows from (90). Using and we obtain (88).
Case (iii). Now we assume that and take and so that . Define the smoothed kernel by
[TABLE]
As a consequence, is block-constant on subsets of the form \big{(}\tau^{-1}[G(a-1),G(a))\cap\mathcal{X}\big{)}\times\big{(}[G(b-1),G(b))\cap\mathcal{Y}\big{)}. Arguing as in the proof of Lemma 15, we derive that
[TABLE]
For any such that define the function on by . Arguing as in Case (i), we observe that for any . We have that the kernel is a step function. Hence, there exists a partition of and functions such that . Then, the triangular inequality ensures that, for any and any , we have . As a consequence, for any there exists at most one , which we will denote by , such that . Now we compute
[TABLE]
where we used , , and that is small enough. Together with (96), we obtain the desired result (88). ∎
Proof of Lemma 15.
We first prove that . Fix any measurable subset . Since functions are constant on each set , the supremum is achieved by a subset which is an union of some of , that is . For such , the definition (93) of implies so that
[TABLE]
Taking the supremum over all leads to . By definition of and we have that is a step function. Then, Lemma 4 allows us to conclude
[TABLE]
∎
Proof of Lemma 14.
The proof of Lemma 14 follows the lines of the proof of of Lemma 4.3 in [26] and we give it here for completeness. For , let be the vector of i.i.d. random variables with the discrete distribution on defined by
[TABLE]
Let be the symmetric matrix with elements and for . Assume that, conditionally on , the adjacency matrix is sampled according to the network sequence model with such probability matrix . Notice that in this case the observations have the probability distribution . Using this remark and introducing the probabilities and for , we can write the Kullback-Leibler divergence between and in the form
[TABLE]
where the sums in are over and the sum in is over all triangular upper halves of matrices in . Since the function is convex we can apply Jensen’s inequality to get
[TABLE]
where the last equality follows from the fact that are -product probabilities. Using (97) we get
[TABLE]
which is equal to times the Kullback-Leibler divergence between two discrete distribution. Since the Kullback-Leibler divergence is less than the chi-square divergence we obtain
[TABLE]
where last inequality we use , and . Combining this with (98) proves the lemma.
∎
Appendix G Proof of Proposition 7
To prove (25), it is enough to prove separately the following three minimax lower bounds:
[TABLE]
The proof of (99) follows from the proof of (43) in [26] using the trivial inequality
[TABLE]
The proof of (100) follows the lines of the proof of (44) using that for matrices with entries in . The proof of (101) is identical to the proof of (45) in [26].
In order to prove the upper bound (26), the proof of Proposition 3.2 in [26] can be easily modified to get an upper bound on the agnostic error measured in -distance:
Lemma 16** (Agnostic error measured in -distance).**
Consider the -random graph model. For all integer , and , we have
[TABLE]
Now (26) follows from Lemma 16 and (16). Finally, the convergence rate is simply achieved by the constant estimator .
Appendix H Proof of Proposition 8
For generated according to the sparse -random graph model (27) with graphon , integrating (9) with respect to and using , we get
[TABLE]
So, using the triangle inequality (20) it is enough to bound the agnostic error . We take (or in the case of graphons) such that
[TABLE]
or for graphons. Without lost of generality we can assume that . Let and be such such that for where are the same as for . Triangle inequality implies
[TABLE]
where we use and and that is distributed as under . Similarly for graphons, we obtain . Then, we use the following lemma:
Lemma 17**.**
- (i)
Consider any and such that . Then
[TABLE]
- (ii)
Consider any and such that . Then,
[TABLE]
Now (28) follows from (i) of Lemma 17 and . The proof of (30) follows the same lines using (ii) of Lemma 17.
To prove (29) and (31) we only need to prove that . Using the definition of (13) we compute
[TABLE]
where we used that and the definition of . This completes the proof of Proposition 8.
Proof of Lemma 17.
Consider the matrix with entries for all . As opposed to , the diagonal entries of are not constrained to be null. By the triangle inequality, we get
[TABLE]
Since the entries of coincide with those of outside the diagonal, the difference is null outside of a set of measure . Also, the entries of are smaller than . It follows that . Since , it suffices to prove that
[TABLE]
Since is a -step function, we can reorganize and in such a way that these two graphon are equal on a set of large Lebesgue value. More precisely, we adopt the same approach as in the proof of Theorem 1 and we only sketch the result here. Let and that characterize . For , denote . For any , define the cumulative distribution function and set . For any define . Define . Obviously, is weakly isomorphic to . Now, let be the (unobserved) empirical frequency of group . Consider a function such that:
- (i)
for all and ,
- (ii)
for all .
Such a function exists (for details see the Step 2 of the proof of Theorem 1). Finally define the graphon . Notice that is weakly isomorphic to the empirical graphon . Since is a metric on the quotient space of graphons, we have
[TABLE]
The two functions and are equal except possibly the case when either or belongs to one of the intervals for and we have
[TABLE]
Since are i.i.d. uniformly distributed random variables, has a binomial distribution with parameters (, ). By Cauchy-Schwarz inequality we get and . Then, we get
[TABLE]
Now for we use for all to get
[TABLE]
since we assume that . For we use the Cauchy-Schwarz inequality:
[TABLE]
since . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Edo M Airoldi, Thiago B Costa, and Stanley H Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems , pages 692–700, 2013.
- 2[2] Noga Alon, W. Fernandez De La Vega, Ravi Kannan, and Marek Karpinski. Random sampling and approximation of max-csps. Journal of computer and system sciences , 67(2):212–243, 2003.
- 3[3] Afonso S. Bandeira and Ramon van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Ann. Probab. , 44(4):2479–2506, 2016.
- 4[4] Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences , 106(50):21068–21073, 2009.
- 5[5] Peter J Bickel, Aiyou Chen, and Elizaveta Levina. The method of moments and degree distributions for network models. The Annals of Statistics , 39(5):2280–2301, 2011.
- 6[6] Béla Bollobás, Svante Janson, and Oliver Riordan. The phase transition in inhomogeneous random graphs. Random Structures Algorithms , 31(1):3–122, 2007.
- 7[7] C. Borgs, J.T. Chayes, H. Cohn, and S. Ganguly. Consistent nonparametric estimation for heavy-tailed sparse graphs. Ar Xiv e-prints , August 2015.
- 8[8] C. Borgs, J.T. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Convergent sequences of dense graphs. I. Subgraph frequencies, metric properties and testing. Adv. Math. , 219(6):1801–1851, 2008.
