Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data
Mahendra Mariadassou, Timoth\'ee Tabouy

TL;DR
None
Contribution
None
Abstract
Statistical analysis of network is an active research area and the literature counts a lot of papers concerned with network models and statistical analysis of networks. However, very few papers deal with missing data in network analysis and we reckon that, in practice, networks are often observed with missing values. In this paper we focus on the Stochastic Block Model with valued edges and consider a MCAR setting by assuming that every dyad (pair of nodes) is sampled identically and independently of the others with probability . We prove that maximum likelihood estimators and its variational approximations are consistent and asymptotically normal in the presence of missing data as soon as the sampling probability satisfies .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data
Mahendra [email protected] & Timothée [email protected]
(∗ MaIAGE, INRAE, Université Paris-Saclay, 78352 Jouy-en-Josas, France
UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005 Paris, France)
Abstract
Statistical analysis of network is an active research area and the literature counts a lot of papers concerned with network models and statistical analysis of networks. However, very few papers deal with missing data in network analysis and we reckon that, in practice, networks are often observed with missing values. In this paper we focus on the Stochastic Block Model with valued edges and consider a MCAR setting by assuming that every dyad (pair of nodes) is sampled identically and independently of the others with probability . We prove that maximum likelihood estimators and its variational approximations are consistent and asymptotically normal in the presence of missing data as soon as the sampling probability satisfies .
Stochastic Block Model Maximum Likelihood Missing data Concentration Inequality
1 Introduction
For the last decade, statistical network analyses has a been a very active research topic and the statistical modeling of networks has found many applications in social sciences and biology for example Aicher et al. (2014), Barbillon et al. (2015), Mariadassou et al. (2010), Wasserman and Faust (1994) and Zachary (1977).
Many random graphs models have been widely studied, either from a theoretical or an empirical point of view. The first model studied was Erdős-Rényi model (Erdős and Renyi, 1959) which assumes that each pair of nodes (dyad) is connected independently to the others with the same probability. This model assumes homogeneity of all nodes across the network. In order to alleviate this constraint, many families of models have been introduced. Most are endowed with a latent structure (reviewed in Matias and Robin, 2014) to capture heterogeneity across nodes. Among those, the Stochastic Block Model (in short SBM, see Frank and Harary, 1982; Holland et al., 1983) is one of the oldest and most studied as it is highly flexible and can capture a large variety of structures (affiliation, hub, bipartite and many other). In order to estimate this model, Bayesian approaches were first proposed (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001) but have been superseded by variational methods (Daudin et al., 2008; Latouche et al., 2012). The former class of approaches are exact but lack the computational efficiency and scalability that the latter offers.
Theoretical guarantees concerning maximum likelihood estimators (in short MLE) and variational estimators (in short VE), based on variational approximations of the likelihood, for the binary SBM estimation are quite difficult to obtain. In Celisse et al. (2012), consistency of MLE and VE is proven but asymptotic normality requires that the estimators converges at rate at least , which is not proven in the paper, although some results were available for some particular cases (affiliation for example). Ambroise and Matias (2012) tackles the specific case of affiliation model with equal group proportion and proves the consistency and asymptotic normality of parameter estimates. Bickel et al. (2013) extends those results to arbitrary binary SBM graphs and improves Celisse et al. (2012) by removing the condition on the convergence rate, as it is automatically satisfied by the MLE. Following along the path of Bickel et al. (2013), Brault et al. (2020) proved consistency and asymptotic normality of estimators (MLE and VE) to weighted Latent Block Models where the weights distribution belongs to a one-dimensional exponential families. In particular, considering unbounded edge values invalidates several parts of the proofs for binary graphs and requires substantial adaptations and additional results, notably concentration inequalities for sums of unbounded, non-gaussian random variables.
Some results are also available for the related semi-parametric problem of assignment reconstruction. Mariadassou and Matias (2015) show that the conditional distribution of the (latent) assignments converge to a degenerate distribution and Rohe et al. (2010) prove that, when the data are generated according to a SBM model, spectral methods are consistent. Choi et al. (2012) extend those results to settings where the density of the graph goes to [math] as (for large enough) and/or the number of groups goes to as . Chatterjee (2015) proves also strong results for reconstruction of large matrices with noisy entries estimation and partial observation of the dyads, by means of a universal singular value thresholding (USVT). In the special case of binary SBM with groups, he achieves a reconstruction error rate of order as soon as the fraction of observed dyads is at least for (for large enough). Since USVT replaces missing dyads with [math]s, it naturally achieves the same limiting rate as the sparse setting. Finally, Wang and Bickel (2017) and Hu et al. (2017) also show that model selection for the number of groups is consistent for dense graphs, they suggest using a penalized likelihood criteria with penalty of the form where is a tuning parameter.
In this paper we consider a simple setting with fixed number of groups and fixed density but weighted edges and missing values. In most network studies, there is a strong asymmetry between the presence of an edge and its absence: the lack of proof that an edge exists is taken as proof that the edge does not exist and edges with uncertain status are considered as non existent in the graph. This is the strategy adopted in most sparse asymptotic settings where the density of edges goes to [math] asymptotically Bickel et al. (2013). We adopt a different point of view where edges with uncertain status are considered as missing, rather than absent and explicitly accounted for their missing nature. We use the framework of Rubin (1976) and its application to network data, see Kolaczyk (2009) and Handcock and Gile (2010), for parameter inference in presence of missing values and more specifically its applications to SBM Tabouy et al. (2019). We prove that, in the MCAR setting where each dyad is missing independently and with the same probability, the MLE and variational estimates are still consistent and asymptotically normal.
The article is organized as follows. We first present the model and missing data theory applied to our context with some examples of sampling designs. We then posit some definitions and discuss the assumptions required for our results in Section 2. In Section 3 we establish asymptotic normality for the complete-observed model (i.e. observed SBM where latent variables are known). Section 4 is the main result of this paper and states that the observed-likelihood behaves like the complete-observed likelihood (i.e. joint likelihood of the observed data and latent variables) close to its maximum. Consequences for the MLE and variational estimator are in discussed in Section 5. The proof is sketched in Section 6. Comparison to existing results are made and discussed in Section 7. Technical lemmas and details of the proofs are available in the appendices.
2 Statistical framework
2.1 Notations
[TABLE]
2.2 Stochastic Block Model
In SBM, nodes from a set are distributed among a set of hidden blocks that model the latent structure of the graph. The block-memberships are encoded by where the are independant random variables with prior probabilities , such that , for all . The value of any dyad in , with , only depends on the blocks and belong to. The variables s are thus independent conditionally on the s:
[TABLE]
In the following, is the adjacency matrix of the random graph, the -vector of the latent blocks. With a slight abuse of notation, we associate to a binary vector such that , for all . In this case is a matrix.
We note the complete parameter set as where stands for the parameter space. When performing inference from data, we note the true parameter set, i.e. the parameter values used to generate the data, and the true (and usually unobserved) memberships of nodes. For any , we also note:
- •
the size of the community (or block) for membership
- •
its counterpart for .
2.3 Missing data for SBM
Regarding SBM inference, a missing value corresponds to a missing entry in the adjacency matrix , typically denoted by NA’s. We rely on the sampling matrix to record the missing state of each entry:
[TABLE]
As a shortcut, we use and to respectively denote the observed and missing dyads. The sampling design is the description of the stochastic process that generates . It is assumed that the network exists before the sampling design acts upon it, which is fully characterized by the conditional distribution , the parameters of which are such that and live in a product space . In this paper we are going to focus on a specific type of missingness, called missing completely at random (MCAR) for which and leave aside more complex forms of dependencies such as Missing at random (MAR) and Not missing at random (NMAR).
We then follow the framework of (Rubin, 1976) and Tabouy et al. (2019) for missing data and define the joint probability density function as
[TABLE]
Property 2.1**.**
According to Equation (2.2), if the sampling design is MCAR, then maximising or in is equivalent to maximising in , this corresponds to the ignorability notion defined in Rubin (1976).
2.4 Sampling design examples
We present here some examples of sampling designs to illustrate differences between notions of MCAR, MAR and NMAR.
Definition 2.2** (Random dyad sampling).**
Each dyad has the same probability of being observed, independently of the others. This design is MCAR.
Definition 2.3** (Random node sampling).**
The random node sampling consists in selecting independently with probability a set of nodes and then observing the corresponding rows and columns of matrix .
The major point in both examples is that the probability ( in random dyad sampling and in the random node sampling) of observing a dyad does not depend on its value. In contrast, the following dyad-centered sampling design adapted to binary networks is NMAR since the probability to observe a dyad depends on its value:
Definition 2.4** (Double standard sampling).**
Each dyad is observed, independently of other dyads, with a probability depending on its value: and .
For non-binary networks, specifying the sampling design is more involved and requires defining the sampling density for every possible value of , e.g. for Poisson-valued edges.
Remark 2.5*.*
In this paper, we focused on data sampled according to random dyad sampling, which is the simplest case but already yields valuable insights into the differences between the partially and fully sampled settings.
As observed above, there are however many other ways to sample a network. In the case of node-centered sampling design, like random node sampling, the main difficulty to prove consistency and asymptotic normality is the dependency between the variables. Indeed, in random node sampling, the variable depends on all and (for all ). As a consequence, a different inference strategy is required and many results proved in this paper are not valid under random node sampling. NMAR sampling designs raises problem of their own: each design requires its own estimation procedure (Tabouy et al., 2019) and therefore its own analysis. For example, parameter estimation under the seemingly simple double standard sampling for binary networks is still an open problem: numerical experiments suggest that and are jointly identifiable but there is no formal proof.
2.5 Observed-likelihoods
When the labels are known, the complete-observed log-likelihood is given by:
[TABLE]
But the labels are usually unobserved, and the observed log-likelihood is obtained by integration over all memberships:
[TABLE]
2.6 Models and Assumptions
We focus here on parametric models where belongs to a regular one-dimension exponential family in canonical form:
[TABLE]
where belongs to the space , so that is well defined for all . Classical properties of exponential families ensure that is convex, infinitely differentiable on , that is well defined on . Furthemore, when , and .
In the following, we recall assuming that missing data are produced according to a random dyad sampling with parameter .
Moreover, we make the following assumptions on the parameter space and the asymptotics of :
: goes to [math] but satisfies 2.
: There exists a positive constant , and a compact interval such that
[TABLE] 3.
: The true parameter lies in the interior of . 4.
: The map is injective. 5.
: The coordinates of , where is applied component-wise, are pairwise distinct.
The previous assumptions are standard. Assumption ensures that the fraction of observed dyad is not too small. Assumption ensures that the group proportions are bounded away from [math] and so that no group disappears when goes to infinity. It also ensures that is bounded away from the boundaries of the . This is essential for the subexponential properties of Propositions 2.9 and 2.10. is in line with standard assumptions in parametric statistics. is necessary for identifiability purposes: the model is trivially not identifiable if the map is not injective. ensures identifiability of SBM parameters under random dyad sampling. Note that, combined with , it implies that all columns and all rows of are distincts and therefore that no two groups have the connectivity profile. In the following, we consider the number of blocks to be known.
2.7 Identifiability
Since is independant on , the identifiability of SBM with emission law in the one-dimension exponential family under random dyad sampling can be stated in two steps. First the sampling parameter and secondly the SBM parameters given .
Proposition 2.6**.**
The sampling parameter of random dyad sampling is identifiable w.r.t. the sampling distribution.
Proof.
See Tabouy et al. (2019). The proof does not depend on being binary but also holds for distributed as in Eq. (2.5). ∎
Proposition 2.7**.**
Let and assume that for any , , and that the coordinates of , where is applied component-wise, are pairwise distinct. Then, under random dyad sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching.
Proof.
The proof is nearly identical to the one written in Tabouy et al. (2019) and inspired by Celisse et al. (2012) for the binary SBM under random dyad sampling. However, substituting to in the proof ensures that is identifiable. Finally, the fact that is a one-to-one map ensures that is identifiable. ∎
Note that asymptotically, the assumption is always satisfied since is fixed and grows to infinity.
2.8 Subexponential variables
Remark 2.8*.*
Since we restricted in a bounded subset of , the variance of is bounded away from [math] and . We note
[TABLE]
Similarly, since belongs to a bounded subset of a open interval, there exists a constant , such that uniformly over all
Proposition 2.9**.**
With the previous notations, if and , then is subexponential with parameters .
Proposition 2.10**.**
Considering (we recall that ), with independant of and bounded. There are non-negative numbers and such that is subexponential with parameters .
Proof.
These results derive directly from theorem C.1 (statement 2.). ∎
2.9 Symmetry
We now introduce the concepts of assignments and parameter symmetries, that must be accounted for when studying the asymptotic properties of the MLE. Complications stemming from symmetries are related to but no equivalent to the problem of label-switching in mixture models.
Definition 2.11** (permutation).**
Let be a permutation on . If is a matrix with columns and rows, we define as the matrix obtained by permuting the columns of according to , i.e. for any row and column of , . If is a matrix with rows and columns, is defined similarly:
[TABLE]
Definition 2.12** (equivalence).**
We define the following equivalence relationships:
- •
Two assignments and are equivalent, noted , if they are equal up to label permutation, i.e. there exists a permutation such that .
- •
Two parameters and are equivalent, noted , if they are equal up to label permutation, i.e. there exists a permutation such that .
- •
* and are equivalent, noted , if they are equal up to label permutation on and , i.e. there exists a permutation such that . This is label-switching.*
Definition 2.13** (symmetry).**
We say that the parameter exhibits symmetry for the permutation if
[TABLE]
* exhibits symmetry if it exhibits symmetry for any non trivial permutations . Finally the set of permutations for which exhibits symmetry is noted .*
Remark 2.14*.*
The set of parameters that exhibit symmetry is a manifold of null Lebesgue measure in . The notion of symmetry allows us to deal with a notion of non-identifiability of the class labels that is subtler than and different from label switching. More precisely
[TABLE]
In particular, in label-switching, and have the same likelihood but under equivalent yet different parameters s. In contrast, in the presence of symmetry, and have exactly the same likelihood under . This implies in particular that the posterior can not concentrate on a single assignment. This is instrumental for Proposition 6.11.
Example 1*.*
In this example we illustrate what and its cardinal can be in a simple case. Consider a network with nodes,
[TABLE]
As a consequences the two following assignments
[TABLE]
belongs to . Indeed they are the only assignments belongings to , so, in this particular case .
The issue of symmetry forces us to use a notion of distance between assignment that is invariant to label permutation.
Definition 2.15** (distance).**
We define the following distance, up to equivalence, between configurations and :
[TABLE]
where, for all matrix , we use the Hamming norm defined by
[TABLE]
Definition 2.16** (Set of local assignments).**
We note the set of configurations that have a representative (for ) within relative radius of :
[TABLE]
2.10 Other definitions
We finally introduce a few useful notions that will be instrumental in the proofs. The first is “regular” assignments, for which each group has “enough” nodes:
Definition 2.17** (-regular assignments).**
Let . For any , we say that is c-regular if
[TABLE]
Class distinctness captures the differences between groups: lower values of means that at least two classes have very similar connectivity profiles. is intrisically linked to the convergence rate of several estimates.
Definition 2.18** (class distinctness).**
For . We define:
[TABLE]
with the Kullback divergence between and , when comes from an exponential family.
Remark 2.19*.*
Since all have distinct rows and columns, .
Finally, the confusion matrix allows to compare groups between assignments:
Definition 2.20** (confusion matrix).**
For given assignments and , we define the confusion matrix between and , noted , as follows:
[TABLE]
Definition 2.21**.**
For more conciseness, we define
[TABLE]
3 Complete-observed Model
Hereafter and in the rest of the text, we use the term "complete" to say that true assignments are known, and "observed" to say that only some dyads are observed. In the following we study the asymptotic properties of the complete-observed data model.
Proposition 3.1**.**
Under random dyad sampling, defining and the set of nodes with at least one dyad observed. Then
[TABLE]
Proof.
This proposition is a direct consequence of Borel-Cantelli’s theorem. Details are available in appendix A. ∎
Remark 3.2*.*
This result shows that, with high probability, the network has no unobserved node. In the remainder, we work conditionnally on .
Let be the MLE of in the complete-observed data model. Simple manipulations of Equation (2.3) yield:
[TABLE]
Proposition 3.3**.**
Let .Then is semi-definite positive, of rank , and is asymptotically normal:
[TABLE]
*Similarly, let be the matrix defined by and
. Then the estimates are independent and asymptotically Gaussian with limit distribution:*
[TABLE]
Proof.
The proof is postponed to appendix A. The first part is a direct application of central limit theorem for i.i.d. variables and the second part relies on a variant of the central limit theorem for random sums of random variables. ∎
Remark 3.4*.*
The main differences with Bickel et al. (2013) are (i) the scaling of as and (ii) the need for a central limit theorem for random sums of random variables, as the sums involved in (3.1) are over a random number of terms.
Proposition 3.5** (Local asymptotic normality).**
Let be the complete likelihood function defined on by . For any and in a compact set, we have:
[TABLE]
where denote the Hadamard product of two matrices (element-wise product) and and are defined in Proposition 3.3. is asymptotically Gaussian with zero mean and variance matrix . is a random matrix with independent entries that are asymptotically gaussian zero mean and variance .
Proof.
This result is based on a Taylor expansion of in a neighborhood of . Details are available in appendix A. ∎
4 Main Result
Our main result compares the observed likelihood ratio with the complete observed likelihood to show that they have the same argmax. To ease the comparison, we work only on the high probablity set of -regular configurations, i.e. that have nodes in each group as defined in Section 2,
Proposition 4.1**.**
Define as the subset of made of -regular assignments, with defined in assumption . Note the event , then:
[TABLE]
Proof.
This proposition is a consequence of Hoeffding’s inequality. See appendix A for more details. ∎
We can now state our main result:
Theorem 4.2** (complete-observed).**
Assume that to with random-dyad sampling hold for the Stochastic Block Model of known order with observations coming from an univariate exponential family and define as the set of permutation for which exhibits symmetry. Then, for tending to infinity and , the observed likelihood ratio behaves like the complete likelihood ratio, up to a bounded multiplicative factor:
[TABLE]
where the is uniform over all .
The maximum over all that are equivalent to stems from the fact that because of label-switching, is only identifiable up to its -equivalence class from the observed likelihood, whereas it is completely identifiable from the complete likelihood. The multiplicative factor arises from the fact that equivalent assignments have exactly the same complete likelihood and contribute equally to the observed likelihood.
Remark 4.3*.*
This result is very similar to the one of Brault et al. (2020) and corrects an error in the main result of Bickel et al. (2013): the missing terms and .
Corollary 4.4**.**
If contains only parameters with no symmetry:
[TABLE]
where the is uniform over all .
5 Variational and Maximum Likelihood Estimates
This section is devoted to the asymptotic of the MLE and the VE in the incomplete data model as a consequence of the main result 4.2. Note that, with high probability, both estimators have no symmetry since the set is a manifold of null Lebesque’s mesure in and thus .
5.1 ML estimator
The asymptotic behavior of the maximum likelihood estimator in the incomplete data model is a direct consequence of Theorem 4.2 and Proposition 3.5.
Corollary 5.1** (Asymptotic behavior of ).**
Denote the maximum likelihood estimator and use the notations of Proposition 3.3. There exist permutations of such that
[TABLE]
Hence, the maximum likelihood estimator for the SBM under random-dyad sampling condition is consistent and asymptotically normal, with the same behavior as the maximum likelihood estimator in the complete data model. The proof is postponed to appendix B.10.
5.2 Variational estimator
Due to the complex dependency structure of the observations, the maximum likelihood estimator of the SBM is not numerically tractable, even with the Expectation Maximisation algorithm. In practice, a variational approximation is often used (see Daudin et al., 2008): for any joint distribution on a lower bound of is given by
[TABLE]
where . Choosing to be the set of product distributions, such that for all
[TABLE]
allows us to obtain tractable expressions of . The variational estimate of is defined as
[TABLE]
The following corollary states that has the same asymptotic properties as and , in particular is consistent and asymptotically normal.
Corollary 5.2** (Variational estimate).**
Under the assumptions of Theorem 4.2, there exist permutations of such that
[TABLE]
The proof is very similar to the proof of Corollary 5.1 and postponed to appendix B.10.
6 Proof Sketch
The proof of theorem relies on deviations of the log-likelihood ratios from their expectations. We first define those quantities.
6.1 log-likelihood ratios
Definition 6.1**.**
We define the conditional log-likelihood ratio and its expectation as:
[TABLE]
We also define the profile ratio and its counterpart as:
[TABLE]
The following decomposition of highlights the importance of :
[TABLE]
Since , the profile ratio is useful to remove the dependency on and reduce the study to a series of problems depending only on . The following propositions show that and are constrats which are maximum (in expectation) at the true parameter value (up to group relabeling) and have negative curvature at those points. This allows us to prove that, asymptotically, only one (or a few) contribute to the above sum.
Proposition 6.2**.**
Conditionally on , we have
[TABLE]
with for such that or i.e. no dyad observed in class .
Remark 6.3*.*
Note the absence of the random variable in , which is integrated out in the expectation .
Proposition 6.4** (maximum of and in ).**
The functions and are maximum respectively in for and defined by:
[TABLE]
so that
[TABLE]
Proposition 6.5** (Local upperbound for ).**
Conditionally upon , there exists a positive constant such that for all :
[TABLE]
Proposition 6.6** (maximum of and in ).**
* can be written:*
[TABLE]
Conditionally on the set of regular assignments and for ,
- (i)
* is maximized at and its equivalence class and .*
- (ii)
* is maximized at and its equivalence class and .*
- (iii)
The maximum of (and hence the maximum of ) is well separated.
Proofs of Propositions 6.2, 6.4, 6.5 and 6.6 are postponed to Appendix B.
6.2 High level view of the proof
The proof proceeds by splitting as a sum over three types of configurations that partition and studying the asymptotic behavior of and on each type:
global control: for such that , Proposition 6.7 proves a large deviation behavior and shows that . In turn, those assignments contribute a of to the sum (Proposition 6.8). 2. 2.
local control: a small deviation result (Proposition 6.9) is needed to show that the combined contribution of assignments close to but not equivalent to is also a of (Proposition 6.10). 3. 3.
equivalent assignments: Proposition 6.11 examines which of the remaining assignments, all equivalent to , contribute to the sum.
These results are presented in next section 6.3 and their proofs postponed to Appendix B. They are then put together in section 6.4 to prove our main result. The remainder of the section is devoted to the asymptotics of the ML and variational estimators as a consequence of the main result.
6.3 Different asymptotic behaviors
6.3.1 Global Control
Proposition 6.7** (large deviations of ).**
Let . For all and large enough that
[TABLE]
Proposition 6.8** (contribution of global assignments).**
Choose decreasing to [math] slowly enough that . Then conditionally on and for large enough that , we have:
[TABLE]
6.3.2 Local Control
Proposition 6.9** (small deviations ).**
Conditionally on ,
[TABLE]
The next proposition uses Propositions 6.9 and 6.6 to show that the combined contribution to the observed likelihood of assignments close to is also a of :
Proposition 6.10** (contribution of local assignments).**
With the previous notations and the positive constant defined in Proposition 6.5:
[TABLE]
6.3.3 Equivalent assignments
It remains to study the contribution of equivalent assignments.
Proposition 6.11** (contribution of equivalent assignments).**
For all , we have
[TABLE]
where the is uniform in .
6.4 Proof of the main result
Proof.
We work conditionally on . Choose and a sequence decreasing to [math] but satisfying . According to Proposition 6.8,
[TABLE]
Since decreases to [math], it gets smaller than (used in proposition 6.10) for large enough. As this point, Proposition 6.10 ensures that:
[TABLE]
And therefore the observed likelihood ratio reduces as:
[TABLE]
And Proposition 6.11 allows us to conclude
[TABLE]
∎
7 Discussion
Close examination of the different proofs, especially of Prop. 6.10, reveals that the quantities driving convergence of the estimates are , which must go to with to ensure validity of Prop. 6.8, and , which must be larger than while , to ensure validity of Prop. 6.10. Both conditions are met as soon as , allowing for a large fraction of missing edges. Note that this limiting rate for missingness is the same as the one found for graph density in sparse settings to achieve consistency and local asymptotic normality of (Bickel et al., 2013). It’s also the same as the one found by Chatterjee (2015) for the structured matrix reconstruction problem. Note also that in the fixed setting, both MLE and VE are consistent and asymptotically normal but the cost of missingness is an expected blow up of the asymptotic variance matrix by a factor of .
The proof follows along the line of (Bickel et al., 2013) but differs in some significant ways. First, since the number of observed dyads is random, we must rely on variants of the central limit theorem that hold for random sums of random variables. Second, the move from the binary to unbounded dyads invalidates a counting argument used in (Bickel et al., 2013) and requires different concentration inequalities. We leverage the facts that random variables with distribution in natural exponential families are subexponential and that the subexponential property is preserved by summation and multiplication to derive Bernstein-type inequality. Finally, we add the missing terms which have little impact for the corollaries but are required for the rigorous statement of the main result.
8 Acknowledgment
The authors thank Pierre Barbillon (INRA-MIA, AgroParisTech), Julien Chiquet (INRA-MIA, AgroParisTech), Stéphane Robin (INRA-MIA, AgroParisTech) and James Ridgway (CFM) for their helpful remarks and suggestions.
This work is supported by two public grants overseen by the French National research Agency (ANR): first as part of the « Investissement d’Avenir » program, through the « IDI 2017 » project funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02, and second by the « EcoNet » project.
Appendix A Technical results
A.1 Proof of proposition 3.1
Proof.
Noticing that , then . As a consequence , and . Then by Borel-Cantelli theorem (because converge), and as , the result follow. ∎
A.2 Technical lemma A.1
Lemma A.1**.**
[TABLE]
Proof.
Noticing that and defining . By Hoeffding decomposition for U-statistics (see Hoeffding (1948))
[TABLE]
where for each permutation , is a sum of independant r.v. Then, for by Jensen’s inequality and Hoeffding’s lemma about bounded r.v.
[TABLE]
Finally, using the same proof than Hoeffding’s inequality allows us to conclude. ∎
A.3 Proof of proposition 3.3
Proof.
Since is the sample mean of i.i.d. multinomial random variables with parameters and , a simple application of the central limit theorem (CLT) gives:
[TABLE]
which proves Equation (3.2) where is semi-definite positive of rank .
Similarly, is the average of i.i.d. random variables with mean and variance . is itself random but thanks to lemma A.1 : . Therefore, by Slutsky’s lemma and the CLT for random sums of random variables Shanthikumar and Sumita (1984), we have:
[TABLE]
The differentiability of and the delta method then gives:
[TABLE]
and the independence results from the independence of and as soon as or , as they involve different sets of i.i.d. variables. ∎
A.4 Proof of proposition 3.5
Proof.
By Taylor expansion,
[TABLE]
where and denote the respective components of the gradient of evaluated at and and denote the conditional hessian of evaluated at . By inspection, and converge in probability to constant matrices and the random vectors and converge in distribution by central limit theorem. ∎
A.5 Proof of proposition 4.1
Proof.
In regular configurations, each group has members, where if there exists two constant such that for enough large . -regular assignments, with defined in Assumption , have high -probability in the space of all assignments, uniformly over all .
Each is a sum of i.i.d Bernoulli r.v. with parameter . A simple Hoeffding bound shows that
[TABLE]
taking a union bound over values of leads to Proposition 4.1. ∎
Appendix B Main Results
B.1 Proof of proposition 6.2)
Proof.
First of all we will prove equation 6.3,
[TABLE]
where . Noticing that the for which does not contributes in any of the two terms of the ratio. The calculus of this expectation is then equivalent to calculate an expectation of the general form , and .
Lemma B.1**.**
[TABLE]
Proof.
Define and noticing that . Conditionally to
[TABLE]
∎
Now, applying lemma B.1 with leads to
[TABLE]
Finally, can be arbitrarily defined at the same value than which conclued the proof. ∎
B.2 Proof of proposition 6.4
Proof.
Defining . For fixed, is maximized at . Manipulations yield
[TABLE]
which is maximized at . Similarly with ,
[TABLE]
is maximized at . ∎
B.3 Proof of Proposition 6.6 (maximum of and )
Proof.
We condition on and prove Equation (6.5):
[TABLE]
If is regular, and for , all the rows of have at least one positive element and we can apply Lemma 3.2 of Bickel et al. (2013) to characterize the maximum for .
The maximality of results from the fact that where is a particular value of , is immediately maximum at , and for those, we have .
The separation and local behavior of around is a direct consequence of the proposition 6.5. ∎
B.4 Proof of Proposition 6.5 (Local upper bound for )
Proof.
We work conditionally on . The principle of the proof relies on the extension of to a continuous subspace of , in which the confusion matrix is naturally embedded. The regularity assumption allows us to work on a subspace that is bounded away from the borders of . The proof then proceeds by computing the gradient of at and around its argmax and using those gradients to control the local behavior of around its argmax. The local behavior allows us in turn to show that is well-separated.
Note that only depends on through . We can therefore extend it to matrix where is the subset of matrices with each row sum higher than .
[TABLE]
where
[TABLE]
and is the matrix filled with . Confusion matrix satisfy , with a vector only containing values, and are obviously in as soon as is regular.
The maps are twice differentiable with second derivatives bounded over and therefore so is . Tedious but straightforward computations show that the derivative of at is:
[TABLE]
is the matrix-derivative of at . Since is -regular and by definition of , if and for all . By boundedness of the second derivative, there exists such that for all and all , we have:
[TABLE]
Choose in satisfying . have nonnegative off diagonal coefficients and negative diagonal coefficients. Furthermore, the coefficients of sum up to and . By Taylor expansion, there exists also in such that
[TABLE]
To conclude the proof, assume without loss of generality that achieves the norm (i.e. it is the closest to in its representative class). Then is in and satisfy . We just need to note to end the proof.
∎
B.5 Proof of Proposition 6.7 (global convergence )
Proof.
Conditionally upon ,
[TABLE]
uniformly in , where the are independent and by Taylor expansion defined by:
[TABLE]
is the sum of sub-exponential variables with parameters and is therefore itself sub-exponential with parameters . According to Proposition B.3 of Brault et al. (2020) , and is sub-exponential with parameters . In particular, for all
[TABLE]
We can then remove the conditioning and take a union bound. ∎
B.6 Proof of Proposition 6.8 (contribution of far away assignments)
Proof.
Conditionally on , we know from proposition 6.6 that is maximal in and its equivalence class. Choose decreasing to [math] but satisfying . According to 6.6 (iii), for all
[TABLE]
since .
Set and large enough that . By proposition 6.7, and with our choice of , with probability higher than ,
[TABLE]
where the second line comes from inequality (B.1), the third from the global control studied in Proposition 6.7 and the definition of , the fourth from the definition of , the fifth from the bounds on and the last from .
In addition, with our choice of , we have so that the series converges and:
[TABLE]
∎
B.7 Proof of Proposition 6.9 (local convergence )
Proof.
We work conditionally on . Choose small. Assignments at -distance less than of are -regular. According to Proposition B.1 of Brault et al. (2020) , and are at distance at most with probability higher than . Defining
[TABLE]
where . Manipulation of , and yield
[TABLE]
where , and .
Concerning the first term.
The function is twice differentiable on with and . (resp. ) are bounded over by (resp. ) so that:
[TABLE]
By Proposition B.1 (adapted for SBM) of Brault et al. (2020) , where the is uniform in and does not depend on . Similarly,
[TABLE]
is a convex combination of the therefore,
[TABLE]
Note that:
[TABLE]
and . Therefore
[TABLE]
The remaining term writes
[TABLE]
and is also uniformly in and by Proposition C.2.
Concerning the second term.
For all , defining
[TABLE]
and noticing that and . Using the following notations
[TABLE]
we are able to write
[TABLE]
Where the second equality is the sum of independent random variables.
Note that :
[TABLE]
also that and . Therefore
[TABLE]
Concerning the third term.
Using arguments developed previously leads to the same conclusion than before :
[TABLE]
As a conclusion, writing
[TABLE]
and noticing that since is maximized in (see 6.6). We have
[TABLE]
∎
B.8 Proof of Proposition 6.10 (contribution of local assignments)
Proof.
By Proposition 4.1, it is enough to prove that the sum is small compared to on . We work conditionally on . Choose in with defined in proposition 6.8.
[TABLE]
For small enough, we can assume without loss of generality that is the representative closest to and note . Then:
[TABLE]
where the first line comes from the definition of , the second line from Proposition 6.6 and the third from Proposition 6.9. Thanks to proposition D.1, we also know that:
[TABLE]
There are at most assignments at distance of and each of them has at most equivalent configurations. Therefore,
[TABLE]
where .
∎
B.9 Proof of Proposition 6.11 (contribution of equivalent assignments)
Proof.
Choose permutations of and assume that . Then . If furthermore , and immediately . We can therefore partition the sum as
[TABLE]
unimodal in , with a mode in . By consistency of , either or and . In the latter case, any other than is bounded away from and thus . In summary,
[TABLE]
∎
B.10 Proof of Corollary 5.1: Behavior of
We may prove the corollary by contradiction. Note first that unless is constrained and with high probability, and exhibit no symmetries. Indeed, equalities like have vanishingly small probabilities of being simultaneously true when is discrete, and even null when is continuous. Assume then or where is a permutation of . Then, by Proposition 3.5 and the consistency of
[TABLE]
But, since and maximise respectively and and have no symmetries, it follows by Theorem 4.2 that
[TABLE]
which contradicts Eq (B.2) and concludes the proof.
B.11 Proof of Corollary 5.2: Behavior of
Remark first that for every and for every ,
[TABLE]
where denotes the dirac mass on . By dividing by , we obtain
[TABLE]
As this inequality is true for every couple , we have in particular:
[TABLE]
Noticing that , Theorem 4.2 therefore leads to the following bounds:
[TABLE]
Again unless is constrained, exhibits no symmetries with high probability and the same proof by contradiction as in appendix B.10 gives the result.
Appendix C Sub-exponential random variables
We now prove two propositions regarding subexponential variables. Recall first that a random variable is sub-exponential with parameters if for all such that ,
[TABLE]
In particular, all distributions coming from a natural exponential family are sub-exponential. Sub-exponential variables satisfy a large deviation Bernstein-type inequality:
[TABLE]
So that
[TABLE]
The subexponential property is preserved by summation and multiplication.
- •
If is sub-exponential with parameters and , then so is with parameters
- •
If the , are sub-exponential with parameters and independent, then so is with parameters
Theorem C.1** (Equivalent characterizations of sub-exponential variables).**
For a zero-mean random variable , the following statements are equivalent:
There are non-negative numbers such that
[TABLE] 2. 2.
There is a positive number such that for all . 3. 3.
There are constants such that
[TABLE] 4. 4.
The quantity is finite.
Proof.
A proof of this theorem can be found in Wainwright (2015). ∎
Proposition C.2** (Maximum in ).**
Let be any configuration and the -equivalent configuration that achieves let (resp. ) and (resp. = ) be as defined in Equations (3.1) and (6.3). Under the assumptions of the section 2.6, for all ,
[TABLE]
Proof.
Note . The numerator within the in the fraction can be expanded to
[TABLE]
and is thus a sum of at most non-null centered subexponential random variables with parameters . It is therefore a centered subexponential with parameters . By Bernstein inequality, for all we have
[TABLE]
There are at most at distance of . An union bound shows that:
[TABLE]
where the last equality is true as soon as .
∎
Appendix D Likelihood ratio of assignments
Proposition D.1**.**
Let be -regular and at -distance of . Then, for all
[TABLE]
Proof.
Note then that:
[TABLE]
where the first inequality comes from the definition of and the second from Lemma B.6 of Brault et al. (2020) and the fact that and are -regular. Finally, local asymptotic normality of the MLE for multinomial proportions ensures that .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aicher et al. (2014) C. Aicher, A. Z. Jacobs, and A. Clauset. Learning latent block structure in weighted networks. J. Compl. Net. , 3.2:221–248, 2014.
- 2Ambroise and Matias (2012) C. Ambroise and C. Matias. New consistent and asymptotically normal parameter estimates for random-graph mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 74(1):3–35, 2012.
- 3Barbillon et al. (2015) P. Barbillon, S. Donnet, E. Lazega, and A. Bar-Hen. Stochastic block models for multiplex networks: an application to networks of researchers. J. R. Stat. Soc. C-Appl. , 2015.
- 4Bickel et al. (2013) P. Bickel, D. Choi, X. Chang, H. Zhang, et al. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics , 41(4):1922–1943, 2013.
- 5Brault et al. (2020) V. Brault, C. Keribin, and M. Mariadassou. Consistency and Asymptotic Normality of Latent Blocks Model Estimators. Electronic Journal of Statistics , 14(1):123–1268, 2020.
- 6Celisse et al. (2012) A. Celisse, J.-J. Daudin, L. Pierre, et al. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electronic Journal of Statistics , 6:1847–1899, 2012.
- 7Chatterjee (2015) S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics , 43(1):177–214, 2015.
- 8Choi et al. (2012) D. S. Choi, P. J. Wolfe, and E. M. Airoldi. Stochastic blockmodels with growing number of classes. Biometrika , 99 2:273–284, 2012.
