Consistency and Asymptotic Normality of Stochastic Block Models   Estimators from Sampled Data

Mahendra Mariadassou; Timoth\'ee Tabouy

arXiv:1903.12488·math.ST·October 21, 2020

Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data

Mahendra Mariadassou, Timoth\'ee Tabouy

PDF

TL;DR

None

Contribution

None

Abstract

Statistical analysis of network is an active research area and the literature counts a lot of papers concerned with network models and statistical analysis of networks. However, very few papers deal with missing data in network analysis and we reckon that, in practice, networks are often observed with missing values. In this paper we focus on the Stochastic Block Model with valued edges and consider a MCAR setting by assuming that every dyad (pair of nodes) is sampled identically and independently of the others with probability $ρ > 0$ . We prove that maximum likelihood estimators and its variational approximations are consistent and asymptotically normal in the presence of missing data as soon as the sampling probability $ρ$ satisfies $ρ ≫ lo g (n) / n$ .

Equations270

y_{ij} ∣ z_{i} = q, z_{j} = ℓ \sim^{ind} φ (., π_{q ℓ}), \forall (i, j) \in D, i \neq = j, \forall (q, ℓ) \in Q \times Q .

y_{ij} ∣ z_{i} = q, z_{j} = ℓ \sim^{ind} φ (., π_{q ℓ}), \forall (i, j) \in D, i \neq = j, \forall (q, ℓ) \in Q \times Q .

(r_{ij}) = {10 if y_{ij} is observed, otherwise.

(r_{ij}) = {10 if y_{ij} is observed, otherwise.

p_{θ, ψ} (y^{o}, z, r) = \int p_{θ} (y^{o}, y^{m}, z) p_{ψ} (r ∣ y^{o}, y^{m}, z) d y^{m} .

p_{θ, ψ} (y^{o}, z, r) = \int p_{θ} (y^{o}, y^{m}, z) p_{ψ} (r ∣ y^{o}, y^{m}, z) d y^{m} .

L_{co} (z; θ) = lo g p (y^{o}, z; θ) = i, q \sum z_{i q} lo g α_{q} + i, j, q, ℓ i \neq = j \sum z_{i q} z_{j ℓ} r_{ij} lo g φ (y_{ij}; π_{q ℓ})

L_{co} (z; θ) = lo g p (y^{o}, z; θ) = i, q \sum z_{i q} lo g α_{q} + i, j, q, ℓ i \neq = j \sum z_{i q} z_{j ℓ} r_{ij} lo g φ (y_{ij}; π_{q ℓ})

L_{o} (θ) = lo g p (y^{o}; θ) = lo g (z \in Z \sum p (y^{o}, z; θ)) .

L_{o} (θ) = lo g p (y^{o}; θ) = lo g (z \in Z \sum p (y^{o}, z; θ)) .

φ (y, π) = b (y) exp (π y - ψ (π)),

φ (y, π) = b (y) exp (π y - ψ (π)),

Θ \subset [c, 1 - c]^{Q} \times C_{π}^{Q \times Q} with C_{π} \subset \overset{˚}{A} .

Θ \subset [c, 1 - c]^{Q} \times C_{π}^{Q \times Q} with C_{π} \subset \overset{˚}{A} .

\overset{σ}{ˉ}^{2} = π \in C_{π} sup V (y_{π}) < + \infty and \underline{σ}^{2} = π \in C_{π} in f V (y_{π}) > 0.

\overset{σ}{ˉ}^{2} = π \in C_{π} sup V (y_{π}) < + \infty and \underline{σ}^{2} = π \in C_{π} in f V (y_{π}) > 0.

A^{s} = (A_{i s (q)})_{i, q} C^{s} = (C_{s (q) s (ℓ)})_{q, ℓ}

A^{s} = (A_{i s (q)})_{i, q} C^{s} = (C_{s (q) s (ℓ)})_{q, ℓ}

(α^{s}, π^{s}) = (α, π) .

(α^{s}, π^{s}) = (α, π) .

Label switching is when : p (y^{o}, z, θ)

Label switching is when : p (y^{o}, z, θ)

Symmetry is when : p (y^{o}, z, θ)

\alpha=(1/6,1/6,2/3),\ \text{and}\ \ \pi=\left(\begin{array}[]{ccc}0&0.7&0.2\\ 0.7&0&0.2\\ 0.2&0.2&0.2\end{array}\right).

\alpha=(1/6,1/6,2/3),\ \text{and}\ \ \pi=\left(\begin{array}[]{ccc}0&0.7&0.2\\ 0.7&0&0.2\\ 0.2&0.2&0.2\end{array}\right).

z_{1}

z_{1}

z_{2}

∥ z - z^{⋆} ∥_{0, \sim} = z^{'} \sim z in f ∥ z^{'} - z^{⋆} ∥_{0}

∥ z - z^{⋆} ∥_{0, \sim} = z^{'} \sim z in f ∥ z^{'} - z^{⋆} ∥_{0}

∥ z ∥_{0} = \frac{1}{2} i, q \sum \mathds 1 {z_{i q} \neq = 0} .

∥ z ∥_{0} = \frac{1}{2} i, q \sum \mathds 1 {z_{i q} \neq = 0} .

S (z^{⋆}, r) = {z : ∥ z - z^{⋆} ∥_{0, \sim} \leq r n}

S (z^{⋆}, r) = {z : ∥ z - z^{⋆} ∥_{0, \sim} \leq r n}

q min z_{+ q} \geq c n .

q min z_{+ q} \geq c n .

δ (π) = q, q^{'} min ℓ max KL (π_{q ℓ}, π_{q^{'} ℓ})

δ (π) = q, q^{'} min ℓ max KL (π_{q ℓ}, π_{q^{'} ℓ})

I R (z)_{q q^{'}} = \frac{1}{n} i \sum z_{i q}^{⋆} z_{i q^{'}}

I R (z)_{q q^{'}} = \frac{1}{n} i \sum z_{i q}^{⋆} z_{i q^{'}}

S^{⋆} = (S_{q ℓ}^{⋆})_{q ℓ} = (ψ^{'} (π_{q ℓ}^{⋆}))_{q ℓ}

S^{⋆} = (S_{q ℓ}^{⋆})_{q ℓ} = (ψ^{'} (π_{q ℓ}^{⋆}))_{q ℓ}

P (n \to + \infty lim Ω_{0, n}) = 1.

P (n \to + \infty lim Ω_{0, n}) = 1.

\overset{α}{^}_{q} = α_{q} (z) = \frac{z _{+ q}}{n}

\overset{α}{^}_{q} = α_{q} (z) = \frac{z _{+ q}}{n}

y_{q ℓ} (z) = \frac{\sum _{i \neq = j} y _{ij} r _{ij} z _{i q} z _{j ℓ}}{\sum _{i \neq = j} r _{ij} z _{i q} z _{j ℓ}}

n (\hat{α} (z^{⋆}) - α^{⋆}) D n \to \infty N (0, Σ_{α^{⋆}})

n (\hat{α} (z^{⋆}) - α^{⋆}) D n \to \infty N (0, Σ_{α^{⋆}})

n (n - 1) (π_{q ℓ} (z^{⋆}) - π_{q ℓ}^{⋆}) D n \to \infty N (0, Σ_{π^{⋆}, q ℓ}) for all q, ℓ

n (n - 1) (π_{q ℓ} (z^{⋆}) - π_{q ℓ}^{⋆}) D n \to \infty N (0, Σ_{π^{⋆}, q ℓ}) for all q, ℓ

L_{co}^{⋆} (α^{⋆} + \frac{s}{n}, π^{⋆} + \frac{u}{n ( n - 1 )})

L_{co}^{⋆} (α^{⋆} + \frac{s}{n}, π^{⋆} + \frac{u}{n ( n - 1 )})

P_{θ^{⋆}} (\overset{ˉ}{Ω}_{1}) \leq Q exp (- \frac{n c ^{2}}{2}) .

P_{θ^{⋆}} (\overset{ˉ}{Ω}_{1}) \leq Q exp (- \frac{n c ^{2}}{2}) .

\frac{p ( y ^{o} ; θ )}{p ( y ^{o} ; θ ^{⋆} )} = \frac{# Sym ( θ )}{# Sym ( θ ^{⋆} )} θ^{'} \sim θ max \frac{p ( y ^{o} , z ^{⋆} ; θ ^{'} )}{p ( y ^{o} , z ^{⋆} ; θ ^{⋆} )} (1 + o_{P} (1)) + o_{P} (1)

\frac{p ( y ^{o} ; θ )}{p ( y ^{o} ; θ ^{⋆} )} = \frac{# Sym ( θ )}{# Sym ( θ ^{⋆} )} θ^{'} \sim θ max \frac{p ( y ^{o} , z ^{⋆} ; θ ^{'} )}{p ( y ^{o} , z ^{⋆} ; θ ^{⋆} )} (1 + o_{P} (1)) + o_{P} (1)

\frac{p ( y ^{o} ; θ )}{p ( y ^{o} ; θ ^{⋆} )} = θ^{'} \sim θ max \frac{p ( y ^{o} , z ^{⋆} ; θ ^{'} )}{p ( y ^{o} , z ^{⋆} ; θ ^{⋆} )} (1 + o_{P} (1)) + o_{P} (1)

\frac{p ( y ^{o} ; θ )}{p ( y ^{o} ; θ ^{⋆} )} = θ^{'} \sim θ max \frac{p ( y ^{o} , z ^{⋆} ; θ ^{'} )}{p ( y ^{o} , z ^{⋆} ; θ ^{⋆} )} (1 + o_{P} (1)) + o_{P} (1)

\hat{α} (z^{⋆}) - α_{M L E}^{s}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data

Mahendra [email protected] & Timothée [email protected]

(∗ MaIAGE, INRAE, Université Paris-Saclay, 78352 Jouy-en-Josas, France

${\dagger}$ UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005 Paris, France)

Abstract

Statistical analysis of network is an active research area and the literature counts a lot of papers concerned with network models and statistical analysis of networks. However, very few papers deal with missing data in network analysis and we reckon that, in practice, networks are often observed with missing values. In this paper we focus on the Stochastic Block Model with valued edges and consider a MCAR setting by assuming that every dyad (pair of nodes) is sampled identically and independently of the others with probability $\rho>0$ . We prove that maximum likelihood estimators and its variational approximations are consistent and asymptotically normal in the presence of missing data as soon as the sampling probability $\rho$ satisfies $\rho\gg\log(n)/n$ .

Stochastic Block Model $\cdot$ Maximum Likelihood $\cdot$ Missing data $\cdot$ Concentration Inequality

1 Introduction

For the last decade, statistical network analyses has a been a very active research topic and the statistical modeling of networks has found many applications in social sciences and biology for example Aicher et al. (2014), Barbillon et al. (2015), Mariadassou et al. (2010), Wasserman and Faust (1994) and Zachary (1977).

Many random graphs models have been widely studied, either from a theoretical or an empirical point of view. The first model studied was Erdős-Rényi model (Erdős and Renyi, 1959) which assumes that each pair of nodes (dyad) is connected independently to the others with the same probability. This model assumes homogeneity of all nodes across the network. In order to alleviate this constraint, many families of models have been introduced. Most are endowed with a latent structure (reviewed in Matias and Robin, 2014) to capture heterogeneity across nodes. Among those, the Stochastic Block Model (in short SBM, see Frank and Harary, 1982; Holland et al., 1983) is one of the oldest and most studied as it is highly flexible and can capture a large variety of structures (affiliation, hub, bipartite and many other). In order to estimate this model, Bayesian approaches were first proposed (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001) but have been superseded by variational methods (Daudin et al., 2008; Latouche et al., 2012). The former class of approaches are exact but lack the computational efficiency and scalability that the latter offers.

Theoretical guarantees concerning maximum likelihood estimators (in short MLE) and variational estimators (in short VE), based on variational approximations of the likelihood, for the binary SBM estimation are quite difficult to obtain. In Celisse et al. (2012), consistency of MLE and VE is proven but asymptotic normality requires that the estimators converges at rate at least $n^{-1}$ , which is not proven in the paper, although some results were available for some particular cases (affiliation for example). Ambroise and Matias (2012) tackles the specific case of affiliation model with equal group proportion and proves the consistency and asymptotic normality of parameter estimates. Bickel et al. (2013) extends those results to arbitrary binary SBM graphs and improves Celisse et al. (2012) by removing the condition on the convergence rate, as it is automatically satisfied by the MLE. Following along the path of Bickel et al. (2013), Brault et al. (2020) proved consistency and asymptotic normality of estimators (MLE and VE) to weighted Latent Block Models where the weights distribution belongs to a one-dimensional exponential families. In particular, considering unbounded edge values invalidates several parts of the proofs for binary graphs and requires substantial adaptations and additional results, notably concentration inequalities for sums of unbounded, non-gaussian random variables.

Some results are also available for the related semi-parametric problem of assignment reconstruction. Mariadassou and Matias (2015) show that the conditional distribution of the (latent) assignments converge to a degenerate distribution and Rohe et al. (2010) prove that, when the data are generated according to a SBM model, spectral methods are consistent. Choi et al. (2012) extend those results to settings where the density of the graph goes to [math] as $\Omega(\log^{\alpha}(n)/n)$ (for $\alpha$ large enough) and/or the number of groups goes to $+\infty$ as $\sqrt{n}$ . Chatterjee (2015) proves also strong results for reconstruction of large matrices with noisy entries estimation and partial observation of the dyads, by means of a universal singular value thresholding (USVT). In the special case of binary SBM with $k$ groups, he achieves a reconstruction error rate of order $\sqrt{k/n}$ as soon as the fraction of observed dyads is at least $\Omega(\log^{\alpha}(n)/n)$ for (for $\alpha$ large enough). Since USVT replaces missing dyads with [math]s, it naturally achieves the same limiting rate as the sparse setting. Finally, Wang and Bickel (2017) and Hu et al. (2017) also show that model selection for the number $k$ of groups is consistent for dense graphs, they suggest using a penalized likelihood criteria with penalty of the form $\frac{k(k+1)}{2}\log(n)+\lambda n\log(k)$ where $\lambda$ is a tuning parameter.

In this paper we consider a simple setting with fixed number of groups and fixed density but weighted edges and missing values. In most network studies, there is a strong asymmetry between the presence of an edge and its absence: the lack of proof that an edge exists is taken as proof that the edge does not exist and edges with uncertain status are considered as non existent in the graph. This is the strategy adopted in most sparse asymptotic settings where the density of edges goes to [math] asymptotically Bickel et al. (2013). We adopt a different point of view where edges with uncertain status are considered as missing, rather than absent and explicitly accounted for their missing nature. We use the framework of Rubin (1976) and its application to network data, see Kolaczyk (2009) and Handcock and Gile (2010), for parameter inference in presence of missing values and more specifically its applications to SBM Tabouy et al. (2019). We prove that, in the MCAR setting where each dyad is missing independently and with the same probability, the MLE and variational estimates are still consistent and asymptotically normal.

The article is organized as follows. We first present the model and missing data theory applied to our context with some examples of sampling designs. We then posit some definitions and discuss the assumptions required for our results in Section 2. In Section 3 we establish asymptotic normality for the complete-observed model (i.e. observed SBM where latent variables are known). Section 4 is the main result of this paper and states that the observed-likelihood behaves like the complete-observed likelihood (i.e. joint likelihood of the observed data and latent variables) close to its maximum. Consequences for the MLE and variational estimator are in discussed in Section 5. The proof is sketched in Section 6. Comparison to existing results are made and discussed in Section 7. Technical lemmas and details of the proofs are available in the appendices.

2 Statistical framework

2.1 Notations

[TABLE]

2.2 Stochastic Block Model

In SBM, nodes from a set $\mathcal{N}\triangleq\{1,\dots,n\}$ are distributed among a set $\mathcal{Q}\triangleq\{1,\dots,Q\}$ of hidden blocks that model the latent structure of the graph. The block-memberships are encoded by $(z_{i},i\in\mathcal{N})$ where the $z_{i}$ are independant random variables with prior probabilities $\alpha=(\alpha_{1},\dots,\alpha_{Q})$ , such that $\mathbb{P}(z_{i}=q)=\alpha_{q}$ , for all $q\in\mathcal{Q}$ . The value $y_{ij}$ of any dyad $(i,j)$ in $\mathcal{D}=\mathcal{N}\times\mathcal{N}$ , with $i\neq j$ , only depends on the blocks $i$ and $j$ belong to. The variables $(y_{ij})$ s are thus independent conditionally on the $(z_{i})$ s:

[TABLE]

In the following, $\mathbf{y}=(y_{ij})_{i,j\in\mathcal{D}}$ is the $n\times n$ adjacency matrix of the random graph, $\mathbf{z}=(z_{1},\dots,z_{n})$ the $n$ -vector of the latent blocks. With a slight abuse of notation, we associate to $z_{i}$ a binary vector $(z_{i1},\dots,z_{iQ})$ such that $z_{i}=q\Leftrightarrow z_{iq}=1,z_{i\ell}=0$ , for all $\ell\neq q$ . In this case $\mathbf{z}$ is a $n\times Q$ matrix.

We note the complete parameter set as $\boldsymbol{\theta}=(\boldsymbol{\alpha},\boldsymbol{\pi})\in\boldsymbol{\Theta}$ where $\boldsymbol{\Theta}$ stands for the parameter space. When performing inference from data, we note $\boldsymbol{\theta}^{\star}=(\boldsymbol{\alpha}^{\star},\boldsymbol{\pi}^{\star})$ the true parameter set, i.e. the parameter values used to generate the data, and $\mathbf{z}^{\star}$ the true (and usually unobserved) memberships of nodes. For any $\mathbf{z}$ , we also note:

•

$z_{+q}=\sum_{i}z_{iq}$ the size of the $q^{th}$ community (or block) for membership $\mathbf{z}$

•

$z^{\star}_{+q}$ its counterpart for $\mathbf{z}^{\star}$ .

2.3 Missing data for SBM

Regarding SBM inference, a missing value corresponds to a missing entry in the adjacency matrix $\mathbf{y}$ , typically denoted by NA’s. We rely on the $n\times n$ sampling matrix $\mathbf{r}$ to record the missing state of each entry:

[TABLE]

As a shortcut, we use $\mathbf{y}^{\text{\rm o}}=\{y_{ij}:r_{ij}=1\}$ and $\mathbf{y}^{\text{\rm m}}=\{y_{ij}:r_{ij}=0\}$ to respectively denote the observed and missing dyads. The sampling design is the description of the stochastic process that generates $\mathbf{r}$ . It is assumed that the network exists before the sampling design acts upon it, which is fully characterized by the conditional distribution $p_{\psi}(\mathbf{r}|\mathbf{y})$ , the parameters of which are such that $\psi$ and $\theta$ live in a product space $\Theta\times\Psi$ . In this paper we are going to focus on a specific type of missingness, called missing completely at random (MCAR) for which $p_{\psi}(\mathbf{r}|\mathbf{y})=p_{\psi}(\mathbf{r})$ and leave aside more complex forms of dependencies such as Missing at random (MAR) and Not missing at random (NMAR).

We then follow the framework of (Rubin, 1976) and Tabouy et al. (2019) for missing data and define the joint probability density function as

[TABLE]

Property 2.1.

According to Equation (2.2), if the sampling design is MCAR, then maximising $p_{\theta,\psi}(\mathbf{y}^{\text{\rm o}},\mathbf{z},\mathbf{r})$ or $p_{\theta,\psi}(\mathbf{y}^{\text{\rm o}},\mathbf{r})$ in $\theta$ is equivalent to maximising $p_{\theta}(\mathbf{y}^{\text{\rm o}})$ in $\theta$ , this corresponds to the ignorability notion defined in Rubin (1976).

2.4 Sampling design examples

We present here some examples of sampling designs to illustrate differences between notions of MCAR, MAR and NMAR.

Definition 2.2 (Random dyad sampling).

Each dyad $(i,j)\in\mathcal{D}$ has the same probability $\mathbb{P}(r_{ij}=1)=\rho$ of being observed, independently of the others. This design is MCAR.

Definition 2.3 (Random node sampling).

The random node sampling consists in selecting independently with probability $\rho$ a set of nodes and then observing the corresponding rows and columns of matrix $\mathbf{y}$ .

The major point in both examples is that the probability ( $\rho$ in random dyad sampling and $1-(1-\rho)^{2}$ in the random node sampling) of observing a dyad does not depend on its value. In contrast, the following dyad-centered sampling design adapted to binary networks is NMAR since the probability to observe a dyad depends on its value:

Definition 2.4 (Double standard sampling).

Each dyad $(i,j)\in\mathcal{D}$ is observed, independently of other dyads, with a probability depending on its value: $\mathbb{P}(r_{ij}=1|y_{ij}=0)=\rho_{0}$ and $\mathbb{P}(r_{ij}=1|y_{ij}=1)=\rho_{1}$ .

For non-binary networks, specifying the sampling design is more involved and requires defining the sampling density for every possible value of $y_{ij}$ , e.g. $(\mathbb{P}(r_{ij}=1|y_{ij}=k))_{k\in\mathbb{N}}$ for Poisson-valued edges.

*Remark 2.5**.*

In this paper, we focused on data sampled according to random dyad sampling, which is the simplest case but already yields valuable insights into the differences between the partially and fully sampled settings.

As observed above, there are however many other ways to sample a network. In the case of node-centered sampling design, like random node sampling, the main difficulty to prove consistency and asymptotic normality is the dependency between the $r_{ij}$ variables. Indeed, in random node sampling, the variable $r_{i_{0}j_{0}}$ depends on all $r_{ij_{0}}$ and $r_{i_{0}j}$ (for all $i,j\in\mathcal{N}$ ). As a consequence, a different inference strategy is required and many results proved in this paper are not valid under random node sampling. NMAR sampling designs raises problem of their own: each design requires its own estimation procedure (Tabouy et al., 2019) and therefore its own analysis. For example, parameter estimation under the seemingly simple double standard sampling for binary networks is still an open problem: numerical experiments suggest that $\boldsymbol{\theta}=(\boldsymbol{\pi},\boldsymbol{\alpha})$ and $\boldsymbol{\psi}=(\rho_{0},\rho_{1})$ are jointly identifiable but there is no formal proof.

2.5 Observed-likelihoods

When the labels are known, the complete-observed log-likelihood is given by:

[TABLE]

But the labels are usually unobserved, and the observed log-likelihood is obtained by integration over all memberships:

[TABLE]

2.6 Models and Assumptions

We focus here on parametric models where $\varphi$ belongs to a regular one-dimension exponential family in canonical form:

[TABLE]

where $\pi$ belongs to the space $\mathcal{A}$ , so that $\varphi(\cdot,\pi)$ is well defined for all $\pi\in\mathcal{A}$ . Classical properties of exponential families ensure that $\psi$ is convex, infinitely differentiable on $\mathring{\mathcal{A}}$ , that $(\psi^{\prime})^{-1}$ is well defined on $\psi^{\prime}(\mathring{\mathcal{A}})$ . Furthemore, when $y_{\pi}\sim\varphi(.,\pi)$ , $\mathbb{E}[y_{\pi}]=\psi^{\prime}(\pi)$ and $\mathbb{V}[y_{\pi}]=\psi^{\prime\prime}(\pi)$ .

In the following, we recall assuming that missing data are produced according to a random dyad sampling with parameter $\rho>0$ .

Moreover, we make the following assumptions on the parameter space and the asymptotics of $\rho$ :

$A_{0}$

: $\rho$ goes to [math] but satisfies $\rho\gg\log(n)/n$ 2. $A_{1}$

: There exists a positive constant $c$ , and a compact interval $C_{\pi}$ such that

[TABLE] 3. $A_{2}$

: The true parameter $\boldsymbol{\theta}^{\star}=(\boldsymbol{\alpha}^{\star},\boldsymbol{\pi}^{\star})$ lies in the interior of $\boldsymbol{\Theta}$ . 4. $A_{3}$

: The map $\pi\mapsto\varphi(\cdot,\pi)$ is injective. 5. $A_{4}$

: The coordinates of $\boldsymbol{\pi}^{\star}\psi^{\prime}(\boldsymbol{\alpha}^{\star})$ , where $\psi^{\prime}$ is applied component-wise, are pairwise distinct.

The previous assumptions are standard. Assumption $A_{0}$ ensures that the fraction of observed dyad is not too small. Assumption $A_{1}$ ensures that the group proportions are bounded away from [math] and $1$ so that no group disappears when $n$ goes to infinity. It also ensures that $\pi$ is bounded away from the boundaries of the $\mathcal{A}$ . This is essential for the subexponential properties of Propositions 2.9 and 2.10. $A_{2}$ is in line with standard assumptions in parametric statistics. $A_{3}$ is necessary for identifiability purposes: the model is trivially not identifiable if the map $\pi\mapsto\varphi(.,\pi)$ is not injective. $A_{4}$ ensures identifiability of SBM parameters under random dyad sampling. Note that, combined with $A_{3}$ , it implies that all columns and all rows of $\boldsymbol{\pi}^{\star}$ are distincts and therefore that no two groups have the connectivity profile. In the following, we consider the number of blocks $\mathcal{Q}$ to be known.

2.7 Identifiability

Since $\mathbf{r}$ is independant on $\mathbf{y}$ , the identifiability of SBM with emission law in the one-dimension exponential family under random dyad sampling can be stated in two steps. First the sampling parameter $\rho$ and secondly the SBM parameters $\boldsymbol{\theta}^{\star}=(\boldsymbol{\alpha}^{\star},\boldsymbol{\pi}^{\star})$ given $\rho$ .

Proposition 2.6.

The sampling parameter $\rho>0$ of random dyad sampling is identifiable w.r.t. the sampling distribution.

Proof.

See Tabouy et al. (2019). The proof does not depend on $\mathbf{y}$ being binary but also holds for $\mathbf{y}$ distributed as in Eq. (2.5). ∎

Proposition 2.7.

Let $n\geq 2Q$ and assume that for any $1\leq q\leq Q$ , $\rho>0$ , $\pi^{\star}_{q}>0$ and that the coordinates of $\boldsymbol{\alpha}^{\star}\psi^{\prime}(\boldsymbol{\pi}^{\star})$ , where $\psi^{\prime}$ is applied component-wise, are pairwise distinct. Then, under random dyad sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching.

Proof.

The proof is nearly identical to the one written in Tabouy et al. (2019) and inspired by Celisse et al. (2012) for the binary SBM under random dyad sampling. However, substituting $\mathbb{E}[y_{ij}|z_{i}=q]$ to $s_{q}$ in the proof ensures that $\boldsymbol{\alpha}^{\star}$ is identifiable. Finally, the fact that $(\psi^{\prime})^{-1}$ is a one-to-one map ensures that $\boldsymbol{\pi}^{\star}$ is identifiable. ∎

Note that asymptotically, the assumption $n\geq 2Q$ is always satisfied since $Q$ is fixed and $n$ grows to infinity.

2.8 Subexponential variables

*Remark 2.8**.*

Since we restricted $\pi$ in a bounded subset of $\mathring{\mathcal{A}}$ , the variance of $y_{\pi}$ is bounded away from [math] and $+\infty$ . We note

[TABLE]

Similarly, since $\pi$ belongs to a bounded subset of a open interval, there exists a constant $\kappa>0$ , such that $[\pi-\kappa,\pi+\kappa]\subset\mathring{\mathcal{A}}$ uniformly over all $\pi\in C_{\pi}$

Proposition 2.9.

With the previous notations, if $\pi\in C_{\pi}$ and $y_{\pi}\sim\varphi(.,\pi)$ , then $y_{\pi}$ is subexponential with parameters $(\bar{\sigma}^{2},\kappa^{-1})$ .

Proposition 2.10.

Considering $x=y_{\pi}r_{ij}+\lambda r_{ij}$ (we recall that $r_{ij}\sim\mathcal{B}(\rho)$ ), with $r_{ij}$ independant of $y_{\pi}$ and $\lambda\in\mathbb{R}$ bounded. There are non-negative numbers $\nu$ and $b$ such that $x$ is subexponential with parameters $(\nu^{2},b^{-1})$ .

Proof.

These results derive directly from theorem C.1 (statement 2.). ∎

2.9 Symmetry

We now introduce the concepts of assignments and parameter symmetries, that must be accounted for when studying the asymptotic properties of the MLE. Complications stemming from symmetries are related to but no equivalent to the problem of label-switching in mixture models.

Definition 2.11 (permutation).

Let $s$ be a permutation on $\{1,\dots,Q\}$ . If $\boldsymbol{A}$ is a matrix with $Q$ columns and $n$ rows, we define $\boldsymbol{A}^{s}$ as the matrix obtained by permuting the columns of $\boldsymbol{A}$ according to $s$ , i.e. for any row $i$ and column $q$ of $\boldsymbol{A}$ , ${A}^{s}_{iq}=A_{is(q)}$ . If $\boldsymbol{C}$ is a matrix with $Q$ rows and $Q$ columns, $\boldsymbol{C}^{s}$ is defined similarly:

[TABLE]

Definition 2.12 (equivalence).

We define the following equivalence relationships:

•

Two assignments $\mathbf{z}$ and $\mathbf{z}^{\prime}$ are equivalent, noted $\sim$ , if they are equal up to label permutation, i.e. there exists a permutation $s$ such that $\mathbf{z}^{\prime}=\mathbf{z}^{s}$ .

•

Two parameters $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{\prime}$ are equivalent, noted $\sim$ , if they are equal up to label permutation, i.e. there exists a permutation $s$ such that $(\boldsymbol{\alpha}^{s},\boldsymbol{\pi}^{s})=(\boldsymbol{\alpha}^{\prime},\boldsymbol{\pi}^{\prime})$ .

•

$(\boldsymbol{\theta},\mathbf{z})$ * and $(\boldsymbol{\theta}^{\prime},\mathbf{z}^{\prime})$ are equivalent, noted $\sim$ , if they are equal up to label permutation on $\boldsymbol{\pi}$ and $\mathbf{z}$ , i.e. there exists a permutation $s$ such that $(\boldsymbol{\pi}^{s},\mathbf{z}^{s})=(\boldsymbol{\pi}^{\prime},\mathbf{z}^{\prime})$ . This is label-switching.*

Definition 2.13 (symmetry).

We say that the parameter $\boldsymbol{\theta}$ exhibits symmetry for the permutation $s$ if

[TABLE]

$\boldsymbol{\theta}$ * exhibits symmetry if it exhibits symmetry for any non trivial permutations $s$ . Finally the set of permutations for which $\boldsymbol{\theta}$ exhibits symmetry is noted $\operatorname{Sym}(\boldsymbol{\theta})$ .*

*Remark 2.14**.*

The set of parameters that exhibit symmetry is a manifold of null Lebesgue measure in $\boldsymbol{\Theta}$ . The notion of symmetry allows us to deal with a notion of non-identifiability of the class labels that is subtler than and different from label switching. More precisely

[TABLE]

In particular, in label-switching, $\mathbf{z}$ and $\mathbf{z}^{s}$ have the same likelihood but under equivalent yet different parameters $\boldsymbol{\theta}$ s. In contrast, in the presence of symmetry, $\mathbf{z}$ and $\mathbf{z}^{s}$ have exactly the same likelihood under $\boldsymbol{\theta}$ . This implies in particular that the posterior $p(\mathbf{z}|\mathbf{y^{o}},\boldsymbol{\theta})$ can not concentrate on a single assignment. This is instrumental for Proposition 6.11.

*Example 1**.*

In this example we illustrate what $\textrm{Sym}(\theta)$ and its cardinal can be in a simple case. Consider a network with $n$ nodes,

[TABLE]

As a consequences the two following assignments

[TABLE]

belongs to $\textrm{Sym}(\theta)=\{Id,[1,2]\}$ . Indeed they are the only assignments belongings to $\textrm{Sym}(\theta)$ , so, in this particular case $\#\textrm{Sym}(\theta)=2$ .

The issue of symmetry forces us to use a notion of distance between assignment that is invariant to label permutation.

Definition 2.15 (distance).

We define the following distance, up to equivalence, between configurations $\mathbf{z}$ and $\mathbf{z}^{\star}$ :

[TABLE]

where, for all matrix $\mathbf{z}$ , we use the Hamming norm $\left\|\cdot\right\|_{0}$ defined by

[TABLE]

Definition 2.16 (Set of local assignments).

We note $S(\mathbf{z}^{\star},r)$ the set of configurations that have a representative (for $\sim$ ) within relative radius $r$ of $\mathbf{z}^{\star}$ :

[TABLE]

2.10 Other definitions

We finally introduce a few useful notions that will be instrumental in the proofs. The first is “regular” assignments, for which each group has “enough” nodes:

Definition 2.17 ( $c$ -regular assignments).

Let $\mathbf{z}\in\mathcal{Z}$ . For any $c>0$ , we say that $\mathbf{z}$ is c-regular if

[TABLE]

Class distinctness $\delta(\boldsymbol{\pi})$ captures the differences between groups: lower values of $\delta(\boldsymbol{\pi})$ means that at least two classes have very similar connectivity profiles. $\delta(\boldsymbol{\pi})$ is intrisically linked to the convergence rate of several estimates.

Definition 2.18 (class distinctness).

For $\boldsymbol{\theta}=(\boldsymbol{\alpha},\boldsymbol{\pi})\in\boldsymbol{\Theta}$ . We define:

[TABLE]

with $\operatorname{KL}(\pi,\pi^{\prime})=\mathbb{E}_{\pi}[\log(\varphi(Y,\pi)/\varphi(Y,\pi^{\prime}))]=\psi^{\prime}(\pi)(\pi-\pi^{\prime})+\psi(\pi^{\prime})-\psi(\pi)$ the Kullback divergence between $\varphi(.,\pi)$ and $\varphi(.,\pi^{\prime})$ , when $\varphi$ comes from an exponential family.

*Remark 2.19**.*

Since all $\boldsymbol{\pi}$ have distinct rows and columns, $\delta(\boldsymbol{\pi})>0$ .

Finally, the confusion matrix allows to compare groups between assignments:

Definition 2.20 (confusion matrix).

For given assignments $\mathbf{z}$ and $\mathbf{z}^{\star}$ , we define the confusion matrix between $\mathbf{z}$ and $\mathbf{z}^{\star}$ , noted ${I\!R}(\mathbf{z})$ , as follows:

[TABLE]

Definition 2.21.

For more conciseness, we define

[TABLE]

3 Complete-observed Model

Hereafter and in the rest of the text, we use the term "complete" to say that true assignments $\mathbf{z}^{\star}$ are known, and "observed" to say that only some dyads are observed. In the following we study the asymptotic properties of the complete-observed data model.

Proposition 3.1.

Under random dyad sampling, defining $N_{i}=\sum_{j}r_{ij}$ and $\Omega_{0,n}=\cap_{i=1}^{n}\{N_{i}\geqslant 1\}$ the set of nodes with at least one dyad observed. Then

[TABLE]

Proof.

This proposition is a direct consequence of Borel-Cantelli’s theorem. Details are available in appendix A. ∎

*Remark 3.2**.*

This result shows that, with high probability, the network has no unobserved node. In the remainder, we work conditionnally on $\Omega_{0,n}$ .

Let $\widehat{\boldsymbol{\theta}}_{c}=\left(\widehat{\boldsymbol{\alpha}},\widehat{\boldsymbol{\pi}}\right)$ be the MLE of $\boldsymbol{\theta}$ in the complete-observed data model. Simple manipulations of Equation (2.3) yield:

[TABLE]

Proposition 3.3.

Let $\Sigma_{\boldsymbol{\alpha}^{\star}}=\operatorname{Diag}(\boldsymbol{\alpha}^{\star})-\boldsymbol{\alpha}^{\star}\left(\boldsymbol{\alpha}^{\star}\right){{}^{T}}$ .Then $\Sigma_{\boldsymbol{\alpha}^{\star}}$ is semi-definite positive, of rank $\mathcal{Q}-1$ , and $\widehat{\boldsymbol{\alpha}}$ is asymptotically normal:

[TABLE]

*Similarly, let $V(\boldsymbol{\pi}^{\star})$ be the matrix defined by $[V(\boldsymbol{\pi}^{\star})]_{q\ell}=1/\psi^{\prime\prime}(\pi^{\star}_{q\ell})$ and

$\Sigma_{\boldsymbol{\pi}^{\star}}=\rho^{-1}\operatorname{Diag}^{-1}(\boldsymbol{\alpha}^{\star})V(\boldsymbol{\pi}^{\star})\operatorname{Diag}^{-1}(\boldsymbol{\alpha}^{\star})$ . Then the estimates $\hat{\pi}_{q\ell}(\mathbf{z}^{\star})$ are independent and asymptotically Gaussian with limit distribution:*

[TABLE]

Proof.

The proof is postponed to appendix A. The first part is a direct application of central limit theorem for i.i.d. variables and the second part relies on a variant of the central limit theorem for random sums of random variables. ∎

*Remark 3.4**.*

The main differences with Bickel et al. (2013) are (i) the scaling of $\Sigma_{\boldsymbol{\pi}^{\star}}$ as $\rho^{-1}$ and (ii) the need for a central limit theorem for random sums of random variables, as the sums involved in (3.1) are over a random number of terms.

Proposition 3.5 (Local asymptotic normality).

Let $\mathcal{L}_{co}^{\star}$ be the complete likelihood function defined on $\boldsymbol{\Theta}$ by $\mathcal{L}_{co}^{\star}\left(\boldsymbol{\alpha},\boldsymbol{\pi}\right)=\log p\left(\mathbf{y}^{o},\mathbf{z}^{\star};\boldsymbol{\theta}\right)$ . For any $s$ and $u$ in a compact set, we have:

[TABLE]

where $\odot$ denote the Hadamard product of two matrices (element-wise product) and $\Sigma_{\boldsymbol{\alpha}^{\star}}$ and $\Sigma_{\boldsymbol{\pi}^{\star}}$ are defined in Proposition 3.3. $\mathbf{Y}_{\boldsymbol{\alpha}^{\star}}$ is asymptotically Gaussian with zero mean and variance matrix $\Sigma_{\boldsymbol{\alpha}^{\star}}$ . $\mathbf{Y}_{\boldsymbol{\pi}^{\star}}$ is a random matrix with independent entries that are asymptotically gaussian zero mean and variance $\Sigma_{\boldsymbol{\pi}^{\star}}$ .

Proof.

This result is based on a Taylor expansion of $\mathcal{L}_{co}^{\star}$ in a neighborhood of $(\boldsymbol{\alpha}^{\star},\boldsymbol{\pi}^{\star})$ . Details are available in appendix A. ∎

4 Main Result

Our main result compares the observed likelihood ratio $p(\mathbf{y}^{o};\boldsymbol{\theta})/p(\mathbf{y}^{o};\boldsymbol{\theta}^{\star})$ with the complete observed likelihood $p(\mathbf{y}^{o},\mathbf{z}^{\star};\boldsymbol{\theta}^{\prime})/p(\mathbf{y}^{o},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star})$ to show that they have the same argmax. To ease the comparison, we work only on the high probablity set $\Omega_{1}$ of $c/2$ -regular configurations, i.e. that have $\Omega(n)$ nodes in each group as defined in Section 2,

Proposition 4.1.

Define $\mathcal{Z}_{1}$ as the subset of $\mathcal{Z}$ made of $c/2$ -regular assignments, with $c$ defined in assumption $H_{1}$ . Note $\Omega_{1}$ the event $\{\mathbf{z}^{\star}\in\mathcal{Z}_{1}\}$ , then:

[TABLE]

Proof.

This proposition is a consequence of Hoeffding’s inequality. See appendix A for more details. ∎

We can now state our main result:

Theorem 4.2 (complete-observed).

Assume that $A_{1}$ to $A_{4}$ with random-dyad sampling hold for the Stochastic Block Model of known order with $n\times n$ observations coming from an univariate exponential family and define $\#\operatorname{Sym}(\boldsymbol{\theta})$ as the set of permutation $s$ for which $\boldsymbol{\theta}=(\boldsymbol{\alpha},\boldsymbol{\pi})$ exhibits symmetry. Then, for $n$ tending to infinity and $\rho\gg\log(n)/n$ , the observed likelihood ratio behaves like the complete likelihood ratio, up to a bounded multiplicative factor:

[TABLE]

where the $o_{P}$ is uniform over all $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ .

The maximum over all $\boldsymbol{\theta}^{\prime}$ that are equivalent to $\boldsymbol{\theta}$ stems from the fact that because of label-switching, $\boldsymbol{\theta}$ is only identifiable up to its $\sim$ -equivalence class from the observed likelihood, whereas it is completely identifiable from the complete likelihood. The multiplicative factor arises from the fact that equivalent assignments have exactly the same complete likelihood and contribute equally to the observed likelihood.

*Remark 4.3**.*

This result is very similar to the one of Brault et al. (2020) and corrects an error in the main result of Bickel et al. (2013): the missing terms $\#\operatorname{Sym}(\boldsymbol{\theta})$ and $\#\operatorname{Sym}(\boldsymbol{\theta}^{\star})$ .

Corollary 4.4.

If $\boldsymbol{\Theta}$ contains only parameters with no symmetry:

[TABLE]

where the $o_{P}$ is uniform over all $\boldsymbol{\Theta}$ .

5 Variational and Maximum Likelihood Estimates

This section is devoted to the asymptotic of the MLE and the VE in the incomplete data model as a consequence of the main result 4.2. Note that, with high probability, both estimators have no symmetry since the set $\{\boldsymbol{\theta}:\#\operatorname{Sym}(\boldsymbol{\theta})>1\}$ is a manifold of null Lebesque’s mesure in $\boldsymbol{\Theta}$ and thus $\mathbb{P}_{\boldsymbol{\theta}^{\star}}(\#\operatorname{Sym}(\hat{\boldsymbol{\theta}})>1)\to 0$ .

5.1 ML estimator

The asymptotic behavior of the maximum likelihood estimator in the incomplete data model is a direct consequence of Theorem 4.2 and Proposition 3.5.

Corollary 5.1 (Asymptotic behavior of $\widehat{\boldsymbol{\theta}}_{MLE}$ ).

Denote $\widehat{\boldsymbol{\theta}}_{MLE}$ the maximum likelihood estimator and use the notations of Proposition 3.3. There exist permutations $s$ of $\{1,\dots,Q\}$ such that

[TABLE]

Hence, the maximum likelihood estimator for the SBM under random-dyad sampling condition is consistent and asymptotically normal, with the same behavior as the maximum likelihood estimator in the complete data model. The proof is postponed to appendix B.10.

5.2 Variational estimator

Due to the complex dependency structure of the observations, the maximum likelihood estimator of the SBM is not numerically tractable, even with the Expectation Maximisation algorithm. In practice, a variational approximation is often used (see Daudin et al., 2008): for any joint distribution $\mathbb{Q}\in\mathcal{Q}$ on $\mathcal{Z}$ a lower bound of $\mathcal{L}(\boldsymbol{\theta})$ is given by

[TABLE]

where $\mathcal{H}\left(\mathbb{Q}\right)=-\mathbb{E}_{\mathbb{Q}}[\log(\mathbb{Q})]$ . Choosing $\mathcal{Q}$ to be the set of product distributions, such that for all $\mathbf{z}$

[TABLE]

allows us to obtain tractable expressions of $J\left(\mathbb{Q},\boldsymbol{\theta}\right)$ . The variational estimate $\widehat{\boldsymbol{\theta}}_{var}$ of $\boldsymbol{\theta}$ is defined as

[TABLE]

The following corollary states that $\widehat{\boldsymbol{\theta}}_{var}$ has the same asymptotic properties as $\widehat{\boldsymbol{\theta}}_{MLE}$ and $\widehat{\boldsymbol{\theta}}_{MC}$ , in particular is consistent and asymptotically normal.

Corollary 5.2 (Variational estimate).

Under the assumptions of Theorem 4.2, there exist permutations $s$ of $\{1,\dots,Q\}$ such that

[TABLE]

The proof is very similar to the proof of Corollary 5.1 and postponed to appendix B.10.

6 Proof Sketch

The proof of theorem relies on deviations of the log-likelihood ratios from their expectations. We first define those quantities.

6.1 log-likelihood ratios

Definition 6.1.

We define the conditional log-likelihood ratio $LR$ and its expectation $ELR$ as:

[TABLE]

We also define the profile ratio $\Lambda$ and its counterpart $\tilde{\Lambda}$ as:

[TABLE]

The following decomposition of $p(\mathbf{y^{o}};\boldsymbol{\theta})$ highlights the importance of $LR(\boldsymbol{\theta},\mathbf{z})$ :

[TABLE]

Since $LR(\boldsymbol{\theta},\mathbf{z})\leq\Lambda(\mathbf{z})$ , the profile ratio is useful to remove the dependency on $\boldsymbol{\theta}$ and reduce the study to a series of problems depending only on $\mathbf{z}$ . The following propositions show that $\tilde{\Lambda}$ and $ELR$ are constrats which are maximum (in expectation) at the true parameter value (up to group relabeling) and have negative curvature at those points. This allows us to prove that, asymptotically, only one (or a few) $z$ contribute to the above sum.

Proposition 6.2.

Conditionally on $\mathbf{z}^{\star}$ , we have

[TABLE]

with $\bar{y}_{q\ell}(\mathbf{z})=0$ for $\mathbf{z}$ such that $\widehat{\alpha}_{q}(\mathbf{z})=0$ or $\widehat{\alpha}_{\ell}(\mathbf{z})=0$ i.e. no dyad observed in class $(q,l)$ .

*Remark 6.3**.*

Note the absence of the random variable $\mathbf{r}$ in $\bar{y}_{q\ell}(\mathbf{z})$ , which is integrated out in the expectation $\mathbb{E}_{\boldsymbol{\theta}^{\star}}$ .

Proposition 6.4 (maximum of $ELR$ and $\tilde{\Lambda}$ in $\boldsymbol{\theta}$ ).

The functions $LR(\boldsymbol{\theta},\mathbf{z})$ and $ELR(\boldsymbol{\theta},\mathbf{z})$ are maximum respectively in $\boldsymbol{\pi}$ for $\widehat{\boldsymbol{\pi}}(\mathbf{z})$ and $\bar{\boldsymbol{\pi}}(\mathbf{z})$ defined by:

[TABLE]

so that

[TABLE]

Proposition 6.5 (Local upperbound for $\tilde{\Lambda}$ ).

Conditionally upon $\Omega_{1}$ , there exists a positive constant $C$ such that for all $\mathbf{z}\in S(\mathbf{z}^{\star},C)$ :

[TABLE]

Proposition 6.6 (maximum of $ELR$ and $\tilde{\Lambda}$ in $(\boldsymbol{\theta},\mathbf{z})$ ).

$ELR$ * can be written:*

[TABLE]

Conditionally on the set $\Omega_{1}$ of regular assignments and for $n>2/c$ ,

(i)

$ELR$ * is maximized at $(\boldsymbol{\pi}^{\star},\mathbf{z}^{\star})$ and its equivalence class and $ELR(\boldsymbol{\pi}^{\star},\mathbf{z}^{\star})=0$ .*

(ii)

$\tilde{\Lambda}$ * is maximized at $\mathbf{z}^{\star}$ and its equivalence class and $\tilde{\Lambda}(\mathbf{z}^{\star})=0$ .*

(iii)

The maximum of $\tilde{\Lambda}$ (and hence the maximum of $ELR$ ) is well separated.

Proofs of Propositions 6.2, 6.4, 6.5 and 6.6 are postponed to Appendix B.

6.2 High level view of the proof

The proof proceeds by splitting $p(\mathbf{y^{o}};\boldsymbol{\theta})$ as a sum over three types of configurations that partition $\mathcal{Z}$ and studying the asymptotic behavior of $LR$ and on each type:

global control: for $\mathbf{z}$ such that $\tilde{\Lambda}(\mathbf{z})=\Omega(-n^{2})$ , Proposition 6.7 proves a large deviation behavior and shows that $LR=-\Omega_{P}(n^{2})$ . In turn, those assignments contribute a $o_{P}$ of $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star}))$ to the sum (Proposition 6.8). 2. 2.

local control: a small deviation result (Proposition 6.9) is needed to show that the combined contribution of assignments close to but not equivalent to $\mathbf{z}^{\star}$ is also a $o_{P}$ of $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star})$ (Proposition 6.10). 3. 3.

equivalent assignments: Proposition 6.11 examines which of the remaining assignments, all equivalent to $\mathbf{z}^{\star}$ , contribute to the sum.

These results are presented in next section 6.3 and their proofs postponed to Appendix B. They are then put together in section 6.4 to prove our main result. The remainder of the section is devoted to the asymptotics of the ML and variational estimators as a consequence of the main result.

6.3 Different asymptotic behaviors

6.3.1 Global Control

Proposition 6.7 (large deviations of $LR$ ).

Let $\operatorname{Diam}(\boldsymbol{\Theta})=\sup_{\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{\infty}$ . For all $\varepsilon_{n}<\nu b$ and $n$ large enough that $2\sqrt{2n^{2}}\epsilon_{n}\geq Q^{2}$

[TABLE]

Proposition 6.8 (contribution of global assignments).

Choose $t_{n}$ decreasing to [math] slowly enough that $\frac{\rho nt_{n}}{\sqrt{\log(n)}}\to+\infty$ . Then conditionally on $\Omega_{1}$ and for $n$ large enough that $2\sqrt{2n^{2}}\epsilon_{n}\geq\mathcal{Q}^{2}$ , we have:

[TABLE]

6.3.2 Local Control

Proposition 6.9 (small deviations $LR$ ).

Conditionally on $\Omega_{1}$ ,

[TABLE]

The next proposition uses Propositions 6.9 and 6.6 to show that the combined contribution to the observed likelihood of assignments close to $\mathbf{z}^{\star}$ is also a $o_{P}$ of $p(\mathbf{z}^{\star},\mathbf{y}^{o};\boldsymbol{\theta}^{\star})$ :

Proposition 6.10 (contribution of local assignments).

With the previous notations and $C$ the positive constant defined in Proposition 6.5:

[TABLE]

6.3.3 Equivalent assignments

It remains to study the contribution of equivalent assignments.

Proposition 6.11 (contribution of equivalent assignments).

For all $\boldsymbol{\theta}\in\boldsymbol{\Theta}$ , we have

[TABLE]

where the $o_{P}$ is uniform in $\boldsymbol{\theta}$ .

6.4 Proof of the main result

Proof.

We work conditionally on $\Omega_{1}$ . Choose $\mathbf{z}^{\star}\in\mathcal{Z}_{1}$ and a sequence $t_{n}$ decreasing to [math] but satisfying $\rho nt_{n}/\sqrt{\log(n)}\to+\infty$ . According to Proposition 6.8,

[TABLE]

Since $t_{n}$ decreases to [math], it gets smaller than $C$ (used in proposition 6.10) for $n$ large enough. As this point, Proposition 6.10 ensures that:

[TABLE]

And therefore the observed likelihood ratio reduces as:

[TABLE]

And Proposition 6.11 allows us to conclude

[TABLE]

∎

7 Discussion

Close examination of the different proofs, especially of Prop. 6.10, reveals that the quantities driving convergence of the estimates are $\rho n\delta(\boldsymbol{\pi}^{\star})$ , which must go to $+\infty$ with $n$ to ensure validity of Prop. 6.8, and $\rho nt_{n}\delta(\boldsymbol{\pi}^{\star})$ , which must be larger than $\sqrt{\log(n)}$ while $t_{n}\to 0$ , to ensure validity of Prop. 6.10. Both conditions are met as soon as $\rho\gg\log(n)/n$ , allowing for a large fraction of missing edges. Note that this limiting rate for missingness is the same as the one found for graph density in sparse settings to achieve consistency and local asymptotic normality of $\boldsymbol{\theta}$ (Bickel et al., 2013). It’s also the same as the one found by Chatterjee (2015) for the structured matrix reconstruction problem. Note also that in the fixed $\rho$ setting, both MLE and VE are consistent and asymptotically normal but the cost of missingness is an expected blow up of the asymptotic variance matrix by a factor of $\rho^{-1}$ .

The proof follows along the line of (Bickel et al., 2013) but differs in some significant ways. First, since the number of observed dyads is random, we must rely on variants of the central limit theorem that hold for random sums of random variables. Second, the move from the binary to unbounded dyads invalidates a counting argument used in (Bickel et al., 2013) and requires different concentration inequalities. We leverage the facts that random variables with distribution in natural exponential families are subexponential and that the subexponential property is preserved by summation and multiplication to derive Bernstein-type inequality. Finally, we add the missing terms $\#\operatorname{Sym}(\boldsymbol{\theta})$ which have little impact for the corollaries but are required for the rigorous statement of the main result.

8 Acknowledgment

The authors thank Pierre Barbillon (INRA-MIA, AgroParisTech), Julien Chiquet (INRA-MIA, AgroParisTech), Stéphane Robin (INRA-MIA, AgroParisTech) and James Ridgway (CFM) for their helpful remarks and suggestions.

This work is supported by two public grants overseen by the French National research Agency (ANR): first as part of the « Investissement d’Avenir » program, through the « IDI 2017 » project funded by the IDEX Paris-Saclay, ANR-11-IDEX-0003-02, and second by the « EcoNet » project.

Appendix A Technical results

A.1 Proof of proposition 3.1

Proof.

Noticing that $N_{i}\sim\text{Bin}(n-1,\rho)$ , then $\mathbb{P}(N_{i}\geqslant 1)=1-(1-{\rho})^{n-1}$ . As a consequence $\mathbb{P}(\overline{\Omega_{0,n}})\leqslant\sum_{i}\mathbb{P}(N_{i}=0)=n(1-{\rho})^{n-1}\underset{n\to+\infty}{\longrightarrow}0$ , and $\mathbb{P}(\Omega_{0,n})\underset{n\to+\infty}{\longrightarrow}1$ . Then $\mathbb{P}(\limsup(\overline{\Omega_{0,n}}))=0$ by Borel-Cantelli theorem (because $\sum_{n}\mathbb{P}(\overline{\Omega_{0,n}})$ converge), and as $\overline{\limsup\overline{\Omega_{0,n}}}=\overline{\bigcap_{n\geqslant 0}\bigcup_{q\geqslant n}\overline{\Omega_{0,n}}}=\bigcup_{n\geqslant 0}\bigcap_{q\geqslant n}{\Omega_{0,n}}=\liminf{\Omega_{0,n}}$ , the result follow. ∎

A.2 Technical lemma A.1

Lemma A.1.

[TABLE]

Proof.

Noticing that $\mathbb{E}[r_{ij}z_{iq}z_{j\ell}]=\rho\alpha_{q}\alpha_{l}$ and defining $q_{i,j}^{q,\ell}=r_{ij}z_{iq}z_{j\ell}-\rho\alpha_{q}\alpha_{l}$ . By Hoeffding decomposition for U-statistics (see Hoeffding (1948))

[TABLE]

where for each permutation $\sigma\in\mathfrak{S}$ , $\sum_{i=1}^{\lfloor\frac{n}{2}\rfloor}q_{\sigma(i),\sigma(i+\lfloor\frac{n}{2}\rfloor)}^{q,\ell}$ is a sum of independant r.v. Then, for $\gamma>0$ by Jensen’s inequality and Hoeffding’s lemma about bounded r.v.

[TABLE]

Finally, using the same proof than Hoeffding’s inequality allows us to conclude. ∎

A.3 Proof of proposition 3.3

Proof.

Since $\hat{\boldsymbol{\alpha}}\left(\mathbf{z}^{\star}\right)=\left(\hat{\alpha}_{1}\left(\mathbf{z}^{\star}\right),\dots,\hat{\alpha}_{g}\left(\mathbf{z}^{\star}\right)\right)$ is the sample mean of $n$ i.i.d. multinomial random variables with parameters $1$ and $\boldsymbol{\alpha}^{\star}$ , a simple application of the central limit theorem (CLT) gives:

[TABLE]

which proves Equation (3.2) where $\Sigma_{\boldsymbol{\alpha}^{\star}}$ is semi-definite positive of rank $\mathcal{Q}-1$ .

Similarly, $\psi^{\prime}\left(\widehat{\pi}_{q\ell}\left(\mathbf{z}^{\star}\right)\right)$ is the average of $\sum_{i\neq j}r_{ij}z_{iq}^{\star}z_{j\ell}^{\star}$ i.i.d. random variables with mean $\psi^{\prime}\left(\pi^{\star}_{q\ell}\right)$ and variance $\psi^{\prime\prime}\left(\pi^{\star}_{q\ell}\right)$ . $\sum_{i\neq j}r_{ij}z_{iq}^{\star}z_{j\ell}^{\star}$ is itself random but thanks to lemma A.1 : $\frac{1}{n(n-1)}\sum_{i\neq j}r_{ij}z_{iq}^{\star}z_{j\ell}^{\star}\xrightarrow[n\to+\infty]{\mathbb{P}}\rho\alpha^{\star}_{q}\alpha^{\star}_{l}$ . Therefore, by Slutsky’s lemma and the CLT for random sums of random variables Shanthikumar and Sumita (1984), we have:

[TABLE]

The differentiability of $(\psi^{\prime})^{-1}$ and the delta method then gives:

[TABLE]

and the independence results from the independence of $\widehat{\pi}_{q\ell}\left(\mathbf{z}^{\star}\right)$ and $\widehat{\pi}_{q^{\prime}\ell^{\prime}}\left(\mathbf{z}^{\star}\right)$ as soon as $q\neq q^{\prime}$ or $\ell\neq\ell^{\prime}$ , as they involve different sets of i.i.d. variables. ∎

A.4 Proof of proposition 3.5

Proof.

By Taylor expansion,

[TABLE]

where $\nabla{\mathcal{L}_{co}^{\star}}_{\boldsymbol{\alpha}}\left(\boldsymbol{\theta}^{\star}\right)$ and $\nabla{\mathcal{L}_{co}^{\star}}_{\boldsymbol{\pi}}\left(\boldsymbol{\theta}^{\star}\right)$ denote the respective components of the gradient of $\mathcal{L}_{co}^{\star}$ evaluated at $\boldsymbol{\theta}^{\star}$ and $\mathbf{H}_{\boldsymbol{\alpha}}$ and $\mathbf{H}_{\boldsymbol{\pi}}$ denote the conditional hessian of $\mathcal{L}_{co}^{\star}$ evaluated at $\boldsymbol{\theta}^{\star}$ . By inspection, $\mathbf{H}_{\boldsymbol{\alpha}}/n$ and $\mathbf{H}_{\boldsymbol{\pi}}/{(n(n-1))}$ converge in probability to constant matrices $\Sigma_{\alpha},\Sigma_{\pi}$ and the random vectors $\nabla{\mathcal{L}_{co}^{\star}}_{\boldsymbol{\alpha}}\left(\boldsymbol{\theta}^{\star}\right)/\sqrt{n}$ and $\nabla{\mathcal{L}_{co}^{\star}}_{\boldsymbol{\pi}}\left(\boldsymbol{\theta}^{\star}\right)/\sqrt{{n(n-1)}}$ converge in distribution by central limit theorem. ∎

A.5 Proof of proposition 4.1

Proof.

In regular configurations, each group has $\Omega(n)$ members, where $u_{n}=\Omega(n)$ if there exists two constant $a,b>0$ such that for $n$ enough large $an\leq u_{n}\leq bn$ . $c/2$ -regular assignments, with $c$ defined in Assumption $H_{1}$ , have high $\mathbb{P}_{\boldsymbol{\theta}^{\star}}$ -probability in the space of all assignments, uniformly over all $\boldsymbol{\theta}^{\star}\in\boldsymbol{\Theta}$ .

Each $z_{+q}$ is a sum of $n$ i.i.d Bernoulli r.v. with parameter $\alpha_{q}\geq\alpha_{\min}\geq c$ . A simple Hoeffding bound shows that

[TABLE]

taking a union bound over $\mathcal{Q}$ values of $q$ leads to Proposition 4.1. ∎

Appendix B Main Results

B.1 Proof of proposition 6.2)

Proof.

First of all we will prove equation 6.3,

[TABLE]

where $Z_{i}=q\Leftrightarrow z_{iq}=1$ . Noticing that the $(i,j)$ for which $z_{iq}z_{j\ell}=0$ does not contributes in any of the two terms of the ratio. The calculus of this expectation is then equivalent to calculate an expectation of the general form $\mathbb{E}_{\boldsymbol{\theta}^{\star}}\left[\frac{\sum_{i=1}^{n}a_{i}R_{i}}{\sum_{i=1}^{n}R_{i}}\right]$ , $(a_{i})_{i\in\{1,..,n\}}\in\mathbb{R}^{n}$ and $T_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{B}(\rho)$ .

Lemma B.1.

[TABLE]

Proof.

Define $N=\sum_{i=1}^{n}T_{i}$ and noticing that $\mathbb{E}[T_{i}|N=k]=\frac{k}{n}$ . Conditionally to $N\geq 1$

[TABLE]

∎

Now, applying lemma B.1 with $N_{q\ell}^{o}(z)=\sum_{i\neq j}z_{iq}z_{j\ell}r_{ij}$ leads to

[TABLE]

Finally, $\mathbb{E}_{\boldsymbol{\theta}^{\star}}[\widehat{y}_{q\ell}(\mathbf{z})|\mathbf{z}^{\star},N_{q\ell}^{o}(z)=0]$ can be arbitrarily defined at the same value than $\mathbb{E}_{\boldsymbol{\theta}^{\star}}[\widehat{y}_{q\ell}(\mathbf{z})|\mathbf{z}^{\star},N_{q\ell}^{o}(z)\geq 1]$ which conclued the proof. ∎

B.2 Proof of proposition 6.4

Proof.

Defining $\nu(y,\pi)=y\pi-\psi(\pi)$ . For $y$ fixed, $\nu(y,\pi)$ is maximized at $\pi=(\psi^{\prime})^{-1}(y)$ . Manipulations yield

[TABLE]

which is maximized at $\pi_{q\ell}=(\psi^{\prime})^{-1}(\widehat{y}_{q\ell}(\mathbf{z}))$ . Similarly with $N_{q\ell}(z)=\sum_{i\neq j}z_{iq}z_{j\ell}$ ,

[TABLE]

is maximized at $\pi_{q\ell}=(\psi^{\prime})^{-1}(\bar{y}_{q\ell}(\mathbf{z}))$ . ∎

B.3 Proof of Proposition 6.6 (maximum of $ELR$ and $\tilde{\Lambda}$ )

Proof.

We condition on $\mathbf{z}^{\star}$ and prove Equation (6.5):

[TABLE]

If $\mathbf{z}^{\star}$ is regular, and for $n>2/c$ , all the rows of ${I\!R}(\mathbf{z})$ have at least one positive element and we can apply Lemma 3.2 of Bickel et al. (2013) to characterize the maximum for $ELR$ .

The maximality of $\tilde{\Lambda}(\mathbf{z}^{\star})$ results from the fact that $\tilde{\Lambda}(\mathbf{z})=ELR(\bar{\boldsymbol{\pi}}(\mathbf{z}),\mathbf{z})$ where $\bar{\boldsymbol{\pi}}(\mathbf{z})$ is a particular value of $\boldsymbol{\pi}$ , $\tilde{\Lambda}$ is immediately maximum at $\mathbf{z}\sim\mathbf{z}^{\star}$ , and for those, we have $\bar{\boldsymbol{\pi}}(\mathbf{z})\sim\boldsymbol{\pi}^{\star}$ .

The separation and local behavior of $G$ around $\mathbf{z}^{\star}$ is a direct consequence of the proposition 6.5. ∎

B.4 Proof of Proposition 6.5 (Local upper bound for $\tilde{\Lambda}$ )

Proof.

We work conditionally on $\mathbf{z}^{\star}$ . The principle of the proof relies on the extension of $\tilde{\Lambda}$ to a continuous subspace of $\mathcal{M}_{\mathcal{Q}}([0,1])$ , in which the confusion matrix is naturally embedded. The regularity assumption allows us to work on a subspace that is bounded away from the borders of $\mathcal{M}_{\mathcal{Q}}([0,1])$ . The proof then proceeds by computing the gradient of $\tilde{\Lambda}$ at and around its argmax and using those gradients to control the local behavior of $\tilde{\Lambda}$ around its argmax. The local behavior allows us in turn to show that $\tilde{\Lambda}$ is well-separated.

Note that $\tilde{\Lambda}$ only depends on $\mathbf{z}$ through ${I\!R}(\mathbf{z})$ . We can therefore extend it to matrix $U\in\mathcal{U}_{c}$ where $\mathcal{U}$ is the subset of matrices $\mathcal{M}_{\mathcal{Q}}([0,1])$ with each row sum higher than $c/2$ .

[TABLE]

where

[TABLE]

and $\mathbf{1}$ is the $\mathcal{Q}\times\mathcal{Q}$ matrix filled with $1$ . Confusion matrix ${I\!R}(\mathbf{z})$ satisfy ${I\!R}(\mathbf{z}){1\!I}=\boldsymbol{\alpha}(\mathbf{z}^{\star})$ , with ${1\!I}=(1,\ldots,1){{}^{T}}$ a vector only containing $1$ values, and are obviously in $\mathcal{U}_{c}$ as soon as $\mathbf{z}^{\star}$ is $c/2$ regular.

The maps $f_{q,q^{\prime},\ell,\ell^{\prime}}:(U)\mapsto KL(\pi^{\star}_{q\ell},\bar{\pi}_{q\ell}(U))$ are twice differentiable with second derivatives bounded over $\mathcal{U}_{c}$ and therefore so is $\tilde{\Lambda}(U)$ . Tedious but straightforward computations show that the derivative of $\tilde{\Lambda}$ at $D_{\alpha}\coloneqq\operatorname{Diag}(\boldsymbol{\alpha}(\mathbf{z}^{\star}))$ is:

[TABLE]

$A(\mathbf{z}^{\star})$ is the matrix-derivative of $-\tilde{\Lambda}/n^{2}$ at $D_{\alpha}$ . Since $\mathbf{z}^{\star}$ is $c/2$ -regular and by definition of $\delta(\boldsymbol{\pi}^{\star})$ , $A(\mathbf{z}^{\star})_{qq^{\prime}}\geq c\rho\delta(\boldsymbol{\pi}^{\star})$ if $q\neq q^{\prime}$ and $A(\mathbf{z}^{\star})_{qq}=0$ for all $q$ . By boundedness of the second derivative, there exists $C>0$ such that for all $D_{\alpha}$ and all $H\in B(D_{\alpha},C)$ , we have:

[TABLE]

Choose $U$ in $\mathcal{U}_{c}\cap B(D_{\alpha},C)$ satisfying $U{1\!I}=\boldsymbol{\alpha}(\mathbf{z}^{\star})$ . $U-D_{\alpha}$ have nonnegative off diagonal coefficients and negative diagonal coefficients. Furthermore, the coefficients of $U,D_{\alpha}$ sum up to $1$ and $\operatorname{Tr}(D_{\alpha})=1$ . By Taylor expansion, there exists $H$ also in $\mathcal{U}_{c}\cap B(D_{\alpha},C)$ such that

[TABLE]

To conclude the proof, assume without loss of generality that $\mathbf{z}\in S(\mathbf{z}^{\star},C)$ achieves the $\|.\|_{0,\sim}$ norm (i.e. it is the closest to $\mathbf{z}^{\star}$ in its representative class). Then $U={I\!R}(\mathbf{z})$ is in $(\mathcal{U}_{c}\cap B(D_{\alpha},C)$ and satisfy $U{1\!I}=\boldsymbol{\alpha}(\mathbf{z}^{\star})$ . We just need to note $n(1-\operatorname{Tr}({I\!R}(\mathbf{z})))=\|\mathbf{z}-\mathbf{z}^{\star}\|_{0,\sim}$ to end the proof.

∎

B.5 Proof of Proposition 6.7 (global convergence $LR$ )

Proof.

Conditionally upon $\mathbf{z}^{\star}$ ,

[TABLE]

uniformly in $\boldsymbol{\theta}$ , where the $W_{qq^{\prime}\ell\ell^{\prime}}$ are independent and by Taylor expansion defined by:

[TABLE]

is the sum of $n^{2}{I\!R}(\mathbf{z})_{qq^{\prime}}{I\!R}(\mathbf{z})_{\ell\ell^{\prime}}$ sub-exponential variables with parameters $(\nu^{2},1/b)$ and is therefore itself sub-exponential with parameters $(n^{2}{I\!R}(\mathbf{z})_{qq^{\prime}}{I\!R}(\mathbf{z})_{\ell\ell^{\prime}}\nu^{2},1/b)$ . According to Proposition B.3 of Brault et al. (2020) , $\mathbb{E}_{\boldsymbol{\theta}^{\star}}[Z|\mathbf{z}^{\star}]\leq Q^{2}\operatorname{Diam}(\boldsymbol{\Theta})\sqrt{n^{2}\nu^{2}}$ and $Z$ is sub-exponential with parameters $(n^{2}\operatorname{Diam}(\boldsymbol{\Theta})^{2}(2\sqrt{2})^{2}\nu^{2},2\sqrt{2}\operatorname{Diam}(\boldsymbol{\Theta})/b)$ . In particular, for all $\varepsilon_{n}<\nu b$

[TABLE]

We can then remove the conditioning and take a union bound. ∎

B.6 Proof of Proposition 6.8 (contribution of far away assignments)

Proof.

Conditionally on $\mathbf{z}^{\star}$ , we know from proposition 6.6 that $\tilde{\Lambda}$ is maximal in $\mathbf{z}^{\star}$ and its equivalence class. Choose $0<t_{n}$ decreasing to [math] but satisfying $\frac{n\rho t_{n}}{\sqrt{\log(n)}}\to+\infty$ . According to 6.6 (iii), for all $\mathbf{z}\notin S(\mathbf{z}^{\star},t_{n})$

[TABLE]

since $\|\mathbf{z}-\mathbf{z}^{\star}\|_{0,\sim}\geq nt_{n}$ .

Set $\varepsilon_{n}=\inf(5c\rho\delta(\boldsymbol{\pi}^{\star})t_{n}/(\sqrt{2}\nu\operatorname{Diam}(\boldsymbol{\Theta})),\nu b)$ and $n$ large enough that $\epsilon_{n}\geq\frac{Q^{2}}{n\sqrt{8}}$ . By proposition 6.7, and with our choice of $\varepsilon_{n}$ , with probability higher than $1-\Delta_{n}^{1}(\varepsilon_{n})$ ,

[TABLE]

where the second line comes from inequality (B.1), the third from the global control studied in Proposition 6.7 and the definition of $\varepsilon_{n}$ , the fourth from the definition of $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star})$ , the fifth from the bounds on $\boldsymbol{\alpha}^{\star}$ and the last from $\frac{n\rho t_{n}}{\sqrt{\log(n)}}\to+\infty$ .

In addition, with our choice of $t_{n}$ , we have $\varepsilon_{n}\gg\sqrt{\log(n)}/n$ so that the series $\sum_{n}\Delta_{n}^{1}(\varepsilon_{n})$ converges and:

[TABLE]

∎

B.7 Proof of Proposition 6.9 (local convergence $LR$ )

Proof.

We work conditionally on $\mathbf{z}^{\star}\in\mathcal{Z}_{1}$ . Choose $\varepsilon\leq\kappa\underline{\sigma}^{2}$ small. Assignments $\mathbf{z}$ at $\|.\|_{0,\sim}$ -distance less than $c/4$ of $\mathbf{z}^{\star}$ are $c/4$ -regular. According to Proposition B.1 of Brault et al. (2020) , $\widehat{y}_{q\ell}$ and $\bar{y}_{q\ell}$ are at distance at most $\varepsilon$ with probability higher than $1-2\exp\left(-\frac{n^{2}c^{2}\varepsilon^{2}}{32(\nu^{2}+b^{-1}\varepsilon)}\right)$ . Defining

[TABLE]

where $\tilde{\Lambda}(\mathbf{z})=\mathbb{E}\left[\tilde{\tilde{\Lambda}}(\mathbf{z})|\mathbf{z}^{\star}\right]$ . Manipulation of $\Lambda$ , $\tilde{\Lambda}$ and $\tilde{\tilde{\Lambda}}$ yield

[TABLE]

where $f(x)=x(\psi^{\prime})^{-1}(x)-\psi\circ(\psi^{\prime})^{-1}(x)$ , $\widehat{y}_{q\ell}^{\star}=\widehat{y}_{q\ell}(\mathbf{z}^{\star})$ and $\bar{y}_{q\ell}^{\star}=\psi^{\prime}(\pi^{\star}_{q\ell})$ .

Concerning the first term.

The function $f$ is twice differentiable on $\mathring{\mathcal{A}}$ with $f^{\prime}(x)=(\psi^{\prime})^{-1}(x)$ and $f^{\prime\prime}(x)=1/\psi^{\prime\prime}\circ(\psi^{\prime})^{-1}(x)$ . $f^{\prime}$ (resp. $f^{\prime\prime}$ ) are bounded over $I=\psi^{\prime}(C_{\pi})$ by $C_{\pi}$ (resp. $1/\underline{\sigma}^{2}$ ) so that:

[TABLE]

By Proposition B.1 (adapted for SBM) of Brault et al. (2020) , $(\widehat{y}_{q\ell}-\bar{y}_{q\ell})^{2}=\mathcal{O}_{P}(1/n^{2})$ where the $\mathcal{O}_{P}$ is uniform in $\mathbf{z}$ and does not depend on $\mathbf{z}^{\star}$ . Similarly,

[TABLE]

$\bar{y}_{q\ell}$ is a convex combination of the $S^{\star}_{q\ell}=\psi^{\prime}(\pi^{\star}_{q\ell})$ therefore,

[TABLE]

Note that:

[TABLE]

and $\widehat{y}_{q\ell}-\bar{y}_{q\ell}=o_{P}(1)$ . Therefore

[TABLE]

The remaining term writes

[TABLE]

and is also $o_{P}(\left(\|\mathbf{z}-\mathbf{z}^{\star}\|_{0,\sim}/n\right)$ uniformly in $\mathbf{z}$ and $\mathbf{z}^{\star}\in\Omega_{1}$ by Proposition C.2.

Concerning the second term.

For all $q,\ell$ , defining

[TABLE]

and noticing that $N_{q\ell}^{+}(\mathbf{z},\mathbf{z}^{\star})=\#\{(i,j):z_{iq}=1,z_{j\ell}=1,(z_{q\ell},z_{j\ell})\neq(z_{q\ell}^{\star},z_{j\ell}^{\star})\}$ and $N_{q\ell}^{-}(\mathbf{z},\mathbf{z}^{\star})=\#\{(i,j):z_{iq}^{\star}=1,z_{j\ell}^{\star}=1,(z_{q\ell},z_{j\ell})\neq(z_{q\ell}^{\star},z_{j\ell}^{\star})\}$ . Using the following notations

[TABLE]

we are able to write

[TABLE]

Where the second equality is the sum of independent random variables.

Note that :

[TABLE]

also that $\hat{\rho}_{q\ell}^{+}-\rho=o_{P}\left(1\right)$ and $\hat{\rho}_{q\ell}^{-}-\rho=o_{P}\left(1\right)$ . Therefore

[TABLE]

Concerning the third term.

Using arguments developed previously leads to the same conclusion than before :

[TABLE]

As a conclusion, writing

[TABLE]

and noticing that $\frac{\tilde{\Lambda}(\mathbf{z})-\tilde{\Lambda}(\mathbf{z}^{\star})}{n\|\mathbf{z}-\mathbf{z}^{\star}\|_{0,\sim}}\leq 0$ since $\tilde{\Lambda}$ is maximized in $\mathbf{z}^{\star}$ (see 6.6). We have

[TABLE]

∎

B.8 Proof of Proposition 6.10 (contribution of local assignments)

Proof.

By Proposition 4.1, it is enough to prove that the sum is small compared to $p(\mathbf{z}^{\star},\mathbf{y^{o}};\boldsymbol{\theta}^{\star})$ on $\Omega_{1}$ . We work conditionally on $\mathbf{z}^{\star}\in\mathcal{Z}_{1}$ . Choose $\mathbf{z}$ in $S(\mathbf{z}^{\star},C)$ with $C$ defined in proposition 6.8.

[TABLE]

For $C$ small enough, we can assume without loss of generality that $\mathbf{z}$ is the representative closest to $\mathbf{z}^{\star}$ and note $r=\|\mathbf{z}-\mathbf{z}^{\star}\|_{0}$ . Then:

[TABLE]

where the first line comes from the definition of $\Lambda$ , the second line from Proposition 6.6 and the third from Proposition 6.9. Thanks to proposition D.1, we also know that:

[TABLE]

There are at most ${n\choose r}Q^{r}$ assignments $\mathbf{z}$ at distance $r$ of $\mathbf{z}^{\star}$ and each of them has at most $Q^{Q}$ equivalent configurations. Therefore,

[TABLE]

where $a_{n}=ne^{(Q+1)\log Q+M_{c/4}-c\rho n\frac{3\delta(\boldsymbol{\pi}^{\star})(1+o_{P}(1))}{4}}=o_{P}(1)$ .

∎

B.9 Proof of Proposition 6.11 (contribution of equivalent assignments)

Proof.

Choose $s$ permutations of $\{1,\dots,Q\}$ and assume that $\mathbf{z}=\mathbf{z}^{\star,s}$ . Then $p(\mathbf{y^{o}},\mathbf{z};\boldsymbol{\theta})=p(\mathbf{y^{o}},\mathbf{z}^{\star,s};\boldsymbol{\theta})=p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{s})$ . If furthermore $s\in\operatorname{Sym}(\boldsymbol{\theta})$ , $\boldsymbol{\theta}^{s}=\boldsymbol{\theta}$ and immediately $p(\mathbf{y^{o}},\mathbf{z};\boldsymbol{\theta})=p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta})$ . We can therefore partition the sum as

[TABLE]

$p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta})$ unimodal in $\boldsymbol{\theta}$ , with a mode in $\widehat{\boldsymbol{\theta}}_{MC}$ . By consistency of $\widehat{\boldsymbol{\theta}}_{MC}$ , either $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta})=o_{P}(p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star}))$ or $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta})=\mathcal{O}_{P}(p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star}))$ and $\boldsymbol{\theta}\to\boldsymbol{\theta}^{\star}$ . In the latter case, any $\boldsymbol{\theta}^{\prime}\sim\boldsymbol{\theta}$ other than $\boldsymbol{\theta}$ is bounded away from $\boldsymbol{\theta}^{\star}$ and thus $p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\prime})=o_{P}(p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star}))$ . In summary,

[TABLE]

∎

B.10 Proof of Corollary 5.1: Behavior of $\widehat{\boldsymbol{\theta}}_{MLE}$

We may prove the corollary by contradiction. Note first that unless $\boldsymbol{\Theta}$ is constrained and with high probability, $\widehat{\boldsymbol{\theta}}_{MLE}$ and $\widehat{\boldsymbol{\theta}}(\mathbf{z}^{\star})$ exhibit no symmetries. Indeed, equalities like $\widehat{y}_{q\ell}=\widehat{y}_{q^{\prime},\ell^{\prime}}$ have vanishingly small probabilities of being simultaneously true when $y_{ij}$ is discrete, and even null when $y_{ij}$ is continuous. Assume then $\min_{s}(\widehat{\boldsymbol{\alpha}}_{MLE}^{s}-\hat{\boldsymbol{\alpha}}\left(\mathbf{z}^{\star}\right))\neq o_{P}\left(1/\sqrt{n}\right)$ or $\min_{s}(\widehat{\boldsymbol{\pi}}_{MLE}^{s}-\hat{\boldsymbol{\pi}}\left(\mathbf{z}^{\star}\right))\neq o_{P}\left(1/n\right)$ where $s$ is a permutation of $\{1,\dots,Q\}$ . Then, by Proposition 3.5 and the consistency of $\hat{\boldsymbol{\theta}}\left(\mathbf{z}^{\star}\right)$

[TABLE]

But, since $\hat{\boldsymbol{\theta}}\left(\mathbf{z}^{\star}\right)$ and $\widehat{\boldsymbol{\theta}}_{MLE}$ maximise respectively $\frac{p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\prime})}{p(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star})}$ and $\frac{p(\mathbf{y^{o}};\boldsymbol{\theta})}{p\left(\mathbf{y^{o}};\boldsymbol{\theta}^{\star}\right)}$ and have no symmetries, it follows by Theorem 4.2 that

[TABLE]

which contradicts Eq (B.2) and concludes the proof.

B.11 Proof of Corollary 5.2: Behavior of $J\left(\mathbb{Q},\boldsymbol{\theta}\right)$

Remark first that for every $\boldsymbol{\theta}$ and for every $\mathbf{z}$ ,

[TABLE]

where $\delta_{\mathbf{z}}$ denotes the dirac mass on $\mathbf{z}$ . By dividing by $p\left(\mathbf{y^{o}};\boldsymbol{\theta}^{\star}\right)$ , we obtain

[TABLE]

As this inequality is true for every couple $\mathbf{z}$ , we have in particular:

[TABLE]

Noticing that $p\left(\mathbf{y^{o}};\boldsymbol{\theta}^{\star}\right)=\#\operatorname{Sym}(\boldsymbol{\theta}^{\star})p\left(\mathbf{y^{o}},\mathbf{z}^{\star};\boldsymbol{\theta}^{\star}\right)(1+o_{p}(1))$ , Theorem 4.2 therefore leads to the following bounds:

[TABLE]

Again unless $\boldsymbol{\Theta}$ is constrained, $\widehat{\boldsymbol{\theta}}_{VAR}$ exhibits no symmetries with high probability and the same proof by contradiction as in appendix B.10 gives the result.

Appendix C Sub-exponential random variables

We now prove two propositions regarding subexponential variables. Recall first that a random variable $X$ is sub-exponential with parameters $(\tau^{2},b)$ if for all $\lambda$ such that $|\lambda|\leq 1/b$ ,

[TABLE]

In particular, all distributions coming from a natural exponential family are sub-exponential. Sub-exponential variables satisfy a large deviation Bernstein-type inequality:

[TABLE]

So that

[TABLE]

The subexponential property is preserved by summation and multiplication.

•

If $X$ is sub-exponential with parameters $(\tau^{2},b)$ and $\alpha\in\mathbb{R}$ , then so is $\alpha X$ with parameters $(\alpha^{2}\tau^{2},\alpha b)$

•

If the $X_{i}$ , $i=1,\dots,n$ are sub-exponential with parameters $(\tau_{i}^{2},b_{i})$ and independent, then so is $X=X_{1}+\dots+X_{n}$ with parameters $(\sum_{i}\tau_{i}^{2},\max_{i}b_{i})$

Theorem C.1 (Equivalent characterizations of sub-exponential variables).

For a zero-mean random variable $X$ , the following statements are equivalent:

There are non-negative numbers $(\nu,b^{-1})$ such that

[TABLE] 2. 2.

There is a positive number $c_{0}>0$ such that $\mathbb{E}[e^{\lambda X}]<\infty$ for all $|\lambda|<c_{0}$ . 3. 3.

There are constants $c_{1},c_{2}>0$ such that

[TABLE] 4. 4.

The quantity $\gamma:=\sup_{k\geq 2}\left[\frac{\mathbb{E}[X^{k}]}{k!}\right]^{1/k}$ is finite.

Proof.

A proof of this theorem can be found in Wainwright (2015). ∎

Proposition C.2 (Maximum in $\mathbf{z}$ ).

Let $(\bar{\mathbf{z}}$ be any configuration and $\mathbf{z}$ the $\sim$ -equivalent configuration that achieves $\|\mathbf{z}-\mathbf{z}^{\star}\|_{0}=\|\bar{\mathbf{z}}-\mathbf{z}^{\star}\|_{0,\sim}$ let $\widehat{y}_{q\ell}=\hat{y}_{q,\ell}(\mathbf{z})$ (resp. $\bar{y}_{q\ell}(\mathbf{z})$ ) and $\widehat{y}_{q\ell}^{\star}=\hat{y}_{q,\ell}(\mathbf{z}^{\star})$ (resp. $\bar{y}_{q\ell}^{\star}=\bar{y}_{q\ell}(\mathbf{z}^{\star})$ = $\psi^{\prime}(\pi^{\star}_{q\ell})$ ) be as defined in Equations (3.1) and (6.3). Under the assumptions of the section 2.6, for all $\varepsilon\leq\kappa\bar{\sigma}^{2}$ ,

[TABLE]

Proof.

Note $r=\|\mathbf{z}-\mathbf{z}^{\star}\|_{0}$ . The numerator within the $\max$ in the fraction can be expanded to

[TABLE]

and is thus a sum of at most $N=nr$ non-null centered subexponential random variables with parameters $(a^{2},1/w)$ . It is therefore a centered subexponential with parameters $(Na^{2},1/w)$ . By Bernstein inequality, for all $\varepsilon\leq\kappa a^{2}$ we have

[TABLE]

There are at most $n^{r}Q^{r}Q^{Q}$ $\mathbf{z}$ at $\|.\|_{0,\sim}$ distance $r$ of $\mathbf{z}^{\star}$ . An union bound shows that:

[TABLE]

where the last equality is true as soon as $n\varepsilon_{n}\gg\log n$ .

∎

Appendix D Likelihood ratio of assignments

Proposition D.1.

Let $\mathbf{z}^{\star}$ be $c/2$ -regular and $\mathbf{z}$ at $\|.\|_{0}$ -distance $c/4$ of $\mathbf{z}^{\star}$ . Then, for all $\boldsymbol{\theta}\in\boldsymbol{\Theta}$

[TABLE]

Proof.

Note then that:

[TABLE]

where the first inequality comes from the definition of $\hat{\boldsymbol{\alpha}}(\mathbf{z})$ and the second from Lemma B.6 of Brault et al. (2020) and the fact that $\mathbf{z}^{\star}$ and $\mathbf{z}$ are $c/4$ -regular. Finally, local asymptotic normality of the MLE for multinomial proportions ensures that $\frac{p(\mathbf{z}^{\star};\hat{\boldsymbol{\alpha}}(\mathbf{z}^{\star}))}{p(\mathbf{z}^{\star};\boldsymbol{\alpha}^{\star})}=\mathcal{O}_{P}(1)$ .

$\square$

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aicher et al. (2014) C. Aicher, A. Z. Jacobs, and A. Clauset. Learning latent block structure in weighted networks. J. Compl. Net. , 3.2:221–248, 2014.
2Ambroise and Matias (2012) C. Ambroise and C. Matias. New consistent and asymptotically normal parameter estimates for random-graph mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 74(1):3–35, 2012.
3Barbillon et al. (2015) P. Barbillon, S. Donnet, E. Lazega, and A. Bar-Hen. Stochastic block models for multiplex networks: an application to networks of researchers. J. R. Stat. Soc. C-Appl. , 2015.
4Bickel et al. (2013) P. Bickel, D. Choi, X. Chang, H. Zhang, et al. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics , 41(4):1922–1943, 2013.
5Brault et al. (2020) V. Brault, C. Keribin, and M. Mariadassou. Consistency and Asymptotic Normality of Latent Blocks Model Estimators. Electronic Journal of Statistics , 14(1):123–1268, 2020.
6Celisse et al. (2012) A. Celisse, J.-J. Daudin, L. Pierre, et al. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electronic Journal of Statistics , 6:1847–1899, 2012.
7Chatterjee (2015) S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics , 43(1):177–214, 2015.
8Choi et al. (2012) D. S. Choi, P. J. Wolfe, and E. M. Airoldi. Stochastic blockmodels with growing number of classes. Biometrika , 99 2:273–284, 2012.

TL;DR

Contribution

Abstract

Peer Reviews

Videos

Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data

Abstract

1 Introduction

2 Statistical framework

2.1 Notations

2.2 Stochastic Block Model

2.3 Missing data for SBM

Property 2.1**.**

2.4 Sampling design examples

Definition 2.2** (Random dyad sampling).**

Definition 2.3** (Random node sampling).**

Definition 2.4** (Double standard sampling).**

Remark 2.5*.*

2.5 Observed-likelihoods

2.6 Models and Assumptions

2.7 Identifiability

Proposition 2.6**.**

Proof.

Proposition 2.7**.**

Proof.

2.8 Subexponential variables

Remark 2.8*.*

Proposition 2.9**.**

Proposition 2.10**.**

Proof.

2.9 Symmetry

Definition 2.11** (permutation).**

Definition 2.12** (equivalence).**

Definition 2.13** (symmetry).**

Remark 2.14*.*

Example 1*.*

Definition 2.15** (distance).**

Definition 2.16** (Set of local assignments).**

2.10 Other definitions

Definition 2.17** (ccc-regular assignments).**

Definition 2.18** (class distinctness).**

Remark 2.19*.*

Definition 2.20** (confusion matrix).**

Definition 2.21**.**

3 Complete-observed Model

Proposition 3.1**.**

Proof.

Remark 3.2*.*

Proposition 3.3**.**

Proof.

Remark 3.4*.*

Proposition 3.5** (Local asymptotic normality).**

Proof.

4 Main Result

Proposition 4.1**.**

Proof.

Theorem 4.2** (complete-observed).**

Remark 4.3*.*

Corollary 4.4**.**

5 Variational and Maximum Likelihood Estimates

5.1 ML estimator

Corollary 5.1** (Asymptotic behavior of θ^MLE\widehat{\boldsymbol{\theta}}_{MLE}θMLE​).**

5.2 Variational estimator

Corollary 5.2** (Variational estimate).**

6 Proof Sketch

6.1 log-likelihood ratios

Definition 6.1**.**

Proposition 6.2**.**

Remark 6.3*.*

Proposition 6.4** (maximum of ELRELRELR and Λ~\tilde{\Lambda}Λ~ in θ\boldsymbol{\theta}θ).**

Proposition 6.5** (Local upperbound for Λ~\tilde{\Lambda}Λ~).**

Proposition 6.6** (maximum of ELRELRELR and Λ~\tilde{\Lambda}Λ~ in (θ,z)(\boldsymbol{\theta},\mathbf{z})(θ,z)).**

6.2 High level view of the proof

6.3 Different asymptotic behaviors

6.3.1 Global Control

Proposition 6.7** (large deviations of LRLRLR).**

Proposition 6.8** (contribution of global assignments).**

6.3.2 Local Control

Proposition 6.9** (small deviations LRLRLR).**

Proposition 6.10** (contribution of local assignments).**

Property 2.1.

Definition 2.2 (Random dyad sampling).

Definition 2.3 (Random node sampling).

Definition 2.4 (Double standard sampling).

*Remark 2.5**.*

Proposition 2.6.

Proposition 2.7.

*Remark 2.8**.*

Proposition 2.9.

Proposition 2.10.

Definition 2.11 (permutation).

Definition 2.12 (equivalence).

Definition 2.13 (symmetry).

*Remark 2.14**.*

*Example 1**.*

Definition 2.15 (distance).

Definition 2.16 (Set of local assignments).

Definition 2.17 ( $c$ -regular assignments).

Definition 2.18 (class distinctness).

*Remark 2.19**.*

Definition 2.20 (confusion matrix).

Definition 2.21.

Proposition 3.1.

*Remark 3.2**.*

Proposition 3.3.

*Remark 3.4**.*

Proposition 3.5 (Local asymptotic normality).

Proposition 4.1.

Theorem 4.2 (complete-observed).

*Remark 4.3**.*

Corollary 4.4.

Corollary 5.1 (Asymptotic behavior of $\widehat{\boldsymbol{\theta}}_{MLE}$ ).

Corollary 5.2 (Variational estimate).

Definition 6.1.

Proposition 6.2.

*Remark 6.3**.*

Proposition 6.4 (maximum of $ELR$ and $\tilde{\Lambda}$ in $\boldsymbol{\theta}$ ).

Proposition 6.5 (Local upperbound for $\tilde{\Lambda}$ ).

Proposition 6.6 (maximum of $ELR$ and $\tilde{\Lambda}$ in $(\boldsymbol{\theta},\mathbf{z})$ ).

Proposition 6.7 (large deviations of $LR$ ).

Proposition 6.8 (contribution of global assignments).

Proposition 6.9 (small deviations $LR$ ).

Proposition 6.10 (contribution of local assignments).

Proposition 6.11 (contribution of equivalent assignments).

Lemma A.1.

Lemma B.1.

B.3 Proof of Proposition 6.6 (maximum of $ELR$ and $\tilde{\Lambda}$ )

B.4 Proof of Proposition 6.5 (Local upper bound for $\tilde{\Lambda}$ )

B.5 Proof of Proposition 6.7 (global convergence $LR$ )

B.7 Proof of Proposition 6.9 (local convergence $LR$ )

B.10 Proof of Corollary 5.1: Behavior of $\widehat{\boldsymbol{\theta}}_{MLE}$

B.11 Proof of Corollary 5.2: Behavior of $J\left(\mathbb{Q},\boldsymbol{\theta}\right)$

Theorem C.1 (Equivalent characterizations of sub-exponential variables).

Proposition C.2 (Maximum in $\mathbf{z}$ ).

Proposition D.1.