missSBM: An R Package for Handling Missing Values in the Stochastic   Block Model

Pierre Barbillon; Julien Chiquet; Timoth\'ee Tabouy

arXiv:1906.12201·stat.CO·May 28, 2021·J. Stat. Softw.

missSBM: An R Package for Handling Missing Values in the Stochastic Block Model

Pierre Barbillon, Julien Chiquet, Timoth\'ee Tabouy

PDF

Open Access 2 Repos

TL;DR

missSBM is an R package designed to fit stochastic block models to partially observed network data, accounting for missing values and external covariates, and includes methods for model selection and missing data imputation.

Contribution

The paper introduces missSBM, a novel R package that handles missing data in stochastic block models using variational inference and model selection techniques.

Findings

01

Effective imputation of missing network edges.

02

Automatic selection of the number of blocks using ICL.

03

Application to political blog interaction data.

Abstract

The Stochastic Block Model (SBM) is a popular probabilistic model for random graphs. It is commonly used for clustering network data by aggregating nodes that share similar connectivity patterns into blocks. When fitting an SBM to a network which is partially observed, it is important to take into account the underlying process that generates the missing values, otherwise the inference may be biased. This paper introduces missSBM, an R-package fitting the SBM when the network is partially observed, i.e., the adjacency matrix contains not only 1's or 0's encoding presence or absence of edges but also NA's encoding missing information between pairs of nodes. This package implements a set of algorithms for fitting the binary SBM, possibly in the presence of external covariates, by performing variational inference adapted to several observation processes. Our implementation automatically…

Equations2

Y_{ij} ∣ Z_{i}, Z_{j} \sim^{in d} B (π_{Z_{i} Z_{j}}), for all (i, j) \in D,

Y_{ij} ∣ Z_{i}, Z_{j} \sim^{in d} B (π_{Z_{i} Z_{j}}), for all (i, j) \in D,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComplex Network Analysis Techniques · Bayesian Modeling and Causal Inference · Mental Health Research Topics

Full text

\pkgmissSBM: An \proglangR Package for Handling Missing Values in the Stochastic Block Model

Pierre Barbillon

UMR MIA-Paris

Université Paris-Saclay

AgroParisTech

INRAE

Julien Chiquet

UMR MIA-Paris

Université Paris-Saclay

AgroParisTech

INRAE

Tabouy Timothée

UMR MIA-Paris

Université Paris-Saclay

AgroParisTech

INRAE

[email protected]

\Plainauthor

Pierre Barbillon, Julien Chiquet, Timothée Tabouy

\PlaintitlemissSBM: An R Package for Handling Missing Values in the Stochastic Block Model \Shorttitle\pkgmissSBM: Handling Missing Values in the SBM \AbstractThe Stochastic Block Model (SBM) is a popular probabilistic model for random graphs. It is commonly used for clustering network data by aggregating nodes that share similar connectivity patterns into blocks. When fitting an SBM to a network which is partially observed, it is important to take into account the underlying process that generates the missing values, otherwise the inference may be biased. This paper introduces \pkgmissSBM, an \proglangR-package fitting the SBM when the network is partially observed, i.e., the adjacency matrix contains not only 1’s or 0’s encoding presence or absence of edges but also NA’s encoding missing information between pairs of nodes. This package implements a set of algorithms for fitting the binary SBM, possibly in the presence of external covariates, by performing variational inference adpated to several observation processes. Our implementation automatically explores different block numbers to select the most relevant model according to the Integrated Classification Likelihood (ICL) criterion. The ICL criterion can also help determine which observation process better corresponds to a given dataset. Finally, \pkgmissSBM can be used to perform imputation of missing entries in the adjacency matrix. We illustrate the package on a network data set consisting of interactions between political blogs sampled during the French presidential election in 2007.

\KeywordsNetwork, Missing data, Stochastic Block Model \PlainkeywordsNetwork, Missing data, Stochastic Block Model \Address Pierre Barbillon, Julien Chiquet & Timothée Tabouy

MIA Paris, Université Paris-Saclay, AgroParisTech, INRAE

E-mails: , ,

1 Introduction

In many fields of science, networks are a natural way to represent interaction data. To cite a few examples, a network may represent social interactions such as friendship or collaboration between people in a social network, regulation between genes and their products in a gene regulatory network, or predation between animals in a food web. In this paper, we only consider networks which can be represented by graphs composed of binary edges connecting pairs of nodes (also referred to as dyads in the following).

To this day, there exist many pieces of software performing network-related analyses. Unsurprisingly, the \proglangR community is extremely active in this area. Indeed, the \proglangR programming language is especially well-designed for performing data manipulation and visualization, and is thus appropriate for handling network data. Among the many available \proglangR packages related to networks, we suggest a classification into three groups111In addition to this brief typology, the interested reader may consult the CRAN task view on the related topic of graphical modeling (taskCRAN_GM).:

i)

Packages for representation, manipulation or visualization tasks, and packages computing descriptive statistics. We mention non-exhaustively the following top representatives: \pkgigraph (igraph), \pkgnetwork and \pkgsna (network; sna). 2. ii)

Packages learning the structure of a network from an external source of data, such as \pkghuge (huge), \pkgglasso (glasso), \pkgbnlearn (scutari2009learning) or \pkgbnstruct (bnstruct). These packages generally rely on a specific graphical modeling of the data (e.g., Gaussian graphical models (Lauritzen1996) in \pkghuge and \pkgglasso, or Bayesian networks (pearl2011bayesian) in \pkgbnlearn and \pkgbnstruct). 3. iii)

Packages fitting (probabilistic) models on network data. The \pkgergm package (ergm) fits the family of exponential random graph models (ERGM) introduced in ergm_model: it is part of the collection of tools around ERGM regrouped in the \pkgstatnet metapackage (statnet); \pkglatentnet (latentnet) implements the latent space approach of hoff2002latent; \pkgmixer (mixer) and \pkgblockmodels (Leger2016) fit the Stochastic Block Model (SBM) when the distribution of the edges belongs to the exponential family (snijders1997estimation; Nowicki2001). Other \proglangR packages related to the SBM and its extensions include \pkgsbm (sbm), \pkgsbmr (sbmr), \pkgdynSBM (dynsbm), \pkgblockmodeling (blockmodeling), \pkgdBlockmodeling (dBlockmodeling), \pkgexpSBM (expSBM), \pkgMLVSBM (MLVSBM), \pkggreed (greed), \pkgsbmSDP (sbmSDP), \pkghergm (schweinberger2018hergm), \pkglda (lda), \pkggraphon (graphon), \pkgGREMLINS (GREMLINS) and \pkgnoisySBM (noisySBM). Some of these packages, as well as some implementations in other programming languages, are presented in the following.

The \pkgmissSBM package which we introduce here belongs to the third category, that is, software that fits a specific probabilistic model on network data. More specifically, \pkgmissSBM is dedicated to the estimation of the Stochastic Block Model (SBM), a mixture of Erdős-Rényi random graphs (Erdos1959) offering a high degree of heterogeneity in connectivity profiles (see abbe2017community, for a recent review). The SBM generally fits well real-world network data while keeping the advantage of being a probabilistic generative model (contrary to mechanistic approaches such as the Barabási-Albert model (albert2002statistical), defined by a preferential attachment algorithm). The main outcome of an SBM fit is a clustering of the nodes – or "blocks" – so that the nodes share the same properties within the same block. To our knowledge, the reference package for fitting the SBM with the \proglangR programming system is \pkgblockmodels. It includes efficient implementations of variational algorithms to fit different flavors of the SBM, adapted to binary network data and valued networks, with optional covariates on the edges. Two other important extensions of the SBM are available as \proglangR packages: the degree-corrected Stochastic Block Model in \pkgrandnet (randnet) and a dynamic version of the Stochastic Block Model in \pkgdynsbm (dynsbm). Beyond the \proglangR framework, there also exist \proglangPython packages and \proglangC++ libraries providing efficient codes for some particular SBM: the \proglangPython packages \pkgCommunityDetection (communityDetection) and \pkgBipartiteSBM (bipartiteSBM) are dedicated to the estimation of special network structures using various heuristics and network models, among which the SBM. Beyond variational approaches, MCMC methods exist for inferring the SBM, solving the exact problem but being generally more computationally demanding: the \proglangPython library \pkggraph-tool (graph-tool) includes an MCMC sampler to fit the binary SBM and its degree-corrected variant; \proglangC++ libraries \pkgsbm_canonical_mcmc (sbmCanonicalMCMC) and \pkgbipartiteSBM-MCMC (bipartiteSbmMCMC) respectively implement a MCMC sampler for the SBM and the bipartite SBM. Finally \pkgMODE-NET (modeNet) implements the belief propagation algorithm for inferring the degree-corrected SBM.

Despite their high quality, an important limitation of the aforementioned software is to require a network that is fully observed, that is, no missing value is supported. The main feature of \pkgmissSBM is to deal with cases where the network data is only partially observed. More precisely, we consider situations where the adjacency matrix of the network data contains not only 1’s or 0’s for presence or absence of an edge, but also NA’s encoding missing information for some dyads. Note that this situation is different from the case considered in \pkgnoisySBM: there, a similarity matrix is fully observed between all pairs of nodes, and the goal is to separate the ’true’ interactions from noise by means of a dedicated SBM.

When inferring the SBM from network data with missing values, it is important to take into account the underlying process that generates these missing values in the estimation of the model parameters, otherwise it may be biased. More specifically, one has to identify whether the values are Missing at Random or not (MAR and MNAR, see little2019statistical). This issue has been studied in the context of network data by handcock2010 for the ERGM and in our methodological paper (Tabouy2019) for the SBM. \pkgmissSBM is an implementation of the methodology developed therein. It also considers new sampling designs and the inclusion of covariates simultaneously in the SBM and in the observation process, which was not studied by Tabouy2019. Specifically, \pkgmissSBM implements variational algorithms in the vein of daudin2008mixture and Leger2016 for estimating the SBM, with or without covariates, under various missing data mechanisms. This includes cases of incomplete data where the inference can be made only on the observed part of the data (MAR), or cases where it is necessary to take the sampling design into account in the inference (MNAR).

Some frameworks deal with missing data but rather from the cross-validation perspective than the sampling perspective. Cross-validation is used to perform model selection for networks such as the choice of the number of blocks or communities (li2020network; chen2018network) or the choice of the latent structure (hoff2008modeling). Hence, these frameworks are quite different from ours since cross-validation is done under a MAR sampling while our main goal is to be able to infer an SBM under several MNAR sampling mechanisms.

The paper is organized as follows: Section 2 introduces the statistical framework of the binary SBM, with or without covariates, and summarizes the key points of its inference under missing data conditions. Section LABEL:sec:guidelines provides basic user guidelines for the main functions and classes of objects. We finally detail in Section LABEL:sec:example a case study which analyzes a network data set describing the French blogosphere during the period preceding the 2007 French presidential election, illustrating the most striking features of the package.

2 Statistical Framework

2.1 Binary Stochastic Block Model (SBM)

In an SBM, nodes from a set $\mathcal{N}\triangleq\{1,\ldots,n\}$ are distributed among a set $\mathcal{Q}\triangleq\{1,\ldots,Q\}$ of hidden blocks which model the latent structure of the graph. The group membership is described by independent categorical variables $(\mathbf{Z}_{i},i\in\mathcal{N})$ with multinomial distribution $\mathcal{M}(1,\boldsymbol{\alpha}=(\alpha_{1},...,\alpha_{Q}))$ . The probability of having an edge between any pair of nodes (or dyad) only depends on the blocks the two nodes belong to. Hence, the presence of an edge between $i$ and $j$ , indicated by the binary variable $Y_{ij}$ , is independent of the other edges conditionally on the latent blocks:

[TABLE]

where $\mathcal{B}$ stands for the Bernoulli distribution and $\mathcal{D}$ the set of dyads. This set may be either equal to $\{(i,j)\in\mathcal{N}^{2};\ i\not=j\}$ if the network is directed or to $\{(i,j)\in\mathcal{N}^{2};\ i<j\}$ , otherwise222Although self-edges ( $Y_{ii}$ ) could be defined in the SBM, they are not considered in \pkgmissSBM since they are scarce in real data.. In the following, we denote by ${\boldsymbol{\pi}}=\left(\pi_{q\ell}\right)_{(q,\ell)\in\mathcal{Q}^{2}}\in[0,1]^{Q^{2}}$ the connectivity matrix, $\boldsymbol{\alpha}\in\mathbb{D}^{Q}=\{(\alpha_{1},\ldots,\alpha_{Q})\in[0,1]^{Q};\ \alpha_{1}+\ldots\alpha_{Q}=1\}$ the block proportions, ${\mathbf{Z}}=(\mathbf{Z}_{1},...,\mathbf{Z}_{n})^{T}$ the $n\times Q$ membership matrix and $\mathbf{Y}=(Y_{ij})_{(i,j)\in\mathcal{D}}$ the $n\times n$ adjacency matrix. This matrix is binary, with a diagonal filled with NA’s and is symmetric if and only if the network is undirected. The vector encompassing all the unknown model parameters is $\boldsymbol{\theta}=(\boldsymbol{\alpha},\boldsymbol{\pi})$ . A schematic representation of the binary SBM in the undirected case is given in Figure LABEL:fig:tikzSBM, where we highlight the latent clustering.