Reliable data clustering with Bayesian community detection

Magnus Neuman; Jelena Smiljani\'c; Martin Rosvall

arXiv:2510.15013·stat.ML·October 20, 2025·2 cites

Reliable data clustering with Bayesian community detection

Magnus Neuman, Jelena Smiljani\'c, Martin Rosvall

PDF

Open Access

TL;DR

This paper introduces Bayesian community detection methods based on the Minimum Description Length principle to improve clustering reliability in noisy high-dimensional data, outperforming traditional methods.

Contribution

It demonstrates that Bayesian community detection methods provide a principled, noise-resistant framework for clustering in various scientific fields, unifying sparsification and model selection.

Findings

01

Outperforms traditional clustering in high-noise synthetic data

02

Identifies more robust gene modules in genomics data

03

Provides a unified, principled approach to clustering and sparsification

Abstract

From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results. To detect reliable clusters, we capitalize on recent advances in network science to unite sparsification and clustering with principled model selection. We test two Bayesian community detection methods, the Degree-Corrected Stochastic Block Model and the Regularized Map Equation, both grounded in the Minimum Description Length principle for model selection. In synthetic data, they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBioinformatics and Genomic Networks · Gene expression and cancer classification · Bayesian Methods and Mixture Models