Scalable Dirichlet Process Mixture Models with Unknown Concentration and Adaptive Covariance for High-Dimensional Clustering Applied to Leukemia Transcriptomics
Annesh Pal, Aguirre Mimoun, Rodolphe Thi\'ebaut, Boris P. Hejblum

TL;DR
This paper introduces a scalable Dirichlet Process Mixture Model with adaptive covariance and unknown concentration parameters, enabling effective high-dimensional clustering, demonstrated on leukemia transcriptomics data.
Contribution
It presents a novel collapsed variational inference approach for DPMMs with adaptive covariance and priors, improving convergence speed and clustering accuracy in high-dimensional data.
Findings
Faster convergence than MCMC methods on Gaussian data
Successfully identified known and novel sub-clusters in leukemia data
Robust performance demonstrated through sensitivity analyses
Abstract
We propose a novel method that performs adaptive clustering with DPMM using collapsed VI, while incorporating weakly-informative priors for DP concentration parameter alpha and base distribution G0. We illustrate the importance of G0 covariance structure and prior choice by considering different parameterisations of the data covariance matrix. On high-dimensional Gaussian simulations, our model demonstrates substantially faster convergence than a state-of-the-art MCMC splice sampler. We further evaluate performances on Negative Binomial simulations and conduct sensitivity analyses to assess robustness on realistic data conditions. Application to a publicly available leukemia transcriptomic data set comprising 72 samples and 2,194 gene expression successfully recovers every known sub-type, all while identifying additional gene expression-based sub-clusters with meaningful biological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Gaussian Processes and Bayesian Inference · Gene expression and cancer classification
