Model-based clustering of categorical data based on the Hamming distance
Raffaele Argiento, Edoardo Filippi-Mazzola, Lucia Paci

TL;DR
This paper introduces a Bayesian nonparametric model for clustering categorical data using Hamming distance, providing a flexible approach that automatically determines the number of clusters and improves clustering accuracy.
Contribution
It develops a novel Bayesian nonparametric mixture model based on Hamming distance, with a transdimensional Gibbs sampler for full Bayesian inference on clusters.
Findings
Improved clustering recovery over existing methods
Effective Bayesian inference for unknown number of clusters
Demonstrated performance on simulated and real datasets
Abstract
A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Advanced Clustering Algorithms Research · Statistical Methods and Bayesian Inference
