GALILEO: A Generalized Low-Entropy Mixture Model
Cetin Savkli, Jeffrey Lin, Philip Graff, Matthew Kinsey

TL;DR
GALILEO introduces a novel entropy-based mixture model for categorical data clustering, effectively identifying high-quality clusters and scaling linearly with dataset size, suitable for large datasets.
Contribution
It proposes an entropy-based density metric and annealing process to improve mixture model clustering for categorical attributes, with scalable performance.
Findings
Consistently finds high-quality clusters
Identifies the same optimal number of clusters
Scales linearly with dataset size
Abstract
We present a new method of generating mixture models for data with categorical attributes. The keys to this approach are an entropy-based density metric in categorical space and annealing of high-entropy/low-density components from an initial state with many components. Pruning of low-density components using the entropy-based density allows GALILEO to consistently find high-quality clusters and the same optimal number of clusters. GALILEO has shown promising results on a range of test datasets commonly used for categorical clustering benchmarks. We demonstrate that the scaling of GALILEO is linear in the number of records in the dataset, making this method suitable for very large categorical datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Image and Signal Denoising Methods · Gaussian Processes and Bayesian Inference
MethodsPruning
