Training Gaussian Mixture Models at Scale via Coresets
Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman

TL;DR
This paper introduces a method to create small, weighted subsets called coresets for Gaussian mixture models, enabling efficient training on large datasets without sacrificing accuracy.
Contribution
The paper presents the first polynomial-sized coresets for Gaussian mixtures, applicable in distributed and streaming contexts, based on novel geometric and combinatorial techniques.
Findings
Coresets significantly reduce training time.
Coresets maintain high approximation accuracy.
Method is applicable to real-world datasets.
Abstract
How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models · Machine Learning and Algorithms
