Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data
Ruohui Wang, Dahua Lin

TL;DR
This paper introduces a scalable distributed estimation method for Dirichlet Process Mixture Models that efficiently handles new components locally and merges them probabilistically, enabling high scalability without sacrificing performance.
Contribution
The paper presents a novel distributed estimation approach for DPMMs that allows local creation of components and probabilistic merging, reducing communication costs and maintaining consistency.
Findings
Achieves high scalability in distributed environments
Maintains estimation consistency with low communication overhead
Performs well on large real-world datasets
Abstract
We consider the estimation of Dirichlet Process Mixture Models (DPMMs) in distributed environments, where data are distributed across multiple computing nodes. A key advantage of Bayesian nonparametric models such as DPMMs is that they allow new components to be introduced on the fly as needed. This, however, posts an important challenge to distributed estimation -- how to handle new components efficiently and consistently. To tackle this problem, we propose a new estimation method, which allows new components to be created locally in individual computing nodes. Components corresponding to the same cluster will be identified and merged via a probabilistic consolidation scheme. In this way, we can maintain the consistency of estimation with very low communication cost. Experiments on large real-world data sets show that the proposed method can achieve high scalability in distributed and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
