An Asynchronous Distributed Expectation Maximization Algorithm For Massive Data: The DEM Algorithm
Sanvesh Srivastava, Glen DePalma, and Chuanhai Liu

TL;DR
The paper introduces DEM, an asynchronous distributed EM algorithm that efficiently handles massive datasets by parallelizing the E step across multiple workers, significantly speeding up convergence while maintaining accuracy.
Contribution
It presents a novel distributed EM algorithm that enables scalable, asynchronous processing for large data, extending EM's applicability to massive datasets with proven convergence properties.
Findings
DEM is significantly faster than existing EM algorithms in simulations.
DEM maintains similar accuracy to traditional EM methods.
DEM performs well on a large-scale movie ratings dataset.
Abstract
The family of Expectation-Maximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EM-type algorithms is time-consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called \emph{workers} and \emph{managers} for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Cloud Computing and Resource Management · Gaussian Processes and Bayesian Inference
