Scalable Model-Based Clustering with Sequential Monte Carlo
Connie Trojan, Pavel Myshkov, Paul Fearnhead, James Hensman, Tom Minka, Christopher Nemeth

TL;DR
This paper introduces a scalable Sequential Monte Carlo algorithm for online clustering, effectively handling large-scale problems with complex distributions, especially in text data, by decomposing problems into manageable subproblems.
Contribution
A novel SMC algorithm that decomposes clustering tasks into nearly independent subproblems, reducing memory usage and improving efficiency for large-scale, complex clustering.
Findings
Accurately solves clustering in knowledge base construction.
Efficiently manages uncertainty in large-scale online clustering.
Outperforms traditional SMC methods in complex distribution scenarios.
Abstract
In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
