Large-scale entity resolution via microclustering Ewens--Pitman random partitions
Mario Beraha, Stefano Favaro

TL;DR
This paper introduces a novel microclustering Ewens--Pitman model for large-scale entity resolution, enabling efficient inference with significant speed improvements while preserving competitive accuracy.
Contribution
The paper develops a new microclustering Ewens--Pitman model with scalable properties and proposes efficient variational inference methods for entity resolution tasks.
Findings
Achieves three orders of magnitude speed-up over existing Bayesian methods.
Maintains competitive empirical performance in entity resolution.
Demonstrates the microclustering property with sub-linear growth of largest cluster.
Abstract
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData-Driven Disease Surveillance · Data Quality and Management · Machine Learning in Healthcare
