SPER: Accelerating Progressive Entity Resolution via Stochastic Bipartite Maximization
Dimitrios Karapiperis, George Papadakis, Vassilios Verykios

TL;DR
SPER introduces a stochastic sampling approach to accelerate progressive entity resolution, enabling linear-time prioritization that significantly outperforms existing methods in speed while maintaining accuracy.
Contribution
It proposes a novel stochastic bipartite maximization framework that replaces sorting, achieving scalable, fast progressive ER suitable for high-velocity data streams.
Findings
SPER achieves 3x to >6x speedup over baselines.
Maintains comparable recall and precision to state-of-the-art methods.
Scales effectively to high-velocity data streams.
Abstract
Entity Resolution (ER) is a critical data cleaning task for identifying records that refer to the same real-world entity. In the era of Big Data, traditional batch ER is often infeasible due to volume and velocity constraints, necessitating Progressive ER methods that maximize recall within a limited computational budget. However, existing progressive approaches fail to scale to high-velocity streams because they rely on deterministic sorting to prioritize candidate pairs, a process that incurs prohibitive super-linear complexity and heavy initialization costs. To address this scalability wall, we introduce SPER (Stochastic Progressive ER), a novel framework that redefines prioritization as a sampling problem rather than a ranking problem. By replacing global sorting with a continuous stochastic bipartite maximization strategy, SPER acts as a probabilistic high-pass filter that selects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Time Series Analysis and Forecasting
