Efficient and Effective ER with Progressive Blocking

Sainyam Galhotra; Donatella Firmani; Barna Saha; Divesh Srivastava

arXiv:2005.14326·cs.DB·March 17, 2021

Efficient and Effective ER with Progressive Blocking

Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava

PDF

TL;DR

This paper introduces pBlocking, a progressive blocking method for entity resolution that adaptively balances efficiency and effectiveness by using partial ER outputs in a feedback loop, achieving significant improvements.

Contribution

The paper proposes a novel progressive blocking approach that dynamically refines blocking results using partial ER feedback, applicable across various cluster size distributions.

Findings

01

pBlocking improves ER efficiency by 5x

02

pBlocking enhances ER effectiveness by 60%

03

Overall F-score of ER increases up to 60%

Abstract

Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.