TL;DR
AutoBlock introduces an automated, scalable, and effective blocking framework for entity matching that reduces manual effort and outperforms traditional methods on large, complex datasets.
Contribution
AutoBlock presents a novel hands-off blocking framework based on similarity-preserving representation learning and nearest neighbor search, eliminating manual data cleaning and key tuning.
Findings
AutoBlock outperforms existing baselines on large-scale datasets.
AutoBlock handles dirty and unstructured data effectively.
AutoBlock has sub-quadratic time complexity, enabling deployment on millions of records.
Abstract
Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys. In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
