AutoBlock: A Hands-off Blocking Framework for Entity Matching

Wei Zhang; Hao Wei; Bunyamin Sisman; Xin Luna Dong; Christos; Faloutsos; David Page

arXiv:1912.03417·cs.DB·December 10, 2019

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos, Faloutsos, David Page

PDF

1 Repo

TL;DR

AutoBlock introduces an automated, scalable, and effective blocking framework for entity matching that reduces manual effort and outperforms traditional methods on large, complex datasets.

Contribution

AutoBlock presents a novel hands-off blocking framework based on similarity-preserving representation learning and nearest neighbor search, eliminating manual data cleaning and key tuning.

Findings

01

AutoBlock outperforms existing baselines on large-scale datasets.

02

AutoBlock handles dirty and unstructured data effectively.

03

AutoBlock has sub-quadratic time complexity, enabling deployment on millions of records.

Abstract

Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys. In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vintasoftware/entity-embed
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.