Efficient Entity Resolution on Heterogeneous Records
Yiming Lin, Hongzhi Wang, Jianzhong Li, Hong Gao

TL;DR
This paper introduces a novel framework for entity resolution on heterogeneous records that preserves more information by avoiding schema matching, utilizing a new similarity function and indexing for efficiency, validated on real datasets.
Contribution
It proposes a new similarity function and an iterative framework for ER on heterogeneous data, bypassing schema matching to retain crucial information.
Findings
Effective in identifying matching records across heterogeneous schemas
Achieves higher accuracy than existing methods
Demonstrates improved efficiency with indexing
Abstract
Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. Specifically, the schemas of records may differ from each other. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we design the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Web Data Mining and Analysis
