Skyblocking for Entity Resolution
Jingyu Shao, Qing Wang, Yu Lin

TL;DR
This paper introduces skyblocking, a novel approach to efficiently identify optimal blocking schemes for entity resolution by combining skyline techniques with active learning, demonstrating superior performance on real datasets.
Contribution
It proposes the concept of skyblocking and a scheme skyline learning method that integrates skyline techniques with active learning for entity resolution blocking.
Findings
Efficiently identifies scheme skylines with limited labels
Outperforms state-of-the-art in label efficiency and blocking quality
Effective on multiple real-world datasets
Abstract
In this paper, for the first time, we introduce the concept of skyblocking, which aims to efficiently identify the "most preferred" blocking scheme in terms of a given set of selection criteria for entity resolution blocking. To capture all possible preferred blocking schemes, scheme skyline (i.e. blocking schemes on the skyline) has been studied in a multi-dimensional scheme space with dimensions corresponding to selection criteria for blocking (e.g. PC and PQ). However, applying traditional skyline techniques to learn scheme skylines is a non-trivial task. Due to the unique characteristics of blocking schemes, we face several challenges, such as: how to find a balanced number of match and non-match labels to effectively approximate a block scheme in a scheme space, and how to design efficient skyline algorithms to explore a scheme space for finding scheme skylines. To overcome these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Management and Algorithms · Data-Driven Disease Surveillance
