Developing a Temporal Bibliographic Data Set for Entity Resolution
Yichen Hu, Qing Wang, Peter Christen

TL;DR
This paper presents a new large-scale temporal bibliographic data set derived from DBLP and MAG, designed to evaluate temporal entity resolution methods with ground truth and temporal information.
Contribution
The authors created a comprehensive temporal bibliographic data set with ground truth, linking author profiles and publications using DBLP and MAG data sources.
Findings
Data set includes 80K author profiles and 2 million publications.
Ground truth links were improved using multiple data sources.
The data set is challenging for temporal entity resolution research.
Abstract
Entity resolution is the process of identifying groups of records within or across data sets where each group represents a real-world entity. Novel techniques that consider temporal features to improve the quality of entity resolution have recently attracted significant attention. However, there are currently no large data sets available that contain both temporal information as well as ground truth information to evaluate the quality of temporal entity resolution approaches. In this paper, we describe the preparation of a temporal data set based on author profiles extracted from the Digital Bibliography and Library Project (DBLP). We completed missing links between publications and author profiles in the DBLP data set using the DBLP public API. We then used the Microsoft Academic Graph (MAG) to link temporal affiliation information for DBLP authors. We selected around 80K (1%) of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Web Data Mining and Analysis · Advanced Database Systems and Queries
