Integrating Large Citation Datasets
Inci Yueksel-Erguen, Ida Litzel, Hanqiu Peng

TL;DR
This paper presents methods for merging large citation datasets using big data techniques to create a comprehensive and accurate citation graph, improving scientific impact evaluation.
Contribution
It introduces a novel approach to integrate large citation datasets, addressing data inconsistency and deduplication challenges to produce a reliable, extensive citation graph.
Findings
Merged dataset contains over 119 million records and 1.4 billion citations.
The integrated citation graph enhances the accuracy of scientific impact assessment.
Demonstrates improved robustness over traditional citation metrics.
Abstract
This paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
