Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names
Mingrong She, Liuhuaying Yang, Ana Maria Jaramillo, Lisette Esp\'in-Noboa

TL;DR
This paper introduces a rule-based disambiguation framework for Chinese scholar names, improving accuracy in author identification across Latin and Chinese scripts for large-scale scientometric research.
Contribution
The proposed framework effectively integrates multiple data sources and achieves high accuracy on Chinese names, outperforming baseline methods and being script-agnostic.
Findings
F1-score of 0.88 for Pinyin names
F1-score of 0.89 for character-based names
Outperforms baseline approaches in recall
Abstract
Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
