Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names

Mingrong She; Liuhuaying Yang; Ana Maria Jaramillo; Lisette Esp\'in-Noboa

arXiv:2604.03776·cs.DL·April 7, 2026

Bridging the Language Gap in Scholarly Data I: Enhancing Author Disambiguation Algorithms for Chinese Names

Mingrong She, Liuhuaying Yang, Ana Maria Jaramillo, Lisette Esp\'in-Noboa

PDF

TL;DR

This paper introduces a rule-based disambiguation framework for Chinese scholar names, improving accuracy in author identification across Latin and Chinese scripts for large-scale scientometric research.

Contribution

The proposed framework effectively integrates multiple data sources and achieves high accuracy on Chinese names, outperforming baseline methods and being script-agnostic.

Findings

01

F1-score of 0.88 for Pinyin names

02

F1-score of 0.89 for character-based names

03

Outperforms baseline approaches in recall

Abstract

Disambiguating scholars with identical names is essential for accurate authorship assignment and robust large-scale scientometric research. Existing methods are often designed for Latin-script metadata and perform poorly on Chinese names. In international publications, Chinese names typically appear as Romanized Pinyin, which is highly ambiguous as it can map to multiple distinct characters. Chinese characters, in contrast, reduce but do not eliminate this ambiguity, and are rarely available in international records. To address both challenges, we propose a rule-based disambiguation framework that integrates co-authorship networks, citation networks, author affiliations, and content similarity. We apply this framework to 65,241 physics papers from the China National Knowledge Infrastructure (CNKI), spanning over 70 years of data. On a human annotated sample of 80 name pairs, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.