Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks
Jinseok Kim, Jana Diesner

TL;DR
This study investigates how initial-based name disambiguation methods distort the structure and statistics of large-scale coauthorship networks across multiple scientific fields, revealing significant inaccuracies and biases.
Contribution
It provides empirical evidence that initial-based disambiguation significantly biases network measurements and misidentifies key authors, challenging its validity for research.
Findings
Initial-based disambiguation inflates some network metrics like productivity and density.
It underestimates the number of unique authors and network components.
Asian names are particularly prone to misidentification.
Abstract
Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in datasets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
