ORCID-linked labeled data for evaluating author name disambiguation at scale
Jinseok Kim, Jason Owen-Smith

TL;DR
This paper proposes using ORCID profiles as a scalable source of labeled data to evaluate author name disambiguation methods, demonstrating its effectiveness and potential for large-scale, nuanced performance assessment.
Contribution
It introduces a method to leverage ORCID data for large-scale evaluation of disambiguation algorithms, enabling more accessible and detailed performance analysis.
Findings
ORCID-linked data effectively capture high precision over high recall performance.
Discrepancies exist between ORCID-linked data and the population in Author-ity2009.
Labeled data can be improved as ORCID expands and is regularly updated.
Abstract
How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher profiles. Results show that although ORCID-linked labeled data do not effectively represent the population of name instances in Author-ity2009, they do effectively capture the 'high precision over high recall' performances of Author-ity2009. In addition, ORCID-linked labeled data can provide nuanced details about the Author-ity2009's performance when name instances are evaluated within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
