A framework for constructing a huge name disambiguation dataset:   algorithms, visualization and human collaboration

Zhuoyue Xiao; Yutao Zhang; Bo Chen; Xiaozhao Liu; Jie Tang

arXiv:2007.02086·cs.SI·July 7, 2020·5 cites

A framework for constructing a huge name disambiguation dataset: algorithms, visualization and human collaboration

Zhuoyue Xiao, Yutao Zhang, Bo Chen, Xiaozhao Liu, Jie Tang

PDF

Open Access

TL;DR

This paper introduces a large, high-accuracy author name disambiguation dataset called WhoisWho, along with a collaborative annotation framework and an inductive disambiguation model, advancing research in author disambiguation.

Contribution

It presents a novel large-scale dataset, an efficient human-computer annotation framework, and a new inductive model that outperforms existing methods in author name disambiguation.

Findings

01

The proposed model outperforms other algorithms on WhoisWho.

02

The annotation framework improves accuracy and efficiency.

03

Author disambiguation remains a challenging problem.

Abstract

We present a manually-labeled Author Name Disambiguation(AND) Dataset called WhoisWho, which consists of 399,255 documents and 45,187 distinct authors with 421 ambiguous author names. To label such a great amount of AND data of high accuracy, we propose a novel annotation framework where the human and computer collaborate efficiently and precisely. Within the framework, we also propose an inductive disambiguation model to classify whether two documents belong to the same author. We evaluate the proposed method and other state-of-the-art disambiguation methods on WhoisWho. The experiment results show that: (1) Our model outperforms other disambiguation algorithms on this challenging benchmark. (2) The AND problem still remains largely unsolved and requires more in-depth research. We believe that such a large-scale benchmark would bring great value for the author name disambiguation task.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Natural Language Processing Techniques