S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from   Publications

Shaurya Rohatgi; Doug Downey; Daniel King; Sergey Feldman

arXiv:2204.10838·cs.DL·May 3, 2022

S2AMP: A High-Coverage Dataset of Scholarly Mentorship Inferred from Publications

Shaurya Rohatgi, Doug Downey, Daniel King, Sergey Feldman

PDF

1 Repo

TL;DR

This paper introduces two large datasets for scholarly mentorship, including a ground truth set of 300,000 mentor-mentee pairs and an inferred mentorship graph with 137 million edges, enabling extensive analysis of academic mentorship patterns.

Contribution

The authors provide the first large-scale, publicly available datasets for studying scholarly mentorship, including a high-accuracy classifier and an inferred mentorship network from bibliographic data.

Findings

01

Classifier achieves 0.96 ROC AUC in predicting mentorship.

02

Dataset includes 300,000 ground truth mentor-mentee pairs.

03

Inferred mentorship graph contains 137 million edges among 24 million nodes.

Abstract

Mentorship is a critical component of academia, but is not as visible as publications, citations, grants, and awards. Despite the importance of studying the quality and impact of mentorship, there are few large representative mentorship datasets available. We contribute two datasets to the study of mentorship. The first has over 300,000 ground truth academic mentor-mentee pairs obtained from multiple diverse, manually-curated sources, and linked to the Semantic Scholar (S2) knowledge graph. We use this dataset to train an accurate classifier for predicting mentorship relations from bibliographic features, achieving a held-out area under the ROC curve of 0.96. Our second dataset is formed by applying the classifier to the complete co-authorship graph of S2. The result is an inferred graph with 137 million weighted mentorship edges among 24 million nodes. We release this first-of-its-kind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/s2amp-data
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.