A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature
Sara Lafia, Lizhou Fan, Libby Hemphill

TL;DR
This paper presents a natural language processing pipeline that automatically detects informal data references in academic publications, significantly aiding data librarians and researchers in linking studies to datasets at scale.
Contribution
It introduces a novel NER model for detecting informal data references and provides a dataset linking social science publications to datasets, enhancing data citation analysis.
Findings
Increased recall in identifying data references.
Effective detection of informal data mentions.
Established a new dataset for data-publication links.
Abstract
Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
