A Semi-Automatic Approach for Detecting Dataset References in Social Science Texts
Behnam Ghavimi (1,2), Philipp Mayr (1), Christoph Lange (2,3), Sahar, Vahdati (2), S\"oren AUER (2,3) ((1) GESIS Leibniz Institute for the, Social Sciences, (2) Enterprise Information Systems (EIS), University of, Bonn, (3) Fraunhofer Institute for Intelligent Analysis

TL;DR
This paper presents a semi-automatic method to identify and match dataset references in social science texts, improving reproducibility and dataset accessibility without needing a large article corpus.
Contribution
It introduces a three-step approach that extracts features from dataset titles, detects references, and matches them, avoiding the cold start problem and achieving high accuracy.
Findings
F-measure of 0.84 for reference detection
F-measure of 0.83 for matching references
Effective without requiring a corpus of articles
Abstract
Today, full-texts of scientific articles are often stored in different locations than the used datasets. Dataset registries aim at a closer integration by making datasets citable but authors typically refer to datasets using inconsistent abbreviations and heterogeneous metadata (e.g. title, publication year). It is thus hard to reproduce research results, to access datasets for further analysis, and to determine the impact of a dataset. Manually detecting references to datasets in scientific articles is time-consuming and requires expert knowledge in the underlying research domain.We propose and evaluate a semi-automatic three-step approach for finding explicit references to datasets in social sciences articles.We first extract pre-defined special features from dataset titles in the da|ra registry, then detect references to datasets using the extracted features, and finally match the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Data Quality and Management
