TL;DR
This paper introduces SoMeSci, a comprehensive gold standard knowledge graph of software mentions in scientific articles, enabling improved automatic extraction, disambiguation, and analysis of software usage in research.
Contribution
It presents the first high-quality, annotated dataset of software mentions in scientific literature, including relation labels and mention types, supporting various NLP tasks.
Findings
High inter-annotator agreement (κ=0.82)
Contains 3756 software mentions in 1367 articles
Provides baseline results for NLP tasks
Abstract
Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci (Software Mentions in Science) a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: ) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
