SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software   Mentions in Scientific Articles

David Schindler; Felix Bensmann; Stefan Dietze; Frank Kr\"uger

arXiv:2108.09070·cs.IR·August 23, 2021

SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

David Schindler, Felix Bensmann, Stefan Dietze, Frank Kr\"uger

PDF

1 Repo

TL;DR

This paper introduces SoMeSci, a comprehensive gold standard knowledge graph of software mentions in scientific articles, enabling improved automatic extraction, disambiguation, and analysis of software usage in research.

Contribution

It presents the first high-quality, annotated dataset of software mentions in scientific literature, including relation labels and mention types, supporting various NLP tasks.

Findings

01

High inter-annotator agreement (κ=0.82)

02

Contains 3756 software mentions in 1367 articles

03

Provides baseline results for NLP tasks

Abstract

Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci (Software Mentions in Science) a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: $κ = .82$ ) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dave-s477/somesci_code
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.