# Soft-Search: Two Datasets to Study the Identification and Production of   Research Software

**Authors:** Eva Maxfield Brown, Lindsey Schwartz, Richard Lewei Huang, and, Nicholas Weber

arXiv: 2302.14177 · 2023-03-01

## TL;DR

This paper introduces two datasets to improve the identification of research software, leveraging machine learning models trained on annotated data and applied to NSF project reports, facilitating better linking of software to scholarly work.

## Contribution

The paper presents novel datasets and models for identifying research software production from project reports, enabling large-scale analysis of software in research.

## Key findings

- Successfully trained models to predict software production.
- Created an inferred dataset of over 150,000 NSF awards with software production labels.
- Released datasets publicly to support research on research software identification.

## Abstract

Software is an important tool for scholarly work, but software produced for research is in many cases not easily identifiable or discoverable. A potential first step in linking research and software is software identification. In this paper we present two datasets to study the identification and production of research software. The first dataset contains almost 1000 human labeled annotations of software production from National Science Foundation (NSF) awarded research projects. We use this dataset to train models that predict software production. Our second dataset is created by applying the trained predictive models across the abstracts and project outcomes reports for all NSF funded projects between the years of 2010 and 2023. The result is an inferred dataset of software production for over 150,000 NSF awards. We release the Soft-Search dataset to aid in identifying and understanding research software production: https://github.com/si2-urssi/eager

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14177/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14177/full.md

## References

12 references — full list in the complete paper: https://tomesphere.com/paper/2302.14177/full.md

---
Source: https://tomesphere.com/paper/2302.14177