NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories
Genet Asefa Gesese, Zongxiong Chen, Shufan Jiang, Mary Ann Tan, Zhaotai Liu, Sonja Schimmler, Harald Sack

TL;DR
This paper introduces NERdME, a new dataset of 200 annotated README files with over 10,000 labeled spans across 10 entity types, aimed at improving information extraction from implementation-level artifacts in code repositories.
Contribution
The paper presents a novel dataset for extracting structured information from README files, addressing a gap in existing scholarly information extraction datasets.
Findings
Large language models show different performance on paper-level vs. implementation-level entities.
Extending SIE benchmarks with README entity types enhances artifact discovery.
Entities from README files support better metadata integration and artifact linking.
Abstract
Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Biomedical Text Mining and Ontologies · Topic Modeling
