TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of   Tasks Datasets and Metrics

Yufang Hou; Charles Jochim; Martin Gleize; Francesca Bonin; Debasis; Ganguly

arXiv:2101.10273·cs.CL·January 26, 2021·1 cites

TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis, Ganguly

PDF

Open Access 1 Repo

TL;DR

This paper introduces TDMSci, a new annotated corpus for extracting Tasks, Datasets, and Metrics from scientific NLP papers, enabling improved information extraction and knowledge discovery.

Contribution

The paper presents a novel corpus with expert annotations for T, D, M entities and demonstrates its utility through extraction experiments and large-scale application.

Findings

01

Effective data augmentation improves extraction accuracy

02

Applied tagger to 30,000 NLP papers for large-scale analysis

03

Corpus availability fosters further research in scientific literature understanding

Abstract

Tasks, Datasets and Evaluation Metrics are important concepts for understanding experimental scientific papers. However, most previous work on information extraction for scientific literature mainly focuses on the abstracts only, and does not treat datasets as a separate type of entity (Zadeh and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus that contains domain expert annotations for Task (T), Dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is made publicly available to the community for fostering research on scientific publication summarization (Erera et al., 2019) and knowledge discovery.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IBM/science-result-extractor
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Biomedical Text Mining and Ontologies