PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts
Franck Dernoncourt, Ji Young Lee

TL;DR
This paper introduces PubMed 200k RCT, a large dataset of 200,000 medical abstracts with sentence-level labels, aimed at improving sequential sentence classification and aiding literature review in medicine.
Contribution
The paper provides a large, publicly available dataset for sequential sentence classification in medical abstracts, addressing the scarcity of big datasets in this domain.
Findings
Dataset contains 2.3 million sentences labeled by role
Enables development of more accurate sequential classification algorithms
Facilitates efficient literature review in medical research
Abstract
We present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Biomedical Text Mining and Ontologies
