PubMed 200k RCT: a Dataset for Sequential Sentence Classification in   Medical Abstracts

Franck Dernoncourt; Ji Young Lee

arXiv:1710.06071·cs.CL·October 18, 2017·35 cites

PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Franck Dernoncourt, Ji Young Lee

PDF

Open Access 5 Repos 1 Models 1 Datasets

TL;DR

This paper introduces PubMed 200k RCT, a large dataset of 200,000 medical abstracts with sentence-level labels, aimed at improving sequential sentence classification and aiding literature review in medicine.

Contribution

The paper provides a large, publicly available dataset for sequential sentence classification in medical abstracts, addressing the scarcity of big datasets in this domain.

Findings

01

Dataset contains 2.3 million sentences labeled by role

02

Enables development of more accurate sequential classification algorithms

03

Facilitates efficient literature review in medical research

Abstract

We present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
SaborDay/Phi2_RCT1M-ft-heading
model· 8 dl· ♡ 1
8 dl♡ 1

Datasets

armanc/pubmed-rct20k
dataset· 663 dl
663 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Biomedical Text Mining and Ontologies