A Distant Supervision Corpus for Extracting Biomedical Relationships Between Chemicals, Diseases and Genes
Dongxu Zhang, Sunil Mohan, Michaela Torkar, Andrew McCallum

TL;DR
This paper presents ChemDisGene, a large, high-quality biomedical relation extraction dataset with extensive annotations, designed to improve training and evaluation of models identifying relationships among chemicals, diseases, and genes.
Contribution
The creation of ChemDisGene, a substantially larger and cleaner dataset with entity linking and both training and evaluation portions, along with baseline models for biomedical relation extraction.
Findings
The dataset contains 80k abstracts with 78% accuracy in distant labeling.
Baseline models demonstrate effective relation extraction performance.
ChemDisGene outperforms existing datasets in size and annotation quality.
Abstract
We introduce ChemDisGene, a new dataset for training and evaluating multi-class multi-label document-level biomedical relation extraction models. Our dataset contains 80k biomedical research abstracts labeled with mentions of chemicals, diseases, and genes, portions of which human experts labeled with 18 types of biomedical relationships between these entities (intended for evaluation), and the remainder of which (intended for training) has been distantly labeled via the CTD database with approximately 78\% accuracy. In comparison to similar preexisting datasets, ours is both substantially larger and cleaner; it also includes annotations linking mentions to their entities. We also provide three baseline deep neural network relation extraction models trained and evaluated on our new dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Text Analysis Techniques
