The RareDis corpus: a corpus annotated with rare diseases, their signs   and symptoms

Claudia Mart\'inez-deMiguel; Isabel Segura-Bedmar; Esteban; Chac\'on-Solano; Sara Guerrero-Aspizua

arXiv:2108.01204·cs.CL·December 10, 2021

The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms

Claudia Mart\'inez-deMiguel, Isabel Segura-Bedmar, Esteban, Chac\'on-Solano, Sara Guerrero-Aspizua

PDF

Open Access 3 Repos

TL;DR

The RareDis corpus is a high-quality annotated dataset of over 5,000 rare diseases and nearly 6,000 clinical signs, enabling improved NLP applications for diagnosis and treatment of rare diseases.

Contribution

This paper introduces the RareDis corpus, a novel annotated dataset of rare diseases and symptoms, addressing the scarcity of such resources for NLP research.

Findings

01

High inter-annotator agreement (F1 83.5%) for entities

02

High inter-annotator agreement (F1 81.3%) for relations

03

Potential to facilitate NLP applications in rare disease diagnosis

Abstract

The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Genomics and Rare Diseases · Text Readability and Simplification