The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms
Claudia Mart\'inez-deMiguel, Isabel Segura-Bedmar, Esteban, Chac\'on-Solano, Sara Guerrero-Aspizua

TL;DR
The RareDis corpus is a high-quality annotated dataset of over 5,000 rare diseases and nearly 6,000 clinical signs, enabling improved NLP applications for diagnosis and treatment of rare diseases.
Contribution
This paper introduces the RareDis corpus, a novel annotated dataset of rare diseases and symptoms, addressing the scarcity of such resources for NLP research.
Findings
High inter-annotator agreement (F1 83.5%) for entities
High inter-annotator agreement (F1 81.3%) for relations
Potential to facilitate NLP applications in rare disease diagnosis
Abstract
The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Genomics and Rare Diseases · Text Readability and Simplification
