Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav

TL;DR
This study assesses the diagnostic accuracy of large language models on rare diseases using a novel dataset from House M.D., revealing significant performance variation and highlighting future research directions.
Contribution
Introduces a new dataset from House M.D. for evaluating LLMs on rare disease diagnosis and provides baseline performance metrics and an evaluation framework.
Findings
Model accuracy ranged from 16.48% to 38.64%.
Newer models showed 2.3 times better performance.
All models faced challenges with rare disease diagnosis.
Abstract
Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenomics and Rare Diseases · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
