When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

TL;DR
This paper introduces OGCaReBench, a retrieval-based benchmark for evaluating medical LLMs on rare, real-world clinical questions requiring open-ended reasoning beyond guidelines.
Contribution
It presents a new benchmark dataset for assessing LLMs' ability to answer complex, rare clinical questions with evidence-grounded reasoning.
Findings
GPT-5.2 correctly answers 56% of questions without retrieval.
Specialized models reach only 42% accuracy.
Retrieval augmentation improves performance to 82%.
Abstract
Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
