Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim; Jiwoong Sohn; Aidan Gilson; Nicholas Cochran-Caggiano; Serina Applebaum; Heeju Jin; Seihee Park; Yujin Park; Jiyeong Park; Seoyoung Choi; Brittany Alexandra Herrera Contreras; Thomas Huang; Jaehoon Yun; Ethan F. Wei; Roy Jiang; Leah Colucci; Eric Lai; Amisha Dave; Tuo Guo; Maxwell B. Singer; Yonghoe Koo; Ron A. Adelman; James Zou; Andrew Taylor; Arman Cohan; Hua Xu; Qingyu Chen

arXiv:2511.06738·cs.CL·November 11, 2025

Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Hyunjae Kim, Jiwoong Sohn, Aidan Gilson, Nicholas Cochran-Caggiano, Serina Applebaum, Heeju Jin, Seihee Park, Yujin Park, Jiyeong Park, Seoyoung Choi, Brittany Alexandra Herrera Contreras, Thomas Huang, Jaehoon Yun, Ethan F. Wei, Roy Jiang, Leah Colucci, Eric Lai, Amisha Dave

PDF

Open Access 1 Models

TL;DR

This study critically evaluates retrieval-augmented generation (RAG) in medicine, revealing significant performance issues and proposing simple strategies to improve its reliability for medical applications.

Contribution

It provides the most comprehensive expert evaluation of RAG in medicine, identifying key failure points and demonstrating effective mitigation strategies.

Findings

01

Only 22% of retrieved passages were relevant

02

Evidence selection precision was 41-43%

03

Simple strategies improved performance by up to 12%

Abstract

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Yale-BIDS-Chen/Llama-3.1-8B-Evidence-Filtering
model· 48 dl· ♡ 1
48 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare