Answering real-world clinical questions using large language model based systems
Yen Sia Low (1), Michael L. Jackson (1), Rebecca J. Hyde (1), Robert, E. Brown (1), Neil M. Sanghavi (1), Julian D. Baldwin (1), C. William Pike, (1), Jananee Muralidharan (1), Gavin Hui (1, 2), Natasha Alexander (3),, Hadeel Hassan (3), Rahul V. Nene (4), Morgan Pike (5)

TL;DR
This study evaluates various large language models for answering clinical questions, finding that specialized, purpose-built systems significantly outperform general-purpose models in relevance, reliability, and ability to handle novel queries.
Contribution
The paper demonstrates that combining retrieval-augmented generation and agentic LLMs enhances clinical question answering, highlighting the need for purpose-built healthcare LLM systems.
Findings
General-purpose LLMs rarely produced relevant, evidence-based answers (2%-10%).
RAG-based and agentic LLMs achieved 24%-58% relevance and evidence quality.
Only agentic ChatRWD answered novel questions effectively (65%).
Abstract
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Residual Connection · WordPiece · Softmax · Byte Pair Encoding · Layer Normalization
