Answering real-world clinical questions using large language model based   systems

Yen Sia Low (1); Michael L. Jackson (1); Rebecca J. Hyde (1); Robert; E. Brown (1); Neil M. Sanghavi (1); Julian D. Baldwin (1); C. William Pike; (1); Jananee Muralidharan (1); Gavin Hui (1; 2); Natasha Alexander (3),; Hadeel Hassan (3); Rahul V. Nene (4); Morgan Pike (5); Courtney J. Pokrzywa; (6); Shivam Vedak (7); Adam Paul Yan (3); Dong-han Yao (7); Amy R. Zipursky; (3); Christina Dinh (1); Philip Ballentine (1); Dan C. Derieg (1); Vladimir; Polony (1); Rehan N. Chawdry (1); Jordan Davies (1); Brigham B. Hyde (1),; Nigam H. Shah (1; 7); Saurabh Gombar (1; 8) ((1) Atropos Health; New; York NY; USA; (2) Department of Medicine; University of California; Los; Angeles CA; USA; (3) Department of Pediatrics; The Hospital for Sick; Children; Toronto ON; Canada; (4) Department of Emergency Medicine,; University of California; San Diego CA; USA; (5) Department of Emergency; Medicine; University of Michigan; Ann Arbor MI; USA; (6) Department of; Surgery; Columbia University; New York NY; USA; (7) Center for Biomedical; Informatics Research; Stanford University; Stanford CA; USA (8) Department of; Pathology; Stanford University; Stanford CA; USA)

arXiv:2407.00541·cs.CL·July 2, 2024·5 cites

Answering real-world clinical questions using large language model based systems

Yen Sia Low (1), Michael L. Jackson (1), Rebecca J. Hyde (1), Robert, E. Brown (1), Neil M. Sanghavi (1), Julian D. Baldwin (1), C. William Pike, (1), Jananee Muralidharan (1), Gavin Hui (1, 2), Natasha Alexander (3),, Hadeel Hassan (3), Rahul V. Nene (4), Morgan Pike (5)

PDF

Open Access

TL;DR

This study evaluates various large language models for answering clinical questions, finding that specialized, purpose-built systems significantly outperform general-purpose models in relevance, reliability, and ability to handle novel queries.

Contribution

The paper demonstrates that combining retrieval-augmented generation and agentic LLMs enhances clinical question answering, highlighting the need for purpose-built healthcare LLM systems.

Findings

01

General-purpose LLMs rarely produced relevant, evidence-based answers (2%-10%).

02

RAG-based and agentic LLMs achieved 24%-58% relevance and evidence quality.

03

Only agentic ChatRWD answered novel questions effectively (65%).

Abstract

Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Weight Decay · Multi-Head Attention · Residual Connection · WordPiece · Softmax · Byte Pair Encoding · Layer Normalization