Evaluation of GPT-3.5 and GPT-4 for supporting real-world information   needs in healthcare delivery

Debadutta Dash; Rahul Thapa; Juan M. Banda; Akshay Swaminathan; Morgan; Cheatham; Mehr Kashyap; Nikesh Kotecha; Jonathan H. Chen; Saurabh Gombar,; Lance Downing; Rachel Pedreira; Ethan Goh; Angel Arnaout; Garret Kenn Morris,; Honor Magon; Matthew P Lungren; Eric Horvitz; Nigam H. Shah

arXiv:2304.13714·cs.AI·May 2, 2023·21 cites

Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan, Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar,, Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris,, Honor Magon, Matthew P Lungren, Eric Horvitz

PDF

Open Access

TL;DR

This study evaluates GPT-3.5 and GPT-4's ability to safely and accurately support healthcare information needs, revealing they are generally safe but often lack concordance with expert reports and may require further customization.

Contribution

It provides a real-world assessment of LLMs in healthcare, highlighting safety and accuracy issues, and emphasizes the need for tailored prompt engineering and calibration.

Findings

01

No responses were deemed overtly harmful by physicians.

02

Less than 20% of responses matched expert consultation reports.

03

Responses often contained hallucinated references and lacked concordance.

Abstract

Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Electronic Health Records Systems · Ethics in Clinical Research

Methodstravel james · Attention Is All You Need · Cosine Annealing · Linear Layer · Adam · Layer Normalization · Attention Dropout · Dense Connections · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia?