Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery
Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan, Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar,, Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris,, Honor Magon, Matthew P Lungren, Eric Horvitz

TL;DR
This study evaluates GPT-3.5 and GPT-4's ability to safely and accurately support healthcare information needs, revealing they are generally safe but often lack concordance with expert reports and may require further customization.
Contribution
It provides a real-world assessment of LLMs in healthcare, highlighting safety and accuracy issues, and emphasizes the need for tailored prompt engineering and calibration.
Findings
No responses were deemed overtly harmful by physicians.
Less than 20% of responses matched expert consultation reports.
Responses often contained hallucinated references and lacked concordance.
Abstract
Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Electronic Health Records Systems · Ethics in Clinical Research
Methodstravel james · Attention Is All You Need · Cosine Annealing · Linear Layer · Adam · Layer Normalization · Attention Dropout · Dense Connections · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia?
