Evaluating the Impact of Lab Test Results on Large Language Models   Generated Differential Diagnoses from Clinical Case Vignettes

Balu Bhasuran; Qiao Jin; Yuzhang Xie; Carl Yang; Karim Hanna; Jennifer; Costa; Cindy Shavor; Zhiyong Lu; Zhe He

arXiv:2411.02523·cs.CL·November 6, 2024

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer, Costa, Cindy Shavor, Zhiyong Lu, Zhe He

PDF

Open Access

TL;DR

This study evaluates how lab test results influence the accuracy of large language models in generating differential diagnoses from clinical vignettes, highlighting GPT-4's superior performance and the importance of lab data.

Contribution

It systematically assesses the impact of lab results on LLMs' diagnostic accuracy, comparing multiple models and incorporating clinician and knowledge graph evaluations.

Findings

01

GPT-4 achieved 55% accuracy for Top 1 diagnoses with lab data.

02

Lab results significantly improved diagnostic accuracy across models.

03

LLMs correctly interpreted common lab tests for differential diagnosis.

Abstract

Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Cosine Annealing · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Attention Dropout · {Dispute@FaQ-s}How to file a dispute with Expedia?