Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks
Mengdi Chai, Ali R. Zomorrodi

TL;DR
This study evaluates the performance of three state-of-the-art LLMs in clinical decision support tasks, revealing that prompt engineering's effectiveness varies by model and task, and is not universally beneficial.
Contribution
It provides a comprehensive analysis of LLMs in clinical decision-making, demonstrating that prompt engineering effects are highly model- and task-dependent, challenging the notion of a universal solution.
Findings
LLMs show high variability across clinical tasks.
Prompt engineering improves some tasks but not others.
Model and task-specific strategies are necessary for effective LLM use in healthcare.
Abstract
Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Genomics and Rare Diseases
