Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Mengdi Chai; Ali R. Zomorrodi

arXiv:2512.22966·cs.CL·December 30, 2025

Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Mengdi Chai, Ali R. Zomorrodi

PDF

Open Access

TL;DR

This study evaluates the performance of three state-of-the-art LLMs in clinical decision support tasks, revealing that prompt engineering's effectiveness varies by model and task, and is not universally beneficial.

Contribution

It provides a comprehensive analysis of LLMs in clinical decision-making, demonstrating that prompt engineering effects are highly model- and task-dependent, challenging the notion of a universal solution.

Findings

01

LLMs show high variability across clinical tasks.

02

Prompt engineering improves some tasks but not others.

03

Model and task-specific strategies are necessary for effective LLM use in healthcare.

Abstract

Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Genomics and Rare Diseases