When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
Binesh Sadanandan, Vahid Behzadan

TL;DR
This study evaluates the sensitivity of medical language models to prompt variations, revealing that common prompting strategies often decrease accuracy and highlighting the importance of robust evaluation methods.
Contribution
It systematically assesses prompt sensitivity in medical LLMs, demonstrating that standard prompt engineering techniques may not be effective and proposing alternative scoring methods.
Findings
Chain-of-Thought prompting decreases accuracy by 5.7%.
Shuffling answer options causes 59.1% prediction changes.
Cloze scoring surpasses prompting strategies, achieving up to 64.5% accuracy.
Abstract
Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
