This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA
Hye Sun Yun, Geetika Kapoor, Michael Mackert, Ramez Kouzy, Wei Xu, Junyi Jessy Li, Byron C. Wallace

TL;DR
This study systematically evaluates how prompt phrasing affects LLM responses in medical question answering, revealing significant inconsistencies driven by question framing and conversation context.
Contribution
It introduces a dataset and analysis demonstrating the impact of question framing on LLM consistency in medical QA, emphasizing the need for robustness in high-stakes applications.
Findings
Positively- and negatively-framed questions lead to contradictory answers more often.
Multi-turn conversations amplify framing-induced inconsistencies.
No significant effect of language style on response consistency.
Abstract
Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
