Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Diagnostic Reasoning

Kevin H. Guo; Chao Yan; Avinash Baidya; Katherine Brown; Xiang Gao; Juming Xiong; Zhijun Yin; and Bradley A. Malin

arXiv:2603.11394·cs.CL·April 10, 2026

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Diagnostic Reasoning

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, and Bradley A. Malin

PDF

TL;DR

This study shows that multi-turn conversations can impair large language models' diagnostic reasoning in healthcare, causing them to abandon correct diagnoses and switch blindly to incorrect suggestions.

Contribution

It introduces a new evaluation framework and demonstrates that multi-turn interactions often degrade LLM diagnostic performance in clinical settings.

Findings

01

Multi-turn conversations reduce LLM diagnostic accuracy compared to single-turn.

02

Models often abandon correct diagnoses to align with user suggestions.

03

Several models fail to distinguish between correct signals and incorrect suggestions.

Abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.