Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Diagnostic Reasoning
Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, and Bradley A. Malin

TL;DR
This study shows that multi-turn conversations can impair large language models' diagnostic reasoning in healthcare, causing them to abandon correct diagnoses and switch blindly to incorrect suggestions.
Contribution
It introduces a new evaluation framework and demonstrates that multi-turn interactions often degrade LLM diagnostic performance in clinical settings.
Findings
Multi-turn conversations reduce LLM diagnostic accuracy compared to single-turn.
Models often abandon correct diagnoses to align with user suggestions.
Several models fail to distinguish between correct signals and incorrect suggestions.
Abstract
Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
