Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, ChengXiang Zhai, Ying Ding, Yuji Zhang

TL;DR
This paper introduces MINT, a multi-turn medical diagnosis benchmark, revealing key behavioral patterns of LLMs and offering strategies to improve their reliability in clinical reasoning tasks.
Contribution
The paper presents MINT, a new benchmark for multi-turn medical diagnosis, and provides insights and recommendations to enhance LLMs' decision-making in clinical settings.
Findings
Models often answer prematurely within the first two turns.
Self-correction occurs more frequently than incorrect-to-correct flips.
Deferring answers and reserving salient evidence improves accuracy.
Abstract
Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
