Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Jinrui Fang; Runhan Chen; Xu Yang; Jian Yu; Jiawei Xu; Ashwin Vinod; Wenqi Shi; Tianlong Chen; Heng Ji; ChengXiang Zhai; Ying Ding; Yuji Zhang

arXiv:2604.04325·cs.CL·April 7, 2026

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, ChengXiang Zhai, Ying Ding, Yuji Zhang

PDF

TL;DR

This paper introduces MINT, a multi-turn medical diagnosis benchmark, revealing key behavioral patterns of LLMs and offering strategies to improve their reliability in clinical reasoning tasks.

Contribution

The paper presents MINT, a new benchmark for multi-turn medical diagnosis, and provides insights and recommendations to enhance LLMs' decision-making in clinical settings.

Findings

01

Models often answer prematurely within the first two turns.

02

Self-correction occurs more frequently than incorrect-to-correct flips.

03

Deferring answers and reserving salient evidence improves accuracy.

Abstract

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.