MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang

TL;DR
MedMT-Bench is a comprehensive benchmark designed to evaluate large language models' ability to understand and memorize long, multi-turn medical conversations, highlighting current limitations and guiding future improvements.
Contribution
This paper introduces MedMT-Bench, a novel, realistic medical multi-turn conversation benchmark with expert-validated test cases and an evaluation protocol, addressing gaps in existing medical AI assessments.
Findings
All tested models perform below 60% accuracy on MedMT-Bench.
The best model achieves 59.75% accuracy, indicating significant room for improvement.
MedMT-Bench effectively challenges models on long-context memory and instruction following in medical scenarios.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Medical multi-turn conversations are both high-risk and frequent in real clinical workflows. The paper is well-motivated by identifying that existing medical benchmarks primarily test static knowledge via single-turn or short-dialogue formats. 2. The benchmark is structured around the entire diagnosis and treatment process from pre-diagnosis, to post-diagnosis. This improves the validity and realism of the scenarios. 3. The paper introduces an atomic test points mechanism for its LLM-as-judge
1. While the benchmark covers 24 medical departments, the main results in Table 3, 4 only report aggregate performance. There is no analysis of which departments are more or less difficult, which should be common for medical benchmark studies and provide more clinical insights. 2. Lack of modality ablation study: the paper presents results for text, image, and merged subsets but does not provide a quantitative breakdown of which capabilities are most impacted by the image modality. The analysis
- Authors focus on multi-turn evaluation in the medical domain, which is both timely and critically important. - Authors conduct comprehensive analysis across different LLMs and share insightful observations, providing possible opportunities for future work in both pure LLM research and medical domain. - Proposed data generation pipeline combines multi-agent synthesis with manual expert editing, resulting in 400 high-quality samples; which is promising for ensuring the reliability and accuracy
- The main limitation of the paper is its heavy dependence on LaaJ during evaluation, which poses additional concerns for specialized domains like medicine. Although the authors conduct human analysis, there is no guarantee that the LaaJ model (Gemini in this case) has sufficient domain knowledge to provide accurate judgments, especially when tasks increase in difficulty. In such situations, LaaJ may prove untrustworthy, especially for medical factuality. A multi-agent LaaJ approach could serve
I think the idea of the atomic test points is interesting, as a way to better use LLMs for automatic evaluation. I also think it's a valuable contribution to create more open source evaluation data along the key dimensions in the paper. The paper is testing a large range of models (17) which is good to understand which models perform well in this domain.
The most major weakness from my point of view of this work is that it's not possible to properly evaluate the contributions at in the paper as many details are missing, despite having a very long (60 pages) paper. In particular: It's not clear if the accuracy metric is the fraction of the total test points that pass over the full corpus or if it's aggregated over the conversations. Why does the models have a consistency of less than 50 % (random guessing if I understand the metric) without the
- The medical domain doesn't see much multi-turn conversation evaluation which makes the question at hand timely - The benchmark coverage is adequate, where they used pretty much all the frontier and open-source models so the numbers provided can help community to differentiate some of the models
- Overall the benchmark is only focusing synthetic data, i.e. no real patient data, no real world driven scenario. Although there are some expert reviews, the contribution is incremental. Take Healthbench as example, all of the scenarios are real world driven and are sourced globally to uncover some of the common issues LLMs have in medical domain. - Further, on some concepts, the authors equate “instruction following” with “medical safety and reasoning.” In high-stakes domains, those are disti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
