MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Lecheng Gong; Weimin Fang; Ting Yang; Dongjie Tao; Chunxiao Guo; Peng Wei; Bo Xie; Jinqun Guan; Zixiao Chen; Fang Shi; Jinjie Gu; and Junwei Liu

arXiv:2601.03023·cs.CL·January 8, 2026

MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

Lecheng Gong, Weimin Fang, Ting Yang, Dongjie Tao, Chunxiao Guo, Peng Wei, Bo Xie, Jinqun Guan, Zixiao Chen, Fang Shi, Jinjie Gu, and Junwei Liu

PDF

Open Access

TL;DR

MedDialogRubrics introduces a comprehensive benchmark with synthetic patient cases and detailed evaluation rubrics to assess and improve multi-turn medical dialogue capabilities of large language models.

Contribution

The paper presents MedDialogRubrics, a novel benchmark and evaluation framework for medical LLMs, including synthetic case generation, expert-refined rubrics, and a multi-agent system to ensure clinical plausibility.

Findings

01

Current models struggle with multi-turn diagnostic tasks.

02

Improving medical dialogue requires advances in dialogue management architectures.

03

The benchmark reveals significant challenges in existing medical LLMs.

Abstract

Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling