ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng; Han Li; Wenjie Luo; Weiqi Zhai; Yiyuan Li; Chuanmiao Yan; Tianyi Tang; Yubo Ma; Kexin Yang; Dayiheng Liu; Hu Wei; and Bing Zhao

arXiv:2603.02097·cs.CL·March 20, 2026

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, and Bing Zhao

PDF

Open Access

TL;DR

ClinConsensus is a comprehensive Chinese medical benchmark with 2500 cases across specialties and care stages, designed to evaluate LLMs' reasoning, evidence use, and clinical reasoning in complex, real-world scenarios.

Contribution

It introduces a novel, expert-validated benchmark with a dual-judge evaluation framework and a new consistency score for assessing Chinese medical LLMs across multiple complexity levels.

Findings

01

Top models show heterogeneity in reasoning and evidence use.

02

Clinical actionability remains a significant challenge.

03

Models perform variably across specialties and care stages.

Abstract

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Advanced Causal Inference Techniques