MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Jinru Ding; Lu Lu; Chao Ding; Mouxiao Bian; Jiayuan Chen; Wenrao Pang; Ruiyao Chen; Xinwei Peng; Renjie Lu; Sijie Ren; Guanxu Zhu; Xiaoqin Wu; Zhiqiang Liu; Rongzhao Zhang; Luyi Jiang; Bing Han; Yunqiu Wang; Jie Xu

arXiv:2511.14439·cs.CL·November 20, 2025

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu

PDF

Open Access 1 Models

TL;DR

MedBench v4 is a comprehensive, cloud-based benchmark for evaluating Chinese medical language models, multimodal models, and agents, highlighting current capabilities and safety gaps in clinical AI systems.

Contribution

Introduces MedBench v4, a large-scale, expert-reviewed benchmarking platform tailored for Chinese medical AI models, including evaluation of safety, ethics, and multimodal reasoning.

Findings

01

Base LLMs score 54.1/100 on average; best: Claude Sonnet 4.5 at 62.5/100.

02

Multimodal models perform worse overall; best: GPT-5 at 54.9/100.

03

Agents significantly improve performance, reaching up to 85.3/100 overall.

Abstract

Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MedAIBase/AntAngelMed
model· 130 dl· ♡ 81
130 dl♡ 81

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling