A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang; Zhihui Tang; Huaxia Yang; Qiuhong Gong; Tiantian Gu; Hongyang Ma; Yongxin Wang; Wubin Sun; Zeliang Lian; Kehang Mao; Yinan Jiang; Zhicheng Huang; Lingyun Ma; Wenjie Shen; Yajie Ji; Yunhui Tan; Chunbo Wang; Yunlu Gao; Qianling Ye; Rui Lin; Mingyu Chen; Lijuan Niu; Zhihao Wang; Peng Yu; Mengran Lang; Yue Liu; Huimin Zhang; Haitao Shen; Long Chen; Qiguang Zhao; Si-Xuan Liu; Lina Zhou; Hua Gao; Dongqiang Ye; Lingmin Meng; Youtao Yu; Naixin Liang; Jianxiong Wu

arXiv:2507.23486·cs.CL·August 14, 2025

A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu

PDF

TL;DR

This paper introduces a comprehensive benchmark for evaluating medical language models' safety and effectiveness in clinical settings, highlighting performance gaps and domain-specific advantages to guide safer deployment.

Contribution

The study presents the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a new multidimensional evaluation framework developed with expert input and applied to assess six medical LLMs.

Findings

01

Moderate overall performance of tested LLMs.

02

Significant performance drop in high-risk scenarios.

03

Domain-specific models outperform general-purpose models.

Abstract

Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.