MedBench: A Comprehensive, Standardized, and Reliable Benchmarking   System for Evaluating Chinese Medical Large Language Models

Mianxin Liu; Jinru Ding; Jie Xu; Weiguo Hu; Xiaoyang Li; Lifeng Zhu,; Zhian Bai; Xiaoming Shi; Benyou Wang; Haitao Song; Pengfei Liu; Xiaofan; Zhang; Shanshan Wang; Kang Li; Haofen Wang; Tong Ruan; Xuanjing Huang; Xin; Sun; Shaoting Zhang

arXiv:2407.10990·cs.CL·July 17, 2024·2 cites

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu,, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan, Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin, Sun, Shaoting Zhang

PDF

Open Access

TL;DR

MedBench is a comprehensive benchmarking system designed to evaluate Chinese medical large language models across multiple specialties, ensuring reliable, standardized, and unbiased assessment for real-world medical applications.

Contribution

This work introduces MedBench, the largest evaluation dataset and a fully automated, dynamic evaluation infrastructure specifically for Chinese medical LLMs, addressing current evaluation gaps.

Findings

01

Evaluation results align with medical professionals' perspectives.

02

MedBench provides unbiased and reproducible assessments.

03

The system covers 43 clinical specialties with over 300,000 questions.

Abstract

Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce "MedBench", a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare