Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset
Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian,, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li

TL;DR
This paper introduces CMExam, a large-scale Chinese medical exam dataset with annotations, to evaluate and analyze large language models' performance in medical question answering, revealing significant gaps compared to human experts.
Contribution
The paper presents the first comprehensive Chinese medical exam dataset with annotations and benchmarks LLMs, providing insights into their capabilities and limitations in medical QA.
Findings
GPT-4 achieved 61.6% accuracy on CMExam.
LLMs lag behind human accuracy of 71.6%.
Finetuning improves LLM reasoning but still falls short.
Abstract
Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization
