Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese   Medical Exam Dataset

Junling Liu; Peilin Zhou; Yining Hua; Dading Chong; Zhongyu Tian,; Andrew Liu; Helin Wang; Chenyu You; Zhenhua Guo; Lei Zhu; Michael Lingzhi Li

arXiv:2306.03030·cs.CL·October 24, 2023·32 cites

Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset

Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian,, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, Michael Lingzhi Li

PDF

Open Access 1 Repo 3 Models 2 Datasets 1 Video

TL;DR

This paper introduces CMExam, a large-scale Chinese medical exam dataset with annotations, to evaluate and analyze large language models' performance in medical question answering, revealing significant gaps compared to human experts.

Contribution

The paper presents the first comprehensive Chinese medical exam dataset with annotations and benchmarks LLMs, providing insights into their capabilities and limitations in medical QA.

Findings

01

GPT-4 achieved 61.6% accuracy on CMExam.

02

LLMs lag behind human accuracy of 71.6%.

03

Finetuning improves LLM reasoning but still falls short.

Abstract

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

williamliujl/cmexam
pytorchOfficial

Models

Datasets

Videos

Benchmarking Large Language Models on CMExam - A comprehensive Chinese Medical Exam Dataset· slideslive

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization