CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

Guangya Yu; Yanhao Li; Zongying Jiang; Yuxiong Jin; Li Dai; Yupian Lin; Ruihui Hou; Weiyan Zhang; Yongqi Fan; Qi Ye; Jingping Liu; Tong Ruan

arXiv:2502.11703·cs.CL·July 10, 2025

CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation

Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan

PDF

Open Access 1 Video

TL;DR

This paper introduces CMQCIC-Bench, a Chinese medical dataset for evaluating large language models in medical quality control, proposing new methods that outperform existing approaches in clinical indicator calculation tasks.

Contribution

It presents a new Chinese EMR-based dataset and a novel CF-IR method for medical indicator calculation, advancing LLM applications in healthcare quality assessment.

Findings

01

CF-IR outperforms Chain-of-Thought methods in MQCIC tasks

02

The dataset contains 785 instances and 76 indicators

03

Comprehensive experiments on 20 LLMs demonstrate effectiveness

Abstract

Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation· underline

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare · Clinical practice guidelines implementation

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax