# Supporting postgraduate exam preparation with large language models: implications for traditional Chinese medicine education

**Authors:** Baifeng Wang, Meiwei Zhang, Zhe Wang, Keyu Yao, Meng Hao, Junhui Wang, Suyuan Peng, Yan Zhu

PMC · DOI: 10.3389/fmed.2025.1667104 · Frontiers in Medicine · 2026-01-09

## TL;DR

This study evaluates how large language models perform on a Chinese Traditional Chinese Medicine postgraduate exam, showing they can support education and learning.

## Contribution

The paper introduces the first evaluation of LLMs on the Chinese Postgraduate Examination for TCM, revealing their potential for educational support.

## Key findings

- Ernie Bot and ChatGLM exceeded the passing score of the 2023 CPE-TCM exam.
- LLMs demonstrated strong logical reasoning and integration of background knowledge in TCM contexts.
- The presence of internal or external information significantly influenced answer correctness in SparkDesk.

## Abstract

In China, the medical education system features multiple co-existing levels, with higher education often leading to better job prospects. In career advancement—especially for entry into competitive urban hospitals—the postgraduate examination often plays a more decisive role than the licensing examination. The application of Large Language Models (LLMs) in Traditional Chinese Medicine (TCM) has rapidly expanded. TCM theories possess distinct scientific features, requiring LLMs to demonstrate advanced information processing and comprehension abilities in a Chinese context. While LLMs have shown strong performance in many countries' licensing examinations, their performance in selective TCM examinations remains underexplored. This study aimed to evaluate and compare the performance of Ernie Bot, ChatGLM, SparkDesk, and GPT-4 on the 2023 Chinese Postgraduate Examination for TCM (CPE-TCM), and explore their potential in supporting TCM education and academic development.

We assessed the performance of four LLMs using the 2023 CPE-TCM as a test set. Exam scores were calculated to evaluate subject-specific performance. Additionally, responses were qualitatively analyzed based on logical reasoning and the use of internal and external information.

Ernie Bot and ChatGLM achieved accuracy rates of 50.30 and 46.67%, respectively, both above the passing score. Statistically significant differences in subject-specific performance were observed, with the highest scores in the medical humanistic spirit module. ChatGLM and GPT-4 provided logical explanations for all responses, while Ernie Bot and SparkDesk showed logical reasoning in 98.2 and 43.6% of responses, respectively. ChatGLM and GPT-4 incorporated internal information in all explanations, whereas SparkDesk rarely did. Over 60% of responses from Ernie Bot, ChatGLM, and GPT-4 included external information, which did not significantly differ between correct and incorrect answers. In SparkDesk, the presence of internal or external information was significantly associated with answer correctness (P < 0.001).

Ernie Bot and ChatGLM surpassed the passing threshold for postgraduate selection, reflecting solid TCM expertise. LLMs demonstrated strong capabilities in logical reasoning and integration of background knowledge, highlighting their promising role in enhancing TCM education.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12827181/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12827181/full.md

---
Source: https://tomesphere.com/paper/PMC12827181