Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: Cross-Sectional Evaluation Study
Longhui Xu, Xiao Cong, Renxiu Wang, Na Li, Xinru Liu, Ronghui Wang, Cuiping Xu

TL;DR
This study evaluates how well large language models perform on the Chinese National Nurse Licensure Examination, finding that while some models are accurate, they lack reliability and confidence calibration.
Contribution
The study provides a rigorous evaluation of LLMs on a culturally specific, high-stakes nursing exam, highlighting critical reliability limitations.
Findings
DeepSeek V3 and Gemini 2.0 Pro achieved over 83% accuracy on CNNLE questions.
All models showed poor repeatability and confidence calibration despite high accuracy.
A stability-flexibility trade-off paradox was observed in model performance.
Abstract
Large language models (LLMs) are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally specific examinations, such as the Chinese National Nurse Licensure Examination (CNNLE), remain underevaluated, making rigorous evaluation crucial before their adoption in nursing training and practice. This study aimed to evaluate the performance, accuracy, repeatability, confidence, and robustness of 4 LLMs on the CNNLE. Four LLMs (Sider Fusion [Vidline Inc], GPT-4o [OpenAI], Gemini 2.0 Pro [Google DeepMind], and DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 CNNLE. Accuracy and repeatability were assessed using 2 prompting strategies. Confidence was evaluated via self-ratings (1‐10 scale) and robustness via repeated adversarial prompting. DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Education and Admissions · Simulation-Based Education in Healthcare · Innovations in Medical Education
