Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: Cross-Sectional Evaluation Study

Longhui Xu; Xiao Cong; Renxiu Wang; Na Li; Xinru Liu; Ronghui Wang; Cuiping Xu

PMC · DOI:10.2196/78279·November 3, 2025

Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: Cross-Sectional Evaluation Study

Longhui Xu, Xiao Cong, Renxiu Wang, Na Li, Xinru Liu, Ronghui Wang, Cuiping Xu

PDF

Open Access

TL;DR

This study evaluates how well large language models perform on the Chinese National Nurse Licensure Examination, finding that while some models are accurate, they lack reliability and confidence calibration.

Contribution

The study provides a rigorous evaluation of LLMs on a culturally specific, high-stakes nursing exam, highlighting critical reliability limitations.

Findings

01

DeepSeek V3 and Gemini 2.0 Pro achieved over 83% accuracy on CNNLE questions.

02

All models showed poor repeatability and confidence calibration despite high accuracy.

03

A stability-flexibility trade-off paradox was observed in model performance.

Abstract

Large language models (LLMs) are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally specific examinations, such as the Chinese National Nurse Licensure Examination (CNNLE), remain underevaluated, making rigorous evaluation crucial before their adoption in nursing training and practice. This study aimed to evaluate the performance, accuracy, repeatability, confidence, and robustness of 4 LLMs on the CNNLE. Four LLMs (Sider Fusion [Vidline Inc], GPT-4o [OpenAI], Gemini 2.0 Pro [Google DeepMind], and DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 CNNLE. Accuracy and repeatability were assessed using 2 prompting strategies. Confidence was evaluated via self-ratings (1‐10 scale) and robustness via repeated adversarial prompting. DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

GPT-4o

Diseases3

hallucination CNNLE LLM

Figures5

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Education and Admissions · Simulation-Based Education in Healthcare · Innovations in Medical Education