# Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: Cross-Sectional Evaluation Study

**Authors:** Longhui Xu, Xiao Cong, Renxiu Wang, Na Li, Xinru Liu, Ronghui Wang, Cuiping Xu

PMC · DOI: 10.2196/78279 · 2025-11-03

## TL;DR

This study evaluates how well large language models perform on the Chinese National Nurse Licensure Examination, finding that while some models are accurate, they lack reliability and confidence calibration.

## Contribution

The study provides a rigorous evaluation of LLMs on a culturally specific, high-stakes nursing exam, highlighting critical reliability limitations.

## Key findings

- DeepSeek V3 and Gemini 2.0 Pro achieved over 83% accuracy on CNNLE questions.
- All models showed poor repeatability and confidence calibration despite high accuracy.
- A stability-flexibility trade-off paradox was observed in model performance.

## Abstract

Large language models (LLMs) are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally specific examinations, such as the Chinese National Nurse Licensure Examination (CNNLE), remain underevaluated, making rigorous evaluation crucial before their adoption in nursing training and practice.

This study aimed to evaluate the performance, accuracy, repeatability, confidence, and robustness of 4 LLMs on the CNNLE.

Four LLMs (Sider Fusion [Vidline Inc], GPT-4o [OpenAI], Gemini 2.0 Pro [Google DeepMind], and DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 CNNLE. Accuracy and repeatability were assessed using 2 prompting strategies. Confidence was evaluated via self-ratings (1‐10 scale) and robustness via repeated adversarial prompting.

DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy (ranging from 199/237 to 209/237; >83%) compared to GPT-4o and Sider Fusion (ranging from 151/237 to 166/237; <71%). However, all LLMs showed suboptimal repeatability (highest at 206/237; <87% consistency). Critically, poor confidence calibration was evident; most models showed high confidence often mismatching actual accuracy (Sider Fusion: P=.01; GPT-4o: P=.03; and Gemini 2.0 Pro: P=.049). A stability-flexibility trade-off paradox was also observed.

While some LLMs show promising accuracy on the CNNLE, fundamental reliability limitations (poor confidence calibration and inconsistent repeatability) hinder safe application in nursing education and practice. Future LLM development must prioritize trustworthiness and calibrated reliability over surface accuracy.

## Full-text entities

- **Diseases:** hallucination (MESH:D006212), CNNLE (MESH:C562377), LLM (MESH:D007806)
- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12582878/full.md

---
Source: https://tomesphere.com/paper/PMC12582878