# Performance of DeepSeek and ChatGPT on the Chinese Health Professional and Technical Examination: A comparative study

**Authors:** Xu Li, Xu Hu, Huiting Xu, Zhiang Sun, Pin Yu, Hailing Ju

PMC · DOI: 10.1371/journal.pone.0338328 · PLOS One · 2026-01-22

## TL;DR

This study compares the performance of two large language models on a Chinese nursing exam, finding that DeepSeek-R1 outperforms GPT-4o in accuracy, though GPT-4o is more consistent.

## Contribution

The study provides empirical evidence on the performance of DeepSeek-R1 and GPT-4o in a high-stakes nursing examination context.

## Key findings

- DeepSeek-R1 had significantly higher overall accuracy (88.5%) compared to GPT-4o (67.9%).
- GPT-4o showed higher response consistency (96.5%) but lower consistent accuracy (66.7%) than DeepSeek-R1.
- Significant differences in consistent accuracy were observed in specific domains like surgical and gynecological nursing.

## Abstract

Large language models (LLMs) are increasingly applied in medical education, yet their reliability in specialized, high-stakes assessments such as the Chinese Health Professional and Technical Examination remains unclear. DeepSeek-R1, a recently released reasoning-enhanced LLM, has shown promising performance, but empirical evidence within nursing examination contexts is limited.

To compare the performance of DeepSeek-R1 and the GPT-4o API on the Chinese Health Professional and Technical Examination (Intermediate Nursing), focusing on accuracy, response consistency, and consistent accuracy.

Four hundred official practice examination multiple-choice questions were categorized into four competency units and two question types (A/B). Both models were evaluated using overall accuracy, consistency (agreement across repeated responses), and consistent accuracy (proportion of responses that were both consistent and correct). Stratified analyses were performed across units, question types, and disciplines. Chi-square tests were used for statistical comparison, and Holm–Bonferroni correction was applied for multiple comparisons.

DeepSeek-R1 demonstrated significantly higher overall accuracy than the GPT-4o API (88.5% vs. 67.9%, P < 0.001). GPT-4o API showed higher response consistency (96.5% vs. 88.5%) but lower consistent accuracy (66.7% vs. 84.0%). After multiple-comparison correction, significant differences in consistent accuracy remained in basic knowledge, professional knowledge, professional practice ability and Type A questions, as well as in surgical and gynecological nursing disciplines, while other domains showed no statistically significant differences.

DeepSeek-R1 outperformed the GPT-4o API across multiple dimensions of nursing competency assessment, particularly in overall accuracy and consistent accuracy. GPT-4o API exhibited high response stability but a tendency toward systematic errors, underscoring the need for careful interpretation of model outputs. Further research is needed to evaluate LLM performance using open-ended clinical reasoning tasks and real-world assessment data to support safe and effective educational integration.

## Full-text entities

- **Diseases:** diseases (MESH:D004194), LLMs (MESH:D007806), infection (MESH:D007239)
- **Chemicals:** GPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12826474/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12826474/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12826474/full.md

---
Source: https://tomesphere.com/paper/PMC12826474