# Comparative performance of ChatGPT-5 and DeepSeek on the Chinese ultrasound medicine senior professional title examination

**Authors:** Dao-Rong Hong, Chun-Yan Huang, Jiu Gao

PMC · DOI: 10.3389/fdgth.2026.1783347 · Frontiers in Digital Health · 2026-03-09

## TL;DR

This study compares ChatGPT-5 and DeepSeek on a Chinese ultrasound medicine certification exam, finding ChatGPT-5 performs better on image-based questions.

## Contribution

First comparison of ChatGPT-5 and DeepSeek on a Chinese specialty medical certification exam, highlighting strengths in image-based interpretation.

## Key findings

- ChatGPT-5 had higher overall accuracy (74.0%) than DeepSeek (60.0%) in the exam.
- ChatGPT-5 outperformed DeepSeek on image-based items (61.7% vs. 40.0%).
- Both models performed similarly on text-based items (92.5% vs. 90.0%).

## Abstract

Large language models (LLMs) have shown growing potential for medical education and assessment, but evidence on their performance in specialty certification exams in China—particularly in ultrasound medicine—remains limited.

To compare the performance of ChatGPT-5 and DeepSeek on the Chinese Ultrasound Medicine Senior Professional Title Examination, overall and by item type.

Between August and September 2025, we randomly selected 100 multiple-choice questions from the official Chinese Ultrasound Medicine Senior Professional Title Examination bank (60 image-based interpretation items and 40 text-based items). We evaluated ChatGPT-5 and DeepSeek using identical prompts through their public web interfaces. The primary outcome was overall accuracy; secondary outcomes were accuracy by item type and subspecialty. Between-model differences were assessed using two-proportion z-tests (α = 0.05) in Python 3.12.

Overall accuracy was higher for ChatGPT-5 than for DeepSeek [74.0% (74/100) vs. 60.0% (60/100); p = 0.035]. Accuracy on image-based items was also higher for ChatGPT-5 (61.7% vs. 40.0%; p = 0.018). Performance on text-based items was similar for both models (92.5% vs. 90.0%). Subspecialty patterns varied across domains; however, no between-model differences reached statistical significance.

ChatGPT-5 outperformed DeepSeek on image-based items (61.7% vs. 40.0%), while both models performed similarly on text-based knowledge items (92.5% vs. 90.0%). Overall, both LLMs showed strong performance on Chinese ultrasound senior-title examination questions, with complementary strengths across content areas. They may be useful as supplementary educational tools, but further advances in multimodal reasoning are needed to support more reliable image interpretation.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), malignancy (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12968994/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12968994/full.md

## References

24 references — full list in the complete paper: https://tomesphere.com/paper/PMC12968994/full.md

---
Source: https://tomesphere.com/paper/PMC12968994