# A comparative study on DeepSeek and ChatGPT for bone and soft tissue tumor clinical practice

**Authors:** Baicheng Yang, Zhangfu Li, Xinxin Zhang, Xiaoyang Li, Ge Zhao, Yongli Jia, Shengji Yu

PMC · DOI: 10.3389/fonc.2025.1642880 · Frontiers in Oncology · 2026-01-09

## TL;DR

This study compares DeepSeek and ChatGPT in diagnosing bone and soft tissue tumors, finding DeepSeek more accurate and reliable in clinical reasoning.

## Contribution

The study introduces a novel comparative evaluation framework for AI models in bone and soft tissue tumor clinical practice.

## Key findings

- DeepSeek outperformed ChatGPT in diagnostic accuracy and clinical reasoning for bone and soft tissue tumors.
- DeepSeek uniquely answered 60 questions correctly where ChatGPT made errors.
- Experts rated DeepSeek higher in imaging interpretation and overall case analysis.

## Abstract

Artificial intelligence (AI) models are increasingly applied in clinical oncology, yet their comparative utility in specialized domains like bone and soft tissue tumors remains understudied. This study evaluates the diagnostic accuracy and clinical reasoning capabilities of DeepSeek and ChatGPT.

A two-phase evaluation framework was implemented. First, 249 validated clinical questions (191 single-choice, 58 multiple-choice) spanning five domains (diagnosis, imaging, pathology, staging, treatment) were administered, with expert-derived answers serving as ground truth. Second, nine blinded clinicians scored model-generated analyses of a complex sarcoma case across seven clinical dimensions. Statistical analysis employed chi-square tests for accuracy comparisons, Cohen’s kappa for inter-rater reliability, and independent t-tests for expert ratings (α = 0.05).

DeepSeek outperformed ChatGPT in overall accuracy (74.7% vs 55.4%, p < 0.001), excelling in single-choice questions (86.9% vs 64.9%, p < 0.001) and two key domains: Pathology & Genetics (72.5% vs 40.0%, p = 0.006) and Treatment (71.3% vs 51.2%, p = 0.015). Experts rated DeepSeek higher in imaging interpretation (7.11 vs. 6.00, p = 0.002) and overall case analysis (54.11 vs. 51.56, p = 0.022). Cross-model analysis revealed DeepSeek uniquely answered 60 questions correctly where ChatGPT erred, while both models shared 51 errors.

DeepSeek outperforms ChatGPT in diagnostic accuracy and specialized clinical reasoning for bone/soft tissue tumors, particularly in pathology and treatment domains. The significant performance gap (p < 0.001) and 24.1% unique correct responses position DeepSeek as a more reliable diagnostic aid, though shared errors (51 questions) necessitate hybrid AI-clinician workflows.

## Linked entities

- **Diseases:** sarcoma (MONDO:0005089)

## Full-text entities

- **Diseases:** tumors (MESH:D009369), sarcoma (MESH:D012509), bone and soft tissue tumor (MESH:D012983)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12827084/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12827084/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/PMC12827084/full.md

---
Source: https://tomesphere.com/paper/PMC12827084