# Performance of ChatGPT-4o, Gemini 2.0 Pro, and DeepSeek-V3 in Patient-Facing Information on Chest Wall Deformities: A Comparative Evaluation of Accuracy, RELIABILITY, and Reproducibility

**Authors:** Deniz Oke, Ozge Gulsum Illeez, Esra Giray, Betül Çiftçi

PMC · DOI: 10.3390/diagnostics16040589 · Diagnostics · 2026-02-15

## TL;DR

This study compares how well three AI models provide accurate and reliable patient information about chest wall deformities, finding ChatGPT-4o to perform best.

## Contribution

First domain-specific comparative evaluation of LLMs in chest wall deformities with reproducibility analysis.

## Key findings

- ChatGPT-4o had highest accuracy and lowest hallucination rate (5.0%)
- Treatment-related questions had most errors across all models
- ChatGPT-4o showed highest reproducibility (weighted κ = almost perfect)

## Abstract

Background: Large language models (LLMs) such as DeepSeek-V3, Google Gemini 2.0 Pro, and ChatGPT-4o are increasingly used by patients seeking online medical information. However, their accuracy, reliability, and reproducibility in patient-facing content related to chest wall deformities (CWD) remain unclear. This study aimed to compare the performance of three contemporary LLMs in generating information on pectus excavatum, pectus carinatum, and related thoracic deformities. Methods: Eighty patient-facing questions were developed across eight thematic domains and independently submitted to each model using newly created accounts over two consecutive days. Accuracy was assessed using a validated four-point rubric by blinded physiatrists, and reproducibility was evaluated using agreement metrics and weighted Cohen’s kappa. Results: ChatGPT-4o achieved the highest overall accuracy (median score: 1.20), the greatest proportion of fully accurate responses, and the lowest hallucination rate (5.0%). Gemini showed intermediate accuracy, while DeepSeek-V3 demonstrated the lowest accuracy and highest hallucination rate (11.25%). Across all models, general-information and quality-of-life domains had the best performance, whereas treatment-related questions showed the most errors. Reproducibility was highest for ChatGPT-4o (weighted κ = almost perfect), followed by Gemini and DeepSeek-V3. Inter-rater reliability was substantial (Fleiss’ κ = 0.69). Conclusions: Contemporary LLMs can generate largely accurate and reproducible patient-facing information on CWD, with ChatGPT-4o showing the strongest overall performance. This study provides the first domain-specific comparative evaluation of LLMs in CWD and integrates reproducibility analysis alongside accuracy and reliability assessment. While these tools may support patient education, treatment-related responses require caution, and LLMs should be used as adjuncts rather than substitutes for clinical counseling.

## Linked entities

- **Diseases:** pectus excavatum (MONDO:0008213)

## Full-text entities

- **Diseases:** Wall Deformities (MESH:D056988), deformities (MESH:D009140), PE (MESH:D005660), AI hallucinations (MESH:D006212), PC (MESH:D066166), CWD (MESH:D013898), knee osteoarthritis (MESH:D020370), chest pain (MESH:D002637), hip (MESH:D025981), LLMs (MESH:D007806), cardiac compression (MESH:D009408), congenital abnormalities (MESH:D000013), injury to (MESH:D014947), anxiety (MESH:D001007), cardiopulmonary restriction (MESH:D006323), thoracic deformities (MESH:D013896), dyspnea (MESH:D004417)
- **Chemicals:** ChatGPT-4o (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12939082/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12939082/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC12939082/full.md

---
Source: https://tomesphere.com/paper/PMC12939082