# Comparing Real and ChatGPT-Generated Radiographs for Training Deep Learning Models to Diagnose Knee Osteoarthritis

**Authors:** Rohan R Datir, Akshay Reddy, Yash Bhatia, Niall Baig, Roman Barrozo, Vinayak Sathe

PMC · DOI: 10.7759/cureus.100381 · 2025-12-29

## TL;DR

This study compares AI models trained on real and ChatGPT-generated knee X-rays for diagnosing osteoarthritis, finding that synthetic images can help but are not enough on their own.

## Contribution

The novel contribution is evaluating ChatGPT-generated radiographs as a supplement to real data for training AI in knee osteoarthritis diagnosis.

## Key findings

- Models trained on real radiographs (Model B) and a mix of real and synthetic images (Model C) outperformed those trained solely on synthetic data (Model A).
- Model C showed slightly better discrimination (AUROC 0.782) than Model B (0.758), though confidence intervals overlapped.
- Synthetic images improved grade-specific sensitivity but not significantly after statistical adjustment.

## Abstract

Introduction: Osteoarthritis (OA) is a degenerative joint disease characterized by progressive cartilage loss, bone remodeling, and chronic pain. The growing global burden of OA motivates the evaluation of artificial intelligence (AI) approaches for automating radiographic diagnosis.

Purpose: This study aimed to compare AI models trained on real radiographs, ChatGPT-generated radiographs, and a combined dataset to assess whether synthetic imaging can improve OA detection.

Methods: Three binary classifiers were trained using knee radiographs: Model A (ChatGPT-generated images only), Model B (real images only), and Model C (real + synthetic). All models were developed using PyTorch in Google Colab and evaluated on 1,656 held-out real radiographs. Performance metrics (accuracy, sensitivity, specificity, precision, F1 score, and AUROC (area under the receiver operating characteristic)) were computed. Between-model comparisons used two-sided McNemar’s tests on paired predictions; 95% confidence intervals were estimated by bootstrap resampling. Grade-specific comparisons were Holm-Bonferroni adjusted (with unadjusted p-values also reported).

Results: Models B and C outperformed Model A across overall performance, while Model A showed higher specificity. Model C demonstrated slightly higher discrimination than Model B (AUROC 0.782 vs 0.758), with overlapping 95% confidence intervals. Sensitivity for grade 1 and grade 4 OA was higher for Model C than for Model B in unadjusted comparisons, but these differences did not remain statistically significant after Holm-Bonferroni adjustment.

Conclusion: ChatGPT-generated radiographs alone were insufficient for reliable training of OA diagnostic models. When used as a supplement to real radiographs, synthetic images produced small, directionally favorable changes in discrimination and grade-specific sensitivity, supporting their use as an adjunct for dataset expansion rather than a replacement for clinical imaging.

## Linked entities

- **Diseases:** osteoarthritis (MONDO:0005178)

## Full-text entities

- **Diseases:** degenerative joint disease (MESH:D019636), OA (MESH:D010003), cartilage loss (MESH:D002357), Knee Osteoarthritis (MESH:D020370), chronic pain (MESH:D059350)

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12853004/full.md

---
Source: https://tomesphere.com/paper/PMC12853004