# ChatGPT models provide higher‐quality but lower‐readability responses than Google Gemini regarding anterior shoulder instability, with no added benefit of the orthopaedic expert plugin

**Authors:** Khaled Skaik, Sean Omoseni, Danielle Dagher, Darshil Shah, Theodorakys Marín Fermín, Piero Agostinone, Ashraf Hantouly, Moin Khan

PMC · DOI: 10.1002/ksa.70255 · 2025-12-26

## TL;DR

ChatGPT models provide higher-quality but harder-to-read responses about shoulder instability compared to Google Gemini, with no added benefit from an orthopaedic expert plugin.

## Contribution

This study compares the quality and readability of medical information on anterior shoulder instability from three large language models.

## Key findings

- ChatGPT 4o and ChatGPT OE provided higher-quality responses than Google Gemini.
- Google Gemini's responses were more readable but lower in quality.
- The orthopaedic expert plugin did not improve ChatGPT's performance.

## Abstract

The purpose is to analyze and compare the quality and readability of information regarding anterior shoulder instability and shoulder stabilization surgery from three LLMs: ChatGPT 4o, ChatGPT Orthopaedic Expert (OE) and Google Gemini.

ChatGPT 4o, ChatGPT OE and Google Gemini were used to answer 21 commonly asked questions from patients on anterior shoulder instability. The responses were independently rated by three fellowship‐trained orthopaedic surgeons using the validated Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. Assessors were blinded to the model, and evaluations were performed twice, 3 weeks apart. Readability was measured using Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). This study adhered to TRIPOD‐LLM. Statistical analysis included the Friedman test, the Wilcoxon signed‐rank tests and inter‐class coefficients.

Inter‐rater reliability between three surgeons was good or excellent reliability in all LLMs. ChatGPT OE and ChatGPT 4o demonstrated comparable overall performance, each achieving a median QAMAI score of 22 with interquartile ranges (IQRs) of 5.25 and 6.75, respectively, with median (IQR) domain scores for accuracy 4 (1) and 4 (1), clarity 4 (1) and 4 (1), relevance 4 (1) and 4 (1), completeness 4 (1) and 4 (1), provision of sources 1 (0) for both and usefulness 4 (1) and 4 (1), respectively. Google Gemini showed lower scores across these domains (accuracy 3 [1], clarity 3 [1], relevance 3 [1.25], completeness 3 [0.25], sources 3 [3] and usefulness 3 [1.25]), with a median QAMAI score of 19 (5.25) (p < 0.01 vs. each ChatGPT model). Readability was higher for Google Gemini (FRES = 36.96, FKGL = 11.92) than for ChatGPT OE (FRES = 21.90, FKGL = 14.94) and ChatGPT 4o (FRES = 24.24, FKGL = 15.11), indicating easier‐to‐read content (p < 0.01). There was no significant difference between ChatGPT 4o and OE in overall quality or readability.

ChatGPT 4o and ChatGPT OE provided statistically higher‐quality responses than Google Gemini, though all models showed good‐quality responses overall. However, responses generated by ChatGPT 4o and OE were more difficult to read than those generated by Google Gemini.

Level V, expert opinion.

## Full-text entities

- **Diseases:** anterior shoulder instability (MESH:D000070599)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12850548/full.md

---
Source: https://tomesphere.com/paper/PMC12850548