ChatGPT models provide higher‐quality but lower‐readability responses than Google Gemini regarding anterior shoulder instability, with no added benefit of the orthopaedic expert plugin

Khaled Skaik; Sean Omoseni; Danielle Dagher; Darshil Shah; Theodorakys Marín Fermín; Piero Agostinone; Ashraf Hantouly; Moin Khan

PMC · DOI:10.1002/ksa.70255·December 26, 2025

ChatGPT models provide higher‐quality but lower‐readability responses than Google Gemini regarding anterior shoulder instability, with no added benefit of the orthopaedic expert plugin

Khaled Skaik, Sean Omoseni, Danielle Dagher, Darshil Shah, Theodorakys Marín Fermín, Piero Agostinone, Ashraf Hantouly, Moin Khan

PDF

Open Access

TL;DR

ChatGPT models provide higher-quality but harder-to-read responses about shoulder instability compared to Google Gemini, with no added benefit from an orthopaedic expert plugin.

Contribution

This study compares the quality and readability of medical information on anterior shoulder instability from three large language models.

Findings

01

ChatGPT 4o and ChatGPT OE provided higher-quality responses than Google Gemini.

02

Google Gemini's responses were more readable but lower in quality.

03

The orthopaedic expert plugin did not improve ChatGPT's performance.

Abstract

The purpose is to analyze and compare the quality and readability of information regarding anterior shoulder instability and shoulder stabilization surgery from three LLMs: ChatGPT 4o, ChatGPT Orthopaedic Expert (OE) and Google Gemini. ChatGPT 4o, ChatGPT OE and Google Gemini were used to answer 21 commonly asked questions from patients on anterior shoulder instability. The responses were independently rated by three fellowship‐trained orthopaedic surgeons using the validated Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. Assessors were blinded to the model, and evaluations were performed twice, 3 weeks apart. Readability was measured using Flesch Reading Ease Score (FRES) and Flesch–Kincaid Grade Level (FKGL). This study adhered to TRIPOD‐LLM. Statistical analysis included the Friedman test, the Wilcoxon signed‐rank tests and inter‐class coefficients. Inter‐rater…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

anterior shoulder instability

Figures6

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Clinical Reasoning and Diagnostic Skills