The assessment of ChatGPT‐4's performance compared to expert's consensus on chronic lateral ankle instability

Takuji Yokoe; Giulia Roversi; Nuno Sevivas; Naosuke Kamei; Pedro Diniz; Hélder Pereira

PMC · DOI:10.1002/jeo2.70393·August 5, 2025

The assessment of ChatGPT‐4's performance compared to expert's consensus on chronic lateral ankle instability

Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz, Hélder Pereira

PDF

Open Access

TL;DR

This study compares ChatGPT-4's answers on treating chronic ankle instability with expert consensus, finding partial agreement but significant gaps.

Contribution

The study evaluates ChatGPT-4's reliability in surgical decision-making for ankle instability, a novel application of LLMs in orthopedic surgery.

Findings

01

ChatGPT-4 agreed with expert consensus on 64.7% of surgical management questions.

02

The model showed overconclusiveness and incompleteness in most responses.

03

Despite limitations, ChatGPT-4 shows potential for supporting non-expert clinicians.

Abstract

To evaluate the accuracy of answers to clinical questions on the surgical treatment of chronic lateral ankle instability (CLAI) using ChatGPT‐4 as a reference for consensus statements developed by the ESSKA‐AFAS Ankle Instability Group (AIG). This study simulated the clinical settings where non‐expert clinicians treat patients with CLAI. The large language model (LLM) ChatGPT‐4 was used on 10 February 2025 to answer a total of 17 questions regarding the surgical management of CLAI that were developed by the ESSKA‐AFAS AIG. The ChatGPT responses were compared with the consensus statements developed by ESSKA‐AFAS AIG. The consistency and accuracy of the answers by ChatGPT as a reference for the experts' answers were evaluated. The consistency of ChatGPT's answers to the consensus statements was assessed by the question, 'Is the answer by ChatGPT agreement with those by the experts? (Yes…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Meta-analysis and systematic reviews · Delphi Technique in Research