# The assessment of ChatGPT‐4's performance compared to expert's consensus on chronic lateral ankle instability

**Authors:** Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz, Hélder Pereira

PMC · DOI: 10.1002/jeo2.70393 · 2025-08-05

## TL;DR

This study compares ChatGPT-4's answers on treating chronic ankle instability with expert consensus, finding partial agreement but significant gaps.

## Contribution

The study evaluates ChatGPT-4's reliability in surgical decision-making for ankle instability, a novel application of LLMs in orthopedic surgery.

## Key findings

- ChatGPT-4 agreed with expert consensus on 64.7% of surgical management questions.
- The model showed overconclusiveness and incompleteness in most responses.
- Despite limitations, ChatGPT-4 shows potential for supporting non-expert clinicians.

## Abstract

To evaluate the accuracy of answers to clinical questions on the surgical treatment of chronic lateral ankle instability (CLAI) using ChatGPT‐4 as a reference for consensus statements developed by the ESSKA‐AFAS Ankle Instability Group (AIG). This study simulated the clinical settings where non‐expert clinicians treat patients with CLAI.

The large language model (LLM) ChatGPT‐4 was used on 10 February 2025 to answer a total of 17 questions regarding the surgical management of CLAI that were developed by the ESSKA‐AFAS AIG. The ChatGPT responses were compared with the consensus statements developed by ESSKA‐AFAS AIG. The consistency and accuracy of the answers by ChatGPT as a reference for the experts' answers were evaluated. The consistency of ChatGPT's answers to the consensus statements was assessed by the question, 'Is the answer by ChatGPT agreement with those by the experts? (Yes or No)'. Four scoring categories: Accuracy, Overconclusiveness (proposed recommendation despite the lack of consensus), Supplementary (additional information not covered by the consensus statement), and Incompleteness, were used to evaluate the quality of ChatGPT's answers.

Of the 17 questions on the surgical management of CLAI, 11 answers (64.7%) were agreement with the consensus statements by the experts. The percentages of ChatGPT's answers that were considered ‘Yes’ in the Accuracy and Supplementary were 64.7% (11/17) and 70.6% (12/17), respectively. The percentages of ChatGPT's answers that were considered “No” in the Overconclusiveness and Incompleteness were 76.5% (13/17) and 88.2% (15/17), respectively.

The present study showed that ChatGPT‐4 could not provide answers to queries on the surgical management of CLAI, such as foot and ankle experts. However, ChatGPT also showed its promising potential for its application when managing patients with CLAI.

Level Ⅳ.

## Full-text entities

- **Genes:** CFL1 (cofilin 1) [NCBI Gene 1072] {aka CFL, HEL-S-15, cofilin}
- **Diseases:** calcaneal fractures (MESH:D036982), Re-Injury (MESH:D000083102), laxity (MESH:D007593), Ligamentous Injury (MESH:D000070598), Obese (MESH:D009765), subluxation (MESH:D004204), Ligamentous laxity (MESH:C536012), subtalar arthritis (MESH:D001168), femoroacetabular impingement (MESH:D057925), sprains (MESH:D013180), LLM (MESH:D007806), functional disability (MESH:D003291), heel inversion instability (MESH:D007446), Mechanical Instability (MESH:D041781), Tendon (MESH:D052256), Oedema (MESH:C536897), Ehlers-Danlos Syndrome (MESH:D004535), Osteochondral lesions of the talus (MESH:D010007), chronic (MESH:D002908), tears (MESH:D012167), fractures (MESH:D050723), ATFL injury (MESH:D014947), deformities (MESH:D009140), bony abnormalities (MESH:D018213), neuromuscular dysfunction (MESH:D009468), faulty (MESH:C538446), knee or hip osteoarthritis (MESH:D020370), movement deficiencies (MESH:D009069), Instability (MESH:D043171), Pain (MESH:D010146), cavovarus deformity (MESH:D000070589), calcaneofibular ligament (MESH:D000082122), Ankle sprain (MESH:D016512), ATFL insufficiency (MESH:D000309), impingement (MESH:D019534), anterior drawer (MESH:D020759), Hypermobility (MESH:C536196), cartilage damage (MESH:D002357), Loose (MESH:D007594), Tenderness (MESH:D063806), hypertrophy (MESH:D006984), Swelling (MESH:D004487), Stress Fractures (MESH:D015775), thromboembolic (MESH:D013923), chondral lesions (MESH:D009059), -articular (MESH:D057072), tendon injury (MESH:D013708), sinus tarsi syndrome (MESH:C000604661), synovitis (MESH:D013585)
- **Chemicals:** NO (MESH:D009614), FAQ (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

---
Source: https://tomesphere.com/paper/PMC12322689