# A Retrospective Comparison of Artificial Intelligence and the Orthopaedic Multi-disciplinary Team in the Management of Intracapsular Neck of Femur Fractures

**Authors:** Matthew K Emmerson, Ryan Hillier-Smith, Amar Malhas

PMC · DOI: 10.7759/cureus.94699 · Cureus · 2025-10-16

## TL;DR

This study compared ChatGPT's recommendations for hip fracture surgery with those of orthopaedic consultants and found that ChatGPT's decisions were unreliable and inconsistent when tested on new patients.

## Contribution

The study evaluates the reliability of ChatGPT in replicating orthopaedic decision-making for hip fractures and highlights risks of over-reliance on AI without proper validation.

## Key findings

- Initial agreement between ChatGPT and consultants was low (κ = 0.03), but improved after in-session adjustments (κ = 0.93).
- ChatGPT's post-adjustment recommendations were significantly influenced by patient age.
- Validation on a new dataset showed poor generalization (κ = 0.29), similar to initial performance.

## Abstract

Introduction

Artificial intelligence (AI) tools, such as ChatGPT, could potentially support junior clinicians in making initial operative decisions for hip fractures. However, the safety and reliability of such use are uncertain. This study compared ChatGPT’s management recommendations for patients with intracapsular neck of femur (NOF) fractures to decisions made by orthopaedic consultants and evaluated which patient factors influenced those recommendations.

Methods

We identified a retrospective cohort of patients admitted with an intracapsular NOF fracture over an 18-week period to a United Kingdom District General Hospital. We collected patients’ age, sex, comorbidities, mobility status, and the 4 A’s Test (4AT) score. De-identified data were entered into ChatGPT with instructions to recommend management based on National Institute for Health and Care Excellence guidance; ChatGPT’s recommendations were compared with the operation recorded at the Trauma Meeting. When ChatGPT and consultants disagreed, we provided ChatGPT with the consultant's decision and rationale and recorded the revised recommendation. We then validated ChatGPT on a separate anonymized patient set. We used Cohen’s Kappa to assess agreement and applied multinomial and binomial logistic regression and two-proportion z-tests to assess significance.

Results

One hundred five patients were included in the primary cohort, and 30 in the validation cohort. Initial agreement between ChatGPT and consultants was low [κ = 0.03, 95% confidence interval (CI) - 0.11 to 0.19, p = 0.70]. After in-session adjustment, agreement rose (κ = 0.93, 95% CI 0.84 - 1.00, p < 0.001), a statistically significant improvement (z = 10.3, p < 0.001). The only post-adjustment factor that significantly influenced ChatGPT’s recommendations was age, showing that increasing age was associated with a reduced likelihood of receiving total hip replacement rather than hemiarthroplasty (odds ratio = 0.81, 95% CI 0.70-0.94; p = 0.006). In the validation cohort, agreement fell (κ = 0.29, 95% CI -0.04 to 0.60, p = 0.06) and did not significantly differ from the initial agreement (z = -1.51, p = 0.13).

Conclusion

ChatGPT did not reliably replicate orthopaedic consultant decision-making for intracapsular NOF fractures. Although in-session adjustments produced high superficial concordance, this effect did not generalize to an independent dataset. The model’s tendency to conform to user prompts risks creating false confidence in its outputs. Clinically focused, validated AI systems require further evaluation before they can safely augment operative decision-making.

## Full-text entities

- **Diseases:** hip fractures (MESH:D006620), NOF fracture (MESH:D005265), Trauma (MESH:D014947)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12529217/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12529217/full.md

---
Source: https://tomesphere.com/paper/PMC12529217