# Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications

**Authors:** Shashwat Singh, Pranav Chandrasekhar

PMC · DOI: 10.7759/cureus.100845 · 2026-01-05

## TL;DR

This study shows that while GPT-5 performs well on orthopaedic exams, it often provides incorrect or unreliable references, even when answers are correct.

## Contribution

The study introduces a systematic evaluation of reasoning quality and reference reliability in LLMs for orthopaedics, beyond just accuracy.

## Key findings

- GPT-5 achieved 78.3% accuracy on the OITE, outperforming senior trainees.
- 33% of GPT-5's answers cited fabricated or misrepresented evidence, with higher hallucination rates for incorrect answers.
- Correct answers often relied on flawed references, highlighting the need for evaluating reasoning and evidence quality.

## Abstract

Background: Large language models (LLMs) now achieve performance comparable to senior postgraduate trainees on orthopaedic examinations and are increasingly trusted by clinicians to provide explanations for educational and decision-support purposes. However, correct answers do not necessarily indicate sound reasoning or reliable referencing. Current evaluations in this field emphasise accuracy alone, ignoring the quality and evidentiary reliability of the reasoning process.

Aim: This study aimed to systematically evaluate the relationship between answer accuracy, reasoning quality, and reference reliability in the latest generation of LLMs applied to a standardised postgraduate orthopaedic examination.

Methods: The 2024 Orthopaedic In-Training Examination (OITE; 203 questions) was administered to GPT-5 (OpenAI, San Francisco, CA, USA). The model was prompted to provide one answer, a brief rationale, and one supporting reference per question. Accuracy and percentile were recorded relative to official American Academy of Orthopaedic Surgeons (AAOS) data. A structured subsample of 88 responses (44 correct, 44 incorrect) underwent detailed validation of referencing and reasoning. GPT-5's reasoning was compared against official AAOS answer explanations for each question. Reasoning quality was scored using a three-point ordinal scale. References were categorised as fabricated, misrepresented, or accurate. Hallucination rates and reasoning scores were compared between correct and incorrect answers.

Results: GPT-5 achieved 78.3% accuracy (159/203), exceeding the OITE pass threshold (67%) and the mean postgraduate year-5 (PGY-5) resident score (73%), the highest accuracy reported, to our knowledge, among peer-reviewed studies to date. In the subset of 88 responses, hallucinations occurred in 33% overall, significantly higher in incorrect (50%) than in correct answers (15.9%; p=0.001). Reasoning among correct answers was consistently high (median 2.0, IQR 0.0), with 95.5% scoring maximum points, indicating reasoning entirely concordant with the reasoning provided by AAOS. Image-based questions showed lower accuracy (44.7%) compared with text-based questions (54%), though this difference was not statistically significant (p=0.52).

Conclusions: GPT-5 appears to exceed previously reported LLM performance on the OITE and achieved accuracy higher than published mean scores for senior trainees, but demonstrated poor reference reliability, with one in three answers citing fabricated or misrepresented evidence. Even correct answers frequently relied on flawed or unverifiable sources. Evaluation of LLMs in medical education should incorporate systematic reasoning and evidence validation, not accuracy alone.

## Full-text entities

- **Diseases:** hallucinations (MESH:D006212)

---
Source: https://tomesphere.com/paper/PMC12874175