Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications

Shashwat Singh; Pranav Chandrasekhar

PMC · DOI:10.7759/cureus.100845·January 5, 2026

Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications

Shashwat Singh, Pranav Chandrasekhar

PDF

Open Access

TL;DR

This study shows that while GPT-5 performs well on orthopaedic exams, it often provides incorrect or unreliable references, even when answers are correct.

Contribution

The study introduces a systematic evaluation of reasoning quality and reference reliability in LLMs for orthopaedics, beyond just accuracy.

Findings

01

GPT-5 achieved 78.3% accuracy on the OITE, outperforming senior trainees.

02

33% of GPT-5's answers cited fabricated or misrepresented evidence, with higher hallucination rates for incorrect answers.

03

Correct answers often relied on flawed references, highlighting the need for evaluating reasoning and evidence quality.

Abstract

Background: Large language models (LLMs) now achieve performance comparable to senior postgraduate trainees on orthopaedic examinations and are increasingly trusted by clinicians to provide explanations for educational and decision-support purposes. However, correct answers do not necessarily indicate sound reasoning or reliable referencing. Current evaluations in this field emphasise accuracy alone, ignoring the quality and evidentiary reliability of the reasoning process. Aim: This study aimed to systematically evaluate the relationship between answer accuracy, reasoning quality, and reference reliability in the latest generation of LLMs applied to a standardised postgraduate orthopaedic examination. Methods: The 2024 Orthopaedic In-Training Examination (OITE; 203 questions) was administered to GPT-5 (OpenAI, San Francisco, CA, USA). The model was prompted to provide one answer, a…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases1

hallucinations

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Machine Learning in Healthcare