Accuracy Is Not Enough: Reasoning and Reference Reliability in Orthopaedic Large Language Model (LLM) Applications
Shashwat Singh, Pranav Chandrasekhar

TL;DR
This study shows that while GPT-5 performs well on orthopaedic exams, it often provides incorrect or unreliable references, even when answers are correct.
Contribution
The study introduces a systematic evaluation of reasoning quality and reference reliability in LLMs for orthopaedics, beyond just accuracy.
Findings
GPT-5 achieved 78.3% accuracy on the OITE, outperforming senior trainees.
33% of GPT-5's answers cited fabricated or misrepresented evidence, with higher hallucination rates for incorrect answers.
Correct answers often relied on flawed references, highlighting the need for evaluating reasoning and evidence quality.
Abstract
Background: Large language models (LLMs) now achieve performance comparable to senior postgraduate trainees on orthopaedic examinations and are increasingly trusted by clinicians to provide explanations for educational and decision-support purposes. However, correct answers do not necessarily indicate sound reasoning or reliable referencing. Current evaluations in this field emphasise accuracy alone, ignoring the quality and evidentiary reliability of the reasoning process. Aim: This study aimed to systematically evaluate the relationship between answer accuracy, reasoning quality, and reference reliability in the latest generation of LLMs applied to a standardised postgraduate orthopaedic examination. Methods: The 2024 Orthopaedic In-Training Examination (OITE; 203 questions) was administered to GPT-5 (OpenAI, San Francisco, CA, USA). The model was prompted to provide one answer, a…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills · Machine Learning in Healthcare
