Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

Dong Chen; Yanzhe Wei; Zonglin He; Guan-Ming Kuang; Canhua Ye; Meiru An; Huili Peng; Yong Hu; Huiren Tao; Kenneth MC Cheung

arXiv:2511.00588·cs.LG·November 21, 2025

Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation

Dong Chen, Yanzhe Wei, Zonglin He, Guan-Ming Kuang, Canhua Ye, Meiru An, Huili Peng, Yong Hu, Huiren Tao, Kenneth MC Cheung

PDF

Open Access

TL;DR

This paper presents a clinician-centered framework to evaluate hallucination risks in LLMs used for spine surgery decision support, revealing model vulnerabilities and emphasizing the need for interpretability and safety validation.

Contribution

Introduces a comprehensive validation framework for clinical LLMs, assessing hallucination risks and model robustness in high-stakes surgical decision-making.

Findings

01

DeepSeek-R1 outperformed other models with an 86.03 score.

02

Reasoning enhancements did not always improve model reliability.

03

Stress-testing revealed specific vulnerabilities under complex scenarios.

Abstract

Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery but pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs that may compromise patient safety. This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment. We assessed six leading LLMs across 30 expert-validated spinal cases. DeepSeek-R1 demonstrated superior overall performance (total score: 86.03 $\pm$ 2.08), particularly in high-stakes domains such as trauma and infection. A critical finding reveals that reasoning-enhanced model variants did not uniformly outperform standard counterparts: Claude-3.7-Sonnet's extended thinking mode underperformed relative to its standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Clinical Reasoning and Diagnostic Skills