Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

Halimat Afolabi; Zainab Afolabi; Elizabeth Friel; Jude Roberts; Antonio Ji-Xu; Lloyd Chen; Egheosa Ogbomo; Emiliomo Imevbore; Phil Eneje; Wissal El Ouahidi; Aaron Sohal; Alisa Kennan; Shreya Srivastava; Anirudh Vairavan; Laura Napitu; Katie McClure

arXiv:2603.13988·cs.AI·May 13, 2026

Faithful or Just Plausible? Evaluating the Faithfulness of Closed-Source LLMs in Medical Reasoning

Halimat Afolabi, Zainab Afolabi, Elizabeth Friel, Jude Roberts, Antonio Ji-Xu, Lloyd Chen, Egheosa Ogbomo, Emiliomo Imevbore, Phil Eneje, Wissal El Ouahidi, Aaron Sohal, Alisa Kennan, Shreya Srivastava, Anirudh Vairavan, Laura Napitu, Katie McClure

PDF

TL;DR

This study systematically evaluates the faithfulness of three popular closed-source LLMs in medical reasoning, revealing that they often generate plausible but unfaithful explanations, highlighting the need for faithfulness-focused assessments.

Contribution

It introduces three perturbation-based probes to assess faithfulness in medical reasoning of closed-source LLMs and provides empirical evidence of their limitations.

Findings

01

Chain-of-thought reasoning often does not causally influence predictions.

02

Models readily incorporate external hints without acknowledgment.

03

Positional biases had minimal impact in this setting.

Abstract

Closed-source large language models (LLMs), such as ChatGPT and Gemini, are increasingly consulted for medical advice, yet their explanations may appear plausible while failing to reflect the model's underlying reasoning process. This gap poses serious risks as patients and clinicians may trust coherent but misleading explanations. We conduct a systematic black-box evaluation of faithfulness in medical reasoning among three widely used closed-source LLMs. Our study consists of three perturbation-based probes: (1) causal ablation, testing whether stated chain-of-thought (CoT) reasoning causally influences predictions; (2) positional bias, examining whether models create post-hoc justifications for answers driven by input positioning; and (3) hint injection, testing susceptibility to external suggestions. We complement these quantitative probes with a small-scale human evaluation of model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.