Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

TL;DR
This paper investigates how inference and training choices affect the faithfulness of explanations generated by large language models, with implications for improving trustworthiness in healthcare and social bias contexts.
Contribution
It systematically evaluates how few-shot examples, prompting strategies, and training procedures influence explanation faithfulness in LLMs.
Findings
Few-shot example quantity and quality impact faithfulness
Prompting design significantly affects explanation faithfulness
Instruction-tuning improves faithfulness in medical tasks
Abstract
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
