Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
Bar Alon, Itamar Zimerman, Lior Wolf

TL;DR
This paper introduces a training-free approach to improve the faithfulness of LLM explanations by guiding them through attention-level interventions based on attribution heatmaps, addressing the gap between subjective and epistemic faithfulness.
Contribution
It proposes a novel, training-free method that enhances the epistemic faithfulness of LLM explanations using attribution-guided attention interventions.
Findings
Significantly improves epistemic faithfulness across multiple models
Addresses the gap between subjective appearance and actual evidence reliance
Works effectively across various benchmarks and prompts
Abstract
Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
