NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment
Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

TL;DR
NeuroFaith introduces a novel framework for evaluating and improving the faithfulness of LLM self-explanations by analyzing internal neural representations and concept influence.
Contribution
It presents a flexible, representation-based method to measure and enhance the faithfulness of LLM explanations, addressing limitations of prior behavioral and computational approaches.
Findings
NeuroFaith effectively assesses faithfulness in reasoning and classification tasks.
The linear faithfulness probe detects unfaithful explanations from internal representations.
Steering based on NeuroFaith improves explanation faithfulness.
Abstract
Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling
