NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment

Milan Bhan; Jean-Noel Vittaut; Nicolas Chesneau; Sarath Chandar; Marie-Jeanne Lesot

arXiv:2506.09277·cs.CL·January 30, 2026

NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

PDF

Open Access

TL;DR

NeuroFaith introduces a novel framework for evaluating and improving the faithfulness of LLM self-explanations by analyzing internal neural representations and concept influence.

Contribution

It presents a flexible, representation-based method to measure and enhance the faithfulness of LLM explanations, addressing limitations of prior behavioral and computational approaches.

Findings

01

NeuroFaith effectively assesses faithfulness in reasoning and classification tasks.

02

The linear faithfulness probe detects unfaithful explanations from internal representations.

03

Steering based on NeuroFaith improves explanation faithfulness.

Abstract

Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling