A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Harry Mayne; Justin Singh Kang; Dewi Gould; Kannan Ramchandran; Adam Mahdi; Noah Y. Siegel

arXiv:2602.02639·cs.AI·February 4, 2026

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Harry Mayne, Justin Singh Kang, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel

PDF

Open Access

TL;DR

This paper introduces a new metric, Normalized Simulatability Gain (NSG), to evaluate the faithfulness of LLM self-explanations, demonstrating they significantly improve the prediction of model behavior across various models and datasets.

Contribution

The paper proposes NSG, a scalable metric for assessing explanation faithfulness, and provides empirical evidence that self-explanations enhance model behavior prediction, surpassing external explanations.

Findings

01

Self-explanations improve behavior prediction by 11-37% NSG.

02

Self-generated explanations outperform external explanations in predictive power.

03

5-15% of self-explanations are egregiously misleading.

Abstract

LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model's true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI