Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models

Tomoki Doi; Masaru Isonuma; Hitomi Yanaka

arXiv:2512.07288·cs.CL·December 9, 2025

Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models

Tomoki Doi, Masaru Isonuma, Hitomi Yanaka

PDF

Open Access

TL;DR

This paper investigates methods to improve the faithfulness of self-generated explanations in large language models, demonstrating that training enhances faithfulness across styles and tasks, with evidence of generalization and cross-style transfer.

Contribution

It introduces a training approach using pseudo-faithful explanations to improve and generalize faithfulness in LLM self-explanations across multiple styles and tasks.

Findings

01

Training improves explanation faithfulness across tasks and styles

02

Improvements generalize to multi-word explanations and unseen tasks

03

Cross-style generalization indicates broader faithfulness enhancement

Abstract

Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models' actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Machine Learning in Materials Science