Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models
Tomoki Doi, Masaru Isonuma, Hitomi Yanaka

TL;DR
This paper investigates methods to improve the faithfulness of self-generated explanations in large language models, demonstrating that training enhances faithfulness across styles and tasks, with evidence of generalization and cross-style transfer.
Contribution
It introduces a training approach using pseudo-faithful explanations to improve and generalize faithfulness in LLM self-explanations across multiple styles and tasks.
Findings
Training improves explanation faithfulness across tasks and styles
Improvements generalize to multi-word explanations and unseen tasks
Cross-style generalization indicates broader faithfulness enhancement
Abstract
Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models' actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Machine Learning in Materials Science
