Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser

TL;DR
This paper introduces PSCB, a large-scale benchmark to measure and improve the alignment between LLMs' answers and their explanations, enhancing interpretability and faithfulness.
Contribution
It presents PSCB for large-scale evaluation, finds Spearman correlation more reliable than cosine, and applies DPO to improve alignment without sacrificing accuracy.
Findings
Spearman rank correlation better indicates alignment than cosine similarity.
DPO improves alignment between answers and explanations.
Fine-tuning on attribution data outperforms supervised fine-tuning.
Abstract
Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its explanation, meaning post-hoc rationales can misrepresent what actually shaped the model's output. We quantify this gap by comparing the feature-importance distributions of answers and their explanations. Prior analyses reveal such discrepancies, but large-scale study has been limited by the high computational cost of attribution methods. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB), a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Using PSCB, we find that Spearman rank correlation provides a more reliable signal of alignment than cosine similarity. Building on this insight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
