FaithLM: Towards Faithful Explanations for Large Language Models
Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, and Xia Hu

TL;DR
FaithLM is a model-agnostic framework that evaluates and enhances the faithfulness of large language model explanations by using intervention-based metrics and iterative refinement, leading to more reliable and human-aligned explanations.
Contribution
FaithLM introduces a novel intervention-based evaluation method and an iterative optimization process to improve the faithfulness of LLM explanations without task-specific heuristics.
Findings
FaithLM significantly improves explanation faithfulness across multiple datasets.
The contrary-hint score effectively measures explanation faithfulness.
Iterative refinement enhances the alignment of explanations with human rationales.
Abstract
Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
