Evaluating Human Alignment and Model Faithfulness of LLM Rationale
Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng

TL;DR
This paper compares prompting-based and attribution-based rationales in large language models, revealing that attribution methods are generally more aligned and faithful to the models' decision processes, especially after fine-tuning.
Contribution
It provides a systematic evaluation of different rationale extraction methods across datasets, highlighting the limitations of prompting-based explanations and the benefits of attribution-based methods.
Findings
Attribution-based explanations are more aligned with human rationales.
Fine-tuning improves attribution-based rationale alignment.
Prompting-based explanations are less faithful and less aligned than attribution-based ones.
Abstract
We study how well large language models (LLMs) explain their generations through rationales -- a set of tokens extracted from the input text that reflect the decision-making process of LLMs. Specifically, we systematically study rationales derived using two approaches: (1) popular prompting-based methods, where prompts are used to guide LLMs in generating rationales, and (2) technical attribution-based methods, which leverage attention or gradients to identify important tokens. Our analysis spans three classification datasets with annotated rationales, encompassing tasks with varying performance levels. While prompting-based self-explanations are widely used, our study reveals that these explanations are not always as "aligned" with the human rationale as attribution-based explanations. Even more so, fine-tuning LLMs to enhance classification task accuracy does not enhance the alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Islamic Finance and Banking Studies · Business Process Modeling and Analysis
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · ALIGN
