Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Mohsen Fayyaz; Fan Yin; Jiao Sun; Nanyun Peng

arXiv:2407.00219·cs.CL·October 23, 2024·2 cites

Evaluating Human Alignment and Model Faithfulness of LLM Rationale

Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng

PDF

Open Access

TL;DR

This paper compares prompting-based and attribution-based rationales in large language models, revealing that attribution methods are generally more aligned and faithful to the models' decision processes, especially after fine-tuning.

Contribution

It provides a systematic evaluation of different rationale extraction methods across datasets, highlighting the limitations of prompting-based explanations and the benefits of attribution-based methods.

Findings

01

Attribution-based explanations are more aligned with human rationales.

02

Fine-tuning improves attribution-based rationale alignment.

03

Prompting-based explanations are less faithful and less aligned than attribution-based ones.

Abstract

We study how well large language models (LLMs) explain their generations through rationales -- a set of tokens extracted from the input text that reflect the decision-making process of LLMs. Specifically, we systematically study rationales derived using two approaches: (1) popular prompting-based methods, where prompts are used to guide LLMs in generating rationales, and (2) technical attribution-based methods, which leverage attention or gradients to identify important tokens. Our analysis spans three classification datasets with annotated rationales, encompassing tasks with varying performance levels. While prompting-based self-explanations are widely used, our study reveals that these explanations are not always as "aligned" with the human rationale as attribution-based explanations. Even more so, fine-tuning LLMs to enhance classification task accuracy does not enhance the alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Islamic Finance and Banking Studies · Business Process Modeling and Analysis

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · ALIGN