Counterfactuals As a Means for Evaluating Faithfulness of Attribution   Methods in Autoregressive Language Models

Sepehr Kamahi; Yadollah Yaghoobzadeh

arXiv:2408.11252·cs.CL·March 11, 2025

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models

Sepehr Kamahi, Yadollah Yaghoobzadeh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a counterfactual-based evaluation method for attribution techniques in autoregressive language models, addressing the challenge of out-of-distribution issues in faithfulness assessment.

Contribution

It proposes a novel counterfactual generation approach that produces fluent, in-distribution examples to better evaluate attribution method faithfulness in autoregressive models.

Findings

01

Counterfactual generation improves evaluation reliability.

02

The method produces fluent, in-distribution counterfactuals.

03

Enhanced assessment of attribution faithfulness.

Abstract

Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method -- how accurately it explains the inner workings and decision-making of the model -- is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model's output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sepehr-kamahi/faith
pytorchOfficial

Videos

Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models· underline

Taxonomy

TopicsTopic Modeling · activated carbon and charcoal · Natural Language Processing Techniques

MethodsCounterfactuals Explanations