Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models
Sepehr Kamahi, Yadollah Yaghoobzadeh

TL;DR
This paper introduces a counterfactual-based evaluation method for attribution techniques in autoregressive language models, addressing the challenge of out-of-distribution issues in faithfulness assessment.
Contribution
It proposes a novel counterfactual generation approach that produces fluent, in-distribution examples to better evaluate attribution method faithfulness in autoregressive models.
Findings
Counterfactual generation improves evaluation reliability.
The method produces fluent, in-distribution counterfactuals.
Enhanced assessment of attribution faithfulness.
Abstract
Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method -- how accurately it explains the inner workings and decision-making of the model -- is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model's output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · activated carbon and charcoal · Natural Language Processing Techniques
MethodsCounterfactuals Explanations
