Can LLMs Explain Themselves Counterfactually?
Zahra Dehghanighobadi, Asja Fischer, Muhammad Bilal Zafar

TL;DR
This paper investigates the ability of Large Language Models to generate self-explanations, specifically counterfactual explanations, and finds that they often struggle to produce consistent and accurate self-generated counterfactuals.
Contribution
The study introduces tests for evaluating LLMs' ability to generate self-explanations and analyzes their performance across different models and settings.
Findings
LLMs sometimes fail to generate self-generated counterfactual explanations.
When generated, LLMs' predictions often do not align with their counterfactual reasoning.
Performance varies across model sizes, temperatures, and datasets.
Abstract
Explanations are an important tool for gaining insights into the behavior of ML models, calibrating user trust and ensuring regulatory compliance. Past few years have seen a flurry of post-hoc methods for generating model explanations, many of which involve computing model gradients or solving specially designed optimization problems. However, owing to the remarkable reasoning abilities of Large Language Model (LLMs), self-explanation, that is, prompting the model to explain its outputs has recently emerged as a new paradigm. In this work, we study a specific type of self-explanations, self-generated counterfactual explanations (SCEs). We design tests for measuring the efficacy of LLMs in generating SCEs. Analysis over various LLM families, model sizes, temperature settings, and datasets reveals that LLMs sometimes struggle to generate SCEs. Even when they do, their prediction often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLaw, AI, and Intellectual Property
