Evaluating the Reliability of Self-Explanations in Large Language Models
Korbinian Randl, John Pavlopoulos, Aron Henriksson, and Tony Lindgren

TL;DR
This study assesses the reliability of self-generated explanations by large language models, revealing a gap between perceived and actual reasoning, and proposing counterfactual explanations as a more faithful alternative.
Contribution
It demonstrates that counterfactual explanations from LLMs can be more faithful and reliable than extractive explanations, offering a new approach to model interpretability.
Findings
Self-explanations correlate with human judgment but lack full fidelity.
Counterfactual explanations can produce more faithful and verifiable insights.
Prompt tailoring is crucial for effective counterfactual explanations.
Abstract
This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsCounterfactuals Explanations · Shapley Additive Explanations
