Evaluating the Reliability of Self-Explanations in Large Language Models

Korbinian Randl; John Pavlopoulos; Aron Henriksson; and Tony Lindgren

arXiv:2407.14487·cs.CL·February 3, 2025

Evaluating the Reliability of Self-Explanations in Large Language Models

Korbinian Randl, John Pavlopoulos, Aron Henriksson, and Tony Lindgren

PDF

Open Access 1 Repo

TL;DR

This study assesses the reliability of self-generated explanations by large language models, revealing a gap between perceived and actual reasoning, and proposing counterfactual explanations as a more faithful alternative.

Contribution

It demonstrates that counterfactual explanations from LLMs can be more faithful and reliable than extractive explanations, offering a new approach to model interpretability.

Findings

01

Self-explanations correlate with human judgment but lack full fidelity.

02

Counterfactual explanations can produce more faithful and verifiable insights.

03

Prompt tailoring is crucial for effective counterfactual explanations.

Abstract

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k-randl/self-explaining_llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsCounterfactuals Explanations · Shapley Additive Explanations