Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn, Bilal Chughtai, Owain Evans

TL;DR
This paper introduces the ArticulateRules dataset to evaluate whether large language models can accurately explain their own classification decisions, revealing significant differences among models and challenges in improving explanation fidelity.
Contribution
The paper presents a new dataset, ArticulateRules, for assessing LLMs' ability to generate faithful natural language explanations of their classification behavior.
Findings
Articulate accuracy varies widely across models, with GPT-4 outperforming GPT-3.
GPT-3 struggles to articulate correct explanations even after finetuning.
The dataset enables evaluation of self-explanation capabilities in both in-context learning and finetuning.
Abstract
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Position-Wise Feed-Forward Layer · Cosine Annealing · Dropout · Linear Warmup With Cosine Annealing · Label Smoothing · Residual Connection
