Can Language Models Explain Their Own Classification Behavior?

Dane Sherburn; Bilal Chughtai; Owain Evans

arXiv:2405.07436·cs.LG·May 14, 2024

Can Language Models Explain Their Own Classification Behavior?

Dane Sherburn, Bilal Chughtai, Owain Evans

PDF

Open Access 1 Repo

TL;DR

This paper introduces the ArticulateRules dataset to evaluate whether large language models can accurately explain their own classification decisions, revealing significant differences among models and challenges in improving explanation fidelity.

Contribution

The paper presents a new dataset, ArticulateRules, for assessing LLMs' ability to generate faithful natural language explanations of their classification behavior.

Findings

01

Articulate accuracy varies widely across models, with GPT-4 outperforming GPT-3.

02

GPT-3 struggles to articulate correct explanations even after finetuning.

03

The dataset enables evaluation of self-explanation capabilities in both in-context learning and finetuning.

Abstract

Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danesherbs/articulate-rules
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Position-Wise Feed-Forward Layer · Cosine Annealing · Dropout · Linear Warmup With Cosine Annealing · Label Smoothing · Residual Connection