Looking Inward: Language Models Can Learn About Themselves by   Introspection

Felix J Binder; James Chua; Tomek Korbak; Henry Sleight; John Hughes,; Robert Long; Ethan Perez; Miles Turpin; Owain Evans

arXiv:2410.13787·cs.CL·October 18, 2024·5 cites

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes,, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

PDF

Open Access 1 Repo

TL;DR

This paper explores whether large language models can perform introspection by predicting their own behavior, which could improve interpretability and understanding of their internal states.

Contribution

It introduces a method to finetune LLMs to predict their own behavior, providing evidence that models can exhibit a form of introspection on simple tasks.

Findings

01

Models outperform others in predicting their own behavior.

02

Models maintain prediction accuracy even after behavior modification.

03

Introspection is limited to simple tasks, not complex or out-of-distribution tasks.

Abstract

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

felixbinder/introspection_self_prediction
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Adam · Dropout · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings