Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes,, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

TL;DR
This paper explores whether large language models can perform introspection by predicting their own behavior, which could improve interpretability and understanding of their internal states.
Contribution
It introduces a method to finetune LLMs to predict their own behavior, providing evidence that models can exhibit a form of introspection on simple tasks.
Findings
Models outperform others in predicting their own behavior.
Models maintain prediction accuracy even after behavior modification.
Introspection is limited to simple tasks, not complex or out-of-distribution tasks.
Abstract
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Adam · Dropout · Dense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings
