TL;DR
This paper investigates how GPT-based models develop internal representations during gameplay, using OthelloGPT as a testbed, and compares interpretability methods to understand the progression of learned features.
Contribution
It introduces a framework for analyzing internal representations in GPT models through layer-wise analysis and compares autoencoders with linear probes for interpretability.
Findings
Early layers encode static board features
Deeper layers reflect dynamic gameplay changes
SAEs provide more disentangled insights than linear probes
Abstract
Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Discriminative Fine-Tuning · Layer Normalization · Dense Connections · Cosine Annealing · Attention Dropout · Adam
