
TL;DR
This paper introduces Grounded Prediction Networks, a single-layer recurrent model for language modeling that approaches the performance of deeper transformer-based models, offering insights into its internal representations.
Contribution
The paper proposes GPN, a simple recurrent architecture for language modeling, demonstrating competitive perplexity with deep models and analyzing its internal geometry.
Findings
GPN with 130M parameters achieves perplexity close to deep transformers.
Single-layer GPN approaches the performance of multi-layer models.
Internal analysis reveals persistent token directions and memory dynamics.
Abstract
Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
