A Single-Layer Model Can Do Language Modeling

Zanmin Wang

arXiv:2605.10643·cs.CL·May 12, 2026

A Single-Layer Model Can Do Language Modeling

Zanmin Wang

PDF

TL;DR

This paper introduces Grounded Prediction Networks, a single-layer recurrent model for language modeling that approaches the performance of deeper transformer-based models, offering insights into its internal representations.

Contribution

The paper proposes GPN, a simple recurrent architecture for language modeling, demonstrating competitive perplexity with deep models and analyzing its internal geometry.

Findings

01

GPN with 130M parameters achieves perplexity close to deep transformers.

02

Single-layer GPN approaches the performance of multi-layer models.

03

Internal analysis reveals persistent token directions and memory dynamics.

Abstract

Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.