The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Costin-Andrei Oncescu; Depen Morwani; Samy Jelassi; Alexandru Meterez; Mujin Kwun; Sham Kakade

arXiv:2604.21215·cs.LG·April 24, 2026

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, Sham Kakade

PDF

TL;DR

The Recurrent Transformer introduces a layerwise recurrent memory mechanism that enhances effective depth and decoding efficiency, outperforming standard Transformers in language modeling tasks.

Contribution

It presents a simple architectural modification enabling recurrent memory in Transformers, avoiding optimization issues and reducing memory bandwidth during training and inference.

Findings

01

Improves cross-entropy over parameter-matched Transformers on large-scale pretraining.

02

Reduces KV cache memory footprint and inference latency.

03

Achieves effective depth through recurrence without increasing model parameters.

Abstract

Transformers process tokens in parallel but are temporally shallow: at position $t$ , each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near $1$ because keys and values are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.