The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Costin-Andrei Oncescu, Depen Morwani, Samy Jelassi, Alexandru Meterez, Mujin Kwun, Sham Kakade

TL;DR
The Recurrent Transformer introduces a layerwise recurrent memory mechanism that enhances effective depth and decoding efficiency, outperforming standard Transformers in language modeling tasks.
Contribution
It presents a simple architectural modification enabling recurrent memory in Transformers, avoiding optimization issues and reducing memory bandwidth during training and inference.
Findings
Improves cross-entropy over parameter-matched Transformers on large-scale pretraining.
Reduces KV cache memory footprint and inference latency.
Achieves effective depth through recurrence without increasing model parameters.
Abstract
Transformers process tokens in parallel but are temporally shallow: at position , each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near because keys and values are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
