Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Adnan Oomerjee; Zafeirios Fountas; Haitham Bou-Ammar; Jun Wang

arXiv:2505.16950·cs.LG·March 26, 2026

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang

PDF

3 Reviews

TL;DR

This paper introduces the Bottlenecked Transformer, which uses periodic KV cache rewrites inspired by brain memory consolidation to improve reasoning in large language models, demonstrating significant performance gains on math benchmarks.

Contribution

The paper proposes a novel memory consolidation approach for Transformers using periodic KV cache rewrites, justified by Information Bottleneck theory, and shows improved reasoning performance.

Findings

01

Performance gains of up to +6.6pp on math reasoning benchmarks.

02

Theoretical justification via Information Bottleneck theory.

03

Effective memory consolidation improves reasoning accuracy.

Abstract

Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. Clear problem framing of ALSC at the sequence level and principled positioning of cache rewriting as consolidation/reconsolidation rather than compression, with an architecture that is simple to integrate and keeps KV dimensionality unchanged. 2. Information-theoretic perspective highlights a plausible failure mode of vanilla decoder-only training (KV states retain unnecessary sequence detail), and motivates a targeted non-causal rewrite to increase predictive efficiency without shrinking m

Weaknesses

1. Theorem 4.2 provides a lower bound linking token cross‑entropy to a sum of mutual information terms, but the text then treats autoregressive training as “maximizing both” $$I(S_{0:n};\hat Z)$$ and $$I(\hat Z;S_{n+1})$$, which does not follow from a loose bound and is not shown to hold per‑term, weakening the justification for the proposed remedy. 2. The “terminal bottleneck” claim for the KV cache plus last hidden state is used to argue that the cache retains reconstructive detail that impede

Reviewer 02Rating 6Confidence 3

Strengths

* **Clear theoretical motivation**: The IB framing (Theorems 4.1–4.2) formalizes the KV cache + final hidden state as a terminal bottleneck and links autoregressive training to maximizing both $I(X;Z)$ and $I(Z;Y)$, motivating selective rewrites. * **Simple, modular mechanism**: A lightweight, layer-aligned processor rewrites (i) recent-step KVs and (ii) top-k recalled past entries by attention mass; gating stabilizes updates. The schedule is practical (trigger on newline). * **Consistent empi

Weaknesses

* **Scope & novelty relative to cache operators**: While the reconsolidation framing is fresh, the core operation (transform selected cache entries) is close to existing cache-edit/compression lines; novelty hinges on scheduling/selection rather than a fundamentally new cache objective. * **Supervision signal may be weak**: The processor is trained only via next-step cross-entropy with truncated BPTT, which the authors note causes credit-assignment issues; no explicit IB/MI control is used. *

Reviewer 03Rating 8Confidence 3

Strengths

**Strong Theoretical Foundation:** The primary strength of this work is its grounding in Information Bottleneck (IB) theory. The authors provide a formal proof that vanilla Transformers are constrained in their ability to form optimal sequence representations for generalization. **Novel and Bio-Inspired Architecture:** The concept of a separate Cache Processor that performs periodic, in-place KV cache rewrites is highly novel. The design, explicitly inspired by the neural mechanisms of memory

Weaknesses

**Significant Computational Overhead:** The practical utility of the method is currently hampered by its high computational cost. The authors report that training the Cache Processor is ~20x slower than standard SFT, and inference is ~45% slower on a 1B parameter model. While acknowledged as a potential engineering issue, this overhead is a major barrier to adoption. **Weak and Indirect Supervision:** The Cache Processor is trained only via the cross-entropy loss of the next reasoning step. As

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Focus · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Pruning