TL;DR
This paper introduces the Bottlenecked Transformer, which uses periodic KV cache rewrites inspired by brain memory consolidation to improve reasoning in large language models, demonstrating significant performance gains on math benchmarks.
Contribution
The paper proposes a novel memory consolidation approach for Transformers using periodic KV cache rewrites, justified by Information Bottleneck theory, and shows improved reasoning performance.
Findings
Performance gains of up to +6.6pp on math reasoning benchmarks.
Theoretical justification via Information Bottleneck theory.
Effective memory consolidation improves reasoning accuracy.
Abstract
Transformer LLMs have been shown to exhibit strong reasoning ability that scales with inference-time compute, most prominently through token-space "thinking" chains of thought. A growing line of work pushes extra computation into the model's latent space, which we term Auxiliary Latent-Space Computation (ALSC). Existing ALSC methods largely fall into three buckets: (i) token-mediated latent rollouts, (ii) residual/activation steering, and (iii) memory (KV) compression. An underexplored alternative is memory consolidation/reconsolidation, two processes in the brain that are responsible for stabilising newly formed memory traces, and, upon recall, transiently rendering established traces plastic such they can integrate new contextual information before restabilising. In Transformer LLMs, this can be seen as analogous to performing in-place rewrites of new KV segments, and rewrites of…
Peer Reviews
Decision·ICLR 2026 Poster
1. Clear problem framing of ALSC at the sequence level and principled positioning of cache rewriting as consolidation/reconsolidation rather than compression, with an architecture that is simple to integrate and keeps KV dimensionality unchanged. 2. Information-theoretic perspective highlights a plausible failure mode of vanilla decoder-only training (KV states retain unnecessary sequence detail), and motivates a targeted non-causal rewrite to increase predictive efficiency without shrinking m
1. Theorem 4.2 provides a lower bound linking token cross‑entropy to a sum of mutual information terms, but the text then treats autoregressive training as “maximizing both” $$I(S_{0:n};\hat Z)$$ and $$I(\hat Z;S_{n+1})$$, which does not follow from a loose bound and is not shown to hold per‑term, weakening the justification for the proposed remedy. 2. The “terminal bottleneck” claim for the KV cache plus last hidden state is used to argue that the cache retains reconstructive detail that impede
* **Clear theoretical motivation**: The IB framing (Theorems 4.1–4.2) formalizes the KV cache + final hidden state as a terminal bottleneck and links autoregressive training to maximizing both $I(X;Z)$ and $I(Z;Y)$, motivating selective rewrites. * **Simple, modular mechanism**: A lightweight, layer-aligned processor rewrites (i) recent-step KVs and (ii) top-k recalled past entries by attention mass; gating stabilizes updates. The schedule is practical (trigger on newline). * **Consistent empi
* **Scope & novelty relative to cache operators**: While the reconsolidation framing is fresh, the core operation (transform selected cache entries) is close to existing cache-edit/compression lines; novelty hinges on scheduling/selection rather than a fundamentally new cache objective. * **Supervision signal may be weak**: The processor is trained only via next-step cross-entropy with truncated BPTT, which the authors note causes credit-assignment issues; no explicit IB/MI control is used. *
**Strong Theoretical Foundation:** The primary strength of this work is its grounding in Information Bottleneck (IB) theory. The authors provide a formal proof that vanilla Transformers are constrained in their ability to form optimal sequence representations for generalization. **Novel and Bio-Inspired Architecture:** The concept of a separate Cache Processor that performs periodic, in-place KV cache rewrites is highly novel. The design, explicitly inspired by the neural mechanisms of memory
**Significant Computational Overhead:** The practical utility of the method is currently hampered by its high computational cost. The authors report that training the Cache Processor is ~20x slower than standard SFT, and inference is ~45% slower on a 1B parameter model. While acknowledged as a potential engineering issue, this overhead is a major barrier to adoption. **Weak and Indirect Supervision:** The Cache Processor is trained only via the cross-entropy loss of the next reasoning step. As
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Dense Connections · Focus · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Pruning
