Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Ibne Farabi Shihab; Sanjeda Akter; Anuj Sharma

arXiv:2602.12204·cs.LG·February 13, 2026

Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

PDF

Open Access

TL;DR

This paper introduces a biologically inspired memory consolidation mechanism that reduces attention computation over training by distilling episodic memories into semantic memory, significantly improving efficiency and transferability in language models.

Contribution

It proposes a novel consolidation-based routing method that enables decreasing attention utilization during training, supported by theoretical proofs and empirical results showing substantial compute reduction and improved task transfer.

Findings

01

88% of attention operations retrieve predictable information

02

37.8× reduction in attention compute over training

03

100% retrieval accuracy at 1.6% attention compute

Abstract

Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Ferroelectric and Negative Capacitance Devices · EEG and Brain-Computer Interfaces