Learning to Forget Attention: Memory Consolidation for Adaptive Compute Reduction
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR
This paper introduces a biologically inspired memory consolidation mechanism that reduces attention computation over training by distilling episodic memories into semantic memory, significantly improving efficiency and transferability in language models.
Contribution
It proposes a novel consolidation-based routing method that enables decreasing attention utilization during training, supported by theoretical proofs and empirical results showing substantial compute reduction and improved task transfer.
Findings
88% of attention operations retrieve predictable information
37.8× reduction in attention compute over training
100% retrieval accuracy at 1.6% attention compute
Abstract
Hybrid architectures combining state-space models with attention have achieved strong efficiency-quality tradeoffs, yet existing approaches either apply attention uniformly or learn static sparse patterns. This misses a key opportunity: \emph{attention demand should decrease over time as recurring patterns become familiar}. We present a surprising finding from analyzing GPT-2 models: \textbf{88\%} of attention operations retrieve information already predictable from the model's hidden state, and this redundancy does \emph{not} decrease during training. Motivated by this observation, we introduce \textbf{\ours{}} (\textbf{C}onsolidation-based \textbf{R}outing for \textbf{A}daptive \textbf{M}emory), a biologically inspired memory consolidation mechanism that gradually distills episodic retrievals into parametric semantic memory. Unlike prior sparse attention methods, \ours{} exhibits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Ferroelectric and Negative Capacitance Devices · EEG and Brain-Computer Interfaces
