Learning to Remember, Learn, and Forget in Attention-Based Models
Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci

TL;DR
This paper introduces Palimpsa, a novel self-attention model that addresses memory limitations in transformers by framing in-context learning as a continual learning problem, improving long-sequence processing.
Contribution
It proposes a Bayesian metaplasticity approach for attention models, linking various architectures and significantly enhancing memory capacity and performance.
Findings
Palimpsa outperforms baselines on MQAR benchmark.
It improves performance on Commonsense Reasoning tasks.
Theoretical link between models enables transforming non-metaplastic models.
Abstract
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. It's interesting to see a Bayesian scheme for meta-plasticity, combining ideas from LongHorn and Mesanet. 2. The authors attempt to structure the field a bit, e.g. by showing methods like MAMBA2 are a special case of their work. Table 1 is interesting in that respect, but could be made clearer. 3. They achieve competitive results, especially for 'smaller' models.
1. The paper needs significant restructuring and rewriting, in my opinion. - Half of the main paper is spent on contextualization and related work. They start from a too broad context of self-attention and in-context learning, while actually their work applies to state space models. Stating that explicitly from the start, would have made the actual contributions a lot clearer. - A lot of the relevant content of the paper, related to the actual method, is moved to the supplementary material. -
1. Principled derivation. The free‑energy view leads to exact, closed‑form update rules (Eq. 3) for both the mean and the per‑synapse importance, providing a clean, test‑time learning interpretation of attention that goes beyond heuristic gating. 2. Unifying lens on gated models. Table 1 methodically maps several models into the same objective, and shows Mamba2 as a limiting case. This is an interesting contribution.
1. Scaling story is mixed. At 760M, Palimpsa underperforms Gated DeltaNet on the averaged suite, weakening the claim that metaplastic updates are broadly advantageous; an analysis isolating why performance flips with scale is needed. 2. Fairness of comparisons needs tightening. Table 5 shows Palimpsa uses more layers yet half state size to keep state budgets comparable. Please provide compute‑matched and parameter‑matched comparisons (same layers/expansions). 3. Limited long‑context evaluation.
The attempt to formally unify linear-gated transformers is admirable (see the Appendix) and a common area of recent work. This reviewer finds the framing of in-context learning as a continual learning problem compelling. The quality of the writing is above average and the arguments are typically clear.
This reviewer believes the evaluation and results of the authors’ layer, Palimpsa, could be much stronger. The language modeling and commonsense reasoning evaluation only compares to Gated DeltaNet. We have observed that results on those tasks tend to be much better with Gated DeltaNet-H2. While the authors give DeltaNet and Gated DeltaNet special treatment in the Appendix, that should not preclude the use of a different model – one closer to state-of-the-art – in their evaluation. The results o
The main strength is that the paper provides a compelling Bayesian reinterpretation of attention and state-space models, showing that many recent architectures (e.g., Gated DeltaNet, Mamba2, MesaNet) can be viewed as special cases of a single variational inference framework. This theoretical unification helps clarify connections between disparate model classes and gives principled meaning to heuristic gating mechanisms used in prior work.
1. Some parts of the paper are not well-presented. In Section 2.2, β is not defined until several paragraphs later and initially reads as if it comes directly from the input. The notation of placing diag(β) in a subscript is also unfamiliar to me and likely to other readers. It was not immediately clear why or how we aim to maximize the objective on line 245 — if I understand correctly, the goal is to update S to accommodate the new key–value pair (k, v). On line 265, what is x and how is it lin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
