Adaptive Memory Decay for Log-Linear Attention
Yaxita Amin, Helen Zichen Li, Mengfan Zhang, Samet Ayhan

TL;DR
This paper introduces a method to adaptively learn memory decay parameters in log-linear attention models, enhancing their ability to recall relevant information in long sequences.
Contribution
It proposes a lightweight, input-dependent decay mechanism for log-linear attention that improves long-range memory recall without increasing complexity.
Findings
Input-dependent decay outperforms fixed decay in associative recall and language modeling.
Largest improvements occur in long-range memory tasks where fixed decay fails.
The method preserves log-linear complexity with negligible parameter overhead.
Abstract
Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter {\lambda} is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning {\lambda} directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
