GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory
Jiaxu Liu, Yuhe Bai, Xiangyu Yin, Christos-Savvas Bouganis

TL;DR
GatedFWA introduces a gated windowed attention mechanism that stabilizes memory updates and controls gradient flow, maintaining efficiency and improving global context utilization in autoregressive models.
Contribution
It proposes GatedFWA, a novel attention method combining sliding window efficiency with memory stabilization via gating, enhancing autoregressive model performance.
Findings
Competitive throughput with negligible overhead
Better utilization of global context
Compatible with token compression methods
Abstract
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Parallel Computing and Optimization Techniques · Topic Modeling
