GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Jiaxu Liu; Yuhe Bai; Xiangyu Yin; Christos-Savvas Bouganis

arXiv:2512.07782·cs.LG·January 8, 2026

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Jiaxu Liu, Yuhe Bai, Xiangyu Yin, Christos-Savvas Bouganis

PDF

Open Access

TL;DR

GatedFWA introduces a gated windowed attention mechanism that stabilizes memory updates and controls gradient flow, maintaining efficiency and improving global context utilization in autoregressive models.

Contribution

It proposes GatedFWA, a novel attention method combining sliding window efficiency with memory stabilization via gating, enhancing autoregressive model performance.

Findings

01

Competitive throughput with negligible overhead

02

Better utilization of global context

03

Compatible with token compression methods

Abstract

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Parallel Computing and Optimization Techniques · Topic Modeling