Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao, Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu

TL;DR
This paper introduces Gated Slot Attention (GSA), a novel method that improves memory capacity and efficiency in sequence modeling by combining gating mechanisms with attention, enabling better recall and faster training.
Contribution
GSA enhances attention models with gating and memory control, achieving efficient linear-time sequence modeling and improved recall in pretrained transformer finetuning.
Findings
GSA outperforms existing models in recall-intensive tasks.
GSA reduces training and inference resource requirements.
GSA is effective in finetuning pretrained transformers to RNNs.
Abstract
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via , utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the operation is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗fla-hub/gsa-1.3B-100Bmodel· 18 dl18 dl
- 🤗fla-hub/gsa-2.7B-100Bmodel· 5 dl5 dl
- 🤗linear-moe-hub/Liger-GLA-8Bmodel· 29 dl· ♡ 329 dl♡ 3
- 🤗linear-moe-hub/Liger-GSA-8Bmodel· 4 dl· ♡ 24 dl♡ 2
- 🤗linear-moe-hub/GSA-340Mmodel· 2 dl2 dl
- 🤗msj19/mask_gdn_1B_hrr4_bytmodel
- 🤗msj19/mask_gdn_hrr2model· 1 dl1 dl
- 🤗msj19/mask_gdn_1B_hrr4_byt_a100model
- 🤗msj19/mask_gdn_hrr4model· 2 dl2 dl
- 🤗msj19/mask_gdn_1B_hrr4_byt_a100_l34model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques
MethodsAttention Is All You Need · Softmax
