ReGLA: Refining Gated Linear Attention
Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais

TL;DR
ReGLA introduces a refined gated linear attention mechanism with improved feature mapping, normalization, and gating, leading to superior performance in large language model tasks while reducing computational complexity.
Contribution
This work presents a comprehensive refinement of Gated Linear Attention by enhancing feature maps, normalization, and gating, resulting in better performance and training stability.
Findings
Outperforms previous Gated Linear Attention methods
Effective in training from scratch and continual pre-training
Reduces computational complexity of attention mechanisms
Abstract
Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
