Masked Gated Linear Unit
Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota

TL;DR
This paper introduces Masked Gated Linear Units (MGLUs), an efficient and hardware-friendly variant of GLUs that reduces memory usage and increases inference speed in large language models, while maintaining or improving accuracy.
Contribution
The paper proposes MGLUs with MoEG architecture and FlashMGLU kernel, significantly improving memory efficiency and inference speed in LLMs compared to standard GLUs.
Findings
Up to 19.7× inference speed-up with FlashMGLU
47% more memory-efficient than standard GLUs
SwiMGLU matches or surpasses SwiGLU accuracy
Abstract
Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning in Materials Science
