The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Peisong Wen, Qingming Huang

TL;DR
This paper investigates why Gated Linear Units (GLU) outperform non-GLU structures, revealing that GLU reshapes the neural tangent kernel spectrum to enable faster training convergence.
Contribution
The study provides a theoretical analysis of GLU's effect on NTK spectrum and training dynamics, highlighting its role in accelerating optimization.
Findings
GLU reshapes the NTK spectrum, reducing the condition number.
GLU models converge faster due to spectral reshaping.
GLU has limited impact on reducing the generalization gap.
Abstract
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
