GLU Variants Improve Transformer
Noam Shazeer

TL;DR
This paper explores variants of Gated Linear Units (GLUs) within Transformer models, demonstrating that certain modifications can enhance model quality over standard ReLU or GELU activations.
Contribution
It introduces and empirically evaluates different nonlinear functions in GLU variants, showing improvements in Transformer performance.
Findings
Some GLU variants outperform ReLU and GELU in quality.
Certain nonlinear functions in GLU lead to better Transformer results.
GLU modifications can enhance sequence-to-sequence model performance.
Abstract
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/t5-v1_1-smallmodel· 38k dl· ♡ 2838k dl♡ 28
- 🤗google/t5-base-lm-adaptmodel· 8.3k dl· ♡ 198.3k dl♡ 19
- 🤗google/t5-large-lm-adaptmodel· 144 dl· ♡ 8144 dl♡ 8
- 🤗google/t5-small-lm-adaptmodel· 235 dl· ♡ 10235 dl♡ 10
- 🤗google/t5-v1_1-basemodel· 26k dl· ♡ 5926k dl♡ 59
- 🤗google/t5-v1_1-largemodel· 64k dl· ♡ 1864k dl♡ 18
- 🤗google/t5-v1_1-xlmodel· 17k dl· ♡ 1617k dl♡ 16
- 🤗google/t5-v1_1-xxlmodel· 531k dl· ♡ 145531k dl♡ 145
- 🤗google/t5-xl-lm-adaptmodel· 50 dl· ♡ 1450 dl♡ 14
- 🤗google/t5-xxl-lm-adaptmodel· 209 dl· ♡ 10209 dl♡ 10
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Neural Networks and Reservoir Computing · Analog and Mixed-Signal Circuit Design
MethodsTest · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · SwiGLU · GeGLU · ReGLU · Gated Linear Unit · Residual Connection · Byte Pair Encoding
