Revisiting the Shape Convention of Transformer Language Models
Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu

TL;DR
This paper challenges the traditional narrow-wide-narrow MLP shape in Transformer language models by proposing and empirically validating deeper hourglass-shaped FFNs, leading to more efficient and effective architectures.
Contribution
It introduces a novel hourglass-shaped FFN architecture for Transformers and demonstrates its advantages over conventional designs through extensive experiments.
Findings
Hourglass FFNs outperform conventional FFNs up to 400M parameters.
Hourglass FFNs achieve comparable performance to standard models at 1B parameters.
Using lighter hourglass FFNs with more attention parameters improves efficiency.
Abstract
Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
