One Wide Feedforward is All You Need
Telmo Pessoa Pires, Ant\'onio V. Lopes, Yannick Assogba, Hendra, Setiawan

TL;DR
This paper demonstrates that a simplified Transformer architecture with a shared, single feedforward network can maintain high performance while reducing parameters and improving efficiency.
Contribution
It introduces a novel approach of sharing a single FFN across encoder layers, significantly reducing parameters with minimal accuracy loss.
Findings
Shared FFN reduces model size by up to 50%.
Scaling the shared FFN improves accuracy and latency.
Removing FFN from decoder layers has minimal impact on performance.
Abstract
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Embedded Systems Design Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings
