One Wide Feedforward is All You Need

Telmo Pessoa Pires; Ant\'onio V. Lopes; Yannick Assogba; Hendra; Setiawan

arXiv:2309.01826·cs.CL·October 24, 2023

One Wide Feedforward is All You Need

Telmo Pessoa Pires, Ant\'onio V. Lopes, Yannick Assogba, Hendra, Setiawan

PDF

Open Access

TL;DR

This paper demonstrates that a simplified Transformer architecture with a shared, single feedforward network can maintain high performance while reducing parameters and improving efficiency.

Contribution

It introduces a novel approach of sharing a single FFN across encoder layers, significantly reducing parameters with minimal accuracy loss.

Findings

01

Shared FFN reduces model size by up to 50%.

02

Scaling the shared FFN improves accuracy and latency.

03

Removing FFN from decoder layers has minimal impact on performance.

Abstract

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Embedded Systems Design Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings