Not all parameters are born equal: Attention is mostly what you need
Nikolay Bogoychev

TL;DR
This paper investigates the relative importance of different transformer components in machine translation and language modeling by freezing parameters, revealing that attention and FFN layers are equally crucial, while embeddings vary in importance depending on the task.
Contribution
It introduces an ablation study that assesses the importance of embeddings, attention, and FFN layers by freezing them, highlighting the significance of training decisions over parameter count.
Findings
Attention and FFN are equally important in transformers.
The importance of a component depends on whether it is trained or frozen.
Embeddings are less critical for translation but vital for language modeling.
Abstract
Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component's importance. Finally, while the embedding layer is the least essential for machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Neural Networks and Applications
