Synthesizer: Rethinking Self-Attention in Transformer Models
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

TL;DR
This paper questions the necessity of dot product self-attention in Transformers, introducing Synthesizer models that learn synthetic attention weights, achieving competitive or superior performance across various NLP tasks.
Contribution
The paper proposes Synthesizer, a novel attention mechanism that forgoes token-token interactions, demonstrating competitive performance and efficiency gains over traditional Transformers.
Findings
Random Synthesizer performs surprisingly well.
Synthesizers outperform Transformers when combined with dot product attention.
Simple Synthesizers are faster and more effective than Dynamic Convolutions.
Abstract
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)· youtube
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Feedforward Network · Factorized Dense Synthesized Attention · Factorized Random Synthesized Attention · Random Synthesized Attention · Dense Synthesized Attention · Synthesizer
