Synthesizer: Rethinking Self-Attention in Transformer Models

Yi Tay; Dara Bahri; Donald Metzler; Da-Cheng Juan; Zhe Zhao; Che Zheng

arXiv:2005.00743·cs.CL·May 25, 2021·198 cites

Synthesizer: Rethinking Self-Attention in Transformer Models

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper questions the necessity of dot product self-attention in Transformers, introducing Synthesizer models that learn synthetic attention weights, achieving competitive or superior performance across various NLP tasks.

Contribution

The paper proposes Synthesizer, a novel attention mechanism that forgoes token-token interactions, demonstrating competitive performance and efficiency gains over traditional Transformers.

Findings

01

Random Synthesizer performs surprisingly well.

02

Synthesizers outperform Transformers when combined with dot product attention.

03

Simple Synthesizers are faster and more effective than Dynamic Convolutions.

Abstract

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

10-zin/Synthesizer
pytorch

Videos

Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)· youtube

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Feedforward Network · Factorized Dense Synthesized Attention · Factorized Random Synthesized Attention · Random Synthesized Attention · Dense Synthesized Attention · Synthesizer