RealFormer: Transformer Likes Residual Attention

Ruining He; Anirudh Ravula; Bhargav Kanagal; Joshua Ainslie

arXiv:2012.11747·cs.LG·September 14, 2021·20 cites

RealFormer: Transformer Likes Residual Attention

Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie

PDF

Open Access 5 Repos

TL;DR

RealFormer introduces a residual attention mechanism in Transformer models, significantly improving performance and stability across various NLP tasks compared to standard Transformers.

Contribution

It proposes a simple residual attention layer for Transformers, enhancing performance and training stability over existing models like BERT and ETC.

Findings

01

Outperforms canonical Transformer on multiple NLP benchmarks

02

Stabilizes training and results in sparser attention patterns

03

Achieves superior results on tasks like GLUE, SQuAD, and translation

Abstract

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · InfoNCE · Absolute Position Encodings · Relative Position Encodings · Position-Wise Feed-Forward Layer · Global-Local Attention · Contrastive Predictive Coding · Extended Transformer Construction · RealFormer · Softmax