RealFormer: Transformer Likes Residual Attention
Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie

TL;DR
RealFormer introduces a residual attention mechanism in Transformer models, significantly improving performance and stability across various NLP tasks compared to standard Transformers.
Contribution
It proposes a simple residual attention layer for Transformers, enhancing performance and training stability over existing models like BERT and ETC.
Findings
Outperforms canonical Transformer on multiple NLP benchmarks
Stabilizes training and results in sparser attention patterns
Achieves superior results on tasks like GLUE, SQuAD, and translation
Abstract
Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · InfoNCE · Absolute Position Encodings · Relative Position Encodings · Position-Wise Feed-Forward Layer · Global-Local Attention · Contrastive Predictive Coding · Extended Transformer Construction · RealFormer · Softmax
