Fastformer: Additive Attention Can Be All You Need

Chuhan Wu; Fangzhao Wu; Tao Qi; Yongfeng Huang; Xing Xie

arXiv:2108.09084·cs.CL·September 7, 2021·79 cites

Fastformer: Additive Attention Can Be All You Need

Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, Xing Xie

PDF

Open Access 5 Repos 2 Videos

TL;DR

Fastformer introduces a linear-complexity Transformer model using additive attention to efficiently model global context, outperforming many existing models in speed while maintaining or improving performance on long text tasks.

Contribution

It proposes Fastformer, a novel Transformer variant that replaces pair-wise attention with additive attention for efficient global context modeling with linear complexity.

Findings

01

Fastformer achieves significant speed improvements over traditional Transformers.

02

It maintains or surpasses existing models in long text understanding tasks.

03

Experiments on five datasets validate its efficiency and effectiveness.

Abstract

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Fastformer: Additive Attention Can Be All You Need (Machine Learning Research Paper Explained)· youtube

Fastformer: Additive Attention Can Be All You Need | Paper Explained· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fastformer · Layer Normalization · Adam · Label Smoothing · Tanh Activation · Softmax