Fastformer: Additive Attention Can Be All You Need
Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, Xing Xie

TL;DR
Fastformer introduces a linear-complexity Transformer model using additive attention to efficiently model global context, outperforming many existing models in speed while maintaining or improving performance on long text tasks.
Contribution
It proposes Fastformer, a novel Transformer variant that replaces pair-wise attention with additive attention for efficient global context modeling with linear complexity.
Findings
Fastformer achieves significant speed improvements over traditional Transformers.
It maintains or surpasses existing models in long text understanding tasks.
Experiments on five datasets validate its efficiency and effectiveness.
Abstract
Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- MindCode-4/code-7/tree/main/fastformer-additivemindspore
- MindSpore-scientific-2/code-12/tree/main/fastformer-additivemindspore
- nanzhaogang/contrib/tree/master/application/fastformer-additive-attention-can-b-all-you-needmindspore
- lucidrains/fast-transformer-pytorchpytorch
- MindCode-4/code-11/tree/main/fastformer-additivemindspore
Videos
Fastformer: Additive Attention Can Be All You Need (Machine Learning Research Paper Explained)· youtube
Fastformer: Additive Attention Can Be All You Need | Paper Explained· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fastformer · Layer Normalization · Adam · Label Smoothing · Tanh Activation · Softmax
