cosFormer: Rethinking Softmax in Attention
Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv,, Junjie Yan, Lingpeng Kong, Yiran Zhong

TL;DR
cosFormer introduces a linear attention mechanism based on cosine re-weighting that maintains accuracy comparable to traditional softmax attention while significantly reducing computational complexity, enabling efficient processing of long sequences.
Contribution
The paper presents cosFormer, a novel linear transformer that preserves key properties of softmax attention using cosine-based re-weighting, improving efficiency and performance on long sequence tasks.
Findings
Achieves state-of-the-art results on Long-Range Arena benchmark.
Maintains comparable accuracy to softmax attention in language modeling.
Reduces quadratic complexity to linear, enabling scalable long-sequence processing.
Abstract
Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsSoftmax
