The Devil in Linear Transformer
Zhen Qin, XiaoDong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick, Barnes, Yiran Zhong

TL;DR
This paper introduces transNormer, a novel linear transformer that stabilizes gradients through normalization and enhances local attention, leading to improved performance and efficiency on various NLP tasks and benchmarks.
Contribution
It proposes a new linear attention mechanism that replaces scaling with normalization and employs diagonal attention for local context, addressing key issues in existing linear transformers.
Findings
Outperforms vanilla and existing linear transformers on NLP benchmarks.
Demonstrates superior performance on Long-Range Arena benchmark.
Achieves better space-time efficiency while maintaining high accuracy.
Abstract
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Web Data Mining and Analysis
