Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences
Zicheng Liu, Siyuan Li, Li Wang, Zedong Wang, Yunfan Liu, and Stan Z., Li

TL;DR
This paper introduces CHELA, a hybrid approach combining short-long convolutions with hardware-efficient linear attention, enabling effective long sequence processing with real linear complexity and improved stability.
Contribution
It proposes a novel hybrid model that replaces SSMs with convolutions and implements linear attention efficiently for long sequences, addressing hardware and stability issues.
Findings
Outperforms existing methods on Long Range Arena benchmark
Achieves real linear complexity in long sequence processing
Demonstrates effectiveness on language modeling tasks
Abstract
To mitigate the computational complexity in the self-attention mechanism on long sequences, linear attention utilizes computation tricks to achieve linear complexity, while state space models (SSMs) popularize a favorable practice of using non-data-dependent memory pattern, i.e., emphasize the near and neglect the distant, to processing sequences. Recent studies have shown the priorities by combining them as one. However, the efficiency of linear attention remains only at the theoretical level in a causal setting, and SSMs require various designed constraints to operate effectively on specific data. Therefore, in order to unveil the true power of the hybrid design, the following two issues need to be addressed: (1) hardware-efficient implementation for linear attention and (2) stabilization of SSMs. To achieve this, we leverage the thought of tiling and hierarchy to propose CHELA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Algorithms and Data Compression · Advanced Data Compression Techniques
