WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability
Yufan Zhuang, Zihan Wang, Fangbo Tao, Jingbo Shang

TL;DR
WavSpA introduces a wavelet-based attention mechanism for Transformers, capturing both position and frequency information efficiently, leading to improved long sequence learning and reasoning extrapolation.
Contribution
The paper proposes Wavelet Space Attention (WavSpA), a novel method that replaces Fourier-based attention with wavelet transforms for better long-range sequence modeling in Transformers.
Findings
WavSpA outperforms Fourier-based methods on Long Range Arena tasks.
Learning in wavelet space enhances Transformer's reasoning over long distances.
Wavelet transforms improve efficiency by capturing multi-resolution features.
Abstract
Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Softmax · Byte Pair Encoding · Adam · Dense Connections · Absolute Position Encodings
