TL;DR
WERSA introduces a linear-time attention mechanism using wavelet-enhanced spectral features, enabling efficient processing of very long sequences across vision and NLP tasks with improved accuracy and reduced computational costs.
Contribution
The paper presents WERSA, a novel linear-time attention method combining spectral features and wavelets, outperforming existing mechanisms on multiple benchmarks.
Findings
WERSA achieves 1.2% higher accuracy on ArXiv classification.
WERSA reduces training time by 81% and FLOPS by 73.4%.
WERSA handles extremely long sequences where quadratic methods fail.
Abstract
Transformer models are computationally costly on long sequences since regular attention has quadratic time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsWavelet-Enhanced Random Spectral Attention
