Multi Resolution Analysis (MRA) for Approximate Self-Attention
Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh

TL;DR
This paper introduces a multi-resolution analysis approach using wavelets for efficient self-attention in Transformers, demonstrating superior performance across various sequence lengths and outperforming existing methods.
Contribution
It revisits classical MRA concepts like wavelets and adapts them for self-attention, offering a novel, effective approximation method for Transformers.
Findings
Outperforms most efficient self-attention methods
Effective for both short and long sequences
Demonstrates excellent performance across criteria
Abstract
Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Sparse and Compressive Sensing Techniques · Image and Signal Denoising Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Residual Connection
