Power-based Partial Attention: Bridging Linear-Complexity and Full Attention
Yufeng Huang

TL;DR
This paper introduces power-based partial attention (PPA), a scalable attention mechanism that interpolates between linear and quadratic complexity, demonstrating that sub-quadratic attention can match full attention performance.
Contribution
The paper proposes PPA, a novel attention method of order O(L^{1+p}) that bridges linear and full attention, enabling analysis of attention complexity-performance trade-offs.
Findings
Sub-quadratic attention can achieve full attention performance.
Performance transitions sharply from linear to full attention over a narrow parameter range.
There exists an intermediate p where attention complexity is reduced without performance loss.
Abstract
It is widely accepted from transformer research that "attention is all we need", but the amount of attention required has never been systematically quantified. Is quadratic attention necessary, or is there a sub-quadratic attention mechanism that can achieve comparable performance? To answer this question, we introduce power-based partial attention (PPA), an attention mechanism of order , where , such that corresponds to sliding window attention with linear complexity, and corresponds to full attention. With this attention construction, we can explore how transformer architecture performance varies as a function of the attention scaling behavior controlled by . The overall trend from our experiments shows an S-curve-like behavior where the performance transitions from sliding-window (linear-complexity) attention to full attention over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Low-power high-performance VLSI design · Parallel Computing and Optimization Techniques
