ZETA: Leveraging Z-order Curves for Efficient Top-k Attention
Qiuhao Zeng, Jerry Huang, Peng Lu, Gezheng Xu, Boxing Chen, Charles, Ling, Boyu Wang

TL;DR
ZETA introduces a novel method using Z-order curves to enable efficient, parallel top-$k$ attention in Transformers, significantly reducing computational costs for long sequences while maintaining high performance.
Contribution
The paper presents ZETA, a new approach that leverages Z-order curves for parallel top-$k$ attention, improving efficiency over existing methods for long sequence modeling.
Findings
Matches standard attention on synthetic tasks
Outperforms existing methods on Long Range Arena
Effective in language modeling tasks
Abstract
Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length , rendering it prohibitively expensive for long sequences. A promising approach is top- attention, which selects only the most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing the existing top- attention method from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging \textbf{Z}-Order Curves for \textbf{E}fficient \textbf{T}op- \textbf{A}ttention, to enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Anomaly Detection Techniques and Applications · Brain Tumor Detection and Classification
MethodsAttention Is All You Need · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing
