ZETA: Leveraging Z-order Curves for Efficient Top-k Attention

Qiuhao Zeng; Jerry Huang; Peng Lu; Gezheng Xu; Boxing Chen; Charles; Ling; Boyu Wang

arXiv:2501.14577·cs.LG·April 2, 2025

ZETA: Leveraging Z-order Curves for Efficient Top-k Attention

Qiuhao Zeng, Jerry Huang, Peng Lu, Gezheng Xu, Boxing Chen, Charles, Ling, Boyu Wang

PDF

Open Access

TL;DR

ZETA introduces a novel method using Z-order curves to enable efficient, parallel top-$k$ attention in Transformers, significantly reducing computational costs for long sequences while maintaining high performance.

Contribution

The paper presents ZETA, a new approach that leverages Z-order curves for parallel top-$k$ attention, improving efficiency over existing methods for long sequence modeling.

Findings

01

Matches standard attention on synthetic tasks

02

Outperforms existing methods on Long Range Arena

03

Effective in language modeling tasks

Abstract

Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length $N$ , rendering it prohibitively expensive for long sequences. A promising approach is top- $k$ attention, which selects only the $k$ most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing the existing top- $k$ attention method from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging \textbf{Z}-Order Curves for \textbf{E}fficient \textbf{T}op- $k$ \textbf{A}ttention, to enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Anomaly Detection Techniques and Applications · Brain Tumor Detection and Classification

MethodsAttention Is All You Need · Softmax · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing