Unshackling Context Length: An Efficient Selective Attention Approach   through Query-Key Compression

Haoyu Wang; Tong Teng; Tianyu Guo; An Xiao; Duyu Tang; Hanting Chen,; Yunhe Wang

arXiv:2502.14477·cs.CL·February 21, 2025

Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression

Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen,, Yunhe Wang

PDF

Open Access

TL;DR

This paper introduces ESA, a novel selective attention method that efficiently extends context length in large language models by compressing query and key vectors, enabling better long-sequence processing with reduced computation.

Contribution

ESA is a new approach that improves long-context handling in LLMs by efficiently selecting critical tokens through query-key compression, outperforming existing methods.

Findings

01

ESA achieves comparable performance to full-attention methods on long sequences.

02

ESA outperforms other selective attention techniques in multi-piece retrieval tasks.

03

ESA scales effectively to sequences up to 256k tokens.

Abstract

Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Neural Networks and Applications · Time Series Analysis and Forecasting

MethodsSoftmax · Attention Is All You Need