TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

TL;DR
TokenSelect is a training-free method that improves long-context inference efficiency in LLMs by selectively involving critical tokens in attention, achieving significant speedups without sacrificing accuracy.
Contribution
It introduces a novel token-level KV cache selection mechanism that leverages attention sparsity and a new cache design for faster inference in long-context scenarios.
Findings
Up to 23.84x speedup in attention computation
Up to 2.28x reduction in end-to-end latency
Outperforms state-of-the-art long-context inference methods
Abstract
Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Network Packet Processing and Optimization
MethodsSoftmax · Attention Is All You Need
