TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Wei Wu; Zhuoshi Pan; Chao Wang; Liyi Chen; Yunchu Bai; Tianfu Wang; Kun Fu; Zheng Wang; Hui Xiong

arXiv:2411.02886·cs.CL·October 10, 2025

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

PDF

Open Access 1 Video

TL;DR

TokenSelect is a training-free method that improves long-context inference efficiency in LLMs by selectively involving critical tokens in attention, achieving significant speedups without sacrificing accuracy.

Contribution

It introduces a novel token-level KV cache selection mechanism that leverages attention sparsity and a new cache design for faster inference in long-context scenarios.

Findings

01

Up to 23.84x speedup in attention computation

02

Up to 2.28x reduction in end-to-end latency

03

Outperforms state-of-the-art long-context inference methods

Abstract

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection· underline

Taxonomy

TopicsAdvanced Data Storage Technologies · Network Packet Processing and Optimization

MethodsSoftmax · Attention Is All You Need