Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Wenhao Li; Yuxin Zhang; Gen Luo; Haiyuan Wan; Ziyang Gong; Fei Chao; Rongrong Ji

arXiv:2508.19740·cs.CL·October 10, 2025

Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji

PDF

TL;DR

This paper introduces Spotlight Attention, a non-linear hashing method that improves KV cache retrieval efficiency in LLMs, significantly reducing latency and computational costs during inference.

Contribution

It proposes a novel non-linear hashing technique for KV cache retrieval, along with a GPU-efficient training framework and CUDA kernel implementation, enhancing speed and accuracy in LLM decoding.

Findings

01

Drastically improved retrieval precision

02

At least 5× shorter hash codes than linear hashing

03

Achieved 3× higher throughput in decoding

Abstract

Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.