FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference
Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, Huanrui Yang

TL;DR
FIER introduces a fine-grained, efficient key-value cache retrieval method for long-context LLM inference, significantly reducing latency while maintaining high performance by using 1-bit quantized keys for importance estimation.
Contribution
The paper presents a novel fine-grained KV retrieval approach using 1-bit quantized keys, improving accuracy and efficiency over existing methods for long-context LLMs.
Findings
Matches full KV performance with only 11% cache usage
Reduces decoding latency by 1.2 to 1.5 times
Effective across various long-context tasks
Abstract
The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a \underline{Fi}ne-Grained and \underline{E}fficient KV cache \underline{R}etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
