FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

Dongwei Wang; Zijie Liu; Song Wang; Yuxin Ren; Jianing Deng; Jingtong Hu; Tianlong Chen; Huanrui Yang

arXiv:2508.08256·cs.DB·September 18, 2025

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

Dongwei Wang, Zijie Liu, Song Wang, Yuxin Ren, Jianing Deng, Jingtong Hu, Tianlong Chen, Huanrui Yang

PDF

Open Access 1 Video

TL;DR

FIER introduces a fine-grained, efficient key-value cache retrieval method for long-context LLM inference, significantly reducing latency while maintaining high performance by using 1-bit quantized keys for importance estimation.

Contribution

The paper presents a novel fine-grained KV retrieval approach using 1-bit quantized keys, improving accuracy and efficiency over existing methods for long-context LLMs.

Findings

01

Matches full KV performance with only 11% cache usage

02

Reduces decoding latency by 1.2 to 1.5 times

03

Effective across various long-context tasks

Abstract

The Key-Value (KV) cache reading latency increases significantly with context lengths, hindering the efficiency of long-context LLM inference. To address this, previous works propose retaining a small fraction of KV cache based on token importance. For example, KV eviction uses static heuristics to retain tokens, while KV retrieval dynamically selects query-relevant tokens for more adaptive cache management. However, we observe that important tokens are often sparsely distributed across the long context. This sparsity makes existing page-level KV retrieval inaccurate, as each page may include irrelevant tokens and miss critical ones. In this work, we propose Fier, a \underline{Fi}ne-Grained and \underline{E}fficient KV cache \underline{R}etrieval method. Fier uses 1-bit quantized keys to estimate the importance of each token, resulting in efficient and precise retrieval. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference· underline

Taxonomy

TopicsCaching and Content Delivery · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies