TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Dingyu Yao; Bowen Shen; Zheng Lin; Wei Liu; Jian Luan; Bin Wang; Weiping Wang

arXiv:2505.19586·cs.CL·May 28, 2025

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Dingyu Yao, Bowen Shen, Zheng Lin, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

PDF

Open Access 1 Repo

TL;DR

TailorKV is a hybrid framework that optimizes long-context inference in large language models by combining selective KV cache offloading and quantization, significantly reducing memory and latency while maintaining near-lossless performance.

Contribution

It introduces a novel hybrid compression method, TailorKV, that effectively integrates quantization and offloading based on layer-specific characteristics for improved long-context inference.

Findings

01

Achieves near-lossless performance with aggressive compression.

02

Serves 128k context on a single RTX 3090 within 82 ms per token.

03

Outperforms state-of-the-art methods in long-context LLM inference.

Abstract

The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ydyhello/tailorkv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Data Compression Techniques · Algorithms and Data Compression

MethodsFocus