Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle

Zihan Wang; Cheng Tang; Lei Gong; Cheng Li; Chao Wang; teng wang; Wenqi Lou; Xuehai Zhou

arXiv:2601.16986·cs.CL·January 27, 2026

Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle

Zihan Wang, Cheng Tang, Lei Gong, Cheng Li, Chao Wang, teng wang, Wenqi Lou, Xuehai Zhou

PDF

Open Access

TL;DR

Crystal-KV introduces an answer-first principle and an attention-based cache management algorithm to optimize KV cache usage in Chain-of-Thought reasoning, significantly enhancing efficiency without sacrificing accuracy.

Contribution

It proposes a novel cache management framework tailored for CoT reasoning, including an attention-based eviction strategy and adaptive cache allocation, improving throughput and response time.

Findings

01

Achieves state-of-the-art KV cache compression

02

Significantly improves inference throughput and speed

03

Maintains or improves answer accuracy in CoT tasks

Abstract

Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques