Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning
Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan

TL;DR
This paper demonstrates that KV caches, traditionally used for speed in autoregressive decoding, can be repurposed as lightweight representations for downstream tasks like reasoning and sampling, achieving competitive results without additional computation.
Contribution
It introduces a novel approach to utilize KV caches as effective, low-cost representations for downstream tasks, reducing the need for recomputation and enabling new inference capabilities.
Findings
KV-derived representations are sufficient for chain-of-embedding tasks.
They enable adaptive reasoning with up to 5.7x speedup.
KV caches perform competitively or better than dedicated embeddings.
Abstract
KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
