ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval
David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

TL;DR
ZoomR introduces a hierarchical key-value cache strategy for large language models, compressing reasoning steps into summaries to significantly reduce memory usage during long output generation.
Contribution
It proposes a novel multi-granularity KV retrieval method that adaptively compresses reasoning thoughts and selectively zooms in on details, improving memory efficiency.
Findings
Reduces inference memory by over 4 times on reasoning tasks.
Maintains competitive performance with baseline models.
Enables memory-efficient decoding for long outputs.
Abstract
Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
