ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

David H. Yang; Yuxuan Zhu; Mohammad Mohammadi Amiri; Keerthiram Murugesan; Tejaswini Pedapati; Subhajit Chaudhury; Pin-Yu Chen

arXiv:2604.10898·cs.LG·April 15, 2026

ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

David H. Yang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Subhajit Chaudhury, Pin-Yu Chen

PDF

TL;DR

ZoomR introduces a hierarchical key-value cache strategy for large language models, compressing reasoning steps into summaries to significantly reduce memory usage during long output generation.

Contribution

It proposes a novel multi-granularity KV retrieval method that adaptively compresses reasoning thoughts and selectively zooms in on details, improving memory efficiency.

Findings

01

Reduces inference memory by over 4 times on reasoning tasks.

02

Maintains competitive performance with baseline models.

03

Enables memory-efficient decoding for long outputs.

Abstract

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.