SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning
Sanjay Kariyappa, G. Edward Suh

TL;DR
SideQuest introduces a model-driven KV cache compression method for long-horizon reasoning tasks, significantly reducing memory usage while maintaining accuracy by leveraging the reasoning model itself for cache management.
Contribution
It presents a novel approach where the reasoning model performs KV cache compression, addressing limitations of heuristic methods in multi-step reasoning tasks.
Findings
Reduces peak token usage by up to 65%
Achieves minimal accuracy degradation
Outperforms heuristic-based compression techniques
Abstract
Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
