Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
Yakov Pyotr Shkolnikov

TL;DR
This paper introduces a method to persist and restore multi-agent LLM caches on edge devices using quantization, significantly reducing inference time and enabling more agents within limited memory.
Contribution
It proposes a novel persistent cache system with 4-bit quantization and direct restoration, improving multi-agent LLM inference efficiency on edge hardware.
Findings
Cache restoration reduces time-to-first-token by up to 136x.
Q4 quantization allows 4x more agents in fixed memory.
Perplexity impact is minimal, with less than 3% increase.
Abstract
Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Data Storage Technologies
