Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Yakov Pyotr Shkolnikov

arXiv:2603.04428·cs.LG·March 6, 2026

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Yakov Pyotr Shkolnikov

PDF

Open Access

TL;DR

This paper introduces a method to persist and restore multi-agent LLM caches on edge devices using quantization, significantly reducing inference time and enabling more agents within limited memory.

Contribution

It proposes a novel persistent cache system with 4-bit quantization and direct restoration, improving multi-agent LLM inference efficiency on edge hardware.

Findings

01

Cache restoration reduces time-to-first-token by up to 136x.

02

Q4 quantization allows 4x more agents in fixed memory.

03

Perplexity impact is minimal, with less than 3% increase.

Abstract

Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent's KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model -- 15.7 seconds per agent at 4K context. We address this by persisting each agent's KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents' quantized caches, and cross-phase context injection that accumulates attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Advanced Data Storage Technologies