QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs
Pratik Honavar, Tejpratap GVSL

TL;DR
QKVShare introduces a quantized KV-cache handoff framework for multi-agent on-device LLMs, improving latency and efficiency over full-precision transfer with adaptive quantization and cache injection techniques.
Contribution
It presents a novel quantized KV-cache handoff method combining token-level mixed-precision, CacheCard representation, and HuggingFace-compatible cache injection for edge LLM systems.
Findings
Adaptive quantization outperforms uniform quantization in deeper-hop, higher budget settings.
QKVShare reduces handoff latency significantly compared to full re-prefill across various contexts.
Post-injection generation dominates QKVShare latency, indicating areas for further optimization.
Abstract
Multi-agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re-prefill or full-precision KV transfer. We study QKVShare, a framework for quantized KV-cache handoff between agents that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
