ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
Junjie Li, Jiong Lou, Jie Li

TL;DR
ProxyKV introduces a cross-model proxy pruning framework that enhances long-context LLM inference efficiency by asynchronously offloading importance scoring to a lightweight proxy, achieving high accuracy and significant speedups.
Contribution
The paper presents ProxyKV, a novel cross-model proxy pruning method with a hybrid architecture and loss, improving long-context inference speed and accuracy across multiple LLM families.
Findings
Matches KVZip accuracy while significantly speeding up KV prefetching.
Achieves up to 3.21x speedup on Llama-3.1-8B with minimal accuracy loss.
Maintains speedup at context lengths up to 170k tokens on Qwen-2.5-7B.
Abstract
Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
