ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Junjie Li; Jiong Lou; Jie Li

arXiv:2605.16360·cs.LG·May 19, 2026

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Junjie Li, Jiong Lou, Jie Li

PDF

TL;DR

ProxyKV introduces a cross-model proxy pruning framework that enhances long-context LLM inference efficiency by asynchronously offloading importance scoring to a lightweight proxy, achieving high accuracy and significant speedups.

Contribution

The paper presents ProxyKV, a novel cross-model proxy pruning method with a hybrid architecture and loss, improving long-context inference speed and accuracy across multiple LLM families.

Findings

01

Matches KVZip accuracy while significantly speeding up KV prefetching.

02

Achieves up to 3.21x speedup on Llama-3.1-8B with minimal accuracy loss.

03

Maintains speedup at context lengths up to 170k tokens on Qwen-2.5-7B.

Abstract

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.