MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression
Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin

TL;DR
MoE-nD introduces a per-layer mixture-of-experts routing framework for KV cache compression in long-context LLM inference, significantly improving compression efficiency while maintaining accuracy.
Contribution
It proposes a novel per-layer heterogeneous eviction and quantization method, optimizing compression by routing each layer to its best configuration under a global memory budget.
Findings
Achieves 14x compression with no accuracy loss on LongBench-v1 tasks.
Outperforms other compression baselines by a large margin in memory efficiency.
Improves reasoning benchmark scores by 6 to 27 points over uniform quantization methods.
Abstract
KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
