MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun; Peixiong He; Po-Wei Harn; Xiao Qin

arXiv:2604.17695·cs.LG·April 21, 2026

MoE-nD: Per-Layer Mixture-of-Experts Routing for Multi-Axis KV Cache Compression

Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin

PDF

TL;DR

MoE-nD introduces a per-layer mixture-of-experts routing framework for KV cache compression in long-context LLM inference, significantly improving compression efficiency while maintaining accuracy.

Contribution

It proposes a novel per-layer heterogeneous eviction and quantization method, optimizing compression by routing each layer to its best configuration under a global memory budget.

Findings

01

Achieves 14x compression with no accuracy loss on LongBench-v1 tasks.

02

Outperforms other compression baselines by a large margin in memory efficiency.

03

Improves reasoning benchmark scores by 6 to 27 points over uniform quantization methods.

Abstract

KV cache memory is the dominant bottleneck for long-context LLM inference. Existing compression methods each act on a single axis of the four-dimensional KV tensor -- token eviction (sequence), quantization (precision), low-rank projection (head dimension), or cross-layer sharing -- but apply the same recipe to every layer. We show that this homogeneity leaves accuracy on the table: different layers respond very differently to each compression operation, and the optimal per-layer mix of eviction and quantization is far from uniform. We propose MoE-nD, a mixture-of-experts framework that routes each layer to its own (eviction-ratio, K-bits, V-bits) tuple under a global memory budget. An offline-calibrated greedy solver chooses the routing that minimizes predicted quality loss; at inference time, per-layer heterogeneous eviction and quantization are applied jointly through a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.