Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

TL;DR
This paper introduces RotateK, a rotation-based structured key channel pruning method for vision-language models that improves inference efficiency by balancing accuracy and latency under fixed memory constraints.
Contribution
RotateK employs an online PCA-based rotation to enable accurate, hardware-friendly key channel pruning, addressing structural trade-offs in prior methods.
Findings
RotateK outperforms prior key channel pruning in accuracy and latency.
Joint token-channel pruning surpasses token-only baselines at the same cache budget.
Experiments validate effectiveness on two VLM backbones.
Abstract
Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
