OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Zunhai Su; Rui Yang; Chao Zhang; Yaxiu Liu; Yifan Zhang; Wei Wu; Jing Xiong; Dayou Du; Xialie Zhuang; Yulei Qian; Yuchen Xie; Yik-Chung Wu; Hongxia Yang; Ngai Wong

arXiv:2605.19660·cs.LG·May 20, 2026

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Zunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai Wong

PDF

1 Repo

TL;DR

OScaR is a novel lightweight quantization framework that significantly reduces memory and improves speed for large language models' key-value caches, enabling near-lossless compression at extreme levels.

Contribution

It introduces Canalized Rotation and Omni-Token Scaling to address Token Norm Imbalance, outperforming existing methods in KV cache quantization for X-LLMs.

Findings

01

OScaR achieves up to 3.0x speedup over baseline decoding.

02

Memory footprint is reduced by 5.3x with OScaR.

03

Near-lossless performance under INT2 quantization is demonstrated.

Abstract

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZunhaiSu/OScaR-KV-Quant
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.