TL;DR
OScaR is a novel lightweight quantization framework that significantly reduces memory and improves speed for large language models' key-value caches, enabling near-lossless compression at extreme levels.
Contribution
It introduces Canalized Rotation and Omni-Token Scaling to address Token Norm Imbalance, outperforming existing methods in KV cache quantization for X-LLMs.
Findings
OScaR achieves up to 3.0x speedup over baseline decoding.
Memory footprint is reduced by 5.3x with OScaR.
Near-lossless performance under INT2 quantization is demonstrated.
Abstract
The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
