TL;DR
OSCAR is a novel offline rotation method for 2-bit KV-cache quantization in large language models, aligning quantization with attention covariance to improve accuracy and efficiency in long-context inference.
Contribution
It introduces a covariance-aware rotation technique for INT2 quantization, enabling accurate, deployable, and efficient long-context LLM serving with theoretical support and practical implementation.
Findings
Reduces BF16 accuracy gap to under 4 points on large models.
Maintains near BF16 accuracy on models up to 358B parameters.
Achieves 8x memory reduction and up to 7x throughput improvement.
Abstract
INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
