OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou; Donglin Zhuang; Jisen Li; Ziyan Chen; Shuaiwen Leon Song; Ben Athiwaratkun; Xiaoxia Wu

arXiv:2605.17757·cs.LG·May 19, 2026

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

PDF

1 Repo 1 Models

TL;DR

OSCAR is a novel offline rotation method for 2-bit KV-cache quantization in large language models, aligning quantization with attention covariance to improve accuracy and efficiency in long-context inference.

Contribution

It introduces a covariance-aware rotation technique for INT2 quantization, enabling accurate, deployable, and efficient long-context LLM serving with theoretical support and practical implementation.

Findings

01

Reduces BF16 accuracy gap to under 4 points on large models.

02

Maintains near BF16 accuracy on models up to 358B parameters.

03

Achieves 8x memory reduction and up to 7x throughput improvement.

Abstract

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

futuremls-lab/OSCAR
github

Models

🤗
Zhongzhu/OSCAR-RotationZoo
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.