Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia; Xiaoxia Wu; Jisen Li; Robert Wu; Junxiong Wang; Jue Wang; Chenxi Li; Aman Singhal; Alay Dilipbhai Shah; Alpay Ariyak; Donglin Zhuang; Zhongzhu Zhou; Ben Athiwaratkun; Zhen Zheng; Shuaiwen Leon Song

arXiv:2511.18643·cs.LG·November 25, 2025

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia, Xiaoxia Wu, Jisen Li, Robert Wu, Junxiong Wang, Jue Wang, Chenxi Li, Aman Singhal, Alay Dilipbhai Shah, Alpay Ariyak, Donglin Zhuang, Zhongzhu Zhou, Ben Athiwaratkun, Zhen Zheng, Shuaiwen Leon Song

PDF

Open Access

TL;DR

Kitty introduces a mixed-precision 2-bit key-value cache system with dynamic channel-wise precision boosting, significantly reducing memory usage while maintaining accuracy for large language model inference.

Contribution

The paper presents a novel algorithm-system co-design that enables near-2-bit memory efficiency with minimal accuracy loss through dynamic channel-wise precision boosting.

Findings

01

Reduces KV memory by nearly 8x across multiple tasks and models.

02

Enables up to 8x larger batches and 2.1x-4.1x higher throughput.

03

Maintains near-zero accuracy loss with dynamic mixed-precision quantization.

Abstract

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Advanced Neural Network Applications