AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu

TL;DR
AnTKV introduces a token-aware vector quantization method that selectively preserves high-sensitivity tokens to significantly improve memory efficiency and accuracy in large language model KV caches under ultra-low-bit quantization.
Contribution
The paper proposes a novel dual-stage framework with anchor token-aware vector quantization, combining offline centroid learning and online token selection for improved compression and accuracy.
Findings
Achieves up to 3.5x higher decoding throughput on LLaMA3-8B.
Reduces perplexity to 6.32 at 1-bit quantization on Mistral-7B.
Matches or surpasses prior quantization methods at 4-bit.
Abstract
Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models. Nevertheless, minimizing the accuracy degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. While scalar quantization is constrained by 1-bit bound, vector quantization exploits intra-vector correlations and enables sub-bit regimes, making it more suitable for ultra-low-bit quantization. To further mitigate quantization-induced degradation, we reveal that the degradation is highly uneven across tokens in attention quality. To investigate this unevenness, we introduce anchor score to measure each token's sensitivity to quantization. Our analysis and experiments show that preserving a small subset (1\%) of tokens with the highest Anchor Score significantly mitigates accuracy loss under aggressive quantization. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Network Packet Processing and Optimization · Advanced Data Storage Technologies
