CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

Insu Han; Zeliang Zhang; Zhiyuan Wang; Yifan Zhu; Susan Liang; Jiani; Liu; Haiting Lin; Mingjie Zhao; Chenliang Xu; Kun Wan; Wentian Zhao

arXiv:2502.14882·cs.CV·March 26, 2025

CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs

Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani, Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao

PDF

1 Repo

TL;DR

CalibQuant introduces a 1-bit quantization method for KV caches in multimodal LLMs, drastically reducing memory and computation overhead while maintaining performance, enabling faster inference on memory-limited devices.

Contribution

We propose CalibQuant, a novel 1-bit KV cache quantization technique with calibration, improving efficiency of multimodal LLMs without architectural modifications.

Findings

01

Achieves 10x throughput increase on InternVL models.

02

Significantly reduces memory usage of KV caches.

03

Maintains model performance and multimodal capabilities.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

insuhan/calibquant
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.