TL;DR
CalibQuant introduces a 1-bit quantization method for KV caches in multimodal LLMs, drastically reducing memory and computation overhead while maintaining performance, enabling faster inference on memory-limited devices.
Contribution
We propose CalibQuant, a novel 1-bit KV cache quantization technique with calibration, improving efficiency of multimodal LLMs without architectural modifications.
Findings
Achieves 10x throughput increase on InternVL models.
Significantly reduces memory usage of KV caches.
Maintains model performance and multimodal capabilities.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
