WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Wei Tao; Xiaoyang Qu; Peiqiang Wang; Guokuan Li; Jiguang Wan; Kai Lu; Jianzong Wang

arXiv:2605.02262·cs.CV·May 5, 2026

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Wei Tao, Xiaoyang Qu, Peiqiang Wang, Guokuan Li, Jiguang Wan, Kai Lu, Jianzong Wang

PDF

TL;DR

WindowQuant introduces a window-adaptive mixed-precision quantization method for VLMs that reduces inference latency and memory usage by optimizing KV cache based on window similarity.

Contribution

It proposes a novel window-level quantization search and cache computation approach that improves efficiency and maintains accuracy in VLM inference.

Findings

01

Outperforms existing methods on multiple datasets.

02

Reduces inference latency and GPU memory usage.

03

Maintains model accuracy with adaptive quantization.

Abstract

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called WindowQuant, which employs window-adaptive mixed-precision quantization to optimize the KV cache. WindowQuant consists of two modules: window-level quantization search and window-level KV cache computation. Window-level quantization search quickly determines the optimal bit-width configuration of the KV cache windows based on the similarity scores between the corresponding visual token windows and the text prompt, maintaining the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.