VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Yufei Xue; Yushi Huang; Jiawei Shao; Lunjie Zhu; Chi Zhang; Xuelong Li; Jun Zhang

arXiv:2508.03351·cs.CV·March 9, 2026

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Yufei Xue, Yushi Huang, Jiawei Shao, Lunjie Zhu, Chi Zhang, Xuelong Li, Jun Zhang

PDF

TL;DR

VLMQ introduces a token saliency-driven post-training quantization method tailored for vision-language models, effectively addressing visual over-representation and modality gap issues to improve quantization performance, especially in low-bit settings.

Contribution

The paper proposes VLMQ, a novel PTQ framework that prioritizes salient tokens using gradient-based importance, with lightweight backpropagation and importance-aware optimization, specifically designed for VLMs.

Findings

01

Achieves state-of-the-art quantization performance on 8 benchmarks.

02

Demonstrates 16.45% improvement on MME-RealWorld under 2-bit quantization.

03

Effectively addresses visual over-representation and modality gap in VLMs.

Abstract

Post-training quantization (PTQ) has emerged as an effective technique for compressing large models and accelerating inference without retraining. While PTQ has been extensively studied in large language models (LLMs), its application to vision-language models (VLMs) remains underexplored. In this work, we identify two intrinsic characteristics of VLM activations: 1) visual over-representation, where vision tokens are excessive and often redundant, and 2) modality gap, which refers to the clear distribution gap between text and vision tokens in the latent feature space. Together, these two factors significantly deteriorate quantization performance but have been overlooked by existing PTQ methods. To address these challenges, we propose VLMQ, A VLM-tailored PTQ framework that selectively prioritizes salient tokens while suppressing redundant ones during quantization. In particular, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.