Towards Joint Quantization and Token Pruning of Vision-Language Models

Xinqing Li; Xin He; Xindong Zhang; Ming-Ming Cheng; Lei Zhang; and Yun Liu

arXiv:2604.17320·cs.CV·April 21, 2026

Towards Joint Quantization and Token Pruning of Vision-Language Models

Xinqing Li, Xin He, Xindong Zhang, Ming-Ming Cheng, Lei Zhang, and Yun Liu

PDF

TL;DR

This paper introduces QUOTA, a unified framework for low-bit quantization and deterministic token pruning in vision-language models, improving robustness and efficiency during inference.

Contribution

It proposes a collaborative quantization-and-pruning method that unifies low-bit inference and token pruning in a single pipeline, enhancing robustness and efficiency.

Findings

01

Achieves 95.65% token retention with only 30% tokens used.

02

Outperforms stage-wise baselines in robustness under low-bit regimes.

03

Improves inference efficiency and robustness on standard benchmarks.

Abstract

Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.