QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models
Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang

TL;DR
This paper introduces a quantization-aware vision token pruning method for multimodal large language models, improving low-bit inference accuracy by jointly optimizing token relevance and quantization stability.
Contribution
It proposes a hybrid sensitivity metric that combines quantization error and outlier intensity to effectively co-optimize token pruning and quantization.
Findings
Outperforms naive token pruning and quantization baselines.
Achieves 2.24% accuracy gain at 12.5% token retention.
Surpasses dense quantization without pruning in low-bit regimes.
Abstract
Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
