QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Xinhao Wang; Zhonyu Xia; Zhiwei Lin; Zhe Li; Yongtao Wang

arXiv:2604.02816·cs.CV·April 6, 2026

QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang

PDF

TL;DR

This paper introduces a quantization-aware vision token pruning method for multimodal large language models, improving low-bit inference accuracy by jointly optimizing token relevance and quantization stability.

Contribution

It proposes a hybrid sensitivity metric that combines quantization error and outlier intensity to effectively co-optimize token pruning and quantization.

Findings

01

Outperforms naive token pruning and quantization baselines.

02

Achieves 2.24% accuracy gain at 12.5% token retention.

03

Surpasses dense quantization without pruning in low-bit regimes.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.