QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression

Zhongyang Li; Yaqian Li; Faming Fang; Rinyoichi Takezoe; Zi-Hao Bo; Cheng Qian; Mo Guang; Guixu Zhang; Kaiwen Long

arXiv:2603.21232·cs.CV·March 24, 2026

QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression

Zhongyang Li, Yaqian Li, Faming Fang, Rinyoichi Takezoe, Zi-Hao Bo, Cheng Qian, Mo Guang, Guixu Zhang, Kaiwen Long

PDF

Open Access

TL;DR

QMoP introduces an adaptive, query-guided framework for visual token compression in multimodal models, significantly reducing resource usage while maintaining performance through a multi-branch, dynamic selection approach.

Contribution

The paper presents QMoP, a novel flexible framework with a query-guided router and mixture-of-experts fusion for adaptive visual token compression in multimodal models.

Findings

01

QMoP achieves better compression-performance trade-offs than baselines.

02

The framework significantly reduces memory and computation costs.

03

VTCBench effectively evaluates information loss from compression.

Abstract

Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques