Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi; Xiaohuan Pei; Linfeng Wen; Minjing Dong; Chang Xu

arXiv:2604.06912·cs.CV·April 9, 2026

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu

PDF

2 Repos 3 Models

TL;DR

Q-Zoom is an adaptive perception framework for multimodal large language models that efficiently balances high-resolution visual input processing with inference speed, improving performance and scalability.

Contribution

It introduces a query-aware, coarse-to-fine perception approach with novel modules like a dynamic gating network and self-distilled region proposal network for efficient, fine-grained visual understanding.

Findings

01

Q-Zoom accelerates inference by up to 4.39 times while maintaining accuracy.

02

It surpasses baseline performance by 1.1% to 8.1% on key benchmarks.

03

The approach generalizes across multiple multimodal models.

Abstract

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.