TL;DR
Q-Zoom is an adaptive perception framework for multimodal large language models that efficiently balances high-resolution visual input processing with inference speed, improving performance and scalability.
Contribution
It introduces a query-aware, coarse-to-fine perception approach with novel modules like a dynamic gating network and self-distilled region proposal network for efficient, fine-grained visual understanding.
Findings
Q-Zoom accelerates inference by up to 4.39 times while maintaining accuracy.
It surpasses baseline performance by 1.1% to 8.1% on key benchmarks.
The approach generalizes across multiple multimodal models.
Abstract
MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
