Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian; Chendong Wang; Yifan Yang; Chaoyun Zhang; Huiqiang Jiang; Xufang Luo; Yu Kang; Qingwei Lin; Anlan Zhang; Shiqi Jiang; Ting Cao; Tianjun Mao; Suman Banerjee; Guyue Liu; Saravan Rajmohan; Dongmei Zhang; Yuqing Yang; Qi Zhang; Lili Qiu

arXiv:2505.00742·cs.CV·January 1, 2026

Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

PDF

Open Access

TL;DR

Zoomer is a novel framework that enhances black-box multimodal large language models by adaptively focusing on important image regions, preserving details, and efficiently allocating tokens, significantly improving accuracy and reducing resource usage.

Contribution

It introduces a comprehensive visual prompting method with region-awareness, spatial preservation, and adaptive token budgeting for black-box MLLMs, addressing critical limitations in current approaches.

Findings

01

Boosts accuracy by up to 27% across benchmarks.

02

Reduces image token usage by up to 67%.

03

Demonstrates effectiveness on nine benchmarks and three commercial MLLMs.

Abstract

Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real world scenarios especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer, a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to adaptively allocate tokens between global context and local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Medical Imaging Techniques and Applications · Advanced Radiotherapy Techniques