ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin

TL;DR
ZoomEye introduces a tree-based image exploration method that enhances multimodal large language models by enabling human-like zooming into image regions, significantly improving visual reasoning performance across benchmarks.
Contribution
It presents a training-free, model-agnostic algorithm for vision-level reasoning that allows dynamic zooming into image regions, improving multimodal model accuracy.
Findings
Significant performance gains on high-resolution benchmarks.
Small models outperform large models with ZoomEye.
Effective across multiple multimodal large language models.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
Methods+ ( 1 ) ⟷ 888 ⟷ ( 829 ) ⟷ 0881||How do I resolve a dispute on Expedia? · Focus · Balanced Selection
