Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

Xuan Yu; Dayan Guan; Yanfeng Gu

arXiv:2506.01663·cs.CV·August 12, 2025

Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

Xuan Yu, Dayan Guan, Yanfeng Gu

PDF

Open Access

TL;DR

Zoom-Refine is a training-free approach that improves high-resolution multimodal understanding by combining localized zooming into relevant image regions with self-refinement of responses, enhancing detail interpretation without additional training.

Contribution

It introduces a novel training-free method leveraging localized zoom and self-refinement to enhance high-resolution multimodal understanding in Large Language Models.

Findings

01

Significant performance improvements on high-resolution multimodal benchmarks.

02

Effective localization of task-relevant image regions without extra training.

03

Enhanced response accuracy through iterative self-refinement.

Abstract

Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction