Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei; Liangbo He; Jun Lan; Lingzhong Dong; Yutong Cai; Siyuan Li; Huijia Zhu; Weiqiang Wang; Linghe Kong; Yue Wang; Zhuosheng Zhang; Weiran Huang

arXiv:2602.11858·cs.CV·February 17, 2026

Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception

Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang

PDF

Open Access 10 Models 3 Datasets

TL;DR

This paper introduces Region-to-Image Distillation, a training method that enables large multimodal models to perform fine-grained perception in a single inference pass, eliminating the need for iterative zooming during testing.

Contribution

It transforms zooming from an inference-time process into a training-time primitive, improving fine-grained perception without increased inference latency.

Findings

01

Achieves state-of-the-art results on fine-grained perception benchmarks.

02

Develops ZoomBench, a comprehensive VQA dataset for evaluation.

03

Demonstrates improved general multimodal cognition in experiments.

Abstract

Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis