Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, Zhuosheng Zhang, Weiran Huang

TL;DR
This paper introduces Region-to-Image Distillation, a training method that enables large multimodal models to perform fine-grained perception in a single inference pass, eliminating the need for iterative zooming during testing.
Contribution
It transforms zooming from an inference-time process into a training-time primitive, improving fine-grained perception without increased inference latency.
Findings
Achieves state-of-the-art results on fine-grained perception benchmarks.
Develops ZoomBench, a comprehensive VQA dataset for evaluation.
Demonstrates improved general multimodal cognition in experiments.
Abstract
Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/ZwZ-8Bmodel· 1.5k dl· ♡ 441.5k dl♡ 44
- 🤗inclusionAI/ZwZ-4Bmodel· 1.2k dl· ♡ 311.2k dl♡ 31
- 🤗inclusionAI/ZwZ-7Bmodel· 107 dl· ♡ 11107 dl♡ 11
- 🤗jayr23/ZwZ-8B-hereticmodel· 10 dl· ♡ 310 dl♡ 3
- 🤗swaylenhayes/ZwZ-4B-VL-MLX-4bitmodel· 3 dl3 dl
- 🤗swaylenhayes/ZwZ-8B-VL-MLX-4bitmodel· 5 dl5 dl
- 🤗swaylenhayes/ZwZ-4B-VL-MLX-8bitmodel· 4 dl4 dl
- 🤗swaylenhayes/ZwZ-8B-VL-MLX-8bitmodel· 7 dl7 dl
- 🤗Luxinos/sb_v4_datasetmodel
- 🤗megabytes/ZwZ-8B-hereticmodel· 56 dl56 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
