Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

TL;DR
Vision-OPD introduces a self-distillation framework that enhances multimodal large language models' ability to focus on fine details in images, improving fine-grained visual understanding without external tools.
Contribution
The paper proposes a novel regional-to-global self-distillation method enabling MLLMs to better perceive relevant evidence in images by internalizing zooming benefits.
Findings
Achieves superior performance on fine-grained visual benchmarks.
Enables models to internalize zooming benefits without external supervision.
Outperforms larger models and existing methods in experiments.
Abstract
Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
