CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho; Helder Dias; Bruno Martins

arXiv:2511.19820·cs.CV·April 15, 2026

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, Bruno Martins

PDF

TL;DR

CropVLM is a reinforcement learning-based method that dynamically zooms into relevant image regions to enhance fine-grained vision-language perception without additional supervision.

Contribution

It introduces a low-cost, training-once approach that improves existing VLMs' performance on detailed image understanding tasks without fine-tuning the models.

Findings

01

Significant performance improvements on fine-grained vision-language tasks.

02

Effective for out-of-domain benchmarks without modifying the original VLMs.

03

Does not require human-labeled bounding boxes or synthetic data.

Abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.