VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Chaoya Jiang; Yongrui Heng; Wei Ye; Han Yang; Haiyang Xu; Ming Yan; Ji Zhang; Fei Huang; Shikun Zhang

arXiv:2505.16192·cs.CV·June 2, 2025

VLM-R$^3$: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

PDF

Open Access

TL;DR

VLM-R$^3$ enhances multimodal reasoning by dynamically recognizing, focusing on, and reasoning about visual regions, significantly improving performance on complex visual-textual tasks through a novel training paradigm and curated data.

Contribution

Introduces VLM-R$^3$, a framework with region recognition and reasoning capabilities, and a new training method R-GRPO, advancing multimodal chain-of-thought reasoning.

Findings

01

Achieves new state-of-the-art results on MathVista and ScienceQA.

02

Significant improvements in spatial reasoning and fine-grained visual cue extraction.

03

Effective zero-shot and few-shot performance enhancements.

Abstract

Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R $^{3}$ } (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Robotics and Automated Systems · Cognitive Computing and Networks