RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek

TL;DR
This paper introduces RegionReasoner, a reinforcement learning framework for multi-round visual reasoning that emphasizes explicit grounding in visual regions, and presents a new benchmark for evaluating iterative reasoning in detection and segmentation tasks.
Contribution
The paper proposes a novel reinforcement learning approach, RegionReasoner, that enforces explicit grounding and semantic coherence in multi-round visual reasoning, along with a new benchmark for systematic evaluation.
Findings
RegionReasoner-7B improves reasoning accuracy
Enhanced spatial grounding precision
Better global-local semantic consistency
Abstract
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper introduces RegionDial-Bench, a new benchmark designed to study multi-round conversational reasoning in VLMs, with a specific focus on the groundedness of evidential objects in each dialogue turn. 2. The authors propose a GRPO-based training framework that rewards models for accurate object grounding, global-local semantic consistency, and answer correctness. Experimental results demonstrate the effectiveness of their resulting model, RegionReasoner, on the proposed benchmark.
1. The creation process of RegionDial-Benchmark, which constitutes a major contribution of this work, is not sufficiently detailed in the paper. The authors should include a clear description of the benchmark construction methodology, such as data sources, annotation protocols, and key statistics (e.g., number of dialogues, turns, and object categories),to facilitate wider adoption. 2. The evaluation of RegionReasoner is currently limited to the proposed RegionDial-Bench. To better assess the
- This paper presents an interesting reasoning task that integrates QA, referring expression in a multi-turn manner. - They propose new reward functions for the new task. They propose a global-local consistency reward to align keywords from the global and local context.
- The way they expand the referring expression to multiple turns is confusing and may not be natural. In Appendix B, they illustrate how to simply use a preposition + bbox coordinates in the later turns. A natural referring expression considers the composition between objects. However, in the qualitative examples, they have more complicated and natural questions, such as "Which slice of pizza is R1 about to eat"? "Who is the person next to R1"? They mention that those GPT-style questions used in
(1) RegionReasoner extends a strong previous single-round model VisionReasoner and adapts to the challenging multi-round setting. Results on the proposed benchmark show the validity of RegionReasoner. (2) The benchmark itself can be used later form multi-round vision reasoning studies. The motivation of referring to object locations is direct and clear.
(1) The paper claims "RegionReasoner consistently outperforms strong Vision-Language Models and task-specific baselines on both referring segmentation and detection.". Previous benchmarks focus on single-round detection/segmentation, but in the main table 1 and table 2, the results are shown on the proposed multi-round benchmark. I think it would be reasonable to add the table to show some "task-specific baselines" for the previous single-round benchmarks. (2) Also, the proposed benchmark uses
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
