Grounded Reinforcement Learning for Visual Reasoning
Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

TL;DR
ViGoRL is a novel vision-language reinforcement learning model that explicitly grounds each reasoning step in visual coordinates, significantly improving performance on diverse visual reasoning tasks.
Contribution
Introduces ViGoRL, a visually grounded RL framework that enhances spatial reasoning and visual attention in models, outperforming baselines across multiple benchmarks.
Findings
ViGoRL outperforms supervised and RL baselines on visual reasoning benchmarks.
Multi-turn RL with zoomed-in visual feedback improves localization of small GUI elements.
Grounding enhances exploration, subgoal setting, and visual verification behaviors.
Abstract
While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗gsarch/ViGoRL-MCTS-SFT-3b-Web-Groundingmodel· 5 dl5 dl
- 🤗gsarch/ViGoRL-MCTS-SFT-7b-Web-Groundingmodel· 1 dl1 dl
- 🤗gsarch/ViGoRL-Multiturn-MCTS-SFT-3b-Web-Groundingmodel· 193 dl· ♡ 1193 dl♡ 1
- 🤗gsarch/ViGoRL-Multiturn-3b-Web-Groundingmodel· 6 dl6 dl
- 🤗gsarch/ViGoRL-7b-Web-Groundingmodel· 7 dl7 dl
- 🤗gsarch/ViGoRL-3b-Web-Actionmodel· 2 dl2 dl
- 🤗gsarch/ViGoRL-7b-Web-Actionmodel· 9 dl9 dl
- 🤗gsarch/ViGoRL-Multiturn-MCTS-SFT-3b-Visual-Searchmodel· 85 dl85 dl
- 🤗gsarch/ViGoRL-Multiturn-MCTS-SFT-7b-Visual-Searchmodel· 1 dl1 dl
- 🤗gsarch/ViGoRL-Multiturn-3b-Visual-Searchmodel· 45 dl· ♡ 145 dl♡ 1
Videos
Taxonomy
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
