Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch; Snigdha Saha; Naitik Khandelwal; Ayush Jain; Michael J. Tarr; Aviral Kumar; Katerina Fragkiadaki

arXiv:2505.23678·cs.CV·May 18, 2026

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

PDF

2 Repos 16 Models 1 Datasets 1 Video

TL;DR

ViGoRL is a novel vision-language reinforcement learning model that explicitly grounds each reasoning step in visual coordinates, significantly improving performance on diverse visual reasoning tasks.

Contribution

Introduces ViGoRL, a visually grounded RL framework that enhances spatial reasoning and visual attention in models, outperforming baselines across multiple benchmarks.

Findings

01

ViGoRL outperforms supervised and RL baselines on visual reasoning benchmarks.

02

Multi-turn RL with zoomed-in visual feedback improves localization of small GUI elements.

03

Grounding enhances exploration, subgoal setting, and visual verification behaviors.

Abstract

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

gsarch/vigorl_datasets
dataset· 115 dl
115 dl

Videos

Grounded Reinforcement Learning for Visual Reasoning· slideslive

Taxonomy

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training