SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

TL;DR
SOLE-R1 introduces a video-language reasoning model that provides dense, task-progress-based rewards for online reinforcement learning, enabling robots to learn new manipulation tasks without ground-truth rewards or demonstrations.
Contribution
The paper presents SOLE-R1, a novel reasoning model that serves as the sole reward signal for robot RL, trained with a large-scale reasoning synthesis pipeline and capable of zero-shot learning.
Findings
SOLE-R1 enables robots to learn 24 unseen tasks without ground-truth rewards.
It outperforms GPT-5 and Gemini-3-Pro as vision-language rewarders.
The model shows greater robustness to reward hacking.
Abstract
Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
