Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen; Yufan Shen; Wenxuan Huang; Sheng Zhou; Qunshu Lin; Xinyu Cai; Zhi Yu; Jiajun Bu; Botian Shi; Yu Qiao

arXiv:2507.20766·cs.CV·August 8, 2025

Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao

PDF

TL;DR

This paper introduces RRVF, a novel framework enabling multimodal models to learn complex visual reasoning solely from raw images by leveraging reinforcement learning and the verification of rendered outputs, reducing dependence on image-text supervision.

Contribution

The paper presents RRVF, a new reinforcement learning-based framework that allows visual reasoning models to learn from images alone, bypassing the need for curated image-text datasets.

Findings

01

Outperforms existing open-source MLLMs on image-to-code tasks

02

Demonstrates superior generalization across domains

03

Outperforms the more advanced MLLM used during training

Abstract

Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework, ``Reasoning-Rendering-Visual-Feedback'' (RRVF), that enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the ``Asymmetry of Verification'' principle, i.e., verifying the rendered output against the source image is substantially easier than performing deep visual reasoning to generate a faithful, structured representation such as code. We demonstrate that this relative ease provides an ideal reward signal for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.