Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Junhao Shen; Haiteng Zhao; Yuzhe Gu; Songyang Gao; Kuikun Liu; Haian Huang; Jianfei Gao; Dahua Lin; Wenwei Zhang; Kai Chen

arXiv:2507.16814·cs.LG·October 23, 2025

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

PDF

Open Access

TL;DR

This paper introduces SOPHIA, a semi-off-policy reinforcement learning method that enhances vision-language models with slow-thinking reasoning, leading to state-of-the-art multimodal reasoning performance and surpassing some closed-source models.

Contribution

SOPHIA combines on-policy visual understanding with off-policy reasoning to improve slow-thinking abilities in large vision-language models, addressing hallucination issues and enabling better reasoning performance.

Findings

01

SOPHIA improves InternVL3.0-38B by 8.50% on reasoning benchmarks.

02

Achieves 49.08% and 49.95% pass@1 accuracy on MathVision and OlympiadBench.

03

Outperforms supervised fine-tuning and on-policy RL methods in experiments.

Abstract

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition