Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning
Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

TL;DR
This paper introduces SOPHIA, a semi-off-policy reinforcement learning method that enhances vision-language models with slow-thinking reasoning, leading to state-of-the-art multimodal reasoning performance and surpassing some closed-source models.
Contribution
SOPHIA combines on-policy visual understanding with off-policy reasoning to improve slow-thinking abilities in large vision-language models, addressing hallucination issues and enabling better reasoning performance.
Findings
SOPHIA improves InternVL3.0-38B by 8.50% on reasoning benchmarks.
Achieves 49.08% and 49.95% pass@1 accuracy on MathVision and OlympiadBench.
Outperforms supervised fine-tuning and on-policy RL methods in experiments.
Abstract
Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition
