The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Jierun Chen; Tiezheng Yu; Haoli Bai; Lewei Yao; Jiannan Wu; Kaican Li; Fei Mi; Chaofan Tao; Lei Zhu; Manyi Zhang; Xiaohui Li; Lu Hou; Lifeng Shang; Qun Liu

arXiv:2507.07562·cs.CL·July 11, 2025

The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu

PDF

Open Access 8 Models 5 Datasets

TL;DR

This paper systematically investigates the effects and interactions of long-CoT supervised fine-tuning and reinforcement learning in vision-language models, revealing trade-offs and the need for more adaptive combined training methods.

Contribution

It provides a comprehensive analysis of how long-CoT SFT and RL individually and jointly affect reasoning in VLMs, highlighting their limitations and the complexity of combining them effectively.

Findings

01

SFT improves performance on difficult questions but increases verbosity and harms simple question accuracy.

02

RL enhances generalization and maintains brevity, with consistent improvements across question difficulties.

03

Combining SFT and RL through various strategies does not yield additive benefits and introduces trade-offs.

Abstract

Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling