The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu

TL;DR
This paper systematically investigates the effects and interactions of long-CoT supervised fine-tuning and reinforcement learning in vision-language models, revealing trade-offs and the need for more adaptive combined training methods.
Contribution
It provides a comprehensive analysis of how long-CoT SFT and RL individually and jointly affect reasoning in VLMs, highlighting their limitations and the complexity of combining them effectively.
Findings
SFT improves performance on difficult questions but increases verbosity and harms simple question accuracy.
RL enhances generalization and maintains brevity, with consistent improvements across question difficulties.
Combining SFT and RL through various strategies does not yield additive benefits and introduces trade-offs.
Abstract
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗JierunChen/SFT-RL-SynergyDilemma-SFT_s1.1_R1model· 4 dl4 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-SFT_Eureka_Distillmodel· 3 dl3 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-RLmodel· 6 dl6 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-Two_stagemodel· 2 dl2 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-Interleavemodel· 2 dl2 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-Progressivemodel· 1 dl1 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-Data_Mixingmodel· 1 dl1 dl
- 🤗JierunChen/SFT-RL-SynergyDilemma-Model_Mergingmodel
- JierunChen/MathVision_with_difficulty_leveldataset· 19 dl19 dl
- JierunChen/MathVista_with_difficulty_leveldataset· 35 dl35 dl
- JierunChen/MathVerse_with_difficulty_leveldataset· 16 dl16 dl
- JierunChen/MMMU_with_difficulty_leveldataset· 109 dl109 dl
- JierunChen/MMStar_with_difficulty_leveldataset· 47 dl47 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
