Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun

TL;DR
This paper investigates why reinforcement learning (RL) leads to better out-of-distribution generalization in vision-language models (VLMs) than supervised fine-tuning (SFT), attributing it to data filtering effects, and proposes a difficulty-based data filtering method to improve SFT.
Contribution
The paper introduces Difficulty-Curated SFT (DC-SFT), a simple data filtering approach that enhances out-of-distribution generalization, outperforming RL and standard SFT methods.
Findings
RL's advantage is due to implicit data filtering of medium-difficulty samples.
Filtering hard samples improves SFT's OOD performance.
DC-SFT surpasses RL in OOD generalization and offers greater stability.
Abstract
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
