Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu; Tao Feng; Hangjie Yuan; Wei Li; Yanan Sun

arXiv:2602.10815·cs.CV·February 12, 2026

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun

PDF

Open Access

TL;DR

This paper investigates why reinforcement learning (RL) leads to better out-of-distribution generalization in vision-language models (VLMs) than supervised fine-tuning (SFT), attributing it to data filtering effects, and proposes a difficulty-based data filtering method to improve SFT.

Contribution

The paper introduces Difficulty-Curated SFT (DC-SFT), a simple data filtering approach that enhances out-of-distribution generalization, outperforming RL and standard SFT methods.

Findings

01

RL's advantage is due to implicit data filtering of medium-difficulty samples.

02

Filtering hard samples improves SFT's OOD performance.

03

DC-SFT surpasses RL in OOD generalization and offers greater stability.

Abstract

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling