Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen

TL;DR
This paper investigates how reinforcement fine-tuning (RFT) better preserves prior knowledge in large language models compared to supervised fine-tuning (SFT), highlighting the importance of data distribution and training dynamics.
Contribution
It introduces a novel analysis of RFT versus SFT, emphasizing the role of data distribution and learning dynamics in knowledge retention for large language models.
Findings
RFT maintains prior knowledge better than SFT.
Training data aligned with the model's probability landscape reduces forgetting.
RFT-simulated rollouts improve SFT's knowledge preservation.
Abstract
Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt (multimodal) large language models to downstream tasks. While effective at task adaptation, their impact on retaining prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on the open-source Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but better maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model's…
Peer Reviews
Decision·ICLR 2026 Poster
1. The forgetting is less expored in multimodal setting. 2. Provide theorectical analysis.
1. Limited Novelty of Core Insight Given that [1] has already demonstrated that reinforcement learning can effectively mitigate catastrophic forgetting, the contribution of this work appears limited. In fact, [1] also conducted experiments showing that it is the on-policy training paradigm (e.g., RFT) that helps overcome catastrophic forgetting. The key distinction between the data used in SFT and RFT lies in whether they are on-policy or offline. Beyond the discussion in L153–L157, the authors
1、This paper explicitly address a problem that is crucial at the intersection of Continual Learning (CL) and Reinforcement Learning (RL) : how to avoid catastrophic forgetting when adapting to a new task. 2、Based on the theoretical analysis of Learning Dynamics and eNTK, this paper explain the impact of different data distributions on forgetting from two dimensions of magnitude and direction, which provides a new perspective for understanding the advantages of RFT.
1. Although the paper studies "catastrophic forgetting", the discussion of existing continuous learning (CL) methods is relatively superficial, and the classical CL methods (e.g., EWC, LwF, ER, etc.) are not fully compared. 2. While the authors note that resource constraints prevented experimentation with more multimodal models and large language models, experimenting on more methods is key to improving the contribution of this work. 3. The learning dynamic analysis formulation in Section 5.4 is
- The paper provides a comprehensive comparison between RFT and SFT in both learning efficiency and resistance to catastrophic forgetting, yielding valuable insights. - It proposes a method that uses model-generated reasoning trajectories for SFT, achieving both faster learning and stronger retention of prior knowledge.
- All conclusions are drawn from experiments on the jigsaw puzzle task using only the Qwen2.5-VL family. This setting is rather narrow and makes it difficult to claim general applicability. Therefore, the current experimental scope may not fully guarantee that the conclusions hold across other models or tasks. - Although Rea-GRPO-Rollout converges faster than standard RFT during training, it requires an additional stage of RFT training and reasoning trajectory generation beforehand. Hence, the o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
