Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training
Song Lai, Haohan Zhao, Rong Feng, Changyi Ma, Wenzhuo Liu, Hongbo Zhao, Xi Lin, Dong Yi, Qingfu Zhang, Hongbin Liu, Gaofeng Meng, Fei Zhu

TL;DR
This paper compares supervised and reinforcement fine-tuning in continual post-training, revealing that reinforcement fine-tuning better preserves prior knowledge and enhances general capabilities without explicit regularization mechanisms.
Contribution
It introduces a comparative analysis of SFT and RFT in CPT, highlighting RFT's implicit regularization and proposing an instance filtering algorithm to improve stability and efficiency.
Findings
RFT inherently preserves prior knowledge better than SFT.
RFT maintains and enhances general knowledge on benchmarks.
Explicit regularization is not necessary for RFT's stability.
Abstract
Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
Originality & Importance: The paper addresses a previously overlooked question – whether reinforcement learning-based fine-tuning could fundamentally improve continual learning – making it a fresh and notable contribution. This exploration of the fine-tuning paradigm itself (as opposed to add-on techniques) is highly original. It targets the important problem of catastrophic forgetting in lifelong learning for foundation models, which has broad relevance for deployed AI systems. Empirical Result
Computational Complexity: RFT requires rollouts and policy optimization, making it more computationally intensive and sensitive to hyperparameters than standard SFT. While RIF improves efficiency, real-world deployments may still face engineering challenges. Evaluation Scope: The study focuses on one 7B multimodal model and QA-style tasks. Broader validation across domains, tasks, and model scales would strengthen the generality claims. Reward Dependence: RFT assumes access to clear reward signa
1. This paper presents an interesting theoretical claim: RFT mitigates catastrophic forgetting via implicit gradient regularization driven by reward variance. 2. It offers solid experiments supporting that RFT reduces forgetting, and argues the effect is not attributable to KL constraints or chain-of-thought (CoT). 3. The paper is clear writing and overall easy to follow.
1. My primary concern is, the central theorem lacks convincing empirical support. The proposed rollout-based instance filtering method (RIF-RFT), motivated by the theorem, should improve anti-forgetting via lower reward variance under the theory, yet it underperforms the baseline RFT. The authors note that filtering reduces training data, but this raises a fairness concern: why not match the effective data budget across RIF-RFT and baseline RFT to isolate the effect of variance reduction? 2. Th
1. This paper reveals the inherent forgetting mitigation property in the reinforcement fine-tuning (RFT) paradigm, which endows RFT with superiority in post-training. 2. The mechanism analysis of forgetting mitigation under RFT is in-depth and supported by theories.
1. Experiments are only conducted based on Qwen2.5-VL-7B-Instruct, making it impossible to confirm whether the forgetting mitigation advantage of RFT is independent of the architecture. 2. There is a lack of direct comparisons with mainstream continual learning methods in terms of knowledge retention performance, storage overhead, and computational efficiency. 3. The selection method of the filtering threshold in RIF-RFT is not provided, and relevant ablation experiments are lacking.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
MethodsShrink and Fine-Tune · Balanced Selection
