Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

Jiacheng Wang; Ping Jian; Zhen Yang; Zirong Chen; Keren Liao; Zhongbin Guo

arXiv:2602.11220·cs.LG·February 13, 2026

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

Jiacheng Wang, Ping Jian, Zhen Yang, Zirong Chen, Keren Liao, Zhongbin Guo

PDF

Open Access

TL;DR

This paper introduces an RL-based data rewriting agent that improves the alignment and diversity of training data for fine-tuning large language models, reducing catastrophic forgetting and enhancing downstream performance.

Contribution

It proposes a novel reinforcement learning approach to learn a data rewriting policy that aligns with the model's QA distribution and maintains diversity, addressing limitations of existing methods.

Findings

01

Achieves comparable downstream performance to standard SFT.

02

Reduces forgetting on non-downstream benchmarks by 12.34%.

03

Effectively balances distributional alignment and diversity during data rewriting.

Abstract

Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare