Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
Jiacheng Wang, Ping Jian, Zhen Yang, Zirong Chen, Keren Liao, Zhongbin Guo

TL;DR
This paper introduces an RL-based data rewriting agent that improves the alignment and diversity of training data for fine-tuning large language models, reducing catastrophic forgetting and enhancing downstream performance.
Contribution
It proposes a novel reinforcement learning approach to learn a data rewriting policy that aligns with the model's QA distribution and maintains diversity, addressing limitations of existing methods.
Findings
Achieves comparable downstream performance to standard SFT.
Reduces forgetting on non-downstream benchmarks by 12.34%.
Effectively balances distributional alignment and diversity during data rewriting.
Abstract
Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
