TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT
Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen

TL;DR
TMS introduces a reward-free, dynamic curriculum method that reduces forgetting in language models by approximating RL benefits, improving performance on reasoning and instruction tasks without complex reward engineering.
Contribution
TMS presents a novel reward-free framework that minimizes policy-label divergence, bridging the gap between SFT and RL in model retention and performance.
Findings
TMS outperforms standard SFT on reasoning and instruction benchmarks.
TMS effectively reduces catastrophic forgetting in LLMs.
PLD drift predicts model forgetting and is mitigated by TMS.
Abstract
Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to : the divergence between the model's evolving policy and static training labels. We address this trade-off with , a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes , preventing the mode collapse that drives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
