TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

Rana Muhammad Shahroz Khan; Zijie Liu; Zhen Tan; Charles Fleming; Tianlong Chen

arXiv:2602.03073·cs.LG·February 4, 2026

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen

PDF

Open Access

TL;DR

TMS introduces a reward-free, dynamic curriculum method that reduces forgetting in language models by approximating RL benefits, improving performance on reasoning and instruction tasks without complex reward engineering.

Contribution

TMS presents a novel reward-free framework that minimizes policy-label divergence, bridging the gap between SFT and RL in model retention and performance.

Findings

01

TMS outperforms standard SFT on reasoning and instruction benchmarks.

02

TMS effectively reduces catastrophic forgetting in LLMs.

03

PLD drift predicts model forgetting and is mitigated by TMS.

Abstract

Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $Supervision Mismatch$ : the divergence between the model's evolving policy and static training labels. We address this trade-off with $Trajectory-Mixed Supervision (TMS)$ , a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model's own historical checkpoints. TMS minimizes $Policy-Label Divergence (PLD)$ , preventing the mode collapse that drives…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications