TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo

TL;DR
TRACE introduces a token-routed self-distillation method that selectively focuses on critical spans to improve reinforcement learning performance and mitigate privileged-information leakage.
Contribution
It proposes a novel span-based distillation approach that enhances RL training by targeting important response segments, outperforming previous methods on math benchmarks.
Findings
TRACE improves accuracy by 2.76 percentage points on average across benchmarks.
It maintains out-of-distribution scores where previous methods degrade.
Online annotation gains are comparable to external annotator capabilities.
Abstract
On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
