TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

Jiaxuan Wang; Xuan Ouyang; Zhiyu Chen; Yulan Hu; Zheng Pan; Xin Li; Lan-Zhe Guo

arXiv:2605.10194·cs.AI·May 12, 2026

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo

PDF

TL;DR

TRACE introduces a token-routed self-distillation method that selectively focuses on critical spans to improve reinforcement learning performance and mitigate privileged-information leakage.

Contribution

It proposes a novel span-based distillation approach that enhances RL training by targeting important response segments, outperforming previous methods on math benchmarks.

Findings

01

TRACE improves accuracy by 2.76 percentage points on average across benchmarks.

02

It maintains out-of-distribution scores where previous methods degrade.

03

Online annotation gains are comparable to external annotator capabilities.

Abstract

On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.