Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Anhao Zhao; Haoran Xin; Yingqi Fan; Junlong Tong; Wenjie Li; Xiaoyu Shen

arXiv:2605.16826·cs.LG·May 19, 2026

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

PDF

TL;DR

This paper unifies various LLM distillation methods by decoupling KL divergence choices and introduces practical techniques like KL mixing and entropy-gated curricula to improve reasoning performance.

Contribution

It reveals the orthogonal choices in distillation objectives, provides a gradient-level understanding, and proposes new methods to balance accuracy and diversity in LLM training.

Findings

01

Decoupling KL and prefix source yields four valid objectives.

02

KL direction influences accuracy-entropy tradeoff.

03

Entropy-gated curriculum improves reasoning accuracy and reduces response length.

Abstract

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.