TL;DR
This paper introduces Hybrid Policy Distillation (HPD), a novel method that combines forward and reverse KL for more efficient and stable knowledge distillation of large language models across various tasks.
Contribution
It presents a unified view of knowledge distillation, reformulates it as a token-level log-likelihood, and proposes HPD to improve model performance and efficiency.
Findings
HPD improves optimization stability and performance.
HPD achieves better mode coverage and mode-seeking balance.
Code is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.
Abstract
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
