HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation
Ken Ding

TL;DR
HDPO enhances reinforcement learning for mathematical reasoning in large language models by incorporating privileged self-distillation on failure prompts, leading to improved problem-solving coverage without sacrificing accuracy.
Contribution
The paper introduces HDPO, a novel method combining RL with privileged self-distillation targeting cliff prompts, with provable bounds and empirical improvements.
Findings
Improves coverage metrics on OpenMathInstruct-2
Maintains greedy accuracy during training
Provides a controllable exploration-exploitation tradeoff
Abstract
Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning
