HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Ken Ding

arXiv:2603.23871·cs.LG·March 26, 2026

HDPO: Hybrid Distillation Policy Optimization via Privileged Self-Distillation

Ken Ding

PDF

Open Access

TL;DR

HDPO enhances reinforcement learning for mathematical reasoning in large language models by incorporating privileged self-distillation on failure prompts, leading to improved problem-solving coverage without sacrificing accuracy.

Contribution

The paper introduces HDPO, a novel method combining RL with privileged self-distillation targeting cliff prompts, with provable bounds and empirical improvements.

Findings

01

Improves coverage metrics on OpenMathInstruct-2

02

Maintains greedy accuracy during training

03

Provides a controllable exploration-exploitation tradeoff

Abstract

Large language models trained with reinforcement learning (RL) for mathematical reasoning face a fundamental challenge: on problems the model cannot solve at all - "cliff" prompts - the RL gradient vanishes entirely, preventing any learning signal from reaching these failure modes. We introduce Hybrid Distillation Policy Optimization (HDPO), which augments standard RL with privileged self-distillation targeting cliff prompts. On each training step, HDPO identifies prompts where all rollouts fail, generates privileged rollouts by providing the model with ground-truth information, filters for correct solutions, and distills the teacher's token-level distribution into the student. Because teacher and student share the same weights - differing only in their input - the realizability gap is provably bounded, unlike cross-model distillation. We prove that R=1 filtered privileged generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning