Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Jongwoo Ko; Sara Abdali; Young Jin Kim; Tianyi Chen; Pashmina Cameron

arXiv:2603.11137·cs.LG·March 13, 2026

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron

PDF

Open Access

TL;DR

This paper introduces REOPOLD, a novel framework for stabilizing on-policy distillation by relaxing imitation constraints, leading to more efficient training and better reasoning performance in capacity-limited models.

Contribution

REOPOLD provides a new method that stabilizes on-policy distillation through reward relaxation techniques, improving sample efficiency and inference scaling.

Findings

01

REOPOLD achieves 6.7-12x greater sample efficiency than recent RL methods.

02

REOPOLD enables a 7B student to match a 32B teacher in visual reasoning.

03

REOPOLD improves inference speed by approximately 3.32 times.

Abstract

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning