SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng; Xing Ma; Yiheng Liang; Jingqing Ruan; Xiaoliang Fu; Kepeng Lin; Benchang Zhu; Ke Zeng; Xunliang Cai

arXiv:2604.10688·cs.LG·April 14, 2026

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

PDF

1 Repo 2 Models 1 Datasets

TL;DR

SCOPE introduces a dual-path adaptive training framework for on-policy reinforcement learning in language models, improving reasoning by calibrating supervision based on trajectory correctness and difficulty.

Contribution

It proposes a novel dual-path adaptive distillation method that dynamically weights supervision signals according to trajectory correctness and difficulty.

Findings

01

SCOPE achieves an average 11.42% improvement in Avg@32 over baselines.

02

SCOPE attains a 7.30% increase in Pass@32 across six reasoning benchmarks.

03

Extensive experiments validate the effectiveness of the adaptive weighting scheme.

Abstract

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

machine981/SCOPE
github

Models

Datasets

Machine981/SCOPE
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.