SOD: Step-wise On-policy Distillation for Small Language Model Agents

Qiyong Zhong; Mao Zheng; Mingyang Song; Xin Lin; Jie Sun; Houcheng Jiang; Xiang Wang; Junfeng Fang

arXiv:2605.07725·cs.CL·May 11, 2026

SOD: Step-wise On-policy Distillation for Small Language Model Agents

Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun, Houcheng Jiang, Xiang Wang, Junfeng Fang

PDF

1 Repo 3 Models

TL;DR

SOD introduces a step-wise on-policy distillation method that adaptively reweights supervision signals for small language models, improving reasoning accuracy and stability in tool-integrated tasks.

Contribution

The paper proposes SOD, a novel distillation framework that mitigates cascading errors by step-wise reweighting, enhancing small models' reasoning capabilities in complex benchmarks.

Findings

01

SOD achieves up to 20.86% improvement over baselines.

02

A 0.6B model attains 26.13% on AIME 2025.

03

SOD effectively transfers reasoning skills to lightweight models.

Abstract

Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YoungZ365/SOD
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.