TL;DR
SOD introduces a step-wise on-policy distillation method that adaptively reweights supervision signals for small language models, improving reasoning accuracy and stability in tool-integrated tasks.
Contribution
The paper proposes SOD, a novel distillation framework that mitigates cascading errors by step-wise reweighting, enhancing small models' reasoning capabilities in complex benchmarks.
Findings
SOD achieves up to 20.86% improvement over baselines.
A 0.6B model attains 26.13% on AIME 2025.
SOD effectively transfers reasoning skills to lightweight models.
Abstract
Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
