Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu; Zhiyuan Yao; Zhuowen Han; Zi-Han Wang; Jinyang Wu; Qi Gu; Xunliang Cai; Weiming Lu; Jun Xiao; Yueting Zhuang; Yongliang Shen

arXiv:2605.15155·cs.LG·May 15, 2026

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

PDF

1 Repo

TL;DR

SDAR introduces a gated auxiliary objective for reinforcement learning agents, enhancing stability and performance by integrating dense token-level guidance from self-distillation, especially in multi-turn scenarios.

Contribution

It proposes SDAR, a novel method that combines self-distillation with RL using a gating mechanism to improve multi-turn agent training stability and effectiveness.

Findings

01

SDAR outperforms baseline methods on ALFWorld, WebShop, and Search-QA datasets.

02

SDAR achieves +9.4% on ALFWorld and +10.2% on WebShop-Acc.

03

SDAR avoids instability issues present in naive GRPO+OPSD approaches.

Abstract

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zju-real/SDAR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.