SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

TL;DR
SD-Search introduces a novel on-policy hindsight self-distillation method for search-augmented reasoning agents, providing step-level supervision without external models or annotations, thereby improving query decision quality.
Contribution
It proposes a self-distillation approach that derives step-level supervision from the policy itself, eliminating the need for external teachers or annotations in search-augmented reasoning.
Findings
Enables step-level credit assignment in search-augmented reasoning.
Improves search query decisions without external supervision.
Integrates seamlessly into standard RL training loops.
Abstract
Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
