SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Yufei Ma; Zihan Liang; Ben Chen; Zhipeng Qian; Huangyu Dai; Lingtao Mao; Xuxin Zhang; Chenyi Lei; Wenwu Ou

arXiv:2605.18299·cs.AI·May 19, 2026

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

PDF

TL;DR

SD-Search introduces a novel on-policy hindsight self-distillation method for search-augmented reasoning agents, providing step-level supervision without external models or annotations, thereby improving query decision quality.

Contribution

It proposes a self-distillation approach that derives step-level supervision from the policy itself, eliminating the need for external teachers or annotations in search-augmented reasoning.

Findings

01

Enables step-level credit assignment in search-augmented reasoning.

02

Improves search query decisions without external supervision.

03

Integrates seamlessly into standard RL training loops.

Abstract

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.