$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

TL;DR
This paper introduces $ ext{π}$-Play, a multi-agent self-play framework that uses privileged self-distillation with question construction paths to improve learning efficiency without external data.
Contribution
It proposes leveraging question construction paths as privileged information in self-play, enabling dense supervision and more efficient training of search agents.
Findings
$ ext{π}$-Play surpasses fully supervised agents in performance.
It improves evolutionary efficiency by 2-3 times over traditional self-play.
The method operates without external data or human feedback.
Abstract
Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
