$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Yaocheng Zhang; Yuanheng Zhu; Wenyue Chong; Songjun Tu; Qichao Zhang; Jiajun Chai; Xiaohan Wang; Wei Lin; Guojun Yin; Dongbin Zhao

arXiv:2604.14054·cs.LG·April 16, 2026

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao

PDF

TL;DR

This paper introduces $ ext{π}$-Play, a multi-agent self-play framework that uses privileged self-distillation with question construction paths to improve learning efficiency without external data.

Contribution

It proposes leveraging question construction paths as privileged information in self-play, enabling dense supervision and more efficient training of search agents.

Findings

01

$ ext{π}$-Play surpasses fully supervised agents in performance.

02

It improves evolutionary efficiency by 2-3 times over traditional self-play.

03

The method operates without external data or human feedback.

Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.