Scaling Self-Play with Self-Guidance
Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto, Tengyu Ma

TL;DR
This paper introduces Self-Guided Self-Play (SGS), a novel method where language models guide their own problem generation to improve scaling and performance in self-play tasks like formal theorem proving.
Contribution
SGS enables language models to self-regulate problem generation, preventing collapse and improving scaling in self-play for complex tasks like theorem proving.
Findings
SGS surpasses the asymptotic solve rate of strong RL baselines in fewer than 80 rounds.
A 7B parameter model with SGS solves more problems than a 671B parameter model pass@4.
Longer training with SGS improves problem-solving capabilities significantly.
Abstract
LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
