TL;DR
G-Zero introduces a verifier-free, co-evolutionary framework enabling large language models to self-improve in open-ended tasks without external judges, using intrinsic rewards and internal dynamics.
Contribution
The paper proposes Hint-$ extdelta$, a novel intrinsic reward, and a co-evolutionary training method for LLMs that eliminates reliance on external evaluators.
Findings
G-Zero achieves continuous self-improvement without external verification.
Theoretical guarantees are provided for the idealized DPO version of G-Zero.
The framework effectively targets model blind spots through internal distributional signals.
Abstract
Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
