RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets
Andrew Borthwick, Stephen Ash, Anthony Galczak

TL;DR
RoboPhD systematically compares three optimization methods for evolving complex agents under limited evaluation budgets, demonstrating RoboPhD's superior performance in most benchmarks and releasing it as an open toolkit.
Contribution
This work introduces RoboPhD, a validation-free evolution method that outperforms existing approaches in diverse tasks with fixed evaluation budgets.
Findings
RoboPhD outperforms GEPA and Autoresearch on three of four benchmarks.
RoboPhD evolves a 22-line seed into a 1,013-line multi-strategy system.
RoboPhD improves accuracy from 27.8% to 65.8% on ARC-AGI.
Abstract
2026 has brought an explosion of interest in LLM-guided evolution of agentic artifacts, with systems like GEPA and Autoresearch demonstrating that LLMs can iteratively improve prompts, code, and agent architectures across diverse domains. As adoption accelerates, a central question emerges: given the same information, the same seed agent, and the same objective, which optimization algorithm yields the best results under the same evaluation budget? This question becomes critical when evaluations are expensive, such as when they require human judgment or multiple LLM calls. We present the first systematic comparison of three optimization paradigms -- Elo tournament selection (RoboPhD), Pareto-based selection (GEPA), and greedy hill-climbing (Autoresearch) -- across four benchmarks spanning abstract reasoning, cloud scheduling, SQL generation, and financial QA, all under a fixed budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
