RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution
Andrew Borthwick, Stephen Ash

TL;DR
RoboPhD is an autonomous system where AI agents iteratively improve Text-to-SQL performance through self-guided evolution, discovering effective strategies without human domain input, leading to significant accuracy gains.
Contribution
This work introduces RoboPhD, the first autonomous agent system that self-improves Text-to-SQL models via evolutionary techniques without external guidance.
Findings
Achieved 73.67% accuracy on BIRD test set.
Discovered size-adaptive database analysis and SQL generation strategies.
Improved model performance by up to 8.9 points over weaker baselines.
Abstract
We present RoboPhD, a system where AI agents autonomously conduct research to improve Text-to-SQL performance. RoboPhD implements a closed-loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO-based selection mechanism enabling survival-of-the-fittest dynamics while handling non-transitivity in performance. Starting from a naive 70-line baseline, RoboPhD evolves agents through iterative cross-pollination, discovering effective techniques without any external guidance on the Text-to-SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size-adaptive database analysis that adjusts depth based on schema complexity and SQL…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- Creative and well-written. The “autonomous PhD” metaphor and clear structure make the paper engaging and easy to follow. - Complete system implementation. The framework runs end-to-end with concrete results and released configurations, showing technical feasibility. - Innovative use of ELO-based evolution. Adapting ELO scoring for agent selection is a neat and transferable idea. - Potentially inspiring direction. The paper contributes to the growing discussion on self-improving LLM systems and
- Insufficient experiments and analysis. Evaluation is limited to a single dataset (BIRD) and two short runs, with no ablations, variance reporting, or comparisons to prompt-optimization baselines (e.g., DSPy, OPRO). The 2% improvement could easily stem from randomness. - Overstated claims. The work demonstrates automated prompt search within a fixed pipeline, not genuine autonomous “research.” The conceptual framing exceeds what the evidence supports. - Limited novelty beyond Text-to-SQL. Simil
1. The closed-loop system combines evolution, evaluation, and analysis agents, allowing agents to learn from their own experiments without human intervention. 2. The use of ELO ratings as an active selection mechanism in evolutionary optimization is well justified. 3. The core algorithm (Algorithm 1: RoboPhD Evolution Cycle) is clearly written and well described.
1. Experiments are only conducted on very simple baselines, text-to-SQL literature is not considered. The current SOTA text-to-SQL method has achieved more > 80% execution accuracy on the BIRD leaderboard [1]. I don't want to be harsh, but this work is obviously far away from latest text-to-SQL research, when the cost is also high. 2. In addition, the evaluation is very shallow. Experiments are conducted only on a single benchmark, limiting the generalizability of the claims. The evolution ana
1. The primary strength of this paper is the innovative concept of an "AI researcher" framework that automates the cycle of hypothesis, experimentation, and refinement. The three-agent architecture and the use of an ELO rating system for agent selection are elegant and technically sound. 2. The paper demonstrates that its framework can autonomously achieve a measurable performance gain (~2-2.6% absolute accuracy) over baselines on a challenging benchmark. This provides concrete evidence for its
1. The most significant weakness is the lack of systematic ablation studies on the evolution strategies. The authors acknowledge this (lines 450-452), but without this analysis, it is difficult for the reader to understand the marginal contribution of each component (e.g., "Research-Driven" vs. "Error-Focused"). This is a critical piece of analysis needed to fully validate the framework's design. 2. The framework's success is demonstrated on a single, high-end proprietary model (Claude Opus 4.
- The method completely relies on test-time optimization, which is a compelling story if one cares about cost. - The goal of building an autonomous domain independent research agent is ambitious. - The most part of this paper is well-written and easy to understand.
- The performance improvement with the proposed method seems to be marginal, especially when compared with other methods in the BIRD-bench leaderboard. Given the scale of the model and complexity of the framework, ~2% absolute improvement doesn’t seem to justify 80 iterations of complex evolution. - The proposed methodology with prompt optimization and tool evolvement doesn’t seem to be capable of fundamentally resolve the challenges for NL2SQL, e.g. schema linking errors with ambiguous column/t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Scientific Computing and Data Management
