Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Wenyi Wang; Piotr Pi\k{e}kos; Li Nanbo; Firas Laakom; Yimeng Chen; Mateusz Ostaszewski; Mingchen Zhuge; and J\"urgen Schmidhuber

arXiv:2510.21614·cs.AI·October 30, 2025

Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Wenyi Wang, Piotr Pi\k{e}kos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and J\"urgen Schmidhuber

PDF

3 Reviews

TL;DR

The paper introduces the Huxley-G"odel Machine, a self-improving coding agent that estimates its future performance potential using a new metric, enabling it to outperform previous methods and achieve human-level coding performance.

Contribution

It proposes the CMP metric to guide self-improvement, and develops the Huxley-G"odel Machine that effectively searches self-modification trees based on this metric.

Findings

01

HGM outperforms prior methods on SWE-bench and Polyglot datasets.

02

HGM achieves human-level coding performance with GPT-5-mini.

03

The approach demonstrates strong transferability to other datasets and models.

Abstract

Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ( $CMP$ ) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $CMP$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 2

Strengths

Significance and Novelty: - The paper is extremely insightful and I think it will be beneficial for the broader ICLR community as well. Clarity: * The paper was a joy to read and I thank the authors for formally describing the algorithm as well as presenting psuedo-code in Appendix B -- this really helped better understand the details of the algorithm.

Weaknesses

Minor dataset concerns: - Recent works in Software engineering benchmarking have found contaminatioon issues in SWE-Bench Lite. As such, I recommend the authors also verify results on SWE-Bench Live (https://github.com/microsoft/SWE-bench-Live). - It's understandable that a full-rerun might be too expensive. Even a small-scale experiment verifying that the main result holds on `SWE-Bench: Live - Lite` (the names are getting challenging to say out loud) would be extremely useful here. Clarit

Reviewer 02Rating 4Confidence 3

Strengths

- The proposed method is inspired by the concept of clades and introduces a new metric (CMP), which measures the productivity of an entire lineage by aggregating the benchmark success of an agent’s descendants rather than relying on the agent’s own performance. This idea convincingly addresses the shortcomings of previous methods and provides a theoretically sound foundation for improving self-improving agent exploration. HGM achieves better performance than DGM and SICA.

Weaknesses

- While the authors cite STOP as related work, they do not include it in their experiments. STOP recursively improves its own reasoning code, namely prompts and inference strategies, and thus represents a closely related setup. The lack of empirical comparison with STOP leaves a gap in the comprehensiveness of evaluation. - The experimental evaluation is limited to two coding benchmarks, SWE-Verified-60 and Polyglot, both within a narrow programming domain. If the goal is to improve the agent’s

Reviewer 03Rating 6Confidence 4

Strengths

The problem being studied is very interesting and has potentially enormous impact. The idea of using a bit of look-ahead is very expensive but principled. Experiments suggest that it may be more effective than other approaches.

Weaknesses

*Theory*: Theorem 1 is fine to include but it is not nearly enough to justify publication. The proof itself is almost tautological once the definitions are established. The theory appendix is poorly presented, which is concerning. For instance, the definitions of the concepts referenced in the statement of Theorem 1 are defined in the proof. The necessary definitions should be separate so that the theorem statement makes sense without reading the proof. Therefore the paper's justification is emp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.