Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver
Joshua Sherwood, Ben Aybar, Benjamin Kaplan

TL;DR
This paper introduces a benchmark where AI agents autonomously implement machine learning pipelines for Connect Four, demonstrating significant progress and competitive performance against an external solver within a limited time.
Contribution
It presents a novel benchmark for measuring AI's ability to autonomously develop ML pipelines, with a focus on frontier coding agents and their performance in a game-playing context.
Findings
Claude Opus 4.7 outperformed the Pascal Pons solver in most trials.
Agents showed substantial differentiation in success rates.
GPT-5.4 exhibited anomalous time-budget usage, prompting further analysis.
Abstract
Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
