Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

Joshua Sherwood; Ben Aybar; Benjamin Kaplan

arXiv:2604.25067·cs.MA·April 30, 2026

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

Joshua Sherwood, Ben Aybar, Benjamin Kaplan

PDF

TL;DR

This paper introduces a benchmark where AI agents autonomously implement machine learning pipelines for Connect Four, demonstrating significant progress and competitive performance against an external solver within a limited time.

Contribution

It presents a novel benchmark for measuring AI's ability to autonomously develop ML pipelines, with a focus on frontier coding agents and their performance in a game-playing context.

Findings

01

Claude Opus 4.7 outperformed the Pascal Pons solver in most trials.

02

Agents showed substantial differentiation in success rates.

03

GPT-5.4 exhibited anomalous time-budget usage, prompting further analysis.

Abstract

Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.