TL;DR
The paper introduces 1GC-7RC, a comprehensive benchmark for evaluating AI coding agents across seven diverse ML tasks on a single GPU within specified time limits.
Contribution
It presents a standardized, modular benchmark with evaluation scripts and baseline training code, enabling fair comparison of autonomous AI coding agents.
Findings
Substantial performance differences among seven evaluated agents.
The benchmark reveals varying levels of ML knowledge and planning ability.
All evaluation artifacts are publicly available for reproducibility.
Abstract
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
