Synthetic Sandbox for Training Machine Learning Engineering Agents
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan

TL;DR
SandMLE introduces a synthetic sandbox environment for efficient, large-scale on-policy RL training of machine learning engineering agents, significantly reducing computational costs while maintaining problem complexity.
Contribution
The paper presents SandMLE, a framework that creates diverse, verifiable synthetic MLE environments from few seed tasks, enabling scalable on-policy RL in MLE.
Findings
SandMLE reduces execution time by over 13 times.
It achieves significant performance gains over supervised fine-tuning baselines.
The trained policies generalize well to unseen agentic scaffolds.
Abstract
As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
