ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Zhuowen Yuan; Tao Liu; Yang Yang; Yang Wang; Feng Qi; Kaushik Rangadurai; Bo Li; Shuang Yang

arXiv:2511.03985·cs.AI·November 7, 2025

ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Zhuowen Yuan, Tao Liu, Yang Yang, Yang Wang, Feng Qi, Kaushik Rangadurai, Bo Li, Shuang Yang

PDF

Open Access 3 Reviews

TL;DR

ArchPilot is a multi-agent system that improves machine learning architecture search efficiency by using proxy evaluations and adaptive search, reducing the need for costly full training runs.

Contribution

It introduces a novel multi-agent framework with proxy-guided evaluation and a Monte Carlo Tree Search-based orchestration for scalable ML architecture search.

Findings

01

Outperforms state-of-the-art methods like AIDE and ML-Master on MLE-Bench.

02

Reduces reliance on full training runs, saving computational resources.

03

Demonstrates effective architecture search with limited budgets.

Abstract

Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

Novelty: I think this paper hold certain novelty in combining the strength of MCTS and LLM for automated NAS.

Weaknesses

While novel to some extends, I still think this paper is not prepared well for publication. 1. Writing: I have to say that te writing and content organization of this paper is not very ideal. Where is the introduction or preliminary of MCTS-based NAS, why the authors expect the readers very familiar with this field? There are many NAS works while the related work section only review those that focus on how to do efficient evaluation. 2. Contribution: I can not understand why the three proxy f

Reviewer 02Rating 2Confidence 4

Strengths

- The paper seeks to address a core challenge in AutoML, the prohibitive cost of searching over large search spaces, and improving performance over limited search budgets. - The modular design is clean, although this is the case in many prior NAS approaches too. - Evaluation on MLE-Bench is better than that on standard NAS benchmarks.

Weaknesses

- The agnetic framing of ArchPilot is disingenuous. There is no action space, nor any decision making going on. The OA is simply calling tools (LLM in this case), GA and OA are largely manually designed processes. Existing NAS methods also do the same. - The primary novelty of ArchPilot is combining existing components (MCTS/UCT, proxy training, ridge-fitting, LLM-based code generation, restart). The paper over states the contributions. - Only two baselines methods are considered, AIDE and ML

Reviewer 03Rating 2Confidence 4

Strengths

- The idea of using a proxy to avoid retraining is technically sound and novel. - Three designed agents play basic roles in the agent system, which forms into a loop to enable the system with self-evolving capability. - The paper is well-written and well-organized, which makes it easy to follow.

Weaknesses

- The experiments are weak. Only two baseline agents are compared on a single benchmark, MLE-Bench, and only one LLM backbone is used. - As the only benchmark used in this paper, it lacks a detailed illustration, such as how many instances are in each task. - The paper lacks ablation studies, which cannot demonstrate the contributions from different modules or agents. - There are no quantitative results of the time ArchPilot reduces using the proxy module. I believe the author should at least di

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Scientific Computing and Data Management · Explainable Artificial Intelligence (XAI)