MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke; Austin Xu; Yifei Ming; Xuan-Phi Nguyen; Ryan Chin; Caiming Xiong; Shafiq Joty

arXiv:2505.14996·cs.CL·March 10, 2026

MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision

Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin, Caiming Xiong, Shafiq Joty

PDF

Open Access 1 Repo 4 Reviews

TL;DR

MAS-ZERO introduces a self-evolving, inference-time framework for automatic multi-agent system design that adapts to each problem without validation sets, improving performance across various complex tasks.

Contribution

It is the first framework enabling dynamic, self-evolved MAS configuration at inference time without validation data, enhancing adaptability and performance.

Findings

01

Outperforms manual and automatic baselines in reasoning, coding, and agentic tasks.

02

Achieves up to 16.69% accuracy improvement in reasoning tasks.

03

Maintains cost efficiency while improving accuracy.

Abstract

Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation set for tuning and yield static MAS designs lacking adaptability during inference, while also removing the flexibility to reduce to simpler systems. We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design. MAS-ZERO employs meta-level design to iteratively design, critique, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Comprehensive baselines. - Relatively clear presentation.

Weaknesses

- The system seems to need quite some manual design, for example, the modules in the `MAS-INIT`, and prompts in Appendix J. It's not clear how the system is robust w.r.t these manual design and how much replies on human instructions. - Following the above reason, the novelty might be lacking because the lack of automation. - Some implementation details to ensure fair comparison is not clear, see questions. - The title doesn't seem to be precise? Specifically, "ZERO SUPERVISION" holds in the sens

Reviewer 02Rating 4Confidence 3

Strengths

1. The framework's primary strength is its ability to adapt its design to each specific problem instance at inference time, without the need for a pre-tuned configuration or a validation set. This is a compelling alternative to static, validation-set-tuned systems. 2. The expanded ablation studies in Section 4.2 provide a clear and valuable breakdown of the system's performance. The ablations on MAS-Init, MAS-Evolve, meta-design, and meta-feedback effectively demonstrate that all components con

Weaknesses

1. While the empirical results are now much stronger, the core conceptual framework can be viewed as a very sophisticated and well-executed combination of existing ideas (problem decomposition, self-refinement loops, agent routing) orchestrated via prompt engineering. The "meta-design" component, while effective, is fundamentally a well-crafted heuristic rather than a wholly new paradigm. 2. The entire system's success hinges on the reasoning capability of the meta-agent to correctly decompose

Reviewer 03Rating 4Confidence 4

Strengths

1. Originality: a new zero-supervision, inference-time self-evolving MAS framework, breaking validation set dependence and supporting dynamic task decomposition/complexity switching. 2. Quality: Comprehensive experiments across domains/models, comparisons with 11 baselines, ablation studies validating core modules, and planned open-sourcing ensuring reproducibility. 3. Clarity: Clearly describes the framework (3 key steps) and details (code templates, prompts), with complete appendices reducing

Weaknesses

1. From my understanding, this paper focuses on prompt engineering and involves no model training. Its performance upper bound is constrained by the capabilities of the underlying model, and its effectiveness bears similarities to test-time scaling. I consider its contributions limited, that is, had the authors proposed a training paradigm or a data framework to explore ways of pushing the model’s performance ceiling, the work would have been far more impactful. 2. Drawing parallels to test-ti

Reviewer 04Rating 2Confidence 4

Strengths

1. The motivations are very great. For example, this paper wants to eliminate the need for a labeled validation set. 2. The framework's explicit mechanism for breaking down complex problems into manageable sub-tasks and designing tailored sub-MAS for each is a robust approach to tackling intricate challenges. 3. The paper can be easily understood.

Weaknesses

1. The framework centralizes a heavy cognitive load on the meta-agent, which must simultaneously evaluate agent capabilities, refine system architecture, and verify final answers. This creates a potential single point of failure and places a high floor on the required capability of the underlying LLM. 2. While the framework avoids a traditional validation set, it replaces it with an online, self-referential validation loop (Meta-Feedback and MAS-Verify). This raises concerns about evaluation bl

Code & Models

Repositories

SalesforceAIResearch/MAS-Zero
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsMixing Adam and SGD · ALIGN