MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen

TL;DR
MASLab is a comprehensive, unified codebase for LLM-based multi-agent systems that consolidates methods, standardizes evaluation, and facilitates research and comparison in the field.
Contribution
It introduces MASLab, a unified platform integrating over 20 methods, providing standardized benchmarks, and lowering barriers for research in LLM-based multi-agent systems.
Findings
Extensive experiments on 10+ benchmarks with 8 models.
Standardized evaluation protocols for fair comparison.
Facilitates understanding and extension of MAS methods.
Abstract
LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper proposes a codebase (framework) that combines two capabilities: easy implementation and unified execution, as well as wide-ranging, standardized benchmarking. This identifies a real gap in prior work, as existing methods typically focus on only one of these aspects. The framework also includes a wide range of recent methods and benchmarks, which strengthens the contribution. The experiments show broad coverage and a rigorous, thorough evaluation. This is a good indication of a versat
Novelty: While it is reasonable to claim that this work conceptually fills the gap between easily extensible multi-agent frameworks and broad benchmark coverage, prior works are not entirely lacking in this direction. For example, CRAB (https://github.com/camel-ai/crab , https://arxiv.org/abs/2407.01511) supports a much larger number of tasks (120) while maintaining a modular shared codebase. AgentBoard (https://openreview.net/forum?id=4S8agvKjle), while focusing mainly on unified environments a
1. The paper introduces MASLab, which consolidates over 20 state-of-the-art MAS methods across multiple domains (e.g., general tasks, coding, mathematics, science). This integration reduces redundant implementation efforts 2. MASLab addresses a critical gap in the field by offering a standardized evaluation pipeline. It ensures consistent input preprocessing, configuration alignment, and evaluation protocols, which are essential for fair and reproducible comparisons. 3. By abstracting each MAS m
1. While the paper evaluates MAS methods across 10+ benchmarks, many of these benchmarks are not specifically designed for MAS. 2. In the evaluation system, the default xVerify method is supervised. Does this imply that users must train new evaluation models when extending to new assessment tasks? 3. The work lacks theoretical or methodological innovation. Its focus is primarily on engineering and software development. Although some interesting phenomena were observed during the evaluation of
1. The authors survey the most popular agentic solutions comprehensively, in terms of the features of agentic solutions and their applicable areas. 2. The authors conduct a wide range of agentic benchmarks to compare the adapted versions of the agentic solutions in their MASLab. 3. The paper presents the key results in a way that readers can understand with little effort.
1. A key concern of this paper is its limited scientific contribution to the community. This reads essentially as a benchmark paper that evaluates various of agentic solutions. However, this paper also introduces some additional modifications to those solutions/frameworks, which may make the overall performance attribution more difficult. Namely, it is unclear whether the "unifying" operations and adaptations introduce the performance degradation/improvement compared to their original implementa
* The primary strength of this paper is the codebase, it allows future researchers to conduct experiments in a unified and easier way. * The empirical studies presented in the paper are extensive and comprehensive.
* The codebase is not available in this submission, making it hard to evaluate its easiness of adoption. Since the major contribution of this paper (i.e., a unified codebase) is more engineering than novel research, the lack of the source code makes the impact less convincing. * Although the authors claimed that step-by-step output verification was conducted to ensure a validated implementation, quantitative evidence is insufficient (e.g., Table 6 shows the comparison in AFlow, but the rest meth
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Semantic Web and Ontologies
MethodsMixing Adam and SGD
