MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde; Alexander Rubinstein; Anmol Goel; Ahmed Heakl; Sangdoo Yun; Seong Joon Oh; Martin Gubri

arXiv:2603.08835·cs.AI·March 11, 2026

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri

PDF

Open Access

TL;DR

MASEval introduces a system-level evaluation framework for agentic systems, emphasizing the importance of implementation choices beyond models, and enabling comprehensive comparisons across system components.

Contribution

It provides a framework-agnostic tool for evaluating entire agentic systems, highlighting the impact of design decisions on performance.

Findings

01

Framework choice influences system performance as much as model choice.

02

System components like topology and error handling significantly affect outcomes.

03

System-level analysis reveals new insights for designing agentic systems.

Abstract

The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics · Model-Driven Software Engineering Techniques