Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis
Abdelghny Orogat, Ana Rostam, Essam Mansour

TL;DR
This paper introduces a unified benchmark and analysis for multi-agent LLM frameworks, revealing how architectural choices significantly affect performance, accuracy, and scalability, and providing guidance for future design.
Contribution
It presents an architectural taxonomy and MAFBench, a standardized evaluation suite, enabling systematic comparison and analysis of multi-agent LLM frameworks.
Findings
Framework design impacts latency by over 100x
Planning accuracy can decrease by up to 30%
Coordination success drops from over 90% to below 30%
Abstract
Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order-of-magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework-level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Natural Language Processing Techniques
