Which LLM Multi-Agent Protocol to Choose?
Hongyi Du, Jiaqi Su, Jisen Li, Lijie Ding, Yingxuan Yang, Peixuan Han, Xiangru Tang, Kunlun Zhu, and Jiaxuan You

TL;DR
This paper introduces ProtocolBench, a benchmarking framework for comparing multi-agent communication protocols across various metrics, and presents ProtocolRouter, a learnable system that optimizes protocol selection for improved performance and resilience.
Contribution
The paper provides the first systematic benchmark for multi-agent protocols and proposes ProtocolRouter, a learnable approach for dynamic protocol selection based on scenario requirements.
Findings
Protocol choice significantly impacts system performance.
ProtocolRouter improves recovery time by up to 18.1%.
Scenario-specific protocol selection enhances success rates.
Abstract
As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping performance and reliability. Despite the existence of diverse protocols (A2A, ACP, ANP, Agora, etc.), selection is often intuition-driven and lacks standardized guidance. We introduce ProtocolBench, a benchmark that systematically compares agent protocols along four measurable axes: task success, end-to-end latency, message or byte overhead, and robustness under failures. On ProtocolBench, protocol choice significantly influences system behavior. In the Streaming Queue scenario, overall completion time varies by up to 36.5% across protocols, and mean end-to-end latency differs by 3.48 s. Under Fail-Storm Recovery, resilience also differs consistently across protocols. Beyond evaluation, we present ProtocolRouter, a learnable protocol router that selects…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Novelty and Significance: The paper tackles a highly relevant and important problem. The lack of a systematic framework for evaluating communication protocols is a major bottleneck in the field. 2. Comprehensive and Rigorous Evaluation: The experimental design is a major strength. The authors conduct experiments across four well-chosen, diverse scenarios that stress different aspects of protocol performance. The evaluation is multi-faceted, using a comprehensive set of metrics that capture
1. Statistical Reporting could be Clearer: While the authors mention using statistical procedures (e.g., bootstrap CIs, Holm-Bonferroni correction) in the appendix, the main results in the tables are often presented as point estimates (e.g., averages) without confidence intervals or p-values. Directly including measures of statistical uncertainty in the main tables would make the claims of performance differences (e.g., "statistically observable differences") more immediately transparent and com
The paper addresses an important and timely problem in multi-agent LLM systems by introducing ProtocolBench and ProtocolRouter. The systematic evaluation of multiple protocols across diverse scenarios is a valuable contribution, as it highlights the practical impact of protocol choice on system performance and reliability. The experimental design is OK, with clear metrics and reproducible setups, which strengthens the credibility of the results. I appreciate the open-source release of ProtocolBe
**Summary:** * ProtocolRouter’s learning approach assumes stationary or slowly-changing workloads, which may not generalize to highly dynamic or adversarial environments. * Maintaining multiple protocol states and adapters could introduce computational overhead at very large scales, which is not fully addressed or quantified. * Comparing protocols head-to-head may not always be fair because each protocol is designed with different optimization goals and assumptions. * Technical contribution is l
The paper is clearly written and tackles a very important problem, given the recent surge of Agent Communication Protocols. The 4 tasks/metrics being covered are also central to the evaluations of these protocols, I liked how the benchmark tackles each of the specific metrics and generates queries/scenarios specific to them. The experimental analysis also provides good insights into the limitations of existing protocols.
I have two central concerns: the contribution of ProtocolRouter and the lack of actionable insights for future work. While the paper carefully reports where different protocols vary in performance, it does not sufficiently diagnose why these differences arise or how they could guide the design of better protocols. The analysis ends at identifying which protocol performs best under specific scenarios, rather than translating those observations into concrete recommendations for improving protocol
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Mobile Agent-Based Network Management · Distributed systems and fault tolerance
