AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Aritra Mazumder; Shubhashis Roy Dipta; Nusrat Jahan Lia; Tanzila Khan; Kainat Raisa Hossain; Nehaa Shri; Shubhrangshu Debsarkar; Humayra Tasnim; Gour Gupal Talukder Shawon; Debjoty Mitra; Sumaiya Ahmed Rani; Al Jami Islam Anik; Al Nafeu Khan

arXiv:2605.08647·cs.CL·May 12, 2026

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

Aritra Mazumder, Shubhashis Roy Dipta, Nusrat Jahan Lia, Tanzila Khan, Kainat Raisa Hossain, Nehaa Shri, Shubhrangshu Debsarkar, Humayra Tasnim, Gour Gupal Talukder Shawon, Debjoty Mitra, Sumaiya Ahmed Rani, Al Jami Islam Anik, Al Nafeu Khan

PDF

TL;DR

AgentCollabBench introduces a diagnostic benchmark with 900 tasks to evaluate multi-agent system vulnerabilities, revealing how communication topology and structural factors impact collaboration reliability beyond model capabilities.

Contribution

This paper presents a new benchmark and analysis framework to diagnose multi-agent collaboration failures, emphasizing the importance of architecture and communication topology.

Findings

01

Model-specific vulnerabilities are revealed by the benchmark.

02

Communication topology explains 7-40% of variance in information survival.

03

Structural bottlenecks can cause silent failures despite capable models.

Abstract

Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.