Configurable multi-agent framework for scalable and realistic testing of llm-based agents
Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, Jun Xu

TL;DR
Neo is a multi-agent framework that enables scalable, realistic testing of LLM-based agents through dynamic, human-like conversations, uncovering failures and improving evaluation efficiency.
Contribution
The paper introduces Neo, a modular, configurable multi-agent system for automated, high-fidelity testing of LLM agents, surpassing manual methods in efficiency and diversity.
Findings
Uncovered edge-case failures with a 3.3% break rate
Generated 180 test questions in 45 minutes, 10-12X faster than humans
Achieved broader behavioral exploration than scripted testing
Abstract
Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
