Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Sai Wang; Senthilnathan Subramanian; Mudit Sahni; Praneeth Gone; Lingjie Meng; Xiaochen Wang; Nicolas Ferradas Bertoli; Tingxian Cheng; Jun Xu

arXiv:2507.14705·cs.AI·July 22, 2025

Configurable multi-agent framework for scalable and realistic testing of llm-based agents

Sai Wang, Senthilnathan Subramanian, Mudit Sahni, Praneeth Gone, Lingjie Meng, Xiaochen Wang, Nicolas Ferradas Bertoli, Tingxian Cheng, Jun Xu

PDF

TL;DR

Neo is a multi-agent framework that enables scalable, realistic testing of LLM-based agents through dynamic, human-like conversations, uncovering failures and improving evaluation efficiency.

Contribution

The paper introduces Neo, a modular, configurable multi-agent system for automated, high-fidelity testing of LLM agents, surpassing manual methods in efficiency and diversity.

Findings

01

Uncovered edge-case failures with a 3.3% break rate

02

Generated 180 test questions in 45 minutes, 10-12X faster than humans

03

Achieved broader behavioral exploration than scripted testing

Abstract

Large-language-model (LLM) agents exhibit complex, context-sensitive behaviour that quickly renders static benchmarks and ad-hoc manual testing obsolete. We present Neo, a configurable, multi-agent framework that automates realistic, multi-turn evaluation of LLM-based systems. Neo couples a Question Generation Agent and an Evaluation Agent through a shared context-hub, allowing domain prompts, scenario controls and dynamic feedback to be composed modularly. Test inputs are sampled from a probabilistic state model spanning dialogue flow, user intent and emotional tone, enabling diverse, human-like conversations that adapt after every turn. Applied to a production-grade Seller Financial Assistant chatbot, Neo (i) uncovered edge-case failures across five attack categories with a 3.3% break rate close to the 5.8% achieved by expert human red-teamers, and (ii) delivered 10-12X higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.