TL;DR
The paper introduces the Agent-Testing Agent (ATA), a meta-agent that automates the testing and evaluation of conversational AI agents through adaptive, evidence-grounded, and diverse testing methods, providing both quantitative and qualitative insights.
Contribution
It presents a novel meta-agent framework combining multiple testing strategies and adaptive feedback to improve evaluation of conversational AI agents, surpassing traditional static benchmarks.
Findings
ATA surfaces more diverse and severe failures than expert annotators.
ATA completes testing in 20-30 minutes, much faster than multi-round human annotation.
Ablation of code analysis and web search increases variance and miscalibration.
Abstract
LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
