Efficient Agent Evaluation via Diversity-Guided User Simulation
Itay Nakash, George Kour, Ateret Anaby-Tavor

TL;DR
DIVERT is a novel, efficient user simulation framework that enhances the evaluation of large language models by exploring diverse interaction paths and uncovering failures more effectively.
Contribution
It introduces a snapshot-based, coverage-guided approach that reduces redundant computation and improves failure detection in agent evaluation.
Findings
DIVERT discovers more failures per token than standard methods.
It expands the range of tasks where failures are identified.
The framework improves evaluation efficiency and coverage.
Abstract
Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
