RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation
Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang

TL;DR
RealUserSim introduces a grounded user simulation framework based on authentic behavioral data, significantly improving the realism and reliability of agent benchmarking by addressing limitations of existing LLM-based simulators.
Contribution
This work is the first to ground LLM user simulators in real behavioral data, enhancing fidelity and revealing new failure modes in agent evaluation.
Findings
Grounded simulation increases match rate from 24.2% to 45.3%.
Grounded simulation exposes three failure mechanisms in agents.
Directive Amplification leads to unrealistic behaviors in existing benchmarks.
Abstract
LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
