RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Ming Zhu; Juntao Tan; Rithesh Murthy; Jielin Qiu; Liangwei Yang; Wenting Zhao; Silvio Savarese; Shelby Heinecke; Huan Wang

arXiv:2605.20204·cs.HC·May 21, 2026

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang

PDF

TL;DR

RealUserSim introduces a grounded user simulation framework based on authentic behavioral data, significantly improving the realism and reliability of agent benchmarking by addressing limitations of existing LLM-based simulators.

Contribution

This work is the first to ground LLM user simulators in real behavioral data, enhancing fidelity and revealing new failure modes in agent evaluation.

Findings

01

Grounded simulation increases match rate from 24.2% to 45.3%.

02

Grounded simulation exposes three failure mechanisms in agents.

03

Directive Amplification leads to unrealistic behaviors in existing benchmarks.

Abstract

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.