Mind the Sim2Real Gap in User Simulation for Agentic Tasks

Xuhui Zhou; Weiwei Sun; Qianou Ma; Yiqing Xie; Jiarui Liu; Weihua Du; Sean Welleck; Yiming Yang; Graham Neubig; Sherry Tongshuang Wu; and Maarten Sap

arXiv:2603.11245·cs.AI·March 13, 2026

Mind the Sim2Real Gap in User Simulation for Agentic Tasks

Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap

PDF

Open Access

TL;DR

This paper investigates the gap between LLM-based user simulators and real human behaviors in interactive NLP tasks, revealing significant differences and the need for human validation to improve simulation fidelity.

Contribution

It introduces the User-Sim Index (USI) metric, benchmarks 31 simulators with real human data, and highlights the behavioral and evaluation discrepancies in current user simulation methods.

Findings

01

LLM simulators are overly cooperative and lack realistic frustration.

02

Simulated feedback is uniformly positive, unlike nuanced human feedback.

03

Higher model capability does not guarantee more faithful user simulation.

Abstract

As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $τ$ -bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Social Robot Interaction and HRI · Speech and dialogue systems