Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, Seraphina Goldfarb-Tarrant

TL;DR
This study critically examines the reliability of LLM-simulated users in agentic evaluations, revealing significant biases, variability, and limitations that question their validity as proxies for real human users across diverse populations.
Contribution
The paper provides empirical evidence that LLM-simulated users are unreliable proxies, highlighting biases, calibration issues, and demographic disparities in agent evaluation outcomes.
Findings
Simulated user success rates vary up to 9 percentage points across different LLMs.
Evaluations systematically underestimate performance on difficult tasks and overestimate on moderate ones.
Disparities in success and calibration are pronounced for AAVE speakers and increase with age.
Abstract
Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on {\tau}-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Speech and dialogue systems · Language and cultural evolution
