Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

Preethi Seshadri; Samuel Cahyawijaya; Ayomide Odumakinde; Sameer Singh; Seraphina Goldfarb-Tarrant

arXiv:2601.17087·cs.HC·January 29, 2026

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

Preethi Seshadri, Samuel Cahyawijaya, Ayomide Odumakinde, Sameer Singh, Seraphina Goldfarb-Tarrant

PDF

Open Access

TL;DR

This study critically examines the reliability of LLM-simulated users in agentic evaluations, revealing significant biases, variability, and limitations that question their validity as proxies for real human users across diverse populations.

Contribution

The paper provides empirical evidence that LLM-simulated users are unreliable proxies, highlighting biases, calibration issues, and demographic disparities in agent evaluation outcomes.

Findings

01

Simulated user success rates vary up to 9 percentage points across different LLMs.

02

Evaluations systematically underestimate performance on difficult tasks and overestimate on moderate ones.

03

Disparities in success and calibration are pronounced for AAVE speakers and increase with age.

Abstract

Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on {\tau}-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Speech and dialogue systems · Language and cultural evolution