MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Ashutosh Hathidara; Julien Yu; Vaishali Senthil; Sebastian Schreiber; Anil Babu Ankisettipalli

arXiv:2601.08118·cs.AI·May 19, 2026

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli

PDF

1 Repo

TL;DR

MirrorBench is a comprehensive benchmarking framework designed to evaluate the human-likeness of conversational user proxy agents using diverse lexical and judge-based metrics, independent of task success.

Contribution

It introduces a reproducible, extensible framework with novel metrics and calibration controls for assessing user proxies' human-likeness in dialogue systems.

Findings

01

MirrorBench reveals systematic gaps between user proxies and real humans.

02

The framework provides variance-aware comparisons across datasets.

03

Open source implementation facilitates reproducibility and further research.

Abstract

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~ $K$ **, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench**…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SAP/mirrorbench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Mobile Crowdsensing and Crowdsourcing