Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang

TL;DR
This paper investigates how the quality of user simulators for training LLM assistants can be measured by their effectiveness in real-world interactions, emphasizing the importance of real human grounding.
Contribution
It introduces a method to quantify simulator quality based on downstream performance of trained assistants with real users and compares different simulator training approaches.
Findings
Training with fine-tuned simulators improves assistant performance in user studies.
Role-playing LLMs benefit from persona conditioning but do not match fine-tuned simulators.
Scaling simulator size enhances fine-tuned models but not role-playing ones.
Abstract
User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
