SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators
Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu

TL;DR
SalesSim is a new framework for evaluating multimodal language models' ability to simulate realistic, persona-driven retail conversations, highlighting current limitations and proposing reinforcement learning improvements.
Contribution
The paper introduces SalesSim, a benchmark and testbed for assessing and improving multimodal language models' fidelity in retail user simulation, emphasizing decision alignment and conversational quality.
Findings
Models produce fluent conversations but lack lexical diversity.
Models tend to be persuaded by suggestions and drift from personas.
UserGRPO improves decision alignment by 13.8% and conversational quality.
Abstract
We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
