ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi; Krisztian Balog; Sally Goldman; Avi Caciularu; Guy Tennenholtz; Jihwan Jeong; Amir Globerson; Craig Boutilier

arXiv:2602.16938·cs.CL·February 20, 2026

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces ConvApparel, a new dataset and validation framework for user simulators in conversational recommenders, addressing the realism gap and improving simulator robustness through counterfactual validation.

Contribution

It presents a novel dataset with dual-agent data collection and a comprehensive validation framework to evaluate and enhance user simulators in conversational AI.

Findings

01

Data-driven simulators outperform prompted baselines in realism.

02

Simulators adapt more realistically to unseen behaviors.

03

Significant realism gap identified across all tested simulators.

Abstract

The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

google/ConvApparel
dataset· 32 dl
32 dl

Videos

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders· underline

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing