On the Reliability of User-Centric Evaluation of Conversational Recommender Systems
Michael M\"uller, Amir Reza Mohammadi, Andreas Peintner, Beatriz Barroso Gstrein, G\"unther Specht, Eva Zangerle

TL;DR
This study examines the reliability of user-centric evaluation methods for conversational recommender systems, revealing significant variability and halo effects in third-party annotations that impact evaluation validity.
Contribution
It provides the first large-scale empirical analysis of the reliability of static dialogue annotations, highlighting limitations of current third-party evaluation practices.
Findings
Utilitarian dimensions like accuracy and satisfaction are moderately reliable.
Socially grounded dimensions such as humanness and rapport are less reliable.
Many dimensions tend to collapse into a single global quality score due to halo effects.
Abstract
User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · AI in Service Interactions · Expert finding and Q&A systems
