On the Reliability of User-Centric Evaluation of Conversational Recommender Systems

Michael M\"uller; Amir Reza Mohammadi; Andreas Peintner; Beatriz Barroso Gstrein; G\"unther Specht; Eva Zangerle

arXiv:2602.17264·cs.IR·February 20, 2026

On the Reliability of User-Centric Evaluation of Conversational Recommender Systems

Michael M\"uller, Amir Reza Mohammadi, Andreas Peintner, Beatriz Barroso Gstrein, G\"unther Specht, Eva Zangerle

PDF

Open Access

TL;DR

This study examines the reliability of user-centric evaluation methods for conversational recommender systems, revealing significant variability and halo effects in third-party annotations that impact evaluation validity.

Contribution

It provides the first large-scale empirical analysis of the reliability of static dialogue annotations, highlighting limitations of current third-party evaluation practices.

Findings

01

Utilitarian dimensions like accuracy and satisfaction are moderately reliable.

02

Socially grounded dimensions such as humanness and rapport are less reliable.

03

Many dimensions tend to collapse into a single global quality score due to halo effects.

Abstract

User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · AI in Service Interactions · Expert finding and Q&A systems