A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

Ivica Kostric; Krisztian Balog

arXiv:2605.13053·cs.IR·May 21, 2026

A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

Ivica Kostric, Krisztian Balog

PDF

TL;DR

This paper conducts a standardized re-evaluation of conversational recommender systems on the ReDial dataset, highlighting issues in reproducibility, the influence of LLM capacity, and the importance of user-centric metrics.

Contribution

It provides a transparent baseline for CRS evaluation, analyzes factors affecting reproducibility, and advocates for metrics emphasizing novelty and interaction quality.

Findings

01

Recall@1 is highly sensitive to implementation details.

02

Nearly 50% of reported accuracy is due to shortcuts absent in novelty-focused metrics.

03

Performance gains are often due to LLM capacity rather than architectural innovations.

Abstract

Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a ``granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.