A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset
Ivica Kostric, Krisztian Balog

TL;DR
This paper conducts a standardized re-evaluation of conversational recommender systems on the ReDial dataset, highlighting issues in reproducibility, the influence of LLM capacity, and the importance of user-centric metrics.
Contribution
It provides a transparent baseline for CRS evaluation, analyzes factors affecting reproducibility, and advocates for metrics emphasizing novelty and interaction quality.
Findings
Recall@1 is highly sensitive to implementation details.
Nearly 50% of reported accuracy is due to shortcuts absent in novelty-focused metrics.
Performance gains are often due to LLM capacity rather than architectural innovations.
Abstract
Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a ``granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
