Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

Gustavo Penha; Aleksandr V. Petrov; Claudia Hauff; Enrico Palumbo; Ali Vardasbi; Edoardo D'Amico; Francesco Fabbri; Alice Wang; Praveen Chandar; Henrik Lindstrom; Hugues Bouchard; Mounia Lalmas

arXiv:2511.23312·cs.IR·December 1, 2025

Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

Gustavo Penha, Aleksandr V. Petrov, Claudia Hauff, Enrico Palumbo, Ali Vardasbi, Edoardo D'Amico, Francesco Fabbri, Alice Wang, Praveen Chandar, Henrik Lindstrom, Hugues Bouchard, Mounia Lalmas

PDF

Open Access

TL;DR

This paper explores using Large Language Models as automatic judges for recommender system evaluation, demonstrating high agreement with human judgments and potential for scalable, standardized assessment.

Contribution

It introduces LLM-judge as a scalable alternative to manual relevance judgments in recommender evaluation, showing high alignment with human labels.

Findings

01

LLM-judge achieves Kendall's tau of 0.87 with human judgments.

02

Incorporating item metadata and user history improves LLM-judge alignment.

03

LLM-judge proves effective in a real-world podcast recommendation case study.

Abstract

Evaluating recommender systems remains a long-standing challenge, as offline methods based on historical user interactions and train-test splits often yield unstable and inconsistent results due to exposure bias, popularity bias, sampled evaluations, and missing-not-at-random patterns. In contrast, textual document retrieval benefits from robust, standardized evaluation via Cranfield-style test collections, which combine pooled relevance judgments with controlled setups. While recent work shows that adapting this methodology to recommender systems is feasible, constructing such collections remains costly due to the need for manual relevance judgments, thus limiting scalability. This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address these scalability challenges. Using the ML-32M-ext Cranfield-style movie recommendation collection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Information Retrieval and Search Behavior