Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Hossein A. Rahmani; Clemencia Siro; Mohammad Aliannejadi; Nick; Craswell; Charles L. A. Clarke; Guglielmo Faggioli; Bhaskar Mitra; Paul; Thomas; Emine Yilmaz

arXiv:2502.13908·cs.IR·February 20, 2025·2 cites

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick, Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul, Thomas, Emine Yilmaz

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the use of Large Language Models for generating relevance judgments in information retrieval, benchmarking various approaches and analyzing biases and effectiveness compared to human assessments.

Contribution

It introduces the LLMJudge challenge, releasing a large dataset of 42 LLM-generated relevance labels and benchmarking different methods for automated relevance assessment.

Findings

01

LLMs can produce diverse relevance judgments useful for evaluation.

02

Systematic biases in LLM-generated labels are identified.

03

Ensemble models and methodological improvements enhance automated evaluation.

Abstract

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chuanmeng/qpp-genre
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Systems and Judicial Processes · Judicial and Constitutional Studies · Law, Economics, and Judicial Systems