Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

David Otero; Javier Parapar; \'Alvaro Barreiro

arXiv:2411.13212·cs.IR·July 23, 2025

Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

David Otero, Javier Parapar, \'Alvaro Barreiro

PDF

Open Access

TL;DR

This paper critically examines the limitations of using large language models for automatic relevance assessments in IR evaluation, revealing issues with fairness and statistical reliability in top-performing system rankings.

Contribution

It provides an empirical analysis of how LLM-based relevance judgements affect system ranking fairness and significance testing, highlighting their shortcomings.

Findings

01

LLM-based judgements are unfair at ranking top systems.

02

High false positive rate in statistical significance detection.

03

Limitations impact the reliability of IR system evaluations.

Abstract

Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Focus