Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation
David Otero, Javier Parapar, \'Alvaro Barreiro

TL;DR
This paper critically examines the limitations of using large language models for automatic relevance assessments in IR evaluation, revealing issues with fairness and statistical reliability in top-performing system rankings.
Contribution
It provides an empirical analysis of how LLM-based relevance judgements affect system ranking fairness and significance testing, highlighting their shortcomings.
Findings
LLM-based judgements are unfair at ranking top systems.
High false positive rate in statistical significance detection.
Limitations impact the reliability of IR system evaluations.
Abstract
Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Focus
