How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods
David Otero, Javier Parapar, Nicola Ferro

TL;DR
This paper introduces a new methodology to evaluate how low-cost document adjudication methods affect the statistical significance of system comparisons in retrieval benchmarks, emphasizing the importance of preserving significance over mere ranking stability.
Contribution
It proposes a novel approach to assess whether adjudication methods maintain the statistical significance of system differences, addressing a gap in traditional evaluation metrics.
Findings
Ranking correlation does not always align with significance preservation.
Some adjudication methods better preserve statistical significance than others.
Traditional metrics may overlook the impact on significance testing.
Abstract
Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
