LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?
Rikiya Takehi, Ellen M. Voorhees, Tetsuya Sakai, Ian Soboroff

TL;DR
This paper introduces LLM-Assisted Relevance Assessments (LARA), a method that combines human judgment and LLM predictions to efficiently create reliable test collections for information retrieval evaluation, especially under budget constraints.
Contribution
LARA actively calibrates and debiases LLM relevance predictions to optimize manual annotation efforts, improving test collection quality with limited resources.
Findings
LARA outperforms alternative methods across multiple datasets.
It effectively balances manual and LLM annotations under various budgets.
LARA enhances the reliability of test collections with reduced manual effort.
Abstract
Test collections are information-retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms. While test collections have become an integral part of IR research, the process of data creation involves significant manual-annotation effort, which often makes it very expensive and time-consuming. Consequently, test collections can become too small when the budget is limited, which may lead to unstable evaluations. As a cheaper alternative, recent studies have proposed using large language models (LLMs) to completely replace human assessors. However, while LLMs correlate to some extent with human judgments, their predictions are not perfect and often show bias. Thus, a complete replacement with LLMs is considered too risky and not fully reliable. In this paper, we propose LLM-Assisted Relevance Assessments (LARA), an effective method to balance manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsHigh-Order Consensuses
