LLMJudge: LLMs for Relevance Judgments

Hossein A. Rahmani; Emine Yilmaz; Nick Craswell; Bhaskar Mitra; Paul; Thomas; Charles L. A. Clarke; Mohammad Aliannejadi; Clemencia Siro; Guglielmo; Faggioli

arXiv:2408.08896·cs.IR·August 20, 2024

LLMJudge: LLMs for Relevance Judgments

Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra, Paul, Thomas, Charles L. A. Clarke, Mohammad Aliannejadi, Clemencia Siro, Guglielmo, Faggioli

PDF

Open Access 1 Repo

TL;DR

The paper discusses the LLMJudge challenge, which investigates using large language models to generate relevance judgments for information retrieval evaluation, aiming to find cost-effective alternatives to human labeling.

Contribution

It introduces a challenge to evaluate LLMs' effectiveness in producing accurate relevance judgments, comparing different models and analyzing biases and data leakage issues.

Findings

01

LLMs can generate reliable relevance judgments

02

Comparison of open-source and closed-source LLMs

03

Insights into biases and data leakage in synthetic data

Abstract

The LLMJudge challenge is organized as part of the LLM4Eval workshop at SIGIR 2024. Test collections are essential for evaluating information retrieval (IR) systems. The evaluation and tuning of a search system is largely based on relevance labels, which indicate whether a document is useful for a specific search and user. However, collecting relevance judgments on a large scale is costly and resource-intensive. Consequently, typical experiments rely on third-party labelers who may not always produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. However, it remains unclear which LLMs can match the accuracy of human labelers, which prompts are most effective, how fine-tuned open-source LLMs compare to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm4eval/LLMJudge
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Rights Management and Security

MethodsLinear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Attention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax