LLMs as Assessors: Right for the Right Reason?
Sourav Saha, Mandar Mitra, Aditya Dutta

TL;DR
This study evaluates the effectiveness of Large Language Models as relevance assessors in Information Retrieval, emphasizing their ability to highlight relevant passages and the importance of human oversight.
Contribution
It extends prior research by using a Wikipedia-based test collection to assess LLMs' judgment accuracy and reasoning in relevance evaluation.
Findings
LLMs can identify relevant passages similar to human assessors.
LLMs often make correct relevance judgments but are not yet reliable enough to replace humans.
Caution is advised in using LLMs for dataset creation due to potential for incorrect judgments.
Abstract
A good deal of recent research has focused on how Large Language Models (LLMs) may be used as judges in place of humans to evaluate the quality of the output produced by various text / image processing systems. Within this broader context, a number of studies have investigated the specific question of how effectively LLMs can be used as relevance assessors for the standard ad hoc task in Information Retrieval (IR). We extend these studies by looking at additional questions. Most importantly, we use a Wikipedia based test collection created by the INEX initiative, and prompt LLMs to not only judge whether documents are relevant / non-relevant, but to highlight relevant passages in documents that it regards as useful. The human relevance assessors involved in creating this collection were given analogous instructions, i.e., they were asked to highlight all passages within a document that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
