Formalized Information Needs Improve Large-Language-Model Relevance Judgments
J\"uri Keller, Maik Fr\"obe, Bj\"orn Engelmann, Fabian Haak, Timo Breuer, Birger Larsen, and Philipp Schaer

TL;DR
Formalizing information needs for LLM relevance assessments enhances reliability and agreement in retrieval evaluations, aligning LLM judgments more closely with human standards.
Contribution
This study demonstrates that synthetically formalized topics improve LLM relevance judgment consistency and agreement, a novel approach in LLM-based evaluation setups.
Findings
Formalized topics lead to fewer documents judged relevant by LLMs.
Formalization increases agreement between LLM and human judgments.
Using formalized information needs improves evaluation reliability.
Abstract
Cranfield-style retrieval evaluations with too few or too many relevant documents or with low inter-assessor agreement on relevance can reduce the reliability of observations. In evaluations with human assessors, information needs are often formalized as retrieval topics to avoid an excessive number of relevant documents while maintaining good agreement. However, emerging evaluation setups that use Large Language Models (LLMs) as relevance assessors often use only queries, potentially decreasing the reliability. To study whether LLM relevance assessors benefit from formalized information needs, we synthetically formalize information needs with LLMs into topics that follow the established structure from previous human relevance assessments (i.e., descriptions and narratives). We compare assessors using synthetically formalized topics against the LLM-default query-only assessor on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
