TL;DR
This study investigates the tendency of large language models to overrate relevance in information retrieval tasks, revealing systematic biases and sensitivity to superficial cues that challenge their reliability as human proxies.
Contribution
It provides a comprehensive analysis of overrating behaviors in LLM-based relevance judgments across models and evaluation paradigms, emphasizing the need for careful diagnostic frameworks.
Findings
LLMs tend to inflate relevance scores with high confidence.
Relevance judgments are sensitive to passage length and lexical cues.
Systematic bias rather than random fluctuation was observed in overrating behavior.
Abstract
Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
