When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Chuting Yu; Hang Li; Guido Zuccon; Joel Mackenzie; Teerapong Leelanupab

arXiv:2602.17170·cs.IR·April 28, 2026

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Chuting Yu, Hang Li, Guido Zuccon, Joel Mackenzie, Teerapong Leelanupab

PDF

1 Repo

TL;DR

This study investigates the tendency of large language models to overrate relevance in information retrieval tasks, revealing systematic biases and sensitivity to superficial cues that challenge their reliability as human proxies.

Contribution

It provides a comprehensive analysis of overrating behaviors in LLM-based relevance judgments across models and evaluation paradigms, emphasizing the need for careful diagnostic frameworks.

Findings

01

LLMs tend to inflate relevance scores with high confidence.

02

Relevance judgments are sensitive to passage length and lexical cues.

03

Systematic bias rather than random fluctuation was observed in overrating behavior.

Abstract

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a study of \textit{overrating behavior} in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.