Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song; Mao Zheng; Chenning Xu

arXiv:2603.11027·cs.CL·March 12, 2026

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song, Mao Zheng, Chenning Xu

PDF

Open Access

TL;DR

This paper reveals that high agreement among LLM evaluators often masks superficial consensus based on surface heuristics, and proposes knowledge-grounded rubric generation to improve evaluation reliability.

Contribution

It introduces the Evaluation Illusion concept and the MERG framework for dynamic, knowledge-driven rubric generation to enhance LLM evaluation accuracy.

Findings

01

Model-level agreement is high but sample-level agreement is fragile.

02

Shared rubric structure significantly improves agreement.

03

Knowledge-grounded rubrics increase agreement in objective domains.

Abstract

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ = 0.99$ ) masks fragile sample-level agreement (Pearson $\overset{r}{ˉ} = 0.72$ ; absolute agreement ICC $= 0.67$ ), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI