Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Mingyang Song, Mao Zheng, Chenning Xu

TL;DR
This paper reveals that high agreement among LLM evaluators often masks superficial consensus based on surface heuristics, and proposes knowledge-grounded rubric generation to improve evaluation reliability.
Contribution
It introduces the Evaluation Illusion concept and the MERG framework for dynamic, knowledge-driven rubric generation to enhance LLM evaluation accuracy.
Findings
Model-level agreement is high but sample-level agreement is fragile.
Shared rubric structure significantly improves agreement.
Knowledge-grounded rubrics increase agreement in objective domains.
Abstract
The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs 3 frontier judges 100 tasks 11 temperatures), we show that model-level agreement (Spearman ) masks fragile sample-level agreement (Pearson ; absolute agreement ICC ), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
