Are We on the Right Way to Assessing LLM-as-a-Judge?
Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, and Dongping Chen

TL;DR
This paper introduces Sage, an innovative evaluation suite for LLM-as-a-Judge that does not rely on human annotations, using axioms of rational choice to assess consistency and reliability of language models in judgment tasks.
Contribution
Sage provides a novel, annotation-free framework for evaluating LLM judges based on rational choice axioms, revealing reliability issues and potential improvements.
Findings
Current SOTA LLMs show significant reliability problems as judges.
Finetuning and structured approaches improve judging consistency.
Human judgments exhibit substantial inconsistency, questioning their reliability.
Abstract
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Topic Modeling · Explainable Artificial Intelligence (XAI)
