Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

Weiyue Li; Minda Zhao; Weixuan Dong; Jiahui Cai; Yuze Wei; Michael Pocress; Yi Li; Wanyan Yuan; Xiaoyue Wang; Ruoyu Hou; Kaiyuan Lou; Wenqi Zeng; Yutong Yang; Yilun Du; Mengyu Wang

arXiv:2601.03444·cs.CL·January 8, 2026

Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, Kaiyuan Lou, Wenqi Zeng, Yutong Yang, Yilun Du, Mengyu Wang

PDF

Open Access

TL;DR

This study investigates how different grading scales affect the consistency and alignment of large language models as evaluators, finding that a 0-5 scale maximizes human-LLM agreement across diverse tasks.

Contribution

It systematically compares grading scales and demonstrates that a 0-5 scale enhances LLM-human alignment, highlighting the importance of scale design in automated evaluation.

Findings

01

0-5 scale yields highest human-LLM agreement

02

Grading scale choice significantly impacts consistency

03

Subgroup differences reveal scale effects across demographics

Abstract

Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Artificial Intelligence in Healthcare and Education