Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement
Jessica Huynh, Alfredo Gomez, Athiya Deviyani, Renee Shelby, Jeffrey P. Bigham, Fernando Diaz

TL;DR
This study analyzes how modifications to evaluation rubrics influence agreement levels between human and automated raters, highlighting factors that improve or hinder consistency.
Contribution
It provides a statistical analysis of rubric modifications' effects on human-autorater agreement across different evaluation domains.
Findings
Rubrics with representative examples and added context increase agreement.
Reducing positional bias in rubrics improves consistency.
Higher rubric complexity and conservative aggregation decrease agreement.
Abstract
Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
