Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
Jodi M. Casabianca, Maggie Beiting-Parrish

TL;DR
This paper introduces an item response theory approach to correct human rater biases in AI evaluation, improving the accuracy and interpretability of human judgment data.
Contribution
It applies psychometric rater models, especially the multi-faceted Rasch model, to separate true AI output quality from rater effects, enhancing evaluation reliability.
Findings
Adjusting for rater severity yields more accurate quality estimates.
Psychometric modeling provides diagnostic insights into rater performance.
Incorporating these models leads to more transparent AI evaluation practices.
Abstract
Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Psychometric Methodologies and Testing · Ethics and Social Impacts of AI
