Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Jodi M. Casabianca; Maggie Beiting-Parrish

arXiv:2602.22585·cs.AI·February 27, 2026

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Jodi M. Casabianca, Maggie Beiting-Parrish

PDF

Open Access

TL;DR

This paper introduces an item response theory approach to correct human rater biases in AI evaluation, improving the accuracy and interpretability of human judgment data.

Contribution

It applies psychometric rater models, especially the multi-faceted Rasch model, to separate true AI output quality from rater effects, enhancing evaluation reliability.

Findings

01

Adjusting for rater severity yields more accurate quality estimates.

02

Psychometric modeling provides diagnostic insights into rater performance.

03

Incorporating these models leads to more transparent AI evaluation practices.

Abstract

Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Psychometric Methodologies and Testing · Ethics and Social Impacts of AI