Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita; Flip Korn; Chris Welty; Christopher M. Homan

arXiv:2605.13801·cs.LG·May 14, 2026

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

PDF

TL;DR

This paper proposes a multi-level bootstrapping method to model annotator behavior, analyzing how increasing annotations and raters improves reproducibility in AI evaluation.

Contribution

It introduces a novel approach to account for annotator variance using datasets with persistent rater identifiers, addressing reproducibility issues.

Findings

01

Analyzes tradeoffs between number of items and responses per item for significance.

02

Models individual annotator variance to improve evaluation reliability.

03

Provides insights into data collection strategies for reproducible AI assessments.

Abstract

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.