Modernizing Ground Truth: Four Shifts Toward Improving Reliability and Validity in AI in Education

Danielle R. Thomas; Conrad Borchers; Kirk P. Vanacore; Kenneth R. Koedinger; and Ren\'e F. Kizilcec

arXiv:2603.29141·cs.CY·April 1, 2026

Modernizing Ground Truth: Four Shifts Toward Improving Reliability and Validity in AI in Education

Danielle R. Thomas, Conrad Borchers, Kirk P. Vanacore, Kenneth R. Koedinger, and Ren\'e F. Kizilcec

PDF

TL;DR

This paper proposes four practical shifts to improve the reliability and validity of ground truth in AI in Education, addressing issues with inter-rater reliability, transparency, bias, and validation.

Contribution

It introduces novel shifts for establishing ground truth, emphasizing diagnostic use of IRR, transparency, bias mitigation, and comprehensive validation in educational AI datasets.

Findings

01

IRR should be used diagnostically rather than as a strict threshold

02

Transparency in rater procedures enhances data quality

03

Bias audits and validation improve annotation reliability

Abstract

Generative Artificial Intelligence (GenAI) is now widespread in education, yet the efficacy of GenAI systems remains constrained by the quality and interpretation of the labeled data used to train and evaluate them. Studies commonly report inter-rater reliability (IRR), often summarized by a single coefficient such as Cohen's kappa (k), as a gatekeeper to ``ground truth.'' We argue that many educational assessment and practice support settings include challenges, such as high-inference constructs, skewed label distributions, and temporally segmented multimodal data, which yield potential misapplication or misinterpretation of threshold-based heuristics for IRR. The growing use of large language models as annotators and judges introduces risks such as automation bias and circular validation. We propose four practical shifts for establishing ground truth: (1) treat IRR as a diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.