Modernizing Ground Truth: Four Shifts Toward Improving Reliability and Validity in AI in Education
Danielle R. Thomas, Conrad Borchers, Kirk P. Vanacore, Kenneth R. Koedinger, and Ren\'e F. Kizilcec

TL;DR
This paper proposes four practical shifts to improve the reliability and validity of ground truth in AI in Education, addressing issues with inter-rater reliability, transparency, bias, and validation.
Contribution
It introduces novel shifts for establishing ground truth, emphasizing diagnostic use of IRR, transparency, bias mitigation, and comprehensive validation in educational AI datasets.
Findings
IRR should be used diagnostically rather than as a strict threshold
Transparency in rater procedures enhances data quality
Bias audits and validation improve annotation reliability
Abstract
Generative Artificial Intelligence (GenAI) is now widespread in education, yet the efficacy of GenAI systems remains constrained by the quality and interpretation of the labeled data used to train and evaluate them. Studies commonly report inter-rater reliability (IRR), often summarized by a single coefficient such as Cohen's kappa (k), as a gatekeeper to ``ground truth.'' We argue that many educational assessment and practice support settings include challenges, such as high-inference constructs, skewed label distributions, and temporally segmented multimodal data, which yield potential misapplication or misinterpretation of threshold-based heuristics for IRR. The growing use of large language models as annotators and judges introduces risks such as automation bias and circular validation. We propose four practical shifts for establishing ground truth: (1) treat IRR as a diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
