Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Danielle R. Thomas; Conrad Borchers; Kenneth R. Koedinger

arXiv:2508.00143·cs.AI·August 4, 2025

Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation

Danielle R. Thomas, Conrad Borchers, Kenneth R. Koedinger

PDF

Open Access

TL;DR

This paper critiques the reliance on human inter-rater reliability metrics for educational AI annotation, advocating for alternative validation methods that better ensure data quality and educational impact.

Contribution

It introduces and advocates for complementary evaluation approaches beyond IRR to improve the validity and educational relevance of annotated data in AI systems.

Findings

01

IRR overreliance hampers valid data classification

02

Proposes multi-label and expert-based validation methods

03

Highlights importance of external validity across categories

Abstract

Humans can be notoriously imperfect evaluators. They are often biased, unreliable, and unfit to define "ground truth." Yet, given the surging need to produce large amounts of training data in educational applications using AI, traditional inter-rater reliability (IRR) metrics like Cohen's kappa remain central to validating labeled data. IRR remains a cornerstone of many machine learning pipelines for educational data. Take, for example, the classification of tutors' moves in dialogues or labeling open responses in machine-graded assessments. This position paper argues that overreliance on human IRR as a gatekeeper for annotation quality hampers progress in classifying data in ways that are valid and predictive in relation to improving learning. To address this issue, we highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Reliability and Agreement in Measurement · Explainable Artificial Intelligence (XAI)