AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics

Bakhtawar Ahtisham; Kirk Vanacore; Jinsook Lee; Zhuqian Zhou; Doug Pietrzak; Rene F. Kizilcec

arXiv:2511.09785·cs.AI·January 29, 2026

AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics

Bakhtawar Ahtisham, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, Rene F. Kizilcec

PDF

Open Access

TL;DR

This study evaluates how verification strategies like self- and cross-verification improve the quality of LLM-generated annotations in learning analytics, showing significant gains in agreement with human labels across different models and configurations.

Contribution

Introduces a flexible framework for LLM annotation verification and empirically compares its effectiveness across multiple models and data, providing standardized notation for reporting.

Findings

01

Orchestration improves annotation agreement by 58%.

02

Self-verification nearly doubles agreement over unverified labels.

03

Cross-verification yields a 37% improvement, with pair-dependent effects.

Abstract

Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)