From Tool to Teammate: LLM Coding Agents as Collaborative Partners for Behavioral Labeling in Educational Dialogue Analysis
Eason Chen, Isabel Wang, Nina Yuan, Sophia Judicke, Kayla Beigh, Xinyi Tang

TL;DR
This paper introduces an autonomous LLM-based coding agent that iteratively improves dialogue labeling prompts, achieving human-level reliability in educational dialogue analysis at low cost.
Contribution
It presents a novel iterative prompt refinement methodology using LLM agents for behavioral coding, demonstrating improved accuracy and insights in educational dialogue analysis.
Findings
Achieved a Cohen's kappa of 0.78, matching human inter-rater reliability.
Demonstrated the approach on 659 tutoring sessions across multiple experiments.
Identified a new labeling pattern regarding expressions of confusion.
Abstract
Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test (SD), matching human inter-rater reliability (), at a cost of approximately $5--8 per agent. While development-set performance reached --, the cross-validated results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
