Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools
Jie Cao, Zhanxin Hao, Jifan Yu

TL;DR
This study assesses GPT-5.2 and Gemini-3 for educational dialogue annotation, revealing context-dependent accuracy, bias patterns, and the importance of deployment considerations.
Contribution
It provides a comprehensive evaluation of large language models as annotation tools, highlighting their biases and performance variations across educational contexts.
Findings
Multi-agent prompting achieved highest accuracy but not statistically significant.
Higher accuracy in K-12 datasets compared to university-level data.
Bias patterns include optimistic bias in Gemini-3 and domain-specific under/overestimation.
Abstract
Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
