LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec

TL;DR
This study demonstrates that reasoning generated by Large Language Models can effectively predict the correctness of their own predictions in analyzing classroom dialogue, improving automated educational assessment.
Contribution
It introduces a reasoning-based approach using linguistic cues and supervised classifiers to detect errors in LLM predictions within educational dialogue analysis.
Findings
Random Forest classifier achieved an F1 score of 0.83 in error detection
Construct-specific linguistic cues improve detection performance
Correct predictions show grounded causal language, while incorrect ones rely on hedging and metacognition
Abstract
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Text Readability and Simplification
