Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident
Conrad Borchers, Jill-J\^enn Vie, Roger Azevedo

TL;DR
This study evaluates GPT-4.1's ability to simulate novice reasoning in chemistry tutoring, revealing it is overly coherent, verbose, and overestimates learner success, highlighting limitations in modeling human learning.
Contribution
The paper introduces an evaluation framework for assessing LLMs as models of novice reasoning, emphasizing their limitations in faithfully representing human learning processes.
Findings
GPT-4.1 generates fluent, contextually appropriate reasoning but is overly coherent and verbose.
Model overestimates learner success and exhibits less variability than human think-alouds.
Rich problem contexts amplify the model's over-coherence and overconfidence.
Abstract
Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
