Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Conrad Borchers; Jill-J\^enn Vie; Roger Azevedo

arXiv:2602.01015·cs.CL·May 12, 2026

Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Conrad Borchers, Jill-J\^enn Vie, Roger Azevedo

PDF

TL;DR

This study evaluates GPT-4.1's ability to simulate novice reasoning in chemistry tutoring, revealing it is overly coherent, verbose, and overestimates learner success, highlighting limitations in modeling human learning.

Contribution

The paper introduces an evaluation framework for assessing LLMs as models of novice reasoning, emphasizing their limitations in faithfully representing human learning processes.

Findings

01

GPT-4.1 generates fluent, contextually appropriate reasoning but is overly coherent and verbose.

02

Model overestimates learner success and exhibits less variability than human think-alouds.

03

Rich problem contexts amplify the model's over-coherence and overconfidence.

Abstract

Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.