Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K--12 Education

Danial Hooshyar; Yeongwook Yang; Gustav \v{S}\'i\v{r}; Tommi K\"arkk\"ainen; Raija H\"am\"al\"ainen; Mutlu Cukurova; and Roger Azevedo

arXiv:2512.23036·cs.AI·December 30, 2025

Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K--12 Education

Danial Hooshyar, Yeongwook Yang, Gustav \v{S}\'i\v{r}, Tommi K\"arkk\"ainen, Raija H\"am\"al\"ainen, Mutlu Cukurova, and Roger Azevedo

PDF

Open Access

TL;DR

Large language models currently fall short in reliably modeling learners' evolving knowledge in K--12 education, highlighting the need for hybrid systems that combine LLMs with traditional learner modeling techniques for responsible tutoring.

Contribution

This study empirically compares LLMs with deep knowledge tracing, revealing significant limitations of LLMs in accuracy, temporal coherence, and reliability for learner modeling in high-stakes educational settings.

Findings

01

Deep knowledge tracing outperforms LLMs in accuracy (AUC=0.83)

02

Fine-tuning improves LLM performance but remains inferior to DKT

03

LLMs exhibit unstable, inconsistent mastery updates over time

Abstract

The rapid rise of large language model (LLM)-based tutors in K--12 education has fostered a misconception that generative models can replace traditional learner modelling for adaptive instruction. This is especially problematic in K--12 settings, which the EU AI Act classifies as high-risk domain requiring responsible design. Motivated by these concerns, this study synthesises evidence on limitations of LLM-based tutors and empirically investigates one critical issue: the accuracy, reliability, and temporal coherence of assessing learners' evolving knowledge over time. We compare a deep knowledge tracing (DKT) model with a widely used LLM, evaluated zero-shot and fine-tuned, using a large open-access dataset. Results show that DKT achieves the highest discrimination performance (AUC = 0.83) on next-step correctness prediction and consistently outperforms the LLM across settings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification · Topic Modeling