Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

KV Aditya Srivatsa; Kaushal Kumar Maurya; Ekaterina Kochmar

arXiv:2507.08232·cs.CL·July 14, 2025

Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

PDF

TL;DR

This study evaluates how well large language models can simulate real students' abilities in math and reading, revealing that strong models often outperform average students but require careful prompting for accurate emulation.

Contribution

The paper introduces a systematic evaluation of diverse LLMs against real student data using IRT, highlighting the variability and limitations of current models as proxies.

Findings

01

Strong LLMs outperform average students without guidance

02

Prompting strategies influence models' ability to match student performance

03

No single model-prompt combination reliably simulates students across subjects and grades

Abstract

Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.