Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?
KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar

TL;DR
This study evaluates how well large language models can simulate real students' abilities in math and reading, revealing that strong models often outperform average students but require careful prompting for accurate emulation.
Contribution
The paper introduces a systematic evaluation of diverse LLMs against real student data using IRT, highlighting the variability and limitations of current models as proxies.
Findings
Strong LLMs outperform average students without guidance
Prompting strategies influence models' ability to match student performance
No single model-prompt combination reliably simulates students across subjects and grades
Abstract
Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
